diff --git a/.gitattributes b/.gitattributes
index a6344aac8c09253b3b630fb776ae94478aa0275b..90613d37680204b8efdbacb33191608c248c5b14 100644
--- a/.gitattributes
+++ b/.gitattributes
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/case_1.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/case_2.png filter=lfs diff=lfs merge=lfs -text
diff --git a/README.md b/README.md
index 24fc926ca88f0b1f45b2f49b8d6463e6c1160416..c0b516fbaae17c5ba7cc135ac0332ad96ccfeb83 100644
--- a/README.md
+++ b/README.md
@@ -1,48 +1,107 @@
----
-title: Avocado On Toast
-emoji: 🥑
-colorFrom: green
-colorTo: yellow
-sdk: gradio
-app_file: app.py
-pinned: false
+# AVoCaDO: An A udioV isual Video Ca ptioner D riven by Temporal O rchestration
+
+
+
+
+
+
+
---
-# Avocado On Toast
+## ✨ Overview
+Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. We introduce AVoCaDO , a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance under visual-only settings.
-Gradio Space that standardizes uploaded videos with ffmpeg, runs AVoCaDO, and
-writes JSONL outputs into a run-labeled dataset folder.
+## 🎬 Captioning Case of AVoCaDO
+
+An illustration of a video caption generated by AVoCaDO, featuring both precise audiovisual temporal alignment and accurate dialogue rendering .
-## Usage (Hugging Face Space)
+## 🚀 Getting Started
+Follow these simple steps to set up and run AVoCaDO on your machine.
-1. Create a new Gradio Space and connect it to this repo.
-2. Ensure `ffmpeg` and AVoCaDO are available in the Space image.
-3. Set `AVOCADO_CMD` as a Space secret or enter it in the UI textbox.
- - The command must accept `{input}` and `{output}` placeholders.
- - Example:
- ```bash
- python -m avocado_captioner.cli --input {input} --output {output}
- ```
-4. Launch the Space, upload videos, and click **Process videos**.
+### 1. Clone the repository
+First, clone the project and navigate into the directory:
-Outputs are written to:
+```bash
+git clone https://github.com/AVoCaDO-Captioner/AVoCaDO.git
+cd AVoCaDO
+```
+### 2. Set Up the Environment
+Create and activate the Conda environment using the provided ``environment.yml`` file.
+
+```bash
+conda env create -f environment.yml
+conda activate AVoCaDO
```
-/data/dataset/runs//annotations.jsonl
+
+### 3. Quick Usage
+```python
+python inference.py assets/case_1.mp4
```
-Each run produces `annotations.jsonl` and `manifest.json` under the run label.
+## 📈 Benchmark Evaluation
+We provide evaluation scripts for all evaluated benchmarks in our paper.
+
+### Direct Audiovisual Caption Evaluation
+1. **video-SALMONN2-testset:**
+ ```bash
+ bash eval_scripts/video-SALMONN2-testset/eval_video-SALMONN2-test.sh
+ ```
-## Configuration
+2. **UGC-VideoCap:**
+ ```bash
+ bash eval_scripts/UGC-VideoCap/eval_UGC-VideoCap.sh
+ ```
-- `DATA_ROOT` (default: `/data`) controls the root folder for uploads and output.
-- `AVOCADO_CMD` defines the AVoCaDO command used during processing.
+### QA-based Audiovisual Caption Evaluation
+1. **Daily-Omni:**
+ ```bash
+ bash eval_scripts/Daily-Omni/Daily-Omni_pipeline.sh
+ ```
-## Notes
+2. **WorldSense:**
+ ```bash
+ bash eval_scripts/WorldSense/WorldSense_pipeline.sh
+ ```
-The standardization step uses:
-- 720p scaling (`scale=-2:720`)
-- H.264 video (`libx264`, `crf=23`)
-- AAC audio (`128k`)
+### Visual-only Caption Evaluation
+1. **VDC:**
+ First, generate captions for the videos in the VDC benchmark.
+ ```python
+ python eval_scripts/VDC/generate_caption.py \
+ --model_path \
+ --fout_path
+ ```
-Adjust these settings in `app.py` if you want a different quality/compute tradeoff.
+ Next, set up the judge server. This requires installing [SGLang](https://github.com/sgl-project/sglang) to deploy the [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) as the judge model.
+ ```python
+ # Deploy the judge model using SGLang
+ python -m sglang.launch_server \
+ --model-path path_to_Meta-Llama-3.1-8B-Instruct \
+ --port 30000 \
+ --dp 2 --tp 4
+ ```
+
+ Once the judge model is successfully deployed and running, you can start the evaluation.
+ ```bash
+ bash AVoCaDO/eval_scripts/VDC/evaluation.sh
+ ```
+
+2. **DREAM-1K:**
+ ```bash
+ bash eval_scripts/DREAM-1K/eval_DREAM-1K.sh
+ ```
+
+
+## ✒️ Citation
+
+If you find our work helpful for your research, please consider giving a star ⭐ and citing our paper. We appreciate your support!
+
+```bibtex
+@article{chen2025avocado,
+ title={AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration},
+ author={Chen, Xinlong and Ding, Yue and Lin, Weihong and Hua, Jingyun and Yao, Linli and Shi, Yang and Li, Bozhou and Zhang, Yuanxing and Liu, Qiang and Wan, Pengfei and others},
+ journal={arXiv preprint arXiv:2510.10395},
+ year={2025}
+}
+```
diff --git a/assets/avocado.ico b/assets/avocado.ico
new file mode 100644
index 0000000000000000000000000000000000000000..124f29e235691b5e139c10207ff95d907c170674
Binary files /dev/null and b/assets/avocado.ico differ
diff --git a/assets/case_1.mp4 b/assets/case_1.mp4
new file mode 100644
index 0000000000000000000000000000000000000000..4ea58bf86bcdac66270e113f77e2c27df0c97976
--- /dev/null
+++ b/assets/case_1.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:78c17b195eae977e5ffdeafb499fbeeec2ea50f9258973154c0da111c2b90b07
+size 6417271
diff --git a/assets/case_2.png b/assets/case_2.png
new file mode 100644
index 0000000000000000000000000000000000000000..753d9e64bedf869840d220e602f0902272ff2427
--- /dev/null
+++ b/assets/case_2.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:46159b25d4560ca19ab7cbe605de1762e71ce1fc3c4f2ed72321af1b268fe2bc
+size 2559969
diff --git a/environment.yml b/environment.yml
new file mode 100644
index 0000000000000000000000000000000000000000..d1a08c87081963190c18e261ef6ff478c4a71238
--- /dev/null
+++ b/environment.yml
@@ -0,0 +1,327 @@
+name: AVoCaDO
+channels:
+ - conda-forge
+ - defaults
+dependencies:
+ - _libgcc_mutex=0.1=main
+ - _openmp_mutex=5.1=1_gnu
+ - bzip2=1.0.8=h5eee18b_6
+ - ca-certificates=2024.9.24=h06a4308_0
+ - ld_impl_linux-64=2.40=h12ee557_0
+ - libffi=3.4.4=h6a678d5_1
+ - libgcc-ng=11.2.0=h1234567_1
+ - libgomp=11.2.0=h1234567_1
+ - libstdcxx-ng=11.2.0=h1234567_1
+ - libuuid=1.41.5=h5eee18b_0
+ - mscorefonts=0.0.1=3
+ - ncurses=6.4=h6a678d5_0
+ - openssl=3.0.15=h5eee18b_0
+ - pip=24.2=py310h06a4308_0
+ - python=3.10.15=he870216_1
+ - readline=8.2=h5eee18b_0
+ - setuptools=75.1.0=py310h06a4308_0
+ - sqlite=3.45.3=h5eee18b_0
+ - tk=8.6.14=h39e8969_0
+ - wheel=0.44.0=py310h06a4308_0
+ - xz=5.4.6=h5eee18b_1
+ - zlib=1.2.13=h5eee18b_1
+ - pip:
+ - accelerate==0.34.1
+ - aiodns==3.2.0
+ - aiohappyeyeballs==2.4.3
+ - aiohttp==3.10.10
+ - aiosignal==1.3.1
+ - annotated-types==0.7.0
+ - antlr4-python3-runtime==4.11.0
+ - anyio==4.6.2.post1
+ - apscheduler==3.10.4
+ - asttokens==2.4.1
+ - async-timeout==4.0.3
+ - attrs==24.2.0
+ - audioread==3.0.1
+ - av==13.1.0
+ - backcall==0.2.0
+ - beartype==0.19.0
+ - beautifulsoup4==4.13.4
+ - bleach==6.2.0
+ - blinker==1.9.0
+ - cachetools==4.2.4
+ - cchardet==2.1.7
+ - certifi==2021.10.8
+ - cffi==1.17.1
+ - charset-normalizer==3.4.0
+ - cityhash==0.2.4.post11
+ - click==8.1.7
+ - cloudpickle==3.1.0
+ - colorama==0.4.6
+ - compressed-tensors==0.8.0
+ - contourpy==1.3.1
+ - cycler==0.12.1
+ - dapr==1.14.0
+ - dapr-ext-fastapi==1.14.0
+ - dask==2024.12.1
+ - datasets==3.1.0
+ - dbpool==1.2.1
+ - decorator==4.4.2
+ - decord==0.6.0
+ - deepspeed==0.16.2
+ - defusedxml==0.7.1
+ - dill==0.4.0
+ - diskcache==5.6.3
+ - distro==1.9.0
+ - docopt==0.6.2
+ - docstring-parser==0.16
+ - dsc-auth==0.1.18
+ - einops==0.8.0
+ - et-xmlfile==2.0.0
+ - exceptiongroup==1.2.2
+ - executing==2.1.0
+ - fastapi==0.115.5
+ - fastjsonschema==2.21.1
+ - ffmpeg-python==0.2.0
+ - filelock==3.13.1
+ - fire==0.7.0
+ - flash-attn==2.7.0.post2
+ - flask==3.1.0
+ - fonttools==4.55.0
+ - frozenlist==1.5.0
+ - fsspec==2024.2.0
+ - ftfy==6.3.1
+ - func-timeout==4.3.5
+ - future==1.0.0
+ - fvcore==0.1.5.post20221221
+ - gguf==0.10.0
+ - google-api-core==2.23.0
+ - google-auth==2.36.0
+ - google-cloud-aiplatform==1.71.1
+ - google-cloud-bigquery==3.27.0
+ - google-cloud-core==2.4.1
+ - google-cloud-resource-manager==1.13.1
+ - google-cloud-storage==2.18.2
+ - google-crc32c==1.6.0
+ - google-resumable-media==2.7.2
+ - googleapis-common-protos==1.66.0
+ - grpc-google-iam-v1==0.13.1
+ - grpcio==1.68.0
+ - grpcio-reflection==1.48.2
+ - grpcio-status==1.68.0
+ - h11==0.14.0
+ - h5py==3.12.1
+ - hf-xet==1.1.4
+ - hickle==5.0.3
+ - hiredis==2.4.0
+ - hjson==3.1.0
+ - httpcore==1.0.6
+ - httptools==0.6.4
+ - httpx==0.27.2
+ - huggingface-hub==0.33.0
+ - humanize==4.11.0
+ - icecream==2.1.3
+ - idna==3.10
+ - imageio==2.36.0
+ - imageio-ffmpeg==0.5.1
+ - importlib-metadata==8.5.0
+ - infra-component==1.4.7
+ - infra-framework==1.17.10
+ - infra-kconf==1.1.3
+ - infra-kess==1.1.5
+ - infra-keycenter==1.1.1
+ - infra-storage==1.3.1
+ - install==1.3.5
+ - interegular==0.3.3
+ - iopath==0.1.10
+ - ipdb==0.13.13
+ - ipython==8.12.3
+ - itsdangerous==2.2.0
+ - jedi==0.19.2
+ - jinja2==3.1.3
+ - jiter==0.7.0
+ - joblib==1.4.2
+ - jsonschema==4.23.0
+ - jsonschema-specifications==2024.10.1
+ - jupyter-client==8.6.3
+ - jupyter-core==5.8.1
+ - jupyterlab-pygments==0.3.0
+ - kazoo==2.10.0
+ - kiwisolver==1.4.7
+ - ks-kafka-python==2.0.3
+ - lark==1.2.2
+ - lazy-loader==0.4
+ - levenshtein==0.26.1
+ - librosa==0.11.0
+ - llvmlite==0.43.0
+ - lm-format-enforcer==0.10.6
+ - locket==1.0.0
+ - lxml==4.9.4
+ - lz4==3.1.10
+ - markupsafe==2.1.5
+ - matplotlib==3.10.0
+ - matplotlib-inline==0.1.7
+ - mergedeep==1.3.4
+ - mistral-common==1.5.1
+ - mistral-inference==1.5.0
+ - mistune==3.1.3
+ - moviepy==1.0.3
+ - mpmath==1.3.0
+ - msgpack==1.1.0
+ - msgpack-numpy==0.4.8
+ - msgspec==0.18.6
+ - multidict==6.1.0
+ - multiprocess==0.70.18
+ - mysql-connector-python==8.0.31
+ - nbclient==0.10.2
+ - nbconvert==7.16.6
+ - nbformat==5.10.4
+ - nest-asyncio==1.6.0
+ - networkx==3.2.1
+ - ninja==1.11.1.3
+ - numba==0.60.0
+ - numpy==1.26.3
+ - nvidia-cublas-cu12==12.4.5.8
+ - nvidia-cuda-cupti-cu12==12.4.127
+ - nvidia-cuda-nvrtc-cu12==12.4.127
+ - nvidia-cuda-runtime-cu12==12.4.127
+ - nvidia-cudnn-cu12==9.1.0.70
+ - nvidia-cufft-cu12==11.2.1.3
+ - nvidia-curand-cu12==10.3.5.147
+ - nvidia-cusolver-cu12==11.6.1.9
+ - nvidia-cusparse-cu12==12.3.1.170
+ - nvidia-cusparselt-cu12==0.6.2
+ - nvidia-ml-py==12.560.30
+ - nvidia-nccl-cu12==2.21.5
+ - nvidia-nvjitlink-cu12==12.4.127
+ - nvidia-nvtx-cu12==12.4.127
+ - nvitop==1.3.2
+ - omegaconf==2.3.0
+ - open-clip-torch==2.29.0
+ - openai==1.54.3
+ - opencv-python==4.10.0.84
+ - opencv-python-headless==4.10.0.84
+ - openpyxl==3.1.5
+ - outlines==0.0.46
+ - packaging==24.1
+ - pandas==2.2.3
+ - pandocfilters==1.5.1
+ - parameterized==0.9.0
+ - parso==0.8.4
+ - partd==1.4.2
+ - partial-json-parser==0.2.1.1.post4
+ - pathos==0.3.4
+ - peft==0.14.0
+ - pexpect==4.9.0
+ - pickleshare==0.7.5
+ - pillow==10.4.0
+ - pipreqs==0.5.0
+ - platformdirs==4.3.6
+ - pooch==1.8.2
+ - portalocker==3.0.0
+ - pox==0.3.6
+ - ppft==1.7.7
+ - prettytable==2.5.0
+ - proglog==0.1.10
+ - prometheus-client==0.21.0
+ - prometheus-fastapi-instrumentator==7.0.0
+ - prompt-toolkit==3.0.48
+ - propcache==0.2.0
+ - proto-plus==1.25.0
+ - protobuf==3.20.0
+ - psutil==6.1.0
+ - ptyprocess==0.7.0
+ - pure-eval==0.2.3
+ - py-cpuinfo==9.0.0
+ - pyairports==2.1.1
+ - pyarrow==18.0.0
+ - pyasn1==0.6.1
+ - pyasn1-modules==0.4.1
+ - pycares==4.4.0
+ - pycocoevalcap==1.2
+ - pycocotools==2.0.8
+ - pycountry==24.6.1
+ - pycparser==2.22
+ - pycryptodome==3.21.0
+ - pydantic==2.9.2
+ - pydantic-core==2.23.4
+ - pydub==0.25.1
+ - pygments==2.18.0
+ - pyparsing==3.2.0
+ - pysmhasher==0.2.5
+ - pysoundfile==0.9.0.post1
+ - python-dateutil==2.9.0.post0
+ - python-dotenv==1.0.1
+ - python-levenshtein==0.26.1
+ - python-snappy==0.6.1
+ - pytorchvideo==0.1.5
+ - pytube==15.0.0
+ - pytz==2021.3
+ - pytz-deprecation-shim==0.1.0.post0
+ - pyyaml==6.0.2
+ - pyzmq==26.2.0
+ - qwen-omni-utils==0.0.8
+ - qwen-vl-utils==0.0.8
+ - rapidfuzz==3.12.2
+ - ray==2.38.0
+ - redis==4.6.0
+ - referencing==0.35.1
+ - regex==2024.9.11
+ - requests==2.32.3
+ - rpds-py==0.21.0
+ - rsa==4.9
+ - safetensors==0.4.5
+ - scenedetect==0.6.4
+ - scikit-learn==1.6.0
+ - scipy==1.14.1
+ - seaborn==0.13.2
+ - sentencepiece==0.2.0
+ - setuptools-scm==8.1.0
+ - shapely==2.0.6
+ - simple-parsing==0.1.6
+ - six==1.16.0
+ - sniffio==1.3.1
+ - soundfile==0.13.1
+ - soupsieve==2.7
+ - soxr==0.5.0.post1
+ - sqlparse==0.4.4
+ - stack-data==0.6.3
+ - starlette==0.41.3
+ - sympy==1.13.1
+ - tabulate==0.9.0
+ - termcolor==2.5.0
+ - threadpoolctl==3.5.0
+ - tiktoken==0.7.0
+ - timm==1.0.12
+ - tinycss2==1.4.0
+ - tokenizers==0.21.1
+ - tomli==2.0.2
+ - toolz==1.0.0
+ - torch==2.6.0
+ - torchvision==0.21.0
+ - tornado==6.4.1
+ - tqdm==4.67.0
+ - traitlets==5.14.3
+ - transformers==4.52.3
+ - triton==3.2.0
+ - typeguard==4.4.1
+ - typing-extensions==4.12.2
+ - tzdata==2024.2
+ - tzlocal==4.3.1
+ - unpaddedbase64==2.1.0
+ - urllib3==1.26.20
+ - uvicorn==0.32.0
+ - uvloop==0.21.0
+ - vertexai==1.71.1
+ - vllm==0.6.3
+ - watchfiles==0.24.0
+ - wcwidth==0.2.13
+ - webencodings==0.5.1
+ - websockets==13.1
+ - werkzeug==3.1.3
+ - word2number==1.1
+ - xformers==0.0.27.post2
+ - xmltodict==0.12.0
+ - xxhash==3.5.0
+ - yacs==0.1.8
+ - yarg==0.1.9
+ - yarl==1.17.1
+ - zhon==2.1.1
+ - zipp==3.20.2
+ - zsvision==0.7.12
diff --git a/eval_scripts/DREAM-1K/dream_example.jsonl b/eval_scripts/DREAM-1K/dream_example.jsonl
new file mode 100644
index 0000000000000000000000000000000000000000..62e71b456ee6810a2997f8a0ab97251125a9a353
--- /dev/null
+++ b/eval_scripts/DREAM-1K/dream_example.jsonl
@@ -0,0 +1,1000 @@
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "135.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The little boy held a ball in his right hand, tapped the glove on his left hand, and then threw it forward. The ball quickly bounced back, knocking the child over, and the child's hat, gloves, shoes, socks, and clothes all flew off.", "text": "A boy stands on a mound of dirt, wearing a white cap, a yellow shirt with black stripes, and a brown glove, holding a baseball in his right hand. He prepares to pitch by raising his leg and positioning his glove. The boy throws the baseball with a strong motion. The baseball flies through the air, and the boy's hat, gloves, and other items are seen floating in the air. The word 'POW!' appears as the baseball hits the boy, causing him to fall backward onto the mound. The boy lies on the mound, surrounded by his scattered items.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 135, "extra_info": {"duration": 4.0, "events": ["The little boy held a ball.", "The little boy tapped the glove.", "The little boy threw the ball forward.", "The ball bounced back.", "The ball knocked the child over.", "The child's hat flew off.", "The child's gloves flew off.", "The child's shoes flew off.", "The child's socks flew off.", "The child's clothes flew off."], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 135}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "457.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a living room, a man wearing a white T-shirt walks past the entrance, and a white cat jumps out of a cage beside the door, following the man to the center of the living room. A black dog springs out from the left side of the camera, heads toward the white cat, and nudges the cat with its nose. The white cat slightly backs away and then lifts its paws to swiftly and continuously swipe at the black dog's face, frightening the dog into retreating repeatedly. Throughout the entire process, the man in the white T-shirt stands by, watching them.", "text": "A man in a white shirt and jeans walks into a room with a tiled floor and two doors decorated with red Chinese ornaments. A white cat runs out from a cage and approaches the man. The cat continues walking towards the man, who is smiling and watching it. A black dog appears and approaches the cat, which then jumps up and starts playing with the dog. The dog and cat engage in playful interaction, with the cat swatting at the dog. The camera then shifts to show the dog moving around a table filled with various food items, including a watermelon, a pot, and several dishes. The man continues to watch the interaction between the cat and the dog.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 457, "extra_info": {"duration": 9.93, "events": ["Man walks past the entrance.", "White cat jumps out of cage.", "Cat follows man to living room center.", "Black dog springs out from left side.", "Dog heads toward white cat.", "Dog nudges cat with nose.", "Cat backs away from dog.", "Cat swipes at dog's face.", "Dog retreats from cat."], "n_events": 9, "n_shots": 1, "n_subjects": 3}, "vid": 457}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "359.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A crab is moving beside a person's foot; the camera switches to a long shot, showing the person squatting down and hammering at something. The camera then switches and pans up for a close-up of the person hammering.", "text": "A crab is seen moving on a rocky surface near a person's boot. The camera then shows a wider view of the beach, revealing two people sitting on the pebbles with a rocky cliff in the background. One of the individuals, wearing a blue coat, is holding a hammer and a stone, preparing to strike the stone. The person then looks down at the stone and begins to hammer it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 359, "extra_info": {"duration": 8.62, "events": ["A crab moves beside a person's foot", "The person squats down", "The person hammers at something"], "n_events": 3, "n_shots": 3, "n_subjects": 2}, "vid": 359}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "114.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On the ground, there was a pile of tangled wires. Several tentacles dropped down from above and a fuse was burning. A red-haired girl ran over, lifted a baby from the tangled wires, and a small explosion occurred within the wires. The baby was laughing. The scene then cut to the doorway where the little girl, looking terrified, was holding the baby, and two other red-haired girls stood by the door watching them.", "text": "In a dimly lit room with ornate wallpaper and a large rolled-up carpet, a pile of colorful yarn balls is set on fire, causing a small flame to appear. The yarn balls begin to transform, with a yellow yarn ball rising and forming legs. The legs continue to extend, and a red, furry creature starts to emerge from the pile. The creature fully stands up, revealing a red, furry body and a yellow outfit. The creature, now standing, looks at the burning yarn ball. A girl with long red hair and glasses, wearing a green dress, runs in and picks up the creature, holding it close and smiling.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 114, "extra_info": {"duration": 3.88, "events": ["Tentacles dropped from above.", "A fuse was burning.", "Red-haired girl ran over.", "Girl lifted baby from wires.", "Small explosion occurred within wires.", "Baby laughed.", "Girl held baby looking terrified.", "Two red-haired girls stood by door.", "Red-haired girls watched other girl and baby."], "n_events": 9, "n_shots": 2, "n_subjects": 4}, "vid": 114}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "138.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera sequentially switches between three black and white drawings. The first drawing shows a little boy walking in front of a new village, the second depicts a letter being thrown out of a mailbox slot, with the boy turning his head back to look, and the third shows the little boy standing under a tree, holding a heart-shaped object with both hands.", "text": "A boy stands in front of a door labeled 'MAIL' with his hands on his hips, looking at the door. Suddenly, a letter flies out of the door and hits the boy on the head, causing him to turn his head to the right. The scene then switches to the boy standing in front of a tree, holding a heart-shaped object and looking sad. The scene switches back to the boy in front of the door, still looking at the door.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 138, "extra_info": {"duration": 8.12, "events": ["A little boy walks in front of a village", "A letter is thrown out of a mailbox slot", "The little boy turns his head back to look", "The little boy stands under a tree", "The little boy holds a heart-shaped object"], "n_events": 5, "n_shots": 3, "n_subjects": 1}, "vid": 138}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "305.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person slightly moved their lips towards the camera; the camera angle switched to the side of the man as he opened the door of a storage locker.", "text": "A man, wearing a hat and a dark coat, is seen in a close-up shot with a bandage on his forehead. He is standing in front of a row of lockers, and he uses his right hand to insert a key into the lock of one of the lockers. He then turns the key and opens the locker door.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 305, "extra_info": {"duration": 5.38, "events": ["A person slightly moved their lips towards the camera.", "The man opened the door of a storage locker."], "n_events": 2, "n_shots": 2, "n_subjects": 1}, "vid": 305}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "643.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A hand is lifting the mats layer by layer; the first mat is facing up, the second with its back facing up, and so on alternately. When the hand reaches the fifth mat, the video ends.", "text": "A hand is seen lifting the top layer of a stack of carpets, revealing a label underneath. The hand continues to lift the label, exposing more of the label's details. The hand then places the label back down and lifts the next layer of the stack. The hand repeats the process, lifting another layer and revealing a new label. The hand continues to lift and place the layers, showing the labels on each layer.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 643, "extra_info": {"duration": 6.36, "events": ["A hand lifts the first mat.", "The first mat is facing up.", "The hand lifts the second mat.", "The second mat's back is facing up.", "The hand lifts the third mat.", "The hand lifts the fourth mat.", "The hand reaches the fifth mat.", "The video ends."], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 643}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "734.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A pair of hands picked a few fruits of varying sizes from the branches.", "text": "The video begins with a close-up of two apples hanging from a tree branch, surrounded by green leaves and illuminated by sunlight. A hand with dark nail polish reaches towards the apples. The hand grasps one of the apples and pulls it from the branch. The hand then reaches for another apple and pulls it from the tree. The hand continues to pick more apples from the tree, moving from one branch to another. The video ends with the hand holding several apples and moving out of the frame.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 734, "extra_info": {"duration": 17.85, "events": ["Hands reached towards the branches.", "Hands grasped fruits of varying sizes.", "Hands picked fruits from the branches."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 734}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "447.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two boys watched three basins filled with rice and flour, as well as a person holding a colander. The person scooped up some rice from one basin, and the rice fell through the colander back into the basin. Then, they used the colander to scoop up some millet from another basin, and the millet too fell through back into the basin. Lastly, the person used the colander to scoop up some flour from the final basin, but the flour did not fall through. Accompanying the person's actions, a voice in the video asks why the finest flour does not fall through. The background is an ordinary residential room.", "text": "Two children are watching as a person uses a strainer to sift rice from a plastic container into a metal bowl. The person continues to sift the rice, ensuring it is evenly distributed into the bowl. The person then sifts yellow powder from the same metal bowl into another container. Finally, the person sifts white powder from a plastic container into the same metal bowl. The background includes a kitchen setting with various items such as a microwave, a tripod, and some food items on the counter.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 447, "extra_info": {"duration": 8.63, "events": ["Two boys watched three basins.", "A person scooped rice with a colander.", "Rice fell through the colander.", "The person scooped millet with the colander.", "Millet fell through the colander.", "The person scooped flour with the colander.", "Flour did not fall through the colander.", "A voice asked why flour did not fall through."], "n_events": 8, "n_shots": 1, "n_subjects": 3}, "vid": 447}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "42.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A massive robot and a person run toward the elevator with other robots chasing after them. The person presses the elevator button, but the elevator doors won't close because a dog is crouching in the middle of the doorway. The dog turns its head to look at the person.", "text": "In a futuristic cityscape with numerous robots and bright lights, a girl with purple hair runs towards an elevator as a robot follows her. The girl reaches the elevator and presses the button to close the door. The door begins to close, and a small dog, which was standing outside, starts to move towards the door. The door fully closes, and the dog turns around to face the girl.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 42, "extra_info": {"duration": 6.38, "events": ["A massive robot and a person run toward the elevator.", "Other robots chase after the massive robot and the person.", "The person presses the elevator button.", "The elevator doors fail to close.", "A dog crouches in the elevator doorway.", "The dog turns its head to look at the person."], "n_events": 6, "n_shots": 1, "n_subjects": 4}, "vid": 42}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "107.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A little boy, frightened, stands upright and then starts running forward. On both sides, there are three little girls watching him. The camera switches to the hallway, where the boy rushes down the stairs with the three girls following closely behind. The little boy stops at the stairway entrance, glances left and right, and then throws a blue ball from the left side of the screen, followed by the appearance of two little birds.", "text": "In a room filled with books and various objects, a character with a red hat and a red-haired girl with glasses stand together, looking surprised. The character with the red hat gestures with his hands and speaks, then turns and runs towards the door. The camera shifts to show the character with the red hat and the red-haired girl running down the stairs, followed by three other characters. As they descend, they drop three balls, one blue, one pink, and one yellow, onto the floor.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 107, "extra_info": {"duration": 4.38, "events": ["Little boy stands upright.", "Little boy starts running forward.", "Three little girls watch the boy.", "Little boy rushes down the stairs.", "Three little girls follow the boy.", "Little boy stops at the stairway entrance.", "Little boy glances left and right.", "Little boy throws a blue ball.", "Two little birds appear."], "n_events": 9, "n_shots": 2, "n_subjects": 3}, "vid": 107}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "627.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video features several quick-cut shots, displaying various patterns and numbers, until a shot reveals \"HEAD\", followed by \"Image Start\", and then a countdown number appears inside a circle on the screen. Concurrently, a radius begins at the 12 o'clock position, sweeping clockwise around the entire circle. When the countdown number reaches 2, the screen turns black, and remains so until the video ends.", "text": "The video begins with a black screen, followed by the appearance of the word 'HEAD' in large, bold letters. The scene then transitions to a countdown timer starting from 10, displayed within a circular target-like graphic. The countdown continues, showing the numbers 9, 8, 7, 6, 5, 4, 3, 2, and finally 1, before the screen returns to black.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 627, "extra_info": {"duration": 12.96, "events": ["Video displays various patterns.", "Video displays various numbers.", "Shot reveals the word 'HEAD'.", "'Image Start' appears on screen.", "Countdown begins with a number inside a circle.", "A radius sweeps clockwise around the circle.", "Countdown number decreases.", "Countdown reaches number 2.", "Screen turns black.", "Video ends with black screen."], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 627}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "832.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a street, a motorcycle carrying two people is seen. The rider is donned in a blue cowboy outfit, while the passenger, who has long hair, is dressed in pink and is seated behind the rider. When the traffic light turns red, causing the other vehicles to halt, this motorcycle blatantly flouts the traffic rule and runs the red light. An attempt by the traffic police to intervene is completely ignored by the rider in the blue cowboy suit. Disregarding the police warning, the rider continues his journey, showing a grave lack of respect towards the law.", "text": "A man wearing a helmet rides a motorcycle with a woman in a pink top and denim shorts sitting behind him, passing a colorful bus. The traffic light changes from yellow to red. The man on the motorcycle stops at the red light, and a police officer on the side of the road signals the motorcycle to move forward. The motorcycle continues to move forward as the police officer gestures and then walks away. The man on the motorcycle looks back at the police officer and then turns his head forward. The traffic light changes from red to yellow.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 832, "extra_info": {"duration": 6.01, "events": ["A motorcycle carries two people on a street.", "The rider wears a blue cowboy outfit.", "The passenger in pink sits behind the rider.", "The traffic light turns red.", "Other vehicles halt at the red light.", "The motorcycle runs the red light.", "The traffic police attempt to intervene.", "The rider in the blue cowboy suit ignores the police warning.", "The rider continues the journey."], "n_events": 9, "n_shots": 6, "n_subjects": 4}, "vid": 832}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "927.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a swimming pool, a kid outfitted with swimming goggles and donned in red shorts, struggles in the water, nearly drowning. The child's hands are continuously seen bobbing on the surface as he attempts to rise, yet sinks further in. Nearby, another individual, identifiable via a long braid and a swimming suit with goggles resting on their head, stands at the pool's edge. They reach out with their left hand to aid the struggling kid but are just out of reach. Swiftly, they turn to the nearby fence to retrieve tools to assist the child in peril.", "text": "In a swimming pool surrounded by residential buildings, a woman in a plaid swimsuit and goggles stands at the edge of the pool, gesturing with her hands as a boy swims towards her. The boy, wearing goggles, continues to swim towards the woman, who keeps her hands extended towards him. The boy reaches the woman and starts to swim past her, moving further into the pool. The woman watches as the boy swims away, and she begins to move towards the pool's edge. The camera then shifts focus, capturing the surrounding environment, including the pool area and the buildings in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 927, "extra_info": {"duration": 6.97, "events": ["Kid outfitted with swimming goggles struggles in water", "Kid's hands bob on the surface", "Kid attempts to rise in water", "Kid sinks further in the pool", "Individual stands at the pool's edge", "Individual reaches out with left hand", "Individual is out of reach of kid", "Individual turns to the fence", "Individual retrieves tools from the fence", "Tools are intended to assist the struggling kid"], "n_events": 10, "n_shots": 2, "n_subjects": 3}, "vid": 927}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "304.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A black player leaps into the air to complete a dunk. The camera then cuts to the bench, where the coach raises his fist in excitement, and the substitute players stand up, stirred by the action. Nearby, a reporter hurriedly snaps photos. The camera switches to a frontal view of the substitute players, who, along with the audience, are cheering. The players continue to move around on the court while the camera shifts to an elderly person, panning from right to left across the elder's head. The background of the video is a basketball game.", "text": "A basketball player in a red jersey with the number 10 jumps and dunks the ball into the hoop. The crowd and teammates react with excitement, cheering and celebrating. A man in a suit, likely a coach or team member, raises his fist in celebration. The players in white jerseys, representing 'Texas Western,' also cheer and celebrate the moment. The camera then shifts to an older man in a suit, who is watching the scene intently.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 304, "extra_info": {"duration": 7.62, "events": ["Black player completes a dunk.", "Coach raises fist in excitement.", "Substitute players stand up.", "Reporter snaps photos.", "Substitute players cheer.", "Audience cheers.", "Players continue to move on court.", "Camera pans across elderly person's head."], "n_events": 8, "n_shots": 4, "n_subjects": 4}, "vid": 304}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "6.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two animated birds fly towards the door lock first and have a brief conversation. Then they fly out of a stone window into the sunlight.", "text": "A blue bird with a pink scarf flies towards a wooden door with a brass knocker. Two more blue birds join, and they all fly around the door. The birds land on the knocker and start talking to each other. They then take off and fly towards a window in a stone wall. The birds reach the window and look inside.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 6, "extra_info": {"duration": 6.13, "events": ["Two animated birds fly towards the door lock.", "Animated birds have a brief conversation.", "Animated birds fly out of the stone window.", "Animated birds enter the sunlight."], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 6}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "109.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man and a woman are arguing face to face, getting closer as they become more heated in their dispute, ultimately reaching the point where their noses are pressed against each other as they continue to push forward while arguing. Several children nearby watch them, frightened, until the two almost have their faces touching, at which point they both suddenly realize in surprise.", "text": "In a room filled with bookshelves and scattered books, a group of characters, including a boy, a girl with glasses, and two smaller characters, stand together. The boy and the girl with glasses move closer to each other, and the boy begins to speak. The camera zooms in on the boy and the girl with glasses as they continue their conversation. The boy and the girl with glasses maintain eye contact and continue talking, while the other characters watch them. The camera then zooms out, showing the group of characters in the room again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 109, "extra_info": {"duration": 4.88, "events": ["A man and a woman start arguing.", "The man and the woman get closer during the argument.", "The man and the woman's noses press against each other.", "The man and the woman continue to push forward while arguing.", "Several children nearby begin to watch the argument.", "The children appear frightened by the argument.", "The man and the woman's faces almost touch.", "The man and the woman suddenly realize in surprise."], "n_events": 8, "n_shots": 1, "n_subjects": 3}, "vid": 109}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "101.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A soldier propped himself up on one arm in bed, looked at the dog in front of him, then extinguished the kerosene lamp and lay down on his side to sleep.", "text": "A man is lying on a cot in a dimly lit room, with a lantern on a table beside him and an open book. He smiles and looks at a dog sitting in front of him. The man then reaches out and turns off the lantern, causing the room to become darker. He turns his head and lies down on the cot, appearing to fall asleep. The scene then switches to a different man in a different room, who is also lying on a cot and looking at the camera.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 101, "extra_info": {"duration": 3.12, "events": ["Soldier propped up on one arm in bed.", "Soldier looked at the dog.", "Soldier extinguished the kerosene lamp.", "Soldier lay down on his side.", "Soldier went to sleep."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 101}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "589.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is a peach grove. The man is dressed in a blue short-sleeved shirt, tightly gripping a pair of tongs as he picks a peach from the tree. He turns around and places it into a basket on the ground. The camera switches to show the man looking upwards, reaching to take a peach off the tongs, and then turning towards the camera.", "text": "A man in a blue shirt and black pants is standing in a lush green orchard, using a long-handled tool to pick peaches from a tree. Two baskets filled with peaches are placed on the ground in front of him. He repeatedly extends the tool to reach the peaches on the tree and places them into the baskets. The man continues this process, moving the tool from one basket to another to collect the peaches. The background shows a dense orchard with green leaves and several peaches hanging from the trees. The man then uses the tool to pick a peach directly from the tree, bringing it down and placing it into the basket.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 589, "extra_info": {"duration": 3.98, "events": ["Man grips a pair of tongs", "Man picks a peach from the tree", "Man turns around", "Man places the peach into a basket", "Man looks upwards", "Man reaches to take a peach", "Man takes a peach off the tongs", "Man turns towards the camera"], "n_events": 8, "n_shots": 2, "n_subjects": 1}, "vid": 589}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "913.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "This video clip features a football game between a team in white tops and dark blue shorts and another team dressed in black jerseys. The player in white, wearing No. 8, attempts a shot at the goal which is effectively blocked by the goalkeeper clothed in yellow. This negates a score for the white team. The goalkeeper, a member of the black team donned in a yellow jersey, can be seen smiling broadly in the video.", "text": "On a soccer field, players from two teams are actively engaged in a match. A player in a white jersey receives the ball and passes it to a teammate, who then kicks the ball towards the goal. The goalkeeper in a yellow jersey dives to his left in an attempt to save the ball but misses. The ball goes into the net, and the goalkeeper lands on the ground. The goalkeeper then gets up from the ground. The scene then switches back to the field where the players continue their game.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 913, "extra_info": {"duration": 5.0, "events": ["Player in white wearing No. 8 attempts a shot at goal", "Goalkeeper clothed in yellow effectively blocks the shot", "Goalkeeper from black team prevents a score for the white team", "Goalkeeper in yellow jersey smiles broadly"], "n_events": 4, "n_shots": 2, "n_subjects": 3}, "vid": 913}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "466.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Three young men, each with both hands clutching a rope, ran forward together. One of them was wearing a life jacket. They ran to the riverbank, swinging out an arc over the water, aiming for the opposite shore. When they reached the middle of the river, the young man in white clothes fell into the water first, followed by the young man in the life jacket. The remaining young man held onto the rope, swinging a bit further, but he too quickly fell into the water. The young men held onto the rope and paddled forward some distance in the river before the video ended. Many people were watching them by the riverbank.", "text": "Two men, one wearing a red life jacket and the other in a white shirt, hold onto a rope and prepare to swing. They begin to swing forward, gaining momentum. The men swing over a grassy area and towards a body of water. They continue swinging and enter the water, creating splashes. The men swing further into the water, with one man still holding onto the rope. The men are seen in the water, with one man still holding onto the rope and the other man starting to let go.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 466, "extra_info": {"duration": 9.2, "events": ["Three young men clutched a rope and ran forward.", "The young men swung out over the river.", "One young man in white clothes fell into the water.", "One young man in a life jacket fell into the water.", "The last young man fell into the water.", "The young men paddled forward in the river."], "n_events": 6, "n_shots": 1, "n_subjects": 4}, "vid": 466}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "506.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Inside the room, a little white cat spread its hind legs and ran forward. It ran past a door and arrived in front of a black table, then crawled under the table.", "text": "A white cat is seen running on a tiled floor, moving towards a doorway. The cat continues running through the doorway into another room. The cat then turns and runs further into the room, heading towards a corner. The cat approaches a small table and a chair, and starts to circle around them. The cat then moves towards a small table with a bag on it, and begins to circle around it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 506, "extra_info": {"duration": 4.3, "events": ["A little white cat spread its hind legs.", "The little white cat ran forward.", "The little white cat ran past a door.", "The little white cat arrived in front of a black table.", "The little white cat crawled under the black table."], "n_events": 5, "n_shots": 1, "n_subjects": 1, "old_duration": 4.3, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/106.mp4"}, "vid": 506}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "796.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On the bustling street, a red bus approaches, with motorcycles passing by in front, followed by some cars driving behind. A person riding a bicycle appears from the right and heads forward, while pedestrians move about on the road.", "text": "The video shows a busy street scene with a red double-decker bus, motorcycles, cars, and a cyclist moving through the intersection. The background features two large, ornate buildings with classical architecture, and the street is lined with pedestrians walking on the sidewalks. The traffic lights change, allowing vehicles to move forward while pedestrians cross the street. The scene is lively with various types of vehicles, including taxis, vans, and bicycles, navigating the intersection. The video captures the dynamic movement of the vehicles and the continuous flow of pedestrians.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 796, "extra_info": {"duration": 20.19, "events": ["A red bus approaches.", "Motorcycles pass by in front.", "Cars drive behind the bus.", "A person on a bicycle appears.", "The person rides the bicycle forward.", "Pedestrians move about on the road."], "n_events": 6, "n_shots": 1, "n_subjects": 6}, "vid": 796}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "619.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a barren loess land with no grass or trees, a leopard is running in pursuit of a quickly moving small object. They pass by another leopard walking in the opposite direction. Upon catching up with the small object, the chasing leopard pounces towards it, kicking up a cloud of dust, but the small object still manages to escape. The leopard watches the direction in which the small object fled and then slows to a gentle run forward.", "text": "In a dry, open landscape with scattered trees, a cheetah is seen running towards a toy on the ground. The cheetah leaps into the air, aiming for the toy. It lands and continues to chase the toy, which is being pulled along the ground. The cheetah makes contact with the toy, causing it to kick up dust as it runs. The cheetah slows down and eventually stops, lying on the ground to catch its breath. After a brief rest, the cheetah gets up and walks away from the toy.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 619, "extra_info": {"duration": 16.92, "events": ["A leopard runs after a small object.", "A small object moves quickly.", "Another leopard walks in the opposite direction.", "The chasing leopard pounces towards the small object.", "A cloud of dust rises.", "The small object escapes.", "The leopard watches the small object's escape direction.", "The leopard slows to a gentle run."], "n_events": 8, "n_shots": 1, "n_subjects": 3}, "vid": 619}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "696.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person holds an envelope in their right hand and a lighter in their left. They light the bottom left corner of the envelope, and it continues to burn, moving towards the right side of the person. The video ends after most of the paper has been burned away.", "text": "A person holds a blank white sheet of paper against a black background and uses a lighter to ignite the bottom left corner of the paper. The flame starts small and gradually grows as the paper catches fire. The fire spreads along the bottom edge of the paper, creating a visible burn mark. The paper continues to burn, with the flames becoming more intense and spreading further up the paper. The paper is almost completely engulfed in flames, with the fire reaching the top edge. The paper is now fully burning, with the flames consuming the entire bottom and part of the top edge.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 696, "extra_info": {"duration": 20.32, "events": ["A person holds an envelope in their right hand.", "A person holds a lighter in their left hand.", "The person lights the bottom left corner of the envelope.", "The flame moves towards the right side of the envelope.", "The envelope continues to burn.", "The video ends after most of the envelope has burned away."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 696}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "662.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In an open field, two groups of horses gallop from right to left across the plain. The dust raised by their hooves catches the red glow of the setting sun. The camera follows the herd, moving from a position far behind the horses, gradually transitioning to the side of the herd.", "text": "In a vast desert landscape at sunset, a group of horses is seen running across the terrain, kicking up dust as they move. The camera follows the horses, capturing their movement from a distance and gradually getting closer. The horses continue to run, creating a trail of dust behind them, and the camera angle shifts to show different perspectives of their run. The scene is set against a backdrop of mountains and a setting sun, casting a warm glow over the entire scene.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 662, "extra_info": {"duration": 28.4, "events": ["Two groups of horses gallop across the plain.", "Horses raise dust with their hooves.", "Dust catches the glow of the setting sun.", "Camera follows the herd of horses.", "Camera moves from behind to the side of the herd."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 662}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "613.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is a soccer field, and in the distance, yellow and blue chairs can be seen, as well as a patch of trees. Four men dressed in white short sleeves run into the soccer field with their backs to the camera, then stand facing each other in pairs. The man on the right kicks the soccer ball to the left, kicks it back, and then kicks it to the man beside him. The camera follows their movements.", "text": "Four men in white shirts walk through a gate onto a soccer field, with the camera following them from behind. They continue walking towards the field, and the camera angle shifts to show them from the front as they approach the center of the field. The men start to spread out and begin to interact with a soccer ball on the field. One man begins to dribble the ball while the others watch and move around. The man continues to control the ball, moving it around the field as the others observe.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 613, "extra_info": {"duration": 8.32, "events": ["Four men run into the soccer field", "Pairs of men stand facing each other", "Man on the right kicks soccer ball to the left", "Receiver kicks ball back to the man on the right", "Man on the right kicks ball to man beside him"], "n_events": 5, "n_shots": 1, "n_subjects": 4}, "vid": 613}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "401.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In front of a white wall, a gentleman wearing a hooded sweatshirt and a lady in a short skirt danced together to an upbeat tune.", "text": "A man stands in front of a gray shutter door, looking to his left and then forward, with a woman standing behind him. The man turns to face forward, and the woman steps out from behind him. The man and woman start dancing together, moving in sync with each other. They perform various dance moves, including spins and coordinated steps. The man and woman continue their dance routine, moving around each other and performing different dance moves. They end their dance by facing each other and then turning to face the camera.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 401, "extra_info": {"duration": 15.18, "events": ["The gentleman donned a hooded sweatshirt.", "The lady chose a short skirt.", "They paired up.", "The upbeat tune started.", "The gentleman and the lady danced.", "Their feet shuffled to the rhythm.", "They spun together.", "Their hands clapped.", "They twirled individually.", "They synchronized their movements."], "n_events": 10, "n_shots": 1, "n_subjects": 2}, "vid": 401}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "735.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The eagle soars in the sky, the camera follows and films it, then it smoothly lands on a cliff face.", "text": "A vulture is seen flying in the clear blue sky, gradually descending as it approaches a rocky cliff. The vulture continues its descent, passing by clouds and trees, and gets closer to the cliff edge. The vulture lands on the edge of the cliff, spreading its wings as it settles.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 735, "extra_info": {"duration": 24.49, "events": ["The eagle soars in the sky.", "The camera follows the eagle.", "The eagle lands on a cliff face."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 735}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "224.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera pans left, the man stands up and steps back, a woman holding flowers walks forward, and the girl is sitting on the bed, holding something in her hand.", "text": "In a dimly lit room with a greenish hue, a man in a suit bends down and moves towards a table, passing by a wall adorned with framed photos and a decorative hanging light. A woman holding a vase of flowers approaches the table and places the vase on it. The man stands up and moves out of the frame, while the woman continues to place the vase on the table. The camera then pans left, revealing a woman sitting on a bed, holding and looking at a phone.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 224, "extra_info": {"duration": 4.88, "events": ["The camera pans left", "The man stands up", "The man steps back", "A woman holding flowers walks forward", "The girl is sitting on the bed", "The girl holds something in her hand"], "n_events": 6, "n_shots": 1, "n_subjects": 3}, "vid": 224}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "292.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a street bustling with people, a man in a green shirt handed over his camera to the woman in front of him who was wearing a hat. The woman took a puff of her cigarette with her right hand and then threw it to the ground, exhaling the smoke. She then fiddled with the camera using both hands, and the lens quickly turned to the man in the green shirt. He gestured with his hands as he walked towards the woman.", "text": "In a park with many people, a man in a blue sweater and a colorful scarf hands a camera to a woman wearing a beanie and a denim jacket. The woman takes the camera from the man and starts walking forward while looking at the camera. She continues to walk, examining the camera closely, and then raises it to take a photo. The man gestures with his hand, signaling her to come back, and she returns to him.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 292, "extra_info": {"duration": 6.88, "events": ["Man in green shirt handed over camera to woman.", "Woman with hat took puff of cigarette.", "Woman threw cigarette to ground.", "Woman exhaled smoke.", "Woman fiddled with camera.", "Camera lens turned to man in green shirt.", "Man gestured with hands.", "Man walked towards woman."], "n_events": 8, "n_shots": 2, "n_subjects": 3}, "vid": 292}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "105.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the trench, a soldier wearing glasses looked ahead. To his left, another soldier stood up, holding a dog and then also looked forward. To the left of the soldier holding the dog, another soldier stood up and faced forward. The camera then switched to what they were looking at: the sun was rising on the horizon, and a bird perched on the antlers of a deer nearby.", "text": "A soldier with glasses and a rifle looks to the left. Another soldier with a dog on his shoulder appears from behind the first soldier. The dog, a beagle, looks to the left and then to the right. A third soldier emerges from behind the first soldier. The three soldiers and the dog look to the left, then to the right, and finally forward. The camera then shows a bird perched on a barbed wire fence against a sunset sky. The scene returns to the soldiers and the dog, who are now looking forward.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 105, "extra_info": {"duration": 5.38, "events": ["A soldier in the trench looked ahead.", "Another soldier stood up while holding a dog.", "A third soldier stood up.", "All soldiers faced forward.", "The sun rose on the horizon.", "A bird perched on the deer's antlers."], "n_events": 6, "n_shots": 2, "n_subjects": 4}, "vid": 105}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "562.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is an interior with colorful lighting, surrounded by many mirrors. The girl runs forward and suddenly another girl appears from the right. The two collide, stumble backward, and fall to the ground, with their images of kneeling on the floor reflected in the mirrors.", "text": "A child wearing a white shirt and black shorts runs through a brightly lit, colorful tunnel with arches and swirling patterns. As the child continues running, another child appears from the opposite direction, and they collide. Both children fall to the ground, and the camera moves forward, capturing the scene from a closer perspective.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 562, "extra_info": {"duration": 2.75, "events": ["The girl runs forward", "Another girl appears from the right", "The two collide", "The two stumble backward", "The two fall to the ground", "Their images are reflected in the mirrors"], "n_events": 6, "n_shots": 1, "n_subjects": 2, "old_duration": 2.88, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/162_no_audio.mp4"}, "vid": 562}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "266.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person is riding a horse forward, the camera cuts to the back of the rider. Another person is running on the road, the camera cuts to the front of the person running on the road. Then it switches to the front of the rider, who raises a whip. The camera cuts again, and a group of people are running on the road. The rider swings the whip and knocks down the people in front, and there are also people falling to the ground as they run behind.", "text": "A man in black is riding a white horse, holding the reins and looking around. The camera shifts to show the man from behind as he rides forward, with a man in a blue coat running ahead of him. The man in black raises his whip and swings it at the man in the blue coat, who is running frantically. The man in black continues to ride forward, swinging his whip at other people running ahead of him, causing them to fall to the ground. The man in black rides past several people who have fallen, continuing his pursuit.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 266, "extra_info": {"duration": 13.12, "events": ["A person rides a horse forward.", "Another person runs on the road.", "The rider raises a whip.", "A group of people run on the road.", "The rider swings the whip.", "The whip knocks down people in front.", "People fall to the ground while running."], "n_events": 7, "n_shots": 8, "n_subjects": 3}, "vid": 266}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "216.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man ran past the tent, then slowly stopped in front of the gate, while electric sparks emitted from the top of the tent.", "text": "A man is running inside a large, dimly lit tent with a high ceiling and arched windows, with a bright light source visible through the windows. He continues running, passing by a column and moving towards the right side of the tent. The man runs past a cross and heads towards a door on the right side of the tent. He exits the tent through the door, entering a darker area outside the tent. The man walks further into the darker area, with the tent's interior still visible in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 216, "extra_info": {"duration": 4.25, "events": ["The man ran past the tent.", "The man slowly stopped.", "Electric sparks emitted from the tent top."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 216}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "223.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The blond man is baring his teeth and looking ahead. The blond man is pushed away by the person in front of him, and there is someone behind being hung up by a rope. Then another man pushes the black person forwards. The blond man at the back is hitting the person who is hung up, and then there is a close-up of the black person turning his face around.", "text": "In a dimly lit room with wooden beams and various objects hanging from the ceiling, a man in a green shirt turns his head to the right and punches another man in the face. The man who was punched turns his head to look at the man who punched him. The man who punched then turns and walks away, while the man who was punched continues to look at him. The man who punched then turns and walks towards the camera, speaking as he moves. In the background, another man is seen hanging from the ceiling, and other individuals are moving around.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 223, "extra_info": {"duration": 7.62, "events": ["The blond man bares his teeth.", "Person in front pushes the blond man away.", "Someone is hung up by a rope behind.", "Another man pushes the black person forwards.", "The blond man at the back hits the person hung up.", "The black person turns his face around."], "n_events": 6, "n_shots": 2, "n_subjects": 4}, "vid": 223}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "932.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video features two individuals seated at a table. On the right is a person with short hair, garbed in a black t-shirt and gray jeans, accessorized by green sunglasses. This person is seated in a chair with an empty chair next to them, reading a newspaper with their right hand and holding a yellow pen in their left. Suddenly, another individual with long hair enters from the left. Dressed in a white shirt, black skirt, pink pantyhose, and black shoes, this person places both hands on the table and requests to use the empty chair. However, the person on the right refrains from responding, instead resting their feet on the chair and nonchalantly twirling the pen in their left hand, clearly indicating refusal. The person on the left expresses disappointment by raising both hands and walking away.", "text": "A woman approaches a man sitting at an outdoor table in front of a store, asking if he is using the chair. The man, wearing a blue t-shirt and sunglasses, responds by moving his chair and saying 'Yes.' The woman, dressed in a white shirt, skirt, and pink socks, turns around and gestures with her hands, then walks away. The man continues to read a newspaper while the woman leaves.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 932, "extra_info": {"duration": 5.0, "events": ["Person reads a newspaper with right hand.", "Person holds a yellow pen with left hand.", "Individual with long hair enters from the left.", "Individual with long hair places both hands on the table.", "Individual with long hair requests to use the empty chair.", "Person rests feet on the chair.", "Person twirls the pen in left hand.", "Individual with long hair raises both hands.", "Individual with long hair walks away."], "n_events": 9, "n_shots": 1, "n_subjects": 3}, "vid": 932}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "537.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera captures a low-angle shot of a man wearing a black coat moving a table to the doorway. After setting it down, he turns and walks towards the camera, bends over to pick up a chair, and then heads back towards the doorway. Turning to look at the camera, he swings one arm in front of the lens.", "text": "In a classroom, a man in a suit is seen pushing a desk and chair towards the back of the room. He continues to move the desk and chair, positioning them near the door. The man then turns around and walks back towards the camera, adjusting his suit jacket. He picks up another chair and places it next to the desk. Finally, he turns around and starts to push the desk and chair back towards the camera.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 537, "extra_info": {"duration": 7.3, "events": ["Camera captures a low-angle shot of a man.", "Man wearing black coat moves a table.", "Man positions the table at the doorway.", "Man turns and walks towards the camera.", "Man bends over to pick up a chair.", "Man carries the chair towards the doorway.", "Man turns to look at the camera.", "Man swings his arm in front of the lens."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 537}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "351.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man is sitting with his mouth moving slightly. The camera cuts to the man sitting at the living room table, looking at something in front of him, then cuts to the man observing the outside through the window.", "text": "A man is sitting in a wheelchair in a busy outdoor area with people in the background. The scene transitions to the man sitting on a kitchen counter, looking upwards in a dimly lit room with modern decor, including a table, chairs, and hanging lights. The man then stands up and walks to the window, looking out at the cityscape outside.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 351, "extra_info": {"duration": 20.75, "events": ["The man sits with his mouth moving slightly.", "The man looks at something on the living room table.", "The man observes the outside through the window."], "n_events": 3, "n_shots": 3, "n_subjects": 1}, "vid": 351}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "725.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A black woman wearing glasses is taking a selfie with her phone; she waves at the camera. In the distance, a man is also waving at the camera. Then, facing the camera, he performs a series of movements and strikes various poses.", "text": "In a forest with sunlight filtering through the trees, a woman with curly hair and glasses is holding a camera and smiling while waving her hand. Behind her, a man in a beige shirt and dark pants starts dancing, raising his arms and moving energetically. The woman continues to wave and smile at the camera, occasionally looking at the man. The man dances with various arm movements, including raising his arms and spinning around. The woman turns her head to look at the man while he continues his dance. The man claps his hands and continues dancing, moving closer to the woman. The woman keeps waving and smiling at the camera, while the man dances energetically in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 725, "extra_info": {"duration": 12.38, "events": ["A woman takes a selfie with her phone.", "A woman waves at the camera.", "A man in the distance waves at the camera.", "The man performs a series of movements.", "The man strikes various poses."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 725}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "455.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Underwater, a person wearing a wetsuit and fins lifts a square rock with both hands. Another person with a net bag swims up to the square rock, extends the net underneath it, where there are several large lobsters, and drives all the lobsters into the net bag with one hand.", "text": "A diver in a blue wetsuit is seen underwater, holding a net and a tool, and begins to move towards a submerged structure. The diver approaches a concrete slab on the ocean floor, which is covered in marine growth and surrounded by various underwater plants. As the diver gets closer, a group of lobsters becomes visible under the slab, moving around and interacting with each other. The diver uses the net to scoop up the lobsters, which are seen crawling and moving under the slab. The diver successfully catches several lobsters with the net, lifting them out from under the slab.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 455, "extra_info": {"duration": 8.6, "events": ["Person dives underwater.", "Person wears wetsuit and fins.", "Person lifts square rock underwater.", "Second person swims up to square rock.", "Second person extends net under square rock.", "Several large lobsters appear under rock.", "Second person drives lobsters into net bag."], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 455}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "509.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On an outdoor dirt path, the camera captures from the front several puppies running towards the camera. As the puppies approach, the camera keeps retreating. Finally, the camera shoots from a low angle, focusing on one of the cream-colored puppies as it runs.", "text": "A small brown puppy is running on a sandy path with sparse vegetation, followed by a white puppy. More puppies join the chase, running in a line behind the first two. The camera focuses on a light-colored puppy running towards it, getting closer with each frame. The puppy continues to run, with the background showing a clear sky and distant trees. The camera then shifts to show the brown puppy again, still running on the sandy path.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 509, "extra_info": {"duration": 10.1, "events": ["Several puppies run towards the camera.", "The camera retreats from the approaching puppies.", "One cream-colored puppy becomes the focus.", "The puppy runs on an outdoor dirt path."], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 509}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "467.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A little dog has one of its front paws on a skateboard, while the other paws forcefully push off the ground. In front of a flower bed, the dog turns and places all of its paws on the skateboard, which glides through the crowd. With skilled movements, the dog weaves and turns among the people. The camera consistently captures the action from the dog’s perspective, following its path as it moves along.", "text": "A small dog wearing a bright green jacket is seen riding a skateboard on a busy street, with several people standing nearby watching. The dog continues to skate forward, maintaining its balance on the skateboard as it moves past various pedestrians and storefronts. The dog skates through a crowded area, navigating around people and obstacles, and continues to move forward with ease. The dog skates further down the street, passing by more pedestrians and shops, and eventually moves away from the camera's view. The dog is seen from behind as it skates further down the street, with the camera capturing the bustling street scene in the background. The dog continues to skate forward, moving away from the camera and further into the distance.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 467, "extra_info": {"duration": 9.16, "events": ["A little dog places one paw on a skateboard.", "A little dog pushes off the ground with other paws.", "In front of a flower bed, the dog turns.", "The dog places all paws on the skateboard.", "The skateboard glides through the crowd.", "The dog weaves among the people.", "The dog turns skilled movements."], "n_events": 7, "n_shots": 1, "n_subjects": 2, "old_duration": 9.49, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/67_no_audio.mp4"}, "vid": 467}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "296.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man in white clothes stands leaning against a tree, the camera cuts to a soccer field, where a flock of birds takes flight from the field.", "text": "A man in a white shirt and blue jeans is seen leaning over a brick wall, appearing to be in distress, with his hands on his head. The scene then transitions to a sports field with goalposts and floodlights, where a large number of birds are gathered on the ground. The birds begin to take off, flying in a coordinated manner across the field.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 296, "extra_info": {"duration": 9.12, "events": ["A man stands leaning against a tree.", "Birds take flight from the soccer field."], "n_events": 2, "n_shots": 2, "n_subjects": 1}, "vid": 296}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "545.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman dismounted from the horseback, grabbed the saddle with an external force, stepped on the surface of the water, and moved forward following the horse. She turned her head to look at the camera, with water splashing around. The camera followed with the motion.", "text": "A woman is riding a brown horse through shallow water, creating splashes as they move. The background shows a shoreline with trees and buildings. The woman, initially shirtless, starts to put on a beige top while still riding the horse. She continues to ride the horse, now wearing the top, and the horse splashes water as they move. The woman appears to be enjoying the ride, smiling as she looks back.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 545, "extra_info": {"duration": 8.33, "events": ["The woman dismounted from the horseback.", "The woman grabbed the saddle.", "The woman exerted external force.", "The woman stepped on the water surface.", "The woman moved forward.", "The horse moved.", "The woman turned her head towards the camera.", "Water splashed around.", "The camera followed the motion."], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 545}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "586.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is an interior undergoing renovation. The man is wearing a mask and dressed in a green short-sleeved shirt, tightly gripping a sledgehammer and repeatedly smashing the wall. The wall cracks, and bricks and plaster fall off one after another.", "text": "A man wearing a green shirt and blue gloves is using a long-handled tool to break through a concrete wall. He starts by hitting the lower part of the wall, creating a small hole. As he continues to strike, the hole expands, and more bricks and concrete fall to the ground. The man keeps hitting the wall, causing larger sections to collapse and more debris to accumulate on the floor. The wall continues to break down, with significant portions of it falling away, revealing the interior structure behind it. The man persistently strikes the wall, further enlarging the hole and causing more bricks to fall. The wall is almost completely destroyed, with large chunks of concrete and bricks scattered on the floor.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 586, "extra_info": {"duration": 6.0, "events": ["Interior is undergoing renovation.", "Man wears a mask.", "Man dresses in a green short-sleeved shirt.", "Man tightly grips a sledgehammer.", "Man repeatedly smashes the wall.", "Wall cracks.", "Bricks fall off.", "Plaster falls off."], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 586}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "404.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a tree-lined path, clusters of boys and girls carrying backpacks walked along, when a boy ran into the middle of the road and broke into a street dance. Some turned their heads to watch him dance, while others continued on their way.", "text": "A group of people, including students in school uniforms, are walking on a sidewalk surrounded by trees and greenery. A person wearing a black cap, blue shirt, and black pants starts dancing in the middle of the sidewalk, moving energetically and changing positions. The onlookers, some of whom are also in school uniforms, stop and watch the dancer. The dancer continues to perform various moves, including arm gestures and leg kicks, while the onlookers remain engaged. The dancer then turns and walks away from the group, heading towards the left side of the frame.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 404, "extra_info": {"duration": 10.47, "events": ["Boys and girls walked along a tree-lined path.", "A boy ran into the middle of the road.", "The boy broke into a street dance.", "Some boys and girls turned their heads to watch.", "Others continued on their way."], "n_events": 5, "n_shots": 1, "n_subjects": 2, "old_duration": 10.47, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/4.mp4"}, "vid": 404}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "213.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman standing on a ladder hands a feather to a little elephant, the rope gets burned and then snaps, the woman falls from the ladder, and a person below quickly reacts.", "text": "A girl, standing on a platform, extends her hand to an elephant wearing a firefighter's hat and a yellow vest, who is holding a piece of paper. The elephant takes the paper from the girl's hand. The camera then shows a close-up of a rope catching fire. The girl pulls the rope, causing the elephant to be lifted into the air. The scene shifts to a clown running on a rotating platform, with an audience watching in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 213, "extra_info": {"duration": 5.12, "events": ["A woman stands on a ladder.", "The woman hands a feather to a little elephant.", "A rope gets burned.", "The rope snaps.", "The woman falls from the ladder.", "A person below quickly reacts."], "n_events": 6, "n_shots": 4, "n_subjects": 3}, "vid": 213}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "2.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "One character raises their hand and then puts it down. The other character walks away and is followed by the former, both characters go up a staircase together.", "text": "In a dimly lit industrial setting with large windows and various mechanical structures, a man with white hair raises his hand and gestures for a woman with red hair to turn around. The woman turns and starts walking away, while the man lowers his hand and turns to follow her. They walk together through the industrial area, passing by large machinery and metal structures. The man and woman continue walking side by side, moving deeper into the industrial environment.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 2, "extra_info": {"duration": 9.67, "events": ["One character raises their hand.", "The same character puts their hand down.", "The other character walks away.", "The first character follows the other.", "Both characters go up a staircase together."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 2}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "202.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A young male speaks in an outdoor snowy setting at night. The scene then cuts to two other individuals sitting by a fire. There are no significant movements or actions besides the speaking and facial expressions.", "text": "In a snowy outdoor setting with trees and a lit candle in the background, a boy wearing a fur-lined coat looks down and then raises his head to speak. A girl, also in a fur-lined coat, looks up at the boy. The boy continues to speak while the girl listens attentively. The girl then looks down, and the boy also looks down. The girl looks up again, and the boy continues to look down.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 202, "extra_info": {"duration": 9.47, "events": ["A young male speaks.", "The scene cuts to two individuals.", "Two individuals sit by a fire."], "n_events": 3, "n_shots": 1, "n_subjects": 3}, "vid": 202}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "51.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A lens with a reticle and a complex interface scans over the crowd as people dance.", "text": "The video begins with a close-up of a group of people dancing, focusing on their legs and feet. The camera then transitions to a digital interface overlay, showing various data and graphics as the dancers continue to move. The interface displays detailed information about the dancers, including their movements and physical attributes. The camera continues to move through the crowd, capturing different dancers and their movements, while the digital interface provides continuous data updates. The video concludes with a close-up of a dancer's back, showing the interface highlighting specific details about their physical characteristics.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 51, "extra_info": {"duration": 3.25, "events": ["A lens with a reticle scans over the crowd.", "People dance."], "n_events": 2, "n_shots": 1, "n_subjects": 1}, "vid": 51}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "525.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video showcases an advanced Mercedes-Benz truck, first presenting the front of the vehicle, followed by the rear, the front body and the back body. The top of the truck is also capable of rising in layers.", "text": "The video begins with a close-up front view of a Mercedes-Benz truck, showcasing its detailed front grille and headlights. The camera then shifts to the rear of the truck, revealing a large spare tire and the text 'NAVIGATOR-BX' on the back. The truck is then shown from a side angle, displaying its full length and the attached trailer. The camera continues to move around the truck, providing a comprehensive view of its exterior, including the trailer's side and the building in the background. Finally, the camera returns to a front view of the truck, emphasizing its front grille and headlights.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 525, "extra_info": {"duration": 8.9, "events": ["Video showcases an advanced Mercedes-Benz truck", "Presentation begins with the front of the vehicle", "Camera moves to present the rear of the truck", "Front body of the truck is displayed", "Back body of the truck comes into view", "The top of the truck rises in layers"], "n_events": 6, "n_shots": 5, "n_subjects": 1}, "vid": 525}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "122.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera tilts upward from below, capturing a sphere of light above. A woman is floating and spinning around the light sphere. The shot switches to an even lower angle as several female characters peek out from beneath the floor one after another, looking up at the woman above. The camera cuts again, a beam of light shoots down from the light sphere, and the woman enters the beam. After a flash of bright light, the woman disappears, and the camera quickly pans up to film the light sphere.", "text": "A dark screen transitions to a scene where a girl is seen falling through a glowing, magical portal, with a bright light and energy surrounding her. The camera then shifts to show a group of girls with colorful hair watching the girl as she falls. The girl continues to fall through the portal, which emits a bright, greenish light. The camera follows the girl as she descends, and the light from the portal intensifies. The scene ends with the girl still falling through the portal, with the light illuminating the surroundings.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 122, "extra_info": {"duration": 8.38, "events": ["Camera tilts upward capturing a sphere of light", "A woman floats and spins around the light sphere", "Several female characters peek out from beneath the floor", "Female characters look up at the woman above", "A beam of light shoots down from the light sphere", "The woman enters the beam of light", "The woman disappears after a flash of bright light", "Camera quickly pans up to film the light sphere"], "n_events": 8, "n_shots": 4, "n_subjects": 2}, "vid": 122}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "781.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The subtitles move from left to right in sequence, consisting of the red \"warning,\" the yellow \"alert,\" the green \"stand by,\" and the blue \"online.\"", "text": "The video begins with a black screen. Red text 'WARNING' appears and then disappears. Yellow text 'ALERT' appears and then disappears. Green text 'STAND BY' appears and then disappears. Blue text 'ONLINE' appears and then disappears. Red text 'WARNING' appears again and then disappears.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 781, "extra_info": {"duration": 9.48, "events": ["The subtitles move from left to right.", "The red 'warning' appears sequentially.", "The yellow 'alert' follows after the red 'warning'.", "The green 'stand by' comes next in sequence.", "The blue 'online' displays lastly."], "n_events": 5, "n_shots": 1, "n_subjects": 4}, "vid": 781}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "204.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A silver car drives on a city road, makes a left turn at an intersection, and proceeds along a driveway towards a red brick building, as seen through a wrought iron fence.", "text": "A silver car with the license plate 'USA 1' drives through a busy street, passing a red double-decker bus and other vehicles, with Big Ben visible in the background. The car continues to drive through the city streets, moving past various buildings and other cars. The scene transitions to a view through an ornate iron gate, showing the car driving on a narrow road towards a large brick building with white-framed windows. The car approaches and enters the courtyard of the building, which has a green door and intricate architectural details.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 204, "extra_info": {"duration": 19.68, "events": ["A silver car drives on a city road.", "The silver car makes a left turn at an intersection.", "The silver car proceeds along a driveway.", "The silver car approaches a red brick building."], "n_events": 4, "n_shots": 2, "n_subjects": 1}, "vid": 204}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "820.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video captures an individual with long hair, dressed in a striking red wrap dress, strutting on stage with an enchanting smile. This person is seen wearing a ring on their right ring finger and a piece of gold jewelry on their left wrist. They gracefully show off their beautiful dress through a variety of poses. Taking on the role of a model during a T station show, they stir the interest of the audience, many of whom are pictured capturing their performance on camera.", "text": "A woman in a red dress with a colorful skirt runs past the audience, who are seated and watching her. She continues running and then starts dancing on the runway, moving gracefully and smiling. The audience, seated on both sides of the runway, watches her performance attentively. The woman spins and twirls on the runway, showcasing her dress and movements. She strikes a pose with one hand on her hip, smiling and looking confident. The background features a white, geometric-patterned wall, and the lighting highlights the runway.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 820, "extra_info": {"duration": 5.96, "events": ["The individual struts on stage.", "The individual smiles enchantingly.", "The individual wears a ring.", "The individual showcases a gold bracelet.", "The individual poses with the dress.", "The individual assumes a model role.", "The audience captures the performance."], "n_events": 7, "n_shots": 4, "n_subjects": 2}, "vid": 820}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "341.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows a split screen shot, with the left side featuring two male workers in the foreground and another worker visible in the background. On the right side, a rocket is flying upwards to the right, with a blast of flame propelling it forward from the tail of the rocket. Surrounding the flame is a large trail of smoke and dust. In the left side of the frame, one of the workers looks up, and the right side of the screen moves to the left, covering the entire frame. The rocket continues to fly when suddenly, a thick burst of flame erupts from below the rocket.", "text": "The video is split into two sections: on the left, a man wearing a headset and a yellow turtleneck is seen looking down, while on the right, a rocket is ascending into the night sky, leaving a trail of smoke behind it. The man then raises his head to look at the rocket. The camera then focuses solely on the rocket as it continues to ascend, with the smoke trail becoming more pronounced. Suddenly, the rocket is engulfed in a bright explosion, creating a large burst of light and smoke.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 341, "extra_info": {"duration": 7.88, "events": ["Two male workers stand in the foreground.", "Another worker is visible in the background.", "A rocket flies upwards with a blast of flame.", "Flame propels the rocket forward from its tail.", "A large trail of smoke and dust surrounds the flame.", "One worker looks up at the sky.", "The right side screen covers the entire frame.", "The rocket continues to fly.", "A thick burst of flame erupts from below the rocket."], "n_events": 9, "n_shots": 2, "n_subjects": 4}, "vid": 341}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "471.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A slightly overweight lady holding a tray walks forward. On the tray, there's a blue and white balloon. Suddenly, the balloon bursts, and a pigeon flies out. The lady gestures with her right hand, and the pigeon lands on it. The background is a stage glittering with lights.", "text": "A woman dressed in a black outfit with white polka dots and high heels is walking on a stage holding a tray with a white balloon on it. As she continues walking, the balloon on the tray starts to deflate. The balloon completely deflates, and the woman reacts by raising her hand and looking at the deflated balloon. She then smiles and continues to hold the tray with the deflated balloon.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 471, "extra_info": {"duration": 4.23, "events": ["Lady holds tray with balloon", "Balloon bursts unexpectedly", "Pigeon flies out of balloon", "Lady gestures with hand", "Pigeon lands on lady's hand"], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 471}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "719.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A lady shows an expression of surprise at something ahead, then opens her arms and runs over to embrace the three ladies in front of the camera.", "text": "A woman in a light pink dress stands in front of a mirror, clapping her hands and smiling as she looks at her reflection. She continues to clap and smile, then opens her arms wide. The woman in the red dress and another woman with a pink headband approach her, and they all embrace each other warmly. The woman in the pink dress then turns and walks away, while the other two women continue to hug.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 719, "extra_info": {"duration": 14.3, "events": ["Lady shows an expression of surprise.", "Lady opens her arms.", "Lady runs over to the three ladies.", "Lady embraces the three ladies."], "n_events": 4, "n_shots": 1, "n_subjects": 4}, "vid": 719}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "966.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, an individual with short hair is seen wearing a dark blue t-shirt with black stripes, black shorts, and dark blue shoes featuring leaf pattern. On the right wrist of the individual is a red watch, while on the left wrist there is a black watch. Both hands also sport a ring each. Additionally, a red cord is tied around the right foot. The individual is spotted sliding leftwards on a black skateboard. After crouching down briefly, the person jumps up to the first step, glides along the edge of the stairs to some distance, pivots, and leaps down once more. The movement of this person is from the right towards the left. Due to the potential risk involved, it is advised against imitating such high-risk maneuvers on skateboard.", "text": "A man is skateboarding on a smooth concrete surface in an outdoor area with trees and a cloudy sky in the background. He approaches a low concrete ledge and begins to crouch down, preparing for a trick. The man then jumps and flips the skateboard in the air, positioning it to land on the ledge. The skateboard lands on the ledge, and the man balances on it as it slides along the edge. The man continues to slide along the ledge, maintaining his balance. Finally, he lands back on the ground and continues skateboarding.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 966, "extra_info": {"duration": 4.96, "events": ["Individual slides leftwards on a skateboard", "Individual crouches down briefly", "Individual jumps up to the first step", "Individual glides along the edge of the stairs", "Individual pivots", "Individual leaps down from the stairs"], "n_events": 6, "n_shots": 2, "n_subjects": 2}, "vid": 966}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "536.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two teams, one dressed in red and the other in blue, are playing a volleyball match. The ball is hit into the air, with the camera following its movement. A woman uses both hands to hit the falling volleyball back into the air, while another woman runs forward, jumps up, and stretches out her right hand to strike the ball towards the opposite side. The players on the other side fail to catch it, and the ball hits the ground. The team in red gathers together, celebrating their point. Other team members can be seen moving around them, along with the seated referee.", "text": "In an indoor volleyball court with wooden flooring and a net in the middle, a player in a pink jersey with the number 7 hits the ball over the net. The ball travels over the net and is received by a player in a black jersey. The player in the black jersey sets the ball, and it is passed to another player in a black jersey. The player in the black jersey hits the ball back over the net. The ball is received by a player in a pink jersey, who then passes it to a teammate. The teammate in the pink jersey hits the ball back over the net. The ball is received by a player in a black jersey, who sets it. The player in the black jersey hits the ball back over the net. The ball is received by a player in a pink jersey, who then passes it to a teammate. The teammate in the pink jersey hits the ball back over the net. The ball is received by a player in a black jersey, who sets it. The player in the black jersey hits the ball back over the net. The ball is received by a player in a pink jersey, who then passes it to a teammate. The teammate in the pink jersey hits the ball back over the net. The ball is received by a player in a black jersey, who sets it. The player in the black jersey hits the ball back over the net. The ball is received by a player in a pink jersey, who then passes it to a teammate. The teammate in the pink jersey hits the ball back over the net. The ball is received by a player in a black jersey, who sets it. The player in the black jersey hits the ball back over the net. The ball is received by a player in a pink jersey, who then passes it to a teammate. The teammate in the pink jersey hits the ball back over the net. The ball is received by a player in a black jersey, who sets it. The player in the black jersey hits the ball back over the net. The ball is received by a player in a pink jersey, who then passes it to a teammate. The teammate in the pink jersey hits the ball back over the net. The ball is received by a player in a black jersey, who sets it. The player in the black jersey hits the ball back over the net. The ball is received by a player in a pink jersey, who then passes it to a teammate. The teammate in the pink jersey hits the ball back over the net. The ball is received by", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 536, "extra_info": {"duration": 7.2, "events": ["Two teams play a volleyball match.", "The ball is hit into the air.", "A woman hits the volleyball back.", "Another woman runs forward.", "The running woman jumps up.", "The jumping woman strikes the ball.", "Players on the other side fail to catch the ball.", "The ball hits the ground.", "The team in red celebrates.", "Other team members move around."], "n_events": 10, "n_shots": 1, "n_subjects": 4}, "vid": 536}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "623.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a lush green field, a lady wearing long sleeves intercepts a red ball rolling to the right. She then runs to the right side of the ball, gently nudges it forward, and then kicks it to the left. A lady wearing short sleeves runs over from the left side, meets the ball, and with one kick, sends it flying.", "text": "A person wearing jeans and white shoes runs across a grassy field. They kick a red ball, which rolls forward on the grass. The person continues to chase the ball, which keeps rolling ahead. Another person, wearing pink shoes and jeans, runs towards the ball and kicks it. The ball rolls further away as the second person continues to chase it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 623, "extra_info": {"duration": 6.47, "events": ["A red ball rolls to the right.", "Lady intercepts the red ball.", "Lady runs to the right side of the red ball.", "Lady nudges the red ball forward.", "Lady kicks the red ball to the left.", "Another lady runs over from the left side.", "Another lady meets the red ball.", "Another lady kicks the red ball, sending it flying."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 623}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "69.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A group of animated girl characters and a puppy are in a room, looking towards the door. Then one of the girls shows a frightened expression. The camera cuts to the door as a shadow kicks it open and staggers in. As the scene brightens, it reveals that the shadow is a girl holding several bath towels, with a cowboy hat on one of them.", "text": "A group of seven girls, each with distinct hair colors and styles, stand in a room with wooden floors and ornate decorations, looking at something off-screen. The girl with pink hair, holding a rolled-up item, turns her head and opens her mouth in surprise. The door opens, and a girl wearing a crown and carrying a large stack of towels enters the room. The girl with the towels looks at the group and starts speaking. The group of girls turns their heads to look at the girl with the towels.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 69, "extra_info": {"duration": 5.88, "events": ["A group of animated girl characters and a puppy look towards the door", "One of the girls shows a frightened expression", "A shadow kicks the door open", "The shadow staggers in", "The scene brightens", "It reveals the girl holding several bath towels", "A cowboy hat is on one of the towels"], "n_events": 7, "n_shots": 3, "n_subjects": 3}, "vid": 69}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "388.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "One person covered the man's face with a towel while another held a showerhead, spraying water onto the man's face, as the man struggled in pain.", "text": "A man is lying in a bathtub with his mouth open, appearing to be in distress. Another person approaches and begins to cover the man's face with a towel. The person then uses a showerhead to rinse the man's body, starting from his head and moving down to his torso. The man remains motionless as the person continues to rinse him, ensuring he is thoroughly cleaned.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 388, "extra_info": {"duration": 5.25, "events": ["One person covered the man's face with a towel", "Another person held a showerhead", "Water sprayed onto the man's face", "The man struggled", "The man experienced pain"], "n_events": 5, "n_shots": 1, "n_subjects": 3}, "vid": 388}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "338.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the car, a man took a paper bag from the side and pulled out a piece of food from it, and started eating it.", "text": "A man is sitting in the driver's seat of a car, facing the back seat. He turns his head slightly and reaches behind him to grab a paper bag. He opens the bag and starts to take out a piece of food. The man pulls out a piece of food from the bag and begins to eat it. He continues to eat the food while sitting in the car.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 338, "extra_info": {"duration": 5.5, "events": ["Man took paper bag from the side.", "Man pulled out piece of food from bag.", "Man started eating the food."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 338}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "96.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera tilts upward to capture a deep red sky, a hand emerges from a distant trench and throws a grenade. In the trench nearby, visible in the frame, a soldier crouches with hands over his head at the edge of the trench. The scene switches to two soldiers running along the trench when an explosion occurs at the edge, the blast wave knocking them down. One soldier lies on the ground, gasping for air.", "text": "The scene is set in a desolate, reddish landscape with barren trees and a distant structure under a red sky. A person emerges from the ground, raising their arms as a green object flies overhead. The person falls to the ground, covering their head with their hands. Two soldiers, holding rifles, run along a trench but are hit by the flying object, causing them to fall to the ground. The soldiers lie on the ground, seemingly unconscious or injured.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 96, "extra_info": {"duration": 4.38, "events": ["A hand throws a grenade from a trench.", "A soldier crouches with hands over his head.", "Two soldiers run along the trench.", "An explosion occurs at the trench edge.", "The blast wave knocks two soldiers down.", "One soldier lies on the ground gasping for air."], "n_events": 6, "n_shots": 3, "n_subjects": 3}, "vid": 96}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "294.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man is lying on his side in bed, a woman walks to the bedside, places down a scarf, reaches out to take off the man's glasses, and then gently grasps the man's hair with her hand.", "text": "A man is lying on his side in bed, wearing a gray jacket with a patch on the shoulder, in a dimly lit room with a lamp on the nightstand. A woman approaches from the left, reaches out, and places her hand on the man's back. The man, still lying down, begins to remove his glasses with his right hand. The woman then reaches out and takes the man's glasses from him. She gently touches the man's head with her hand. The woman then moves her hand away, and the man remains lying on the bed.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 294, "extra_info": {"duration": 10.5, "events": ["A man is lying on his side in bed", "A woman walks to the bedside", "A woman places down a scarf", "A woman reaches out", "A woman takes off the man's glasses", "A woman gently grasps the man's hair"], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 294}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "740.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person reaches out to pluck the lettuce leaves one by one, placing them on their left hand.", "text": "In a lush green field, a person reaches out and begins to pick fresh green lettuce leaves from the ground. The person's hands are seen carefully selecting and pulling the leaves, placing them into a pile. The background is filled with more green lettuce plants, indicating a healthy and abundant crop. The person continues to pick the leaves, ensuring they are fresh and ready for consumption.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 740, "extra_info": {"duration": 16.58, "events": ["A person reaches out.", "Person plucks lettuce leaves.", "Person places leaves on their left hand."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 740}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "521.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Inside an airplane cabin, skydivers jump out one after another from the hatch. After the last athlete jumps out of the plane, the camera follows the last athlete to film, with the sun in the distance, clouds beside them, and the earth below.", "text": "Several skydivers are seen free-falling through the sky, with clouds and a clear blue sky in the background. Inside the plane, skydivers are preparing to jump, with one person wearing a Red Bull helmet and gear. The person with the Red Bull helmet exits the plane, followed by others, and begins free-falling. The skydivers continue to fall, with the plane visible in the background, and the landscape below becoming more visible. The skydivers are seen in various positions as they descend, with the sun shining brightly in the sky. The video ends with the skydivers continuing their descent, surrounded by clouds and the landscape below.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 521, "extra_info": {"duration": 11.03, "events": ["Skydivers line up inside the airplane cabin.", "First athlete jumps out of the airplane hatch.", "Subsequent skydivers jump out one after another.", "Last athlete exits the airplane.", "Camera follows the last skydiver.", "Skydiver freefalls.", "Sun appears in the distance.", "Clouds are visible beside the skydiver.", "Earth is seen below.", "Skydiver stabilizes body position."], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 521}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "293.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman is walking down from the stairs, and a man downstairs is putting on his coat, looking at the woman.", "text": "A woman in a white tank top and black skirt is descending a staircase with a black wrought iron railing, holding the railing with her right hand and looking forward while talking. The wall behind her is decorated with various framed pictures and posters. The scene then switches to a man standing in a room with a black bookshelf, a yellow bag on a chair, and a painting on the wall. He is putting on a black jacket, adjusting it as he speaks.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 293, "extra_info": {"duration": 3.0, "events": ["A woman walks down the stairs.", "A man downstairs puts on his coat.", "The man looks at the woman."], "n_events": 3, "n_shots": 2, "n_subjects": 2}, "vid": 293}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "435.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is a forest. The woman reaches out to scoop out the white viscous material from the bowl and places it into the green basket behind her. The camera zooms in on the object as it falls into the basket. The woman bends down to pick up another piece and continues to place it into the basket. The camera then switches to the woman tearing the white viscous material off the tree, following the movement with the shot.", "text": "A person is seen collecting white liquid from a tree using a container placed under the liquid flow. The person then removes the container and places it into a green bucket. The person repeats the process by collecting more liquid from another container and placing it into the green bucket. The person then uses a tool to scrape the liquid from the tree trunk, ensuring all the liquid is collected. The video ends with a close-up of the liquid flowing from the tree trunk into the container.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 435, "extra_info": {"duration": 15.08, "events": ["The woman scoops out the white viscous material.", "The woman places the material into the green basket.", "The material falls into the basket.", "The woman bends down to pick up another piece.", "The woman continues to place the material into the basket.", "The woman tears the white viscous material off the tree."], "n_events": 6, "n_shots": 6, "n_subjects": 1}, "vid": 435}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "468.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a gauze rack hangs a nearly transparent piece of black gauze, embroidered with a monkey covering its mouth with its paws. An elderly grandma with graying hair threads her needle through the gauze. On the other side of the fabric is another elderly grandma wearing glasses. The camera pans around the gauze rack from the side of the grandma with graying hair to the grandma with glasses. The grandma with glasses receives the passed-through needle and thread, and threads it back again. On the side of the grandma with glasses, a white puppy with black eyes is being embroidered.", "text": "An elderly woman, wearing a plaid jacket and green gloves, is seen working on a piece of fabric with a monkey image on it, using a needle and thread. She continues to add details to the monkey's face, focusing intently on her work. The camera then shifts to show the woman working on a different part of the fabric, which now features a dog image. The woman adjusts the fabric and continues to add details to the dog's image, while another person assists her.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 468, "extra_info": {"duration": 10.63, "events": ["Piece of black gauze hangs on the rack", "Monkey is embroidered covering its mouth", "Grandma with graying hair threads her needle", "Needle passes through the gauze", "Camera pans around the gauze rack", "Grandma with glasses receives the needle", "Grandma with glasses threads the needle back", "A white puppy is being embroidered"], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 468}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "910.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "An individual of Caucasian descent, attired in black clothing, is seen gripping a knife in one hand and a white cloth in the other. They begin to wrap the cloth around their hand and then elevate the knife, executing a slashing move towards an African-American individual, also adorned in black clothing. The latter individual raises their left arm in defence, bending it as they do so, with a black cloth swathed around their left hand.", "text": "In a gymnasium with basketball hoops and a logo on the wall, a man in a black martial arts uniform holds a white cloth and a knife, demonstrating various movements while another man in a similar uniform stands in front of him, watching. The man with the knife performs a series of actions, including swinging the knife and moving the cloth, while the other man remains stationary, observing. The man with the knife continues to demonstrate, raising the knife and moving it in different directions, while the other man adjusts his stance and prepares to engage. The man with the knife then raises the knife high and brings it down towards the other man, who reacts by moving his body to avoid the strike. The man with the knife continues his demonstration, moving the cloth and knife in a controlled manner, while the other man remains in a defensive stance.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 910, "extra_info": {"duration": 9.01, "events": ["Individual grips a knife in one hand.", "Individual wraps a white cloth around their hand.", "Individual elevates the knife.", "Individual executes a slashing move towards another individual.", "African-American individual raises their left arm in defence.", "African-American individual bends their left arm.", "African-American individual has a black cloth around their left hand."], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 910}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "975.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a Minecraft game scenario, another player controls their character, a green entity with a cube-shaped head wearing brown overalls. The character shakes their head left and right, moving around a green area aiming to reach a stand labeled \"life\". Unexpectedly, the character begins to fly. It remains static with its head, yet rotates its limbs and torso in an impossible real-life manner, ascending into the blue sky above. In the same stand, the text \"How did she?\" can be seen.", "text": "In a grassy, blocky environment, a character with a white and yellow outfit stands in front of a wooden structure with a blue and white roof, facing a zombie. The character turns to face the camera, revealing a green plant on its head. The character then turns back to the zombie and starts walking away from the structure. The character continues walking along a dirt path, moving further away from the structure. The character jumps into the air, and the camera follows the character's movement upwards. The character lands back in front of the wooden structure, facing the zombie again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 975, "extra_info": {"duration": 8.0, "events": ["Character shakes head left and right.", "Character moves around a green area.", "Character aims to reach a stand labeled 'life'.", "Character unexpectedly begins to fly.", "Character rotates limbs and torso while ascending.", "Character ascends into the blue sky."], "n_events": 6, "n_shots": 4, "n_subjects": 1}, "vid": 975}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "849.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video showcases many individuals, both children and adults. A portion of these people are dressed in pants, while the remaining are garbed in swimwear like bikinis and other types of swimsuits. With a delighted spirit, they all run in the same direction, evidently heading towards a swimming or beach activity. Those clad in swimsuits could possibly be participating in an organized event or activity.", "text": "A large group of people, mostly in swimwear, are gathered on a gravel area surrounded by dense forest. They begin to move forward, raising their hands and cheering as they start running. The crowd continues to run, with some individuals wearing hats and others in various swimwear, including bikinis and swim trunks. The group maintains their energy, running in a coordinated manner, with some people clapping and others waving. The background remains consistent with the forested area, and the crowd appears to be enjoying the event, with some individuals taking photos or videos.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 849, "extra_info": {"duration": 9.01, "events": ["Individuals run towards a destination.", "Children participate in the activity.", "Adults join the running group.", "People dressed in pants run.", "Others wear swimwear.", "Participants head for swimming.", "Adults and children show excitement.", "People possibly join an event.", "Swimwear-clad individuals prepare for activity."], "n_events": 9, "n_shots": 1, "n_subjects": 1}, "vid": 849}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "884.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video features a circular ash-filled hole situated within a brush in a forest. An individual pours water into this ash-filled hole to deter any potential fires. Following that, a person clad in black attire, a hat, and a knife nestled in their pocket, proceeds to gather the surrounding dirt from the ash-filled hole using a stick. The individual then collects the combined ashes and soil and discards them.", "text": "A black cast iron skillet is placed on the forest floor, surrounded by dry leaves and twigs. The scene shifts to a close-up of a pile of ashes, where a stick is used to stir and fan the embers, causing smoke to rise. A man, dressed in outdoor clothing and a cap, kneels on the ground and uses a stick to further stir the ashes, creating more smoke. He then uses his hands to cover the ashes with soil, ensuring they are completely extinguished. The final frame shows the empty skillet on the forest floor.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 884, "extra_info": {"duration": 6.0, "events": ["An individual pours water into the ash-filled hole.", "The individual in black gathers dirt with a stick.", "The individual collects mixed ashes and soil.", "The individual discards the ashes and soil."], "n_events": 4, "n_shots": 4, "n_subjects": 1}, "vid": 884}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "417.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A child in a blue and black school uniform is holding an umbrella. The camera pans to the right, and a child slides down the stairs. The camera switches to a side view as the child stands on the stairs and walks forward, then suddenly falls backwards and slides down the stairs. The people around look at him. After standing up, the child walks toward the steps. The background is an outdoor scene in the rain.", "text": "A person holding a checkered umbrella walks up a set of stairs, followed by a child in a colorful outfit. The child attempts to climb the stairs but loses balance and falls down the steps. The child rolls down the stairs and lands at the bottom. The child stands up and starts walking again, while the person with the checkered umbrella continues to ascend the stairs. The child continues to walk up the stairs, passing by other people who are also walking up and down the stairs.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 417, "extra_info": {"duration": 15.03, "events": ["A child holds an umbrella.", "A child slides down the stairs.", "The child stands on the stairs.", "The child walks forward.", "The child falls backwards.", "The child slides down the stairs.", "People around look at the child.", "The child stands up.", "The child walks toward the steps.", "It rains in the background."], "n_events": 10, "n_shots": 2, "n_subjects": 3}, "vid": 417}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "458.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "One person spreads a sheet of dough on a pot, then drizzles yellow sauce horizontally across it, followed by a drizzle of black sauce vertically. Next, crispy corn snacks are poured in, and then spread out evenly by hand. Four pieces of black chocolate are placed on top. The background is set in a kitchen.", "text": "A man places a silicone mold on a metal tray and secures it with a clamp. He then pours peanut butter into the mold, followed by chocolate sauce, creating a layered pattern. Next, he adds crushed nuts on top of the chocolate and peanut butter layers. He places chocolate bars on the nuts and presses them down. Finally, he removes the mold, revealing the finished dessert.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 458, "extra_info": {"duration": 10.23, "events": ["Person spreads dough on a pot", "Person drizzles yellow sauce horizontally across dough", "Person drizzles black sauce vertically across dough", "Person pours crispy corn snacks into pot", "Person spreads crispy corn snacks by hand", "Person places four pieces of black chocolate on top"], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 458}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "894.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video begins with cars moving in a customary and orderly fashion. Unexpectedly, a large white truck with red tires speeds up from the left rear, colliding with the first white car to its side. It then goes on to hit the second car which is blue and white. This causes the blue and white car to veer right and forward, initially colliding with a transport vehicle carrying a green car ahead.", "text": "The video shows a highway with several vehicles, including a white SUV in the center lane, a blue car in the left lane, and a large truck with a green and yellow trailer in the right lane. The white SUV is moving forward, and the blue car is gradually moving closer to the white SUV. As the blue car continues to move forward, a large truck appears from the left side of the frame and starts to enter the left lane. The large truck continues to move into the left lane, partially blocking the blue car. The large truck is now fully in the left lane, and the blue car is almost completely blocked by the truck. The large truck continues to move forward, and the blue car is now fully blocked by the truck.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 894, "extra_info": {"duration": 4.0, "events": ["Cars move in orderly fashion.", "Large white truck speeds up.", "Truck collides with first white car.", "Truck hits second blue and white car.", "Blue and white car veers right.", "Blue and white car moves forward.", "Blue and white car collides with transport vehicle.", "Transport vehicle carries green car."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 894}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "246.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman opens her mouth and looks forward as her hair flutters and blood flows from her nose. The camera cuts to her hands, from which blue flames are emitted and fly towards a raging fire in front of her. The flames gradually swell, and the camera then switches to a long shot of the fire. Another character stands in front of the fire, looking up at it, as the camera pans upward.", "text": "A woman with blonde hair, wearing a dark blue outfit, looks down with a concerned expression as her hair is blown by the wind. She then raises her head slightly. The scene shifts to a large explosion of fire and energy, with the woman standing on the ground, raising her hands as the energy wave approaches her. The energy wave continues to grow and intensify, forming a massive, fiery sphere. A man stands in front of the expanding energy sphere, which is now a large, glowing, and fiery mass. The woman is seen again, looking down with a distressed expression.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 246, "extra_info": {"duration": 7.0, "events": ["The woman opens her mouth.", "The woman's hair flutters.", "Blood flows from the woman's nose.", "The woman's hands emit blue flames.", "Blue flames fly towards the fire.", "The fire swells.", "Another character stands in front of the fire.", "Another character looks up at the fire.", "The camera pans upward."], "n_events": 9, "n_shots": 3, "n_subjects": 2}, "vid": 246}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "212.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman looks at the object in front of her, her body slightly moves, the camera cuts to the forest canopy, then moves down, circling around the man.", "text": "A woman with red hair is looking intently at something in front of her, with a blurred background of greenery. The camera then shifts to show leaves falling from the trees above. The camera continues to move downward, revealing a man with a beard and long hair, dressed in dark clothing, looking up at the falling leaves. The man then begins to walk forward, still looking up at the leaves.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 212, "extra_info": {"duration": 7.12, "events": ["The woman looks at the object.", "The woman's body slightly moves.", "The camera circles around the man."], "n_events": 3, "n_shots": 2, "n_subjects": 3}, "vid": 212}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "215.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the night, a thick fog descends in front of a door. A woman dressed in theatrical costume pushes past two guards and flings one of them into the water, causing splashes. The woman then turns and walks to the right.", "text": "A large, ominous gate with a red glow and a dark, eerie background is shown. A woman in a green, shell-like dress is being held by a man in a blue uniform. The woman struggles and kicks the man, causing him to fall. The man is thrown into the water, creating a large splash. The woman then runs away from the scene.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 215, "extra_info": {"duration": 6.38, "events": ["A thick fog descends in front of a door.", "A woman pushes past two guards.", "The woman flings a guard into the water.", "Water splashes upon the guard's impact.", "The woman turns.", "The woman walks to the right."], "n_events": 6, "n_shots": 4, "n_subjects": 3}, "vid": 215}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "678.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A black car slowly makes its way through the crowd of protesters, with onlookers watching as the vehicle passes by, some taking out their phones to take pictures of it.", "text": "A black SUV with 'SUBURBAN' written on the side is seen with its door open, and a person in a hoodie stands nearby. The SUV begins to move forward as the crowd, holding signs and taking photos, watches. The SUV continues to drive away, and the crowd starts to disperse. The SUV moves further down the street, and the crowd continues to follow and document the event. The SUV drives away, and the crowd begins to move in the opposite direction. The SUV continues to drive away, and the crowd disperses further. The SUV drives away, and the crowd continues to move in the opposite direction.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 678, "extra_info": {"duration": 11.14, "events": ["A black car moves through the crowd.", "Protesters watch the black car passing by.", "Onlookers take out their phones.", "People take pictures of the black car."], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 678}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "100.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Three soldiers look toward the camera in surprise. Then, the soldier in the middle walks forward, out of the frame. The soldier wearing glasses on the left follows him, whereas the soldier on the right remains in place, watching the direction the first two soldiers left.", "text": "Three soldiers in green uniforms and hats stand in front of a building with multiple windows. The soldier in the middle has red hair and initially looks surprised, then turns his head to the left and starts speaking. He then turns his body to the left and begins to walk away. The soldier on the left, who is wearing glasses, also turns his head to the left and starts speaking. The soldier on the right remains still, looking forward.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 100, "extra_info": {"duration": 3.75, "events": ["Three soldiers look toward the camera in surprise.", "The soldier in the middle walks forward.", "The soldier in the middle exits the frame.", "The soldier wearing glasses follows the soldier in the middle.", "The soldier on the right stays in place."], "n_events": 5, "n_shots": 1, "n_subjects": 3}, "vid": 100}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "496.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A lady by a stream waters a fish with a mineral water bottle, and then throws the fish back into the stream.", "text": "A woman is sitting on a concrete ledge by a river, surrounded by lush greenery and a small waterfall. She holds a fish in her hands and pours water from a bottle onto it. After ensuring the fish is wet, she throws it into the river. The fish splashes into the water and swims away. The woman then looks at the water and smiles.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 496, "extra_info": {"duration": 5.83, "events": ["A lady stands by a stream.", "The lady waters a fish with a mineral water bottle.", "The lady throws the fish back into the stream."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 496}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "794.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The numbers appear in sequence enlarged at the center of the screen, and finally the numbers disappear, turning the screen white.", "text": "The video begins with a bright light source in a blue background. The number '10' appears in black, rotating and moving slightly. The number '9' appears next, rotating and moving. The number '8' appears, rotating and moving. The number '7' appears, rotating and moving. The number '6' appears, rotating and moving. The number '5' appears, rotating and moving. The number '4' appears, rotating and moving. The number '3' appears, rotating and moving. The number '2' appears, rotating and moving. The number '1' appears, rotating and moving. The screen then transitions to a bright white light. The number '10' reappears in black, rotating and moving.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 794, "extra_info": {"duration": 11.32, "events": ["Numbers appear in sequence.", "Numbers become enlarged.", "Numbers disappear.", "Screen turns white."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 794}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "7.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two birds hover over a dog and pull the dog's ears upwards. A white horse appears in the doorway of the cottage, observing the interaction between the birds and the dog.", "text": "A dog is lying on the ground in front of a stone house with a thatched roof, appearing to be asleep. Two blue birds fly into the scene from the left and approach the dog. The dog wakes up and looks at the birds. A horse's head appears at the open door of the house, and the dog raises its paw to greet the horse. The dog then turns its head to look at the birds, which are now perched on its paw. The dog starts to move its paw, causing the birds to fly around it. The dog continues to move its paw, and the birds follow its movements. The dog then lies back down, and the birds fly away.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 7, "extra_info": {"duration": 8.22, "events": ["Two birds hover over a dog", "The birds pull the dog's ears upwards", "A white horse appears in the doorway", "The horse observes the interaction"], "n_events": 4, "n_shots": 1, "n_subjects": 4}, "vid": 7}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "230.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman looks across, where another person with their back to the camera uses a stick to ignite a mixture. The camera follows the movement as the woman blows out the stick, causing sparks to fly from the bowl in front of her. A blue liquid is drawn into a syringe, and a pair of hands places the tray holding the syringe on a table. One hand picks up the syringe, the camera pans up, and there's a cherry on the tray, which reflects the shadow of the woman's movements.", "text": "In a dimly lit room filled with various bottles and decorative items, a woman wearing a black dress and multiple necklaces looks at a man in front of her. She uses a lit match to ignite a bowl containing dried rose petals, causing the petals to catch fire and spark. The woman then uses the burning match to ignite a green liquid in a small bowl. She stirs the green liquid with a dropper, mixing it thoroughly. Finally, she places the dropper and other tools into a tray, which reflects her image.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 230, "extra_info": {"duration": 12.12, "events": ["A woman looks across the room", "Person uses a stick to ignite a mixture", "Woman blows out the stick", "Sparks fly from the bowl", "Blue liquid is drawn into a syringe", "Hands place the tray with syringe on a table", "One hand picks up the syringe", "The camera pans up", "A cherry is on the tray", "The cherry reflects the shadow of the woman's movements"], "n_events": 10, "n_shots": 5, "n_subjects": 2}, "vid": 230}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "863.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video showcases a scene on a road where a vehicle progressively slows down, coming to a halt at a red traffic light, adjacent to a black car. Both vehicles are poised to make a left turn, awaiting the traffic signal to turn green. Simultaneously, a white vehicle from the right intersection moves horizontally past. The black car along with the car next to it remains stationary until the light turns green, post the transit of the white vehicle. Various traffic signs are also detectable in the landscape.", "text": "The video shows a car driving on a multi-lane road with a red traffic light ahead. The car is initially behind a black SUV on the left lane and a white car on the right lane. As the car continues to move forward, the black SUV begins to turn left, and the white car starts to move forward. The white car continues to move forward and crosses the intersection, while the black SUV completes its turn and exits the frame. The car continues to drive straight through the intersection, following the white car. The background includes trees, traffic lights, and a mountainous landscape in the distance.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 863, "extra_info": {"duration": 6.01, "events": ["A vehicle slows down on the road.", "The vehicle comes to a halt at a red light.", "A black car waits at the same light.", "The intended direction for both cars is a left turn.", "A white vehicle moves horizontally past from the right.", "Both the black car and the camera's car stay still.", "The traffic light turns green after the white car passes.", "The black car and the camera's car start to turn left."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 863}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "16.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man jumps into mid-air, collides with the surface of the ice, the ice cracks, and the man slowly slides down to the ground, with two characters standing behind him.", "text": "The scene begins with a view of a large, icy wall. A man appears at the bottom of the wall and starts climbing upwards. He reaches the top of the wall and begins to slide down. The man continues to slide down the wall, eventually reaching the ground. He lands on the ground and lies face down, with his legs and arms spread out.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 16, "extra_info": {"duration": 5.12, "events": ["A man jumps into mid-air.", "The man collides with the ice surface.", "The ice cracks.", "The man slides down to the ground.", "Two characters stand behind the man."], "n_events": 5, "n_shots": 2, "n_subjects": 1}, "vid": 16}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "53.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A child character is talking to an adult character. The adult character raises their right hand and places a finger to their lips. The child character walks forward, comes to a box filled with Halloween items, and reveals a surprised expression.", "text": "In a room with blue walls and a green couch, a boy in a yellow sweater and glasses stands next to a man in a white apron and blue shirt, who is looking at him. The boy smiles and then turns to walk towards a box labeled 'Halloween Decorations' on the floor. The box contains various Halloween decorations, including a white ghost, a black bat, and a pumpkin. The boy reaches into the box and takes out a small pumpkin, while the man continues to watch him. The boy then turns and walks away from the box, and the man remains standing in the doorway, still observing the boy.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 53, "extra_info": {"duration": 5.38, "events": ["Child character talks to adult character.", "Adult character raises their right hand.", "Adult character places a finger to their lips.", "Child character walks forward.", "Child character comes to a box.", "Box is filled with Halloween items.", "Child character reveals a surprised expression."], "n_events": 7, "n_shots": 2, "n_subjects": 2}, "vid": 53}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "663.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On the basketball court, in the foreground, a person is continuously dribbling while slightly squatting with bent knees, while in the distance, another person is dribbling towards the basket for a layup.", "text": "A person wearing white sneakers with red accents is dribbling a basketball on an outdoor court. In the background, another person in a red and black outfit is seen running and shooting a basketball towards the hoop. The dribbling person continues to move the ball up and down, while the shooter repeatedly attempts to make a basket. The court is surrounded by a green fence, and there are trees visible in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 663, "extra_info": {"duration": 9.83, "events": ["Person dribbles continuously while squatting.", "Person dribbles towards the basket.", "Person performs a layup."], "n_events": 3, "n_shots": 1, "n_subjects": 2}, "vid": 663}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "38.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The characters turned their heads to look up, and the woman in the middle spoke. The elder at a higher position tore off his clothes and lifted his spear, showing off his abs and said something.", "text": "A group of people, including a large tiger, are standing together in a dark, eerie environment with blue lighting. The people turn their heads to look forward, and the girl in the yellow dress starts speaking. The camera then shifts to a character with white hair, who is holding a staff and looking forward. The character's expression changes to anger, and he throws a creature off the cliff. The character then speaks while standing on the cliff.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 38, "extra_info": {"duration": 4.12, "events": ["Characters turned their heads upwards.", "The woman in the middle spoke.", "The elder tore off his clothes.", "The elder lifted his spear.", "The elder showed off his abs.", "The elder said something."], "n_events": 6, "n_shots": 2, "n_subjects": 3}, "vid": 38}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "307.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a street, a woman walks towards a flower shop ahead, moving forward along the road together with a man carrying an umbrella coming from the front. Many people are walking around. The camera switches to the front of a building where the man and woman walk past it. Vehicles are passing on the road, and someone walks towards the direction of the camera, startling a flock of birds into flight.", "text": "A man holding a cup walks through a rainy street, passing by people with umbrellas and a flower shop with potted flowers. He continues walking past various pedestrians, including a woman in a beige coat and hat, and a man in a gray coat. The man walks past a flower shop with a green awning and colorful flowers displayed outside. He then turns onto a street with a 'Dry C. Sanning' sign and a 'Beverage Bar' sign, and a green car drives by, emitting white smoke.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 307, "extra_info": {"duration": 8.25, "events": ["A woman walks towards a flower shop.", "A man carrying an umbrella moves forward.", "People walk around the street.", "The man and woman walk past a building.", "Vehicles pass on the road.", "Someone walks towards the camera.", "A flock of birds takes flight."], "n_events": 7, "n_shots": 2, "n_subjects": 3}, "vid": 307}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "409.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A lady is strutting in an empty parking lot, showing off black clothing and a white backpack; the camera switches from her back to her front, and then again to her side.", "text": "A woman is walking on a rooftop parking lot with buildings and construction cranes in the background. She turns around and starts walking towards the camera, wearing a black jacket, black pants, and white sneakers, with a white bag slung over her shoulder. She adjusts her jacket by pulling it down and then starts to walk away from the camera. She turns around again, showing the back of her outfit, and continues walking. She adjusts her bag and continues walking, occasionally looking back.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 409, "extra_info": {"duration": 9.17, "events": ["A lady struts in an empty parking lot.", "Lady shows off black clothing.", "Lady displays white backpack."], "n_events": 3, "n_shots": 3, "n_subjects": 1}, "vid": 409}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "630.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A pair of feet wearing orange shoes jumped onto a platform, then walked backward down the platform, then jumped back onto the platform, and walked down again, repeating this process until the video ended. The background was an outdoor sports field.", "text": "The video shows a nighttime outdoor setting with bright lights in the background, likely a sports field or track. A person wearing red shoes and patterned leggings is seen jumping rope. The person's feet are visible as they jump, and the rope is in motion, creating a blur effect. The background remains consistent with the lights and field visible throughout the video.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 630, "extra_info": {"duration": 11.88, "events": ["Feet wearing orange shoes jumped onto a platform", "Feet walked backward down the platform", "Feet jumped back onto the platform", "Feet walked down the platform again"], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 630}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "891.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "There are two people in the video, one in the blue with short hair is playing the game and the other with the long hair in the grey with blue icon is explaining the game. In the game, the game character with gray hair and a cartoon-patterned backpack on their back squatted down and moved to the left. Suddenly, a blue wall appeared, blocking a thrown grenade just in time. The commentator with long hair was shocked to see this scene and almost thought the companion was about to be accidentally hit by a grenade and was frightened to death. Then, the in-game character stood up and aimed their gun at the opponent to continue shooting.", "text": "The game character, equipped with a colorful backpack and holding a gun, moves forward in a virtual environment with various structures and obstacles. The character encounters an obstacle and stops, with the text 'I just heard A PIN' appearing on the screen. The character then turns around and aims a red and white object, which is revealed to be a grenade, at the obstacle. The character throws the grenade, and the text 'Ohhhhh' appears on the screen. The character continues to move forward after throwing the grenade.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 891, "extra_info": {"duration": 5.0, "events": ["Person with short hair plays game.", "Person with long hair explains game.", "Game character squats down.", "Game character moves left.", "Blue wall appears.", "Wall blocks thrown grenade.", "Commentator expresses shock.", "Game character stands up.", "Game character aims gun.", "Game character continues shooting."], "n_events": 10, "n_shots": 1, "n_subjects": 3}, "vid": 891}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "570.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man alternates his hands to grasp objects as he climbs up the wall, his body swaying as it moves back and forth from left to right. The camera follows the man's movements.", "text": "A shirtless man in blue shorts and white socks is climbing a wooden wall with various handholds and blue grips, located in a gym with a blue padded floor and a sign that reads 'JUST DON'T QUIT.' He starts by hanging from the top rungs and gradually moves downward, using both his hands and feet to navigate the wall. The man continues to descend, reaching for higher holds and maintaining his balance. He makes steady progress, moving from one hold to another, and his body remains suspended in the air. The man reaches a point where he is almost at the bottom of the wall, still holding onto the grips with both hands. He continues his descent, showing strength and control as he moves downward. The man finally reaches a lower part of the wall, still climbing with determination.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 570, "extra_info": {"duration": 5.13, "events": ["The man alternates hands to grasp objects.", "The man climbs up the wall.", "The man's body sways back and forth.", "The man moves from left to right.", "The camera follows the man's movements."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 570}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "561.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man pulls on the bar with both hands, leaps upward and then releases his hands, clapping them in front of his chest. After landing, he reaches out to grab the bar again, then jumps down to the ground, performing a muscle-flexing pose. The camera then cuts to another man who is standing.", "text": "A man in a gray hoodie and shorts is performing pull-ups on a bar in a grassy area with buildings in the background. He jumps up and down while holding the bar, then claps his hands while still holding the bar. He continues to perform pull-ups, jumping up and down between each one. The scene then switches to another man in a black and white jacket standing in a garden area with trees and plants, pointing forward.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 561, "extra_info": {"duration": 2.7, "events": ["The man pulls on the bar", "The man leaps upward", "The man releases the bar", "The man claps hands in front of chest", "The man lands", "The man reaches out to grab the bar", "The man jumps down to the ground", "The man performs a muscle-flexing pose"], "n_events": 8, "n_shots": 2, "n_subjects": 2}, "vid": 561}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "827.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The individual is practicing boxing at home. The routine involves two right-hand punches followed by a right, left, and another right-hand punch combined with a squat. This pattern is then repeated using both right and left hands. The person moves to the right, executing two right-hand punches then steps back and repeats two right-hand punches again.", "text": "In a cluttered basement with exposed wooden beams and various gym equipment, a man in a blue sleeveless shirt and white shorts is practicing boxing on a black punching bag. He throws a series of punches, including jabs and hooks, while moving around the bag. The punching bag swings back and forth with each hit. The man continues his training, maintaining a focused stance and delivering powerful blows. The background includes a staircase, a mirror, and various items scattered around, indicating a home gym setup. Throughout the video, the text 'Boxing Workout! Please Subscribe!' is displayed at the bottom of the screen.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 827, "extra_info": {"duration": 11.0, "events": ["Individual practices boxing at home.", "Individual throws two right-hand punches.", "Individual performs a combo: right, left, right-hand punch with a squat.", "Pattern is repeated with both hands.", "Individual moves to the right.", "Individual executes two right-hand punches.", "Individual steps back.", "Individual repeats two right-hand punches."], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 827}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "722.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two men and one woman are playing cards. The lady on the right is resting her cheeks in her hands. The man in the middle flips the cards onto the table, then collects all the chips. The man on the left and the lady on the right give a round of applause. Afterward, the man in the middle raises his hand and high-fives the man on the left. Finally, he turns to look at the lady, who returns to her pose with her cheeks resting in her hands.", "text": "Three people are sitting on the ground around a green table with poker chips and cards scattered on it. The man in the middle, wearing a beige cardigan and brown pants, picks up a card from the table and places it on the table. He then gathers the poker chips into a pile and pushes them forward. The two people on either side of him, one in a beige sweater and the other in a green sweater, clap their hands in approval. The man in the middle and the person on his left high-five each other. The man in the middle then looks at the person on his right, who is resting her head on her hand.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 722, "extra_info": {"duration": 7.67, "events": ["Two men and one woman play cards.", "The woman rests her cheeks in her hands.", "The man in the middle flips cards onto the table.", "The man in the middle collects all the chips.", "The man on the left and the woman applaud.", "The man in the middle raises his hand.", "The man in the middle high-fives the man on the left.", "The woman returns to her pose."], "n_events": 8, "n_shots": 1, "n_subjects": 3}, "vid": 722}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "176.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The snake moves forward; the camera switches to three creatures sitting in a vehicle driving off into the distance, then switches back to the snake quickly advancing.", "text": "A blue creature with large eyes looks ahead in a dark, rocky environment. The scene shifts to a vehicle carrying three small creatures, including a purple one, a yellow one, and a gray one, as they drive away. The vehicle, marked with the license plate 'CZINBR,' speeds off into the distance, followed by several motorcycles. The blue creature then runs forward across the open landscape.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 176, "extra_info": {"duration": 5.38, "events": ["The snake moves forward.", "Three creatures sit in a vehicle.", "The vehicle drives off into the distance.", "The snake advances quickly."], "n_events": 4, "n_shots": 3, "n_subjects": 4}, "vid": 176}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "587.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is a courtyard, surrounded by a small grove of trees and a house. The girl is running in front, occasionally looking back, with a little donkey following her as they circle around a pillar in the middle. The camera cuts to subtitles and images appearing on the screen.", "text": "In a grassy field with a wooden fence and a building in the background, a girl and a donkey are seen running around each other in a playful manner. The girl, wearing a light-colored shirt and shorts, and the donkey, with a light brown coat, chase each other in a circular pattern. The girl occasionally looks back at the donkey while running, and the donkey follows closely behind. The background includes a few other animals, such as goats, grazing near the building. The video ends with a screen displaying the text '来抖音 发现更多创作者' and '抖音号: An608'.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 587, "extra_info": {"duration": 7.72, "events": ["The girl runs in the courtyard", "The girl occasionally looks back", "A little donkey follows the girl", "They circle around a pillar"], "n_events": 4, "n_shots": 2, "n_subjects": 2}, "vid": 587}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "773.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person opens their left hand, reaches out with their right hand to take a pill, and then closes their left hand again.", "text": "A hand is shown in a closed fist against a wooden background. The hand opens, revealing several white pills in the palm. Another hand reaches in and picks up one of the pills. The hand then closes again, concealing the pills.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 773, "extra_info": {"duration": 5.04, "events": ["A person opens their left hand.", "The person reaches out with their right hand.", "The person takes a pill.", "The person closes their left hand."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 773}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "970.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Very close-view of the setting sun, depicted as a bright, partially obscured object. The upper portion of the sun is a blend of yellow and orange with a few dark spots, while the lower sections exhibit a redder hue. Two dots can be seen on the sun. The surrounding scenery is enveloped in darkness. The sun, appearing as a luminous object, progressively moves towards the lower right, gradually disappearing from sight. It gives the impression of a glowing entity slowly descending.", "text": "The video shows a solar eclipse, with the sun partially obscured by the moon, creating a bright, glowing orange and red hue. The sun's surface is visible, with some dark spots and a distinct edge where the moon is covering it. The background is completely dark, emphasizing the brightness of the sun.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 970, "extra_info": {"duration": 11.0, "events": ["The sun sets.", "The sun moves towards the lower right.", "The sun gradually disappears from sight."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 970}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "41.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A flock of birds flew out from the bushes, startling a woman who was facing away from the camera. The woman turned to look at the creature behind her as the camera cut to a large bear emerging from the underbrush, with a cub following close behind, its head the only part visible amidst the grass. The shot then switched to a single-log bridge made from a dead tree. The woman walked out from the right side of the frame, followed by the bear. Both the woman and the bear headed towards the log bridge, with the ground covered in snow.", "text": "A girl with a long braid, wearing a yellow coat and a purple sash, stands in tall grass with her arms outstretched as birds fly away. She then turns her head to look at a bear standing nearby. The bear, with a red coat, moves through the grass, partially obscured by the tall blades. The bear's face becomes visible as it continues to move through the grass. The scene transitions to a snowy mountain landscape with a fallen tree bridge, where the girl and the bear are seen walking towards the bridge.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 41, "extra_info": {"duration": 7.5, "events": ["A flock of birds flew out from the bushes.", "A woman turned to look behind her.", "A large bear emerged from the underbrush.", "A cub followed the large bear.", "The woman walked out from the right side of the frame.", "The large bear followed the woman.", "The woman headed towards the log bridge.", "The large bear headed towards the log bridge.", "The ground was covered in snow."], "n_events": 9, "n_shots": 3, "n_subjects": 4}, "vid": 41}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "850.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts an individual attired in black shorts-overalls, wearing a necklace and a headset. This individual is seen collecting various sheets from their bedroom for laundering. Initially, the person is shown standing by a nightstand near the room's door, removing a left-side pillowcase which is to be washed. They then proceed to remove the right-side pillowcase from another pillow and toss the pillow back onto the bed. Subsequently, the individual moves to the side of the bed closest to the window, gathers the bedding and holds them in their grasp. These sheets are intended to be laundered in a washing machine.", "text": "In a bright bedroom with white walls and a large window, a woman wearing a black tank top, shorts, and headphones is seen making a bed. She starts by lifting the blanket and pillows from the bed, shaking them out vigorously. She continues to shake the blanket and pillows, ensuring they are properly fluffed and spread out. The woman then adjusts the pillows and blanket, making sure they are neatly arranged. She repeats the process, shaking out the blanket and pillows again to ensure they are fully spread out. Finally, she makes final adjustments to the pillows and blanket, ensuring everything is in place.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 850, "extra_info": {"duration": 8.98, "events": ["Individual removes a left-side pillowcase.", "Individual removes a right-side pillowcase.", "Individual tosses pillow onto bed.", "Individual gathers bedding from bed.", "Individual holds bedding for laundry."], "n_events": 5, "n_shots": 3, "n_subjects": 1}, "vid": 850}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "553.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man swims in the pool of a train carriage, then turns around to sit up, wipes the water off his face, and each takes a slice of watermelon to eat with the two men beside him. The camera pans to the left.", "text": "A small vehicle is covered with a striped tarp and a colorful umbrella, with a table set up on the back containing watermelon slices and bottles. A man in black is seen swimming in the watermelon water, moving towards the other two men who are sitting and talking. The man in black reaches the other two men and starts eating watermelon. The other two men continue to talk and eat watermelon while the man in black enjoys his watermelon. The man in black lies back in the watermelon water, holding a piece of watermelon, while the other two men continue to eat and talk.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 553, "extra_info": {"duration": 9.18, "events": ["A man swims in the pool of a train carriage.", "The man turns around to sit up.", "The man wipes the water off his face.", "The man takes a slice of watermelon.", "Two men beside the man each take a slice of watermelon.", "The men eat the watermelon."], "n_events": 6, "n_shots": 1, "n_subjects": 3}, "vid": 553}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "285.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man bent over and, using his thick fingers, pressed and pulled open a drawer. A woman with long hair, holding a bottle, looked at the man, who then stood up straight and faced the woman.", "text": "In a kitchen with blue cabinets and a yellow wall, a man bends over and opens a drawer under the counter, while a woman stands nearby holding a bottle and watching him. The man continues to open the drawer and looks inside, then closes it. He repeats this action, opening and closing the drawer multiple times. The woman remains standing, holding the bottle and observing the man's actions. The man then stands up and moves to the side, while the woman continues to watch him.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 285, "extra_info": {"duration": 4.0, "events": ["A man bent over.", "Man pressed open a drawer.", "Man pulled open the drawer.", "Woman holding a bottle looked at the man.", "Man stood up straight.", "Man faced the woman."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 285}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "260.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a bridge, a motorcyclist dressed in blue waves toward the front. The camera switches to the opposite side, where a man with a selfie stick is focusing intently on taking a selfie, with many people walking by his side. The camera then switches to behind the selfie-taking man as a motorcyclist wearing red comes speeding through and collides with him. The camera switches again, showing the motorcyclist in red racing down some stairs, with the motorcyclist in blue closely following behind, as other people on the stairs swiftly move aside to avoid them.", "text": "A motorcyclist in a blue uniform and helmet rides a police motorcycle on a bridge, making a hand gesture. A man holding a selfie stick with a phone attached, wearing a brown jacket, adjusts his hair and smiles while looking at the motorcyclist. The motorcyclist continues riding, and the man with the selfie stick steps back, still smiling. The motorcyclist rides past the man, who continues to step back. The scene shifts to a staircase where two motorcyclists, one in a red helmet and the other in a blue uniform, ride down the stairs, causing people to scatter and duck.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 260, "extra_info": {"duration": 6.0, "events": ["Motorcyclist waves toward the front.", "Man uses selfie stick to take a selfie.", "People walk by the selfie-taking man.", "Motorcyclist in red collides with the selfie-taking man.", "Motorcyclist in red races down the stairs.", "Motorcyclist in blue follows the motorcyclist in red.", "People move aside on the stairs."], "n_events": 7, "n_shots": 4, "n_subjects": 4}, "vid": 260}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "372.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man is hung up, with some people looking at him, others swaying their bodies. The camera cuts to a close-up as another man walks forward, reaches out with both hands, and pulls open the man's clothes.", "text": "In a medieval setting with stone walls and a wooden gate, a man in a brown tunic walks towards a group of soldiers standing near the gate. The man reaches the edge of the gate and begins to lift his shirt. A soldier in chainmail approaches the man and starts to pull his shirt up. The soldier continues to pull the man's shirt up, revealing his chest. The man appears to be in distress as the soldier pulls his shirt further up. The soldier finally pulls the man's shirt completely off, exposing his chest.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 372, "extra_info": {"duration": 5.75, "events": ["The man is hung up.", "Some people look at the man.", "Others sway their bodies.", "Another man walks forward.", "The man reaches out with both hands.", "The man pulls open the hung man's clothes."], "n_events": 6, "n_shots": 2, "n_subjects": 3}, "vid": 372}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "854.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video clip features two teams, one dressed in white and the other in red. A member of the red team attempts to prevent a player from the white team from scoring a goal. Despite the red team's efforts, the white team successfully scores a goal, to which the audience reacts with enthusiastic applause and cheer. The crowd seems particularly delighted with the goal, possibly because they are backing the white team. They express their joy towards the team which scored and for the player who made the score.", "text": "In a crowded basketball arena, a player in a white jersey jumps and dunks the ball into the hoop while being closely guarded by players in red jerseys. The crowd and players on the bench react with excitement, raising their arms and cheering. The scoreboard shows the game tied at 90-90 in the 2nd quarter, with 0.2 seconds remaining. The player who made the dunk lands and is immediately surrounded by teammates and opponents, all celebrating the successful play.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 854, "extra_info": {"duration": 5.0, "events": ["Player from the red team attempts to block the white team member.", "White team member shoots towards the goal.", "Red team fails to prevent the goal.", "White team scores a goal.", "Audience reacts with enthusiastic applause.", "Crowd cheers loudly.", "Fans express joy towards the white team.", "Supporters celebrate the player who scored."], "n_events": 8, "n_shots": 1, "n_subjects": 3}, "vid": 854}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "736.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The golf ball rolls into the hole, a foot steps forward, and then a hand picks the ball out of the hole and leaves the frame.", "text": "A golf ball rolls towards the hole on a green surface, with a yellow flagstick in the hole and a person's foot visible in the background. The ball continues to roll and eventually falls into the hole. The person then reaches down and picks up the ball from the hole. The person stands up and walks away, leaving the hole and flagstick in the frame.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 736, "extra_info": {"duration": 6.84, "events": ["Golf ball rolls into the hole.", "Foot steps forward.", "Hand picks the ball out of the hole."], "n_events": 3, "n_shots": 1, "n_subjects": 2}, "vid": 736}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "625.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Several people holding weapons passed by the corner of a house, one of them holding a long rifle provided cover at the window, then two others holding rifles approached the front door. After that, a person wearing a hat, holding a pistol, approached the door from another direction. The camera panned left and right around the three individuals in front of the door. Then, someone inside the house opened the door, and the three at the entrance quickly rushed into the house. The door slowly closed, and a hand reached out to block it.", "text": "A group of armed individuals, dressed in black tactical gear and helmets, approach a house with a car parked in the driveway. They move quickly towards the front door, with one person leading and others following closely behind. The group reaches the front door and begins to position themselves around it, with one person aiming a gun at the door while others prepare to enter. The individuals continue to secure the area, with one person attempting to open the door while others maintain their positions with guns drawn. The door is eventually opened, and the group enters the house, with the last person stepping inside.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 625, "extra_info": {"duration": 22.27, "events": ["Several people holding weapons passed by a house corner.", "One person provided cover at the window with a long rifle.", "Two individuals approached the front door with rifles.", "A person wearing a hat approached the door with a pistol.", "The camera panned around the three individuals at the door.", "Someone inside the house opened the door.", "Three individuals quickly rushed into the house.", "The door slowly closed.", "A hand reached out to block the door."], "n_events": 9, "n_shots": 1, "n_subjects": 5}, "vid": 625}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "12.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A knight on horseback engages in a sword fight with a soldier, falls off the horse, and other knights and soldiers fight in the background.", "text": "In a grassy field with scattered bodies, several soldiers on horseback are engaged in combat. A soldier on a brown horse charges towards another soldier on a white horse. The brown horse collides with the white horse, causing the white horse to fall. The brown horse continues to move forward, and the white horse is seen lying on the ground. The brown horse then steps over the fallen white horse, continuing its movement.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 12, "extra_info": {"duration": 2.17, "events": ["A knight on horseback engages in a sword fight with a soldier.", "The knight falls off the horse.", "Other knights and soldiers fight in the background."], "n_events": 3, "n_shots": 1, "n_subjects": 4}, "vid": 12}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "70.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The girl with pink hair is saying something as she walks backwards, the girl with black hair blinks as she looks ahead. The camera cuts to the top of a bookshelf, a hand reaches up, not able to reach anything, then two hands reach up, still unable to reach anything.", "text": "In a kitchen with shelves filled with various items and a refrigerator in the background, a pink-haired girl with a surprised expression stands next to a purple-haired girl who is playing with dough on the counter. The pink-haired girl turns her head to look at the purple-haired girl, then turns back and starts to walk away. The scene then shifts to an exterior view of a building with three large cylindrical structures. A hand appears from the bottom of the frame, reaching up towards the building. Another hand joins, and both hands are seen reaching up towards the building.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 70, "extra_info": {"duration": 4.12, "events": ["The girl with pink hair walks backwards", "The girl with black hair blinks", "A hand reaches up to the top of the bookshelf", "Two hands reach up towards the top of the bookshelf"], "n_events": 4, "n_shots": 2, "n_subjects": 2}, "vid": 70}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "959.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video features two children, both outfitted in red hoodies with black squares. The child on the left is sporting a red Christmas hat and headphones while engaging in computer games and football. The child on the right, with curly and voluminous hair, presents a yellow cylindrical object and urges the child on the left to unbox it. As the child on the left lifts the lid of the box, a large grey object appearing from inside startles them and they leave the scene. Meanwhile, the child on the right bursts into a joyful dance.", "text": "A boy wearing a Santa hat and headphones is intensely playing a soccer video game on a computer, holding a game controller. The text 'Elhume' appears on the screen as the boy continues to play. The camera then shifts to show a girl in a red and black checkered shirt approaching the boy. The girl hands the boy a cup of snacks, and they both start eating. The girl then points at the computer screen and laughs, while the boy looks at the screen.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 959, "extra_info": {"duration": 9.0, "events": ["Child on the left engages in computer games.", "Child on the left plays with a football.", "Child on the left wears a red Christmas hat.", "Child on the left uses headphones.", "Child on the right presents a yellow cylindrical object.", "Child on the right urges the other to unbox it.", "Child on the left lifts the lid off the box.", "A large grey object appears from inside the box.", "Child on the left is startled and leaves the scene.", "Child on the right bursts into a dance."], "n_events": 10, "n_shots": 2, "n_subjects": 2}, "vid": 959}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "29.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man picks up a baby dressed in a puppy costume, the camera switches back and forth between the man and the baby. A boy lies on the ground watching them as the baby sticks out its tongue and licks towards the man.", "text": "A man in a red uniform with a hat bends down and picks up a small brown dog. The dog looks at the man with a frightened expression. The man looks at the dog with a stern expression. A child peeks out from behind a red and white striped curtain, watching the scene. The dog, still in the man's hands, looks at the child with a sad expression and then closes its eyes.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 29, "extra_info": {"duration": 6.0, "events": ["A man picks up a baby.", "The baby, dressed in a puppy costume, licks towards the man.", "A boy lies on the ground.", "The baby sticks out its tongue."], "n_events": 4, "n_shots": 5, "n_subjects": 3}, "vid": 29}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "129.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Snow falls from the sky as a little boy dressed in red looks ahead with a smile. He moves forward, and the camera angle switches to show the opposite view, where four children stand on each side, wearing ice skates and holding hands on the ice. The little boy in red skates into the midst of them and reaches out to hold the hand of a nearby child.", "text": "A boy wearing a brown hat and red jacket stands in a snowy landscape with a log in the background. He smiles and starts to move forward. The boy continues walking and exits the frame. The scene then shows a group of children standing in a line on the ice, holding hands. The boy skates towards the group, joins them, and they all stand together in a line. The camera then focuses on the boy's hand as he shakes hands with another child wearing an orange sleeve.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 129, "extra_info": {"duration": 5.38, "events": ["Snow falls from the sky.", "A little boy smiles.", "The little boy moves forward.", "Four children stand on the ice.", "The children hold hands.", "The little boy in red skates into the group.", "The little boy reaches out.", "The little boy holds the hand of another child."], "n_events": 8, "n_shots": 3, "n_subjects": 2}, "vid": 129}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "706.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "One person sprinkled some salt onto the vegetables in a glass bowl and then used a pair of spatulas, one in each hand, to toss and turn them.", "text": "In a kitchen setting, a person is seen preparing a salad in a glass bowl placed on a countertop. The bowl contains various ingredients including lettuce, cherry tomatoes, and cucumber slices. The person sprinkles seasoning over the salad using their hand. They then pick up two wooden salad servers and begin to mix the salad thoroughly. The person continues to mix the salad, ensuring all ingredients are well combined. After mixing, the person sprinkles more seasoning over the salad.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 706, "extra_info": {"duration": 16.35, "events": ["Person sprinkled salt onto the vegetables.", "Person picked up a pair of spatulas.", "Person used spatulas to toss the vegetables.", "Person used spatulas to turn the vegetables."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 706}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "383.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the forest, a group of people are running forward in succession. The camera switches to a man rowing backwards with both hands, turning his head to look behind him and then to his left. The scene shifts to a woman sitting on a wooden plank, floating on the water, turning her head to look behind her before turning forward again.", "text": "A group of people, dressed in various colorful outfits, run through a dense forest, moving quickly and appearing to be in a hurry. The scene then shifts to a boy rowing a boat on a river, turning his head to look back. The camera then shows a woman in a white dress sitting on a floating platform in the river, looking around and then reaching out her hand towards the water.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 383, "extra_info": {"duration": 8.75, "events": ["A group of people runs forward in the forest", "A man rows backwards with both hands", "The man turns his head to look behind", "The man turns his head to his left", "A woman sits on a wooden plank on water", "The woman turns her head to look behind", "The woman turns her head forward"], "n_events": 7, "n_shots": 3, "n_subjects": 3}, "vid": 383}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "319.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On the deck of a ship, a man in blue clothes drew a line in front of several items with chalk, then the man stood up, placed the chalk in his left chest pocket, took a step forward, stood on a piece of equipment, and crossed his arms, looking ahead.", "text": "A man, dressed in light blue pants and a matching shirt, is kneeling on a wooden deck of a ship, placing a white board on the ground next to two other boards and some metal brackets. He then stands up and walks towards the ship's engine, which is surrounded by ropes and other equipment. The man reaches the engine, touches it, and then turns around to stand with his arms crossed, looking out at the sea.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 319, "extra_info": {"duration": 10.75, "events": ["A man drew a line with chalk", "The man stood up", "The man placed the chalk in his pocket", "The man took a step forward", "The man stood on a piece of equipment", "The man crossed his arms", "The man looked ahead"], "n_events": 7, "n_shots": 1, "n_subjects": 1}, "vid": 319}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "165.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "During the night, in a town, a car quickly moved forward after turning at an intersection. A second car was reversing and followed the first car's direction after turning at the same intersection, swaying from side to side as it traveled on the road. The reversing car first knocked over items on the right side of the road and then crashed into vehicles parked on the left. The camera then cuts to the road where the car in front sped past with silhouettes visible inside, followed by the reversing car, swaying as it chased after the leading vehicle.", "text": "In a quiet suburban neighborhood at night, several cars are seen driving on the road. A green car and a yellow car are driving side by side, with the green car slightly ahead. The green car begins to lose control and veers towards the yellow car. The green car collides with the yellow car, causing the yellow car to spin out of control. The green car continues to spin and eventually crashes into a utility pole. The camera then shifts to show a large truck with three people inside, driving down the same road. The truck follows the green car, which is now driving away from the scene.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 165, "extra_info": {"duration": 5.38, "events": ["A car turned at an intersection.", "The car quickly moved forward.", "A second car reversed direction.", "The second car followed the first car.", "The second car swayed while traveling.", "The second car knocked over roadside items.", "The second car crashed into parked vehicles.", "The leading car sped past."], "n_events": 8, "n_shots": 2, "n_subjects": 2}, "vid": 165}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "809.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video showcases a black and white dog being leashed by its owner, with a grey basket or container affixed on the rear wheel of a bicycle. Throughout the video, the owner rides the bicycle, prompting the dog to jump into the basket from the ground, and subsequently jump back onto the ground from the basket.", "text": "A man stands next to a bicycle with a crate attached, surrounded by trees and greenery. He is wearing a dark shirt, a cap, and sunglasses, and he gestures with his hands while talking. The scene shifts to the man, now wearing a red shirt and a helmet, standing on a dirt road with a bicycle and a dog on a leash. The dog circles around the man, who then lifts the dog and places it in the bicycle's crate. The man gets on the bicycle and starts riding down the dirt road, with the dog in the crate. The video ends with the man standing next to the bicycle again, continuing to talk.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 809, "extra_info": {"duration": 8.0, "events": ["Owner leashes the dog.", "Owner affixes grey basket on the bicycle rear wheel.", "Owner rides the bicycle.", "Dog jumps into the basket from the ground.", "Dog jumps onto the ground from the basket."], "n_events": 5, "n_shots": 3, "n_subjects": 2}, "vid": 809}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "201.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman and a man were initially talking, and then the woman turned around and raised her hand to call a taxi.", "text": "A woman is talking to a man while a yellow taxi passes by in the background. She wipes her eyes with her hand. She then turns around and starts walking away from the man. The woman raises her hand as she continues to walk.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 201, "extra_info": {"duration": 3.63, "events": ["A woman and a man were talking.", "The woman turned around.", "The woman raised her hand.", "The woman called a taxi."], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 201}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "35.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A large and fluffy dog was scared and backed away by a black dog, revealing a little dog hiding beneath. The little dog hung its head and walked forward, as two black dogs from behind approached, encircling the little dog with the black dog in front. The black dogs lowered their heads, watching the little dog, which became frightened.", "text": "In a garden with colorful flowers, a large gray and white dog turns its head to look at a black bird standing in front of it. The dog then steps back, revealing a small brown dog with a yellow bow on its head. The small dog starts to move forward, and the two black birds, one on each side, begin to step back. The small dog continues to move forward, and the birds keep retreating, creating a path for the small dog.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 35, "extra_info": {"duration": 4.75, "events": ["A large and fluffy dog backed away.", "A black dog scared the large and fluffy dog.", "A little dog was revealed hiding beneath the large and fluffy dog.", "The little dog hung its head.", "The little dog walked forward.", "Two black dogs approached from behind.", "The black dogs encircled the little dog.", "The black dogs lowered their heads.", "The black dogs watched the little dog.", "The little dog became frightened."], "n_events": 10, "n_shots": 1, "n_subjects": 4}, "vid": 35}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "382.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person reaches out to stack and pick up the books from the table, moves forward, and gathers the books from another table. The camera switches to a wide angle as the man turns around with his back to the camera and walks forward, collecting the books from the tables in turn. The camera then switches back to the man retracting his hand.", "text": "A man in a black suit and white shirt is holding a stack of papers in his hands. He bends down and picks up a few more papers from the wooden benches in front of him. He then stands up and starts walking through the room, which has wooden benches and large windows with curtains. The man continues to walk, picking up more papers from the benches as he moves. He reaches another bench and picks up a few more papers.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 382, "extra_info": {"duration": 11.5, "events": ["A person reaches out to stack books on the table.", "A person picks up books from the table.", "A person moves forward.", "A person gathers books from another table.", "The man turns around with his back to the camera.", "The man walks forward.", "The man collects books from the tables in turn.", "The man retracts his hand."], "n_events": 8, "n_shots": 3, "n_subjects": 1}, "vid": 382}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "846.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "This video depicts an individual clad in black attire, carrying a red bag, engaging in conversation with a few drivers on the road. Initially, the individual attempts to converse with the driver of a white car; however, the outcome seems unfavorable. They continue their efforts to communicate with another driver of a white vehicle, but again, it appears unsuccessful. The video suggests that the individual is trying to secure a free ride.", "text": "A man with a red backpack walks towards a white car on a busy road, raising his hand to hail a ride. He continues to wave his hand as he approaches the car, but the car does not stop. The man then turns and walks towards a black car, raising his hand again to hail a ride. The black car stops, and the man bends down to pick up an item from the ground. Text appears on the screen, expressing frustration and disappointment. The man then walks away from the black car, and the video ends with him walking back towards the white car.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 846, "extra_info": {"duration": 11.83, "events": ["Individual wearing black approaches white car.", "Individual attempts conversation with first white car driver.", "First conversation appears unsuccessful.", "Individual approaches another white vehicle.", "Individual attempts conversation with second vehicle's driver.", "Second conversation seems unsuccessful.", "Individual carries red bag throughout."], "n_events": 7, "n_shots": 2, "n_subjects": 1}, "vid": 846}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "397.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man is sleeping on his side, and the scene switches to a boy riding a bicycle on the road.", "text": "A man is lying in bed, covered with a checkered blanket, and appears to be sleeping peacefully. The scene then transitions to a suburban neighborhood with houses and trees in the background. A person wearing a red jacket and a helmet rides a yellow bicycle down the street, moving away from the camera.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 397, "extra_info": {"duration": 5.75, "events": ["The man falls asleep.", "The boy rides a bicycle on the road."], "n_events": 2, "n_shots": 1, "n_subjects": 2}, "vid": 397}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "994.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In this video, an individual dressed in a white shirt, black tie, and black pants approaches a vending machine, clutching a piece of paper bearing a black circle. The individual places the paper on the glass cabinet door of the machine. Astonishingly, they extend their right hand into the black circle on the paper and pass through the glass door, thereby directly retrieving a snack from the machine.", "text": "A man in a white shirt and tie is seen looking ahead. He walks towards a vending machine filled with various snacks and drinks. The man approaches the vending machine and begins to open its door. He lifts the door and starts to reach inside. The man places his hand inside the vending machine, appearing to search for something. He continues to reach inside, seemingly trying to retrieve an item. The man then looks up, possibly to check his surroundings or to see if he has successfully retrieved the item.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 994, "extra_info": {"duration": 7.0, "events": ["Individual approaches a vending machine.", "Individual clutches a piece of paper.", "Piece of paper bears a black circle.", "Individual places the paper on the glass door.", "Individual's right hand extends towards the paper.", "Hand passes through the glass door.", "Individual retrieves a snack from the machine."], "n_events": 7, "n_shots": 4, "n_subjects": 1}, "vid": 994}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "177.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The girl opens her mouth wide, turns around to look ahead, the camera cuts to the girl climbing up, cuts to the girl stepping on the boy, cuts to the girl lying down, cuts to the boy looking down at the approaching creature, and then he squats down, picks up the creature, and jumps to toss it upwards.", "text": "A girl with red hair and a blue dress is seen looking up and speaking while holding onto a circular object. She then turns her head and starts to crawl out from a hole in the floor. The girl continues to crawl out, revealing her face and upper body. A boy, dressed in a purple and yellow outfit, is holding the girl's feet and looking up at her. The girl, now fully out of the hole, looks up at the boy. The boy, standing on a bed, throws a stuffed toy towards the girl. The girl catches the toy and looks up at the boy.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 177, "extra_info": {"duration": 6.0, "events": ["The girl opens her mouth wide.", "The girl turns around to look ahead.", "The girl climbs up.", "The girl steps on the boy.", "The girl lies down.", "The boy looks down at the approaching creature.", "The boy squats down.", "The boy picks up the creature.", "The boy jumps.", "The boy tosses the creature upwards."], "n_events": 10, "n_shots": 4, "n_subjects": 3}, "vid": 177}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "883.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Player number twenty-one from the team in yellow attire takes a corner kick. Numerous players from both teams, blue and yellow, are congregated in the penalty area. Yellow team's player number two fails to connect with the ball and it is subsequently met by player number five, who's behind and possesses a taller stature. He heads the ball and it sails over the goalie of the blue team, finding its way into the net and resulting in a score for the yellow team.", "text": "A soccer player in a yellow uniform prepares to take a free kick from the right side of the field, with a referee and another player in yellow standing nearby. The player kicks the ball, and it travels towards the goal as the goalkeeper and defenders in blue uniforms position themselves. The ball approaches the goal, and the goalkeeper dives to the right in an attempt to save it. The ball goes past the goalkeeper and into the net, resulting in a goal. The players in yellow celebrate the goal, while the players in blue react to the situation. The scene then shifts to another moment where a player in yellow prepares to take a free kick again, with the same setup as before.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 883, "extra_info": {"duration": 5.0, "events": ["Player number twenty-one takes corner kick.", "Players congregate in penalty area.", "Player number two misses the ball.", "Player number five heads the ball.", "Ball sails over the goalie.", "Ball finds its way into the net.", "Yellow team scores."], "n_events": 7, "n_shots": 1, "n_subjects": 4}, "vid": 883}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "180.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The girl turns her head to look at the yellow creature behind her. The camera pans left to reveal a small dog biting the creature. The shot switches to the dog biting down hard, then to the creature shaking its body in pain, and finally to the girl watching the camera, her brow slightly furrowed.", "text": "A large, round character with a single eye is seen looking at a girl who is lying on the edge of a cliff, holding onto the edge with one hand. The camera pans left to reveal a small animal, possibly a fox, licking the large character's body. The large character then turns its head and begins to roll down the cliff. The girl, with a concerned expression, watches the large character roll down and then turns her head to look in another direction.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 180, "extra_info": {"duration": 7.5, "events": ["The girl turns her head.", "The girl looks at the yellow creature.", "A small dog bites the yellow creature.", "The dog bites down hard.", "The yellow creature shakes its body.", "The yellow creature expresses pain.", "The girl watches the camera.", "The girl furrows her brow."], "n_events": 8, "n_shots": 4, "n_subjects": 3}, "vid": 180}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "267.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A group of navy soldiers and a bartender are in the tavern. One soldier is dancing on the table in the center; the bartender pulls the tablecloth off of the middle table. Four other soldiers jump onto the four tables at the side, while another five soldiers watch them from behind.", "text": "In a room with a bar counter in the background, a man in a white uniform and cap dances energetically on a table, while several other men in similar uniforms and a man in a vest watch and react. The man on the table continues to dance, moving his legs and arms in various directions, while the others around him start to mimic his movements. The man in the vest steps onto the table and begins to dance alongside the man on the table, and the others around them also start to join in, dancing on their own tables. The room is filled with men dancing on tables, creating a lively and synchronized dance routine.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 267, "extra_info": {"duration": 4.62, "events": ["A group of navy soldiers is in the tavern.", "A navy soldier dances on the table.", "The bartender pulls off the tablecloth.", "Four navy soldiers jump onto the side tables.", "Five navy soldiers watch from behind."], "n_events": 5, "n_shots": 1, "n_subjects": 3}, "vid": 267}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "423.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A gentleman pours a basket of tea leaves into a large iron pot, then stirs them exaggeratedly, followed by pouring a large basin of water. The video includes subtitles that match the character's actions.", "text": "A man stands outdoors in front of a large wok, holding a basket filled with green leaves. He lifts the basket and pours the leaves into the wok. The man then uses a shovel to stir the leaves in the wok. He continues to stir the leaves vigorously. The man pours water from a metal basin into the wok, causing the leaves to float and mix with the water. He then steps back and looks at the wok, which is now filled with water and leaves.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 423, "extra_info": {"duration": 9.95, "events": ["A gentleman pours tea leaves into a large iron pot.", "The gentleman stirs the tea leaves exaggeratedly.", "The gentleman pours a large basin of water."], "n_events": 3, "n_shots": 5, "n_subjects": 1}, "vid": 423}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "604.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person opens the cover of a book, places a green twig inside, and then closes the book. The camera remains fixed.", "text": "A stack of three old books is placed on a surface, with a pair of black glasses and a green fern leaf beside them. A hand reaches out and opens the top book, revealing the inside pages. The hand then places the green fern leaf into the open book. The hand closes the book, and the fern leaf remains inside. The hand opens the book again, revealing the fern leaf still inside. The hand closes the book once more, leaving the fern leaf inside.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 604, "extra_info": {"duration": 9.16, "events": ["Person opens the book cover.", "Person places a green twig inside the book.", "Person closes the book.", "Camera stays fixed."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 604}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "137.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Four children are playing baseball. A boy in a yellow shirt and a girl with brown hair are standing on a mound of dirt. On the right, a black-skinned boy points at the boy in the yellow shirt while speaking. The boy in the yellow shirt initially looks at the black-skinned boy, then turns his head towards the direction of the camera, revealing a puzzled expression. The girl first looks towards the direction of the camera, and then turns her gaze towards the boy in the yellow shirt. The frame freezes and the camera pans slightly to the right, revealing the edge of the picture on the right side.", "text": "In a grassy field with a clear sky and some clouds, four children are standing in a circle. The child in the yellow shirt and the child in the orange shirt are facing each other, while the child in the green shirt and the child in the red shirt are standing on either side. The child in the orange shirt points at the child in the yellow shirt and says something. The child in the yellow shirt turns his head to look at the child in the orange shirt and responds. The child in the orange shirt then turns his head to look at the child in the green shirt and continues speaking. The child in the green shirt turns her head to look at the child in the yellow shirt and smiles.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 137, "extra_info": {"duration": 4.38, "events": ["Four children play baseball.", "A boy in a yellow shirt stands on a mound of dirt.", "A girl with brown hair stands on a mound of dirt.", "A black-skinned boy points at the boy in the yellow shirt.", "The boy in the yellow shirt turns his head towards the camera.", "The boy in the yellow shirt reveals a puzzled expression.", "The girl with brown hair looks towards the direction of the camera.", "The girl with brown hair turns her gaze towards the boy in the yellow shirt.", "The camera frame freezes.", "The camera pans slightly to the right."], "n_events": 10, "n_shots": 1, "n_subjects": 4}, "vid": 137}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "173.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man grasps with both hands, the camera cuts to the man getting into the car, then cuts to him sitting in the passenger seat, the car moves forward, it cuts again to the man picking up some items to look at them, and then he turns around holding the items, while the boy lifts his head to look at the man.", "text": "A man in a yellow shirt is seen outside a vehicle, gesturing and talking to a boy inside the vehicle. The man then climbs into the vehicle, continuing to talk to the boy. The vehicle drives through a rocky desert landscape, and a large rock appears in front of it. The man retrieves a gun from the vehicle and starts searching through the back. He finds a grenade and prepares to throw it. The man then climbs onto the roof of the vehicle, holding the grenade. The boy inside the vehicle looks up at the man as he stands on the roof.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 173, "extra_info": {"duration": 7.25, "events": ["The man grasps with both hands.", "The man gets into the car.", "The man sits in the passenger seat.", "The car moves forward.", "The man picks up some items.", "The man looks at the items.", "The man turns around holding the items.", "The boy lifts his head."], "n_events": 8, "n_shots": 4, "n_subjects": 2}, "vid": 173}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "529.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two kittens were squatting at the corner, observing. The shadow of the cat on the right appeared on the adjacent glass. It leaned over to observe carefully, then suddenly leaped forward, extending its paw to interact with the other kitten. It then stood with one paw on the glass, while the other kitten reached out a paw and looked up at it.", "text": "In a kitchen setting, a spotted cat approaches a glass door and looks at its reflection. The cat then jumps up and starts pawing at the glass door. The cat continues to paw at the glass, trying to interact with its reflection. The cat stands on its hind legs, using its front paws to push against the glass door. Meanwhile, a smaller cat, which was initially hiding behind the door, starts to emerge and watch the spotted cat.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 529, "extra_info": {"duration": 10.7, "events": ["Two kittens were squatting at the corner.", "The shadow of the cat appeared on the glass.", "The cat on the right leaned over.", "The cat on the right leaped forward.", "The cat on the right extended its paw.", "The cat on the right stood with one paw on the glass.", "The other kitten reached out a paw.", "The other kitten looked up."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 529}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "700.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the gym, a lady swings a rope up and down with alternating hands, and after a while, she changes to swinging the rope up and down with both hands together. Next to her, a gentleman is kneeling on one leg and clapping his hands, while a lady first stands and then bends over to clap her hands.", "text": "In a well-lit gym with various exercise equipment, a woman in black athletic wear is vigorously swinging two ropes up and down. Behind her, a man and a woman, both dressed in black, are kneeling and clapping their hands in encouragement. The gym has large windows allowing natural light to fill the space, and there are weights and other fitness equipment visible in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 700, "extra_info": {"duration": 10.23, "events": ["Lady swings rope with alternating hands", "Lady switches to swinging rope with both hands", "Gentleman kneels on one leg", "Gentleman claps hands", "Lady stands next to gentleman", "Lady bends over to clap hands"], "n_events": 6, "n_shots": 1, "n_subjects": 3}, "vid": 700}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "251.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A pair of hands opens a handbag and pulls out a power cord. The camera cuts to a woman crawling forward, reaching a wall outlet to plug in the power and charge her phone.", "text": "A woman is seen unzipping and opening a beige bag, revealing its contents. She takes out a phone from the bag and places it on the floor. She then moves to a different location, possibly a different room, and bends down to pick up a phone charger from the floor. She plugs the charger into a wall socket and looks at the phone.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 251, "extra_info": {"duration": 10.0, "events": ["Hands open a handbag", "Hands pull out a power cord", "Woman crawls forward", "Woman reaches a wall outlet", "Woman plugs in the power cord", "Woman charges her phone"], "n_events": 6, "n_shots": 2, "n_subjects": 1}, "vid": 251}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "616.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a small hillside, a man rides a mountain bike downhill from left to right. Then he turns his body, the mountain bike skids to a stop, the man plants his left foot on the hillside and looks back toward the camera.", "text": "The video begins with a view of a dirt path surrounded by trees and a hilly landscape in the background. A person on a bicycle appears from the left side of the frame, riding down the path. The cyclist continues to ride down the path, moving from left to right. The cyclist then slows down and comes to a stop, dismounting from the bicycle. The cyclist stands next to the bicycle, looking around the surroundings. The video ends with the cyclist and the bicycle stationary on the path, with the hilly landscape in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 616, "extra_info": {"duration": 5.29, "events": ["A man rides a mountain bike downhill.", "The mountain bike skids to a stop.", "The man turns his body.", "The man plants his left foot on the hillside.", "The man looks back toward the camera."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 616}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "858.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts a conversation between two individuals. The individual on the right is dressed in a yellow T-shirt, a blue denim jacket, and black sneakers. He sports brown curly hair, sunglasses, and a black waist bag. On the left, the second person is dressed in a blue T-shirt, blue pants, and gray shoes. This person has a short black beard but is bald. At a certain point, both individuals stand up and the person in the blue T-shirt picks up a phone and an army green cap that were placed on a rock. They then walk across the lawn and the individual in the blue T-shirt gestures with his right hand to place the cap on his head.", "text": "A man and a woman are sitting on chairs in a backyard with a lattice fence and green grass. The woman gestures with her hands while talking to the man, who is listening attentively. The woman then stands up, and the man begins to rise from his chair. The man bends down to pick up a hat and a phone from the ground. He stands up holding the hat and phone, and starts walking towards the woman. The woman turns her head to look at the man as he approaches her.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 858, "extra_info": {"duration": 4.97, "events": ["Two individuals converse.", "Individual in yellow T-shirt stands up.", "Individual in blue T-shirt stands up.", "Individual in blue T-shirt picks up phone.", "Individual in blue T-shirt picks up army green cap.", "Individuals walk across the lawn.", "Individual in blue T-shirt gestures to place cap on head."], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 858}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "152.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera tilts down for an overhead shot, where an old man looks upward, pushed in the chest with a stick. He falls downward, and as the camera angle shifts to a low angle shot, a woman is seen holding the stick at the mouth of a well. The old man falls down into the well, and the camera then switches to the well's opening, where the woman gives a smile. The scene cuts to the ground level, where the woman takes a deep red hat and replaces the black hat on the head of a short man standing before her.", "text": "A gingerbread man with a mustache and candy decorations is seen inside a large candy cane. He is pushed through the candy cane by a hand, and he struggles as he is moved through the tunnel. The gingerbread man is then seen from the outside, looking through the candy cane's opening. A woman with blonde hair, wearing a purple dress and white gloves, is seen from the inside of the candy cane, smiling and talking. The woman then takes a red and yellow object from the gingerbread man and places it on his head.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 152, "extra_info": {"duration": 6.5, "events": ["Camera tilts down for an overhead shot.", "Old man looks upward.", "Woman pushes old man in chest with stick.", "Old man falls into the well.", "Camera angle shifts to a low angle shot.", "Woman holds the stick at the well's mouth.", "Woman smiles at the well's opening.", "The camera switches to ground level.", "Woman takes a deep red hat.", "Woman replaces black hat on short man's head."], "n_events": 10, "n_shots": 4, "n_subjects": 3}, "vid": 152}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "27.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The blue character wearing a hat holds a fork and reacts as he looks ahead, the camera then cuts to another blue character who is hiding behind a wooden curtain observing, and then closes the curtain.", "text": "In a lush, green forest with a mushroom house and various vegetables in the background, a blue character wearing a straw hat and green overalls holds a pitchfork in one hand and gestures with the other while speaking. The character then lowers its hand and continues to look forward. The scene shifts to a window with wooden shutters, where the same blue character peeks through the slats, looking around. The character then closes the shutters, completely hiding from view.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 27, "extra_info": {"duration": 2.88, "events": ["The blue character holds a fork.", "The blue character reacts.", "Another blue character hides behind a curtain.", "The wooden curtain closes."], "n_events": 4, "n_shots": 2, "n_subjects": 2}, "vid": 27}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "908.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video portrays an individual clad in a white coat who has a black wristwatch on their left wrist. In the foreground, test tubes connected to brown pipes are visible. A bottle labelled \"lime water\" can be seen on the left side of the video. Carbon dioxide is introduced into the test tube through a flexible tube. As a result, the clear liquid in the test tube becomes cloudy and bubbles emerge from the surface.", "text": "A person in a lab coat is seen handling a bottle labeled 'True Water' and a glass container with a red cap on a wooden table. The person removes the red cap from the bottle and connects a red tube to the top of the bottle. The person then connects the other end of the tube to the glass container. The camera zooms in to show the tube extending into the bottle, and text appears on the screen explaining that 'Bubble carbon dioxide through limewater.' The person then places the red cap back on the bottle.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 908, "extra_info": {"duration": 11.98, "events": ["Individual enters wearing a white coat.", "Person puts on a black wristwatch.", "Test tubes connect to brown pipes.", "Bottle labeled 'lime water' appears.", "Carbon dioxide flows through a flexible tube.", "Clear liquid in test tube turns cloudy.", "Bubbles emerge from the liquid's surface."], "n_events": 7, "n_shots": 2, "n_subjects": 1}, "vid": 908}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "721.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person rides a mountain bike on a winding path, cycling quickly uphill and downhill, turning and moving forward, then gradually moving away from the camera.", "text": "A man wearing a helmet and a backpack rides a bicycle on a narrow dirt path surrounded by dry grass and trees. He approaches a small hill and begins to lift the front wheel of his bicycle off the ground. The man continues to ride up the hill, maintaining control of the bicycle. He successfully climbs the hill and continues riding along the path. The man rides further along the path, moving away from the camera. He then turns his head to look back while continuing to ride.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 721, "extra_info": {"duration": 8.12, "events": ["Person rides mountain bike on winding path.", "Person cycles quickly uphill.", "Person cycles quickly downhill.", "Person turns on the path.", "Person moves forward.", "Person gradually moves away from the camera."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 721}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "284.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "At night, a black car moves from the left side of the frame towards the right. In the shadows by the wall, a man wearing a hat stands with his arms folded, watching as the black car drives into a well-lit area and comes to a stop.", "text": "A black car with its headlights on drives down a wet, dimly lit street at night, passing by other parked cars and fire hydrants. The camera then shifts to a man standing against a brick wall, wearing a hat and a dark coat, who turns his head to the left. The black car continues to drive past the man, who remains standing and watching it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 284, "extra_info": {"duration": 10.25, "events": ["A black car moves from left to right.", "A man stands by the wall.", "The man folds his arms.", "The black car drives into a well-lit area.", "The black car comes to a stop."], "n_events": 5, "n_shots": 3, "n_subjects": 2}, "vid": 284}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "713.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On the snowy ground, the camera captures the backs of two people who start to run forward. Another person with a backpack appears in the shot, running behind them. The two in front run over a low wall, and the person in the lead glances back before continuing to run forward. The last person also crosses the low wall, and the three of them run along the winding road towards the forest, one after the other.", "text": "In a snowy forest, a person in a gray coat and yellow hat walks alongside a person in an orange coat and red hat. The person in the orange coat starts running, and the person in the gray coat follows. A third person, wearing a white coat, yellow hat, and carrying a brown backpack, joins the chase, running behind the other two. The three individuals continue running through the snow-covered forest, with the person in the white coat trailing behind the other two.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 713, "extra_info": {"duration": 21.66, "events": ["Two people start to run on the snowy ground.", "Another person with a backpack runs behind the two.", "The front runners cross over a low wall.", "The person in the lead glances back.", "The last person crosses the low wall.", "The three individuals run towards the forest."], "n_events": 6, "n_shots": 1, "n_subjects": 3}, "vid": 713}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "577.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man opens the trunk of the car and turns to run forward; three puppies jump out and run ahead. The camera follows the movement; the puppies look back as they continue to run forward. The man walks on, with one puppy running up behind him, and together they look to their right. The background is a mountain road.", "text": "A man stands behind a blue car parked on a road surrounded by greenery, and he opens the trunk. As the trunk opens, three dogs jump out and start running forward. The dogs run down the road, with the man following them. The dogs continue running, and the man walks behind them. The man and the dogs approach the car again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 577, "extra_info": {"duration": 7.18, "events": ["The man opens the trunk of the car", "The man turns to run forward", "Three puppies jump out of the trunk", "The puppies run ahead", "The puppies look back while running", "The man walks on", "One puppy runs up behind the man", "The man and one puppy look to their right"], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 577}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "649.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In an ice house, the camera slowly pans from left to right, displaying a variety of colorful ice sculptures within the ice house.", "text": "The video begins with a view of an ice sculpture room, showcasing various ice structures including a house, a tree, and a car, all illuminated by colorful lights. The camera slowly pans to the right, revealing more intricate ice sculptures, such as a large ice figure and a detailed ice car. The lighting changes subtly, highlighting different parts of the sculptures and creating a dynamic visual effect. The camera continues to move, showing a variety of ice creations, including a large ice figure with a detailed face and a smaller ice structure with a wheel. The video concludes with a close-up of the ice sculptures, emphasizing their craftsmanship and the vibrant lighting.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 649, "extra_info": {"duration": 9.68, "events": [], "n_events": 0, "n_shots": 1, "n_subjects": 1}, "vid": 649}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "533.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man, wearing a black vest, stepped on a skateboard and glided forward. Then, he looked down, bent his knees, and spread his arms to jump up and switch his stance on the skateboard to continue sliding. He turned to the left and attempted to repeat the previous maneuver but failed this time, falling to the ground and sitting back. Just then, an old man approached him head-on.", "text": "A man wearing a black cap, black vest, white shirt, and blue jeans is skateboarding on a paved path lined with cherry blossom trees. He attempts a trick by lifting the front of his skateboard and flipping it. The skateboard flips in the air, and he tries to land back on it. The skateboard lands on the ground, and he loses balance, falling to the ground. He then gets back up and continues skateboarding.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 533, "extra_info": {"duration": 6.37, "events": ["The man stepped on a skateboard.", "The man glided forward.", "The man looked down.", "The man bent his knees.", "The man spread his arms.", "The man jumped up.", "The man switched his stance on the skateboard.", "The man turned to the left.", "The man attempted to repeat the maneuver.", "The man fell to the ground."], "n_events": 10, "n_shots": 1, "n_subjects": 2}, "vid": 533}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "565.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is a vast expanse of grassland. In the distance, a person is descending from the sky with a paraglider, gliding toward the camera. With bent legs, they slide forward on the grassland. The camera pans to the right as the person continues to glide. Next to them, another person is holding a phone to film the scene, while another man in a black short-sleeved shirt turns to look at the camera, walks a few steps to the right, and raises his hands to gesture.", "text": "A person with a parachute is seen descending from the sky over a grassy field with a clear blue sky and scattered clouds. The person continues to descend, getting closer to the ground as they approach the field. The person lands on the ground, kneeling and then standing up. A woman with tattoos on her arm is seen taking photos of the person with a smartphone. Another person on the ground raises their hands and starts walking towards the camera. The scene then switches to a person in a red shirt standing on a grassy field with a mountainous background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 565, "extra_info": {"duration": 7.8, "events": ["Person descends from sky with paraglider.", "Person slides forward on grassland.", "Camera pans to the right.", "Second person holds phone to film.", "Man in black shirt turns to camera.", "Man walks a few steps to the right.", "Man raises hands to gesture."], "n_events": 7, "n_shots": 1, "n_subjects": 3}, "vid": 565}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "454.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "There was a circular damage on the ground. A person laid a long strip of gauze mesh over the damaged area, then used a tool to tear the mesh. After that, they layered another sheet of mesh and repeated the process until three layers of mesh were applied. The entire damaged ground was completely covered. Then the person used a tool to spread putty over the mesh and smoothed it out repeatedly.", "text": "A circular hole is visible on a gray surface. A person with tattooed hands places a piece of mesh over the hole and begins to apply adhesive around the edges using a trowel. The person continues to spread the adhesive evenly around the mesh, ensuring it is securely attached. The person then applies more adhesive to the mesh and smooths it out with the trowel. The person repeats the process, applying additional adhesive and smoothing the mesh. The person continues to ensure the mesh is properly adhered and smoothed out.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 454, "extra_info": {"duration": 9.13, "events": ["A circular damage appeared on the ground.", "A person laid a strip of gauze mesh over the area.", "The person used a tool to tear the mesh.", "The person layered another sheet of mesh.", "The process repeated until three layers were applied.", "The damaged ground got completely covered by mesh.", "The person spread putty over the mesh.", "The person smoothed out the putty repeatedly."], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 454}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "282.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The shot pans from a low angle upwards, capturing a woman looking down from a railing. A man runs over and also looks down. The shot then switches to a bird's-eye view, showing a person lying on the ground in the middle of a building. The man pulls the woman away from the railing, and the shot changes to a level perspective as the man is seen leading the woman away.", "text": "A woman stands on a fire escape, holding onto the railing and looking down. A man appears behind her, also holding the railing and looking down. The woman turns her head to look at the man, and they both look down. The camera shifts to show a person lying on the ground below, looking up at them. The woman and the man continue to look down, with the woman turning her head to look at the man again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 282, "extra_info": {"duration": 7.0, "events": ["Camera pans upwards", "Woman looks down from railing", "Man runs over to railing", "Shot switches to bird's-eye view", "Person lies on the ground", "Man pulls woman away from railing", "Shot changes to level perspective", "Man leads woman away"], "n_events": 8, "n_shots": 3, "n_subjects": 2}, "vid": 282}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "399.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man falls backwards, and the camera switches to a woman standing by the water, watching as water splashes around. The people nearby look towards the man who has fallen into the water. The camera then switches to a man and a woman embracing each other in a garden, turning their heads to look to the left of the camera.", "text": "A man in a black shirt is seen near a pool, and he suddenly falls into the water, creating a large splash. A woman in a white dress with the number 8 on the back stands nearby, watching the man fall. The water splashes high into the air as the man hits the pool. The woman continues to stand and watch the splash, while a crowd of people in the background, decorated with Halloween lights and decorations, observe the scene. The scene then shifts to a man in a red and white striped shirt and a woman in a white dress with a fox tail, standing by the pool. The man turns his head to look at the woman, who then turns her head to look at him.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 399, "extra_info": {"duration": 2.5, "events": ["The man falls backwards into the water.", "Water splashes around from the impact.", "People nearby turn their heads towards the man in the water.", "A man and a woman embrace each other in a garden.", "The man and woman turn their heads to the left."], "n_events": 5, "n_shots": 3, "n_subjects": 3}, "vid": 399}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "112.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On the bus, a plump woman sitting in her seat looked towards the front seat where a little girl with red hair turned around and spoke to the plump woman. The plump woman sighed, and to their left, a man wearing a hat glanced at the packaged food in his hand and then reached out to give it to the red-haired girl.", "text": "A woman with a large afro, wearing a green cardigan, pink overalls, and a headband, is sitting on a bus and talking to a girl with pink hair sitting in front of her. The girl turns her head to look at the woman and then looks down. The camera then shifts to a boy with a red hat and a blue box labeled 'Make-Off' in front of him. The boy lowers the box and looks at it with a sad expression. The camera then shows the boy's head popping out from behind the woman, and he raises his hand. The camera returns to the woman and the girl, who are still sitting on the bus.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 112, "extra_info": {"duration": 9.75, "events": ["Plump woman looked towards front seat", "Little girl turned around", "Little girl spoke to plump woman", "Plump woman sighed", "Man glanced at packaged food", "Man reached out to little girl", "Man offered packaged food to little girl"], "n_events": 7, "n_shots": 3, "n_subjects": 3}, "vid": 112}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "539.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The image on the screen is shaking. A man in a black short-sleeve shirt reaches upwards with both hands, looking up. A yellow box falls, and the man catches it and looks towards the camera. The shot switches to the man opening the box, then to him taking out a hamburger, to him taking a bite, to him taking a second bite, to him taking a third bite, and to the man stuffing the rest into his mouth and finishing it.", "text": "A person wearing headphones looks up at various food items displayed above their head, including a burger, fries, chicken wings, pretzels, a milkshake, a drink, and a snack bar. They then pick up a McDonald's Big Mac box, open it, and take out the burger. The person shows the burger to the camera, takes a bite, and continues eating it. After finishing the burger, they put the box down and look up at the food items again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 539, "extra_info": {"duration": 10.97, "events": ["The image on the screen shakes.", "A man reaches upwards with both hands.", "A yellow box falls.", "The man catches the yellow box.", "The man looks towards the camera.", "The man opens the yellow box.", "The man takes out a hamburger.", "The man takes a bite of the hamburger.", "The man takes a second bite of the hamburger.", "The man finishes the hamburger by stuffing the rest into his mouth."], "n_events": 10, "n_shots": 7, "n_subjects": 1}, "vid": 539}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "249.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman sat on the bed, turned her head to look at the boy lying on the bed, then she turned around, got off the bed and left. The boy moved his head a bit and watched the direction in which the woman left.", "text": "A woman is sitting on a bed next to a child who is lying down with an orange mask on their face. The woman, dressed in a dark outfit, begins to turn her body to the left. She continues to turn and starts to get off the bed. The woman fully gets off the bed and moves out of the frame to the left. The child, still lying in bed with the orange mask, slightly turns their head to the right. The woman then reappears on the left side of the frame, looking towards the child.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 249, "extra_info": {"duration": 5.88, "events": ["The woman sat on the bed.", "The woman turned her head to look at the boy.", "The woman turned around.", "The woman got off the bed.", "The woman left.", "The boy moved his head a bit.", "The boy watched the direction the woman left."], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 249}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "356.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two men are sitting on the couch watching TV. The camera cuts to their front, and they each raise one hand, swinging it up and down, as the man on the left turns to look at the man on the right.", "text": "Two men are sitting on a couch in a dimly lit room, watching a TV screen that shows a man in a white shirt. The man on the left, wearing glasses, starts mimicking the actions of the man on the TV by moving his arms and making gestures. The man on the right, wearing a gray shirt, turns his head to look at the man with glasses.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 356, "extra_info": {"duration": 5.62, "events": ["Two men sit on the couch.", "Men raise one hand each.", "Hands swing up and down.", "Man on left turns head.", "Man on left looks at man on right."], "n_events": 5, "n_shots": 2, "n_subjects": 2}, "vid": 356}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "818.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts a football match taking place on a green football field. The teams playing are the green team and the blue team. The players in the green team are dressed in green uniforms, green socks, and black trousers, whilst those in the blue team are wearing blue uniforms, trousers, and socks. In the game, the blue team loses possession of the ball, giving the green team an opportunity to strike. The orange-dressed goalkeeper fails to save the ball, allowing the green team's player, numbered 8, to score a goal. The green team erupts in excitement, while the blue team fails to defend their goal resulting in the green team scoring a point.", "text": "On a soccer field, players in blue and yellow jerseys are actively engaged in a game. The blue team is advancing the ball towards the goal, with players from both teams moving strategically. The blue team player dribbles the ball past the midfield, evading yellow team defenders. As the blue team approaches the penalty area, the goalkeeper in red prepares to defend. The blue team player takes a shot at the goal, and the ball goes past the goalkeeper into the net. The blue team players celebrate the goal while the yellow team regroups. The game continues with both teams positioning themselves for the next play.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 818, "extra_info": {"duration": 10.01, "events": ["The football match takes place.", "The blue team loses possession.", "The green team seizes opportunity.", "The green player number 8 shoots.", "The orange-dressed goalkeeper fails to save.", "The ball crosses the goal line.", "The green team scores a goal.", "The green team celebrates excitedly.", "The blue team fails to defend."], "n_events": 9, "n_shots": 2, "n_subjects": 3}, "vid": 818}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "762.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The fighter jet moves forward, taking off into the sky, with thick smoke appearing on the runway as the camera pans to the right.", "text": "A jet aircraft is positioned on the deck of an aircraft carrier, with crew members in green and red uniforms standing nearby. The aircraft begins to accelerate, and smoke starts to emit from its rear as it gains speed. The aircraft continues to accelerate, with more smoke trailing behind it, and it starts to lift off the deck. The aircraft ascends into the air, leaving the deck behind, while the crew members watch. The aircraft flies away from the carrier, and the deck is left with smoke and the crew members still observing.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 762, "extra_info": {"duration": 6.97, "events": ["The fighter jet moves forward.", "The fighter jet takes off into the sky.", "Thick smoke appears on the runway."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 762}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "624.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On the wooden floor, a person sits cross-legged in front of a square cushion, first picking up headphones and putting them on, then picking up the drink beside the square cushion and taking a sip through the straw before placing it back in its original spot. Lastly, the scene concludes with this person bringing both hands together in a gesture of salute or prayer.", "text": "A woman is sitting on a gray exercise mat on a wooden floor, with a smartphone, a pair of headphones, and two pink dumbbells placed on the mat. She picks up the headphones and puts them on. She then reaches for the orange juice bottle, opens it, and takes a sip. After drinking, she places the bottle back on the floor. She clasps her hands together and starts meditating. Finally, she removes the headphones and places them back on the mat.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 624, "extra_info": {"duration": 13.8, "events": ["A person sits cross-legged on the wooden floor.", "The person picks up headphones.", "The person puts on the headphones.", "The person picks up a drink.", "The person takes a sip through the straw.", "The person places the drink back.", "The person brings hands together in a gesture."], "n_events": 7, "n_shots": 1, "n_subjects": 1}, "vid": 624}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "416.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman, holding a sickle, spins in a circle. The camera cuts to the woman as she swings her hand and slices open a bag, from which corn kernels fall out. The shot then switches to show the front legs of a pig standing up at the edge of its pen. The camera zooms in as the pig's mouth quivers slightly. The woman pours the corn kernels into a container in front of her. As the pig's body stirs, the woman, holding a small white scoop, gently pats the pig's head. The pig backs away. The woman is wearing brown boots and uses her foot to kick the corn kernels on the ground into the container. A white English subtitle appears in the middle of the frame while the woman, with her hands on her hips, walks towards the camera. The camera switches between these scenes, all set against the backdrop of a pigsty.", "text": "A woman holds a tool in her hand and uses it to open a white bag, causing corn to pour out. A pig in a pen looks at the camera with its mouth open, and the text 'WOW!' appears on the screen. The woman continues to pour corn from the bag into a container. The pig, now with a hat on its head, looks at the camera. The woman steps on the corn, and the text 'HANGIN' 'ROUND THIS FAMOUS PLACE' appears on the screen. The woman stands with her hands on her hips, wearing a hat and a brown outfit, and the text 'HANGIN' 'ROUND THIS FAMOUS PLACE' is displayed.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 416, "extra_info": {"duration": 10.3, "events": ["Woman spins holding a sickle.", "Woman slices open a bag with sickle.", "Corn kernels fall out of the bag.", "Pig’s mouth quivers slightly.", "Woman pours corn kernels into a container.", "Pig stirs in its pen.", "Woman pats pig's head with a scoop.", "Pig backs away from the woman.", "Woman kicks corn kernels into the container.", "Woman walks towards the camera."], "n_events": 10, "n_shots": 8, "n_subjects": 2}, "vid": 416}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "375.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man turned on the light, the camera switched to the woman gasping deeply. The man walked away, the shot switched to the man walking forward with his back to the camera, then turning around. It switched to the woman, who was panting and said something.", "text": "A man wearing a black leather jacket and gloves turns on a hanging light bulb in a dimly lit room with brick walls. He then approaches a woman who is tied to a chair with blue restraints, and she begins to speak. The man walks towards a door at the back of the room, turns around, and looks back at the woman. The woman continues to speak while looking forward.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 375, "extra_info": {"duration": 9.12, "events": ["The man turned on the light.", "The woman gasped deeply.", "The man walked away.", "The man turned around.", "The woman panted.", "The woman said something."], "n_events": 6, "n_shots": 4, "n_subjects": 2}, "vid": 375}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "211.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The light is twinkling; the girl turns her head, and the boy turns his head too. The scene shifts to the green perspective of an object, with plants swaying. The girl lowers her head to pick up the object, turns around, and hands it to the boy, then turns and leaves to the left.", "text": "In a dimly lit, cave-like environment with neon green and pink lights, a woman wearing a headlamp and a dirty outfit looks around and speaks. The camera then shifts to a view through the woman's headlamp, showing a green-lit plant area. The woman turns her head and picks up a tool from a table. She then walks towards a boy, who is also wearing a headlamp and a dirty outfit, and places the tool in front of him. The woman then turns and walks away.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 211, "extra_info": {"duration": 7.25, "events": ["The light twinkles", "The girl turns her head", "The boy turns his head", "Plants sway", "The girl lowers her head", "The girl picks up the object", "The girl turns around", "The girl hands the object to the boy", "The girl turns", "The girl leaves to the left"], "n_events": 10, "n_shots": 3, "n_subjects": 2}, "vid": 211}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "967.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts two individuals on the front lawn of a house. The individual on the left is seen with short hair, protective glasses, ear protectors, a green t-shirt featuring black text on the back, and plaid gray shorts. They wield a black handgun. The second individual, positioned on the right and towards the fence, is dressed in a black and white cap, dark green t-shirt, blue jeans, and ear protectors. They instinctively shield their right arm before collapsing onto the grass, subsequently kneeling. In response, the person on the left rushes over to their side.", "text": "In a backyard with a house covered in ivy, a man in a green shirt and jeans approaches a wooden fence and kneels down, appearing to be in pain. He then lies on the ground, rolling over and clutching his stomach. Another man in a green sleeveless shirt and plaid shorts, wearing a head-mounted device, approaches and gestures towards the man on the ground. The man on the ground starts to get up, while the man with the head-mounted device continues to gesture and move around. The man on the ground finally stands up and moves towards the fence.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 967, "extra_info": {"duration": 8.0, "events": ["The individual on the left wields a black handgun.", "The second individual shields their right arm.", "The second individual collapses onto the grass.", "The second individual subsequently kneels.", "The individual on the left rushes over."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 967}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "919.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person with black hair, donned in a red T-shirt, black shorts, and black shoes is running towards an orange cone. This individual starts at the beginning of a track and halts at a pontoon. Another participant, wearing a T-shirt patterned with red and pink colors bearing a world logo, an Olympic's emblem, and a 'UCS' insignia on its back, appears from this person's right. The person on the left is merely stretching their legs while the individual wearing the green shoes is engaging in jumps to warm up their body in preparation for the forthcoming exercises.", "text": "A man in a red shirt and black shorts is running on a red track surrounded by green trees and grass. He continues running until he reaches the starting blocks, where he stops and stands still. The man then places his hands on his hips and prepares to start running. Another man in a red shirt and black shorts with green shoes approaches the starting blocks. The first man bends down and positions himself on the starting blocks, while the second man gets ready to start running.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 919, "extra_info": {"duration": 10.01, "events": ["Person runs towards orange cone.", "Individual halts at pontoon.", "Participant appears from right.", "Person stretches legs.", "Individual wearing green shoes jumps."], "n_events": 5, "n_shots": 2, "n_subjects": 2}, "vid": 919}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "890.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "An individual dressed in orange is engaged in a shooting game. He triumphantly celebrates after making a kill, expressing his excitement with a loud exclamation of 'wow', removing his headphones, and cheerfully jumping onto the bed.", "text": "The video begins with a black screen. The scene transitions to a first-person view of a player aiming a sniper rifle at the sky, with a small inset video of a person in an orange shirt playing the game. The player's scope focuses on a bird in the sky, and the bird is hit, resulting in a '4 KILLS' notification. The player then moves indoors, aiming the rifle at a room with a chair and a window. The player fires the rifle, and the person in the inset video reacts by moving their body and standing up. The video ends with a black screen.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 890, "extra_info": {"duration": 5.0, "events": ["Individual in orange engages in shooting game.", "Individual makes a kill in the game.", "Individual exclaims 'wow' loudly.", "Individual removes headphones.", "Individual jumps onto the bed."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 890}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "111.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man wearing a hat and three red-haired women looked through the window to the outside. The man said something, and then all four of them ducked down out of sight beneath the window.", "text": "A group of characters, including a person with glasses and a green hat, are seen through a window, looking inside. The person with glasses appears to be speaking. The characters then begin to move downward, gradually disappearing from the window. The window is left empty, showing the background of a street with buildings and lights.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 111, "extra_info": {"duration": 2.5, "events": ["A man wearing a hat looked through the window.", "Three red-haired women looked through the window.", "The man said something.", "All four individuals ducked down beneath the window."], "n_events": 4, "n_shots": 1, "n_subjects": 4}, "vid": 111}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "844.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "This video clip depicts an individual eating in a car that is not being driven by anyone. One might perceive that there may be potential for an accident or other hazards due to the absence of a driver.", "text": "A man is sitting in the driver's seat of a car, wearing a seatbelt, and eating a sandwich wrapped in paper. He continues to eat the sandwich, taking large bites. The camera then pans to the left, showing the empty front passenger seat and the car's interior. The scene transitions to a black background with a cracked glass pattern, and text appears that reads 'If it can't wait, pull over.' along with the Road Safety logo.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 844, "extra_info": {"duration": 6.0, "events": ["Individual eats in the car.", "Car moves without a driver."], "n_events": 2, "n_shots": 2, "n_subjects": 1}, "vid": 844}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "331.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man enters the driver's seat of a car, picks up his phone, and places it on the mount.", "text": "A man wearing a brown jacket and a patterned tie approaches a car and opens the driver's side door. He then sits down in the driver's seat, holding a phone in his right hand. The man places the phone on the car's dashboard, which is visible through the steering wheel and dashboard area.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 331, "extra_info": {"duration": 3.88, "events": ["A man enters the driver's seat.", "The man picks up his phone.", "The man places the phone on the mount."], "n_events": 3, "n_shots": 2, "n_subjects": 1}, "vid": 331}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "752.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A group of bats are flying in the night sky against the backdrop of a bright full moon.", "text": "The video features a bright full moon in the night sky, with a dark blue background and scattered stars. Silhouettes of bats are seen flying around the moon, moving in various directions and patterns. The bats' movements are dynamic, with some flying closer to the moon and others moving further away. The scene remains consistent with the moon as the central focus, while the bats continue their flight patterns around it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 752, "extra_info": {"duration": 27.03, "events": ["A group of bats flies in the night sky.", "The moon reaches full brightness."], "n_events": 2, "n_shots": 1, "n_subjects": 1}, "vid": 752}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "838.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In this video clip, a football match is taking place between two teams. A player from the red team strikes the ball towards the goal, but the goalkeeper from the blue team skillfully intercepts it, preventing the red team from scoring. If the goalkeeper had failed to catch the ball, allowing it to cross the goal line, the red team's score would have increased by one point. Despite the red team's efforts, their score remained unchanged, while the blue team's score also remained the same.", "text": "The soccer match takes place in a stadium filled with spectators. The red team is attacking the blue team's goal, with several players from both teams positioned near the penalty area. The red team player in the center of the penalty area receives the ball and prepares to pass it. The red team player passes the ball to a teammate on the left side of the penalty area. The teammate receives the ball and takes a shot towards the goal. The blue team's goalkeeper dives to the right in an attempt to save the shot. The ball goes past the goalkeeper and into the net, resulting in a goal for the red team.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 838, "extra_info": {"duration": 6.0, "events": ["A football match takes place.", "Player from red team strikes ball.", "Football heads towards blue team's goal.", "Goalkeeper from blue team intercepts ball.", "Goalkeeper prevents scoring.", "Red team's score remains unchanged.", "Blue team's score remains the same."], "n_events": 7, "n_shots": 1, "n_subjects": 3}, "vid": 838}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "178.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "One person reaches out to take a cookie; the camera cuts to this person eating the cookie. It then cuts to a group of people taking cookies one by one, before cutting to a woman looking around, with a crowd swaying behind her. The scene switches to a girl jumping up in cheer, a boy turns his head to look at her, and the girl lifts the boy into an embrace.", "text": "A tray filled with intricately decorated cookies is shown, and a hand reaches for one of the cookies. The scene shifts to a close-up of a person's mouth as they eat the cookie. The camera then shows a tray filled with heart-shaped cookies, and two hands reach for them. The scene changes to a woman in a blue outfit who is talking and gesturing with her hands, while the background shows people dancing. The camera then shows a boy and a girl standing together, with the girl jumping and hugging the boy.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 178, "extra_info": {"duration": 7.75, "events": ["One person takes a cookie.", "Person eats the cookie.", "Group of people take cookies.", "Woman looks around.", "Crowd sways.", "Girl jumps up in cheer.", "Boy turns his head.", "Girl embraces the boy."], "n_events": 8, "n_shots": 5, "n_subjects": 4}, "vid": 178}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "330.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The ATM machine is furiously spewing out cash, the camera switches to a man who yells in reaction to this, followed by the camera switching to the ATM screen which displays two lines of text.", "text": "Money is seen flying out of an ATM machine, with various denominations of bills visible. The ATM machine has a sign that reads 'NEED CASH?' and is branded with 'Radcis'. The scene then shifts to a man walking on a street with buildings and a bus in the background. The ATM machine's screen displays a message that reads 'Hummity: I thought you would have more money'. The final frame shows more money flying out of the ATM machine.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 330, "extra_info": {"duration": 4.5, "events": ["The ATM machine spews out cash.", "A man yells in reaction.", "The ATM screen displays two lines of text."], "n_events": 3, "n_shots": 3, "n_subjects": 2}, "vid": 330}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "262.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man puts down the phone from his ear and then places the phone on the table, looking at the man in the hospital bed.", "text": "A man, wearing a dark jacket and a blue shirt, is seen holding a phone to his ear in a hospital room. He then lowers the phone and looks at it. The man walks towards a hospital bed where a patient is lying, connected to medical equipment. He places the phone on a table next to the bed. The man stands with his hands in his pockets, looking at the patient. The patient, who is lying in a hospital bed with an oxygen mask, turns his head to look at the man.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 262, "extra_info": {"duration": 7.88, "events": ["The man puts down the phone from his ear.", "The man places the phone on the table.", "The man looks at the man in the hospital bed."], "n_events": 3, "n_shots": 2, "n_subjects": 2}, "vid": 262}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "730.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera captures a group of horses on the grassland, focusing on one of them. At first, it faces the camera, then turns and runs to the right. Soon after, another horse sprints forward, overtaking the former. The camera follows their movement.", "text": "A group of horses is standing in a field of tall grass, with a few horses in the background. One horse in the foreground begins to move forward, and the other horses start to follow. The horses begin to run, with two horses leading the way. The two leading horses continue running, with their manes and tails flowing in the wind. The horses run across the field, with the background showing a vast, open landscape. The horses continue running, with the camera following their movement. The horses run further into the distance, with the field and horizon visible.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 730, "extra_info": {"duration": 14.01, "events": ["Camera captures a group of horses on grassland", "One horse faces the camera", "The horse turns", "The horse runs to the right", "Another horse sprints forward", "The second horse overtakes the first horse", "Camera follows the movement of the horses"], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 730}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "875.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the darkness of the night, a white car proceeds along the road. Unexpectedly, a gray car emerges from the fork in the road ahead, darting from right to left. The white car has virtually no time to brake, and the distance between the two vehicles narrows dangerously. On the brink of collision, the video ends.", "text": "The video shows a car driving on a well-lit road at night, with the camera positioned inside the car, capturing the view through the windshield. The road is illuminated by streetlights, and there are buildings with lights on in the background. The car passes by several vehicles, including a truck and a car, which are visible on the right side of the road. As the car continues to drive, it approaches a section of the road with a barrier on the right side and a pedestrian crossing on the left. The car then turns left at an intersection, passing a pedestrian crossing and a parked car on the left side of the road. The car continues to drive straight on the new road, with the background showing more buildings and streetlights.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 875, "extra_info": {"duration": 5.0, "events": ["A white car proceeds along the road.", "A gray car emerges from the fork.", "The gray car darts from right to left.", "The white car brakes.", "The distance between the two cars narrows.", "The video ends before collision."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 875}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "379.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person shifted their chair, reached out to take a plate, and continued walking forward. The seated man turned his head to look at him; the camera angle moved up. The man holding the plate turned his head to look at the seated man, swung his hand, and smashed the plate onto the head of the seated man. The seated man then walked forward with his hands covering his head, and the camera switched to a side view of the standing man.", "text": "A man in a suit and tie is sitting at a table with a pitcher and glasses, in a wooden room with a staircase in the background. Another man in an orange shirt approaches and picks up a plate from the table. The man in the orange shirt walks away with the plate. He then returns, holding a knife, and stabs the plate, causing it to shatter and spill its contents on the man in the suit. The man in the suit reacts by bending over and moving away from the table. The man in the orange shirt looks at the man in the suit.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 379, "extra_info": {"duration": 5.5, "events": ["A person shifted their chair.", "A person reached out to take a plate.", "A person continued walking forward.", "The seated man turned his head to look.", "The man holding the plate turned his head to look.", "The man holding the plate swung his hand.", "The man holding the plate smashed the plate onto the seated man's head.", "The seated man walked forward with hands covering his head."], "n_events": 8, "n_shots": 2, "n_subjects": 2}, "vid": 379}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "988.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows an individual wearing blue pants, a red shirt, and a red cap participating in a race. He is initially behind six other runners – three on his left and three on his right – in an orange field. This individual, referred to as Mario, performs a spinning motion while sprinting ahead, successfully overtaking all the runners and securing the first position, leaving a trail of blue circles behind him.", "text": "In a crowded stadium, a group of characters, including a character in a red hat and blue overalls, a green character, a yellow character, a purple character, and a red character, are running on a track. The character in the red hat and blue overalls is in the lead, with a glowing trail beneath their feet. As they run, the character in the red hat and blue overalls continues to gain speed, while the other characters follow closely behind. The character in the red hat and blue overalls starts to emit a bright light from their feet, indicating an increase in speed. The character in the red hat and blue overalls maintains their lead, with the glowing trail becoming more pronounced. The character in the red hat and blue overalls continues to accelerate, with the other characters still in pursuit. The character in the red hat and blue overalls is now significantly ahead, with the glowing trail extending further behind them. The character in the red hat and blue overalls maintains their lead, with the other characters still running behind.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 988, "extra_info": {"duration": 5.0, "events": ["Mario participates in a race.", "Mario wears blue pants, a red shirt, and a cap.", "Mario starts behind six other runners.", "Runners are in an orange field.", "Mario performs a spinning motion.", "Mario sprints ahead of the competitors.", "Mario overtakes all the runners.", "Mario secures the first position.", "Mario leaves a trail of blue circles."], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 988}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "917.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, a person wearing a gray t-shirt with a spider emblem on it and headphones is watching an anime video. In the video, a character kicks a grey-clothed character in the face and the gray-clothed character is left in a difficult situation. The camera then zooms out and a person with red hair and red clothes is seen surrounded by several gray-clothed people lying on the ground. Then, in another scene, this person is surrounded by many people wearing grey clothes and holding shields in their hands. Meanwhile, the person in the red outfit is leaning against a wall and smoking a cigarette. Then the person in the black t-shirt watching the video laughs for the character in Anime is smoking and there is a gathering army around him.", "text": "A man is watching an animated video on his screen, which shows a character with red hair and a black outfit in a dramatic pose against a cloudy background. The character is seen flying and then punching another character, causing the latter to be knocked down. The man reacts with surprise and excitement, placing his hand on his chest and then adjusting his earphones. The character in the video is then shown standing triumphantly over the fallen enemies, with the text 'That's it, Gaffer' appearing on the screen. The man continues to react with joy, smiling and adjusting his earphones.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 917, "extra_info": {"duration": 5.97, "events": ["A person watches an anime video.", "A character kicks a grey-clothed character in the face.", "The grey-clothed character remains in a difficult situation.", "A person with red hair stands surrounded by fallen people.", "Many people hold shields around the red-clothed person.", "The red-clothed person leans against a wall and smokes.", "The person in a black t-shirt laughs."], "n_events": 7, "n_shots": 1, "n_subjects": 4}, "vid": 917}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "658.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the sky, numerous seagulls flap their wings, trying to hover in place. A person placed some food from their left hand onto their right hand, then stretched the right hand upwards. A nearby seagull flew over, pecked at the food in the hand, and immediately flew away. Other seagulls continued to dance in the air. The person tossed the food from their right hand upwards, and the seagulls swooped toward the food. The video background features a sheer cliff with thick white fog above. Only the upper part of the person's arms and above is captured in the video.", "text": "A person extends their hand, offering food to a group of seagulls flying around them. The seagulls gather closely, flapping their wings and reaching for the food. The background features a rocky cliff with greenery and a cloudy sky. The person continues to hold out their hand as the seagulls fly in various directions, some getting closer to the hand while others fly away. The scene remains consistent with the seagulls actively feeding and the person maintaining their hand position.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 658, "extra_info": {"duration": 14.15, "events": ["Seagulls flap their wings in the sky.", "Person transfers food from left hand to right hand.", "Person stretches right hand upwards.", "Seagull pecks at food in person's hand.", "Seagull flies away immediately.", "Other seagulls continue hovering in air.", "Person tosses food upwards.", "Seagulls swoop toward the thrown food."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 658}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "769.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Someone uses scissors to cut the green cardstock into the shape of a Christmas tree, removes the excess parts, sets down the scissors, takes out a piece of cardboard and places it on the table, then puts the Christmas tree on the cardboard. Afterwards, they take out colorful ornaments and decorate it by placing them on top.", "text": "A person wearing a white sweater is seen cutting a green paper tree with pink-handled scissors. After cutting, they fold the tree and place it on a piece of brown paper. The person then starts arranging small colorful balls around the tree, placing them in a circular pattern. Finally, they pick up the scissors again to cut another piece of green paper.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 769, "extra_info": {"duration": 29.28, "events": ["Someone cuts the green cardstock into a Christmas tree shape.", "Excess parts are removed from the green cardstock.", "Scissors are set down on the table.", "Someone takes out a piece of cardboard.", "Cardboard is placed on the table.", "Christmas tree is put on the cardboard.", "Someone takes out colorful ornaments.", "Ornaments are placed on top of the Christmas tree."], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 769}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "199.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera zooms in, the figures of SpongeBob and other creatures gradually blur, the shot switches to a pipe opening, the camera pulls back, SpongeBob stretches out his hand to press the switch button, and raises his hands joyfully laughing.", "text": "SpongeBob is seen standing on a bed surrounded by other characters, with various flags hanging above them. The camera zooms in on SpongeBob, who appears to be speaking. The scene then transitions to SpongeBob lying in bed, with a machine blowing bubbles towards him. SpongeBob, initially looking tired, starts to wake up and raises his arms in excitement. The room is decorated with a red brick wall, a bed with a purple blanket, a clock, a bucket, and a toy with pink hands.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 199, "extra_info": {"duration": 7.38, "events": ["The camera zooms in.", "Figures gradually blur.", "The shot switches to a pipe opening.", "The camera pulls back.", "SpongeBob stretches out his hand.", "SpongeBob presses the switch button.", "SpongeBob raises his hands.", "SpongeBob laughs joyfully."], "n_events": 8, "n_shots": 2, "n_subjects": 2}, "vid": 199}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "947.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video features a person with long hair dressed in a red and blue suit, white pants, and gray shoes. This individual is seen running as a swiftly charging black bull narrowly misses them. With their hands raised above their head, chest pressing forward, and head inclined backward, the person skillfully dodges the bull's assault. Simultaneously, two others, dressed similarly and waving either a red or yellow cloth, are seen running towards the bull from behind.", "text": "In a bullfighting arena, two matadors in colorful uniforms are present, with one matador standing and the other walking. The walking matador starts running across the arena. A bull appears and charges at the running matador. The matador continues to run, evading the bull's charge. The bull follows closely behind the matador as he runs towards the edge of the arena. The matador maintains his pace, with the bull still in pursuit. The matador reaches the edge of the arena, with the bull still chasing him. The matador slows down as he approaches the barrier, while the bull continues to follow.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 947, "extra_info": {"duration": 5.0, "events": ["Person runs from charging bull.", "Bull narrowly misses person.", "Person dodges bull's assault.", "Person's hands raised above head.", "Person presses chest forward.", "Person inclines head backward.", "Two others run towards bull.", "First other waves red cloth.", "Second other waves yellow cloth."], "n_events": 9, "n_shots": 3, "n_subjects": 3}, "vid": 947}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "789.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Red smoke wafts through the air as a group of soldiers carrying shields and holding tools advance forward, with other pedestrians walking ahead.", "text": "The video begins with a red background and the 'video' logo displayed prominently. As the camera moves, red smoke starts to appear, partially obscuring the view. The smoke thickens, revealing a group of soldiers in riot gear, including helmets and shields, moving forward. The soldiers are holding shields labeled 'CARABIN' and are advancing in a line, with some holding batons. The background shows a clear sky and some trees, indicating an outdoor setting. The soldiers continue to move forward, with the red smoke still present, and the crowd in the background appears to be reacting to the situation. The video ends with the red smoke gradually dissipating, returning to the red background with the 'video' logo.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 789, "extra_info": {"duration": 10.91, "events": ["Red smoke wafts through the air", "Soldiers advance forward carrying shields", "Soldiers hold tools", "Pedestrians walk ahead"], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 789}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "737.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person moves a wooden spoon up and down with both hands, stirring the vegetables, and then reaches out to pick up a cellphone, swiping across the screen.", "text": "A person is seen mixing a salad in a glass bowl using two wooden spoons, surrounded by various fresh vegetables including tomatoes, cucumbers, and leafy greens on a wooden cutting board. The person continues to mix the salad thoroughly, ensuring all ingredients are well combined. After mixing, the person reaches for a smartphone on the table and starts using it. The person then resumes mixing the salad with the wooden spoons.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 737, "extra_info": {"duration": 13.3, "events": ["Person moves wooden spoon up.", "Person moves wooden spoon down.", "Person stirs the vegetables.", "Person reaches out.", "Person picks up cellphone.", "Person swipes across the screen."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 737}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "474.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Inside an off-road motorcycle track, a motorcycle flew off a dirt ramp and landed on the ground, followed closely by another motorcycle which flew off the ramp and landed right on the first one, knocking over the first motorcycle and its rider to the ground. The second motorcycle and its rider fell and tumbled forward. Several more motorcycles were approaching from behind.", "text": "A motorcyclist rides along a dirt track and approaches a dirt mound. The motorcyclist ascends the mound and begins to jump off the top. As the motorcyclist lands, another motorcyclist appears from the left side of the frame, also jumping off the mound. Both motorcyclists collide in mid-air, causing their bikes to flip over. The motorcyclists and their bikes fall to the ground, with the first motorcyclist sliding forward and the second motorcyclist rolling on the ground. The scene then cuts to a person in a red shirt standing in front of a building with a sign that reads 'MOTOCROSS'.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 474, "extra_info": {"duration": 4.63, "events": ["A motorcycle flew off a dirt ramp.", "The motorcycle landed on the ground.", "Another motorcycle flew off the ramp.", "The second motorcycle landed on the first one.", "The first motorcycle and rider were knocked over.", "The second motorcycle and its rider fell.", "The second rider tumbled forward.", "Several more motorcycles were approaching.", "More motorcycles were reaching the dirt ramp."], "n_events": 9, "n_shots": 1, "n_subjects": 3}, "vid": 474}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "253.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man lies in bed, covering his ears with a pillow. The camera cuts to an old man walking into the room from the doorway, holding a stick. The camera cuts again as the old man pokes the man's buttocks with the stick, causing the man to jump up from the bed in pain.", "text": "A man is lying on his side in bed, covered with a white blanket and resting his head on a white pillow, with his arm extended outwards. An older man enters the room, holding a red object in one hand and a glass in the other, and walks towards the bed. The older man approaches the bed and uses the red object to hit the man sleeping in the bed. The man in bed reacts by sitting up and turning around, then lies back down, still holding the pillow.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 253, "extra_info": {"duration": 6.62, "events": ["A man lies in bed.", "The man covers ears with pillow.", "An old man walks into the room.", "The old man holds a stick.", "The old man pokes the man's buttocks with stick.", "The man jumps up from bed in pain."], "n_events": 6, "n_shots": 3, "n_subjects": 2}, "vid": 253}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "707.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, a leopard is seen running forward. The camera gradually speeds up and pulls away from the running leopard, then captures another leopard also in motion. The camera continues to accelerate slowly, filming the scene of the two leopards running one after the other.", "text": "In a dry, open landscape with sparse trees and a clear blue sky, a cheetah is seen running across the terrain. The cheetah continues to run, moving from the center of the frame towards the right. Another cheetah appears in the background, also running in the same direction. Both cheetahs are now visible, running side by side across the landscape. The cheetahs maintain their pace, with the second cheetah slightly behind the first. The cheetahs continue their run, moving further to the right and slightly ahead of each other. The cheetahs are now closer to the right side of the frame, with the second cheetah still following the first. The cheetahs are seen running past a tree, with the second cheetah slightly behind the first. The cheetahs continue their run, moving further to the right and slightly ahead of each other.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 707, "extra_info": {"duration": 20.27, "events": ["A leopard is running forward.", "Another leopard starts running."], "n_events": 2, "n_shots": 1, "n_subjects": 2}, "vid": 707}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "642.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person lifts one corner of a semi-circular pastry with their left hand, draping it over the condiments in the middle. Then, with a knife in their right hand, they gently push the pastry, and lift up another corner to cover all of the condiments, turning the entire pastry into a triangle.", "text": "A person wearing gloves is seen on a griddle, placing a pancake with chocolate spread and sliced strawberries on it. The person then folds the pancake in half, revealing the chocolate spread and strawberries inside. The person continues to fold the pancake further, ensuring it is tightly wrapped. Finally, the person places the folded pancake on the griddle, completing the process.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 642, "extra_info": {"duration": 9.72, "events": ["A person lifts one corner of the semi-circular pastry.", "The corner drapes over the condiments.", "The person uses a knife to gently push the pastry.", "A person lifts another corner of the pastry.", "The person covers the condiments with the pastry.", "The pastry turns into a triangle."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 642}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "102.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A soldier opened his eyes, with a dog standing on him. The dog jumped off him, and the soldier got out of bed and stood up. Inside the tent, two other soldiers also sat up from their beds.", "text": "A man is lying on a bed with a dog standing on his chest, gently touching his chest with its paw. The man looks at the dog and then at the camera, while the dog continues to touch his chest. The dog then jumps off the man's chest and lands on the floor. The man sits up, and the dog walks towards another man who is sitting on a bed in the background. The man who was sitting on the bed gets up and stands, while the other man, who is sitting at a table, adjusts his hair.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 102, "extra_info": {"duration": 4.0, "events": ["A soldier opened his eyes.", "A dog stood on the soldier.", "The dog jumped off the soldier.", "The soldier got out of bed.", "The soldier stood up.", "Inside the tent, two other soldiers sat up.", "Two other soldiers got out of their beds."], "n_events": 7, "n_shots": 2, "n_subjects": 4}, "vid": 102}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "746.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Coins bearing portraits continually fall down, as the clock in the background moves upwards.", "text": "The video shows a blue background with multiple clocks and coins floating and rotating in the air. The clocks display various times, and the coins have different designs, including a profile of a person and a detailed engraving. The objects move in a continuous, fluid motion, creating a dynamic and visually engaging scene.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 746, "extra_info": {"duration": 24.9, "events": ["Coins fall down continuously.", "The clock moves upwards."], "n_events": 2, "n_shots": 1, "n_subjects": 2}, "vid": 746}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "377.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman walks down the stairs, with the man following behind. The camera cuts to the woman walking forward, following her movement as she opens the door and leans forward.", "text": "A woman wearing a red, fluffy coat walks forward on a stage with a dark, industrial background and some equipment visible. She continues walking past a man who is holding a microphone. The woman enters a dimly lit room, her hair flowing behind her. She approaches a door, opens it, and enters a small, dark bathroom. She closes the door behind her and sits down on the toilet.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 377, "extra_info": {"duration": 8.12, "events": ["The woman walks down the stairs.", "The man follows the woman.", "The woman opens the door.", "The woman leans forward."], "n_events": 4, "n_shots": 2, "n_subjects": 2}, "vid": 377}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "689.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In front of the camera, a brown-haired girl is playing with a Tower of Hanoi puzzle. She places the disk on the last vacant peg of the Tower of Hanoi, then pushes the tower towards the direction of the camera before reaching out to grab another toy. In the distance, the silhouettes of two blurry children can be seen.", "text": "A girl with long hair is seen stacking colorful blocks on a wooden base, starting with a green block, followed by a yellow block, and then a red block. She continues to add more blocks, including a blue block and a green block, carefully placing them on the base. The girl then picks up a yellow block and places it on top of the stack. She adjusts the blocks to ensure they are properly aligned and stable. In the background, two other children are also engaged in stacking blocks.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 689, "extra_info": {"duration": 14.68, "events": ["Brown-haired girl plays with Tower of Hanoi puzzle.", "Girl places disk on last vacant peg.", "Girl pushes tower towards camera.", "Girl reaches out to grab another toy.", "Two blurry children appear in the distance."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 689}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "407.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A gentleman is participating in an obstacle course game. He agilely jumps over two moving barriers, and in an attempt to dodge the third obstacle, his leap is slightly misaligned, causing his feet to miss the platform. However, he quickly braces himself with his arms against the platform and effortlessly climbs up, then proceeds to smoothly navigate through the subsequent obstacles, easily completing the challenge. There are captions at the top of the screen expressing admiration for the contestant's rapid passage through the course. The background is that of a variety show channel from a television station.", "text": "A man in a white tank top and gray pants runs along a yellow platform above a pool of water, approaching a set of blue and yellow poles. He jumps onto the first pole, then onto the second pole, and continues to move forward. The man climbs over the poles, maintaining his balance as he progresses. He reaches the end of the platform, where he prepares to jump off. The man jumps off the platform and lands in the water, while a person in a red vest is seen in the water below. The background features a colorful mural with various advertisements and a clear blue sky.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 407, "extra_info": {"duration": 14.7, "events": ["Gentleman jumps over two moving barriers.", "Gentleman attempts to dodge the third obstacle.", "Gentleman's leap misaligns causing a missed platform landing.", "Gentleman braces himself with his arms against the platform.", "Gentleman climbs up onto the platform.", "Gentleman navigates through subsequent obstacles.", "Gentleman completes the challenge."], "n_events": 7, "n_shots": 1, "n_subjects": 2, "old_duration": 20.5, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/7_no_audio.mp4"}, "vid": 407}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "380.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The shot looks down as a man opens a lid with both hands and takes out a bottle to examine it.", "text": "In a tiled room with a checkered floor, a man in a suit stands next to a white urinal. He reaches out and begins to open the urinal lid. He continues to lift the lid, revealing the inside of the urinal. The man then reaches inside the urinal and retrieves a glass bottle. He holds the bottle up, examining it briefly. Finally, he turns his head to look at the bottle in his hand.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 380, "extra_info": {"duration": 7.0, "events": ["The camera angle points downward.", "A man approaches the lid.", "The man grasps the lid with both hands.", "The man opens the lid.", "The man reaches into the open space.", "The man takes out a bottle.", "The man examines the bottle."], "n_events": 7, "n_shots": 1, "n_subjects": 1}, "vid": 380}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "632.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A baby is clutching a fork in its hand, trying to pick up fruit from a plate. The baby pokes at the fruit several times without success, until finally spearing a piece and bringing it towards its mouth. Then, the baby prods the plate a few times with the fork. The camera moves upwards to capture the baby's entire body and head, as well as the motion of the baby using the fork to stab downwards.", "text": "A child, wearing a light-colored sweater, is sitting in a high chair with a gray plate in front of them containing pieces of green and white food. The child uses a fork to pick up pieces of green food from the plate. The camera gradually moves upward, focusing on the child's hand and the fork. The child continues to hold the fork and looks at it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 632, "extra_info": {"duration": 12.02, "events": ["A baby clutches a fork.", "The baby tries to pick up fruit.", "The baby pokes at the fruit several times.", "The baby finally spears a piece of fruit.", "The baby brings the speared fruit towards its mouth.", "The baby prods the plate with the fork."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 632}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "992.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In this video, an individual donned in a brown suit jacket is seen obstinately attempting to operate a television using a remote control. Another individual, clad in a black suit jacket, repeatedly advises him that the remote does not control the TV. Despite this, the person continues to depress the remote's buttons. Eventually, a transparent swan ornament, perched atop the TV, is ejected by a suddenly-opening cover. The swan ornament collides with a hanging glass photo frame, containing a picture of a family – two people and a white dog – causing the glass to shatter.", "text": "In a well-lit room with a staircase, a woman in a green dress and a man in a suit enter, while a man in a brown suit holding a remote control turns to face them. The man in the brown suit points the remote control at a JVC VCR and presses a button, causing a glass swan ornament on top of the VCR to move. The man in the suit reacts with surprise, and the woman also looks shocked. The glass swan ornament continues to move, and the man in the suit appears to be explaining the situation. The scene then cuts to a photo of a couple with a dog on a boat.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 992, "extra_info": {"duration": 8.01, "events": ["Individual tries to operate TV with remote.", "Other individual advises remote doesn't control the TV.", "Individual presses remote's buttons.", "TV cover suddenly opens.", "Swan ornament is ejected from TV.", "Swan ornament collides with photo frame.", "Glass photo frame shatters."], "n_events": 7, "n_shots": 9, "n_subjects": 3}, "vid": 992}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "571.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The child wearing a blue coat opens the wooden fence, then turns around, extends their left hand to push the door open, walks forward, glances at the camera then looks away, as a yellow ox walks out from the cattle pen. The camera follows the movement of the child. The background is the cattle pen.", "text": "A child in a blue sweater with a cartoon character on the front approaches a wooden gate and begins to open it. The child successfully opens the gate and steps back, looking at the cow inside the pen. The child then turns around and starts walking away from the gate. The child continues to walk away, and the cow inside the pen begins to move towards the gate. The child turns back towards the gate as the cow gets closer. The child reaches the gate again and starts to close it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 571, "extra_info": {"duration": 10.23, "events": ["Child opens the wooden fence", "Child turns around", "Child extends left hand to push door", "Child walks forward", "Child glances at the camera", "Child looks away", "Yellow ox walks out from cattle pen"], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 571}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "697.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man and a woman are sitting on the snow facing the camera. The man is talking non-stop, while the woman is holding a cup and drinking water. In front of the camera, a figure runs from left to right, and after a while, runs back from the right to the left. After a bit more time, a woman wearing a thick scarf runs over from the left and sits to the right of the first woman, who then passes the water cup to the woman with the thick scarf.", "text": "A man and a woman are sitting on a bench in a snowy forest, with the man wearing a green jacket, yellow scarf, and brown beanie, and the woman in a brown jacket, purple scarf, and striped beanie. They are holding food and drinking from cups. The man wipes his nose and talks to the woman, who is drinking from her cup. A person in a brown coat approaches and sits down next to them. The woman in the brown coat hands her cup to the woman in the striped beanie, who takes it and drinks from it. The man continues to talk and wipe his nose while the woman in the striped beanie drinks from the cup.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 697, "extra_info": {"duration": 18.92, "events": ["A man and a woman sit on the snow.", "The man talks non-stop.", "The woman holds a cup and drinks water.", "A figure runs from left to right in front of the camera.", "The same figure runs back from right to left.", "A woman with a thick scarf runs over from the left.", "The woman with a thick scarf sits to the right of the first woman.", "The first woman passes the cup to the woman with the thick scarf."], "n_events": 8, "n_shots": 1, "n_subjects": 3}, "vid": 697}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "824.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video features a person wearing a green T-shirt, black shorts, gray sneakers, a ring on the left ring finger, and a black watch on the left wrist. This individual is performing easy push-ups in a room filled with sports equipment. A total of five push-ups were performed.", "text": "In a gym setting with various exercise equipment, a man in a green shirt and black shorts performs push-ups on a black mat. The text 'FEET TOGETHER OR UP TO 1' A PART' appears at the top of the screen, and 'HANDS JUST OUTSIDE SHOULDER WIDTH' is displayed at the bottom. The man lowers his body towards the mat and then pushes himself back up, bending his elbows until his shoulders are below his elbows. The text 'KEEP ABS TIGHT AND BODY STRAIGHT THROUGHOUT THE MOVEMENT' appears at the top of the screen, and a white line is drawn to indicate the correct form. The man continues to perform push-ups, maintaining a straight body and tight abs throughout the motion.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 824, "extra_info": {"duration": 9.01, "events": ["Person starts first push-up.", "Person reaches lowest point of first push-up.", "Person completes first push-up.", "Person starts second push-up.", "Person completes second push-up.", "Person starts third push-up.", "Person completes third push-up.", "Person starts fourth push-up.", "Person completes fourth push-up.", "Person starts fifth push-up."], "n_events": 10, "n_shots": 2, "n_subjects": 1}, "vid": 824}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "227.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The red car is being chased by a police car. The driver inside the vehicle is turning the steering wheel, navigating the car through a roundabout intersection, with the police closely following behind. Several cars are shuttling back and forth on the road.", "text": "A red convertible car is seen drifting on a dirt road at night, with its headlights illuminating the surroundings. The car continues to drift, creating a cloud of dust behind it. Inside the car, two people are visible, with one person in the driver's seat and another in the passenger seat. The car's side is shown as it continues to drift, with the driver and passenger still visible. The car is surrounded by several police cars with flashing lights, which are chasing it. The police cars are seen from an aerial view, forming a circular formation around the red car. The red car continues to drift and evade the police cars, which are closing in from all directions.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 227, "extra_info": {"duration": 8.25, "events": ["The red car is being chased by a police car.", "The driver navigates the red car through a roundabout.", "Cars are shuttling on the road.", "Police car closely follows the red car."], "n_events": 4, "n_shots": 6, "n_subjects": 2}, "vid": 227}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "440.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera captures from various angles as a woman swings a hammer to strike bamboo strips. A yellow dog and a black dog run by her side. The shot switches to a frontal view of the woman, then to an overhead shot, with a close-up of the bamboo being tightly bound by ropes. The woman steps on the bamboo with her right foot, then with her left foot, she pulls up the ropes. She hoists the bamboo strips and walks forward, the two dogs following her. The camera then switches to show the woman stopping in front of a pile of bamboo strips. She drops the strips and jumps into a pit, turning around to grab a basket. The background is a forested mountain landscape.", "text": "A woman is seen bending over a stack of bamboo on a stone ledge, surrounded by a stone wall and greenery. She uses a hammer to split the bamboo, with a dog nearby. The woman continues to hammer the bamboo, and the dog moves around her. She then ties the bamboo together with a rope. The woman carries the tied bamboo on her shoulder and walks away. She places the bamboo on the ground and then jumps down from a higher ledge. The woman continues to work with the bamboo, splitting and arranging it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 440, "extra_info": {"duration": 15.03, "events": ["A woman swings a hammer to strike bamboo strips.", "A yellow dog and a black dog run by the woman's side.", "The woman steps on the bamboo with her right foot.", "The woman pulls up the ropes with her left foot.", "The woman hoists the bamboo strips and walks forward.", "The two dogs follow the woman.", "The woman stops in front of a pile of bamboo strips.", "The woman drops the bamboo strips.", "The woman jumps into a pit.", "The woman grabs a basket."], "n_events": 10, "n_shots": 9, "n_subjects": 3, "old_duration": 15.03, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/40.mp4"}, "vid": 440}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "982.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video displays a player attacking an opponent using dual guns. The opponent retaliates by firing back. Subsequently, the opponent ducked, quickly rolled behind a stone carving for cover, and surprisingly reemerged close to the player, attacking fiercely. Meanwhile, the player skillfully twirled the guns, fired back and got hit.", "text": "The player, holding two guns, moves forward in a city street at night, with buildings and neon signs in the background. The player aims and shoots at an enemy, who is seen in the distance. The enemy is hit and starts to fall. The player continues to shoot, causing the enemy to explode and disintegrate. The player then moves to the right, continuing to shoot at the enemy. The player's score increases as they successfully eliminate the enemy. The player continues to move forward, with the score and game information displayed on the screen.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 982, "extra_info": {"duration": 6.0, "events": ["Player attacks opponent with dual guns", "Opponent fires back at player", "Opponent ducked for cover", "Opponent rolled behind a stone carving", "Opponent reemerged near the player", "Opponent attacked player fiercely", "Player twirled the guns", "Player fired back at opponent", "Player got hit by opponent"], "n_events": 9, "n_shots": 1, "n_subjects": 3}, "vid": 982}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "415.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is an expanse of sea where a person with a net is scooping up marine life in the water. The camera then cuts to three individuals standing at the side of a boat, pulling the creatures aboard. The process is filmed up close, and afterwards, they flip the marine life out and onto the deck of the ship.", "text": "A fishing net is seen in the water, and a large jellyfish is visible beneath it. The net is pulled closer to the boat, and the jellyfish is gradually lifted out of the water. The jellyfish is brought onto the boat, and three people on the boat work together to handle the net and the jellyfish. The jellyfish is placed on the boat, and the people continue to manage the net and the jellyfish. The jellyfish is fully brought onto the boat, and the people secure it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 415, "extra_info": {"duration": 15.07, "events": ["A person scoops up marine life with a net.", "Three individuals pull creatures aboard the boat.", "They flip marine life onto the deck of the ship."], "n_events": 3, "n_shots": 4, "n_subjects": 3, "old_duration": 18.5, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/15_no_audio.mp4"}, "vid": 415}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "514.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A panda tumbles forward from a wooden ramp it's constructed, rolling down the slope and onto the grass. It takes a turn and continues to roll down the grassy incline, then it hits a rock, pivots around, and keeps rolling downhill.", "text": "A panda is seen rolling down a wooden ramp in a grassy area with trees and a building in the background. The panda continues to roll off the ramp and onto the grass, eventually reaching the bottom of the ramp. The panda rolls further down the grassy slope, passing by rocks and a stone wall. The panda continues to roll down the slope, eventually reaching the bottom and coming to a stop.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 514, "extra_info": {"duration": 15.2, "events": ["A panda tumbles from a wooden ramp.", "Panda rolls down the slope.", "Panda lands on the grass.", "Panda takes a turn.", "Panda continues to roll down the incline.", "Panda hits a rock.", "Panda pivots around.", "Panda keeps rolling downhill."], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 514}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "501.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man had a rope tied around his waist, stretched taut on both sides. He danced for a while and then stood up, spreading his arms wide with his fingers pointing to the sky. Next, the rope was pulled to the right, causing the man's body to lean to the right, hopping on his right foot. Then the rope was pulled to the left, making his body lean to the left, hopping on his left foot. The camera zoomed out to show two people pulling the rope on each side of the man, tugging it back and forth. In the distance, a fire truck was parked. Finally, the camera switched again, capturing a close-up of the man's dance moves.", "text": "A man in a white shirt and black pants is seen performing a series of movements on a grassy field with trees in the background. He starts by squatting and then stands up, raising his arms and pointing upwards. He then performs a series of kicks and jumps, moving to the left side of the frame. The camera zooms out to reveal two other individuals on either side of him, each holding a rope. The man continues to perform movements while the two individuals pull the rope, causing him to move back and forth. The man then stops, raises his hands, and bows slightly.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 501, "extra_info": {"duration": 13.18, "events": ["A man danced with a rope tied around his waist.", "The man stood up.", "The man spread his arms wide.", "The man's fingers pointed to the sky.", "The rope pulled the man to the right.", "The man hopped on his right foot.", "The rope pulled the man to the left.", "The man hopped on his left foot.", "Two people tugged the rope back and forth.", "The man performed dance moves in close-up."], "n_events": 10, "n_shots": 1, "n_subjects": 2}, "vid": 501}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "951.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person dressed in a white Reebok t-shirt is captured riding a motorcycle in a selfie video. The individual is not wearing a helmet while on the motorcycle. The video then transitions to a shot of the road and the motorcycle. To the left side of the accelerating motorcycle, there are houses, while a verdant hill lies to the right. A little further, the motorcycle transits into a forested path. Transitioning back to a selfie video, the camera captures the individual who is now next to an electrical transformer, still observing the road.", "text": "The video begins with a view of a man riding a motorcycle, with the camera positioned to capture the scene from below, showing the sky and tree branches above. The man is wearing a light-colored shirt and is seen from a low angle as he rides. The camera then switches to a view of the motorcycle's dashboard, showing the speedometer and the road ahead as the man rides through a narrow path with buildings and greenery on either side. The camera angle changes back to the man's perspective, showing the sky and power lines as he continues riding. The man raises his hand to his mouth, possibly to adjust something or to eat. The video ends with a black screen.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 951, "extra_info": {"duration": 5.0, "events": ["Person rides motorcycle in selfie video.", "Person does not wear helmet on motorcycle.", "Motorcycle accelerates on road.", "Houses appear on left side of motorcycle.", "Verdant hill lies to right of motorcycle.", "Motorcycle enters forested path.", "Person stands next to electrical transformer."], "n_events": 7, "n_shots": 3, "n_subjects": 1}, "vid": 951}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "208.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A male actor runs from the theater aisle toward the camera, stretches out his arms, tips his hat to bow to the audience, and then throws his hat toward the back of the stage.", "text": "A man in a top hat and suit walks down the aisle of a theater, waving to the audience. He removes his hat and continues walking, spreading his arms wide. The man then turns around, removes his coat, and gestures towards the stage. He continues to walk down the aisle, speaking to the audience. The man then turns and walks towards the stage, while the audience claps and watches him.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 208, "extra_info": {"duration": 14.22, "events": ["Male actor runs from the theater aisle.", "Male actor stretches out his arms.", "Male actor tips his hat.", "Male actor bows to the audience.", "Male actor throws his hat toward the back of the stage."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 208}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "671.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man and a woman, both dressed in white, argue in front of the camera. In the end, the woman pushes the man and walks off to the right side of the screen, and the man follows suit and leaves as well.", "text": "A man and a woman are standing in front of a gray wall, facing each other and having a conversation. The woman gestures with her hands while talking, and the man responds with hand movements. The woman appears to be upset, placing her hands on her chest and continuing to speak animatedly. The man tries to calm her down by placing his hands on her shoulders and gesturing with his hands. The woman continues to express her emotions, and the man keeps trying to soothe her. Eventually, the woman pushes the man away, and he stumbles out of the frame. The woman remains standing in front of the wall, looking in the direction where the man was.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 671, "extra_info": {"duration": 22.8, "events": ["A man and a woman argue in front of the camera.", "The woman pushes the man.", "The woman walks off to the right side of the screen.", "The man follows the woman."], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 671}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "310.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera moves from right to left, viewing through the glass windows of the house in front, a woman is seen taking off her coat. The shot then switches to outside of the house, where a man wearing a hat emerges from behind a tree and then turns to walk towards the direction of the camera.", "text": "A woman, wearing a brown coat, is seen through a window as she removes her coat, revealing a white dress underneath. She continues to take off her coat and holds it in her hands. The scene then shifts to a man walking towards the camera, wearing a brown jacket and a cap, with a background of trees and a body of water.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 310, "extra_info": {"duration": 6.75, "events": ["Camera moves from right to left", "Woman takes off her coat", "Man wearing a hat emerges from behind a tree", "Man turns to walk towards the camera's direction"], "n_events": 4, "n_shots": 2, "n_subjects": 2}, "vid": 310}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "940.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In this video, an individual wearing a black glove is seen handling some meat. They use a knife to cut the meat on a white chopping board. Once cut, the meat is placed in a transparent bowl on the board. Lastly, they add various spices into a separate transparent bowl.", "text": "A person wearing black gloves is slicing a piece of raw chicken breast on a white cutting board using a large knife. The person continues to cut the chicken into smaller pieces. The scene then shifts to the person placing the sliced chicken into a clear glass bowl. The person picks up a white plate containing various spices and herbs, and begins to pour the contents into the bowl. The spices and herbs are emptied into the bowl, mixing with the chicken.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 940, "extra_info": {"duration": 9.0, "events": ["Individual wears black glove.", "Individual handles meat.", "Knife cuts the meat.", "Meat is cut on chopping board.", "Cut meat is placed in bowl.", "Individual prepares spices.", "Spices are added to separate bowl."], "n_events": 7, "n_shots": 2, "n_subjects": 1}, "vid": 940}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "802.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video captures that at a traffic light junction, an orange taxi and an orange truck passed when the light turned green, followed by a red truck with black tires. When the light is yellow, the red truck stops and waits for the green light. When the light is green, the red truck and another orange truck behind the truck pass through the junction.", "text": "A woman with purple hair sits at a table with a globe, various stickers, and a framed picture of a taxi on the wall behind her. The scene transitions to a cartoon depiction of a street with a yellow taxi and a traffic light. The yellow taxi moves forward as the traffic light changes from green to red. A pink truck then stops at the red light. The truck remains stationary as the light changes from red to green. A 'SUBSCRIBE' button appears at the bottom of the screen, and a hand cursor clicks on it. The truck then moves forward as the light changes from green to red. The scene transitions back to the woman at the table.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 802, "extra_info": {"duration": 12.0, "events": ["Traffic light turns green.", "Orange taxi passes through the junction.", "Orange truck follows the orange taxi.", "Red truck with black tires passes through the junction.", "Traffic light turns yellow.", "Red truck stops at the yellow light.", "Traffic light turns green again.", "Red truck resumes motion through the junction.", "Another orange truck passes the junction."], "n_events": 9, "n_shots": 2, "n_subjects": 4}, "vid": 802}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "32.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man wearing the yellow jacket stepped forward, crossed his arms over his chest and made a gesture of thinking. He looked at the woman beside him who was holding a child. The woman blinked, and then the man spoke to her, waving his hands in front of his chest.", "text": "In a room with a vending machine, a 'Dino Zone' sign, and scattered debris, a man in a yellow jacket with a fur collar stands with his arms crossed, looking at a girl holding a doll. The girl, wearing a purple shirt and pink glasses, looks back at the man. The man then uncrosses his arms, gestures with his hands, and speaks to the girl. The girl, still holding the doll, closes her eyes and smiles. The man then crosses his arms again and continues to look at the girl.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 32, "extra_info": {"duration": 6.88, "events": ["The man stepped forward", "The man crossed his arms", "The man made a thinking gesture", "The woman holding a child blinked", "The man spoke to the woman", "The man waved his hands"], "n_events": 6, "n_shots": 3, "n_subjects": 3}, "vid": 32}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "560.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person is washing their hands when a little dog, carrying a cloth toy in its mouth, jumps onto the basin. The person takes away the toy, and the little dog drinks water flowing from the tap. The camera then cuts to footage of the dog continuing to drink water. The background is an indoor living room.", "text": "A person is washing their hands under a running faucet in a modern kitchen. A dog appears from the background, holding a pink stuffed toy in its mouth, and approaches the sink. The dog places its front paws on the edge of the sink and starts licking the water from the faucet. The dog continues to drink the water, occasionally adjusting its position and the toy in its mouth. The dog licks the water more vigorously, getting its face wet. The dog then moves slightly to the side, still holding the toy and drinking water.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 560, "extra_info": {"duration": 8.68, "events": ["A person washes their hands", "A little dog jumps onto the basin", "The little dog carries a cloth toy in its mouth", "The person takes the toy from the little dog", "The little dog drinks water from the tap", "The camera shows the dog drinking water", "The background shows an indoor living room"], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 560}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "146.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "At night, a treehouse dweller stands opposite a girl and a boy, makes a move towards the two children, the camera then cuts to behind the treehouse where the treehouse dweller picks up a bucket. The camera switches to the side of the treehouse, as the treehouse dweller turns slightly to the left. The angle changes to a top-down shot from the top of the treehouse, where the treehouse dweller makes a slight movement. Then the camera cuts to the two children; the little girl remains composed, while the little boy and a dog behind him show expressions of surprise.", "text": "At night, a treehouse is seen floating in the air, surrounded by a forest with a starry sky. Birds fly past the treehouse as it begins to descend. The treehouse continues to lower, and a bucket attached to a rope swings down to the ground. The children, a boy and a girl, watch the treehouse as it lands. The treehouse, now on the ground, is shown in close-up, revealing its detailed and colorful design. The camera then shifts to an aerial view, showing the treehouse from above. The boy and girl, along with a small dog, stand on the ground, looking up at the treehouse. The boy appears to be speaking while the girl looks on.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 146, "extra_info": {"duration": 8.62, "events": ["Treehouse dweller stands opposite a girl and a boy", "Treehouse dweller makes a move towards the children", "Treehouse dweller picks up a bucket", "Treehouse dweller turns slightly to the left", "Treehouse dweller makes a slight movement", "Little girl remains composed", "Little boy shows expressions of surprise", "Dog shows expressions of surprise"], "n_events": 8, "n_shots": 5, "n_subjects": 4}, "vid": 146}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "37.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The light shines on a boy and a robot, the boy stops his moving body, then turns around, the robot turns its head, a car with its headlights on approaches, and then stops.", "text": "In a dimly lit industrial area filled with stacked shipping containers, a car with its headlights on is seen approaching and gradually coming to a stop.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 37, "extra_info": {"duration": 3.12, "events": ["Light shines on a boy and a robot", "The boy stops moving", "The boy turns around", "The robot turns its head", "A car approaches with headlights on", "The car stops"], "n_events": 6, "n_shots": 2, "n_subjects": 2}, "vid": 37}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "47.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The pirate captain faces the camera and expresses an emotion, then rotates his eyes and tilts his head to look behind him. The crew members behind him are conversing. The captain adjusts his hat and then looks towards the camera, revealing a smile and making a gesture.", "text": "A bearded man in a pirate outfit is speaking while holding a map, with three other characters in the background on a ship. The bearded man continues to speak, and the characters behind him, including a man in a blue uniform and a white-haired character, react with surprise and concern. The bearded man turns his head to look at the characters behind him, who are now looking at him with worried expressions. The bearded man then adjusts his hat and continues to speak, gesturing with his hand.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 47, "extra_info": {"duration": 10.25, "events": ["Pirate captain expresses an emotion.", "Pirate captain rotates his eyes.", "Pirate captain tilts his head.", "Crew members converse.", "Pirate captain adjusts his hat.", "Pirate captain reveals a smile.", "Pirate captain looks towards the camera.", "Pirate captain makes a gesture."], "n_events": 8, "n_shots": 3, "n_subjects": 2}, "vid": 47}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "567.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Background is a classroom. Two girls stretch their arms out, suspended in the air. The girl behind runs up beside them, grabs their arms with both hands, stretches her legs out, and spins around in the air before landing on the ground, stepping back a few paces together. They laugh joyously, and the girl on the right gives a thumbs-up to the camera, then the three stand together.", "text": "In a room with a red door, fluorescent lights, and various items on shelves, two girls stand facing each other with their fists raised, while a third girl runs towards them from the background. The running girl reaches the two girls and performs a backflip, flipping over their arms. After the flip, the three girls continue dancing together, smiling and enjoying the moment.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 567, "extra_info": {"duration": 6.23, "events": ["Two girls stretch their arms out.", "The girl behind runs up beside them.", "The girl behind grabs their arms with both hands.", "The girl behind stretches her legs out.", "The girl behind spins around in the air.", "The girl behind lands on the ground.", "They step back a few paces together.", "They laugh joyously.", "The girl on the right gives a thumbs-up to the camera.", "The three girls stand together."], "n_events": 10, "n_shots": 1, "n_subjects": 3}, "vid": 567}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "647.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person is in the kitchen peeling an onion. After peeling off a layer of the onion, they throw the peeled skin into the waste pile. Then, they wave their hands about, gesturing something, and afterwards, they pick up a kitchen knife with their right hand and start chopping the onion.", "text": "In a kitchen setting, a person wearing a white shirt and a beige apron is seen peeling a red onion over a wooden cutting board. The person continues to peel the onion, removing the outer layers. After peeling, the person places the onion on the cutting board and begins to slice it with a knife. The person then picks up another red onion and repeats the peeling process.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 647, "extra_info": {"duration": 17.25, "events": ["Person peels off a layer of the onion.", "Person throws peeled skin into the waste pile.", "Person waves their hands.", "Person picks up a kitchen knife.", "Person starts chopping the onion."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 647}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "675.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A delivery guy is riding a bicycle, carrying a yellow delivery box as he cycles forward.", "text": "A man wearing a white helmet and a gray shirt is riding a bicycle through a city street at night, carrying a large yellow delivery bag on his back. The background shows various storefronts and illuminated signs, indicating a busy urban area. The man continues to ride, passing by different buildings and streetlights, with the delivery bag prominently visible. The scene transitions to a more open area with fewer buildings and more open space, as the man rides further away from the camera.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 675, "extra_info": {"duration": 9.37, "events": ["A delivery guy rides a bicycle.", "The delivery guy carries a yellow delivery box."], "n_events": 2, "n_shots": 1, "n_subjects": 1}, "vid": 675}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "986.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts a sinister scene where a colossal snake's head descends from the ceiling, fiercely bites into a blue-shirted blue-collar worker, drawing blood and dragging the worker towards the ceiling. On the video's right side, two individuals, who both have black-strapped backpacks on, stand rooted at the door of the cockpit, faces etched with horror. On the left, one individual with long hair is attired in a green jacket, white T-shirt and blue jeans. The other person, with a short haircut, wears a gray T-shirt under a black coat and extends a hand protectively towards the long-haired individual.", "text": "The scene is set in a restaurant with yellow walls, white tables, and a bar counter filled with various bottles and glasses. A man in a blue shirt is seen talking and gesturing with his hand. Suddenly, a snake jumps onto the man's head, covering it completely. The man looks down at the snake on his head. Two people, a man and a woman, stand in front of the man, looking at the snake with wide eyes. The snake then falls to the ground.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 986, "extra_info": {"duration": 7.0, "events": ["Snake's head descends from the ceiling.", "Snake bites into a blue-shirted worker.", "Snake draws blood from the worker.", "Snake drags the worker towards the ceiling.", "Two individuals stand at the cockpit door.", "Individuals' faces show horror.", "Person in green jacket appears on the left.", "Person with short haircut extends a hand protectively."], "n_events": 8, "n_shots": 3, "n_subjects": 4}, "vid": 986}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "55.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A female character turns off the room's light and then closes the door; the camera switches to the outside, where the lady, along with a child character dressed in green, walks toward the courtyard entrance.", "text": "A woman, dressed in a purple outfit with a green and white collar, stands at the door and turns off the light switch. She then closes the door and locks it. The scene transitions to a nighttime view of a house decorated for Halloween, with a pumpkin on a hay bale, ghost decorations hanging from a tree, and a 'Happy Halloween' sign on the door. Two children, one in a green outfit and the other in a pink outfit with bunny ears, walk out of the house and onto the sidewalk. The woman is then shown again, still at the door.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 55, "extra_info": {"duration": 5.62, "events": ["Female character turns off light", "Female character closes the door", "Female character walks toward courtyard", "Child character follows the female character"], "n_events": 4, "n_shots": 2, "n_subjects": 2}, "vid": 55}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "887.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A cook, dressed in a black t-shirt, chef's hat, and gray apron, drains noodles into a plate containing a liquid dish on a kitchen counter. The scene then transitions to an array of colourful foods and four ramen bowls. The ensuing shot reveals side dishes including meat, black fungus, scallions, and sesame seeds on the bowls. The cook then places seaweeds, marked with a logo, into the ramen bowls. A close-up shot of a ramen bowl follows.", "text": "In a busy kitchen, a man wearing a black shirt and an apron pours soup from a pot into several bowls lined up on the counter. He then places a piece of nori with a design on top of each bowl. The man continues to add more pieces of nori on top of the bowls. The camera zooms in to show a close-up of the final dish, which includes noodles, meat, and green onions, with a piece of nori on top. The man uses chopsticks to pick up the noodles from the bowl.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 887, "extra_info": {"duration": 11.0, "events": ["Cook drains noodles into a plate", "Scene transitions to colourful foods and ramen bowls", "Side dishes added to the bowls", "Cook places logo-marked seaweeds into ramen bowls", "Close-up of ramen bowl occurs"], "n_events": 5, "n_shots": 3, "n_subjects": 1}, "vid": 887}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "385.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person stirs an object; the camera cuts to a woman pouring liquid from a bottle, then with the bottle turned upside down, viewed from above, the camera cuts to the woman's profile, cuts to the object being poured into a mold, and then the camera follows the woman as she turns to the right, the object obscuring the woman's face.", "text": "A person holds a metal container and begins to pour its contents, which appear to be a fine powder, into a bowl. The person then picks up a blue bottle labeled 'Evanston's Pure Water' and examines it closely. The person pours the liquid from the bottle into a metal pan. The person then covers the pan with a lid.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 385, "extra_info": {"duration": 13.38, "events": ["A person stirs an object.", "A woman pours liquid from a bottle.", "The bottle is turned upside down.", "Liquid is poured into a mold.", "A woman turns to the right.", "The object obscures the woman's face."], "n_events": 6, "n_shots": 5, "n_subjects": 2}, "vid": 385}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "413.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman dressed in a white tank top is chopping lettuce, then she adds three raw eggs into the bowl, along with black pepper and sea salt, and then begins to stir with chopsticks. The camera angle remains unchanged. The background is a bright living room.", "text": "A woman is seen chopping green lettuce on a white cutting board using a knife. She then cracks three eggs into a glass bowl containing the chopped lettuce. Next, she sprinkles black pepper over the salad. Finally, she uses a wooden spoon to mix the eggs and lettuce together.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 413, "extra_info": {"duration": 7.3, "events": ["A woman chops lettuce.", "The woman adds three raw eggs into the bowl.", "The woman sprinkles black pepper into the bowl.", "The woman shakes sea salt into the bowl.", "The woman stirs the mixture with chopsticks."], "n_events": 5, "n_shots": 1, "n_subjects": 1, "old_duration": 10.43, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/13_no_audio.mp4"}, "vid": 413}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "935.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, a black car is driving on a two-way lane, passing by grassy lawns and small houses. Ultimately, it circles around in front of an open gray garage, adjusting its direction to parallel itself with the garage. The car is going to park inside the garage.", "text": "A black car is seen driving on a multi-lane road with trees and power lines in the background, reflecting the surroundings on its side mirror. The car continues to drive on a two-lane road surrounded by lush green trees and grassy areas. The camera then focuses on the car's wheel as it drives through a forested area. Finally, the car is shown parked in a driveway next to a house with a garage, and it begins to reverse.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 935, "extra_info": {"duration": 6.97, "events": ["A black car drives on a two-way lane.", "Car passes by grassy lawns.", "Car passes by small houses.", "Car circles around in front.", "Car adjusts its direction to parallel.", "Car prepares to park.", "Car enters the open gray garage."], "n_events": 7, "n_shots": 5, "n_subjects": 1}, "vid": 935}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "602.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera remains fixed. The background is a cafe. Three people are seated around a coffee table, a man dressed in a suit holding a pen and occasionally looking up to exchange words with the other two while pointing at a chart. The woman lifts her coffee cup to drink and then sets it down, the man in yellow clothes stretches out his hand to point at the chart, the woman turns to look at the man in yellow, then points at the chart and says something, setting down her coffee after finishing it, and the man in yellow nods in response.", "text": "In a cozy café, a man in a suit holds a clipboard with colorful charts and graphs, pointing at them while discussing with a woman in a blue cardigan who is taking notes. The woman occasionally sips coffee from a blue cup and continues to write in her notebook. Another man, seated across from them, gestures with his hand and speaks, while the woman briefly looks at him before returning her attention to the charts. The background shows a modern café setting with wooden tables, chairs, and a few other patrons.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 602, "extra_info": {"duration": 28.18, "events": ["Man in suit holds pen", "Man in suit exchanges words with others", "Man in suit points at chart", "Woman lifts coffee cup", "Woman sets down coffee cup", "Man in yellow clothes stretches out hand", "Man in yellow points at chart", "Woman turns to man in yellow", "Woman points at chart and speaks", "Man in yellow nods in response"], "n_events": 10, "n_shots": 1, "n_subjects": 3}, "vid": 602}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "681.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the room, a man and a woman dressed in white opened a paper box and took out a piece of white clothing. The woman happily twirled and bounced around, while the man looked at her with a smile.", "text": "In a room with a white wall, a wooden shelf with books and plants, and several cardboard boxes, a man and a woman work together to open a cardboard box. The woman pulls out a white shirt from the box, examines it, and then puts it on. She spins around to show the shirt off, then removes it and places it back into the box. The man continues to hold the box open as the woman repeats the process.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 681, "extra_info": {"duration": 10.94, "events": ["A man and a woman opened a paper box", "The man and the woman took out a piece of clothing", "The woman twirled", "The woman bounced around", "The man looked at the woman with a smile"], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 681}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "984.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In this video, an individual donned in a red sweater and checkered pants takes note of the Christmas tree in the room, which is aflame. The person appears terrified and begins to frantically search the room for something useful.", "text": "A Christmas tree is engulfed in flames in a room filled with various toys and decorations. A man in a red sweater and checkered pants runs towards the burning tree, stopping in front of it and looking at it with a shocked expression. He then turns around, bends down, and reaches out to grab a chair. The man moves the chair closer to the burning tree, seemingly to try and extinguish the flames.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 984, "extra_info": {"duration": 5.0, "events": ["Individual takes note of the aflame Christmas tree.", "Individual appears terrified.", "Individual begins to search the room.", "Individual frantically searches for something useful."], "n_events": 4, "n_shots": 2, "n_subjects": 1}, "vid": 984}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "904.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The orange chameleon is quietly staying on the tree. The chameleon then changes color from dark red to pink as it crawls forward.The background of the image shows a water drop hanging on a tree, not on the chameleon.", "text": "A red chameleon is perched on a branch with a green, blurred background. The chameleon's eye is initially open and facing forward. The chameleon then begins to close its eye gradually. The eye continues to close until it is fully shut. The chameleon then starts to move forward along the branch.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 904, "extra_info": {"duration": 7.0, "events": ["The orange chameleon stays on the tree.", "The chameleon changes color from dark red to pink.", "The chameleon crawls forward.", "A water drop hangs on a tree."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 904}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "554.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man walks forward, picks up a piece of meat from the bag, extends his left hand to toss it into the air and then steps back. The lion leaps into the air, catches the chunk of meat in its jaws, and lands on the ground.", "text": "In a grassy area with a pond and trees in the background, a man stands facing a lion, holding a piece of meat in his hand. The man raises the meat above his head, and the lion looks up at it. The lion jumps up, extending its front paws towards the meat. The lion successfully catches the meat in mid-air and lands back on the ground. The man lowers his hand, and the lion stands on all fours, looking at the man.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 554, "extra_info": {"duration": 9.85, "events": ["The man walks forward.", "The man picks up a piece of meat.", "The man extends his left hand.", "The man tosses the meat into the air.", "The man steps back.", "The lion leaps into the air.", "The lion catches the chunk of meat in its jaws.", "The lion lands on the ground."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 554}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "394.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man stretched out his hand to check his watch, then continued walking forward with his hands in his pockets, while a person dressed in white approached from the opposite direction, surrounded by other pedestrians walking by.", "text": "A man stands by the river, looking to his left, with a bridge and people walking in the background. The scene shifts to the man walking on a wet sidewalk with ornate pillars and a fence, where he checks his watch. He continues walking as a woman approaches him from the opposite direction. The man stops and looks at the woman as she gets closer. The scene then shifts back to the man standing by the river.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 394, "extra_info": {"duration": 5.12, "events": ["The man checked his watch.", "The man continued walking forward.", "A person dressed in white approached.", "Pedestrians walked by."], "n_events": 4, "n_shots": 1, "n_subjects": 3}, "vid": 394}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "436.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera angles down and captures a boat moving right towards the shore. People on the boat move back and forth. A truck cab drives onto the boat, and other large trucks follow, driving off the vessel one by one. One large truck is making a U-turn. The camera holds its position steady.", "text": "A large ship is docked at a pier, with a ramp extending from the ship to the shore. The ship is loaded with numerous colorful trucks, and the ramp is initially empty. As the frames progress, the ramp begins to fill with trucks, starting from the bottom and moving towards the top of the ship. The trucks continue to move up the ramp, gradually filling the space on the ship. The trucks are seen driving up the ramp and onto the ship, with some trucks already on the ship and others still on the ramp. The ramp becomes increasingly crowded with trucks as they continue to move up. The trucks are now almost fully on the ship, with only a few still on the ramp. The ramp is now mostly empty, with the trucks having successfully boarded the ship.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 436, "extra_info": {"duration": 15.03, "events": ["Boat moves towards the shore", "People move back and forth on the boat", "Truck cab drives onto the boat", "Large trucks follow onto the boat", "Trucks drive off the boat one by one", "One truck makes a U-turn"], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 436}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "670.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, a black guy is gesturing with his right hand towards the front while continuously talking. In front of the camera, a white guy is wearing VR glasses and holding a game controller. He glances at it, and then repeatedly gestures towards the front.", "text": "In a dimly lit room with red neon lights, a man in a yellow sweater is wearing a VR headset and holding a game controller, while a man in a purple hoodie stands beside him, gesturing and talking. The man in the VR headset moves his body and hands, interacting with the virtual environment, while the man in the purple hoodie continues to gesture and speak, seemingly giving instructions or commentary. The man in the VR headset raises his hand and continues to move, while the man in the purple hoodie maintains his gestures and conversation. The man in the VR headset looks down at his controller, then raises it again, continuing his interaction with the virtual environment. The man in the purple hoodie continues to gesture and talk, while the man in the VR headset remains focused on the game.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 670, "extra_info": {"duration": 20.88, "events": ["A black guy gestures with his right hand towards the front.", "A black guy continuously talks.", "A white guy wearing VR glasses glances at a game controller.", "A white guy repeatedly gestures towards the front."], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 670}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "162.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Rain is falling from the sky when suddenly a red car starts up and dashes forward. The camera switches to a blond man walking on the sidewalk. A black car pulls up beside the blond man, the door opens, and a person in black appears. They grab the man and pull him towards the car. The camera then switches to a view from inside the car looking out at the door as the blond man struggles, his hands gripping the edge of the door. The person in black grabs the blond man's shoulders. The blond man looks left and right before finally being pulled into the car by the person in black.", "text": "A red car with its headlights on is seen driving on a rainy night, moving from left to right. The car continues to drive past a building with a sign that reads 'CITY HONESTY' and a store with a blue awning. A man in a blue shirt and shorts is walking on the sidewalk, looking to his left. A black SUV approaches and stops next to the man. A man in a suit and sunglasses exits the SUV, grabbing the man's arms and pulling him towards the car. The man in the blue shirt struggles and tries to resist, but the man in the suit continues to hold him. The man in the blue shirt is forced into the car, and the man in the suit releases his grip. The red car continues to drive away in the rain.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 162, "extra_info": {"duration": 4.88, "events": ["Rain falls from the sky.", "Red car starts up.", "Red car dashes forward.", "Blond man walks on sidewalk.", "Black car pulls up beside blond man.", "Car door opens.", "Person in black appears.", "Person in black grabs blond man.", "Blond man struggles at car door.", "Person in black pulls blond man into car."], "n_events": 10, "n_shots": 3, "n_subjects": 2}, "vid": 162}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "113.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In front of a fireplace ablaze with fire, a man and a woman—two adults—look toward an object on the table in front of them. The woman is knitting a sweater. From the object emerges a girl's head, and then a hand stretches out from each side to reach for the wine glasses on the table. The camera switches to the opposite side, where a little boy is looking at them in surprise.", "text": "A man and a woman are sitting at a dining table with a large piece of meat in the center, surrounded by candles and wine glasses. The woman places her plate of food on the table and looks at the man. The man raises his hand, and a small character with glasses and a red hat emerges from the table, holding a fork and knife. The small character stands in front of the man and woman, who are holding wine glasses. The small character looks at the man and woman, who are still holding their wine glasses.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 113, "extra_info": {"duration": 3.5, "events": ["Fireplace ablaze with fire", "Man and woman look toward object", "Woman knits a sweater", "Girl's head emerges from object", "Girl's hand reaches for wine glass", "Camera switches sides", "Little boy looks in surprise"], "n_events": 7, "n_shots": 2, "n_subjects": 4}, "vid": 113}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "76.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A chubby woman is about to serve some rice, standing next to her is a boy, and then they both turn their heads to look in the direction of the camera. The shot changes to the woman opening the front door, where a man wearing glasses and a long-haired girl are standing. The camera then cuts back to the chubby woman, who says something to the man and girl.", "text": "In a cozy kitchen with a refrigerator adorned with magnets and notes, a girl in a pink shirt and red apron stirs a pot on the stove while a boy in a white shirt stands beside her. The girl turns her head to look at the boy, who also turns to look in the same direction. The girl then turns back to the stove and opens the door, revealing a man and a woman standing outside. The girl, wearing an apron with Chinese characters, speaks to the man and woman.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 76, "extra_info": {"duration": 5.75, "events": ["A woman prepares to serve rice.", "A boy stands next to the woman.", "The woman and boy turn heads towards the camera.", "The woman opens the front door.", "A man wearing glasses stands at the door.", "A long-haired girl stands next to the man.", "The woman speaks to the man and girl."], "n_events": 7, "n_shots": 3, "n_subjects": 4}, "vid": 76}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "336.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Inside a car, the person in the passenger seat opened a box, revealing the cake inside. The man in the driver's seat looked towards the passenger seat, and the person in the passenger seat took out a cake knife and cut a piece of the cake.", "text": "A person sitting in a car removes a basket from a cake placed on a green plate, revealing a partially eaten cake with white icing. The person then picks up a knife and begins to cut the cake. The camera shifts to show a man driving the car, holding the steering wheel with his right hand. The person continues to cut the cake with the knife.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 336, "extra_info": {"duration": 7.75, "events": ["The person in the passenger seat opened a box", "The box revealed the cake inside", "The man in the driver's seat looked towards the passenger seat", "The person in the passenger seat took out a cake knife", "The person in the passenger seat cut a piece of the cake"], "n_events": 5, "n_shots": 3, "n_subjects": 2}, "vid": 336}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "634.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person is using a pair of pet nail clippers to trim a dog's nails. They first clip the nails on the dog's left front paw, then pick up the right front paw to continue the trimming. The dog turns its head back and forth to cooperate.", "text": "A small dog with white and brown fur is sitting on a wooden bench with a metal lattice background. A person is holding the dog's paw and using a pair of nail clippers to trim the dog's nails. The dog looks around and occasionally moves its head while the person continues to clip its nails. The person adjusts the dog's paw and continues trimming the nails, ensuring each nail is properly clipped. The dog remains calm and cooperative throughout the process.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 634, "extra_info": {"duration": 19.35, "events": ["A person clips a dog's left front paw nails", "The person picks up the dog's right front paw", "The dog turns its head back", "The dog turns its head forth", "The person continues trimming the nails"], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 634}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "276.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two men are looking ahead, the camera cuts to the scene where both are wrestling on the ground. The man in black throws the man in red to the ground. The camera cuts back to the two men from before, showing excited expressions on their faces.", "text": "A man with a towel over his shoulder and a boy stand side by side in front of a colorful background with the words 'BOYS' and 'CO' visible. The man looks at the boy, who is speaking. The scene then shifts to a wrestling match where a wrestler in black lifts a wrestler in red and slams him to the ground. The camera returns to the man and the boy, who continue their conversation. The boy turns his head to look at the man, then turns back to face forward.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 276, "extra_info": {"duration": 5.62, "events": ["Two men look ahead", "Both men start wrestling", "Man in black throws man in red to the ground", "Men show excited expressions"], "n_events": 4, "n_shots": 3, "n_subjects": 4}, "vid": 276}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "679.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A young woman is live streaming with her phone set up in front of her. She holds a pen in her right hand and lifts a notebook with her left, pointing to a drawing inside as she explains it.", "text": "A woman is sitting on a bed in a well-lit room with framed pictures on the wall and a plant in the background. She holds a notebook with a floral drawing on the cover and shows it to the camera while talking. She points to the drawing with a pencil, explaining its details. She continues to gesture with the pencil, emphasizing certain parts of the drawing. She adjusts her hair with her hand while still holding the notebook and pencil. She flips the notebook to show the back cover and continues to talk.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 679, "extra_info": {"duration": 10.33, "events": ["A young woman starts live streaming.", "The young woman sets up her phone in front of her.", "The young woman holds a pen in her right hand.", "The young woman lifts a notebook with her left hand.", "The young woman points to a drawing inside the notebook.", "The young woman explains the drawing."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 679}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "87.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a hallway, four animated characters and a cat are bounced into the air by a spring on the floor. They open their eyes wide and their mouths in excitement, with light emanating from behind them. The scene then cuts to a different angle, looking up from below as they land on the floor above. A male adult turns around, making a victorious gesture, while the others walk away to the right.", "text": "A family of four, including a man, a woman, a girl, a boy, and a small yellow creature, stand in a hallway adorned with various decorations and paintings. The floor begins to open, revealing a spiral staircase. The family starts to float upwards as the staircase ascends. They continue to rise, smiling and looking around as they ascend. The family reaches the top of the staircase, where they land and stand on a platform. The scene then transitions back to the family standing in the hallway.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 87, "extra_info": {"duration": 7.25, "events": ["Four animated characters and a cat are bounced into the air", "Characters and cat open their eyes wide", "Characters and cat open their mouths in excitement", "Light emanates from behind the characters and cat", "Characters and cat land on the floor above", "A male adult turns around", "The male adult makes a victorious gesture", "Other characters walk away to the right"], "n_events": 8, "n_shots": 3, "n_subjects": 2}, "vid": 87}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "834.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video takes place on a vibrant green soccer field where an intense game is underway. One team is dressed in white jerseys and socks, while the opposing team is sporting red and black attire, complemented with black socks. A player from the white team skillfully controls the ball and successfully shoots it towards the rival team's goal. The goalkeeper, donned in orange, fails to stop the ball, resulting in a goal for the white team. The black and red team drops a point. In response, the crowd rises to their feet in an outpour of joy and merriment.", "text": "The soccer match is taking place between RMA and MCI, with the score at 0-0. The players are positioned on the field, with RMA in white and MCI in maroon jerseys. A player from MCI receives the ball and starts moving towards the goal, passing it to a teammate. The teammate receives the ball and takes a shot at the goal. The goalkeeper from RMA dives to save the shot but misses, and the ball goes into the net. The crowd in the stands reacts with excitement as the goal is scored. The scene then transitions to a different moment in the match, showing the players in their original positions.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 834, "extra_info": {"duration": 6.0, "events": ["Soccer game is underway on green field", "Player from white team controls the ball", "Player from white team shoots towards goal", "Goalkeeper in orange fails to stop the ball", "White team scores a goal", "Red and black team drops a point", "Crowd rises to their feet", "Crowd expresses joy and merriment"], "n_events": 8, "n_shots": 1, "n_subjects": 3}, "vid": 834}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "957.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, an individual with long hair is seen. They are wearing a red t-shirt, which has the word 'family' written on it in white lettering and depicts a knight's helmet symbol. They also have on black tights displaying the Nike symbol, and are accessorizing with a white watch. The person walks by a beige sofa, an armchair, and a black table, all situated on a terrace, while carrying a tin container and a cloth bag. They proceed to descend wooden stairs, clad in their white shoes. Further on, this person heads towards a chicken coop to feed the chickens with the contents from the bag. The video shows the chickens inside the coop, near the fence. The scene also includes trees inside the coop, under which the chickens are gathered.", "text": "A woman wearing a red 'Family' shirt and black pants walks on a patio with a beige sofa set and a black coffee table, holding a metal container and a paper bag. She approaches the table and places the metal container on it. The camera then focuses on her hand holding the paper bag, showing the text and images on the bag. She continues walking down the steps, holding the bag and the container. The woman walks towards a fenced area with chickens inside, carrying the bag and the container.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 957, "extra_info": {"duration": 6.01, "events": ["Individual walks by a beige sofa.", "Individual walks by an armchair.", "Individual walks by a black table.", "Individual carries a tin container.", "Individual carries a cloth bag.", "Individual descends wooden stairs.", "Individual heads towards a chicken coop.", "Individual feeds the chickens.", "Chickens gather under the trees."], "n_events": 9, "n_shots": 4, "n_subjects": 1}, "vid": 957}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "10.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The character picks up a knife and attempts to cut their arm.", "text": "In a dimly lit room with a blue background, a girl with gray hair and a sleeveless top sits on a chair, looking at another girl with green hair and pink streaks. The gray-haired girl picks up a pen and shows it to the green-haired girl, who looks at it with a surprised expression. The gray-haired girl then turns and starts drawing on her own arm with the pen, while the green-haired girl watches and smiles. The green-haired girl then looks down, and the gray-haired girl continues to draw on her arm.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 10, "extra_info": {"duration": 20.77, "events": ["The character picks up a knife.", "The character attempts to cut their arm."], "n_events": 2, "n_shots": 1, "n_subjects": 1}, "vid": 10}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "408.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A tall and handsome man yawned and feigned surprise as he looked towards the camera. Then he gestured with his hands, stomped his foot and his pants changed from grey casual wear into black formal attire. Pretending to have come to a realization, he turned his head to complete a facial beauty touch-up. Afterwards, feigning helplessness, he took a couple of steps forward and changed into a fashionable jacket. Turning his head once more, he put on a pair of sunglasses.", "text": "A man in a blue sweater and gray sweatpants stands against a plain wall, adjusting his hair and making various hand gestures. He then changes into black pants, continues making gestures, and finally walks forward. The scene transitions to the man wearing a black and white outfit with a belt, where he makes more gestures and turns his head.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 408, "extra_info": {"duration": 12.5, "events": ["Man yawned and feigned surprise.", "Man gestured with hands.", "Man stomped foot and pants changed to black formal attire.", "Man turned head for facial touch-up.", "Man feigned helplessness.", "Man took a couple of steps forward.", "Man changed into a fashionable jacket.", "Man turned head again.", "Man put on a pair of sunglasses."], "n_events": 9, "n_shots": 1, "n_subjects": 1}, "vid": 408}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "599.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is a water channel at the foot of a mountain. A sheep walks out of the channel, shakes its head, and moves forward. The man retracts his stick, placing it on another sheep, holding its neck as he guides it forward, while several other men are busy within the sheep pen. The camera angle remains fixed and unchanged.", "text": "In a rocky outdoor setting, a man wearing a hat uses a long stick to guide a goat through a narrow water channel. The goat initially resists but eventually follows the stick's direction, moving through the water. Another goat follows closely behind, also being guided through the water. The man continues to guide the goats, ensuring they move through the channel without slipping. In the background, several people are seen herding more goats into the channel.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 599, "extra_info": {"duration": 5.83, "events": ["A sheep walks out of the water channel.", "Sheep shakes its head.", "Sheep moves forward.", "The man retracts his stick.", "Man places the stick on another sheep.", "Man holds the sheep's neck.", "Man guides the sheep forward.", "Several men are busy within the sheep pen."], "n_events": 8, "n_shots": 1, "n_subjects": 3}, "vid": 599}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "186.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person reaches out and opens the bottle cap, the camera cuts to a girl pouring the contents out of the bottle, it then cuts to the liquid from the bottle diffusing and rising up around the body of a dragon statue, with the camera following the movement to capture the scene.", "text": "A hand holding a small bottle filled with glowing purple liquid appears in front of a dragon's body. The girl, dressed in a blue outfit with a red bow tie, shakes the bottle, causing the liquid inside to glow brighter. She then pours the glowing liquid onto the dragon's body, which starts to react. The liquid spreads across the dragon's body, creating a sparkling effect. The dragon's body begins to glow and transform, with the liquid continuing to spread and illuminate the dragon. The transformation intensifies as the dragon's body becomes more radiant and colorful.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 186, "extra_info": {"duration": 7.12, "events": ["A person reaches out and opens the bottle cap.", "A girl pours the contents out of the bottle.", "The liquid diffuses around the dragon statue.", "The liquid rises up around the dragon statue."], "n_events": 4, "n_shots": 3, "n_subjects": 1}, "vid": 186}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "411.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman sits on the suitcase moving towards the camera, then stands up and starts to dance. The scene shifts from the waiting hall to different locations, including a street with residential buildings in the background, a pedestrian walkway with an overpass in the background, the entrance of a mall, and the entrance to a scenic spot.", "text": "A woman is sitting on a luggage cart in an airport terminal, moving forward while holding the handle. She then gets off the cart and starts dancing energetically. The scene transitions to the woman dancing on a street with a bus and buildings in the background. She continues dancing in different locations, including in front of a building and on a busy street with pedestrians. The video ends with the woman dancing in a plaza with a crowd around her.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 411, "extra_info": {"duration": 15.2, "events": ["The woman sits on the suitcase.", "The suitcase moves towards the camera.", "The woman stands up.", "The woman starts to dance.", "The scene shifts to a street.", "The scene shifts to a pedestrian walkway.", "The scene shifts to the entrance of a mall.", "The scene shifts to the entrance to a scenic spot."], "n_events": 8, "n_shots": 5, "n_subjects": 1}, "vid": 411}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "334.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man walked towards a yellow car, followed by a woman. When the man got to the car door, he held something in his left hand and used the car key in his right hand to open the door. He then leaned over and stuffed the item from his left hand under the driver's seat.", "text": "A man and a woman walk towards a parked car on a street with vintage cars and buildings in the background. The man reaches for the car key and inserts it into the door lock. He then opens the car door and bends down to look inside. The man continues to inspect the interior of the car, focusing on the dashboard area.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 334, "extra_info": {"duration": 10.5, "events": ["A man walked towards a yellow car", "A woman followed the man", "The man reached the car door", "The man held something in his left hand", "The man used the car key to open the door", "The man leaned over", "The man stuffed the item under the driver's seat"], "n_events": 7, "n_shots": 3, "n_subjects": 3}, "vid": 334}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "787.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman bent down and said something to the baby, swayed her body, then opened her mouth wide, looked up, lifted the baby up with both hands, and kissed the baby's belly.", "text": "A woman is holding a baby in her arms, smiling and looking at the baby. She then looks up and starts talking while still holding the baby. The woman lifts the baby up and down, and the baby looks around and smiles. The woman continues to hold the baby close, and the baby appears to be happy and engaged. The woman then brings the baby closer to her face, and the baby starts to laugh. The woman continues to hold the baby, who is now smiling and laughing.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 787, "extra_info": {"duration": 21.62, "events": ["The woman bent down.", "The woman said something to the baby.", "The woman swayed her body.", "The woman opened her mouth wide.", "The woman looked up.", "The woman lifted the baby up with both hands.", "The woman kissed the baby's belly."], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 787}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "320.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man with a red hat takes a fish out of a bucket, walks down a small wooden bridge above the pool, and the camera switches to the man standing on a ladder, holding the fish with one hand outstretched. An orca leaps into the air and eats the fish from the man's hand, then falls back down. The camera then switches again, focusing on a railing in front of him.", "text": "A man in a suit and red hat stands on a metal walkway by a pool, holding a fish in his right hand and looking around. He then walks towards the camera, still holding the fish. The man stands on a ladder next to a lighthouse, raising the fish in his right hand. Suddenly, a large fish jumps out of the water and grabs the fish from his hand. The man turns to look at the fish as it swims away. The scene then shifts to a window with a radio on the sill, showing the time as 2:03 AM.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 320, "extra_info": {"duration": 10.25, "events": ["Man takes fish out of bucket", "Man walks down wooden bridge", "Man stands on ladder", "Man holds fish outstretched", "Orca leaps into the air", "Orca eats fish from man's hand", "Orca falls back down"], "n_events": 7, "n_shots": 3, "n_subjects": 1}, "vid": 320}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "636.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Around a long table, there are a few people. The table is covered with many food containers filled with food. The two people on the left are wrapping the containers with aluminum foil, while a character in blue clothing on the right is handing over the unwrapped containers to the two people on the left.", "text": "A table is filled with several aluminum trays containing food, and a stack of empty trays is placed on the table. A man in a blue shirt begins to fill the trays with food, while another person assists by placing lids on the trays. The man continues to fill the trays, and the person with the lids continues to place them on the trays. The man and the person with the lids work together to ensure all trays are filled and covered. The background shows other people assisting with the food distribution, and the table is covered with more trays and food items.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 636, "extra_info": {"duration": 18.49, "events": ["People sit around a long table", "Table is covered with food containers", "People on left wrap containers with aluminum foil", "Character in blue hands over containers", "Food is contained for storage"], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 636}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "666.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, the lady on the right gestures with both hands while talking to someone beside her, while the gentleman on the left takes notes with a pen. The camera pans to the left, focusing on another lady with brown hair. This lady points at the computer screen in front of her and says a few words, while the man who was taking notes earlier stops writing and also says something.", "text": "A woman with blue nail polish gestures with her hands while talking, and a man in a suit and glasses looks at her attentively. The woman continues to gesture and then starts writing on a notepad. The man remains focused on the notepad as the woman writes. Another woman with curly hair joins the conversation, pointing at the laptop screen and speaking. The man and the first woman continue to look at the laptop screen, with the man holding a pen. The first woman continues to gesture and talk while the man listens.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 666, "extra_info": {"duration": 12.53, "events": ["A lady gestures with both hands while talking.", "A gentleman takes notes with a pen.", "The camera pans to the left.", "A lady with brown hair points at the computer screen.", "The lady with brown hair says a few words.", "The gentleman stops writing.", "The gentleman says something."], "n_events": 7, "n_shots": 1, "n_subjects": 3}, "vid": 666}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "77.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A boy in a blue suit turns around and crouches in front of a pile of gift boxes. There is space built with gift boxes in front, leaving a small window. The boy looks through the window to see a girl with long hair crying inside.", "text": "A man in a blue suit and brown shoes walks towards a large pile of gift boxes in a room with a kitchen counter in the background. He stops in front of the pile, bends down, and places his hands on his knees. He then crouches down to look inside one of the boxes. Inside the box, a woman with long black hair is seen looking down and speaking. The man continues to look at the woman inside the box.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 77, "extra_info": {"duration": 5.75, "events": ["A boy turns around.", "A boy crouches.", "A boy looks through a small window.", "A girl cries inside a space."], "n_events": 4, "n_shots": 2, "n_subjects": 2}, "vid": 77}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "346.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a restaurant, in front of the camera, there are three people sitting on either side of a table, with a man on the left, a man and a woman on the right. The waiter comes over, places a dish on the table, and then walks away. The man on the left turns his head to speak to the waiter. The woman across from him uses her finger to wipe her eyes twice, and the man tilts his head. There are other people dining in the restaurant as well.", "text": "In a dimly lit restaurant with colorful balloons in the background, a man with a beard sits at a table with a woman and another man, all looking at the table. The woman wipes her nose with her hand and then places her hand on the table. A waiter approaches and places a plate on the table in front of the man with the beard. The man with the beard turns his head to look at the waiter and then turns back to look at the woman. The waiter walks away, and the man with the beard continues to look at the woman.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 346, "extra_info": {"duration": 6.38, "events": ["Waiter places a dish on the table.", "Waiter walks away from the table.", "Man on the left turns head to speak.", "Woman wipes her eyes with her finger twice.", "Man on the left tilts his head."], "n_events": 5, "n_shots": 1, "n_subjects": 4}, "vid": 346}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "828.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A child clad in gray attire gleefully runs towards a house's door, intending to prank its occupants. The mischievous child rings the doorbell and then conceals himself. A bemused individual, dressed in green, opens the door to find nobody there. As confusion grows, this individual sticks their head out the door to take a thorough look around their vicinity, perplexed about who might be playing a prank or who could have mistakenly rung their doorbell. Despite their efforts, they are left clueless and shut the door.", "text": "A boy in a gray hoodie and green pants runs down a sidewalk between apartment buildings. He reaches a door, knocks on it, and then runs away. The boy hides behind a pillar, peeking out to observe. The scene shifts to a man in a yellow hoodie opening a door and looking outside. The man leans out of the door, then closes it and turns to look at the door handle. The boy is seen running towards the door again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 828, "extra_info": {"duration": 11.0, "events": ["A child runs towards a house's door.", "The child rings the doorbell.", "The child hides.", "An individual opens the door.", "The individual looks around.", "The individual shows confusion.", "The individual sticks their head out.", "The individual searches the vicinity.", "The individual finds no one.", "The individual closes the door."], "n_events": 10, "n_shots": 6, "n_subjects": 2}, "vid": 828}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "403.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, a lady dressed in black is taking a selfie in the park. The first shot shows her turning around to face the camera, putting on sunglasses. The second shot captures her walking gracefully towards the camera. In the third shot, she opens her arms and runs away from the camera. The fourth shot shows her waving trekking poles as she walks forward.", "text": "A woman in a black dress is walking on a bridge surrounded by trees, holding a cane in her right hand. She turns around, flips her hair, and puts on sunglasses. She continues walking, adjusting her hair and sunglasses. She starts running along the bridge, with the sun shining through the trees. She then turns around, removes her sunglasses, and continues walking.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 403, "extra_info": {"duration": 9.57, "events": ["The lady turns to face the camera.", "The lady puts on sunglasses.", "The lady walks towards the camera.", "The lady opens her arms.", "The lady runs away from the camera.", "The lady waves trekking poles."], "n_events": 6, "n_shots": 4, "n_subjects": 1}, "vid": 403}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "754.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman squeezes the hand cream onto the back of her hand, then closes the lid and sets it aside, spreads it out with her fingers, and then rubs her hands together reciprocally.", "text": "A woman, wearing a white sleeveless top, holds a white bottle with a pointed nozzle in her right hand and positions it over her left hand. She squeezes the bottle, dispensing a small amount of cream onto her left hand. She then uses her right hand to spread the cream evenly across her left hand. The woman continues to rub the cream into her hand, ensuring it is well absorbed. Finally, she holds the bottle again, ready to dispense more cream if needed.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 754, "extra_info": {"duration": 11.61, "events": ["The woman squeezes hand cream onto her hand.", "The woman closes the hand cream's lid.", "The woman sets the hand cream aside.", "The woman spreads the cream with her fingers.", "The woman rubs her hands together."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 754}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "999.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts a blonde-haired person, sporting a black swimsuit and a ponytail, standing on an orange jumping platform. With their back to the right, they face left and raise their hands. Following that, they jump, wrapping their arms around their legs and performing a spin in mid-air. They conclude by diving into the water, hand-first and feet-up, creating a considerable splash.", "text": "A woman in a black swimsuit stands on a high diving platform with her arms raised above her head, preparing to dive. She maintains her stance for a few moments, keeping her arms extended. She then bends her knees slightly and jumps off the platform, initiating a dive. The woman performs a series of flips and twists in the air as she descends towards the pool. The background shows an indoor swimming pool facility with a high ceiling and a diving platform structure.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 999, "extra_info": {"duration": 6.0, "events": ["Blonde-haired person stands on jumping platform.", "Person faces left.", "Person raises hands.", "Person jumps from platform.", "Person wraps arms around legs.", "Person performs a spin in mid-air.", "Person dives into the water.", "Person enters water hand-first.", "Person's feet go up.", "Splash occurs in water."], "n_events": 10, "n_shots": 1, "n_subjects": 2}, "vid": 999}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "726.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man jumped up and grabbed the bar with both hands, then did three pull-ups.", "text": "In a gym setting with a metal frame structure, a man in a black shirt and shorts approaches the frame and grabs the top bar with both hands. He begins to perform pull-ups, lifting his body up and down while maintaining a firm grip on the bar. The man continues this exercise, repeatedly pulling himself up and lowering back down. The background includes a chalkboard, windows, and various gym equipment.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 726, "extra_info": {"duration": 14.51, "events": ["The man jumped up.", "The man grabbed the bar.", "The man did a pull-up.", "The man did a second pull-up.", "The man did a third pull-up."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 726}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "621.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A hummingbird rapidly flutters its wings, hovering in the air. Then, the hummingbird flies in front of a flower, extends its beak into the stamen, and sips the nectar before flying out backward. Following that, it flies in front of the flower again and sips the nectar. This process is repeated three times before the hummingbird finally flies upwards and away.", "text": "A vibrant red and green bird is seen hovering in front of a cluster of orange and yellow flowers, flapping its wings rapidly. The bird repeatedly dips its beak into the flowers, likely feeding on nectar. The background is a blurred green, indicating a natural, outdoor setting.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 621, "extra_info": {"duration": 13.41, "events": ["A hummingbird flutters its wings.", "The hummingbird hovers in the air.", "Hummingbird extends beak into the flower's stamen.", "Hummingbird sips nectar from the flower.", "The hummingbird flies out backward.", "Hummingbird approaches the flower again.", "Hummingbird repeats nectar sipping.", "The process is repeated three times.", "Hummingbird flies upwards.", "Hummingbird flies away."], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 621}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "368.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman, with her back to the camera, takes off her coat and walks forward.", "text": "In a dark, rainy alleyway with a few illuminated buildings in the background, a person is seen from behind as they begin to remove their jacket. The person continues to pull the jacket off, revealing a police vest underneath. The vest has the word 'POLICE' clearly visible on the back. The person fully removes the jacket, leaving the police vest fully exposed.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 368, "extra_info": {"duration": 4.0, "events": ["A woman takes off her coat.", "The woman walks forward."], "n_events": 2, "n_shots": 1, "n_subjects": 1}, "vid": 368}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "934.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows a person wearing a grey t-shirt, who turns on the tap near a sink surrounded by various cleaning products. The individual rolls up their sleeves and pours a skin cleanser, specifically Cetaphil, into their hand. Then, the person rubs the hands together and proceeds to wash the face. After washing, they turn off the tap and dry their hands with a white towel. Finally, the person dries their face with a green towel.", "text": "A man turns on a modern faucet, and water starts flowing. He then picks up a bottle of Cetaphil Gentle Skin Cleanser from a shelf filled with various toiletries. The man uses the cleanser to wash his face under the running water. After washing his face, he turns off the faucet and dries his face with a towel. Finally, he turns on the faucet again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 934, "extra_info": {"duration": 9.0, "events": ["Person turns on the tap", "Person pours Cetaphil into hand", "Person rubs hands together", "Person washes face", "Person turns off the tap", "Person dries hands with white towel", "Person dries face with green towel"], "n_events": 7, "n_shots": 8, "n_subjects": 1}, "vid": 934}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "868.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video displays a scene of snowy landscapes, with an orange deer lying on the ice, eyes half-closed. Suddenly, the deer opens its eyes wide and notices a small grey-and-white penguin walking along the railroad tracks toward a plant bearing red fruit. A black train swiftly approaches the penguin from a distance on the same tracks.", "text": "A camel with large goggles is lying on the snow, looking down at the ground. The camel's eyes gradually open, revealing a surprised expression. The camera then shifts to show a snow-covered railway track. A small bird appears on the track, looking around. The camera quickly pans to the right, revealing a steam locomotive moving along the track, emitting white smoke.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 868, "extra_info": {"duration": 5.0, "events": ["An orange deer lies on the ice.", "The orange deer opens its eyes wide.", "A small grey-and-white penguin walks along the railroad tracks.", "The small grey-and-white penguin heads toward a plant.", "A plant bears red fruit.", "A black train swiftly approaches on the tracks.", "The black train moves toward the small grey-and-white penguin."], "n_events": 7, "n_shots": 3, "n_subjects": 2}, "vid": 868}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "786.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "After the plane landed, it continued to taxi forward, with dust and smoke rising from all around. The camera follows the movement.", "text": "A large U.S. Air Force aircraft is seen approaching for landing with its landing gear extended, against a backdrop of mountains and a clear sky. The aircraft continues its descent, getting closer to the runway as it prepares for touchdown. The aircraft's wheels make contact with the runway, creating a cloud of dust as it rolls forward. The camera follows the aircraft as it moves along the runway, capturing the dust cloud it generates. The aircraft continues to roll down the runway, with the dust cloud becoming more prominent. The aircraft moves further down the runway, eventually exiting the frame, leaving only the dust cloud visible.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 786, "extra_info": {"duration": 18.45, "events": ["The plane landed.", "The plane taxied forward.", "Dust rose from the ground.", "Smoke rose from the ground."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 786}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "860.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video displays a car behind a grey vehicle that fails to stop at a red light signal and proceeds to drive through it.", "text": "The video shows a car driving through an intersection on a clear day with a blue sky. The car approaches a traffic light and begins to turn right, passing a white van on the right side of the road. The car continues straight, passing a gas station on the right and a pizza shop on the left. The car approaches another intersection with a traffic light, which is red, and continues straight. The car passes a construction site with orange traffic cones and a sign indicating a speed limit of 35 mph. The car continues straight, passing a building with a sign that reads 'WOWS' and a gas station on the right. The car approaches another intersection with a traffic light, which is red, and continues straight. The car passes a building with a sign that reads 'WOWS' and a gas station on the right. The car continues straight, passing a building with a sign that reads 'WOWS' and a gas station on the right. The car approaches another intersection with a traffic light, which is red, and continues straight. The car passes a building with a sign that reads 'WOWS' and a gas station on the right. The car continues straight, passing a building with a sign that reads 'WOWS' and a gas station on the right. The car approaches another intersection with a traffic light, which is red, and continues straight. The car passes a building with a sign that reads 'WOWS' and a gas station on the right. The car continues straight, passing a building with a sign that reads 'WOWS' and a gas station on the right. The car approaches another intersection with a traffic light, which is red, and continues straight. The car passes a building with a sign that reads 'WOWS' and a gas station on the right. The car continues straight, passing a building with a sign that reads 'WOWS' and a gas station on the right. The car approaches another intersection with a traffic light, which is red, and continues straight. The car passes a building with a sign that reads 'WOWS' and a gas station on the right. The car continues straight, passing a building with a sign that reads 'WOWS' and a gas station on the right. The car approaches another intersection with a traffic light, which is red, and continues straight. The car passes a building with a sign that reads 'WOWS' and a gas station on the right. The car continues straight", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 860, "extra_info": {"duration": 5.0, "events": ["The car follows a grey vehicle.", "Grey vehicle approaches a red light.", "Red light signal activates.", "Grey vehicle fails to stop.", "Grey vehicle drives through the red light."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 860}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "487.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a spacious area, a man rides a bicycle towards the camera. As he approaches a speed bump, he lifts the handlebars and the bike raises upwards, continuing on one wheel. The front wheel comes off the bike, hits the speed bump, bounces into the air, and then lands on a blue-patterned round barrel. It bounces up again and lands on a yellow-green patterned barrel, where the wheel bounces up once more. Just then, the man arrives, grabs the front wheel with one hand and continues forward. As he approaches a turn ahead, the man slows down, riding while preparing to reattach the front wheel back onto the bike.", "text": "A man rides a bicycle towards a small ramp under a covered area at night. He approaches the ramp and begins to lift the front wheel of the bicycle. The man successfully lifts the front wheel off the ground and continues to ride while holding the front wheel in the air. He rides past a blue barrel and a green barrel, maintaining balance with the front wheel elevated. The man continues to ride, passing by a large pillar and moving towards a more open area. He rides past a yellow-lit wall and a staircase, still holding the front wheel in the air. Finally, he rides towards the camera, maintaining his balance with the front wheel elevated.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 487, "extra_info": {"duration": 9.7, "events": ["Man rides bicycle toward camera", "Man lifts handlebars of bicycle", "Bike raises onto one wheel", "Front wheel detaches from bicycle", "Detached wheel hits speed bump", "Wheel lands on blue-patterned barrel", "Wheel bounces onto yellow-green patterned barrel", "Man grabs front wheel with hand", "Man continues forward on bicycle", "Man slows down to reattach front wheel"], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 487}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "607.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Against the backdrop of the starry sky, light radiates outward from the center. Shortly thereafter, fragments of letters fly from off-screen to the center of the screen to form the name of the movie. After the movie title briefly and slowly enlarges forward, it disintegrates back into fragments and flies off the screen.", "text": "The video begins with a dark background filled with small, scattered particles and light streaks emanating from the center. The text 'FATHER'S' appears in large, bold letters, followed by the word 'DAY' in a smaller font below it. The text remains on the screen for a few moments, with the light streaks continuing to move outward. The text then starts to disintegrate and move out of the frame, leaving the background with the light streaks and particles. The text reappears, moving in a similar manner, and then disintegrates again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 607, "extra_info": {"duration": 10.03, "events": ["Light radiates outward from the center.", "Fragments of letters fly from off-screen to the center.", "The movie title forms from the flying fragments.", "The movie title enlarges forward.", "The movie title disintegrates into fragments.", "Fragments fly off the screen."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 607}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "898.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video clip shows a racing game with two competing black cars that are angled side by side. Neither driver is willing to concede to the other and their persistent side-by-side driving makes them susceptible to flipping over, thus potentially leading to an imminent accident.", "text": "The video shows a race track with three cars, two of which are Volkswagen race cars, driving closely together. The cars are moving at high speed, with the leading car slightly ahead of the others. The background includes a fenced track, trees, and buildings, indicating an urban or suburban setting. The cars maintain their positions relative to each other as they navigate the track, with the leading car occasionally adjusting its position. The track has a mix of straight and curved sections, and the cars are seen maneuvering through these turns. The video captures the intensity of the race, with the cars closely packed and competing for position.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 898, "extra_info": {"duration": 5.0, "events": ["Two black cars race side by side.", "Drivers refuse to concede to each other.", "Both cars maintain persistent side-by-side position.", "Cars risk flipping over due to their angle.", "The likelihood of an accident increases."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 898}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "912.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts an individual extracting a wire panel. Upon removal of this wire board, the red and white wires appear unprotected, devoid of any electrical tape. The red wire, also known as the live or active wire, and the white wire, known as the neutral wire, are exposed. This leaves potential for electrical mishaps. If these exposed wires are inadvertently touched, the person may suffer from an electric shock and life-threatening consequences may ensue, requiring immediate medical attention and potentially an ambulance.", "text": "A person holds a gray electrical outlet with two wires, one red and one white, in front of a wall with a circular hole. The person uses their left hand to hold the outlet steady while their right hand begins to strip the insulation off the red wire. The right hand continues to strip the insulation, revealing the bare wire. The person then pulls the stripped wire out of the outlet. The person holds the stripped wire in their left hand, showing the bare wire, while the outlet is still held in their right hand.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 912, "extra_info": {"duration": 5.0, "events": ["Individual extracts a wire panel.", "Red and white wires are exposed.", "Person may inadvertently touch the wires.", "Person suffers from an electric shock.", "Life-threatening consequences ensue.", "Immediate medical attention is required.", "An ambulance may be needed."], "n_events": 7, "n_shots": 1, "n_subjects": 1}, "vid": 912}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "788.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A close-up of a man's hands being handcuffed, two other men grabbing this man's arms and walking forward together. The camera follows the movement until they stop at the doorway, where the man on the right reaches into his pocket searching for something, while the other two men turn to the left.", "text": "A man in a gray shirt is being handcuffed by two officers, one on each side, as they secure his hands behind his back. The officers then guide the man, who is now handcuffed, towards a building entrance. The man is led by the officers through a courtyard and towards the entrance of the building. The officers continue to escort the man, who is still handcuffed, as they approach the door. The officers reach the door and begin to open it, preparing to enter the building.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 788, "extra_info": {"duration": 19.72, "events": ["Hands being handcuffed", "Two men grab the man's arms", "Group walks forward together", "They stop at the doorway", "Man on the right reaches into his pocket", "Other two men turn to the left"], "n_events": 6, "n_shots": 1, "n_subjects": 3}, "vid": 788}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "357.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera moves downward, a group of people open a door and walk forward, a dog in front follows them, and then the camera switches to their backs.", "text": "The camera moves downward to focus on a large, ornate wooden door with golden handles and a blue and gold decorated archway. The door begins to open, revealing a group of people and a small dog inside. The group, consisting of individuals in various outfits, including a person in a white and red suit and another in a blue outfit, steps forward as the door opens wider. The small dog, wearing a harness, walks ahead of the group. The group continues to move forward, entering a grand hall with stained glass windows and wooden paneling.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 357, "extra_info": {"duration": 5.0, "events": ["The camera moves downward", "A group of people open a door", "People walk forward", "A dog follows the group of people"], "n_events": 4, "n_shots": 2, "n_subjects": 2}, "vid": 357}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "882.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows three individuals next to a caravan. A child dressed in a purple T-shirt is seen standing on tiptoe, struggling to open the black door of a caravan that is equipped with a white, foldable partition. After a while, the child finally manages to get it open. An adult wearing a blue hat and a red jacket then enters the frame taking a selfie. Several caravans can be seen in the background. As the adult pans his camera around, another adult wearing a pink T-shirt pops into the scene behind him.", "text": "A young child stands on a small step ladder in front of a large, black and gray RV, reaching up to touch a white object on the RV's exterior. The scene then shifts to a man in a red plaid shirt and a cap, who is talking and moving around in front of the RV. A woman in a pink shirt appears in the background, smiling and looking at the man. The video concludes with a return to the child continuing to touch the white object on the RV.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 882, "extra_info": {"duration": 9.98, "events": ["A child struggles to open a caravan door.", "The child opens the caravan door.", "An adult enters the frame taking a selfie.", "The adult pans his camera around.", "Another adult pops into the scene."], "n_events": 5, "n_shots": 2, "n_subjects": 3}, "vid": 882}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "210.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man was peeking from behind a tree, another man carrying firewood walked by, the man quickly hid behind the tree, the man carrying the firewood continued to walk forward, the man behind the tree raised a handgun with both hands, aiming at the man carrying the firewood, who slowly turned around.", "text": "A man wearing a beanie peeks from behind a tree in a snowy forest, observing his surroundings. He then steps out from behind the tree and walks forward, carrying a bundle of wood. The man hides behind another tree, watching as another man, also in a beanie, emerges from behind a tree and raises a gun. The second man, with a backpack, turns his head to look back at the first man. The first man continues to observe the second man as he turns around.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 210, "extra_info": {"duration": 10.5, "events": ["A man peeked from behind a tree.", "Another man walked by carrying firewood.", "The first man quickly hid behind the tree.", "The man with firewood continued to walk forward.", "The first man raised a handgun with both hands.", "The man with firewood turned around slowly."], "n_events": 6, "n_shots": 5, "n_subjects": 2}, "vid": 210}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "1000.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video features animated characters: a large, overweight gray rabbit and a purple squirrel in a brown field surrounded by greenery and trees. The gray rabbit directs the purple squirrel to a certain location. The purple squirrel starts heading there with a hazelnut in its hand, displaying signs of fear. Stopping at a small green bush, the squirrel seems to inspect it. The camera then pulls back, revealing the bush as a wooden trap. The trap springs, launching the purple squirrel into the air and making it drop the hazelnut. The scene then shifts to the squirrel flying through the air.", "text": "In a forest setting with green grass and trees, a large, white, rabbit-like creature stands with its hands on its hips, while a small, gray mouse-like creature holds a ball and looks at the larger creature. The mouse-like creature starts to move to the left, still holding the ball, and the larger creature continues to watch. The mouse-like creature sits down on the ground, still holding the ball, and the larger creature remains standing with its hands on its hips. The mouse-like creature then throws the ball into the air, and the ball is seen flying high above the trees. The mouse-like creature looks up at the ball as it ascends.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 1000, "extra_info": {"duration": 12.0, "events": ["Gray rabbit directs purple squirrel to a location.", "Purple squirrel carries hazelnut.", "Purple squirrel displays fear.", "Purple squirrel stops at green bush.", "Purple squirrel inspects green bush.", "Bush reveals itself as wooden trap.", "Wooden trap springs.", "Wooden trap launches purple squirrel into the air.", "Purple squirrel drops hazelnut."], "n_events": 9, "n_shots": 2, "n_subjects": 2}, "vid": 1000}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "729.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera affixed to the train captures from the side the train moving forward in a snowy landscape, with the image gradually darkening as the train enters a tunnel.", "text": "A red train is seen moving along a snow-covered mountainous landscape, approaching a stone bridge with multiple arches. The train continues to move forward, getting closer to the bridge, and begins to enter a tunnel carved into the mountain. The train fully enters the tunnel, and the screen gradually fades to black, displaying the 'video' logo.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 729, "extra_info": {"duration": 22.36, "events": ["The camera captures the train moving forward.", "The train advances in a snowy landscape.", "The image darkens gradually.", "The train enters a tunnel."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 729}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "821.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video features one person clad in a blue shirt and white pants with brown hair and red nail polish. Famously, there are four parrots in view. The person uses food to lure the parrots. Of note is the third parrot from the left, or second from the right. This particular parrot spreads its wings and notably flies to the person's right hand from atop the cage. Distinguishing it from the rest are green feathers and an orange bosom.", "text": "A woman with red hair walks towards a large birdcage in a living room, which contains several colorful parrots perched on various levels. She is wearing a black shirt with 'FAN 12' written on the back and white shorts. The living room has a fireplace, a TV, and various decorative items. The woman turns around and smiles at the camera, then turns back to face the birdcage. She extends her hand towards a red and green parrot, which flies towards her and lands on her hand.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 821, "extra_info": {"duration": 10.01, "events": ["One person lures parrots with food.", "Third parrot from left flies to person's right hand."], "n_events": 2, "n_shots": 1, "n_subjects": 3}, "vid": 821}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "520.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "An athlete is skiing, sliding up a slope, soaring into the air, spinning, and then landing steadily backwards. He then turns around, glides down a track, and once again lands steadily backwards.", "text": "A skier, dressed in a gray jacket and black pants, is seen skiing down a snowy slope, making turns and leaving tracks in the snow. The skier approaches a snow ramp and launches into the air, performing a spin mid-air. The skier lands smoothly on the snow and continues skiing down the slope. The skier then approaches another ramp, jumps off it, and performs a trick in the air. The skier lands on a colorful rail, slides across it, and continues skiing down the slope. The skier makes another turn and continues skiing down the slope.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 520, "extra_info": {"duration": 7.3, "events": ["Athlete skis up a slope", "Athlete slides into the air", "Athlete performs spinning", "Athlete lands backwards steadily", "Athlete turns around", "Athlete glides down a track", "Athlete lands steadily backwards"], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 520}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "171.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man adjusted his crooked beard with his hand, and the scene switched to him parking the vehicle on the edge of a cliff. The camera zoomed out to show a panoramic view of the vehicle at the cliff's edge, then cut to the man turning around to look back. It then switched to the man climbing out of the window and reaching out to grab the ladder on the roof of the vehicle.", "text": "A man with blonde hair is driving a vehicle, holding the steering wheel with his right hand and touching his mustache with his left hand. The vehicle, which is red and white, is seen precariously perched on the edge of a cliff. The man looks ahead with a concerned expression, then quickly turns his head to the left and reaches out to grab the ladder attached to the vehicle. The ladder is occupied by a child and a small creature, and the man grabs the ladder to prevent the vehicle from falling.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 171, "extra_info": {"duration": 7.25, "events": ["The man adjusted his crooked beard", "The man parked the vehicle on the cliff's edge", "The man turned around to look back", "The man climbed out of the vehicle's window", "The man reached out to grab the ladder on the roof"], "n_events": 5, "n_shots": 4, "n_subjects": 1}, "vid": 171}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "579.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Three people are riding on a motorcycle, with the two people at the sides moving forward. The woman is holding the front of the motorcycle and suddenly stands up and turns around. The camera pans to the left, and the woman sets down the motorcycle, riding forward with her feet on the motorcycle. Then she stretches her legs out straight and leans back.", "text": "Three individuals are seated on motorcycles in a covered area, with the person in the middle starting to move forward. The person in the middle begins to ride the motorcycle, lifting it slightly off the ground. The person continues to ride the motorcycle, moving forward and turning it around. The person rides the motorcycle towards a building entrance, passing by a glass door and a red sign.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 579, "extra_info": {"duration": 6.17, "events": ["Three people are riding on a motorcycle.", "Two people at the sides move forward.", "The woman holds the front of the motorcycle.", "The woman suddenly stands up.", "The woman turns around.", "The woman sets down the motorcycle.", "The woman rides forward with her feet on the motorcycle.", "The woman stretches her legs out straight.", "The woman leans back."], "n_events": 9, "n_shots": 1, "n_subjects": 3, "old_duration": 6.17, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/179.mp4"}, "vid": 579}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "761.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A family sits around on a picnic blanket, the woman takes out bread from the basket, while the man takes out plates to put down bananas, and then takes out two bowls, places an apple into each, and hands them to two girls. They kneel and sit; the girl in front picks up her hair and turns around to look, and the girl in the back reaches out to take some food.", "text": "A family is having a picnic on a grassy field, with a blanket spread out and various fruits and bread on it. The man takes out bananas and bread from a basket and places them on the blanket. The woman and the children help by taking out more food items, including apples and grapes, and placing them on the blanket. The man continues to take out more food, including more bananas and bread, while the woman and the children assist. The family continues to arrange the food on the blanket, with the woman and the children actively participating. The man and the woman take out more food, including apples and bread, and place them on the blanket. The family continues to arrange the food, with the woman and the children helping. The man and the woman take out more food, including bananas and bread, and place them on the blanket. The family continues to arrange the food, with the woman and the children actively participating.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 761, "extra_info": {"duration": 13.32, "events": ["Family sits on picnic blanket.", "Woman takes out bread from basket.", "Man takes out plates.", "Man places bananas on plates.", "Man takes out two bowls.", "Man places an apple into each bowl.", "Man hands bowls to two girls.", "Two girls kneel and sit.", "Girl in front picks up her hair.", "Girl in back reaches out for food."], "n_events": 10, "n_shots": 1, "n_subjects": 4}, "vid": 761}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "948.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts a semi-nude individual, clad in black trousers, a black hat, and black shoes. This person takes a bold leap from the roof of a green multi-storey building, landing on the rooftop of another building which hosts an antenna. The jumper carries no safety gear throughout this stunt. Soon after, another person comes into view, adorned with white makeup and several faux scars made with the same. Two more people are seen fastening some sort of white safety gear on a person in a green T-shirt. Meanwhile, the individual with makeup raises their hands in the air.", "text": "A man in black jumps off a moving truck, grabbing the door handle as he descends towards an orange car on the road. The scene then shifts to a shirtless man jumping from a high building, landing on a rooftop. The camera then shows a man with a painted face and a green shirt, who is speaking and gesturing with his hands. Finally, the man with the painted face is seen standing with his arms outstretched, surrounded by several people in a park setting.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 948, "extra_info": {"duration": 6.0, "events": ["Individual jumps from one rooftop to another.", "Person lands on a building hosting an antenna.", "Another person with white makeup comes into view.", "More people fasten safety gear on a person.", "Person with makeup raises hands in the air."], "n_events": 5, "n_shots": 4, "n_subjects": 4}, "vid": 948}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "50.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A letter moves forward, a man in the snow walks ahead. Then the envelope falls down in front of the man. The scene shifts to a close-up of the man turning his head to look to the side, the camera zooms out as the man looks to the other side, then he bends down preparing to pick up the letter on the ground.", "text": "An elderly man with a long white beard, wearing a green cloak, walks through a snowy village with ice-covered buildings and trees. He notices a brown envelope floating in the air and turns to look at it. The envelope continues to float and eventually lands on the ground in front of him. The man looks down at the envelope, then looks up and around the village. He bends down to pick up the envelope from the ground.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 50, "extra_info": {"duration": 20.75, "events": ["A letter moves forward", "A man walks in the snow", "The envelope falls to the ground", "The man turns his head", "The man looks to the side", "The camera zooms out", "The man prepares to pick up the letter"], "n_events": 7, "n_shots": 3, "n_subjects": 1}, "vid": 50}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "288.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera revolves around a man and a woman who are talking face to face, panning from one side of the two to the other, with people walking by on the side.", "text": "A man and a woman are walking side by side on a tree-lined street with colorful buildings in the background. The man, wearing a blue jacket and a backpack, is talking to the woman, who is dressed in a denim jacket, a beanie, and a scarf. As they walk, the woman turns her head to look at the man and smiles. The man continues to talk while the woman maintains her smile and looks at him. The background shows other pedestrians walking and cars parked along the street.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 288, "extra_info": {"duration": 3.88, "events": ["The camera revolves around a man and a woman.", "The man and the woman are talking to each other.", "People are walking by the conversation."], "n_events": 3, "n_shots": 1, "n_subjects": 3}, "vid": 288}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "275.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man in a suit and another man in a vest are sitting on the stairs in front of the door, while a man in a leather jacket stands in front of the door. The two seated men are smoking. The man in the leather jacket takes a step forward, the man in the vest hands over the cigarette in his hand, and the man in the leather jacket takes it and lights it, then walks over and leans against the pillar on the left.", "text": "Three men are sitting on the steps in front of a house with an American flag hanging on the door. The man in the middle, wearing a leather jacket, stands up and starts smoking a cigarette. The man on the left, dressed in a suit, also lights a cigarette. The man in the middle hands a cigarette to the man on the right, who is wearing a vest and a shirt. The man in the middle then walks to the right, turns around, and walks back to the left.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 275, "extra_info": {"duration": 14.0, "events": ["Two men sit on the stairs.", "The two men smoke.", "A man in a leather jacket stands.", "The man in a leather jacket steps forward.", "The man in a vest hands over a cigarette.", "The man in a leather jacket takes the cigarette.", "The man in a leather jacket lights the cigarette.", "The man in a leather jacket walks over.", "The man in a leather jacket leans against the pillar."], "n_events": 9, "n_shots": 1, "n_subjects": 3}, "vid": 275}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "335.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The mirror reflected a man lying on the bed and a woman leaning against the headboard. The camera angle then switched to the bed, where the man glanced at the woman several times before sitting up to check the clock on the table, then lay back down again.", "text": "In a dimly lit room, a man and a woman are lying in bed, with a lamp on a bedside table illuminating the scene. The man, wearing a plaid shirt, turns his head to look at the woman, who is wearing a yellow shirt. The man then shifts his gaze towards a clock on the wall, which shows the time as 3:15. After checking the time, he turns his head back to look at the woman again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 335, "extra_info": {"duration": 11.38, "events": ["The mirror reflected a man lying on the bed.", "A woman leaned against the headboard.", "The man glanced at the woman several times.", "The man sat up to check the clock.", "The man lay back down again."], "n_events": 5, "n_shots": 4, "n_subjects": 2}, "vid": 335}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "452.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A girl with both arms covered in sludge, lifts a bucket by the handle with her left hand and tilts it by the bottom with her right, tipping out a bucket of crawfish still smeared with mud. The camera pans right to a little boy, his body daubed in sludge, who brings over a pair of trousers caked with mud, from which he pours yet another heap of mud-covered crawfish. The backdrop is the entrance of a rural house.", "text": "A woman, covered in mud, lifts a bucket filled with live crayfish and pours them onto the ground. The crayfish scatter and move around on the wet ground. A boy, also covered in mud, uses a net to gather the crayfish into a pile. The boy then picks up a bag and starts to pour the crayfish from the bag onto the ground. The crayfish continue to scatter and move around on the ground as the boy empties the bag.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 452, "extra_info": {"duration": 8.87, "events": ["A girl lifts a bucket by the handle with her left hand.", "The girl tilts the bucket with her right hand.", "The bucket tips out mud-smeared crawfish.", "A little boy brings over a pair of trousers.", "The boy pours out mud-covered crawfish from the trousers.", "Crawfish hit the ground near the rural house."], "n_events": 6, "n_shots": 1, "n_subjects": 2, "old_duration": 10.43, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/52_no_audio.mp4"}, "vid": 452}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "867.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The individual dressed in khaki attire manipulates a blue phone using their index finger to enable silent mode. Following this action, the individual flips the phone upside down and tosses it toward the bed. The person then proceeds to push the keyboard, which is black, white, and red in color, forward. After a small hand gesture, the person picks up a blue book titled \"Osce Revision Guide\" and holds it up to the camera.", "text": "A man stands in a well-lit room with large windows, a bed, and a desk with a computer, a keyboard, and various items. He picks up a phone and starts using it, with a 'Silent Mode On' notification appearing above his head. He then turns around, gestures towards the computer, and places the phone on the desk. The man picks up a book from the desk and shows it to the camera, with a small video playing in the top left corner. He then picks up the phone again and continues to use it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 867, "extra_info": {"duration": 9.01, "events": ["Individual enables silent mode on blue phone", "Individual flips phone upside down", "Individual tosses phone toward bed", "Individual pushes keyboard forward", "Individual makes small hand gesture", "Individual picks up blue book", "Individual holds blue book up to camera"], "n_events": 7, "n_shots": 1, "n_subjects": 1}, "vid": 867}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "657.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Below the black screen, a burst of flame suddenly ignites, with little sparks floating above the fire. The light of the fire gradually dims until it fades away, leaving behind the sparks, which also slowly extinguish, decreasing in number until the screen turns black again.", "text": "The video begins with a black screen displaying the 'video' logo. Suddenly, a large explosion occurs, filling the screen with bright flames and sparks. The explosion continues to expand, with flames and debris scattering in all directions. As the explosion subsides, the screen starts to darken, with the flames and sparks gradually fading away. The screen continues to darken until it is almost completely black, with only a few remaining sparks visible.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 657, "extra_info": {"duration": 7.03, "events": ["Flame ignites below the black screen", "Sparks float above the fire", "Fire light gradually dims", "Fire light fades away", "Sparks slowly extinguish", "Number of sparks decreases", "Screen turns black again"], "n_events": 7, "n_shots": 1, "n_subjects": 1}, "vid": 657}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "784.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person is continuously hitting a punching bag in a dark environment.", "text": "In a dimly lit environment with two bright lights in the background, a man wearing a hooded jacket is seen punching a heavy bag. The man is consistently hitting the bag with both hands, moving around it as he punches. The heavy bag swings slightly with each hit, indicating the force of the punches. The man's movements are quick and deliberate, showing a focused and intense workout.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 784, "extra_info": {"duration": 17.22, "events": ["A person hits a punching bag.", "Punching bag swings upon impact."], "n_events": 2, "n_shots": 1, "n_subjects": 1}, "vid": 784}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "815.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video features two individuals engaged in a friendly boxing match, seven spectators are observing the scene attentively. One of the fighters is attired in a dark blue t-shirt, blue pants, and red boxing gloves, while the other is dressed in a white t-shirt, black pants, and black boxing gloves. They swiftly exchange blows, showcasing strong camaraderie and sportsmanship. The fighter in dark blue outperforms his counterpart by successfully landing six attacks, despite the fighter in white's earnest attempts to evade them. In the midst of the action, a spectator wearing a white t-shirt is seen capturing the thrilling match on his mobile phone. The dark blue fighter is particularly adept at targeting the face of the fighter in white, landing a total of six punches throughout the contest.", "text": "In a room with white walls and a gray carpet, two men wearing boxing gloves are sparring while a group of people watch them. The man in the blue shirt and red gloves throws punches at the man in the white shirt and black gloves, who is wearing black gloves. The man in the white shirt dodges and counters with his own punches. The audience, consisting of men sitting and standing, observes the sparring session, with some recording the event on their phones. The man in the white shirt lands a punch on the man in the blue shirt, causing him to step back. The sparring continues with both men exchanging punches and defensive maneuvers.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 815, "extra_info": {"duration": 10.01, "events": ["Two individuals engage in a friendly boxing match.", "Seven spectators observe the match attentively.", "Fighter in dark blue lands six attacks.", "Fighter in white attempts to evade attacks.", "Spectator captures the match on mobile phone.", "Fighter in dark blue targets the other's face."], "n_events": 6, "n_shots": 4, "n_subjects": 3}, "vid": 815}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "386.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "At night, the rocket crashes to the ground and explodes, sending up a burst of flames. The scene switches to daytime, showing a rocket placed on flat ground exploding after falling over.", "text": "At night, a rocket is seen on a wet surface with a bright light at its base, indicating it is about to launch. The rocket begins to lift off, creating a bright flash and smoke. The rocket continues to ascend, with more smoke and light emanating from its base. The rocket reaches a higher altitude, and the light and smoke become more intense. The rocket starts to descend, and the light and smoke continue to intensify. The rocket lands on the wet surface, creating a large splash and a significant amount of smoke. The rocket continues to slide on the wet surface, with smoke and debris scattering around. The rocket comes to a stop, and the smoke and debris continue to spread.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 386, "extra_info": {"duration": 7.5, "events": ["The rocket crashes to the ground at night.", "The rocket explosion sends up a burst of flames.", "The scene shows a rocket placed on flat ground in daytime.", "The rocket falls over.", "The rocket explodes after falling."], "n_events": 5, "n_shots": 2, "n_subjects": 2}, "vid": 386}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "381.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The boy and the girl are writing, the girl looks towards the camera, the camera switches to a woman making a phone call, who turns her head to look to the left of the frame.", "text": "A girl in a pink shirt and a boy in a blue shirt are sitting at a table with various items including a bowl of colorful candies, a cup, a glass of juice, and some stationery. The girl is looking down at something on the table, while the boy is writing in a notebook. The girl then turns her head to look forward. The scene then shifts to a woman holding a phone to her ear, looking forward. She then turns her head to the side and begins to hang up the phone.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 381, "extra_info": {"duration": 3.5, "events": ["The boy writes.", "The girl writes.", "The girl looks towards the camera.", "The woman makes a phone call.", "The woman turns her head to the left."], "n_events": 5, "n_shots": 2, "n_subjects": 3}, "vid": 381}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "654.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The lionfish is slowly swimming in the water by waving its pectoral fins, with other fish swimming nearby. The background is a shallow sea with corals, where the water surface is gently undulating.", "text": "A large, striped fish with a yellow mouth swims steadily over a coral reef, with various corals and small fish visible in the background. The fish moves closer to the camera, revealing more details of its pattern and features, including its eyes and fins. As it continues to swim, the fish's body fills the frame, showing its full length and the intricate patterns on its scales. The fish then swims away from the camera, moving further into the background, while the coral reef and other marine life remain visible.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 654, "extra_info": {"duration": 11.34, "events": ["The lionfish is swimming in the water.", "The lionfish is waving its pectoral fins.", "Other fish are swimming nearby.", "The water surface is gently undulating."], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 654}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "515.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In an aquarium, a female diver dressed as a mermaid dances through the water. She makes a heart shape with her hands, followed by a gesture of shooting an arrow and blowing a kiss. A cluster of heart-shaped lights appears in the water, encircling the mermaid. Afterwards, several heart-shaped rings of light emerge, as the mermaid turns and swims away. Many spectators are seen through the glass, watching the mermaid's performance.", "text": "A woman dressed in a white top and a blue mermaid tail is performing underwater in front of a rocky background. She moves her arms and body gracefully, interacting with the audience who are recording her with their phones. The woman continues to dance, making various hand gestures and movements. As she dances, heart-shaped animations appear around her, creating a magical effect. The woman then swims to the right, with more heart-shaped animations following her.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 515, "extra_info": {"duration": 11.1, "events": ["A female diver dances through the water.", "The female diver makes a heart shape with her hands.", "The female diver acts like shooting an arrow.", "The female diver blows a kiss.", "Heart-shaped lights appear in the water.", "Heart-shaped lights encircle the mermaid.", "Heart-shaped rings of light emerge in the water.", "The mermaid turns in the water.", "The mermaid swims away.", "Spectators watch the mermaid's performance through the glass."], "n_events": 10, "n_shots": 1, "n_subjects": 2, "old_duration": 11.1, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/115.mp4"}, "vid": 515}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "500.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two people carrying a large plastic bag approached the edge of a pond. They placed the bag into the water and withdrew the plastic, and a koi fish floated in the water, slowly swimming forward.", "text": "Two individuals, dressed in black jackets and white pants, walk along a stone-paved path towards a pond, carrying a plastic bag containing a large orange fish. They approach the edge of the pond and begin to lower the bag into the water. The person in front carefully releases the fish into the pond, ensuring it is fully submerged. The fish, now in the water, starts to swim away from the edge. The fish continues to swim further into the pond, moving away from the edge.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 500, "extra_info": {"duration": 9.15, "events": ["Two people approached the edge of a pond.", "People carried a large plastic bag.", "People placed the plastic bag into the water.", "People withdrew the plastic bag.", "A koi fish floated in the water.", "Koi fish began to swim forward."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 500}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "150.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man dressed in red sits on the edge of the stage holding a treasure chest. He opens the chest, looks at the gold coins inside, then grabs a handful and throws them back in, bursting into laughter.", "text": "A man dressed in a purple outfit with gold trim walks onto a stage holding a treasure chest. He sits down on the edge of the stage, still holding the chest. The man opens the chest and looks inside, revealing a large pile of gold coins. He reaches into the chest and starts to take out the gold coins, smiling as he does so.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 150, "extra_info": {"duration": 5.0, "events": ["A man sits on the stage edge.", "The man opens the treasure chest.", "The man looks at the gold coins.", "The man grabs a handful of coins.", "The man throws the coins into the chest.", "The man bursts into laughter."], "n_events": 6, "n_shots": 2, "n_subjects": 1}, "vid": 150}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "660.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video captured the process of mass printing of the US dollar. Initially, uncut sheets of US dollars quickly flashed one by one from the right to the left of the screen. Subsequently, the full sheets moved continuously upwards. Then, the sheets slowly progressed from the top down through a chute in the machine, followed by a slow push to the right. The entire process was operated by machines.", "text": "The video begins with a close-up of a machine printing $100 bills, showing the detailed features of the bills as they move along the conveyor belt. The camera then shifts to a wider view, revealing multiple $100 bills being printed and aligned on the machine. The machine continues to print and stack the bills, with the camera capturing the process from different angles. Finally, the machine completes the printing and stacking of the bills, with the camera focusing on the neatly arranged stacks of $100 bills.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 660, "extra_info": {"duration": 27.19, "events": ["Uncut sheets of dollars flashed from right to left.", "Full sheets moved continuously upwards.", "Sheets progressed top down through a chute.", "Sheets received a slow push to the right.", "Machines operated the entire printing process."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 660}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "317.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera faces the stream flowing ahead and slowly moves forward. The shot changes to an aerial view of an open space, where three bald monks are seen walking from below the camera. The camera gradually lowers and shifts to a rear view of the three monks.", "text": "The video begins with a serene view of a waterfall surrounded by lush green trees, with water cascading over rocks into a calm pool below. The scene then transitions to an overhead view of three monks dressed in orange robes walking in a line on a paved path. The monks are holding small items in their hands as they walk, and the camera follows their movement from above.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 317, "extra_info": {"duration": 11.0, "events": ["The camera moves forward", "Three bald monks walk", "The camera lowers", "The camera shifts to a rear view"], "n_events": 4, "n_shots": 2, "n_subjects": 3}, "vid": 317}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "989.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, the dark-skinned person with long orange hair, wearing a red shirt and black jacket and purple glasses, says a few words and immediately transforms into a white person with long blond hair and a ponytail wearing a blue headband, white top and blue skirt in the grassy area.", "text": "A woman with orange hair, wearing a red turtleneck, a black leather jacket, and white glasses, stands in a green, forested area and smiles while looking forward. The scene then transitions to another woman with brown hair, wearing a white blouse with lace details and a gray skirt, who is also in the same green, forested area. She closes her eyes, then opens them and turns her head to the side.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 989, "extra_info": {"duration": 5.0, "events": ["The person speaks.", "The person transforms.", "The transformed person appears in a grassy area."], "n_events": 3, "n_shots": 2, "n_subjects": 2}, "vid": 989}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "58.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "There are already two adults, a man and a woman, sitting on the sofa. A little girl comes to the sofa and sits down. The female character hands the little girl a neatly folded green blanket, which the girl then unfolds and covers her legs with.", "text": "A girl dressed as a doctor, wearing a white coat and a stethoscope, kneels beside a boy who is holding a yellow and brown checkered pillow, and they both look at each other. The girl then turns her head to look at a girl wearing a crown, who is sitting next to the boy. The crowned girl takes a green bag from the boy and hands it to the girl in the doctor's outfit. The girl in the doctor's outfit takes the bag and starts to open it. The girl in the doctor's outfit then turns her head to look at the boy and the crowned girl.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 58, "extra_info": {"duration": 7.75, "events": ["Two adults sit on the sofa.", "A little girl comes to the sofa.", "The little girl sits down on the sofa.", "The female adult hands the little girl a green blanket.", "The little girl unfolds the green blanket.", "The little girl covers her legs with the green blanket."], "n_events": 6, "n_shots": 3, "n_subjects": 3}, "vid": 58}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "639.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera pans from above a palm tree, rotating to the left while simultaneously descending in height. Upon reaching the beach, the shot changes to move forward, gradually catching up to two ladies wearing bikinis, who are running along the beach.", "text": "The video begins with an aerial view of a beach lined with palm trees and several parked cars, with the ocean in the background. The camera moves forward, passing over the palm trees and cars, and continues along the beach. As the camera progresses, it approaches two women running along the shoreline. The women are seen running towards the camera, with the ocean and distant mountains visible in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 639, "extra_info": {"duration": 16.18, "events": ["Camera pans from above a palm tree.", "Camera rotates to the left.", "Camera descends in height.", "Two ladies run along the beach."], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 639}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "398.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "At the beginning, the man and woman are holding hands, then they separate. The man pulls off the dust cover from a vehicle, revealing a Mercedes-Benz. He turns his head to look at the woman, who is standing by watching.", "text": "In a spacious, industrial-style room with large windows and brick walls, a man and a woman walk towards a car covered with a green cloth. The man, wearing a light blue shirt and beige pants, and the woman, dressed in a white shirt and jeans, stand in front of the covered car. The man begins to pull the cloth off the car, revealing a silver Mercedes-Benz. The woman watches as the man continues to remove the cloth, fully exposing the car. The man holds the cloth in his hands, and the woman turns her head to look at the car.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 398, "extra_info": {"duration": 5.38, "events": ["Man and woman hold hands.", "Man and woman separate.", "Man pulls dust cover off vehicle.", "Vehicle is revealed as a Mercedes-Benz.", "Man turns head to look at woman.", "Woman stands by watching."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 398}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "920.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the hurdle race, the princess takes the lead, followed by the turtle in second and Super Mario in the last position as they clear the first hurdle. Super Mario then accelerates, overtaking the turtle to claim second place. They all proceed to pass the second hurdle. As Super Mario continues to increase his speed, he begins to close in on the princess. All three continue to race and successfully pass the third hurdle. Just before the fourth hurdle, Super Mario takes over the lead from the princess.", "text": "The scene is set in a stadium with a red track and blue barriers, surrounded by a crowd of spectators. The characters, including a green character, a red character, and a yellow character, are running on the track. The green character is consistently in the lead, followed by the red character, and then the yellow character. The red character is seen jumping over hurdles and occasionally emitting sparks. The yellow character is also seen jumping over hurdles and emitting sparks. The green character maintains its lead throughout the race, while the red character and yellow character follow closely behind. The red character occasionally emits sparks as it runs. The yellow character continues to follow the red character, both maintaining their positions relative to each other. The green character remains in the lead, with the red character and yellow character trailing behind. The red character and yellow character continue to jump over hurdles, with the red character occasionally emitting sparks. The green character maintains its lead, while the red character and yellow character follow closely behind. The red character and yellow character continue to jump over hurdles, with the red character occasionally emitting sparks. The green character remains in the lead, while the red character and yellow character follow closely behind. The red character and yellow character continue to jump over hurdles, with the red character occasionally emitting sparks. The green character maintains its lead, while the red character and yellow character follow closely behind. The red character and yellow character continue to jump over hurdles, with the red character occasionally emitting sparks. The green character remains in the lead, while the red character and yellow character follow closely behind. The red character and yellow character continue to jump over hurdles, with the red character occasionally emitting sparks. The green character maintains its lead, while the red character and yellow character follow closely behind. The red character and yellow character continue to jump over hurdles, with the red character occasionally emitting sparks. The green character remains in the lead, while the red character and yellow character follow closely behind. The red character and yellow character continue to jump over hurdles, with the red character occasionally emitting sparks. The green character maintains its lead, while the red character and yellow character follow closely behind. The red character and yellow character continue to jump over hurdles, with the red character occasionally emitting sparks. The green character remains in the lead, while the red character and yellow character follow closely behind. The red character and yellow character continue to jump over hurdles, with the red character occasionally emitting sparks. The green character maintains its lead, while the red character and yellow character follow closely behind. The red character", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 920, "extra_info": {"duration": 8.0, "events": ["Princess takes the lead.", "Turtle follows in second.", "Super Mario clears first hurdle.", "Super Mario overtakes turtle.", "All pass second hurdle.", "Super Mario closes on princess.", "All clear third hurdle.", "Super Mario leads before fourth hurdle."], "n_events": 8, "n_shots": 1, "n_subjects": 3}, "vid": 920}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "71.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A group of animated characters each holds a gemstone in their hands, with the gemstones emitting a dazzling light. The camera cuts to a pair of feet that also shine brightly, which are then replaced by a sparkling pair of shoes. The camera switches again to a hand that glows dazzlingly, and then a brown glove is put on. Afterwards, the character, wearing the glove, clenches the fist tightly.", "text": "A group of animated characters, including a girl with long red hair, a blue unicorn, a yellow pony, and others, stand together in a forest setting, looking forward. The girl in the center holds a glowing red object, and the characters around her also hold glowing objects. The scene transitions to a close-up of a pair of blue shoes with wings, which start to glow and transform. The shoes change into golden boots with wings, and the transformation continues. A hand reaches out and grabs a glowing object, which emits bright light. The hand then holds a brown object with a green gem and a golden ring.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 71, "extra_info": {"duration": 3.62, "events": ["Animated characters hold gemstones.", "Gemstones emit dazzling light.", "Feet shine brightly.", "Sparkling shoes replace feet.", "Hand glows dazzlingly.", "Character puts on brown glove.", "Character clenches fist tightly."], "n_events": 7, "n_shots": 3, "n_subjects": 2}, "vid": 71}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "813.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video captures a pair standing adjacent to a shelf in a room. The person on the right of the frame is donned in a shirt with blue and white vertical stripes, brown footwear, and carries a black balloon affixed to a stick. The person on the left, having long blonde hair, is attired in a pink and white checked dress complemented by golden sandals. The individual with long hair pops the black balloon, leading to a shower of small yellow papers.", "text": "In a room decorated with a bookshelf, a white cabinet, and a colorful painting on the wall, a man in a blue striped shirt and beige pants holds a black balloon while a woman in a pink checkered dress stands beside him. The woman raises her hand and appears to be speaking as the man holds the balloon. The balloon is then popped, releasing a shower of gold confetti. The confetti falls around them, and the camera moves downward, showing the confetti scattered on the floor.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 813, "extra_info": {"duration": 6.01, "events": ["Person on the right carries a black balloon.", "Person with long hair pops the balloon.", "Small yellow papers shower from the popped balloon."], "n_events": 3, "n_shots": 1, "n_subjects": 2}, "vid": 813}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "185.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The orange creature shuffled the basket, the camera cut to the contents spilling out of the plate, the creature fell forward and lay on the ground. The pink creature ran over holding the basket, then pointed at the yellow creature and laughed loudly before falling backward, losing balance and falling forward onto the ground. The orange creature turned around, pointed at the pink creature and laughed heartily. The pink creature sat up, picked up an object and threw it at the orange creature, then pointed at it and laughed out loud. The orange creature also picked up an item and threw it at the pink creature.", "text": "A basket filled with cookies is seen on a table. The basket is lifted by a character with orange hands and a red hat, while another character with a purple hat and a basket of cookies runs towards them. The character with the red hat loses balance and falls, causing the cookies to spill out and scatter on the floor. The character with the purple hat also loses balance and falls, dropping their basket of cookies. Both characters laugh and look at the mess they have created. The character with the purple hat gets up and starts picking up the scattered cookies.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 185, "extra_info": {"duration": 11.0, "events": ["The orange creature shuffled the basket.", "Contents spilled out of the plate.", "The orange creature fell forward onto the ground.", "The pink creature ran over holding the basket.", "The pink creature pointed at the yellow creature and laughed.", "The pink creature lost balance and fell onto the ground.", "The orange creature turned and laughed at the pink creature.", "The pink creature sat up.", "The pink creature threw an object at the orange creature.", "The orange creature threw an item at the pink creature."], "n_events": 10, "n_shots": 2, "n_subjects": 2}, "vid": 185}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "24.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man falls from the cliff on the ice surface, dropping the manuscript he was holding off the cliff. He grabs onto the ice with his hands, causing scratches to appear on the surface. In the end, a creature falls down, and the scene concludes with it being suspended by a rope.", "text": "A bald character with a scarred head and wearing a blue outfit is seen struggling to climb a blue icy wall. The character reaches out and grabs a pickaxe, but loses grip and falls off the wall. The character's hands, wearing gloves, are shown gripping the icy wall tightly as they fall. The character's face shows a pained expression as they continue to fall. The scene then shifts to a red, furry character with a large nose and spiky hair, who is also hanging from a rope and looking down.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 24, "extra_info": {"duration": 3.5, "events": ["A man falls from the cliff.", "The man drops the manuscript.", "The manuscript falls off the cliff.", "The man grabs onto the ice.", "Scratches appear on the ice surface.", "A creature falls down.", "The creature is suspended by a rope."], "n_events": 7, "n_shots": 4, "n_subjects": 2}, "vid": 24}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "775.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The person clad in red clothing is holding a stopwatch while standing in the background, as two women crouch down in a starting position and then run forward together, with the camera following their movement.", "text": "In an indoor track and field facility, a man in a red jacket stands in the background holding a stopwatch, while two women, one in black and the other in blue, are crouched at the starting line, preparing to run. The women begin to rise from their starting positions, getting into a sprinting stance. They start running forward, with the woman in black slightly ahead of the woman in blue. The camera follows their movement as they sprint down the track, maintaining their positions relative to each other. The background shows hurdles and other track equipment, indicating a training environment.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 775, "extra_info": {"duration": 6.9, "events": ["Person holds a stopwatch.", "Two women crouch down.", "Women run forward together."], "n_events": 3, "n_shots": 1, "n_subjects": 2}, "vid": 775}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "996.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In this video, an individual donned in an orange shirt and gray shorts is seen placing a black and white football on the grass. A referee dressed in a blue shirt and black pants stands behind the player. The player in orange takes a run up and kicks the ball with his right foot. The ball soars towards the goal, and the goalkeeper clad in yellow clothing and blue gloves raises both hands in an attempt to block it, but fails to do so. The ball zooms past the goalkeeper, into the net, catching fire and then transforming into a burst of vibrant fireworks. A person dressed in red cheers on the grass, raising both hands in celebration.", "text": "The video begins with an aerial view of a brightly lit stadium at night, showing the field and surrounding parking lot. A hand places a soccer ball on the grass in front of the goal, with a goalkeeper standing in front of the net. The camera then shows a player in a red shirt preparing to take a free kick, with the goalkeeper ready to defend. The player kicks the ball, and the goalkeeper dives to the right in an attempt to save it. The ball flies past the goalkeeper and into the net. The scene then transitions to a figure engulfed in flames, rising into the sky amidst smoke.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 996, "extra_info": {"duration": 8.0, "events": ["Player places football on grass", "Player kicks ball with right foot", "Ball soars towards goal", "Goalkeeper attempts to block ball", "Goalkeeper fails to stop ball", "Ball zooms into net", "Ball catches fire", "Ball transforms into fireworks", "Person in red cheers raising hands"], "n_events": 9, "n_shots": 4, "n_subjects": 4}, "vid": 996}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "459.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A gentleman, holding a beer bottle in his right hand, walked through the entrance. He glanced around and wedged the bottle cap on the edge of the door, then forcefully kicked the door with his foot. Afterward, he turned and walked back. The impact of the door shattered the bottle top. The gentleman looked at the broken bottle top in surprise, as the damaged door fell toward him, striking the back of his head. He sat down on the ground, dazed by the hit, rubbing the back of his head with his left hand while still holding the beer bottle in his right hand.", "text": "A man in a white shirt and black pants is in a room with a bunk bed and a wooden door. He holds a green bottle and approaches the door, attempting to open it. As he kicks the door, it falls, and the bottle slips from his hand, spilling its contents on the floor. The man reacts by turning around, dropping the bottle, and kneeling on the floor, holding his head in pain.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 459, "extra_info": {"duration": 6.75, "events": ["Gentleman walked through the entrance.", "Gentleman glanced around.", "Gentleman wedged bottle cap on door edge.", "Gentleman kicked the door with his foot.", "Gentleman turned and walked back.", "Door impact shattered the bottle top.", "Broken door fell toward the gentleman.", "Door struck the gentleman on the head.", "Gentleman sat down on the ground.", "Gentleman rubbed the back of his head."], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 459}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "482.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A high-mountain skier is carrying his skis, walking up along a mountain ridge. He uses his skis to cut a transverse groove into the accumulated snow on the ridge to facilitate stepping, forming a row of blocky paths in the snow behind him. The camera then cuts to the mountaintop, as the skier slides down from the summit. He leaps off a cliff, rotating in the air, and lands steadily on the slope, continuing to glide downwards.", "text": "A person, dressed in winter gear and carrying skis, climbs a steep, snow-covered slope, using their hands to pull themselves up. They reach the top of the slope, standing on the edge and looking down at the valley below. The person then jumps off the edge, skiing down the steep, snowy mountain face. The skier navigates the descent, making sharp turns and maintaining balance as they descend the slope.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 482, "extra_info": {"duration": 9.67, "events": ["High-mountain skier carries skis up mountain ridge.", "Skier cuts transverse groove in snow.", "Skier forms row of blocky paths.", "Skier walks along mountain ridge.", "Skier slides down from summit.", "Skier leaps off cliff.", "Skier rotates in the air.", "Skier lands steadily on slope.", "Skier continues to glide downwards."], "n_events": 9, "n_shots": 2, "n_subjects": 1}, "vid": 482}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "987.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, an individual wearing glasses along with a white gown and red vest is shown receiving information from another person sporting a gray coat and a beard. The bearded person sits opposite the glasses-wearing individual. Secretly, the individual in glasses gestures with his fingers and left hand towards a distant sniper. Spotting the signal, the sniper who previously had his gun targeted, discontinues his action and refrains from taking the shot.", "text": "A man wearing a red vest is seen speaking while gesturing with his hand, with a large building in the background. He then turns his head to the side. The camera shifts to show a hand resting on a railing, with the background still showing the building. The camera then moves to reveal a man in a green shirt aiming a rifle from behind the railing. The final frame shows a wider view of the scene, with the man in the green shirt aiming his rifle at a group of people standing on a platform in front of a large building.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 987, "extra_info": {"duration": 6.0, "events": ["Individual in glasses receives information from bearded person", "Bearded person sits opposite individual in glasses", "Individual in glasses gestures with fingers", "Individual in glasses gestures with left hand", "Individual in glasses signals to distant sniper", "Sniper previously targets with gun", "Sniper spots the signal", "Sniper discontinues his action", "Sniper refrains from taking the shot"], "n_events": 9, "n_shots": 4, "n_subjects": 2}, "vid": 987}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "814.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A conversation is happening between people which is incidental to five cars moving by in the background. Four cars are moving from the right to the left of the frame and one car from the left to the right. Of those four cars, the first is black, the second and the third are grey, and the fourth car is white.", "text": "A group of people, including children and adults, are gathered on the side of a road with bicycles. The children are wearing helmets and school uniforms, and they are engaged in a conversation with two adults, one in a suit and the other in a vest. The background shows a tree-lined street with cars passing by. The children appear to be explaining something to the adults, who are listening attentively. The adults occasionally gesture with their hands while talking. The children continue to speak, and the adults respond, maintaining a friendly and engaged conversation.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 814, "extra_info": {"duration": 9.99, "events": ["People engage in conversation.", "First black car moves right to left.", "Second grey car moves right to left.", "Third grey car moves right to left.", "Fourth white car moves right to left.", "One car moves left to right."], "n_events": 6, "n_shots": 1, "n_subjects": 4}, "vid": 814}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "526.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A group of motorcyclists are riding mini motorcycles on a racetrack. At a curve, the riders lean their bodies and take the turn with a tilt.", "text": "A group of motorcyclists are racing on a track, leaning into a turn with their bodies close to the ground. The camera follows the riders as they navigate the curve, with the motorcyclist in the blue and red suit leading the pack. The rider in the black suit with the number 16 on their bike is closely followed by the rider in the yellow and black suit. The camera continues to track the riders as they exit the turn, with the rider in the yellow and black suit moving ahead. The video ends with a close-up of the rider in the yellow and black suit, showing their focused posture as they ride.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 526, "extra_info": {"duration": 6.2, "events": ["A group of motorcyclists ride mini motorcycles.", "Motorcyclists approach a curve on the racetrack.", "Motorcyclists lean their bodies.", "Motorcyclists take the turn.", "Motorcycles tilt on the curve."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 526}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "852.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video features a recording from a car's perspective. It suddenly decelerates, even coming to a full stop in the middle of the road. As a result, the vehicle following behind collides with it. After the impact, the car spins to face the adjacent lane where a white truck is approaching and ends up in a head-on collision with it.", "text": "The video is split into two sections: the left side shows a man sitting in the passenger seat of a car, wearing a seatbelt, and talking while looking forward. The right side shows the view from the car's dashboard camera, capturing the road ahead as the car moves forward. The man continues to talk and occasionally looks around. Suddenly, he reacts with surprise, moving his body and raising his hand. The car continues to drive forward as the man regains his composure.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 852, "extra_info": {"duration": 7.01, "events": ["The car decelerates suddenly.", "The car comes to a full stop on the road.", "The following vehicle collides with the stopping car.", "The stopping car spins post-collision.", "The spinning car faces the adjacent lane.", "A white truck approaches in the adjacent lane.", "The spinning car collides head-on with the white truck."], "n_events": 7, "n_shots": 1, "n_subjects": 1}, "vid": 852}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "563.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man in a black short-sleeved shirt walks towards the camera, picks up the nunchucks in front of him, taps them together and gives a couple of pulls as a demonstration, then performs nunchuck movements with hooking actions to the left and right. The scene ends with a cross of his arms. The background is an indoor living room with a computer and a computer desk around.", "text": "In a room with a desk, computer, and various items on shelves, a man in a black shirt and beige pants holds two metal pipes connected by a chain. He demonstrates how to separate the pipes by pulling them apart. He then reassembles the pipes and starts performing a series of movements, including swinging the pipes in various directions. The man continues to manipulate the pipes, showing different techniques and movements. Finally, he brings the pipes back together and holds them in a ready position.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 563, "extra_info": {"duration": 10.36, "events": ["The man walks towards the camera", "The man picks up the nunchucks", "The man taps nunchucks together", "The man gives a couple of pulls", "The man performs nunchuck movements", "The man hooks to the left", "The man hooks to the right", "The scene ends with the man crossing arms"], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 563}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "464.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Inside a cement pipe, a little girl forcefully kicks a yellow soccer ball. The ball spins around the pipe twice, passing between the girl's legs, and continues to spin upwards. The girl leans forward slightly and catches the ball with the black bag she is carrying on her back. She then looks back to confirm the ball has fallen into the black bag, steps sideways twice to get out of the pipe, turns to face the camera, and with a smile, takes the black bag off her shoulder.", "text": "A child in a red jacket and black pants stands inside a large circular concrete pipe, facing a yellow ball on the ground. The child kicks the ball, which flies out of the pipe. The child turns around and kicks the ball back into the pipe. The child continues to kick the ball around inside the pipe, moving from one side to the other. The child looks towards the camera and smiles. The child continues to kick the ball around inside the pipe, moving from one side to the other.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 464, "extra_info": {"duration": 9.4, "events": ["The girl kicks the yellow soccer ball inside the cement pipe", "The ball spins around the pipe twice", "Soccer ball passes between the girl's legs", "Ball continues to spin upwards", "The girl leans forward slightly", "The girl catches the ball with the black bag", "The girl looks back to confirm ball is in black bag", "The girl steps sideways twice to exit the pipe", "The girl turns to face the camera", "The girl smiles and removes the black bag from shoulder"], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 464}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "479.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A row of motorcycle riders is poised to take off. Once the barriers in front of the motorcycles are lowered, the riders surge forth from the starting point. However, the barriers in front of two motorcycles are not properly set, preventing these two riders from immediately dashing onto the track. These two riders adjust their positions and mount the track only after the others have sped away. They raise their hands one after the other to signal backward before continuing with the race.", "text": "A group of motorcyclists are lined up at the starting line, ready to race, with a dirt track and spectators in the background. The starting gate drops, and the motorcyclists accelerate forward, kicking up dirt as they begin the race. The camera captures the motorcyclists speeding down the track, with the crowd watching from the sidelines. The motorcyclists continue to race, with one rider in a white and blue outfit leading the pack. The race progresses with the riders navigating the dirt track, creating trails of dust behind them.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 479, "extra_info": {"duration": 11.1, "events": ["Motorcycle riders are poised at the starting point.", "Barriers in front of the motorcycles are lowered.", "Riders surge forth from the starting point.", "Two barriers fail to lower properly.", "Two riders are prevented from dashing onto the track.", "Riders adjust their positions after the barrier malfunction.", "Delayed riders mount the track.", "The first delayed rider raises hand to signal backward.", "The second delayed rider raises hand to signal backward.", "Both delayed riders continue with the race."], "n_events": 10, "n_shots": 1, "n_subjects": 3}, "vid": 479}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "362.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The lights flicker as the woman walks forward, looking up above. The camera cuts to the light tube bursting and falling off.", "text": "A woman wearing a teal uniform with pink trim and a white towel over her shoulder walks forward in a room with a tiled floor and a potted plant in the background. She stops and looks up, her expression changing to one of surprise. The camera then shifts to a ceiling light fixture, which suddenly shatters, scattering glass and light bulbs.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 362, "extra_info": {"duration": 3.62, "events": ["The lights flicker.", "The woman walks forward.", "The woman looks up.", "The light tube bursts.", "The light tube falls off."], "n_events": 5, "n_shots": 2, "n_subjects": 2}, "vid": 362}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "390.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman took the wine bottle from the man's hand, then poured out all the alcohol inside. The camera pans to a close-up of the pouring process, then cuts back to the man opposite. He shows an expression of helplessness, slightly turning his head, while the woman also turns her head to look at the man.", "text": "In a forested outdoor setting with a wooden structure in the background, a woman with curly hair and a patterned cardigan looks down and then picks up a bottle of liquor. She turns and begins to pour the liquor into a cup held by a man with long hair, who is wearing a white tank top. The man watches as the liquor is poured, and then he turns his head to look at the woman.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 390, "extra_info": {"duration": 5.75, "events": ["The woman took the wine bottle from the man's hand.", "The woman poured out all the alcohol.", "The man showed an expression of helplessness.", "The man slightly turned his head.", "The woman turned her head to look at the man."], "n_events": 5, "n_shots": 2, "n_subjects": 2}, "vid": 390}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "279.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two men looked ahead, the man on the right pulled up the zipper of his jacket, and then both turned around and walked back, as another man approached from behind them.", "text": "In a locker room with green lockers and a window in the background, a blonde boy leans into a locker while a dark-haired boy stands beside him. The dark-haired boy adjusts his collar with his right hand. The blonde boy continues to look inside the locker, then turns and starts to close the locker door. The dark-haired boy follows the blonde boy as they both walk away from the locker.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 279, "extra_info": {"duration": 4.5, "events": ["Two men looked ahead.", "The man on the right pulled up his zipper.", "Both men turned around.", "Both men walked back.", "Another man approached from behind."], "n_events": 5, "n_shots": 1, "n_subjects": 3}, "vid": 279}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "861.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts a red car preparing to make a right turn. Simultaneously, a grey car emerges from the right side of the red car. The red car then halts to yield to the grey car.", "text": "A red car is seen driving on a road with a cloudy sky in the background, approaching an intersection. The car continues to move forward and begins to turn right at the intersection. The car completes the turn and starts to drive away from the camera, moving towards the right side of the frame. The camera captures the car from different angles as it drives away, showing the surrounding buildings and other vehicles on the road. The car continues to drive away, and the camera angle shifts slightly to show more of the road and the buildings. The car is now further away, and the camera captures the empty road and the buildings in the background. The car is seen driving away from the camera, and the camera angle shows the road and the buildings from a different perspective.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 861, "extra_info": {"duration": 8.98, "events": ["Red car prepares to make a right turn.", "Grey car emerges from right side of red car.", "Red car halts to yield to grey car."], "n_events": 3, "n_shots": 1, "n_subjects": 2}, "vid": 861}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "498.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "At the entrance of a house, with hail falling outside, a person tapped a little boy on the shoulder. The boy quickly dashed into the hail, ran to the roadside, took out an electric scooter key, inserted it into the keyhole, unlocked the scooter seat, and took out a red rain cape. Then he ran back towards the entrance.", "text": "A person is holding a child who is sitting on a ledge, with potted plants and a scooter in the background. The child, wearing a blue shirt and shorts, is then released and runs towards the scooter. The child reaches the scooter, which is parked on a wet street, and starts to open the seat. The child lifts the seat and takes out a red piece of clothing. The child then turns around and walks back towards the ledge, holding the red clothing.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 498, "extra_info": {"duration": 7.95, "events": ["A person tapped the boy on the shoulder.", "The boy dashed into the hail.", "The boy ran to the roadside.", "The boy took out an electric scooter key.", "The boy inserted the key into the keyhole.", "The boy unlocked the scooter seat.", "The boy took out a red rain cape.", "The boy ran back towards the entrance."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 498}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "961.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person with blonde hair, wearing a black T-shirt, white bracelet and black shorts throws off an orange bedspread and slowly gets out of bed. This person goes to the window, opens the curtain and fills it with sunlight. Then, this person walks to the other window next to the mirror and opens the curtain of that window. Then this person enters the bathroom through a door with towels hanging on it.", "text": "A woman is lying in bed under a red blanket in a dimly lit room with a window covered by blinds. She sits up and pulls the blanket off herself. She gets out of bed and walks towards the window, opening the blinds to let in light. The room becomes brighter as she opens the blinds. She walks towards the door, and the text 'GoodMorning' appears on the screen. She continues walking towards the door, and the text 'I have a story time for y'all' appears. She enters the bathroom, and the text disappears.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 961, "extra_info": {"duration": 10.01, "events": ["Person throws off orange bedspread", "Person gets out of bed", "Person opens the curtain", "Sunlight fills the room", "Person walks to the other window", "Person opens the second curtain", "Person enters the bathroom", "Towels hang on the door"], "n_events": 8, "n_shots": 4, "n_subjects": 1}, "vid": 961}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "347.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The door of a prison cell is open, a male officer is holding the door, and a female officer, holding the keys, looks inside the cell. Inside, there is a female prisoner who stands up and walks out of the cell. The camera angle switches to the back of the female prisoner as the female officer grabs the prisoner's arm and walks to the right, while the male officer closes the cell door. The camera then switches to a corridor where the female prisoner is in the middle, flanked by the two officers, as all three walk forward.", "text": "A woman in a blue uniform stands outside a cell, holding a key, while a man inside the cell looks at her. The woman turns her head to look at the man inside the cell. The camera then shows a woman inside the cell, who gets up and walks towards the door. The man inside the cell opens the door, and the woman in the blue uniform and the man outside the cell enter the room. The three of them walk down a hallway together. The woman in the blue uniform and the man outside the cell look at the woman inside the cell.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 347, "extra_info": {"duration": 11.88, "events": ["Prison cell door opens", "Male officer holds the door", "Female officer holds keys", "Female officer inspects the cell", "Female prisoner stands up", "Female prisoner walks out of the cell", "Female officer grabs prisoner's arm", "Officers escort prisoner to the right", "Male officer closes the cell door", "Officers walk forward with prisoner"], "n_events": 10, "n_shots": 4, "n_subjects": 3}, "vid": 347}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "755.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two metal dice fall down, roll forward, and then come to a stop.", "text": "The video begins with a black background. A single die appears from the top of the screen and starts falling. Another die appears and falls next to the first die. Both dice continue to fall and land on the screen. The dice settle on the screen, with one die showing a different number than the other. The dice remain stationary on the screen. The screen goes black again. The dice reappear from the top of the screen and start falling.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 755, "extra_info": {"duration": 5.34, "events": ["Two metal dice fall down.", "Metal dice roll forward.", "Metal dice come to a stop."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 755}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "905.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video clip features a glass filled with ice. Over time, the ice slowly begins to sink in the water as it melts.", "text": "A glass filled with water and ice cubes is placed against a dark background. The ice cubes inside the glass begin to move and shift positions as the water level slightly decreases. The ice cubes continue to move and settle at the bottom of the glass as the water level stabilizes.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 905, "extra_info": {"duration": 12.0, "events": ["Ice is present in the glass.", "Ice slowly starts to melt.", "Melted ice causes water level to rise.", "Ice cubes begin sinking.", "Final ice cube disappears into water."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 905}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "426.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a basketball court, a team wearing black jerseys is playing a basketball game against various other teams. One of the guards, using skilled techniques of dribbling, penetrating, and passing, helps his team score goals frequently.", "text": "A man in a black sleeveless shirt dribbles the basketball while being guarded by a man in a white shirt on an outdoor court surrounded by a crowd of spectators. The man in black passes the ball to a teammate, who then makes a layup. The man in black retrieves the ball and dribbles again, evading another defender and making another layup. The man in black continues to dribble and pass the ball to his teammates, who make successful layups. The man in black dribbles past another defender and makes a layup. The man in black retrieves the ball and dribbles again, evading another defender and making another layup.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 426, "extra_info": {"duration": 15.05, "events": ["Team in black jerseys plays basketball game.", "Guard dribbles the ball.", "Guard penetrates the defense.", "Guard passes the ball.", "Team scores a goal."], "n_events": 5, "n_shots": 4, "n_subjects": 2}, "vid": 426}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "848.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video clip depicts a person enjoying a meal at a restaurant. On the table, there's a dish filled with a variety of cakes, a spoon, and some red sauce. The assorted cakes include a yellow pancake and a macaron. The person initially presents the yellow pancake to the camera and proceeds to break it into two halves. Following this, they apply some red sauce to one half of the pancake.", "text": "A woman with red hair and tattoos is sitting at a table in a café named 'Royal Quarter,' holding a piece of food and taking a bite. She then looks down at the food in her hand, which is a scone, and continues to hold it. The camera shifts to show a display case with various pastries and desserts, including scones, macarons, and other baked goods. The woman then uses a spoon to scoop some red jam from a small bowl and spreads it on the scone. She continues to hold the scone and the jam, preparing to eat it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 848, "extra_info": {"duration": 10.01, "events": ["A person enjoys a meal at a restaurant", "Dish displays a variety of cakes", "Person presents yellow pancake to the camera", "Person breaks yellow pancake into two halves", "Person applies red sauce to half of the pancake"], "n_events": 5, "n_shots": 3, "n_subjects": 3}, "vid": 848}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "717.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera shakes as it films the Mona Lisa painting. As the lens pulls back, many people's heads come into view, with numerous individuals holding up their mobile phones or cameras to take pictures. A man dressed in black steps out from the crowd and turns to walk away. He then looks down at the camera in his hands. The camera angle changes to another direction, showing a dense crowd of visitors moving around inside the exhibition hall.", "text": "The video begins with a close-up of a framed painting on a wall, showing a woman in a dark dress. The camera then zooms out to reveal a crowd of people taking photos of the painting with their phones and cameras. The camera continues to pull back, showing more of the crowd and the surrounding gallery, which is filled with other paintings. The crowd is diverse, with people of various ages and appearances, all focused on capturing the moment. The gallery is spacious, with high ceilings and large, ornate paintings displayed on the walls.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 717, "extra_info": {"duration": 20.77, "events": ["The camera films the Mona Lisa painting.", "The lens pulls back.", "People's heads come into view.", "Individuals hold up their mobile phones or cameras.", "A man dressed in black steps out from the crowd.", "The man turns to walk away.", "The man looks down at his camera.", "Visitors move around inside the exhibition hall."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 717}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "652.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A well-built man is wrapping his right hand fingers, palm, and wrist with a black cloth bandage, and his left hand is already wrapped with the bandage.", "text": "In a boxing gym with a blue and white boxing ring in the background, a person wearing a white shirt is seen wrapping black hand wraps around their hands. The person starts by wrapping the hand wrap around their wrist and then continues to wrap it around their hand, ensuring it is secure. The wrapping process involves multiple layers to provide support and protection. The person adjusts the wrap to fit snugly around their hand and wrist, making sure it is tightly secured. The wrapping continues with the person pulling the wrap through and tightening it around their hand and wrist. The process is methodical and careful, ensuring the wrap is properly aligned and secure.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 652, "extra_info": {"duration": 22.86, "events": ["Man wraps right hand fingers with black cloth bandage.", "Man wraps right hand palm with black cloth bandage.", "Man wraps right hand wrist with black cloth bandage.", "Left hand is already wrapped with black cloth bandage."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 652}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "744.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A dark shadow against a red background moved upwards, then morphed into a humanoid shape, and subsequently transformed into black lines that vanished upwards. Next, it changed into a terrifying human face with an agape bloodied mouth flying towards the camera. Blackness took over the entire screen, and finally, red once again spread across the entire screen.", "text": "The video begins with a red background and the 'video' logo in the center. A black shape starts to appear from the bottom of the screen, gradually forming a skull-like figure. The skull figure becomes more defined and detailed, with the 'video' logo remaining in the center. The skull figure continues to expand and become more prominent, filling the screen. The skull figure then starts to disintegrate and fade away, leaving the red background. The skull figure reappears, now more abstract and fragmented, before it completely disappears. The video ends with the red background and the 'video' logo in the center.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 744, "extra_info": {"duration": 8.24, "events": ["A dark shadow moved upwards against a red background.", "The shadow morphed into a humanoid shape.", "The humanoid shape transformed into black lines.", "Black lines vanished upwards.", "The shape changed into a terrifying human face.", "A bloodied mouth flew towards the camera.", "Blackness took over the entire screen.", "Red spread across the entire screen."], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 744}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "134.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The bald little boy sat on the mat while the little girl stood with her hands behind her back, talking to the boy. The little boy walked away to the left, then the screen faded to dark. The little boy came back and continued to sit on the mat. The little girl smiled, turned around, and turned on the projector in front of her.", "text": "In a room with a green couch and a blue curtain, a boy in a yellow shirt with black stripes and a girl in a blue dress are sitting on a blue rug, facing each other. The boy looks at the girl, who is holding a projector on a small table. The boy's expression changes to a frown, and he starts to get up. The girl's expression also changes to a frown as the boy walks away. The room's lighting dims, and the girl smiles and turns on the projector, directing the light towards the window. The boy returns, and the girl turns back to face him, smiling.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 134, "extra_info": {"duration": 8.38, "events": ["The bald little boy sat on the mat.", "The little girl talked to the bald little boy.", "The bald little boy walked away to the left.", "The screen faded to dark.", "The bald little boy came back.", "The bald little boy continued to sit on the mat.", "The little girl smiled.", "The little girl turned around.", "The little girl turned on the projector."], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 134}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "222.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man looked back, and a woman walked up and opened the door. She turned her head to look at the person coming out. After that, a person in black and another in a trench coat came out. The one in the trench coat turned around and walked forward, speaking to the woman while pointing at her.", "text": "The scene is set in a festive store decorated with Christmas ornaments, lights, and a sign that reads 'WONDERFUL.' A man in a beige coat and a woman in a green elf costume walk towards the store entrance. The woman opens the door and steps outside, followed by the man. The woman holds the door open as the man exits, and she turns to look at him. The man turns back to face the woman and starts talking to her.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 222, "extra_info": {"duration": 6.38, "events": ["The man looked back", "A woman walked up", "The woman opened the door", "The woman turned her head", "A person in black came out", "Another in a trench coat came out", "The one in the trench coat turned around", "The one in the trench coat walked forward", "The one in the trench coat spoke to the woman", "The one in the trench coat pointed at the woman"], "n_events": 10, "n_shots": 3, "n_subjects": 3}, "vid": 222}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "648.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A knife blade has a bright red chili pepper skewered on its tip, with a droplet of water poised to fall on either side of the pepper. The camera gradually zooms in, magnifying the knife tip and the pepper slowly. Then, a mist of water showers down from above, covering the surface of the chili with beads of water, making the entire scene even more vivid and tantalizing.", "text": "A red chili pepper with water droplets is placed on a metal fork against a black background. The camera gradually zooms in on the chili pepper, and water droplets begin to fall from the chili pepper. The water droplets continue to fall, creating a small stream of water. The camera continues to zoom in, focusing on the chili pepper and the falling water droplets.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 648, "extra_info": {"duration": 8.77, "events": ["A knife blade skewers a bright red chili pepper.", "A droplet of water poises on either side of the pepper.", "The camera zooms in on the knife tip and pepper.", "A mist of water showers down from above.", "Water covers the chili with beads."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 648}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "594.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The yellow little bird picks up a small ring and walks to the left, standing on a wooden board, it slips the small ring in its beak onto the pole, then turns around and walks to the right, picks up another small ring and returns to the pole to slip it on. The bird repeats the actions above, and the video ends with the bird approaching the pole with the last ring in its beak.", "text": "A yellow and white bird is seen holding an orange ring in its beak while standing on a wooden platform with a vertical post. The bird moves around the platform, occasionally dropping the ring and picking it up again. The bird continues to move around the platform, sometimes placing the ring on the post and then picking it up again. The bird picks up a blue ring and continues to move around the platform, occasionally dropping and picking up the rings. The bird places the blue ring on the post and then picks it up again, continuing to move around the platform.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 594, "extra_info": {"duration": 10.8, "events": ["Yellow bird picks up a small ring", "Bird walks to the left", "Bird slips ring onto the pole", "Bird turns around", "Bird walks to the right", "Bird picks up another small ring", "Bird returns to the pole", "Bird slips second ring on", "Bird repeats actions", "Bird approaches pole with last ring"], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 594}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "543.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A little donkey is running around in circles inside the pen. Another donkey walks over, and the camera follows the movement of the little donkey as it runs up to the other one. The background is a pen in winter, covered with a thick layer of snow.", "text": "A small donkey is seen walking in a snowy enclosure with a green fence in the background. The donkey approaches a tree and starts to circle around it. It then moves towards a metal structure, possibly a feeding area, and continues to circle around the tree. The donkey appears to be exploring its surroundings, moving back and forth between the tree and the metal structure. Another larger donkey is visible in the background, standing near the fence. The small donkey continues to circle around the tree and the metal structure, seemingly curious about its environment.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 543, "extra_info": {"duration": 7.58, "events": ["A little donkey runs around in circles.", "Another donkey walks over.", "The little donkey runs up to the other donkey."], "n_events": 3, "n_shots": 1, "n_subjects": 2}, "vid": 543}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "546.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The boy glides forward with his left foot on the skateboard, then turns and continues to glide with both feet on the board. After crouching down, he slides off the steps, and then raises his hands. Meanwhile, a person walks ahead, with the camera following the movement.", "text": "A young boy is seen riding a skateboard on a paved path in a park, approaching a set of steps. He begins to descend the steps, maintaining his balance as he moves down. The boy continues to skate down the steps, keeping his arms outstretched for balance. He successfully navigates the steps and lands on the flat ground at the bottom. The boy then skates away from the steps, moving towards the open area of the park.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 546, "extra_info": {"duration": 8.93, "events": ["The boy glides forward on the skateboard.", "The boy turns on the skateboard.", "The boy continues gliding with both feet on the board.", "The boy crouches down on the skateboard.", "The boy slides off the steps.", "The boy raises his hands.", "A person walks ahead."], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 546}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "495.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a small boat in the river, a person manipulated the oar and approached two cormorants. This person picked up a net with a long pole and scooped the cormorants out of the river; the two cormorants were fighting over a big fish.", "text": "A person on a boat is holding a stick and approaching a bird swimming in the water. The bird is seen struggling in the water as the person gets closer. The person uses the stick to grab a fish that is floating near the bird. The bird continues to struggle as the person pulls the fish closer to the boat. The person successfully grabs the fish with the stick and lifts it out of the water. The bird is still in the water, flapping its wings as the person holds the fish.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 495, "extra_info": {"duration": 11.27, "events": ["Person on boat manipulated the oar.", "Person approached two cormorants.", "Person picked up a net with a long pole.", "Person scooped the cormorants out of the river.", "Two cormorants were fighting.", "Cormorants fought over a big fish."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 495}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "159.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera captures several individuals wearing prison uniforms sitting inside a cell. A bald, burly man is propping himself up with his right hand on a stool. He then stands up, hands in his pockets. The blond man to his left glances at the blue bag on the stool, exchanges a look with a brown-haired man further to the left, and then turns his head to pick up the blue bag from the stool.", "text": "In a prison setting, a group of inmates in orange uniforms are seated on a bench. The bald inmate, wearing uniform number 78001, stands up and places his hands on his hips. The yellow-haired inmate, wearing uniform number 3001, turns his head to look at the bald inmate, then turns back and picks up a small jar from the bench. The yellow-haired inmate then turns his head to look at the bald inmate again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 159, "extra_info": {"duration": 3.88, "events": ["Several individuals sit inside a cell.", "A bald man propps himself up with his right hand.", "The bald man stands up.", "The bald man places hands in pockets.", "A blond man glances at the blue bag.", "The blond man exchanges a look with the brown-haired man.", "The blond man turns his head.", "The blond man picks up the blue bag."], "n_events": 8, "n_shots": 1, "n_subjects": 4}, "vid": 159}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "166.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A koala gestures in various ways and speaks towards the camera, when suddenly some petals block its mouth. It falls to the ground, eyes closed. A yellow scorpion looks at the koala, takes a step back, blinks, and glances to the upper right.", "text": "A gray koala with large ears and blue eyes is seen talking and gesturing with its hand in a garden with a brick wall and green bushes in the background. The koala then falls to the ground, appearing tired or exhausted, with some pink substance on its nose. An orange creature with large eyes and a small body, holding an orange object, approaches the koala and starts talking to it. The koala remains lying on the ground with its eyes closed, while the orange creature continues to speak.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 166, "extra_info": {"duration": 6.25, "events": ["A koala gestures towards the camera.", "A koala speaks.", "Petals block the koala's mouth.", "The koala falls to the ground.", "The koala's eyes close.", "A yellow scorpion looks at the koala.", "The yellow scorpion takes a step back.", "The yellow scorpion blinks.", "The yellow scorpion glances to the upper right."], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 166}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "615.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is a bright, tidy living room. A woman in a blue dress is half-kneeling on the ground, with her right hand stretched out. A puppy is walking towards the left, the camera pans to the right, and a baby walks towards the camera. Then the baby moves to the front of a man in a green short-sleeved shirt, who picks up the baby into his arms.", "text": "In a cozy living room with a wooden floor, a woman in a blue dress kneels on the floor, smiling and watching a child in a striped shirt and gray pants walk towards the camera. The child takes small steps forward, approaching a hand that is extended towards him. The child reaches the hand and is lifted up by the person behind the camera. The woman continues to smile and watch as the child is lifted.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 615, "extra_info": {"duration": 10.21, "events": ["Woman half-kneels on the ground.", "Woman stretches out her right hand.", "Puppy walks towards the left.", "Baby walks towards the camera.", "Baby moves to the front of the man.", "Man picks up the baby into his arms."], "n_events": 6, "n_shots": 1, "n_subjects": 4}, "vid": 615}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "420.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A little pig moves to open the wrapper of an ice cream, then pulls it back and moves toward the camera, eating the ice cream as it starts to melt. The pig then takes away the ice cream stick in its mouth and continues to eat. The camera remains fixed throughout the scene. The background shows a white tabletop.", "text": "A pink pig toy is seen next to a packaged ice cream bar on a white background. The pig toy starts to pull the wrapping off the ice cream bar, revealing the ice cream inside. The pig toy continues to pull the wrapping, fully exposing the ice cream bar. The pig toy then begins to bite and eat the ice cream, causing the ice cream to melt and drip onto the surface. The pig toy continues to eat the ice cream, creating a large pool of melted ice cream around it. The pig toy moves around the melted ice cream, enjoying its meal.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 420, "extra_info": {"duration": 15.03, "events": ["Little pig opens ice cream wrapper", "Little pig pulls back", "Little pig moves toward camera", "Little pig eats melting ice cream", "Ice cream starts to melt", "Little pig takes ice cream stick in mouth", "Little pig continues to eat"], "n_events": 7, "n_shots": 1, "n_subjects": 1}, "vid": 420}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "738.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person threw a pile of documents into a burning fire, the one on top with the title \"contract agreement\" written on it, which quickly burned to ashes.", "text": "A fire is burning in a fireplace with logs and embers visible. A hand holding a document labeled 'AGREEMENT' appears and throws it into the fire. The document lands on the embers and begins to catch fire. The document continues to burn, with flames engulfing it. The document is almost completely burned, with only a few remnants left. The fire continues to burn the document until it is completely destroyed.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 738, "extra_info": {"duration": 20.95, "events": ["A person threw a pile of documents into a fire.", "The fire burned the documents.", "The top document titled 'contract agreement' quickly turned to ashes."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 738}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "488.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a forest clearing, a person rides a mountain bike, soaring down from a high slope and swinging left and right as they plunge down the curving gully. The camera then cuts to another spot in the forest where a different rider, also on a mountain bike, leaps into the air from behind a dirt mound and quickly zooms down the slope. As they descend, the rider swings from side to side to maintain the bike's speed and balance, before the final shot switches to a stretch of woodland.", "text": "In a forested area with a dirt trail, a person on a bicycle approaches a small hill and begins to descend. The cyclist navigates the trail, making a sharp turn and continuing down the path. The cyclist loses balance and falls off the bike, sliding down the trail. Another cyclist appears, riding down the same trail and making a turn. This second cyclist also loses balance and falls off the bike, sliding down the trail. The first cyclist is seen again, riding down the trail and making a turn.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 488, "extra_info": {"duration": 7.53, "events": ["A person rides a mountain bike in a forest clearing.", "The person soars down from a high slope.", "The rider swings left and right down the curving gully.", "A different rider leaps into the air from a dirt mound.", "The different rider quickly zooms down the slope.", "The different rider swings side to side to maintain speed and balance."], "n_events": 6, "n_shots": 3, "n_subjects": 2}, "vid": 488}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "674.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "First, a series of small explosions occurred on the ground, followed by a large explosion that caused the ground to bounce up, and then the entire screen was filled with dust.", "text": "The video begins with a view of a quarry site, showing a dirt and rock landscape with a river and hills in the background. Suddenly, an explosion occurs, creating a large cloud of dust and debris that rapidly expands and obscures the view. The dust cloud continues to grow, completely covering the quarry site and the surrounding area. The video ends with the dust cloud fully enveloping the scene, leaving only the 'video' watermark visible.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 674, "extra_info": {"duration": 9.38, "events": ["A series of small explosions occurred on the ground.", "A large explosion caused the ground to bounce up.", "Dust filled the entire screen."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 674}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "124.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the store, a brown-haired man pushes a shopping cart forward, followed by a blond-haired man. The blond man takes a piece of clothing off a mannequin. Behind them, there are two women. The two men walk straight out of the exit, where a red light is flashing. As they leave the store, the brown-haired man pushes the shopping cart forward, and the cart rolls through a bed of flowers and onto the grass. There are also some people around the store.", "text": "In a store with shelves of books and sports merchandise, a blonde boy picks up a red and yellow sports jersey from a display and examines it closely. He then turns and walks towards the exit, followed by a brown-haired boy who is pushing a shopping cart filled with items. The blonde boy covers his mouth with his hand as they exit the store, while the brown-haired boy continues to push the cart. The camera then shows the two boys walking out of the store and onto the street, with the shopping cart moving away from the store. The scene ends with the blonde boy still covering his mouth and the brown-haired boy looking ahead.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 124, "extra_info": {"duration": 6.38, "events": ["Brown-haired man pushes shopping cart", "Blond-haired man removes clothing from mannequin", "Two men walk towards exit", "Red light flashes at exit", "Men leave the store", "Shopping cart rolls through flower bed", "Cart moves onto the grass", "People are present around the store"], "n_events": 8, "n_shots": 3, "n_subjects": 3}, "vid": 124}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "901.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video showcases a cooking process where garlic is sautéed, together with butter in the pan. The cook consistently stirs this mix with a spatula causing the butter to gradually melt. As the liquid volume increases, the mixture begins to bubble and the ingredients turn mushy. The pan's contents continually bubble due to being boiled. Finally, shrimps and seasonings are added once the garlic and butter have fully melted.", "text": "A person adds a piece of butter to a pan containing chopped garlic and stirs it with a spatula. The text 'GAR'S KITCHEN' appears on the screen as the butter melts and mixes with the garlic. The person continues to stir the mixture until it is fully melted and bubbling. The text 'GARLIC SHRIMP PASTA' appears as the person adds shrimp to the pan. The shrimp are stirred into the garlic mixture, and the person continues to cook the shrimp in the pan.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 901, "extra_info": {"duration": 10.0, "events": ["Cook sautés garlic in the pan.", "Butter is added to the pan.", "Cook stirs the mixture with a spatula.", "Butter gradually melts in the pan.", "Liquid volume increases in the pan.", "The mixture begins to bubble.", "Ingredients turn mushy.", "Contents boil in the pan.", "Shrimps are added to the pan.", "Seasonings are introduced to the mixture."], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 901}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "941.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, an elderly individual is wearing brown pants, a shirt, and a vest. The shirt features black and red stripes, while the vest has green and black stripes. The person has white hair and is wearing a black hat. He place a skinned fish on the lower grill rack of an oven. Other fish can be seen on the upper grill rack of the oven. Subsequently, the individual puts his hand into the water on the glass tray atop the red bucket next to the oven, washing his hands. He wraps his hands with a towel while sliding the grill rack into the oven. Positioned atop the oven are a bottle and a white bowl. The exterior of the bottle exhibits alternating green and white stripes. The towel is adorned with white and blue stripes. Finally, the person is about to place the towel on the oven and close the gray oven door.", "text": "An elderly man, wearing a red and black checkered shirt, a green plaid vest, and a blue cap, is seen placing pieces of meat onto a grill. He carefully arranges the meat on the grill, ensuring they are evenly spaced. After placing the meat, he reaches for a cloth and begins to clean the grill. The background shows a parked Jeep and a red cooler with a hose connected to it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 941, "extra_info": {"duration": 9.98, "events": ["Elderly individual places a skinned fish on lower grill rack.", "Other fish are on upper grill rack.", "Elderly individual washes hands in water on glass tray.", "Elderly individual wraps hands with striped towel.", "Elderly individual slides grill rack into the oven.", "A bottle is positioned atop the oven.", "Elderly individual is about to place towel on the oven.", "Elderly individual closes gray oven door."], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 941}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "194.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "SpongeBob reached out to grab a bag and opened it; the air flow from inside the bag surged upwards, deforming SpongeBob’s face as he got blown by it. The camera zooms out, and a creature reaches out to pinch the bag closed and says something to SpongeBob, while Patrick Star turns his head to look to the left.", "text": "SpongeBob, standing on a green wooden deck with an anchor symbol in the background, points to the left and then looks down at a bag on the ground. He picks up the bag and starts to open it, revealing a large number of tentacles inside. Patrick, standing next to SpongeBob, watches with a surprised expression. A green character with glasses and a pink dress appears from the left, holding the bag and talking to SpongeBob. SpongeBob looks at the green character and then back at the bag, while Patrick continues to watch. The green character continues to hold the bag and talk, and SpongeBob looks at the bag with a concerned expression.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 194, "extra_info": {"duration": 4.62, "events": ["SpongeBob reached out to grab a bag", "SpongeBob opened the bag", "Air from inside the bag surged upwards", "SpongeBob's face got deformed by the air flow", "A creature reached out to pinch the bag closed", "The creature said something to SpongeBob", "Patrick Star turned his head to look left"], "n_events": 7, "n_shots": 1, "n_subjects": 3}, "vid": 194}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "72.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A young boy is walking forward, holding the hand of a girl in blue clothing. Behind them, a girl with red and yellow hair stands with her arms crossed, looking at them angrily. Where they have just left, colorful specks of light are scattered. The girl with red and yellow hair's expression turns to astonishment. The camera then switches to show the back of the young boy and the girl in blue. The colorful specks of light fall from the pockets of the boy's shorts. The girl with yellow and red hair, along with a small dog, watches them as they continue to walk ahead.", "text": "In a forest setting, a boy in a red shirt extends his hand to a girl with glasses, who is wearing a blue shirt and a purple skirt with star patterns. The girl takes the boy's hand, and they start walking together. A blonde girl with a red and yellow outfit watches them, crossing her arms and looking surprised as sparkles appear around her. The boy's pants begin to sparkle as they walk, and they pass by a cart filled with firewood. Another character, wearing a red and yellow striped shirt, appears in the background, seemingly following them.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 72, "extra_info": {"duration": 7.38, "events": ["A young boy walks forward holding a girl's hand.", "A girl in blue clothing accompanies the young boy.", "A girl with red and yellow hair stands with arms crossed.", "A girl with red and yellow hair looks angrily.", "Colorful specks of light scatter from a location.", "The girl with red and yellow hair's expression changes to astonishment.", "Colorful specks of light fall from the boy's pockets.", "A small dog watches the young boy and the girl.", "The young boy and the girl in blue continue walking."], "n_events": 9, "n_shots": 2, "n_subjects": 4}, "vid": 72}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "998.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows a child with short blond hair in a room, wearing a beige t-shirt, and carrying pants, walking towards a chest of drawers. A second child, with curly black hair dressed in a pink dress, is also present. The room features a chest of drawers, toys, and dolls scattered on the floor. Additionally, two children in purple pyjamas are trying to climb the white chest of drawers, introducing the risk of it falling over them, near a yellow toy truck. One of these children opens the uppermost drawer, while the child in the pink dress attempts to climb onto a black table situated under a screen. These activities are considered high-risk due to the potential of injury from tumbling furniture.", "text": "A young boy in a gray shirt with a black pocket holds a green and brown patterned cloth in his right hand and walks towards a white dresser with star-shaped handles. The scene then shifts to a girl in a pink dress with curly hair, who is reaching up to a TV stand. The camera then shows a wider view of a room with a patterned rug, toys, and a crib, where two children in purple outfits are playing on the dresser. The boy in the gray shirt is seen again, now holding a black piece of clothing and placing it on the dresser. Finally, the girl in the pink dress is shown standing on a small step stool, reaching up to the TV stand.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 998, "extra_info": {"duration": 6.0, "events": ["Child with short blond hair walks towards chest of drawers.", "Child with curly black hair tries to climb black table.", "Two children in purple pyjamas try to climb chest of drawers.", "One child in purple pyjamas opens uppermost drawer of chest.", "Toys and dolls are scattered on the floor.", "Child in pink dress approaches black table.", "Child risks furniture tumbling while climbing.", "Child potentially injures self by climbing furniture.", "Child in beige t-shirt carries pants.", "Child near yellow toy truck introduces falling risk."], "n_events": 10, "n_shots": 6, "n_subjects": 4}, "vid": 998}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "226.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Four people look ahead and clap, among them the woman at the front turns her head to look at the man to her right, after making eye contact the two shake hands and slightly nod, then look forward again. The man wearing glasses walks over to them and shakes hands with each in turn, below the scene text appears as an introduction.", "text": "A group of people are seated and clapping in a room with a modern, industrial background. A woman named Chen Shu turns to her right and starts talking to an older man with white hair, who turns his head to look at her. The woman then turns back to face forward. The scene shifts to a man in a beige suit and glasses, who bends down and speaks to the woman and the older man. The woman and the older man turn their heads to look at the man.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 226, "extra_info": {"duration": 9.0, "events": ["Four people clap while looking ahead.", "The woman at the front turns her head.", "The woman and the man to her right make eye contact.", "The woman and the man to her right shake hands.", "The woman and the man slightly nod at each other.", "They look forward again.", "The man wearing glasses walks over to the group.", "The man wearing glasses shakes hands with the first person.", "The man wearing glasses shakes hands with the second person.", "Text appears below the scene as an introduction."], "n_events": 10, "n_shots": 2, "n_subjects": 3}, "vid": 226}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "503.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video is playing in slow motion, showing a scene where several characters with white hair are fighting in a room. In the center of the screen, a long-haired woman lifts a vase and smashes it towards the short-haired man on the right. The white-haired man behind raises a painting canvas and slams it downwards. The long-haired woman on the right side of the screen throws a stack of papers. The short-haired man is struck down, the canvas shatters, and after throwing the papers, the long-haired woman steps forward and kicks towards the long-haired woman in the middle. Paper and petals flutter through the air. On the left side of the screen, a woman with brown hair stands motionless.", "text": "In a room with brown curtains and a red carpet, a woman in black holds a bouquet of flowers and faces a woman in a striped shirt. The woman in black raises the bouquet and swings it towards a man in black, who is holding a picture frame. The man in black is hit by the bouquet, causing him to fall to the ground. The woman in black continues to hold the bouquet as the man in black falls, and papers from the picture frame scatter in the air. The woman in black and another woman in black react to the fall, with the woman in black bending down and the other woman in black looking at the fallen man.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 503, "extra_info": {"duration": 8.6, "events": ["Long-haired woman lifts a vase.", "Long-haired woman smashes vase towards short-haired man.", "White-haired man raises a painting canvas.", "White-haired man slams canvas downwards.", "Long-haired woman throws stack of papers.", "Short-haired man is struck down.", "Painting canvas shatters.", "Long-haired woman kicks towards another long-haired woman.", "Paper and petals flutter through the air."], "n_events": 9, "n_shots": 1, "n_subjects": 5}, "vid": 503}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "168.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera captures a flagpole, with a national flag on the left descending. On the right side, a snake clamps onto a line attached to the flagpole with its mouth, rapidly ascending. Lined up on the snake’s back from head to tail are a lizard, a scorpion, and a spider. The snake’s body swings, causing the scorpion and spider below to sway as well. As they reach the top of the flagpole, the four animals fly into the air following their momentum. The camera then switches to an aerial view, showing the four animals spreading all their limbs in mid-air. The camera angle changes again to a ground-level view. The snake, lizard, and scorpion land first, followed by the spider, which lands last, covering the other three animals with its legs.", "text": "Four characters, including a blue creature with large eyes, a gray creature, an orange creature, and a purple creature, are seen hanging from a rope attached to a pole. They are lifted off the ground and begin to swing through the air. The characters continue to swing higher into the sky, with the blue creature leading the way. As they reach the peak of their swing, they are seen floating in the air against a clear blue sky with some clouds. The characters then start to descend, landing on a road below. They crash onto the road, with the blue creature in the front, followed by the gray, orange, and purple creatures.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 168, "extra_info": {"duration": 6.5, "events": ["National flag descends on the left.", "Snake clamps onto line with mouth.", "Snake ascends flagpole rapidly.", "Lizard lines up on snake's back.", "Scorpion lines up on snake's back.", "Spider lines up on snake's back.", "Snake's body swings with animals swaying.", "Animals fly into the air from flagpole.", "Snake, lizard, and scorpion land first.", "Spider lands, covering other animals with legs."], "n_events": 10, "n_shots": 4, "n_subjects": 4}, "vid": 168}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "74.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A little boy in a white short-sleeved shirt is showing off something to a little girl in green clothes. The little girl makes an unusually angry face and gesture at the boy, then the boy quickly runs away, looking back and laughing at the girl as he does. The little girl chases after the boy, and a plump auntie watches them from behind. They are chasing each other down a hallway, with the girl trying to kick the boy, but he runs away.\n\nThe scene shifts to the front of a chicken's gravestone, and the camera pans back from the gravestone as a pair of hands scatter some chicken feathers into a hole in front of it and then bury it with earth. The camera then switches to the opposite side of the gravestone, where the boy and girl are crying in front of it. Shortly after, the boy puts one hand on the girl's shoulder, and then both children take out a chicken drumstick with their left hands and start eating.", "text": "In a room with a pink wall and various items, a boy points to a drawing on the wall while a girl stands beside him, and a woman stands nearby. The boy and girl then run out of the room, followed by the woman. They run through a street with shops and people, heading towards a sign that reads 'Clucky' with a date 'Sep 5 - June 19'. The boy and girl place a bunch of leaves on the ground and cover them with dirt. The scene then shifts to the boy and girl standing in front of a large object, with the boy placing his hand on the girl's shoulder and then picking up two pieces of food.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 74, "extra_info": {"duration": 11.0, "events": ["A boy shows something to a girl.", "The girl makes an angry face and gesture at the boy.", "The boy runs away, laughing.", "The girl chases after the boy.", "The boy evades the girl's attempted kick.", "A plump auntie watches the children from behind.", "A pair of hands scatters chicken feathers into a hole.", "A pair of hands buries the hole with earth.", "The boy and girl cry in front of the gravestone.", "The boy and girl start eating chicken drumsticks."], "n_events": 10, "n_shots": 3, "n_subjects": 3}, "vid": 74}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "197.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "What the ruffian boss said, after turning around and pressing the red button, the camera moves up, Brother Octopus looks up at the moving lines within the machine, while the other creatures behind him are eating, their bodies slightly moving.", "text": "In a room with a futuristic design, several characters are seated at tables, eating and drinking. A green insect-like character stands in front of a tall, wheeled machine, looking up at it. The insect character raises its hand and presses a button on the machine, causing the machine to start moving. The camera then shifts to show a large, white character with a yellow shirt and purple shoes, who is watching a TV screen. The TV screen displays a green heartbeat line that starts to move. The camera then returns to the insect character, which continues to look up at the machine.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 197, "extra_info": {"duration": 4.88, "events": ["Ruffian boss turns around.", "Ruffian boss presses the red button.", "Brother Octopus looks up.", "Machine lines move.", "Creatures behind Brother Octopus eat.", "Creatures' bodies move slightly."], "n_events": 6, "n_shots": 1, "n_subjects": 3}, "vid": 197}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "869.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "This video explains how to land quickly in the PubG game through a screen recording. It commences with the player deploying a parachute. After it is opened, the player scans for nearby players. Next, the player swiftly glides towards a predetermined spot, intermittently checking the large map displayed on the screen. The intended landing point is the rooftop of a housing structure adorned with a Hyundai billboard. The player spots a rival just as he is about to touch down.", "text": "The video begins with an aerial view of a town, showing various buildings and roads. A parachute is deployed, and the person descends towards the town. The person continues to descend, passing over a landscape with hills and a river in the background. The person approaches the town, getting closer to the buildings. The person descends further, targeting a specific building labeled 'Mastery of Martial Arts Academy.' The person continues to descend, getting closer to the building's rooftop. The person lands on the rooftop of the building, with the parachute still deployed. The person then ascends back into the sky, with the parachute still open.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 869, "extra_info": {"duration": 8.0, "events": ["Player deploys parachute", "Parachute opens", "Player scans for nearby players", "Player glides towards a spot", "Player intermittently checks the map", "Landing point is a rooftop", "Rooftop adorned with a Hyundai billboard", "Player spots a rival", "Player is about to touch down"], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 869}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "174.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The girl bent down to open the small door at the rear of the bus, the camera cut to a close-up of the door being opened, and four reptilian creatures walked forward. The girl watched them, as three of the creatures leaped forward, and the snake looked back in the girl's direction.", "text": "A girl with blonde hair and a blue bow leans out of the window of a yellow school bus, looking around curiously. Inside the bus, several colorful creatures, including a purple creature, an orange creature, and a blue creature, are seen moving towards the door. The blue creature opens the door, and the creatures start to move towards the exit. The girl continues to watch them intently from the window. The creatures, including the purple and orange ones, begin to jump out of the bus. The blue creature remains at the door, looking at the creatures as they exit.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 174, "extra_info": {"duration": 4.38, "events": ["The girl bent down to open the small door.", "Four reptilian creatures walked forward.", "Three of the creatures leaped forward.", "The snake looked back in the girl's direction."], "n_events": 4, "n_shots": 4, "n_subjects": 3}, "vid": 174}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "893.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video clip showcases a person driving a motorcycle at a high speed. The rapidly increasing speed is evident from the dash board readings. The motorcycle then collides with a white car and flips over. In the end, only some lights can be seen.", "text": "The video shows a motorcycle riding at night on a road, with the speedometer displaying speeds ranging from 150 to 190 km/h. The rider navigates through traffic, passing by various vehicles and streetlights. The motorcycle's dashboard is illuminated, showing the speed and other metrics. The rider maneuvers through the traffic, passing cars and other vehicles. The road is well-lit with streetlights, and the motorcycle continues to speed up, reaching speeds over 190 km/h. The video ends with the motorcycle speeding through a tunnel, with the speedometer showing 190 km/h.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 893, "extra_info": {"duration": 6.97, "events": ["Person drives motorcycle at high speed.", "Speed increases on dashboard readings.", "Motorcycle collides with white car.", "Motorcycle flips over.", "Lights become visible at the end."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 893}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "83.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A boy with a backpack runs while watering the flowers in the corridor with a watering can. He puts down the can and runs up the stairs. At the stair corner, two men are playing chess while a small dog watches on. The boy runs past and tosses the fish in his hand to the dog, then continues upwards. The dog catches the fish and settles back down, while the two chess players remain in the same position but with their eyes widened.", "text": "A boy with a backpack runs through a narrow alley, holding a watering can, and places it on the ground. He then runs up the stairs. The scene shifts to two men sitting and playing a game with a dog, who is sitting in front of them. The boy jumps over the table, grabs a piece of food from the game, and runs away. The dog turns its head to look at the food in its mouth.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 83, "extra_info": {"duration": 4.12, "events": ["A boy runs while watering flowers.", "The boy puts down the watering can.", "The boy runs up the stairs.", "Two men play chess at the stair corner.", "A small dog watches the two men.", "The boy tosses a fish to the dog.", "The dog catches the fish.", "The chess players widen their eyes."], "n_events": 8, "n_shots": 2, "n_subjects": 4}, "vid": 83}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "291.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman with long hair glanced at the wall, then turned and walked towards the exit. The man standing in the room, in disbelief, covered his forehead with his hand. The woman walked to the end of the room, turned around, and walked back. When she reached the door, she spoke to the man with a smile, and the man lowered his hand and smiled back at her.", "text": "The scene is set in a cozy, yellow-walled room decorated with various ornaments, including golden and silver star-shaped decorations, a lamp, and a framed picture on the wall. A man stands near the left side of the room, looking at a woman who is walking towards the right side of the room. The woman continues walking towards the right, passing by a small table with a lamp and a chair. She then turns around and walks back towards the man, who is now smiling and covering his face with his hand. The woman stops in front of the man, looks at him, and then turns around to walk back towards the right side of the room.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 291, "extra_info": {"duration": 9.0, "events": ["A woman glanced at the wall.", "The woman turned towards the exit.", "A man covered his forehead with his hand.", "The woman walked to the end of the room.", "The woman turned around.", "The woman walked back to the door.", "The woman spoke to the man with a smile.", "The man lowered his hand.", "The man smiled back at the woman."], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 291}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "771.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person, holding a curved stick, runs forward, then comes to a sudden stop, and turns around to run back.", "text": "A man wearing a white jersey, black pants, and a helmet is skating on an ice rink while holding a hockey stick. He skates from the left side of the frame towards the right, maintaining a steady pace. The background shows the ice rink's boards and a few empty seats. The man continues to skate, making a turn to the right as he moves across the frame. He then skates back towards the left side of the frame, maintaining his speed and control. The video ends with the man skating out of the frame to the left.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 771, "extra_info": {"duration": 5.0, "events": ["A person runs forward holding a curved stick.", "The person comes to a sudden stop.", "The person turns around.", "The person runs back."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 771}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "18.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman in green clothes opens her mouth wide and places her hands on her hips, while the woman in purple clothes clenches her fist. The camera switches to a palace where a man and the woman in purple clothes are dancing hand in hand, with the woman's gaze following the man.", "text": "In a grand hall with ornate columns and red curtains, a woman in a purple dress with a green brooch and a woman in a green dress with a blue bow are dancing. The woman in green turns her head to look at the woman in purple, who is looking forward. The scene then shifts to a man in a white military uniform with gold epaulettes and a woman in a purple dress with a yellow bow, who are dancing together. The man holds the woman's hand and they move closer, with the woman smiling and looking at the man.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 18, "extra_info": {"duration": 2.62, "events": ["The woman in green clothes opens her mouth wide.", "The woman in green clothes places her hands on her hips.", "The woman in purple clothes clenches her fist.", "The man and the woman in purple clothes dance hand in hand.", "The woman in purple clothes follows the man with her gaze."], "n_events": 5, "n_shots": 2, "n_subjects": 4}, "vid": 18}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "57.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a room, the wall facing us has white letters that read \"This is my tree house\". The camera zooms in on the white text, and then cuts to a character dressed in grey clothing with glasses showing a surprised expression. He blinks once, and then the scene refreshes, the character's clothing changes to white, and the window next to him turns into a wooden wall.", "text": "The camera zooms in on a wooden wall with the text 'THIS IS MY TREE HOUSE' written on it. The scene then shifts to a character with glasses and bunny ears, who is looking to the right and speaking. The character's eyes gradually turn black. The scene transitions to another character with glasses and a white coat, who is looking to the right and speaking. The final frame shows the wooden wall again with the text 'THIS IS MY TREE HOUSE' written on it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 57, "extra_info": {"duration": 4.5, "events": ["Camera zooms in on the white text.", "Character shows a surprised expression.", "Character blinks once.", "Scene refreshes.", "Character's clothing changes to white.", "Window turns into a wooden wall."], "n_events": 6, "n_shots": 3, "n_subjects": 1}, "vid": 57}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "960.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "There is an animated moon, a house and trees on the screen. Then, a blonde animated character wearing a red cloak with black dots and a blue outfit appears to be doing something with hands on a white area with black parts on it. The character with a bag with a picture of a cat and its paw on it is seen from different angles, then the character's black bandana and t-shirt with a smiley face symbol appear while the character laughs and runs on a gray floor. Then the character goes through a door with a black frame and a white middle.", "text": "The video begins with a view of a night sky featuring a full moon and a small star, with a house and a forest in the background. The scene then shifts to a character with spiky hair and a red cape, who is seen unlocking a door. As the door opens, a cat with a black and white pattern on its back is revealed, sitting on the character's backpack. The character, now with a happy expression, runs out of the room and exits the house. The final frame shows the exterior of the house at night, with the moon still visible in the sky.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 960, "extra_info": {"duration": 7.01, "events": ["The moon appears on the screen.", "A house and trees appear on the screen.", "A blonde animated character wearing a red cloak with black dots emerges.", "The character does something with hands on a white area with black parts.", "The character with a cat picture bag is seen from different angles.", "The character wearing a black bandana and t-shirt laughs.", "The character runs on a gray floor.", "The character goes through a door with a black frame."], "n_events": 8, "n_shots": 3, "n_subjects": 3}, "vid": 960}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "842.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In this video clip, an individual dressed in a white vest and blue pants is performing sit-ups on a black mat. They are also wearing black shoes. Over the span of nine seconds, the person performs a total of three sit-ups.", "text": "The video begins with a dark screen, which then transitions to a man lying on a black exercise mat on a blue floor, wearing a white tank top, blue shorts with green and white stripes, and black shoes. The man starts performing sit-ups, lifting his upper body towards his knees and then lowering back down. The background is a plain black wall, and there is a logo and text 'Fitnessyard.com' in the top right corner of the screen. The man continues this exercise routine, repeatedly lifting his upper body and lowering it back down.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 842, "extra_info": {"duration": 9.0, "events": ["Individual lies down on a black mat.", "Individual secures feet on the ground.", "Individual places hands behind head.", "Individual lifts torso upwards.", "Individual completes one sit-up.", "Individual lowers torso back down.", "Individual performs second sit-up.", "Individual returns to starting position.", "Individual completes third sit-up.", "Individual finishes the exercise."], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 842}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "601.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Four people are gathered around a wooden table laden with books. The person seated at the center of the frame is operating a computer, then rotates the laptop to show the screen to the others. Holding a pen, they point at something on the screen and on the paper in front of them. The person on the right writes something on the paper, while the two people on the left raise and lower their hands. The camera angle remains fixed.", "text": "Several people are seated around a wooden table, each holding a pen and writing on a piece of paper. One person is holding a laptop, and another person is holding a pencil. The person with the pencil moves it towards the paper, while the others continue writing. The person with the pencil then places it on the table and picks up a red pencil. They use the red pencil to point at the paper, and the others continue writing. The person with the red pencil continues to point at the paper, while the others keep writing.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 601, "extra_info": {"duration": 18.1, "events": ["Four people gather around a wooden table.", "Person at center operates a computer.", "Person at center rotates the laptop.", "Person at center shows screen to others.", "Person at center points at the screen with a pen.", "Person at center points at the paper with a pen.", "Person on the right writes on the paper.", "Person on the left raises hand.", "Person on the left lowers hand.", "Another person on the left raises hand."], "n_events": 10, "n_shots": 1, "n_subjects": 4}, "vid": 601}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "767.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "An ant, gripping a green leaf, flipped over the leaf to climb onto its surface and continued forward, passing by other ants along the way.", "text": "A single ant is seen on a green leaf, pulling a piece of the leaf with its mouthparts. The ant continues to pull the leaf, gradually moving it along the leaf's edge. As the ant progresses, more ants join in, and they all work together to move the leaf. The group of ants successfully pulls the leaf off the stem and moves it across the leaf. The ants continue to carry the leaf, moving it further along the leaf's edge.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 767, "extra_info": {"duration": 8.98, "events": ["An ant gripped a green leaf.", "The ant flipped over the leaf.", "The ant climbed onto the leaf's surface.", "The ant continued forward.", "The ant passed by other ants."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 767}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "659.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a grassy soccer field, a pair of feet wearing black soccer shoes approach the ball. The left foot nudges the ball to the right, and then the right foot pulls back to scoop the ball up. After that, the right foot juggles the ball twice, followed by the left and right feet alternately keeping the ball in the air. The video almost exclusively captures the person's shadow and the movement of the ball. Then, the camera gradually moves upward from the feet, sequentially showing the lower legs, knees, thighs clad in black shorts, hands, and the upper body dressed in a blue T-shirt. Meanwhile, the person continues to juggle the ball with alternating left and right feet. Just before the video ends, the individual turns their back to the camera and flicks the ball with the lower leg.", "text": "A person is seen on a grassy field, using their feet to control a soccer ball. The person skillfully maneuvers the ball, keeping it in the air through continuous kicks. The ball is lifted and moved around with precision, showcasing the person's control and agility. The person continues to juggle the ball, maintaining it in the air with smooth and coordinated movements. The ball is then brought closer to the ground, and the person prepares to place it down. The person finally places the ball on the ground, completing the sequence of actions.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 659, "extra_info": {"duration": 12.22, "events": ["Feet approach the soccer ball.", "Left foot nudges ball to the right.", "Right foot scoops up the ball.", "Right foot juggles the ball twice.", "Left and right feet alternately juggle the ball.", "Camera moves upward from feet to lower legs.", "Camera shows knees, then thighs in black shorts.", "Camera displays hands and upper body in blue T-shirt.", "Person continues to juggle ball with feet.", "Individual turns back and flicks ball with lower leg."], "n_events": 10, "n_shots": 3, "n_subjects": 1}, "vid": 659}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "853.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video captures an intersection on a crisp day with both lateral and longitudinal traffic in view. There are traffic signals in place to regulate the flow. The video displays a considerable amount of traffic. It features a grey car ahead of a red one, with several cars moving left to right. An individual is seen crossing the street behind a white car, also moving from left to right, indicating the traffic lights allow passage in that direction. The grey car in front of the red car is making a turn, suggesting the red car can do the same. However, the red car abruptly moves straight ahead, failing to make the turn, and collides with the white car traversing from left to right, leading to a traffic mishap.", "text": "The video shows a busy road with several vehicles, including a bus, cars, and vans, moving along the road. The camera then shifts to a different angle, focusing on a roundabout where multiple vehicles, including a white van, a black car, and a white car, are seen driving. The vehicles navigate the roundabout, with the white van moving from the center to the right side of the frame, followed by the black car and the white car. The background includes a grassy area and some buildings, with the road curving around the roundabout.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 853, "extra_info": {"duration": 6.0, "events": ["Lateral and longitudinal traffic flows at intersection.", "Traffic signals regulate the flow.", "A grey car turns at the intersection.", "A red car moves straight ahead.", "Several cars move from left to right.", "An individual crosses the street.", "The white car moves from left to right.", "Pedestrians given passage by traffic lights.", "The grey car precedes the red car.", "The red car collides with the white car."], "n_events": 10, "n_shots": 1, "n_subjects": 4}, "vid": 853}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "823.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video is set in a classroom where a person, wearing a red coat and a white T-shirt underneath, ignites a piece of paper on a table using a lighter. The paper promptly catches fire, causing shock and consternation among the four individuals sitting nearby.", "text": "A hand holding a blue lighter ignites a piece of paper on a desk, causing the paper to catch fire. The fire grows larger as the hand continues to hold the lighter. The camera then shows a classroom with a teacher in blue standing and looking at the burning paper, while three animated characters with bunny ears react with surprise and concern. The teacher and the characters continue to watch the burning paper, with the characters showing various expressions of shock and worry. The camera then focuses back on the burning paper.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 823, "extra_info": {"duration": 10.0, "events": ["Person in red coat ignites paper with lighter", "Paper catches fire on table", "Four individuals react to fire", "Shock appears on individuals' faces", "Consternation spreads among the individuals"], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 823}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "31.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two dogs on the grass look forward and wag their tails, behind them a flock of sheep. The camera then cuts to a city park, where a black wolf enters the frame from the left, baring its teeth and looking ahead before lowering its mouth to the ground.", "text": "A large brown dog and a small brown and white dog stand in front of a sheep pen, with the large dog smiling and the small dog wagging its tail. The large dog then stops smiling and closes its mouth. The scene transitions to a beautiful garden with blooming trees and flowers, and a city skyline in the background. A black dog with a collar appears, looking around and then lowering its head. The scene then returns to the large brown dog and the small brown and white dog in front of the sheep pen.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 31, "extra_info": {"duration": 5.75, "events": ["Two dogs look forward.", "Two dogs wag their tails.", "A flock of sheep stands behind the two dogs.", "A black wolf enters the frame from the left.", "The black wolf bares its teeth.", "The black wolf looks ahead.", "The black wolf lowers its mouth to the ground."], "n_events": 7, "n_shots": 2, "n_subjects": 4}, "vid": 31}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "591.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is inside a bowling alley. A woman wearing yellow leggings and a black short-sleeved shirt stands with her back to the camera, holding a bowling ball with both hands, leaning forward and stepping out to roll the ball, her right hand swinging. The bowling ball rolls forward and knocks down the white pins ahead. The woman straightens up and watches.", "text": "A woman wearing a black shirt and yellow leggings is at a bowling alley, preparing to bowl. She holds a bowling ball and begins her approach, bending forward and swinging her arm back. She releases the ball, which rolls down the lane towards the pins. The ball continues to roll and eventually hits the pins, knocking them down. The woman watches the ball's progress and then starts to walk away from the lane.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 591, "extra_info": {"duration": 5.43, "events": ["The woman leans forward", "The woman steps out", "The woman's right hand swings", "The bowling ball rolls forward", "The bowling ball knocks down pins", "The woman straightens up", "The woman watches"], "n_events": 7, "n_shots": 1, "n_subjects": 1}, "vid": 591}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "581.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Four cars drive down the hillside in succession. In the middle of the scene, a large hammer is spinning, knocking the first car out of the frame, while the other three continue moving forward.", "text": "A road is shown with a green hillside and trees on both sides, and a hammer with a long handle is seen swinging back and forth. A white car appears on the road and is hit by the hammer, causing it to flip over and fall off the road. The hammer continues to swing as the white car disappears from view. A black car then appears on the road and is hit by the hammer, causing it to flip over and fall off the road. The hammer continues to swing as the black car disappears from view. Another black car appears on the road and is hit by the hammer, causing it to flip over and fall off the road. The hammer continues to swing as the second black car disappears from view. A third black car appears on the road and is hit by the hammer, causing it to flip over and fall off the road. The hammer continues to swing as the third black car disappears from view.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 581, "extra_info": {"duration": 6.13, "events": ["Four cars drive down the hillside.", "A large hammer spins in the middle of the scene.", "The hammer knocks the first car out of the frame.", "The second car continues moving forward.", "The third car continues moving forward.", "The fourth car continues moving forward."], "n_events": 6, "n_shots": 1, "n_subjects": 3}, "vid": 581}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "956.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, an individual with short hair is seen sporting a necklace, a white T-shirt, jeans, and glasses standing in a bathroom. They showcase a 'Loreal' brand mascara to the camera, which they are holding in their hand. On the bathroom counter, one can notice a makeup bag and a small stick immersed in some kind of yellow liquid. The camera angle then shifts, showing this person adjusting the white mirror, removing their glasses, and preparing to apply makeup.", "text": "In a bathroom, a woman wearing a white t-shirt with a leopard print design and blue jeans reaches into a bag on the counter and takes out a lipstick. She examines the lipstick closely, holding it up for a better view. The camera zooms in to show the label 'VOLUMINOUS BROSSE COURBEE' on the lipstick. The woman then sits down on the edge of the bathtub, holding the lipstick in her hands.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 956, "extra_info": {"duration": 6.0, "events": ["The individual showcases a 'Loreal' mascara to the camera.", "The individual adjusts the white mirror.", "The individual removes their glasses.", "The individual prepares to apply makeup."], "n_events": 4, "n_shots": 2, "n_subjects": 1}, "vid": 956}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "606.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera captures footage inside a clear swimming pool, with the surface of the water gently rippling. On the right side of the frame, a woman's legs enter the water vertically from above. After bobbing up and down in the water twice, the woman turns around, pushes off the edge of the pool, and emerges from the water.", "text": "The video begins with an empty swimming pool, showing the pool's lanes and the water surface. A woman jumps into the pool, creating a splash as she enters the water. She continues to descend, with bubbles and water movement visible around her. The woman swims forward, her body becoming more visible as she moves through the water. She continues to swim, with her arms and legs moving in a coordinated manner. The woman swims further away, and the water around her becomes less turbulent. The pool is left empty again, with the water calm and the lanes clearly visible.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 606, "extra_info": {"duration": 10.42, "events": ["Camera captures footage inside a swimming pool", "Water surface gently ripples", "Woman's legs enter the water vertically", "Woman's legs bob up in the water", "Woman's legs bob down in the water", "Woman turns around in the water", "Woman pushes off from the pool's edge", "Woman emerges from the water"], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 606}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "573.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person wearing an orange helmet and dressed in a black short-sleeve shirt rides a motorcycle towards the mountain top. The camera follows the movement; dust flies along the way. As they are about to reach the summit, the person stands up from the motorcycle, plants their feet on the ground, then turns around, swings one leg over the bike, and continues to rev the engine, with dust enveloping them. The background is a mountain.", "text": "A man wearing an orange helmet rides a motorcycle on a dirt path towards a steep, rocky cliff. He ascends the cliff, kicking up a trail of dust behind him. As he reaches the top, he continues to ride up the cliff, creating a large cloud of dust. The man reaches the peak of the cliff and begins to turn his motorcycle around. He completes the turn and starts descending the other side of the cliff, still surrounded by a thick cloud of dust.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 573, "extra_info": {"duration": 10.6, "events": ["Person rides motorcycle towards the mountain top.", "Motorcycle stirs up dust along the way.", "Person stands up from the motorcycle.", "Person plants feet on the ground.", "Person turns around.", "Person swings leg over the bike.", "Person continues to rev the motorcycle's engine.", "Dust envelops the person."], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 573}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "110.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In front of the fireplace, a person wearing a hat is showing a cellphone to a little girl sitting on the left. The little girl remains still, reading her book, paying no attention to the person with the hat. The person with the hat then turns to show the cellphone to two little girls on the right, but they too remain still, reading their books, ignoring him.", "text": "In a richly decorated room with green wallpaper, framed pictures, and ornate furniture, a character wearing a green hat and holding a book jumps up from the fireplace and lands on the floor. The character then turns around and starts dancing energetically in front of the fireplace. Meanwhile, two other characters, one on the left and one on the right, are sitting in armchairs, reading books and watching the dancing character.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 110, "extra_info": {"duration": 3.25, "events": ["Person shows cellphone to little girl.", "Little girl reads her book.", "Person turns to two little girls.", "Two little girls read their books.", "Person shows cellphone to two little girls."], "n_events": 5, "n_shots": 1, "n_subjects": 4}, "vid": 110}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "256.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two men and a girl are in the park, the man on the left flips a coin, and the man on the right bends over to catch it.", "text": "In a park with benches and trees, a girl holding a pink umbrella talks to two men standing in front of her. The man on the left, wearing a black jacket, takes out some money and hands it to the man on the right, who is wearing a beanie and a jacket. The man on the right then bends down and takes the money from the man on the left. The girl continues to talk to the two men throughout the interaction.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 256, "extra_info": {"duration": 4.75, "events": ["Two men and a girl are in the park", "The man on the left flips a coin", "The man on the right bends over", "The man on the right catches the coin"], "n_events": 4, "n_shots": 1, "n_subjects": 3}, "vid": 256}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "665.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, three people are sitting at a table drinking. A man and a woman face the camera holding brown bottles, while another person, only their hand visible to the camera, is holding a green bottle. The three of them exchange a few words and then clink their bottles together.", "text": "Three people are sitting at a table outdoors, each holding a bottle. The person on the left holds a brown bottle, the person in the middle holds a brown bottle, and the person on the right holds a green bottle. They raise their bottles and clink them together in a toast. After the toast, they lower their bottles back to the table.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 665, "extra_info": {"duration": 6.08, "events": ["Three people sit at a table.", "A man holds a brown bottle.", "A woman holds a brown bottle.", "Another person's hand holds a green bottle.", "The three people exchange words.", "They clink their bottles together."], "n_events": 6, "n_shots": 1, "n_subjects": 3}, "vid": 665}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "877.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In this video, a black car is smoothly driving on the left, nearing a cross-section with a fresh green traffic light. Suddenly, a large white truck emerges from the right intersection. The black car attempts to swerve to the left to avoid a collision, but fails as the white truck collides with it.", "text": "The video shows a car driving on a multi-lane road with a clear sky and some trees in the background. As the car approaches an intersection, a truck appears from the right side of the frame and starts crossing the road. The truck continues to move across the intersection, and the car begins to slow down. The truck completes its crossing, and the car comes to a stop behind the truck. The truck continues to move forward, and the car remains stationary. The truck drives away, and the car starts to move again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 877, "extra_info": {"duration": 6.0, "events": ["A black car drives on the left.", "The black car nears a cross-section.", "A traffic light turns green.", "A large white truck emerges from the right.", "The black car swerves to the left.", "The white truck collides with the black car."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 877}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "595.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is a small path at the entrance of a residential house. The man is holding dolls in both hands, looking down, he takes a step forward and steps on an object. His eyes move, then he looks downward, discards the doll in his right hand, and takes a step back. He repeats the stepping forward and stepping back actions, while also throwing away the doll in his left hand, and takes off his black coat. The final scene shows the man bending over with both hands open, standing on one foot.", "text": "A man stands in an alley holding two stuffed toys, one in each hand, and a tablet on the ground in front of him. He steps on the tablet, causing the toys to fall to the ground. The man looks at the tablet, then steps on it again, causing the toys to fall once more. He reacts with surprise, stepping back and looking at the tablet. The man then removes his jacket, revealing a white shirt underneath, and continues to look at the tablet.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 595, "extra_info": {"duration": 8.4, "events": ["Man holds dolls in both hands.", "Man takes a step forward.", "Man steps on an object.", "Man's eyes move.", "Man looks downward.", "Man discards doll from his right hand.", "Man takes a step back.", "Man throws away doll from his left hand.", "Man takes off his black coat.", "Man stands on one foot bending over."], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 595}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "439.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman pushes the door forcefully, then looks up. She picks up an iron shovel to prop against the eaves, pushes the door back with her right foot, turns around to put down the shovel, and the camera cuts to the side, showing the woman sweeping the floor with a broom. The scene switches to the woman washing yellow food in a bowl under a tap, then cuts again to the woman scattering the yellow food on the ground, lifting up a wooden lid, and a flock of chickens run out to eat the food.", "text": "A woman, wearing a colorful headscarf, a pink apron, and sandals, opens a metal door of a brick building and steps inside. She retrieves a shovel from inside the building and uses it to sweep the ground. She then uses a red broom to further clean the area. The woman washes a bowl at a water tap, then pours the contents onto the ground. She opens a wicker basket and releases several chickens and a rabbit onto the ground. Finally, she closes the door.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 439, "extra_info": {"duration": 15.03, "events": ["The woman pushes the door forcefully.", "The woman looks up.", "The woman picks up an iron shovel.", "The woman props shovel against the eaves.", "The woman pushes the door with her right foot.", "The woman puts down the shovel.", "The woman sweeps the floor with a broom.", "The woman washes yellow food in a bowl.", "The woman scatters the yellow food on the ground.", "The chickens run out to eat the food."], "n_events": 10, "n_shots": 3, "n_subjects": 2}, "vid": 439}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "84.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The boy sits on the bed, leaning against the wall, his eyes gazing diagonally at the wall behind him. On the other side of the wall, several people press their ears against it, listening to the sounds. Suddenly, they widen their eyes and quickly leave, all except for one big-headed kid who reacts a bit more slowly. The camera pans to the left, revealing the entire room of the boy. The boy turns his eyes forward and sits up, supporting himself with his hands on the bed.", "text": "A boy is sitting on a bed in a room with a red plaid blanket, wearing a white tank top and blue shorts with white polka dots. The room has a window with red frames, a desk with a green lamp, a green apple, and a plant. There are posters on the wall, including one with a red background and a white figure. The boy looks to his right as a group of people, including a person holding a bowl of popcorn, move away from the window. The boy then turns his head to look at the group and starts to get up from the bed.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 84, "extra_info": {"duration": 6.38, "events": ["The boy gazes diagonally at the wall.", "People press their ears against the wall.", "People's eyes widen.", "People quickly leave the room.", "One big-headed kid reacts slowly.", "The boy sits up.", "The boy supports himself with his hands on the bed."], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 84}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "611.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background features a blurry silhouette of a person. After waving goodbye, the figure turns and leaves to the left, lightly pushing the door with their left hand. A hand lightly sways in front of the camera, and the man reaches forward to close the door before leaving. The camera then moves to the right, ending with a close-up of the room's door lock.", "text": "A hand is extended towards a door, with a person in white standing outside, waving. The person in white continues to wave as they approach the door. The person in white reaches the door and begins to close it. The door is fully closed, and the person in white is no longer visible. A man in an orange shirt approaches the door and starts to close it. The man in the orange shirt continues to close the door, and the door is almost fully shut. The door is completely closed, and the man in the orange shirt is no longer visible.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 611, "extra_info": {"duration": 8.68, "events": ["Person waves goodbye", "Person turns", "Person leaves to the left", "Person pushes door with left hand", "Hand sways in front of camera", "Man reaches forward", "Man closes door", "Camera moves to the right", "Close-up of door lock"], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 611}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "550.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Some men are jumping together on a trampoline, and after one of them falls, he is bounced high into the air. The camera captures an overhead shot of the man being bounced upwards, then he falls back down, and the camera switches to show the men moving backwards.", "text": "A group of people are gathered around a trampoline in front of a large building with a patterned facade. One person in an orange outfit is seen jumping high into the air, performing flips and spins. The person continues to jump and flip, while others on the trampoline and around it watch and record the event. The camera angle shifts to an overhead view, showing multiple people on the trampoline performing flips and jumps simultaneously. The person in the orange outfit continues to jump and flip, while others on the trampoline and around it continue to watch and record. The video ends with the person in the orange outfit landing back on the trampoline.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 550, "extra_info": {"duration": 7.0, "events": ["Men jump together on a trampoline.", "One man falls on the trampoline.", "The man is bounced high into the air.", "The man falls back down onto the trampoline.", "Men move backwards."], "n_events": 5, "n_shots": 2, "n_subjects": 2}, "vid": 550}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "272.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person is swinging an axe down, repeatedly striking with force. The camera angle shifts from the man's upper body to slowly pan down to the ground, where the man is using the axe to chop at a wooden stake in front of him.", "text": "A man stands in a wooded area with dense green foliage in the background, holding an axe with both hands. He begins to swing the axe down forcefully, aiming at a log on the ground. The man continues to swing the axe repeatedly, striking the log with increasing force. The log is gradually split into smaller pieces as the man maintains his vigorous chopping motion. The man finally splits the log into several pieces, which are scattered around the stump.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 272, "extra_info": {"duration": 8.38, "events": ["A person swings an axe down.", "The axe repeatedly strikes with force.", "The man chops at a wooden stake."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 272}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "843.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In this video clip, an individual places a rubber band around the middle finger and thumb of their right hand. They then begin to spin the rubber band using their left hand. Unexpectedly, the rubber band slips off the middle finger.", "text": "A person places a green rubber band on their left hand, which is resting on a wooden table. The person then uses their right hand to press down on the rubber band, causing it to stretch and form a loop around their left hand. The person continues to press and adjust the rubber band, ensuring it is securely wrapped around their hand. Finally, the person releases the rubber band, leaving it looped around their left hand.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 843, "extra_info": {"duration": 5.99, "events": ["Individual places a rubber band around middle finger and thumb.", "Individual begins to spin the rubber band with left hand.", "Rubber band slips off the middle finger."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 843}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "880.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video displays two people, clad in red and white garments, holding hands on a seaside surrounded by other individuals. The sky is heavily clouded with a torrential downpour. One person, donned in white clothes and yellow pants, is seen holding a black umbrella over the pair. Another person, notable for standing near a table on the far right, holds an open umbrella. Amid the palm trees, a photographer in white captures the moment. Suddenly, a bolt of lightning strikes, causing everyone to react in fear and direct their attention towards the lightning. Despite the frightful situation, the individual wearing a yellow dress and another in a wedding dress continue to hold hands. Following the alarming event, the photographer departs from the scene.", "text": "A group of people, including a photographer, are gathered on a beach near the water, with some holding umbrellas and others standing in the rain. The photographer is taking pictures of the group, who are posing for the camera. The group includes individuals in red and white dresses, and they are standing on a sandy area with a table and chairs nearby. The background shows a cloudy sky and the ocean. The group starts to move, with some people walking away from the photographer. The photographer continues to take pictures as the group disperses.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 880, "extra_info": {"duration": 6.0, "events": ["Two people hold hands on a seaside.", "Individual in white clothes holds a black umbrella.", "Person near a table holds an open umbrella.", "Photographer captures the moment among palm trees.", "Lightning strikes, causing a reaction from everyone.", "Individual in yellow dress and another continue to hold hands.", "Photographer departs from the scene."], "n_events": 7, "n_shots": 1, "n_subjects": 5}, "vid": 880}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "444.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The shot is taken from behind as the large dog stands up, opens the glass door, and runs out, only to get frightened and suddenly turn back. After opening the door, it takes refuge behind a table, sits down, and looks straight ahead while the camera follows the movement. The background is an office with a computer desk.", "text": "A dog stands in front of a glass door, wagging its tail and looking outside. The dog then turns around and starts running towards the door. It jumps and pushes the door open with its head. The dog runs through the now-open door into the room. Inside the room, the dog walks towards a chair and jumps onto it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 444, "extra_info": {"duration": 8.53, "events": ["The large dog stands up", "The large dog opens the glass door", "The large dog runs out", "The large dog gets frightened", "The large dog turns back", "The large dog takes refuge behind a table", "The large dog sits down", "The large dog looks straight ahead", "The camera follows the large dog's movement"], "n_events": 9, "n_shots": 1, "n_subjects": 1}, "vid": 444}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "993.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In this video, a person wearing a brown suit is observing an oil painting hung on a red wall. The painting has a wooden frame and it portrays an elderly person with long white hair, dressed in a black gown. A green plant can be seen beside the painting, placed on the left side. A gray wall is depicted in the painting's background, with a small picture hung in a black frame at the upper left corner. Suddenly, the person in the brown suit sneezes towards the painting, and upon contact, the spray lands on the painting's face, leaving it marked with water and dirt. The person, instinctively touching their nose after the sneeze, is astonished to see the blemish on the painting.", "text": "A man in a suit stands in front of a framed painting of a woman in a black dress and white headscarf, looking at it intently. He then turns his head and body away from the painting, appearing to be in discomfort. The man bends over, covering his mouth with his hand, and seems to be vomiting. He then straightens up, still looking at the painting, and continues to speak.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 993, "extra_info": {"duration": 6.01, "events": ["A person observes an oil painting.", "The oil painting portrays an elderly person.", "A green plant is placed beside the painting.", "The person in the brown suit sneezes towards the painting.", "Spray lands on the painting's face.", "The painting is marked with water and dirt.", "The person touches their nose after sneezing.", "The person is astonished to see the blemish."], "n_events": 8, "n_shots": 4, "n_subjects": 2}, "vid": 993}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "516.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, a black child dressed in a red ski suit skis down a slope. The camera follows the child as they make turns, sometimes slowing down, until finally, the child raises their arms in cheer and comes to a stop.", "text": "A child dressed in a red snowsuit and black helmet is snowboarding down a snowy slope with mountains in the background. The child maintains balance while moving forward, occasionally adjusting their stance. The child then raises both arms in the air, appearing to celebrate or express joy. After holding the pose for a moment, the child lowers their arms and continues to snowboard down the slope.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 516, "extra_info": {"duration": 8.13, "events": ["A black child skis down a slope.", "The child makes turns.", "The child slows down.", "The child raises their arms.", "The child comes to a stop."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 516}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "857.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts an individual outfitted in a sleeveless shirt, denim shorts, and slippers. The person is seen setting up a trap system - placing the final piece and throwing a carrot into a self-made cage trap. The person tactfully imitates a bird's movements as they handle the carrot. The self-built cage houses three green bamboo sticks. Upon the carrot's entry into the bamboo cage, a wire mesh door promptly shuts.", "text": "A man is crouching next to a black plastic crate with multiple holes, holding a carrot in his right hand. He places the carrot into the crate through one of the holes. The man continues to place more carrots into the crate, one by one. The crate is now partially filled with carrots. The man then covers the crate with a black mesh net, securing it in place. The net is fully covering the crate, and the carrots are visible through the mesh. The man continues to place more carrots into the crate, ensuring they are all inside.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 857, "extra_info": {"duration": 6.01, "events": ["Individual sets up a trap system.", "Individual places the final piece of trap.", "Individual throws a carrot into the cage.", "Individual imitates a bird's movements.", "The carrot enters the bamboo cage.", "Wire mesh door shuts promptly."], "n_events": 6, "n_shots": 3, "n_subjects": 1}, "vid": 857}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "791.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The Canadian flag flutters in the wind, the camera cuts to a close-up to show the image of the Canadian flag fluttering in the breeze.", "text": "The video shows a Canadian flag waving in the wind against a backdrop of a blue sky with scattered clouds. The flag is red and white with a prominent maple leaf in the center. The flag's movement is dynamic, with the fabric rippling and fluttering as it is blown by the wind. The camera gradually zooms in on the flag, bringing it closer into view while maintaining focus on the flag's details and the sky in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 791, "extra_info": {"duration": 11.08, "events": ["The Canadian flag flutters in the wind."], "n_events": 1, "n_shots": 2, "n_subjects": 1}, "vid": 791}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "792.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera moves downward, displaying a panoramic view of Earth, with subtitles and icons appearing in the middle. The camera zooms in, the subtitles and icons disappear, and the image changes from blue to green and then turns completely white.", "text": "The video begins with a view of Earth from space, showing the North American continent against a starry background. The camera gradually zooms out, revealing more of the Earth's surface, including the Asian continent. As the camera continues to zoom in, the location of Taipei, Taiwan, is highlighted with a label. The zoom continues, bringing the viewer closer to the island of Taiwan, showing detailed geographical features. The camera zooms in further, transitioning from a global view to a close-up aerial view of the island's terrain. The video ends with a close-up of the island's landscape, before fading to white.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 792, "extra_info": {"duration": 11.04, "events": ["The camera moves downward.", "Subtitles appear in the middle.", "Icons appear in the middle.", "The camera zooms in.", "The subtitles disappear.", "The icons disappear.", "The image changes from blue to green.", "The image turns completely white."], "n_events": 8, "n_shots": 2, "n_subjects": 1}, "vid": 792}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "128.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A bald boy is sitting on a platform, with his right hand propping up his cheek and his left hand resting on his knee, talking to Snoopy in front of him. Snoopy walks forward two steps, hops onto the platform, sits down, and mimics the boy's pose with his right hand on his cheek and his left hand supporting his knee.", "text": "In a backyard with a wooden fence and autumn foliage, a boy in a yellow and black striped shirt sits on a small platform, talking to a dog. The dog, standing in front of a red bowl, listens attentively. The dog then turns around, sits down, and starts scratching its head. The boy continues to talk to the dog, which remains seated and scratching its head.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 128, "extra_info": {"duration": 4.75, "events": ["A bald boy is sitting on a platform.", "The bald boy props up his cheek with right hand.", "The bald boy rests his left hand on his knee.", "Snoopy walks forward two steps.", "Snoopy hops onto the platform.", "Snoopy sits down.", "Snoopy mimics the boy's pose.", "Snoopy props up his cheek with right hand.", "Snoopy supports his knee with left hand."], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 128}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "425.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a beach, a swing hangs between two palm trees, with a lady sitting on it. Behind the lady, a man pushes her to start swinging quickly then moves out of the camera frame. Taking advantage of the moment when the lady swings backward, another man who had already been waiting in front heads towards her, running with his phone to capture the moment on the swing. As the lady reaches the peak of her swing and begins to move back, the man passes underneath and continues to film.", "text": "On a sunny beach with clear blue skies and calm waters, a woman in a black swimsuit and sunglasses is swinging on a swing set. A man in a black shirt and shorts runs towards the swing and pushes it, causing the woman to swing higher. Another man in a blue shirt and shorts, standing nearby, starts taking photos of the woman as she swings. The woman continues to swing back and forth, enjoying the ride, while the man in the blue shirt continues to take photos. The background features several boats anchored in the water and a few people swimming and relaxing near the shore.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 425, "extra_info": {"duration": 10.8, "events": ["Swing hangs between two palm trees.", "Lady sits on the swing.", "Man pushes the lady to swing quickly.", "Pushing man moves out of the camera frame.", "Another man runs with his phone towards the swing.", "Lady reaches the peak of her swing.", "Lady begins to move back.", "Running man passes underneath the swing.", "Running man continues to film."], "n_events": 9, "n_shots": 1, "n_subjects": 3}, "vid": 425}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "130.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera showcases a series of comic images in front. It moves down to the right, passing a simple sketch, then moves downwards and fixes on Snoopy lying on top of the red house. After a while, Snoopy opens his eyes and sits up.", "text": "The video begins with a comic strip showing a boy sitting and reading a book, with a dog running in the background. The camera then moves to another comic strip where two boys are sitting and talking. The scene transitions to a cartoon where a dog is lying on top of a red house, looking around. The dog then turns its head to look behind it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 130, "extra_info": {"duration": 9.38, "events": ["Camera showcases a series of comic images", "Camera moves down to the right", "Camera passes a simple sketch", "Camera moves downwards", "Camera fixes on Snoopy", "Snoopy lies on top of the red house", "Snoopy opens eyes", "Snoopy sits up"], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 130}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "758.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A blue sedan drifts into the frame and then drives away, with the passenger holding onto the safety handle in the car.", "text": "In an industrial area with various buildings, trucks, and palm trees in the background, a blue car enters the frame from the right side and begins to drift, creating a cloud of smoke from its tires. The car continues to drift, moving towards the left side of the frame, with the smoke trail becoming more pronounced. The car exits the frame to the left, leaving a trail of smoke behind. The background remains consistent throughout, showing the same industrial setting with buildings and trucks.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 758, "extra_info": {"duration": 4.59, "events": ["A blue sedan drifts into the frame.", "The blue sedan drives away.", "Passenger holds onto the safety handle."], "n_events": 3, "n_shots": 1, "n_subjects": 2}, "vid": 758}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "384.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman lay on her side reading a book, sighed, put down the book, reached out to open the box, and rummaged through the things inside.", "text": "A woman, dressed in a white nightgown, is sitting up in bed and reading a book under the light of a bedside lamp. She adjusts the book in her hands, moving it slightly. She then reaches out with her right hand towards a box on the bedside table. She opens the box and looks inside.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 384, "extra_info": {"duration": 9.0, "events": ["The woman lay reading a book.", "The woman sighed.", "The woman put down the book.", "The woman reached out.", "The woman opened the box.", "The woman rummaged through the box's contents."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 384}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "963.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video portrays Formula One vehicles speeding on a track, with a large forest in the background. As the leading cars zip around the bend at high speed, the trailing vehicles are also negotiating the curve. Some cars take a broader path around the next corner, while others, aiming to be faster and billowing smoke, navigate the turn more narrowly. Meanwhile, a blue car assumes the lead in the race.", "text": "A group of race cars is seen approaching the starting line under a 'TotalEnergies' banner, with the timer showing 11:31:50. The cars begin to accelerate as they pass under the banner, with the timer now at 11:31:51. The cars continue to speed up, moving closer to the camera, and the timer shows 11:31:52. As the cars get nearer, the timer reaches 11:31:53, and the cars start to spread out slightly. The cars are now very close to the camera, with the timer at 11:31:54, and the leading cars begin to emit white smoke. The leading cars continue to emit white smoke as they pass by the camera, with the timer showing 11:31:55. The cars move further away from the camera, still emitting white smoke, and the timer shows 11:31:56.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 963, "extra_info": {"duration": 6.0, "events": ["Formula One vehicles speed on a track", "Cars zip around the bend", "Leading cars navigate the curve at high speed", "Trailing vehicles negotiate the curve", "Some cars take a broader path around the corner", "Other cars navigate the turn more narrowly", "Cars billow smoke while turning", "A blue car assumes the lead"], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 963}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "760.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A group of people and a man in a spacesuit are dancing indoors.", "text": "In a room decorated with colorful flags and party lights, a man dressed in a spacesuit with illuminated helmet and gloves dances energetically in the center. He moves his arms and body rhythmically, interacting with the people around him who are also dancing. The background shows other partygoers enjoying the music and dancing, with some holding drinks. The man in the spacesuit continues to dance, occasionally looking around and engaging with the crowd. The lighting in the room changes colors, adding to the festive atmosphere. The man in the spacesuit remains the focal point, dancing with enthusiasm and interacting with the people around him.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 760, "extra_info": {"duration": 15.16, "events": ["A group of people dances indoors.", "A man in a spacesuit dances indoors."], "n_events": 2, "n_shots": 1, "n_subjects": 2}, "vid": 760}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "314.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman lies in bed, facing away from the camera towards the headboard. After a moment, she turns her head to face the ceiling. The camera angle changes, and a bouquet of artificial flowers is lifted up from the bottom of the screen by a hand. The woman stands up and looks up at the flowers, then she walks from the left side of the flowers to the right side.", "text": "A woman is lying on her side in bed, covered with a white blanket, and wearing a gray sweater. She begins to turn her head and body to the side. The camera then shifts to show the woman holding a pink artificial tree with multiple branches and leaves. She raises the tree and starts to adjust it, moving it closer to her face. The woman continues to adjust the tree, ensuring it is properly positioned.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 314, "extra_info": {"duration": 9.75, "events": ["A woman lies in bed.", "Woman turns her head to face the ceiling.", "Hand lifts a bouquet of artificial flowers.", "Woman stands up.", "Woman looks up at the flowers.", "Woman walks from left to right side of the flowers."], "n_events": 6, "n_shots": 2, "n_subjects": 1}, "vid": 314}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "437.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A little boy is sitting at the dining table, scooping food into another boy's bowl with a spoon held by his foot. The other boy says something, and extends his chopsticks with his right hand. The little boy then switches the spoon from one foot to the other, and using his feet, he grabs the chopsticks attempting to pick up the green beans from the bowl. The camera follows the movement of the little boy's feet. The background is a living room.", "text": "A child is sitting on a table with various dishes, including a bowl of mixed vegetables, a bowl of green beans, and a bowl of rice. The child uses a spoon to scoop food from a yellow pot and places it into a bowl of rice held by an adult. The adult then hands the child a pair of chopsticks, which the child uses to pick up food from the yellow pot. The child continues to use the chopsticks to pick up food from the yellow pot and the bowl of green beans.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 437, "extra_info": {"duration": 15.08, "events": ["A little boy sits at the dining table.", "The little boy scoops food into another boy's bowl with foot.", "Another boy speaks.", "Another boy extends chopsticks with right hand.", "The little boy switches the spoon from one foot to the other.", "Using feet, the little boy grabs the chopsticks.", "The little boy attempts to pick up green beans with chopsticks."], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 437}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "97.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A soldier in a blue helmet turned his head to nod towards the back, and the camera switched to a soldier in green who glanced at the blue-helmeted soldier, then looked forward in confusion. The blue-helmeted soldier had already stood up and charged into the dense fog ahead. After a while, the camera switched back to the man in green, who was accompanied by a dog. Both climbed out of the trench and rushed forward following the direction of the blue-helmeted soldier.", "text": "A man in a blue uniform and helmet is aiming a rifle in a foggy, desolate landscape. He turns his head to look behind him. Another man in a green uniform and helmet, also holding a rifle, turns his head to look in the same direction. A person in a blue coat and hat runs into the fog and disappears. The man in the green uniform and his dog, who is wearing a collar, get up and start running.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 97, "extra_info": {"duration": 11.12, "events": ["A soldier nodded towards the back.", "A soldier glanced at another soldier.", "A soldier stood up.", "A soldier charged into dense fog.", "A man climbed out of the trench.", "A dog climbed out of the trench.", "Man and dog rushed forward."], "n_events": 7, "n_shots": 5, "n_subjects": 3}, "vid": 97}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "805.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "This video illustrates a person cooking in the kitchen. In the first snippet, an induction cooker is observed on the kitchen counter and a grey-blue casseroid dish is placed on it. The person proceeds to add ingredients to the pot. The subsequent clip shows the person chopping a sausage. The person is clad in thin black gloves, with a black watch featured on their right wrist. They demonstrate the method of cutting the sausage, gripping it on a cutting board with their left hand while slicing it into pieces with a knife held in their right hand.", "text": "A person wearing black gloves is stirring a pot of food with a wooden spoon, which contains chopped vegetables including green, red, and yellow bell peppers, and onions. The camera zooms in to show a close-up of the vegetables being stirred. The scene then shifts to the person placing two sausages on a wooden cutting board and cutting them in half. The person continues to cut the sausages into smaller pieces. The camera then returns to the pot, showing the vegetables being stirred again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 805, "extra_info": {"duration": 11.98, "events": ["Person places casserole dish on induction cooker", "Person adds ingredients to the pot", "Person chops a sausage", "Person grips sausage with left hand", "Person slices sausage with knife"], "n_events": 5, "n_shots": 2, "n_subjects": 1}, "vid": 805}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "873.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the featured video clip, two teams are engaged in a soccer match. Player number 23, donned in a white shirt and green shorts, stands out of bounds, attempting to pass the ball to a teammate clad in white attire, green pants, black socks, and green shoes. Concurrently, an opposing player wearing a red shirt and shorts makes an effort to steal the ball from player 23's teammate.", "text": "A soccer ball rolls along the edge of the field, moving from the center towards the left side of the frame. The ball continues to roll and eventually exits the frame to the left. The camera then shifts to show a player in a white jersey preparing to throw the ball back into play. The player throws the ball, and it is seen flying towards the field. The ball lands on the field, and players in red and white jerseys are seen moving and positioning themselves for the next play.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 873, "extra_info": {"duration": 5.99, "events": ["Player 23 attempts to pass the ball", "Opposing player tries to steal the ball", "Teammate waits for the pass"], "n_events": 3, "n_shots": 2, "n_subjects": 2}, "vid": 873}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "9.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Camera pans right in an old industrial workshop to a snowy landscape, displaying group of penguins.", "text": "The camera moves downward, revealing a dimly lit room with large windows and various mechanical chains hanging from the ceiling. As the camera continues to move, the view shifts to show the outside through the windows, where penguins are seen walking on the snow. The camera zooms in on the window, focusing on the penguins as they move closer to the glass. The penguins continue to approach the window, with one penguin standing prominently in the center. The camera then shifts back to the mechanical chains hanging from the ceiling.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 9, "extra_info": {"duration": 13.84, "events": ["Camera pans right in an old industrial workshop", "Camera displays a snowy landscape", "Group of penguins appears"], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 9}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "370.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman sits on the steps smoking, turns around, rummages through her bag, and turns on the screen of her mobile phone.", "text": "A woman is sitting on the steps of a porch, smoking a cigarette while looking to her left. She then turns her head and reaches into her bag, which is placed beside her. She pulls out a phone from her bag and starts using it while still holding the cigarette in her other hand.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 370, "extra_info": {"duration": 8.25, "events": ["The woman sits on the steps.", "The woman smokes.", "The woman turns around.", "The woman rummages through her bag.", "The woman turns on the mobile phone screen."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 370}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "43.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera moves to the right, the blue fish turns its head to observe its surroundings, then turns around to look at the sea anemone which is in motion. The scene shifts to the inside of the sea anemone where two clownfish are sleeping. The blue fish makes a facial expression and speaks.", "text": "A blue fish with large eyes swims through a coral reef, moving from right to left. The fish continues to swim past various corals and anemones, gradually turning its body to the left. The fish then swims behind a large, purple, textured coral structure. The scene transitions to show the fish peeking from behind the coral, looking at two sleeping fish inside a crevice. The fish then moves further into the crevice, partially hiding behind the coral.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 43, "extra_info": {"duration": 5.25, "events": ["The blue fish turns its head.", "The blue fish turns around.", "The sea anemone is in motion.", "Two clownfish are sleeping inside the sea anemone.", "The blue fish makes a facial expression.", "The blue fish speaks."], "n_events": 6, "n_shots": 2, "n_subjects": 3}, "vid": 43}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "580.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A high-rise building caught fire, with black smoke billowing around. The phrases \"119 Training for Combat\" and \"Drone Firefighting\" were written on the building. A red drone, carrying two hoses on its left and bottom sides, was spraying water into the building to extinguish the fire.", "text": "A drone is seen flying in the sky, holding a hose that is spraying water onto a building with a fire on the upper floor. The building has a sign in Chinese characters on the left side, and the fire is visible through the windows on the upper floor. The drone moves slightly as it continues to spray water, gradually extinguishing the fire. The water stream from the drone is directed at the burning area, and the fire appears to be diminishing. The drone maintains its position and continues to spray water, ensuring the fire is completely put out.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 580, "extra_info": {"duration": 10.0, "events": ["High-rise building caught fire.", "Black smoke billowed around high-rise building.", "Phrases appeared on high-rise building.", "Red drone carried two hoses.", "Red drone sprayed water into high-rise building.", "Water extinguished the fire in high-rise building."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 580}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "551.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man runs forward holding a skimboard, throws it onto the water's surface, quickly steps on it, and glides across the water. Then he jumps up, flips in the air, and falls back down, creating a splash. The camera follows the man as he moves. The background is a seaside setting.", "text": "A man holding a surfboard runs down a sandy slope towards the beach, passing by a group of people sitting under trees. He continues running along the beach, approaching the water. The man places the surfboard on the sand and steps onto it, riding the wave. He gains momentum and performs a backflip off the wave, landing back on the water.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 551, "extra_info": {"duration": 5.0, "events": ["A man runs forward holding a skimboard.", "Man throws skimboard onto the water's surface.", "Man quickly steps on the skimboard.", "Man glides across the water.", "Man jumps up from the skimboard.", "Man flips in the air.", "Man falls back down into the water.", "Water creates a splash."], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 551}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "203.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A boy runs down a set of stairs and chases after another boy down a cobblestone street. A man in a brown suit and hat runs down the stairs and chases after the boys.", "text": "In a stone-paved alley with arched doorways and a staircase, a group of children run down the stairs and along the path. A man in a suit and hat descends the stairs, holding a briefcase, and starts running after the children. The man continues to chase the children, who are running ahead of him. The man reaches the bottom of the stairs and continues running down the alley. The children keep running further down the path, while the man follows closely behind.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 203, "extra_info": {"duration": 9.67, "events": ["A boy runs down a set of stairs.", "A boy chases after another boy.", "A man runs down the stairs.", "A man chases after the boys."], "n_events": 4, "n_shots": 1, "n_subjects": 3}, "vid": 203}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "605.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is a grassy field. Many bubbles are drifting in from the left, and the girl on the right reaches out to catch them. The girl on the left spins around, and then the two of them grasp hands tightly and happily spin around on the grass. Afterwards, they continue to catch bubbles. The camera remains fixed.", "text": "In a sunlit park with a green lawn, two girls in pink jackets and jeans are playing with soap bubbles. They start by reaching out to pop the bubbles around them. The girls then hold hands and begin to spin around, smiling and enjoying the moment. They continue to spin and reach out to pop more bubbles, moving in a circular motion. The girls laugh and interact with each other, occasionally looking at the bubbles and each other. They keep spinning and reaching out, fully immersed in the playful activity.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 605, "extra_info": {"duration": 7.7, "events": ["Bubbles drift in from the left", "Girl on the right reaches out to catch bubbles", "Girl on the left spins around", "Both girls grasp hands tightly", "Girls happily spin around on the grass", "Girls continue to catch bubbles"], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 605}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "15.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The girl whistles as she and the boy look upward together. The camera angle switches from their front to their back, and a large bird flies toward them. The large bird grabs the boy and flies away, while the girl turns to watch the direction in which the large bird is flying. The bird ascends into the sky, where other birds and burning objects are present.", "text": "Two characters in armor stand on a wooden platform with a fiery background, looking ahead. The character on the right touches their face and then turns to look at the character on the left. They both turn to face the sky as a dragon flies towards them. The character on the left raises a weapon and points it at the approaching dragon. The dragon flies closer and attacks the character on the left, knocking them off the platform. The character on the right watches as the dragon flies away with the character on the left. The dragon continues to fly away, leaving the character on the right standing alone on the platform.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 15, "extra_info": {"duration": 5.38, "events": ["The girl whistles.", "The girl and the boy look upward together.", "A large bird flies toward the girl and the boy.", "The large bird grabs the boy.", "The large bird flies away with the boy.", "The girl turns to watch the large bird.", "The large bird ascends into the sky.", "Other birds are present in the sky.", "Burning objects are present in the sky."], "n_events": 9, "n_shots": 3, "n_subjects": 4}, "vid": 15}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "806.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video portrays an individual with dark skin, curly hair, and a white T-shirt, utilizing an electric toothbrush to brush their teeth. Following this, they turn on the bathtub tap, situated to their right, and the shower starts running. The turning of the faucet to the right is intended to source water for the bath. Concurrently, a basket with toiletries can be seen stored.", "text": "A man in a white tank top is brushing his teeth in a bathroom with white towels hanging on the wall and various toiletries on the counter. He then turns off the faucet by turning the handle. The camera shifts to show the showerhead, which is turned on and water is flowing. The camera then moves to show the entire shower area, including the showerhead, handle, and curtain. Finally, the man is seen again brushing his teeth in front of the mirror.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 806, "extra_info": {"duration": 11.97, "events": ["Individual brushes teeth with electric toothbrush.", "Individual turns on bathtub tap.", "Shower starts running.", "Basket with toiletries appears."], "n_events": 4, "n_shots": 3, "n_subjects": 1}, "vid": 806}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "559.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two men are standing in the middle of a stone wall, one taller and one shorter, passing the ball back and forth, with another person walking around on the left.", "text": "In a narrow, rocky canyon with green moss-covered walls, a man in a blue shirt and gray shorts stands on a rock, throwing a yellow and black soccer ball to a woman in a white shirt and shorts who is standing further down the canyon. The woman catches the ball and throws it back to the man. The man catches the ball and throws it back to the woman, who catches it again. This sequence of throwing and catching continues multiple times.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 559, "extra_info": {"duration": 7.93, "events": ["Two men stand in the middle of a stone wall.", "One man passes the ball to the other man.", "The other man passes the ball back to the first man.", "A person walks around on the left."], "n_events": 4, "n_shots": 1, "n_subjects": 3}, "vid": 559}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "995.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person wearing yellow-striped flip-flops and black socks bends down, breaks off the sodium metal from the tip of a stick on a newspaper, and prepares it to throw it into the water in a plastic bottle. The person performs this experiment on a soil surface with fallen tree leaves. To be careful, the person wearing flip-flops and a gray tracksuit plucks the metal from the end of the stick with their right hand, bends down, cautiously extends their right arm, and puts it into the plastic bottle. Reacting violently with water, this large piece of metal first emits white smoke and then explodes, sending the bottle crashing to the ground.", "text": "A person is crouching on the ground next to a plastic bottle, a blue cup, and some scattered leaves and rocks. The person stands up and moves closer to the bottle, reaching out to touch it. The person then steps back, and the bottle starts to emit smoke. The smoke intensifies, and the bottle is eventually blown up, scattering into pieces on the ground. The person crouches down again, and the scene returns to the initial setup with the bottle, cup, and scattered items.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 995, "extra_info": {"duration": 12.0, "events": ["Person breaks off sodium metal on newspaper", "Person prepares sodium for tossing into water", "Person plucks sodium metal with right hand", "Person bends down with sodium metal", "Person cautiously extends right arm with sodium", "Person puts sodium into the plastic bottle", "Sodium reacts violently with water", "White smoke emits from the bottle", "Sodium explosion occurs", "Bottle crashes to the ground"], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 995}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "239.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man displays an object, takes a few steps forward, and speaks, then the camera shifts to the interior where two people are cleaning the floor. The man wearing a hat moves to the left, walking towards another man, and speaks to him.", "text": "A man in a trench coat holds up a small, dark object in his right hand and speaks while walking forward in a dimly lit, industrial-looking area with a multi-story building in the background. The scene then shifts to an indoor setting with various machinery and equipment. A man in a suit stands near a large, frosted window, observing the surroundings. Two men, one in a dark uniform and the other in a cap, walk towards the man in the suit and stop beside him.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 239, "extra_info": {"duration": 4.75, "events": ["A man displays an object.", "The man takes a few steps forward.", "The man speaks.", "Two people clean the floor.", "The man wearing a hat moves to the left.", "The man wearing a hat walks towards another man.", "The man wearing a hat speaks to another man."], "n_events": 7, "n_shots": 2, "n_subjects": 4}, "vid": 239}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "702.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the water, an octopus swims away into the distance followed by a group of small fish. The octopus stops at a spot, its body quickly blending in with the surrounding environment. The following group of small fish quickly disperses, and then a few of them swim back towards the octopus.", "text": "An octopus is seen on a coral reef, surrounded by various corals and a school of small, orange fish. The octopus begins to move, using its tentacles to navigate through the water. It continues to swim, gradually moving away from the camera, while the fish swim around it. The octopus becomes less visible as it swims further into the distance, blending into the coral reef environment. The fish continue to swim around the coral reef, maintaining their presence in the scene.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 702, "extra_info": {"duration": 20.81, "events": ["An octopus swims away in the water.", "A group of small fish follows the octopus.", "The octopus stops at a spot.", "The octopus camouflages with the environment.", "The group of small fish quickly disperses.", "A few small fish swim towards the octopus."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 702}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "46.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man claps and speaks as the table with a face opens its eyes, floating and flying around the room. A character turns their head in reaction to this, and the table with a face stops beside the other tables.", "text": "In a room with yellow curtains and white-covered tables, a man in a black cape raises his hand and speaks, causing a table to rise from the ground. The man then stands up and watches as the table floats in the air. The table continues to float and move around the room, eventually landing on another table. The man in the black cape continues to speak and gesture with his hand.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 46, "extra_info": {"duration": 8.5, "events": ["The man claps.", "The man speaks.", "The table with a face opens its eyes.", "The table with a face floats.", "The table with a face flies around the room.", "A character turns their head.", "The table with a face stops flying.", "The table with a face positions itself beside other tables."], "n_events": 8, "n_shots": 2, "n_subjects": 3}, "vid": 46}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "432.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is a bookshelf. The camera moves to the right, as a woman wearing headphones looks upward, picks up a flashlight from the table, and turns it on. The flashlight is shining toward the camera, then the scene switches to show the woman holding the flashlight while walking down a smoke-filled corridor, ending with images of several pigs.", "text": "A woman in a school uniform sits at a cluttered desk with various items including a pen, a small bottle, and some papers, and puts on headphones. She then picks up a flashlight from the desk and turns it on, illuminating the area in front of her. The scene transitions to the woman walking down a dimly lit corridor, holding the flashlight and scanning the surroundings.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 432, "extra_info": {"duration": 8.0, "events": ["The camera moves to the right", "A woman looks upward", "The woman picks up a flashlight", "The woman turns on the flashlight", "The flashlight shines toward the camera", "The woman walks down a smoke-filled corridor", "Images of several pigs appear"], "n_events": 7, "n_shots": 3, "n_subjects": 2, "old_duration": 10.03, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/32_no_audio.mp4"}, "vid": 432}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "795.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The giraffe runs forward on the savannah, with the camera following the movement, and then other giraffes appear. Some are standing still, while others are running along with it. Then it stops, turns its head, and looks to the right, and the other giraffes walk forward.", "text": "A giraffe is seen running across a grassy field with a clear sky in the background. The giraffe continues to run, gradually moving from the left to the right side of the frame. As the giraffe runs, it approaches a group of other giraffes that are standing and walking in the field. The group of giraffes becomes more visible as the running giraffe gets closer. The running giraffe joins the group of giraffes, and they all stand together in the field.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 795, "extra_info": {"duration": 30.16, "events": ["The giraffe runs forward on the savannah.", "Other giraffes appear.", "Some giraffes are standing still.", "Other giraffes are running.", "The giraffe stops.", "The giraffe turns its head.", "The giraffe looks to the right.", "Other giraffes walk forward."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 795}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "170.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A sightseeing boat full of tourists sailed from the left to the right side of the river. Afterwards, several animals successively popped their heads out of the water and then they swam towards the left side.", "text": "A boat labeled 'Outback Bird Watchers' moves from left to right along a river, passing by a lush, green forest with large trees and dense foliage. The boat carries several passengers, including a person in a blue shirt and another in a green shirt. As the boat continues to move, it reveals a group of animals, including a platypus, a bird, a blue creature, and a purple creature, swimming in the river. The animals continue to swim as the boat exits the frame, leaving the animals in the river.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 170, "extra_info": {"duration": 6.5, "events": ["A sightseeing boat sailed from left to right side of river.", "Several animals popped their heads out of the water.", "Animals swam towards the left side."], "n_events": 3, "n_shots": 1, "n_subjects": 2}, "vid": 170}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "810.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "From the video, we can see the dog is kept on a leash by the owner who trys to pull the dog behind a circle with purple, white and yellow. Then the dog jumped out of the circle, and then climbed the stairs.", "text": "A small dog with a green ball in its mouth is led by a person on a chain across a green indoor field, with various agility equipment visible in the background. The dog drops the ball and starts to move around, while the person holding the chain bends down to interact with the dog. The scene then shifts to a different dog, which is seen sitting and then jumping through a colorful hoop. The dog successfully navigates the hoop and continues to move forward, with a timer and the name 'BOCA' appearing on the screen.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 810, "extra_info": {"duration": 9.99, "events": ["The owner keeps the dog on a leash.", "The owner tries to pull the dog behind a circle.", "The dog jumps out of the circle.", "The dog climbs the stairs."], "n_events": 4, "n_shots": 2, "n_subjects": 2}, "vid": 810}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "673.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A girl is playing with clay on the table, flattening the clay with her hands. Next to her is a child wearing a yellow sweater, who starts with one hand on the table, and at the end of the video, he stands up.", "text": "A girl in a white shirt and brown apron is playing with pink play dough on a white table, using a red plastic mold to cut shapes. She presses the mold into the play dough, lifts it, and places it back down to cut another shape. The girl continues to press and lift the mold, creating multiple shapes from the play dough. Another child in a yellow sweater is also playing with play dough on the left side of the table. The background shows a colorful room with various toys and decorations.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 673, "extra_info": {"duration": 13.16, "events": ["A girl plays with clay on the table.", "The girl flattens the clay with her hands.", "A child with a yellow sweater has one hand on the table.", "The child in the yellow sweater stands up."], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 673}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "915.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video illustrates an 8-person race, with the athlete wearing the number 184 in white leading and eventually winning. Overwhelmed by excitement, this athlete raises their hand, displays the national flag to the crowd, and kisses it.", "text": "Several athletes are running on a blue track with a dry, brown field in the background. The runners are wearing different colored uniforms, with one in a yellow and blue uniform leading the pack. As they approach the finish line, the runner in the yellow and blue uniform maintains his lead. The runner crosses the finish line first, followed by the others. The runner in the yellow and blue uniform raises his arm in celebration and then lowers it. He then looks down and adjusts his uniform. The scene then cuts back to the runners continuing their race.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 915, "extra_info": {"duration": 11.01, "events": ["Athlete number 184 leads the race.", "Athlete number 184 wins the race.", "Athlete number 184 raises their hand in excitement.", "Athlete number 184 displays the national flag to the crowd.", "Athlete number 184 kisses the national flag."], "n_events": 5, "n_shots": 2, "n_subjects": 8}, "vid": 915}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "263.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two motorcycles sped past on the road, among other vehicles. The camera then switches to a white-bearded rider who pulls out a gun and fires a shot ahead, hitting a tree in front of him. The rider in front dodges and continues riding forward.", "text": "A motorcycle speeds down a deserted road, passing a car in the distance. The motorcycle continues to speed up, moving closer to the camera. The rider, wearing a white helmet and black leather jacket, raises a gun and points it forward. The camera then shows two motorcycles speeding down the road, with a car following behind. As the motorcycles pass by, a cactus on the side of the road explodes, sending debris flying.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 263, "extra_info": {"duration": 3.88, "events": ["Two motorcycles sped past on the road.", "Other vehicles were present on the road.", "A white-bearded rider pulled out a gun.", "The white-bearded rider fired a shot.", "The shot hit a tree in front.", "The rider in front dodged the shot.", "The dodging rider continued riding forward."], "n_events": 7, "n_shots": 3, "n_subjects": 5}, "vid": 263}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "669.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man and woman with graying hair danced in the room, and after the woman did a spin, the two hugged each other.", "text": "In a modern kitchen with a black metal shelf and a white cabinet, a man and a woman hold hands and begin to dance. They raise their joined hands above their heads and spin around. They then lower their hands and embrace each other warmly.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 669, "extra_info": {"duration": 5.8, "events": ["Man and woman danced in a room.", "Woman did a spin.", "Man and woman hugged each other."], "n_events": 3, "n_shots": 1, "n_subjects": 2}, "vid": 669}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "286.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman lies on her side on the sofa making a phone call. The camera then switches to show the woman putting down the phone with her left hand and looking at the notebook in her right hand. The camera switches again to show the woman lying on her stomach on the sofa making a phone call.", "text": "A woman is lying on a gray sofa in a cozy living room, holding a green rotary phone to her ear. She adjusts her position, placing her hand on her head while continuing to talk on the phone. The camera zooms in to show her holding a book in one hand and the phone in the other. She then moves her body, shifting her position on the sofa while still holding the phone. The room is decorated with various items, including a green coffee table with a plate of colorful objects, a bottle, and some candles.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 286, "extra_info": {"duration": 5.5, "events": ["A woman lies on her side on the sofa.", "The woman makes a phone call.", "The woman puts down the phone with her left hand.", "The woman looks at the notebook in her right hand.", "The woman lies on her stomach on the sofa.", "The woman makes another phone call."], "n_events": 6, "n_shots": 3, "n_subjects": 1}, "vid": 286}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "400.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man took off his glasses, looked into the distance, and blinked.", "text": "A man wearing a brown coat, a knitted sweater, and a checkered flat cap is standing in front of a building with a red awning and a snowy background. He looks up and then raises his right hand to his face. He uses his gloved hand to remove his glasses. After removing his glasses, he looks forward and begins to speak.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 400, "extra_info": {"duration": 4.25, "events": ["The man took off his glasses.", "The man looked into the distance.", "The man blinked."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 400}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "412.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person carrying a bag opens the door, and the camera cuts to this person opening a freezer, pouring the ice cream from the bag into it, and then closing the freezer door after a simple rearrangement. The camera switches to the person moving the keyboard forward, then placing a plate of ice cream on the computer desk.", "text": "A person uses a key to unlock a door and opens it, revealing a room with boxes on the floor. The person then opens a refrigerator, revealing various items inside, and starts pouring the contents of a white bag into the refrigerator. The person continues to empty the bag, filling the refrigerator with various snacks and items. After closing the refrigerator door, the person moves to a desk, turns on a keyboard with a glowing blue light, and then picks up a tray filled with various snacks.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 412, "extra_info": {"duration": 12.6, "events": ["Person opens the door", "Person opens the freezer", "Person pours ice cream into the freezer", "Person rearranges items in the freezer", "Person closes the freezer door", "Person moves the keyboard forward", "Person places a plate of ice cream on the desk"], "n_events": 7, "n_shots": 5, "n_subjects": 1}, "vid": 412}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "68.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman is holding a microphone and speaking. Next to her, a man is watching her. The woman extends the microphone forward, and the scene switches to six girls on the opposite side, where a character with green hair raises her hand to speak.", "text": "In a forest setting, a girl with long pink hair wearing a flower crown and a yellow top holds a microphone and speaks while standing next to a boy with green hair in a red shirt and green pants. The girl then extends the microphone towards a group of girls standing in front of her. The group of girls, dressed in various colorful outfits, look at the microphone and react with smiles and raised hands.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 68, "extra_info": {"duration": 2.62, "events": ["A woman holds a microphone.", "A woman speaks.", "A man watches the woman.", "The woman extends the microphone forward.", "A character with green hair raises her hand."], "n_events": 5, "n_shots": 2, "n_subjects": 4}, "vid": 68}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "481.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "By the side of a pond, two workers dressed in red are cleaning. Inside the pool, an animal places a red buoy on the side of the pond, while a bald male worker is cleaning the pool from the shore with his hands. The animal then places a yellow buoy beside the pool as well. The camera cuts to the male worker sitting in front of a bucket with the small animal standing on the edge of the pool massaging his back. Then they both play a game of catching and throwing items. Following this, the male worker rubs the animal's head with a white object, and the small animal hands over white items one by one to the male worker. There are Chinese subtitles at the bottom of the screen.", "text": "Two men in red jackets and white boots are cleaning the pool area, with one man mopping and the other sweeping. A seal emerges from the water and approaches the men, who are focused on their cleaning tasks. The seal climbs out of the water and interacts with the men, who are now crouching by the poolside. The seal stands on its hind legs and uses its front flippers to take food from the men's hands. The men feed the seal, which eagerly accepts the food. The seal continues to interact with the men, who are smiling and enjoying the interaction.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 481, "extra_info": {"duration": 10.4, "events": ["Two workers clean by the pond.", "Animal places red buoy on the pond side.", "Bald worker cleans the pond from the shore.", "Animal places yellow buoy beside the pond.", "Bald worker sits in front of a bucket.", "Small animal massages bald worker's back.", "Bald worker and small animal play catch.", "Bald worker rubs small animal's head with white object.", "Small animal hands white items to bald worker."], "n_events": 9, "n_shots": 6, "n_subjects": 3}, "vid": 481}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "25.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The door opened, two dogs were thrown into the trash bin, a girl came out from the door and yelled, then she closed the door.", "text": "The scene is set against a brick wall with a green dumpster filled with various items, including wooden planks and boxes, positioned next to a blue door. The door opens, and a dog with a long tail and a white body with black spots jumps out, landing in the dumpster. The dog then jumps out of the dumpster and lands on the ground. A girl in a blue dress appears at the door, looking at the dog and the dumpster. She then closes the door and turns to look at the dog. The door opens again, and the dog jumps out once more, landing in the dumpster. The girl closes the door and turns to look at the dog.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 25, "extra_info": {"duration": 3.38, "events": ["The door opened", "Two dogs were thrown into the trash bin", "A girl came out from the door", "The girl yelled", "The girl closed the door"], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 25}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "906.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts metal strips covered in rust. A man initially sprays a particular solution onto the metal strip, followed by using a high-pressure water gun for cleaning. The rust is then removed by the water, resulting in a color change to gray.", "text": "The video begins with a view of a car's interior, showing a person using a tool to clean the car's interior. The scene then shifts to an outdoor setting where a person is spraying a cleaning solution onto a metal surface. The text 'DRX THE ULTIMATE CLEANER' appears on the screen as the person continues to spray the solution. The person then uses a tool to clean the metal surface, and the text 'NO SANDING OR GRINDING' appears on the screen. The person continues to clean the metal surface with the tool, and the text 'NO SANDING OR GRINDING' remains on the screen. The person then uses a high-pressure water jet to clean the metal surface, and the text 'NO SANDING OR GRINDING' is still visible.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 906, "extra_info": {"duration": 10.01, "events": ["A man sprays a solution onto the rust-covered metal strip.", "A man uses a high-pressure water gun on the metal strip.", "The water removes the rust from the metal strip.", "The metal strip changes color to gray."], "n_events": 4, "n_shots": 2, "n_subjects": 1}, "vid": 906}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "366.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman walks forward, reaches out with her right hand, the camera switches to the woman bumping into a lamp, covering her forehead with both hands, pulling back her left hand, saying something, then switches to another woman sitting and eating, looking up, and switches again to the woman reaching out to snatch snacks from the girl's hands.", "text": "A family is seated around a dining table with various dishes and drinks, including a Christmas tree in the background. A girl in a pink dress is holding a bag of chips and starts to open it. A woman standing up from the table reacts with surprise, raising her hands and touching her head. The woman continues to express her surprise, while a woman in a green sweater watches the scene. The woman in the gray sweater moves towards the girl, and the girl in the pink dress opens the bag of chips.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 366, "extra_info": {"duration": 5.75, "events": ["The woman walks forward.", "The woman reaches out with her right hand.", "The woman bumps into a lamp.", "The woman covers her forehead with both hands.", "The woman pulls back her left hand.", "The woman says something.", "Another woman looks up while sitting and eating.", "The woman snatches snacks from the girl's hands."], "n_events": 8, "n_shots": 4, "n_subjects": 3}, "vid": 366}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "902.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts a series of vehicles travelling along an asphalt road, lined on either side by forest. Two cars, one gray and one black, pass the camera from inside a third vehicle which is following a yellow truck along the same route. Abruptly, the black car switches to an incorrect lane, causing the yellow truck to swerve to the right in an attempt to avoid a collision. Despite the truck's efforts, a collision occurs between the black car and the truck. The impact sends fragments of both vehicles flying, and the black car spins out of control. The vehicle with the camera applies the brakes and comes to a stop.", "text": "The video shows a highway with a truck driving in the right lane and several cars in the left lane. A silver car in the left lane begins to move towards the right lane, approaching the truck. The silver car collides with the truck, causing the truck to veer off the road and crash into the grassy shoulder. The impact causes debris to scatter, and the truck comes to a stop on the shoulder. The scene then transitions to a different location where a person is seen standing on a platform, preparing to jump.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 902, "extra_info": {"duration": 5.0, "events": ["Two cars pass the camera.", "Gray car and black car travel along the road.", "Third vehicle follows a yellow truck.", "Black car switches to an incorrect lane.", "Yellow truck swerves to the right.", "Collision occurs between the black car and the truck.", "Fragments fly from both vehicles.", "Black car spins out of control.", "Vehicle with camera applies the brakes.", "Vehicle with camera comes to a stop."], "n_events": 10, "n_shots": 1, "n_subjects": 3}, "vid": 902}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "255.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows a monitor with an elderly person speaking to the camera. An old man and a woman are seated behind a table facing the camera. The old man slides an envelope forward, and the woman watches. The camera then cuts to a younger person across from them, who takes the envelope and looks at its cover, before the camera switches to a view of the envelope's cover.", "text": "A man in a red shirt is seen on a TV screen, holding a black folder and speaking while closing it. The scene then shifts to a conference room where a man in a suit and a woman are seated at a table with coffee cups. The man in the suit reaches out and slides a piece of paper across the table. The camera then shows a man in a brown jacket reaching out to pick up the paper. The final frames show a close-up of the paper, revealing it to be a check made out to 'Jason Stevens' for 'NOSA GATTLLE RANCH, TX 77864'.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 255, "extra_info": {"duration": 6.12, "events": ["Elderly person speaks to the camera.", "Old man slides an envelope forward.", "Woman watches the old man.", "Younger person takes the envelope.", "Camera shows the envelope's cover."], "n_events": 5, "n_shots": 4, "n_subjects": 4}, "vid": 255}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "371.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman dances in the center of the stage, the camera switches to the back of the woman; she swings her arms and steps back. The camera then cuts to a man raising his arms, removing his helmet to say something, and the surrounding audience looks towards him.", "text": "A woman in a red dress is dancing energetically on a stage with a dark background illuminated by small purple lights. She continues to dance, moving her arms and spinning around. The camera then shifts to show her from behind as she walks towards the audience, who are seated at tables decorated with balloons. The audience, including a man in a black leather jacket and helmet, raises his hands in excitement and then lowers them while speaking.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 371, "extra_info": {"duration": 8.25, "events": ["A woman dances in the center of the stage.", "The woman swings her arms.", "The woman steps back.", "A man raises his arms.", "A man removes his helmet.", "A man says something.", "The audience looks towards the man."], "n_events": 7, "n_shots": 3, "n_subjects": 3}, "vid": 371}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "990.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video features a brown cat, donned in a white hat, licking its left front paw inside a glass box. In an unexpected turn of events, a black mouse drops from the top right. The cat, which is bedecked with a white napkin around its neck and a petite white hat, watches as the mouse frantically tries to escape. Abruptly, two cartoon characters pop into sight, both sporting brilliant white attire and hats against their black skin. The character found on the right then takes a shot at the mouse with a yellow pistol, unleashing a laser beam. The mouse, exposed to the laser, diminishes immediately in size.", "text": "A cat wearing a chef's hat is trapped in a glass box, licking its paw and looking at a mouse outside the box. The mouse approaches the glass, and the cat turns to look at it. The cat then puts on a chef's hat and starts dancing. Two other characters, also in chef's hats, appear and hold a laser gun. They aim the laser gun at the mouse and shoot it, causing the mouse to disappear. The cat, now with a shocked expression, watches as the mouse vanishes.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 990, "extra_info": {"duration": 9.0, "events": ["Brown cat licks its left front paw.", "Black mouse drops from the top right.", "Brown cat watches the mouse trying to escape.", "Two cartoon characters pop into sight.", "Cartoon character on the right takes a shot at the mouse.", "Yellow pistol unleashes a laser beam.", "Laser beam hits the mouse.", "Mouse immediately diminishes in size."], "n_events": 8, "n_shots": 4, "n_subjects": 4}, "vid": 990}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "976.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video portrays an individual in a lilac cardigan, dusting a green nightstand. They use a green duster and a white napkin in their cleaning process. The nightstand is adorned with various decorative items and electronics. Behind the nightstand is a window, with a white curtain embroidered in yellow hanging adjacent to it.", "text": "A person is seen cleaning a light blue dresser with a green cloth, starting by wiping the surface. The dresser has various items on it, including a green and white device, a small brown object, a framed photo, a wicker basket, and a white object. The person continues to wipe the surface, moving the cloth across the dresser. They then pick up a small item from the dresser and place it into the wicker basket. The person continues to clean the dresser, moving the cloth back and forth. Finally, they pick up another item from the dresser and place it into the wicker basket.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 976, "extra_info": {"duration": 4.97, "events": ["Individual dusts a green nightstand with a green duster.", "Individual wipes the nightstand with a white napkin.", "Curtain hangs adjacent to the window."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 976}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "830.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The car is moving in a straight line on a road. At a junction, the car veers to the left, moving onto another road.", "text": "The video shows the interior of a Tesla car, with a person driving on a two-lane road in a suburban area. The dashboard displays a navigation screen with a map and a rearview camera feed showing the road behind. The car approaches an intersection where a white van is seen crossing from the left lane. The driver begins to turn the steering wheel to the left, aligning the car with the left lane. The car continues to turn left, passing the white van and moving into the left lane. The car completes the turn and continues driving straight on the left lane, following the road ahead.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 830, "extra_info": {"duration": 10.0, "events": ["The car moves in a straight line.", "The car veers to the left.", "The car enters another road."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 830}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "240.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The SUV reversed and unraveled the barbed wire. Its bumper fell off, startling the soldiers behind who then stepped back. The SUV drove forward as the soldiers got into the Humvee. The SUV sped away, leaving the spiked bumper on the road.", "text": "A red SUV crashes into a roadside barrier, causing the front of the vehicle to break and debris to scatter. The scene shifts to the interior of the vehicle, showing the driver and passenger as the vehicle comes to a stop. The view changes to the perspective from inside the vehicle, showing soldiers in military uniforms running towards the vehicle. The soldiers approach the vehicle and begin to inspect the damage. The camera then shows the soldiers standing next to the vehicle, examining the wreckage. Finally, the camera focuses on the roadside barrier, which is now damaged and leaning against a road sign.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 240, "extra_info": {"duration": 9.62, "events": ["The SUV reversed.", "The SUV unraveled the barbed wire.", "The SUV's bumper fell off.", "The soldiers stepped back.", "The SUV drove forward.", "The soldiers got into the Humvee.", "The SUV sped away.", "The spiked bumper remained on the road."], "n_events": 8, "n_shots": 5, "n_subjects": 2}, "vid": 240}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "600.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is a kindergarten corridor. Four children and a long-haired woman are sitting against the wall. Another woman with short hair, wearing a red short-sleeved top and black sports pants, is using both hands to pull the children up from their seats in turns. After they jump up, they sit back down. Finally, the woman starts to pull on the long-haired woman who is still seated. A group of children run out from the classroom, with a little girl in front leading the charge, grabbing the short-haired woman's clothing to help. In the end, the long-haired woman is successfully pulled up, and the video ends with her being lifted and jumping into the air.", "text": "In a colorful hallway with space-themed decorations, a woman in a pink shirt and black pants helps children stand up from their seats by holding their hands. She then guides each child to stand up and hold her hand. The woman and the children hold hands and start to move together. As they move, more children join in, and the woman and the children continue to hold hands and move forward.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 600, "extra_info": {"duration": 15.2, "events": ["Children and woman sit against the wall.", "Woman in red pulls children up in turns.", "Children jump up.", "Children sit back down.", "Woman in red attempts to pull long-haired woman up.", "Group of children run out from the classroom.", "Little girl leads group, grabbing woman in red's clothing.", "Children help to pull long-haired woman up.", "Long-haired woman gets pulled up successfully.", "Long-haired woman jumps into the air."], "n_events": 10, "n_shots": 1, "n_subjects": 4}, "vid": 600}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "714.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A submarine slowly surfaced, with many ships in the distance. The camera panned around the submarine from left to right, and at the end of the video, a speedboat moved forward along the submarine.", "text": "A submarine is seen moving forward on the surface of the ocean, creating waves as it progresses. The camera gradually moves closer to the submarine, capturing its details and the surrounding water. In the background, several ships are visible on the horizon, and the sky is clear with a few clouds. As the camera continues to move, a small boat appears in the distance, approaching the submarine.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 714, "extra_info": {"duration": 8.88, "events": ["Submarine surfaced slowly.", "Ships appeared in the distance.", "Speedboat moved forward alongside the submarine."], "n_events": 3, "n_shots": 1, "n_subjects": 3}, "vid": 714}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "175.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The creature pushed off the lid of the barrel, shook the banana peel off its head, looked at another creature crawling out of the box, and then looked to the right.", "text": "In an alleyway with brick walls and a green trash can, a box labeled 'Yoga Diet' is placed on the ground next to two black garbage bags and a can of soda. A blue creature with yellow hair emerges from the trash can, looking around. The creature then throws a banana peel onto the ground. The box on the ground opens, revealing a purple creature and a small yellow creature inside. The blue creature continues to look around as the box remains open.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 175, "extra_info": {"duration": 5.62, "events": ["The creature pushed off the lid of the barrel", "The creature shook the banana peel off its head", "The creature looked at another creature", "Another creature crawled out of the box", "The creature looked to the right"], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 175}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "297.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person lays their head against the car window, looking inside. Then, they stand up, and the window reflects their shadow. They turn their head to look behind, and afterwards, they walk away.", "text": "A man, wearing a light-colored jacket and a watch, is seen pressing his face against the car window, appearing distressed. He then begins to move away from the window, his reflection visible on the glass. The man continues to move away, and his reflection gradually fades from the window. The car window is now empty, with no reflection visible. The man reappears, pressing his face against the window again, showing signs of distress.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 297, "extra_info": {"duration": 5.88, "events": ["A person lays head against the car window", "The person stands up", "Car window reflects the person's shadow", "The person turns head to look behind", "The person walks away"], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 297}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "680.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two men and two women are playing the game of Rock-Paper-Scissors together. They played two rounds; in the first round, the man in red on the left side of the camera was eliminated. Then they proceeded to the second round of the game.", "text": "Four people are sitting on steps outdoors, with a blurred background of buildings and greenery. They are engaged in a hand game, moving their hands in various gestures. The person on the far left, wearing an orange sweater, and the person next to them, in a purple shirt, are actively participating in the game. The person in the denim jacket and the person in the yellow shirt are also involved, with the person in the denim jacket making expressive hand movements. The group appears to be enjoying the game, smiling and laughing as they continue to play.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 680, "extra_info": {"duration": 10.85, "events": ["Two men and two women start playing Rock-Paper-Scissors.", "Players show their chosen signs simultaneously.", "First round of the game concludes.", "The man in red is eliminated.", "Four players prepare for another round.", "Second round of the game begins.", "Players reveal their choices again."], "n_events": 7, "n_shots": 1, "n_subjects": 4}, "vid": 680}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "453.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "At the entrance of a school, a group of students are walking out of the gates. A rope has been pulled up next to the gate, with relatives who came to pick up students being held back behind the rope. The closest to the camera is a boy in a blue outfit and a little girl in a white dress. The boy in blue looks back at the little girl, who watches another male student hopping out. When that student reaches the little girl, she opens her arms wide, and the student picks her up and carries her forward.", "text": "A young girl in a white dress with a colorful skirt stands in front of a traffic cone, holding a yellow rope, while a boy in a blue shirt and backpack stands beside her. A group of students in green and gray uniforms walk towards the girl from the background. The girl raises her hands and reaches out to the approaching students. One of the students, wearing a green jacket and gray pants, notices the girl and bends down to pick her up. The student lifts the girl and carries her away, while the other students continue walking in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 453, "extra_info": {"duration": 9.2, "events": ["Students walk out of school gates.", "A rope has been pulled up.", "Relatives wait behind the rope.", "Boy in blue looks back.", "Little girl watches another student.", "Student hops out of school.", "Little girl opens her arms wide.", "Student picks little girl up.", "Student carries little girl forward."], "n_events": 9, "n_shots": 1, "n_subjects": 3}, "vid": 453}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "698.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, it starts with darkness, then from the top it begins to turn red, as the red area gradually expands, the screen begins to brighten. Soon after, the sun emerges from the dark clouds, illuminating the sky.", "text": "The video begins with a dark sky and the 'video' logo displayed prominently in the center. As the camera moves upward, the sky starts to show a faint orange hue at the top. The orange color becomes more pronounced, and the sun begins to emerge from behind the clouds. The sun continues to rise, becoming more visible and brighter, illuminating the sky with a vibrant orange glow. The sun is now fully visible, casting a bright light across the sky, with the 'video' logo still in the center. The video ends with the sun fully risen, filling the sky with a warm, orange light, and the 'video' logo remains in the center.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 698, "extra_info": {"duration": 11.6, "events": ["Darkness envelops the screen.", "Red color begins to appear from the top.", "The red area expands gradually.", "Screen starts to brighten.", "Sun emerges from dark clouds.", "Sun illuminates the sky."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 698}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "544.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two athletes are riding forward on their bicycles when the person on the right suddenly falls to the left, knocking the person on the left down to the ground. The camera then cuts to an athlete bent over, repairing the bicycle.", "text": "Two cyclists in red jerseys are riding closely together on a road lined with spectators behind barriers. The cyclist on the left loses control and falls off his bike, sliding on the ground. The other cyclist continues riding forward. The fallen cyclist gets up and starts inspecting his bike, checking the front wheel and other parts. The scene then switches to a group of cyclists riding in a line on a road.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 544, "extra_info": {"duration": 8.5, "events": ["Two athletes ride forward on bicycles.", "Person on right falls to the left.", "Person on left is knocked down to the ground.", "Athlete bends over.", "Athlete repairs the bicycle."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 544}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "127.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A bald little boy walks out of a house with a red door, the scene cuts to a bus stop where a boy with thinning hair speaks to the approaching bald boy.", "text": "A yellow house with a red door and a black roof is shown with a small cloud in the sky. A boy in a red shirt opens the door and steps out of the house, holding a bag. The boy walks down the steps and onto the sidewalk. The scene shifts to a boy in an orange shirt standing on the sidewalk near a 'no parking' sign, smiling. The boy in the red shirt walks towards the boy in the orange shirt, who turns his head to look at him. The boy in the red shirt continues walking towards the boy in the orange shirt, who is still smiling. The scene shifts back to the boy in the red shirt standing in front of the house, holding a bag.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 127, "extra_info": {"duration": 6.12, "events": ["Bald little boy walks out of a house.", "Boy with thinning hair speaks to the bald boy."], "n_events": 2, "n_shots": 2, "n_subjects": 2}, "vid": 127}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "978.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video the driver is driving normally, the snow on the windshield may affect the line of sight, in the intersection appeared a car driving with lights on and this car is turning. The oncoming car with lights on should stop and wait the other car to pass through according to traffic rules but it didn't. The vehicle shooting video had no time to brake , and the two cars is about to have an accident at the cross.", "text": "The video shows a car driving on a snow-covered road at night, with buildings and streetlights visible on both sides. As the car continues to move forward, the headlights of an oncoming vehicle become visible in the distance. The oncoming vehicle's headlights grow brighter and more intense as it approaches. The car continues to drive towards the oncoming vehicle, which is now very close. The oncoming vehicle's headlights are now very bright, illuminating the entire road ahead. The car passes the oncoming vehicle, which is now very close and its headlights are still very bright. The car continues to drive past the oncoming vehicle, which is now out of the frame.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 978, "extra_info": {"duration": 7.0, "events": ["Driver drives with snow on windshield", "Snow affects line of sight", "Car at intersection turns with lights on", "Oncoming car with lights on breaks traffic rule", "Oncoming car with lights on fails to stop", "Vehicle recording video approaches the intersection", "Driver in video has no time to brake", "Two cars are about to collide at the cross"], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 978}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "205.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two men sit in the chair. A boy and a girl enter the scene. A man beckons the boy over and talks to him.", "text": "Two men are sitting on a porch, one in a suit and the other in a vest, with a drink on the man's lap. A girl in a blue dress runs out of the house and opens the door. She steps out and closes the door behind her. A boy in a blue shirt and suspenders walks towards the man in the vest. The man in the vest turns his head to look at the boy and extends his hand to pat the boy's shoulder. The man in the suit also turns his head to look at the boy.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 205, "extra_info": {"duration": 8.22, "events": ["Two men sit in chairs.", "A boy and a girl enter the scene.", "A man beckons the boy over.", "The man talks to the boy."], "n_events": 4, "n_shots": 2, "n_subjects": 4}, "vid": 205}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "143.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man in a red outfit is posed with one hand behind him and the other over his chest, leans forward, and then straightens his body with one hand behind and the other placed in front of his stomach. A blonde woman comes over and supports the man's arm. A short woman walks past the camera toward the man in the red outfit and the blonde woman, and both the man in red and the blonde woman look towards the short woman.", "text": "In a grand hall with a detailed backdrop of a city and mountains, a man in a purple suit with a yellow vest and a woman in a blue outfit stand side by side. The man raises his right hand while the woman looks at him. The woman quickly moves to the man's side and grabs his arm. She then pulls him back, causing him to lean forward. The woman continues to hold the man's arm, and they both look towards a small figure with a large, round head that is walking towards them. The small figure, dressed in blue, approaches the man and woman, who turn their heads to look at it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 143, "extra_info": {"duration": 4.12, "events": ["Man in red outfit leans forward.", "Man in red outfit straightens his body.", "Blonde woman supports man's arm.", "Short woman walks towards the man and blonde woman.", "Man in red and blonde woman look towards short woman."], "n_events": 5, "n_shots": 1, "n_subjects": 3}, "vid": 143}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "312.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A female police officer with glasses strides toward the camera, swinging a baton. She stops in front of a cell, turns around to look inside, where a girl dressed in white is facing the wall, curled up and lying on the floor.", "text": "A man wearing a beige uniform and sunglasses walks down a hallway holding a baton in his right hand. He approaches a cell door and stops in front of it. Inside the cell, a person is lying on the floor, covered in feathers, with a pillow and some scattered items nearby. The man looks at the person inside the cell, then turns his head to look down.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 312, "extra_info": {"duration": 5.75, "events": ["A female police officer strides toward the camera.", "The female police officer swings a baton.", "The female police officer stops in front of a cell.", "The female police officer turns around.", "A girl lies on the floor facing the wall."], "n_events": 5, "n_shots": 2, "n_subjects": 2}, "vid": 312}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "825.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A black and brown cat is seen standing upright in a room, with a transparent, plastic storage box in front of it. The box contains brown food and has a blue lid. The cat reaches inside the box with its right paw, takes out some food, and proceeds to eat it.", "text": "A cat with a collar is standing on its hind legs, using its front paws to dig into a blue and transparent litter box filled with litter. The cat continues to dig and move its head around inside the litter box, seemingly searching for something. The background shows a tiled floor and a white door, with a blue mat and a green bowl placed next to the litter box. Throughout the video, there are various overlays and text, including a 'SUBSCRIBE' button, a 'Get Your Daily Dose' section, and a 'Lovers' section promoting funny pet videos.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 825, "extra_info": {"duration": 6.0, "events": ["The cat stands upright.", "The cat reaches inside the box.", "The cat takes out some food.", "The cat eats the food."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 825}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "710.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The screen transitions from flickering to a normal state, divided into many small squares by black bars, with each square containing an eye of a different color. Each eye is moving, and the camera gradually focuses on a yellow eye, before the screen reverts to a flickering state again.", "text": "The video begins with a black and white abstract pattern. The scene transitions to a grid of various eye images, including different colors and conditions. The camera zooms in on a specific eye image, focusing on the details of the eye. The zoom continues, bringing the eye image into closer view. The video ends with the eye image becoming blurred and transitioning back to the abstract pattern.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 710, "extra_info": {"duration": 12.54, "events": ["Screen transitions from flickering to normal state.", "Black bars divide screen into small squares.", "Each square contains a moving eye.", "Eyes display different colors.", "Camera focuses on a yellow eye.", "Screen reverts to a flickering state."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 710}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "34.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A stone fell into the pond, splashing water, and the dinosaur by the riverbank turned its head to look back.", "text": "The scene begins with a calm pond surrounded by green plants and a rocky shore. Suddenly, a stone is thrown into the pond, creating a splash and ripples on the water's surface. The camera then shifts to a green dinosaur standing on a rocky ground with grass in the background. The dinosaur, initially looking down, gradually raises its head and looks upwards.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 34, "extra_info": {"duration": 3.88, "events": ["A stone fell into the pond.", "Water splashed.", "The dinosaur turned its head.", "The dinosaur looked back."], "n_events": 4, "n_shots": 2, "n_subjects": 1}, "vid": 34}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "321.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man and a woman are sitting on the bed in the cabin, the woman looks at the man, the man looks forward, the camera then cuts to the corridor, the woman walks ahead, the man comes down from a distant staircase and looks towards the woman.", "text": "In a small, well-lit room with a desk, chair, and various office supplies, a man in a blue uniform and red cap sits on a bed, looking down at his hands, while a woman in a green jacket stands beside him, also looking down. The room has a sink, a fire extinguisher, and a 'Caution' sign on the door. The scene then shifts to a hallway where a woman with a bandaged hand and a bag walks forward, talking, while a man in a blue uniform descends a staircase in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 321, "extra_info": {"duration": 4.25, "events": ["A man and a woman sit on the bed.", "The woman looks at the man.", "The man looks forward.", "The woman walks in the corridor.", "The man comes down the staircase.", "The man looks towards the woman."], "n_events": 6, "n_shots": 2, "n_subjects": 2}, "vid": 321}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "183.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The soldier takes a deep breath and runs forward. The camera switches to a woman assuming a combat stance. It then cuts to the soldier flying past the woman, who opens her mouth in surprise. The camera switches and zooms in, focusing on the woman's eyes widening and her brows slightly furrowing. It then switches to an object flying in and sticking into a wooden stake.", "text": "A gingerbread soldier stands on a stage with a pink and white background, looking surprised. The soldier then starts to move forward. The camera zooms out to reveal a woman in a purple dress and a man in a green outfit with a crown, surrounded by other gingerbread characters. The soldier is shot by a cannon, flying through the air towards the woman. The woman looks shocked and turns her head to the side. The camera zooms in on the woman's face as she watches the soldier fly past her. The camera then shows the soldier hitting a wooden post.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 183, "extra_info": {"duration": 7.12, "events": ["The soldier takes a deep breath.", "The soldier runs forward.", "The woman assumes a combat stance.", "The soldier flies past the woman.", "The woman opens her mouth in surprise.", "The camera focuses on the woman's eyes.", "The woman's brows furrow slightly.", "An object flies in.", "The object sticks into a wooden stake."], "n_events": 9, "n_shots": 5, "n_subjects": 2}, "vid": 183}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "646.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A fixed camera shot displays a coast and the distant sunset, with the sun dyeing the sky and clouds in shades of orange. As the sun gradually sets, the clouds slowly ascend, and the shades of the clouds and sky change, while the water's surface, constantly rippling, reflects various colors in the sunlight.", "text": "The video begins with a bright sun shining over the sea, casting a golden reflection on the water. As the sun starts to set, the sky gradually changes color, transitioning from bright yellow to deeper shades of orange and red. The sun continues to descend, moving closer to the horizon, and the sky becomes more saturated with warm hues. Eventually, the sun dips below the horizon, leaving a vibrant orange and pink sky. The scene concludes with the sun completely set, and the sky filled with the remnants of the sunset, creating a serene and picturesque view.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 646, "extra_info": {"duration": 16.08, "events": ["The sun sets on the horizon.", "The sky turns shades of orange.", "Clouds gradually ascend in the sky.", "The clouds change colors.", "Water's surface ripples continuously.", "Reflected sunlight colors the water."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 646}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "363.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man reached out to take the object handed over by the person in front of him, examined it for a bit, then put it in his mouth and bit down, looked at it again, and then raised his head to look forward.", "text": "In a room filled with shelves and various electronic equipment, a man in a white shirt with a patterned collar receives a small item from another person. He takes the item and begins to examine it closely. The man then brings the item to his mouth and tastes it. After tasting, he looks up and starts speaking to the person in front of him.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 363, "extra_info": {"duration": 6.5, "events": ["The man reached out to take the object.", "Person in front of the man handed over the object.", "The man examined the object.", "The man bit down on the object.", "The man looked at the object again.", "The man raised his head to look forward."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 363}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "220.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The mechanical arm reaches into the water, attempting to retrieve glasses that emit a yellow glow. The camera cuts to a different angle from a camera, then switches back to the mechanical arm. In the water with the yellow glow, the mechanical arm crushes the goggles and slowly withdraws from the water.", "text": "In a brightly lit environment with yellow lights, a pair of robotic arms are seen manipulating a transparent object. The arms move the object around, adjusting its position and orientation. The camera then shifts focus to a close-up of a machine with a circular component and a light source, showing the machine's internal mechanism. The camera returns to the robotic arms, which continue to manipulate the object, breaking it into smaller pieces.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 220, "extra_info": {"duration": 9.75, "events": ["The mechanical arm reaches into the water.", "The mechanical arm attempts to retrieve the glasses.", "The glasses emit a yellow glow.", "The mechanical arm crushes the glasses.", "The mechanical arm withdraws from the water."], "n_events": 5, "n_shots": 3, "n_subjects": 1}, "vid": 220}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "433.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person is holding a basket of mangroves, washing shells in the creek. He picks up the basket, and the camera cuts to him placing the basket on a table, picking up a shell and putting it on a wooden board. Standing on a rope, he presses down to secure the shell and saws it into long strips. The camera captures the man sawing the shell from different angles, then switches to a frontal view of the man as he pulls the saw vigorously. Afterward, it switches to him picking up one of the long strips and sawing it into a square shape.", "text": "A basket filled with oysters is placed in water, and the person lifts it out, causing water to splash. The person places the basket on a table and begins to open the oysters. The person then picks up a stone and places it on a wooden block, securing it with a rope. The person uses a saw to cut the stone into smaller pieces. The cut stone pieces are placed on the ground, and the person ties the stone pieces together with a rope. The person then uses the saw to cut the tied stone pieces.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 433, "extra_info": {"duration": 15.03, "events": ["Person holds basket of mangroves.", "Person washes shells in creek.", "Person picks up the basket.", "Person places basket on table.", "Person picks up a shell.", "Person puts shell on wooden board.", "Person stands on rope.", "Person secures shell with rope.", "Person saws shell into long strips.", "Person cuts long strip into square shape."], "n_events": 10, "n_shots": 6, "n_subjects": 1}, "vid": 433}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "3.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Several animals jumped into a yellow pipe one by one, and the last brown bird jumped directly off the platform.", "text": "A hawk, a mouse, and a cat are perched on the edge of a building, with a yellow trash can below them. The mouse and the cat are seen rummaging through the trash can, while the hawk watches them. The mouse and the cat continue to search through the trash can, and the hawk remains observant. The mouse and the cat find something in the trash can and start to move away. The hawk turns its head to look around, then turns back to the trash can. The hawk jumps into the trash can, joining the mouse and the cat.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 3, "extra_info": {"duration": 4.88, "events": ["Animals jumped into a yellow pipe", "A brown bird jumped off the platform"], "n_events": 2, "n_shots": 1, "n_subjects": 6}, "vid": 3}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "156.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows many people dining in a restaurant. The camera cuts to a table where a brown-haired man takes a hamburger passed to him by a server and then moves forward to pick up a plate of vegetables. Behind him, a blond-haired man turns around and frantically starts eating fries from the plate of a woman behind him. Then he snatches the plate out of the woman's hands, and she looks at him in shock.", "text": "In a busy cafeteria, people are seen queuing at the dessert, grill, and sides counters, with some already seated and eating. A woman in a green dress hands a tray with a burger and fries to a man in a blue shirt, while another woman in a blue shirt hands a tray with a salad to a man in a black shirt. The man in the blue shirt turns to the left and picks up a salad from the counter, while the man in the black shirt turns to the right and picks up a salad from the counter. The woman in the green dress turns to look at the man in the blue shirt, who is now holding the tray with the burger and fries.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 156, "extra_info": {"duration": 5.75, "events": ["People dine in a restaurant.", "A brown-haired man receives a hamburger from a server.", "The brown-haired man picks up a plate of vegetables.", "A blond-haired man turns around.", "The blond-haired man frantically eats fries.", "The blond-haired man snatches a plate from a woman.", "The woman looks at the blond-haired man in shock."], "n_events": 7, "n_shots": 2, "n_subjects": 4}, "vid": 156}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "234.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman is riding a horse; a man comes driving a carriage; the woman dismounts from the horse; a man in front of the crowd moves and watches her; the woman runs towards the back of the carriage.", "text": "A woman in a yellow dress rides a white horse in front of a stone building, with a black horse standing nearby. A man in black, sitting on a cart, looks ahead and then turns his head to the right. The black horse moves forward, and the crowd holding torches watches them. The woman dismounts from the white horse and runs towards a cart. A man in a red coat walks forward, and the woman runs towards the cart. The woman reaches the cart and prepares to get on it. The man in black, still on the cart, looks ahead.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 234, "extra_info": {"duration": 10.62, "events": ["A woman rides a horse.", "A man drives a carriage.", "The woman dismounts from the horse.", "A man moves in front of the crowd.", "The man watches the woman.", "The woman runs towards the carriage's back."], "n_events": 6, "n_shots": 7, "n_subjects": 6}, "vid": 234}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "683.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A vehicle moves forward along a winding road, with the camera following it. Gradually, the camera catches up and overtakes the vehicle to film another car at a fork in the road ahead.", "text": "A white convertible car is driving on a winding road surrounded by grassy hills and trees under a cloudy sky. The camera follows the car from behind, gradually moving to the side and then to the front as the car continues along the road. The car maintains a steady speed as it navigates the curves of the road, with the camera capturing different angles of the vehicle and the surrounding landscape.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 683, "extra_info": {"duration": 20.72, "events": ["A vehicle moves forward along a winding road.", "The camera follows the vehicle.", "The camera catches up with the vehicle.", "The camera overtakes the vehicle.", "The camera films another car.", "Another car waits at a fork in the road ahead."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 683}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "597.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A monkey sits in front of the camera, with its hands placed on a plate, looking upwards. Some colorful candies are poured in from the right side. The monkey's gaze follows the movement of the candies, then it reaches out to grab the candies and puts them into its mouth to eat, occasionally looking up. The camera remains fixed.", "text": "A monkey wearing a light blue shirt with a Superman logo sits in front of a white plate against a yellow background. The monkey looks around and then a hand appears, pouring colorful candies into the plate. The monkey starts picking up the candies and eating them. After eating some, the monkey looks up and then back down at the plate.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 597, "extra_info": {"duration": 4.35, "events": ["Monkey sits in front of the camera.", "Monkey places hands on the plate.", "Colorful candies are poured onto the plate.", "Monkey's gaze follows the movement of the candies.", "Monkey reaches out to grab the candies.", "Monkey puts candies into its mouth.", "Monkey eats the candies.", "Monkey occasionally looks up."], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 597}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "478.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a challenge-based show, a man dressed in a green short-sleeved white T-shirt prepares to start at the starting point. After raising his hand to signal, he quickly jumps onto the treadmill, runs swiftly across it, then nimbly and agilely sprints over one obstacle after another.", "text": "A man stands on a wooden platform by a lake, preparing to jump onto a moving platform. He steps onto the moving platform and begins to run forward. The man continues running across the platform, which is elevated above the water. He approaches a series of colorful obstacles and jumps onto a red and yellow log. The man balances on the log and continues running towards a blue structure with a basketball hoop. He reaches the end of the log and prepares to jump onto the blue structure.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 478, "extra_info": {"duration": 15.08, "events": ["Man signals to start", "Man jumps onto treadmill", "Man runs across treadmill", "Man sprints over obstacles"], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 478}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "973.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "At home, there are two kids sitting on the sofa, both with blue shirt and curly black hair. The loser of them, who is on the left side wears blue tape on the head, a necklace, and a watch on their right hand. The kids are playing rock-paper-scissors, so both of them show the gestures of scissors after hitting their left hands with their right hands for 3 times, and then for the fourth time, the kid on the right shows the gesture of scissors while the kid on the left shows the paper. Then the kid on the right is happy and the kid on the left with the necklace is getting upset because the kid on the right won the game of rock-paper-scissors after 3 rounds draw. Then the other kid started smiling.", "text": "Two children are sitting on a couch in a living room with a staircase in the background, both wearing blue shirts and one with a blue headband. They start playing a hand game, pointing at each other and moving their hands in sync. The child with the headband gestures with their hands while the other child continues to play. The child with the headband raises their hands, and the other child continues the game. The child with the headband makes a gesture with their hands, and the other child responds with a similar gesture. The child with the headband continues to gesture, and the other child follows along. The child with the headband makes a final gesture, and the other child responds with a similar gesture. The child with the headband makes a final gesture, and the other child responds with a similar gesture. The child with the headband makes a final gesture, and the other child responds with a similar gesture.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 973, "extra_info": {"duration": 8.98, "events": ["Two kids sit on the sofa.", "Both kids hit left hands with right hands three times.", "Kids show scissors gesture simultaneously.", "Kid on the right shows scissors again.", "Kid on the left shows paper.", "Kid on the right becomes happy.", "Kid on the left with a necklace gets upset.", "Kid on the right starts smiling."], "n_events": 8, "n_shots": 7, "n_subjects": 2}, "vid": 973}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "812.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows an individual wearing a black, woolly sun-protective sleeve unlocking their car and shutting the door from the inside. Subsequently, they hold a piece of paper with a QR code. The top section of the door's interior is grey, and the bottom is black. A car key with a red plastic ring and two attached cards is visible. The key is primarily black, with a smidgen of red in the middle.", "text": "A woman, wearing a mask and a black vest over a white shirt, walks out of a closet into a living room with a dog following her. She removes her shoes and places them on the floor. The scene then shifts to a close-up of a hand inserting a key into a car's ignition and turning it. The hand then inserts a keychain with a tag labeled 'n' into the ignition. The hand pulls the keychain out of the ignition. The scene changes to a hand closing the car door. Finally, the hand holds a piece of paper with a QR code and contact information for 'Ashley Mayne' from 'Fulham FC'.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 812, "extra_info": {"duration": 5.99, "events": ["Individual wearing sun-protective sleeve unlocks the car", "Individual shuts the car door from the inside", "Individual holds a piece of paper with a QR code", "Car key with red plastic ring becomes visible"], "n_events": 4, "n_shots": 4, "n_subjects": 2}, "vid": 812}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "731.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman is holding a mobile phone and wearing headphones while dancing. During this time, she taps on the phone screen a few times and then raises her right hand and waves it as she continues to dance.", "text": "A woman with curly hair, wearing a white sweater and black top, holds a smartphone in her right hand and turns around while looking at the phone. She puts on wireless earbuds and starts dancing, moving her arms and body rhythmically. She continues to dance, raising her arms and moving her head and body in various directions. The woman then looks at her phone again, possibly checking something, before resuming her dance.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 731, "extra_info": {"duration": 28.2, "events": ["A woman holds a mobile phone.", "A woman wears headphones.", "A woman dances.", "A woman taps on the phone screen.", "A woman raises her right hand.", "A woman waves her right hand."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 731}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "23.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A blue car was being chased by a police car on a winding road at night. At an intersection, another police car came and joined the pursuit. The newly joined police car collided with another police car behind it, and a hubcap flew towards the camera. The other police cars continued to chase the blue car. The camera then cuts to the interior of the car, showing the passenger side where a cow, resting its hand on the backrest, turns its head to look at and speak to the camera, with the rearview mirror reflecting the police cars.", "text": "At night, a car is seen driving on a dirt road surrounded by green fields and wooden fences, pursued by several police cars with flashing lights. The police cars chase the car closely, causing it to swerve and create a cloud of dust. The car continues to evade the police, making sharp turns and changing lanes to avoid being caught. The scene then shifts to the interior of the car, revealing a cow driving and turning its head to speak.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 23, "extra_info": {"duration": 5.0, "events": ["A blue car was chased by a police car on a winding road.", "At an intersection, another police car joined the pursuit.", "The newly joined police car collided with another police car.", "A hubcap flew towards the camera.", "The other police cars continued the chase.", "A cow rested its hand on the backrest.", "The cow turned its head to look at the camera.", "The cow spoke to the camera.", "The rearview mirror reflected the police cars."], "n_events": 9, "n_shots": 2, "n_subjects": 1}, "vid": 23}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "188.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "SpongeBob reached out and said something, ran forward, the camera switched to SpongeBob opening the door and saying something, then he closed the door, took a step forward, and Patrick Star widened his eyes.", "text": "SpongeBob, wearing a red shirt and tie, is seen talking and gesturing with his hands while standing next to a large man holding a spatula. SpongeBob then transforms into a stream of bubbles and disappears. The scene shifts to a pink character with a sad expression, who is watching the door. The door opens, revealing SpongeBob looking shocked and scared. The pink character turns its head to look at SpongeBob. SpongeBob then closes the door and starts to walk away.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 188, "extra_info": {"duration": 4.62, "events": ["SpongeBob reached out", "SpongeBob said something", "SpongeBob ran forward", "SpongeBob opened the door", "SpongeBob said something", "SpongeBob closed the door", "SpongeBob took a step forward", "Patrick Star widened his eyes"], "n_events": 8, "n_shots": 2, "n_subjects": 2}, "vid": 188}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "640.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person holds a pair of tweezers, twirling them in a stainless steel plate full of food, rolling up a bundle of food, and then placing the rolled-up food onto a white plate nearby, using their hand to push the food off the tweezers onto the plate.", "text": "A person is seen using tongs to lift a portion of yellow pasta from a pan filled with various ingredients, including shrimp, squid, and red peppers, and transferring it onto a white plate. The person continues to add more pasta from the pan onto the plate, ensuring an even distribution. The plate is gradually filled with the pasta, and the person adjusts the placement to make sure it is neatly arranged. The final frame shows the plate with a neatly arranged portion of the pasta, ready to be served.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 640, "extra_info": {"duration": 6.88, "events": ["Person holds tweezers.", "Person twirls tweezers in plate.", "Food rolls up into bundle.", "Person places rolled-up food on white plate.", "Person uses hand to push food off tweezers."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 640}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "323.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, a woman is using clippers to cut her hair. She pushes the clippers to the back of her head, turns to look at the screen, then faces the screen and smiles slightly, before putting the clippers on her head again.", "text": "In a room with a closed door and a towel hanging in the background, a woman uses her left hand to hold her hair up while her right hand holds an electric clipper. She begins trimming the hair on the top of her head, moving the clipper upwards. She continues to trim the hair, moving the clipper downwards. The woman then turns her head slightly to the side and smiles. She resumes trimming the hair on the top of her head, moving the clipper upwards again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 323, "extra_info": {"duration": 6.0, "events": ["A woman uses clippers to cut her hair.", "The woman pushes the clippers to the back of her head.", "The woman turns to look at the screen.", "The woman faces the screen.", "The woman smiles slightly.", "The woman puts the clippers on her head again."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 323}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "236.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The boy runs in the woods, shaking his head, looking around. The camera follows the movement, then cuts to a blonde woman speaking beside a tree, turning and swaying her body.", "text": "A boy is running through a dense forest with tall trees and lush greenery, wearing a brown and orange shirt and beige pants. He continues running, looking around as he moves forward. The camera then shifts to a girl leaning against a tree, wearing a floral shirt, red pants, and colorful arm sleeves. She turns her head to the left and starts speaking.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 236, "extra_info": {"duration": 4.0, "events": ["The boy runs in the woods.", "The boy shakes his head.", "The boy looks around.", "The blonde woman speaks beside a tree.", "The blonde woman turns.", "The blonde woman sways her body."], "n_events": 6, "n_shots": 2, "n_subjects": 2}, "vid": 236}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "36.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man looks up in surprise, gazes upward, then turns and runs away. The wall behind him collapses, and the man leaps to avoid it but is then struck down to the ground by the debris.", "text": "A man is seen running forward in a room filled with floating blue crystals. He looks up and then raises his hands in fear. The man continues to run, but the crystals start to move towards him. He jumps and performs a flip in the air, narrowly avoiding the moving crystals. The man lands on the ground and rolls to avoid the falling crystals. He then stands up and looks ahead, appearing relieved.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 36, "extra_info": {"duration": 7.0, "events": ["Man looks up in surprise.", "Man gazes upward.", "Man turns.", "Man runs away.", "Wall collapses.", "Man leaps.", "Debris strikes man to the ground."], "n_events": 7, "n_shots": 2, "n_subjects": 1}, "vid": 36}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "252.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A little girl holding a pink umbrella angrily yells in the direction of the camera, then turns around and walks away. The camera switches to the opposite side, and a man walks away helplessly, with two other people passing by beside him.", "text": "A girl holding a pink umbrella stands in front of a fountain, speaking while looking forward. She then turns around and starts walking away from the camera, still holding the umbrella. The scene shifts to a man in a brown shirt and striped pants standing with his hands in his pockets, looking to the left. A man in a suit and a woman in blue scrubs walk towards him from the left side of the frame. The man in the brown shirt turns around to face them as they approach.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 252, "extra_info": {"duration": 4.5, "events": ["A little girl yells in the direction of the camera.", "A little girl turns around.", "A little girl walks away.", "A man walks away helplessly.", "Two people pass by the man."], "n_events": 5, "n_shots": 2, "n_subjects": 3}, "vid": 252}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "151.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A boy with a head ornament clenched his fists, said something, and then stood up. The camera cuts to a room where a pumpkin-headed figure stood up. He swayed from side to side, and explosions occurred beneath him. The camera then switches to a frontal shot of an old witch; explosions happen behind her, emitting purple smoke, as the pumpkin head is set ablaze, and the old woman looks back in horror.", "text": "A boy, wearing a woven hat, looks ahead with a concerned expression and speaks while holding his collar. The camera then moves downward, revealing the boy's purple and yellow outfit. The scene shifts to a room with a wooden door and shelves filled with various items, where a pumpkin-headed character emerges from the ground, shaking its head and causing purple energy to appear. The pumpkin-headed character then turns and walks away, while a man with gray hair and a red hat, who is in the same room, turns to look at the pumpkin-headed character.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 151, "extra_info": {"duration": 5.0, "events": ["A boy clenched his fists.", "A boy said something.", "A boy stood up.", "A pumpkin-headed figure stood up.", "The pumpkin-headed figure swayed from side to side.", "Explosions occurred beneath the pumpkin-headed figure.", "An old witch was shown from the front.", "Explosions emitted purple smoke behind the old witch.", "The pumpkin head was set ablaze.", "The old witch looked back in horror."], "n_events": 10, "n_shots": 3, "n_subjects": 3}, "vid": 151}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "45.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera zooms in on a group photo and a hand turning off a switch. The woman closes the door, and glowing stars shine through the darkness of the room. The camera pans around the room, and the girl takes out a glowing object from under the blanket.", "text": "A photo of two animated characters is taped to a wooden door. A hand reaches out and turns off a light switch on a wall. A woman opens the door and enters a dimly lit room where a girl is lying in bed. The room is dark, and the walls are decorated with star-shaped lights. The girl looks up at the ceiling, which is covered with star-shaped lights. She then turns off the light, and the room becomes darker. The girl, now in a sleeping position, holds a glowing toy in her hand.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 45, "extra_info": {"duration": 13.25, "events": ["Camera zooms in on a group photo", "Hand turns off a switch", "Woman closes the door", "Glowing stars shine through darkness", "Camera pans around the room", "Girl takes out glowing object from under blanket"], "n_events": 6, "n_shots": 5, "n_subjects": 2}, "vid": 45}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "914.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two players, one in red jerseys representing Hungary and the other in yellow jerseys representing Brazil, are engaged in a game of teqball. The player in the red jersey and white shoes lifts the ball into the air, while his teammate, wearing a red jersey and blue shoes, deftly controls the ball with his foot and redirects it into the opponent's area. The ball bounces twice on the yellow team's side, thereby earning a point for the red team. In celebration, the player in the red jersey and blue shoes pumps his fists in triumph and embraces his teammate. He then quickly shakes his head and moves his fingers in excitement.", "text": "The video shows a foosball match between two teams, one in yellow and the other in red, with a referee in white standing in the center. The yellow team scores a goal, and the red team attempts to return the ball. The red team player kicks the ball towards the goal, but it is deflected by the yellow team. The referee signals the goal, and the yellow team celebrates. The camera then focuses on the red team player, who smiles and makes a gesture with his hand. The player then looks down, and the video ends with a close-up of the player's face.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 914, "extra_info": {"duration": 6.97, "events": ["Player in red jersey lifts the ball into the air", "Teammate in red jersey controls the ball with foot", "Teammate redirects the ball into the opponent's area", "Ball bounces twice on the yellow team's side", "Red team earns a point", "Player in red jersey and blue shoes pumps fists in triumph", "Player in red jersey and white shoes embraces teammate", "Player in red jersey and blue shoes shakes head", "Player in red jersey and blue shoes moves fingers in excitement"], "n_events": 9, "n_shots": 1, "n_subjects": 3}, "vid": 914}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "250.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman looks ahead and speaks, while the man with his head lowered turns to look to the side. The camera then cuts to the man moving quickly around the room, placing clothes into a suitcase, as the woman leans against the door watching him.", "text": "A woman with long hair is seen speaking while looking forward, with a background of wooden blinds and a beach view. A man, wearing a dark shirt, turns his head to look at the woman. The man then picks up a suitcase from the bed and starts packing clothes into it. The woman, now in a white top and shorts, watches the man as he continues to pack. The man throws the packed suitcase onto the floor and then picks up a bag from the bed. The woman continues to watch the man as he moves around the room.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 250, "extra_info": {"duration": 9.88, "events": ["The woman speaks.", "The man turns his head.", "The man places clothes into a suitcase.", "The woman leans against the door."], "n_events": 4, "n_shots": 8, "n_subjects": 2}, "vid": 250}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "306.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man first supports the chin of the woman in front of him with his left hand, then switches to his right hand, and then kisses the woman on her forehead.", "text": "In a dimly lit hallway with wall-mounted lamps and a patterned carpet, a man gently touches the woman's neck with his hand. He then moves his hand to her chin, holding it tenderly. The man leans in and kisses the woman's forehead while still holding her chin. After the kiss, he lowers his hand and continues to hold her neck.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 306, "extra_info": {"duration": 5.62, "events": ["Man supports woman's chin with his left hand.", "Man switches to his right hand.", "Man kisses woman on her forehead."], "n_events": 3, "n_shots": 1, "n_subjects": 2}, "vid": 306}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "872.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In this video clip, a blue vehicle is shot down while hovering over a desert. After it's hitting, the blue vehicle catches fire and emits smoke while it continues to traverse the desert.", "text": "A blue and black vehicle is seen in a desert environment, initially stationary near a large red arch. The vehicle begins to move forward, leaving a trail of red smoke behind it. As it continues to drive, it passes by several floating rings and other vehicles in the background. The vehicle maintains its speed and direction, moving further into the desert. The vehicle's speed increases, and it continues to drive across the sandy terrain, leaving a clear trail of red smoke. The vehicle approaches a more open area of the desert, with fewer obstacles and other vehicles visible. The vehicle continues to drive forward, heading towards a distant point in the desert.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 872, "extra_info": {"duration": 9.0, "events": ["A blue vehicle hovers over a desert.", "The blue vehicle is shot down.", "The blue vehicle catches fire.", "The blue vehicle emits smoke.", "The blue vehicle continues to traverse the desert."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 872}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "923.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video takes place in a living room where an adult with short hair, dressed in black shorts and a red T-shirt, descends a set of brown stairs. The adult is carrying a small monkey, which is decked out in white clothes, on a couch. Also in the living room are two long-haired children engaged in play. The kid on the right, dressed in a white sleeveless T-shirt and red shorts, has their hair styled in a ponytail and is seated on a yellow and blue toy car. The kid on the left, donned in a white sleeveless T-shirt along with white shorts, approaches on foot. Both kids are barefoot. Spotting the adult carrying the monkey downstairs, the children draw near. The one with the toy car drives over, while the other walks. They both halt on reaching the adult. The child dressed in a white sleeveless T-shirt and pink shorts shows a desire to extend a hug to the monkey.", "text": "In a spacious, well-furnished room with wooden floors, a man in a red shirt and black shorts descends a staircase while holding a baby. He wipes his forehead with his hand as he walks down. The man approaches two children who are playing with a green toy car. He bends down and interacts with the children, who are sitting on the toy car. The man then stands up and continues to watch the children as they play.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 923, "extra_info": {"duration": 5.0, "events": ["Adult descends brown stairs.", "Adult carries monkey.", "Child on toy car drives over.", "Child on foot approaches.", "Children halt near adult.", "Child in white and pink expresses desire to hug monkey."], "n_events": 6, "n_shots": 1, "n_subjects": 3}, "vid": 923}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "845.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A monkey clad in a green suit climbs onto the lap of an individual wearing a blue T-shirt. Subsequently, the monkey leaps onto an orange chair before jumping onto the floor beside the person. After stumbling briefly, the monkey descends from the chair. As the monkey gradually moves away from the person, the individual gestures a wave towards the monkey. The monkey then hops down a short step and continues to saunter away. The monkey strides past a vibrant painting towards the exit.", "text": "A man sitting at a desk with a computer and a monitor places a baby on the floor in front of him. The man gestures with his hands as the baby starts to crawl away from him. The baby continues to crawl across the floor, moving past a desk and a large framed painting. The baby walks towards the door, passing by a table and a chair. The baby reaches the door and exits the room.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 845, "extra_info": {"duration": 12.0, "events": ["Monkey climbs onto individual's lap.", "Monkey leaps onto orange chair.", "Monkey jumps onto floor.", "Monkey stumbles briefly.", "Monkey descends from chair.", "Individual gestures a wave towards monkey.", "Monkey hops down a step.", "Monkey saunters away.", "Monkey strides past painting.", "Monkey heads towards exit."], "n_events": 10, "n_shots": 1, "n_subjects": 2}, "vid": 845}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "922.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video begins with a visual of a white trash can filled with a bottle having a yellow cap, a black bag, and trash held by a person's hand. A text reading, \"Your bin is filled to the brim\" appears on the right. This text soon changes to \"But you are too lazy\". As the person's hand comes closer, the lid of the trash can opens wider, and the person throws their trash. The text on the right side now reads, \"What options do you have?\", which is eventually replaced by a text saying \"check out this\" against a white background. The latter part of the video shows a person pressing the trash can button which closes the trash lid, automatically securing the garbage bag inside. When the process is completed, the person retrieves the garbage bag and removes it from the trash can.", "text": "A hand is seen opening the lid of a white trash bin filled with various items, including a plastic bottle and a black object, while the text 'Your bin is filled to the brim' appears on the right side of the screen. The hand then closes the lid, and the text changes to 'But you are too lazy.' The screen goes white, and the text 'What options do you have?' appears. The hand opens the lid again, and the text 'Check out this' appears. The hand presses a button on the side of the bin, and the lid opens automatically, revealing a black plastic bag inside. The hand reaches into the bin and removes the black plastic bag. The screen goes white again, and the text 'By pressing one button, Townew deals with it all' appears.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 922, "extra_info": {"duration": 10.98, "events": ["Trash can lid opens wider", "Person's hand throws trash into the can", "Text on video changes", "Person presses trash can button", "Trash can lid closes automatically", "Person secures the garbage bag inside", "Person retrieves the garbage bag", "Person removes the garbage bag from the trash can"], "n_events": 8, "n_shots": 3, "n_subjects": 1}, "vid": 922}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "56.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Three children characters huddle their heads together. Then two characters walk off to the left side of the screen, and the character with the blue cape walks into a door and then closes it behind them.", "text": "In a room with green walls and a wooden door, a girl in a pink outfit with bunny ears holds the hand of a girl in a white outfit with a stethoscope, while a boy in a blue outfit with a crown and bunny ears stands nearby. The girl in pink releases the girl's hand and turns to walk away. The girl in white also turns and starts to walk away. The boy in the crown and blue outfit turns and walks towards the door. He opens the door and exits the room. The door closes behind him, leaving the room empty.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 56, "extra_info": {"duration": 5.38, "events": ["Three children characters huddle their heads together.", "Two characters walk off to the left side.", "Character with blue cape enters a door.", "Character with blue cape closes the door behind them."], "n_events": 4, "n_shots": 1, "n_subjects": 3}, "vid": 56}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "635.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person dressed in white clothing and wearing a white headscarf is distributing food to people on the left. After the people on the left receive the food, they walk away, and someone on the right brings out new food to give to the person in white clothing. They continue to hand out food to the new arrivals on the left.", "text": "A woman dressed in white, wearing a turban and gloves, is seen distributing food to people. She hands food to an elderly man and a man in a blue shirt, who are both wearing masks. The woman continues to distribute food to more people, including a man in a red jacket. She then turns to her left and continues distributing food to another person. The background shows a green tent and some trees, indicating an outdoor setting.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 635, "extra_info": {"duration": 8.71, "events": ["Person in white distributes food to people on the left.", "People on the left receive the food.", "People on the left walk away.", "Someone on the right brings out new food.", "Person in white receives new food.", "Person in white continues to hand out food.", "New arrivals on the left receive food."], "n_events": 7, "n_shots": 1, "n_subjects": 4}, "vid": 635}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "257.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man takes a drink from a glass, the camera cuts to him gesturing with his glass forward, the camera cuts to an overweight man opposite him also raising his glass, the camera cuts back to the first man as he continues to drink. Beside him, a woman smiles as she watches, the camera then cuts to the overweight man holding his glass and smiling at him, while many people behind the overweight man are also smiling as they watch them.", "text": "A man wearing a feathered headpiece drinks from a cup while sitting next to a woman. The woman claps her hands as the man continues to drink. The man then raises the cup and looks to his right, while the woman continues to clap. The camera shifts to show a man in a hat drinking from a cup and then lowering it. The man with the feathered headpiece smiles and looks at the man in the hat. The camera then focuses on the man in the hat, who is holding the cup and smiling.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 257, "extra_info": {"duration": 4.25, "events": ["A man takes a drink from a glass.", "A man gestures with his glass forward.", "An overweight man opposite him raises his glass.", "The first man continues to drink.", "A woman beside the first man smiles.", "The overweight man holds his glass and smiles.", "Many people behind the overweight man smile."], "n_events": 7, "n_shots": 5, "n_subjects": 4}, "vid": 257}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "751.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person holds a shovel, steps on it to turn over the soil in front of them, then moves one step to the right, and continues to turn over the ground.", "text": "A person wearing white sneakers and jeans is seen using a shovel to dig into the soil. The person repeatedly inserts the shovel into the ground, lifts it, and throws the soil to the side. The background consists of a plowed field with clumps of soil scattered around. The person continues this action multiple times, moving the shovel and soil in a consistent manner.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 751, "extra_info": {"duration": 17.95, "events": ["Person holds a shovel.", "Person steps on the shovel.", "Shovel turns over the soil.", "Person moves one step to the right.", "Person continues to turn over the ground."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 751}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "8.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman moves around a man and drags him into a door. A blue cat-like observer is watching aside.", "text": "In a room with purple lighting and floral wallpaper, a red-headed character with glasses and a long tongue makes a gesture with his hand. A black-haired character approaches from behind and grabs the red-headed character's hand, pulling him along. The black-haired character continues to lead the red-headed character towards a door, guiding him with her hands on his shoulders. They move towards the door, and the black-haired character pushes the red-headed character through the doorway. The red-headed character is pushed out of the room, and the door closes behind them.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 8, "extra_info": {"duration": 9.54, "events": ["A woman moves around a man.", "A woman drags the man into a door.", "A blue cat-like observer watches."], "n_events": 3, "n_shots": 1, "n_subjects": 2}, "vid": 8}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "800.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Multiple screens arranged in a matrix are moving up and down within the frame, each screen displaying different content that is constantly changing.", "text": "The video begins with a series of colorful, abstract shapes and patterns that resemble video frames, with various symbols and numbers overlaid on them. As the video progresses, the shapes and patterns start to blur and overlap, creating a dynamic and chaotic visual effect. The background remains consistent with a mix of bright colors and abstract designs, while the symbols and numbers continue to shift and change positions. The video maintains this abstract and chaotic visual style throughout, with no clear narrative or recognizable objects.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 800, "extra_info": {"duration": 14.25, "events": ["Multiple screens are moving up.", "Multiple screens are moving down.", "Each screen displays different content.", "Content on screens is constantly changing."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 800}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "358.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "One person picks a fruit, puts it in their mouth to eat, and the camera pans upward.", "text": "A man wearing fingerless gloves and multiple bracelets reaches into a bush and picks a black berry. He holds the berry in his gloved hand and brings it closer to his mouth. The man then puts the berry into his mouth and starts chewing. He looks to his left and begins to speak. In the background, another person is seen moving.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 358, "extra_info": {"duration": 5.38, "events": ["Person picks a fruit.", "Person puts fruit in mouth.", "Person eats the fruit.", "Camera pans upward."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 358}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "871.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, a long-haired blonde individual is in a household bathroom. They are garbed in a gray outfit. To the left of the frame is a partially ajar door. The room itself exudes a white ambiance. Behind the person is a shower, with a towel draped on the wall to their right. They are holding a white toothbrush and are in the act of brushing their teeth. The caption \"DAY 1 OF MY CANDIDA CLEANSE\" appears at the bottom. Subsequently, they comb their lengthy hair using a black comb with both hands. Following this, the scene transitions to a bedroom where we see the individual asleep in a bed draped with white sheets, positioned on the right side. Center screen, a hand gestures a half-heart symbol. The scene then changes again, and the individual is seen standing in a kitchen, holding a yellow lemon squeezer and extracting juice from a lemon into a glass.", "text": "A woman in a gray tank top holds a pair of scissors and begins her day with a candia cleanse. She brushes her teeth with a white toothbrush. She then ties her hair up and starts exercising with a small dumbbell. The scene shifts to her making a hand gesture in front of a bed where a person is sleeping. She then peels a banana and places it into a glass jar on a kitchen counter. Finally, she holds the scissors again, continuing her candia cleanse.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 871, "extra_info": {"duration": 6.01, "events": ["The individual brushes teeth with a toothbrush", "The individual combs hair with a black comb", "The individual sleeps in bed", "A hand gestures a half-heart symbol", "The individual squeezes lemon juice into a glass"], "n_events": 5, "n_shots": 5, "n_subjects": 1}, "vid": 871}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "929.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video captures a scene in a classroom where a teacher with long brown hair, dressed in a white shirt, pants, and a purple jacket, is chatting and laughing with the students. A student dressed in a black uniform is also shown laughing while seated. The teacher then places a bag on the table upfront, next to a blackboard that has writings like \"Mr. Finn\" and \"Notes Quiz Friday - History\". In the next scene, a glasses-wearing, blonde student wearing a grey uniform is seen handling a toy rocket which suddenly launches, trailing smoke and hitting the wall near the blackboard. This action draws the attention of all the students towards the blonde student. Upon turning around, the teacher appears visibly shocked and frightened due to the unexpected rocket launch.", "text": "In a classroom with colorful decorations and a chalkboard, a man in a purple jacket and tie is seen moving around energetically. A boy in a school uniform sits at his desk, watching the man and smiling. The man picks up a laptop from his desk and starts interacting with it. Another boy, also in a school uniform, is seen launching a model rocket from his desk. The rocket flies across the classroom, and the man turns around to avoid it. The students in the classroom turn their attention to the man as he moves around. The man continues to move around the classroom, and the students watch him.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 929, "extra_info": {"duration": 6.97, "events": ["Teacher chats and laughs with students.", "Student laughs while seated.", "Teacher places a bag on the table.", "Toy rocket launches in classroom.", "Rocket trails smoke.", "Rocket hits the wall near the blackboard.", "Students turn attention to blonde student.", "Teacher turns around.", "Teacher appears shocked.", "Teacher appears frightened."], "n_events": 10, "n_shots": 7, "n_subjects": 4}, "vid": 929}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "694.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A pig is walking on the ground, rooting in the soil. Then, the camera shows two pigs, one large and one small. The new, larger pig sniffs the pig in front, followed by another pig that crosses the frame from left to right.", "text": "In a forest setting, a wild boar is seen walking forward and lowering its head to sniff the ground. The boar continues to sniff the ground, moving its head closer to the soil. Another boar appears in the background, also sniffing the ground. The first boar lifts its head slightly and then continues to sniff the ground. The boar then turns its body to the right and starts to walk away.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 694, "extra_info": {"duration": 10.21, "events": ["A pig walks on the ground.", "A pig roots in the soil.", "A larger pig sniffs the smaller pig.", "Another pig crosses the frame."], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 694}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "840.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video portrays the preparations of firefighters before embarking on a mission. Each frame distinctly highlights the firefighters in their respective gear. These include individuals dressed in red safety helmets, some donning yellow helmets, a single individual wearing a white safety helmet, and notably, one person not adhering to the safety standard by wearing a black cap instead of a helmet.", "text": "A group of firefighters in yellow uniforms and helmets are gathered around a red truck and a trailer labeled 'Los Angeles City Fire Department Urban Search and Rescue' as they prepare their equipment. One firefighter with a red helmet is seen putting on a backpack. The scene then shifts to a group of firefighters standing in front of a control panel, where a man in a black cap is checking documents. One firefighter turns to look at the camera and then turns back to the control panel. The final scene shows a man in a safety vest and hard hat opening a door with a sign that reads 'All visitors must obtain a site rescue monitor and badge tag before entering.'", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 840, "extra_info": {"duration": 8.0, "events": ["Firefighters prepare before a mission", "Individuals dress in red safety helmets", "Some firefighters don yellow helmets", "A person wears a white safety helmet", "One person dons a black cap"], "n_events": 5, "n_shots": 3, "n_subjects": 2}, "vid": 840}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "723.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Inside the cabin, a person at the cabin door gestured upwards with their hand, then pointed outside. Then, one by one, individuals jumped out of the door. After a brief pause, another two people jumped out from the door, and each person in front then successively deployed their parachutes.", "text": "Inside a military aircraft, a man in a beige jumpsuit and helmet gestures with his hand, signaling to a man in black gear. The man in black gear steps forward and jumps out of the aircraft, followed by the man in beige gear. Both men free-fall through the sky, with the man in black gear performing a flip. The man in beige gear follows, also performing a flip. The camera captures their descent from various angles, showing the vast sky and clouds below. The video ends with the men continuing their free-fall, with the camera focusing on their descent.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 723, "extra_info": {"duration": 18.82, "events": ["A person gestured upwards at the cabin door.", "The same person pointed outside the cabin.", "An individual jumped out of the door.", "Another individual jumped out of the door.", "Yet another individual jumped out of the door.", "Two people consecutively jumped out from the door.", "The first person deployed their parachute.", "The second person deployed their parachute."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 723}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "324.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman is combing her hair in front of a mirror, first combing the hair on the left side, then the hair on the right side.", "text": "In a well-lit bathroom with two mirrors and a sink, a woman in a white robe stands in front of the mirror. She begins by brushing her hair with a hairbrush, moving it from the top of her head downwards. She continues to brush her hair, ensuring it is smooth and tidy. After finishing brushing, she adjusts her hair with her hands, making sure it is in place. She then resumes brushing her hair, focusing on different sections to ensure it is well-groomed. Finally, she makes a final adjustment to her hair, ensuring it looks neat and tidy.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 324, "extra_info": {"duration": 6.38, "events": ["A woman is combing her hair.", "The woman combs the left side of her hair.", "The woman combs the right side of her hair."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 324}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "931.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, an individual with short hair, wearing blue jeans and a white T-shirt, is seen dribbling a ball. A second person with long hair, dressed in a white T-shirt, yellow vest, black pants, and a hat, tries to play defense. The short-haired person knocks the long-haired person to the ground with their left hand, jumps to take a shot, and scores. Triumphantly, this person raises their right hand as a sign of success. Clearly, scoring the basket brings this person great joy.", "text": "A man wearing a cap, a reflective vest, and a white t-shirt with a graphic design is playing basketball with a boy on an outdoor court. The boy, dressed in a white t-shirt with a peace symbol and jeans, dribbles the ball while the man defends him. The boy maneuvers around the man, who is closely guarding him. The man loses his balance and falls to the ground as the boy continues to dribble the ball. The boy then turns and shoots the ball towards the hoop. The ball goes through the hoop, and the boy raises his hand in celebration. The man, still on the ground, looks up at the boy.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 931, "extra_info": {"duration": 5.97, "events": ["Individual with short hair dribbles a ball.", "Second person with long hair plays defense.", "Short-haired individual knocks long-haired person to the ground.", "Short-haired individual jumps to take a shot.", "Short-haired individual scores.", "Short-haired individual raises right hand in triumph."], "n_events": 6, "n_shots": 2, "n_subjects": 2}, "vid": 931}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "33.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The robot took a step forward, holding out a toy in the direction of the child, then threw it to the ground. After that, it took out a backpack, and the child approached and quickly took the backpack. Then, a hand reached into the backpack and turned to look at the robot.", "text": "In a backyard at night, a robot with green eyes and a small creature in its hand stands near a house with a patio and some plants. The robot drops the creature to the ground. The creature lies on the ground as the robot retracts its hand. The camera then focuses on a girl with purple hair standing in front of the robot. The girl walks towards the robot and picks up the creature. She looks at the robot while holding the creature.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 33, "extra_info": {"duration": 5.0, "events": ["The robot took a step forward.", "The robot held out a toy.", "The robot threw the toy to the ground.", "The robot took out a backpack.", "The child approached the robot.", "The child quickly took the backpack.", "A hand reached into the backpack.", "The hand turned to look at the robot."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 33}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "206.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person is initially seen reading a magazine, then lowers it to reveal their face, tosses it aside, sits up, and stands to approach a barred window. Another individual passes by the cell in the background.", "text": "A man is lying in bed, reading a Weidic magazine with a cover featuring a woman and a man. He lowers the magazine, revealing his face, and looks around. He then gets out of bed, grabs the bed frame, and stands up. The man approaches the bars of the cell and starts banging on them. A guard walks by as the man continues to bang on the bars.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 206, "extra_info": {"duration": 11.97, "events": ["Person reads a magazine.", "Person lowers the magazine.", "Person reveals their face.", "Person tosses the magazine aside.", "Person sits up.", "Person stands up.", "Person approaches a barred window.", "Another individual passes by the cell."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 206}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "631.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The traffic light at the side of the street is green, with bicycles passing by in both directions. As the light turns from yellow to red, the flow of bicycles comes to a stop.", "text": "The video shows a close-up of a traffic light with a green bicycle signal, indicating it is safe for cyclists to cross the street. In the background, several cyclists are seen riding past the traffic light, moving from left to right. The traffic light then changes to a yellow signal, signaling cyclists to prepare to stop. Finally, the traffic light turns red, indicating that cyclists should stop. Throughout the video, pedestrians are also visible in the background, walking along the sidewalk.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 631, "extra_info": {"duration": 18.08, "events": ["The traffic light turns green.", "Bicycles pass by in both directions.", "The traffic light turns yellow.", "The traffic light turns red.", "The flow of bicycles comes to a stop."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 631}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "540.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman dribbles the basketball with one arm while squatting down to pick up a jacket and slips on one of its sleeves. Then, with a forceful pat, the basketball flies out of the frame. The woman looks up, and as it falls back, she moves forward, continuing to dribble the basketball. The camera follows the movement.", "text": "In a gymnasium with wooden walls and a wooden floor, a woman in a dark blue 'Volleyball Witches' sweatshirt and black pants is juggling a volleyball with her right hand. She then squats down and picks up a white jacket from the floor. She stands up and starts to put on the jacket while continuing to juggle the volleyball with her right hand. The woman successfully puts on the jacket and continues to juggle the volleyball with her right hand. She then turns around and walks towards the camera, still juggling the volleyball.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 540, "extra_info": {"duration": 15.1, "events": ["A woman dribbles the basketball with one arm.", "The woman squats down.", "The woman picks up a jacket.", "The woman slips on a jacket sleeve.", "The woman pats the basketball forcefully.", "The basketball flies out of the frame.", "The woman looks up.", "The basketball falls back.", "The woman moves forward.", "The woman continues to dribble the basketball."], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 540}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "231.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the tunnel, a woman turns around and runs, stopping by a cage, opens the cage, and looks at a fire extinguisher. The camera then cuts to a man running from inside the tunnel.", "text": "A woman, dressed in a dark coat, turns and starts running down a dimly lit, industrial-looking corridor with brick walls and metal railings. She glances back while running, then stops and looks around cautiously. She approaches a sign on a chain-link fence that reads 'GEFAHR! Brennhar Flügelsigkeit' and begins to unlock a gate. The gate opens to reveal two large gas cylinders labeled '015' and '016'. The scene then shifts to a man running down the same corridor, approaching the woman.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 231, "extra_info": {"duration": 13.12, "events": ["A woman turns around in the tunnel.", "The woman runs in the tunnel.", "The woman stops by a cage.", "The woman opens the cage.", "The woman looks at a fire extinguisher.", "A man runs from inside the tunnel."], "n_events": 6, "n_shots": 4, "n_subjects": 2}, "vid": 231}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "864.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video portrays an individual wearing glasses, a cap, green t-shirt and shorts, white socks, and shoes, who is jogging at a consistent speed on a deserted road bounded by barriers on one side. This person is running alone. The surrounding area is teeming with verdant trees and grass on both sides of the road. After jogging vigorously along this pathway, the person proceeds to ascend the stairs leading towards a hill blanketed by green grass.", "text": "The video begins with an aerial view of a person running on a narrow road surrounded by lush green hills and trees. The camera then zooms in to show a close-up of the person's white running shoes as they run. The camera angle changes to a side view, revealing the person wearing a green outfit and a cap, running along the road. The camera then shifts to a rear view, showing the person continuing to run down the road. Finally, the camera returns to an aerial view, following the person as they run further down the road.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 864, "extra_info": {"duration": 6.0, "events": ["Individual wearing glasses jogs on deserted road.", "Individual runs at consistent speed.", "Road is bounded by barriers.", "Jogger ascends stairs.", "Stairs lead towards a grass-blanketed hill."], "n_events": 5, "n_shots": 4, "n_subjects": 1}, "vid": 864}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "328.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man hurriedly gets out of the car. The camera cuts to the outside of the car, where the man quickly gets out of the taxi, closes the door, and runs forward. As he runs, he looks towards the right front, and the camera switches to show an electronic screen displaying text.", "text": "A man in a white shirt is sitting in the back seat of a car, talking to the driver who is wearing a headset and a beaded necklace. The man then turns and opens the car door, stepping out of the vehicle. He closes the car door and starts running away from the car. The camera then shifts to a digital sign displaying the message 'COME HOME DISAPSHIT' as the man continues to run.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 328, "extra_info": {"duration": 5.38, "events": ["Man hurriedly exits the car", "Man closes the car door", "Man runs forward", "Man looks towards the right front", "Camera shows an electronic screen displaying text"], "n_events": 5, "n_shots": 3, "n_subjects": 1}, "vid": 328}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "684.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera captures the view of the sunset from the mountaintop. A woman in a denim jacket and a man in a windbreaker enter the screen from the right, one after another. They both glance at the sunset, then look at each other. The woman smiles, while the man speaks to her.", "text": "The video begins with a scenic view of a sunset over a town, with the sun partially obscured by clouds and the landscape covered in trees and houses. A woman with long hair, wearing a denim jacket and a red hoodie, walks into the frame from the right, followed by a man in an orange jacket and a blue backpack. The man holds the woman's hand as they walk together, and they stop to look at each other. The woman turns her head to face the man, and they share a moment of eye contact. The video ends with the same scenic view of the sunset and the town.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 684, "extra_info": {"duration": 7.19, "events": ["The camera captures the sunset from the mountaintop.", "A woman in a denim jacket enters the screen from the right.", "A man in a windbreaker enters the screen from the right.", "The woman glances at the sunset.", "The man looks at the sunset.", "The woman and the man look at each other.", "The woman smiles.", "The man speaks to the woman."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 684}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "701.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, there are two birds: one perched on a tree branch and another fluttering its wings in the air. Then, a third bird flies up from below. The bird that was flying leaves the screen, while the third bird flaps its wings for a while. The bird that was perched on the branch flies down and joins the third bird, flying away together.", "text": "A hummingbird with a blue throat and green body is perched on a branch, looking around. Another hummingbird flies in from the left, flapping its wings rapidly. The perched hummingbird turns its head to look at the approaching bird. The approaching bird lands on the branch next to the perched bird. The perched bird then turns its head to look at the new bird. The new bird flies away, and the perched bird remains on the branch.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 701, "extra_info": {"duration": 10.64, "events": ["One bird is perched on a tree branch.", "Another bird is fluttering its wings in the air.", "A third bird flies up from below.", "The flying bird leaves the screen.", "The third bird flaps its wings.", "The perched bird flies down.", "The perched bird joins the third bird.", "The perched bird and the third bird fly away together."], "n_events": 8, "n_shots": 1, "n_subjects": 3}, "vid": 701}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "66.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A character with black hair got one of his feet stuck in the door and was lying on the ground. He propped himself up to stand, his foot still stuck. He glanced at the object in his hand, then turned his head to look behind him. Someone was watching him. Frightened, he hurriedly closed the door and leaned against it with a scared expression on his face.", "text": "A boy is lying on the ground in front of a door, holding a green object close to his chest. He begins to roll over and starts to get up. The boy stands up and looks at the door, where a bald man is standing and watching him. The boy smiles and says something to the bald man. He then turns around, closes the door, and leans against it, looking surprised. The scene ends with the boy lying on the ground again, still holding the green object.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 66, "extra_info": {"duration": 5.25, "events": ["Character's foot got stuck in the door.", "Character lay on the ground.", "Character propped himself up.", "Character's foot remained stuck.", "Character glanced at the object in hand.", "Character turned head to look behind.", "Character noticed someone watching.", "Character, frightened, hurriedly closed the door.", "Character leaned against the door.", "Character displayed a scared expression."], "n_events": 10, "n_shots": 1, "n_subjects": 2}, "vid": 66}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "556.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person descends from the sky with a paraglider, skimming the water with their feet, creating a splash, then lands on the grass. They grab the paraglider and sprint forward rapidly. The camera follows the movement, then cuts to a snowy day, where one person lands with a paraglider while another person holds onto the ropes beneath, descending together forward.", "text": "A person with a parachute is seen descending from the sky over a scenic landscape with mountains and a lake. The person continues to descend, getting closer to the lake's surface. The person lands on the water, creating a splash as they glide across the lake. The person then runs out of the water onto the grassy shore, with the parachute still deployed. The scene transitions to the person landing on a snowy slope, with the parachute still in the air.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 556, "extra_info": {"duration": 10.2, "events": ["Person descends from sky with paraglider.", "Person skims water with feet creating splash.", "Person lands on grass.", "Person grabs the paraglider.", "Person sprints forward.", "Person lands with paraglider on snowy day.", "Another person holds onto ropes beneath paraglider.", "Persons descend together forward."], "n_events": 8, "n_shots": 2, "n_subjects": 3}, "vid": 556}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "430.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A lady wearing gloves is riding on a pelican. She forcefully pries open the pelican's beak and pours a little duck out of the pelican's mouth. Then, she steps over the pelican, pinches its beak, and carries it away. During the process, a little girl is watching the whole time by her side, and there are other people walking around nearby.", "text": "A woman, wearing a pink shirt and jeans, is seen interacting with a large bird inside an enclosure, using a tool to feed it. The bird opens its beak wide as the woman continues to feed it. A child approaches the enclosure, and the woman continues to feed the bird. The woman then lifts the bird by its beak and starts to move it towards the edge of the enclosure. The bird is carried out of the enclosure, and the woman walks away with it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 430, "extra_info": {"duration": 15.13, "events": ["Lady rides on a pelican.", "Lady pries open the pelican's beak.", "Lady pours a little duck out of the beak.", "Lady steps over the pelican.", "Lady pinches the pelican's beak.", "Lady carries the pelican away.", "Little girl watches the process.", "People walk around nearby."], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 430}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "299.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man bends down to look into the mirror, then straightens up, turns to face the camera, looks back at the mirror, and the camera moves forward, pushing all the way to the edge of the mirror near a photo of a woman.", "text": "A man is leaning over a cluttered vanity table, which is filled with various items including flowers, makeup, and personal belongings. He straightens up and looks at his reflection in the mirror, then steps back while keeping his hands in his pockets. The camera gradually zooms in on the vanity table, focusing on the items and the mirror. The camera continues to zoom in, revealing more details of the vanity, including a white feather boa and several photos pinned to the mirror.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 299, "extra_info": {"duration": 10.5, "events": ["A man bends down to look into the mirror.", "The man straightens up.", "The man turns to face the camera.", "The man looks back at the mirror.", "The camera moves forward to the edge of the mirror."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 299}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "572.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man wearing a shirt stands with his right hand grasping a stick stretched forward. Another man, clad in a black coat, glances at a stool beside him, then extends his right hand to move the stool away, placing it behind himself to sit down. The man in the shirt shakes his arms, leans his body backward, and falls into the water pool behind him, causing water to splash in all directions. The camera follows the movement as the man in the black coat turns his head to look at him. The background features a pond surrounded by a stand of trees.", "text": "Two men are standing on a concrete platform by a body of water, with one man holding a fishing rod and the other standing beside a chair. The man in black reaches for the chair and begins to pull it towards himself. He then sits down on the chair, positioning it over a small water pit. The man in the blue shirt, who is still holding the fishing rod, loses his balance and falls into the water. The man in black remains seated on the chair, while the man in the blue shirt is seen in the water.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 572, "extra_info": {"duration": 5.33, "events": ["Man grasps stick with right hand.", "Man in black coat glances at stool.", "Man in black coat moves stool away.", "Man in black coat sits down.", "Man in shirt shakes arms.", "Man in shirt leans body backward.", "Man in shirt falls into water pool.", "Water splashes in all directions.", "Man in black coat turns head to look."], "n_events": 9, "n_shots": 1, "n_subjects": 2, "old_duration": 5.33, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/172.mp4"}, "vid": 572}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "969.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A hand, clad in a blue shirt, is seen holding a plastic container filled with a turmeric-colored liquid. The individual shakes the container before placing it on a table. Subsequently, a second person with dark skin inserts a pink straw into the container and begins to blow into it. The liquid changes color to red, but after continued blowing, it reverts back to its original turmeric color.", "text": "A hand is seen holding a small plastic cup with an orange liquid at the bottom, and the hand begins to move the cup forward on a white surface. The hand then releases the cup, leaving it on the surface. Another hand appears, holding a pink straw, and inserts it into the cup. The person then drinks from the cup using the straw.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 969, "extra_info": {"duration": 12.0, "events": ["A hand holds a plastic container.", "The individual shakes the plastic container.", "The hand places the plastic container on a table.", "A second person inserts a pink straw into the container.", "The second person blows into the straw.", "The liquid changes color to red.", "The liquid reverts back to turmeric color."], "n_events": 7, "n_shots": 2, "n_subjects": 2}, "vid": 969}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "327.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman picks up a cell phone, a child crawls out from the blankets, and the woman, child, and another older girl sit down together. The woman presses on the screen to start taking a selfie.", "text": "In a dimly lit room, a woman is sitting on a bed with two children, one of whom is holding a stuffed toy. The woman puts down the book she was holding and picks up a phone. The children, who are sitting on the bed, turn to face the woman and the phone. The woman extends the phone towards the children, and they all look at the screen. The woman and the children continue to look at the phone screen, with the woman holding the phone and the children watching intently. The woman and the children remain focused on the phone screen, with the woman holding the phone and the children leaning in closer. The woman and the children continue to look at the phone screen, with the woman holding the phone and the children watching closely. The woman and the children remain focused on the phone screen, with the woman holding the phone and the children leaning in closer.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 327, "extra_info": {"duration": 8.0, "events": ["A woman picks up a cell phone.", "A child crawls out from the blankets.", "The woman sits down.", "An older girl sits down.", "The woman presses on the screen."], "n_events": 5, "n_shots": 1, "n_subjects": 3}, "vid": 327}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "65.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A security guard used his hand to block the intense light coming from ahead as he looked forward. A tall character approached the security guard, lifted his foot, and kicked the guard, sending him flying back into a wall. The security guard then slid to the ground as two security guards inside a window on the wall showed expressions of shock and horror. Subsequently, one of the guards quickly ran away, pointing outside the window, while the other guard, after taking a sip of water, hurriedly ran off as well.", "text": "A boy in a police uniform points at a tall, thin figure walking towards him in a dimly lit area with a greenish glow. The figure stops and kicks the boy, causing him to fall. The scene then shifts to the boy standing outside a window, making a gesture with his hand. Inside the room, two other boys in police uniforms watch him. One of the boys inside the room picks up a cup and drinks from it. The boy outside the window then runs towards the window and looks inside.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 65, "extra_info": {"duration": 5.12, "events": ["Security guard blocked light with hand.", "Tall character approached security guard.", "Tall character kicked security guard.", "Security guard sent flying into wall.", "Security guard slid to ground.", "Two security guards showed shock and horror.", "One security guard pointed outside window.", "One security guard took water sip.", "Same security guard ran off."], "n_events": 9, "n_shots": 2, "n_subjects": 4}, "vid": 65}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "924.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Three individuals are exercising side by side on a road. Among them, there is an adult dressed in a short-sleeve gray T-shirt and black shorts. Two children, with their hair styled in ponytails, are wearing white short-sleeved T-shirts and shorts. A tricycle painted in yellow and blue stands between the two kids. The child on the right sports pink slippers, while the one on the left has on white slippers. The adult, attired in black shorts, a gray short-sleeved T-shirt and brown slippers, is situated to the right of the children. They all are engaged in leg jumping exercises on the roadside. A monkey clothed in a white short-sleeved T-shirt and yellow shorts jumps from the steering wheel of a stroller onto the middle child who is wearing glasses, thus interrupting the child's workout. The adult, with a smile, picks up the monkey with both hands and repositions it on the wheel of the stroller.", "text": "Two children in white pajamas are pushing a tricycle with a baby sitting on it along a sidewalk at night. A man in a gray shirt and black shorts follows them closely. The child in the front turns around and starts walking back towards the camera. The child in the back continues to push the tricycle. The man reaches the child in the back and takes the baby from the tricycle. The man then places the baby back on the tricycle. The children continue to walk along the sidewalk.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 924, "extra_info": {"duration": 9.0, "events": ["Three individuals exercise on a road", "Adult dressed in gray T-shirt stands right of children", "Two children wear white T-shirts and shorts", "A tricycle stands between two kids", "Child on the right wears pink slippers", "Child on the left wears white slippers", "Adult wears brown slippers", "They all engage in leg jumping exercises", "Monkey jumps from stroller onto middle child", "Adult picks up monkey and repositions it on stroller"], "n_events": 10, "n_shots": 1, "n_subjects": 4}, "vid": 924}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "17.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The elephant was sitting on a person and a gorilla, and turned its head to look at them. The person and the gorilla stood up with effort, and the elephant leaned backward. Then the person ran forward, and the gorilla lifted the elephant and looked in the direction the person had gone.", "text": "In a lush, jungle-like environment with dense foliage and a warm, glowing background, a large red elephant, a gorilla, and a human-like creature are seen together. The elephant begins to rise, causing the gorilla and the human-like creature to react. The gorilla and the human-like creature start to move away from the elephant as it continues to rise. The gorilla and the human-like creature fully separate from the elephant and begin to move in different directions. The gorilla and the human-like creature continue to move away from the elephant, which is now fully upright. The gorilla and the human-like creature are now further apart from the elephant, which remains in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 17, "extra_info": {"duration": 3.75, "events": ["The elephant sat on a person and a gorilla.", "The elephant turned its head to look at them.", "The person and the gorilla stood up with effort.", "The elephant leaned backward.", "The person ran forward.", "The gorilla lifted the elephant.", "The gorilla looked in the direction the person had gone."], "n_events": 7, "n_shots": 1, "n_subjects": 3}, "vid": 17}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "720.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, a full screen of starlight moves towards the direction of the screen. Metallic text flies from the bottom left of the screen to the center, and then the text rotates to reveal the words “HAPPY BIRTHDAY”. Afterwards, the text stops, and countless stars fall from above. Finally, the stars slowly fade away.", "text": "The video begins with a starry background filled with small, glowing stars. Golden and silver 3D letters start to appear and move into view, gradually forming the words 'HAPPY BIRTHDAY'. As the letters settle into place, colorful stars begin to emerge from the letters, floating upwards. The stars continue to rise, filling the space around the 'HAPPY BIRTHDAY' text. The stars then start to disperse and fade away, leaving the 'HAPPY BIRTHDAY' text prominently displayed against the starry background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 720, "extra_info": {"duration": 14.01, "events": ["Starlight moves towards the screen", "Metallic text flies from bottom left to center", "Metallic text rotates to reveal 'HAPPY BIRTHDAY'", "Metallic text stops in the center", "Countless stars fall from above", "Stars slowly fade away"], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 720}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "712.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Beside a black car on the snowy ground, a person bent down for a look at the tire, then stepped forward to kick it, followed by pressing on the tire with their hand, and then stood up straight.", "text": "A person wearing a gray coat, black pants, and brown boots stands next to a black car in a snowy environment. The person bends down and steps on the snow-covered ground with their right foot. They then bend down further and use their right hand to touch the car's side. After touching the car, the person stands up and steps back, moving out of the frame. The scene then transitions to a red and black striped background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 712, "extra_info": {"duration": 7.54, "events": ["A person bent down beside a black car.", "The person examined the tire.", "The person kicked the tire.", "The person pressed on the tire with their hand.", "The person stood up straight."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 712}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "78.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the dead of night, a car sways as it approaches from a distance towards the camera. There are two people inside the vehicle. From high above in the distance, a large iron ball is suspended by a chain, swinging towards the car. Atop the iron ball stands a man in white clothing. The man in white jumps off the iron ball, opens his arms, and scrambles to the back of the car. Through the rear car window, the man can be seen hanging on with both arms, looking inside the vehicle, before he ultimately closes his eyes.", "text": "A white van is driving on a highway at night, approaching a road sign indicating a speed limit of 100 km/h. The van continues to move forward, and the driver, a large man, is seen smiling and talking. The camera zooms in on the driver and the passenger, who is a smaller man. The camera then shifts to show a boy hanging from a chain, suspended in the air. The boy climbs onto a large ball and jumps off, landing on the roof of the van. The boy grabs onto the van's roof and looks inside the vehicle.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 78, "extra_info": {"duration": 7.5, "events": ["A car sways approaching the camera.", "Two people sit inside the vehicle.", "A large iron ball swings towards the car.", "A man in white stands atop the iron ball.", "The man in white jumps off the iron ball.", "The man in white opens his arms.", "The man in white scrambles to the back of the car.", "The man in white hangs on with both arms.", "The man in white looks inside the vehicle.", "The man in white closes his eyes."], "n_events": 10, "n_shots": 4, "n_subjects": 3}, "vid": 78}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "86.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A red dragon flies into the sky, a firework explodes, bursting into a screen full of golden light. The red dragon passes through the golden glow, with fireworks continuing to bloom behind it. The red dragon starts to fall downwards; underneath its head, there is a man and a woman being swung side to side. The fireworks behind them extinguish into black smoke as the red dragon wobbles downward.", "text": "A colorful dragon is seen flying over a cityscape with tall buildings in the background. The dragon breathes a bright, fiery ball of light from its mouth. The light expands into a massive explosion, filling the sky with golden sparks. The dragon, now with a more detailed and expressive face, continues to fly forward, carrying two people on its back. The dragon and the two people fly past the buildings, with the dragon's head and body becoming more visible. The dragon and the two people continue to fly, moving closer to the camera, with the cityscape in the background. The dragon and the two people fly past the buildings, with the dragon's head and body becoming more detailed. The dragon and the two people fly past the buildings, with the cityscape in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 86, "extra_info": {"duration": 6.12, "events": ["A red dragon flies into the sky.", "A firework explodes in the sky.", "The red dragon passes through the golden glow.", "Fireworks continue to bloom behind the red dragon.", "The red dragon starts to fall downwards.", "A man and a woman are swung side to side.", "Fireworks extinguish into black smoke.", "The red dragon wobbles downward."], "n_events": 8, "n_shots": 2, "n_subjects": 3}, "vid": 86}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "655.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person wearing a blue sweater is sitting behind a white table, fiddling with the vegetables in a bowl with a fork, then picking up a few slices and bringing them to their mouth. A strand of long hair sways in front of their chest as their body moves. The video shows the person's head and chest area.", "text": "A person wearing a blue sweater is holding a white bowl filled with a salad that includes various greens and some white dressing. The person uses a fork to mix the salad thoroughly, ensuring the dressing is evenly distributed. After mixing, the person lifts a forkful of the salad, bringing it towards their mouth. The person then eats the salad.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 655, "extra_info": {"duration": 6.71, "events": ["Person sits behind the white table.", "Person fiddles with vegetables in bowl with fork.", "Person picks up vegetable slices.", "Person brings vegetable slices to mouth.", "Strand of long hair sways in front of chest."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 655}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "745.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person is counting money and then touching the embossed print on the bills with their right hand.", "text": "A person is holding a stack of $100 bills in their hands, with the bills fanned out and the person's fingers moving over them. The person's fingers are seen counting and flipping through the bills, revealing the serial numbers and the portrait of Benjamin Franklin on the bills. The background is blurred, focusing on the hands and the money.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 745, "extra_info": {"duration": 14.48, "events": ["A person counts money.", "The person touches the embossed print on the bills."], "n_events": 2, "n_shots": 1, "n_subjects": 1}, "vid": 745}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "590.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person holds a strawberry, subtitles appear and then disappear on the screen. A small dog turns around and runs into the dog crate, subtitles appear and then disappear on the screen. The little dog turns around, extends its paw to close the door, and then sits down, subtitles appear and then disappear on the screen. The person brings the strawberry close to the dog's nose, and it takes a sniff. The background is an indoor living room. The camera follows the movement of the person's hand.", "text": "A dog sits on the floor in front of a cage, looking at a hand holding a strawberry. The dog turns and runs into the cage, then turns back to look at the hand. The dog jumps up and uses its paw to open the cage door. The dog stands inside the cage, looking at the hand holding the strawberry. The dog sniffs the strawberry and then sits back down, looking at the hand.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 590, "extra_info": {"duration": 8.63, "events": ["Person holds a strawberry.", "Subtitles appear and disappear.", "Small dog turns around.", "Small dog runs into the crate.", "Subtitles appear and disappear.", "Small dog extends paw to close the door.", "Small dog sits down.", "Subtitles appear and disappear.", "Person brings strawberry close to dog's nose.", "Dog takes a sniff."], "n_events": 10, "n_shots": 1, "n_subjects": 2}, "vid": 590}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "524.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two little kittens are lying on the trunk of the tree. The camera shifts to the kitten above, which opens its mouth and makes a sound. Then, the camera zooms out to a full view, and the two little kittens begin to climb up the tree one after the other.", "text": "Two orange kittens are seen climbing a tree, with one kitten higher up and the other lower down. The lower kitten looks up at the higher kitten, which is meowing. The lower kitten then looks directly at the camera and meows. The higher kitten continues to meow while the lower kitten looks up at it. The lower kitten starts to climb up the tree, following the higher kitten. The higher kitten begins to ascend the tree, and the lower kitten continues to follow.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 524, "extra_info": {"duration": 7.87, "events": ["Two kittens lie on the trunk of a tree.", "The camera shifts to the upper kitten.", "The upper kitten opens its mouth.", "The upper kitten makes a sound.", "The camera zooms out for a full view.", "The first kitten begins to climb the tree.", "The second kitten follows climbing up the tree."], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 524}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "686.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On the track field, a black athlete is in a sprinting starting position. Subsequently, he raises his hips then takes off running. The camera captures the athlete running towards the lens from the front until the athlete runs out of the frame.", "text": "A man is in a crouched starting position on a track, with his hands on the ground and his body leaning forward. He begins to push off from the starting blocks, transitioning from a crouched position to a standing position. The man starts running forward, with his arms moving in a running motion. The camera follows his movement, capturing his upper body and face as he runs. The man continues to run, with the camera zooming in on his face, showing his focused expression. The video ends with a close-up of the man's face, and the 'video' logo is visible throughout the video.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 686, "extra_info": {"duration": 10.72, "events": ["A black athlete assumes a sprinting starting position.", "The black athlete raises his hips.", "The black athlete takes off running.", "The athlete runs towards the camera.", "The athlete runs out of the frame."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 686}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "693.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the foreground, there is an open laptop on the table. In the background, a little girl is lying on her side in bed, getting ready to sleep. A lady opens a blanket and drapes it over the little girl, then she leans down to kiss the little girl. The little girl also turns her head towards the lady, returning the kiss.", "text": "In a room with a bed and a laptop on a table, a woman is seen making a bed for a girl who is lying on the bed. The laptop screen in the foreground displays a green background with tracking markers and a logo. The woman adjusts the bedspread, ensuring it is properly arranged. She then gently touches the girl's head, checking on her. After ensuring the bed is made, the woman stands up and moves away from the bed.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 693, "extra_info": {"duration": 9.51, "events": ["Laptop is open on the table.", "Little girl lies in bed.", "Lady opens the blanket.", "Lady drapes the blanket over the little girl.", "Lady leans down.", "Lady kisses the little girl.", "Little girl turns her head.", "Little girl returns the kiss."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 693}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "628.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A fixed camera captures the activities of a day in the square in front of it, with numerous individuals engaging in various activities. The passage of time on camera hastens, and the activities of the people blur rapidly. The camera starts in the morning, with the sunlight gradually brightening and people's activities steadily increasing. By evening, the sunlight fades, street lights come on, and a light show begins in the distance. People's activities reach a climax, and eventually, everyone gradually disperses.", "text": "The video begins with a view of a large open plaza with a city skyline in the background, where people are seen walking and sitting. As the video progresses, more people appear, some walking in groups, others sitting on the steps, and a few taking photos. The sky gradually darkens, and the city lights start to illuminate, creating a beautiful contrast with the darkening sky. By the end of the video, the plaza is filled with people, and the city lights are fully on, casting a warm glow over the scene.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 628, "extra_info": {"duration": 49.32, "events": ["Camera captures a day's activities in the square.", "People start their morning routines.", "Sunlight brightens the environment.", "Activities of individuals increase.", "Evening arrives and sunlight fades.", "Street lights come on.", "A light show begins in the distance.", "People's activities reach a climax.", "Individuals gradually disperse."], "n_events": 9, "n_shots": 1, "n_subjects": 1}, "vid": 628}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "841.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, an individual pours blue curacao from a small bottle into a glass situated on a table. Following this, the same individual adds simple syrup to the glass, using the same bottle. Lastly, lemon juice is added to the glass from the same bottle. Throughout the video, specific liquid measurements are displayed at the top left corner of the screen.", "text": "A hand holds a metal jigger and pours 1/2 oz of blue curaçao into a clear glass with a diamond pattern on a wooden surface. The hand then pours 1/2 oz of simple syrup into the same glass. Finally, the hand pours 1/2 oz of lemon juice into the glass.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 841, "extra_info": {"duration": 8.0, "events": ["Individual pours blue curacao into a glass.", "Individual adds simple syrup using the bottle.", "Individual adds lemon juice from the bottle.", "Liquid measurements display at top left corner."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 841}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "981.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows a kitchen with red wooden flooring and white doors. Upon entering, there is a long black countertop positioned on the left, with a trash can tucked beneath it. Several kitchen papers and two white plastic bags are scattered on the countertop. Encased between the plastic bags and kitchen papers are a small white mop and a small green broom, both with wooden handles. A grey toaster is situated in front of the wall. The countertop houses two sinks with a faucet located between them, facing the window. Some white flowers in a glass vase and a bunch of bananas are displayed in front of the window. In the top right corner of the connected sinks, a yellow dish soap bottle and a blue dish soap bottle can be seen. The left sink houses white dishes and three glass cups that require cleaning. A black refrigerator is positioned near the table.", "text": "The video begins with a view of a kitchen featuring white cabinets, a black countertop, and a wooden floor. Various items are scattered on the countertop, including a paper towel roll, a broom, a plastic bag, a plant, and some cleaning supplies. The camera gradually moves forward, focusing on the countertop and the sink area. The sink is filled with dishes, including glasses and plates, and there are cleaning products and a dish rack placed around it. The camera continues to move closer to the sink, providing a clearer view of the dishes and cleaning supplies. The video ends with a close-up of the sink and the surrounding area, highlighting the cluttered state of the kitchen.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 981, "extra_info": {"duration": 8.01, "events": [], "n_events": 0, "n_shots": 1, "n_subjects": 1}, "vid": 981}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "705.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A pair of hands wearing blue gloves is preparing for a nucleic acid test. A person opens their mouth, and the tester inserts a cotton swab deep into their mouth for sampling. After the sampling is completed, the person being tested puts on a mask, and the tester seals the test reagent in a test tube. Then, the next person comes forward to prepare for the test.", "text": "A person wearing blue gloves and a white protective suit holds a cotton swab and a wooden stick, preparing to test a woman. The person inserts the cotton swab into the woman's mouth. The woman holds a small bag and a yellow item, which she hands to the person. The person, now wearing a face shield and mask, administers the test to the woman. The person then places the collected sample into a plastic bag and seals it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 705, "extra_info": {"duration": 26.89, "events": ["Hands wearing blue gloves prepare for a nucleic acid test", "A person opens their mouth", "Tester inserts a cotton swab into the person's mouth", "Tester takes a sample from the person's mouth", "The person being tested puts on a mask", "Tester seals the test reagent in a test tube", "The next person comes forward for the test"], "n_events": 7, "n_shots": 1, "n_subjects": 3}, "vid": 705}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "943.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows a shark preparing to attack a person on the left, who is outfitted in a black wetsuit and carries an oxygen tank on their back. As the shark opens its mouth to attack, an individual on the right thrusts a round hook into the shark's mouth. Immediately, the person on the left maneuvers swiftly, taking shelter beneath the shark's body.", "text": "A kayak is seen moving through the water near a rocky shoreline. The scene then transitions to an underwater view where a diver is surrounded by various fish, including striped and patterned ones. The diver is approached by a large shark, and the diver begins to move towards the shark. The diver reaches out and grabs the shark's fin as it swims past.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 943, "extra_info": {"duration": 6.01, "events": ["Shark prepares to attack person on left", "Person on left carries oxygen tank", "Shark opens mouth", "Individual on right thrusts hook into shark's mouth", "Person on left maneuvers swiftly", "Person on left takes shelter beneath shark's body"], "n_events": 6, "n_shots": 2, "n_subjects": 3}, "vid": 943}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "770.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A suction cup swings back and forth, sucking away all the coins on the table, and in the end, the screen fades to black.", "text": "A pile of various coins, including gold, silver, and bronze, is scattered on a dark surface. A metal cone begins to tilt and move over the coins, causing them to shift and scatter further. The cone continues to move, pushing more coins off the pile and onto the surface. The coins are gradually pushed off the pile and spread out across the surface. The cone moves out of the frame, leaving the scattered coins on the surface.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 770, "extra_info": {"duration": 19.92, "events": ["A suction cup swings back and forth.", "The suction cup sucks away all the coins.", "The screen fades to black."], "n_events": 3, "n_shots": 1, "n_subjects": 2}, "vid": 770}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "281.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man leans forward with both hands on a red railing, shouting out towards the front. A lady glances back while gliding forward on the ice rink, then turns her whole body to face the camera, waving in the direction of the camera, and continues to slide forward backwards. The camera then switches back to the man, who is speaking towards the front.", "text": "A man in a brown jacket and purple shirt is leaning on the railing of an ice rink, looking forward and talking. The background shows a blue wall with red bleachers, a 'Gold' sign, and a 'Reno County Better Club Certifications' banner. The camera then shifts to a woman with a ponytail, wearing a maroon sweater, who is skating on the ice rink. She turns around, smiles, and waves her hand while continuing to skate. The camera then returns to the man, who is still leaning on the railing and talking.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 281, "extra_info": {"duration": 4.38, "events": ["A man leans forward on a red railing.", "The man shouts out towards the front.", "A lady glances back while gliding on ice.", "The lady turns her whole body to face the camera.", "The lady waves in the direction of the camera.", "The lady continues to slide forward backwards."], "n_events": 6, "n_shots": 3, "n_subjects": 2}, "vid": 281}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "63.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A character wearing a hat moves their head downward and then lifts it up, followed by using both hands to scoop water from below to wash their face. Then, the character cups their hands together to draw a handful of water to sip into their mouth. Afterward, the camera switches to a full view, where the character with the hat is kneeling in front of the toilet, hands placed on the toilet seat. Subsequently, they press a button on the side of the toilet, and a stream of water shoots upward from the toilet, to which the hatted character leans forward to drink.", "text": "The video begins with a view of a tiled wall. A man in a gray suit and black hat appears, holding a pink object. He then wipes his face with his hand, removing the pink object. The man continues to wipe his face, looking relieved. He then looks at the toilet in front of him, which has an Apple logo on it. The man kneels in front of the toilet, pressing a button on the toilet tank. Water starts to flow from the toilet. The man turns his head to look at the camera.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 63, "extra_info": {"duration": 9.25, "events": ["Character wearing a hat moves head downward.", "Character lifts head up.", "Character uses hands to scoop water.", "Character washes face with water.", "Character cups hands to draw water.", "Character sips water into mouth.", "Character kneels in front of the toilet.", "Character places hands on toilet seat.", "Character presses button on the toilet.", "Water shoots upward from the toilet."], "n_events": 10, "n_shots": 2, "n_subjects": 1}, "vid": 63}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "511.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Inside a concrete pipe, a little girl kicks a black and white soccer ball forcefully. The ball rolls around the concrete pipe twice, passes between the girl's legs, and continues to spin upwards. The girl leans forward slightly and catches the soccer ball on the black backpack she is carrying on her back. The girl looks back to confirm the ball has fallen into the backpack, then smiles and walks out of the concrete pipe.", "text": "A girl in a purple jacket and black pants with a backpack kicks a soccer ball inside a large concrete pipe. The ball bounces back and hits her in the face, causing her to bend over and look at the ball. She then turns around, waves her hand, and walks out of the pipe. The girl returns to the pipe and kicks the ball again, which bounces back and hits her in the face once more. She bends over and looks at the ball, then turns around, waves her hand, and walks out of the pipe.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 511, "extra_info": {"duration": 6.13, "events": ["A little girl kicks a soccer ball inside a concrete pipe.", "The soccer ball rolls around the concrete pipe twice.", "The soccer ball passes between the little girl's legs.", "The soccer ball continues to spin upwards.", "The little girl leans forward slightly.", "The little girl catches the soccer ball on her backpack.", "The soccer ball falls into the backpack.", "The little girl looks back to confirm the ball.", "The little girl smiles.", "The little girl walks out of the concrete pipe."], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 511}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "139.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera pans from a painting of Snoopy standing on a red house to the right, moving to an image of a little boy writing on paper. The boy is biting the end of his pen, with one hand propping up his cheek. Then the image transforms into a live shot, the child's supporting hand and index finger shift slightly, a gust of wind blows, and the papers on the boy's study desk are swept away. The little boy looks on in confusion, and then turns his head to look to the left.", "text": "The video begins with a split screen showing a dog sitting on a red block with a green object in its mouth on the left, and a baby sitting at a table on the right. The camera then focuses on the baby, who is holding a pen and writing on a piece of paper. The baby continues to write, and the paper starts to fly away from the table. The baby looks up and turns its head to the left, watching the paper fly away.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 139, "extra_info": {"duration": 6.62, "events": ["The camera pans from a painting of Snoopy to an image of a boy.", "The boy bites the end of his pen.", "The boy props up his cheek with one hand.", "The image transforms into a live shot.", "The boy's supporting hand and index finger shift slightly.", "A gust of wind blows.", "Papers on the boy's desk are swept away.", "The boy looks on in confusion.", "The boy turns his head to the left."], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 139}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "633.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera focuses on a person's legs; they're wearing a pair of orange running shoes while jogging. Suddenly, he stops, ties the lace of his right shoe, then stands up and continues running. The background is an outdoor sports stadium.", "text": "A person wearing red running shoes and patterned leggings is running on a track at night. The person stops running and squats down to tie the laces of their right shoe. They pull the laces tight and begin to form a knot. The person continues to tie the laces, ensuring they are secure. After tying the laces, the person stands up and starts running again. The camera then shifts focus to the empty track.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 633, "extra_info": {"duration": 11.5, "events": ["Person jogs in orange running shoes.", "Person stops jogging.", "Person ties lace of right shoe.", "Person stands up.", "Person resumes running."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 633}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "259.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man in the jacket watched the man opposite him, who was wearing floral clothing, throw the Molotov cocktail in his hand into a nearby car and then run away. The man in the jacket watched as the fire started, and then he also ran off.", "text": "A man in a black jacket stands next to a black car by the waterfront, talking to another man in a patterned hoodie who is holding a bottle with a flame. The man in the hoodie turns and throws the flaming bottle into the car. The man in the black jacket watches as the bottle ignites the car's interior, causing a large fire. The man in the black jacket continues to watch as the car burns intensely. The man in the hoodie walks away from the scene.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 259, "extra_info": {"duration": 5.25, "events": ["Man in floral clothing threw Molotov cocktail into car", "Man in floral clothing ran away", "Fire started in the car", "Man in jacket ran off"], "n_events": 4, "n_shots": 3, "n_subjects": 2}, "vid": 259}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "184.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The octopus carries two children forward, as the camera switches to three mermaids holding tridents.", "text": "In a dense forest with twisted trees and a misty atmosphere, two purple creatures with large bodies and small heads are seen swimming in a body of water. They are carrying two children on their backs as they move forward. The creatures continue to swim towards a large, ornate tree with a green, intricately designed door in the background. As they approach the tree, they notice several mermaids sitting on the branches, holding spears. The mermaids are dressed in colorful outfits, with one in a pink dress and another in a blue dress.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 184, "extra_info": {"duration": 5.38, "events": ["The octopus carries two children forward.", "Three mermaids hold tridents."], "n_events": 2, "n_shots": 2, "n_subjects": 3}, "vid": 184}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "836.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video clip features a white and blue bus approaching. To the right of the bus, there's a group of cyclists riding. If these cyclists were to suddenly make a left turn, it may cause an accident. Such a crash could potentially harm multiple individuals, including drivers, passengers, and cyclists, and damage the bus or bicycles.", "text": "A man wearing a silver jacket and carrying a yellow bag rides a bicycle on a sidewalk next to a building with a hedge and a signpost. The scene then shifts to a busy street with a bus approaching from the left side of the frame. Several cyclists, including those on bicycles and electric scooters, ride along the sidewalk as the bus moves forward. The bus continues to move forward, passing by the cyclists who are riding in the same direction. The bus moves further down the street, and more cyclists, including those on electric scooters, continue to ride along the sidewalk. The bus and cyclists move past the camera, with the bus gradually moving out of the frame. The scene returns to the man on the bicycle riding on the sidewalk.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 836, "extra_info": {"duration": 5.0, "events": ["White and blue bus approaches", "Group of cyclists rides to the right", "Cyclists make a left turn", "Accident occurs", "Individuals get potentially harmed", "Bus sustains damage", "Bicycles get damaged"], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 836}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "661.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera angles down from above to capture the rippling surface of the water, then pushes forward and gradually lifts, reaching the riverbank to reveal the stones by the river, a stretch of land devoid of vegetation, and shrubbery in the distance.", "text": "The video begins with a view of the ocean surface, reflecting sunlight and showing gentle waves. The camera then starts to move forward, gradually revealing the edge of a rocky coastline. As the camera continues to move, the rocky shore becomes more visible, and the waves crash against the rocks. The camera ascends slightly, showing more of the rocky terrain and the surrounding landscape, which includes sparse vegetation and a hilly background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 661, "extra_info": {"duration": 7.09, "events": ["The camera angles down from above.", "The water surface ripples.", "The camera pushes forward.", "The camera gradually lifts.", "The camera reaches the riverbank.", "The riverbank reveals the stones.", "The land devoid of vegetation appears.", "Shrubbery in the distance becomes visible."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 661}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "928.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A child donned in a yellow T-shirt with white dots walks up to another child who is attired in a white T-shirt, emblazoned with a unicorn, paired with a hijab that covers her braided hair. This child is also holding a phone. Present is another child, wearing a half-pink, half-white striped T-shirt and a backpack, the child points at the hijab-wearing, unicorn T-shirt clad child, laughing. As this child continues to laugh, the child in the yellow T-shirt starts dancing in front of the child wearing the hijab and the unicorn T-shirt, evoking laughter as well. This subsequently depresses and saddens the child wearing the hijab and the white T-shirt with a unicorn in the middle.", "text": "A boy in a yellow shirt with white spots covers his mouth with his hand while standing in a parking lot with several cars and houses in the background. The camera moves to show a girl in a white shirt with a yellow and black headband, who is holding a phone and looking at the boy. The girl in the white shirt touches her nose and then points at the boy. Another girl in a pink and white shirt with braided hair joins them, pointing at the boy and smiling. The boy in the yellow shirt starts dancing and making gestures, while the girl in the pink and white shirt continues to point and smile. The boy in the yellow shirt continues to dance and gesture, and the girl in the pink and white shirt laughs and points at him. The boy in the yellow shirt keeps dancing and gesturing, and the girl in the pink and white shirt continues to point and smile.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 928, "extra_info": {"duration": 10.01, "events": ["Child in yellow T-shirt walks up to another child.", "Child in hijab holds a phone.", "Child in striped T-shirt points at child in hijab.", "Child in striped T-shirt laughs.", "Child in yellow T-shirt starts dancing.", "Child in hijab starts laughing.", "Child in hijab becomes sad."], "n_events": 7, "n_shots": 4, "n_subjects": 3}, "vid": 928}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "768.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person takes out several bags and boxes from a shopping cart and places them into the trunk, then reaches out to close the trunk.", "text": "A woman, wearing a light-colored coat and a red dress, is seen at a supermarket checkout area with a shopping cart filled with items. She begins to take out plastic bags from the cart and places them into the trunk of a car. The woman continues to empty the cart, placing more bags into the trunk. After emptying the cart, she pushes it away and closes the trunk. The scene gradually darkens as the woman leaves the frame.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 768, "extra_info": {"duration": 17.02, "events": ["A person takes out several bags from a shopping cart.", "A person takes out several boxes from a shopping cart.", "A person places bags into the trunk.", "A person places boxes into the trunk.", "A person reaches out to close the trunk."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 768}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "289.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A bare-chested man leans forward, glancing at the clothes he is putting on. He then slips his arms into the sleeves with effort, buttons up, and looks to the left.", "text": "In a dimly lit room with a window adorned with green foliage and string lights, a shirtless man begins to put on a black jacket. He first inserts his right arm into the sleeve and then his left arm. After putting on the jacket, he adjusts it by pulling it down and straightening it. He then starts to button the jacket, beginning from the top and moving downwards.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 289, "extra_info": {"duration": 4.0, "events": ["A bare-chested man leans forward.", "Man glances at the clothes.", "Man slips arms into sleeves.", "Man buttons up the clothes.", "Man looks to the left."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 289}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "727.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A group of people are gathered around a table with an architectural model, discussing the layout plan. A man reaches out to pick up a white windmill model, while another woman moves a blue cardboard representing a lake to the right side.", "text": "The scene is set on a table with a model of a house, a wind turbine, and several blue circular pieces representing water bodies. A person on the left is holding a pencil and appears to be drawing or writing on a notepad. Another person reaches out and picks up one of the blue circular pieces from the table. The person then places the blue circular piece next to the model house. The person on the left continues to hold the pencil and notepad, while the other person adjusts the position of the blue circular piece. The background includes a laptop, a green roll of paper, and various other items on the table.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 727, "extra_info": {"duration": 6.6, "events": ["A group of people gather around a table.", "People discuss the architectural layout plan.", "A man picks up a white windmill model.", "A woman moves a blue cardboard.", "The blue cardboard representing a lake shifts to the right side."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 727}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "892.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In this video clip,the person in blue vest is running so fast from the left. Then a police car pulles up and some policemen in green vests get out of the car. These policemen start to run towards the direction where the person in blue runs towards. Obviously, the person in blue seems to have commited a crime and these policemen is catching the person. Afraid of being caught by the police, the person abandoned the car and ran.", "text": "The video shows a street scene with several cars driving on a road lined with buildings and parked vehicles. A man in a blue shirt exits a parked car on the left side of the road and starts walking towards the sidewalk. He continues walking along the sidewalk, passing by a black car parked on the side of the road. The man in the blue shirt reaches the sidewalk and begins to walk away from the camera. Meanwhile, a white car drives past him on the road. Another man in a green vest appears and starts walking towards the sidewalk. The man in the green vest continues walking along the sidewalk, passing by the parked black car. The white car continues to drive past the man in the green vest. The man in the green vest reaches the sidewalk and continues walking away from the camera.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 892, "extra_info": {"duration": 11.0, "events": ["Person in blue vest runs from left.", "Police car pulls up.", "Policemen in green vests exit police car.", "Policemen start running towards person in blue vest's direction.", "Person in blue vest seems to have committed a crime.", "Policemen are chasing person in blue vest.", "Person in blue vest is afraid.", "Person in blue vest abandons a car.", "Person in blue vest runs away."], "n_events": 9, "n_shots": 1, "n_subjects": 3}, "vid": 892}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "265.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Inside the helicopter, two individuals look ahead, while the camera captures an aerial view of the road below where two motorcycles speed along, accompanied by other vehicles on the road. The camera then switches again to capture the profile of a person riding a motorcycle at high speed on the road.", "text": "Two individuals, one wearing a white helmet and the other a black helmet, are seated in a helicopter, looking out the window at the landscape below. The camera then shifts to an aerial view, showing a road with several vehicles, including a motorcycle, moving along it. The scene transitions to a police officer riding a motorcycle on the road, with the officer looking ahead and speaking. The camera then returns to the interior of the helicopter, showing the two individuals still looking out the window.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 265, "extra_info": {"duration": 4.38, "events": ["Two individuals look ahead inside the helicopter.", "The helicopter captures an aerial view of the road.", "Two motorcycles speed along the road.", "Other vehicles accompany the motorcycles on the road.", "The camera captures the profile of a speeding motorcyclist."], "n_events": 5, "n_shots": 3, "n_subjects": 4}, "vid": 265}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "764.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person kicks a football to a man, the man catches the ball, takes a jump forward and kicks towards the goal. There are two other men blocking, but the man manages to kick the ball into the goal and jogs forward.", "text": "A man in a white shirt and black shorts is seen dribbling a soccer ball on a green field with a goalpost and exercise equipment in the background. He continues to control the ball, moving it forward with his feet. Another man in a white shirt and black shorts approaches and attempts to intercept the ball. The first man skillfully maneuvers the ball, evading the second man's attempts to take it. He then kicks the ball towards the goal. The ball flies past the goalkeeper and into the net. The first man runs towards the goal, celebrating his successful shot.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 764, "extra_info": {"duration": 6.72, "events": ["A person kicks a football to another man.", "The man catches the football.", "The man takes a jump forward.", "The man kicks the ball towards the goal.", "Two other men attempt to block.", "The ball goes into the goal.", "The man jogs forward."], "n_events": 7, "n_shots": 1, "n_subjects": 4}, "vid": 764}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "964.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows a person's video camera and the inside of a car. First of all, the steering wheel with the AS emblem on of the black-coated car is visible, and then the instrument panel. Then the front view of this car with red exterior trim is shown. Afterwards, a shot is seen from the side mirrors, and at this time, it is seen that a white car is coming behind the red car. Afterwards, the front part of this white car is visible from the exterior shot and the rims of this car are black. Afterwards, his path is seen from a different perspective. Afterwards, various lighting and a bridge appear on another road where it is dark. The grey rim of this Alfa Romeo brand car and the emblem and red part inside is shown. Then, finally, the rearview and exhaust of the red car are shown.", "text": "A man is seen driving a car, with a close-up of the steering wheel and dashboard showing the time and date. The car's headlights turn on, and the man continues driving, passing by other vehicles on the road. The car is shown driving on a highway, with the man still visible in the corner of the screen. The car drives through a city at night, passing by tall buildings and a bridge. The car's logo is shown, followed by a close-up of the car's wheel and exhaust. The car is seen from the rear, with its taillights on, and the man is still visible in the corner of the screen. The car's exhaust is shown in detail, and the man is seen driving again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 964, "extra_info": {"duration": 11.0, "events": ["Camera pans from the steering wheel to the instrument panel.", "Camera captures the front view of the red-trimmed car.", "White car appears in the side mirror.", "Front of the white car becomes visible from the outside.", "Camera view switches to a different road perspective.", "Various lighting and a bridge appear on the road.", "Camera shows the Alfa Romeo's grey rim and red interior.", "Camera reveals the rearview and exhaust of the red car."], "n_events": 8, "n_shots": 10, "n_subjects": 1}, "vid": 964}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "983.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The footage showcases gameplay featuring the character Jjonak from the game Overwatch. The player deftly backtracks, wields the sword, and then proceeds toward the adversaries through a tunnel. The character has a blue circle adorning its back. Simultaneously, the grey adversary in front of the yellow character fires red rays at the yellow character. Swiftly, the yellow character lunges forward, shattering the grey opponent's shield, thereby emitting a yellow glow. The yellow character then strikes the opponent against the wall near the open door, causing the opponent's defeat.", "text": "The video begins with a dark, indistinct scene that gradually brightens to reveal a first-person perspective of a player in a video game. The player is engaged in a firefight, shooting at enemies in a street environment with buildings and greenery. The player's weapon, a futuristic gun, is seen firing at various targets, including a character named 'Atronique' who is repeatedly identified in the on-screen text. The player continues to shoot, causing explosions and damage to the enemies, with the on-screen text indicating the player's actions and the names of the opponents. The player's weapon emits bright flashes and energy beams as it fires, and the enemies are seen taking damage and being defeated. The player moves forward, continuing to shoot at the remaining enemies, with the on-screen text showing the player's actions and the opponents' names. The video ends with the player still engaged in combat, with the on-screen text indicating the player's actions and the opponents' names.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 983, "extra_info": {"duration": 5.0, "events": ["Jjonak backtracks in the gameplay.", "Jjonak wields the sword.", "Jjonak moves through the tunnel.", "Blue circle adorns Jjonak's back.", "Grey adversary fires red rays at Jjonak.", "Jjonak lunges forward.", "Jjonak shatters the grey adversary's shield.", "Shield emits a yellow glow.", "Jjonak strikes the adversary against the wall.", "Grey adversary is defeated near the open door."], "n_events": 10, "n_shots": 1, "n_subjects": 3}, "vid": 983}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "392.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man initially knelt on the mat spread out on the beach, then propped himself up with his hands to stand, and then turned to face the sea.", "text": "A man is kneeling on a beach towel on the sand, with the ocean and a few people in the background. He begins to push himself up from the towel, using his hands for support. As he stands up, he adjusts his shirt by pulling it down. Once fully standing, he turns around and starts to walk away from the towel. The man then turns back towards the towel and begins to kneel down again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 392, "extra_info": {"duration": 4.62, "events": ["The man knelt on the mat.", "The man propped himself up with his hands.", "The man stood up.", "The man turned to face the sea."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 392}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "287.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows a man with a briefcase on his back and a bunch of flowers in his left hand, walking towards a grave while making a phone call, as the camera pans from left to right in the cemetery.", "text": "The scene is set in a cemetery with several gravestones and a large tree in the background. A man in a suit, carrying a bag over his shoulder, walks through the cemetery while talking on the phone. He continues walking past various gravestones, including one with the name 'Finnegan' and another with the name 'Hodge'. The man approaches a gravestone with the name 'Finnegan' and stops in front of it. He then looks at the gravestone while still on the phone.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 287, "extra_info": {"duration": 4.88, "events": ["A man walks towards a grave.", "The man carries a briefcase on his back.", "The man holds a bunch of flowers in his left hand.", "The man makes a phone call."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 287}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "301.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A long-haired man was knocked down by another man wearing a white hat. Then, the man with the white hat opened the car door, got inside, and started the vehicle. The man who had been knocked down raised his hand, prompting the man in the white hat to stop the car. He then used the car door to hit the fallen man before driving off. The man lying on the ground got up from the ground.", "text": "A man in a gray jacket runs towards a black car in a parking lot with palm trees and a cloudy sky in the background. He turns around and runs back towards the car. The man in a white tank top and white cap opens the car door and gets inside. The man in the gray jacket reaches the car, places his hand on the window, and waves goodbye. The man in the white tank top starts the car and drives away. The man in the gray jacket watches the car as it leaves.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 301, "extra_info": {"duration": 13.25, "events": ["A man knocked down a long-haired man.", "The man in the white hat opened the car door.", "The man in the white hat got inside the car.", "The man in the white hat started the vehicle.", "The long-haired man raised his hand.", "The man in the white hat stopped the car.", "The man in the white hat hit the long-haired man with the car door.", "The man in the white hat drove off.", "The long-haired man got up from the ground."], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 301}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "92.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A soldier is running forward holding a dog, while another soldier is sitting on the ground, watching them run past, panting heavily.", "text": "A man in a green military uniform is holding a brown and white dog in his arms, gently rocking it back and forth. The background is a dimly lit, reddish environment, possibly a trench or a dugout. The man then sits down on the ground, still holding the dog, and looks to his left. Another person in a similar uniform walks past him from left to right. The man then turns his head to face forward and begins to speak.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 92, "extra_info": {"duration": 4.5, "events": ["A soldier runs forward.", "The soldier holds a dog.", "Another soldier sits on the ground.", "The sitting soldier watches them run past.", "The sitting soldier pants heavily."], "n_events": 5, "n_shots": 2, "n_subjects": 3}, "vid": 92}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "421.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a target range devoid of vegetation, a warrior presses bullets into a magazine on the ground, then walks up to a rifle, inserts the magazine into the rifle, followed by getting down, pulling the bolt, aiming, and firing. The scene switches to the side of the warrior, where a row of warriors is firing with the same weapons and stance.", "text": "The video begins with a black screen displaying some text. A person in camouflage uniform is seen loading bullets into a magazine. The person then attaches the magazine to a rifle. The person, now wearing a cap and earmuffs, aims the rifle and starts shooting. The camera captures the rifle from different angles as it is being aimed and fired. The person continues to shoot while lying on the ground, with the rifle supported by a bipod. The video ends with a close-up of the person's hands loading bullets into the magazine.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 421, "extra_info": {"duration": 15.2, "events": ["Warrior presses bullets into a magazine", "Warrior walks up to a rifle", "Warrior inserts the magazine into the rifle", "Warrior gets down", "Warrior pulls the bolt", "Warrior aims", "Warrior fires", "A row of warriors fires weapons", "Warriors adopt the same stance"], "n_events": 9, "n_shots": 4, "n_subjects": 2}, "vid": 421}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "153.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman in white clothes is embracing a man in green clothes, as a priest holding a book says something to the woman in white. A server watches them, and the woman in white throws the man in green forward. The camera cuts to the front, where the man in green is thrown down, a stool overturned on the ground, while a little girl and a small dog from the other side of the wall are watching what is happening.", "text": "In a grand setting with a priest holding a book and a man in a yellow outfit holding a crown, a woman in a white dress with a heart design embraces a man in a green outfit with a crown. The woman then turns and runs away, causing the man to be thrown off his feet. The man lands on the ground, and a red pot with a face on it rolls towards him. The man looks at the pot and starts speaking to it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 153, "extra_info": {"duration": 2.5, "events": ["A woman in white clothes embraces a man in green clothes.", "A priest holding a book says something to the woman in white.", "A server watches the woman and man.", "The woman in white throws the man in green forward.", "The man in green is thrown down.", "A stool overturns on the ground.", "A little girl watches the event.", "A small dog watches the event."], "n_events": 8, "n_shots": 2, "n_subjects": 5}, "vid": 153}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "622.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the stable, a horse is bowing its head to eat the hay on the ground. The camera begins to show the movements of the horse's hooves and muzzle, then gradually pans up to reveal the entire head and mane of the horse. It then pans back down to the horse's legs and mouth. Finally, after the horse finishes eating the hay, it lifts its head away from the screen, leaving only the shot of the horse's legs as it ends.", "text": "A horse is seen eating hay from the ground in a stable, with its head lowered and mouth moving as it chews. The camera gradually moves upward, revealing more of the horse's head and neck. The horse continues to eat, and the camera captures its face and mane in close-up. The horse's head moves slightly as it eats, and the camera focuses on its face and the hay it is consuming. The camera then shifts downward, showing the horse's legs and the hay on the ground.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 622, "extra_info": {"duration": 21.88, "events": ["Horse bows head to eat hay.", "Horse's hooves move.", "Horse's muzzle moves.", "Horse's head and mane become visible.", "Horse's legs come into view.", "Horse's mouth eats hay.", "Horse lifts head away from ground.", "Horse's legs remain in the shot."], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 622}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "376.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man walks forward to the bed and sits down, with the camera following his movement.", "text": "A man in a blue shirt is seen walking to the right in a room with light blue walls and a painting on the wall. He continues walking past a bed with a metal frame and a nightstand with a lamp and some items on it. The man approaches the bed and begins to sit down on it. He then sits on the bed, facing the nightstand.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 376, "extra_info": {"duration": 5.5, "events": ["The man walks forward.", "The man sits down on the bed."], "n_events": 2, "n_shots": 1, "n_subjects": 1}, "vid": 376}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "103.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a bomb crater, a soldier with a blue helmet takes out a hand grenade and hands it to a soldier next to him in a green uniform. The soldier in green is aiming his gun forward and immediately takes the grenade and tosses it ahead. Then the blue helmet soldier lies down and takes aim forward with his gun, while the soldier in green covers his head with both hands and ducks down. Many soldiers are charging forward behind them. The scene switches to the opposing position, which is shrouded in smoke and deserted.", "text": "Two soldiers are lying in a trench, with one in blue and the other in green, both holding rifles. The soldier in blue turns his head and hands a hammer to the soldier in green. The soldier in green takes the hammer and throws it away. The soldier in green then lies down, covering his head with his arm. The soldier in blue continues to aim his rifle forward. In the background, a group of soldiers is seen running across the battlefield.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 103, "extra_info": {"duration": 4.25, "events": ["Soldier in a bomb crater takes out a hand grenade.", "Soldier with blue helmet hands grenade to soldier in green uniform.", "Soldier in green uniform tosses hand grenade ahead.", "Soldier with blue helmet lies down, takes aim with gun.", "Soldier in green uniform covers head, ducks down.", "Many soldiers charge forward."], "n_events": 6, "n_shots": 2, "n_subjects": 3}, "vid": 103}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "949.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video features a blue barrel with a brown interior. Inside the barrel, an explosive fuse, which is yellow on the inside and orange on the outside, is shown. A small hole is present at the bottom of the barrel, and there's a transfer pipe through the hole. Inside the barrel, there is a metal piece hanging. The camera angle switches to show the exterior of the blue barrel placed among miscellaneous warehouse items. An imminent dust explosion is anticipated.", "text": "Two men wearing red helmets and work clothes are standing in a warehouse with metal shelves and cardboard boxes in the background, looking down at a blue barrel. The camera then focuses on the inside of the barrel, showing a circular object with a red center and a metal rod on top. The camera then shows the blue barrel from a distance, with the men still standing next to it. The men continue to look at the barrel.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 949, "extra_info": {"duration": 8.0, "events": ["A blue barrel with a brown interior appears.", "An explosive fuse burns inside the barrel.", "The fuse color transitions from yellow to orange.", "A small hole at the barrel's bottom is revealed.", "A transfer pipe runs through the barrel's hole.", "A metal piece hangs inside the barrel.", "The blue barrel is situated in a warehouse.", "Objects surround the blue barrel in the warehouse."], "n_events": 8, "n_shots": 3, "n_subjects": 2}, "vid": 949}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "117.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows a round button with red and white stripes. A hand appears and presses the button several times. The camera then cuts to a red-haired girl standing in front of a house. As the door opens, a small-headed, large-bodied officer emerges from a door just big enough for him to pass through, with a small baby hanging on his right chest.", "text": "A red and white striped button is shown on a pink background. A hand with a purple sleeve presses the button. The door, which has a red frame and a pink background, opens. A character dressed in a colorful outfit with a white hat and blue cloud-like decorations on their head appears at the door. The character, who has a small pink creature on their chest, turns to face the person with red hair standing outside.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 117, "extra_info": {"duration": 6.5, "events": ["A hand presses the round button multiple times.", "The red-haired girl stands in front of a house.", "The door opens.", "A small-headed officer emerges from the door.", "A small baby hangs on the officer's right chest."], "n_events": 5, "n_shots": 2, "n_subjects": 3}, "vid": 117}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "644.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A car passes by; the reflection in the car's windows shows two people walking past. After the car has completely passed by, a parking station on the road is revealed, and the front of a black sedan peeks out from behind a billboard, with other vehicles also parked on the street. At the end of the video, someone walks from right to left on the sidewalk.", "text": "The video begins with a blurred view of a bus stop, showing a map and a signboard in the background, with the 'video' logo overlaying the scene. The camera then moves to the right, revealing more of the bus stop and a person walking by. The scene transitions to a clearer view of a bus stop with a 'Welcome!' sign on the glass door, and a person is seen entering a car parked nearby. The camera remains focused on the bus stop and the car, with the 'Welcome!' sign clearly visible.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 644, "extra_info": {"duration": 9.8, "events": ["A car passes by.", "Two people walk past.", "The car reveals a parking station.", "A black sedan peeks from behind a billboard.", "Other vehicles are parked on the street.", "Someone walks from right to left on the sidewalk."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 644}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "309.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows the lush foliage of plants, with a grasshopper climbing up a leaf. The camera angle then switches to a low-angle shot, capturing a large flock of birds flying in the sky.", "text": "A grasshopper is seen climbing up a green leaf, moving steadily upward. The scene then transitions to a view of the sky, where a large flock of birds is flying in a coordinated manner, creating a dense, moving pattern against the blue sky.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 309, "extra_info": {"duration": 7.75, "events": ["Plants show lush foliage.", "Grasshopper climbs up a leaf.", "Birds fly in the sky."], "n_events": 3, "n_shots": 2, "n_subjects": 2}, "vid": 309}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "664.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the sky, an eagle is soaring. The camera follows it, first moving downward, then moving to the left to keep the eagle in the center of the frame.", "text": "A turkey vulture is seen flying in the sky with a backdrop of clouds and a house in the background. The camera follows the vulture as it descends towards a cliffside with trees and a house on top. The vulture continues to fly over the cliff, with the camera tracking its movement. The vulture flies over the edge of the cliff, with the Grand Canyon visible in the background. The vulture continues its flight, moving further away from the cliff and into the open sky.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 664, "extra_info": {"duration": 10.94, "events": ["An eagle is soaring in the sky.", "The camera follows the eagle.", "The camera moves downward.", "The camera moves to the left."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 664}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "126.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a spaceship, a female astronaut floats and activates the switch of a door in front of her. Once the door opens, the female astronaut's feet touch down onto the ground.", "text": "A woman in a blue uniform with a ponytail is seen pushing a large circular door with both hands in a room with a metallic and industrial design. She continues to push the door, gradually opening it. As the door opens further, she steps back slightly and uses her left hand to hold the door in place. She then uses both hands to push the door further open, revealing a control panel inside.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 126, "extra_info": {"duration": 3.25, "events": ["A female astronaut floats in a spaceship.", "The female astronaut activates a switch.", "A door opens in front of the female astronaut.", "The female astronaut's feet touch down onto the ground."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 126}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "688.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a forest, a beautiful lady looks towards the camera, then walks into the depths of the woods. As she walks, she looks back at the camera, walks for a while, then runs for a few steps before turning back to face the camera.", "text": "A woman in a light-colored sweater and jeans is in a forest with tall trees. She initially holds onto a tree with both hands, then releases the tree and starts to move away. She turns around, adjusts her hair with one hand, and begins to spin around. She continues to spin and walk, occasionally looking back and smiling. She then walks forward, adjusting her hair again, and continues to move through the forest.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 688, "extra_info": {"duration": 16.58, "events": ["A lady looks towards the camera.", "The lady walks into the woods.", "The lady looks back at the camera.", "The lady walks for a while.", "The lady runs for a few steps.", "The lady turns to face the camera."], "n_events": 6, "n_shots": 2, "n_subjects": 1}, "vid": 688}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "462.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a rural lane, a shirtless man is chasing after a dog, which is holding a rabbit in its mouth. Just before reaching the camera, the man is about to catch up with the dog, which turns its head back while still holding the rabbit in its jaws.", "text": "A shirtless man is running on a paved path surrounded by lush greenery, with a dog following closely behind him. The dog, wearing a purple outfit, is carrying a small animal in its mouth as it runs. The man continues to run, looking back at the dog, while the dog maintains its pace, holding the animal securely. The dog gets closer to the camera, still carrying the animal, as the man runs ahead. The dog, with the animal in its mouth, runs past the camera, showing a red tag on its ear. The man continues to run ahead, with the dog following closely behind.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 462, "extra_info": {"duration": 10.18, "events": ["A shirtless man chases a dog.", "A dog carries a rabbit in its mouth.", "The dog turns its head back.", "The shirtless man is about to catch the dog.", "The dog continues holding the rabbit.", "The shirtless man runs on a rural lane.", "The chase approaches the camera.", "The dog evades the shirtless man's grasp."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 462}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "968.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video showcases a white candle, ensconced atop a brown table, whose flame suddenly extinguishes. The wick, however, continues to glow and burn for a few more seconds before turning black and leaving behind a thin trail of smoke.", "text": "A white candle is burning steadily against a plain background. The flame remains steady for a few moments. The flame starts to diminish and eventually goes out, leaving a small red ember at the wick. Smoke begins to rise from the extinguished wick, creating a swirling pattern in the air. The smoke continues to rise and spread, becoming more diffuse. The smoke starts to dissipate, and the wick remains red and glowing. The flame reappears, resuming its steady burn.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 968, "extra_info": {"duration": 5.0, "events": ["A white candle rests atop a brown table.", "The candle's flame suddenly extinguishes.", "The candle's wick continues to glow.", "The wick burns for a few more seconds.", "The wick turns black.", "A thin trail of smoke rises from the wick."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 968}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "217.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The doctor wearing glasses nodded, then took off his mask to speak, while the camera switched to another man glaring ahead with a woman standing with her mouth open.", "text": "A man in a white lab coat and tie is seen in a room with a window and cabinets in the background. He is wearing a face mask and safety goggles. The man looks up and begins to remove his face mask with his right hand, which is wearing a purple glove. He continues to take off the mask, revealing his face. The man then starts speaking while looking forward. The camera then cuts to another man in a white shirt, who is looking at a woman with her mouth open.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 217, "extra_info": {"duration": 4.12, "events": ["The doctor nodded.", "The doctor took off his mask.", "Another man glared ahead.", "A woman stood with mouth open."], "n_events": 4, "n_shots": 2, "n_subjects": 3}, "vid": 217}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "318.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video was shot on a dark night, with the camera pointed at the pitch-black sea ahead, then quickly panning left to capture a man wearing a red hat. The camera then switches to a seaside platform, where the man with the red hat is standing sideways at the edge. An airplane flies overhead, and the man watches as it passes before turning back to face the camera and walking forward, with the lights on the platform swaying.", "text": "A boat is seen moving on the water at night. The camera pans to reveal a man in a red beanie and blue shirt standing on the shore, looking towards the water. The scene shifts to a man in blue clothing walking on a pier at night, with a plane flying overhead. The man continues walking on the pier, occasionally looking back at the plane. He then turns around and walks back towards the camera, before turning around again and continuing to walk on the pier. The camera then cuts back to the man in the red beanie, who is still looking towards the water.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 318, "extra_info": {"duration": 11.5, "events": ["Camera captured pitch-black sea", "Camera panned to a man with a red hat", "Man with a red hat stood on a seaside platform", "An airplane flew overhead", "Man with a red hat watched the airplane", "Man with a red hat turned back to face the camera", "Man with a red hat walked forward", "Lights swayed on the platform"], "n_events": 8, "n_shots": 2, "n_subjects": 1}, "vid": 318}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "238.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man dressed in brown clothes sat at the edge of the table, and someone was walking around in a room behind him. Then the man turned his head, and around the tables sat others wearing the same clothes. Another man came up beside him and reached out to grab an apple from the table. The man dressed in brown clothes grabbed this man's hand, while a bald man sitting at the next table turned his head to watch everything.", "text": "A man in a gray shirt is sitting and looking forward, while another man in black walks from left to right in the background. The man in the gray shirt then turns his head to the right. The camera shifts to show the man in the gray shirt from the side, and he is seen holding a yellow ball. Another man in a gray suit approaches and places his hand on the table, while the man in the gray shirt looks at him.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 238, "extra_info": {"duration": 7.5, "events": ["Man sat at the table's edge.", "Someone walked around in the room.", "Man turned his head.", "Others sat around the tables.", "Another man approached the table.", "Man reached out for an apple.", "Man in brown clothes grabbed the approaching man's hand.", "Bald man turned his head to watch."], "n_events": 8, "n_shots": 2, "n_subjects": 3}, "vid": 238}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "119.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A mermaid looks downward, then swims downward, the camera cuts to her catching up with a male mermaid, swimming forward by his side.", "text": "A woman with long blonde hair, wearing a pink and blue top with pearl accessories, looks down and speaks while adjusting her top. She then bends forward and starts to swim. The camera follows her as she swims through the water, revealing a man with a fish tail swimming alongside her. They both swim towards a glowing pink platform. The woman and the man continue to swim together, approaching the platform.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 119, "extra_info": {"duration": 4.25, "events": ["A mermaid looks downward.", "A mermaid swims downward.", "A mermaid catches up with a male mermaid.", "A mermaid swims by the male mermaid's side."], "n_events": 4, "n_shots": 2, "n_subjects": 2}, "vid": 119}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "67.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A character wearing shorts gets their foot stuck in a door. Another character, dressed in black, throws a punch at the shorts-wearing character. The person in shorts twists their body against the door and simultaneously opens it. The character in black charges through the door and falls to the ground. As the door closes, the shorts-wearing character also falls to the floor, dropping something from their hand. A short person rushes over to snatch the item, but the individual in shorts quickly grabs it and stands up, holding the item high with one hand and steadying themselves on the ground with one foot. The short person jumps several times but is unable to reach it. The character in shorts throws a punch and sends the short person flying.", "text": "A boy with a shark fin on his back, wearing a white shirt and blue shorts, stands in front of a red door labeled '103' and points at a green object in his hand while talking. A man in black, standing to the left, reacts with surprise and fear, raising his hands and stepping back. The boy kicks the man, sending him flying through the air and crashing into the door. The boy then raises the green object above his head and throws it at the man, who is now on the ground. The man gets up and runs towards the boy, who is still holding the green object.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 67, "extra_info": {"duration": 6.5, "events": ["Character's foot gets stuck in door.", "Character in black throws punch.", "Character in shorts opens door twisting body.", "Character in black charges through door.", "Character in black falls to ground.", "Door closes, character in shorts falls.", "Character in shorts drops item from hand.", "Short person rushes to snatch item.", "Character in shorts grabs item and stands up.", "Character in shorts punches, sends short person flying."], "n_events": 10, "n_shots": 1, "n_subjects": 3}, "vid": 67}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "465.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a green patch of ground, a man wearing jeans holds a jump rope in his hand. In front of him, there is a group of poodles. He starts to swing the jump rope and begins jumping, and the poodles follow suit. However, the first attempt fails as the rope catches on the man's legs. Then the man continues to swing the jump rope, and he and the poodles jump in unison, successfully completing two cycles of jumping. Around them, many people are filming the scene with their mobile phones.", "text": "A man stands on a green mat, holding a rope, and begins to jump rope with a group of small dogs dressed in colorful outfits. The dogs, initially standing close together, start to jump in unison as the man swings the rope. The man continues to jump rope, and the dogs follow his movements, jumping higher and higher. The background shows a crowd of onlookers, some of whom are taking photos or videos, and a decorative setup with orange and red elements. The man maintains his rhythm, and the dogs keep jumping, creating a lively and entertaining scene.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 465, "extra_info": {"duration": 6.35, "events": ["Man holds a jump rope.", "Poodles stand in front of the man.", "Man starts swinging the jump rope.", "Man begins jumping.", "Poodles attempt to jump.", "Rope catches on the man's legs.", "Man continues to swing the jump rope.", "Man and poodles jump together.", "They complete two cycles of jumping.", "People film the scene with mobile phones."], "n_events": 10, "n_shots": 1, "n_subjects": 3}, "vid": 465}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "88.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a room filled with books and sundries, a blue cat walks from right to left, the camera moves from bottom to top, revealing multiple hammocks, with a red creature on the top layer swinging looking out through a round window.", "text": "In a room filled with stacks of books, a cello, and various musical instruments, a blue cat walks forward and turns its head to look around. The camera then moves upward, revealing a wall covered with floating sheets of paper. As the camera continues to move up, it shows a hammock hanging from the ceiling, with a pink creature sitting on it, looking out of a circular window.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 88, "extra_info": {"duration": 6.12, "events": ["A blue cat walks from right to left.", "The camera moves from bottom to top.", "A red creature swings in the top hammock.", "The red creature looks out through a round window."], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 88}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "133.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A bald little boy sits on a platform, his right hand propping his cheek and his left hand resting on his knee. Snoopy, holding a food bowl, looks up at the sky, then throws the bowl away and runs off to the right. The camera cuts to a red house where Snoopy runs up, jumps onto the roof, bares his teeth for a moment, then sits down and puts on a flying cap, a red scarf, and goggles. Next, he extends his hands forward as if driving. Two plumes of black smoke emerge from behind the red house, which then moves forward and flies up out of the frame.", "text": "A bald boy in a yellow and black striped shirt sits on a small platform, looking at a dog holding a red object, with a wooden fence and autumn foliage in the background. The dog drops the red object, which transforms into a red box. The dog runs towards the box, jumps onto it, and sits on top. The dog puts on a green helmet and a red scarf, then puts on goggles. The dog looks around, and smoke starts to appear around the box. The dog, now wearing the helmet and goggles, looks to the left as the smoke continues to rise.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 133, "extra_info": {"duration": 9.5, "events": ["A bald little boy props his cheek with his right hand.", "A bald little boy rests his left hand on his knee.", "Snoopy looks up at the sky.", "Snoopy throws the food bowl away.", "Snoopy runs off to the right.", "Snoopy jumps onto the red house roof.", "Snoopy bares his teeth.", "Snoopy sits down on the roof.", "Snoopy puts on a flying cap, a red scarf, and goggles.", "The red house moves forward and flies up out of the frame."], "n_events": 10, "n_shots": 2, "n_subjects": 2}, "vid": 133}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "741.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Animated image, showing several red blood cells flowing smoothly in a blood vessel, followed by a sudden constriction in the vessel where the red blood cells become blocked.", "text": "The video shows a close-up view of a blood vessel with red blood cells flowing through it. The blood cells are seen moving in a continuous stream, with some cells appearing to change shape and position as they travel. The background is a soft, light-colored surface, likely representing the interior of a body. The blood cells are depicted in a detailed, 3D animation style, with a focus on their movement and interaction within the vessel.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 741, "extra_info": {"duration": 13.4, "events": ["Animated image appears", "Red blood cells flow in a blood vessel", "Blood vessel suddenly constricts", "Red blood cells become blocked"], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 741}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "593.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is a yard. A man wearing an apron is holding a basin of water, splashing it toward the security guard sitting and sleeping next to him. The security guard is startled awake, stands up and looks down, then turns to the right, extends his right hand forward, and says something. Another man wearing a green hat, carrying a bucket, comes over and says something to the security guard. After sitting down, the security guard looks down at his water-soaked clothes, shakes his body, turns his head to look at the man and says something, then looks down at his own pants. The man with the green hat lifts the bucket and splashes the security guard with water. The security guard raises both hands, shaking his body.", "text": "A man in a checkered shirt and jeans, holding a metal basin, approaches a seated man in a uniform and pours water on him, causing the seated man to react and move. The man in the checkered shirt then runs into a building. A man wearing a green helmet and holding a bucket of water approaches the seated man and starts talking to him. The man in the green helmet then pours the water from the bucket onto the seated man, who reacts by moving and trying to avoid the water. The man in the checkered shirt reappears and pours water on the seated man again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 593, "extra_info": {"duration": 8.23, "events": ["Man splashes water on security guard", "Security guard stands up", "Security guard looks down", "Security guard turns right and extends hand", "Man in green hat approaches security guard", "Security guard sits down", "Security guard shakes body", "Security guard turns head towards man", "Man with green hat splashes security guard", "Security guard raises both hands"], "n_events": 10, "n_shots": 1, "n_subjects": 3}, "vid": 593}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "921.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video showcases a horse race. A racer, clad in yellow, is riding a horse originating from the eighth track, identified by the horse's yellow hooves. Despite the start of the race, the racer continues to maintain the last position.", "text": "The video shows a horse race at the People's Turf Plc, with eight horses and their jockeys lined up at the starting gate, ready to begin the race. As the gate opens, the horses burst forward, and the jockeys lean forward, urging their horses to gain speed. The horses continue to accelerate, with the jockeys maintaining their forward-leaning posture. The race progresses with the horses running at full speed, and the jockeys still leaning forward, trying to gain an advantage. The horses spread out slightly as they move further down the track, with the jockeys still focused on their horses. The background shows a crowd of spectators watching the race from the stands, and several vehicles, including an ambulance, are parked along the track. The race continues with the horses maintaining their speed, and the jockeys still leaning forward, while the crowd watches intently.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 921, "extra_info": {"duration": 5.0, "events": ["A horse race starts.", "Racer in yellow rides a horse.", "Horse's hooves touch the track.", "Racer maintains last position."], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 921}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "502.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A lady has a glass bowl placed in front of her, with one hand on the left side of the frame adding seasoning to the bowl, and another hand on the right side pouring in clear water. Then, the lady stirs the seasoning in the bowl until it becomes a paste. After that, the lady takes a round wooden plate and places a cut cucumber in a raincoat style on it; she then pours the mixed seasoning over the cucumber. With chopsticks, she maneuvers the cucumber to coat it evenly with the seasoning, and finally, she picks up a piece of the cucumber and eats it.", "text": "A woman adds garlic, sugar, and chili powder to a bowl, followed by pouring hot oil over the mixture and stirring it. She then places a cucumber on a wooden plate, pours the spicy mixture over it, and uses chopsticks to mix and coat the cucumber. Finally, she picks up the cucumber with chopsticks and takes a bite.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 502, "extra_info": {"duration": 9.63, "events": ["Lady adds seasoning to the glass bowl.", "Hand pours clear water into the bowl.", "Lady stirs seasoning in bowl into paste.", "Lady places cut cucumber on wooden plate.", "Lady pours seasoning over the cucumber.", "Lady coats cucumber with seasoning using chopsticks.", "Lady eats a piece of seasoned cucumber."], "n_events": 7, "n_shots": 1, "n_subjects": 1}, "vid": 502}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "505.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A little calico cat was running along the side of the road, and when it reached a courtyard, the little cat turned into it and came to a door. The cat pressed its nose against the screen door and then looked behind. There was a string around the cat's neck, which was being pulled by someone outside the frame of the camera.", "text": "A cat on a leash walks along a path bordered by greenery and a metal guardrail, moving steadily forward. The cat continues walking, passing by various plants and a signpost, and approaches a wooden structure with an 'OPEN' sign. The cat reaches the entrance of the wooden structure, climbs up the steps, and enters through a door. Inside, the cat looks around and then turns its head to face the camera.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 505, "extra_info": {"duration": 11.23, "events": ["Calico cat ran along the road.", "Calico cat turned into a courtyard.", "Calico cat reached a door.", "Calico cat pressed its nose against the screen door.", "Calico cat looked behind.", "String was around the cat's neck.", "Someone pulled the string."], "n_events": 7, "n_shots": 1, "n_subjects": 1, "old_duration": 11.23, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/105.mp4"}, "vid": 505}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "295.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man in white clothes is pressing down on another man, moving towards the house ahead, while the trunk of the car behind them slowly opens.", "text": "In a suburban neighborhood, a man and a woman are seen walking together with their arms around each other, moving towards a house with a mailbox in the foreground. The house has a green exterior with a window and a door, and there is a car parked in the driveway. The man and woman continue walking towards the house, and the man opens the door. They enter the house, and the man closes the door behind them.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 295, "extra_info": {"duration": 6.0, "events": ["A man presses down on another man.", "They move towards the house.", "The trunk of the car opens slowly."], "n_events": 3, "n_shots": 1, "n_subjects": 3}, "vid": 295}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "757.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person stands in the swimming pool, with both hands braced against the edge of the pool. With a forceful leap, they step onto the ground and finally stand up and walk forward out of the frame.", "text": "A woman, wearing a black swim cap and goggles, is in a swimming pool and reaches the edge of the pool. She submerges her head underwater and then resurfaces. She begins to push herself up from the pool edge, using her hands for support. She continues to rise, lifting her legs out of the water. She stands up fully and starts to walk out of the pool. The pool area is empty, with no other people visible.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 757, "extra_info": {"duration": 6.63, "events": ["Person stands in the swimming pool.", "Person braces hands against the pool edge.", "Person leaps forcefully.", "Person steps onto the ground.", "Person stands up.", "Person walks forward.", "Person exits the frame."], "n_events": 7, "n_shots": 1, "n_subjects": 1}, "vid": 757}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "692.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows a scene where a man is leaning against a wall, his mouth moving as if talking or singing. The camera pans down from his head, gradually revealing the man's hands playing a guitar. Finally, the camera slowly moves up from the guitar, panning back towards the man's head.", "text": "A man is standing against a wall on a street with a building featuring balconies and windows in the background. He is wearing a dark coat and a scarf, and he begins to sing while looking to his left. The camera then moves down to show him playing an acoustic guitar. As he continues to play, a car drives by in the background. The camera then moves back up to focus on the man's face as he continues to sing.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 692, "extra_info": {"duration": 22.96, "events": ["The man leans against the wall.", "The man's mouth moves.", "The camera pans down from the man's head.", "The man's hands play the guitar.", "The camera moves up from the guitar."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 692}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "724.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera is filming underwater, a little girl stands in the pool, then she pinches her nose with both hands and squats into the water, and then slowly floats in the water.", "text": "A girl in a pink swimsuit with a colorful waistband is underwater in a swimming pool. She begins to swim forward, moving her arms and legs. She then stops and starts to blow bubbles from her mouth. The girl continues to blow bubbles while looking towards the camera. She then begins to swim forward again, moving her arms and legs.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 724, "extra_info": {"duration": 13.55, "events": ["Camera films underwater.", "Little girl stands in the pool.", "Little girl pinches her nose with both hands.", "Little girl squats into the water.", "Little girl slowly floats in the water."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 724}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "13.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two men close a window behind them, then a woman appears behind the window and smashes it with a hammer, causing the men to look at the shattered window and then turn to continue talking to each other.", "text": "Two men, one in a purple suit and the other in a blue shirt, are seen opening a large window from the outside. They look inside and notice a woman standing behind the glass. The men high-five each other and then turn to look at the woman. The woman, holding a hammer, smashes the glass, creating a large hole. The men react with surprise and concern as they watch the woman. The woman then throws the hammer away, and the men continue to look at her.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 13, "extra_info": {"duration": 5.67, "events": ["Two men close a window behind themselves.", "A woman appears behind the window.", "The woman smashes the window with a hammer.", "The window shatters.", "The two men look at the shattered window.", "The two men turn to continue talking."], "n_events": 6, "n_shots": 1, "n_subjects": 3}, "vid": 13}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "463.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the snowy area in front of the house, a man came up behind a crouching child with a shovel in hand. He rapidly pushed the shovel along the ground until it reached under the child's feet, causing the child to fall and sit on the shovel. The man then ran forward, pushing the shovel with the seated child, turned at the roadside, and headed back.", "text": "In a snowy courtyard with a building marked '163' and a gate with a decorative iron fence, a man uses a shovel to push a child sitting in a sled. The man starts by shoveling snow and then begins to pull the sled with the child on it. The man continues to pull the sled, moving it across the snow-covered ground. The man walks further away from the building, pulling the sled with the child on it. The man and the sled move towards a more open area of the courtyard, with the child sitting comfortably in the sled. The man continues to pull the sled, moving it across the snow-covered ground. The man and the sled are now further away from the building, heading towards a more open area of the courtyard.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 463, "extra_info": {"duration": 9.28, "events": ["A man approached a crouching child from behind.", "The man readied a shovel in hand.", "Man rapidly pushed shovel along the ground.", "Shovel reached under the child's feet.", "Child fell and sat on the shovel.", "Man ran forward with the shovel.", "Seated child was pushed by the man.", "Man turned at the roadside.", "Man headed back with the child on shovel."], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 463}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "865.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows eight swimmers competing in a two-hundred-meter race. The swimmer in lane number four outperforms the others, swimming faster and finishing first. This swimmer is wearing a white swimming cap. Trailing behind is the contestant in lane number five. The swimmer in lane number four completes the course in one minute and forty-two seconds. Representatives are stationed at the finish point, waiting for all swimmers to finish.", "text": "The video shows a swimming competition with multiple swimmers in lanes, swimming freestyle towards the finish line. The swimmers are in their respective lanes, marked by blue, yellow, and red lane dividers, and are making steady progress through the water. The background includes advertisements for various sponsors such as 'Microplus,' 'Fludra,' and 'LNS,' and the event is identified as the '2000 Frankfurt Marathon.' The timer at the bottom of the screen counts down from 1:35.9 to 1:41.3 as the swimmers approach the finish line.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 865, "extra_info": {"duration": 7.0, "events": ["Eight swimmers compete in a race.", "Swimmer in lane four outperforms the others.", "Swimmer in lane four finishes first.", "Swimmer in lane five trails behind.", "Swimmer in lane four completes the course.", "Representatives wait at the finish point.", "All swimmers finish the race."], "n_events": 7, "n_shots": 1, "n_subjects": 8}, "vid": 865}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "742.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The boy sits on the ground holding a toy, stands up with one hand supporting him on the floor, walks towards the camera, and then turns his head to look to his right.", "text": "A child is sitting on the floor in a living room, playing with a yellow toy block. The child looks up and then starts to crawl forward while holding the toy block. The child stands up, still holding the toy block, and begins to walk forward. The child continues walking towards the camera, smiling and holding the toy block. The child gets closer to the camera, and the camera follows the child's movement. The child turns to the side, and the camera captures the child from the side.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 742, "extra_info": {"duration": 5.62, "events": ["The boy sits on the ground with a toy.", "The boy stands up supporting himself with one hand.", "The boy walks towards the camera.", "The boy turns his head to look right."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 742}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "90.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A soldier in blue attire with a large beard and another soldier with a green helmet hold their guns and look forward. Opposite them, two soldiers wearing gas masks also hold their guns aimed at them. The camera shifts back to the soldier in blue with the large beard; he tilts his head slightly. Then the camera switches to a low-angle shot looking up at the two soldiers with gas masks as they gaze at a dog on the other side. Both of them crouch down, put down their guns, and prepare to surrender.", "text": "Two soldiers, one in a green uniform and the other in a blue uniform, aim their guns forward in a trench with sandbags in the background. The scene then shifts to two soldiers in green uniforms and gas masks, also aiming their guns forward in a wooden structure. The camera focuses on the bearded soldier in the blue uniform, who looks ahead and then turns his head to the side. The two soldiers in gas masks lower their guns and begin to crouch down. The bearded soldier in the blue uniform also lowers his gun and looks ahead.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 90, "extra_info": {"duration": 9.12, "events": ["A soldier in blue attire holds a gun.", "Another soldier with a green helmet holds a gun.", "Two soldiers wearing gas masks aim their guns.", "Soldier in blue attire tilts his head.", "Two soldiers with gas masks gaze at the dog.", "Two soldiers with gas masks crouch down.", "Two soldiers with gas masks put down their guns.", "Two soldiers with gas masks prepare to surrender."], "n_events": 8, "n_shots": 4, "n_subjects": 3}, "vid": 90}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "300.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, the silhouette of a man sits motionless atop a wall, the camera angle switches to the man's side profile, and the man lifts his head to look forward.", "text": "A man is sitting on a concrete wall in an outdoor area with a tree and some buildings in the background. The camera gradually zooms in on the man, who is sitting with his arms wrapped around his knees and his head resting on his arms. There is a water bottle and a sandwich placed on the wall next to him. The man then lifts his head and looks forward. The camera then switches back to a wider shot of the man sitting on the wall.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 300, "extra_info": {"duration": 7.62, "events": ["The silhouette of a man sits atop a wall.", "The man lifts his head."], "n_events": 2, "n_shots": 2, "n_subjects": 1}, "vid": 300}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "280.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera angle starts from below and pans up to capture a lady's face, then the lady turns her head upwards, and the camera switches to a wide shot, revealing that the lady is participating in a figure skating competition, with judges and spectators watching in the background.", "text": "A woman with her hair tied back and secured with a hairpin is seen in a close-up shot, wearing a red dress adorned with colorful sequins. She gradually lowers her head, looking downwards. She then begins to spin on the ice rink, extending her arms gracefully. The background shows an ice rink with a banner that reads 'GLAND AMCHAMPIONSHIPS' and several spectators seated in the stands.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 280, "extra_info": {"duration": 6.88, "events": ["Camera pans up to capture a lady's face", "Lady turns her head upwards", "Camera reveals a figure skating competition", "Judges watch the performance", "Spectators observe from the background"], "n_events": 5, "n_shots": 2, "n_subjects": 2}, "vid": 280}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "169.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In front of a fish tank, a snake slowly lifts its head, looking left and right. It gazes at the picture of mountains illuminated by the light ahead, then buries its head back into its body.", "text": "A blue snake is coiled on a rock in front of a glass enclosure containing various colorful creatures. The snake begins to rise, revealing its head and eyes. The snake's eyes widen as it looks at a display in the background, which shows a vibrant scene with mountains, trees, and several colorful creatures. The snake's eyes show a range of emotions, from curiosity to surprise, as it continues to observe the display. The snake then turns its head to the side and lowers its head back into its coiled position.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 169, "extra_info": {"duration": 5.88, "events": ["Snake lifts its head.", "Snake looks left.", "Snake looks right.", "Snake gazes at the picture.", "Light illuminates the mountains.", "Snake buries its head into its body."], "n_events": 6, "n_shots": 3, "n_subjects": 1}, "vid": 169}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "207.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man and a woman converse, and then the woman closes her eyes and prepares to kiss the man.", "text": "A man in a suit and tie is standing in a busy indoor area, looking at a woman with long blonde hair. The woman, also in a suit, looks down and then raises her head to speak to the man. She then moves closer to him. The background is filled with people, some of whom are engaged in conversations or activities.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 207, "extra_info": {"duration": 4.42, "events": ["A man and a woman converse.", "The woman closes her eyes.", "The woman prepares to kiss the man."], "n_events": 3, "n_shots": 2, "n_subjects": 3}, "vid": 207}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "93.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two soldiers are holding guns aimed forward, with a dog by their side. Opposite them, three surrendering soldiers raise their hands high, slowly walking through the intersection of two trenches. The two soldiers slowly turn, continuously aiming their guns at the three surrendering soldiers.", "text": "In a dimly lit, enclosed area with wooden planks and makeshift structures, a man in a blue outfit and cap is seen aiming and firing a gun. A man in a yellow outfit, also holding a gun, follows closely behind the man in blue. Four soldiers, dressed in green uniforms and helmets, raise their hands and move forward in a defensive posture. The man in blue continues to aim and fire his gun, while the man in yellow maintains his position behind him. The soldiers keep their hands raised and move cautiously, seemingly in response to the gunfire.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 93, "extra_info": {"duration": 3.75, "events": ["Two soldiers aim guns forward", "A dog stands by the soldiers' side", "Three soldiers raise hands high", "Surrendering soldiers walk through the intersection", "Two soldiers turn slowly", "Two soldiers continuously aim at the surrendering soldiers"], "n_events": 6, "n_shots": 1, "n_subjects": 3}, "vid": 93}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "793.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man is wearing headphones, dancing with his arms spread out, moving back and forth, walking forward, and dancing face to face with another man. They high-five and then hug each other, with other people dancing around them.", "text": "A man in a suit and tie, wearing headphones, dances energetically on a balcony with a scenic view of the sea and boats in the background. He moves his arms and body rhythmically, while people around him record his dance with their phones. The man continues to dance, moving closer to the railing, and a woman bends down to pick up a bottle from the ground. The man then dances towards a man in black, who joins in the dance, and they high-five each other. They then embrace each other warmly.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 793, "extra_info": {"duration": 24.46, "events": ["The man dances with arms spread out.", "The man moves back and forth.", "The man walks forward.", "Two men dance face to face.", "The men high-five each other.", "The men hug each other.", "Other people dance around them."], "n_events": 7, "n_shots": 1, "n_subjects": 3}, "vid": 793}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "907.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video presents a blue-gray voltage tester. The tester is inserted into a socket located on a white wall, upon which it instantly flashes a red light and produces a sound.", "text": "A hand holds a white and blue electric toothbrush in front of a wall socket with a green cord plugged into it. The hand presses a blue button on the toothbrush, causing the button to light up. The hand then moves the toothbrush towards the wall socket and plugs it in. The toothbrush's tip lights up red, indicating it is powered on.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 907, "extra_info": {"duration": 6.97, "events": ["The voltage tester is presented.", "The voltage tester is inserted into a socket.", "The voltage tester flashes a red light.", "The voltage tester produces a sound."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 907}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "519.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman struck a martial arts pose in front of a strength tester, then spun her body around, lifted her left foot, suspended herself in midair, and prepared to kick the strength tester with her right foot. Someone beside her admired the woman's movements.", "text": "In an indoor arcade with various gaming machines and a '4SK' sign in the background, a woman in a beige top and striped pants stands in front of a punching machine, preparing to punch it. A person in black clothing walks past her from right to left. The woman in beige starts to move her body and prepares to punch the machine. She then jumps and kicks the punching machine with her right leg. After the kick, she lands and continues to move her body.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 519, "extra_info": {"duration": 3.4, "events": ["A woman struck a martial arts pose.", "The woman spun her body around.", "The woman lifted her left foot.", "The woman suspended herself in midair.", "The woman prepared to kick the strength tester.", "Someone admired the woman's movements."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 519}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "780.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The white smoke moves quickly, with two holographic images of airplanes appearing in succession and flying to the left and right, while the image of the Earth in the center continues to rotate.", "text": "The video begins with a view of a glowing Earth surrounded by stars and clouds. A blue airplane appears from the bottom of the frame, flying towards the Earth. The airplane continues to move across the screen, passing in front of the Earth. The airplane's path is highlighted by a bright light as it moves. The airplane exits the frame, leaving the glowing Earth and clouds in the background. The airplane reappears from the bottom of the frame, flying towards the Earth again. The airplane continues its journey, passing in front of the Earth once more. The airplane exits the frame, and the glowing Earth remains in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 780, "extra_info": {"duration": 10.0, "events": ["White smoke moves quickly.", "First holographic airplane image appears.", "Second holographic airplane image appears.", "First airplane image flies to the left.", "Second airplane image flies to the right.", "Image of Earth continues to rotate."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 780}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "531.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Three little puppies chased forward. The two on the sides tumbled down headfirst from the steps. The camera switched to the three little ones running in a line on the stairs, with the two at the back falling down. The camera then switched again, showing them running side by side, with the two on the left falling down, and the other one also falling down after running a distance.", "text": "Three puppies are running along a sidewalk next to a brick wall and a bush, with a car parked in the background. They continue running and start to lose their balance, with one puppy falling off the ledge. The scene then shifts to the puppies running along a different ledge, with one puppy falling off. Finally, the puppies are seen running down a set of steps, with one puppy falling off the edge. The video ends with a black screen.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 531, "extra_info": {"duration": 10.27, "events": ["Three little puppies chased forward.", "Two puppies tumbled down from the steps.", "Three puppies ran in a line on stairs.", "Two back puppies fell down.", "Puppies ran side by side.", "Two left puppies fell down.", "Other puppy fell down after running a distance."], "n_events": 7, "n_shots": 3, "n_subjects": 3}, "vid": 531}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "350.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the sky, an airplane spirals upward as it flies.", "text": "A small airplane is seen flying over a dark, expansive body of water. The camera gradually zooms in on the airplane as it continues its flight. The airplane ascends and begins to perform a series of maneuvers, including a roll and a turn. The background transitions from the dark water to a sky with scattered clouds as the airplane continues its flight. The airplane completes its maneuvers and continues to fly steadily in the clear sky.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 350, "extra_info": {"duration": 6.38, "events": ["An airplane spirals upward in the sky."], "n_events": 1, "n_shots": 1, "n_subjects": 1}, "vid": 350}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "486.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a beach, some people in swimwear gathered around as a man clutching a skateboard charged forward. The man stepped with one foot onto a dirt mound and leaped like a fish, flying through the air for a distance before placing the skateboard under him and landing on a slide. Lying on the skateboard, he slid down the track towards point X, continuing until he reached the water. The man spread his arms and kept gliding forward in the water, heading towards a makeshift gate constructed with three white sticks. In front of this gate, there was another, slightly smaller makeshift gate.", "text": "A group of people stand on a sandy beach, watching a man holding a blue board run towards a white tarp. The man jumps onto the tarp, holding the board, and slides down a sandy slope towards the water. He continues sliding down the slope, gaining speed as he approaches the water. The man reaches the water and is pulled by a boat, creating a wake as he moves across the water.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 486, "extra_info": {"duration": 8.67, "events": ["Man holding skateboard charges forward.", "Man steps onto a dirt mound.", "Man leaps through the air.", "Man places skateboard under himself.", "Man lands on a slide.", "Man slides down the track.", "Man reaches the water.", "Man spreads arms and glides in water.", "Man heads towards a makeshift gate.", "Another smaller makeshift gate stands in front."], "n_events": 10, "n_shots": 1, "n_subjects": 2}, "vid": 486}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "241.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two children fell from the sky towards the ground, flailing their limbs in the air, passing through the treetops, and landing in the water near a waterfall, splashing water around.", "text": "A girl in a gray dress is seen flying through the air with her arms outstretched, surrounded by large, colorful clouds against a lush green forest background. As she continues to fly, the camera follows her descent, showing her moving further away from the camera and towards a dense forest. The scene transitions to a view of the forest canopy, where the girl is seen falling through the trees. The camera then captures her descent from above, showing her falling towards a serene lake with a waterfall in the background. Finally, the girl splashes into the lake, creating a large splash.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 241, "extra_info": {"duration": 6.75, "events": ["Two children fell from the sky.", "Children flailed their limbs in the air.", "They passed through the treetops.", "Children landed in the water.", "They splashed water around near a waterfall."], "n_events": 5, "n_shots": 3, "n_subjects": 2}, "vid": 241}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "489.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man stands in the corner of a pool, holding a long iron pair of tongs. He leans forward and swiftly extends the tongs into the water, gripping a crocodile. The crocodile thrashes from side to side in a struggle, but the man flips it onto the ground and continues to hold it with the tongs. He also comes down to the ground, and as the crocodile rolls a few times unable to break free from the tongs, the man seizes the crocodile by its tail.", "text": "A man stands in a shallow pool, holding a stick and reaching into the water. He grabs a crocodile with the stick, causing the crocodile to splash and struggle. The man pulls the crocodile out of the water and onto the concrete edge. He then drags the crocodile onto the ground, continuing to hold it with the stick.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 489, "extra_info": {"duration": 9.35, "events": ["A man stands in the corner of a pool.", "The man leans forward.", "The man extends the tongs into the water.", "The tongs grip a crocodile.", "The crocodile thrashes from side to side.", "The man flips the crocodile onto the ground.", "The man holds the crocodile with the tongs.", "The man comes down to the ground.", "The crocodile rolls a few times.", "The man seizes the crocodile by its tail."], "n_events": 10, "n_shots": 1, "n_subjects": 3}, "vid": 489}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "158.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "At night, a mouse was scurrying across a road when a bus with its headlights on approached from the distance. The headlights shone on the mouse, which ran across the road in panic. As the bus passed by the camera, the shot switched to the inside of the bus. Inside, the passengers were sleeping. The camera panned from left to right inside the bus, and finally focused on two front-row seats on the right, where a brown-haired man was sleeping and a blond-haired man was awake, looking out of the window. Then he turned his head forward.", "text": "In a desert landscape at night, a mouse appears on the road under a full moon. A bus with headlights on drives towards the mouse. The bus passes over the mouse, which is hit by the headlights. The scene shifts to the interior of the bus, showing passengers sleeping. The camera focuses on two passengers, one with brown hair and the other with yellow hair, who are sleeping. The yellow-haired passenger turns his head to look at the brown-haired passenger.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 158, "extra_info": {"duration": 11.0, "events": ["A mouse scurried across a road at night.", "A bus approached with headlights on.", "Headlights shone on the mouse.", "The mouse ran across the road in panic.", "The bus passed by the camera.", "Passengers inside the bus were sleeping.", "Camera panned inside the bus.", "A brown-haired man was sleeping in a front-row seat.", "A blond-haired man looked out of the window.", "The blond-haired man turned his head forward."], "n_events": 10, "n_shots": 2, "n_subjects": 4}, "vid": 158}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "807.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video captures a football match taking place on a green field between two teams. One team is dressed in yellow attire with matching yellow socks, and the other team sports green clothes complimented with green socks. The goalkeeper from the yellow team intercepts the ball, thereby defending the goal. The green team's goal is under assault but manages to prevent a score, thanks to a well-placed block by a player wearing a green jersey. Ultimately, no points are accrued by either team.", "text": "A soccer match is taking place on a green field with a goalpost and a brick wall in the background. The goalkeeper in blue dives to the ground as the ball is kicked towards the goal. The camera then shifts to a wider view of the field, showing players in yellow and green jerseys moving the ball towards the goal. The ball is passed among the yellow team players as they advance towards the goal. A yellow team player takes a shot at the goal, and the goalkeeper in red dives to save it but misses. The ball goes into the net, and a yellow team player falls to the ground in celebration. The scene then cuts back to the initial moment where the goalkeeper in blue is on the ground, and the ball is near the goalpost.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 807, "extra_info": {"duration": 10.0, "events": ["Football match takes place on field.", "Team in yellow attire intercepts the ball.", "Goalkeeper from yellow team defends goal.", "Green team's goal faces assault.", "Player in green jersey blocks the ball."], "n_events": 5, "n_shots": 2, "n_subjects": 2}, "vid": 807}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "333.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera shoots from inside the car towards the front, through the car window glass we see a man in a white tank top standing in front of the car. The driver shifts gears, the camera then switches to the perspective of the man in the white tank top outside the vehicle, looking at the man and woman in the driver and passenger seats. The camera angle switches again towards the direction of the car but focuses on the woman's side outside the vehicle. The man in the white tank top is leaning on a stick, tilting his head to look inside; he watches for a while, then shows a faint smile. Next, the camera switches back to the perspective of the man in the white tank top, as the woman in the passenger seat pushes the door open, preparing to get out of the car.", "text": "A man stands outside a car holding a tool, looking at the occupants inside the car. The man inside the car, who is driving, turns the steering wheel and gives a thumbs-up gesture. The man outside the car, still holding the tool, smiles and looks down. The woman in the passenger seat turns her head to look at the man driving, then turns back to face forward. The man inside the car opens the door and prepares to exit.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 333, "extra_info": {"duration": 7.62, "events": ["Man in white tank top stands in front of the car.", "Driver shifts gears.", "Man in white tank top leans on a stick.", "Man in white tank top tilts his head to look inside.", "Man in white tank top shows a faint smile.", "Woman in passenger seat pushes the door open."], "n_events": 6, "n_shots": 4, "n_subjects": 3}, "vid": 333}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "886.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows a person, clothed in a brown suit coat and a red tie, sitting on a train seat and appearing drowsy. They raise their right hand and rub their right eye. Suddenly, they observe a child, clad in white and sporting short hair, walking past the window outside the train. Seeing this, the person has the urge to disembark the train to locate this child.", "text": "A man in a suit and red tie is sleeping on a train, with another man sitting beside him. A person walks past the sleeping man, causing him to wake up and rub his eyes. The man continues to rub his eyes as another person walks by. The camera then shows the view outside the train window, where a woman with a suitcase and a man with a skateboard walk by. The man on the train looks out the window and then lowers his hand.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 886, "extra_info": {"duration": 9.97, "events": ["Person sits on train seat.", "Person appears drowsy.", "Person raises right hand.", "Person rubs right eye.", "Child walks past train window.", "Person observes the child.", "Person feels urge to disembark.", "Person attempts to locate child."], "n_events": 8, "n_shots": 3, "n_subjects": 4}, "vid": 886}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "899.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts a game character named iUseNoMouse, represented by green health points, attacking a tower. Suddenly, a character with red health points charges forward, striking iUseNoMouse. Just before being attacked, iUseNoMouse collects a life capsule under the turret and escapes towards his birthpoint.", "text": "In a game environment, a character named 'Hulahhun' is seen attacking a turret, causing it to emit a red laser beam. The turret's health decreases as it is hit, and it eventually explodes, creating a large fireball. The character continues to move around the turret, avoiding the fireball and other attacks. The character then moves towards a nearby water body, passing by other characters and structures. The character reaches the water's edge, with the turret's remains visible in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 899, "extra_info": {"duration": 5.0, "events": ["iUseNoMouse attacks a tower", "Character with red health points charges forward", "Character with red health points strikes iUseNoMouse", "iUseNoMouse collects a life capsule", "iUseNoMouse escapes towards his birthpoint"], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 899}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "191.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man leaned against the wall observing his surroundings and took deep breaths, then relaxed and looked down, when suddenly a pair of hands grabbed and choked the man, pulling him up.", "text": "A man wearing a red hoodie with the word 'CLOSED' written on it stands against a brick wall background, looking around with a concerned expression. He then starts speaking while maintaining his stance. As he continues to talk, two black hands appear from the top of the frame and grab him. The man is lifted off the ground by the hands.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 191, "extra_info": {"duration": 4.38, "events": ["The man leaned against the wall.", "The man observed his surroundings.", "The man took deep breaths.", "The man relaxed.", "The man looked down.", "A pair of hands grabbed the man.", "The hands choked the man.", "The hands pulled the man up."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 191}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "885.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts two stainless steel pots on a heated black stove. One pot is covered and boiling, while the other is filled with boiling water containing four ears of corn. A person uses tongs to turn the corn, then sets the tongs aside, picks up the lid, and secures it on the pot.", "text": "A person is seen holding a pair of tongs and placing three ears of corn into a pot of boiling water on a stove. The person adjusts the corn in the pot to ensure they are evenly submerged. The person then places a lid on the pot and waits for the corn to boil. After a while, the person removes the lid from the pot.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 885, "extra_info": {"duration": 6.0, "events": ["A stove heats two stainless steel pots.", "One pot covered and boiling.", "Boiling water contains four ears of corn in the other pot.", "A person turns the corn with tongs.", "The person sets the tongs aside.", "The person picks up the lid.", "The person secures the lid on the pot with corn."], "n_events": 7, "n_shots": 1, "n_subjects": 1}, "vid": 885}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "342.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video displayed a photograph featuring a woman, a man, and three children. The camera then switched to a room where two male staff members were helping a male astronaut put on his spacesuit, with another staff member also working nearby.", "text": "The video begins with a black-and-white photo of a family at a playground, where a girl is swinging, a boy is standing on a swing, and two adults and two other children are posing for the camera. The scene then transitions to a modern setting where a man in a spacesuit is being assisted by two other men in white uniforms with NASA logos. The man in the spacesuit is having his suit secured by the two men, who are adjusting and fastening various parts of the suit.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 342, "extra_info": {"duration": 6.62, "events": ["Photograph displayed a woman, a man, and three children.", "Camera switched to a room.", "Two male staff members helped a male astronaut.", "A male astronaut put on his spacesuit.", "Another staff member worked nearby."], "n_events": 5, "n_shots": 2, "n_subjects": 4}, "vid": 342}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "273.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man and a woman are in a room, looking at each other. The camera cuts to the woman as she cries in distress and then lowers her head. The man watches her and then takes something out of the refrigerator next to him and leaves the room, with the woman covering her face with her hands.", "text": "A man and a woman are standing close to each other in a dimly lit room with patterned wallpaper and a window in the background. The woman looks at the man, who is wearing a green shirt, and then lowers her head. The man, who has a mustache, moves closer to the woman and kisses her. After the kiss, the man turns and walks towards the door, while the woman, still looking down, covers her face with her hands. The room has a dining table with plates and cups, and the man exits through the door, leaving the woman standing by the refrigerator.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 273, "extra_info": {"duration": 14.0, "events": ["A man and a woman look at each other.", "The woman cries in distress.", "The woman lowers her head.", "The man takes something out of the refrigerator.", "The man leaves the room.", "The woman covers her face with her hands."], "n_events": 6, "n_shots": 3, "n_subjects": 2}, "vid": 273}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "588.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man is wearing a red hat and a black coat. He kneels on one knee and firmly grips an axe, chopping at the tree in front of him. Then, he looks up, stands up, and steps back as the tree falls to the ground. A video appears in the bottom right corner of the screen.", "text": "In a dense forest, a man wearing an orange helmet and black clothing is seen kneeling beside a large tree, using an axe to chop at its base. He continues to chop the tree, creating a visible notch in the trunk. The man then looks up at the tree, preparing for the next strike. He swings the axe and successfully splits the tree, causing it to fall. The tree falls to the ground, and the man is seen in the background as the tree collapses. The video ends with a scene from a game, showing a character in a forest environment.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 588, "extra_info": {"duration": 9.6, "events": ["The man kneels on one knee.", "The man grips an axe.", "The man chops at the tree.", "The man looks up.", "The man stands up.", "The man steps back.", "The tree falls to the ground."], "n_events": 7, "n_shots": 1, "n_subjects": 1}, "vid": 588}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "242.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman is lying on the bed, stroking her pet, then the scene switches to another girl who takes out a small booklet from the bedside table, casually closes the drawer, and tilts her head to look at the booklet.", "text": "A woman, lying on a bed with a patterned blanket, reaches out to pet a small dog wearing a red sweater with a bow. She then turns her head to look forward. The scene transitions to the woman sitting up in bed, reaching for a letter on a bedside table. She picks up the letter and looks at it, then glances upwards.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 242, "extra_info": {"duration": 5.88, "events": ["A woman lies on the bed.", "A woman strokes her pet.", "Another girl takes out a small booklet.", "Another girl closes the bedside table drawer.", "Another girl tilts her head to look at the booklet."], "n_events": 5, "n_shots": 2, "n_subjects": 3}, "vid": 242}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "484.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two black boys are competing in a game. They start by pushing a soccer ball from one end of a long red table toward the other end, where there is a hole with a net hanging underneath it. The soccer balls fall into the hole one after the other. The boy on the left runs out first, while the boy on the right, upon seeing the soccer ball go into the hole, gives a clap before running forward. They both reach the yellow table, with the left boy picking up three white balls and the right boy picking up three green balls. Starting from behind a line, they each push their balls forward. There are six holes at the far end of the table, and the balls from both sides roll past the holes, hit a bumper, and roll back toward the boys.", "text": "Two boys, one in an orange shirt and the other in a red shirt, stand at red tables with soccer balls in their hands, preparing to throw them into the nets. They throw the soccer balls into the nets simultaneously. The boy in the red shirt claps and runs towards the camera, while the boy in the orange shirt walks away. The camera then shifts to show the boys playing a game of table soccer with white and green balls on a yellow table. The boy in the orange shirt throws a white ball, and the boy in the red shirt throws a green ball. The white ball rolls into the net, and the green ball rolls towards the other side of the table.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 484, "extra_info": {"duration": 10.2, "events": ["Two boys push a soccer ball across a table", "Soccer balls fall into a hole with a net", "Boy on the left runs out of balls first", "Boy on the right claps after scoring", "Both boys run to the yellow table", "Boy on the left picks up three white balls", "Boy on the right picks up three green balls", "Both boys push balls from behind a line", "Balls roll past the holes at the table's end", "Balls hit a bumper and roll back toward the boys"], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 484}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "955.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video features a person with long hair tied back using a blue headdress cord, wearing blue and white striped pajamas. The setting is a bedroom with a gray bed featuring a gray headboard and white bedding. The bed is flanked by gray lights on either side. Firstly, the person places 4 gray pillows on the ground to the left of the bed. Next, the person moves to the right side of the bed to continue repositioning more pillows on the ground. Subsequently, a gray quilt is spread out over the bed. Lastly, the 4 pillows initially set on the ground are divided into two piles and stacked neatly.", "text": "In a bedroom with a large bed, a woman in blue pajamas starts by removing the pillows from the bed and placing them on the floor. She then begins to adjust the bedspread, smoothing it out and arranging it neatly. After ensuring the bedspread is properly arranged, she picks up the pillows and places them at the head of the bed. She continues to adjust the pillows and the bedspread, making sure everything is in place. Finally, she steps back to inspect her work, ensuring the bed is neatly made.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 955, "extra_info": {"duration": 8.0, "events": ["Person places 4 gray pillows on the ground.", "Person moves to the right side of the bed.", "Person repositions more pillows on the ground.", "Person spreads a gray quilt over the bed.", "Person stacks 2 piles of pillows neatly."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 955}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "402.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera focuses on a woman's profile, with one hand displaying a bottle of makeup in front of the lens. Then another hand taps the bottle with a small brush, after which foundation is applied to the lady's forehead. The camera switches to a side view, continuing to apply makeup to the lady's face.", "text": "A woman is seated with a red background behind her. A hand holding a bottle of foundation appears and begins to apply the foundation to her face using a brush. The brush is moved across her forehead, cheeks, and chin, spreading the foundation evenly. The hand continues to apply the foundation to her forehead, cheeks, and chin, ensuring an even coverage. The brush is then used to blend the foundation on her cheeks and under her eyes. The hand continues to apply the foundation to her forehead, cheeks, and chin, ensuring an even coverage.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 402, "extra_info": {"duration": 15.03, "events": ["The camera focuses on a woman's profile.", "A hand displays a bottle of makeup in front of the lens.", "Another hand taps the bottle with a small brush.", "Foundation is applied to the woman's forehead.", "Makeup is applied to the woman's face."], "n_events": 5, "n_shots": 2, "n_subjects": 2}, "vid": 402}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "603.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The setting is a bar. A waiter walks towards the camera, places down two glasses of drinks, then turns and leaves. The two people sitting opposite each other at the table reach out to pick up the glasses, drink them down, and then put them back. The camera remains fixed.", "text": "The video begins with a blurred view of a bar counter with various bottles in the background. A bartender in a white shirt approaches the counter and places two drinks, one with an orange slice and the other with a green garnish, on the counter. The bartender then steps back, and two customers, one with a watch and the other with black nail polish, reach for the drinks. The customers pick up their respective drinks and take a sip. The bartender returns to the counter, and the customers continue to drink their beverages.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 603, "extra_info": {"duration": 24.72, "events": ["Waiter walks towards the camera", "Waiter places down two glasses", "Waiter turns", "Waiter leaves", "First person reaches out to pick up a glass", "Second person reaches out to pick up a glass", "First person drinks the drink", "Second person drinks the drink", "First person puts the glass back", "Second person puts the glass back"], "n_events": 10, "n_shots": 1, "n_subjects": 3}, "vid": 603}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "494.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In an open area, a truck is parked with two rows of people standing perpendicular to the truck, leaving a long passage between the truck door and the camera. In front of the camera, there is a soccer ball placed. A person takes a few steps to get a run-up and kicks the soccer ball, which flies towards the truck door, passes through the door, and then flies out from the door on the other side of the truck.", "text": "In an outdoor setting with a clear sky, a man in a blue jacket and black pants stands on a dirt ground, preparing to kick a soccer ball. He runs towards the ball, which is placed on the ground in front of him. The man kicks the ball, sending it forward. After kicking, he continues to run forward, following the ball's trajectory. The ball rolls away, and the man slows down and comes to a stop. He then turns his head to look back at the ball. In the background, a group of people, including children in red uniforms, stand in two lines, watching the man.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 494, "extra_info": {"duration": 5.15, "events": ["A truck is parked in an open area", "Two rows of people stand perpendicular to the truck", "A long passage is left between the truck door and the camera", "A soccer ball is placed in front of the camera", "A person takes a few steps to get a run-up", "The person kicks the soccer ball", "The soccer ball flies towards the truck door", "The soccer ball passes through the truck door", "The soccer ball flies out from the door on the other side of the truck"], "n_events": 9, "n_shots": 1, "n_subjects": 3}, "vid": 494}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "549.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman dressed in a floral skirt held a cane, took off her glasses to glance at the man beside her, then put her glasses back on and tapped the man with them before continuing to walk forward. The man suddenly stood up straight, touching his butt, then turned his head to look at the woman, and then moved forward, extending his right hand. He bent over twice to touch his knee that had been struck. The woman, holding the cane, tapped the ground left and right as she walked forward.", "text": "A man is leaning on a blue railing, looking at his phone, while a woman in a floral dress and red shoes walks towards him, holding a cane. The man puts down his phone and starts walking towards the woman. He gestures with his hand and appears to be talking to her. The woman continues walking forward, and the man follows her, still gesturing and talking. The woman adjusts her sunglasses as she walks, and the man continues to follow her, occasionally looking at her and gesturing. The background shows a riverside path with greenery and a cloudy sky.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 549, "extra_info": {"duration": 10.43, "events": ["Woman took off her glasses.", "Woman glanced at the man.", "Woman put her glasses back on.", "Woman tapped the man with her glasses.", "Woman continued to walk forward.", "Man stood up straight.", "Man touched his butt.", "Man turned his head to look at the woman.", "Man moved forward, extending his right hand.", "Man bent over to touch his knee."], "n_events": 10, "n_shots": 1, "n_subjects": 2}, "vid": 549}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "535.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person with a paraglider spirals down to the surface of the lake from the sky, gliding on with one foot touching the water, causing a spray of water to fly in all directions. The camera follows the movement of the individual,", "text": "A person is seen paragliding in the sky, with a blue and white paraglider against a backdrop of clouds. The paraglider begins to descend, and the person starts to control the lines. The paraglider continues to lower, and the person maneuvers it towards a body of water. The paraglider approaches the water, and the person prepares to land. The paraglider touches down on the water, creating a splash. The person lands smoothly on the water, with the paraglider still attached.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 535, "extra_info": {"duration": 10.53, "events": ["Person spirals down from sky", "Person with paraglider approaches lake surface", "Person touches water with foot", "Spray of water flies in all directions"], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 535}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "946.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The yellow dinosaur, dressed in a police uniform, is sitting inside a blue police car and is in pursuit of a brown-red truck driven by a blue dinosaur wearing red clothing and a red hat. Inside the brown-red truck is an open bag of money. The blue dinosaur in red clothing and a red hat, along with four other blue dinosaurs dressed in red, proceeds to bomb the bank. Then, they make their getaway in the truck. Subsequently, the police car chases them and collides with their vehicle, causing the money to spill out.", "text": "The video begins with a title screen displaying 'Dinosaur Police Car' and a play button icon, with a police car and a red truck driving side by side on a city street. The scene then shifts to a bank with a large hole in its facade, and a group of small blue characters with red hats emerge from the hole and run towards the right side of the screen. The characters continue running past the bank and the vehicles, heading towards the right side of the screen. The police car and the red truck follow the characters as they move through the city street, passing by various buildings and vehicles. The video concludes with the characters and the vehicles continuing their journey through the city, with the 'Dinosaur Police Car' title screen reappearing at the top of the screen.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 946, "extra_info": {"duration": 6.0, "events": ["Yellow dinosaur drives police car in pursuit.", "Blue dinosaur operates brown-red truck.", "Brown-red truck contains open bag of money.", "Blue dinosaur bombs the bank.", "Four blue dinosaurs assist in bank bombing.", "Blue dinosaur and accomplices escape in truck.", "Police car chases brown-red truck.", "Police car collides with brown-red truck.", "Money spills out from brown-red truck."], "n_events": 9, "n_shots": 3, "n_subjects": 5}, "vid": 946}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "345.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a large room, a man is watching a boxing match on a laptop while stroking a dog in front of him. Behind the sofa, another man is bent over inspecting something which emits a few flashes of light. The camera shifts to the man behind the sofa, who is smoking. At a farther distance, a woman is sitting on a bed, sewing something with needle and thread.", "text": "In a dimly lit, ornately decorated room with classical paintings and lamps, a man sits on a sofa petting a dog while watching a video on a laptop placed on a small table. In the background, a woman is sitting on a bed, and another man is bent over, lighting a cigarette. The man on the sofa continues to pet the dog, which is sitting beside him, while the man in the background lights his cigarette and then removes it from his mouth.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 345, "extra_info": {"duration": 6.75, "events": ["A man watches a boxing match on a laptop.", "The man strokes a dog.", "Another man inspects something behind the sofa.", "Something emits flashes of light.", "The man behind the sofa smokes.", "A woman sits on a bed.", "The woman sews something with needle and thread."], "n_events": 7, "n_shots": 2, "n_subjects": 4}, "vid": 345}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "925.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video clip features two children in blue outfits seated at a table, possibly siblings. The child in the blue skirt is pulling something out. There's also an individual in a blue shirt and another in black attire in the kitchen. The person in black, presumably the mother, is opening the refrigerator to find something to eat. From the photographs displayed on the wall, it's discernible that the four individuals are likely a family, with two adults presumably the parents, and the children their offspring.", "text": "In a dining room with a large clock on the wall, a girl in a blue tank top and a boy in a blue shirt are sitting at a table with food and drinks. The girl reaches out to the table and picks up a bag of food, while the boy continues eating. A person in a blue shirt walks past the camera, partially obscuring the view. The girl places the bag of food on the table and starts to open it. The scene then shifts to a kitchen where a man in a blue shirt walks towards the camera, and a woman in a green shirt with 'BLESSED' written on it walks to the refrigerator. The woman opens the refrigerator and starts looking inside, while the man continues walking towards the camera.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 925, "extra_info": {"duration": 10.0, "events": ["Child in blue skirt pulls something out.", "Individual in blue shirt sits at table.", "Person in black attire opens refrigerator.", "Mother searches for something to eat."], "n_events": 4, "n_shots": 2, "n_subjects": 4}, "vid": 925}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "20.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A creature bites into a bottle of drink to crush it, then its eyes widen; the creature and the boy start to dance, with the creature lying on the ground looking up at the boy.", "text": "A white, furry creature with large eyes and a wide mouth is seen in front of wooden crates, opening its mouth wide and sticking out its tongue. The creature then puts a red can into its mouth, and the can starts to disintegrate and spill its contents. The creature's eyes widen in shock as it continues to chew on the can. The creature then turns around and starts to roll on the ground, while a boy in a green jersey with the number 00 watches and laughs.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 20, "extra_info": {"duration": 6.88, "events": ["Creature bites into a bottle of drink", "Creature crushes the bottle", "Creature's eyes widen", "Creature and boy start to dance", "Creature lies on the ground", "Boy looks at creature"], "n_events": 6, "n_shots": 2, "n_subjects": 2}, "vid": 20}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "410.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a patch of grass, a handsome young man dressed in a brown suit stands sideways to the camera, tossing fresh flowers into the air with his hand and catching them again. The scene switches to show the young man walking away from the camera with flowers in his hand behind his back, then he turns around and looks back at the camera.", "text": "The video begins with a bright, overexposed scene, transitioning to a man in a maroon suit standing on a grassy field with mountains in the background. He looks up as a bouquet of flowers is thrown towards him. The man catches the bouquet with his right hand. The camera then shows the man from behind, revealing the bouquet in his hand. He turns around, holding the bouquet behind his back, and looks towards the camera. The video ends with the man still holding the bouquet behind his back, looking at the camera.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 410, "extra_info": {"duration": 8.7, "events": ["Young man tosses flowers into the air.", "Flowers are caught by young man.", "Young man walks with flowers behind back.", "Young man turns around.", "Young man looks at the camera."], "n_events": 5, "n_shots": 2, "n_subjects": 1}, "vid": 410}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "264.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A hand reached out to touch the leg of the person in front, followed by a fist swinging towards someone's face. The camera then cuts to a standing man punching another person sitting on the ground, knocking him out. The camera switches again to show the person on the ground knocked unconscious, while bystanders look on.", "text": "A hand is seen gripping a person's leg, pulling it slightly. The hand then moves to the person's face and begins to punch them. The person being punched is thrown to the ground by another individual. The person who was punched lies on the ground, appearing to be in pain.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 264, "extra_info": {"duration": 3.12, "events": ["A hand touches another person's leg.", "A fist swings towards a face.", "A standing man punches a sitting person.", "The punched person falls to the ground.", "Bystanders witness the unconscious person on the ground."], "n_events": 5, "n_shots": 4, "n_subjects": 3}, "vid": 264}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "106.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On the other side of the camera lens, a character peeks out from beside a door. A character wearing green clothes turns to look at the person peeking. The person who was peeking moves from beside the door to the center of the doorway, standing there uncertainly for a moment. Then, they crouch down and slowly shift to the other side of the door before quickly vanishing. Two other characters in the room are looking towards the doorway.", "text": "In a room with green patterned wallpaper and a wooden cabinet, a character wearing a green and red outfit with a hat is seen holding a bowl. The character in the hat peeks through a doorway, looking at the character in green. The character in the hat then starts dancing, lifting one leg and moving energetically. After dancing, the character in the hat crouches down and begins to crawl away. The character in green turns their head to watch the character in the hat as they crawl away.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 106, "extra_info": {"duration": 3.75, "events": ["A character peeks out from beside a door.", "A character wearing green clothes turns to look.", "The first character moves to the center of the doorway.", "The first character stands uncertainly for a moment.", "The first character crouches down.", "The first character shifts to the other side.", "The first character quickly vanishes from the doorway.", "Two characters look towards the doorway."], "n_events": 8, "n_shots": 1, "n_subjects": 4}, "vid": 106}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "548.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person dressed in a wingsuit bends over and jumps off a mountain, opening their limbs and gliding down through the valleys, with a backdrop of mountain peaks.", "text": "A person wearing a black suit and a large backpack stands on the edge of a cliff, preparing to jump. The person bends their knees and leans forward, positioning themselves for the jump. The person then leaps off the cliff, spreading their arms and legs as they begin to descend. The person continues to fall, moving further away from the cliff edge and into the valley below. The person is seen falling deeper into the valley, with the rocky cliff walls surrounding them.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 548, "extra_info": {"duration": 6.6, "events": ["Person in wingsuit bends over", "Person jumps off a mountain", "Person opens limbs", "Person glides down", "Person passes through valleys", "Mountain peaks serve as backdrop"], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 548}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "353.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man receives an item with his right hand from another man. The two exchange a glance. The man on the left turns his head for a quick look to the left side and then walks towards the camera. After both men turn around and sit down, they make eye contact again.", "text": "A man in a gray shirt hands a box to a man in a green shirt while standing in a store with shelves filled with various items and a fan in the background. The man in the green shirt takes the box and they both walk further into the store. The man in the green shirt bends down to place the box on a chair, followed by the man in the gray shirt who also bends down to place his box on the chair. They both sit down on the chairs, open the boxes, and start examining the contents. The man in the gray shirt turns his head to look at the man in the green shirt, who is also looking at the box.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 353, "extra_info": {"duration": 9.75, "events": ["A man receives an item with his right hand from another man.", "The two men exchange a glance.", "The man on the left turns his head to the left side.", "The man on the left walks towards the camera.", "Both men turn around.", "Both men sit down.", "The two men make eye contact again."], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 353}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "977.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video showcases a blue car parked on a black asphalt road enveloped in grey smoke. Following this, another car can be seen driving along a three-lane asphalt road, occupying the middle lane. After overtaking a few stationary cars in the left lane, the car approaches a highway, seemingly about to collide with the blue car that's moving erratically across its path.", "text": "A race car is seen driving on a track, creating a large cloud of smoke behind it. Text appears on the screen, reading 'DASHCAM UTILITY' and 'BAD DRIVERS, CAR CRASHES AND MORE...' . The scene then transitions to a highway where the camera is mounted on a moving vehicle, capturing the view of the road ahead . The vehicle continues to drive forward, passing under a bridge and approaching a traffic light . The vehicle slows down and comes to a stop at the traffic light, with other cars visible on the road .", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 977, "extra_info": {"duration": 10.0, "events": ["Blue car parked on the road.", "Grey smoke envelops the blue car.", "Another car drives along the three-lane road.", "The car occupies the middle lane.", "The car overtakes stationary cars in the left lane.", "The car approaches a highway.", "Blue car moves erratically.", "Cars seem about to collide."], "n_events": 8, "n_shots": 2, "n_subjects": 3}, "vid": 977}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "261.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man drives a car, drifting around a corner, while a white car chases after him, rushing towards the roadside. A man in a yellow hat quickly jumps into the bed of a pickup truck, and then the white car crashes into it.", "text": "A man wearing a hat is driving a car through a city street. The car speeds through the street, passing by pedestrians and buildings. The car makes a sharp turn, causing it to emit smoke from its tires. The car continues to speed through the intersection, narrowly missing pedestrians. The car crashes into a truck, causing the driver to jump out of the vehicle. The truck's rear end is shown with the Ford logo visible.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 261, "extra_info": {"duration": 7.5, "events": ["A man drives a car.", "The car drifts around a corner.", "A white car chases the first car.", "The white car rushes towards the roadside.", "A man in a yellow hat jumps into a pickup truck.", "The white car crashes into the pickup truck."], "n_events": 6, "n_shots": 7, "n_subjects": 4}, "vid": 261}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "608.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a chocolate cake, three rows of candles were inserted. A pair of clenched fists moved slightly backward, then the candles were blown out. Following this, the pair of clenched hands opened up and began to clap, with another pair faintly visible clapping on the left side of the screen.", "text": "A chocolate cake with five lit candles is placed on a table, with a person in the background. The candles are burning brightly, and the person's hands are visible in the background. The candles start to flicker and gradually diminish in size. The flames on the candles continue to shrink until they are almost extinguished. The candles are now mostly out, with only a few small flames remaining. The candles are completely extinguished, and the person in the background claps their hands. The candles are relit, and the flames are burning brightly again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 608, "extra_info": {"duration": 8.36, "events": ["Three rows of candles were inserted into a chocolate cake.", "A pair of clenched fists moved slightly backward.", "The candles were blown out.", "The pair of clenched hands opened up.", "The opened hands began to clap.", "Another pair of hands started clapping on the left side."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 608}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "763.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The smoke drifts upward, entering the smoke detector, and then the red light on the alarm flashes a few times before lighting up.", "text": "A white smoke detector is mounted on a dark background, with smoke beginning to appear around it. The smoke intensifies, and the red light on the smoke detector starts to flash. The red light continues to flash as the smoke thickens, and the light becomes more prominent. The smoke detector's red light flashes intermittently as the smoke persists.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 763, "extra_info": {"duration": 20.85, "events": ["Smoke drifts upward", "Smoke enters the smoke detector", "Red light on the alarm flashes", "Red light stays illuminated"], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 763}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "798.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man and woman sit facing each other in conversation. The woman says something as she looks up at the man and then raises her left hand to signal him to leave. The man angrily slams the pen in his hand down on the table, then stands up and walks away. The woman watches the scene of the man leaving and then withdraws her hand.", "text": "In an office setting with a wooden and white geometric wall background, a man and a woman are seated at a table with a laptop, a pen holder, and some papers. The woman is talking and gesturing with her hands while the man listens attentively. The woman then points to the right with her right hand, and the man turns his head to look in the direction she is pointing. The man then stands up and walks away, leaving the woman sitting alone at the table. The woman continues to look in the direction the man pointed, then lowers her hand and looks at the man.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 798, "extra_info": {"duration": 12.92, "events": ["Woman says something to man.", "Woman looks up at man.", "Woman raises left hand to signal man.", "Man slams pen down on table.", "Man stands up.", "Man walks away.", "Woman watches man leaving.", "Woman withdraws hand."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 798}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "799.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman, while speaking to the horse, caressed it as it chewed on its fodder.", "text": "A woman, wearing a beige sweater, stands in a wooden stable and gently pets a brown and white horse. She smiles and talks to the horse, which moves its head closer to her. The woman continues to pet the horse, occasionally looking at it and speaking. She raises her hand to touch the horse's nose and then kisses its nose. The horse responds by moving its head closer, and the woman continues to pet and kiss the horse. The woman appears to be enjoying the interaction, smiling and talking to the horse throughout the video.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 799, "extra_info": {"duration": 28.9, "events": ["The woman speaks to the horse.", "The woman caresses the horse.", "The horse chews on its fodder."], "n_events": 3, "n_shots": 1, "n_subjects": 2}, "vid": 799}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "120.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A male merman extended his hand to a female merman in an inviting gesture, and the female merman placed her hand on the male merman’s. The male merman then led the female merman away, swimming upwards to the left, as a seahorse and a dolphin looked on. The two animals exchanged glances. Around them, many other merpeople were present. The scene then shifts to show the male and female merman starting to dance.", "text": "In a vibrant, underwater-themed setting with colorful lights and mermaids in the background, a man in a blue outfit with a crown and a woman in a pink outfit with a pearl necklace extend their hands towards each other. They hold hands and begin to spin around, lifting off the ground. As they spin, a pink seahorse and a blue dolphin appear, floating in the air. The man and woman continue to spin, with the woman wrapping her arms around the man's neck.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 120, "extra_info": {"duration": 4.88, "events": ["A male merman extended his hand to a female merman", "The female merman placed her hand on the male merman's hand", "The male merman led the female merman upwards", "The male merman and the female merman swam to the left", "A seahorse and a dolphin looked on", "The seahorse and the dolphin exchanged glances", "Merpeople surrounded the two mermen", "The male merman and the female merman started to dance"], "n_events": 8, "n_shots": 2, "n_subjects": 4}, "vid": 120}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "373.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A group of people are fighting in the water, one man is pushed down, and the camera cuts to this man being held under the water by another person, struggling.", "text": "In a forested area, a group of people are engaged in a chaotic battle, with water splashing around them as they fight. A man in brown clothing raises his sword and swings it, while others around him also engage in combat. The man continues to fight, moving through the water and attacking his opponents. He swings his sword at an enemy, who is knocked down into the water. The man then grabs the fallen enemy and begins to pull them out of the water.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 373, "extra_info": {"duration": 5.75, "events": ["A group of people fight in the water.", "One man pushes another man down.", "Another person holds a man under the water.", "A man struggles underwater."], "n_events": 4, "n_shots": 2, "n_subjects": 3}, "vid": 373}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "125.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two people approach a food store, with the brown-haired man in front leading while holding a black object in his hand. They open the door and enter the store. The camera then cuts to the store door being kicked open by the brown-haired man, who comes out carrying a large stack of food; the blond-haired man follows him, carrying a slightly smaller amount of food. As they both exit, some of the food falls to the ground.", "text": "A boy and a girl walk towards a restaurant named 'Sand Castle Foods' on a sunny day, with a palm tree and a bench visible in the background. The boy holds a camera and takes a photo of the restaurant's entrance, which has a sign indicating 'Gameday Soon! Beach Ready! Reach for Noodles!' and 'Sand Castle Foods' with a 'Closed' sign. The boy then opens the door and enters the restaurant, followed by the girl. Inside, the boy and girl are seen carrying a large stack of food items, which they drop on the floor as they exit the restaurant. The boy and girl then walk away from the restaurant, with the boy still holding the camera.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 125, "extra_info": {"duration": 7.0, "events": ["Two people approach a food store.", "Brown-haired man leads holding a black object.", "They open the store door.", "Brown-haired man enters the store.", "Brown-haired man kicks the store door open.", "Brown-haired man exits carrying a large stack of food.", "Blond-haired man exits carrying food.", "Food falls to the ground."], "n_events": 8, "n_shots": 2, "n_subjects": 2}, "vid": 125}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "142.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A group of people are swaying left and right with their hands raised high, while only a little boy stands, looking around at the others in confusion. The camera zooms in on the boy's upper body; he takes a bite of the biscuit in his hand, and then excitedly raises his arms and starts to sway left and right too. The camera then switches to show the boy and many people next to him swaying together. Once again, the camera angle changes to capture the back of a woman swaying, as well as the sea of people in front of her moving in unison.", "text": "A group of people, including a boy holding a heart-shaped cookie, are dancing energetically with their arms raised in a festive atmosphere. The boy looks around, then takes a bite of his cookie. He continues to dance, raising his arms and smiling. The camera zooms out to show the entire group dancing in unison. The scene then shifts to a woman on a balcony, who is also dancing and waving her arms, with a large crowd of people dancing below her. The camera returns to the boy, who continues to dance and look around.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 142, "extra_info": {"duration": 9.62, "events": ["A group of people sway left and right.", "A little boy stands looking around in confusion.", "The little boy takes a bite of a biscuit.", "The little boy raises his arms.", "The little boy starts to sway.", "People next to the boy sway together.", "A woman sways to the rhythm.", "A sea of people moves in unison."], "n_events": 8, "n_shots": 4, "n_subjects": 3}, "vid": 142}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "30.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A dog with a red scarf is sitting on an object, flying forward. The camera switches to a moving train with white smoke emerging above. An airplane is flying, turning a corner. A dog with a purple scarf stands on the airplane, looking backward. The dog with the red scarf makes a move, accelerating through the air, approaching the airplane. Both dogs reach out their paws toward each other. Their scarves flutter in the wind. The ears of the dog with the purple scarf are also blown by the wind. The dog with the purple scarf clutches the tail wing of the airplane, turns its head, as the plane flies away. The purple scarf drifts into view, covering the camera, and the screen fades to black.", "text": "A white dog wearing a green helmet and goggles sits on a red platform, looking ahead with a sad expression. The scene shifts to a red biplane flying over a train on a bridge, with the plane tilting as it maneuvers. The white dog continues to watch the plane, which is now flying very close to the train. A pink dog with a purple scarf and pink ears is seen on the plane's wing, holding onto the wing with one paw. The white dog reaches out to the pink dog, who is struggling to hold on. The pink dog loses its grip and falls off the wing, floating in the air. The white dog remains on the platform, looking ahead with a sad expression.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 30, "extra_info": {"duration": 8.75, "events": ["A dog sits on a flying object.", "The train emits white smoke.", "An airplane turns a corner.", "A dog stands on the airplane.", "The dog with red scarf accelerates through the air.", "Both dogs reach out their paws.", "The dog with purple scarf clutches the tail wing.", "The purple scarf flutters in the wind.", "Ears of the dog with purple scarf are blown by wind.", "The purple scarf covers the camera."], "n_events": 10, "n_shots": 6, "n_subjects": 2}, "vid": 30}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "121.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A purple-haired female ghost is speaking to the camera; she then flies upwards, followed by six other female ghosts. The camera angle shifts to show them flying past a row of buildings, heading towards the darkness in the distance.", "text": "A woman with long, wavy purple hair and wearing a pink dress with a corset and chains speaks while looking forward. She then turns her head to the side and begins to float upwards. The woman, along with four other girls, starts to float upwards together. The camera then shifts to show the five girls floating upwards in a large, book-filled room with tall shelves and a high ceiling.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 121, "extra_info": {"duration": 8.62, "events": ["A purple-haired female ghost speaks to the camera.", "The purple-haired female ghost flies upwards.", "Six other female ghosts follow the purple-haired ghost.", "The group of female ghosts flies past a row of buildings.", "Female ghosts head towards the darkness."], "n_events": 5, "n_shots": 2, "n_subjects": 2}, "vid": 121}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "283.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person wearing a hat is driving, slowly moving forward on the road. Through the windshield and side window, one can see a man wearing a duckbill cap and a woman in a green long coat walking in the opposite direction of the car on the right-hand sidewalk. The camera follows the gaze of the person inside the car, initially showing the front of the two pedestrians, then their side profiles, and finally their backs.", "text": "A man wearing a hat is driving a blue car on a street lined with parked cars and buildings. The car moves forward, passing by several parked cars and a few moving vehicles. As the car continues, it approaches a sidewalk where a man and a woman are walking. The man and woman, dressed in coats, walk side by side on the sidewalk, passing by a building with stairs and a fence. The car continues to move past the couple, who are walking away from the camera. The car drives past a parked car and some discarded items on the sidewalk, including a refrigerator and some boxes.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 283, "extra_info": {"duration": 9.75, "events": ["A person drives forward on the road.", "A man in a duckbill cap walks on the sidewalk.", "A woman in a green long coat walks on the sidewalk.", "The man and woman walk in the opposite direction of the car.", "The camera follows the gaze of the person inside the car.", "The camera shows the front of the pedestrians.", "The camera shows the side profiles of the pedestrians.", "The camera shows the backs of the pedestrians."], "n_events": 8, "n_shots": 1, "n_subjects": 3}, "vid": 283}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "164.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A blond-haired man runs towards the makeshift toilet ahead, while the brown-haired man watches him. After the blond-haired man enters the toilet and closes the door, the camera cuts to the inside of the toilet. The blond-haired man sits on the toilet and picks up his cellphone to make a call.", "text": "A yellow-haired boy runs towards a blue portable toilet in a construction site with scaffolding and dirt in the background. He reaches the toilet, opens the door, and enters. The boy then sits on the toilet and starts talking on a phone. The scene then switches back to the yellow-haired boy running towards the toilet while a brown-haired boy watches him.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 164, "extra_info": {"duration": 5.0, "events": ["Blond-haired man runs towards the makeshift toilet.", "Blond-haired man enters the makeshift toilet.", "Blond-haired man closes the toilet door.", "Blond-haired man sits on the toilet.", "Blond-haired man picks up his cellphone.", "Blond-haired man makes a call."], "n_events": 6, "n_shots": 2, "n_subjects": 2}, "vid": 164}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "419.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The person dressed in a green mascot suit sat in a chair and put down the helmet, then turned to look at the person next to them dressed in a brown mascot suit. They were startled and sat up straight, poking at them with an object. The person in the brown mascot suit fell backwards to the ground and then stood up. The person in the green mascot suit picked up the mascot's helmet, turned around, and ran towards the person dressed in brown.", "text": "A person in a green costume sits on a bench next to a large orange mascot in a plaza with a decorative background. The person in green reaches for a blue object on the bench and starts to pull it. The orange mascot, noticing the action, begins to move away from the bench. The orange mascot falls to the ground as the person in green continues to pull the blue object. The person in green stands up and runs away, while the orange mascot gets up and starts to follow. The person in green returns to the bench, and the orange mascot stands up and faces the person in green.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 419, "extra_info": {"duration": 8.15, "events": ["The person in the green mascot suit sat in a chair.", "The person in the green mascot suit put down the helmet.", "The person in the green mascot suit turned to look at the companion.", "The person in the green mascot suit was startled.", "The person in the green mascot suit sat up straight.", "The person in the green mascot suit poked the companion with an object.", "The person in the brown mascot suit fell backwards to the ground.", "The person in the brown mascot suit stood up.", "The person in the green mascot suit picked up the mascot helmet.", "The person in the green mascot suit ran towards the person in brown."], "n_events": 10, "n_shots": 1, "n_subjects": 2}, "vid": 419}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "585.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A pile of duck corpses is dumped from a basket into a pit in the ground. As the camera zooms out, a girl clad in blue sets down the basket, crouches down to press on the bodies of the ducks, and then pushes the adjacent soil over to bury them, while a small dog behind her turns its head to look over. The background features a woodland mountain landscape.", "text": "In a forested area, a woman in a blue shirt and black pants is seen crouching and placing a group of small, featherless birds into a hole in the ground. She then uses a basket to pour more birds into the hole, ensuring they are all inside. After emptying the basket, she begins to cover the birds with soil, using her hands to pat the dirt over them. A small dog is visible in the background, observing the woman's actions.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 585, "extra_info": {"duration": 7.83, "events": ["A pile of duck corpses is dumped into a pit.", "A girl sets down the basket.", "The girl crouches down.", "The girl presses on the duck bodies.", "The girl pushes soil over the duck corpses.", "A small dog turns its head."], "n_events": 6, "n_shots": 1, "n_subjects": 3}, "vid": 585}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "21.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera takes a bird's-eye view, a red-haired monster looks up, its mouth moves slightly, the camera pans upwards, and the monsters are arranged into a heart shape.", "text": "A red-haired character with a white flower in their hair is seen holding a brown bag and looking inside it. The camera gradually zooms out, revealing a group of small, colorful characters wearing various hats standing around the red-haired character. As the camera continues to pull back, the group of characters forms a heart shape around the red-haired character. The background is a vibrant, patterned area with various designs and colors, including a large red and white striped umbrella. The red-haired character then stands up and looks at the heart-shaped formation of the other characters.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 21, "extra_info": {"duration": 7.0, "events": ["Camera takes a bird's-eye view", "Red-haired monster looks up", "Red-haired monster's mouth moves slightly", "Camera pans upwards", "Monsters arrange into a heart shape"], "n_events": 5, "n_shots": 1, "n_subjects": 3}, "vid": 21}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "22.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The boy and the baby turned to face the mirror, and at the same time, sunglasses slide down from above their heads and are put on. Then the two begin to dance. The camera zooms in on the car's rear windshield with a sign that reads \"Baby Inside\" as the car moves forward and passes a sign for a dog company. The camera pans up past the sign to reveal a large building belonging to the dog company, which is emitting thick smoke.", "text": "Two characters, one in a black suit and the other in a blue shirt with a bow tie, turn to face the camera in a bathroom with green tiled walls. The character in the blue shirt puts on sunglasses and starts talking while the other character crosses his arms. The scene then shifts to the interior of a car, where a yellow sign with the text 'BABY ON BOARD' is displayed. The car drives towards and under a large, green, archway with the word 'PUPPYCO' and a dog logo on it. The car continues driving past the archway, revealing a factory with smokestacks emitting smoke in the background. The scene then returns to the bathroom mirror, showing the two characters again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 22, "extra_info": {"duration": 9.5, "events": ["The boy and the baby turned to face the mirror", "Sunglasses slide down from above onto their heads", "The boy and the baby began to dance", "The camera zooms in on the car's rear windshield", "The car moves forward", "The car passes a sign for a dog company", "A sign for the dog company reads 'Baby Inside'", "The camera pans up past the dog company sign", "The large building belonging to the dog company is revealed", "Thick smoke emits from the large building"], "n_events": 10, "n_shots": 2, "n_subjects": 2}, "vid": 22}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "81.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera looks forward from within a crowd, where a group of people are playing drums, with many spectators watching alongside. Then, the drummers and drums part to either side, clearing a path down the middle, through which a lion dancer emerges. The scene switches to a view within the crowd, where a bald man and a man with a crew cut are angrily staring ahead.", "text": "A group of people dressed in red uniforms are playing red drums in a street parade, with onlookers clapping and watching the performance. A lion dancer, wearing a colorful lion costume, emerges from the center of the drummers and begins to move forward. The lion dancer continues to move forward, getting closer to the camera, while the drummers keep playing. The camera then shifts to show the lion dancer from inside the lion costume, revealing the faces of the performers inside.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 81, "extra_info": {"duration": 3.12, "events": ["The camera looks forward from within a crowd.", "People are playing drums.", "Spectators are watching the performance.", "The drummers and drums part to either side.", "A path down the middle clears.", "A lion dancer emerges through the path.", "Bald man stares angrily ahead.", "Man with a crew cut stares angrily ahead."], "n_events": 8, "n_shots": 2, "n_subjects": 3}, "vid": 81}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "431.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Background: Disneyland. A woman carrying a black backpack walks forward, greets someone in a fox costume, then approaches someone in a rabbit costume and hugs her. During the hug, the woman's hat falls off. The fox character points to the hat on the ground, and the woman squats down to pick up the hat and puts it back on her head. Afterwards, the woman and the fox character hug, but during the process, the hat falls off again.", "text": "Two women, one in a blue jacket and the other in a blue dress with a black backpack, walk towards a character area with a blue and white striped background. They approach two characters in blue uniforms, one with a red fox head and the other with a gray rabbit head. The characters greet the women, and the woman in the blue dress hugs the gray rabbit character. The red fox character gestures and talks to the woman, who then bends down to pick up a hat from the ground. She puts the hat on her head and hugs the red fox character. The gray rabbit character watches them and then turns around.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 431, "extra_info": {"duration": 15.2, "events": ["A woman greets someone in a fox costume.", "The woman approaches someone in a rabbit costume.", "The woman hugs the person in the rabbit costume.", "The woman's hat falls off during the hug.", "The character in the fox costume points to the fallen hat.", "The woman squats down to pick up the hat.", "The woman puts the hat back on her head.", "The woman and the character in the fox costume hug.", "The woman's hat falls off again during the hug."], "n_events": 9, "n_shots": 1, "n_subjects": 3}, "vid": 431}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "732.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera moves forward along the road, capturing the seaside buildings, and ultimately, the camera pans left following the road to film the vast expanse of the ocean.", "text": "The video begins with a view of a pathway lined with palm trees and greenery, with a building on the right and a dock with boats on the left. The camera moves forward along the pathway, passing by more palm trees and a white building with a red roof. As the camera continues to move, it approaches a small white structure with a red roof, and then turns left, revealing a staircase leading down to a dock. The camera descends the staircase, showing a clear view of the dock and the sea, with a boat docked at the end. The video ends with a view of the dock and the sea, with the camera having completed its descent down the staircase.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 732, "extra_info": {"duration": 10.08, "events": ["The camera moves forward along the road.", "The camera captures seaside buildings.", "The camera pans left.", "The camera films the ocean."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 732}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "574.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Background of a chicken coop. A group of chickens are walking around a boy wearing a blue short-sleeved shirt. The boy is bending over with a basin in his hands, scooping out a basin of corn from the bucket in front of him, then turns around and jumps up, swinging his arms, scattering the corn around. The chicks are startled by the falling corn and run away in all directions.", "text": "A boy is crouching in a pink bucket filled with corn, surrounded by a large group of chickens in an outdoor area. He scoops up a bowl of corn from the bucket and stands up. The boy turns around and starts walking towards the chickens, holding the bowl of corn. He raises the bowl above his head and begins to throw the corn into the air. The corn scatters as the boy continues to throw it, and the chickens eagerly follow him.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 574, "extra_info": {"duration": 4.03, "events": ["A group of chickens walk around.", "A boy bends over with a basin.", "The boy scoops out corn from a bucket.", "The boy turns around.", "The boy jumps up.", "The boy swings his arms.", "The boy scatters the corn around.", "The chicks are startled by the corn.", "The chicks run away in all directions."], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 574}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "739.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The puppy squatted down, then suddenly backed away, wagging its tail and moving back and forth, revealing two other puppies beside the food bowl behind it, one of which withdrew its paw that had been reaching into the food bowl.", "text": "A small puppy with brown and white fur is seen walking on a wooden floor towards the camera, with a blurred background of a room with glass doors. The puppy continues to move forward, and another puppy appears in the background. The first puppy reaches a metal food bowl and starts eating, while the second puppy joins in. The third puppy, which was initially in the background, moves closer to the food bowl and starts eating as well. The three puppies are now all eating from the food bowl, with the first puppy moving away from the bowl.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 739, "extra_info": {"duration": 15.05, "events": ["The puppy squatted down.", "The puppy suddenly backed away.", "The puppy wagged its tail.", "The puppy moved back and forth.", "Two other puppies appeared beside the food bowl.", "One of the other puppies withdrew its paw.", "The other puppy had been reaching into the food bowl."], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 739}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "73.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man in a white short-sleeved shirt rides an electric scooter on the road, raising one hand and waving to the right, then weaving through traffic. The camera cuts to a scene in front of a billboard advertisement as the man rides the electric scooter down some steps and turns to the left, then leans back to greet a lady hanging clothes. The lady opens her eyes wide in surprise.", "text": "A boy stands on a street holding a watering can and a bag, looking ahead. The scene shifts to the boy riding a yellow scooter through a busy city street, waving his hand as he navigates through traffic. He continues riding, passing by other cyclists and vehicles, and heads towards a set of stairs. The boy rides up the stairs, passing by a large advertisement featuring a woman holding shopping bags. He then turns and rides past a woman who is hanging clothes on a line. The scene returns to the boy standing on the street with the watering can and bag.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 73, "extra_info": {"duration": 7.12, "events": ["A man rides an electric scooter on the road.", "Man raises hand and waves to the right.", "Man weaves through traffic.", "Man rides the electric scooter down some steps.", "Man turns to the left.", "Man leans back to greet a lady.", "Lady hanging clothes opens her eyes wide in surprise."], "n_events": 7, "n_shots": 2, "n_subjects": 3}, "vid": 73}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "361.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person folds the paper with both hands, then pours the contents from the paper into a cup, as the camera switches to a close-up of the woman looking down.", "text": "In a dimly lit setting, a person is seen holding a piece of paper with a dark mark on it, which they begin to lift from a table. The person continues to lift the paper, revealing more of the dark mark as they move it towards a small cup. The paper is held above the cup, and the person points at the dark mark on the paper. The camera then shifts focus to the person's face, showing them looking down at the paper.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 361, "extra_info": {"duration": 5.62, "events": ["A person folds the paper with both hands.", "The person pours the contents from the paper into a cup.", "The camera shows the woman looking down."], "n_events": 3, "n_shots": 2, "n_subjects": 1}, "vid": 361}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "804.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video features three animated characters: an elderly individual with glasses sitting in a red manual wheelchair; a young person sporting a ponytail in a red motorized wheelchair; and a third character with a prosthetic right foot, standing to the left of the second person.", "text": "Three cartoon characters are standing together, with one holding a dog, another using crutches, and the third using a wheelchair. The scene then shifts to a single character in a wheelchair, with the text 'Many people need assistive technology' appearing on the right. As the character in the wheelchair moves, icons representing different assistive technologies, such as a crutch and a hearing aid, appear next to the text. The text 'live, work and play' appears sequentially next to the icons. The scene then returns to the three characters, with the character in the wheelchair now facing the other two.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 804, "extra_info": {"duration": 8.0, "events": [], "n_events": 0, "n_shots": 1, "n_subjects": 3}, "vid": 804}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "313.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The judge lifted the hand of the athlete on the right, then let it go, the camera switched to the audience seating, where the audience members clapped and cheered for the athlete.", "text": "A wrestler in a black and yellow singlet with a headgear raises his right hand in victory, holding it up as the referee's hand is placed on top of his. The wrestler then lowers his hand and bows his head. The camera shifts to the audience, where a group of people, including a girl with a yellow pom-pom, clap and cheer enthusiastically.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 313, "extra_info": {"duration": 3.88, "events": ["The judge lifted the athlete's hand.", "The judge let the athlete's hand go.", "The audience members clapped.", "The audience members cheered for the athlete."], "n_events": 4, "n_shots": 2, "n_subjects": 3}, "vid": 313}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "490.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a rural kitchen, beside the stove, there are three large bowls arranged with seasonings inside. The dumplings in the big pot have been cooked, and a person is using a strainer ladle, scooping out the dumplings in turn and pouring them into the bowls.", "text": "In a kitchen setting, a large pan filled with boiling water and dumplings is placed on a stove, with three bowls of dipping sauce and a wooden lid nearby. A person uses a skimmer to scoop the dumplings from the pan and transfer them into the bowls of dipping sauce. The dumplings are carefully placed into the sauce, filling the bowls. The process continues until all the dumplings are transferred from the pan to the bowls. The final frame shows the three bowls filled with dumplings and sauce, ready to be served.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 490, "extra_info": {"duration": 9.58, "events": ["Three large bowls are arranged beside the stove.", "Seasonings fill the three large bowls.", "Dumplings in the big pot finish cooking.", "A person scoops out dumplings with a strainer ladle.", "Person pours dumplings into the first bowl.", "Person pours dumplings into the second bowl.", "Person pours dumplings into the third bowl."], "n_events": 7, "n_shots": 1, "n_subjects": 1}, "vid": 490}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "782.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two men shake hands, bump fists, point at each other, swaying their bodies.", "text": "In a sunlit outdoor setting, two men engage in a series of hand gestures, starting with a handshake. They then transition to a fist bump, followed by a series of finger snaps. The men then step back, and the man on the left looks down while the man on the right moves to the right.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 782, "extra_info": {"duration": 5.67, "events": ["Two men shake hands.", "They bump fists.", "They point at each other.", "Men sway their bodies."], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 782}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "547.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Three kittens climbed onto the old man's legs, and the old man one by one removed the kittens from his legs and put them back on the ground, while another kitten climbed under the tree on the right. The camera remained steady.", "text": "An elderly man is sitting on a small stool in a garden, holding a shovel in his right hand. Several kittens are around him, and he is gently petting them. One kitten climbs onto his lap, and he lifts it up to hold it in his hands. The man continues to pet the kittens, and they appear to be playful and affectionate. The background shows a garden with some plants and a brick structure in the distance.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 547, "extra_info": {"duration": 6.1, "events": ["Three kittens climbed onto the old man's legs.", "The old man removed one kitten from his legs.", "The old man removed another kitten from his legs.", "The old man removed the third kitten from his legs.", "A kitten climbed under the tree on the right."], "n_events": 5, "n_shots": 1, "n_subjects": 3}, "vid": 547}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "492.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person stretches out their right hand, with a slice of ham sausage on the tip of each finger. A small dog approaches and eats the ham sausage off the index finger, revealing a long fingernail. Then, the dog proceeds to eat the ham sausages off the other fingers in succession, with each slice of ham sausage skewered on long fingernails.", "text": "A hand with long, manicured nails and a gold ring holds several pieces of food in front of a dog. The dog eagerly sniffs and licks the food from the hand. The dog continues to lick and bite the food, showing interest and enjoyment. The hand moves slightly, and the dog follows the food with its mouth, taking small bites. The dog licks the food off the hand, and the hand adjusts the position of the food. The dog continues to lick and bite the food, showing a consistent interest in the treat.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 492, "extra_info": {"duration": 12.75, "events": ["A person stretches out their right hand.", "A slice of ham sausage is on each fingertip.", "A small dog approaches the person.", "The dog eats the ham sausage off the index finger.", "The dog reveals a long fingernail on the index finger.", "The dog eats the ham sausage off the middle finger.", "The dog eats the ham sausage off the ring finger.", "The dog eats the ham sausage off the little finger.", "The dog eats the ham sausage off the thumb."], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 492}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "653.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person is washing lettuce in the sink, carefully inspecting each leaf as clean water drips from the lettuce into the sink. The person holds the lettuce upright to let it drain, while another person holds a white towel with deep blue stripes, placing it to the left of the sink.", "text": "A person is seen washing a head of lettuce under running water in a stainless steel sink, with their hands thoroughly rinsing the leaves. The person continues to scrub and rinse the lettuce, ensuring all the leaves are clean. After thoroughly washing the lettuce, the person lifts it out of the water and shakes off the excess water. The person then places the lettuce on a white cloth to dry.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 653, "extra_info": {"duration": 8.08, "events": ["Person washes lettuce in the sink.", "Person inspects each lettuce leaf.", "Clean water drips from lettuce into sink.", "Person holds lettuce upright to drain.", "Another person holds a white towel.", "Second person places towel to left of sink."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 653}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "971.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video commences with scenes of a cloudy, lightning-filled atmosphere. The setting is a grassy area where the weather transitions from lightning conditions to an intense tornado, extending from thunderstorms down to the ground. The scene is filled with high-velocity winds swirling around nearby houses. In this tumult, two individuals in blue wind-resistant clothing struggle against the hostile conditions. One of these individuals is seen with a watch on his left hand.", "text": "The video begins with a dark night sky illuminated by a bright lightning strike, followed by a series of lightning flashes. The scene then transitions to a large tornado forming over a grassy field, with debris being carried by the wind. The text 'Tornadoes' appears, explaining that tornadoes are formed by rotating columns of air extending from thunderstorms down to the ground. The camera then shifts to a person in a blue raincoat, struggling against the strong wind, with the text 'Wind speeds of over 300 mph, most violent storms on Earth. Can create massive damage along their path' appearing on the screen. The video concludes with another lightning strike in the night sky.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 971, "extra_info": {"duration": 10.0, "events": ["Clouds fill the atmosphere with lightning.", "Weather changes to an intense tornado.", "Tornado extends from thunderstorms to the ground.", "High-velocity winds swirl around houses.", "Two individuals struggle against the winds.", "One individual checks the watch on left hand."], "n_events": 6, "n_shots": 3, "n_subjects": 2}, "vid": 971}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "598.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is the interior of a taekwondo dojo. A woman in a blue short-sleeved shirt stretches her hands forward, shouts loudly, and kicks out with her right leg towards the balloons in front of her. The balloons are kicked and burst one by one. After retracting her leg, the woman stands up and repeats the previous action, kicking towards higher balloons. As the balloons burst, the woman stands again after lowering her right leg. The camera remains fixed.", "text": "In a room with a yellow wall featuring a Chinese flag and a poster, a woman in a blue shirt and dark pants stands in front of a punching bag with yellow and transparent balloons. She raises her fists and prepares to kick. She then kicks the balloons on the punching bag, causing them to pop. After the first kick, she steps back and prepares for another kick. She kicks the punching bag again, breaking more balloons. The balloons on the punching bag are gradually reduced as she continues her kicks.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 598, "extra_info": {"duration": 6.33, "events": ["Woman stretches hands forward", "Woman shouts loudly", "Woman kicks right leg out", "Balloons burst", "Woman retracts her leg", "Woman stands up", "Woman kicks towards higher balloons", "Higher balloons burst", "Woman lowers her right leg", "Camera remains fixed"], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 598}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "507.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a field, a white dog joyfully bounds forward. It comes to a person's side who is jogging slowly, and the little white dog continues to hop along cheerfully. In the distance of the frame are palm trees, blue sky, and white clouds.", "text": "A white dog is seen running on a grassy field near a picnic table and a trash can, with a sandy hill and a small structure in the background. The dog continues running across the grass, moving away from the picnic area. The dog runs past a large tree and towards a man standing in the distance. The man, dressed in dark clothing, turns his head to look at the dog as it approaches him. The dog circles around the man, who remains standing and watching. The dog then continues running further into the open field, with the man standing in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 507, "extra_info": {"duration": 13.23, "events": ["A white dog bounds forward in a field.", "The white dog reaches the person's side.", "The person jogs slowly.", "The white dog hops along cheerfully."], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 507}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "243.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the field, a group of people are playing. A little girl runs towards the direction of the camera. The shot switches to the interior, where a man is sitting with his arms crossed over his chest. The woman behind him watches the man and walks towards him. The shot switches back to the little girl, following her as she moves, stopping in the frame and smiling forward.", "text": "A group of children are running and playing in a field filled with tall grass and wildflowers under a partly cloudy sky. A girl in a purple dress runs towards the camera, smiling and talking. The scene then cuts to a man in a denim shirt and a woman in a white dress standing in a dark environment. The man crosses his arms and speaks, while the woman looks at him and then turns her head to look forward. The scene then switches back to the girl in the purple dress, who stops running and continues talking while standing in the field.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 243, "extra_info": {"duration": 9.75, "events": ["Group of people play in the field", "Little girl runs towards the camera", "Man sits with arms crossed over chest", "Woman behind man watches the man", "Woman walks towards the man", "Little girl stops and smiles"], "n_events": 6, "n_shots": 3, "n_subjects": 4}, "vid": 243}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "62.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man dressed in a blue suit looks to the right; the camera then cuts to him pushing open a door to look inside the room. The shot switches to the interior of the room, where a large number of gift boxes are arranged on and under the table.", "text": "A boy in a blue suit with a pink tie and holding a green and yellow bag turns his head to look at a door in a modern, well-lit room with a staircase and abstract sculptures in the background. He then opens the door slightly and peeks inside, holding the door with his left hand. The camera then shifts to show a table filled with various colorful gift boxes and bags, some of which are stacked on top of each other, while others are scattered around the table.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 62, "extra_info": {"duration": 6.75, "events": ["A man looks to the right.", "A man pushes open a door.", "A large number of gift boxes are arranged on a table.", "Gift boxes are arranged under a table."], "n_events": 4, "n_shots": 3, "n_subjects": 1}, "vid": 62}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "414.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man dressed in red clothes puts the cow's head into the pot, sprinkles in salt and chicken essence. A chicken walks by in the background. He then adds soy sauce, gestures with his finger, says something to the camera, and then covers the pot with a lid. The background is a desolate snowy mountain. The camera angle remains fixed.", "text": "A man stands outdoors in a snowy landscape, holding a cow's head with horns and submerging it into a large pot filled with a red, spicy liquid. He then adds salt and corn flour to the pot, followed by soy sauce. The man stirs the mixture and covers the pot with a lid, allowing it to cook for 5 hours. Finally, he lifts the cow's head out of the pot.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 414, "extra_info": {"duration": 11.5, "events": ["Man puts cow's head into pot", "Man sprinkles salt into pot", "Man adds chicken essence", "Chicken walks by", "Man pours soy sauce into pot", "Man gestures to camera", "Man speaks to camera", "Man covers pot with lid"], "n_events": 8, "n_shots": 1, "n_subjects": 1, "old_duration": 15.53, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/14_no_audio.mp4"}, "vid": 414}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "541.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man stands in front of the camera, stretches out his left hand, then looks down at the monkey hugging his leg. He strokes it with both hands and squats down. Suddenly, the monkey reaches out and hits the man in the face, and the man steps back to the right.", "text": "A man stands on a grassy area in front of a building, raising his right hand while a monkey in a blue jacket stands behind him, holding onto his leg. The man looks at the monkey and then bends down to pet it. The monkey turns its head to look at the man, who then crouches down. The monkey raises its hands and pushes the man, causing him to fall to the ground. The monkey then turns around and looks at the man.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 541, "extra_info": {"duration": 4.7, "events": ["The man stretches out his left hand.", "The man looks down at the monkey.", "The monkey hugs the man's leg.", "The man strokes the monkey with both hands.", "The man squats down.", "The monkey reaches out.", "The monkey hits the man in the face.", "The man steps back to the right."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 541}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "337.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man looks ahead with lowered eyes. The woman opposite him waves her hand and slaps the man across the face. The man's head twists to the side from the impact, and then he looks forward again.", "text": "In a dimly lit room with wooden paneling, a man in a light-colored shirt stands facing a woman with curly blonde hair. The woman, wearing a light-colored top, moves closer to the man and begins to speak. The man listens attentively, maintaining eye contact with the woman. The woman continues to speak, and the man slightly turns his head to the side, appearing to respond.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 337, "extra_info": {"duration": 2.38, "events": ["Man looks ahead with lowered eyes.", "Woman opposite man waves her hand.", "Woman slaps man across the face.", "Man's head twists to the side from impact.", "Man looks forward again."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 337}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "958.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video displays a projection of text combined with an \"eighty-five percent\" sign. Several individuals, seated, are seen applauding. Subsequently, an individual with long hair, attired in a pink jacket, white trousers, and white shoes approaches the projected image amidst the sound of applause.", "text": "A group of people are seated in a room with a large window and a screen displaying '85.00%' and some text in the background. They are clapping while looking at the screen. The camera moves from right to left, showing more people clapping. A woman in a pink outfit walks towards the screen, holding a remote control. She turns around and starts speaking to the group. The camera then moves back to the right, showing the group continuing to clap.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 958, "extra_info": {"duration": 5.0, "events": ["Text projects onto a screen.", "An 'eighty-five percent' sign appears.", "Several individuals applaud while seated.", "An individual with long hair walks forward.", "The individual wears a pink jacket.", "The individual dons white trousers and shoes.", "The individual approaches the projected image.", "Applause sounds in the room."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 958}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "938.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, an individual, clad in a black T-shirt with a yellow emblem, is situated in the bus's luggage compartment. Fearful of being spotted by the authorities, the individual reclines and seeks cover behind the stored luggage. The camera shifts focus to the trunk's opening where assorted luggage can be seen. Suddenly, the baggage handler dressed in white catches sight of the individual in black attempting to take a red suitcase. It's clear the individual in black risks getting caught by the bus driver.", "text": "A man wearing a black shirt with 'MALIA' written in yellow is sitting in a dimly lit vehicle, adjusting his position and looking around. He then turns his head towards the camera and starts speaking. The camera moves closer to his face, and text appears on the screen saying 'HE'S GOING TO SEE ME.' The camera then shifts to show the interior of the vehicle, revealing a bus driver in a white shirt and black pants, who is handling a red suitcase.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 938, "extra_info": {"duration": 8.6, "events": ["Individual reclines in luggage compartment.", "Individual seeks cover behind luggage.", "Camera shows trunk's opening with luggage.", "Baggage handler catches sight of individual.", "Individual attempts to take red suitcase.", "Individual risks getting caught by bus driver."], "n_events": 6, "n_shots": 3, "n_subjects": 2}, "vid": 938}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "344.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two boxers are competing in the ring. The one in red shorts throws a punch, the one in black shorts ducks to avoid it, then rushes into the embrace of the boxer in red shorts. The camera cuts to a scene where the boxer in black shorts falls to the ground. The referee comes over to separate the two fighters, and the boxer in black shorts stands up and moves forward. Outside the ring, the crowd watches the match with excitement.", "text": "In a boxing ring, two boxers are engaged in a match, with one boxer throwing punches while the other defends. The boxer on the right lands a punch, causing the boxer on the left to bend over in pain. The boxer on the right then grabs the other boxer and throws him to the ground. The referee moves towards the fallen boxer, who is now on the ground, while the standing boxer turns to face the referee. The audience around the ring begins to applaud as the referee checks on the fallen boxer.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 344, "extra_info": {"duration": 7.0, "events": ["Boxer in red shorts throws a punch.", "Boxer in black shorts ducks.", "Boxer in black shorts rushes into an embrace.", "Boxer in black shorts falls to the ground.", "Referee separates the two fighters.", "Boxer in black shorts stands up.", "Boxer in black shorts moves forward.", "Crowd watches the match."], "n_events": 8, "n_shots": 2, "n_subjects": 4}, "vid": 344}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "448.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "At the entrance of a room, a little boy placed a bundle of plants on the ground, then flipped up the hem of his jacket, shaking out a lot of elm seeds from his clothes. The boy then reached into his pants and pulled out more elm seeds. Following this, he shook his pant legs, and even more elm seeds fell out from there. Lastly, the boy picked up the bundle of plants that was at the doorway and set it down in front of the scattered elm seeds all over the floor. In the background, there was the sound of singing and a woman's laughter.", "text": "A boy is seen carrying a large bundle of flowers into a room through a doorway. He places the flowers on the floor and begins to empty the contents of his shirt pocket onto the floor. The boy continues to empty the flowers from his pocket, scattering them all over the floor. After emptying the pocket, he starts to gather the remaining flowers from the floor. The boy then stands up, looking at the scattered flowers on the floor.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 448, "extra_info": {"duration": 12.13, "events": ["A little boy placed a bundle of plants on the ground.", "The little boy flipped up the hem of his jacket.", "Elm seeds shook out from the little boy's jacket.", "The little boy pulled out more elm seeds from his pants.", "The little boy shook his pant legs.", "More elm seeds fell out from the little boy's pant legs.", "The little boy picked up the bundle of plants.", "The little boy set the bundle down in front of scattered seeds.", "Sound of singing is audible in the background.", "A woman's laughter is heard in the background."], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 448}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "40.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A group of dogs are barking at an elderly person and a child. They take a few steps backward, with a rope extending upwards behind both the elderly person and the child. The elderly person holds something up to block, while the child clings to the elder's arm, ducking backward. The camera then switches to a black dog that has jumped up from below, staring ahead.", "text": "An elderly man with white hair and glasses, holding a boy in his arms, looks down at two dogs barking at him. The boy, wearing a yellow shirt and a cap, appears frightened and clings to the man. The camera then shifts to show the two dogs, one white and one brown, barking aggressively. A third dog, a Doberman, runs up to the group and stops in front of the man, looking at him with a serious expression. The camera then returns to the elderly man and the boy, who continue to look at the dogs.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 40, "extra_info": {"duration": 5.5, "events": ["A group of dogs bark at an elderly person and a child.", "The elderly person and the child take a few steps backward.", "A rope extends upwards behind the elderly person and the child.", "The elderly person holds something up to block.", "The child clings to the elderly person's arm.", "The child ducks backward.", "A black dog jumps up from below.", "The black dog stares ahead."], "n_events": 8, "n_shots": 3, "n_subjects": 5}, "vid": 40}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "847.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "This video clip showcases a person sitting in the passenger seat of a moving car, filming. As the car moves forward, the trees appear to be moving backward.", "text": "The video shows a car driving on a well-lit road at night, with streetlights illuminating the path ahead. The windshield wipers are actively clearing the rain from the windshield, ensuring visibility. The road is lined with trees and barriers on both sides, and the car is following a vehicle with red taillights in front. As the car continues to drive, a person appears on the right side of the road, waving their hand. The car passes the person and continues down the road, with the streetlights and trees remaining consistent in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 847, "extra_info": {"duration": 5.0, "events": ["A person sits in the passenger seat.", "The car moves forward.", "Trees appear to move backward."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 847}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "391.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two women stood up one after another, and then embraced each other.", "text": "Two women are sitting on chairs outdoors, with a black bag placed between them and a white chair to the left. The woman with curly hair hands an item to the woman with black hair, who then stands up. The woman with black hair turns around and walks towards the woman with curly hair. The woman with curly hair stands up and they embrace each other. After the hug, the woman with black hair sits back down.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 391, "extra_info": {"duration": 6.25, "events": ["The first woman stood up.", "The second woman stood up.", "The two women embraced each other."], "n_events": 3, "n_shots": 1, "n_subjects": 2}, "vid": 391}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "332.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man dressed in grey-blue clothes gestured a thumbs-up toward the front. Behind him, a man in red clothes and another person wearing a hat were squatting down, looking ahead. The camera then cuts to a person in red clothes on the opposite side, who rolls a basketball forward. The basketball rolls along the ground toward the man in grey-blue clothes, while a woman in red clothes walks towards them from the background. The camera switches back to the man in grey-blue, who kicks the basketball into the air with one foot. The man in red and the one with the hat behind him stand up to watch. The man in red clothes on the opposite side spreads his arms to catch the flying basketball and turns his head to look to the right.", "text": "A man in a gray and blue shirt stands next to a man in a red shirt and a referee, giving a thumbs-up gesture and speaking. The scene then shifts to a man in a red shirt on a baseball field, who kicks a red ball. The ball rolls on the ground, and the man in the gray and blue shirt kicks it. The man in the red shirt jumps and catches the ball.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 332, "extra_info": {"duration": 4.88, "events": ["A man gives a thumbs-up.", "Two people squat, looking ahead.", "Person in red rolls a basketball.", "Woman in red walks forward.", "Man in grey-blue kicks basketball into air.", "Two people stand up to watch.", "Man in red catches the basketball.", "Man in red turns head to right."], "n_events": 8, "n_shots": 4, "n_subjects": 5}, "vid": 332}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "513.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a zoo, a panda played on a makeshift swing composed of three poles, two ropes, and a tire. As it sat on the tire, ready to swing forward, it didn't even start to sway before the panda toppled backwards onto the ground. Afterwards, the panda rolled around on the ground.", "text": "In an indoor enclosure with a blue sky and green foliage background, a panda is seen sitting on a tire swing, holding onto the ropes with its front paws. The panda begins to swing back and forth, gradually gaining momentum. As it continues to swing, it loses balance and starts to fall forward. The panda falls off the swing, landing on the ground with its back and legs. The swing remains in motion as the panda lies on the ground.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 513, "extra_info": {"duration": 6.43, "events": ["A panda climbed onto a makeshift swing.", "A panda sat on the tire.", "A panda attempted to swing forward.", "A panda toppled backwards onto the ground.", "A panda rolled around on the ground."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 513}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "933.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "This video clip features a person wearing a blue shirt and black pants, and a baby monkey wearing colorful clothing. The person places a glass of milk on the table and takes a grape from the monkey's hand. He then proceeds to peel the grape while the monkey eats another grape. It is clear that the person is peeling the grape to feed it to the baby monkey.", "text": "In a kitchen setting, a man in a light blue shirt approaches a small monkey sitting on a chair at a glass table. The man places a glass of milk on the table in front of the monkey. He then picks up a piece of fruit and starts peeling it. The monkey watches attentively and then reaches out to grab the fruit from the man's hand. The man continues to peel the fruit while the monkey holds it. The man then places the glass of milk back on the table and continues to peel the fruit.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 933, "extra_info": {"duration": 9.0, "events": ["A person places a glass of milk on the table", "The person takes a grape from the baby monkey's hand", "The person peels the grape", "The baby monkey eats another grape"], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 933}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "564.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The puppy bites the columnar object that is fixed to the wall in front of it. It leans forward then backs away, trying repeatedly, then it reaches out with its paws and looks down. The camera moves to the right. The background is an indoor corridor.", "text": "A small, fluffy puppy is lying on a gray and white striped rug in a hallway with wooden floors and white walls. The puppy notices a black object with a white tip on the floor near the wall and starts to move towards it. The puppy continues to approach the object, showing curiosity and interest. The puppy then stands up on its hind legs, using its front paws to reach for the object. The puppy repeatedly paws at the object, trying to interact with it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 564, "extra_info": {"duration": 9.53, "events": ["The puppy bites the columnar object.", "The puppy leans forward.", "The puppy backs away.", "The puppy tries repeatedly.", "The puppy reaches out with its paws.", "The puppy looks down."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 564}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "819.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The animation features two individuals garbed in white. To the left stands a slender figure and to the right, a corpulent one. The person on the right holds a controller in hand and guides it towards a chicken. With a press of a button on the controller, a burst of yellow light is emitted which consequently transforms the chicken into an egg. Meanwhile, a person wearing black-framed glasses and a white cap, presumably in a lab setting, points a similar machine at the chicken.", "text": "In a laboratory setting, two blue creatures in white lab coats are working on a device on a table, with one creature holding a screwdriver and the other holding a small device. The device on the table lights up with a yellow arrow pointing to the right. The creature with the device raises it, and the arrow on the device points to the right. The creature then points the device at a chicken in a nest, causing the chicken to explode and disappear. The two creatures laugh and celebrate, with one creature holding the device and the other gesturing excitedly. They then turn to face each other, still smiling and laughing.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 819, "extra_info": {"duration": 10.0, "events": ["The slender figure stands to the left.", "The corpulent individual stands to the right.", "The corpulent individual holds a controller.", "The corpulent individual guides the controller towards the chicken.", "The corpulent individual presses a button on the controller.", "A burst of yellow light is emitted from the controller.", "The chicken transforms into an egg.", "A person wears black-framed glasses and a white cap.", "The person points a machine at the chicken."], "n_events": 9, "n_shots": 4, "n_subjects": 3}, "vid": 819}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "247.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man wearing a yellow robe is holding an item with both hands, turning his head to look at the person behind him. The person behind reacts, afterward the camera moves downward, and a woman picks up the item from the water and places it into a plate.", "text": "In a dimly lit room with arched windows, a man wearing a crown and a long robe stands on a platform, holding a bowl and looking down. Beside him, another person in similar attire holds a bowl and looks at the man. The man in the crown turns his head to look at the person beside him, then turns back and throws the contents of his bowl into a large, illuminated basin below. The camera then shifts to show a woman sitting by the basin, who looks up as the man throws the contents into the water.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 247, "extra_info": {"duration": 6.0, "events": ["Man in yellow robe holds item with both hands.", "Man turns head toward person behind.", "Person behind reacts to man turning head.", "Woman retrieves item from water.", "Woman places item into plate."], "n_events": 5, "n_shots": 1, "n_subjects": 3}, "vid": 247}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "811.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video records On the badminton court, a player wearing a red shirt and a player wearing white autumn clothes playing badminton. There were a few referees sitting around and spectators watching the game. The badminton fell to the ground during the ninth pass. The badminton ball was hit to the ground once by the person in red clothes. The badminton drops once in front of the person in white clothes and blue pants. The white team loses points. The player failed to catch the badminton for one time after the badminton flew over the net. The badminton fell down between two white lines for once.", "text": "The video shows a badminton match between two players, one in a red shirt and the other in a white shirt, on a green court with white boundary lines. The court is surrounded by blue walls with various sponsor logos, including 'Chengdu 2017' and 'Pinduoduo'. The scoreboard at the top left corner indicates the match is between China and Vietnam, with China leading 17-1. The players are actively engaged in the game, with the player in the red shirt serving the shuttlecock and the player in the white shirt returning it. The audience is seated in the background, watching the match intently. The players move swiftly across the court, hitting the shuttlecock back and forth over the net. The match appears to be intense, with both players showing agility and skill.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 811, "extra_info": {"duration": 9.0, "events": ["Player in red shirt hits badminton.", "Badminton falls to ground on ninth pass.", "Referees sit around the court.", "Spectators watch the game.", "Person in red clothes strikes badminton to ground.", "Badminton drops in front of person in white clothes.", "White team loses point.", "Player fails to catch flying badminton.", "Badminton lands between two white lines."], "n_events": 9, "n_shots": 1, "n_subjects": 3}, "vid": 811}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "930.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video clip features two individuals situated in a room. The first individual can be seen clad in a pink and gray jacket paired with blue jeans, while the other is donning a purple jacket and white pants. The younger of the two individuals appears to be drying their hair with a blue towel, whilst the older one appears to be preparing some food. The person in the purple jacket proceeds to pick up a yellow plate off the table and takes a seat. Based on visual cues, it's possible that they could either eat the meal from the orange plate in their hand, or choose to consume food from the plate on the table, which has a set of cutlery beside it.", "text": "A woman in a purple cardigan and dark dress is preparing breakfast in a kitchen with red cabinets, various kitchen utensils, and colorful decorations. She picks up a plate of food and walks towards the table, while another woman in a pink and gray outfit with a blue towel over her shoulder follows her. The woman in the purple cardigan places the plate of food on the table and sits down, while the woman in the pink and gray outfit stands by the counter, watching her. The woman in the purple cardigan starts talking and gesturing with her hands, while the woman in the pink and gray outfit continues to watch her.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 930, "extra_info": {"duration": 11.97, "events": ["Individual dries hair with towel", "Older individual prepares food", "Person picks up yellow plate", "Individual takes a seat"], "n_events": 4, "n_shots": 3, "n_subjects": 2}, "vid": 930}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "874.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two individuals with long hair are strolling down the street. The one on the left has blond, curly hair and is wearing a beige coat, while the individual on the right, engrossed in a smartphone held in their left hand, is dressed in a gray suit, complete with gray clothing, pants, and a purple necktie. Despite attempts at conversation by their companion, the individual in the suit remains engrossed in the phone. Approaching them is a bald individual wearing a red t-shirt, blue jeans, and sunglasses. A collision occurs between the person in red and the individual in the suit, specifically with their left shoulders. The person in red raises their hands in apology, to which the suit-clad individual responds with anger, turning and shouting at the apologetic person in red.", "text": "A man in a suit is looking at his phone while walking alongside a woman with curly hair who has her arms crossed and is talking. The man continues to look at his phone as they walk past a man in a red shirt. The man in the suit turns around to face the man in the red shirt, who is talking to him. The man in the suit then turns back and starts speaking, raising his hand slightly.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 874, "extra_info": {"duration": 4.96, "events": ["Two individuals stroll down the street.", "Individual with blond hair wears beige coat.", "Individual in gray suit uses smartphone.", "Blond individual attempts conversation.", "Individual in gray suit ignores attempts.", "Bald individual approaches wearing red t-shirt.", "Bald individual walks wearing blue jeans.", "Bald individual sports sunglasses.", "Collision occurs between individuals in suit and in red.", "Individual in red raises hands in apology."], "n_events": 10, "n_shots": 3, "n_subjects": 3}, "vid": 874}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "116.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A blonde woman stands at the entrance of a house, holding a green \"For Sale\" sign in her hand. A few silhouettes pass by her. The woman walks forward a few steps with a smile, then glances at the house. Turning back, she sees that the silhouettes have moved on. Her expression changes to one of anger, and she puts down the sign and stamps her foot forcefully.", "text": "A woman in a pink outfit stands on a sidewalk next to a 'For Sale' sign that reads 'For Sale 4 Real Estate' and 'Sandy Village' with a 'Sold' sign in her hand. She turns her head to the left and smiles, then turns her head back to face forward. She lifts her right leg and holds the 'Sold' sign in her right hand, then lowers her leg and looks down at the sign. She turns her head to the left again, then turns her head back to face forward. The woman continues to hold the 'Sold' sign in her right hand while standing in the same position.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 116, "extra_info": {"duration": 3.62, "events": ["Blonde woman stands at house entrance.", "Woman holds green 'For Sale' sign.", "Silhouettes pass by blonde woman.", "Woman walks forward a few steps.", "Woman smiles.", "Woman glances at house.", "Woman turns back.", "Silhouettes have moved on.", "Woman's expression changes to anger.", "Woman puts down sign and stamps foot."], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 116}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "343.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a workroom, a man stands in front of a cabinet with its doors open, taking things out from inside. He is holding the items in his arms, having taken so many that he has to use his chin to keep them in place. Afterwards, he closes the cabinet doors, turns around, and walks over to a metal rack. He picks up a bucket of drinks and walks behind the rack.", "text": "In a dimly lit kitchen with stainless steel counters and shelves filled with various items, a man opens a refrigerator and retrieves a container. He then closes the refrigerator door and walks towards the counter. The man places the container on the counter and picks up a bottle of milk. He turns and walks towards the left side of the frame, disappearing from view. The man then reappears, opens the refrigerator again, and retrieves another container.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 343, "extra_info": {"duration": 7.88, "events": ["A man stands in front of an open cabinet", "The man takes items out of the cabinet", "The man holds items using his chin", "The man closes the cabinet doors", "The man turns around", "The man walks over to a metal rack", "The man picks up a bucket of drinks", "The man walks behind the rack"], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 343}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "614.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is a gym, with several treadmills around. A person in a white short-sleeved shirt is holding a spray bottle in their left hand, spraying cleaner on the handle and screen of a treadmill, while wiping with a cloth in their right hand. The camera remains fixed.", "text": "A woman wearing a white shirt tied at the waist and black pants is seen in a gym setting, holding a spray bottle in her right hand and a blue cloth in her left hand. She sprays the disinfectant onto the control panel of a treadmill. She then uses the blue cloth to wipe the control panel thoroughly. After cleaning, she continues to spray the disinfectant on the control panel and wipes it again. Finally, she holds the spray bottle and cloth, ready to clean the control panel once more.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 614, "extra_info": {"duration": 12.63, "events": ["Person sprays cleaner on treadmill handle.", "Person sprays cleaner on treadmill screen.", "Person wipes handle with a cloth.", "Person wipes screen with a cloth."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 614}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "193.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person's fingers are placed on the mobile phone screen. The camera switches to show this individual looking at the phone screen, frowning slightly. The camera switches again, capturing from the side as the person puts the phone down, lies on their side, and rests both hands by the pillow.", "text": "The phone screen displays a text message that reads, 'You left this here yesterday... Some things always fit,' with a heart emoji. The camera then shows a close-up of a person's eyes looking at the phone. The person, lying in bed, puts the phone down on the bedside table. The person then lies down, resting their head on a pillow, and looks at the phone screen.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 193, "extra_info": {"duration": 14.5, "events": ["Person places fingers on mobile phone screen.", "Individual frowns while looking at phone screen.", "Person puts phone down.", "Individual lies on side.", "Individual rests hands by pillow."], "n_events": 5, "n_shots": 3, "n_subjects": 1}, "vid": 193}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "179.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man moves forward quickly, then turns a corner. The camera switches to show him rapidly circling around two people. It then cuts to a woman reaching up to touch her temples and slightly shaking her head, followed by a scene where the woman extends a leg and trips the man.", "text": "A man in a purple and gold outfit with a bow tie stands with his arms outstretched in a room filled with purple glowing objects and platforms. The camera zooms in on the man as he continues to hold his arms out. The scene shifts to an overhead view of the man and a woman standing in the center of a circular area surrounded by spinning objects. The man and woman are encircled by a pink glowing ring that spins rapidly. The woman, wearing a blue outfit, touches her ear and closes her eyes, appearing to be affected by the spinning. The camera then focuses on the woman's foot, which is stepping on a glowing spot on the ground.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 179, "extra_info": {"duration": 5.5, "events": ["The man moves forward quickly.", "The man turns a corner.", "The man rapidly circles around two people.", "The woman reaches up to touch her temples.", "The woman slightly shakes her head.", "The woman extends a leg.", "The woman trips the man."], "n_events": 7, "n_shots": 4, "n_subjects": 3}, "vid": 179}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "339.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In front of a field, a woman lifts a little girl high up, the camera then cuts to an open space in front of a house, where a boy in green clothes, with his back to the camera, throws an American football to another boy dressed in black. The ball is intercepted by a man wearing a white shirt, who turns around, pats the head of the boy in black, puts his arm around his shoulder, and walks towards the camera.", "text": "A woman is seen holding a child in her arms, and the scene transitions to a boy standing on the street in front of a building with a sign that reads 'Diamond Hardware.' The boy throws a football, and a man catches it. The man walks towards the boy, pats his head, and places his hand on the boy's shoulder. The man then turns and walks towards the woman and child, with the text 'BRAUN MAYOR' appearing on the screen.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 339, "extra_info": {"duration": 4.5, "events": ["Woman lifts little girl.", "Boy in green throws football.", "Man intercepts football.", "Man pats boy in black's head.", "Man puts arm around boy in black's shoulder.", "Man and boy in black walk towards camera."], "n_events": 6, "n_shots": 2, "n_subjects": 5}, "vid": 339}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "4.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A mouse is pushing a bowl and making a frightened expression. A cat falling down, causing the bowl to fly up. A mouse holding a key is covered under the bowl. The cat grabbing the bowl.", "text": "In a dimly lit room with a stone wall background, a small mouse in an orange outfit is seen pulling a string attached to a white bucket. The mouse continues to pull the string, causing the bucket to move. A large black cat appears, grabbing the bucket with its paw and starting to roll on the ground. The cat, now on its back, continues to roll while holding the bucket. The cat's expression changes to a menacing one as it looks at the mouse, which is now holding a key. The cat, with a fierce expression, grabs the bucket and continues to roll on the ground. The mouse, still holding the key, moves away from the bucket.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 4, "extra_info": {"duration": 4.05, "events": ["A mouse pushes a bowl.", "A mouse makes a frightened expression.", "A cat falls down.", "The bowl flies up.", "A mouse holds a key.", "A mouse is covered under the bowl.", "The cat grabs the bowl."], "n_events": 7, "n_shots": 1, "n_subjects": 3}, "vid": 4}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "469.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "By the riverbank, a crew-cut man was walking while trying to strike up a conversation with a lady holding a cellphone next to him. As he got closer to the lady, he tried to act suave by lifting his arm, bending his leg, and leaning against a nearby pillar. No sooner had he leaned against the pillar than it collapsed, causing both him and the pillar to fall into the river.", "text": "A man stands by the riverbank, holding his head with one hand and leaning against a wooden post. He then walks towards a woman who is holding a phone and talking to him. The man leans against the post, smiling and talking to the woman. He then loses his balance, falls, and falls into the river. The man surfaces in the water, and the wooden post he was leaning on also falls into the river.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 469, "extra_info": {"duration": 4.53, "events": ["Crew-cut man walked by the riverbank.", "Crew-cut man tried to strike up a conversation.", "Lady held a cellphone.", "Crew-cut man got closer to the lady.", "Crew-cut man lifted his arm.", "Crew-cut man bent his leg.", "Crew-cut man leaned against the pillar.", "The pillar collapsed.", "Crew-cut man fell into the river.", "The pillar fell into the river."], "n_events": 10, "n_shots": 1, "n_subjects": 2}, "vid": 469}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "576.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person threw a watermelon into the gaping mouth of a hippopotamus, which then crushed the watermelon. The camera panned downward, showing the watermelon juice flowing out. The background featured a pool with a forest behind it.", "text": "A hippopotamus with its mouth wide open is seen near a water body, with a watermelon placed in front of it. A hand places the watermelon into the hippopotamus's mouth. The hippopotamus closes its mouth, swallowing the watermelon. The hippopotamus then chews the watermelon, and some of it spills out onto the ground. The scene repeats with the hippopotamus opening its mouth wide again, and the hand places another watermelon in front of it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 576, "extra_info": {"duration": 6.13, "events": ["Person threw a watermelon into the mouth of a hippopotamus", "Hippopotamus crushed the watermelon", "Watermelon juice flowed out", "Camera panned downward"], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 576}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "326.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a night with only a faint glow, a woman sat on the ground with her left leg stretched out in front of her, and a person lying in front of her. The woman took a flashlight in her right hand, shining it towards the left side of her left leg, and with her left hand, she picked up a notebook and flipped it open, looking at it while pointing with her finger.", "text": "In a dimly lit cave with a small fire burning in the background, a woman sits on the ground next to a person lying face down. She looks at the person lying down, then turns her head to the side and reaches for a flashlight. She turns on the flashlight and uses it to illuminate the ground in front of her. She then picks up a piece of paper from the ground and examines it under the flashlight's light.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 326, "extra_info": {"duration": 4.5, "events": ["A night presents faint glow.", "A woman sits on the ground.", "Woman's left leg is stretched out.", "A person lies in front of the woman.", "Woman takes flashlight with her right hand.", "Woman shines flashlight towards left side of her leg.", "Woman picks up notebook with her left hand.", "Woman flips notebook open.", "Woman looks at notebook.", "Woman points with her finger at notebook."], "n_events": 10, "n_shots": 1, "n_subjects": 2}, "vid": 326}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "229.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Underneath the elephant's body, it uses its trunk to blow the feathers away. The camera cuts to a boy behind two railings speaking as he looks ahead, and the girl turns her head to look forward as well.", "text": "An elephant, standing in a pen with a metal fence and a grassy background, lowers its head and begins to bend down. The elephant continues to bend down, using its trunk to pick up some hay from the ground. The elephant then lifts its head slightly while still holding the hay in its trunk. The scene shifts to a boy and a girl standing by the fence, with the boy speaking and the girl turning her head to look at him.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 229, "extra_info": {"duration": 6.12, "events": ["The elephant uses its trunk to blow away feathers.", "A boy speaks behind two railings.", "A girl turns her head to look forward."], "n_events": 3, "n_shots": 3, "n_subjects": 3}, "vid": 229}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "445.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The person on the left pours a black liquid, the person in the middle catches it with both hands and then splashes it onto their cheeks, while the person on the right scatters powder. The individual on the left scoops up a bowl of liquid and pours it over the head of the person in the middle. A man dressed in white clothes approaches from the left and, together with the person on the right, sprinkles powder on the person in the middle. The background features yellow house buildings. The camera angle remains fixed and unchanged.", "text": "A person is seen applying a black substance to a statue, which is seated and has long, braided hair. The person continues to spread the substance over the statue's head and body, ensuring it is thoroughly covered. Another person joins in, helping to apply the substance to the statue's head and shoulders. The two individuals work together, ensuring the substance is evenly distributed over the statue's entire body. The background shows a street scene with buildings, a flag, and a few people passing by.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 445, "extra_info": {"duration": 14.83, "events": ["Person on the left pours black liquid.", "Person in the middle catches black liquid with hands.", "Person in the middle splashes black liquid onto cheeks.", "Person on the right scatters powder.", "Person on the left scoops up bowl of liquid.", "Person on the left pours liquid over person in the middle's head.", "Man in white clothes approaches from the left.", "Man in white and person on the right sprinkle powder on person in the middle."], "n_events": 8, "n_shots": 1, "n_subjects": 4}, "vid": 445}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "424.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A male diver was performing a springboard dive when he made a mistake and landed flat on the water's surface.", "text": "A diver is seen in mid-air, performing a dive from a high platform. The diver begins to descend towards the water, with their body fully extended and arms raised. As the diver approaches the water, they start to tuck their body and prepare for entry. The diver enters the water with a splash, creating a large splash upon impact. The water ripples and splashes as the diver submerges. The background shows a banner with the text '2024年全国青年跳水冠军赛' and several spectators and officials seated around the pool.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 424, "extra_info": {"duration": 4.27, "events": ["A male diver starts a springboard dive.", "The male diver performs somersaults in the air.", "The male diver makes a mistake during the dive.", "The male diver attempts to regain form.", "The male diver fails to recover posture.", "The male diver lands flat on the water's surface."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 424}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "349.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman walks in front of the camera, moving from darkness into a lit area, then back into darkness again. Behind her, a car with its headlights on turns away, while the camera follows the woman's movements.", "text": "A woman is walking forward in a dimly lit environment, with a concrete wall in the background. As she continues to walk, the lighting gradually becomes brighter, revealing more details of her and her surroundings. A car with headlights on appears in the background, approaching from behind her. The car continues to follow her as she walks, with its headlights illuminating the scene.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 349, "extra_info": {"duration": 7.0, "events": ["A woman walks in front of the camera.", "The woman moves from darkness into a lit area.", "The woman then returns back into darkness.", "A car with headlights on turns away behind the woman.", "The camera follows the woman's movements."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 349}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "510.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Inside a kindergarten classroom, a boy and a girl are having a competition. They use a long-handled wooden spoon to move a ball from one circle to another. The boy moves a green ball between green circles, while the girl moves a blue ball between blue circles. The boy quickly picks up the ball and runs back and forth, while the girl merely walks briskly. The boy's speed is clearly much faster than the girl's. A group of children sit nearby, watching their competition.", "text": "In a room with wooden floors and a few chairs, a boy in a blue sleeveless shirt and dark shorts uses a yellow stick to hit green balls into a green hoop. He then runs to the right, picks up another green ball, and continues to hit it into the hoop. The boy repeats this action multiple times, moving around the room and hitting the balls into the hoop. A girl in a white dress, also using a yellow stick, follows a similar pattern, hitting blue balls into a blue hoop. The boy and girl continue to play, hitting the balls into their respective hoops and moving around the room.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 510, "extra_info": {"duration": 10.03, "events": ["Boy and girl start a competition.", "Boy uses wooden spoon to move green ball.", "Girl uses wooden spoon to move blue ball.", "Boy quickly picks up the ball.", "Boy runs back and forth between circles.", "Girl walks briskly between circles.", "Children sit nearby watching the competition."], "n_events": 7, "n_shots": 1, "n_subjects": 3}, "vid": 510}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "620.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person wearing a belt slices an avocado and arranges it on chocolate toast, then pinches the crumbs from the bowl in front, sprinkling them on the avocado. Afterwards, they take a cherry tomato, cut it into four pieces, and also place it on top of the avocado.", "text": "A person is seen cutting an avocado in half on a wooden cutting board, with various ingredients like lettuce, cucumber, and avocado slices around. The person then slices the avocado into smaller pieces and sprinkles seeds on it. Next, the person cuts a tomato into small pieces and adds them on top of the avocado. Finally, the person picks up another avocado and prepares to cut it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 620, "extra_info": {"duration": 37.92, "events": ["Person slices an avocado", "Person arranges avocado on chocolate toast", "Person pinches crumbs from the bowl", "Person sprinkles crumbs on the avocado", "Person takes a cherry tomato", "Person cuts cherry tomato into four pieces", "Person places tomato pieces on the avocado"], "n_events": 7, "n_shots": 1, "n_subjects": 1}, "vid": 620}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "695.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a night, with streetlights illuminated, a roller coaster slides past the camera, and in the distance, a pirate ship is swaying.", "text": "The video shows a nighttime amusement park with a brightly lit roller coaster in operation. The roller coaster car, filled with passengers, moves along the track, illuminated by colorful lights. The background features various other amusement rides and attractions, also lit up, creating a vibrant and lively atmosphere. The roller coaster car continues to move back and forth, passing under a metal structure that frames the scene. The lights on the roller coaster and the surrounding rides create a dynamic and festive environment.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 695, "extra_info": {"duration": 6.01, "events": ["Roller coaster slides past the camera.", "Pirate ship sways in the distance."], "n_events": 2, "n_shots": 1, "n_subjects": 2}, "vid": 695}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "61.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A character wearing short sleeves reaches out and leads a character in a red cape out of the room. The camera then cuts to a television screen where subtitles are scrolling upwards. The camera switches again to show seven characters watching TV in the room.", "text": "In a dimly lit room with wooden barrels and a brick wall, a man with a mustache and a child in a red cape stand in front of a door. The man gestures towards the door while the child holds a green object. The man and the child start walking towards the door, with the child carrying the green object. They continue walking and exit the room, leaving the door behind. The scene transitions to a television screen displaying a list of names, which scrolls down. The room is then shown with various characters, including a rabbit, a girl, and other animals, all watching the television.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 61, "extra_info": {"duration": 9.88, "events": ["A character reaches out.", "A character leads another character.", "Subtitles scroll upwards on a television screen.", "Seven characters watch television."], "n_events": 4, "n_shots": 3, "n_subjects": 3}, "vid": 61}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "711.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, there is an ongoing game of chess. A hand picks up a knight and captures an opposing pawn. In the foreground, there is a silhouette of a person who makes a slight movement with their hand.", "text": "The video shows a chessboard with various pieces arranged for a game. A hand reaches out and picks up a white knight piece from the board. The hand then moves the white knight piece to a new position on the board. The hand places the white knight piece in its new position. The hand then picks up a black knight piece from the board. The hand moves the black knight piece to a new position on the board. The hand places the black knight piece in its new position.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 711, "extra_info": {"duration": 6.7, "events": ["A game of chess is ongoing.", "A hand picks up a knight.", "The knight captures an opposing pawn.", "A silhouette makes a slight hand movement."], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 711}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "950.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person clad in a yellow shirt and black hat faces the camera and makes remarks. Up next, an individual wearing a black apron and a grey t-shirt, brandishes a transparent cup filled with pink smoothie in their left hand. The kitchen counter is laden with assorted foods and culinary utensils. In an unexpected turn, the person upends the cup as though intending to pour the smoothie over their head, but interestingly, the smoothie does not gush out for several ensuing seconds.", "text": "A man in a yellow shirt is speaking to the camera in a kitchen, gesturing with his hands and moving closer to the camera. The scene then shifts to another man in a gray shirt and black apron, who is in the same kitchen. He removes the lid from a blender containing a red mixture, lifts the blender, and pours the contents into a bowl.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 950, "extra_info": {"duration": 6.01, "events": ["A person makes remarks facing the camera.", "An individual brandishes a transparent cup.", "The cup contains pink smoothie.", "The individual upends the cup.", "The smoothie does not gush out."], "n_events": 5, "n_shots": 2, "n_subjects": 2}, "vid": 950}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "340.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A black cowboy rides a horse, looking solemnly downwards. He takes off the cowboy hat from his head and places it over his chest. Beside him, a woman on horseback also rests her cowboy hat on her chest. Next to the woman is a group of riderless horses.", "text": "A man wearing a denim jacket, a bandana, and gloves begins to lift his cowboy hat with his right hand. He continues to raise the hat, revealing his head. The man fully removes his hat, holding it in his hand. In the background, a woman on horseback, wearing a red headscarf and holding her hat, turns her horse to face the man. The woman appears to be speaking to the man as she looks at him.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 340, "extra_info": {"duration": 7.62, "events": ["A black cowboy rides a horse.", "The cowboy looks downward solemnly.", "The cowboy takes off the cowboy hat.", "The cowboy places the cowboy hat over the cowboy's chest.", "A woman on horseback rests her cowboy hat on her chest.", "A group of riderless horses accompany the riders."], "n_events": 6, "n_shots": 2, "n_subjects": 2}, "vid": 340}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "235.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The albino orangutan showed aggression, then picked up a concrete object from the side and smashed it towards the iron mesh gate, while a bald man shouted at it.", "text": "A woman and a man stand in front of a cage, looking at a large, aggressive ape inside. The man turns to face the ape, and the woman also turns to look at the ape. The man and woman move closer to the cage, and the ape roars at them. The man and woman step back, and the ape continues to roar and move towards the cage. The man and woman look at the ape, which is now closer to the cage. The ape reaches up and grabs the cage door with both hands. The man and woman react with shock and fear as the ape holds onto the cage door.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 235, "extra_info": {"duration": 8.88, "events": ["The albino orangutan showed aggression.", "The albino orangutan picked up a concrete object.", "The albino orangutan smashed the concrete object towards the gate.", "A bald man shouted at the albino orangutan."], "n_events": 4, "n_shots": 9, "n_subjects": 4}, "vid": 235}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "214.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman, holding an axe, ran to one side and leaned against the wall. Taking advantage of the light, she observed the surroundings of the corridor.", "text": "A woman is seen in a dark environment, partially illuminated by a light source. She walks down a dimly lit corridor, holding an axe in her right hand. She stops and looks around cautiously, holding the axe close to her body. The corridor is narrow and appears to be part of a larger, industrial or institutional setting, with metal walls and a grated floor. The woman continues to look around, her expression tense and alert.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 214, "extra_info": {"duration": 9.88, "events": ["The woman grasped an axe.", "The woman ran to one side.", "The woman leaned against the wall.", "The woman used the light to observe.", "The woman surveyed the corridor's surroundings."], "n_events": 5, "n_shots": 4, "n_subjects": 1}, "vid": 214}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "136.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Against the blue drawing background, the names of the production staff switch on the screen multiple times.", "text": "The video begins with a blue background featuring several white, cloud-like shapes scattered around. Text appears in the center of the screen, listing the executive producers: Josh Scherba, Anne Lot, and Stephanie Betts. The text remains on the screen for a few moments before disappearing. New text appears, listing the executive producers again: Paige Broadcock and Craig Schultz. This text also remains on the screen for a few moments before disappearing. The text changes to list the executive producer: Michael Bonfiglio. This text remains on the screen for a few moments before disappearing. The text changes to list the co-executive producer: Meredith Kaufers. This text remains on the screen for a few moments before disappearing. The text returns to list the executive producers: Josh Scherba, Anne Lot, and Stephanie Betts.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 136, "extra_info": {"duration": 5.25, "events": ["Production staff names switch on the screen."], "n_events": 1, "n_shots": 1, "n_subjects": 1}, "vid": 136}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "198.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "SpongeBob moves forward but suddenly stops. Patrick Star bends forward, and another creature says something and turns to leave. The camera switches to Patrick sitting upright, saying something, and SpongeBob runs over and says something.", "text": "SpongeBob, wearing a 'GG' hat, walks forward in a room with checkered flooring and various framed pictures on the walls. He stops and looks ahead, while Patrick, also wearing a 'GG' hat, is sitting at a counter with a green character who is serving ice cream. Patrick turns his head to look at SpongeBob, then turns back to the green character and starts talking. Patrick holds a spoon and looks at the ice cream, which has a face and is decorated with cherries and bananas. SpongeBob, now with a sad expression, walks forward and looks at Patrick.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 198, "extra_info": {"duration": 5.38, "events": ["SpongeBob moves forward.", "SpongeBob suddenly stops.", "Patrick Star bends forward.", "Another creature speaks.", "Another creature turns to leave.", "Patrick Star sits upright.", "Patrick Star says something.", "SpongeBob runs over.", "SpongeBob says something."], "n_events": 9, "n_shots": 2, "n_subjects": 3}, "vid": 198}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "816.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video showcases a pet shop with an array of orange and white birds in an aviary. There are also black and white cats seen in a litter box and different types of fishes swimming in fish tanks. The order of their appearance is: first, the birds; second, the cats; and third, the fish. The nouns must appear in an answer in the same sequence - birds followed by cats and then fish.", "text": "A man exits a black car parked in front of a pet store named 'PET STORE' and walks towards the entrance. The camera then shows a birdcage with several colorful birds perched on branches. Inside the store, there are multiple cages with various birds, including a green parrot and other colorful birds. A colorful bird is seen walking on the ground inside a cage. The store also has cages with small animals, including a dog in a blue and white cage. The video then shows a pond with several fish swimming and plants growing in it. Finally, the camera focuses on an aquarium with various fish and plants, labeled with different species and prices.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 816, "extra_info": {"duration": 11.01, "events": ["Orange and white birds perch in an aviary.", "Black and white cats scratch in a litter box.", "Different types of fish glide through water in tanks."], "n_events": 3, "n_shots": 7, "n_subjects": 5}, "vid": 816}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "837.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video clip features a football game involving two teams dressed in red and black respectively. A certain player from the red team, identifiable by his white socks, abruptly hits a player from the black team. Following this incident, the black team member is seen on the ground, prompting celebratory reactions from the black team. The referee intervenes to adjudicate a committed foul. The player in red jersey with '5' as his number is deemed to have broken the rules, thus relegating the fault to the red team and not their opponents, the black team.", "text": "The soccer match takes place in a stadium filled with spectators. The ball is initially in the air near the center of the field, with players from both teams positioned around it. The ball is then passed to a player in a black uniform, who starts running towards the left side of the field. The player in black continues to advance with the ball, evading players in red uniforms. As the player in black approaches the penalty area, he is tackled by a player in a red uniform, causing him to fall to the ground. The referee, who is following the play, signals a foul by raising his arms. The players from both teams react to the foul, with some raising their arms in protest. The scene then transitions to a different moment in the match, showing the players in their starting positions on the field.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 837, "extra_info": {"duration": 11.0, "events": ["A player from the red team hits a player from the black team.", "The player from the black team falls to the ground.", "The black team members react with celebration.", "The referee intervenes to adjudicate the foul.", "The referee identifies the player in red jersey number 5.", "The referee attributes the foul to the red team."], "n_events": 6, "n_shots": 2, "n_subjects": 3}, "vid": 837}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "360.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "\"The woman blew out the candles, then turned to her side and leaned against something, covering her body with a blanket.\"", "text": "In a dimly lit room, a woman is seen holding a candle on a table, which is the only source of light. She then turns away from the candle and begins to sit down. The woman continues to lower herself into a chair, adjusting her position. She pulls a blanket over herself, covering her body. Finally, she settles into a comfortable position, with the candle still burning beside her.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 360, "extra_info": {"duration": 5.0, "events": ["The woman blew out the candles.", "The woman turned to her side.", "The woman leaned against something.", "The woman covered her body with a blanket."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 360}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "733.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person walks up to the scooter, withdraws the kickstand with the left foot, steps on the deck with the right foot, and rides away forward.", "text": "A black electric scooter is parked on a paved path with a white line in the background. A person wearing white sneakers and blue jeans approaches the scooter. The person steps onto the scooter with one foot, then the other, and begins to push the scooter forward. The person continues to push the scooter, moving it along the path. The scooter is pushed further away, and the person is no longer visible. The scooter is left parked on the path as the person has moved out of the frame.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 733, "extra_info": {"duration": 8.17, "events": ["A person walks up to the scooter.", "Person withdraws the kickstand with the left foot.", "Person steps on the deck with the right foot.", "Person rides away forward."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 733}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "115.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a room filled with burning candles, three lit candles slowly ascend diagonally upwards. Suddenly, the candles on the wall are all blown out. The camera switches to four children, with a slightly older boy in the middle holding three candles. The candles on the distant wall are also blown out. Lastly, the three candles held by the boy are extinguished, and in the darkness, their eyes look upwards. Suddenly, a burst of red smoke explodes behind them, and then a tall monster appears within the smoke. They turn around to look at the monster.", "text": "The camera moves upward, revealing a room filled with numerous lit candles, with some candles starting to extinguish. The scene then shifts to a group of four characters standing in front of the candles, looking ahead. As the candles continue to extinguish, the room becomes darker. Suddenly, a red, smoky figure appears in front of the group, causing them to turn and face the figure.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 115, "extra_info": {"duration": 6.25, "events": ["Three candles ascend diagonally upwards.", "All candles on the wall are blown out.", "Four children stand with an older boy holding candles.", "Candles on distant wall are blown out.", "Three candles held by the older boy are extinguished.", "Children look upwards in the darkness.", "Red smoke bursts behind the children.", "A tall monster appears in the smoke.", "Children turn around to face the monster."], "n_events": 9, "n_shots": 2, "n_subjects": 3}, "vid": 115}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "750.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Four people are playing soccer, with two passing the ball and two blocking. One of them runs forward and kicks the ball, which goes in, and teammates raise their hands to cheer. One person from the defending team holds their head with both hands in dismay, while another retrieves the ball from the goal, cradling it in their arms.", "text": "Three men are playing soccer on a field with a goal in the background. One man dribbles the ball while the other two watch. The man with the ball passes it to another player, who then kicks it towards the goal. The ball goes into the net, and the players celebrate by raising their arms and cheering. The man who scored the goal picks up the ball and walks away from the goal. The other players also walk away from the goal area.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 750, "extra_info": {"duration": 9.15, "events": ["Four people play soccer.", "Two players pass the ball.", "Two players attempt to block.", "One player runs forward.", "One player kicks the ball.", "Ball goes into the goal.", "Teammates raise hands to cheer.", "One defender holds head in dismay.", "Another defender retrieves the ball.", "Defender cradles ball in arms."], "n_events": 10, "n_shots": 1, "n_subjects": 4}, "vid": 750}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "364.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man picks up a metal coin and observes it; the scene switches to a close-up of the metal coin in his hand; it switches again to the man standing up and looking at a tree in front of him; then switches to the man reaching out and placing his hand on the tree.", "text": "A man, wearing a dark hoodie, is seen in a forested area with sunlight filtering through the trees. He picks up a coin and examines it closely, showing its detailed design. The man then stands up and looks around, appearing to be cautious or curious. He approaches a tree and places the coin on its trunk.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 364, "extra_info": {"duration": 10.5, "events": ["The man picks up a metal coin.", "The man observes the metal coin.", "The man stands up.", "The man looks at a tree.", "The man reaches out.", "The man places his hand on the tree."], "n_events": 6, "n_shots": 3, "n_subjects": 1}, "vid": 364}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "95.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "An officer wearing a hat reached out and patted the head of the puppy in front of him, and the puppy picked up an object in front and sat on the ground looking at the officer as the officer reached out to take the object from the dog's mouth. Behind the officer stood a person at attention, motionless.", "text": "A man in a green uniform and cap crouches down and pets a brown dog with a white chest, who is sitting on the floor in front of him. The man then retracts his hand and looks at the dog, which looks up at him. The dog lowers its head to pick up a bone from the floor with its mouth. The dog then raises its head and looks at the man, who smiles and extends his hand towards the dog. The dog looks at the man's hand and then lowers its head to pick up the bone again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 95, "extra_info": {"duration": 4.12, "events": ["The officer patted the puppy's head.", "The puppy picked up an object.", "The puppy sat on the ground.", "The puppy looked at the officer.", "The officer reached out to the puppy.", "The officer took the object from the puppy's mouth.", "A person stood at attention behind the officer."], "n_events": 7, "n_shots": 1, "n_subjects": 3}, "vid": 95}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "352.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera zooms out, three people stand on different balconies, their bodies moving slightly.", "text": "A woman is standing on a balcony, holding a child in her arms and looking out at the cityscape at night. The balcony is part of a multi-story building, and the interior of the apartment is visible through the glass doors. The woman is wearing a light-colored top and dark shorts, and the child is dressed in dark clothing. The balcony has a railing, and there is a small table with a lamp on it. The background shows a cityscape with lights and buildings.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 352, "extra_info": {"duration": 11.62, "events": ["The camera zooms out.", "Three people stand on different balconies.", "Bodies move slightly."], "n_events": 3, "n_shots": 1, "n_subjects": 3}, "vid": 352}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "268.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person opens the back door of a car and takes out a bundle of rope. Another person at the front door watches him. He turns around and walks towards the back of the vehicle, fiddling with the rope as he goes.", "text": "A man wearing a hat and a dark suit stands beside a black vintage car, while another man in a black suit and cap stands nearby. The man in the hat opens the car door and begins to retrieve a rope from inside the car. He fully exits the car, holding the rope in his hands. The man in the hat starts to walk away from the car, looking down at the rope. He continues to walk, focusing on the rope as he moves further away from the car.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 268, "extra_info": {"duration": 5.75, "events": ["A person opens the back door of a car.", "A person takes out a bundle of rope.", "Another person watches from the front door.", "A person turns around.", "A person walks towards the back of the vehicle.", "A person fiddles with the rope."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 268}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "228.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman is leaning forward at a table, with a group of people watching her and making various movements and expressions. The woman licks her fingers and chews while looking to the right. Two men are looking at her, the man on the right is extending a hand, while the man on the left is supporting himself on a pillar.", "text": "In a kitchen filled with various jars, bottles, and cooking utensils, a woman in a pink sweater and floral apron tastes something from her finger and then looks at the people in front of her. Four women, one holding a dog, react with surprise and extend their hands towards the woman in pink. The woman in pink continues to taste and contemplate the flavor, while the others maintain their surprised expressions and gestures. The scene then shifts to two men, one holding the dog, who also react with surprise and extend their hands towards the woman in pink.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 228, "extra_info": {"duration": 3.38, "events": ["A woman leans forward at a table.", "A group of people watch the woman.", "People make various movements and expressions.", "The woman licks her fingers.", "The woman chews.", "The woman looks to the right.", "A man on the right extends his hand.", "A man on the left supports himself on a pillar."], "n_events": 8, "n_shots": 4, "n_subjects": 5}, "vid": 228}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "85.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On the stage, a person is performing martial arts as the lights shine down. The performer is a boy wearing a blue suit. The camera switches to the audience, where a girl in the middle watches without an expression, while the others are smiling and clapping. The camera switches back to the stage, where, after performing a gesture, the boy kicks upward into a split stance and then holds still in place.", "text": "A boy in a blue suit performs a series of martial arts moves on a stage with a spotlight and smoke effects, while an audience watches. The audience, including a man in glasses and a woman in a white dress, reacts with surprise and then starts clapping. The boy continues his performance, executing a high kick and other martial arts moves. The audience, including a boy in a gray suit, continues to watch and applaud.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 85, "extra_info": {"duration": 5.62, "events": ["A person performs martial arts on stage.", "Lights shine down on stage.", "The boy wears a blue suit.", "The audience watches the performance.", "A girl watches without expression.", "Others smile and clap.", "The boy performs a gesture.", "The boy kicks upward into a split stance.", "The boy holds still in place."], "n_events": 9, "n_shots": 3, "n_subjects": 3}, "vid": 85}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "718.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person jumps onto a platform with both legs, then squats down, followed by returning to the ground, jumps back onto the platform to squat again, and finally steps down from the platform.", "text": "In a well-lit outdoor skate park at night, a person wearing patterned leggings and red shoes is seen performing a series of squats on a wooden platform. The background features several tall light poles and a few people in the distance. The person starts in a standing position, then squats down, and stands up again, repeating this motion several times.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 718, "extra_info": {"duration": 5.23, "events": ["Person jumps onto platform", "Person squats down", "Person returns to ground", "Person jumps back onto platform", "Person squats down again", "Person steps down from platform"], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 718}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "866.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A basketball player, wearing jersey number twenty-three in white, tries to bypass an opponent in a blue jersey, number fifteen, to swiftly approach the basket. However, he misses the layup. The player's teammate, wearing jersey number fifteen in white, secures the rebound and scores the point. The crowd erupts in cheers, delighted by the scored point. Subsequently, player number thirty-three in a blue jersey readies himself to set the ball in motion from the sidelines.", "text": "The basketball game is taking place in the Basketball Africa League, with the score showing Pathitos BL leading Rivers Hoops 15-18. A player in a blue jersey is dribbling the ball quickly towards the basket, pursued by players in white jerseys. The player in blue continues to advance, passing the ball to a teammate who is also in a blue jersey. The teammate receives the ball and jumps to make a layup. The ball goes through the hoop, scoring a point for the blue team. The players in white jerseys attempt to block the shot but are unsuccessful. The blue team players start to move back on defense as the ball is secured.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 866, "extra_info": {"duration": 6.0, "events": ["Basketball player in jersey number twenty-three attempts bypass.", "Player in jersey number twenty-three misses the layup.", "Teammate in jersey number fifteen secures the rebound.", "Teammate in jersey number fifteen scores the point.", "The crowd erupts in cheers.", "Player in jersey number thirty-three readies to inbound ball."], "n_events": 6, "n_shots": 1, "n_subjects": 3}, "vid": 866}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "277.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Someone struck the drum set with their hands, a man watched the person playing the drums, the person playing spoke something, a woman outside walked to the door, the camera switched to the upper body of the woman, and she looked at them.", "text": "In a room with a drum set, a man in a gray shirt is playing the drums while a man in a blue shirt stands with his arms crossed, observing him. The man in the gray shirt continues to play the drums, and the man in the blue shirt remains in the same position. A man in a red shirt enters the room from the left side and walks towards the man in the gray shirt. The man in the gray shirt stops playing and looks at the man in the red shirt. The camera then switches to a woman in a red dress standing at the door, looking at the man in the gray shirt.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 277, "extra_info": {"duration": 4.62, "events": ["Someone struck the drum set with their hands.", "A man watched the person playing the drums.", "The person playing the drums spoke.", "A woman outside walked to the door.", "The woman looked at them."], "n_events": 5, "n_shots": 2, "n_subjects": 3}, "vid": 277}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "618.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two people, each holding a bucket of popcorn, face a character in a yellow outfit who is carefully inspecting the tickets. Afterward, they tear the tickets with both hands, hand the stubs to the person on the right, and raise a hand toward the left. Then the two individuals turn and walk to the left.", "text": "Two people are holding red and white striped popcorn buckets, while a third person holds two blue cards. The person with the cards points at the cards and then takes out a piece of paper from one of the cards. The person with the paper hands it to the person holding the cards, who then takes it. The person with the paper raises their hand and walks away, while the person holding the cards remains in place. The scene then switches back to the two people holding the popcorn buckets and the person with the cards.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 618, "extra_info": {"duration": 11.72, "events": ["Two people hold buckets of popcorn.", "Character in yellow outfit inspects tickets.", "Character tears tickets with both hands.", "Character hands stubs to person on the right.", "Character raises hand toward the left.", "Two individuals turn.", "Two individuals walk to the left."], "n_events": 7, "n_shots": 1, "n_subjects": 3}, "vid": 618}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "703.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera angle looks down as a hand reaches forward, pinching a small piece of food between its fingers. A little dog, squatting on the ground, watches the hand holding the food. Then, the dog stands up on its hind legs and takes the food from the hand using its mouth to eat it, before settling back on all fours and looking upwards.", "text": "In a room with a wooden floor and a white rug, a person holds a small treat in their hand, extending it towards a small brown dog sitting on the rug. The dog notices the treat and starts to move closer. The dog jumps up towards the treat, reaching out with its mouth. The dog successfully takes the treat from the person's hand. After receiving the treat, the dog lowers its head and begins to chew.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 703, "extra_info": {"duration": 5.0, "events": ["A hand pinches a piece of food.", "A little dog watches the hand.", "The dog stands up on hind legs.", "The dog takes the food using its mouth.", "The dog eats the food.", "The dog settles back on all fours.", "The dog looks upwards."], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 703}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "542.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A rooster runs forward, with the camera following the movement. A man stands in front of a fence with his hands in his pockets, smiling as he looks at the rooster and walks towards it. Other hens follow by his side. The camera then cuts to the man walking forward, with the hens following behind him, as he looks back over his shoulder.", "text": "A chicken is seen running energetically on a grassy area, with several other chickens and a person in the background. The person, dressed in a green jacket, black pants, and boots, stands with hands in pockets, watching the chicken. The person then starts walking towards the camera, smiling, while the chickens follow closely behind. The scene transitions to a different location with a greenhouse in the background, where the person continues walking with the chickens following.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 542, "extra_info": {"duration": 10.5, "events": ["A rooster runs forward.", "The camera follows the rooster.", "A man stands in front of a fence.", "The man walks towards the rooster.", "Hens follow by the man's side.", "The man walks forward.", "Hens follow behind the man.", "The man looks back over his shoulder."], "n_events": 8, "n_shots": 2, "n_subjects": 3, "old_duration": 10.5, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/142.mp4"}, "vid": 542}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "316.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the room, a woman sits on the floor leaning against the bed, holding a square cup in her hand, while a man lies on the bed, pouring a glass of wine for the woman. Then the man puts the wine bottle on the floor, the woman takes a sip, and afterwards the man props his arm on the bed and lifts his head to look upwards, as the woman tilts her head to look diagonally downward after finishing her drink.", "text": "In a dimly lit bedroom, a man and a woman are lying on a bed with white sheets. The man, lying on his side, picks up a bottle of wine and shows it to the woman, who is sitting up and holding a small wooden box. The woman smiles and picks up a cup from the bed. The man then places the bottle back on the bed and starts to get up. The woman takes a sip from her cup while the man continues to move. The man sits up, and the woman continues to drink from her cup. The man then picks up the bottle again and shows it to the woman.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 316, "extra_info": {"duration": 5.0, "events": ["Woman sits on floor leaning against bed.", "Woman holds square cup in hand.", "Man lies on bed pouring wine for woman.", "Man places wine bottle on floor.", "Woman takes a sip of wine.", "Man props arm on bed.", "Man lifts head to look upwards.", "Woman tilts head to look diagonally downward."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 316}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "558.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman runs forward a bit, then jumps and mounts the cow's back. The cow is grazing; the man walks up beside the cow, bends down to look, then pulls the grass away, and the cow turns and follows, running forward.", "text": "A woman approaches a buffalo that is eating from a blue bucket in a rural setting with wooden structures and a brick house in the background. She reaches out and grabs the buffalo's tail, then climbs onto its back. The woman sits on the buffalo's back as it continues to eat from the bucket. A man in a white shirt runs over and picks up the bucket. The buffalo, now with the woman on its back, starts to move forward, dragging the bucket with the man. The buffalo and the woman move away from the initial spot, heading towards a more open area.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 558, "extra_info": {"duration": 9.83, "events": ["The woman runs forward.", "The woman jumps.", "The woman mounts the cow's back.", "The cow grazes.", "The man walks up beside the cow.", "The man bends down.", "The man pulls the grass away.", "The cow turns.", "The cow follows the man.", "The cow runs forward."], "n_events": 10, "n_shots": 1, "n_subjects": 3}, "vid": 558}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "916.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts a running competition where players vie for the top spot. All lanes except for the sixth one are filled with competitors. As the race takes a turn, the participant in the fifth lane breaks away from the pack. This racer, wearing number 492 and identified by long black hair, a pink T-shirt, black shorts, and shoes to match, surges ahead of the other contestants and finishes first. Overcome with joy, they raise their hand in jubilation upon crossing the finish line.", "text": "The video shows a group of female athletes running on a blue track, with the camera following their progress from left to right. The athletes are wearing different colored uniforms, with some in red and others in black and white. The background includes a grassy area and a few spectators or officials standing on the sidelines. The athletes maintain a steady pace as they run, with the camera capturing their movements from various angles. The time displayed on the screen changes as the athletes progress, indicating the elapsed time of the race.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 916, "extra_info": {"duration": 7.97, "events": ["Competition begins with a running race.", "Players compete for the top spot.", "Sixth lane remains empty.", "Race takes a turn.", "Participant in fifth lane breaks away.", "Racer number 492 surges ahead.", "Racer number 492 finishes first.", "Racer raises hand in jubilation."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 916}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "748.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man smiled and stretched out both hands forward, the girl ran over and embraced the man. In the girl's hand, she was holding a greeting card with a heart drawn on it.", "text": "A man in a white t-shirt stands in a modern kitchen with white cabinets, jars, and various kitchen items on the counter, extending his arms forward. A young girl in a pink dress runs towards him, and he opens his arms to embrace her. The man lifts the girl into his arms, and they hug tightly. The girl holds a card with a red heart on it, which she shows to the man while they continue to hug. The man smiles and holds the girl close, and the girl removes the card from her hand. The man continues to hold the girl, and they both appear happy and content.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 748, "extra_info": {"duration": 5.51, "events": ["The man smiled.", "The man stretched out both hands.", "The girl ran over.", "The girl embraced the man.", "The girl held a greeting card."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 748}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "962.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts an individual with curly hair, wearing a white t-shirt, carrying another person sporting a pink bandana and a pink t-shirt on their back in a kitchen. The curly-haired individual appears to be jostling the other person. Subsequently, a third person, a child in a black and pink hoodie, appears in a different room. The curly-haired individual then shares some words that lead the child in the black and pink hoodie to begin crying on a black sofa. The child wears a white bandana on their head.", "text": "In a kitchen with a washing machine, a blender, and a coffee maker, a woman with long curly hair and a child in a pink hoodie and yellow skirt are seen talking and gesturing with their hands. The child, wearing a pink headband, asks, 'Mummy can I go on your back too?' The woman responds, 'No, you can't,' and the child looks disappointed, covering their mouth with their hand and crying on a black couch.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 962, "extra_info": {"duration": 10.01, "events": ["Individual with curly hair carries person with pink bandana.", "Curly-haired individual jostles person on their back.", "Child in black and pink hoodie appears in room.", "Curly-haired individual speaks to child.", "Child begins crying on black sofa."], "n_events": 5, "n_shots": 4, "n_subjects": 3}, "vid": 962}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "374.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man stands with his arms raised, looking ahead. The camera switches to a woman holding an umbrella, turning her head to look down to her right. The scene cuts to a firearm, then cuts back to the woman as she turns to her right, reaching sideways to grab something, and then she lifts the gun to aim at the man.", "text": "A man in a white t-shirt stands in a room with wooden doors and a window with white curtains, holding a small object in his right hand and raising his left hand while speaking. A woman in a black outfit approaches him, holding an umbrella with a gun attached to it, and points the gun at him. The camera then shows a close-up of a gun and a black glove on a wooden table. The woman, still holding the umbrella with the gun, raises the gun and points it at the man. The man continues to hold his hands up and speaks.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 374, "extra_info": {"duration": 11.5, "events": ["The man raises his arms.", "The woman holding an umbrella turns her head.", "The woman reaches sideways to grab something.", "The woman lifts the gun to aim at the man."], "n_events": 4, "n_shots": 4, "n_subjects": 2}, "vid": 374}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "953.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video showcases four motorcycle riders racing on a dirt track. Each rider expertly drifts their motorcycles around the bends on the track. The rider with the white helmet initially leads the race but is overtaken as the rider with the blue helmet successfully maneuvers a bending corner. Later, while traversing a flat stretch, they enter another bend; the rider with the red helmet now surpasses the white helmet rider. Consequently, the blue helmet rider leads the race, the white helmet rider is in second place, and the red helmet rider is in third place.", "text": "Four motorcyclists are racing on a dirt track, with a crowd of spectators watching from the stands. The riders navigate a turn, with the motorcyclist in red leading the pack. As they continue through the turn, the motorcyclist in red maintains his lead, closely followed by the motorcyclist in green. The motorcyclist in red and the motorcyclist in green are neck-and-neck, with the motorcyclist in red slightly ahead. The motorcyclist in red and the motorcyclist in green continue to race closely, with the motorcyclist in red still in the lead. The motorcyclist in red and the motorcyclist in green are still in a tight race, with the motorcyclist in red maintaining his position. The motorcyclist in red and the motorcyclist in green are still neck-and-neck, with the motorcyclist in red slightly ahead. The motorcyclist in red and the motorcyclist in green continue to race closely, with the motorcyclist in red maintaining his lead.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 953, "extra_info": {"duration": 11.0, "events": ["Four motorcycle riders race on a dirt track.", "Riders drift motorcycles around bends.", "Rider with white helmet leads the race.", "Rider with blue helmet overtakes on a corner.", "Riders traverse a flat stretch.", "Rider with red helmet surpasses the white helmet rider.", "Blue helmet rider leads the race.", "White helmet rider is in second place.", "Red helmet rider takes third place."], "n_events": 9, "n_shots": 2, "n_subjects": 4}, "vid": 953}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "26.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A beaver is speaking and then bends down to pick up an object and pushes it towards the fox's face. The scene switches to a creature taking off its shoes, kicking them forward, standing on the ground with feet apart, and the camera pans up to show the creature opening its mouth.", "text": "In a cave with hanging vines, a bear in an orange shirt talks to a fox in a blue shirt, while another fox in a blue shirt stands nearby. The bear bends down and picks up a piece of food, then feeds it to the fox in the blue shirt. The fox in the blue shirt eats the food, and the bear watches. The camera then shows the fox in the blue shirt's feet, where the shoes are removed, revealing furry paws. The camera moves up to show the fox in the blue shirt's face, and finally, the bear and the fox in the blue shirt look at each other.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 26, "extra_info": {"duration": 9.5, "events": ["A beaver speaks.", "A beaver bends down.", "A beaver picks up an object.", "A beaver pushes the object towards the fox's face.", "A creature takes off its shoes.", "A creature kicks its shoes forward.", "A creature stands on the ground with feet apart.", "A camera pans up.", "A creature opens its mouth."], "n_events": 9, "n_shots": 2, "n_subjects": 4}, "vid": 26}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "59.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A little girl is talking to a kitten, then she lifts the kitten onto her lap, and then picks up the round tag on the kitten's neck to look at the words on it.", "text": "A girl with red hair, wearing a green dress and white apron, is sitting on the steps of a building with the word 'ELEVENBOUGEN' in the background, talking to a gray cat with a red collar. The cat looks at the girl and then at the camera, appearing to respond. The girl then hugs the cat, and the cat looks content. The girl continues to hold the cat, which closes its eyes and appears relaxed. The girl looks at the cat and then at the camera, while the cat remains in her arms. The camera zooms in on the cat's collar, revealing a tag that reads 'My name is inkv Krueber, I live at 1428 Elm Street.' The girl and the cat then look at the camera together.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 59, "extra_info": {"duration": 5.88, "events": ["A little girl talks to a kitten.", "The little girl lifts the kitten onto her lap.", "The little girl picks up the round tag on the kitten's neck.", "The little girl looks at the words on the round tag."], "n_events": 4, "n_shots": 2, "n_subjects": 2}, "vid": 59}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "182.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The girl approached the stone wall and then disappeared; the boy leant back, the camera switched to show the boy getting up and running forward, then switched again to show the boy crashing into the stone wall and falling down.", "text": "In a desert-like environment with rocky cliffs and sparse vegetation, a girl in a blue dress walks towards a large rock wall, emitting a pink glow from her hands. The glow intensifies and then fades as she disappears from the scene. A boy, sitting on the ground, watches her leave and then stands up, looking surprised. He then runs towards the rock wall and hides behind it, peeking out to observe the area. The girl reappears from the right side of the frame, still glowing.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 182, "extra_info": {"duration": 5.12, "events": ["The girl approached the stone wall.", "The girl disappeared.", "The boy leant back.", "The boy got up.", "The boy ran forward.", "The boy crashed into the stone wall.", "The boy fell down."], "n_events": 7, "n_shots": 3, "n_subjects": 2}, "vid": 182}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "456.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera pans down from the treetop to capture a person tapping the outer shell of a durian with the back of a small knife. Then, they pick the durian from the tree and toss it down, where another person, wearing white gloves, stretches out their arms to catch the durian and then turns to place it on the ground. The person on the tree continues to gently tap another durian with the back of the small knife, then picks it and tosses it down.", "text": "A person is seen climbing a tree and reaching for a durian fruit hanging from a branch. The person uses a knife to cut the durian from the branch. The person then hands the durian to another individual standing below, who catches it. The person continues to cut more durians from the tree and hands them down to the individuals below. The background shows a lush green area with several people standing on the ground, some holding baskets to collect the durians.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 456, "extra_info": {"duration": 10.0, "events": ["Camera pans down from the treetop.", "Person taps durian with back of knife.", "Person picks the durian from the tree.", "Person tosses the durian down.", "Another person catches the durian with white gloves.", "Person wearing gloves places the durian on the ground.", "Person on the tree taps another durian with knife.", "Person picks the second durian.", "Person tosses the second durian down."], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 456}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "499.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A bare-chested man, holding a plank of wood in his hand, accelerates forward into a run, reaching the riverside he throws the plank into the river, then leaps forward onto it. The plank glides through the water as the man stands atop it, maintaining his balance. He and the plank continue to slide until reaching the opposite bank, rushing up the shore, where a person wearing a hat records a video of the man with a cellphone.", "text": "A shirtless man in white shorts is seen holding a surfboard and running across a grassy area in a park. He continues running towards the edge of a pond, passing by trees and a pathway. The man reaches the edge of the pond and jumps onto the surfboard, which is placed on the water. He starts surfing on the water, moving smoothly across the pond. As he surfs, he approaches the shore where another person is standing and watching him.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 499, "extra_info": {"duration": 6.8, "events": ["A bare-chessed man accelerates into a run.", "Man reaches the riverside.", "Man throws the plank into the river.", "Man leaps onto the plank.", "The plank glides through the water.", "Man stands atop the plank, maintaining balance.", "Man and plank slide to the opposite bank.", "Man rushes up the shore.", "Person records a video of the man."], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 499}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "19.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A character is suspended in midair, making gestures and speaking as the camera pans down to a group of characters gathered in front of a door, among them one wielding a flaming sword. Then, the crowd begins to fight, with the one holding the flaming sword swinging it around and knocking a person to the ground.", "text": "A character in red armor is seen hanging from a rope on a ship's rigging. The scene shifts to a character in black armor holding a flaming torch, surrounded by other characters in armor. The character with the torch runs down the steps of the ship, illuminating the area with the torch. The character continues to move down the steps, passing by other characters who are either standing or sitting. The character with the torch engages in combat with another character, exchanging blows. The character with the torch is hit and stumbles, but continues to fight.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 19, "extra_info": {"duration": 8.62, "events": ["A character is suspended in midair.", "The character makes gestures.", "The character speaks.", "The camera pans down to a group of characters.", "A character wields a flaming sword.", "The crowd begins to fight.", "The character with the flaming sword swings it around.", "The character with the flaming sword knocks a person to the ground."], "n_events": 8, "n_shots": 1, "n_subjects": 3}, "vid": 19}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "856.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video showcases a basketball game between two teams: one clad in blue uniforms and the other in white uniforms. The team in white is on defense while a player from the blue team, designated with the number 27, holds the ball. The blue team player makes an attempt to swerve past the defensive line set by the white team. This culminates into the player shooting a successful three-pointer. The crowd instantly erupts into a chorus of cheers, rising to their feet in applause as the blue team secures a victory. It's evident that the audience is exuberant about supporting the winning team.", "text": "The basketball game is taking place in a crowded arena with players from two teams, one in blue jerseys and the other in white jerseys, actively engaged on the court. A player in a blue jersey dribbles the ball towards the basket, evading defenders from the opposing team. He jumps and shoots the ball towards the hoop. The ball goes through the hoop, scoring a point. The player in the blue jersey bends over, appearing to be in pain or exhaustion, while other players react to the successful shot. The crowd in the background is cheering and celebrating the successful play.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 856, "extra_info": {"duration": 6.01, "events": ["The basketball game is ongoing.", "The team in blue is in possession of the ball.", "Player 27 from the blue team holds the ball.", "Player 27 attempts to swerve past the white team's defense.", "Player 27 shoots a three-pointer.", "Player 27 scores the three-pointer.", "The crowd erupts into cheers.", "Spectators rise to their feet.", "Spectators applaud the blue team.", "The blue team secures a victory."], "n_events": 10, "n_shots": 3, "n_subjects": 4}, "vid": 856}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "395.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The airplane passes by, the camera follows and pans to the left, revealing a herd of elephants on the grassland.", "text": "A small white airplane with red and black stripes is seen landing on a grassy field, moving from left to right. As the airplane continues to move, a group of elephants comes into view, standing and walking in the field. The elephants, including adults and calves, are seen moving and interacting with each other as the airplane passes by.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 395, "extra_info": {"duration": 4.75, "events": ["The airplane passes by.", "The camera follows the airplane.", "The camera pans to the left.", "A herd of elephants appears on the grassland."], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 395}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "449.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In an open area, a man picked up a flatbread with both hands, threw it up into the air, and the flatbread flew in a circle before returning to the man's hands. Then the man took the flatbread and spun around to play with it behind his back, and finally threw the bread into the air once again. In the background, there were sounds of a boy exclaiming in disbelief.", "text": "A man is seen preparing a pancake on a table with several stacked pancakes and a stack of red cups in the background. He then throws the pancake into the air and starts spinning around. The man catches the pancake and begins to perform a series of tricks, tossing and catching it with his hands. He continues to spin and dance while skillfully handling the pancake. The man then throws the pancake into the air and watches it fly away.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 449, "extra_info": {"duration": 15.08, "events": ["Man picked up a flatbread with both hands.", "Man threw flatbread up into the air.", "Flatbread flew in a circle.", "Flatbread returned to man's hands.", "Man spun around with flatbread.", "Man played with flatbread behind his back.", "Man threw flatbread into the air again.", "Boy exclaimed in disbelief."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 449}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "942.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows contestants competing on trike model motorcycles. While navigating through a bend, they adjust their leg positions tactfully to make the turn. The competitor wearing grey shoes, who is recording the video, overtakes the contestant labeled number one from the left. Surrounded by pastures on either side, they gear up to make a left turn. As they move, they dynamically lift their right leg during a left turn and their left leg during a right turn.", "text": "A group of people are riding various types of vehicles, including a motorcycle and several go-karts, on a paved road surrounded by greenery and hills. The riders are moving at high speed, with the motorcycle rider following closely behind the go-karts. As they approach a curve, the go-karts begin to lean to the side, indicating a turn. The motorcycle rider continues to follow the go-karts, maintaining a close distance. The go-karts navigate the curve, with the riders adjusting their positions to maintain balance. The motorcycle rider follows the go-karts as they continue through the turn, with the road curving to the left. The group continues to ride through the curve, with the go-karts leading and the motorcycle following closely behind.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 942, "extra_info": {"duration": 10.01, "events": ["Contestants compete on trike model motorcycles.", "Riders adjust leg positions through a bend.", "Competitor overtakes from the left.", "Competitor wearing grey shoes records the video.", "Riders prepare for a left turn.", "Riders lift right leg during left turn.", "Riders lift left leg during right turn."], "n_events": 7, "n_shots": 1, "n_subjects": 3}, "vid": 942}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "566.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is filled with a bunch of weeds and leaves. A boy in a white short-sleeved shirt is sitting on the ground, placing a spoon on the ground, then he looks up at the camera before picking up a flaky yellow object and placing it inside a mango peel. He puts on the shoes made out of mango peels, and the camera switches to the boy standing up and alternating his feet to show off the mango shoes. Then, he looks up, claps his hands and smiles.", "text": "A boy is sitting on the ground with a coconut, a spoon, and a mango in front of him. He uses the spoon to scoop out the mango flesh and eats it. He then puts the spoon down and starts to put the mango pieces into his shoes. The boy stands up, wearing the mango shoes, and claps his hands. He then sits back down and continues to eat the mango.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 566, "extra_info": {"duration": 9.57, "events": ["A boy sits on the ground.", "The boy places a spoon on the ground.", "The boy looks up at the camera.", "The boy picks up a flaky yellow object.", "The boy places the object inside a mango peel.", "The boy puts on shoes made of mango peels.", "The boy stands up.", "The boy alternates his feet.", "The boy claps his hands.", "The boy smiles."], "n_events": 10, "n_shots": 1, "n_subjects": 1, "old_duration": 12.53, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/166_no_audio.mp4"}, "vid": 566}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "237.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera pans from a distance to close-up on a building where a man is tied up by an object. The shot switches to two other men: one is climbing upward to the rooftop, while the other man hangs off the edge of the building, glancing down below in panic as debris falls around him. The camera cuts back to the man, showing his surroundings from various angles. He leisurely rotates his wrist and then looks off in one direction.", "text": "A man wearing a suit with mechanical arms is seen descending rapidly in a vertical shaft. The camera angle shifts to show the man from above as he continues to descend, with the cityscape visible below. The man looks down at the ground, where another man is seen clinging to the edge of a building, struggling to hold on. The man in the suit continues to descend, looking back at the man on the building. The man on the building continues to struggle, trying to maintain his grip. The man in the suit reaches the bottom of the shaft and looks back at the man on the building.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 237, "extra_info": {"duration": 18.25, "events": ["The camera pans on a building.", "A man is tied up by an object.", "One man climbs upward to the rooftop.", "Another man hangs off the building's edge.", "Debris falls around the hanging man.", "The tied man rotates his wrist.", "The tied man looks off in a direction."], "n_events": 7, "n_shots": 5, "n_subjects": 3}, "vid": 237}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "532.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman stood up straight, her right foot stepping on an inline skate, while holding another in her left hand. She shifted slightly, glided to the left, then dropped the inline skate she was holding and stretched out her left leg to step on it. With both feet on the inline skates, she alternated her legs back and forth, moving to the left.", "text": "A woman wearing a light green dress, white sneakers, and a white cardigan is roller skating in an outdoor area with a concrete floor and yellow and white striped bollards. She starts by standing on the skates and then begins to move forward, lifting her right foot off the ground and extending her arms for balance. She continues to skate forward, maintaining her balance with her arms outstretched. The background shows a modern architectural structure with a curved roof and some greenery in the distance. The woman skates smoothly, moving away from the camera and further into the open area.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 532, "extra_info": {"duration": 8.43, "events": ["The woman stood up straight.", "The woman's right foot stepped on an inline skate.", "The woman held another inline skate in her left hand.", "The woman shifted slightly.", "The woman glided to the left.", "The woman dropped the inline skate.", "The woman stretched out her left leg.", "The woman's left leg stepped on the inline skate.", "With both feet on skates, the woman alternated her legs.", "The woman moved to the left."], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 532}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "157.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A group of people are holding guns, pointing towards the broken vehicles in front, speaking to the front, subtitles appear at the bottom. A brown-haired man and a blond man fall out of the car, with the brown-haired man crawling forward a couple of times.", "text": "A group of people, including a man in a suit and tie, a woman in a blue suit, and several others, are pointing guns at a green car with its headlights open, and the text 'Surrender warriors from Arcurus!' appears on the screen. The man in the suit speaks, and the text disappears. The scene then shifts to the green car, which is upside down and has a hole in the wall, with debris scattered around. A boy in the car opens the door and starts to get out, while another boy is seen lying on the ground inside the car.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 157, "extra_info": {"duration": 7.5, "events": ["People are holding guns.", "People are pointing guns towards vehicles.", "Subtitles appear at the bottom.", "The brown-haired man falls out of the car.", "The blond man falls out of the car.", "The brown-haired man crawls forward."], "n_events": 6, "n_shots": 2, "n_subjects": 3}, "vid": 157}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "797.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera pans left, showing the vast plains and lake water as well as distant mountains, with several houses around.", "text": "The video begins with a view of a serene lake surrounded by lush green hills and mountains in the background. As the camera moves to the left, a few houses and a small village come into view, nestled along the lake's edge. The camera continues to pan left, revealing more of the village, including a few red-roofed buildings and a grassy field. The camera then passes by a few trees and a barn, with the lake and mountains still visible in the background. Finally, the camera moves past the barn and continues to show the expansive green fields and the lake, with the mountains still in the distance.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 797, "extra_info": {"duration": 18.12, "events": ["The camera pans left."], "n_events": 1, "n_shots": 1, "n_subjects": 1}, "vid": 797}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "997.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video showcases an individual on the left with long, curly brown hair, dressed in a white sweater, carrying books, walking towards the right. An individual on the right, outfitted in a black coat and blue jeans, possesses a monitor for a head, displaying a constant succession of different profile pictures. The individual on the right abruptly knocks the books from the hands of the person on the left and aggressively shoves the latter against a red wall.", "text": "A woman with long, curly hair is seen walking and talking. The scene then shifts to a person wearing a black jacket and jeans, with a photo of a woman named Cassie Martinez on their head, walking towards the camera. The person with the photo on their head continues walking and approaches the woman with curly hair. The person with the photo on their head suddenly grabs the woman's backpack and pushes her against a brick wall. The woman with curly hair turns her head to look at the person with the photo on their head. The person with the photo on their head then shows a photo of Alex Brown to the woman.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 997, "extra_info": {"duration": 6.01, "events": ["Individual with books walks towards the right.", "Individual with monitor head displays profile pictures.", "Individual with monitor head knocks books from individual's hands.", "Individual with monitor head shoves individual against red wall."], "n_events": 4, "n_shots": 4, "n_subjects": 2}, "vid": 997}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "747.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman with blonde hair opens her mouth wide in awe as she watches the woman with black hair in front of her, who places a headpiece on the blonde-haired woman's head. The blonde woman places her hands on her cheeks, shaking her head and swaying with joy. Afterwards, the woman with black hair turns around to receive a piece of gauze handed over by another woman on the left, and then she places it on the blonde woman's head. As the camera zooms out, the woman with blonde hair waves her hands excitedly and twists her body.", "text": "In a pink-themed room with white cloud decorations, a woman in a white dress holds a 'Bride' crown and shows it to another woman in a pink outfit. The woman in the pink outfit places the crown on the head of the woman in the white dress, who reacts with excitement, covering her mouth with her hands and smiling. The woman in the pink outfit adjusts the crown and veil on the woman in the white dress, who continues to smile and laugh. The woman in the white dress, now wearing the crown and veil, turns around and starts to walk away, while the other two women follow her.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 747, "extra_info": {"duration": 18.9, "events": ["Blonde-haired woman opens mouth in awe", "Woman with black hair places headpiece on blonde-haired woman", "Blonde-haired woman puts hands on cheeks", "Blonde-haired woman shakes head", "Blonde-haired woman sways with joy", "Woman with black hair receives piece of gauze from another woman", "Woman with black hair places gauze on blonde-haired woman's head", "Blonde-haired woman waves hands excitedly", "Blonde-haired woman twists body"], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 747}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "451.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a road submerged in puddle water, a man, holding a pair of shoes in his hand, rolled up his trousers, and with bare feet, waded through the water towards the opposite side of the street. Just as he was about to reach the curb, he leapt forward, jumping onto the curb, but immediately slipped and fell, ending up sitting on the ground.", "text": "A man stands in a flooded street, holding a pair of shoes in his hands. He begins to walk through the water, stepping carefully. The man continues to walk, occasionally looking around as he moves forward. He then starts to run, still holding the shoes, and heads towards the right side of the frame. The man runs past an orange car parked on the right side of the street.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 451, "extra_info": {"duration": 10.8, "events": ["A man held a pair of shoes in his hand.", "The man rolled up his trousers.", "The man waded through puddle water with bare feet.", "The man walked towards the opposite side of the street.", "The man leapt onto the curb.", "The man immediately slipped.", "The man fell to the ground.", "The man ended up sitting on the ground."], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 451}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "1.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A character emerges from smoke and fire, then half knelt on the ground and made a gesture to give instructions, while another group of characters rushed forward and grabbed two green monsters.", "text": "A robot with the number '04114' on its back is seen on the ground, surrounded by smoke and debris. The robot begins to rise, raising its right hand and waving. More robots, also with numbers on their backs, start to appear and move towards the center. The robots gather around two green creatures, one small and one large, who are sitting on the ground. The large green creature stands up with the help of the surrounding robots, while the small green creature remains seated.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 1, "extra_info": {"duration": 5.09, "events": ["A character emerges from smoke and fire.", "The character half kneels on the ground.", "The character makes a gesture to give instructions.", "A group of characters rushes forward.", "The group of characters grabs two green monsters."], "n_events": 5, "n_shots": 2, "n_subjects": 4}, "vid": 1}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "393.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man and a woman are crouched, searching through the belongings of a person lying on the ground, when a police car approaches from a distance and comes to a stop. The camera cuts to show the man and woman noticing the arrival of the police car. The scene switches again to show two police officers, one male and one female, exiting the vehicle and walking towards them.", "text": "At night, a man and a woman are kneeling beside an injured person lying on the ground near a monument, with city lights and a police car in the background. The man begins to check the injured person's condition while the woman watches. A police car with flashing lights arrives and stops nearby. The man continues to check the injured person, and the woman looks on. The police car's door opens, and a police officer steps out, followed by another officer. The two officers walk towards the injured person and the man and woman.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 393, "extra_info": {"duration": 9.0, "events": ["A man and a woman crouch.", "They search through someone's belongings.", "A person lies on the ground.", "A police car approaches.", "The police car comes to a stop.", "The man and woman notice the police car.", "Two police officers exit the police car.", "One male officer walks towards the man and woman.", "One female officer walks towards the man and woman."], "n_events": 9, "n_shots": 3, "n_subjects": 2}, "vid": 393}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "64.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A character wearing a suit and glasses is looking at and speaking to the screen ahead. The shot switches to the screen, where another character takes a cellphone away from their ear and quickly shrinks to disappear into the screen. On the screen, three charts are displayed showing downward trends. The reflection in the screen shows the face of the character with glasses. The shot then switches back to the front of the character, who takes off his glasses and places them on the table, leans back in his chair, and then raises his hand to touch his forehead.", "text": "A man in a suit and tie is sitting at a desk in an office with a cityscape background, looking at a computer screen. The screen shows a graph with a green line and various data points. The man, now wearing glasses, looks at the screen and then removes his glasses, placing them on the desk. He then rubs his eyes with his hand.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 64, "extra_info": {"duration": 7.12, "events": ["Character wearing suit speaks to the screen.", "Another character takes cellphone away from their ear.", "Another character shrinks to disappear into the screen.", "Three charts showing downward trends are displayed.", "The reflection reveals the face of the character with glasses.", "Character wearing suit takes off his glasses.", "Glasses are placed on the table.", "Character wearing suit leans back in chair.", "Character wearing suit raises hand to touch forehead."], "n_events": 9, "n_shots": 3, "n_subjects": 2}, "vid": 64}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "534.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A child wearing protective gear is skateboarding on a U-shaped track, sliding from left to right. The camera follows the movement, and as the child soars into the air, they quickly spin their body, switching directions before landing back on the track and continuing to slide left. In the air, the child flips both their body and skateboard, and after landing, they slide to the right.", "text": "A person wearing a white helmet and protective gear is skateboarding in an indoor skate park with a large wooden ramp. The skateboarder starts at the bottom of the ramp and begins to ascend the curved surface. The skateboarder gains momentum and launches into the air, performing a spin while in mid-air. The skateboarder lands smoothly on the ramp and continues to skate down the slope. The skateboarder repeats the process, ascending the ramp again and performing another spin in the air. The skateboarder lands and continues to skate down the slope, maintaining balance and speed.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 534, "extra_info": {"duration": 5.7, "events": ["Child wears protective gear", "Child skateboards on U-shaped track", "Child slides from left to right", "Child soars into the air", "Child quickly spins body", "Child switches directions", "Child lands back on track", "Child slides left", "Child flips body and skateboard", "Child slides to the right"], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 534}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "94.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On the map, a red dashed line snakes from left to right and stops at a marker. A vehicle enters from the left side of the map and gradually covers most of the dashed line.", "text": "The video begins with a detailed map of a region, showing various towns and cities such as Charleau-Thierry, Domremy, and Metz, with red markers indicating specific locations. A red line starts to appear, connecting the towns of Charleau-Thierry and Metz, and continues to extend towards the town of Metz. The red line then forms a loop, indicating a route or path. As the red line completes the loop, a horse-drawn carriage appears from the left side of the frame and moves across the map, passing over the red line. The carriage continues to move across the map, eventually exiting the frame on the right side. The map remains visible with the red line and markers, and the carriage is no longer in the frame.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 94, "extra_info": {"duration": 4.75, "events": ["A red dashed line snakes across the map.", "The line stops at a marker.", "A vehicle enters from the left side of the map.", "The vehicle covers most of the dashed line."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 94}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "303.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera captures a panoramic shot of a bathroom scene where a woman is looking into a mirror, and a man is sitting on the toilet. The shot switches to a close-up of the woman, who adjusts her hair. The man stands up from the toilet, the woman closes the cabinet door in front of the mirror, and the man picks up a towel from the sink to dry his hands. The woman then turns around to speak with the man.", "text": "In a bathroom with a white sink, mirror, and a window with frosted glass, a woman in a knitted sweater and jeans adjusts her hair in front of the mirror. A man, dressed in a dark sweater with white stripes, sits on the toilet and then stands up, picking up a towel from the sink. He begins to dry his hands with the towel while the woman continues to adjust her hair. The woman then turns to face the man and starts talking to him.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 303, "extra_info": {"duration": 7.12, "events": ["Woman looks into mirror", "Man sits on toilet", "Woman adjusts her hair", "Man stands up from toilet", "Woman closes cabinet door", "Man picks up towel from sink", "Man dries hands with towel", "Woman turns around", "Woman speaks with man"], "n_events": 9, "n_shots": 4, "n_subjects": 2}, "vid": 303}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "677.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On an assembly line, workers dressed in white and blue clothes are operating on a moving platform. The video is played in fast-forward from beginning to end.", "text": "The video begins with a blurred view of a factory setting, showing a conveyor belt with electronic components. A worker in a white lab coat and gloves places a component onto the conveyor belt. The camera moves along the conveyor belt, showing more workers in similar attire working on the components. The workers are seen handling and assembling the electronic components, with some using tools to secure the parts. The background includes various machinery, shelves, and equipment typical of an electronics manufacturing facility. The workers continue their tasks, with some placing components into boxes and others inspecting the assembled units. The video concludes with a wider view of the factory floor, showing multiple workers engaged in their respective tasks along the conveyor belt.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 677, "extra_info": {"duration": 8.08, "events": ["Workers operate on a moving platform", "Assembly line moves continuously", "Workers dress in uniforms", "Video plays in fast-forward", "Workers perform tasks efficiently", "Platform carries products along the line", "Video starts", "Video ends"], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 677}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "716.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A young man and woman are playing video games with controllers. They both synchronize a movement with their bodies, then the woman lifts her arm to press down on the man's arm. The man dodges and continues operating the controller. Afterwards, the woman lets go of her controller with her left hand, covers the man's eyes for a short moment then brings her hand back to continue playing. The man shows a look of frustration, while the woman laughs joyfully. She twists her body, and then she turns around, caresses the man's face with her left hand, kisses him, and rests her head against his face. Finally, she speaks to the man.", "text": "A man and a woman are sitting on a couch in a dimly lit room with a purple and pink patterned wall in the background, playing a video game with controllers. They are both focused on the game, smiling and laughing as they play. The woman gets excited and covers the man's eyes with her hand, then they both laugh and continue playing. The woman wins the game, raises her arms in celebration, and then hugs the man, kissing his cheek and hugging him tightly. They both look at the screen, and the woman continues to hold the man, showing affection.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 716, "extra_info": {"duration": 18.32, "events": ["A young man and woman play video games with controllers.", "Both synchronize a movement with their bodies.", "The woman lifts her arm to press the man's arm.", "The man dodges and continues playing.", "The woman covers the man's eyes briefly.", "The man shows frustration.", "The woman laughs joyfully.", "The woman twists her body.", "The woman turns around and caresses the man's face.", "The woman kisses the man and rests her head against his."], "n_events": 10, "n_shots": 1, "n_subjects": 2}, "vid": 716}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "578.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man opens the car door, stretches out his right leg, jumps into the water, lies forward, then stands up straight, holding mud in both hands. In a rush, he looks around, and finally, he looks down. The background is a road amidst the mountains and forests.", "text": "A silver van is parked on a concrete path next to a muddy water body, with a forested background. The van's rear door opens, and a man begins to exit the vehicle. The man steps out of the van and jumps into the muddy water. He lands in the water, creating a splash. The man struggles to get up, splashing water around him. He stands up and starts to walk through the water, looking around. The man continues to walk through the water, occasionally looking back at the van.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 578, "extra_info": {"duration": 6.48, "events": ["The man opens the car door", "The man stretches out his right leg", "The man jumps into the water", "The man lies forward in the water", "The man stands up straight", "The man holds mud in both hands", "The man looks around in a rush", "The man looks down"], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 578}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "626.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the center of a black background screen, there is a rounded rectangle with a downward triangle. At its center is a white heart shape and a number. The number starts at 1 and increases one by one: 2, 3, 4, 5, 6... then quickly increases to two-digit numbers, three-digit numbers, until it reaches five digits, finally stopping at 99999. As the number of digits increases, the length of the rounded rectangle also expands accordingly.", "text": "The video begins with a black background. A red heart-shaped icon with a white heart and the number '1' appears in the center. The number on the icon starts to increase sequentially, changing from '1' to '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 626, "extra_info": {"duration": 12.0, "events": ["The number starts at 1.", "The number increases to 2.", "The number increments to 3.", "The number counts up to 4.", "The number ascends to 5.", "The number progresses to 6.", "The number quickly increases to two-digit numbers.", "The number rapidly climbs to three-digit numbers.", "The number swiftly jumps to five-digit numbers.", "The number stops at 99999."], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 626}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "936.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts two individuals in suits standing on opposite sides of a table, each extending their right hand for a handshake. Upon completing the handshake, the person wearing a black suit on the right gestures with his right hand again, palm facing up, inviting the individual in the gray suit with long hair on the left to sit down instead of standing.", "text": "A man in a suit is sitting at a white table with a laptop and a pen on it, looking to his left. He then turns his head to face forward and extends his hand. A woman approaches from the left, and they shake hands. The woman then sits down on the chair next to the man. The man sits back in his chair, and the woman adjusts her position. The man then extends his hand again, and the woman approaches him.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 936, "extra_info": {"duration": 5.0, "events": ["Two individuals stand on opposite sides of a table.", "Each extends their right hand.", "They complete the handshake.", "Person in black suit gestures with right hand.", "Gesture invites individual in gray suit to sit down."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 936}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "348.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a room adorned with photographs, a man on the left is holding a camera, taking pictures of a woman in red across from him. The camera angle switches to the man's perspective, as the woman looks into the camera lens and gives a slight nod.", "text": "In a dimly lit room filled with shelves of books, boxes, and various items, a man in a blue shirt and dark pants holds a camera and prepares to take a photo of a woman standing in front of him. The woman, dressed in a red coat and dark shirt, stands still and looks at the man. The man adjusts his camera and begins to take a photo of the woman. The camera then focuses on the woman, showing her standing still and looking at the man.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 348, "extra_info": {"duration": 5.62, "events": ["A man holds a camera.", "The man takes pictures of the woman.", "The woman looks into the camera lens.", "The woman gives a slight nod."], "n_events": 4, "n_shots": 2, "n_subjects": 2}, "vid": 348}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "557.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background features an indoor climbing wall. A woman is moving and climbing up the wall, coordinated with her hands and feet, then looks down while holding onto an object with one hand. The camera follows the movement.", "text": "A woman is climbing an indoor rock wall with various colored holds, including yellow, black, blue, and orange. She starts by reaching for a black hold with her right hand and a blue hold with her left foot. She then shifts her weight and moves her right foot to a higher blue hold while her left hand reaches for a higher hold. She continues to ascend by grabbing a larger blue hold with her right hand and a smaller blue hold with her left hand. She reaches for a higher blue hold with her right hand and a green hold with her left hand. She then grabs a larger blue hold with her right hand and a smaller blue hold with her left hand, pulling herself up. She reaches for a higher blue hold with her right hand and a smaller blue hold with her left hand, continuing her climb. She grabs a larger blue hold with her right hand and a smaller blue hold with her left hand, pulling herself up. She reaches for a higher blue hold with her right hand and a smaller blue hold with her left hand, continuing her climb.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 557, "extra_info": {"duration": 5.9, "events": ["A woman climbs up the indoor wall.", "The woman coordinates her hands and feet.", "The woman looks down.", "The woman holds onto an object with one hand."], "n_events": 4, "n_shots": 1, "n_subjects": 1, "old_duration": 5.9, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/157.mp4"}, "vid": 557}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "308.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows a scene inside a restaurant, where a man places the newspaper he is holding onto a chair, turns around to face another gentlemen seated across the table. The man across the table takes out a cigarette and puts it in his mouth, while other people in the restaurant are moving about.", "text": "In a dimly lit diner with a rainy window in the background, a man in black is holding a newspaper and talking to a man in a denim jacket. The man in black puts down the newspaper and starts gesturing with his hands while talking. The man in the denim jacket listens attentively, then takes out a cigarette and puts it in his mouth. The man in black continues to talk and gestures towards the man in the denim jacket. A person walks by outside the window as the conversation continues.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 308, "extra_info": {"duration": 4.25, "events": ["A man places the newspaper onto a chair.", "The man turns around to face another seated man.", "The seated man takes out a cigarette.", "The seated man puts the cigarette in his mouth.", "People in the restaurant are moving about."], "n_events": 5, "n_shots": 1, "n_subjects": 3}, "vid": 308}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "232.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man looked at the elephant in front of him, then the elephant walked forward, and the man's gaze followed the movement of the elephant. The camera switched to the backside of the elephant, where a person wearing a red hat was pushing the elephant forward, with several other people standing nearby, turning their heads to look at the elephant. The camera then switched again, showing a few people pushing against a barrier in front of them.", "text": "A bald man in a suit stands facing an elephant in a dimly lit environment. The elephant begins to move forward, and the bald man turns his head to follow its movement. The elephant continues to walk, and a man in a blue uniform with a red hat starts to climb onto the elephant's back. The man in the blue uniform successfully climbs onto the elephant's back, while the bald man and two other men, one in a suit and the other in a uniform, watch the scene. The camera then shifts to show a group of people holding onto bars, looking towards the elephant.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 232, "extra_info": {"duration": 8.5, "events": ["The man looked at the elephant.", "The elephant walked forward.", "The man's gaze followed the elephant.", "A person wearing a red hat pushed the elephant.", "Several people turned their heads to look at the elephant.", "A few people pushed against a barrier."], "n_events": 6, "n_shots": 4, "n_subjects": 5}, "vid": 232}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "290.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A middle-aged man and woman walked towards the camera, looking at a woman with a hat in the direction of the camera. The woman quickly turned around, and the camera cut to the ground. A cigarette butt was thrown onto the floor, then the camera switched back to the woman with the hat standing with a man in a green outfit. The woman covered her mouth with her hand, and the camera cut to the ground again, where a foot stepped on the cigarette butt.", "text": "A man and a woman walk down a busy street, passing by a shop with a 'PAN' sign on the door. The woman turns her head to look at a man and a woman standing by the shop entrance. The camera then focuses on the ground as a cigarette falls and lands on the pavement. The woman by the shop entrance bends down and picks up the cigarette. The camera then shows a foot stepping on the cigarette, crushing it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 290, "extra_info": {"duration": 3.62, "events": ["Middle-aged man and woman walked towards the camera.", "Woman with hat looked at the camera direction.", "Woman with hat quickly turned around.", "Cigarette butt was thrown onto the floor.", "Woman with hat stood with man in green outfit.", "Woman with hat covered her mouth with her hand.", "Foot stepped on the cigarette butt."], "n_events": 7, "n_shots": 4, "n_subjects": 5}, "vid": 290}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "822.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "This video presents an individual in the process of creating flower specimens. The person, who has blond hair and is wearing a black T-shirt, opens a hefty book and places some pink flowers inside. They then close the book and exert pressure on it by using their right elbow. The person continues to use their elbow to press down on the book several times, effectively pressing the flowers within. They put significant effort into pressing the book.", "text": "A person wearing a black t-shirt with a graphic design and Chinese text is standing in a kitchen, flipping through a book on a white table. The table has a glass of water and some flowers on the left side. The person continues to flip through the book, then starts to bend down, placing their hands on their knees. The person leans further down, bringing their head closer to the table, and then straightens up again, resuming their position.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 822, "extra_info": {"duration": 9.99, "events": ["Individual opens a hefty book.", "Individual places pink flowers inside book.", "Individual closes the book.", "Individual presses on book with right elbow.", "Individual repositions elbow to press book multiple times.", "Individual exerts significant effort pressing the book."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 822}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "766.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man lifts the blanket with both hands and embraces the woman in front of him, while the woman wraps her arms around herself. They lean their heads close together and smile. The woman turns around to make eye contact with the man, and he drapes the blanket over both of their heads, leaning back with his body as the woman extends her left leg.", "text": "A man and a woman are sitting on a bed in a cozy room with a mirror on the wall and a vase with dried flowers on a table. The man, sitting behind the woman, picks up a blanket and drapes it over them. The woman crosses her arms and smiles as the man wraps the blanket around them. The man then pulls the blanket over his head, covering both of them. The woman turns her head to look at the man, and they both smile. The man then pulls the blanket off, revealing the woman's legs.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 766, "extra_info": {"duration": 10.14, "events": ["The man lifts the blanket with both hands.", "The man embraces the woman.", "The woman wraps her arms around herself.", "They lean their heads close together.", "They smile at each other.", "The woman turns around to make eye contact.", "The man drapes the blanket over both heads.", "The man leans back with his body.", "The woman extends her left leg."], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 766}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "596.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The backdrop is inside a soccer field. A woman dressed in a black coat is running forward, while a soccer player on the right, wearing a red short-sleeved shirt, turns around. As he bends over with one hand supporting himself on the ground, his right leg stretches out toward the woman. She falls forward and rolls on the ground. The soccer player turns his head to look at the woman, and the two teammates on the left also turn to look at her.", "text": "A woman in a black coat is running on a soccer field towards a player in a pink shirt. As she gets closer, the player in the pink shirt starts to fall to the ground. The woman continues running and trips over the fallen player, causing her to fall as well. She lands on the ground and rolls over. The woman then gets up and continues running towards the player in the pink shirt.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 596, "extra_info": {"duration": 4.3, "events": ["A woman runs forward on the soccer field.", "A soccer player turns around.", "The soccer player bends over.", "The soccer player's right leg stretches out.", "The woman falls forward.", "The woman rolls on the ground.", "The soccer player looks at the woman.", "Two teammates turn to look at the woman."], "n_events": 8, "n_shots": 1, "n_subjects": 4}, "vid": 596}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "181.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The octopus grabbed the boy and girl with its tentacles, and one tentacle took out the girl's bag. The camera then cuts to the octopus emptying the contents of the bag, before retracting its tentacle, looking at the bag in surprise, and continuing to pour things out, switching to the boy looking up.", "text": "A large purple creature with two large eyes is seen with two children in its pouches. The creature's tongue extends out, and the children turn to look at it. The creature's tongue retracts, and it picks up a sword from the ground. The creature then places the sword into the pouch of the boy. The boy looks at the sword and then turns his head to look at the girl. The girl also turns her head to look at the boy.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 181, "extra_info": {"duration": 9.25, "events": ["The octopus grabbed the boy with its tentacles.", "The octopus grabbed the girl with its tentacles.", "A tentacle took out the girl's bag.", "The octopus emptied the contents of the girl's bag.", "The octopus's tentacle retracted.", "The octopus looked at the girl's bag in surprise.", "The octopus continued to pour things out of the girl's bag.", "The boy looked up."], "n_events": 8, "n_shots": 3, "n_subjects": 3}, "vid": 181}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "896.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video showcases a badminton match between two pairs - one from the white team and the other from the red team. The score is currently at 14:16, with the white team trailing. The red team is dressed in red t-shirts and black shorts, while the white team soldiers on in white t-shirts and black shorts. In the ongoing rally, the red team serves first. The white then returns the shuttlecock only for the red team to reply shortly after. A player from the white team, wearing a red headband, makes a heroic dive to return the shuttlecock but fails as it had already hit the ground. Consequently, the white team loses the point, pushing the score to 14:17, now in favor of the red team.", "text": "The badminton match is taking place in a large indoor arena with a green court and blue surroundings. The score is displayed on the left side of the screen, showing China leading Japan 15-14. The players are engaged in a rally, with the team in white and the team in red exchanging shots. The player in white on the right side of the court dives to the ground to return the shuttlecock but misses and falls. The player in white on the left side of the court continues to play, while the player in red on the right side of the court prepares to return the shuttlecock. The player in white on the right side of the court falls to the ground, and the player in red on the right side of the court continues to play. The player in white on the left side of the court moves to the center of the court, and the player in red on the right side of the court prepares to return the shuttlecock.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 896, "extra_info": {"duration": 5.0, "events": ["The score reaches 14:16.", "The red team serves.", "The white team returns the shuttlecock.", "The red team replies with a hit.", "A player from the white team dives.", "The shuttlecock hits the ground.", "The red team scores a point.", "The score updates to 14:17."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 896}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "954.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts a person with brown hair, sitting on the rocks positioned above a stream in a forest. The person is surrounded by numerous trees and sizable boulders, stretching their feet towards the stream. The viewer can see the individual's blue shorts as they lower their left leg into the cool stream water, apparently preparing to immerse themselves in it.", "text": "The video begins with a view of a small, clear pond surrounded by rocks and trees, with the text 'DAY 27' displayed in the top left corner. The camera then shifts to a shirtless man standing near the pond, who is talking and looking around. He gestures with his hand while continuing to speak. The man then sits down on a rock at the edge of the pond, with his legs hanging over the water. The camera captures the serene environment, showing the pond, rocks, and surrounding trees.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 954, "extra_info": {"duration": 5.0, "events": ["Person sits on rocks.", "Person stretches feet towards stream.", "Person lowers left leg into water."], "n_events": 3, "n_shots": 4, "n_subjects": 1}, "vid": 954}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "991.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows astronauts in the cockpit of a spacecraft, manipulating several buttons on the control panel which results in the spacecraft shaking violently. An astronaut tries to stabilize the craft by adjusting a control lever, but their terrified and scared expressions indicate the dire state of the situation.", "text": "A man wearing a headset is seen operating a control panel with various buttons and a steering wheel, gesturing with his hands as he speaks. He then reaches out and presses several buttons on the control panel, causing the lights to change. The man continues to manipulate the controls, turning the steering wheel and making more hand gestures. The camera then shifts to show another man with a beard, who is also wearing a headset and appears to be speaking.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 991, "extra_info": {"duration": 10.0, "events": ["Astronauts manipulate buttons on the control panel.", "Control panel usage causes spacecraft to shake violently.", "Astronaut adjusts a control lever.", "Astronauts display terrified expressions.", "Astronauts attempt to stabilize the spacecraft."], "n_events": 5, "n_shots": 6, "n_subjects": 2}, "vid": 991}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "315.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On the bed in the bedroom, a man covers a woman with a blanket and wraps it tightly around her, then makes eye contact with her.", "text": "In a dimly lit room, a man is seen leaning over a woman who is lying in bed. He begins to pull the blanket over her, covering her up. The man continues to adjust the blanket, ensuring it is properly covering the woman. After covering her, he sits back slightly and looks at her. The man then leans in closer to the woman, appearing to check on her or speak to her.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 315, "extra_info": {"duration": 5.62, "events": ["Man covers woman with a blanket.", "Man wraps blanket tightly around woman.", "Man makes eye contact with woman."], "n_events": 3, "n_shots": 1, "n_subjects": 2}, "vid": 315}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "952.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video features a pan containing some sort of liquid mixture positioned on a stove. A rope is hung over the pan. Behind the stove, a yellow bag of potato chips with an attached string can be seen. Next to these items on the kitchen counter, there are red potato chips and different kinds of boxed food items. The pan contents are heated, forming a black substance. As the heat intensifies, smoke begins to leak and the black substance amasses at the bottom of the pan.", "text": "A pot of water is heating up on a stove, with a bag of 'Kettle Brand' popcorn and a red bag of chips placed on a table in the background. The water in the pot starts to boil, and the text 'Getting hotter...' appears on the screen. The water continues to boil, and the text 'Smoke starts to roll...' appears. The water in the pot begins to cool down, and the text 'Heating up the grease...' appears.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 952, "extra_info": {"duration": 11.0, "events": ["A pan with liquid mixture is positioned on a stove", "A rope is hung over the pan", "A yellow bag of potato chips is seen with an attached string", "Red potato chips and boxed food items are placed on the counter", "The pan's contents are heated", "A black substance forms in the pan", "Smoke begins to leak from the pan", "The black substance amasses at the pan's bottom"], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 952}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "144.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A short woman hands a piece of paper to a man in red clothes, who leans forward to take it and looks at it.", "text": "In a grand, ornate hallway with golden arches and warm lighting, a woman with gray hair and glasses, dressed in a blue outfit, holds up a ticket and speaks to a man in a red and gold suit. The man, standing tall and looking at the woman, turns his head to look at the ticket. He then reaches out and takes the ticket from the woman. The man examines the ticket closely, smiling as he reads it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 144, "extra_info": {"duration": 4.25, "events": ["A short woman hands a piece of paper.", "Man in red clothes leans forward.", "Man in red clothes takes the paper.", "Man in red clothes looks at the paper."], "n_events": 4, "n_shots": 2, "n_subjects": 2}, "vid": 144}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "945.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts a light brown snake with a dark brown pattern, coiled around a deer's neck and limbs in the grass. The snake proceeds to open its large mouth, beginning to swallow the deer from the muzzle, before moving onto the eye region, which remains open.", "text": "A deer is lying on the ground in a grassy area, with its head and neck partially covered by green foliage. The camera gradually moves closer to the deer, revealing more details of its body and the surrounding vegetation. The deer's head is slightly turned, and its eyes are closed, indicating it is either resting or dead. The camera continues to zoom in, showing the deer's face and the snake coiled around its neck. The snake's head is visible, and it appears to be feeding on the deer's neck. The camera then pulls back slightly, providing a wider view of the scene, showing the deer's body and the snake still coiled around it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 945, "extra_info": {"duration": 8.01, "events": ["Snake coils around deer's neck and limbs.", "Snake opens its large mouth.", "Snake begins to swallow the deer from the muzzle.", "Snake proceeds to the deer's eye region.", "Deer's eye remains open."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 945}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "434.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man is lying on the ground, using his hands to scrape away the black sand. His left hand moves up and down, and a yellow creature emerges, which the man then picks up with both hands. The background is a green grassy field. The camera follows the man's movements.", "text": "A man, wearing a patterned shirt and yellow pants, is seen digging into a muddy area with his hands. He uncovers a yellow snake partially buried in the mud. The man continues to dig and pull the snake out of the mud. He successfully extracts the snake, which is fully visible and yellow in color. The man holds the snake up, showing it to the camera. He then places the snake back into the mud.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 434, "extra_info": {"duration": 11.8, "events": ["The man lies on the ground.", "The man uses his hands to scrape away black sand.", "The man's left hand moves up and down.", "A yellow creature emerges from the sand.", "The man picks up the yellow creature with both hands."], "n_events": 5, "n_shots": 1, "n_subjects": 1, "old_duration": 11.8, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/34.mp4"}, "vid": 434}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "808.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "This animated short film features an individual discussing three methods to unwind and alleviate stress. The first method is reading a book on a sofa. The second is taking a warm bath in a bathtub. Finally, the third method is listening to music while lying on a bed.", "text": "A green character and a white character with a leaf on its head are standing against a plain background. The white character gestures with its hand and smiles, while the green character looks at it. The white character raises its hand and points upwards, and a sign with the text 'Relax & De-stress' appears above them. The scene then shifts to the green character sitting in a chair, reading a book titled 'The Cat in the Hat' in a room with a bookshelf and a lamp. The green character is then seen taking a bubble bath with a rubber duck, and finally, the green character is lying in bed with headphones on, connected to a device.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 808, "extra_info": {"duration": 9.97, "events": ["Individual discusses stress relief methods.", "Individual reads a book on a sofa.", "Individual takes a warm bath in a bathtub.", "Individual listens to music on a bed."], "n_events": 4, "n_shots": 4, "n_subjects": 2}, "vid": 808}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "470.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a training field, a little boy with both hands gripped a short crutch with a hook. He jumped off a mat, hooked the first ring with the crutch, and as he swung forward, he bypassed the second ring, directly hooking the third one. Then he continued to exert force, successively hooking each ring in front of him. Upon reaching the last one, he turned his body to face the approaching direction and hooked the second-to-last ring.", "text": "In an indoor gym with various climbing equipment, a boy stands on a blue padded platform holding a pink handle. He jumps off the platform and grabs the handle, starting to swing forward. The boy continues to swing forward, moving past several red rings hanging from a metal structure. He swings further, passing more rings and moving towards a wall with climbing holds. The boy swings past the wall, continuing his forward motion. He swings back towards the starting point, passing more rings. The boy completes his swing and lands back on the platform.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 470, "extra_info": {"duration": 9.2, "events": ["Little boy grips short crutch with both hands.", "Boy jumps off a mat.", "Boy hooks first ring with the crutch.", "Boy swings forward, bypassing second ring.", "Boy hooks third ring directly.", "Boy exerts force, hooking successive rings.", "Boy reaches last ring.", "Boy turns body to face approaching direction.", "Boy hooks second-to-last ring with crutch."], "n_events": 9, "n_shots": 1, "n_subjects": 1}, "vid": 470}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "429.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A lady in a blue dress picked up a pillow with a portrait printed on it, claiming to perform a magic trick to turn it into a living person. She threw the pillow upwards, and when the pillow fell back down, her arms were holding another lady. The lady in the blue dress hugged the other woman and turned around with her, then set her down, and both of them looked back together. The video is accompanied by subtitles that match the actions of the two people.", "text": "A girl in a blue dress with white accents and white shoes stands in front of a wall with neon signs spelling 'O-DOG' and 'DANCE' in the background. She picks up a pillow with a picture of herself on it and starts walking forward. She throws the pillow into the air. Another girl in a light blue top and jeans runs in and lifts the first girl, spinning her around. The two girls then turn to face the camera, with the second girl holding the first girl in her arms. They both smile and look at the camera, with text appearing on the screen. The first girl then bends down.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 429, "extra_info": {"duration": 11.8, "events": ["Lady picked up a pillow with a portrait.", "Lady claimed to perform a magic trick.", "Lady threw the pillow upwards.", "Pillow fell back down.", "Lady's arms were holding another lady.", "Lady in blue dress hugged the other woman.", "Lady in blue dress turned around with the other woman.", "Lady in blue dress set the other woman down.", "Both women looked back together.", "Video subtitles matched the actions."], "n_events": 10, "n_shots": 1, "n_subjects": 2}, "vid": 429}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "569.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The girls, dressed in green coats, black skirts, and wearing white gloves, performed a routine with their guns. First, they held the guns with both hands at their left side, muzzles pointing straight up. Then, they touched their mouths with their left hands, stretched forward, tilted the guns to the right and then straightened them, rotated, held them horizontally, and rotated again.", "text": "A group of girls dressed in green jackets, black skirts, and white knee-high socks are performing a baton routine on a green field. They are holding batons with yellow tassels and are synchronized in their movements, raising and lowering the batons in unison. The background shows a fenced area with some structures and a few orange cones. The girls continue their routine, moving the batons in various patterns and directions, maintaining their formation throughout.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 569, "extra_info": {"duration": 5.83, "events": ["The girls held guns with both hands at left side.", "Guns' muzzles pointed straight up.", "The girls touched their mouths with left hands.", "The girls stretched forward.", "The girls tilted guns to the right.", "The girls straightened the guns.", "The girls rotated.", "The girls held guns horizontally.", "The girls rotated again."], "n_events": 9, "n_shots": 1, "n_subjects": 2, "old_duration": 5.83, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/169.mp4"}, "vid": 569}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "638.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Several flower pots are laid out on the ground, some of which already have plants in them. A pair of hands holds a handful of potting soil and places it into the middle flower pot. The potting soil is then spread around the plant in the pot and pressed down firmly by hand. Afterward, more potting soil is grabbed and sprinkled evenly into the flower pot.", "text": "A person is seen planting small green plants into terracotta pots filled with soil. The person uses their hands to carefully place the plants into the pots and then covers them with soil. The background shows more pots and some gardening tools, indicating a garden or nursery setting. The person continues to plant more plants, ensuring each one is securely placed and covered with soil.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 638, "extra_info": {"duration": 12.01, "events": ["Hands hold a handful of potting soil.", "Hands place potting soil into the middle flower pot.", "Hands spread potting soil around the plant.", "Hands press down the potting soil firmly.", "Hands grab more potting soil.", "Hands sprinkle potting soil evenly into the flower pot."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 638}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "743.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person uses a colander to scoop food from a metal basin into a bowl in their hand twice, then places the colander back in the metal basin, and sets the bowl on the table.", "text": "A person uses a ladle to scoop food from a large metal pot filled with a red, spicy-looking soup containing various ingredients such as onions and chili peppers. The person lifts the ladle filled with food and transfers it into a white bowl held in their other hand. The person continues to scoop more food from the pot and repeat the process, filling the bowl with the soup and its contents.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 743, "extra_info": {"duration": 16.88, "events": ["Person scoops food with a colander.", "Person transfers food from colander to a bowl.", "Person repeats the scooping action.", "Person repeats the transferring action.", "Person places the colander in the metal basin.", "Person sets the bowl on the table."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 743}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "141.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A bald boy yelled something to the right, then turned and ran to the left. A boy with messy hair was playing in the mud. The bald boy ran past the mud-playing boy, who then turned to look in the direction the bald boy had run. The camera moved to the left, revealing the bald boy hiding behind a tree, with the tree blocking the middle of him, leaving the sides of his face and body exposed.", "text": "In a grassy field with a tree in the background, a boy with spiky hair sits in a hole, holding a stick and surrounded by a stack of pancakes. A boy in a yellow shirt stands nearby, watching the spiky-haired boy. The boy in the yellow shirt runs towards the spiky-haired boy, who turns his head to look at him. The spiky-haired boy then turns his head to look at the boy in the yellow shirt, who is now standing by a tree. The spiky-haired boy starts talking to the boy in the yellow shirt, who listens attentively. The boy in the yellow shirt then runs away, leaving the spiky-haired boy sitting alone in the hole.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 141, "extra_info": {"duration": 4.62, "events": ["A bald boy yelled to the right.", "The bald boy turned and ran to the left.", "A boy with messy hair played in the mud.", "The bald boy ran past the mud-playing boy.", "The boy with messy hair turned to look.", "The bald boy hid behind a tree."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 141}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "167.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A hand slaps a green button and the camera cuts to a construction elevator. A koala falls onto the elevator, which is also carrying other creatures. Then the elevator rapidly descends. The camera angle switches to a top-down view from the roof, showing the elevator descending. The camera switches again to a side view of the elevator to continue displaying the rapid descent. Once more, the camera switches, showing the koala holding a small snake, with the snake and other animals widening their eyes in fear. The camera then cuts to a purple spider, whose six legs are suspended in air. Finally, the camera shifts to focus on the face of a grey lizard.", "text": "A close-up of a control panel with red, yellow, and green buttons is shown, with a hand reaching towards the green button. The scene shifts to a group of characters, including a white creature, a blue creature, and several purple creatures, who are looking up in fear. The camera then shows the control panel from a distance, indicating that the platform is moving upwards. The characters on the platform continue to look up in fear as the platform ascends. The purple creatures on the platform start to move forward, with one of them looking ahead.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 167, "extra_info": {"duration": 5.75, "events": ["A hand slaps a green button", "A koala falls onto the elevator", "The elevator rapidly descends", "The koala holds a small snake", "The snake and other animals widen their eyes", "A purple spider's six legs are suspended in air", "The camera focuses on the face of a grey lizard"], "n_events": 7, "n_shots": 7, "n_subjects": 8}, "vid": 167}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "200.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "SpongeBob and Patrick Star lowered their heads and walked forward. Suddenly, SpongeBob stretched out his hand, pointed forward and said something, then together with Patrick Star, they raised their heads to look ahead.", "text": "SpongeBob and Patrick are walking side by side in a desert-like environment with a dark, starry sky in the background. SpongeBob appears to be talking while Patrick looks sad and keeps his head down. SpongeBob continues to talk, and Patrick remains sad, with tears visible on his face. SpongeBob then points at Patrick and says something, causing Patrick to look up briefly. Patrick then looks down again, and SpongeBob continues to talk while Patrick remains sad.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 200, "extra_info": {"duration": 4.0, "events": ["SpongeBob and Patrick Star lowered their heads.", "SpongeBob and Patrick Star walked forward.", "SpongeBob stretched out his hand.", "SpongeBob pointed forward.", "SpongeBob and Patrick Star raised their heads.", "SpongeBob said something.", "SpongeBob and Patrick Star looked ahead."], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 200}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "148.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A little girl and a dog look forward in surprise from behind a wall. The camera switches to the opposite side, where a woman in white clothes takes a crown from a yellow-clad servant. She looks at the crown greedily and then raises it to put it on herself.", "text": "A girl with red hair and a fox with sprinkles on its face look ahead with surprised expressions. The scene then shifts to a woman in a white dress holding a crown, standing next to a man in a yellow suit and another man in a red robe holding a green book with a lion emblem. The woman turns and takes the crown from the man in yellow. She then places the crown on her head, smiling as she does so. The man in the red robe and the man in the yellow suit watch her, with the man in the red robe eventually falling to the ground.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 148, "extra_info": {"duration": 5.0, "events": ["A little girl and a dog look in surprise.", "A woman takes a crown from a servant.", "The woman looks at the crown greedily.", "The woman raises the crown.", "The woman puts the crown on herself."], "n_events": 5, "n_shots": 2, "n_subjects": 4}, "vid": 148}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "512.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A pair of twin babies are playing in a room. They come to a cabinet, where one of them leans on it with both hands. The other comes behind him, lifts the skirt, and pulls down the pants to take a look at his buttocks. Then, they switch roles; one who was observing now leans on the cabinet with both hands, and the one who was in front moves behind, lifts the skirt, pulls down the pants, and takes a look at the buttocks.", "text": "In a kitchen with wooden cabinets, a microwave, and various kitchen items, two children in pink shirts and patterned pants are seen interacting. The child on the left pulls the other child's shirt, causing the latter to turn and face the camera. The child on the left then turns and walks away, while the other child continues to hold onto the counter. The child on the left returns and pulls the other child's shirt again, and they both turn to face the camera. The child on the left then turns and walks away once more, while the other child continues to hold onto the counter.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 512, "extra_info": {"duration": 15.1, "events": ["Twin babies play in a room.", "One twin leans on a cabinet.", "The other twin lifts one's skirt.", "The other twin pulls down one's pants.", "Twin babies switch roles.", "Former observing twin leans on the cabinet.", "Former leaning twin lifts other's skirt.", "Former leaning twin pulls down other's pants."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 512}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "790.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Four people are queueing up in a frenzy, a man with his arms raised swaying back and forth, dancing, while the other three are holding microphones and singing. They make eye contact and interact from time to time, then they raise their glasses, clink them together, and continue to sway their bodies while singing.", "text": "Four people are sitting on a couch in a festive setting with Christmas decorations and lights in the background. They are wearing Santa hats and holding microphones and glasses of champagne. The group is singing and dancing, with one person in the background energetically moving their arms and dancing. The group continues to sing and dance, occasionally raising their glasses in a toast. The festive atmosphere is enhanced by the Christmas lights and decorations, creating a joyful and celebratory mood.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 790, "extra_info": {"duration": 18.65, "events": ["Four people queue up in a frenzy.", "A man dances with arms raised.", "Three individuals sing into microphones.", "People make eye contact periodically.", "Individuals interact during singing.", "All raise their glasses.", "They clink glasses together.", "People continue to sway while singing."], "n_events": 8, "n_shots": 1, "n_subjects": 4}, "vid": 790}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "682.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the room, one person is mopping the floor with a mop, another person is wiping the table with a spray bottle, and another person is cleaning the dishes.", "text": "In a modern kitchen with a wooden floor, a person wearing green gloves and pink sneakers is mopping the floor with a blue-handled mop. In the background, a woman in a green top and brown pants is wiping a table, while a man in an orange shirt and beige pants is organizing items on the counter. The kitchen features a round table with wooden legs, a white chair, and a green sofa. The man occasionally looks at the woman and then back at the items on the counter.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 682, "extra_info": {"duration": 16.16, "events": ["One person mops the floor.", "Another person wipes the table.", "That person uses a spray bottle.", "Another person cleans the dishes."], "n_events": 4, "n_shots": 1, "n_subjects": 3}, "vid": 682}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "690.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the city's evening, a homemade tank travels on the road. From afar to nearby, one can see the lookout personnel on top of the tank and the driver in the middle.", "text": "The video begins with a view of a cityscape at dusk, featuring tall buildings and a pink sky, as a tank drives along a road. The camera follows the tank as it moves forward, revealing more details of the tank and its surroundings, including trees and streetlights. The camera then zooms in on the tank, showing a soldier inside, who is looking out from a hatch. The camera continues to move along the side of the tank, capturing the soldier's actions and the tank's details, such as its tracks and equipment. The video concludes with a wider shot of the tank driving past, with the cityscape still visible in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 690, "extra_info": {"duration": 20.19, "events": ["The homemade tank travels on the city road.", "The evening falls over the city.", "The tank moves from afar to nearby.", "The lookout personnel surveys from atop the tank.", "The driver operates the tank from the middle."], "n_events": 5, "n_shots": 1, "n_subjects": 3}, "vid": 690}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "472.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera faces a colorful corridor, where a lady appears from the far end. She is wearing roller skates and glides over the undulating ground. With her back to the camera, she skates into a spacious room. Once inside, the lady bends her whole body down, one foot on the ground, the other leg extended flat, and both arms open wide, gliding backwards close to the ground. Other people in the room watch the lady's performance.", "text": "The video begins with a view of a brightly lit hallway with colorful lights and a wooden floor, where a woman is seen roller skating towards the camera. She continues to skate forward, gaining speed as she approaches the camera. The woman then transitions from standing to a low, sliding position, extending her arms and legs to balance herself on the floor. She skates on her hands and knees, moving smoothly across the wooden floor. The background shows a room with pink walls, air conditioning units, and several people watching her performance.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 472, "extra_info": {"duration": 5.48, "events": ["A lady appears from the far end of a corridor.", "The lady glides over undulating ground wearing roller skates.", "The lady skates into a spacious room.", "The lady bends her whole body down.", "The lady glides backwards close to the ground.", "Other people watch the lady's performance."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 472}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "817.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video showcases a sunny day where two individuals are practicing their rugby football skills on a field. The rugby football they're using is brown. There's one individual, identified as person A, with their hair tied in a ponytail, donned in a white T-shirt, blue pants, white socks, and gray shoes. They also sport a black watch on their right hand. The other, identified as person B, is noticeable with numerous black braids, a blue coat, gray shorts, and stud earrings. The video begins with person A sending the first pass and person B skillfully catching it. They proceed to pass the ball back and forth a total of five times, enjoying their game.", "text": "On a sports field with a goalpost and bleachers in the background, a man in a blue jacket and gray shorts stands facing a woman in a white shirt and blue shorts. The woman throws a football to the man, who catches it and prepares to throw it back. The man throws the football to the woman, who catches it and throws it back to him. This sequence of throwing and catching continues multiple times.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 817, "extra_info": {"duration": 11.0, "events": ["Person A sends the first pass", "Person B catches the ball skillfully", "They pass the ball back and forth", "Person A catches the return pass", "Person B receives another pass", "The ball is thrown to Person A again", "Person B makes a successful catch", "Another pass is completed to Person B", "Person A receives the ball once more", "They enjoy their game of rugby football"], "n_events": 10, "n_shots": 1, "n_subjects": 2}, "vid": 817}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "378.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman is holding a guitar with a smile, one hand placed on her chest, while the man beside her is playing the guitar with his head down. The camera shifts to the woman swaying her body as the audience below applauds.", "text": "A woman, dressed in a white outfit with a guitar, is on stage with a microphone in front of her. She smiles and looks down, then raises her head and touches her necklace. She turns her head to the right and begins to walk towards the audience, who are clapping and cheering. The camera then switches to a view from behind the woman, showing her walking towards the applauding audience.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 378, "extra_info": {"duration": 3.62, "events": ["A woman holds a guitar with a smile.", "The woman places one hand on her chest.", "A man plays the guitar beside the woman.", "The woman sways her body.", "The audience below applauds."], "n_events": 5, "n_shots": 2, "n_subjects": 3}, "vid": 378}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "612.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two people dressed in white lab coats are standing in front of a sink, rubbing their hands as they wash. They then reach out to turn off the faucet, and the person closer to the camera lifts their hands up. The camera angle remains fixed.", "text": "A person is washing their hands under a running faucet in a white sink, with another person washing their hands in the background. The person continues to rub their hands together under the water. They then turn off the faucet with their right hand. After turning off the faucet, they place their hands back under the running water to rinse them.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 612, "extra_info": {"duration": 7.62, "events": ["Two people stand in front of a sink", "People dressed in white lab coats wash hands", "They rub their hands together", "People reach out to turn off the faucet", "Person closer to the camera lifts hands up"], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 612}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "676.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera focuses on the profile of a female character, then she turns her head to face the camera. Her glasses have a mechanical sheen, and the skin texture on the right side of her face is illuminated by an internal light.", "text": "A woman with red eyes is initially looking to the left in a dark environment with snow falling. She then turns her head to face the camera. As she turns, a holographic image of a mechanical device appears on her face, starting from her left eye and spreading across her face. The holographic image continues to expand, covering more of her face, and then starts to fade away. Finally, she turns her head back to the left, and the holographic image completely disappears.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 676, "extra_info": {"duration": 10.12, "events": ["The camera focuses on the female character's profile.", "The female character turns her head to face the camera.", "Glasses on the female character's face gleam mechanically.", "The internal light illuminates the female character's skin texture."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 676}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "163.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two odd-looking individuals fell to the ground and struggled to sit up. A car coming from the left side hit them again, knocking them down. The person on the right side managed to sit up once more.", "text": "Two characters are seen falling to the ground in a nighttime urban setting with buildings and streetlights in the background. They lie on the ground, with one character on their back and the other on their side. The character on the side turns their head to look at the other character. A green car drives by and stops near them. The character on the side gets up and kicks the other character, causing them to fall back down. The character on the side then turns their head to look at the other character.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 163, "extra_info": {"duration": 7.88, "events": ["Two individuals fell to the ground.", "The individuals struggled to sit up.", "A car hit the individuals again.", "The car was coming from the left side.", "One individual managed to sit up once more."], "n_events": 5, "n_shots": 1, "n_subjects": 3}, "vid": 163}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "131.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A bald little boy speaks to the girl next to him, then turns and walks away. The little girl, smiling, takes two steps forward, brings her hands to her chin, and speaks to the blond-haired boy, who turns his head to look in the direction where the bald little boy left.", "text": "In a room with a purple wall and blue floor, a bald boy in a yellow shirt with black zigzag patterns is talking to a blonde boy in a striped shirt and a black-haired girl in a blue dress. The bald boy continues speaking while the blonde boy and the black-haired girl listen. The bald boy then turns and walks away, leaving the blonde boy and the black-haired girl behind. The black-haired girl turns around, clasps her hands together, and starts talking while smiling. The blonde boy, sitting on a red piano, looks at the black-haired girl and then turns his head to look forward.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 131, "extra_info": {"duration": 4.25, "events": ["Bald little boy speaks to girl.", "Bald little boy turns away.", "Bald little boy walks away.", "Girl takes two steps forward.", "Girl brings hands to chin.", "Girl speaks to blond-haired boy.", "Blond-haired boy turns head."], "n_events": 7, "n_shots": 1, "n_subjects": 3}, "vid": 131}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "245.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man lies on the ground, then props himself up with one hand and turns his head, following this by kneeling on both legs, and lifting his upper body. The camera then cuts to an opening, and the man slowly stands up.", "text": "A man wearing a brown jacket and a gray beanie is lying on the ground, covered in snow, and begins to push himself up with his hands. He continues to rise, looking around as he does so. The camera then shifts to show a person lying inside a snow-covered fireplace. The man, now on his knees, looks towards the fireplace.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 245, "extra_info": {"duration": 9.25, "events": ["The man lies on the ground.", "The man props himself up with one hand.", "The man turns his head.", "The man kneels on both legs.", "The man lifts his upper body.", "The man slowly stands up."], "n_events": 6, "n_shots": 3, "n_subjects": 1}, "vid": 245}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "974.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video features two pandas sitting on the grass attempting to solve a puzzle. The puzzle consists of four pieces that together form a picture of a watermelon. The finished version of the puzzle is visible in the top left corner of the screen. Both pandas are actively engaging in matching the puzzle pieces, with their activities indicated by a blinking red cross on the screen. The panda on the right, notable for its pink bow, is struggling with the placement of puzzle pieces. On the left, the other panda is seen laughing at the challenge the right panda seems to be experiencing. The setting includes green trees in the backdrop, with scattered daisies on the grassy floor.", "text": "A green background with a small image of a watermelon in the top left corner and a larger image of watermelon slices in the center is shown. A black shape appears and moves over the watermelon slices, covering them. A red 'X' appears over the covered watermelon slices, indicating they are incorrect. The scene then switches to two pandas, one with a pink bow, standing in a green outdoor setting with trees in the background. The panda with the pink bow closes its eyes and smiles, while the other panda also smiles and raises its hand. The scene switches back to the watermelon image.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 974, "extra_info": {"duration": 6.0, "events": ["Two pandas sit on the grass.", "Pandas attempt to solve a puzzle.", "Puzzle pieces form a picture of a watermelon.", "A finished puzzle version is visible.", "A blinking red cross indicates activities.", "Panda with a pink bow struggles with pieces.", "Other panda laughs at the challenge.", "Green trees stand in the backdrop.", "Daisies scatter on the grassy floor."], "n_events": 9, "n_shots": 2, "n_subjects": 2}, "vid": 974}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "460.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A slender lady is seated in front of a square table, picking up a fork and preparing to eat the food on her plate. Behind her, a bearded man wearing gloves scatters a pile of powder onto her table and snatches the fork from the lady's hand to place it on the plate. He then pinches and pulls at the food on the plate. The lady stands up, takes her crossbody bag, and walks past the man, casually knocking over a drink cup on the table with her bag and then strides away. The background is within a restaurant, with other diners looking on at them.", "text": "A woman in a green top and white pants is sitting at an outdoor restaurant table with a plate of food, a glass of tea, and a pink bag. A man in a black uniform with 'Sudan' written on it approaches her and starts sprinkling green powder onto her food. The woman looks up at the man as he continues to sprinkle the powder. She then stands up, picks up her pink bag, and begins to walk away from the table. The man follows her, and she walks down the street, passing by other people sitting at the restaurant.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 460, "extra_info": {"duration": 13.63, "events": ["Lady prepares to eat with a fork.", "Man scatters powder onto table.", "Man snatches fork from lady's hand.", "Man places fork on the plate.", "Man pinches and pulls at food.", "Lady stands up from table.", "Lady takes her crossbody bag.", "Lady walks past the man.", "Lady knocks over drink cup with bag.", "Lady strides away from table."], "n_events": 10, "n_shots": 1, "n_subjects": 3}, "vid": 460}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "568.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Three people wearing blue short sleeves came out through the door one after another, jumping up to touch an object tied to the door frame with their heads, and then walked out of the frame. A girl walked under the door frame and looked up, with others following behind her.", "text": "A group of students in blue uniforms and backpacks walk out of a building through a doorway. The first student, a boy, walks ahead, followed by a girl carrying a large blue backpack and a red bag. As the girl steps out, her red bag suddenly flies up and hits her on the head. The boy behind her turns his head to look at her and then continues walking forward. The girl, now holding her head, turns around and starts to walk back into the building. The boy continues walking forward, and more students in similar uniforms follow behind.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 568, "extra_info": {"duration": 4.68, "events": ["Three people came out through the door.", "First person jumped to touch the object.", "Second person jumped to touch the object.", "Third person jumped to touch the object.", "Three people walked out of the frame.", "A girl walked under the door frame.", "Girl looked up at the object.", "Others followed the girl behind her."], "n_events": 8, "n_shots": 1, "n_subjects": 3, "old_duration": 5.53, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/168_no_audio.mp4"}, "vid": 568}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "192.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man said something holding the microphone, then handed the microphone to the approaching woman, who reached out to take the microphone and said something.", "text": "A man holding a microphone speaks while a woman in a yellow dress stands in front of him, smiling. The man extends the microphone towards the woman, who reaches out and takes it from him. The woman then begins to speak into the microphone. In the background, there are three other people standing, one holding a drink and two others dressed in formal attire.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 192, "extra_info": {"duration": 4.0, "events": ["The man spoke into the microphone.", "The man extended the microphone.", "The woman approached.", "The woman reached out.", "The woman took the microphone.", "The woman spoke."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 192}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "406.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "At the foot of a mountain, a man dressed in an orange storm jacket waves a hiking stick. The camera then switches to another man at the base of a different mountain; this one is wearing a blue storm jacket and a hat, also holding a hiking stick and posing. At the end of the video, the man in the blue storm jacket reaches out to grab the cellphone that he is being filmed with. The backgrounds of both shots feature blue skies, white clouds, and distant mountain peaks.", "text": "A man wearing an orange jacket and a white shirt stands in front of a stone with '360°' written on it, holding a selfie stick. He begins to spin the stick around his body, moving it from one side to the other. The man continues to spin the stick, eventually turning around to face the camera. The scene then shifts to another man in a black jacket with a hood, who is also holding a selfie stick. He adjusts the stick and then extends his hand towards the camera.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 406, "extra_info": {"duration": 8.1, "events": ["Man in orange jacket waves hiking stick.", "Camera switches to another man.", "Man in blue jacket poses with hiking stick.", "Man in blue jacket reaches out to grab cellphone."], "n_events": 4, "n_shots": 2, "n_subjects": 2}, "vid": 406}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "354.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The food rotates inside the microwave oven, the camera cuts to a man standing in front of a table operating a laptop, the camera then cuts to a close-up of the man's face.", "text": "A microwave is placed on a counter, surrounded by various items including medicine bottles, canned goods, and a toy. The microwave's digital display shows the time changing from 00:03 to 00:08. A man is standing in a laundry room, leaning on a washing machine while using a laptop placed on top of it. The man is focused on the laptop screen, and the camera gradually zooms in on his face. The microwave's display changes from 00:08 to 00:07.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 354, "extra_info": {"duration": 9.25, "events": ["Food rotates inside the microwave oven.", "Man operates a laptop.", "Camera shows a close-up of the man's face."], "n_events": 3, "n_shots": 3, "n_subjects": 1}, "vid": 354}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "979.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video demonstrates a vehicle, from which the recording is taking place, attempting to overtake a black car on the right. Unexpectedly, the black car swerves to the left causing an unforeseen accident. Losing control over its direction, the recording vehicle crashes into the trees on the right side of the road.", "text": "The video shows a car driving on a two-lane road surrounded by lush green trees and a clear blue sky. The car approaches a black car in the right lane, which is driving in the same direction. The black car begins to move towards the right lane, gradually getting closer to the camera. The black car continues to move into the right lane, passing by the camera. The black car exits the frame to the right, and the camera continues to move forward, passing a white car in the right lane. The camera then approaches a large truck in the right lane, which is driving in the same direction. The camera continues to move forward, passing the truck and continuing down the road.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 979, "extra_info": {"duration": 5.0, "events": ["The vehicle attempts to overtake a black car on the right", "The black car swerves to the left unexpectedly", "The vehicle loses control", "The vehicle crashes into the trees on the right side"], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 979}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "322.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a bright room, a formally dressed man is sitting on a sofa speaking to the camera, with several staff members doing their own things behind him. Suddenly, a light bulb bursts and the screen goes dark. After a while, a faint light appears on the left side of the screen, revealing a staff member.", "text": "In a well-lit room with wooden walls and a large water tank in the background, a man in a tuxedo sits on a red sofa, looking forward. Behind him, a woman in a white dress is seated at a table, and a man in a white shirt is standing near the water tank. The room is populated with several people, including waiters in red hats and a bartender. The scene then transitions to complete darkness, with only faint outlines of the room visible.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 322, "extra_info": {"duration": 4.5, "events": ["A man is speaking to the camera", "A light bulb bursts", "The screen goes dark", "A faint light appears", "A staff member is revealed"], "n_events": 5, "n_shots": 3, "n_subjects": 2}, "vid": 322}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "538.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera follows the girl as she moves forward, holding a bouquet of flowers and a yellow paper bag, running forward to approach a boy. The boy takes the flowers and laughs joyously, while the girl covers her mouth and giggles bashfully. The two make eye contact as the people around them applaud.", "text": "A girl in a school uniform with a ponytail walks down a street holding a bouquet of flowers, passing by several people, including a woman in a beige coat and another in a black coat. She continues walking towards a group of people, including a woman in a kimono and a man in a black suit. The girl reaches the man in the black suit, who is holding a bouquet of flowers, and they exchange the flowers. The girl then turns around and starts walking back in the direction she came from.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 538, "extra_info": {"duration": 12.97, "events": ["The girl runs forward.", "The girl holds a bouquet of flowers.", "The girl carries a yellow paper bag.", "The girl approaches a boy.", "The boy takes the flowers.", "The boy laughs joyously.", "The girl covers her mouth.", "The girl giggles bashfully.", "The girl and the boy make eye contact.", "The people around applaud."], "n_events": 10, "n_shots": 1, "n_subjects": 3}, "vid": 538}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "939.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person donned in a fluorescent yellow life vest and a blue helmet attempts to leap from one red ball to another, but fails, falling into the water instead. Concurrently on the shore, three individuals clad in yellow life vests are seen laughing heartily. One of them wears a fluorescent yellow life vest with pink stripes and sports grey pigtails, while the next is in a fluorescent yellow life vest with grey stripes, and the last one wears a fluorescent yellow life vest with blue stripes. Also, a person in a red helmet and a fluorescent green life jacket attempts the same jump between the red balls, similarly failing and falling into the water.", "text": "A person wearing a blue helmet and yellow life jacket is seen on a red platform, preparing to jump onto a series of red spherical objects floating on the water. The person jumps from one red sphere to the next, but loses balance and falls into the water. The scene then replays the moment of the fall, showing the person in mid-air before hitting the water. The camera then cuts to three people on land, all wearing yellow life jackets and helmets, laughing and reacting to the event. The video ends with a shot of the person on the red spheres, with the text 'BY THE WAY' and 'YOU'RE ONLY ALLOWED TO' appearing on the screen.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 939, "extra_info": {"duration": 5.97, "events": ["Person attempts to leap from one red ball to another.", "Person falls into the water.", "Three individuals laugh heartily on the shore.", "Individual with grey pigtails wears fluorescent life vest.", "Individual with grey stripes on life vest laughs.", "Individual with blue stripes on life vest chuckles.", "Person in red helmet attempts the jump.", "Person in fluorescent green life jacket falls into water."], "n_events": 8, "n_shots": 5, "n_subjects": 6}, "vid": 939}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "530.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The puppies lay down in front of the food, and a person took the buns one by one and tossed them to them. The puppies grabbed the buns and then turned around to eat elsewhere, while the Border Collie in the middle watched them and walked forward.", "text": "Three dogs, two black and one golden, are eagerly looking at a tray filled with dumplings and lemons, with a background of other dogs playing on a paved area. A hand reaches into the tray and picks up a dumpling, offering it to the dogs. The dogs start taking turns eating the dumplings from the hand. More dogs join in, and the hand continues to distribute the dumplings to the eager dogs. The dogs continue to eat the dumplings, with some dogs moving around to get a better position. The hand keeps picking up dumplings and offering them to the dogs, who are all focused on getting a share. The background shows more dogs playing and moving around, adding to the lively scene.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 530, "extra_info": {"duration": 9.2, "events": ["Puppies lay down in front of the food", "Person took a bun", "Person tossed a bun to a puppy", "A puppy grabbed the bun", "Puppy turned around", "Puppy walked to eat elsewhere", "Border Collie watched the puppies", "Border Collie walked forward"], "n_events": 8, "n_shots": 1, "n_subjects": 3}, "vid": 530}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "428.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video showcases four scenes. The first scene is in a spacious indoor area, where two people are cleaning. A person lying on the floor gets up and grabs a cleaning caution sign to hit one of the cleaners. The second scene is inside a subway, where someone suddenly snatches a piece of pizza from the hands of a seated person, then runs off the subway. The third scene features a person wearing a turban, who kisses the head of a seated man, who turns around in surprise to look at him. The fourth scene depicts a person with dreadlocks, reaching through a subway window from outside to snatch the hat off a person inside.", "text": "In a subway station, a man in a yellow vest is mopping the floor while another man in black is cleaning the wall. The man in black notices the man in the yellow vest and starts to move away. The man in black then approaches a man sitting on the subway, who is smoking, and they exchange a few words. The man in black then boards the subway and sits down, while the man who was smoking looks at him. The scene then shifts to a woman who is closing the subway door.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 428, "extra_info": {"duration": 15.08, "events": ["A person gets up from the floor.", "The person grabs a cleaning caution sign.", "The person hits a cleaner with the sign.", "Someone snatches a piece of pizza in a subway.", "The snatcher runs off the subway.", "A person wearing a turban kisses a seated man's head.", "The seated man turns around.", "A person with dreadlocks reaches through a subway window.", "The person with dreadlocks snatches a hat.", "The person inside the subway loses their hat."], "n_events": 10, "n_shots": 4, "n_subjects": 6}, "vid": 428}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "123.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A police officer, with arms crossed, speaks to two people in suits ahead of him. The two in suits turn around, and the person on the right takes a photo and tucks it into the breast pocket of the jacket, then both leave. Behind them, another officer is working at a computer.", "text": "In an office setting, a man in a beige uniform with a sheriff's badge sits at a desk with his arms crossed, speaking to two men in suits standing in front of him. The man on the right, wearing a blue tie, takes out a photo from his pocket and shows it to the man on the left, who is wearing a red tie. The man on the left turns his head to look at the photo, while the man in the beige uniform continues to watch them. In the background, another man is seen working at a computer.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 123, "extra_info": {"duration": 2.5, "events": ["A police officer speaks to two people in suits.", "The person on the right takes a photo.", "The person tucks the photo into the breast pocket.", "Both people in suits leave.", "Another officer works at a computer."], "n_events": 5, "n_shots": 1, "n_subjects": 4}, "vid": 123}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "298.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "At sunset, a person with a cane walks forward, followed by someone behind them. The camera cuts to the person with the cane sitting down, gazing at the setting sun.", "text": "On a beach at sunset, a man with a cane walks alongside another man who is not using a cane. The man with the cane continues walking forward, while the other man follows closely behind. The man with the cane bends down and places his cane on the ground. The camera then shifts focus to the man who is not using a cane, who also bends down and places his cane on the ground. Both men stand still, facing the sunset over the ocean.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 298, "extra_info": {"duration": 6.62, "events": ["Sun sets.", "Person walks forward with a cane.", "Someone follows the person with a cane.", "Person with a cane sits down.", "Person gazes at the setting sun."], "n_events": 5, "n_shots": 2, "n_subjects": 2}, "vid": 298}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "483.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Several lions stand atop the canopy of a withered tree, causing the tree to topple over. The lions run out from the foliage, and the scene shifts to an off-road vehicle. Two lions run towards the vehicle from the left side, and a few more lions are by the car. After lingering briefly beside the vehicle, they walk towards the direction of the camera.", "text": "Two lions are seen resting on a tree branch in a grassy field. The tree branch begins to break under their weight, causing the lions to fall to the ground. The scene then shifts to a group of lions running towards a parked vehicle on a dirt road. The lions continue to approach the vehicle, with some of them getting closer to the front of the vehicle.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 483, "extra_info": {"duration": 7.2, "events": ["Several lions stand atop a withered tree's canopy.", "The withered tree topples over.", "Lions run out from the foliage.", "Two lions approach the off-road vehicle from the left.", "More lions congregate by the vehicle.", "Lions linger briefly beside the vehicle.", "Lions walk towards the camera's direction."], "n_events": 7, "n_shots": 2, "n_subjects": 3}, "vid": 483}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "609.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the screen, there is a counter with several boxes of popcorn on it. Behind the counter, a person's hand manipulates an electronic device a few times, and then places the device on top of the counter. In front of the counter, on the left side, a person takes out a mobile phone to scan the electronic device, while on the right side, another person places their right hand on the counter.", "text": "A person holds a payment terminal in their right hand, with several red and white striped popcorn boxes on the table in front of them. The person rotates the terminal and places it on the table. Another person, standing next to the table, uses their phone to tap on the terminal. The phone is then removed from the terminal, and the terminal remains on the table. The person holding the terminal then picks it up again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 609, "extra_info": {"duration": 6.2, "events": ["A counter displays several boxes of popcorn.", "A person manipulates an electronic device behind the counter.", "The person places the electronic device on the counter.", "A person takes out a mobile phone in front of the counter.", "The person scans the electronic device with the mobile phone.", "Another person places their right hand on the counter."], "n_events": 6, "n_shots": 1, "n_subjects": 3}, "vid": 609}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "862.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "This video clip illustrates a scene at an intersection where multiple cars are waiting for the signal. A black car unexpectedly drives on the wrong path, ignoring the red traffic signal. Simultaneously, a gray car is preparing to cross the intersection from the right and narrowly avoids a collision with the errant black car. The gray car manages to brake in time while the black car swerves to the left, preventing an accident. The incident concludes with the gray car and the following vehicles safely proceeding through the intersection.", "text": "A white Chevrolet SUV is seen driving on a road with a yellow dividing line, surrounded by other parked cars, including a red car and a black car. The white SUV continues to move forward, and a black car passes by it from the right side. The scene then shifts to a different location at an intersection, where the white SUV is now stopped at a red light. Several cars, including a blue SUV and a silver minivan, are seen driving through the intersection. The white SUV remains stationary as the traffic continues to move through the intersection.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 862, "extra_info": {"duration": 9.01, "events": ["Cars wait for the signal at an intersection.", "A black car drives on the wrong path.", "The black car ignores the red traffic signal.", "A gray car prepares to cross the intersection.", "The gray car narrowly avoids a collision.", "The gray car manages to brake in time.", "The black car swerves to the left.", "The incident prevents an accident.", "The gray car proceeds through the intersection.", "The following vehicles safely continue."], "n_events": 10, "n_shots": 2, "n_subjects": 2}, "vid": 862}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "154.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A bald man fires his gun ahead, the surrounding scenery quickly recedes, the camera angle switches to the front, a car sways from side to side, bullets strike the area around the car, the camera angle switches to the front of the car, inside there are two people, a brown-haired man is driving the car, and looks outside along with the other person in the car.", "text": "A man in a suit is seen aiming and firing a large gun from the window of a car at night, with sparks flying from the gun's barrel. The camera then shifts to show the back of two men sitting in the car, with the window shattered by the gunfire. The men inside the car, one with brown hair and the other with yellow hair, turn their heads to look at the damage. The camera then focuses on the two men inside the car, showing their reactions as they look at the shattered window. Finally, the camera returns to the man in the suit, who continues to aim and fire the gun.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 154, "extra_info": {"duration": 6.12, "events": ["A bald man fires his gun ahead.", "The surrounding scenery quickly recedes.", "A car sways from side to side.", "Bullets strike the area around the car.", "A brown-haired man is driving the car.", "The brown-haired man looks outside.", "The other person in the car looks outside."], "n_events": 7, "n_shots": 3, "n_subjects": 3}, "vid": 154}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "839.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, a person first uses a knife to cut tofu into cubes on a cutting board. Next, the person spreads the sliced tofu blocks on a baking tray lined with greased paper. Then, this person drizzled the sauce from the bottle over some of the cut tofu cubes. Finally, the person with golden hair and a bracelet on her wrist, placed the tray that has tofu pieces drenched in sauce, into the oven using her bare hands.", "text": "A person uses a knife to cut a block of tofu into smaller cubes on a wooden cutting board. The person then places the cut tofu cubes onto a baking tray lined with parchment paper. The person arranges the tofu cubes evenly on the tray. Next, the person pours a liquid, likely honey, over the tofu cubes. Finally, the person places the tray into an oven to bake the tofu cubes.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 839, "extra_info": {"duration": 12.0, "events": ["Person cuts tofu into cubes.", "Person spreads tofu cubes on baking tray.", "Person lines baking tray with greased paper.", "Person drizzles sauce over tofu cubes.", "Person with golden hair places tray in oven."], "n_events": 5, "n_shots": 2, "n_subjects": 1}, "vid": 839}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "274.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two men are drinking at the bar, with people moving behind them. The blond man takes a sip of his drink, and the black-haired man takes a puff of a cigarette. A woman appears from the left of the frame and sits next to the blond man. The three glance at each other, and the blond man looks at the woman before picking up his glass to take a drink.", "text": "Two men are sitting at a bar, with one smoking a cigarette and the other drinking from a glass. The man drinking from his glass takes a sip and then puts it down. The man smoking puts down his cigarette and turns to look at a woman who has just sat down. The man drinking picks up his glass again and turns to speak to the woman. The woman turns her head to look at the man drinking, and they exchange glances. The man drinking then turns his head to look forward.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 274, "extra_info": {"duration": 11.62, "events": ["Two men drink at the bar.", "People move behind the two men.", "The blond man sips his drink.", "The black-haired man puffs a cigarette.", "A woman sits next to the blond man.", "The three glance at each other.", "The blond man looks at the woman.", "The blond man picks up his glass.", "The blond man takes a drink."], "n_events": 9, "n_shots": 2, "n_subjects": 4}, "vid": 274}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "508.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In an aquarium, a caretaker was walking back and forth on the shore. A dolphin leaped out of the water and then splashed back in, followed by another dolphin doing the same. Then, both dolphins leaped out of the water together, twirled in the air before plunging back into the water while another dolphin continued to swim along the shore the whole time.", "text": "A man in a blue jacket and black pants walks along a platform by the water, with a dolphin leaping out of the water and landing back in. The man continues walking as another dolphin jumps out of the water and lands back in. The man stops and looks at the dolphins as they jump out of the water. The man then turns around and starts walking back towards the platform. The dolphins jump out of the water again, creating splashes. The man continues walking along the platform as the dolphins jump out of the water once more.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 508, "extra_info": {"duration": 8.9, "events": ["Caretaker walked back and forth on the shore", "First dolphin leaped out of the water", "First dolphin splashed back into the water", "Second dolphin leaped out of the water", "Second dolphin plunged into the water", "Both dolphins leaped out of the water together", "Dolphins twirled in the air", "Dolphins plunged back into the water", "Third dolphin swam along the shore"], "n_events": 9, "n_shots": 1, "n_subjects": 4, "old_duration": 8.9, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/108.mp4"}, "vid": 508}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "909.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In this video clip, a table is set with three clear glasses of water, a model helicopter, and a Barbie doll. A person wearing an orange glove methodically drops white objects into each glass. Upon immersion, the water appears to boil and emit a white gas, much like dry ice.", "text": "On a table, there are three jars, a toy helicopter, and a doll in a purple dress with wings. A person wearing orange gloves pours dry ice from a glass into the leftmost jar, causing white vapor to appear. The person then pours dry ice from the same glass into the middle jar, creating more vapor. Finally, the person pours dry ice from the glass into the rightmost jar, producing a significant amount of white vapor that envelops the doll.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 909, "extra_info": {"duration": 9.0, "events": ["A table is set with glasses, a helicopter, and a Barbie doll.", "A person wears an orange glove.", "The person drops white objects into the first glass of water.", "Water in the first glass appears to boil.", "White gas emits from the first glass.", "The person drops white objects into the second glass of water.", "Water in the second glass appears to boil.", "White gas emits from the second glass.", "The person drops white objects into the third glass of water.", "Water in the third glass appears to boil."], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 909}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "140.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A bald little boy looks to the right, then turns his head towards the camera and smiles. The camera cuts to the back of the bald boy, who is sitting on the grass next to a boy in orange clothing, both watching the stars in front of them. The bald boy stands up and faces the boy in orange, who turns his head to look at the bald boy.", "text": "A bald boy in a red shirt is looking to the right against a starry night sky. He then turns his head to face forward, and a small mark appears on his forehead. The boy smiles and places his hands on his chest. The camera then shows both the bald boy and a boy in an orange shirt sitting on the grass, looking at the stars. The bald boy turns to face the boy in the orange shirt, stands up, and starts talking to him. The camera then focuses on the bald boy's face as he looks to the right.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 140, "extra_info": {"duration": 5.12, "events": ["The bald little boy looks to the right", "The bald little boy turns his head towards the camera", "The bald little boy smiles", "The bald little boy sits on the grass", "The bald little boy watches the stars", "The bald little boy stands up", "The boy in orange turns his head to look at the bald boy"], "n_events": 7, "n_shots": 2, "n_subjects": 2}, "vid": 140}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "918.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Players in red t-shirts, black shorts, black socks and black shoes are jumping and moving forward.They pass a yellow dummy while jumping. In addition to the same players, another player in a black t-shirt, black tracksuit and black shoes formed a circle. They raise their hands up and spread their legs to the sides.", "text": "Four individuals in red jackets and black shorts are running on a green field, each performing a high-knee running exercise while moving in a circular pattern around a series of yellow and white cones. The camera captures their movement from a top-down perspective, showing their synchronized steps and the arrangement of the cones. As they continue their exercise, the camera angle shifts to a closer view, focusing on their upper bodies and the details of their movements. The background includes a yellow pole and additional cones, indicating a structured training environment.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 918, "extra_info": {"duration": 5.0, "events": ["Players jump and move forward.", "Players pass a yellow dummy while jumping.", "Another player joins to form a circle.", "Players raise their hands up.", "Players spread their legs to the sides."], "n_events": 5, "n_shots": 2, "n_subjects": 2}, "vid": 918}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "367.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A woman is running along the highway, glancing to the left of the frame, as the camera follows the movement, a man riding a bicycle approaches from behind.", "text": "A woman is seen writing music on a sheet of paper with a pencil, with sheet music and a notebook on the table. The scene then transitions to the woman running along a riverside path, wearing a gray hoodie and black pants. As she runs, she passes by a stone wall on her left and a hedge on her right. A man on a bicycle appears behind her, riding in the same direction. The woman briefly glances to her right while continuing to run. The scene then cuts back to the woman writing music on the sheet of paper.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 367, "extra_info": {"duration": 7.88, "events": ["A woman runs along the highway.", "The woman glances to the left.", "A man on a bicycle approaches from behind."], "n_events": 3, "n_shots": 1, "n_subjects": 2}, "vid": 367}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "826.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "This video showcases a team battle in the game League of Legends. The red team progress forward, killing the blue team's characters one by one without sustaining any casualty. Various spells are cast by the red team, with animations of these spells visible on-screen. Notably, two players from the red team are responsible for eliminating three characters from the blue team. In this four on four matchup, the red team successfully takes down three characters from the blue team, leaving them with just one.", "text": "In a forested area with stone structures, a group of characters engage in combat, with various colorful effects and abilities being used. The characters, identified by their names and health bars, are seen casting spells and attacking each other, creating vibrant visual effects such as blue, purple, and red energy waves. The characters move around the battlefield, with some characters being hit and taking damage, as indicated by the decreasing health bars. The combat intensifies as more characters join in, and the effects become more pronounced, with some characters being surrounded by multiple energy waves. The characters continue to clash, with the effects becoming more chaotic and colorful, indicating a high level of engagement. The characters start to retreat, with the effects gradually diminishing, and the health bars of the characters begin to recover.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 826, "extra_info": {"duration": 7.0, "events": ["The red team advances forward", "The red team eliminates the first blue character", "The red team kills the second blue character", "The red team takes out the third blue character", "Various spells are cast by the red team", "Animations of spells appear on-screen", "Two red team players eliminate three blue characters", "The red team engages in a four on four matchup", "The red team leaves the blue team with one character"], "n_events": 9, "n_shots": 1, "n_subjects": 1}, "vid": 826}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "480.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "During a speed climbing competition, a Chinese female athlete grasped a protruding rock with both hands, pushed against the wall with her right foot, and pointed her left toe to the ground. When the signal to start was given, the athlete quickly ascended upward, using her hands and feet without pause. Upon reaching the summit, she leaped to slap a red square with her hand, and the display above showed the abbreviation for China, the national flag, the athlete's name, and her time of 6.54. Following that, the rope attached to the athlete began to slide downward, while the athlete herself raised her arms in cheer. Below the display appeared orange text displaying \"First place, 6.54 seconds,\" and in smaller font below, the name and score of the second-place athlete were shown.", "text": "A woman in a red outfit starts climbing a wall with red holds, beginning at 0.2 seconds. She continues to ascend, reaching 1.7 seconds, and then 2.7 seconds. As she climbs higher, she reaches 4.5 seconds and 6.5 seconds. She reaches the top of the wall, and the timer shows 6.54 seconds. The woman is seen celebrating her climb, with the text '6.54秒 第一名' (6.54 seconds, first place) appearing on the screen. The scene then switches to another woman in a red outfit starting her climb at 0.0 seconds.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 480, "extra_info": {"duration": 12.27, "events": ["Chinese athlete grasped a rock with both hands.", "Athlete pushed against the wall with right foot.", "Athlete pointed left toe to the ground.", "Athlete quickly ascended the wall.", "Athlete leaped to slap a red square.", "Display showed China's national flag and athlete's time.", "Rope attached to athlete began to slide downward.", "Athlete raised her arms in cheer.", "Display updated with 'First place, 6.54 seconds'.", "Display showed second-place athlete's name and score."], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 480}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "783.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera angle is a high-angle shot moving forward to capture the car driving on the road, surrounded by various buildings.", "text": "The video shows an aerial view of a busy street with multiple lanes of traffic, including cars, a school bus, and a truck, moving in both directions. The street is lined with various buildings, some with graffiti on their walls, and there are pedestrian sidewalks on both sides. The camera moves forward steadily, capturing the flow of traffic and the urban environment.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 783, "extra_info": {"duration": 16.73, "events": ["Camera moves forward", "Car drives on the road"], "n_events": 2, "n_shots": 1, "n_subjects": 1}, "vid": 783}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "365.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man points ahead, and the man behind him slightly turns his head to look forward. The camera cuts to the direction of the finger, focusing on a cross pattern on the tree, and zooms in for a close-up shot.", "text": "In a forest setting, a boy in a white shirt holds a stick in his left hand and points to the right with his right hand while speaking. Behind him, another boy in a denim shirt with a leaf necklace looks at the boy pointing. The camera then shifts focus to a tree with an 'X' carved on it, and a hand points at the tree. The hand gradually moves out of the frame, leaving the tree with the 'X' mark in view.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 365, "extra_info": {"duration": 5.5, "events": ["The man points ahead.", "The man behind slightly turns his head."], "n_events": 2, "n_shots": 2, "n_subjects": 2}, "vid": 365}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "418.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is an outdoor fitness area surrounded by trees. A man wearing a brown hoodie leaps forward, grabs onto a bar and swings, flipping once in the air, then grabs the bar to do two spins, he hops over the bar and continues with another flip, then does two more flips in the air, grabs the bar again and keeps swinging, with onlookers watching him. The camera follows the man's movements.", "text": "In a park with autumn foliage, a man in a gray hoodie and black pants approaches a series of blue poles and begins to swing from one to the next. He performs a series of acrobatic maneuvers, including flips and handstands, as he moves along the poles. The man continues his routine, executing a handstand and other dynamic movements while maintaining his balance on the poles. He completes his sequence by landing on the ground and preparing to swing again. Throughout the video, several onlookers, including children and adults, watch and record the performance with their phones.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 418, "extra_info": {"duration": 15.05, "events": ["Man in brown hoodie leaps forward", "Man grabs onto a bar", "Man swings, flipping once in the air", "Man grabs the bar to do two spins", "Man hops over the bar", "Man performs another flip", "Man does two more flips in the air", "Man grabs the bar again", "Man keeps swinging", "Onlookers watch the man"], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 418}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "672.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A large group of penguins jumped into the water one by one from the rocky shore. Just as the last penguin jumped in, the tide receded, leaving it lying on the rocks. It wasn't until the water rose again that it swam into the water.", "text": "A group of penguins is seen on a rocky shore, with some already in the water and others preparing to enter. The penguins begin to waddle into the water, with one penguin diving in first. More penguins follow, entering the water and swimming away from the shore. The penguins continue to waddle into the water, with some already swimming and others still on the shore. The last few penguins enter the water, and the group starts to disperse. The penguins are now mostly in the water, with only a few still on the shore. The penguins are seen swimming further into the water, and the shore is left with only a few penguins.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 672, "extra_info": {"duration": 19.72, "events": ["A large group of penguins jumped into the water.", "Penguins jumped from the rocky shore one by one.", "The last penguin jumped towards the water.", "The tide receded after the last penguin jumped.", "The last penguin lay on the rocks.", "Water rose again.", "The last penguin swam into the water."], "n_events": 7, "n_shots": 1, "n_subjects": 1}, "vid": 672}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "944.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video captures a boxing match between two competitors. The boxer wearing white shorts labeled ''INDIA'' and red gloves attempts to land punches on his opponent, donned in black shorts and blue gloves. The blue-gloved boxer adopts a defensive stance, raising his arms to protect his face, thereby mitigating potential injury such as a beat to the head, a broken nose, or a nose bleed. A person wearing headphones is positioned behind the boxers, recording the match. When the red-gloved boxer is unable to strike his opponent's face, he switches tactics, aiming blows at the abdominal region before attempting another headshot. However, the blue-gloved boxer manages to deflect these attacks successfully, to the cheers of the spectators.", "text": "In a boxing ring, two boxers, one wearing white shorts with 'INDIA' written on the back and the other in black shorts, face each other. The boxer in white shorts moves forward, throwing punches at his opponent, who is wearing black shorts and blue gloves. The opponent in black shorts attempts to defend and counterattack. The referee, dressed in a white shirt, moves around the ring, closely observing the match. The audience, seated around the ring, watches the fight intently. The background features a banner with the text 'Delight Everly Monument' and various advertisements. The boxers continue to exchange punches, with the boxer in white shorts landing several blows. The opponent in black shorts tries to block and counter the attacks. The match progresses with both boxers showing determination and skill.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 944, "extra_info": {"duration": 12.0, "events": ["Boxer in white shorts attacks with red gloves.", "Boxer in black shorts raises arms to protect face.", "Boxer in white shorts aims punches at opponent's abdomen.", "Boxer in white shorts attempts headshot on opponent.", "Boxer in black shorts deflects the punches.", "Spectators cheer during the match.", "Person in headphones records the match."], "n_events": 7, "n_shots": 2, "n_subjects": 4}, "vid": 944}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "629.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera focuses on the side profile of a man's nose and mouth; a hand holding a comb runs through the man's beard, and then a trimmer is used to meticulously groom the beard on the man's face and around his mouth. Finally, the man says two sentences.", "text": "A person is seen using an electric trimmer to trim the beard of a man, starting from the side of the face and moving towards the neck. The trimmer is held close to the skin, and the person carefully shaves off the hair, ensuring an even cut. The background is a solid blue color, and the man's ear and part of his neck are visible. The trimmer is moved methodically, covering different sections of the beard to achieve a neat and tidy appearance.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 629, "extra_info": {"duration": 16.68, "events": ["Camera focuses on the side profile of a man's nose and mouth.", "A hand runs a comb through the man's beard.", "A trimmer grooms the beard on the man's face.", "A trimmer grooms around the man's mouth.", "The man says two sentences."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 629}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "244.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman turns her head to look to the right, the camera follows the movement, she walks over and stares at a photo on the mantelpiece, her hands resting on the mantelpiece. The camera switches to the woman's front, as she looks captivated at what's in front of her.", "text": "A woman with blonde hair tied in a ponytail and wearing glasses walks towards a shelf in a room with a lit lamp and bookshelves in the background. She approaches the shelf, which has a framed photo of a family, a ceramic turtle, and a small vase. She places her hands on the shelf and looks at the photo. The camera then focuses on her face as she continues to look at the photo.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 244, "extra_info": {"duration": 6.25, "events": ["The woman turns her head to look right.", "The woman walks over to the mantelpiece.", "The woman stares at a photo on the mantelpiece.", "The woman rests her hands on the mantelpiece."], "n_events": 4, "n_shots": 2, "n_subjects": 1}, "vid": 244}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "833.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video portrays a basketball match between a team in white uniforms and one in blue uniforms. The player in blue obtains the ball and jumps, simultaneously pursued by two players in white, attempting to intercept the ball. The player in blue proceeds to take a shot but fails. As both the players in white and blue make an attempt to secure the rebound, the player in blue tumbles to the ground. The player in white unknowingly turns and is tripped over by the fallen player in blue.", "text": "The basketball game is taking place in an indoor arena with 'Basketball Africa League' branding on the walls. The score is Pathitos BC 39 and Rivers Hoppers 20, with 3:51 remaining in the third quarter. A player in a blue jersey dribbles the ball towards the basket, evading defenders from the white team. The player in blue jumps and attempts a layup. The ball goes through the hoop, scoring a point for the blue team. A player from the white team falls to the ground near the basket. The white team players react to the play, with some raising their arms in response. The camera then shifts to show the players running back to their positions on the court.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 833, "extra_info": {"duration": 7.0, "events": ["Player in blue obtains the ball.", "Player in blue jumps.", "Two players in white attempt to intercept.", "Player in blue takes a shot.", "Player in blue fails the shot.", "Players scramble for the rebound.", "Player in blue tumbles to the ground.", "Player in white turns.", "Player in white is tripped over.", "Fallen player causes the trip."], "n_events": 10, "n_shots": 1, "n_subjects": 4}, "vid": 833}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "248.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera pulls away from a sign to reveal a house. The text on the right side slowly fades away as the camera switches to a poster about a hero's disappearance. The camera moves upward towards the top of the poster, and text on the left side of the frame appears, indicating that someone inside the house is shopping.", "text": "The video begins with a close-up of a 'Foreclosure for Sale Bank Owned' sign in a field, with the text 'Production Designer Patrick Tatopulos' appearing on the screen. The camera then pans left, revealing a house in the background as the text remains on the screen. The scene transitions to a close-up of a newspaper titled 'Metropolis Post' with the headline 'Did They Return to Their Planet?' and images of superheroes. The camera moves upward, showing more of the newspaper and the text 'Director of Photography Fabian Wagner, BSC' appearing on the screen. The camera continues to move upward, revealing a window with a view of a room filled with flowers and people, while the text 'Executive Producers Christopher Nolan, Emma Thomas, Jim Rowe, Ben Affleck' appears on the screen.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 248, "extra_info": {"duration": 12.12, "events": ["Camera pulls away from a sign", "Text fades away", "Camera reveals a house", "Camera switches to a poster", "Hero's disappearance is featured on the poster", "Camera moves upward", "Text appears indicating shopping"], "n_events": 7, "n_shots": 2, "n_subjects": 2}, "vid": 248}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "39.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The green creature looked in the direction the mermaid was pointing. Following that, the mermaid snatched the key from the creature's neck, took a glance behind, and then swam away. The creature turned around and chased after her. Holding the key, the mermaid swam towards a passage and bent down to hand over the key.", "text": "A woman in a pink and white outfit points at a green creature with a red mane, who is smiling and looking at her. The woman reaches out and grabs the key from the creature's neck. The creature's expression changes to one of surprise and pain as the woman pulls the key away. The woman, now holding the key, turns and swims away from the creature. The creature watches her as she swims towards a door in the background. The woman continues to swim towards the door, holding the key up in her hand.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 39, "extra_info": {"duration": 4.0, "events": ["The green creature looked in the direction the mermaid pointed.", "The mermaid snatched the key from the green creature's neck.", "The mermaid took a glance behind.", "The mermaid swam away.", "The green creature turned around.", "The green creature chased after the mermaid.", "The mermaid swam towards a passage.", "The mermaid bent down.", "The mermaid handed over the key."], "n_events": 9, "n_shots": 2, "n_subjects": 2}, "vid": 39}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "881.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video portrays an individual wearing a long-sleeved swimsuit in hues of orange, yellow, and gray. This person, characterized by long hair, is spotted placing their hands on both sides of a blue surfboard, slowly lifting their body. Gradually, they arch their back, lean forward, and stand upright, their feet perfectly aligned with the surfboard's centerline. Subsequently, they stand sideways on the moving board, extending their arms and lowering their center of gravity to maintain optimal balance.", "text": "A woman is lying on a surfboard in the water, paddling with her hands to move forward. She gradually lifts her body off the board, transitioning from a lying to a kneeling position. She then stands up on the surfboard, maintaining her balance as she rides the wave. The woman continues to stand and balance on the surfboard, moving forward with the wave. The text 'Lynanne St. Denis, Esquimeau Program' appears on the screen as she rides the wave. The woman then lies back down on the surfboard, continuing to paddle.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 881, "extra_info": {"duration": 5.0, "events": ["Individual places hands on surfboard sides.", "Individual lifts body above surfboard.", "Individual arches back on surfboard.", "Individual leans forward on surfboard.", "Individual stands upright on surfboard.", "Individual aligns feet with surfboard's centerline.", "Individual stands sideways on moving surfboard.", "Individual extends arms for balance.", "Individual lowers center of gravity on surfboard."], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 881}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "329.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A large group of people are jumping around together, the camera focuses on a man in floral clothing and a woman in pink clothing, then the shot switches to their upper bodies. The woman wraps her arms around the man's neck, and then the two kiss each other.", "text": "A large group of people are dancing energetically at a party, with some holding cups and others raising their hands in the air. A man in a colorful shirt is seen dancing and moving his body enthusiastically. A woman in a pink top and a man in a colorful shirt turn towards each other and start dancing closely. They embrace and kiss each other while continuing to dance. The background is filled with other partygoers who are also dancing and enjoying the music.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 329, "extra_info": {"duration": 6.38, "events": ["A large group of people jump around together", "The camera focuses on a man in floral clothing", "The camera focuses on a woman in pink clothing", "The woman wraps her arms around the man's neck", "The man and the woman kiss each other"], "n_events": 5, "n_shots": 2, "n_subjects": 3}, "vid": 329}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "28.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The dog is standing on a drum, wagging its tail to DJ, while the green creature is playing the piano.", "text": "In a room filled with various musical instruments, including a green furry creature playing a large organ, a red dog stands on a small stool in front of a set of red drums. The dog looks up at the green creature, which is playing the organ, and then turns its head to the left. The dog begins to play the drums with its paws, moving its head and tail energetically. The dog continues to play the drums, occasionally looking up at the green creature. The dog then turns its head to the left, continuing to play the drums with enthusiasm.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 28, "extra_info": {"duration": 3.75, "events": ["The dog stands on a drum.", "The dog wags its tail.", "The DJ plays music.", "The green creature plays the piano."], "n_events": 4, "n_shots": 2, "n_subjects": 2}, "vid": 28}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "900.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person with braided hair, wearing a light pink t-shirt, gray trousers, and a black bag with yellow stripes is skateboarding amidst a crowd. This individual is laughing with a classmate who has long black hair, wears a necklace, and is dressed in a black t-shirt. Four people are walking together in a location resembling a school garden. The first person on the left has blond hair, wears glasses, a black t-shirt, and carries a green backpack. The second individual is the person with the braided hair who is also seen laughing. The third individual has short black hair and is wearing a gray hoodie with orange text on it. They also have keys hanging around their neck as a form of necklace and is holding a mobile phone. The last person has long black hair, wears a black t-shirt along with a gray jacket, and has a cell phone in their back pocket. This person is seen smiling. All of them are working on the same whiteboard within a classroom, indicating they are classmates attending the same class.", "text": "A woman in a pink sweater and jeans runs on a busy campus pathway, smiling as she moves forward. She continues walking among other students, some with backpacks and others with bicycles. The woman turns her head to the side, and a girl in a black top and gray skirt smiles at her. The woman then approaches a group of four people, including a boy in a gray hoodie and three girls, and they all smile and interact with each other. The woman is then seen writing on a whiteboard in a classroom. Finally, she is shown running again on the campus pathway.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 900, "extra_info": {"duration": 6.97, "events": ["Person with braided hair skateboards amidst a crowd.", "Individual with braided hair laughs with classmate with long black hair.", "Four people walk together in a school garden.", "Person on the left carries a green backpack.", "Individual with short black hair holds a mobile phone.", "Person with long black hair smiles.", "Classmates work on the same whiteboard in a classroom."], "n_events": 7, "n_shots": 7, "n_subjects": 4}, "vid": 900}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "903.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a verdant playing field, two American football teams, clad in diverging uniforms, advance towards each other. One team is adorned in green and white, while the other sports black and red attire. The player sporting the number two on the red team launches the ball towards the upper right corner of the field. Simultaneously, at the center of the field, players from the opposing teams block each other's paths and engage in tussles. Two players from the white team attempt to intercept the airborne ball but fail to secure it. Subsequently, a player with the number eight on the red team successfully catches the ball mid-air, earning a point for his side. Consequently, the red team's score increases from 3 to 9.", "text": "The football game is in progress with the Atlanta team leading the New York team 3-0 in the 1st quarter, 1:46 remaining. The Atlanta team is on offense, and the quarterback drops back to pass as the offensive line blocks the New York defenders. The quarterback throws the ball downfield towards the end zone. A receiver from the Atlanta team catches the ball in the end zone, evading a defender from the New York team. The receiver secures the catch and celebrates the touchdown by raising his arms and running towards the end zone. The receiver is then congratulated by a teammate as they celebrate the touchdown.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 903, "extra_info": {"duration": 6.0, "events": ["Two football teams approach each other on the field.", "Player number two launches the ball to the field's corner.", "Players engage in blocks and tussles at the center.", "Two white team players attempt to intercept the ball.", "Red team's player number eight catches the ball mid-air.", "Red team's score increases from 3 to 9."], "n_events": 6, "n_shots": 2, "n_subjects": 3}, "vid": 903}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "389.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A hand opens the lid of a cardboard box, revealing the hand grenade inside. The camera then cuts to a man in military uniform leaning over, followed by a cut back to the cardboard box he has opened.", "text": "A hand reaches towards a cardboard box labeled 'WELCOME TO THE TEAM' and 'FRAGILE' and begins to lift the lid. As the lid is lifted, a wrapped object is revealed inside the box. The camera then shifts to show a man in military uniform, identified by the name tag 'BRIGGS,' looking down at the box. The camera returns to the box, showing the wrapped object more clearly.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 389, "extra_info": {"duration": 5.5, "events": ["A hand opens the lid of a cardboard box.", "The hand grenade is revealed inside the cardboard box.", "A man in military uniform leans over.", "The camera cuts back to the cardboard box."], "n_events": 4, "n_shots": 3, "n_subjects": 1}, "vid": 389}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "254.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Inside the restaurant, a man looks across, and the camera cuts to a woman walking away angrily. The camera switches back to the man, then to a wide shot. The man watches the woman leave, then glances back once more before dejectedly leaning back. Behind him, a man and a woman have been watching them all along, and there are other people dining in the restaurant.", "text": "A man is sitting at a table in a restaurant, looking forward and talking. A woman in a black dress with a necklace sits down at the table, and the man continues to talk while gesturing with his hands. The woman stands up and walks away, and the man turns his head to look at her, then turns back to continue talking. The man looks at the woman again, and she looks down.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 254, "extra_info": {"duration": 9.75, "events": ["A man looks across the restaurant.", "A woman walks away angrily.", "The man watches the woman leave.", "The man glances back.", "The man leans back dejectedly.", "A man and a woman watch the scene.", "Other people dine in the restaurant."], "n_events": 7, "n_shots": 5, "n_subjects": 3}, "vid": 254}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "528.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A puppy is lying down on the ground, while another puppy runs forward. The camera cuts to a puppy running forward, then tumbling and falling down the stairs. The background transitions from a summer park to a small path in a winter park, with the surroundings changing from green plants to snowy scenes.", "text": "Two fluffy puppies are running on a paved path with greenery in the background. The scene transitions to a single puppy running on a brick pathway towards a set of steps. The puppy continues running and approaches the steps. As it reaches the edge of the steps, it loses its balance and falls forward, rolling over onto the ground. The puppy then gets up and continues running.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 528, "extra_info": {"duration": 3.83, "events": ["A puppy lies down on the ground.", "Another puppy runs forward.", "The puppy tumbles.", "The puppy falls down the stairs.", "The background transitions to winter.", "Snowy scenes appear."], "n_events": 6, "n_shots": 2, "n_subjects": 3}, "vid": 528}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "5.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The black cat blows out a candle. Then the scene turns to this black cat holding a bowl when suddenly a plate hits its head.", "text": "A black cat with a red nose and green eyes is seen with its mouth open, showing its teeth and tongue, and then it turns its head to the left. The cat continues to turn its head, and a flame appears from the left side of the screen. The cat looks at the flame, and a worm emerges from the flame. The cat's expression changes to one of surprise as the worm moves closer. The cat then lies on the ground, holding a blue cup, and a shell falls on its face, breaking and scattering pieces. The cat looks up with a surprised expression.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 5, "extra_info": {"duration": 4.05, "events": ["The black cat blows out a candle.", "The black cat holds a bowl.", "A plate hits the black cat's head."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 5}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "889.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a gameplay sequence, an airplane is soaring through the sky. The player pulls out a map to pinpoint their location.", "text": "The video shows a plane flying over a coastal area with a map displayed on the right side of the screen. The plane is seen from behind, and the map on the right side shows various locations and routes. The plane continues to fly forward, and the map on the right side updates to show the plane's position and the distance to different locations. The plane flies over the coast, and the map on the right side shows the plane's trajectory and the distance to 'Georgopol' and 'Hospital'. The plane continues to fly forward, and the map on the right side shows the plane's position and the distance to 'Georgopol' and 'Hospital'.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 889, "extra_info": {"duration": 10.01, "events": ["An airplane soars through the sky.", "The player pulls out a map.", "The player pinpoints their location on the map."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 889}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "491.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On the edge of a field, a tricycle is parked, with two watermelons placed in front of it. A man, bare-chested, leaps up, grabbing the railings of the tricycle, spinning and inverting on top of the vehicle, performing multiple gymnastic moves.", "text": "In a rural setting with a field and some buildings in the background, a shirtless man approaches a small yellow truck with two watermelons placed on the ground in front of it. He places his hands on the truck's rear and begins to lift his body, performing a series of acrobatic movements. The man continues to move his body, transitioning from a handstand to a flip, and then back to a handstand position. He maintains the handstand while moving his legs in a controlled manner, demonstrating strength and balance.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 491, "extra_info": {"duration": 4.97, "events": ["A man leaps up.", "The man grabs the tricycle railings.", "The man spins on top of the tricycle.", "The man inverts on the tricycle.", "The man performs gymnastic moves."], "n_events": 5, "n_shots": 1, "n_subjects": 1, "old_duration": 6.63, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/91_no_audio.mp4"}, "vid": 491}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "709.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video captures the side profile of a person's face, focusing on the eye area. The person blinks frequently and their head shakes slightly, as if they are crying. Then, the person uses a tissue to wipe away tears from the corner of their eyes.", "text": "A woman with gray hair is seen in close-up, with her eyes closed and a tear rolling down her cheek. She uses a white tissue to wipe away the tear from her eye. The woman continues to wipe her eye with the tissue, ensuring the tear is removed. She then looks slightly to the side, still holding the tissue near her face. The woman continues to wipe her eye, making sure it is clean.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 709, "extra_info": {"duration": 8.08, "events": ["The person blinks frequently.", "The person's head shakes slightly.", "The person uses a tissue.", "Tears are wiped from the corner of the eyes."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 709}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "442.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is a field. The camera remains fixed. A person holds a knife with both hands, moves forward, bends over, and slices open a watermelon on the ground.", "text": "A person wearing a hat and long sleeves is seen holding a large knife and standing next to a row of watermelons on the ground. The person begins to cut the first watermelon in the row, making a vertical incision. As the knife continues to cut, the watermelon starts to split open, revealing its red interior. The person continues to cut through the watermelon, fully opening it to show the red flesh inside. The person then moves the knife to the next watermelon in the row and begins to cut it. The watermelon is successfully cut in half, revealing its red interior. The person continues to cut the remaining watermelons in the row, ensuring each one is opened.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 442, "extra_info": {"duration": 7.83, "events": ["A person holds a knife with both hands.", "The person moves forward.", "The person bends over.", "The person slices open a watermelon on the ground."], "n_events": 4, "n_shots": 1, "n_subjects": 1, "old_duration": 7.83, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/42.mp4"}, "vid": 442}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "575.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A green parrot walks forward, bends down to pick up a bottle cap, steps on the trash can beside it, and the lid of the can is opened. It tosses the bottle cap into the can, moves its foot away, and the lid closes automatically. Then it picks up another cap nearby and repeats the same action. The camera remains fixed.", "text": "A green and gray parrot is on a light wooden table with a red container and several metal bottle caps in front of it. The parrot picks up a bottle cap with its beak and moves towards the red container. It opens the container and places the bottle cap inside. The parrot repeats this action, picking up another bottle cap and placing it into the container. The background features a blurred image of a bird with a colorful pattern.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 575, "extra_info": {"duration": 6.43, "events": ["A green parrot walks forward.", "Parrot bends down.", "Parrot picks up a bottle cap.", "Parrot steps on the trash can.", "Trash can lid opens.", "Parrot tosses the bottle cap into the can.", "Parrot moves its foot away.", "Trash can lid closes automatically.", "Parrot picks up another bottle cap.", "Parrot repeats the toss action."], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 575}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "774.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man wearing red pants is walking along the platform, the subway train approaches from the distance, arriving at the platform and coming to a stop. Then, several people enter the frame from the right side, the subway doors open, and people board and disembark the train.", "text": "A subway train approaches the platform from the distance, gradually getting closer. The train arrives at the platform, and the doors open. Passengers begin to board the train, with some already inside and others waiting to enter. The train doors remain open as more passengers board, and the platform becomes busier with people moving around.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 774, "extra_info": {"duration": 28.95, "events": ["Man walks along the platform", "Subway train approaches from the distance", "Subway train arrives at the platform", "Subway train comes to a stop", "People enter the frame from the right side", "Subway doors open", "People board the train", "People disembark the train"], "n_events": 8, "n_shots": 1, "n_subjects": 3}, "vid": 774}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "473.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Four people dressed in monkey costumes first stood in a line and then took turns strutting towards the camera, striking poses. In the end, all four of them danced together and struck a final pose to conclude the video shoot. The filming background was a small road by a street lined with rows of trees but devoid of any vehicles.", "text": "In a park with a paved path and lush green trees, a person in a brown costume with green leaves around their waist starts dancing, moving their arms and legs energetically. Another person in a yellow costume with a black collar and a leopard print skirt joins in, dancing alongside the first person. More people in yellow costumes join the dance, forming a group and performing synchronized dance moves. The group continues to dance together, moving in unison and performing various dance steps. The video ends with the group standing still, having completed their dance routine.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 473, "extra_info": {"duration": 7.05, "events": ["Four people in monkey costumes lined up", "Each person took turns walking towards the camera", "They each struck poses individually", "All four danced together", "Four people in monkey costumes ended with a final pose"], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 473}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "387.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person turns on the tap, fills a cup with water, and the scene switches to someone cooking a steak with a spatula.", "text": "A person turns on a faucet and begins to fill a glass with water. The person then moves their hand away from the faucet and uses a cloth to hold the glass. The scene shifts to the person using tongs to flip pieces of meat in a frying pan.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 387, "extra_info": {"duration": 4.25, "events": ["Person turns on the tap", "Person fills a cup with water", "Scene switches", "Someone cooks a steak", "Someone uses a spatula"], "n_events": 5, "n_shots": 2, "n_subjects": 2}, "vid": 387}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "518.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A cartoon piglet is pole dancing; the piglet performs enchanting pole dance moves on the stage, then jumps onto the pole to spin, accelerates the spinning speed after looking into the camera, and finally spins out of the frame.", "text": "A small pig stands on a platform and looks up at a pole, then begins to climb it. The pig continues to ascend the pole, using its front legs to grip it and its hind legs to push off the platform. The pig climbs higher, wrapping its body around the pole and looking around as it goes. The pig reaches a higher point on the pole, still holding on with its front legs and looking around. The pig then loses its grip and falls off the pole, disappearing from the frame.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 518, "extra_info": {"duration": 7.2, "events": ["Cartoon piglet performs pole dance moves on stage.", "Cartoon piglet jumps onto the pole.", "Cartoon piglet starts to spin on the pole.", "Cartoon piglet accelerates spinning speed.", "Cartoon piglet spins out of the frame."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 518}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "777.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A yellow cat is lying on a person's legs, fiddling with a toy with its paws, then falls off and climbs back up to continue playing. Finally, the cat jumps off the person's legs.", "text": "A ginger cat is lying on a blue cushion, playing with a string toy. The cat swats at the toy with its paws, bites it, and rolls around on the cushion. The cat continues to play, grabbing the toy with its mouth and paws, and eventually gets the toy stuck around its neck. The cat struggles to free itself, turning over and moving around on the cushion. The cat finally manages to get the toy off its neck and continues to play with it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 777, "extra_info": {"duration": 25.56, "events": ["A yellow cat lies on a person's legs", "The cat fiddles with a toy", "The cat falls off the person's legs", "The cat climbs back up onto the person's legs", "The cat continues playing with the toy", "The cat jumps off the person's legs"], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 777}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "475.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a slope covered with snow, a person dressed in green is crawling forward, grasping a red rope. A person in a grey jacket is behind them, shouldering the rope and pulling forcefully. Behind the person in green, there is another individual wearing red who is holding a mountaineering pole, striking it beside the person in green, and then following behind them.", "text": "A person in a green jacket is lying on their stomach on a snowy slope, holding onto a red rope with both hands, and being pulled upwards by another person in a gray jacket. The person in the green jacket struggles to maintain their grip and balance as they are pulled up the slope. The person in the gray jacket continues to pull the rope, gradually moving the person in the green jacket higher up the slope. The person in the green jacket is almost at the top of the slope, still holding onto the rope tightly. The person in the gray jacket is now at the top of the slope, while the person in the green jacket is still being pulled up. The person in the green jacket is now closer to the top, with the person in the gray jacket still pulling the rope. The person in the green jacket is almost at the top, and the person in the gray jacket is still pulling the rope. The person in the green jacket is now at the top of the slope, and the person in the gray jacket is still pulling the rope.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 475, "extra_info": {"duration": 8.97, "events": ["A person in green crawls forward on snow.", "The person in green grasps a red rope.", "A person in grey pulls the rope forcefully.", "The person in red holds a mountaineering pole.", "The person in red strikes the pole beside the green person.", "The person in red follows the green person."], "n_events": 6, "n_shots": 1, "n_subjects": 3}, "vid": 475}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "704.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A pair of feet walks to the left on the grass, followed one after another by little dogs appearing in the frame. With each step, these feet are lifted high to avoid stepping on the puppies. Moving from the left to the right and back again, the little dogs frolic alongside these feet.", "text": "In a grassy outdoor setting, a person wearing jeans and sneakers is seen interacting with a group of small puppies. The puppies, of various colors including brown, black, and white, are energetically circling around the person's legs, jumping and playfully biting at their shoes. The person moves their feet, encouraging the puppies to follow and engage with them. The puppies continue to circle and jump around the person's legs, showing playful and curious behavior. The scene remains consistent with the puppies actively interacting with the person, who occasionally lifts their feet to provoke more playful reactions from the puppies.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 704, "extra_info": {"duration": 16.95, "events": ["A pair of feet walks to the left on the grass.", "Little dogs appear following the feet.", "Feet are lifted high to avoid stepping on puppies.", "Feet move from the left to the right.", "Feet move back again.", "Little dogs frolic alongside the feet."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 704}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "778.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A group of people in uniform, fully armed and carrying shields, surrounded another group holding batons, and a series of conflicts broke out between them. The camera captured the scenes of the conflict from various angles. The latter hit the shields with their batons, and then one of them was pushed to the ground, followed by others piling on top. Afterwards, another person kicked a shield.", "text": "A group of soldiers in camouflage uniforms and helmets are seen forming a line with shields, facing a crowd of protesters. The protesters, some holding signs, approach the soldiers, and the soldiers begin to push and shove the protesters. The soldiers use their shields to block the protesters, and the situation escalates as more soldiers join in to control the crowd. The soldiers continue to push the protesters, and some protesters are seen being pushed to the ground. The soldiers form a tighter line with their shields, and the protesters are forced back. The soldiers then move forward, pushing the protesters further away from the line.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 778, "extra_info": {"duration": 22.42, "events": ["Group in uniform surrounded another group holding batons.", "Groups engaged in a series of conflicts.", "Group holding batons hit the shields.", "Uniformed person pushed a baton-holder to the ground.", "Others piled on top of the fallen person.", "Person kicked a shield."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 778}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "147.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A phoenix is soaring in the sky. A woman in blue clothes watches as the phoenix flies by, shedding red specks of light. The camera follows the woman's perspective as it rotates. The woman turns her head to look at a phoenix feather falling from the sky. She reaches out her hand to catch it, and a female sprite emerges from the water.", "text": "In a dimly lit, ornate room with a wooden dock, a bird with glowing feathers flies away from a girl standing on the dock. The girl, wearing a blue outfit with a bow, watches the bird as it flies past her. The bird continues to fly towards the water, and the girl follows its movement with her eyes. The bird lands on the girl's hand, and she gently places it into the water. The bird dives into the water, and a girl with elaborate hair accessories emerges from the water, smiling at the girl on the dock.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 147, "extra_info": {"duration": 7.62, "events": ["A phoenix soars in the sky.", "The phoenix sheds red specks of light.", "The woman watches the phoenix.", "The woman turns her head.", "A phoenix feather falls from the sky.", "The woman reaches out her hand.", "The woman catches the phoenix feather.", "A female sprite emerges from the water."], "n_events": 8, "n_shots": 2, "n_subjects": 3}, "vid": 147}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "258.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man lies on a hospital bed. A bearded man walks from the corridor to the entrance of the ward, glances at the man on the bed, then looks forward. A female police officer, with her arm in a sling, approaches from the opposite direction. The bearded man picks up the item in his hand.", "text": "A man is lying in a hospital bed, wearing a hospital gown with blue patterns and an oxygen tube in his nose. Another man, holding a 'PINK'S' food bag, walks down a hospital hallway towards the room. He enters the room where the man in the bed is lying. A female nurse with a cast on her arm approaches the man with the food bag and starts talking to him.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 258, "extra_info": {"duration": 6.5, "events": ["A man lies on a hospital bed", "A bearded man walks from the corridor", "The bearded man glances at the man on the bed", "A female police officer approaches", "The bearded man picks up the item"], "n_events": 5, "n_shots": 3, "n_subjects": 3}, "vid": 258}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "801.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts a clear day where a red car rushed toward some individuals and a bicycle. The people managed to evade the car, however, the bicycle was utterly destroyed in the process. The bicycle was noticeably yellow and black in color.", "text": "A red Mini Cooper with the license plate '6UGELIO' drives into a parking lot and comes to a stop. Three people, two boys and a girl, react with surprise and start running away from the car. The car continues to move forward, hitting a boy lying on the ground, causing him to fall. The boy on the ground holds a dollar bill and looks up as the car passes by. The two boys and the girl, who are sitting on the ground, look at the car and then start walking towards it. A woman carrying a bag walks past them. The car remains stationary in the parking lot.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 801, "extra_info": {"duration": 10.01, "events": ["A clear day presents.", "A red car rushes toward individuals.", "Individuals evade the red car.", "The red car heads toward a bicycle.", "The red car destroys the yellow and black bicycle."], "n_events": 5, "n_shots": 10, "n_subjects": 3}, "vid": 801}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "759.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person is carrying a bicycle on their shoulders while walking up a hillside, then puts down the bicycle and continues forward by pushing it.", "text": "A man wearing a helmet and a backpack is seen carrying a bicycle over his shoulder while walking on a grassy path with scattered leaves. He continues to walk uphill, maintaining his grip on the bicycle. The man then lowers the bicycle to the ground and begins to push it forward. He resumes carrying the bicycle over his shoulder as he continues along the path.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 759, "extra_info": {"duration": 10.44, "events": ["Person carries bicycle on shoulders.", "Person walks up a hillside.", "Person puts down the bicycle.", "Person pushes the bicycle forward."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 759}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "218.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The character knocked the woman to the ground, then grabbed her and threw her against the wall, before raising an axe and leaping towards her. The woman dodged to the side and spun downwards, grabbing onto the wall to hang there, with a sword in her other hand.", "text": "A woman in armor is seen fighting against a group of enemies in a dimly lit, industrial setting with sparks flying around. She is knocked down by a powerful enemy wielding a weapon with a flaming blade. The woman, holding her shield, looks up as the enemy continues to attack her. The enemy swings their weapon, and the woman is thrown against the wall. The woman, now on the wall, prepares to counterattack as the enemy continues to advance. The woman swings her weapon, hitting the enemy, causing them to stagger. The woman continues to fight, using her shield and weapon to fend off the attackers.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 218, "extra_info": {"duration": 8.12, "events": ["Character knocked woman to ground", "Character grabbed woman", "Character threw woman against wall", "Character raised axe", "Character leaped towards woman", "Woman dodged to the side", "Woman spun downwards", "Woman grabbed onto wall", "Woman hung on wall", "Woman held sword in hand"], "n_events": 10, "n_shots": 5, "n_subjects": 3}, "vid": 218}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "753.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two children are repeatedly exchanging the toys in their hands on a red tabletop.", "text": "In a room with a red box in the foreground and a blurred background, two hands appear, one holding a yellow object and the other holding a small toy car. The hands place the yellow object and the toy car on the red box. The hands then move away, leaving the red box in the frame. The hands reappear, with one hand holding the yellow object and the other hand holding the toy car. The hands place the yellow object and the toy car on the red box again. The hands move away once more, leaving the red box in the frame.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 753, "extra_info": {"duration": 6.88, "events": ["Two children exchange toys", "Toys move from one hand to another", "Exchange happens on a red tabletop"], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 753}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "221.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man in the driver's seat looked to the right, the camera angle switched to inside, the woman used her ID card to unlock the door, turned around and opened the door, then walked forward. The door was about to close when the man in the suit inside came over to open it. The man in the suit turned his head to look around and then walked forward with the blonde woman, as the door slowly closed.", "text": "A man is sitting in a car at night, looking to his right. The scene then shifts to a hospital hallway where a nurse in blue scrubs walks towards a door, opens it, and enters the room. The nurse closes the door behind her. Shortly after, a man and a woman enter the hallway through the same door, walking side by side.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 221, "extra_info": {"duration": 13.75, "events": ["Man in driver's seat looked right", "Woman used ID card to unlock door", "Woman turned around", "Woman opened the door", "Woman walked forward", "Door was about to close", "Man in suit came to open it", "Man in suit looked around", "Man in suit walked forward with woman", "Door closed slowly"], "n_events": 10, "n_shots": 2, "n_subjects": 4}, "vid": 221}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "75.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man wearing a hat is speaking frantically to someone in front of him. The camera switches to a view of the man's back, at his buttocks, where a red tail appears. The tail touches the butt of a woman in front of him. The camera then cuts to their faces, where the woman gives the man a coquettish and annoyed look, and then closes her eyes and moves to kiss him. As they kiss, the man's mouth transforms into a red animal-like snout. The man's eyes widen with shock. After kissing the woman, the man covers his mouth and quickly runs off. The woman looks in the direction of the man as he runs away.", "text": "A man in a gray suit with a pink tie and a hat is talking to a woman in a black dress with black gloves in a room with beige curtains and a staircase in the background. The man's tail, which is initially gray, starts to glow with pink sparkles and transforms into a pink, glittery tail. The woman, looking surprised, continues to talk to the man. The man smiles and then kisses the woman on the cheek. The man's tail retracts back into his body, and the woman looks at him with a mix of surprise and amusement.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 75, "extra_info": {"duration": 8.25, "events": ["A man speaks frantically to someone.", "A red tail appears from the man's back.", "The tail touches the woman's butt.", "The woman gives the man a coquettish and annoyed look.", "The woman closes her eyes.", "The woman moves to kiss the man.", "The man's mouth transforms into an animal-like snout.", "The man's eyes widen with shock.", "The man covers his mouth.", "The man runs off."], "n_events": 10, "n_shots": 3, "n_subjects": 3}, "vid": 75}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "691.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows an empty expanse of ice, after which several ice hockey players rush in from the right side of the frame. They come to a stop in the center of the screen, sliding back, but one of them fails to stop properly and slips, falling down onto the ice. The other players have already slid out of the frame, leaving only him to slowly and painfully get back up.", "text": "The video begins with an empty ice rink, showing the boards and the ice surface. A hockey player in a blue jersey skates into the frame from the right, followed by two other players. The player in the blue jersey is checked by a player in a white jersey, causing him to fall to the ice. The player in the blue jersey remains on the ice, holding his leg in apparent pain. He then starts to get up, still holding his leg, and begins to skate away. The video ends with the ice rink empty again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 691, "extra_info": {"duration": 7.13, "events": ["Ice expanse appears on the screen.", "Ice hockey players rush in from the right.", "Players come to a stop in the center.", "One player fails to stop properly.", "The failing player slips on the ice.", "Player falls down onto the ice.", "Other players slide out of the frame.", "The fallen player slowly gets back up."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 691}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "195.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The boss with a ruffian demeanor walked forward, reached out to open the drawer, which was filled with documents, and he climbed onto the edge of the drawer and said something.", "text": "A green insect with antennae walks towards a set of drawers in a room with a tiled floor. The insect reaches out and pulls the handle of the middle drawer, which is labeled 'PRIVATE PLEASURE.' The drawer opens, revealing a compartment filled with small, yellow objects. The insect peeks inside the drawer, looking at the contents. After a moment, the insect steps back and closes the drawer.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 195, "extra_info": {"duration": 3.38, "events": ["The boss walked forward.", "The boss reached out.", "The boss opened the drawer.", "The drawer was filled with documents.", "The boss climbed onto the drawer edge.", "The boss said something."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 195}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "728.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman places slices of the cut lemon into the cup one by one.", "text": "In a kitchen setting, a woman is seen preparing a drink on a wooden table. She places a slice of lemon into a glass jar filled with a yellow liquid, which is surrounded by various fruits including oranges, avocados, and potatoes. She then picks up another slice of lemon and places it into the jar. The woman continues to add more slices of lemon into the jar, ensuring they are evenly distributed. The background shows a cutting board with more oranges and a bowl of potatoes, indicating she is in the process of preparing the drink.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 728, "extra_info": {"duration": 5.11, "events": ["The woman cuts a lemon.", "The woman places a lemon slice into the cup."], "n_events": 2, "n_shots": 1, "n_subjects": 1}, "vid": 728}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "685.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, a lady dressed in a yellow windbreaker walks up from a slope filled with gravel, reaching the camera, a man in grey clothes waiting there reaches out a hand to pull the lady up the slope, and the lady climbs up and walks out of the frame.", "text": "A woman in a yellow jacket and gray pants is seen walking down a rocky slope, adjusting her hair with her right hand. She continues to descend, looking down and moving cautiously. As she approaches a man standing at the bottom of the slope, she extends her left hand towards him. The man reaches out and helps her by holding her hand, and she takes his assistance to climb up. The woman successfully climbs up with the man's help and continues walking past him. The background consists of a rocky terrain with scattered stones and a few large rocks.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 685, "extra_info": {"duration": 11.43, "events": ["A lady walks up a gravel slope.", "The man waits at the top of the slope.", "The man reaches out his hand.", "The lady grabs the man's hand.", "The man pulls the lady up the slope.", "The lady climbs up the slope.", "The lady walks out of the frame."], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 685}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "895.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On the right side of the video, an individual carrying luggage can be seen standing in front of a store while using a mobile phone. Suddenly, a white car halts on the right side of the road, obstructing the path of a white FedEx truck behind it. The FedEx logo is evident on the side of the truck in vivid shades of purple and green. We can infer that the FedEx truck may need to overtake the white car to continue on its path.", "text": "The video shows a street scene with a white car driving on the road and a FedEx truck approaching from the left side of the frame. The white car continues to move forward, and the FedEx truck starts to enter the frame from the left. The FedEx truck, labeled 'Ground,' moves further into the frame, partially obscuring the view of the white car. The FedEx truck continues to move forward, fully entering the frame and blocking the view of the white car. The white car is now partially obscured by the FedEx truck, which is now fully in the frame. The FedEx truck continues to move forward, and the white car is almost completely obscured. The FedEx truck moves further into the frame, and the white car is now mostly out of view. The FedEx truck continues to move forward, and the white car is no longer visible.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 895, "extra_info": {"duration": 6.01, "events": ["Individual stands in front of a store.", "Individual uses mobile phone.", "White car halts on the right side of the road.", "White car obstructs the path of the FedEx truck.", "FedEx truck is behind the white car.", "FedEx truck may need to overtake the white car."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 895}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "450.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a waterlogged road, a man carrying a pair of shoes in his hand rolled up his trouser legs, walking barefoot through the water towards the other side of the street. Just as he was about to reach the curb, he leaped forward, jumping onto the curb, but immediately slipped and fell, ending up sitting on the ground.", "text": "A man stands in a flooded street, holding a pair of shoes in his hands. He begins to walk through the water, stepping carefully. The man continues to walk, occasionally looking around as he moves forward. He then starts to run, still holding the shoes, and heads towards the right side of the frame. The man runs past an orange car parked on the right side of the street.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 450, "extra_info": {"duration": 10.8, "events": ["A man carried shoes in his hand.", "The man rolled up his trouser legs.", "The man walked barefoot through water.", "The man approached the other side.", "The man leaped toward the curb.", "The man slipped on the curb.", "The man fell onto the ground.", "The man ended up sitting."], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 450}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "49.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a classroom, a female character looks on in surprise at a device in front of her that generates a large amount of electric current, which then surges towards a frog.", "text": "A character with a white face and black hair is seen in a dark room, raising its hands and looking distressed. The scene shifts to a girl with glasses sitting at a desk, reading a book and looking surprised. The classroom is filled with other characters, all of whom are watching the scene unfold. A device in the center of the room begins to emit electrical sparks, which intensify and spread. The electrical sparks move towards a tray containing a small creature, which is then electrocuted. The girl with glasses continues to watch the event unfold, looking shocked.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 49, "extra_info": {"duration": 5.62, "events": ["A female character looks on in surprise.", "A device generates a large amount of electric current.", "Electric current surges towards a frog."], "n_events": 3, "n_shots": 4, "n_subjects": 3}, "vid": 49}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "965.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video features a brown-haired person wearing a Superman costume and a blonde person wearing a gray-throated hudi. The person wearing the Superman costume is trying to eat a yellow meal from a blue plate on a burgundy table with a knife in his hand. There is a broken eggshell next to this yellow dish. Meanwhile, the person with the grey hood laughs. Behind them are firefighter helmets and a colourful ball.", "text": "A knife is seen spreading a yellow substance on a blue plate with a broken eggshell on it. The camera zooms out to show two women sitting at a table with the plate in front of them. One woman, wearing a gray sweater, holds a blue canister and moves it to the side. The woman in the blue Superman shirt uses the knife to spread the yellow substance on the plate. She then leans forward and eats the yellow substance directly from the plate. The woman in the gray sweater laughs as the woman in the blue shirt continues to eat. The camera zooms in on the plate, showing the yellow substance being spread and eaten.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 965, "extra_info": {"duration": 4.97, "events": ["Brown-haired person tries to eat a yellow meal", "Brown-haired person holds a knife", "Broken eggshell rests beside the yellow dish", "Blonde person laughs", "Firefighter helmets are located behind the people", "Colourful ball is present in the background"], "n_events": 6, "n_shots": 5, "n_subjects": 2}, "vid": 965}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "14.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Numerous small creatures crawl upward from the ground to the top of a car's exterior, causing two characters inside the car to react.", "text": "A green lizard is seen on the car roof, surrounded by numerous insects. The lizard begins to move towards the car window, where a girl and a boy are sitting inside. The girl and boy react with excitement, clapping their hands and smiling as the lizard approaches the window. The lizard continues to move closer to the window, and the insects on the car roof start to swarm around the lizard. The lizard reaches the window, and the insects begin to interact with it, creating a lively scene on the car roof. The girl and boy continue to watch and smile as the lizard and insects interact, with the insects moving around the car roof and the lizard.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 14, "extra_info": {"duration": 1.71, "events": ["Small creatures crawl upward from the ground.", "Creatures reach the top of the car's exterior.", "Two characters inside the car react."], "n_events": 3, "n_shots": 1, "n_subjects": 3}, "vid": 14}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "715.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "One person places the soccer ball on the line, and in the distance, another person stands in front of the goal, assuming the position of a goalkeeper. The person kicks the ball towards the goal. The goalkeeper dives in the wrong direction, and the ball enters the net. The shooter slowly runs towards the goal.", "text": "A man in black shorts and black socks holds a soccer ball with both hands on a soccer field, with a goal and another man in a white shirt and black shorts in the background. He places the ball on the ground and steps back. The man in the white shirt, who is the goalkeeper, prepares to defend the goal. The man in black shorts kicks the ball towards the goal. The goalkeeper dives to his left but fails to stop the ball, which goes into the net. The man in black shorts walks away from the goal, and the goalkeeper stands up and watches the ball.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 715, "extra_info": {"duration": 7.98, "events": ["One person places the soccer ball on the line", "Another person assumes the position of a goalkeeper", "The person kicks the soccer ball towards the goal", "The goalkeeper dives in the wrong direction", "The soccer ball enters the net", "The shooter runs towards the goal"], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 715}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "617.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Against a blurred background, a man looks ahead, then puts on his helmet, fastens the chin strap, straightens his collar, and adjusts the helmet again.", "text": "A man stands outdoors with a blurred background of hills and trees, wearing a dark jacket and a black turtleneck. He looks to his left and then raises a black helmet with both hands. He places the helmet on his head and adjusts it. He secures the helmet by fastening the straps under his chin. After ensuring the helmet is properly fastened, he adjusts the fit by pulling the straps and making final adjustments. Finally, he looks to his left again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 617, "extra_info": {"duration": 10.26, "events": ["Man looks ahead", "Man puts on helmet", "Man fastens chin strap", "Man straightens collar", "Man adjusts helmet"], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 617}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "396.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two men are throwing a basketball to each other, while on their right side, another two men are conversing. As this is happening, two women are walking past them from the left.", "text": "In a grassy area with a brick building in the background, a man in a gray jacket holds a basketball and prepares to throw it. He throws the ball to a man in a black jacket, who catches it and prepares to throw it back. The man in the black jacket throws the ball to the man in the gray jacket, who catches it and prepares to throw it again. The man in the black jacket throws the ball to the man in the gray jacket, who catches it and prepares to throw it back. Throughout the sequence, other people are seen walking and standing in the background, and a person in a white jacket walks past the camera.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 396, "extra_info": {"duration": 5.88, "events": ["Two men throw a basketball to each other", "Another two men converse", "Two women walk past from the left"], "n_events": 3, "n_shots": 1, "n_subjects": 3}, "vid": 396}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "271.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "As the subway doors were closing, a person dashed into the train carriage, holding a suitcase which got stuck outside the doors. Inside the subway, he forcefully pulled at the suitcase handle, while an elderly person next to him offered support. The camera then cuts to the subway slowly starting to move forward, with him still holding onto the suitcase without letting go.", "text": "A man in a suit runs towards a train and opens the door, holding a box in his hand. He steps onto the train, still holding the box, and begins to enter. A woman walks past the train door as the man continues to board. Inside the train, the man bends down and places the box on the floor. An elderly man with white hair assists the man in securing the box inside the train. The train starts to move as the man and the elderly man ensure the box is properly placed.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 271, "extra_info": {"duration": 8.12, "events": ["Person dashed into the subway carriage.", "Suitcase got stuck outside the subway doors.", "Person pulled at the suitcase handle forcefully.", "Elderly person offered support to the person.", "Subway started to move forward."], "n_events": 5, "n_shots": 3, "n_subjects": 3}, "vid": 271}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "493.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On the table, there is a utensil, and a small dog dressed in a floral outfit is using its right front paw to press a button on the utensil. With each press, a few black beans fall from underneath the utensil and land in a plate. The small dog finishes eating the beans, then presses the button again, and continues eating the beans.", "text": "A dog wearing a red floral shirt is seen interacting with a blue and white automatic pet feeder filled with food. The dog uses its paw to push the food into the feeder's bowl, causing the food to scatter. The dog continues to push the food into the bowl, occasionally dipping its head into the bowl to eat. The background shows a tiled floor with a red and black circular design, and another green and white automatic pet feeder is visible in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 493, "extra_info": {"duration": 14.41, "events": ["A small dog presses a button on a utensil", "Black beans fall into a plate", "The small dog eats the beans", "The small dog presses the button again"], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 493}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "785.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person in a spacesuit is driving a four-wheeled vehicle on a planet in outer space, and the camera switches to a close-up shot of the person's face and the front of the vehicle.", "text": "An astronaut in a spacesuit drives a lunar rover across the moon's surface, with the rover equipped with various scientific instruments and a solar panel. The rover continues to move forward, kicking up dust as it goes. The camera then shifts to a close-up of the astronaut's helmet, showing the reflection of the lunar surface and equipment. The camera pans to the right, revealing a lunar module in the background. The camera then focuses on a lunar module with a solar panel and a camera, which is positioned on the moon's surface.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 785, "extra_info": {"duration": 16.98, "events": ["A person drives a four-wheeled vehicle on a planet."], "n_events": 1, "n_shots": 2, "n_subjects": 2}, "vid": 785}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "190.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The ruffian boss opened his mouth wide, sweating profusely, looking ahead. The camera zooms out to reveal a group of creatures raising their hands to pass Karen to the right. The camera then cuts to Karen's smiling expression.", "text": "A green bug with antennae and a headset is seen in a dark environment, initially looking scared. The bug then opens its mouth wide, appearing to shout or scream. The bug is shown standing in a hole in the ground, looking up. The camera then shifts to a group of colorful characters below, who are watching the bug. The characters raise a mechanical arm with a TV screen on it, and the TV screen lights up. The characters cheer and wave their hands in excitement. The TV screen shows the bug's face, and the characters continue to watch it. The TV screen then turns off, and the bug's expression changes to one of fear.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 190, "extra_info": {"duration": 5.12, "events": ["The ruffian boss opened his mouth wide.", "The ruffian boss sweated profusely.", "A group of creatures raised their hands.", "The creatures passed Karen to the right.", "Karen smiled."], "n_events": 5, "n_shots": 2, "n_subjects": 3}, "vid": 190}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "527.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The puppy chases after the duckling as it runs forward, then knocks it down to the ground. The duckling stands up, and then it turns around and continues the chase, running towards the camera.", "text": "A small puppy and a duckling are walking on a paved road with a white fence and trees in the background. The puppy follows the duckling closely, and they both move forward. The puppy gets closer to the duckling, and they start to interact. The puppy nudges the duckling, causing it to fall over. The puppy then circles around the duckling, which gets up and starts to move away. The puppy continues to follow the duckling, which keeps moving forward.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 527, "extra_info": {"duration": 7.8, "events": ["The puppy chases the duckling.", "The duckling runs forward.", "The puppy knocks the duckling down.", "The duckling stands up.", "The duckling turns around.", "The duckling continues the chase.", "The duckling runs towards the camera."], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 527}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "98.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the trench, the soldier in green uniform charged towards the pile of wood ahead. He took off his helmet and moved one plank after another, revealing a dog lying inside.", "text": "In a desolate, reddish environment with sandbags and a wooden gate in the background, a soldier in green uniform and helmet runs towards a pile of wooden planks. He removes his helmet and kneels down to inspect the planks. The camera then focuses on the planks, revealing a dog lying underneath. The soldier begins to move the planks aside, uncovering the dog more clearly. He continues to clear the planks, fully exposing the dog lying on the ground.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 98, "extra_info": {"duration": 4.12, "events": ["Soldier charged towards the pile of wood", "Soldier took off his helmet", "Soldier moved one plank", "Soldier revealed a dog lying inside"], "n_events": 4, "n_shots": 2, "n_subjects": 2}, "vid": 98}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "427.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a mountain road, several monkeys are attacking a group of tourists. Among them, a woman is scared into screaming and running away, and a man is also frightened by the monkeys and runs down the mountain. Eventually, after descending a step, the man turns around to face the monkey. The monkey dodges to the right, and the man supports himself with a tree branch. The dodging monkey, along with another one, comes back and once again makes an aggressive gesture towards the tourists.", "text": "A group of people, including a man and a woman, are walking down a set of stone steps in a forested area. As they descend, a monkey appears and starts chasing them. The man and woman react by trying to avoid the monkey, with the man bending down and the woman stepping back. The man then uses a stick to shoo the monkey away, causing it to retreat. The group continues to move down the steps, with the monkey running off to the side. The man and woman continue to walk down the steps, while the monkey moves further away.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 427, "extra_info": {"duration": 10.5, "events": ["Monkeys attack a group of tourists on a mountain road.", "A woman screams and runs away.", "A man runs down the mountain.", "The man descends a step.", "The man turns around to face the monkey.", "The monkey dodges to the right.", "The man supports himself with a tree branch.", "The dodging monkey comes back.", "Another monkey comes back.", "Monkeys make an aggressive gesture towards the tourists."], "n_events": 10, "n_shots": 1, "n_subjects": 4}, "vid": 427}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "926.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a room setting, a lengthy-haired individual, donned in a gray T-shirt, stands in front of a camera. The individual, whose back faces the camera, restfully leans against a pool table, located on the right, with hands behind their back, holding a thin, rectangular, brown envelope or paper bag. The individual also sports long blonde hair. A second person, dressed in a black T-shirt, blue shorts, and black shoes, emerges from a staircase positioned upfront. The person gradually descends the stairs, strolling past the camera whilst pointing at it with their right hand and subsequently gesturing a greeting by shaking their left hand at the camera. The person in black situates on the left, while the person in gray positions on the right. They both display joyous smiles on their faces. The individual wearing gray finally brings forth the hidden brown object from behind their back.", "text": "A woman in a gray dress stands in a living room, facing a staircase and a table covered with a patterned cloth. A man in a black shirt and blue shorts descends the stairs and walks towards the woman. The woman turns her head to look at the man as he approaches. The man points at the camera and says something, then turns to look at the woman. The woman turns her head to look at the man, then turns back to face the camera. The man walks back up the stairs and exits the frame.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 926, "extra_info": {"duration": 8.99, "events": ["Long-haired individual leans against a pool table.", "Individual holds a brown envelope behind their back.", "Second person emerges from the staircase.", "Second person descends the stairs.", "Second person points at the camera with their right hand.", "Second person shakes their left hand at the camera.", "Both individuals position themselves on opposite sides.", "Both individuals display smiles.", "Long-haired individual reveals the brown object."], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 926}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "189.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "SpongeBob says something to Patrick, leaps forward, and runs away. Patrick turns his head to see a skeleton body thrown to the right. The camera then cuts to SpongeBob running forward, switches to Patrick sitting and blinking, propping himself up with one hand while leaning back, and looking down to his left.", "text": "A pink character and a yellow character are driving a burger-shaped car with a 'K' flag in a graveyard filled with skulls. The yellow character turns to look at the pink character, who then turns to look forward. The yellow character gets out of the car and starts running through the graveyard, causing the skulls to scatter. The pink character turns to look at the yellow character, who then gets back into the car. The yellow character gets out of the car again and runs through the graveyard, causing more skulls to scatter.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 189, "extra_info": {"duration": 8.0, "events": ["SpongeBob speaks to Patrick.", "SpongeBob leaps forward.", "SpongeBob runs away.", "Patrick turns his head.", "Skeleton body is thrown to the right.", "Patrick sits and blinks.", "Patrick props himself up with one hand.", "Patrick leans back.", "Patrick looks down to his left."], "n_events": 9, "n_shots": 3, "n_subjects": 2}, "vid": 189}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "870.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, player number two from the blue team skillfully manages an airborne ball, then proceeds towards the goal of the orange team. With the defense of the orange team closing in, player number two swiftly pulls back the ball, masterfully fakes a move, and drives the ball towards the goal once again. Following another fake, player number two outsmarts player number four of the orange team, causing him to topple. Subsequently, player number two passes the ball inward, and player number twenty-one of the blue team kicks the ball into the net, scoring a goal. The player who scored the goal then begins to run in excitement.", "text": "On a soccer field, players in blue and red jerseys are actively engaged in a match. A blue team player receives the ball and starts dribbling towards the goal, evading red team defenders. The player continues to advance, skillfully maneuvering the ball past the defenders. As he approaches the penalty area, he is closely followed by a red team defender. The blue team player makes a quick move, causing the defender to fall to the ground. The player then takes a shot at the goal, aiming towards the bottom right corner. The goalkeeper dives in an attempt to save the shot but misses, and the ball goes into the net.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 870, "extra_info": {"duration": 11.98, "events": ["Player number two skillfully manages airborne ball.", "Player number two advances towards orange team's goal.", "Player number two performs swift pullback of the ball.", "Player number two fakes a move against the defense.", "Player number two drives the ball towards the goal again.", "Player number two executes another fake.", "Player number two causes player number four to topple.", "Player number two passes the ball inward.", "Player number twenty-one kicks the ball into the net.", "Player number twenty-one starts running in excitement."], "n_events": 10, "n_shots": 1, "n_subjects": 3}, "vid": 870}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "831.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts a living room where some pieces of food are scattered on the floor. A brown and white dog is seen lying down while munching on these floor foods. Apart from this dog, there is another smaller dog with black and white fur. This small canine is being hand-fed by a person.", "text": "A dog is lying on a carpeted floor in a living room, with a few pieces of carrot scattered in front of it. The dog opens its mouth and looks around, then lowers its head to sniff the carrots on the floor. The dog picks up a carrot with its mouth and starts chewing. The scene then shifts to a small dog standing in front of a hand holding a piece of carrot. The small dog licks the carrot from the hand.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 831, "extra_info": {"duration": 10.01, "events": ["Pieces of food scatter on the floor.", "A brown and white dog lies down.", "The brown and white dog munches on floor foods.", "A smaller black and white dog appears.", "A person hand-feeds the smaller black and white dog."], "n_events": 5, "n_shots": 2, "n_subjects": 3}, "vid": 831}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "855.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "This video clip features two football teams, one dressed in yellow shirts while the other in white and blue. A player from the team dressed in blue and white successfully scores a goal despite the yellow team's attempt to block him. However, they fail to stop him. The crowd erupts in cheers for the renowned football player Messi.", "text": "A soccer match is taking place with players in yellow and white jerseys on a green field. The white team is advancing the ball towards the yellow team's goal, with several players from both teams positioned around the penalty area. A white team player dribbles the ball past the yellow team's defenders and takes a shot at the goal. The ball goes past the goalkeeper and into the net, resulting in a goal. The crowd in the stands, wearing blue and white, cheers and waves flags in celebration. The white team player who scored the goal runs towards the goal, while the yellow team players react to the goal. The match continues with the white team maintaining possession of the ball and advancing towards the yellow team's goal again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 855, "extra_info": {"duration": 10.0, "events": ["Two football teams compete on the field.", "One team wears yellow shirts.", "The other team dons white and blue.", "A player from the blue and white team shoots.", "The player scores a goal.", "The yellow team attempts to block the shot.", "The block attempt by the yellow team fails.", "The crowd erupts in cheers.", "The crowd celebrates football player Messi."], "n_events": 9, "n_shots": 5, "n_subjects": 4}, "vid": 855}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "829.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The individual clothed in black and wearing gloves tosses a ball. Another person, attired in white and positioned upright, intercepts and throws the ball towards a different direction. The person in black executes a dive in an attempt to catch the ball but misses. Simultaneously, the individual in white, also wearing gloves, succeeds in catching the ball.", "text": "The pitcher, wearing a white uniform with the number 17, prepares to throw the ball while the catcher and umpire are positioned behind home plate. The pitcher throws the ball, and the batter, also in a white uniform, stands ready to hit. The batter swings and makes contact with the ball, sending it towards the infield. The batter slides towards first base as the fielders react to the play. The fielder, wearing a white uniform with the number 19, tags the base while the umpire signals the play. The scene then cuts back to the pitcher preparing to throw the ball again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 829, "extra_info": {"duration": 7.0, "events": ["Individual in black tosses a ball", "Person in white intercepts the ball", "Person in white throws the ball", "Individual in black executes a dive", "Individual in black misses the ball", "Person in white catches the ball"], "n_events": 6, "n_shots": 2, "n_subjects": 8}, "vid": 829}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "749.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The dog shook the water off its body, then turned its head to look back.", "text": "A wet dog is seen shaking its body vigorously, causing water to splash around it. The dog continues to shake, with water droplets flying in various directions. The dog then turns its head to the left, looking directly at the camera. The background shows a green, outdoor setting with trees and some red foliage.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 749, "extra_info": {"duration": 10.71, "events": ["The dog shook the water off its body.", "The dog turned its head to look back."], "n_events": 2, "n_shots": 1, "n_subjects": 1}, "vid": 749}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "583.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is a kindergarten playground where the children are playing a game of jumping over bamboo poles. Four little kids are standing with their hands on their hips, stretching out their right leg then retracting it, and then jumping over the bamboo pole with alternating legs to the left. Another child sits behind, holding onto the bamboo pole with hands opening and closing it.", "text": "A group of children are standing on a colorful playground mat, with a rope in front of them. The children begin to step onto the rope, starting with the child in the white jacket. The child in the white jacket lifts one foot and places it on the rope, followed by the child in the yellow jacket. The children continue to step onto the rope, with the child in the white jacket leading the way. The children maintain their balance as they move forward on the rope. The child in the white jacket continues to walk forward, while the other children follow behind. The children keep moving forward on the rope, with the child in the white jacket at the front.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 583, "extra_info": {"duration": 3.6, "events": ["Children play in a playground.", "Kids stretch out their right legs.", "Kids retract their right legs.", "Kids jump over a bamboo pole.", "Children alternate legs for jumping.", "A child sits behind the pole.", "The sitting child holds onto the bamboo pole.", "The sitting child opens and closes hands on the pole."], "n_events": 8, "n_shots": 1, "n_subjects": 2}, "vid": 583}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "699.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A foot clad in orange sports shoes steps onto a platform, then a pair of hands reach down to tie the shoelaces, and then the foot is placed back down.", "text": "The video begins with a close-up of a black bench in a blurred background, likely an indoor sports facility. A person wearing red shoes and white socks steps onto the bench. The person then bends down and starts tying the laces of their red shoes. After securing the laces, the person stands up and steps off the bench. The bench is left empty again, with the background remaining blurred.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 699, "extra_info": {"duration": 12.4, "events": ["Foot steps onto platform", "Hands reach down", "Hands tie shoelaces", "Foot is placed back down"], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 699}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "651.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows a person walking across a sandy area in a long black skirt, barefoot, leaving behind a trail of footprints. The video only displays the lower part of the walker's legs, from the calves down.", "text": "A person wearing a black dress is walking barefoot on a sandy desert, leaving a trail of footprints behind them. The dress flows with each step, and the person's feet sink slightly into the sand with each step. The background is a vast, empty desert with no other objects or people visible.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 651, "extra_info": {"duration": 8.76, "events": ["Person walks across sandy area.", "Person wears long black skirt.", "Person goes barefoot.", "Person leaves trail of footprints."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 651}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "851.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "An elder with blonde hair, donned in a blue sweater and a blue apron with flower print, is seen holding a vacuum-sealed package of plant-based fries. He unrolls the four corners of the package over the kitchen sink.", "text": "In a kitchen setting, a woman wearing a blue shirt and a floral apron holds a wrapped item in her right hand, which appears to be a piece of food, and shows it to the camera while speaking. She then uses her left hand to adjust the wrapping of the item, continuing to talk. The background includes a sink with a blue cloth hanging over it, a dish rack with various items, and a countertop with a round, baked item on a piece of parchment paper.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 851, "extra_info": {"duration": 4.0, "events": ["The elder holds a vacuum-sealed package.", "The elder unrolls the package over the sink."], "n_events": 2, "n_shots": 1, "n_subjects": 1}, "vid": 851}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "108.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman embraced the man and spun around once. A snowflake fluttered down from the sky, which the man caught with the tip of his index finger, and then both the man and woman kissed towards the snowflake.", "text": "A man with spiky hair and a woman with curly red hair are standing in front of a blue and red background, holding hands and smiling at each other. The man and woman begin to spin around while still holding hands. As they continue to spin, snowflakes start to appear in the air. The man and woman stop spinning and look at each other, with the man pointing at the woman. The woman then takes a snowflake from the man's hand and brings it to her mouth.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 108, "extra_info": {"duration": 4.5, "events": ["The woman embraced the man.", "They spun around once.", "A snowflake fluttered down from the sky.", "The man caught the snowflake with his index finger.", "The man and woman kissed towards the snowflake."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 108}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "325.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera angles down from the entrance of the camp towards the inside, where two girls are walking into the camp. Inside, many children are playing. The shot then cuts to a body of water, where a boy runs along a wooden platform by the lakeside before leaping into the water. Nearby, some people are kissing while others are walking forward. The camera then switches to capture a sign with the camp rules, and once again cuts to an archery range, where a girl is drawing a bow, with others beside her also engaged in archery activities.", "text": "Two girls walk towards a camp entrance labeled 'Camp Nightwing,' with many children playing in the background. A boy runs towards a dock, jumps off, and dives into the lake, creating a splash. The camera then shows a sign that reads 'Camp Nightwing Lake Rules,' stating that swimming is allowed but no camping is permitted. The scene shifts to a group of children practicing archery in an open field.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 325, "extra_info": {"duration": 4.5, "events": ["Two girls walk into the camp.", "Children play inside the camp.", "A boy runs along a wooden platform.", "The boy leaps into the water.", "Some people kiss by the body of water.", "Others walk by the body of water.", "A sign displays the camp rules.", "A girl is drawing a bow at the archery range.", "Others engage in archery activities."], "n_events": 9, "n_shots": 4, "n_subjects": 5}, "vid": 325}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "879.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the game, a character is seen pointing a red gun out of a window. Suddenly, the character is shot and jumps down to the grass from the second floor. The surroundings depict broken walls around the lawn. The character moves towards the wall and spots an enemy shooting from the second floor of the building opposite. The character returns fire towards the enemy in the opposite building.", "text": "The player, holding a gun with a red and yellow design, moves from a window onto a rooftop, aiming at a distant target. The player fires the gun, and the target is hit, causing a blue mark to appear on the wall. The player continues to move along the rooftop, aiming at another target. The player uses the scope to aim at a distant target, who is standing on a balcony. The player fires, and the target is hit, resulting in a 'You knocked out' message appearing on the screen. The player then moves back to the window, aiming at the target again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 879, "extra_info": {"duration": 8.01, "events": ["A character points a red gun out of a window.", "The character is shot.", "The character jumps down to the grass from the second floor.", "The character moves towards the wall.", "The character spots an enemy shooting from the opposite building.", "The character returns fire towards the enemy."], "n_events": 6, "n_shots": 1, "n_subjects": 1}, "vid": 879}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "438.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Fingers press the button on the table, and a little dog pops its head out from the center of the table. When a finger is placed on the dog's nose, the little dog spits a dice out from its mouth, and the two people begin to draw mahjong tiles. The scene shifts to the person opposite knocking down the mahjong tiles; white subtitles appear then disappear. The little dog lowers its head, and white subtitles appear again in the scene as the two people tidy up the mahjong tiles.", "text": "A dog is seen through a circular hole in a table, surrounded by blue and white tiles, with a small red object in front of it. A hand points at the red object, and the dog lifts its head to look at it. The hand then points at the dog's nose, and the dog remains still, watching the hand. The hand moves away, and the dog continues to look at the red object. The hand starts to move the tiles around the dog, and the dog lowers its head back down. The hand continues to move the tiles, and the dog remains in the same position.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 438, "extra_info": {"duration": 11.1, "events": ["Fingers press the button on the table", "A little dog pops its head out from the table", "A finger is placed on the little dog's nose", "The little dog spits a dice out from its mouth", "Two people begin to draw mahjong tiles", "Opposite person knocks down the mahjong tiles", "White subtitles appear then disappear", "The little dog lowers its head", "White subtitles appear again", "Two people tidy up the mahjong tiles"], "n_events": 10, "n_shots": 2, "n_subjects": 3}, "vid": 438}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "911.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, we are navigated through a dimly lit maze with gray walls that are adorned with bright blue lights. The only source of illumination comes from the direction our character is facing. This character keeps moving forward, turning left twice, then making a right turn, finally appearing at the exit where a huge glowing sphere stands amidst towering trees.", "text": "The video begins with a view of a dark, futuristic corridor with blue lights on the walls. The camera moves forward through the corridor, and the lights on the walls start to illuminate the path. As the camera continues to move, the lights on the walls become brighter, and the corridor appears to be well-lit. The camera approaches a door at the end of the corridor, and the lights on the walls are fully illuminated. The camera passes through the door, revealing a bright, glowing sphere outside. The sphere is shown in detail, with a complex, crystalline structure. The camera then moves back into the corridor, and the lights on the walls start to dim.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 911, "extra_info": {"duration": 8.0, "events": ["Video navigates through a dimly lit maze.", "Character moves forward within the maze.", "Character turns left.", "Character turns left again.", "Character makes a right turn.", "Character reaches the exit.", "A glowing sphere stands at the exit.", "Towering trees surround the sphere."], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 911}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "972.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video depicts an interaction between an individual with long hair and one with short hair. The long-haired person says something, resulting in the short-haired person seeming helpless and unhappy.", "text": "In a modern, well-furnished living room with a beige sofa, colorful cushions, and a glass coffee table, a woman wearing a leather jacket and a necklace talks to a man holding a glass. The man, dressed in a purple shirt and holding a glass, responds to her. The man then turns and walks away, while the woman follows him, continuing to speak. The man walks towards a different part of the room, which features a pool with a flamingo float, a bar area, and various decorative items. The man stops and looks around, while the woman stands behind him, watching.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 972, "extra_info": {"duration": 8.0, "events": ["Individual with long hair speaks.", "Individual with short hair appears helpless.", "Individual with short hair looks unhappy."], "n_events": 3, "n_shots": 4, "n_subjects": 2}, "vid": 972}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "172.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The creature blinked and then let out a shout. The camera zoomed in, showing the creature's tongue wiggling. The shot then switched to glass shattering, with the two creatures beside it standing up to look ahead.", "text": "A yellow creature with large eyes and red appendages looks surprised and begins to swallow an orange object. The creature's mouth opens wide as it continues to swallow the object. The camera then shows a glass enclosure containing three creatures: a purple creature, a gray creature, and the yellow creature. The glass enclosure shatters, and the yellow creature reacts with a shocked expression. The gray creature and the purple creature also react to the shattered glass.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 172, "extra_info": {"duration": 4.25, "events": ["The creature blinked.", "The creature let out a shout.", "The creature's tongue wiggled.", "Glass shattered.", "Two creatures stood up.", "Two creatures looked ahead."], "n_events": 6, "n_shots": 2, "n_subjects": 3}, "vid": 172}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "523.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, a person is making a dish. First, they cut the meat into small pieces, then add greens, pickled vegetables, and marinated dried tofu to a bowl of rice. Next, they add pork trotter meat and the sliced char siu, half an egg, and finally drizzle with sauce. Text on the screen highlights the key steps and asks how much this bowl of cuisine costs.", "text": "A person wearing gloves is slicing a piece of roasted meat on a cutting board. The person then places the sliced meat onto a red bowl filled with rice. Next, they add some green vegetables on top of the rice. They continue by adding some pickled vegetables on top of the green vegetables. The person then places a piece of roasted meat on top of the pickled vegetables. They add more pieces of roasted meat on top of the previous layer. Finally, they place a halved boiled egg next to the meat and pour some sauce over the entire dish.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 523, "extra_info": {"duration": 9.03, "events": ["A person cuts the meat into small pieces", "A person adds greens to a bowl of rice", "A person adds pickled vegetables to the bowl", "A person adds marinated dried tofu to the bowl", "A person adds pork trotter meat to the bowl", "A person adds sliced char siu to the bowl", "A person places half an egg into the bowl", "A person drizzles sauce over the bowl", "Text on the screen highlights the key steps", "Text asks the cost of the bowl of cuisine"], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 523}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "311.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A man is standing in front of the wooden door of a stone house, looking back towards the direction of a girl. The camera films from behind the girl and then pans around her from left to right, rotating to face the girl front on.", "text": "A man stands in front of a wooden door of a stone building, looking at a woman with curly hair who is facing him. The man then turns and walks away from the door. The camera shifts to follow the woman as she turns and walks away from the building. She continues walking, and the camera captures her from behind as she moves further away. The woman then turns her head to look back towards the building.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 311, "extra_info": {"duration": 6.0, "events": ["A man stands in front of a wooden door.", "The man looks back towards the girl.", "The camera pans around the girl from left to right.", "The camera rotates to face the girl front on."], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 311}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "708.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, a man and two women are chatting and drinking together. The two ladies are talking and smiling, while the gentleman picks up the bottle and takes a sip.", "text": "Three people are sitting at a table outdoors, with several beer bottles in front of them. The woman in the middle turns her head to look at the woman on her right, who is holding a green bottle. The woman in the middle then turns her head to look at the man on her left, who is holding a beer bottle. She starts talking and gesturing with her hand, pointing her finger. The man on the left takes a sip from his beer bottle. The woman in the middle continues to talk and gesture, while the woman on the right looks down and smiles.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 708, "extra_info": {"duration": 9.38, "events": ["A man and two women chat.", "Two women talk.", "Two women smile.", "The man picks up the bottle.", "The man takes a sip."], "n_events": 5, "n_shots": 1, "n_subjects": 3}, "vid": 708}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "118.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A fish and a mermaid were floating in the water. The fish swam upwards, only to find something covering the surface above. After swimming in a circle and coming back down, the fish spoke to the mermaid.", "text": "A woman in a pink top and blue mermaid tail is sitting on a large, spiky fish in an underwater environment with green seaweed in the background. The fish begins to rise, and the woman remains seated on it. The fish continues to ascend, moving away from the woman. The fish swims through a circular, patterned structure in the water. The fish turns its head and appears to be speaking. The fish continues to swim and speak, with its mouth open and eyes wide. The fish then turns its head to the side, still speaking.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 118, "extra_info": {"duration": 5.62, "events": ["A fish floated in water.", "A mermaid floated in water.", "The fish swam upwards.", "The fish found something covering the surface.", "The fish swam in a circle.", "The fish returned downwards.", "The fish spoke to the mermaid."], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 118}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "132.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Snoopy opened his mouth wide, jumped down to the ground from the top of the red house, and then danced around with joy. The bald little boy came from the left carrying a bowl of food and placed it on the ground. Snoopy took a look and then started eating.", "text": "A dog is sitting on top of a red doghouse, looking around. The dog then jumps off the doghouse and lands on the ground. The dog rolls on the ground and starts to get up. A boy holding a bowl of food runs towards the dog. The boy places the bowl of food on the ground in front of the dog. The dog approaches the bowl and starts eating. The scene then switches back to the dog sitting on top of the doghouse, looking around.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 132, "extra_info": {"duration": 6.5, "events": ["Snoopy opened his mouth wide.", "Snoopy jumped down from the top of the red house.", "Snoopy danced around with joy.", "The bald little boy carried a bowl of food.", "The bald little boy placed the bowl on the ground.", "Snoopy took a look at the bowl.", "Snoopy started eating from the bowl."], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 132}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "765.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A close-up shot of a person shaking hands with three people in turn.", "text": "The video shows a group of people standing in a circle, with the camera positioned above them. A woman extends her hand and shakes hands with a man in black clothing. Next, she shakes hands with a man in a light blue suit. Finally, she shakes hands with a man in a gray suit. The background is a plain floor, and the focus remains on the hands and the people's feet throughout the video.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 765, "extra_info": {"duration": 11.3, "events": ["Person initiates handshake with first individual.", "Person completes handshake with first individual.", "Person initiates handshake with second individual.", "Person completes handshake with second individual.", "Person initiates handshake with third individual.", "Person completes handshake with third individual."], "n_events": 6, "n_shots": 1, "n_subjects": 4}, "vid": 765}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "897.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video clip shows heavy traffic. A black car in the left lane is initially moving at a high speed, but suddenly decelerates. Subsequently, a gray car trailing the black car collides with it, causing the black car to flip over and hit the guardrail. It's unclear from the video if the gray car had accelerated or not.", "text": "The video shows a multi-lane highway with several cars driving in both directions under a clear blue sky. A black car in the right lane begins to veer towards the left lane, moving closer to the left lane divider. The black car continues to drift into the left lane, causing it to lose control and start to flip over. The black car flips completely, landing on its side and coming to a stop in the left lane. The other cars on the highway continue to drive normally, with no significant changes in their paths.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 897, "extra_info": {"duration": 9.01, "events": ["Heavy traffic flows.", "Black car moves at high speed.", "Black car suddenly decelerates.", "Gray car collides with black car.", "Black car flips over.", "Black car hits the guardrail."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 897}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "937.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video starts with a group of people walking along the beach. The scene then cuts to a long-haired child donned in a black hoodie and matching hat, taking a selfie on a beach by the sea. The child is seen reaching out with the left hand, while the right hand is holding the selfie device. The video's view initially elevates, circles the child, and finally lowers offering a sweeping view of the sea, sunset, lush greenery and the beach.", "text": "A group of people, including a man in a black tank top and shorts, a woman in a gray shirt, and two others, are walking on a sandy beach towards the ocean under a partly cloudy sky. The camera then shifts to a girl in a black hoodie running on the beach, with the camera capturing her from a low angle as she moves forward. The background shows the ocean and the setting sun, creating a serene and picturesque scene.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 937, "extra_info": {"duration": 5.0, "events": ["People walk along the beach", "Child in black hoodie takes a selfie", "Child reaches out with left hand", "Child holds selfie device with right hand", "View elevates around the child", "View circles the child", "View lowers to show the sea", "Sunset becomes visible", "Lush greenery becomes visible", "Beach comes into sweeping view"], "n_events": 10, "n_shots": 2, "n_subjects": 2}, "vid": 937}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "776.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two people reach out to grab a block-shaped object and place it on top of an already constructed object, then withdraw their hands.", "text": "A man and a girl are sitting on a bed, surrounded by scattered blocks of various colors, including pink, white, and blue. The girl, wearing a pink dress, is focused on stacking blocks on top of a larger block structure. The man, dressed in a light blue shirt and dark pants, is also engaged in stacking blocks, occasionally picking up blocks from the pile. The girl continues to add blocks to her structure, carefully placing them on top of the existing blocks. The man assists by picking up blocks and placing them on his own structure. The girl makes several adjustments to her stack, ensuring the blocks are stable. The man and the girl work together, with the girl adding more blocks to her structure while the man continues to build his own. The girl picks up a block from the pile and places it on top of her stack, while the man continues to add blocks to his structure. The girl makes final adjustments to her stack, ensuring it is stable.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 776, "extra_info": {"duration": 11.78, "events": ["Two people reach out to grab a block-shaped object", "They place the block-shaped object on top of another object", "They withdraw their hands from the block-shaped object"], "n_events": 3, "n_shots": 1, "n_subjects": 2}, "vid": 776}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "584.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is an indoor volleyball court. A woman in a turquoise short-sleeved shirt steps forward with hands behind her and then leaps into the air, the camera follows her movement. Beside her, a woman in a yellow short-sleeved shirt jumps up and reaches out with both hands to hit the falling volleyball into the air. The woman in turquoise leaps, stretching out her right hand to hit the volleyball over the net to the opposite side, where a woman in a red short-sleeved shirt watches the approaching ball.", "text": "In an indoor volleyball court, a woman in a gray shirt and black shorts prepares to receive a volleyball, positioning herself with her hands raised. As the ball approaches, she jumps and extends her arms to hit it. The ball is directed towards the net, and a woman in a yellow shirt jumps to block it. The woman in gray continues her jump, reaching out to hit the ball over the net. The ball is successfully hit over the net, and the woman in gray lands back on the ground. In the background, other players and a coach are visible, observing the play.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 584, "extra_info": {"duration": 4.7, "events": ["Woman in turquoise shirt steps forward.", "Woman in turquoise shirt leaps into the air.", "Woman in yellow shirt jumps up.", "Woman in yellow shirt reaches out to hit volleyball.", "Woman in turquoise shirt stretches out right hand.", "Volleyball is hit over the net.", "Woman in red shirt watches approaching ball."], "n_events": 7, "n_shots": 1, "n_subjects": 3}, "vid": 584}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "219.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The screen keeps switching angles, the man scoops up metal shavings with his right hand, places them into a bottle in his left hand, and then sets the bottle on the table. His hands then start to fiddle with another object.", "text": "In a dimly lit workshop filled with various jars and tools, a man holds a jar and begins to pour its contents into another jar. He continues to pour the contents, ensuring all the material is transferred. After emptying the jar, he places it on the table and picks up another jar. He then pours the contents of the second jar into the first jar. Finally, he places the second jar back on the table.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 219, "extra_info": {"duration": 5.38, "events": ["The screen switches angles.", "The man scoops up metal shavings with his right hand.", "The man places the metal shavings into a bottle.", "The man sets the bottle on the table.", "The man starts to fiddle with another object."], "n_events": 5, "n_shots": 4, "n_subjects": 1}, "vid": 219}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "555.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A pair of hands scoop up a school of small fish from the water, their bodies wiggling as the stream flows on.", "text": "In a shallow stream with moss-covered rocks, a person reaches into the water and begins to scoop up several fish. The person carefully gathers the fish, which are swimming around in the stream, and lifts them out of the water. The fish are held in the person's hands, showing their scales and colors clearly. The person then releases the fish back into the stream, allowing them to swim away.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 555, "extra_info": {"duration": 8.1, "events": ["Hands scoop up a school of small fish.", "Small fish wiggle their bodies.", "The stream flows on."], "n_events": 3, "n_shots": 1, "n_subjects": 2}, "vid": 555}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "80.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A boy and a girl were on the rooftop of a building, both holding a kite as they ran to the edge of the roof and threw the kite out. The kite plunged straight down, and the two of them looked at the kite string in their hands with disappointment. Then suddenly, their eyes widened; the camera cuts to behind them, where the kite rose from below and spread all its arms.", "text": "A boy and a girl are running on a rooftop, holding a kite string. They stop at the edge of the rooftop and look up at the kite flying in the sky. The kite starts to descend, and the girl looks surprised while the boy continues to hold the string. The kite falls to the ground, and the girl looks down at it with a disappointed expression. The boy and girl stand together, looking at the fallen kite. The kite is seen on the ground, and the boy and girl look at it with concern.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 80, "extra_info": {"duration": 5.38, "events": ["A boy and a girl were holding a kite on a rooftop.", "They ran to the edge of the roof.", "The boy and the girl threw the kite out.", "The kite plunged straight down.", "The boy and the girl looked at the kite string.", "The boy and the girl showed disappointment.", "Their eyes widened.", "The kite rose from below.", "The kite spread all its arms."], "n_events": 9, "n_shots": 3, "n_subjects": 2}, "vid": 80}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "803.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video records that a sunny day, a person with black sunglasses, stud ears, a gray coat and a shaggy face, standing besides the motorcycle, threw a small cigarette case or box casually to the ground under the trees with left hand. Then this person raised the left hand to keep out the wind. The person's right hand turned on the lighter and lighted the cigarette. The green lighter and began to smoke on a road surrounded by thick green trees.", "text": "A man stands next to a motorcycle on a forest path, holding a cigarette in his hand. He throws the cigarette on the ground and picks up a lighter. He lights the cigarette and takes a puff. The man then exhales smoke and looks around the forest. He turns his head to the left and then back to the front, continuing to smoke.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 803, "extra_info": {"duration": 6.97, "events": ["Person throws cigarette case to ground", "Person raises left hand to block wind", "Person turns on lighter with right hand", "Person lights cigarette", "Person begins to smoke"], "n_events": 5, "n_shots": 5, "n_subjects": 1}, "vid": 803}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "89.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video shows a black screen, and towards the end, a blue cat enters the screen from the bottom right side.", "text": "The video starts with a completely black screen. Gradually, a blue cat appears from the right side of the screen, walking towards the left. The cat continues to walk steadily across the screen, maintaining a consistent pace. The cat's reflection is visible on the black background as it moves. The cat exits the frame to the left, leaving the screen empty again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 89, "extra_info": {"duration": 5.12, "events": ["A black screen appears.", "A blue cat enters from the bottom right."], "n_events": 2, "n_shots": 1, "n_subjects": 1}, "vid": 89}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "52.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A character wearing a mask walks forward, passing by a little white rabbit carrying a bucket. The camera angle switches to film from under a tree up towards a treehouse, from which the heads of two characters subsequently peek out.", "text": "A man wearing a mask and a white coat stands next to a rabbit dressed in an orange shirt with a 'GB' logo, who is holding a bucket filled with candy. The man places his hand on the rabbit's shoulder, and the rabbit turns its head to look at him. The man then moves closer to the rabbit, and they both look at the bucket. The scene transitions to a tree with a small house built into it, where a ghost is hanging from a rope. The ghost opens the door of the house, and three characters, including the man and the rabbit, appear inside the house.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 52, "extra_info": {"duration": 3.88, "events": ["A character walks forward.", "A little white rabbit carries a bucket.", "Two characters peek out from a treehouse."], "n_events": 3, "n_shots": 2, "n_subjects": 4}, "vid": 52}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "667.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "One person scoops the chopped cherry tomatoes and cucumber from the cutting board into a glass bowl, with other vegetables also in the bowl and on the table.", "text": "In a kitchen setting, a person is seen holding a wooden cutting board with sliced cucumbers and tomatoes in their left hand and a knife in their right hand. The person begins to transfer the sliced cucumbers and tomatoes from the cutting board into a glass bowl filled with leafy greens and other vegetables on the table. The table also has a plate with more tomatoes and a head of lettuce in the foreground. The person continues to scrape the remaining pieces of cucumber and tomato from the cutting board into the bowl. After transferring all the vegetables, the person moves the cutting board away from the bowl.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 667, "extra_info": {"duration": 5.94, "events": ["Person scoops chopped cherry tomatoes into glass bowl.", "Person scoops chopped cucumber into glass bowl.", "Other vegetables rest in the bowl.", "Vegetables lay on the table."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 667}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "522.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Three women are practicing rope gymnastics in a gym, moving in unison to extend their legs, then spinning in the air, followed by opening their legs and inverting their bodies.", "text": "In a spacious room with a wooden floor and a ceiling with square panels and lights, three women are suspended in the air using purple aerial silk hammocks. They start by holding onto the hammocks with one hand and one leg extended horizontally. Gradually, they begin to rotate their bodies while maintaining their grip on the hammocks. As they continue, they lift their bodies off the ground, performing a series of spins and flips. The women then transition into a handstand position, with their legs extended horizontally and their heads facing down. They hold this inverted position for a moment before starting to rotate back to their original positions.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 522, "extra_info": {"duration": 6.2, "events": ["Three women practice rope gymnastics in a gym.", "Women move in unison.", "Women extend their legs.", "Women spin in the air.", "Women open their legs.", "Women invert their bodies."], "n_events": 6, "n_shots": 1, "n_subjects": 3}, "vid": 522}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "99.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On the open ground in front of a building, a soldier in green clothes and a dog are playing. The dog runs towards the distant buildings, leaps up, then turns and runs back to the soldier. Both the person and the dog run around in circles.", "text": "In a town square with a church and tents in the background, a man in a green outfit stands and watches as a small dog runs towards him. The man bends down slightly, and the dog continues running. The man then starts running after the dog, which is now ahead of him. The man raises his hands in the air as he runs, and the dog continues to run away. The man stops and watches the dog as it runs further away.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 99, "extra_info": {"duration": 6.62, "events": ["A soldier plays with a dog.", "The dog runs towards buildings.", "The dog leaps up.", "The dog turns.", "The dog runs back to the soldier.", "The soldier and the dog run in circles."], "n_events": 6, "n_shots": 1, "n_subjects": 2}, "vid": 99}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "48.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The boy looked at the skeleton, then turned and ran to the water's edge. He then raised his hands high, kicked the water, and splashed around, while the skeleton watched from the side.", "text": "A boy in a red jacket and white shirt is seen talking to a skeleton with a cracked skull, who is wearing a purple outfit. The boy continues to speak while the skeleton listens attentively. The boy then turns around and starts running towards the water. He jumps into the water, creating a splash. The skeleton remains seated on the rocks, watching the boy as he jumps.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 48, "extra_info": {"duration": 3.88, "events": ["The boy looked at the skeleton.", "The boy turned.", "The boy ran to the water's edge.", "The boy raised his hands high.", "The boy kicked the water.", "The boy splashed around.", "The skeleton watched from the side."], "n_events": 7, "n_shots": 2, "n_subjects": 2}, "vid": 48}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "650.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A butterfly is fluttering its wings on a yellow flower. In the background, the blurred figure of a maroon butterfly first flies from left to right, then from right to left, and finally, the video ends with the butterfly flying away from the yellow flower.", "text": "A butterfly with white and black patterned wings is perched on a yellow flower, surrounded by green leaves. The butterfly flaps its wings while remaining on the flower. The background is a blurred green, indicating a natural outdoor setting.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 650, "extra_info": {"duration": 12.22, "events": ["Butterfly flutters wings on yellow flower", "Maroon butterfly flies from left to right", "Maroon butterfly flies from right to left", "Maroon butterfly flies away from yellow flower"], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 650}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "302.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Inside the first magnificent building, a man pressed the elevator button, then stepped back and waited for the elevator to arrive, before turning his head to look to the left.", "text": "A man in a dark suit is seen on a porch with a lantern hanging from the ceiling and a dark night sky in the background. He reaches out and closes the door in front of him. After closing the door, he steps back a few steps and looks at the door. He then extends his hand towards the door handle again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 302, "extra_info": {"duration": 5.5, "events": ["A man pressed the elevator button.", "A man stepped back.", "The elevator arrived.", "A man turned his head to the left."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 302}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "772.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The woman, holding the girl, sits on the sofa, waving together at someone in front of them. During this time, the woman turns her head to glance at the girl.", "text": "An elderly woman and a young girl are sitting on a brown sofa in a living room with two abstract paintings on the wall behind them. The elderly woman, dressed in a white shirt and white pants, and the young girl, wearing a white sleeveless top and dark pants, are both smiling and waving at the camera. The elderly woman occasionally looks at the young girl while continuing to wave. The young girl also waves and looks at the camera. The elderly woman continues to wave and talk, while the young girl maintains her wave and smile.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 772, "extra_info": {"duration": 6.38, "events": ["The woman sits on the sofa.", "The woman holds the girl.", "The woman and the girl wave together.", "The woman turns her head.", "The woman glances at the girl."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 772}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "985.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the video, a player dressed in a green t-shirt, black shorts and white shoes is seen holding a ping-pong ball, poised to serve it towards a black ping-pong table. To the right of the player, there is a purple table and a referee dressed in a blue shirt. The player in the blue shirt and shorts receives the ball and hits it once before landing a powerful hit that scores them a point. This player, visibly elated, lets out a shout, clenching their fist in celebration. A participant in a black shirt and others in grey cardigans mimic this celebratory gesture. The champion player is seen at the top of the game screen, aligned with the players in black off the sidelines.", "text": "A man in a green shirt is seen preparing to serve a table tennis ball, holding it in his right hand while leaning over the table. The camera then shows a wider view of the table tennis match, with the man in green serving the ball to his opponent in blue. The opponent in blue returns the serve, and the man in green moves to return the ball. The opponent in blue wins the point and celebrates by raising his arms. The camera then shifts to the audience, who are seen cheering and clapping in excitement. The video ends with a close-up of the man in green preparing to serve again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 985, "extra_info": {"duration": 10.0, "events": ["Player holds a ping-pong ball.", "Player serves the ball towards the table.", "Opponent in blue shirt receives the ball.", "Opponent hits the ball once.", "Opponent lands a powerful hit.", "Opponent scores a point.", "Opponent in blue shirt shouts in celebration.", "Opponent clenches their fist.", "Participant in black shirt mimics the celebration.", "Others in grey cardigans mimic the celebration."], "n_events": 10, "n_shots": 4, "n_subjects": 3}, "vid": 985}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "278.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two people are wrestling on the mat; one person is forcefully pinning down the other. The referee is lying on the ground, observing them. The camera switches to a man in blue, who is intently watching the action. The camera switches back to the competition area where the person underneath is struggling fiercely. The referee waves his hand, pointing diagonally upwards. Many people in the background are watching the match.", "text": "A wrestler in a black and yellow uniform is pinning another wrestler in a blue headgear on the mat, while a referee in a black and white striped shirt closely observes the match. The referee leans in and appears to be speaking to the wrestlers. The camera then cuts to a man in a blue shirt standing in front of a 'Down' poster, who is watching the match intently. The camera switches back to the wrestling match, showing the referee raising his hand and then pointing towards the wrestlers. The background features a gymnasium with spectators in yellow shirts cheering and holding signs, including one that reads 'Huskies #1'.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 278, "extra_info": {"duration": 5.25, "events": ["Two people are wrestling on the mat.", "One person is pinning down the other.", "The referee is observing from the ground.", "The man in blue is watching the action.", "The person underneath is struggling fiercely.", "The referee waves his hand.", "People in the background are watching the match."], "n_events": 7, "n_shots": 3, "n_subjects": 5}, "vid": 278}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "54.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A character with blonde hair opened the door, and a little girl and an adult in green clothing stood at the entrance. The little girl was holding a hat with a green brim, with its opening facing upwards. The camera then switched to the interior, where many characters dressed in Halloween decorations were inside the house, eating and chatting.", "text": "A girl with red hair opens the door and holds a green plate, looking at a person with yellow hair and a green outfit with a red bow tie standing outside. The girl then looks forward and smiles. The scene shifts to the interior of a house where a person dressed in a cat costume, holding a plate of food, shakes hands with a character dressed as a sandwich. The camera pans left to show other characters, including a rabbit in a police uniform, a girl in a pink dress, and a character in a witch costume, all holding plates of food. The girl in the pink dress hands a piece of food to the character in the witch costume, who takes it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 54, "extra_info": {"duration": 5.38, "events": ["Blonde-haired character opened the door.", "Little girl stood at the entrance.", "Adult in green clothing stood at the entrance.", "Little girl held a hat with green brim.", "Characters dressed in Halloween decorations were inside the house.", "Characters ate.", "Characters chatted."], "n_events": 7, "n_shots": 2, "n_subjects": 4}, "vid": 54}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "209.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Ahead, a person walks forward with a gun raised, the camera cuts to three soldiers standing back to back, holding their guns and looking around.", "text": "The video begins with a first-person view through a scope, showing a dimly lit, debris-filled room with a window in the background. The scope's view moves forward, revealing more of the room's interior, including a curtain and some scattered objects. The camera then shifts to show a group of soldiers, all equipped with helmets and tactical gear, moving cautiously through the room. The soldiers are seen aiming their weapons as they advance, with one soldier in the foreground and others following closely behind. The video concludes with the scope's view returning to the room, showing the soldiers continuing their movement.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 209, "extra_info": {"duration": 5.88, "events": ["A person walks forward with a gun raised.", "Three soldiers stand back to back.", "Soldiers hold their guns.", "Soldiers look around."], "n_events": 4, "n_shots": 4, "n_subjects": 3}, "vid": 209}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "270.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "One person grabs the wire mesh with one hand, then picks up a box from the ground, and walks away with the child next to them.", "text": "A man in a suit stands by a chain-link fence, holding onto it with his right hand while looking outside. He then turns away from the fence and walks towards a boy who is standing nearby, holding a plastic bag and a box. The man bends down and picks up the box from the ground while the boy watches. The man and the boy then start walking together, with the man carrying the box and the boy holding the plastic bag. They walk away from the fence, moving towards the left side of the frame, with a cityscape featuring tall buildings and flags in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 270, "extra_info": {"duration": 5.75, "events": ["Person grabs the wire mesh.", "Person picks up a box from the ground.", "Person walks away.", "Child walks next to the person."], "n_events": 4, "n_shots": 2, "n_subjects": 2}, "vid": 270}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "668.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A leopard walks from right to left across an upside-down tree trunk, then turns around and walks to the right.", "text": "A leopard is seen walking on a fallen log in a lush, green forest environment. The leopard pauses and looks around, turning its head to survey its surroundings. It then continues to walk along the log, moving from one side to the other. The leopard's movements are deliberate and cautious as it navigates the log and the forest floor.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 668, "extra_info": {"duration": 8.98, "events": ["A leopard walks across an upside-down tree trunk.", "The leopard turns around on the tree trunk.", "The leopard walks to the right on the trunk."], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 668}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "485.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A lady is exercising inside a room, her hands gripping loops of rope, suspending herself upside down in the air. She continuously performs a variety of high-difficulty maneuvers in the air, and finally, she flips down to stand on the table below.", "text": "In a room with large windows and a wooden floor, a woman is performing aerial silk dance. She starts in an inverted position with her legs spread apart and her body hanging upside down. She then brings her legs together and begins to rotate her body while maintaining her grip on the silk. The woman continues to rotate her body, gradually bringing her legs closer together and her head closer to her feet. She completes the rotation and starts to lower her body towards the platform. Finally, she lands on the platform, still holding onto the silk, and returns to an inverted position.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 485, "extra_info": {"duration": 8.67, "events": ["A lady grips loops of rope.", "She suspends herself upside down.", "The lady performs high-difficulty maneuvers.", "She continuously moves in the air.", "Finally, the lady flips down.", "She stands on the table below."], "n_events": 6, "n_shots": 1, "n_subjects": 1, "old_duration": 8.67, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/85.mp4"}, "vid": 485}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "517.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A girl in a short skirt is skateboarding. She is not very skilled yet, as her footing and mounting on the board are a bit shaky. The girl slides forward, turns around, and continues to skate. The skateboard comes to a stop due to insufficient momentum. The girl sways her body forward slightly, propelling the skateboard forward once more.", "text": "A girl in a school uniform, consisting of a white shirt, plaid skirt, and white sneakers, is skateboarding on a paved area in front of a building with red tents and trees in the background. She initially pushes the skateboard forward with her foot and then steps onto it, balancing herself. She adjusts her hair with her hand while continuing to skate. The girl then turns around and skates towards the camera, maintaining her balance.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 517, "extra_info": {"duration": 9.0, "events": ["A girl mounts the skateboard.", "The girl exhibits shaky footing.", "The skateboard slides forward.", "The girl turns around on the skateboard.", "She continues to skate.", "The skateboard comes to a stop.", "The girl sways her body forward.", "The skateboard propels forward again."], "n_events": 8, "n_shots": 1, "n_subjects": 1}, "vid": 517}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "656.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a grassy field, four people are holding hands in a circle, with adult men and women facing each other, each holding the hands of two little girls, as they turn counterclockwise.", "text": "In a grassy field with trees in the background, a family of four, consisting of two adults and two children, holds hands and forms a circle. They begin to spin around in a clockwise direction, maintaining their circle formation. The family continues to spin, with the children in the middle and the adults on the outside, all smiling and enjoying the moment. The sunlight creates a warm glow around them as they keep spinning, creating a joyful and carefree atmosphere.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 656, "extra_info": {"duration": 8.13, "events": ["Four people hold hands in a circle.", "Adult men and women face each other.", "Each adult holds the hands of two little girls.", "The group turns counterclockwise."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 656}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "160.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a dirty and messy room, a person is holding a device and scanning the leftover food in a bowl. The camera then switches to another direction, where a bald man picks up a white towel from the sofa. Behind him, an agent is taking notes with a pen. The bald man turns the towel to face the camera, revealing a face pattern on the towel.", "text": "A man in a suit is kneeling on the floor next to a table, holding a yellow device in his left hand and a black object in his right hand, which he is moving over a bowl of chips and a can of soda on the table. The room is messy with orange liquid splattered on the walls, couch, and floor. The scene then shifts to another man in a suit standing in a different room, holding a white cloth with orange liquid and a face design on it. He unfolds the cloth and shows it to another man in a suit who is writing on a notepad by the window.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 160, "extra_info": {"duration": 5.12, "events": ["A person scans leftover food in a bowl.", "A bald man picks up a white towel from the sofa.", "An agent takes notes with a pen.", "The bald man turns the towel to face the camera.", "A face pattern is revealed on the towel."], "n_events": 5, "n_shots": 2, "n_subjects": 3}, "vid": 160}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "60.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "At an intersection, a child and an adult watch as a group of panicked people run from left to right. Someone has lost their hat in the rush. The camera then shifts to the direction from which the people are running, revealing a monster with its arms wide open, swaying as it moves forward, kicking aside a wooden barrel that blocks its path.", "text": "A group of people, including a man in a hat, a man in a vest, a man in a suit, a woman in a dress, and a child, run down a cobblestone street in front of a café. As they run, a woman in a dress throws her hat, which lands on the ground. The scene then shifts to a bear walking on its hind legs towards a clothesline with various garments hanging on it. The bear continues to walk forward, and a barrel on the ground rolls over.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 60, "extra_info": {"duration": 5.5, "events": ["A child and an adult watch a group run.", "People run from left to right.", "A person loses their hat.", "A group of people exhibit panic.", "A monster sways forward.", "The monster extends its arms wide.", "The monster kicks a wooden barrel aside.", "Panicked people rush at an intersection."], "n_events": 8, "n_shots": 2, "n_subjects": 4}, "vid": 60}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "504.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On the table, there is a ball of dough. Someone used a tool to cut four even lines into the dough without actually slicing it through. Then, they used scissors to snip a flower shape in the center of the dough, placing a red date in the center of the flower. Next, they used a comb to work on the dough's surface. Finally, they used scissors to cut out a flower shape on the outer side of the dough.", "text": "A white ball of dough is placed on a wooden cutting board, surrounded by dried red chilies, a tomato, and some dried fruits. A person uses a knife to cut the dough in half. The person then makes additional cuts, creating a star-like pattern on the dough. The person continues to cut the dough into smaller sections, forming a flower-like shape. The person uses a small tool to add details to the flower pattern, making it more intricate. The person then uses a pair of scissors to make final adjustments to the flower pattern. Finally, the person cuts the dough into smaller pieces, completing the transformation.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 504, "extra_info": {"duration": 15.1, "events": ["Someone cut four lines into the dough.", "Someone snipped a flower shape in the dough's center.", "Someone placed a red date in the flower center.", "Someone combed the dough's surface.", "Someone cut out a flower shape on the dough's outer side."], "n_events": 5, "n_shots": 1, "n_subjects": 1}, "vid": 504}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "592.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is a path in a park. Two little monkeys are running forward. A person holding food appears in front of the camera. The little monkey wearing yellow clothes runs over, grabs the food with its mouth, and turns to leave. The little monkey in red clothes didn't get any and reaches out to grab the clothes of the other monkey, trying to snatch the food from its mouth. A person holding food appears again in front of the camera. They turn to face the camera. The little monkey wearing pink clothes successfully snatches the food and eats with it in its mouth, while the other monkey is eating food in front of the camera. It looks back and then the food is snatched away by another monkey beside it and eaten up.", "text": "Two monkeys are seen crawling on a gravel path towards a person holding a piece of food. The person offers the food to the monkeys, and they eagerly reach for it. The monkeys, one in a yellow shirt and the other in a pink shirt, grab the food from the person's hand. The person then offers more food, and the monkeys continue to reach for it. The person offers another piece of food, and the monkeys, now closer, start eating it. The person offers more food, and the monkeys continue to eat it. The person offers another piece of food, and the monkeys, now sitting on the gravel, continue to eat it.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 592, "extra_info": {"duration": 12.8, "events": ["Two little monkeys run forward on a path.", "Monkey in yellow clothes grabs food with mouth.", "Monkey in yellow clothes turns to leave.", "Monkey in red clothes tries to snatch food.", "Person holding food reappears.", "Monkey in pink clothes snatches the food.", "Monkey in pink clothes eats the food.", "Monkey in front of camera eats food.", "Monkey looks back.", "Another monkey snatches and eats the food."], "n_events": 10, "n_shots": 1, "n_subjects": 3}, "vid": 592}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "497.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A lady wearing a blue skirt, holding a wine bottle in one hand and a wine glass in the other, is surfing on a surfboard in the sea. While surfing, she pours a glass of wine from the bottle into the glass and then drinks it.", "text": "A woman in a blue dress and sunglasses is standing on a surfboard in the ocean, holding a bottle in her right hand and a glass in her left hand. She raises the bottle and begins to pour its contents into the glass. After pouring, she lowers the bottle and brings the glass to her mouth to drink. The background shows a clear blue sky with some clouds and waves crashing around her.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 497, "extra_info": {"duration": 4.73, "events": ["Lady holds wine bottle and glass while surfing.", "Lady pours wine from bottle into glass.", "Lady drinks wine from glass."], "n_events": 3, "n_shots": 1, "n_subjects": 1, "old_duration": 7.9, "old_video_file": "/mnt/bn/videonaslq/videos/test_videos/DREAM-1k_fixed/97_no_audio.mp4"}, "vid": 497}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "369.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "One person is wiping another person's knee with a cotton ball.", "text": "A person is seen holding a cotton pad and gently wiping a wound on another person's knee, which is covered by a piece of clothing. The hand holding the cotton pad moves in a circular motion, applying pressure to the wound. The background is blurred, focusing on the action of the hand and the wound.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 369, "extra_info": {"duration": 4.12, "events": ["One person wipes another person's knee.", "A cotton ball is used during the wiping."], "n_events": 2, "n_shots": 1, "n_subjects": 2}, "vid": 369}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "91.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The two large doors were open, and a soldier wearing a helmet charged in on horseback. Following him, another soldier dressed in blue also rode in through the doors. After entering, the blue-clad soldier immediately dismounted and turned to close the large doors.", "text": "The scene begins with a closed wooden door set in a stone archway. A person in green riding a white horse enters through the door. Another person in blue riding a brown horse follows, also entering through the door. The person in blue dismounts from the horse and walks towards the door. The person in blue closes the door behind them. The door is fully closed, and the scene returns to the initial view of the closed door. The person in green reappears on the horse, entering through the door again. The person in blue follows, dismounts, and closes the door behind them.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 91, "extra_info": {"duration": 5.25, "events": ["The doors opened", "A soldier charged in on horseback", "Another soldier rode in", "The blue-clad soldier dismounted", "The blue-clad soldier closed the doors"], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 91}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "11.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A knight in blue and gold armor duels with an opponent who leaps for an aerial attack.", "text": "In a desert setting with a clear sky, a man in a blue and red armored suit with a mustache and holding a sword stands in front of a woman in a red and black armored suit. The man begins to move forward, and the woman starts to crouch down. The man continues to advance, and the woman remains crouched. The man gets closer to the woman, who is still crouched. The man is now very close to the woman, who is still crouched. The man continues to move forward, and the woman remains in a crouched position. The man is almost touching the woman, who is still crouched. The man is now very close to the woman, who is still crouched. The man continues to move forward, and the woman remains crouched. The man is now very close to the woman, who is still crouched. The man continues to move forward, and the woman remains crouched. The man is now very close to the woman, who is still crouched. The man continues to move forward, and the woman remains crouched. The man is now very close to the woman, who is still crouched. The man continues to move forward, and the woman remains crouched. The man is now very close to the woman, who is still crouched. The man continues to move forward, and the woman remains crouched. The man is now very close to the woman, who is still crouched. The man continues to move forward, and the woman remains crouched. The man is now very close to the woman, who is still crouched. The man continues to move forward, and the woman remains crouched. The man is now very close to the woman, who is still crouched. The man continues to move forward, and the woman remains crouched. The man is now very close to the woman, who is still crouched. The man continues to move forward, and the woman remains crouched. The man is now very close to the woman, who is still crouched. The man continues to move forward, and the woman remains crouched. The man is now very close to the woman, who is still crouched. The man continues to move forward, and the woman remains crouched. The man is now very close to the woman, who is still crouched. The man continues to move", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 11, "extra_info": {"duration": 1.29, "events": ["A knight in blue and gold armor duels.", "Opponent leaps for an aerial attack."], "n_events": 2, "n_shots": 1, "n_subjects": 2}, "vid": 11}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "441.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The yellow text on the screen disappears, and the woman demonstrates the technique of quickly peeling bamboo shoots with a sickle. She first cuts the shoot from top to bottom, then peels back the skin to reveal the middle part of the shoot, chops off the top of the shoot, continues to use a hoe to dig around the base of the shoot, and then sits down to scrape and chop off the excess parts of the bamboo shoot with the sickle. The camera remains fixed. The background is a forest.", "text": "In a forested area, a woman dressed in camouflage clothing is seen holding a large, yellowish plant with a thick stem. She begins by cutting the top part of the plant with a knife, removing the outer layers. She continues to peel off the outer layers, revealing the inner part of the plant. The woman then uses a tool to further strip the plant, removing more layers and exposing the inner core. She continues to peel the plant, ensuring all the outer layers are removed. The woman then uses a tool to cut the plant into smaller sections, preparing it for further processing.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 441, "extra_info": {"duration": 15.08, "events": ["Yellow text disappears from the screen", "Woman cuts bamboo shoot from top to bottom", "Woman peels back skin of bamboo shoot", "Woman chops off top of bamboo shoot", "Woman uses hoe to dig around bamboo shoot base", "Woman sits down", "Woman scrapes bamboo shoot with sickle", "Woman chops off excess parts of bamboo shoot"], "n_events": 8, "n_shots": 4, "n_subjects": 1}, "vid": 441}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "233.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A character wrapped in bandages and a group of children run forward. They are looking to the left as a person dressed as a rabbit greets them, who then runs ahead to them. The character wrapped in bandages and the group of children stand in place, looking towards the direction of the camera, while two other characters enter the frame from the right, performing some actions.", "text": "Three children, dressed in casual clothes, are running on a sidewalk decorated with Halloween decorations, including a scarecrow, tombstones, and a large white tree with black branches. They are being chased by a person in a mummy costume. The children notice a person dressed as a rabbit, holding a bag, standing in front of a house decorated with pumpkins and a large white tree. The rabbit-costumed person waves at the children and then starts running towards them. The children stop running and stand still, watching the rabbit-costumed person.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 233, "extra_info": {"duration": 4.88, "events": ["A character wrapped in bandages runs forward.", "A group of children runs forward.", "The group looks to the left.", "A person dressed as a rabbit greets the group.", "The person dressed as a rabbit runs ahead to the group.", "The character wrapped in bandages stands in place.", "The group of children stands in place.", "The group looks towards the camera's direction.", "Two other characters enter the frame from the right.", "Two characters perform some actions."], "n_events": 10, "n_shots": 3, "n_subjects": 3}, "vid": 233}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "145.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the camera's view, there is a towering tree. Someone falls from above and hits a branch, which sends them bouncing off in a different direction, then strikes another branch. The branch once again deflects the falling person. As the shot changes to show the person being launched by a branch, it becomes apparent there are two people, each being propelled in different directions. A girl in blue clothing hits a branch while lying prone and continues to fall. A blonde boy, head first, hits another branch and keeps tumbling down. The girl's head strikes yet another branch, and with her eyes wide open, she slides down.", "text": "In a forest with a large tree and a sunset in the background, a tree branch breaks, causing a girl and a boy to fall. The girl and boy tumble through the air, with the girl landing on the ground first, followed by the boy. The girl then climbs up the tree trunk, while the boy follows her. The girl looks down at the boy as he climbs up the tree.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 145, "extra_info": {"duration": 7.75, "events": ["Someone falls from the tree.", "The person hits a branch.", "The branch sends the person in a different direction.", "The person strikes another branch.", "The shot reveals two people.", "A girl in blue hits a branch while prone.", "The girl continues to fall.", "A blonde boy hits a branch head first.", "The boy keeps tumbling down.", "The girl's head strikes another branch."], "n_events": 10, "n_shots": 5, "n_subjects": 2}, "vid": 145}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "79.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video displays a scene underwater, where a person releases two grab handles with both hands. The camera then cuts to the face of a boy and subsequently switches to the boy's back as a statue of a Chinese dragon sinks into the depths of the water.", "text": "A pair of hands is seen gripping and releasing a golden, segmented object underwater. The camera then shifts to show a boy's face, who appears to be looking at the object. The object, which resembles a golden, segmented creature with a face, begins to float away from the boy. The object continues to drift further away, becoming smaller and less distinct as it moves into the distance.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 79, "extra_info": {"duration": 5.5, "events": ["A person releases two grab handles underwater.", "The statue of a Chinese dragon sinks into the water."], "n_events": 2, "n_shots": 3, "n_subjects": 1}, "vid": 79}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "446.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a farm, a strange creature covered in a whole cowhide charged at a lady dressed in a long robe with its horns. Frightened by the encounter, the lady ran inside the house. A man inside, armed with a wooden stick, came out. The odd creature then stood upright, lifted the cowhide, revealing that it was actually two people playing a prank. The man with the stick chased after them, and as they threw away the cowhide, they fled.", "text": "A woman in a pink dress is leading a brown cow through a straw-covered area with banana trees in the background. She approaches a small shop and enters it, closing the door behind her. A man inside the shop, holding a stick, steps out and starts chasing the woman and the cow. The woman and the cow run away, and the man continues to pursue them with the stick.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 446, "extra_info": {"duration": 4.0, "events": ["Creature covered in cowhide charged at lady.", "Lady ran inside the house.", "Man armed with stick came out of house.", "Creature stood upright.", "Creature lifted the cowhide.", "People inside creature revealed as pranksters.", "Man with stick chased the pranksters.", "Pranksters threw away the cowhide.", "Pranksters fled."], "n_events": 9, "n_shots": 1, "n_subjects": 4}, "vid": 446}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "149.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In the room, an old woman turns around to face the camera and scatters a row of gems forward. On the right side, a blonde woman, holding her shoulders, looks on. The camera zooms out to reveal the trail of gems extending to the feet of a man in red. The shot switches to the feet of the old woman, who steps on the first gem. The gem emits a glow, and then the gems ahead light up one by one. The radiance forms a beam of light that gradually extends towards the man in red.", "text": "In a room filled with various objects including a desk, plants, and framed pictures, a woman in a blue dress turns around and throws a ball. The ball flies towards a man in a purple outfit who is standing with his arms crossed. The ball hits the man's foot, causing a bright light to appear at the point of contact. The light intensifies, and the man reacts by holding his chest in pain.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 149, "extra_info": {"duration": 5.12, "events": ["Old woman turns around to face the camera.", "Old woman scatters a row of gems forward.", "Blonde woman looks on, holding her shoulders.", "Camera reveals trail of gems extending to man's feet.", "Old woman steps on the first gem.", "First gem emits a glow.", "Gems light up one by one.", "Beam of light forms.", "Radiance extends towards the man in red."], "n_events": 9, "n_shots": 2, "n_subjects": 3}, "vid": 149}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "355.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two men are looking at the ostrich in front of them. The ostrich moves forward and turns its head to look around. The camera cuts to the man on the left turning his head to look at the man on the right, and they both nod slightly and smile.", "text": "Two men are standing side by side, looking at an ostrich in front of them. The man on the left is wearing a yellow shirt with a hexagonal logo and a brown jacket, while the man on the right is wearing a blue jacket and a blue beanie. The ostrich moves its head closer to the men, and the man on the left starts to smile. The camera then shifts to show the men from behind, revealing a scenic background with mountains and a grassy field. The man on the left turns his head to look at the man on the right, who is also looking at the ostrich.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 355, "extra_info": {"duration": 11.12, "events": ["Two men look at the ostrich.", "The ostrich moves forward.", "The ostrich turns its head to look around.", "The man on the left turns his head.", "The two men nod slightly.", "The two men smile."], "n_events": 6, "n_shots": 2, "n_subjects": 3}, "vid": 355}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "645.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera shoots from below the surface of shallow water, capturing a person's foot in the water. A hand releases a fish into the water and, once it enters, the fish quickly twists its body and swims away. The hand that released the fish withdraws from the water and gradually moves away from the surface while a clump of yellow floating matter is stirred and tumbled by the movements of the fish.", "text": "A person is holding a large fish underwater, with the fish's body and fins clearly visible. The person begins to release the fish, and it starts to swim away. As the fish swims away, a small yellow object, possibly a piece of bait or a small fish, is released and starts to float upwards. The fish continues to swim away, and the yellow object continues to float upwards, moving further away from the person's hand.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 645, "extra_info": {"duration": 6.5600000000000005, "events": ["Camera captures a person's foot in water.", "Hand releases a fish into the water.", "Fish twists its body in the water.", "Fish swims away in the water.", "Hand withdraws from the water.", "Yellow floating matter is stirred.", "Yellow floating matter tumbles by fish's movement."], "n_events": 7, "n_shots": 1, "n_subjects": 2}, "vid": 645}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "859.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video features individuals dressed in red and green dresses, black t-shirts, and blue hoodies, all of them smiling cheerfully and applauding. It also shows two distinctive individuals seated beside each other— the video's author, sporting a gray hoodie, holds a knife, and another person wearing glasses, clothed in a black T-shirt containing white script. The person in glasses, garbed in the black T-shirt, looks towards the table, while the individual in the gray hoodie extends the knife towards the cake, ready to cut it. All these individuals have congregated around the table to mark the birthday of the person clad in the gray hoodie, who is also adorned with earrings.", "text": "In a well-decorated room with paintings on the walls and a lit table lamp, a woman in a red dress claps her hands while sitting on a chair. The camera pans left to reveal two other women, one in a green dress and the other in a black top, also clapping. A woman in a blue hoodie stands behind the woman in black, clapping and smiling. The camera continues to move, showing more people in the room, including a woman in a gray hoodie and a man with glasses, all clapping and smiling.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 859, "extra_info": {"duration": 9.0, "events": ["People dressed in colorful attire applaud.", "The video's author holds a knife.", "A person in glasses looks towards the table.", "The individual in the gray hoodie extends the knife.", "Individuals congregate around the table.", "A cake is ready to be cut."], "n_events": 6, "n_shots": 1, "n_subjects": 3}, "vid": 859}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "835.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In this video clip, the person is riding a red motorcycle and a yellow bus with two front lights on is coming. To avoid hitting the yellow bus, the person riding a motorcycle chooeses to turn left. If not, there would be an accident. Some people including drivers or passengers would get hurt or be in danger. And the bus driver would be frightened.", "text": "A person is riding a red motorcycle with 'MOM LOVE DAD GIFT' written on the fuel tank, navigating through a busy street market. The street is filled with people, some walking and others standing near stalls, with various goods displayed. A cow is seen walking on the road, and the rider maneuvers the motorcycle to avoid it. A yellow bus approaches from the opposite direction, and the rider continues to navigate around the cow and the bus. The rider successfully passes the cow and the bus, continuing to navigate through the crowded street. The rider then turns the motorcycle to the left, heading towards the left side of the street.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 835, "extra_info": {"duration": 6.0, "events": ["Person rides red motorcycle", "Yellow bus approaches with lights on", "Motorcycle turns left", "Bus driver gets frightened"], "n_events": 4, "n_shots": 1, "n_subjects": 3}, "vid": 835}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "637.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On a baking pan, there are several pancakes, and a person is holding tongs, showing the movement of a few pancakes' positions, and then flips the pancakes one by one from left to right.", "text": "Several tortillas are being grilled on a flat-top grill, with some already showing signs of toasting. A person, wearing a blue shirt and gray pants, uses their hand to check the doneness of the tortillas by pressing on them. The person then uses a pair of tongs to flip one of the tortillas, lifting it off the grill and placing it back down.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 637, "extra_info": {"duration": 5.42, "events": ["A baking pan holds several pancakes.", "A person grasps tongs.", "The person shows movement of pancakes' positions.", "The person flips the pancakes one by one."], "n_events": 4, "n_shots": 1, "n_subjects": 1}, "vid": 637}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "582.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is an outdoor playground. A group of people are standing face to face on a suspension bridge, swaying it. A girl in black falls from the bridge. Afterwards, a man in a black coat jumps down to check on the girl's condition, followed by another man wearing a hat who also jumps down. The two men walk over to the girl, and the first man reaches out to pull the girl up, while the group on the suspension bridge watches them.", "text": "A group of people are gathered around a colorful inflatable structure, with some standing on the edge and others on the structure itself. A woman in a black outfit loses her balance and falls onto the inflatable structure. She rolls down the structure, and a man in black clothing jumps down to help her. The man in black clothing reaches the woman and helps her up. The woman stands up with the man's assistance, and they both move away from the inflatable structure.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 582, "extra_info": {"duration": 10.2, "events": ["Group of people sway the suspension bridge", "Girl in black falls from the bridge", "Man in black coat jumps down from bridge", "Man wearing hat jumps down from bridge", "Two men walk over to the fallen girl", "First man reaches out to pull the girl up", "Group on the bridge watches the situation"], "n_events": 7, "n_shots": 1, "n_subjects": 4}, "vid": 582}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "756.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A group of children run forward, among them a boy bends over to pick up a football and walks to the right, the other children look towards him.", "text": "A group of children are running across a grassy field towards the camera, with trees and a setting sun in the background. As they get closer, the children continue running, and one child in the front starts to bend down to pick up a soccer ball. The child successfully picks up the ball and begins to run with it, while the other children continue to run behind. The child holding the ball runs past the camera, and the other children follow closely behind.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 756, "extra_info": {"duration": 11.2, "events": ["Group of children run forward", "Boy bends over", "Boy picks up football", "Boy walks to the right", "Other children look towards boy"], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 756}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "610.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a room, a child scoops a shovel of soil from a white basin and pours it into the flower pot in front of them, then uses the shovel to level the soil.", "text": "A child, wearing a pink long-sleeve shirt, is seen using a small yellow shovel to scoop soil from a white paper pot. The child continues to scoop soil and transfer it into a brown plastic pot. The background includes other potted plants and gardening tools, indicating a gardening activity. The child repeatedly scoops soil from the white pot and places it into the brown pot, ensuring the brown pot is filled with soil.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 610, "extra_info": {"duration": 8.32, "events": ["Child scoops soil with a shovel", "Child pours soil into the flower pot", "Child uses shovel to level the soil"], "n_events": 3, "n_shots": 1, "n_subjects": 1}, "vid": 610}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "44.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A rabbit carrying a group of Smurfs runs forward, and the camera pans to the right to show the hills and the woods.", "text": "A rabbit with a blue and white outfit is seen running through a lush, colorful garden filled with vibrant flowers and greenery. The rabbit continues to run, moving from the left side of the frame to the right, passing by various plants and flowers. As the rabbit runs, the camera follows its movement, gradually panning to the right, revealing more of the scenic landscape. The background transitions from a close-up view of the garden to a wider view of a vast, green field with rolling hills and scattered trees in the distance.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 44, "extra_info": {"duration": 5.38, "events": ["A rabbit carries a group of Smurfs.", "The rabbit runs forward."], "n_events": 2, "n_shots": 1, "n_subjects": 3}, "vid": 44}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "405.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Within a garden, a girl opens her arms and falls backward. The camera then switches to a bungee jumping platform, where a staff member gently pushes the girl in front. Like before, she opens her arms and falls backward. The girl repeatedly falls through the air and then gets pulled back up by the rope, with the camera following her movements up and down.", "text": "A woman, dressed in a black jacket, gray pants, and white sneakers, stands on a paved pathway under a metal structure with greenery in the background, preparing to jump. She jumps off the platform, and a man in a green jacket assists her by pushing her off. The woman is seen mid-air, holding onto a pink cushion attached to a bungee cord, as she descends towards the water below. She swings back and forth over the water, with the bungee cord stretching and contracting. The woman continues to swing, reaching the lowest point of her descent and then bouncing back up.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 405, "extra_info": {"duration": 9.43, "events": ["A girl opens her arms in a garden.", "The girl falls backward.", "A bungee jumping platform appears.", "A staff member pushes the girl gently.", "The girl opens her arms on the platform.", "The girl falls backward off the platform.", "The girl repeatedly falls through the air.", "A rope pulls the girl back up.", "The camera follows the girl's movements.", "The girl moves up and down."], "n_events": 10, "n_shots": 2, "n_subjects": 3}, "vid": 405}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "461.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A handsome long-haired gentleman kneels on the ground while a lady dressed in green holds a smartphone in front of him to record. Two assistants stand motionless behind him. The gentleman shakes his head in front of the camera to perform. The lady in green reaches out to caress the gentleman's cheek, and he grabs her hand, pulling it toward his shoulder. The lady then retracts her hand and lifts the gentleman's tie. He first bows his head, then lifts it up, and awkwardly shakes his body a bit before the video ends.", "text": "A man in a black suit is kneeling on the floor, looking at a woman who is holding a phone and taking a selfie. The man reaches out and touches the woman's face, then grabs her arm and pulls it towards him. He then takes the tie from the woman's hand and starts to tie it around his neck. The man looks down at the tie and then looks up at the woman, smiling. In the background, there are people standing and watching the interaction.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 461, "extra_info": {"duration": 15.03, "events": ["The gentleman kneels on the ground", "The lady holds a smartphone to record", "Two assistants stand behind the gentleman", "The gentleman shakes his head", "The lady caresses the gentleman's cheek", "The gentleman grabs the lady's hand", "The gentleman pulls the lady's hand toward his shoulder", "The lady retracts her hand", "The lady lifts the gentleman's tie", "The gentleman shakes his body"], "n_events": 10, "n_shots": 1, "n_subjects": 3}, "vid": 461}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "422.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a corridor, a person is holding a pot of chrysanthemum water with one hand, preparing to water the plants. However, a cat is blocking the path ahead. Although the person stops in their tracks, the chrysanthemum water still spills onto the cat's head. The camera then cuts to the upstairs, where the same person is once again holding a pot of chrysanthemum water, preparing to descend the stairs. The video has subtitles that accompany the actions of the character.", "text": "A person walks down a hallway holding a basin filled with water and yellow flowers, approaching a cat sitting at the end of the hallway. The person says '妈呀妈呀' and '走咱们去洗澡' as they get closer to the cat. The cat, initially sitting and watching, starts to move towards the basin. The cat sniffs the flowers in the basin and then turns around, walking away from the basin. The cat sits down with flowers on its head, looking slightly displeased. The person then walks up a staircase, carrying the basin of flowers.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 422, "extra_info": {"duration": 7.08, "events": ["Person holds pot of chrysanthemum water.", "Person prepares to water plants.", "Cat blocks the person's path.", "Person stops walking.", "Chrysanthemum water spills onto cat's head.", "Person holds pot of chrysanthemum water upstairs.", "Person prepares to descend the stairs."], "n_events": 7, "n_shots": 6, "n_subjects": 2}, "vid": 422}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "477.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The background is an urban park, where a man is preparing to cross a small river in the park with a dog in tow. There are several stepping stones in the river for people to cross. The man lets the dog go ahead while he follows, holding the leash. Just as they are crossing a few of the stepping stones, the little dog accidentally falls into the river. After a few floundering moments in the water, it starts to swim backwards towards the shore. The man, holding the leash, helps the little dog back onto land. Other people in the park are watching them.", "text": "A man is walking a dog on a stone path by a pond, with a woman and a stroller in the background. The dog steps onto the first stone but loses its balance and falls into the water. The man quickly reaches out to help the dog, but the dog struggles to get out of the water. The man continues to assist the dog, which eventually manages to get back on the stone path. The man and the dog continue walking on the stone path, with the woman and the stroller still in the background.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 477, "extra_info": {"duration": 6.23, "events": ["A man prepares to cross the river.", "Dog goes ahead on stepping stones.", "Man follows holding the leash of the dog.", "The little dog falls into the river.", "The dog flounders in the water momentarily.", "Dog starts to swim backwards toward the shore.", "Man helps the dog back onto land.", "Other people in the park watch the incident."], "n_events": 8, "n_shots": 1, "n_subjects": 3}, "vid": 477}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "443.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The camera zooms in for a close-up on a person wearing brown gloves, drawing a straight line on marble with a pencil. The camera then pulls back, capturing the man from the front, with his back to the camera, using tools as he coordinates with both hands to repeatedly cut away the top layer of the stone. The camera zooms in again to show the details of the cut, then reveals a smooth, flat surface. The background features a room with a brown cabinet.", "text": "A person wearing gloves marks a line on a large white stone block using a pencil. The person then uses a chisel and hammer to break off a piece of the stone along the marked line. After removing the broken piece, the person uses a tool to smooth and clean the edge of the stone block. The person then marks another line on the stone block with a pencil.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 443, "extra_info": {"duration": 12.63, "events": ["Person wearing brown gloves draws a line on marble.", "Man uses tools to cut away the top layer of stone.", "Camera reveals a smooth, flat surface.", "Background features a room with a brown cabinet."], "n_events": 4, "n_shots": 7, "n_subjects": 1}, "vid": 443}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "876.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A car is observed moving forward on a road with a red car to its left. Both cars proceed along the road, with the red car moving at a faster pace. Suddenly, the red car swerves to the right, coming to a halt directly in front of the first car. The driver of the first car fails to brake in time, subsequently colliding with the red car. This impact results in a shattered window and a dented front door on the red car.", "text": "The video shows a clear day at an intersection with a modern building on the left and a few trees and utility poles in the background. A white van is seen approaching from the right side of the frame, while a red car appears from the left side and moves towards the center of the frame. The red car continues to move forward, and the white van gets closer to the intersection. The red car's windshield starts to show signs of damage as it gets closer to the intersection. The red car's windshield is visibly cracked, and the white van is now very close to the intersection.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 876, "extra_info": {"duration": 6.0, "events": ["A car moves forward on a road.", "A red car proceeds alongside to the left.", "The red car moves at a faster pace.", "The red car swerves to the right.", "The red car comes to a halt in front.", "The first car fails to brake in time.", "The first car collides with the red car.", "A window shatters on the red car.", "A front door dents on the red car."], "n_events": 9, "n_shots": 1, "n_subjects": 2}, "vid": 876}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "269.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person is making a phone call, looking at the call list on the paper, the pen moving upwards, all the way to the top, where he draws a circle around the name of the topmost person.", "text": "A man is holding a phone to his ear, speaking into it. He then uses a red pen to circle the name 'Walter Ribbon' on a list of names and phone numbers. The man continues to circle the name 'Walter Ribbon' multiple times. He then hangs up the phone and places the pen down. The man is seen again, still holding the phone to his ear.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 269, "extra_info": {"duration": 12.62, "events": ["A person makes a phone call.", "The person looks at the call list on the paper.", "The person's pen moves upwards.", "The person draws a circle around a name."], "n_events": 4, "n_shots": 7, "n_subjects": 1}, "vid": 269}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "225.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The blue robot turned around and looked down, a spherical object was rolling in the midst of a ruined city.", "text": "A blue robot is seen from a low angle, moving its arm and body against a backdrop of a partly cloudy sky and damaged buildings. The robot then steps on a spherical object, which begins to roll forward. The spherical object continues to roll, gaining momentum as it moves through the debris-strewn street. The robot remains stationary as the spherical object rolls further away, eventually becoming a small dot in the distance.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_live_action", "task": "video/caption", "idx": 225, "extra_info": {"duration": 4.5, "events": ["The blue robot turned around.", "The blue robot looked down.", "A spherical object was rolling.", "The spherical object moved amid ruins."], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 225}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "104.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Three figures walked through the corridor from the end of the passage, the camera switched to the back view of the three figures, with the last person carrying a backpack, from which the head of a dog poked out.", "text": "In a dimly lit corridor with wooden floors and metal walls, two soldiers in green uniforms and caps are seen through a glass door. They turn and start running towards the door, with one soldier slightly ahead of the other. The soldiers continue running and exit through the door, leaving the corridor empty. The camera then shifts to show the back of one of the soldiers, revealing a small animal, possibly a raccoon, peeking out of a green backpack on his back. The soldier turns his head to look back, and the animal also turns its head to look in the same direction.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 104, "extra_info": {"duration": 5.0, "events": ["Three figures walked through the corridor.", "The last person carried a backpack.", "A dog's head poked out from the backpack."], "n_events": 3, "n_shots": 2, "n_subjects": 4}, "vid": 104}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "552.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The man runs forward, plunges into the water with his hands hugging his knees, causing water to splash in all directions. The camera follows the shot, then switches to beneath the surface, where a man is seen moving underwater while holding something.", "text": "A shirtless man in black shorts and orange shoes runs on a rocky path towards a cliff edge, with a rocky cliff in the background. He approaches another person who is taking a photo of him. The man jumps off the cliff, performing a dive into a greenish body of water below. The camera captures the splash as he enters the water. Underwater, the man is seen swimming towards the camera, holding a camera in his hand.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 552, "extra_info": {"duration": 6.28, "events": ["The man runs forward.", "The man plunges into the water.", "The man hugs his knees.", "Water splashes in all directions.", "The man moves underwater.", "The man holds something."], "n_events": 6, "n_shots": 2, "n_subjects": 2}, "vid": 552}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "82.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A gentleman in a suit handed a piece of blue clothing to the bespectacled man to his right, who in turn passed the clothing to the short-haired woman beside him. The short-haired woman pressed the clothing with an iron and then handed it to the short-bearded man on her right. The short-bearded man spritzed some perfume on the clothing and passed it to a hand beside him, which then took the clothing inside the house.", "text": "In a modern, stylish setting with a cityscape in the background, a man in a blue suit and bow tie gestures towards a blue suit on a hanger held by a man in a vest and glasses. The man in the vest takes the hanger with the blue suit and starts moving it to the right. A woman in a white top and green skirt, standing nearby, assists by holding the hanger. Another man in a vest and bow tie, holding a bottle of alcohol, follows the hanger with his hand. The man with the bottle continues to guide the hanger as they move towards a door. The man with the bottle then releases the hanger, and the door begins to close.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 82, "extra_info": {"duration": 4.12, "events": ["The gentleman handed a piece of blue clothing.", "The bespectacled man passed the clothing to the woman.", "The short-haired woman pressed the clothing with an iron.", "The woman handed the clothing to the short-bearded man.", "The short-bearded man spritzed perfume on the clothing.", "A hand took the clothing inside the house."], "n_events": 6, "n_shots": 1, "n_subjects": 5}, "vid": 82}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "779.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The audience cheered, some waving flags. Photographers on the field ran forward, the camera panned left, showing two teams in competition, players dressed in black passing the soccer ball between them. A referee stood nearby as players in red attempted to block. One of them fell down, and then players from both teams also tumbled to the ground.", "text": "A large crowd of spectators is seen waving flags and banners in the stands, with a group of police officers standing in formation on the field. The scene then shifts to a soccer match where players are actively engaged in the game. A player in a black jersey is seen dribbling the ball towards the goal, evading opponents in red and white jerseys. The player continues to advance, but is eventually tackled by a player in a red and white jersey, causing him to fall to the ground. The camera then cuts back to the crowd, showing the same scene of spectators waving flags and banners.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 779, "extra_info": {"duration": 18.55, "events": ["The audience cheered.", "Some audience members waved flags.", "Photographers ran forward on the field.", "Players dressed in black passed the soccer ball.", "The referee stood nearby.", "Players in red attempted to block.", "One player in red fell down.", "Players from both teams tumbled to the ground."], "n_events": 8, "n_shots": 1, "n_subjects": 4}, "vid": 779}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "196.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A person in a diving suit raises both hands, moves their body, turns around to pick up a book, and walks toward the left side of the frame. The camera then cuts to him walking forward, turning around to close the door.", "text": "In a dimly lit room filled with various objects and shelves, a person wearing a helmet with a green light on top turns around and raises their hands to adjust the helmet. They then bend down to pick up a book from a table. Holding the book, the person walks towards a green door, opens it, and exits the room.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 196, "extra_info": {"duration": 9.12, "events": ["Person raises both hands", "Person moves their body", "Person turns around", "Person picks up a book", "Person walks toward the left side", "Person walks forward", "Person turns around to close the door"], "n_events": 7, "n_shots": 2, "n_subjects": 1}, "vid": 196}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "980.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video displays a room with light gray walls. A white wardrobe stands at one end, with various toys scattered across the floor in front of it. Two white windows, adorned with gauzy white curtains, let in daylight. A circular, brown storage shelf is positioned under the window, with a white unicorn decoration on the upper right wall. Two framed pictures hang on the right wall. A toy box brimming with toys is tucked into the bottom right corner of the room, flanked by two light khaki polka-dotted pillows. Adjacent to the toy box, a white bed is furnished with two pink pillows and blankets in shades of pink, blue, and velvet. A pink quilt speckled with white polka dots covers miscellaneous items, hangers, and unwashed clothes. Beside the bed, a white dresser stands with six drawers. On top of it, a white lamp and other assorted items rest. A small round table, hosting a potted plant and random knick-knacks, is placed on the left of the dresser. Above the dresser, the wall features a round mirror framed by strings of tiny light bulbs. A small white chair is situated on the right of the dresser.", "text": "The video begins with a view of a closed door and a window with white curtains, with some toys scattered on the floor. As the camera moves to the right, a bed with a pink and blue blanket, various toys, and a blue tray on it comes into view. The bed is surrounded by more toys and a small circular object on the floor. The camera continues to pan right, revealing a wall decorated with string lights and a white dresser with a mirror, a lamp, and various items on it. The video ends with a view of the door and the window again.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 980, "extra_info": {"duration": 11.01, "events": [], "n_events": 0, "n_shots": 2, "n_subjects": 1}, "vid": 980}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "687.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "On the table, there are three hands, each holding a pen, with a notebook underneath. The hand on the right side of the screen is writing something in the notebook. Then, a person walks in from the left side of the frame and sits down facing the camera. The newcomer places a laptop on the table with both hands, then opens the laptop.", "text": "Two people are seated at a wooden table, each holding a pen and writing in notebooks. The person on the right places their pen on a piece of paper in front of them. Another person approaches from the left, places a laptop on the table, and begins to open it. The person who opened the laptop then sits down and starts using it. The person on the right continues writing in their notebook.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 687, "extra_info": {"duration": 5.82, "events": ["Three hands hold pens above a notebook.", "The right hand writes in the notebook.", "A person walks into the frame from the left.", "The person sits down facing the camera.", "The person places a laptop on the table.", "The person opens the laptop."], "n_events": 6, "n_shots": 1, "n_subjects": 4}, "vid": 687}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "155.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "At night, a green car smashed through the wooden fence ahead, then rolled forward and crashed into a house. Many people inside the house, holding guns, moved from other rooms to the one that was hit by the car and aimed their guns at the vehicle.", "text": "At night, a green car drives up and crashes into a house, flipping over and breaking through the roof. The car continues to flip and crash into the house, causing significant damage to the roof and walls. The scene then shifts to the interior of the house, where a group of people, including men in suits and a woman, enter the room holding guns. They approach the damaged car, which is now inside the house, and begin to inspect the scene.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 155, "extra_info": {"duration": 5.62, "events": ["Green car smashed through the wooden fence.", "Car rolled forward after breaking the fence.", "Vehicle crashed into a house.", "People inside the house moved to the impacted room.", "Occupants aimed their guns at the vehicle."], "n_events": 5, "n_shots": 2, "n_subjects": 2}, "vid": 155}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "878.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "The video features a person playing the game Pubg. After a brief moment of running, the gamer momentarily halts to open the M416 scope, targeting an adversary crouched behind a house, wearing a white helmet. There is a blue house situated on the player's right, and a land filled with trees on the left side. The player initiates an attack on the opponent, which results in a green smoke as the bullets make contact. Despite landing several shots accurately, the opponent is only knocked out, not completely eliminated. The player then proceeds to run towards the wounded opponent, across a white tarp.", "text": "The video begins with a first-person view inside a room with floral wallpaper and a framed picture on the wall, showing a gun aimed at an open door leading outside. The scene then transitions to an outdoor setting where the player, holding a gun, moves along the side of a weathered, blue wooden building with graffiti on it. The player continues to move forward, passing by a barrel and a pile of white cloth on the ground. As the player progresses, a message appears on the screen indicating a 'Restricting Play Area' and a 'You Knocked Out' notification. The player then turns and moves towards the open area, with the gun still aimed forward.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 878, "extra_info": {"duration": 5.0, "events": ["Person plays Pubg", "Gamer runs in-game", "Gamer opens M416 scope", "Gamer targets adversary behind house", "Adversary crouches wearing white helmet", "Gamer attacks opponent", "Bullets create green smoke", "Opponent gets knocked out", "Gamer runs towards wounded opponent", "Gamer crosses white tarp"], "n_events": 10, "n_shots": 1, "n_subjects": 1}, "vid": 878}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "476.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Inside a gym, two slim ladies are synchronously performing yoga. They start by standing up, then they stretch their bodies upward with their hands, followed by a forward bend and a back kick on one leg, and finally, they lift one foot to execute a challenging pose.", "text": "In a well-lit room with a wooden floor and a large window, two women stand on platforms with their hands on their hips. They then raise their arms and transition into a lunge position. The women move into a crawling position on the platforms, extending one leg backward while keeping the other leg bent. They continue to move in a crawling motion, alternating the position of their legs. Finally, they return to a standing position with their hands on their hips.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/vertical_video", "task": "video/caption", "idx": 476, "extra_info": {"duration": 15.05, "events": ["Two ladies start standing up inside a gym.", "Ladies stretch their bodies upward with their hands.", "They perform a forward bend.", "Each lady does a back kick on one leg.", "They lift one foot to execute a challenging pose."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 476}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "888.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "In a competitive game, two players engage in a shootout. One of the players alters their position to lie flat on the ground while shooting. Ultimately, the player who chose to lie down on the ground is the victorious one.", "text": "The player, holding a gun, moves forward in a futuristic indoor environment with blue and white walls and various equipment scattered around. The player encounters an enemy and engages in combat, taking damage as indicated by the health bar. The player continues to move forward, aiming and shooting at enemies, causing them to fall. The player then moves through a corridor, aiming and shooting at enemies as they appear. The player successfully takes down multiple enemies, as indicated by the 'X' marks and the 'New Player' notification on the screen. The player continues to move forward, aiming and shooting at enemies as they appear.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/youtube", "task": "video/caption", "idx": 888, "extra_info": {"duration": 6.0, "events": ["Two players engage in a shootout.", "One player alters their position.", "Player lies flat on the ground.", "Player shoots from the ground.", "The lying player wins the shootout."], "n_events": 5, "n_shots": 1, "n_subjects": 2}, "vid": 888}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "641.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "Two children are staring at the automatic dough mixer in front of them; the little girl wearing a red and white checked shirt lifts her head and says a few words, while the child next to her keeps watching the rotor turning inside the dough mixer.", "text": "A white stand mixer is mixing batter in a metal bowl, with the mixer's paddle rotating. A girl in a red checkered shirt is watching the mixer intently, resting her chin on her hand. She occasionally looks up and around, then back at the mixer. The background is decorated with string lights and star-shaped ornaments.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/videvo", "task": "video/caption", "idx": 641, "extra_info": {"duration": 20.12, "events": ["Two children stare at the automatic dough mixer.", "The little girl lifts her head.", "The little girl says a few words.", "The child next to the little girl watches the rotor."], "n_events": 4, "n_shots": 1, "n_subjects": 2}, "vid": 641}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "187.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "SpongeBob is running in circles, while the jellyfish sway their bodies up and down. The camera zooms in and switches to the scene of Squidward taking a bath, and then it switches again to Squidward talking while scrubbing himself.", "text": "SpongeBob runs along a path towards a large, blue, tower-like house with a pineapple-shaped house on the right and a red, dome-shaped house on the left. The camera zooms in on the blue house as SpongeBob continues running. The scene transitions to the inside of the house, showing a shower curtain with a silhouette of SpongeBob taking a bath. The camera zooms in on the silhouette, revealing a green character with a purple hat and a sponge in hand, washing themselves under the shower. The character continues to wash, smiling and enjoying the bath.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 187, "extra_info": {"duration": 17.38, "events": ["SpongeBob runs in circles", "Jellyfish sway bodies up and down", "Squidward takes a bath", "Squidward talks", "Squidward scrubs himself"], "n_events": 5, "n_shots": 3, "n_subjects": 3}, "vid": 187}
+{"messages": [{"content": [{"reference": null, "text": null, "type": "video", "video": {"video_file": "161.mp4"}}, {"reference": null, "text": "Describe the video in detail.", "type": "text", "video": null}], "role": "user"}, {"content": [{"reference": "A female astronaut grabs a control stick in front of her. The camera switches to the exterior of the spacecraft, where a robotic arm extends. After swaying side to side for a few times, the arm lifts high and flings the object at the end of the robotic arm into space.", "text": "A woman in a blue NASA uniform is seen inside a spacecraft, looking concerned and turning her head to the side. She then turns back and presses a button on the control panel. The camera shifts to show the spacecraft's exterior, where a mechanical arm begins to extend from the spacecraft. The arm continues to extend and then starts to retract. The camera then returns to the woman, who is still looking concerned.", "type": "text", "video": null}], "role": "assistant"}], "dataset": "DREAM/movie_animation", "task": "video/caption", "idx": 161, "extra_info": {"duration": 9.88, "events": ["A female astronaut grabs a control stick.", "A robotic arm extends from the spacecraft.", "The robotic arm sways side to side.", "The robotic arm lifts high.", "The robotic arm flings the object into space."], "n_events": 5, "n_shots": 2, "n_subjects": 1}, "vid": 161}
diff --git a/eval_scripts/DREAM-1K/eval_DREAM-1K.sh b/eval_scripts/DREAM-1K/eval_DREAM-1K.sh
new file mode 100644
index 0000000000000000000000000000000000000000..15e97be5ec2096d2d21ff6baf7f1da877dbe92c5
--- /dev/null
+++ b/eval_scripts/DREAM-1K/eval_DREAM-1K.sh
@@ -0,0 +1,12 @@
+#!/bin/bash
+MODEL_PATH="path_to_AVoCaDO" # TODO
+OUTPUT_DIR="$1"
+
+mkdir -p "$OUTPUT_DIR"
+
+python eval_scripts/DREAM-1K/generate_caption.py \
+ --model_path "$MODEL_PATH" \
+ --save_path "$OUTPUT_DIR/model_caption.jsonl"
+
+bash eval_scripts/DREAM-1K/tarsier/scripts/run_evaluation_only.sh "$OUTPUT_DIR/model_caption.jsonl"
+
diff --git a/eval_scripts/DREAM-1K/generate_caption.py b/eval_scripts/DREAM-1K/generate_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..b238b002970002ba7d1c2c71e16c9be1295e30af
--- /dev/null
+++ b/eval_scripts/DREAM-1K/generate_caption.py
@@ -0,0 +1,91 @@
+import os
+import torch
+from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
+from qwen_omni_utils import process_mm_info
+import argparse
+import json
+from tqdm import tqdm
+from pathlib import Path
+
+VIDEO_MAX_PIXELS = 401408 # 512*28*28
+VIDEO_TOTAL_PIXELS = 20070400 # 512*28*28*50
+USE_AUDIO_IN_VIDEO = False
+os.environ['VIDEO_MAX_PIXELS'] = str(VIDEO_TOTAL_PIXELS)
+script_dir = Path(__file__).resolve().parent
+example_path = script_dir / "dream_example.jsonl"
+video_dir = "" # TODO
+
+parser = argparse.ArgumentParser(description="Evaluate a model and save results.")
+parser.add_argument("--model_path", type=str, required=True, help="Path to the model checkpoint.")
+parser.add_argument("--save_path", type=str, required=True, help="Path to save the evaluation results.")
+args = parser.parse_args()
+
+model_path = args.model_path
+fout_path = args.save_path
+
+f_example = open(example_path, 'r', encoding='utf-8')
+fout = open(fout_path, 'w', encoding='utf-8')
+
+model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
+ model_path,
+ torch_dtype=torch.bfloat16,
+ device_map="auto",
+ attn_implementation="flash_attention_2",
+)
+model.disable_talker()
+processor = Qwen2_5OmniProcessor.from_pretrained(model_path)
+
+def chat(data):
+ conversation = [
+ {
+ "role": "system",
+ "content": [
+ {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
+ ],
+ },
+ {
+ "role": "user",
+ "content": [
+ {
+ "type": "video",
+ "video": data["video_path"],
+ "max_pixels": VIDEO_MAX_PIXELS,
+ },
+ {
+ "type": "text",
+ "text": data["question"]
+ },
+ ],
+ },
+ ]
+
+ text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
+ audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
+ inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
+ inputs = inputs.to(model.device).to(model.dtype)
+
+ text_ids = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO, do_sample=False, thinker_max_new_tokens=2048)
+
+ text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+ model_generation = text.split("\nassistant\n")[-1]
+
+ return model_generation
+
+
+for idx, line in tqdm(enumerate(f_example, start=1)):
+ data = json.loads(line)
+ video_path = os.path.join(video_dir, data["messages"][0]["content"][0]["video"]["video_file"])
+ question = "Imagine the video from these frames and describe it in detail."
+
+ temp_data = {
+ "video_path": video_path,
+ "question": question,
+ }
+ with torch.inference_mode():
+ response = chat(temp_data)
+
+ out_data = data
+ data["messages"][0]["content"][1]["text"] = question
+ out_data["messages"][1]["content"][0]["text"] = response
+ fout.write(json.dumps(out_data, ensure_ascii=False) + '\n')
+ fout.flush()
diff --git a/eval_scripts/DREAM-1K/tarsier/LICENSE b/eval_scripts/DREAM-1K/tarsier/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..f49a4e16e68b128803cc2dcea614603632b04eac
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/LICENSE
@@ -0,0 +1,201 @@
+ Apache License
+ Version 2.0, January 2004
+ http://www.apache.org/licenses/
+
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+ 1. Definitions.
+
+ "License" shall mean the terms and conditions for use, reproduction,
+ and distribution as defined by Sections 1 through 9 of this document.
+
+ "Licensor" shall mean the copyright owner or entity authorized by
+ the copyright owner that is granting the License.
+
+ "Legal Entity" shall mean the union of the acting entity and all
+ other entities that control, are controlled by, or are under common
+ control with that entity. For the purposes of this definition,
+ "control" means (i) the power, direct or indirect, to cause the
+ direction or management of such entity, whether by contract or
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
+ outstanding shares, or (iii) beneficial ownership of such entity.
+
+ "You" (or "Your") shall mean an individual or Legal Entity
+ exercising permissions granted by this License.
+
+ "Source" form shall mean the preferred form for making modifications,
+ including but not limited to software source code, documentation
+ source, and configuration files.
+
+ "Object" form shall mean any form resulting from mechanical
+ transformation or translation of a Source form, including but
+ not limited to compiled object code, generated documentation,
+ and conversions to other media types.
+
+ "Work" shall mean the work of authorship, whether in Source or
+ Object form, made available under the License, as indicated by a
+ copyright notice that is included in or attached to the work
+ (an example is provided in the Appendix below).
+
+ "Derivative Works" shall mean any work, whether in Source or Object
+ form, that is based on (or derived from) the Work and for which the
+ editorial revisions, annotations, elaborations, or other modifications
+ represent, as a whole, an original work of authorship. For the purposes
+ of this License, Derivative Works shall not include works that remain
+ separable from, or merely link (or bind by name) to the interfaces of,
+ the Work and Derivative Works thereof.
+
+ "Contribution" shall mean any work of authorship, including
+ the original version of the Work and any modifications or additions
+ to that Work or Derivative Works thereof, that is intentionally
+ submitted to Licensor for inclusion in the Work by the copyright owner
+ or by an individual or Legal Entity authorized to submit on behalf of
+ the copyright owner. For the purposes of this definition, "submitted"
+ means any form of electronic, verbal, or written communication sent
+ to the Licensor or its representatives, including but not limited to
+ communication on electronic mailing lists, source code control systems,
+ and issue tracking systems that are managed by, or on behalf of, the
+ Licensor for the purpose of discussing and improving the Work, but
+ excluding communication that is conspicuously marked or otherwise
+ designated in writing by the copyright owner as "Not a Contribution."
+
+ "Contributor" shall mean Licensor and any individual or Legal Entity
+ on behalf of whom a Contribution has been received by Licensor and
+ subsequently incorporated within the Work.
+
+ 2. Grant of Copyright License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ copyright license to reproduce, prepare Derivative Works of,
+ publicly display, publicly perform, sublicense, and distribute the
+ Work and such Derivative Works in Source or Object form.
+
+ 3. Grant of Patent License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ (except as stated in this section) patent license to make, have made,
+ use, offer to sell, sell, import, and otherwise transfer the Work,
+ where such license applies only to those patent claims licensable
+ by such Contributor that are necessarily infringed by their
+ Contribution(s) alone or by combination of their Contribution(s)
+ with the Work to which such Contribution(s) was submitted. If You
+ institute patent litigation against any entity (including a
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
+ or a Contribution incorporated within the Work constitutes direct
+ or contributory patent infringement, then any patent licenses
+ granted to You under this License for that Work shall terminate
+ as of the date such litigation is filed.
+
+ 4. Redistribution. You may reproduce and distribute copies of the
+ Work or Derivative Works thereof in any medium, with or without
+ modifications, and in Source or Object form, provided that You
+ meet the following conditions:
+
+ (a) You must give any other recipients of the Work or
+ Derivative Works a copy of this License; and
+
+ (b) You must cause any modified files to carry prominent notices
+ stating that You changed the files; and
+
+ (c) You must retain, in the Source form of any Derivative Works
+ that You distribute, all copyright, patent, trademark, and
+ attribution notices from the Source form of the Work,
+ excluding those notices that do not pertain to any part of
+ the Derivative Works; and
+
+ (d) If the Work includes a "NOTICE" text file as part of its
+ distribution, then any Derivative Works that You distribute must
+ include a readable copy of the attribution notices contained
+ within such NOTICE file, excluding those notices that do not
+ pertain to any part of the Derivative Works, in at least one
+ of the following places: within a NOTICE text file distributed
+ as part of the Derivative Works; within the Source form or
+ documentation, if provided along with the Derivative Works; or,
+ within a display generated by the Derivative Works, if and
+ wherever such third-party notices normally appear. The contents
+ of the NOTICE file are for informational purposes only and
+ do not modify the License. You may add Your own attribution
+ notices within Derivative Works that You distribute, alongside
+ or as an addendum to the NOTICE text from the Work, provided
+ that such additional attribution notices cannot be construed
+ as modifying the License.
+
+ You may add Your own copyright statement to Your modifications and
+ may provide additional or different license terms and conditions
+ for use, reproduction, or distribution of Your modifications, or
+ for any such Derivative Works as a whole, provided Your use,
+ reproduction, and distribution of the Work otherwise complies with
+ the conditions stated in this License.
+
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
+ any Contribution intentionally submitted for inclusion in the Work
+ by You to the Licensor shall be under the terms and conditions of
+ this License, without any additional terms or conditions.
+ Notwithstanding the above, nothing herein shall supersede or modify
+ the terms of any separate license agreement you may have executed
+ with Licensor regarding such Contributions.
+
+ 6. Trademarks. This License does not grant permission to use the trade
+ names, trademarks, service marks, or product names of the Licensor,
+ except as required for reasonable and customary use in describing the
+ origin of the Work and reproducing the content of the NOTICE file.
+
+ 7. Disclaimer of Warranty. Unless required by applicable law or
+ agreed to in writing, Licensor provides the Work (and each
+ Contributor provides its Contributions) on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+ implied, including, without limitation, any warranties or conditions
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+ PARTICULAR PURPOSE. You are solely responsible for determining the
+ appropriateness of using or redistributing the Work and assume any
+ risks associated with Your exercise of permissions under this License.
+
+ 8. Limitation of Liability. In no event and under no legal theory,
+ whether in tort (including negligence), contract, or otherwise,
+ unless required by applicable law (such as deliberate and grossly
+ negligent acts) or agreed to in writing, shall any Contributor be
+ liable to You for damages, including any direct, indirect, special,
+ incidental, or consequential damages of any character arising as a
+ result of this License or out of the use or inability to use the
+ Work (including but not limited to damages for loss of goodwill,
+ work stoppage, computer failure or malfunction, or any and all
+ other commercial damages or losses), even if such Contributor
+ has been advised of the possibility of such damages.
+
+ 9. Accepting Warranty or Additional Liability. While redistributing
+ the Work or Derivative Works thereof, You may choose to offer,
+ and charge a fee for, acceptance of support, warranty, indemnity,
+ or other liability obligations and/or rights consistent with this
+ License. However, in accepting such obligations, You may act only
+ on Your own behalf and on Your sole responsibility, not on behalf
+ of any other Contributor, and only if You agree to indemnify,
+ defend, and hold each Contributor harmless for any liability
+ incurred by, or claims asserted against, such Contributor by reason
+ of your accepting any such warranty or additional liability.
+
+ END OF TERMS AND CONDITIONS
+
+ APPENDIX: How to apply the Apache License to your work.
+
+ To apply the Apache License to your work, attach the following
+ boilerplate notice, with the fields enclosed by brackets "[]"
+ replaced with your own identifying information. (Don't include
+ the brackets!) The text should be enclosed in the appropriate
+ comment syntax for the file format. We also recommend that a
+ file or class name and description of purpose be included on the
+ same "printed page" as the copyright notice for easier
+ identification within third-party archives.
+
+ Copyright [yyyy] [name of copyright owner]
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
\ No newline at end of file
diff --git a/eval_scripts/DREAM-1K/tarsier/configs/tarser2_default_config.yaml b/eval_scripts/DREAM-1K/tarsier/configs/tarser2_default_config.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..f2fe7d68beeeb0960c17b5b0de6ab601bce7088e
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/configs/tarser2_default_config.yaml
@@ -0,0 +1,14 @@
+max_n_frames: 256
+n_frames: 16
+max_pixels: 460800 # 1280 * 720 // 2
+min_pixels: 0
+max_seq_len: 16384
+is_training: false # 会影响:1. 训练和测试时采帧不同;2. 测试时忽略 response。
+print_data_error: true
+is_training: false
+do_image_padding: false
+do_image_crop: false
+do_image_resize: false
+video_sampling_strategy: {'video_sampler_version': 'v1', 'force_frames_n_divisible': 1, 'use_multi_images_for_video': true}
+prompt: ""
+train_task: sft
diff --git a/eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/multi_images_parser.py b/eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/multi_images_parser.py
new file mode 100644
index 0000000000000000000000000000000000000000..0a4887facf9c10b5acf2504fe6c618a42920bff8
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/multi_images_parser.py
@@ -0,0 +1,199 @@
+from typing import Dict, List
+import random
+import re
+from PIL import Image
+
+from .utils import sample_video, read_image
+
+class MultiImagesParser:
+ def __init__(
+ self,
+ n_frames=8,
+ is_training=True,
+ ):
+ self.n_frames = n_frames
+ self.is_training = is_training
+ # fmt: off
+ self.data_temp = {
+ "text": [
+ [{
+ "prompt": "Describe the image in short.",
+ "response": "A rollerblader rides high in a full pipe while others watch"
+ }],
+ [{
+ "prompt": "Describe the image in short.",
+ "response": "A woman in winter clothes is on the sidewalk with a phone."
+ }]
+ ],
+ "image": [
+ {
+ "image_file": "/mnt/bn/videonaslq/images/flickr30k/images/3371533654.jpg"
+ },
+ {
+ "image_file": "/mnt/bn/videonaslq/images/coco/train2014/COCO_train2014_000000177950.jpg"
+ },
+ {
+ "video_file": "/mnt/bn/llmdatalq/jiangnan/video_generation/webvid_10M_download/20230609/videos/011851_011900/1047443473.mp4",
+ "frame_indices": [0, 85, 171, 256, 342, 427, 513, 598]
+ }
+ ],
+ "dataset": "coco",
+ "task": "multi_images",
+ "image_processing_config": {},
+ }
+ # fmt: on
+
+ def check_format(self, data_dict: Dict, image_processing_config: Dict):
+ assert data_dict['dataset'] in ['coco', 'sharegpt4v_cap100k', 'sharegpt4v_mix665k', 'webvid', 'movie'], data_dict
+
+ # 目前多图数据应该没有包含坐标的数据吧
+ if image_processing_config.get('has_coordinates', False):
+ raise ValueError(f'do_crop and has_coordinates cannot be True at the same time in MultiImagesParser!')
+
+ # 检查是否能匹配到坐标
+ texts = data_dict['text']
+ for text in texts:
+ match = re.search(r'\[(\d+(\.\d+)?,\s*)+\d+(\.\d+)?\]', text['prompt'] + text['response'])
+ if match:
+ print(f'[Warning] 疑似检测到包含坐标的数据:{data_dict}')
+
+
+ def transform(self, data_dict: Dict, image_processing_config: Dict = None) -> Dict:
+ self.check_format(data_dict, image_processing_config)
+
+ # shuffle
+ texts = data_dict['text']
+ images = data_dict['image']
+ images = self.load_images(images)
+ idxs = list(range(len(texts)))
+ random.shuffle(idxs)
+ texts = [texts[i] for i in idxs]
+ images = [images[i] for i in idxs]
+
+ # sample n_frames
+ if isinstance(self.n_frames, int):
+ n_frames = random.choice(list(range(1, self.n_frames + 1)))
+ else:
+ n_frames = random.choice(self.n_frames)
+ texts = texts[: n_frames]
+ images = images[: n_frames]
+
+ dataset = data_dict['dataset']
+ if dataset in ['coco', 'sharegpt4v_cap100k', 'webvid', 'movie']:
+ prompt, response = self.transform_for_caption_task(texts, dataset, images)
+ else:
+ prompt, response = self.transform_for_qa_task(texts, dataset, images)
+
+ messages = [
+ {
+ "role": "user",
+ "content": [
+ *[{"type": "image", "image": img} for img in images],
+ {"type": "text", "text": prompt},
+ ]
+ },
+ {
+ "role": "assistant",
+ "content": [
+ {"type": "text", "text": response}
+ ]
+ }
+ ]
+
+ return messages
+
+ def transform_for_caption_task(self, texts, dataset, images):
+ idx = random.choice(list(range(len(texts))))
+
+ if dataset == 'coco':
+ if len(texts) == 1:
+ prompt = 'Describe the image in short.'
+ else:
+ prompt = f'Describe the images starting from frame {idx + 1} in short in order.'
+ elif dataset == 'sharegpt4v_cap100k':
+ if len(texts) == 1:
+ prompt = 'Describe the image in detail.'
+ else:
+ prompt = f'Describe the images starting from frame {idx + 1} in detail in order.'
+ else:
+ if len(texts) == 1:
+ prompt = 'Describe the image.'
+ else:
+ prompt = f'Describe the images starting from frame {idx + 1} in order.'
+ response = ''
+ for i, text in enumerate(texts):
+ if i < idx:
+ continue
+ if not isinstance(text, dict):
+ text = random.choice(text)
+ resp = text['response']
+ response += f'{resp}\n'
+ return prompt, response
+
+ def transform_for_qa_task(self, texts, dataset, images):
+ prompt, response = '', ''
+ for i, text in enumerate(texts):
+ if not isinstance(text, dict):
+ text = random.choice(text)
+ if len(texts) > 1:
+ prompt += f'Question for frame {i+1}:\n' + text['prompt'] + '\n'
+ response += f'Answer to question of frame {i+1}:\n' + text['response'] + '\n'
+ else:
+ prompt += text['prompt'] + '\n'
+ response += text['response'] + '\n'
+ return prompt, response
+
+
+ def load_images(self, image_items: List[Dict]) -> List[Image.Image]:
+ """
+ image_items: List[Dict]. each item like:
+ {"video_file": "path/to/video", "frame_indices": [1]}
+ or
+ {"image_file": "path/to/image"}
+ """
+ if image_items is None:
+ raise ValueError(f'image_items is None!')
+
+ if isinstance(image_items, dict):
+ image_items = [image_items]
+
+ images = []
+
+ for image_item in image_items:
+
+ if 'video_file' in image_item:
+ file_key = 'video_file'
+ elif 'image_file' in image_item:
+ file_key = 'image_file'
+ else:
+ raise KeyError(f'video_file or image_file not in {image_item}')
+
+ file_path = image_item[file_key]
+ if file_key == 'video_file':
+ frame_indices = image_item.get('frame_indices', None)
+ if frame_indices is None:
+ raise ValueError(f'read 0 frame: {image_item}')
+ if isinstance(frame_indices, int):
+ frame_indices = [frame_indices]
+ frames = sample_video(file_path, frame_indices = frame_indices)
+ images.extend(frames)
+ else:
+ if isinstance(file_path, str):
+ file_path = [file_path]
+ images.extend([read_image(f) for f in file_path])
+
+ return images
+
+if __name__ == '__main__':
+ # python3 -m xenon_generation.data.custom_data_parsers.multi_images_parser
+
+ from tqdm import tqdm
+ from tools.rw_utils import read_jsonlines
+
+ lines = read_jsonlines('/mnt/bn/videonaslq/VideoCaption/datasets_1009/sharegpt4v_cap100k/part_36.jsonl')
+ lines = lines[:10]
+ parser = MultiImagesParser(n_frames=8)
+ for i, l in tqdm(enumerate(lines)):
+ l_image_processing_config = l.get('image_processing_config', {})
+ messages = parser.transform(l, l_image_processing_config)
+ print(messages)
\ No newline at end of file
diff --git a/eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/object_tracking_parser.py b/eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/object_tracking_parser.py
new file mode 100644
index 0000000000000000000000000000000000000000..7edee291d1762ca1c07e425b4e17e2e8d561fda8
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/object_tracking_parser.py
@@ -0,0 +1,160 @@
+from typing import Dict
+import random
+import re
+
+from torchvision import transforms
+
+from .utils import sample_video
+
+def return_same(x):
+ return x
+
+def _bbox_transform_for_padding(bbox, frame):
+ w1, h1, w2, h2 = bbox
+ width, height = frame.size
+ if width == height:
+ pass
+ elif width > height:
+ h1 += (width - height) // 2
+ h2 += (width - height) // 2
+ height = width
+ else:
+ w1 += (height - width) // 2
+ w2 += (height - width) // 2
+ width = height
+ new_bbox = [w1 / width, h1 / height, w2 / width, h2 / height]
+ new_bbox = [round(i, 2) for i in new_bbox]
+ return new_bbox
+
+def _bbox_transform_for_resize(bbox, frame):
+ w1, h1, w2, h2 = bbox
+ width, height = frame.size
+ new_bbox = [w1 / width, h1 / height, w2 / width, h2 / height]
+ new_bbox = [round(i, 2) for i in new_bbox]
+ return new_bbox
+
+class InAndOutCropAndResize(object):
+ """Crop and resize for in_and_out boxes data according to yuchen
+ Args:
+ size: tuple of (width, height)
+ """
+
+ def __init__(self, size):
+ self.size = size
+
+ def __call__(self, img):
+ """
+ Args:
+ img (PIL Image): PIL Image
+ Returns:
+ PIL Image: PIL image.
+ """
+ w = img.width
+ h = img.height
+ x0 = int(w * 0.5 - h * 0.375)
+ y0 = int(h * 0.125)
+ x1 = int(w * 0.5 + h * 0.375)
+ y1 = int(h * 0.875)
+ img = img.crop((x0, y0, x1, y1)).resize(self.size)
+ return img
+
+
+class ObjectTrackingParser:
+ def __init__(
+ self,
+ n_frames = 8,
+ max_objects = 3,
+ is_training=True,
+ ):
+ self.n_frames = n_frames
+ self.max_objects = max_objects
+ self.is_training = is_training
+ self.img_transform = self.get_img_transform()
+ # fmt: off
+ self.data_temp = {
+ "video_file": "/mnt/bn/llmdatalq/jiaxin/hdvila/20230926/saved/saved_video_clips/0076/lOjn__YCec4.624.1104.mp4",
+ "frame_indices": [154, 157, 160, 163, 166, 169, 172, 175, 178, 181, 184, 187, 190, 193, 196, 199, 202],
+ "objects": {
+ "0": {
+ "phrase": "person",
+ "all_frame_bounding_boxes": [[2, 0, 255, 250], [17, 0, 255, 251], [35, 0, 255, 253], [44, 0, 255, 255], [52, 0, 255, 255], [54, 0, 255, 255], [63, 0, 255, 255], [60, 0, 255, 255], [54, 0, 253, 255], [43, 0, 250, 255], [36, 1, 249, 255], [36, 0, 252, 254], [41, 0, 252, 254], [61, 0, 255, 253], [68, 4, 255, 255], [74, 8, 255, 255], [91, 3, 255, 255]]
+ }
+ },
+ "task": "object_tracking",
+ "dataset": "hdvila"
+ }
+ # fmt: on
+
+ def check_format(self, data_dict: Dict, image_processing_config: Dict):
+ # box tracking 数据不支持 do_crop!!!
+ if image_processing_config.get('do_crop', False):
+ raise ValueError(f'do_crop is not supported in ObjectTrackingParser!')
+
+ def transform(self, data_dict: Dict, image_processing_config: Dict = None) -> Dict:
+ self.check_format(data_dict, image_processing_config)
+
+ bbox_transform = _bbox_transform_for_padding if image_processing_config['do_padding'] else _bbox_transform_for_resize
+
+ # sample n_frames
+ if isinstance(self.n_frames, int):
+ n_frames = self.n_frames
+ else:
+ n_frames = random.choice(self.n_frames)
+ total_frames = list(range(len(data_dict['frame_indices'])))
+ idxs = random.sample(total_frames, min(n_frames, len(total_frames)))
+ idxs.sort()
+
+ frame_indices = [data_dict['frame_indices'][i] for i in idxs]
+ frames = sample_video(data_dict['video_file'], frame_indices=frame_indices)
+ img_transform = self.img_transform[data_dict['dataset']]
+ frames = [img_transform(f) for f in frames]
+
+ objects = []
+ for _, o in data_dict['objects'].items():
+ if o is None:
+ continue
+ all_frame_bounding_boxes = [o['all_frame_bounding_boxes'][i] for i in idxs]
+ all_frame_bounding_boxes_t = []
+ for bbox, frame in zip(all_frame_bounding_boxes, frames):
+ all_frame_bounding_boxes_t.append(bbox_transform(bbox, frame))
+ objects.append(all_frame_bounding_boxes_t)
+ if len(objects) >= self.max_objects:
+ break
+
+ prompt = "Given the bounding box coordinates of these objects in the first frame, output the bounding box coordinates in the following frames.\n{}"
+ response = ''
+
+ object_info = ''
+ for i, o in enumerate(objects):
+ object_info += f'object {i+1}: {o[0]}\n'
+ response += f'object {i+1}: {o[1:]}\n'
+ response = response.strip()
+ prompt = prompt.format(object_info)
+
+ messages = [
+ {
+ "role": "user",
+ "content": [
+ {"type": "video", "video": frames},
+ {"type": "text", "text": prompt}
+ ]
+ },
+ {
+ "role": "assistant",
+ "content": [
+ {"type": "text", "text": response}
+ ]
+ }
+ ]
+
+ return messages
+
+ def get_img_transform(self):
+ return {
+ 'webvid': return_same,
+ 'hdvila': transforms.Compose([
+ transforms.Resize(size=256),
+ transforms.CenterCrop(size=(256, 256))
+ ]),
+ 'hdvila_in_and_out_boxes': InAndOutCropAndResize(size=(256, 256))
+ }
diff --git a/eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/standard_vision_parser.py b/eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/standard_vision_parser.py
new file mode 100644
index 0000000000000000000000000000000000000000..e706189c4980da3a9c5e8d566ca5267bd09cb520
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/standard_vision_parser.py
@@ -0,0 +1,255 @@
+from typing import Dict, List
+from PIL import Image
+import random
+
+from .utils import sample_video, read_image, adjust_bbox, filter_ocr_polygon
+
+
+class VisionParser:
+ def __init__(
+ self,
+ n_frames=8,
+ max_n_frames=256,
+ is_training=True,
+ video_sampling_strategy={},
+ ):
+ self.n_frames = n_frames
+ self.max_n_frames = max_n_frames
+ self.is_training = is_training
+ self.video_sampling_strategy = video_sampling_strategy
+
+ # fmt: off
+ self.data_temp = {
+ "messages": [
+ {
+ "role": "user",
+ "content": [
+ {"type": "text", "text": "Describe the image and the video."},
+ # 支持的 image 格式:
+ {"type": "image", "image": {"image_file": "/path/to/image"}},
+ {"type": "image", "image": {"video_file": "/path/to/video", "frame_indices": 0}},
+ # 支持的 video 格式:
+ {"type": "video", "video": {"video_file": "/path/to/video"}},
+ {"type": "video", "video": {"video_file": "/path/to/video", "frame_indices": [0, 1, 2]}},
+ {"type": "video", "video": {"video_file": "/path/to/video", "start_frame": 0, "end_frame": 100}},
+ {"type": "video", "video": {"video_file": "/path/to/video", "time_indices": [0, 1, 2]}},
+ {"type": "video", "video": {"video_file": "/path/to/video", "start_time": 0, "end_time": 100}},
+ {"type": "video", "video": {"image_file": ["/path/to/image"]}, "frame_indices": [0, 1, 2]},
+ ]
+ },
+ {
+ "role": "assistant",
+ "content": [
+ {"type": "text","text": "xxx"}
+ ]
+ }
+ ],
+ "dataset": "LSMDC",
+ "task": "video/caption"
+ }
+ # fmt: on
+
+ def check_format(self, data_dict: Dict, image_processing_config: Dict):
+ if image_processing_config.get('do_crop', False) and image_processing_config.get('has_coordinates', False):
+ raise ValueError(f'do_crop and has_coordinates cannot be True at the same time!')
+
+ """
+ 1. 将 messages 中的 image/video 替换成相应的 PIL.Image/List[PIL.Image]
+ 2. text 的特殊处理:调整 box;过滤面积太小的OCR
+ """
+ def transform(self, data_dict: Dict, image_processing_config: Dict = None) -> Dict:
+ self.check_format(data_dict, image_processing_config)
+
+ self.set_n_frames(data_dict)
+
+ first_image = None # ugly! 需要调整box/过滤面积太小的OCR的数据只有图片任务
+
+ for msg in data_dict['messages']:
+ if isinstance(msg['content'], dict):
+ msg['content'] = [msg['content']]
+ for content in msg['content']:
+
+ if content['type'] == 'image':
+ content['image'] = self.load_image_item(content['image'])
+ if first_image is None:
+ first_image = content['image']
+ elif content['type'] == 'video':
+ video = self.load_video_item(content['video'])
+ content['video'] = video.pop('frames')
+ if video:
+ data_dict['extra_info']['frame_disturb_info'] = video.pop('video_info', {})
+ elif content['type'] == 'text':
+ pass
+ else:
+ raise ValueError(f"content['type']={content['type']} MUST be one of ['image', 'video', 'text']")
+ for msg in data_dict['messages']:
+ for content in msg['content']:
+ if content['type'] == 'text':
+ self.postprocess_text(content, data_dict, image_processing_config, first_image)
+
+ return data_dict['messages']
+
+ # set n_frames for each vision item.
+ def set_n_frames(self, data_dict):
+
+ if isinstance(self.n_frames, int):
+ n_frames = self.n_frames
+ else:
+ n_frames = random.choice(self.n_frames)
+
+ assert n_frames <= self.max_n_frames
+
+ curr_n_frames = 0
+ has_dynamic = False
+ for msg in data_dict['messages']:
+ if isinstance(msg['content'], dict):
+ msg['content'] = [msg['content']]
+
+ for content in msg['content']:
+
+ if content['type'] == 'image':
+ curr_n_frames += 1
+ elif content['type'] == 'video':
+ if 'frame_indices' in content['video']:
+ curr_n_frames += len(content['video']['frame_indices'])
+ content['video']['n_frames'] = len(content['video']['frame_indices'])
+ elif 'time_indices' in content['video']:
+ curr_n_frames += len(content['video']['time_indices'])
+ content['video']['n_frames'] = len(content['video']['time_indices'])
+ elif 'min_n_frames' in content['video']:
+ content['video']['min_n_frames'] = int(content['video']['min_n_frames'])
+ curr_n_frames += content['video']['min_n_frames']
+ content['video']['n_frames'] = content['video']['min_n_frames']
+ has_dynamic = True
+ elif 'fps' in content['video']:
+ content['video']['n_frames'] = self.max_n_frames
+ curr_n_frames += self.max_n_frames
+ has_dynamic = True
+ else:
+ content['video']['n_frames'] = 0
+ has_dynamic = True
+
+ while curr_n_frames < n_frames and has_dynamic:
+ for msg in data_dict['messages']:
+ for content in msg['content']:
+ if content['type'] == 'video':
+ if 'frame_indices' in content['video']:
+ pass
+ elif 'time_indices' in content['video']:
+ pass
+ else:
+ if curr_n_frames < n_frames:
+ content['video']['n_frames'] += 1
+ curr_n_frames += 1
+
+ while curr_n_frames > self.max_n_frames and has_dynamic:
+ for msg in data_dict['messages']:
+ for content in msg['content']:
+ if content['type'] == 'video':
+ if 'frame_indices' in content['video']:
+ pass
+ elif 'time_indices' in content['video']:
+ pass
+ else:
+ if curr_n_frames > self.max_n_frames:
+ content['video']['n_frames'] -= 1
+ curr_n_frames -= 1
+
+
+ for msg in data_dict['messages']:
+ for content in msg['content']:
+ if content['type'] == 'video':
+ if 'frame_indices' in content['video']:
+ pass
+ elif 'time_indices' in content['video']:
+ pass
+ else:
+ n = self.video_sampling_strategy.get('force_frames_n_divisible', 1)
+ if n > 1 and content['video']['n_frames'] % n != 0:
+ content['video']['n_frames'] += n - content['video']['n_frames'] % n
+
+ def load_image_item(self, image_item) -> Image.Image:
+ """
+ image_item:
+ {"image_file": {"lq": "/path/to/image"}}
+ {"video_file": {"lq": "/path/to/video"}, "frame_indices": 0}
+ """
+
+ # check format
+ if ("image_file" not in image_item) and ("video_file" not in image_item):
+ raise KeyError(f"Key 'image_file' or 'video_file' not found in image_item")
+ if 'image_file' in image_item:
+ if not isinstance(image_item['image_file'], str):
+ raise ValueError(f"{image_item['image_file']} is not a str!")
+ if 'video_file' in image_item:
+ if not isinstance(image_item['frame_indices'], int):
+ raise ValueError(f"{image_item['frame_indices']} is not a int!")
+
+ if 'image_file' in image_item:
+ image = read_image(image_item['image_file'])
+ else:
+ frame_indices = [image_item['frame_indices']]
+ image = sample_video(image_item['video_file'], frame_indices = frame_indices)[0]
+
+ return image
+
+ def load_video_item(self, video_item) -> List[Image.Image]:
+ """
+ video_item:
+ {"video_file": {"lq": "/path/to/video"}, "n_frames": 8}
+ {"video_file": {"lq": "/path/to/video"}, "frame_indices": [0, 1, 2], "n_frames": 3}
+ {"video_file": {"lq": "/path/to/video"}, "start_frame": 0, "end_frame": 100, "n_frames": 8}
+ {"video_file": {"lq": "/path/to/video"}, "time_indices": [0, 1, 2], "n_frames": 3}
+ {"video_file": {"lq": "/path/to/video"}, "start_time": 0, "end_time": 100, "n_frames": 8}
+ {"image_file": {"lq": ["/path/to/image"]}, "frame_indices": [0, 1, 2], "n_frames": 3}
+ """
+
+ # check format
+ if ("image_file" not in video_item) and ("video_file" not in video_item):
+ raise KeyError(f"Key 'image_file' or 'video_file' not found in video_item")
+
+ video_path = video_item.get('video_file', video_item.get('image_file'))
+ n_frames = video_item.get('n_frames', None)
+ frame_indices = video_item.get('frame_indices', None)
+ start_frame = video_item.get('start_frame', None)
+ end_frame = video_item.get('end_frame', None)
+ time_indices = video_item.get('time_indices', None)
+ start_time = video_item.get('start_time', None)
+ end_time = video_item.get('end_time', None)
+ mask_boxes = video_item.get('mask_boxes', None)
+ fps = video_item.get('fps', None)
+
+ frames, frame_indices = sample_video(
+ video_path=video_path,
+ frame_indices=frame_indices,
+ start_frame=start_frame,
+ end_frame=end_frame,
+ n_frames=n_frames,
+ time_indices=time_indices,
+ start_time=start_time,
+ end_time=end_time,
+ sampling_fps=fps,
+ mask_boxes=mask_boxes,
+ is_training=self.is_training,
+ video_sampling_strategy=self.video_sampling_strategy,
+ return_frame_ids=True,
+ )
+
+ if self.video_sampling_strategy.get('use_multi_images_for_video', False):
+ new_frames = []
+ for f in frames:
+ new_frames.extend([f, f])
+ frames = new_frames
+
+ if isinstance(frame_indices, dict):
+ return {
+ 'frames': frames,
+ 'video_info': frame_indices
+ }
+ return {'frames': frames}
+
+ def postprocess_text(self, content, data_dict, image_processing_config, first_image):
+ if image_processing_config.get('has_coordinates') and image_processing_config.get('do_padding'):
+ content['text'] = adjust_bbox(content['text'], frame=first_image)
+ if data_dict.get('task') == 'image/OCR' and image_processing_config.get('has_coordinates'):
+ content['text'] = filter_ocr_polygon(content['text'])
diff --git a/eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/utils.py b/eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e200ba056f3ffe8395d07afd20e96f28485c28e
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/utils.py
@@ -0,0 +1,452 @@
+from typing import List, Dict, Union
+import os
+import random
+import tempfile
+from PIL import Image, ImageSequence
+import base64
+import io
+import re
+import uuid
+import json
+import numpy as np
+import pyarrow.fs as pf
+import func_timeout
+from func_timeout import func_set_timeout
+import math
+
+# fmt: on
+import decord
+# fmt: off
+
+
+def denorm_box(points, height, width):
+ new_points = []
+ for p in points:
+ new_points.append((round(p[0] * width), round(p[1] * height)))
+ return new_points
+
+def process_image_for_tiktok(frames: List[Image.Image], mask_boxes):
+ mask_boxes = mask_boxes[:len(frames)]
+ frames = [np.array(f) for f in frames]
+ # assert len(mask_boxes) == len(frames)
+ height, width = frames[0].shape[:2]
+
+ new_frames = []
+ for boxes, frame in zip(mask_boxes, frames):
+ left, top, right, bottom = 0, 0, width, height
+ for box in boxes:
+ pts = np.array(denorm_box(box, height, width), np.int32)
+ upper_bound = max([p[1] for p in pts]) + 30
+ if bottom > upper_bound:
+ bottom = upper_bound
+ frame[pts[0][1]: pts[2][1], pts[0][0]: pts[1][0]] = 0
+
+ new_frames.append(Image.fromarray(frame[top: bottom, left: right]))
+ return new_frames
+
+# 先将视频分成 n_frames 份。训练时,每份随机抽一帧;测试时,每份抽中间的那一帧。
+def _sample_frame_indices_v2(
+ total_frames: int,
+ n_frames: int,
+ is_training=False,
+ video_sampling_strategy = {},
+ ):
+ total_frames_idxs = list(range(total_frames))
+ if total_frames <= n_frames:
+ return total_frames_idxs
+ k, m = divmod(total_frames, n_frames)
+ frame_splits = [total_frames_idxs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in list(range(n_frames))]
+ if is_training:
+ sample_ids = [random.choice(i) for i in frame_splits]
+ else:
+ sample_ids = [i[(len(i)+1)//2-1] for i in frame_splits]
+ return sample_ids
+
+# 均匀抽帧,必采样首尾帧。
+def _sample_frame_indices_v1(total_frames: int, n_frames: int, is_training=False, video_sampling_strategy = {}):
+ if n_frames == 1:
+ return [0] # sample first frame in default
+ if total_frames <= n_frames:
+ return list(range(total_frames))
+ sample_ids = [round(i * (total_frames - 1) / (n_frames - 1)) for i in range(n_frames)]
+ return sample_ids
+
+def conduct_disturb_frame(frame_indices):
+ disturb_type = random.choice(['exchange', 'crop', 'reverse', 'discard'])
+ n_frames = len(frame_indices)
+ frame_indices_new = []
+ if disturb_type == 'exchange':
+ # 均等分成4个segments, 随机交换两个segment
+ seg_len = math.ceil(n_frames / 4)
+ seg_idxs = list(range(0, n_frames, seg_len))
+ target_idxs = random.sample(range(0, 4), 2)
+ seg_idxs[target_idxs[0]], seg_idxs[target_idxs[1]] = seg_idxs[target_idxs[1]], seg_idxs[target_idxs[0]]
+ for idx in seg_idxs:
+ frame_indices_new += frame_indices[idx: idx+seg_len]
+ elif disturb_type == 'crop':
+ # 随机截取出3/4时长,再采均匀n_frames帧
+ crop_len = math.ceil(n_frames / 4)
+ idx_s = random.choice(range(0, crop_len+1))
+ idx_e = n_frames - 1 - (crop_len - idx_s)
+ frame_indices_new = np.linspace(frame_indices[idx_s], frame_indices[idx_e], n_frames, dtype=int).tolist()
+ elif disturb_type == 'reverse':
+ # 随机选择长度为[1/2, 1]时长的片段进行顺序颠倒
+ reverse_len = math.ceil(random.uniform(0.5,1) * n_frames)
+ idx_s = random.choice(range(0, n_frames-reverse_len+1))
+ idx_e = idx_s + reverse_len - 1
+ frame_indices_new = frame_indices[:idx_s] + list(reversed(frame_indices[idx_s: idx_e+1])) + frame_indices[idx_e+1:]
+ elif disturb_type == 'discard':
+ # 随机丢弃一半帧
+ frame_indices_new = random.sample(frame_indices, n_frames//2)
+ frame_indices_new.sort()
+ return disturb_type, frame_indices_new
+
+@func_set_timeout(60)
+def _download_file(path):
+ if path.startswith("hdfs"):
+ local_path = os.path.join(tempfile.gettempdir(), f'{uuid.uuid4()}_' + os.path.basename(path))
+
+ fs = pf.HadoopFileSystem.from_uri(uri="hdfs://harunava")
+ hdfs_file = fs.open_input_file(path)
+ file_size = hdfs_file.size()
+ if file_size > 1024 * 1024 * 1024: # 1G
+ os.system(f"hadoop fs -get --ct 8 -c 512 '{path}' '{local_path}' > /dev/null 2>&1")
+ elif file_size > 1024 * 1024 * 100: # 100M
+ os.system(f"hadoop fs -get '{path}' '{local_path}' > /dev/null 2>&1")
+ else:
+ local_fs = pf.LocalFileSystem()
+ with local_fs.open_output_stream(local_path) as local_file:
+ while True:
+ chunk = hdfs_file.read(1024 * 1024 * 100) # Reading 1MB chunks, you can adjust this as needed
+ if not chunk:
+ break
+ local_file.write(chunk)
+ else:
+ local_path = path
+
+ if not os.path.exists(local_path):
+ raise FileNotFoundError(f'{local_path}')
+
+ return local_path
+
+def download_file(path):
+ try:
+ # with timer(f'Download {path}'):
+ return _download_file(path)
+ except func_timeout.exceptions.FunctionTimedOut as e:
+ raise ValueError(e)
+
+class VideoReader:
+ def __init__(self, path: str) -> None:
+ self.path = path
+ self.local_path = self.preprocess()
+ self.vr = decord.VideoReader(self.local_path, num_threads=1, ctx=decord.cpu(0), fault_tol=1)
+ self.vr.seek(0)
+ self._length = len(self.vr)
+ self._fps = self.vr.get_avg_fps()
+
+ @property
+ def length(self):
+ return self._length
+
+ @property
+ def fps(self):
+ return self._fps
+
+ def sample(self, frame_indices) -> List[Image.Image]:
+ frames = self.vr.get_batch(frame_indices).asnumpy()
+ frames = [Image.fromarray(f).convert('RGB') for f in frames]
+ return frames
+
+ def preprocess(self):
+ return download_file(self.path)
+
+ def postprocess(self):
+ if self.path.startswith("hdfs"):
+ os.remove(self.local_path)
+
+class ImageSeqReader:
+ def __init__(self, path: List[str]) -> None:
+ self.path = path
+ self.local_path = self.preprocess()
+ self._length = len(self.local_path)
+ self._fps = None
+
+ @property
+ def length(self):
+ return self._length
+
+ @property
+ def fps(self):
+ return self._fps
+
+ def sample(self, frame_indices):
+ return [read_image(self.local_path[i]) for i in frame_indices]
+
+ def preprocess(self):
+ local_paths = []
+ for p in self.path:
+ local_paths.append(p)
+ return local_paths
+
+ def postprocess(self):
+ pass
+
+class GIFReader:
+ def __init__(self, path: str) -> None:
+ self.path = path
+ self.local_path = self.preprocess()
+ self.gif = Image.open(self.local_path)
+ self._length = self.gif.n_frames
+ duration = self.gif.info.get('duration', 0) / 1000 # 转换为秒
+ if duration > 0:
+ self._fps = 1 / duration
+ else:
+ self._fps = None
+
+ @property
+ def length(self):
+ return self._length
+
+ @property
+ def fps(self):
+ return self._fps
+
+ def sample(self, frame_indices):
+ frames = []
+ i = 0
+ for frame in ImageSequence.Iterator(self.gif):
+ if i in frame_indices:
+ frames.append(frame.convert('RGB'))
+ i += 1
+ return frames
+
+ def preprocess(self):
+ return download_file(self.path)
+
+ def postprocess(self):
+ if self.path.startswith("hdfs"):
+ os.remove(self.local_path)
+
+def check_frame_indices(frame_indices, total_frames, video_path):
+ if frame_indices[-1] == total_frames:
+ frame_indices[-1] = total_frames - 1
+
+ valid_frame_indices = [i for i in frame_indices if i >= 0 and i < total_frames]
+
+ if len(valid_frame_indices) != len(frame_indices):
+ print(f'[Error] frame out of index. video_path={video_path}, frame_indices={frame_indices}, total_frames={total_frames}', flush=True)
+
+ return valid_frame_indices
+
+
+def sample_video(
+ video_path: Union[str, List[str]],
+ frame_indices: List[int] = None,
+ start_frame:int=None,
+ end_frame:int=None,
+ n_frames:int = None,
+ time_indices: List[float] = None,
+ start_time:int=None,
+ end_time:int=None,
+ sampling_fps:float=None,
+ mask_boxes=None,
+ is_training:bool=False,
+ video_sampling_strategy={'video_sampler_version': 'v1'},
+ return_frame_ids: bool=False,
+ ) -> List[Image.Image]:
+
+ do_frame_disturb = video_sampling_strategy.get('do_frame_disturb', False)
+
+ if isinstance(video_path, str):
+ if video_path.endswith('.gif'):
+ reader = GIFReader(video_path)
+ else:
+ reader = VideoReader(video_path)
+ else:
+ reader = ImageSeqReader(video_path)
+
+ total_frames = reader.length
+ fps = reader.fps
+
+ if sampling_fps is not None:
+ frame_indices = list(range(0, total_frames, round(fps / sampling_fps)))
+ if len(frame_indices) > n_frames:
+ frame_indices = None
+
+ if time_indices is not None:
+ frame_indices = [round(float(i) * fps) for i in time_indices]
+
+ if start_time is not None and end_time is not None:
+ start_frame = round(start_time * fps)
+ end_frame = round(end_time * fps)
+
+ if frame_indices is None:
+ start_frame = 0 if start_frame is None else round(start_frame)
+ end_frame = total_frames - 1 if end_frame is None else round(end_frame)
+
+ if end_frame == total_frames:
+ end_frame -= 1
+
+ if video_sampling_strategy['video_sampler_version'] == 'v1':
+ # 均匀抽帧,必采样首尾帧。
+ frame_indices = _sample_frame_indices_v1(end_frame - start_frame + 1, n_frames, is_training, video_sampling_strategy)
+ elif video_sampling_strategy['video_sampler_version'] == 'v2':
+ frame_indices = _sample_frame_indices_v2(end_frame - start_frame + 1, n_frames, is_training, video_sampling_strategy)
+ else:
+ raise ValueError(f"video_sampler_version={video_sampling_strategy['video_sampler_version']} must be 'v1' or 'v2'")
+ frame_indices = [i + start_frame for i in frame_indices]
+
+ frame_indices = check_frame_indices(frame_indices, total_frames, video_path)
+
+ if do_frame_disturb:
+ frame_disturb_type, frame_indices_new = conduct_disturb_frame(frame_indices)
+ frame_indices_raw = frame_indices[:]
+ frame_indices = frame_indices_new
+
+ frames = reader.sample(frame_indices)
+ if mask_boxes is not None:
+ frames = process_image_for_tiktok(frames, mask_boxes)
+
+ n = video_sampling_strategy.get('force_frames_n_divisible', 1)
+ if n > 1 and len(frames) % n != 0:
+ new_n = n - len(frames) % n
+ frames.extend([Image.new(mode='RGB', size=frames[-1].size) for _ in range(new_n)])
+
+ reader.postprocess()
+
+ if do_frame_disturb:
+ return frames, {"frame_indices": frame_indices, "disturb_type": frame_disturb_type, "frame_indices_raw": frame_indices_raw}
+ if return_frame_ids:
+ return frames, frame_indices
+ return frames
+
+
+
+def load_image_from_base64String(img_path):
+ img = base64.b64decode(open(img_path, "rb").read())
+ buf = io.BytesIO(img)
+ img = Image.open(buf)
+ return img
+
+def read_image(image_path):
+ local_file = download_file(image_path)
+
+ if local_file.endswith('.dat'):
+ image = load_image_from_base64String(local_file)
+ else:
+ image = Image.open(local_file).convert('RGB')
+ if image_path.startswith("hdfs"):
+ os.remove(local_file)
+ return image
+
+
+def adjust_bbox(text, frame):
+
+ width, height = frame.size
+ new_text = []
+ start_idx = 0
+ for match in re.finditer(r'\[(\d+(\.\d+)?,\s*)+\d+(\.\d+)?\]', text):
+ coordinate_matches = re.findall(r"([0-9.]+)", match.group(0))
+ xys = [float(coord) for coord in coordinate_matches]
+
+ new_xys = []
+ for i in range(len(xys)):
+ p = xys[i]
+
+ if width == height:
+ pass
+
+ if width > height and i % 2 != 0:
+ p = xys[i] * height
+ p += (width - height) // 2
+ p = round(p / width, 2)
+
+ if height > width and i % 2 == 0:
+ p = xys[i] * width
+ p += (height - width) // 2
+ p = round(p / height, 2)
+
+ new_xys.append(p)
+
+ new_text.append(text[start_idx: match.span()[0]])
+ new_text.append(str(new_xys))
+ start_idx = match.span()[1]
+ new_text.append(text[start_idx: ])
+ text = ''.join(new_text)
+
+
+ return text
+
+def bbox_area(vertices, convert_format = True):
+ if convert_format:
+ vertices = list(zip(vertices[::2], vertices[1::2]))
+ x0, y0 = vertices[0]
+ x1, y1 = vertices[1]
+ return abs((x1 - x0) * (y1 - y0))
+
+def polygon_area(vertices, convert_format = True):
+ if convert_format:
+ vertices = list(zip(vertices[::2], vertices[1::2]))
+ n = len(vertices) # 多边形顶点的数量
+ if n == 2:
+ return bbox_area(vertices, convert_format=False)
+ area = 0
+ for i in range(n):
+ x1, y1 = vertices[i]
+ x2, y2 = vertices[(i + 1) % n]
+ area += x1 * y2 - x2 * y1
+ return abs(area) / 2
+
+def get_text_len(text_line):
+ l = 0
+ for c in text_line:
+ if '\u4e00' <= c <= '\u9fff':
+ l += 1
+ else:
+ l += 0.5
+ return l
+
+def filter_ocr_polygon(response, area_threshold=0.0005):
+ try:
+ resp = json.loads(response)
+ except:
+ return response
+ new_resp = []
+ for coords, text_line in resp:
+ area = polygon_area(coords, convert_format=True)
+ text_len = get_text_len(text_line)
+ if text_len == 0:
+ continue
+ if area / text_len < area_threshold:
+ continue
+ new_resp.append([coords, text_line])
+ new_resp = json.dumps(new_resp, ensure_ascii=False)
+
+ return new_resp
+
+def put_pred_to_data_dict(prediction, data_dict):
+ msg = data_dict['messages'][-1]
+ if msg['role'] == 'assistant':
+ msg['content'][-1]['text'] = prediction
+ else:
+ data_dict['messages'].append({
+ "role": "assistant",
+ "content": [{"type": "text", "text": prediction}]
+ })
+
+def get_prompt_from_data_dict(data_dict):
+ prompt = ""
+ for msg in data_dict['messages']:
+ role = msg['role']
+ assert role in {'system', 'user', 'assistant'}
+ for content in msg['content']:
+ if content['type'] == 'text':
+ if content['text']:
+ prompt += f"[{role}]: {content['text']}"
+ elif content['type'] == 'image':
+ prompt += f"[{role}]: "
+ elif content['type'] == 'video':
+ prompt += f"[{role}]: "
+ prompt += '\n'
+ return prompt
diff --git a/eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/utils_visualize.py b/eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/utils_visualize.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c75b898668a46ccb331f619d36670f5a39d5021
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/utils_visualize.py
@@ -0,0 +1,54 @@
+import re
+from typing import Dict, List, Optional
+from PIL import Image, ImageDraw, ImageFont
+
+
+def scale_polygon(polygon, w, h):
+ new_polygon = []
+ for (x, y) in polygon:
+ new_polygon.append((x * w, y * h))
+ return new_polygon
+
+def draw_polygon(image: Image.Image, points: List[List[int]], label: Optional[str] = None):
+ draw = ImageDraw.Draw(image)
+ if len(points) > 2:
+ draw.polygon(points, outline="red", width=3)
+ elif len(points) == 2:
+ draw.rectangle(points, outline="red", width=3)
+ else:
+ raise ValueError(f'points={points} only has one point!')
+
+ if label is not None:
+ font = ImageFont.truetype('/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf', 20)
+ draw.text(points[0], label, font=font, fill=(0, 0, 255))
+ return image
+
+def visualize_image_bbox(data_dict, image_processing_config, processor):
+ if image_processing_config.get('has_coordinates') != True:
+ return
+
+ messages = data_dict['messages']
+
+ polygons = []
+ first_image_content = None
+
+ for msg in messages:
+ for content in msg['content']:
+ if content['type'] == 'text':
+ for match in re.finditer(r'\[(\d+(\.\d+)?,\s*)+\d+(\.\d+)?\]', content["text"]):
+ coordinate_matches = re.findall(r"([0-9.]+)", match.group(0))
+ coords = [float(coord) for coord in coordinate_matches]
+ polygons.append(list(zip(coords[::2], coords[1::2])))
+ elif first_image_content is None and content['type'] == 'image':
+ first_image_content = content
+
+ first_image = first_image_content['image']
+ first_image = processor.preprocess_image(first_image, image_processing_config)
+ w, h = first_image.size
+
+ if len(polygons) > 0:
+ for i, polygon in enumerate(polygons):
+ polygon = scale_polygon(polygon, w, h)
+ first_image = draw_polygon(first_image, polygon, label=str(i))
+
+ first_image_content['image'] = first_image
\ No newline at end of file
diff --git a/eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/video_permutation_parser.py b/eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/video_permutation_parser.py
new file mode 100644
index 0000000000000000000000000000000000000000..ffdf1cf36f019e3ef90fb9965696c8daa5e93f22
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/video_permutation_parser.py
@@ -0,0 +1,137 @@
+from typing import Dict, List
+import random
+from PIL import Image, ImageDraw, ImageFont
+
+from .utils import sample_video
+
+
+class VideoPermutationParser:
+ def __init__(
+ self,
+ n_frames=8,
+ is_training=True,
+ frame_nums = list(range(8, 25)),
+ video_sampling_strategy={},
+ ):
+ self.n_frames = n_frames
+ self.is_training = is_training
+ self.frame_nums = frame_nums
+ self.video_sampling_strategy = video_sampling_strategy
+ # fmt: off
+ self.data_temp = {
+ "text": [{
+ "prompt": "",
+ "response": ""
+ }],
+ "video": [{
+ "video_file": {
+ "yg": "/mnt/bn/videonasyg/videos/webvid_10M_download/011851_011900/1047443473.mp4",
+ "lq": "/mnt/bn/llmdatalq/jiangnan/video_generation/webvid_10M_download/20230609/videos/011851_011900/1047443473.mp4"
+ },
+ "frame_indices": [0, 85, 171, 256, 342, 427, 513, 598]
+ }],
+ }
+ # fmt: on
+
+ def check_format(self, data_dict: Dict):
+ pass
+ # for k in self.data_temp.keys():
+ # assert k in data_dict
+
+ def transform(self, data_dict: Dict, image_processing_config: Dict = None) -> Dict:
+ self.check_format(data_dict)
+
+ frames = self.load_video_item(data_dict['video'][0])
+
+ # frames = self.add_text_to_frames(frames) # for debug
+
+ idxs = list(range(1, len(frames) + 1))
+ random.shuffle(idxs)
+
+ prefix_len = int(3/8*len(idxs))
+
+ shuffled_frames = [frames[i-1] for i in idxs]
+
+ prompt = f'Output the correct chronological order of scrambled video frames. The order of the first {prefix_len} ones are:\n'
+ prompt += '\n'.join([str(i) for i in idxs[: prefix_len]]) + '\nOutput the order of the following frames:'
+ response = '\n'.join([str(i) for i in idxs[prefix_len: ]])
+
+ messages = [
+ {
+ "role": "user",
+ "content": [
+ {"type": "video", "video": shuffled_frames},
+ {"type": "text", "text": prompt},
+ ]
+ },
+ {
+ "role": "assistant",
+ "content": [
+ {"type": "text", "text": response}
+ ]
+ }
+ ]
+
+ return messages
+
+
+ def load_video_item(self, video_item) -> List[Image.Image]:
+ """
+ video_item:
+ {"video_file": "/path/to/video", "n_frames": 8}
+ {"video_file": "/path/to/video", "frame_indices": [0, 1, 2], "n_frames": 3}
+ {"video_file": "/path/to/video", "start_frame": 0, "end_frame": 100, "n_frames": 8}
+ {"video_file": "/path/to/video", "time_indices": [0, 1, 2], "n_frames": 3}
+ {"video_file": "/path/to/video", "start_time": 0, "end_time": 100, "n_frames": 8}
+ {"image_file": ["/path/to/image"], "frame_indices": [0, 1, 2], "n_frames": 3}
+ """
+
+ # check format
+ if ("image_file" not in video_item) and ("video_file" not in video_item):
+ raise KeyError(f"Key 'image_file' or 'video_file' not found in video_item")
+
+ video_path = video_item.get('video_file', video_item.get('image_file'))
+ n_frames = video_item.get('n_frames', None)
+ frame_indices = video_item.get('frame_indices', None)
+ start_frame = video_item.get('start_frame', None)
+ end_frame = video_item.get('end_frame', None)
+ time_indices = video_item.get('time_indices', None)
+ start_time = video_item.get('start_time', None)
+ end_time = video_item.get('end_time', None)
+ mask_boxes = video_item.get('mask_boxes', None)
+
+ n_frames = random.choice(self.frame_nums)
+ n = self.video_sampling_strategy.get('force_frames_n_divisible', 1)
+ if n > 1 and n_frames % n != 0:
+ n_frames += n - n_frames % n
+
+ frames, frame_indices = sample_video(
+ video_path=video_path,
+ frame_indices=frame_indices,
+ start_frame=start_frame,
+ end_frame=end_frame,
+ n_frames=n_frames,
+ time_indices=time_indices,
+ start_time=start_time,
+ end_time=end_time,
+ mask_boxes=mask_boxes,
+ is_training=self.is_training,
+ video_sampling_strategy=self.video_sampling_strategy,
+ return_frame_ids=True,
+ )
+ return frames
+
+
+ def add_text_to_frames(self, frames: List[Image.Image]):
+ new_frames = []
+ for i, image in enumerate(frames):
+ draw = ImageDraw.Draw(image)
+
+ font = ImageFont.truetype('/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf', 100)
+ text_position = (50, 50)
+ text_content = f'{i+1}'
+ text_color = (255, 0, 0)
+ draw.text(text_position, text_content, font=font, fill=text_color)
+ new_frames.append(image)
+ return new_frames
+
diff --git a/eval_scripts/DREAM-1K/tarsier/dataset/tarsier_datamodule.py b/eval_scripts/DREAM-1K/tarsier/dataset/tarsier_datamodule.py
new file mode 100644
index 0000000000000000000000000000000000000000..831958acdb630574c26c63bc60c4270b95b55ab5
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/dataset/tarsier_datamodule.py
@@ -0,0 +1,280 @@
+"""Datamodule for Llava Pretraining and Finetuning"""
+import os
+import re
+from PIL import Image
+import numpy as np
+import re
+import tempfile
+from typing import Dict, List, Union, Tuple
+import traceback
+import json
+
+import torch
+import torch.nn.functional as F
+from transformers import DataCollatorForSeq2Seq
+
+from tools.rw_utils import read_jsonlines
+from torch.utils.data import Dataset, DataLoader
+
+np_str_obj_array_pattern = re.compile(r"[SaUO]")
+
+default_collate_err_msg_format = (
+ "default_collate: batch must contain tensors, numpy arrays, numbers, "
+ "dicts or lists; found {}"
+)
+
+from .custom_data_parsers.standard_vision_parser import VisionParser
+from .custom_data_parsers.object_tracking_parser import ObjectTrackingParser
+from .custom_data_parsers.multi_images_parser import MultiImagesParser
+from .custom_data_parsers.video_permutation_parser import VideoPermutationParser
+from .custom_data_parsers.utils_visualize import visualize_image_bbox
+
+from .tarsier_processor import TarsierProcessor
+
+from tools.rw_utils import NumpyArrayEncoder
+from .utils import DictToObject
+
+class TarsierDataProcessor:
+ def __init__(
+ self,
+ processor: TarsierProcessor,
+ n_frames: Union[int, list],
+ max_n_frames=256,
+ max_pixels=int(1280 * 720 // 2),
+ min_pixels=0,
+ max_seq_len=None,
+ is_training=True, # 会影响:1. 训练和测试时采帧不同;2. 测试时忽略 response。
+ print_data_error=True,
+ do_image_padding=False,
+ do_image_crop=False,
+ do_image_resize=True,
+ video_sampling_strategy={},
+ prompt='',
+ train_task='sft',
+ **kwargs
+ ):
+ self.kwargs = kwargs
+
+ self.processor = processor
+ self.pad_collator = DataCollatorForSeq2Seq(processor.tokenizer, padding='longest')
+
+ self.processor.max_seq_len = self.tokenizer.model_max_length if max_seq_len is None else max_seq_len
+
+ self.n_frames = n_frames
+ self.max_n_frames = max_n_frames
+ self.max_pixels = max_pixels
+ self.min_pixels = min_pixels
+
+ self.is_training = is_training
+ self.print_data_error = print_data_error
+ self.do_image_padding = do_image_padding
+ self.do_image_crop = do_image_crop
+ self.do_image_resize = do_image_resize
+ self.video_sampling_strategy = video_sampling_strategy
+ self.prompt = prompt
+ self.train_task = train_task
+
+ self.object_tracking_parser = ObjectTrackingParser(
+ n_frames=self.n_frames,
+ max_objects=4,
+ is_training=self.is_training,
+ )
+ self.multi_images_parser = MultiImagesParser(
+ n_frames=self.n_frames,
+ is_training=self.is_training,
+ )
+ self.video_permutation_parser = VideoPermutationParser(
+ n_frames=self.n_frames,
+ is_training=self.is_training,
+ video_sampling_strategy=self.video_sampling_strategy,
+ )
+ self.vision_parser = VisionParser(
+ n_frames=self.n_frames,
+ max_n_frames=self.max_n_frames,
+ is_training=self.is_training,
+ video_sampling_strategy=self.video_sampling_strategy
+ )
+
+ def select_parser(self, data_dict):
+ if data_dict.get('task', None) == 'video/object_tracking':
+ return self.object_tracking_parser
+ elif data_dict.get('task', None) == 'multi_images':
+ return self.multi_images_parser
+ elif data_dict.get('dataset', None) == 'video_permutation':
+ return self.video_permutation_parser
+ else:
+ return self.vision_parser
+
+ def parse_image_processing_config(self, data_dict):
+ image_processing_config=data_dict.get('image_processing_config', {})
+
+ do_padding = image_processing_config.get('do_padding', self.do_image_padding)
+ do_crop = image_processing_config.get('do_crop', self.do_image_crop)
+ do_resize = image_processing_config.get('do_resize', self.do_image_resize)
+ max_pixels = image_processing_config.get('max_pixels', self.max_pixels)
+ min_pixels = image_processing_config.get('min_pixels', self.min_pixels)
+
+ assert min_pixels <= max_pixels
+
+ image_processing_config['do_padding'] = do_padding
+ image_processing_config['do_crop'] = do_crop
+ image_processing_config['do_resize'] = do_resize
+ image_processing_config['max_pixels'] = max_pixels
+ image_processing_config['min_pixels'] = min_pixels
+
+ return image_processing_config
+
+
+ def _transform(self, raw_data_dict: Dict) -> Dict:
+ data_dict = json.loads(json.dumps(raw_data_dict, cls=NumpyArrayEncoder))
+ del raw_data_dict
+
+ if self.prompt:
+ for msg in data_dict['messages']:
+ if msg['role'] == 'user':
+ for content in msg['content']:
+ if content['type'] == 'text':
+ content['text'] = self.prompt
+
+ data_dict_copy = json.loads(json.dumps(data_dict, cls=NumpyArrayEncoder))
+
+ image_processing_config = self.parse_image_processing_config(data_dict)
+ parser = self.select_parser(data_dict)
+ messages = parser.transform(data_dict, image_processing_config)
+ data_dict_copy['extra_info'] = data_dict.pop('extra_info', {})
+
+ # visualize_image_bbox(data_dict, image_processing_config, self.processor)
+ outputs = self.processor(messages, image_processing_config, is_training=self.is_training)
+
+ # if not self.is_training:
+ outputs['raw_data_dict'] = data_dict_copy
+
+ return [outputs]
+
+ def _split_chosen_rejected(self, data_dict: Dict):
+ chosen_data_dict = data_dict
+ rejected_data_dict = json.loads(json.dumps(data_dict, cls=NumpyArrayEncoder))
+ for msg in chosen_data_dict['messages']:
+ if msg['role'] == 'assistant':
+ for content in msg['content']:
+ if content['type'] == 'text':
+ content['text'] = content['chosen']
+
+ for msg in rejected_data_dict['messages']:
+ if msg['role'] == 'assistant':
+ for content in msg['content']:
+ if content['type'] == 'text':
+ content['text'] = content['rejected']
+
+ return chosen_data_dict, rejected_data_dict
+
+ def transform(self, data_dict: Dict) -> Dict:
+ try:
+ if self.train_task == 'dpo':
+ chosen_data_dict, rejected_data_dict = self._split_chosen_rejected(data_dict)
+ return self._transform(chosen_data_dict) + self._transform(rejected_data_dict)
+ return self._transform(data_dict)
+ except Exception as e:
+ if self.print_data_error:
+ print(traceback.format_exc())
+ print(f'Error occurs when processing: \n{data_dict}')
+ return []
+
+ def batch_transform(self, batch_data: List[Dict]) -> Dict:
+ model_inputs = {}
+ # if not self.is_training:
+ raw_data_dict = [d.pop('raw_data_dict') for d in batch_data]
+ model_inputs['raw_data_dict'] = raw_data_dict
+
+ batch_pixel_values = [d.pop('pixel_values') for d in batch_data if 'pixel_values' in d]
+ batch_image_grid_thw = [d.pop('image_grid_thw') for d in batch_data if 'image_grid_thw' in d]
+ if len(batch_pixel_values) == 0:
+ vision_placeholder = self.get_vision_placeholder()
+ batch_pixel_values = [vision_placeholder.get('pixel_values')]
+ batch_image_grid_thw = [vision_placeholder.get('image_grid_thw')] if 'image_grid_thw' in vision_placeholder else []
+
+ model_inputs['pixel_values'] = torch.cat(batch_pixel_values, dim=0)
+ if len(batch_image_grid_thw) > 0:
+ model_inputs['image_grid_thw'] = torch.cat(batch_image_grid_thw, dim=0)
+
+ batch_num_images = [d.pop('num_images') for d in batch_data]
+ model_inputs['num_images'] = torch.tensor(batch_num_images)
+ model_inputs.update(self.pad_collator(batch_data))
+ return model_inputs
+
+ def __call__(self, batch_data: Union[Dict, List[Dict]]) -> Dict:
+ if isinstance(batch_data, dict):
+ batch_data = [batch_data]
+ batch = [self.transform(d)[0] for d in batch_data]
+ return self.batch_transform(batch)
+
+ def get_vision_placeholder(self):
+ messages = [{"role": "user", "content": [{"type": "image", "image": Image.new(mode='RGB', size=(336, 336))}]}]
+ image_processing_config = self.parse_image_processing_config({})
+ return self.processor(messages, image_processing_config)
+
+ def get_text_placeholder(self):
+ messages = [
+ {"role": "user", "content": [{"type": "text", "text": "Hello!"}]},
+ {"role": "assistant", "content": [{"type": "text", "text": "Thank you very much"}]},
+ ]
+ image_processing_config = self.parse_image_processing_config({})
+ return self.processor(messages, image_processing_config)
+
+def init_processor(processor: Union[TarsierProcessor, str]=None, config: Dict=None):
+ config = DictToObject(config) if isinstance(config, dict) else config
+ if isinstance(processor, str):
+ sub_processor = TarsierProcessor.from_pretrained(
+ processor,
+ padding_side='left',
+ trust_remote_code=True
+ )
+ else:
+ sub_processor = processor
+ processor = TarsierDataProcessor(
+ processor=sub_processor,
+ n_frames=config.n_frames,
+ max_n_frames=config.max_n_frames,
+ max_pixels=config.max_pixels,
+ min_pixels=config.min_pixels,
+ max_seq_len=config.max_seq_len,
+ is_training=config.is_training,
+ print_data_error=config.print_data_error,
+ do_image_padding=config.do_image_padding,
+ do_image_crop=config.do_image_crop,
+ do_image_resize=config.do_image_resize,
+ video_sampling_strategy=config.video_sampling_strategy,
+ prompt=config.prompt,
+ train_task=config.train_task
+ )
+ return processor
+
+class TarsierDataset(Dataset):
+ def __init__(self, ann_path="", anns=None, config: Dict=None, processor: Union[TarsierDataProcessor, TarsierProcessor, str]=None):
+ self.config = DictToObject(config) if isinstance(config, dict) else config
+ if not isinstance(processor, TarsierDataProcessor):
+ self.processor = init_processor(processor, config)
+ else:
+ self.processor = processor
+ if anns is None:
+ self.anns = []
+ if isinstance(ann_path, str):
+ ann_path = [ann_path]
+ for path in ann_path:
+ self.anns.extend(read_jsonlines(path))
+ else:
+ self.anns = anns
+
+ def __len__(self):
+ return len(self.anns)
+
+ def __getitem__(self, index):
+ if index < 0 or index >= len(self.anns):
+ raise IndexError("Index out of range")
+ try:
+ ann = self.anns[index]
+ model_inputs = self.processor(ann)
+ except Exception as e:
+ print(f"Load data error: {e}")
+ return ann, None
+ return ann, model_inputs
diff --git a/eval_scripts/DREAM-1K/tarsier/dataset/tarsier_processor.py b/eval_scripts/DREAM-1K/tarsier/dataset/tarsier_processor.py
new file mode 100644
index 0000000000000000000000000000000000000000..fd28d100d74b84475de716445230ad859e80e018
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/dataset/tarsier_processor.py
@@ -0,0 +1,240 @@
+from typing import List, Union
+from PIL import Image
+
+import torch
+
+from transformers.feature_extraction_utils import BatchFeature
+from transformers.image_utils import ImageInput, get_image_size, to_numpy_array
+from transformers.processing_utils import ProcessingKwargs, ProcessorMixin, Unpack, _validate_images_text_input_order
+from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
+from transformers.utils import logging
+from transformers import Qwen2VLImageProcessor
+from transformers.models.qwen2_vl.image_processing_qwen2_vl import smart_resize
+
+logger = logging.get_logger(__name__)
+
+
+class TarsierProcessorKwargs(ProcessingKwargs, total=False):
+ _defaults = {
+ "text_kwargs": {},
+ "images_kwargs": {},
+ }
+
+
+class TarsierProcessor(ProcessorMixin):
+
+ attributes = ["image_processor", "tokenizer"]
+ valid_kwargs = ["chat_template", "image_token", "patch_size", "merge_size", "temporal_patch_size", "max_seq_len"]
+ image_processor_class = "AutoImageProcessor"
+ tokenizer_class = "AutoTokenizer"
+
+ def __init__(
+ self,
+ image_processor=None,
+ tokenizer=None,
+ chat_template=None,
+ image_token="",
+ patch_size=None,
+ merge_size=1,
+ temporal_patch_size=1,
+ max_seq_len=8192,
+ **kwargs,
+ ) -> None:
+
+ self.image_token = image_token
+ self.patch_size = patch_size
+ self.merge_size = merge_size
+ self.temporal_patch_size = temporal_patch_size
+ self.max_seq_len = max_seq_len
+ self.max_pixels_per_sample = 128 * 384 * 384
+
+ super().__init__(image_processor, tokenizer, chat_template=chat_template)
+
+ def __call__(
+ self,
+ messages,
+ image_processing_config=None,
+ is_training=True,
+ ) -> torch.Tensor:
+
+ output_kwargs = self._merge_kwargs(
+ TarsierProcessorKwargs,
+ tokenizer_init_kwargs=self.tokenizer.init_kwargs,
+ )
+
+ # 【图片处理】
+ pixel_values, image_grid_thw = [], []
+ num_images = 0
+ for msg in messages:
+ for content in msg['content']:
+ if content['type'] == 'image':
+ num_images += self.temporal_patch_size
+ elif content['type'] == 'video':
+ num_images += len(content['video'])
+ if num_images > 0 and self.max_pixels_per_sample // num_images < image_processing_config['max_pixels']:
+ image_processing_config['max_pixels'] = self.max_pixels_per_sample // num_images
+ image_processing_config['min_pixels'] = min(image_processing_config['min_pixels'], image_processing_config['max_pixels'])
+
+ for msg in messages:
+ for content in msg['content']:
+ if content['type'] == 'image':
+ content['image'] = self.preprocess_image(content['image'], image_processing_config)
+ content['image'] = self.image_processor(images = content['image'], **output_kwargs["images_kwargs"], return_tensors="pt")
+ content['num_vision_tokens'] = self.get_num_vision_tokens(content)
+ pixel_values.append(content['image']['pixel_values'])
+ if 'image_grid_thw' in content['image']:
+ image_grid_thw.extend(content['image']['image_grid_thw'])
+ elif content['type'] == 'video':
+ content['video'] = self.preprocess_image(content['video'], image_processing_config)
+ if isinstance(self.image_processor, Qwen2VLImageProcessor):
+ content['video'] = self.image_processor(images = None, videos = content['video'], **output_kwargs["images_kwargs"], return_tensors="pt")
+ pixel_values.append(content['video']['pixel_values_videos'])
+ else:
+ content['video'] = self.image_processor(images = content['video'], **output_kwargs["images_kwargs"], return_tensors="pt")
+ pixel_values.append(content['video']['pixel_values'])
+
+ if 'video_grid_thw' in content['video']:
+ image_grid_thw.extend(content['video']['video_grid_thw'])
+ content['num_vision_tokens'] = self.get_num_vision_tokens(content)
+
+ #【文本处理】
+ add_generation_prompt = (not is_training and messages[-1]['role'] != 'assistant')
+ strip_final_eos = (not is_training and messages[-1]['role'] == 'assistant')
+ text_inputs = self.tokenizer.apply_chat_template(
+ messages,
+ chat_template = self.chat_template,
+ tokenize=True,
+ tokenizer_kwargs = output_kwargs["text_kwargs"],
+ return_assistant_tokens_mask=True,
+ return_dict=True,
+ add_generation_prompt=add_generation_prompt,
+ strip_final_eos=strip_final_eos,
+ )
+ labels = [-100 if j == 0 else i for i, j in zip(text_inputs['input_ids'], text_inputs['assistant_masks'])]
+ labels = labels[:self.max_seq_len]
+ input_ids = text_inputs['input_ids'][:self.max_seq_len]
+
+ image_token_id = self.tokenizer.convert_tokens_to_ids(self.image_token)
+ if image_token_id in text_inputs['input_ids'][self.max_seq_len:]:
+ raise ValueError(f'Too long sequence! {len(text_inputs["input_ids"])}')
+
+ outputs = {
+ 'input_ids': input_ids,
+ 'labels': labels,
+ 'num_images': num_images,
+ }
+ if len(pixel_values) > 0:
+ outputs['pixel_values'] = torch.cat(pixel_values, dim=0)
+ if len(image_grid_thw) > 0:
+ outputs['image_grid_thw'] = torch.stack(image_grid_thw)
+ return outputs
+
+
+ def preprocess_image(self, pil_img: Union[Image.Image, List[Image.Image]], image_processing_config):
+ if image_processing_config is None:
+ return pil_img
+ images = pil_img
+ if isinstance(pil_img, Image.Image):
+ images = [images]
+ if image_processing_config['do_crop']:
+ images = [self.centralcrop(img, rate=[4, 3]) for img in images]
+ if image_processing_config['do_padding']:
+ images = [self.expand2square(
+ img,
+ # tuple(int(x * 255) for x in self.processor.image_processor.image_mean)
+ tuple(int(x * 255) for x in [0, 0, 0])
+ ) for img in images]
+ if image_processing_config['do_resize']:
+ images = [self.resize2square(img) for img in images]
+ if image_processing_config.get('max_pixels'):
+ images = [self.resize2pixels(
+ img,
+ int(image_processing_config['max_pixels']),
+ int(image_processing_config['min_pixels'])
+ ) for img in images]
+ if isinstance(pil_img, Image.Image):
+ images = images[0]
+ return images
+
+ def expand2square(self, pil_img, background_color):
+ width, height = pil_img.size
+ if width == height:
+ return pil_img
+ elif width > height:
+ result = Image.new(pil_img.mode, (width, width), background_color)
+ result.paste(pil_img, (0, (width - height) // 2))
+ return result
+ else:
+ result = Image.new(pil_img.mode, (height, height), background_color)
+ result.paste(pil_img, ((height - width) // 2, 0))
+ return result
+
+ def resize2square(self, pil_img: Image.Image):
+ width, height = pil_img.size
+ pil_img = pil_img.resize((max(width, height), max(width, height)))
+ return pil_img
+
+ def centralcrop(self, pil_img: Image.Image, rate=[4, 3]):
+ width, height = pil_img.size
+ size = (width, height)
+ min_len = min(size)
+ longer_side = 0 if width >= height else 1
+ center = (width/2, height/2)
+ box = [0, 0, size[0], size[1]]
+
+ # if longer_side == 0:
+ # box[0] = max(0, center[0] - 1/2*min_len/rate[1]*rate[0])
+ # box[2] = min(width, center[0] + 1/2*min_len/rate[1]*rate[0])
+ # else:
+ # box[1] = max(0, center[1] - 1/2*min_len/rate[1]*rate[0])
+ # box[3] = min(height, center[1] + 1/2*min_len/rate[1]*rate[0])
+ box[longer_side] = max(0, center[longer_side] - 1/2*min_len/rate[1]*rate[0])
+ box[2 + longer_side] = min(size[longer_side], center[longer_side] + 1/2*min_len/rate[1]*rate[0])
+
+ # box = (width/2-min_len/2, height/2-min_len/2, width/2+min_len/2, height/2+min_len/2)
+ pil_img = pil_img.crop(box)
+ return pil_img
+
+ def resize2pixels(self, pil_img: Image.Image, max_pixels=None, min_pixels=None):
+ width, height = pil_img.size
+ new_height, new_width = smart_resize(height, width, factor=1, max_pixels=max_pixels, min_pixels=min_pixels)
+ pil_img = pil_img.resize((new_width, new_height))
+ return pil_img
+
+ def get_num_vision_tokens(self, content):
+ if isinstance(self.image_processor, Qwen2VLImageProcessor):
+ merge_length = self.image_processor.merge_size**2
+ if content['type'] == 'image':
+ num_image_tokens = content['image']['image_grid_thw'].prod() // merge_length
+ else:
+ num_image_tokens = content['video']['video_grid_thw'].prod() // merge_length
+ return num_image_tokens
+ else:
+ # 其他模型:image tokens (-> 2x2 compressed) -> add image_newline and image_new
+ k = 'image'if content['type'] == 'image' else 'video'
+ pixel_values = content[k]['pixel_values'][0]
+ n_frames = len(content[k]['pixel_values'])
+
+ height, width = get_image_size(to_numpy_array(pixel_values))
+ num_image_tokens = (height // (self.patch_size * self.merge_size)) * (width // (self.patch_size * self.merge_size) + 1) + 1
+ return num_image_tokens * n_frames
+
+ def batch_decode(self, *args, **kwargs):
+ """
+ This method forwards all its arguments to Qwen2TokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
+ refer to the docstring of this method for more information.
+ """
+ return self.tokenizer.batch_decode(*args, **kwargs)
+
+ def decode(self, *args, **kwargs):
+ """
+ This method forwards all its arguments to Qwen2TokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
+ the docstring of this method for more information.
+ """
+ return self.tokenizer.decode(*args, **kwargs)
+
+ @property
+ def model_input_names(self):
+ tokenizer_input_names = self.tokenizer.model_input_names
+ image_processor_input_names = self.image_processor.model_input_names
+ return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
diff --git a/eval_scripts/DREAM-1K/tarsier/dataset/utils.py b/eval_scripts/DREAM-1K/tarsier/dataset/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..70a1b20a1cc0e0e19596e836c62c1eefb0717e93
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/dataset/utils.py
@@ -0,0 +1,186 @@
+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import List
+import os
+from PIL import Image, ImageSequence
+import decord
+
+VALID_DATA_FORMAT_STRING = "Input data must be {'.jpg', '.jpeg', '.png', '.tif'} for image; or {'.mp4', '.avi', '.webm', '.mov', '.mkv', '.wmv', '.gif'} for videos!"
+
+# 均匀抽帧,必采样首尾帧。
+def sample_frame_indices(start_frame, total_frames: int, n_frames: int):
+ if n_frames == 1:
+ return [0] # sample first frame in default
+ sample_ids = [round(i * (total_frames - 1) / (n_frames - 1)) for i in range(n_frames)]
+ sample_ids = [i + start_frame for i in sample_ids]
+ return sample_ids
+
+def sample_video(
+ video_path: str,
+ n_frames: int = None,
+ start_time: int = 0,
+ end_time: int = -1
+ ) -> List[Image.Image]:
+
+ assert os.path.exists(video_path), f"File not found: {video_path}"
+ vr = decord.VideoReader(video_path, num_threads=1, ctx=decord.cpu(0))
+ vr.seek(0)
+ total_frames = len(vr)
+ fps = vr.get_avg_fps()
+
+ start_frame = 0
+ end_frame = total_frames - 1
+ if start_time > 0:
+ start_frame = min((total_frames-1), int(fps*start_time))
+ if end_time > 0:
+ end_frame = max(start_frame, int(fps*end_time))
+ end_frame = min(end_frame, (total_frames-1))
+ frame_indices = sample_frame_indices(
+ start_frame=start_frame,
+ total_frames=end_frame - start_frame + 1,
+ n_frames=n_frames,
+ )
+
+ frames = vr.get_batch(frame_indices).asnumpy()
+ frames = [Image.fromarray(f).convert('RGB') for f in frames]
+ return frames
+
+def sample_gif(
+ gif_path: str,
+ n_frames:int = None,
+ start_time: int = 0,
+ end_time: int = -1
+ ) -> List[Image.Image]:
+
+ assert os.path.exists(gif_path), f"File not found: {gif_path}"
+
+ gif_frames = Image.open(gif_path)
+
+ start_frame = 0
+ end_frame = gif_frames.n_frames - 1
+ frame_indices = sample_frame_indices(
+ start_frame=start_frame,
+ total_frames=end_frame - start_frame + 1,
+ n_frames=n_frames,
+ )
+
+ frames = []
+ i = 0
+ for frame in ImageSequence.Iterator(gif_frames):
+ if i in frame_indices:
+ frames.append(frame.convert('RGB'))
+ i += 1
+ return frames
+
+def sample_image(
+ image_path: str,
+ n_frames: int = None,
+ start_time: int = 0,
+ end_time: int = -1
+ ):
+ assert os.path.exists(image_path), f"File not found: {image_path}"
+ image = Image.open(image_path).convert('RGB')
+ return [image]
+
+def get_visual_type(input_file):
+ ext = os.path.splitext(input_file)[-1]
+ if ext in {'.gif'}:
+ return 'gif'
+ elif ext in {'.mp4', '.avi', '.webm', '.mov', '.mkv', '.wmv'}:
+ return 'video'
+ elif ext in {'.jpg', '.jpeg', '.png', '.tif'}:
+ return 'image'
+ else:
+ print(f"{VALID_DATA_FORMAT_STRING} But found {ext}!")
+ return 'unk'
+
+def get_benchmarks(benchmarks):
+ final_benchmarks = []
+ type2bm = {
+ 'dream': ['dream'],
+ 'caption': ['msvd-caption', 'msr-vtt-caption', 'vatex-caption'],
+ 'mc_qa': ['next-qa', 'egoschema', 'mvbench', 'video-mme'],
+ 'oe_qa': ['msvd-qa', 'msr-vtt-qa', 'tgif-qa', 'anet-qa'],
+ }
+ for bm in benchmarks:
+ bm = bm.lower()
+ if bm in final_benchmarks:
+ continue
+ if bm == 'all':
+ for v in type2bm.values():
+ final_benchmarks.extend(v)
+ return final_benchmarks
+ if bm in type2bm:
+ final_benchmarks.extend(type2bm[bm])
+ else:
+ final_benchmarks.append(bm)
+ return final_benchmarks
+
+def check_data_format(data):
+ for msg in data['messages']:
+ if isinstance(msg['content'], dict):
+ msg['content'] = [msg['content']]
+ for content in msg['content']:
+ assert content['type'] in {'image', 'video', 'text'}, f"content['type']={content['type']} MUST be one of ['image', 'video', 'text']"
+ if content['type'] != "text":
+ media_path_key = f"{content['type']}_file"
+ meida_paths = content[content['type']][media_path_key]
+ if isinstance(meida_paths, str):
+ meida_paths = [meida_paths]
+ for path in meida_paths:
+ assert os.path.exists(path), f"File not found: {path}"
+
+def format_one_sample(media_file=None, prompt="Describe the video in detail."):
+ sample = {
+ "messages": []
+ }
+ user_content = {
+ "role": "user",
+ "content": []
+ }
+ if media_file is not None:
+ media_type = get_visual_type(media_file)
+ if media_type in ("video", "gif"):
+ media_type = "video"
+ media_path_key = f"{media_type}_file"
+ user_content["content"].append({
+ "type": media_type,
+ media_type: {
+ media_path_key: media_file,
+ }
+ })
+ user_content["content"].append({
+ "type": "text",
+ "text": prompt
+ })
+
+ assistant_content = {
+ "role": "assistant",
+ "content": []
+ }
+
+ sample["messages"].append(user_content)
+ sample["messages"].append(assistant_content)
+ if media_file is not None:
+ sample["task"] = f"{media_type}/QA"
+ else:
+ sample["task"] = 'text-only'
+ check_data_format(sample)
+ return sample
+
+
+class DictToObject(object):
+ def __init__(self, dictionary):
+ for key, value in dictionary.items():
+ setattr(self, key, value)
diff --git a/eval_scripts/DREAM-1K/tarsier/evaluation/evaluate.py b/eval_scripts/DREAM-1K/tarsier/evaluation/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..e627524aeca6b21b11e3b7bd421ef9c22151af37
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/evaluation/evaluate.py
@@ -0,0 +1,177 @@
+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import random
+
+from .metrics import CIDErMetric, GPTMetric, DREAMGPTMetric, AccuracyMetric, VideoMMEAccuracyMetric
+import sys
+sys.path.append('eval_scripts/DREAM-1K/tarsier')
+from tools.rw_utils import read_jsonlines
+from tools.color import Color
+from dataset.utils import get_benchmarks
+
+def extract_item_for_eval(data_dict):
+ item = {}
+ prompt, prediction, reference = [], [], []
+ for msg in data_dict['messages']:
+ for content in msg['content']:
+ if content['type'] == 'text':
+ if msg['role'] == 'user':
+ prompt.append(content['text'])
+ elif msg['role'] == 'assistant':
+ if content.get('reference'):
+ reference.append(content['reference'])
+ prediction.append(content['text'])
+ # prediction.append(content['reference']) # debug
+
+ item['prompt'] = ''.join(prompt)
+ item['prediction'] = ''.join(prediction)
+ item['response'] = ''.join(reference)
+
+ item['dataset'] = data_dict['dataset']
+ item['idx'] = f"{data_dict['dataset']}_{data_dict['idx']}"
+ extra_info = data_dict.get('extra_info', None)
+ vid = data_dict.get('vid', None)
+ if vid is not None:
+ item['vid'] = vid
+ if extra_info:
+ item['events'] = extra_info.get('events', None)
+ item['extra_info'] = extra_info
+ if 'is_hard' in data_dict:
+ item['is_hard'] = data_dict['is_hard']
+
+ return item
+
+def read_dataset(path, dataset_name):
+ if os.path.isdir(path):
+ lines = []
+ for f in os.listdir(path):
+ if f.endswith('.jsonl'):
+ lines.extend(read_jsonlines(os.path.join(path, f)))
+ else:
+ lines = read_jsonlines(path)
+ dataset = []
+ idxs = set()
+ for l in lines:
+ if l['dataset'].split('/')[0] != dataset_name:
+ continue
+ idx = f"{l['dataset']}_{l['idx']}"
+ if idx in idxs:
+ continue
+
+ idxs.add(idx)
+ item = extract_item_for_eval(l)
+ dataset.append(item)
+ return dataset
+
+METRIC_MAPPING = {
+ 'CIDErMetric': CIDErMetric,
+ 'GPTMetric': GPTMetric,
+ 'AccuracyMetric': AccuracyMetric,
+ 'DREAMGPTMetric': DREAMGPTMetric,
+ 'VideoMMEAccuracyMetric': VideoMMEAccuracyMetric
+}
+
+def evaluate(pred_file, METRIC, dataset_name, sample_num=-1, verbose = False):
+ dataset = read_dataset(pred_file, dataset_name)
+ if len(dataset) == 0:
+ return
+ if sample_num > 0:
+ dataset = random.sample(dataset, sample_num)
+ metric = METRIC(dataset_name=dataset_name, verbose=verbose)
+ metric.process(dataset)
+ metric.summarize_metric()
+ metric.save_results(pred_file)
+ if isinstance(metric, DREAMGPTMetric):
+ metric.save_eval_infos(pred_file)
+
+def evaluate_all(pred_file, METRIC2DATASET, sample_num=-1, verbose = False):
+ for METRIC, dataset_name in METRIC2DATASET:
+ if isinstance(METRIC, str):
+ METRIC = METRIC_MAPPING[METRIC]
+ print(f"### Start Evaluating on {dataset_name}")
+ evaluate(pred_file, METRIC, dataset_name, sample_num, verbose)
+ print(f"### Finish Evaluating on {dataset_name}")
+
+
+if __name__ == '__main__':
+ import argparse
+
+ parser = argparse.ArgumentParser()
+ parser.add_argument('--pred_file', type=str)
+ parser.add_argument('--benchmarks', nargs='+', default=["all"], help="Default as 'all' to evaluate on all benchmarks; Also could be task types: ('dream', 'caption', 'mc_qa', 'oe_qa'); And specific benchmark names: ('dream', 'msvd-caption', 'msr-vtt-caption', 'vatex-caption', 'next-qa', 'egoschema', 'mvbench', 'tvbench', 'video-mme', 'msvd-qa', 'msr-vtt-qa', 'tgif-qa', 'anet-qa', 'favor-bench')")
+ parser.add_argument('--sample_num', type=int, default=-1)
+ parser.add_argument('--verbose', action='store_true')
+
+ args = parser.parse_args()
+
+ args.benchmarks = get_benchmarks(args.benchmarks)
+ print("### Selected Benchmarks:", args.benchmarks)
+
+ Benchmark2Metric = {
+ # Multi-chocie QA
+ 'next-qa': 'AccuracyMetric',
+ 'egoschema': 'AccuracyMetric',
+ 'mvbench': 'AccuracyMetric',
+ 'tvbench': 'AccuracyMetric',
+ 'video-mme': 'VideoMMEAccuracyMetric',
+ 'favor-bench': 'AccuracyMetric',
+
+ # Open-ended QA
+ 'msvd-qa': 'GPTMetric',
+ 'msr-vtt-qa': 'GPTMetric',
+ 'tgif-qa': 'GPTMetric',
+ 'anet-qa': 'GPTMetric',
+
+ # Caption DREAM
+ 'dream': 'DREAMGPTMetric',
+
+ # Caption CIDEr
+ 'msvd-caption': 'CIDErMetric',
+ 'msr-vtt-caption': 'CIDErMetric',
+ 'vatex-caption': 'CIDErMetric',
+ }
+
+ Benchmark2Dataset = {
+ 'dream': 'DREAM',
+
+ 'next-qa': 'Next-QA-val-multi_choice',
+ 'egoschema': 'EgoSchema',
+ 'mvbench': 'MVBench',
+ 'tvbench': 'TVBench',
+ 'video-mme': 'Video-MME',
+ 'favor-bench': 'FAVOR-Bench',
+
+ 'msvd-qa': 'MSVD-QA-val',
+ 'msr-vtt-qa': 'MSR-VTT-QA-val',
+ 'tgif-qa': 'TGIF-QA-test',
+ 'anet-qa': 'ActivityNet-QA-test',
+
+ 'msvd-caption': 'MSVD-Caption-test',
+ 'msr-vtt-caption': 'MSR-VTT-Caption-test',
+ 'vatex-caption': 'VATEX-test',
+ }
+
+ METRIC2DATASET = []
+
+ for bm in args.benchmarks:
+ if bm not in Benchmark2Metric:
+ print(Color.red(f"Unknown benchmark: {bm}"))
+ continue
+
+ METRIC2DATASET.append([Benchmark2Metric[bm], Benchmark2Dataset[bm]])
+
+ evaluate_all(args.pred_file, METRIC2DATASET, args.sample_num, args.verbose)
+
+ # python3 -m evaluation.evaluate --pred_file $pred_file --sample_num=100
diff --git a/eval_scripts/DREAM-1K/tarsier/evaluation/metrics/__init__.py b/eval_scripts/DREAM-1K/tarsier/evaluation/metrics/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..58053a3597dab8519ea5b8b300bda9d7e20b734a
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/evaluation/metrics/__init__.py
@@ -0,0 +1,5 @@
+from .evaluate_caption_cider import CIDErMetric
+from .evaluate_qa_oe_gpt import GPTMetric
+from .evaluate_qa_mc import AccuracyMetric
+from .evaluate_dream_gpt import DREAMGPTMetric
+from .evaluate_video_mme import VideoMMEAccuracyMetric
diff --git a/eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_caption_cider.py b/eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_caption_cider.py
new file mode 100644
index 0000000000000000000000000000000000000000..69ecfc6b164691cb0627fcabd4afe057bf8a4265
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_caption_cider.py
@@ -0,0 +1,82 @@
+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+from typing import List, Dict
+import os
+from pycocoevalcap.cider.cider import Cider
+import sys
+sys.path.append('eval_scripts/DREAM-1K/tarsier')
+from tools.ptbtokenizer import PTBTokenizer
+
+from tools.color import Color
+
+class CIDErMetric:
+ def __init__(self, dataset_name, verbose=False) -> None:
+ self.dataset_name = dataset_name
+ self.tokenizer = PTBTokenizer()
+ self.scorer = Cider()
+ self.score = None
+ self.results = []
+ self.dataset = []
+ self.verbose = verbose
+
+ def add(self, data):
+ self.dataset.append(data)
+
+ def process(self, dataset: List[Dict]):
+ references, predictions = {}, {}
+ for i, data in enumerate(dataset):
+ ref = data['response']
+ pred = data['prediction']
+
+ if isinstance(ref, str):
+ ref = [ref]
+
+ references[i] = [{'caption': r.lower()} for r in ref]
+ predictions[i] = [{'caption': pred.lower()}]
+
+ references = self.tokenizer.tokenize(references)
+ predictions = self.tokenizer.tokenize(predictions)
+ score, scores = self.scorer.compute_score(references, predictions)
+ self.score = score
+ for data, s in zip(dataset, scores):
+ self.results.append({
+ 'score': s,
+ 'data': data,
+ })
+
+ def summarize_metric(self):
+ if self.verbose:
+ for result in self.results:
+ print(Color.blue(json.dumps(result['data'])))
+ print(Color.red(f"CIDEr score: {result['score']}"))
+ print(f'=====Evaluation Summary=====')
+ self.eval_records = [
+ f'Dataset: {self.dataset_name}\tMetric: CIDEr',
+ f'#Successful Results: {len(self.results)}',
+ f'CIDEr score: {round(self.score*100, 1)}'
+ ]
+ for info in self.eval_records:
+ print(info)
+
+ def save_results(self, pred_path):
+ if os.path.isdir(pred_path):
+ output_dir = os.path.join(pred_path, 'eval_records')
+ else:
+ output_dir = os.path.join(os.path.dirname(pred_path), 'eval_records')
+ os.makedirs(output_dir, exist_ok=True)
+ fout = open(os.path.join(output_dir, f'{self.dataset_name}_eval_result.txt'), 'w')
+ for info in self.eval_records:
+ fout.write(info+'\n')
+ fout.close()
diff --git a/eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_dream_gpt.py b/eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_dream_gpt.py
new file mode 100644
index 0000000000000000000000000000000000000000..825de27bc1018271837cdd4d16bbfec54db4aaf0
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_dream_gpt.py
@@ -0,0 +1,436 @@
+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import numpy as np
+import ast
+import time
+from typing import List, Dict
+from tqdm import tqdm
+from pathos.multiprocessing import ProcessingPool as Pool
+import func_timeout
+from func_timeout import func_set_timeout
+
+import sys
+sys.path.append('eval_scripts/DREAM-1K/tarsier')
+import re
+import os
+from copy import deepcopy
+from traceback import format_exc
+
+try:
+ with open("apikey.txt", "r") as f:
+ api_key = f.read()
+except:
+ api_key = ''
+
+def call_gpt35(msg):
+ while True:
+ try:
+ response = openai.ChatCompletion.create(
+ model="gpt-3.5-turbo",
+ messages=msg,
+ api_key=api_key,
+ request_timeout=5)
+ break
+ except:
+ print("Timeout, retrying...")
+ time.sleep(5)
+
+ output_text = response['choices'][0]['message']['content']
+ return output_text
+
+def count_f1(r, p):
+ return 2*r*p/(r+p)
+
+
+def call_azure_gpt_api(events, reference, prediction, model):
+ if len(events) == 0:
+ events = [reference.replace('\n', ' ')]
+ messages=[
+ {
+ "role": "user",
+ "content":
+ "Given a video description and a list of events. For each event, classify the relationship between the video description and the event into three classes: entailment, neutral, contradiction.\n"
+ "- \"entailment\" means that the video description entails the event.\n"
+ "- \"contradiction\" means that some detail in the video description contradicts with the event.\n"
+ "- \"neutral\" means that the relationship is neither \"entailment\" or \"contradiction\".\n\n"
+ f"Video Description:\n{prediction}\n\n"
+ f"Events: {events}\n"
+
+ "Output a JSON formed as:\n"
+ "{\n"
+ " \"events\": [\n"
+ " {\"event\": \"copy an event here\", \"relationship\": \"put class name here\", \"reason\": \"give your reason here\"},\n"
+ " ...\n"
+ " ]\n"
+ "}\n\n"
+ "DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only output the JSON. Output:"
+ }
+ ]
+
+ completion = call_gpt35(messages)
+ return completion
+
+
+def call_azure_gpt_api_for_events(caption, model):
+ messages=[
+ {
+ "role": "user",
+ "content":
+ "Bellow is a description of a video clip:\n"
+ f"Video Description: {caption}\n\n"
+
+ "Extract at most 10 key events from the above video description paragraph. Requirements\n:"
+ "- An event must include an action, motion or movement (NOT STATIC INFORMATION). DON'T repeat same events.\n"
+ "- Every event is represented by a brief sentence within 10 words, with a subject, a predicate and optionally an object, avoid unnecessary appearance descriptions.\n"
+ "- Every event must be atomic, meaning that it cannot be further split into multiple events.\n"
+ "- Scene cuts and camera motions are NOT events.\n"
+ "- Substitute pronouns by the nouns they refer to.\n\n"
+ "Please generate the response in the form of a Python dictionary string with keys \"events\". The value of \"events\" is a List(str), of which each item is an event. "
+ "DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "
+ "For example, your response should look like this: {\"events\": [event1, event2, ...]}"
+ }
+ ]
+
+ completion = call_gpt35(messages)
+ return completion
+
+def try_call_api_for_eval(events, answer, prediction, model, verbose=False, max_retry=100):
+ for i in range(max_retry):
+ gpt_q = call_azure_gpt_api(events, answer, prediction, model)
+ if gpt_q is not None:
+ gpt_q = gpt_q.strip()
+ gpt_q = re.sub(r'\n+', '\n', gpt_q)
+ gpt_q = re.sub(r'\s+', ' ', gpt_q)
+
+ if gpt_q.startswith("```json"):
+ gpt_q = gpt_q.replace("```json", "").replace("```", "").strip()
+ elif gpt_q.startswith("```python"):
+ gpt_q = gpt_q.replace("```python", "").replace("```", "").strip()
+ if not gpt_q.startswith('{'):
+ gpt_q = '{' + gpt_q
+ if not gpt_q.endswith('}'):
+ gpt_q = gpt_q + '}'
+ gpt_q = gpt_q.replace("True", "true").replace("False", "false")
+ gpt_q = gpt_q.replace("} {", "}, {").replace("}{", "}, {")
+ gpt_q = gpt_q.replace(",\n}", "\n}").replace(", \n}", "\n}").replace(", }", "}").replace(",}", "}")
+ gpt_q = gpt_q.replace(",\n]", "\n]").replace(", \n]", "\n]").replace(", ]", "]").replace(",]", "]")
+ gpt_q = gpt_q.replace("[Placeholder]", "null")
+ gpt_q = gpt_q.replace("{Events:", "").strip()
+
+ return gpt_q, True
+
+ return f"Exceed max try: {max_retry}", False
+
+def try_call_api_for_events(caption, model, verbose=False):
+ for i in range(100):
+ gpt_q = call_azure_gpt_api_for_events(caption, model)
+ if gpt_q is not None:
+ if gpt_q.startswith("```json"):
+ gpt_q = gpt_q.replace("```json", "").replace("```", "").strip()
+ elif gpt_q.startswith("```python"):
+ gpt_q = gpt_q.replace("```python", "").replace("```", "").strip()
+ return gpt_q, True
+
+ return "Exceed max try: 5", False
+
+def extract_events(inputs, is_pred=False, max_retry=100):
+ data, model, verbose = inputs
+ if is_pred:
+ caption = data['prediction'].lower()
+ else:
+ caption = data['response'].lower()
+ caption = caption.replace("\"", "\'")
+ retry = 0
+ while True and (retry None:
+ self.dataset_name = dataset_name
+ self.num_worker = 64
+ # self.model = 'gpt-35-turbo'
+ self.model = 'gpt-35-turbo-0125'
+ # self.model='gpt-4-1106-preview'
+ self.results = []
+ self.invalid_results = []
+ self.dataset = []
+ self.verbose = verbose
+ self.eval_infos = []
+ self.buckets = {
+ "subjects": {
+ '<=1': [], '==2': [], '==3': [], '>=4': []
+ },
+ "shots": {'<=1': [], '==2': [], '==3': [], '>=4': []
+ },
+ "events": {'<=3': [], 'in [4, 5]': [], 'in [6, 7]': [], '>=8': []
+ }
+ }
+
+ def add(self, data):
+ self.dataset.append(data)
+
+ def select_bucket(self, bucket_name, num):
+ for key in self.buckets[bucket_name]:
+ if eval(f"{num}{key}"):
+ return key
+ return ''
+
+ def add_to_bucket(self, bucket_name, data):
+ sub_bucket = self.select_bucket(bucket_name, data['result']['extra_info'][f'n_{bucket_name}'])
+ if sub_bucket:
+ self.buckets[bucket_name][sub_bucket].append(data)
+
+ def process(self, dataset: List[Dict]):
+ self._process_group_by_subtask(dataset)
+
+ def _process(self, dataset: List[Dict], subtask=None):
+ pool = Pool(processes = self.num_worker, )
+ inputs = [(d, self.model, self.verbose) for d in dataset]
+ results = pool.uimap(process_one_sample, inputs, chunksize = 1)
+
+ for result in tqdm(results, total = len(dataset), desc=f'eval {subtask}'):
+ if subtask:
+ result['subtask'] = subtask
+ self.update_metric(result)
+ pool.close()
+ pool.join()
+ pool.clear() # MUST
+
+ def _process_group_by_subtask(self, dataset: List[Dict]):
+ def _group_by_subtask(dataset):
+ subtasks = {}
+ for data in dataset:
+ if data['dataset'] not in subtasks:
+ subtasks[data['dataset']] = []
+ subtasks[data['dataset']].append(data)
+ return subtasks
+ subtasks = _group_by_subtask(dataset)
+ for subtask, subdata in subtasks.items():
+ self._process(subdata, subtask)
+
+ def update_metric(self, result):
+ if result['success']:
+ self.results.append(result)
+ else:
+ self.invalid_results.append(result)
+
+ def summarize_metric(self):
+ self._summarize_metric_by_subtask()
+ self._summarize_metric_by_bucket()
+
+ def _summarize_metric_by_subtask(self):
+ from prettytable import PrettyTable
+ self.table = PrettyTable(['Task', 'F1 Score', 'Action Recall', 'Action Precision', 'Success', 'Failed'])
+ def _group_by_subtask():
+ sub_results = {}
+ sub_invalid_results = {}
+ for data in self.results:
+ if data['subtask'] not in sub_results:
+ sub_results[data['subtask']] = []
+ sub_results[data['subtask']].append(data)
+ for data in self.invalid_results:
+ if data['subtask'] not in sub_invalid_results:
+ sub_invalid_results[data['subtask']] = []
+ sub_invalid_results[data['subtask']].append(data)
+ return sub_results, sub_invalid_results
+ sub_results, sub_invalid_results = _group_by_subtask()
+ overall_avg_recall = []
+ overall_avg_precision = []
+ subtasks = list(sub_results.keys())
+ subtasks.sort()
+ for subtask in subtasks:
+ sub_rsts = sub_results[subtask]
+ sub_in_rsts = sub_invalid_results.get(subtask, [])
+ recalls = []
+ precisions = []
+ for result in sub_rsts:
+ r, p, infos = result['result']['score_r'], result['result']['score_p'], result['result']['eval_infos']
+ recalls.append(r)
+ precisions.append(p)
+ self.eval_infos.append(infos)
+ avg_recall = np.average(recalls)
+ avg_precision = np.average(precisions)
+ f1 = count_f1(avg_recall, avg_precision)
+ overall_avg_recall.append(avg_recall)
+ overall_avg_precision.append(avg_precision)
+ task_name = subtask
+ self.table.add_row([task_name, round(f1, 3), round(avg_recall, 3), round(avg_precision, 3), len(sub_rsts), len(sub_in_rsts)])
+ overall_recall = np.average(overall_avg_recall)
+ overall_precision = np.average(overall_avg_precision)
+ overall_f1 = count_f1(overall_recall, overall_precision)
+ self.table.add_row(['OVERALL', round(overall_f1, 3), round(overall_recall, 3), round(overall_precision, 3), len(self.results), len(self.invalid_results)])
+ print(f'=====DREAM Evaluation Summary=====')
+ print(self.table)
+
+
+ def _summarize_metric_by_bucket(self):
+ from prettytable import PrettyTable
+ self.bucket_tables = []
+ for bucket in self.buckets:
+ table = PrettyTable(['Score'] + list(self.buckets[bucket].keys()))
+ for data in self.results:
+ self.add_to_bucket(bucket_name=bucket, data=data)
+ bucket_result = {}
+ for sub_bucket in self.buckets[bucket]:
+ recalls = []
+ precisions = []
+ for result in self.buckets[bucket][sub_bucket]:
+ r, p = result['result']['score_r'], result['result']['score_p']
+ recalls.append(r)
+ precisions.append(p)
+ avg_recall = np.average(recalls)
+ avg_precision = np.average(precisions)
+ f1 = count_f1(avg_recall, avg_precision)
+ bucket_result[sub_bucket] = (avg_recall, avg_precision, f1)
+
+ raw = []
+ scores = ['Recall', 'Precision', 'F1']
+ for i in range(len(scores)):
+ raw = [scores[i]]
+ for sub_bucket in bucket_result:
+ raw.append(round(bucket_result[sub_bucket][i], 3))
+ table.add_row(raw)
+ sample_num = ['Count']
+ for k in self.buckets[bucket]:
+ sample_num.append(len(self.buckets[bucket][k]))
+ table.add_row(sample_num)
+ bucket_info = f'\n=====DREAM Evaluation Split by Bucket #{bucket}====='
+ print(bucket_info)
+ print(table)
+ self.bucket_tables.append(bucket_info)
+ self.bucket_tables.append(deepcopy(table))
+
+ def save_results(self, pred_path):
+ if os.path.isdir(pred_path):
+ output_dir = os.path.join(pred_path, 'eval_records')
+ else:
+ output_dir = os.path.join(os.path.dirname(pred_path), 'eval_records')
+ os.makedirs(output_dir, exist_ok=True)
+ model_flag = os.path.basename(pred_path).split('.')[0]
+ fout = open(os.path.join(output_dir, f'{self.dataset_name}_{model_flag}_eval_result.txt'), 'w')
+ print(self.table, file=fout)
+ for bucket_info in self.bucket_tables:
+ print(bucket_info)
+ fout.close()
+
+ def save_eval_infos(self, pred_path):
+ if os.path.isdir(pred_path):
+ output_dir = os.path.join(pred_path, 'eval_records')
+ else:
+ output_dir = os.path.join(os.path.dirname(pred_path), 'eval_records')
+ os.makedirs(output_dir, exist_ok=True)
+ model_flag = os.path.basename(pred_path).split('.')[0]
+ fout = open(os.path.join(output_dir, f'DREAM_{model_flag}_eval_infos.jsonl'), 'w')
+ for info in self.eval_infos:
+ fout.write(json.dumps(info) +'\n')
+ fout.close()
+ print(f"DREAM evaluation information saved in: {os.path.join(output_dir, 'DREAM_eval_infos.jsonl')}", flush=True)
+
diff --git a/eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_qa_mc.py b/eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_qa_mc.py
new file mode 100644
index 0000000000000000000000000000000000000000..6450b22216a3173b595b1639fe60bea09851fe0a
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_qa_mc.py
@@ -0,0 +1,159 @@
+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import numpy as np
+import os
+from typing import List, Dict
+import sys
+sys.path.append('eval_scripts/DREAM-1K/tarsier')
+from tools.color import Color
+
+
+class AccuracyMetric:
+ def __init__(self, dataset_name, verbose=False) -> None:
+ self.dataset_name = dataset_name
+ self.results = []
+ self.invalid_results = []
+ self.dataset = []
+ self.verbose = verbose
+
+ def add(self, data):
+ self.dataset.append(data)
+
+ def process(self, dataset: List[Dict]):
+ if self.dataset_name in ['MVBench', 'TVBench', 'FAVOR-Bench']:
+ return self._process_group_by_subtask(dataset)
+ else:
+ return self._process(dataset)
+
+ def _process(self, dataset: List[Dict], subtask=None):
+ for data in dataset:
+ prompt, response, prediction = data['prompt'], data['response'], data['prediction']
+ prediction = prediction.replace('(', '').replace(')', '').strip()
+ response = response.replace('(', '').replace(')', '').strip()[0]
+ if len(prediction) <= 0:
+ success = False
+ else:
+ prediction = prediction[0]
+ if '0'<=prediction<='5':
+ prediction = chr(int(prediction) + ord('A'))
+ success = prediction.isupper() and prediction.isalpha() and len(prediction) == 1
+ if success:
+ rst = {
+ 'success': success,
+ 'data': data,
+ 'result': {'acc': response == prediction}
+ }
+ if subtask:
+ rst['subtask'] = subtask
+ self.results.append(rst)
+ else:
+ rst = {
+ 'success': success,
+ 'data': data,
+ 'result': {'acc': response == prediction}
+ }
+ if subtask:
+ rst['subtask'] = subtask
+ self.invalid_results.append(rst)
+
+ def _process_group_by_subtask(self, dataset: List[Dict]):
+ def _group_by_subtask(dataset):
+ subtasks = {}
+ for data in dataset:
+ if data['dataset'] not in subtasks:
+ subtasks[data['dataset']] = []
+ subtasks[data['dataset']].append(data)
+ return subtasks
+ subtasks = _group_by_subtask(dataset)
+ for subtask, subdata in subtasks.items():
+ self._process(subdata, subtask)
+
+ def summarize_metric(self):
+ if self.dataset_name in ['MVBench', 'TVBench', 'FAVOR-Bench']:
+ return self._summarize_metric_by_subtask()
+ else:
+ return self._summarize_metric()
+
+ def _summarize_metric(self):
+ if self.verbose:
+ for result in self.results + self.invalid_results:
+ print(f"{Color.red('Success: ' + str(result['success']))}")
+ print(Color.blue(json.dumps(result['data'], ensure_ascii=False)))
+ print(f"{Color.green('Accuracy: ' + str(result['result']['acc']))}")
+
+ accs = []
+ for result in self.results:
+ acc = result['result']['acc']
+ accs.append(acc)
+ avg_acc = np.average(accs)
+
+ self.eval_records = [
+ f'=====Evaluation Summary=====',
+ f'Dataset: {self.dataset_name}\tMetric: Accuracy',
+ f'#Successful Results: {len(self.results)}\n#Failed Results: {len(self.invalid_results)}',
+ f'Accuracy: {round(avg_acc*100, 1)}',
+ ]
+ for info in self.eval_records:
+ print(info)
+
+ def _summarize_metric_by_subtask(self):
+ from prettytable import PrettyTable
+ self.table = PrettyTable(['Task','Accuracy','Success','Failed'])
+ def _group_by_subtask():
+ sub_results = {}
+ sub_invalid_results = {}
+ for data in self.results:
+ if data['subtask'] not in sub_results:
+ sub_results[data['subtask']] = []
+ sub_results[data['subtask']].append(data)
+ for data in self.invalid_results:
+ if data['subtask'] not in sub_invalid_results:
+ sub_invalid_results[data['subtask']] = []
+ sub_invalid_results[data['subtask']].append(data)
+ return sub_results, sub_invalid_results
+ sub_results, sub_invalid_results = _group_by_subtask()
+ oa_accs = []
+ subtasks = list(sub_results.keys())
+ # subtasks.sort(key=lambda x:f"{x.split('/')[-1].split(' ')[0][0]}{x.split('/')[-1].split(' ')[1][0]}")
+ subtasks.sort(key=lambda x:x.split('/')[-1])
+ for subtask in subtasks:
+ sub_rsts = sub_results[subtask]
+ sub_in_rsts = sub_invalid_results.get(subtask, [])
+ accs = []
+ for result in sub_rsts:
+ acc = result['result']['acc']
+ accs.append(acc)
+ oa_accs.append(acc)
+ avg_acc = np.average(accs)
+ # task_name = f"{subtask.split('/')[-1].split(' ')[0][0]}{subtask.split('/')[-1].split(' ')[1][0]}"
+ task_name = subtask.split('/')[-1]
+ self.table.add_row([task_name, round(avg_acc*100, 1), len(sub_rsts), len(sub_in_rsts)])
+ self.table.add_row(['OVERALL', round(np.average(oa_accs)*100, 1), len(self.results), len(self.invalid_results)])
+ print(f'=====Evaluation Summary=====')
+ print(self.table)
+
+ def save_results(self, pred_path):
+ if os.path.isdir(pred_path):
+ output_dir = os.path.join(pred_path, 'eval_records')
+ else:
+ output_dir = os.path.join(os.path.dirname(pred_path), 'eval_records')
+ os.makedirs(output_dir, exist_ok=True)
+ fout = open(os.path.join(output_dir, f'{self.dataset_name}_eval_result.txt'), 'w')
+ if self.dataset_name in ['MVBench', 'TVBench', 'FAVOR-Bench']:
+ print(self.table, file=fout)
+ else:
+ for info in self.eval_records:
+ fout.write(info+'\n')
+ fout.close()
diff --git a/eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_qa_oe_gpt.py b/eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_qa_oe_gpt.py
new file mode 100644
index 0000000000000000000000000000000000000000..87111fa0c7f7e5077d8aaff5a62a42ecb7813740
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_qa_oe_gpt.py
@@ -0,0 +1,153 @@
+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import numpy as np
+import ast
+import time
+from typing import List, Dict
+from tqdm import tqdm
+from pathos.multiprocessing import ProcessingPool as Pool
+import func_timeout
+from func_timeout import func_set_timeout
+import os
+import sys
+sys.path.append('eval_scripts/DREAM-1K/tarsier')
+from tools.color import Color
+
+
+def call_azure_gpt_api(question, answer, prediction, model):
+
+ messages=[
+ {
+ "role": "system",
+ "content":
+ "You are an intelligent chatbot designed for evaluating the correctness of generative outputs for question-answer pairs. "
+ "Your task is to compare the predicted answer with the correct answer and determine if they match meaningfully. Here's how you can accomplish the task:"
+ "------"
+ "##INSTRUCTIONS: "
+ "- Focus on the meaningful match between the predicted answer and the correct answer.\n"
+ "- Consider synonyms or paraphrases as valid matches.\n"
+ "- Evaluate the correctness of the prediction compared to the answer."
+ },
+ {
+ "role": "user",
+ "content":
+ "Please evaluate the following video-based question-answer pair:\n\n"
+ f"Question: {question}\n"
+ f"Correct Answer: {answer}\n"
+ f"Predicted Answer: {prediction}\n\n"
+ "Provide your evaluation only as a yes/no and score where the score is an integer value between 0 and 5, with 5 indicating the highest meaningful match. "
+ "Please generate the response in the form of a Python dictionary string with keys 'pred' and 'score', where value of 'pred' is a string of 'yes' or 'no' and value of 'score' is in INTEGER, not STRING."
+ "DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "
+ "For example, your response should look like this: {'pred': 'yes', 'score': 4.8}"
+ }
+ ]
+ completion = call_gpt35(messages)
+ return completion
+
+def try_call_api(question, answer, prediction, model, verbose=False):
+ for i in range(5):
+ gpt_q = call_azure_gpt_api(question, answer, prediction, model)
+ if gpt_q is not None:
+ return gpt_q, True
+
+ return None, False
+
+def process_one_sample(inputs):
+ data, model, verbose = inputs
+ prompt, response, prediction = data['question'], data['response'].lower(), data['prediction'].lower()
+ result = None
+ try:
+ result, success = try_call_api(prompt, response, prediction, model, verbose)
+ if not success:
+ raise ValueError(result)
+ result = ast.literal_eval(result)
+ pred, score = result['pred'], result['score']
+ # check pred
+ if pred not in ['yes', 'no']:
+ raise ValueError()
+ # check score
+ result['score'] = float(result['score'])
+ if score < 0 or score > 5:
+ raise ValueError()
+ except Exception as e:
+ if verbose:
+ print(e)
+ print(f'invalid GPT response: {result}')
+ return {'success': False, 'result': result, 'data': data}
+ return {'success': True, 'result': result, 'data': data}
+
+class GPTMetric:
+ def __init__(self, dataset_name, verbose=False) -> None:
+ self.dataset_name = dataset_name
+ self.num_worker = 64
+ self.model = 'gpt-35-turbo-0125'
+ self.results = []
+ self.invalid_results = []
+ self.dataset = []
+ self.verbose = verbose
+
+ def add(self, data):
+ self.dataset.append(data)
+
+ def process(self, dataset: List[Dict]):
+ pool = Pool(processes = self.num_worker, )
+ inputs = [(d, self.model, self.verbose) for d in dataset]
+ results = pool.uimap(process_one_sample, inputs, chunksize = 1)
+
+ for result in tqdm(results, total = len(dataset)):
+ self.update_metric(result)
+ pool.close()
+ pool.join()
+ pool.clear() # MUST
+
+ def update_metric(self, result):
+ if result['success']:
+ self.results.append(result)
+ else:
+ self.invalid_results.append(result)
+
+ def summarize_metric(self):
+ if self.verbose:
+ for result in self.results + self.invalid_results:
+ print(f"Success: {Color.red(str(result['success']))}")
+ print(Color.blue(json.dumps(result['data'], ensure_ascii=False)))
+ print(Color.green(json.dumps(result['result'], ensure_ascii=False)))
+ preds, scores = [], []
+ for result in self.results:
+ pred, score = result['result']['pred'], result['result']['score']
+ preds.append(pred)
+ scores.append(score)
+ avg_score = np.average(scores)
+ acc = np.average([p == 'yes' for p in preds])
+ print(f'=====Evaluation Summary=====')
+ self.eval_records = [
+ f'Dataset: {self.dataset_name}\tMetric: GPT Accuracy',
+ f'#Successful Results: {len(self.results)}\n#Failed Results: {len(self.invalid_results)}',
+ f'Accuracy: {round(acc*100, 1)}',
+ f'Average Score: {round(avg_score, 3)}',
+ ]
+ for info in self.eval_records:
+ print(info)
+
+ def save_results(self, pred_path):
+ if os.path.isdir(pred_path):
+ output_dir = os.path.join(pred_path, 'eval_records')
+ else:
+ output_dir = os.path.join(os.path.dirname(pred_path), 'eval_records')
+ os.makedirs(output_dir, exist_ok=True)
+ fout = open(os.path.join(output_dir, f'{self.dataset_name}_eval_result.txt'), 'w')
+ for info in self.eval_records:
+ fout.write(info+'\n')
+ fout.close()
diff --git a/eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_video_mme.py b/eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_video_mme.py
new file mode 100644
index 0000000000000000000000000000000000000000..1069b81050f6c88f21880a5dd648604532881bc1
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_video_mme.py
@@ -0,0 +1,358 @@
+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+from typing import List, Dict
+from typing import Optional, List, Union
+import os
+import sys
+sys.path.append('eval_scripts/DREAM-1K/tarsier')from tools.color import Color
+
+CATEGORIES = [
+ "Knowledge",
+ "Film & Television",
+ "Sports Competition",
+ "Artistic Performance",
+ "Life Record",
+ "Multilingual"
+]
+
+SUB_CATEGORIES = [
+ "Humanity & History",
+ "Literature & Art",
+ "Biology & Medicine",
+ "Finance & Commerce",
+ "Astronomy",
+ "Geography",
+ "Law",
+ "Life Tip",
+ "Technology",
+ "Animation",
+ "Movie & TV Show",
+ "Documentary",
+ "News Report",
+ "Esports",
+ "Basketball",
+ "Football",
+ "Athletics",
+ "Other Sports",
+ "Stage Play",
+ "Magic Show",
+ "Variety Show",
+ "Acrobatics",
+ "Handicraft",
+ "Food",
+ "Fashion",
+ "Daily Life",
+ "Travel",
+ "Pet & Animal",
+ "Exercise",
+ "Multilingual"
+]
+
+TASK_CATEGORIES = [
+ "Temporal Perception",
+ "Spatial Perception",
+ "Attribute Perception",
+ "Action Recognition",
+ "Object Recognition",
+ "OCR Problems",
+ "Counting Problem",
+ "Temporal Reasoning",
+ "Spatial Reasoning",
+ "Action Reasoning",
+ "Object Reasoning",
+ "Information Synopsis",
+]
+
+
+class VideoMMEAccuracyMetric:
+ def __init__(self, dataset_name, verbose=False) -> None:
+ self.dataset_name = dataset_name
+ self.results = []
+ self.invalid_results = []
+ self.dataset = []
+ self.verbose = verbose
+
+ def add(self, data):
+ self.dataset.append(data)
+
+ def process(self, dataset: List[Dict]):
+ return self._process(dataset)
+
+ def _process(self, dataset: List[Dict]):
+ for data in dataset:
+ prompt, response, prediction = data['prompt'], data['response'], data['prediction']
+ extra_info = data['extra_info']
+ prediction = prediction.replace('(', '').replace(')', '').strip()
+ if len(prediction) <= 0:
+ success = False
+ else:
+ prediction = prediction[0]
+ if '1'<=prediction<='5':
+ prediction = chr(int(prediction) + ord('A'))
+ success = prediction.isupper() and prediction.isalpha() and len(prediction) == 1
+ if success:
+ rst = {
+ 'success': success,
+ 'data': data,
+ 'result': {'acc': response == prediction},
+ 'extra_info': extra_info,
+ 'missing': False
+ }
+ self.results.append(rst)
+ else:
+ rst = {
+ 'success': success,
+ 'data': data,
+ 'result': {'acc': False},
+ 'extra_info': extra_info,
+ 'missing': True
+ }
+ self.results.append(rst)
+ self.invalid_results.append(rst)
+
+ def summarize_metric(self):
+ if self.verbose:
+ for result in self.results + self.invalid_results:
+ print(f"{Color.red('Success: ' + str(result['success']))}")
+ print(Color.blue(json.dumps(result['data'], ensure_ascii=False)))
+ print(f"{Color.green('Accuracy: ' + str(result['result']['acc']))}")
+ print(f'=====Evaluation Summary=====')
+ print(f'Dataset: {self.dataset_name}\tMetric: Accuracy')
+ print(f'#Successful Results: {len(self.results) - len(self.invalid_results)}\n#Failed Results: {len(self.invalid_results)}')
+ self.eval_your_results(
+ video_types = ["short","medium","long"],
+ skip_missing = True,
+ return_categories_accuracy = True,
+ return_sub_categories_accuracy = False,
+ return_task_types_accuracy = False,
+ )
+
+ def merge_results(self):
+ results_merged_by_vid = {}
+ for result in self.results:
+ vid = result['extra_info']['vid']
+ if vid not in results_merged_by_vid:
+ results_merged_by_vid[vid] = {
+ 'video_id': vid,
+ "duration": result['extra_info']['duration'],
+ "domain": result['extra_info']['domain'],
+ "sub_category": result['extra_info']['sub_category'],
+ 'questions': [],
+ 'missing': False
+ }
+ if result['missing']:
+ results_merged_by_vid[vid]['missing'] = True
+ results_merged_by_vid[vid]['questions'].append({
+ 'qid': result['extra_info']['idx'],
+ 'task_type': result['extra_info']['task_type'],
+ 'acc': result['result']['acc']
+ }
+ )
+ return results_merged_by_vid
+
+ def eval_your_results(
+ self,
+ video_types: Optional[Union[List[str], str]] = None,
+ skip_missing: Optional[bool] = False,
+ return_categories_accuracy: Optional[bool] = True,
+ return_sub_categories_accuracy: Optional[bool] = False,
+ return_task_types_accuracy: Optional[bool] = False,
+ gt_answer_key: Optional[str] = "answer",
+ your_answer_key: Optional[str] = "response"
+
+ ):
+ """
+ This copy from https://github.com/thanku-all/parse_answer/blob/main/eval_your_results.py
+ Evaluate your results against the ground truth
+
+ Args:
+ - your_results_path (str): Path to your results file
+ - video_types (Optional[List[str], str]): List of video types to evaluate.
+ - skip_missing (Optional[bool]): If True, missing files will be skipped. If False, an error will be raised if there are missing files.
+ - return_categories_accuracy (Optional[bool]): If True, the accuracy for each video category will be returned.
+ - return_sub_categories_accuracy (Optional[bool]): If True, the accuracy for each video sub category will be returned.
+ - return_task_types_accuracy (Optional[bool]): If True, the accuracy for each task category will be returned.
+ - gt_answer_key (Optional[str]): Key to access the ground truth answer in the results file.
+ - your_answer_key (Optional[str]): Key to access your answer in the results file.
+ """
+
+ # Load your results
+ # with open(your_results_path, 'r') as f:
+ # your_results = json.load(f)
+ your_results = list(self.merge_results().values())
+ self.eval_records = []
+ if isinstance(video_types, str):
+ video_types = video_types.split(",")
+
+ q_type_dict = {}
+ v_type_dict = {}
+ v_sub_type_dict = {}
+
+
+ for video_type in video_types:
+
+ # Filter your results based on video types
+ your_results_video_type = [item for item in your_results if item['duration'] == video_type]
+
+ # Task Categories
+ q_type_dict[video_type] = {}
+ for q_type in TASK_CATEGORIES:
+ q_type_dict[video_type][q_type] = {"correct": 0, "answered": 0}
+
+ # Video categories
+ v_type_dict[video_type] = {}
+ for v_type in CATEGORIES:
+ v_type_dict[video_type][v_type] = {"correct": 0, "answered": 0}
+
+ v_sub_type_dict[video_type] = {}
+ for v_sub_type in SUB_CATEGORIES:
+ v_sub_type_dict[video_type][v_sub_type] = {"correct": 0, "answered": 0}
+
+ if not skip_missing:
+ # Check if the number of files in your results and ground truth are the same
+ print(len(your_results_video_type))
+ assert len(your_results_video_type) == 300, f"Number of files in {video_type} is not 300. Check if there are missing files."
+
+ for item in your_results_video_type:
+
+ if skip_missing and item["missing"]:
+ continue
+
+ # Get the video category, sub category and question category
+ video_category = item["domain"]
+ video_sub_category = item["sub_category"]
+
+ questions = item["questions"]
+
+ for question in questions:
+ q_type = question["task_type"]
+
+ # Get the ground truth and your response
+ # gt_answer = question[gt_answer_key]
+ # response = question[your_answer_key]
+ acc = question['acc']
+
+ # Extract the answer from the response
+ # extration = extract_characters_regex(response)
+
+ if acc is not None:
+ q_type_dict[video_type][q_type]["answered"] += 1
+ q_type_dict[video_type][q_type]["correct"] += acc
+
+ v_type_dict[video_type][video_category]["answered"] += 1
+ v_type_dict[video_type][video_category]["correct"] += acc
+
+ v_sub_type_dict[video_type][video_sub_category]["answered"] += 1
+ v_sub_type_dict[video_type][video_sub_category]["correct"] += acc
+
+
+ # Print the results for each video type
+ for video_type in video_types:
+ info = f"=====================================\nEvaluation on video Type: {video_type}\n====================================="
+ self.eval_records.append(info)
+ print(info)
+ if return_categories_accuracy:
+ info = f"-------------------------------------\nVideo Categories\n-------------------------------------"
+ self.eval_records.append(info)
+ print(info)
+ for v_type in v_type_dict[video_type]:
+ info = f"{v_type}: {100 * v_type_dict[video_type][v_type]['correct'] / v_type_dict[video_type][v_type]['answered'] if v_type_dict[video_type][v_type]['answered'] > 0 else 0 : .1f}%"
+ self.eval_records.append(info)
+ print(info)
+ if return_sub_categories_accuracy:
+ info = f"-------------------------------------\nVideo Sub Categories\n-------------------------------------"
+ self.eval_records.append(info)
+ for v_sub_type in v_sub_type_dict[video_type]:
+ info = f"{v_sub_type}: {100 * v_sub_type_dict[video_type][v_sub_type]['correct'] / v_sub_type_dict[video_type][v_sub_type]['answered'] if v_sub_type_dict[video_type][v_sub_type]['answered'] > 0 else 0 : .1f}%"
+ self.eval_records.append(info)
+ print(info)
+ if return_task_types_accuracy:
+ info = f"-------------------------------------\nTask Categories\n-------------------------------------"
+ self.eval_records.append(info)
+ print(info)
+ for q_type in q_type_dict[video_type]:
+ info = f"{q_type}: {100 * q_type_dict[video_type][q_type]['correct'] / q_type_dict[video_type][q_type]['answered'] if q_type_dict[video_type][q_type]['answered'] > 0 else 0 : .1f}%"
+ self.eval_records.append(info)
+ print(info)
+ info = f"-------------------------------------\nOverall Performance\n-------------------------------------"
+
+ print(info)
+ total_correct = sum([q_type_dict[video_type][q_type]["correct"] for q_type in TASK_CATEGORIES])
+ total_answered = sum([q_type_dict[video_type][q_type]["answered"] for q_type in TASK_CATEGORIES])
+ info = f"Overall: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%"
+ self.eval_records.append(info)
+ print(info+'\n')
+
+ # Print the results for the entire dataset
+ info = f"=====================================\nEvaluation on the entire dataset\n====================================="
+ self.eval_records.append(info)
+ print(info)
+
+ if return_categories_accuracy:
+ info = f"-------------------------------------\nVideo Categories\n-------------------------------------"
+ self.eval_records.append(info)
+ print(info)
+ for v_type in CATEGORIES:
+ total_correct = sum([v_type_dict[video_type][v_type]["correct"] for video_type in video_types])
+ total_answered = sum([v_type_dict[video_type][v_type]["answered"] for video_type in video_types])
+ info = f"{v_type}: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%"
+ self.eval_records.append(info)
+ print(info)
+
+
+ if return_sub_categories_accuracy:
+ info = f"-------------------------------------\nVideo Sub Categories\n-------------------------------------"
+ self.eval_records.append(info)
+ print(info)
+
+ for v_sub_type in SUB_CATEGORIES:
+ total_correct = sum([v_sub_type_dict[video_type][v_sub_type]["correct"] for video_type in video_types])
+ total_answered = sum([v_sub_type_dict[video_type][v_sub_type]["answered"] for video_type in video_types])
+ info = f"{v_sub_type}: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%"
+ self.eval_records.append(info)
+ print(info)
+
+
+ if return_task_types_accuracy:
+ info = f"-------------------------------------\nTask Categories\n-------------------------------------"
+ self.eval_records.append(info)
+ print(info)
+ for q_type in TASK_CATEGORIES:
+
+ total_correct = sum([q_type_dict[video_type][q_type]["correct"] for video_type in video_types])
+ total_answered = sum([q_type_dict[video_type][q_type]["answered"] for video_type in video_types])
+ info = f"{q_type}: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%"
+ self.eval_records.append(info)
+ print(info)
+
+ info = f"*************************************\nOverall Performance\n*************************************"
+ self.eval_records.append(info)
+ print(info)
+ total_correct = sum([sum([q_type_dict[video_type][q_type]["correct"] for q_type in TASK_CATEGORIES]) for video_type in video_types])
+ total_answered = sum([sum([q_type_dict[video_type][q_type]["answered"] for q_type in TASK_CATEGORIES]) for video_type in video_types])
+ info = f"Overall: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%"
+ self.eval_records.append(info)
+ print(info)
+
+ def save_results(self, pred_path):
+ if os.path.isdir(pred_path):
+ output_dir = os.path.join(pred_path, 'eval_records')
+ else:
+ output_dir = os.path.join(os.path.dirname(pred_path), 'eval_records')
+ os.makedirs(output_dir, exist_ok=True)
+ fout = open(os.path.join(output_dir, f'{self.dataset_name}_eval_result.txt'), 'w')
+ for info in self.eval_records:
+ fout.write(info+'\n')
+ fout.close()
diff --git a/eval_scripts/DREAM-1K/tarsier/models/modeling_qwen2_vl_fast.py b/eval_scripts/DREAM-1K/tarsier/models/modeling_qwen2_vl_fast.py
new file mode 100644
index 0000000000000000000000000000000000000000..6902c56627dd6c4fc3858a8a2ee312a34e5689e0
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/models/modeling_qwen2_vl_fast.py
@@ -0,0 +1,1320 @@
+import os
+import math
+from dataclasses import dataclass
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn import LayerNorm
+
+from transformers.modeling_utils import PreTrainedModel
+from transformers.configuration_utils import PretrainedConfig
+from transformers.modeling_rope_utils import rope_config_validation, ROPE_INIT_FUNCTIONS
+from transformers.cache_utils import Cache, SlidingWindowCache, StaticCache
+from transformers.modeling_attn_mask_utils import AttentionMaskConverter
+from transformers.utils import (
+ add_start_docstrings,
+ add_start_docstrings_to_model_forward,
+ is_flash_attn_2_available,
+ is_flash_attn_greater_or_equal_2_10,
+ logging,
+ replace_return_docstrings,
+)
+from transformers.modeling_outputs import (
+ BaseModelOutputWithPast,
+ ModelOutput,
+)
+from transformers.activations import ACT2FN
+from transformers.generation import GenerationMixin
+
+if is_flash_attn_2_available():
+ from flash_attn import flash_attn_varlen_func
+
+ from transformers.modeling_flash_attention_utils import _flash_attention_forward
+else:
+ flash_attn_varlen_func = None
+
+# from apex.normalization.fused_layer_norm import fused_rms_norm_affine
+
+logger = logging.get_logger(__name__)
+
+@dataclass
+class Qwen2VLCausalLMOutputWithPast(ModelOutput):
+ """
+ Base class for Qwen2VL causal language model (or autoregressive) outputs.
+
+ Args:
+ loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+ Language modeling loss (for next-token prediction).
+ logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+ past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+ Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
+ `(batch_size, num_heads, sequence_length, embed_size_per_head)`)
+
+ Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
+ `past_key_values` input) to speed up sequential decoding.
+ hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+ one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+ Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+ attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+ Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+ sequence_length)`.
+
+ Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+ heads.
+ rope_deltas (`torch.LongTensor` of shape `(batch_size, )`, *optional*):
+ The rope index difference between sequence length and multimodal rope.
+ """
+
+ loss: Optional[torch.FloatTensor] = None
+ logits: torch.FloatTensor = None
+ past_key_values: Optional[List[torch.FloatTensor]] = None
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+ attentions: Optional[Tuple[torch.FloatTensor]] = None
+
+class Qwen2VLVisionConfig(PretrainedConfig):
+ model_type = "qwen2_vl"
+
+ def __init__(
+ self,
+ depth=32,
+ embed_dim=1280,
+ hidden_size=3584,
+ hidden_act="quick_gelu",
+ mlp_ratio=4,
+ num_heads=16,
+ in_channels=3,
+ patch_size=14,
+ spatial_merge_size=2,
+ temporal_patch_size=2,
+ attn_implementation='flash_attention_2',
+ **kwargs,
+ ):
+ super().__init__(**kwargs)
+
+ self.depth = depth
+ self.embed_dim = embed_dim
+ self.hidden_size = hidden_size
+ self.hidden_act = hidden_act
+ self.mlp_ratio = mlp_ratio
+ self.num_heads = num_heads
+ self.in_channels = in_channels
+ self.patch_size = patch_size
+ self.spatial_merge_size = spatial_merge_size
+ self.temporal_patch_size = temporal_patch_size
+ self.attn_implementation = attn_implementation
+
+ @classmethod
+ def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+ cls._set_token_in_kwargs(kwargs)
+
+ config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+ if config_dict.get("model_type") == "qwen2_vl":
+ config_dict = config_dict["vision_config"]
+
+ if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+ logger.warning(
+ f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+ f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+ )
+
+ return cls.from_dict(config_dict, **kwargs)
+
+
+class Qwen2VLConfig(PretrainedConfig):
+ r"""
+ This is the configuration class to store the configuration of a [`Qwen2VLModel`]. It is used to instantiate a
+ Qwen2-VL model according to the specified arguments, defining the model architecture. Instantiating a configuration
+ with the defaults will yield a similar configuration to that of
+ Qwen2-VL-7B-Instruct [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).
+
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+ documentation from [`PretrainedConfig`] for more information.
+
+
+ Args:
+ vocab_size (`int`, *optional*, defaults to 152064):
+ Vocabulary size of the Qwen2VL model. Defines the number of different tokens that can be represented by the
+ `inputs_ids` passed when calling [`Qwen2VLModel`]
+ hidden_size (`int`, *optional*, defaults to 8192):
+ Dimension of the hidden representations.
+ intermediate_size (`int`, *optional*, defaults to 29568):
+ Dimension of the MLP representations.
+ num_hidden_layers (`int`, *optional*, defaults to 80):
+ Number of hidden layers in the Transformer encoder.
+ num_attention_heads (`int`, *optional*, defaults to 64):
+ Number of attention heads for each attention layer in the Transformer encoder.
+ num_key_value_heads (`int`, *optional*, defaults to 8):
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+ `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+ by meanpooling all the original heads within that group. For more details checkout [this
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to `32`.
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+ The non-linear activation function (function or string) in the decoder.
+ max_position_embeddings (`int`, *optional*, defaults to 32768):
+ The maximum sequence length that this model might ever be used with.
+ initializer_range (`float`, *optional*, defaults to 0.02):
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+ rms_norm_eps (`float`, *optional*, defaults to 1e-05):
+ The epsilon used by the rms normalization layers.
+ use_cache (`bool`, *optional*, defaults to `True`):
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
+ relevant if `config.is_decoder=True`.
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+ Whether the model's input and output word embeddings should be tied.
+ rope_theta (`float`, *optional*, defaults to 1000000.0):
+ The base period of the RoPE embeddings.
+ use_sliding_window (`bool`, *optional*, defaults to `False`):
+ Whether to use sliding window attention.
+ sliding_window (`int`, *optional*, defaults to 4096):
+ Sliding window attention (SWA) window size. If not specified, will default to `4096`.
+ max_window_layers (`int`, *optional*, defaults to 80):
+ The number of layers that use SWA (Sliding Window Attention). The bottom layers use SWA while the top use full attention.
+ attention_dropout (`float`, *optional*, defaults to 0.0):
+ The dropout ratio for the attention probabilities.
+ vision_config (`Dict`, *optional*):
+ The config for the visual encoder initialization.
+ rope_scaling (`Dict`, *optional*):
+ Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
+ and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
+ accordingly.
+ Expected contents:
+ `rope_type` (`str`):
+ The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
+ 'llama3'], with 'default' being the original RoPE implementation.
+ `factor` (`float`, *optional*):
+ Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
+ most scaling types, a `factor` of x will enable the model to handle sequences of length x *
+ original maximum pre-trained length.
+ `original_max_position_embeddings` (`int`, *optional*):
+ Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during
+ pretraining.
+ `attention_factor` (`float`, *optional*):
+ Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
+ computation. If unspecified, it defaults to value recommended by the implementation, using the
+ `factor` field to infer the suggested value.
+ `beta_fast` (`float`, *optional*):
+ Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
+ ramp function. If unspecified, it defaults to 32.
+ `beta_slow` (`float`, *optional*):
+ Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
+ ramp function. If unspecified, it defaults to 1.
+ `short_factor` (`List[float]`, *optional*):
+ Only used with 'longrope'. The scaling factor to be applied to short contexts (<
+ `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+ size divided by the number of attention heads divided by 2
+ `long_factor` (`List[float]`, *optional*):
+ Only used with 'longrope'. The scaling factor to be applied to long contexts (<
+ `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+ size divided by the number of attention heads divided by 2
+ `low_freq_factor` (`float`, *optional*):
+ Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE
+ `high_freq_factor` (`float`, *optional*):
+ Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
+
+ ```python
+ >>> from transformers import Qwen2VLForConditionalGeneration, Qwen2VLConfig
+
+ >>> # Initializing a Qwen2VL style configuration
+ >>> configuration = Qwen2VLConfig()
+
+ >>> # Initializing a model from the Qwen2-VL-7B style configuration
+ >>> model = Qwen2VLForConditionalGeneration(configuration)
+
+ >>> # Accessing the model configuration
+ >>> configuration = model.config
+ ```"""
+
+ model_type = "qwen2_vl"
+ keys_to_ignore_at_inference = ["past_key_values"]
+
+ def __init__(
+ self,
+ vocab_size=152064,
+ hidden_size=8192,
+ intermediate_size=29568,
+ num_hidden_layers=80,
+ num_attention_heads=64,
+ num_key_value_heads=8,
+ hidden_act="silu",
+ max_position_embeddings=32768,
+ initializer_range=0.02,
+ rms_norm_eps=1e-05,
+ use_cache=True,
+ tie_word_embeddings=False,
+ rope_theta=1000000.0,
+ use_sliding_window=False,
+ sliding_window=4096,
+ max_window_layers=80,
+ attention_dropout=0.0,
+ rope_scaling=None,
+ spatial_merge_size=2,
+ attn_implementation='flash_attention_2',
+ **kwargs,
+ ):
+
+ self.vocab_size = vocab_size
+ self.max_position_embeddings = max_position_embeddings
+ self.hidden_size = hidden_size
+ self.intermediate_size = intermediate_size
+ self.num_hidden_layers = num_hidden_layers
+ self.num_attention_heads = num_attention_heads
+ self.use_sliding_window = use_sliding_window
+ self.sliding_window = sliding_window
+ self.max_window_layers = max_window_layers
+
+ # for backward compatibility
+ if num_key_value_heads is None:
+ num_key_value_heads = num_attention_heads
+
+ self.num_key_value_heads = num_key_value_heads
+ self.hidden_act = hidden_act
+ self.initializer_range = initializer_range
+ self.rms_norm_eps = rms_norm_eps
+ self.use_cache = use_cache
+ self.rope_theta = rope_theta
+ self.attention_dropout = attention_dropout
+ self.rope_scaling = rope_scaling
+ self.spatial_merge_size = spatial_merge_size
+ self.attn_implementation = attn_implementation
+
+ # Validate the correctness of rotary position embeddings parameters
+ # BC: if there is a 'type' field, move it to 'rope_type'.
+ # and change type from 'mrope' to 'default' because `mrope` does defeault RoPE calculations
+ # one can set it to "linear"/"dynamic" etc. to have scaled RoPE
+ # TODO: @raushan update config in the hub
+ if self.rope_scaling is not None and "type" in self.rope_scaling:
+ if self.rope_scaling["type"] == "mrope":
+ self.rope_scaling["type"] = "default"
+ self.rope_scaling["rope_type"] = self.rope_scaling["type"]
+ rope_config_validation(self, ignore_keys={"mrope_section"})
+
+ super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
+
+# Copied from transformers.models.llama.modeling_llama.rotate_half
+def rotate_half(x):
+ """Rotates half the hidden dims of the input."""
+ x1 = x[..., : x.shape[-1] // 2]
+ x2 = x[..., x.shape[-1] // 2 :]
+ return torch.cat((-x2, x1), dim=-1)
+
+def apply_multimodal_rotary_pos_emb(q, k, cos, sin, mrope_section, unsqueeze_dim=1):
+ """Applies Rotary Position Embedding with Multimodal Sections to the query and key tensors (https://qwenlm.github.io/blog/qwen2-vl/).
+
+ Explanation:
+ Multimodal 3D rotary position embedding is an extension to 1D rotary position embedding. The input embedding
+ sequence contains vision (images / videos) embedding and text embedding or just contains text embedding. For
+ vision embedding part, we apply rotary position embedding on temporal, height and width dimension seperately.
+ Here we split the channel dimension to 3 chunks for the temporal, height and width rotary position embedding.
+ For text embedding part, we just apply 1D rotary position embedding. The three rotary position index (temporal,
+ height and width) of text embedding is always the same, so the text embedding rotary position embedding has no
+ difference with modern LLMs.
+
+ Args:
+ q (`torch.Tensor`): The query tensor.
+ k (`torch.Tensor`): The key tensor.
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
+ position_ids (`torch.Tensor`):
+ The position indices of the tokens corresponding to the query and key tensors. For example, this can be
+ used to pass offsetted position ids when working with a KV-cache.
+ mrope_section(`List(int)`):
+ Multimodal rope section is for channel dimension of temporal, height and width in rope calculation.
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+ Returns:
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+ """
+ mrope_section = mrope_section * 2
+ cos = torch.cat([m[i % 3] for i, m in enumerate(cos.split(mrope_section, dim=-1))], dim=-1).unsqueeze(
+ unsqueeze_dim
+ )
+ sin = torch.cat([m[i % 3] for i, m in enumerate(sin.split(mrope_section, dim=-1))], dim=-1).unsqueeze(
+ unsqueeze_dim
+ )
+
+ q_embed = (q * cos) + (rotate_half(q) * sin)
+ k_embed = (k * cos) + (rotate_half(k) * sin)
+ return q_embed, k_embed
+
+def apply_rotary_pos_emb_vision(tensor: torch.Tensor, freqs: torch.Tensor) -> torch.Tensor:
+ orig_dtype = tensor.dtype
+ tensor = tensor.float()
+ cos = freqs.cos()
+ sin = freqs.sin()
+ cos = cos.unsqueeze(1).repeat(1, 1, 2).unsqueeze(0).float()
+ sin = sin.unsqueeze(1).repeat(1, 1, 2).unsqueeze(0).float()
+ output = (tensor * cos) + (rotate_half(tensor) * sin)
+ output = output.to(orig_dtype)
+ return output
+
+
+class VisionRotaryEmbedding(nn.Module):
+ def __init__(self, dim: int, theta: float = 10000.0) -> None:
+ super().__init__()
+ inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2, dtype=torch.float) / dim))
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
+
+ def forward(self, seqlen: int) -> torch.Tensor:
+ seq = torch.arange(seqlen, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
+ freqs = torch.outer(seq, self.inv_freq)
+ return freqs
+
+class PatchEmbed(nn.Module):
+ def __init__(
+ self,
+ patch_size: int = 14,
+ temporal_patch_size: int = 2,
+ in_channels: int = 3,
+ embed_dim: int = 1152,
+ ) -> None:
+ super().__init__()
+ self.patch_size = patch_size
+ self.temporal_patch_size = temporal_patch_size
+ self.in_channels = in_channels
+ self.embed_dim = embed_dim
+
+ kernel_size = [temporal_patch_size, patch_size, patch_size]
+ self.proj = nn.Conv3d(in_channels, embed_dim, kernel_size=kernel_size, stride=kernel_size, bias=False)
+
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+ target_dtype = self.proj.weight.dtype
+ hidden_states = hidden_states.view(
+ -1, self.in_channels, self.temporal_patch_size, self.patch_size, self.patch_size
+ )
+ hidden_states = self.proj(hidden_states.to(dtype=target_dtype)).view(-1, self.embed_dim)
+ return hidden_states
+
+
+class PatchMerger(nn.Module):
+ def __init__(self, dim: int, context_dim: int, spatial_merge_size: int = 2) -> None:
+ super().__init__()
+ self.hidden_size = context_dim * (spatial_merge_size**2)
+ self.ln_q = LayerNorm(context_dim, eps=1e-6)
+ self.mlp = nn.Sequential(
+ nn.Linear(self.hidden_size, self.hidden_size),
+ nn.GELU(),
+ nn.Linear(self.hidden_size, dim),
+ )
+
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
+ x = self.mlp(self.ln_q(x).view(-1, self.hidden_size))
+ return x
+
+class VisionMlp(nn.Module):
+ def __init__(self, dim: int, hidden_dim: int, hidden_act: str) -> None:
+ super().__init__()
+ self.fc1 = nn.Linear(dim, hidden_dim)
+ self.act = ACT2FN[hidden_act]
+ self.fc2 = nn.Linear(hidden_dim, dim)
+
+ def forward(self, x) -> torch.Tensor:
+ return self.fc2(self.act(self.fc1(x)))
+
+
+class VisionAttention(nn.Module):
+ def __init__(self, dim: int, num_heads: int = 16) -> None:
+ super().__init__()
+ self.num_heads = num_heads
+ self.head_dim = dim // num_heads
+ self.qkv = nn.Linear(dim, dim * 3, bias=True)
+ self.proj = nn.Linear(dim, dim)
+
+ def forward(
+ self, hidden_states: torch.Tensor, cu_seqlens: torch.Tensor, rotary_pos_emb: torch.Tensor = None
+ ) -> torch.Tensor:
+ seq_length = hidden_states.shape[0]
+ q, k, v = self.qkv(hidden_states).reshape(seq_length, 3, self.num_heads, -1).permute(1, 0, 2, 3).unbind(0)
+ q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
+ k = apply_rotary_pos_emb_vision(k.unsqueeze(0), rotary_pos_emb).squeeze(0)
+
+ attention_mask = torch.full(
+ [1, seq_length, seq_length], torch.finfo(q.dtype).min, device=q.device, dtype=q.dtype
+ )
+ for i in range(1, len(cu_seqlens)):
+ attention_mask[..., cu_seqlens[i - 1] : cu_seqlens[i], cu_seqlens[i - 1] : cu_seqlens[i]] = 0
+
+ q = q.transpose(0, 1)
+ k = k.transpose(0, 1)
+ v = v.transpose(0, 1)
+ attn_weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(self.head_dim)
+ attn_weights = attn_weights + attention_mask
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(q.dtype)
+ attn_output = torch.matmul(attn_weights, v)
+ attn_output = attn_output.transpose(0, 1)
+ attn_output = attn_output.reshape(seq_length, -1)
+ attn_output = self.proj(attn_output)
+ return attn_output
+
+
+class VisionFlashAttention2(nn.Module):
+ def __init__(self, dim: int, num_heads: int = 16) -> None:
+ super().__init__()
+ self.num_heads = num_heads
+ self.qkv = nn.Linear(dim, dim * 3, bias=True)
+ self.proj = nn.Linear(dim, dim)
+
+ def forward(
+ self, hidden_states: torch.Tensor, cu_seqlens: torch.Tensor, rotary_pos_emb: torch.Tensor = None
+ ) -> torch.Tensor:
+ seq_length = hidden_states.shape[0]
+ q, k, v = self.qkv(hidden_states).reshape(seq_length, 3, self.num_heads, -1).permute(1, 0, 2, 3).unbind(0)
+ q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
+ k = apply_rotary_pos_emb_vision(k.unsqueeze(0), rotary_pos_emb).squeeze(0)
+
+ max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max().item()
+ attn_output = flash_attn_varlen_func(q, k, v, cu_seqlens, cu_seqlens, max_seqlen, max_seqlen).reshape(
+ seq_length, -1
+ )
+ attn_output = self.proj(attn_output)
+ return attn_output
+
+QWEN2_VL_VISION_ATTENTION_CLASSES = {
+ "eager": VisionAttention,
+ "flash_attention_2": VisionFlashAttention2,
+}
+
+
+class Qwen2VLVisionBlock(nn.Module):
+ def __init__(self, config, attn_implementation: str = "sdpa") -> None:
+ super().__init__()
+ self.norm1 = LayerNorm(config.embed_dim, eps=1e-6)
+ self.norm2 = LayerNorm(config.embed_dim, eps=1e-6)
+ mlp_hidden_dim = int(config.embed_dim * config.mlp_ratio)
+
+ self.attn = QWEN2_VL_VISION_ATTENTION_CLASSES[attn_implementation](
+ config.embed_dim, num_heads=config.num_heads
+ )
+ self.mlp = VisionMlp(dim=config.embed_dim, hidden_dim=mlp_hidden_dim, hidden_act=config.hidden_act)
+
+ def forward(self, hidden_states, cu_seqlens, rotary_pos_emb) -> torch.Tensor:
+ hidden_states = hidden_states + self.attn(
+ self.norm1(hidden_states), cu_seqlens=cu_seqlens, rotary_pos_emb=rotary_pos_emb
+ )
+ hidden_states = hidden_states + self.mlp(self.norm2(hidden_states))
+ return hidden_states
+
+class Qwen2VLPreTrainedModel(PreTrainedModel):
+ config_class = Qwen2VLConfig
+ base_model_prefix = "model"
+ supports_gradient_checkpointing = True
+ _no_split_modules = ["Qwen2VLDecoderLayer", "Qwen2VLVisionBlock"]
+ _skip_keys_device_placement = "past_key_values"
+ _supports_flash_attn_2 = True
+ _supports_sdpa = False
+ _supports_cache_class = True
+ _supports_static_cache = True
+
+ def _init_weights(self, module):
+ std = self.config.initializer_range
+ if isinstance(module, (nn.Linear, nn.Conv3d)):
+ module.weight.data.normal_(mean=0.0, std=std)
+ if module.bias is not None:
+ module.bias.data.zero_()
+ elif isinstance(module, nn.Embedding):
+ module.weight.data.normal_(mean=0.0, std=std)
+ if module.padding_idx is not None:
+ module.weight.data[module.padding_idx].zero_()
+
+class Qwen2VisionTransformerPretrainedModel(Qwen2VLPreTrainedModel):
+ config_class = Qwen2VLVisionConfig
+ _no_split_modules = ["Qwen2VLVisionBlock"]
+
+ def __init__(self, config) -> None:
+ super().__init__(config)
+ self.spatial_merge_size = config.spatial_merge_size
+
+ self.patch_embed = PatchEmbed(
+ patch_size=config.patch_size,
+ temporal_patch_size=config.temporal_patch_size,
+ in_channels=config.in_channels,
+ embed_dim=config.embed_dim,
+ )
+
+ head_dim = config.embed_dim // config.num_heads
+ self.rotary_pos_emb = VisionRotaryEmbedding(head_dim // 2)
+
+ self.blocks = nn.ModuleList(
+ [Qwen2VLVisionBlock(config, config.attn_implementation) for _ in range(config.depth)]
+ )
+ self.merger = PatchMerger(
+ dim=config.hidden_size, context_dim=config.embed_dim, spatial_merge_size=config.spatial_merge_size
+ )
+ # Initialize weights and apply final processing
+ self.gradient_checkpointing = False
+ self.post_init()
+
+ def get_dtype(self) -> torch.dtype:
+ return self.blocks[0].mlp.fc2.weight.dtype
+
+ def get_device(self) -> torch.device:
+ return self.blocks[0].mlp.fc2.weight.device
+
+ def rot_pos_emb(self, grid_thw):
+ pos_ids = []
+ for t, h, w in grid_thw:
+ hpos_ids = torch.arange(h).unsqueeze(1).expand(-1, w)
+ hpos_ids = hpos_ids.reshape(
+ h // self.spatial_merge_size,
+ self.spatial_merge_size,
+ w // self.spatial_merge_size,
+ self.spatial_merge_size,
+ )
+ hpos_ids = hpos_ids.permute(0, 2, 1, 3)
+ hpos_ids = hpos_ids.flatten()
+
+ wpos_ids = torch.arange(w).unsqueeze(0).expand(h, -1)
+ wpos_ids = wpos_ids.reshape(
+ h // self.spatial_merge_size,
+ self.spatial_merge_size,
+ w // self.spatial_merge_size,
+ self.spatial_merge_size,
+ )
+ wpos_ids = wpos_ids.permute(0, 2, 1, 3)
+ wpos_ids = wpos_ids.flatten()
+ pos_ids.append(torch.stack([hpos_ids, wpos_ids], dim=-1).repeat(t, 1))
+ pos_ids = torch.cat(pos_ids, dim=0)
+ max_grid_size = grid_thw[:, 1:].max()
+ rotary_pos_emb_full = self.rotary_pos_emb(max_grid_size)
+ rotary_pos_emb = rotary_pos_emb_full[pos_ids].flatten(1)
+ return rotary_pos_emb
+
+ def forward(self, hidden_states: torch.Tensor, grid_thw: torch.Tensor) -> torch.Tensor:
+ hidden_states = self.patch_embed(hidden_states)
+ rotary_pos_emb = self.rot_pos_emb(grid_thw)
+
+ cu_seqlens = torch.repeat_interleave(grid_thw[:, 1] * grid_thw[:, 2], grid_thw[:, 0]).cumsum(
+ dim=0, dtype=torch.int32
+ )
+ cu_seqlens = F.pad(cu_seqlens, (1, 0), value=0)
+
+ for blk in self.blocks:
+ if self.gradient_checkpointing and self.training:
+ hidden_states = self._gradient_checkpointing_func(
+ blk.__call__,
+ hidden_states,
+ cu_seqlens,
+ rotary_pos_emb,
+ )
+ else:
+ hidden_states = blk(hidden_states, cu_seqlens=cu_seqlens, rotary_pos_emb=rotary_pos_emb)
+
+ return self.merger(hidden_states)
+
+# class Qwen2RMSNorm(nn.Module):
+# def __init__(self, hidden_size, eps=1e-6):
+# """
+# Qwen2RMSNorm is equivalent to T5LayerNorm
+# """
+# super().__init__()
+# self.weight = nn.Parameter(torch.ones(hidden_size))
+# self.variance_epsilon = eps
+# self.normalized_shape = torch.Size((hidden_size, ))
+
+# def forward(self, hidden_states):
+# return fused_rms_norm_affine(input=hidden_states,
+# weight=self.weight,
+# normalized_shape=self.normalized_shape,
+# eps=self.variance_epsilon,
+# memory_efficient=True)
+
+class Qwen2RMSNorm(nn.Module):
+ def __init__(self, hidden_size, eps=1e-6):
+ """
+ Qwen2RMSNorm is equivalent to T5LayerNorm
+ """
+ super().__init__()
+ self.weight = nn.Parameter(torch.ones(hidden_size))
+ self.variance_epsilon = eps
+
+ def forward(self, hidden_states):
+ input_dtype = hidden_states.dtype
+ hidden_states = hidden_states.to(torch.float32)
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+ return self.weight * hidden_states.to(input_dtype)
+
+ def extra_repr(self):
+ return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
+
+class Qwen2VLRotaryEmbedding(nn.Module):
+ def __init__(
+ self,
+ dim=None,
+ max_position_embeddings=2048,
+ base=10000,
+ device=None,
+ scaling_factor=1.0,
+ rope_type="default",
+ config: Optional[Qwen2VLConfig] = None,
+ ):
+ super().__init__()
+ # TODO (joao): remove the `if` below, only used for BC
+ self.rope_kwargs = {}
+ if config is None:
+ logger.warning_once(
+ "`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the "
+ "`config` argument. All other arguments will be removed in v4.46"
+ )
+ self.rope_kwargs = {
+ "rope_type": rope_type,
+ "factor": scaling_factor,
+ "dim": dim,
+ "base": base,
+ "max_position_embeddings": max_position_embeddings,
+ }
+ self.rope_type = rope_type
+ self.max_seq_len_cached = max_position_embeddings
+ self.original_max_seq_len = max_position_embeddings
+ else:
+ # BC: "rope_type" was originally "type"
+ if config.rope_scaling is not None:
+ self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
+ else:
+ self.rope_type = "default"
+ self.max_seq_len_cached = config.max_position_embeddings
+ self.original_max_seq_len = config.max_position_embeddings
+
+ self.config = config
+ self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
+
+ inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device, **self.rope_kwargs)
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
+ self.original_inv_freq = self.inv_freq
+
+ def _dynamic_frequency_update(self, position_ids, device):
+ """
+ dynamic RoPE layers should recompute `inv_freq` in the following situations:
+ 1 - growing beyond the cached sequence length (allow scaling)
+ 2 - the current sequence length is in the original scale (avoid losing precision with small sequences)
+ """
+ seq_len = torch.max(position_ids) + 1
+ if seq_len > self.max_seq_len_cached: # growth
+ inv_freq, self.attention_scaling = self.rope_init_fn(
+ self.config, device, seq_len=seq_len, **self.rope_kwargs
+ )
+ self.register_buffer("inv_freq", inv_freq, persistent=False) # TODO joao: may break with compilation
+ self.max_seq_len_cached = seq_len
+
+ if seq_len < self.original_max_seq_len and self.max_seq_len_cached > self.original_max_seq_len: # reset
+ self.register_buffer("inv_freq", self.original_inv_freq, persistent=False)
+ self.max_seq_len_cached = self.original_max_seq_len
+
+ @torch.no_grad()
+ def forward(self, x, position_ids):
+ position_ids = position_ids.permute(2, 0, 1)
+ if "dynamic" in self.rope_type:
+ self._dynamic_frequency_update(position_ids, device=x.device)
+
+ # Core RoPE block. In contrast to other models, Qwen2_VL has different position ids for thw grids
+ # So we expand the inv_freq to shape (3, ...)
+ inv_freq_expanded = self.inv_freq[None, None, :, None].float().expand(3, position_ids.shape[1], -1, 1)
+ position_ids_expanded = position_ids[:, :, None, :].float() # shape (3, bs, 1, positions)
+ # Force float32 (see https://github.com/huggingface/transformers/pull/29285)
+ device_type = x.device.type
+ device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
+ with torch.autocast(device_type=device_type, enabled=False):
+ freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(2, 3)
+ emb = torch.cat((freqs, freqs), dim=-1)
+ cos = emb.cos()
+ sin = emb.sin()
+
+ # Advanced RoPE types (e.g. yarn) apply a post-processing scaling factor, equivalent to scaling attention
+ cos = cos * self.attention_scaling
+ sin = sin * self.attention_scaling
+
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
+
+# Copied from transformers.models.qwen2.modeling_qwen2.Qwen2MLP
+class Qwen2MLP(nn.Module):
+ def __init__(self, config):
+ super().__init__()
+ self.hidden_size = config.hidden_size
+ self.intermediate_size = config.intermediate_size
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+ self.act_fn = ACT2FN[config.hidden_act]
+
+ def forward(self, hidden_state):
+ return self.down_proj(self.act_fn(self.gate_proj(hidden_state)) * self.up_proj(hidden_state))
+
+# Copied from transformers.models.llama.modeling_llama.repeat_kv
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+ """
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+ """
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+ if n_rep == 1:
+ return hidden_states
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+
+class Qwen2VLAttention(nn.Module):
+ """
+ Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer
+ and "Generating Long Sequences with Sparse Transformers".
+ """
+
+ def __init__(self, config: Qwen2VLConfig, layer_idx: Optional[int] = None):
+ super().__init__()
+ self.config = config
+ self.layer_idx = layer_idx
+ if layer_idx is None:
+ logger.warning_once(
+ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will "
+ "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
+ "when creating this class."
+ )
+
+ self.hidden_size = config.hidden_size
+ self.num_heads = config.num_attention_heads
+ self.head_dim = self.hidden_size // self.num_heads
+ self.num_key_value_heads = config.num_key_value_heads
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
+ self.max_position_embeddings = config.max_position_embeddings
+ self.rope_theta = config.rope_theta
+ self.is_causal = True
+ self.attention_dropout = config.attention_dropout
+ self.rope_scaling = config.rope_scaling
+
+ if (self.head_dim * self.num_heads) != self.hidden_size:
+ raise ValueError(
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
+ f" and `num_heads`: {self.num_heads})."
+ )
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True)
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
+
+
+class Qwen2VLFlashAttention2(Qwen2VLAttention):
+ """
+ Qwen2VL flash attention module, following Qwen2VL attention module. This module inherits from `Qwen2VLAttention`
+ as the weights of the module stays untouched. The only required change would be on the forward pass
+ where it needs to correctly call the public API of flash attention and deal with padding tokens
+ in case the input contains any of them. Additionally, for sliding window attention, we apply SWA only to the bottom
+ config.max_window_layers layers.
+ """
+
+ def __init__(self, *args, **kwargs):
+ super().__init__(*args, **kwargs)
+
+ # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
+ # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
+ # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
+ self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
+
+ def forward(
+ self,
+ hidden_states: torch.Tensor,
+ attention_mask: Optional[torch.Tensor] = None,
+ position_ids: Optional[torch.LongTensor] = None,
+ past_key_value: Optional[Cache] = None,
+ output_attentions: bool = False,
+ use_cache: bool = False,
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, # will become mandatory in v4.46
+ use_rmpad: Optional[bool] = False,
+ cu_seqlens: Optional[torch.Tensor] = False,
+ ):
+ """
+ Train:
+ unpad: (bsz, q_len) = (1, acc_seqlen)
+ pad: (bsz, q_len) = (bsz, q_len)
+ Test:
+ first_iter: (bsz, q_len) = (bsz, q_len)
+ other: (bsz, q_len) = (bsz, 1)
+ """
+ bsz, q_len, _ = hidden_states.size()
+
+ query_states = self.q_proj(hidden_states)
+ key_states = self.k_proj(hidden_states)
+ value_states = self.v_proj(hidden_states)
+
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+
+ cos, sin = position_embeddings
+
+ query_states, key_states = apply_multimodal_rotary_pos_emb(
+ query_states, key_states, cos, sin, self.rope_scaling["mrope_section"]
+ )
+
+ if past_key_value is not None:
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+
+ # repeat k/v heads if n_kv_heads < n_heads
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
+ dropout_rate = 0.0 if not self.training else self.attention_dropout
+
+ # In PEFT, usually we cast the layer norms in float32 for training stability reasons
+ # therefore the input hidden states gets silently casted in float32. Hence, we need
+ # cast them back in float16 just to be sure everything works as expected.
+ input_dtype = query_states.dtype
+ if input_dtype == torch.float32:
+ if torch.is_autocast_enabled():
+ target_dtype = torch.get_autocast_gpu_dtype()
+ # Handle the case where the model is quantized
+ elif hasattr(self.config, "_pre_quantization_dtype"):
+ target_dtype = self.config._pre_quantization_dtype
+ else:
+ target_dtype = self.q_proj.weight.dtype
+
+ logger.warning_once(
+ f"The input hidden states seems to be silently casted in float32, this might be related to"
+ f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
+ f" {target_dtype}."
+ )
+
+ query_states = query_states.to(target_dtype)
+ key_states = key_states.to(target_dtype)
+ value_states = value_states.to(target_dtype)
+
+ # Reashape to the expected shape for Flash Attention
+ query_states = query_states.transpose(1, 2)
+ key_states = key_states.transpose(1, 2)
+ value_states = value_states.transpose(1, 2)
+
+ if use_rmpad:
+ max_seqlen = torch.max(cu_seqlens[1:] - cu_seqlens[:-1]).item()
+ attn_output = flash_attn_varlen_func(
+ query_states.squeeze(0), key_states.squeeze(0), value_states.squeeze(0),
+ cu_seqlens_q=cu_seqlens, cu_seqlens_k=cu_seqlens,
+ max_seqlen_q=max_seqlen, max_seqlen_k=max_seqlen,
+ dropout_p=dropout_rate,
+ causal=self.is_causal, window_size=(-1, -1),
+ )
+ else:
+ attn_output = _flash_attention_forward(
+ query_states, key_states, value_states,
+ attention_mask,
+ q_len,
+ dropout=dropout_rate,
+ sliding_window=None,
+ is_causal=self.is_causal,
+ use_top_left_mask=self._flash_attn_uses_top_left_mask,
+ )
+
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
+ attn_output = self.o_proj(attn_output)
+
+ if not output_attentions:
+ attn_weights = None
+
+ return attn_output, attn_weights, past_key_value
+
+QWEN2_VL_ATTENTION_CLASSES = {
+ "flash_attention_2": Qwen2VLFlashAttention2,
+}
+
+class Qwen2VLDecoderLayer(nn.Module):
+ def __init__(self, config: Qwen2VLConfig, layer_idx: int):
+ super().__init__()
+ self.hidden_size = config.hidden_size
+
+ if config.attn_implementation != "flash_attention_2":
+ logger.error(
+ f"只支持 flash_attention_2!config.attn_implementation={config.attn_implementation}"
+ )
+ self.self_attn = QWEN2_VL_ATTENTION_CLASSES[config.attn_implementation](config, layer_idx)
+
+ self.mlp = Qwen2MLP(config)
+ self.input_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+ self.post_attention_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+
+ def forward(
+ self,
+ hidden_states: torch.Tensor,
+ attention_mask: Optional[torch.Tensor] = None,
+ position_ids: Optional[torch.LongTensor] = None,
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
+ output_attentions: Optional[bool] = False,
+ use_cache: Optional[bool] = False,
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, # will become mandatory in v4.46
+ use_rmpad: Optional[bool] = False,
+ cu_seqlens: Optional[torch.Tensor] = False,
+ **kwargs,
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+ """
+ Args:
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+ attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
+ `(batch, sequence_length)` where padding elements are indicated by 0.
+ output_attentions (`bool`, *optional*):
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+ returned tensors for more detail.
+ use_cache (`bool`, *optional*):
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+ (see `past_key_values`).
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
+ cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
+ Indices depicting the position of the input sequence tokens in the sequence.
+ position_embeddings (`Tuple[torch.FloatTensor, torch.FloatTensor]`, *optional*):
+ Tuple containing the cosine and sine positional embeddings of shape `(batch_size, seq_len, head_dim)`,
+ with `head_dim` being the embedding dimension of each attention head.
+ kwargs (`dict`, *optional*):
+ Arbitrary kwargs to be ignored, used for FSDP and other methods that injects code
+ into the model
+ """
+
+ residual = hidden_states
+
+ hidden_states = self.input_layernorm(hidden_states)
+
+ # Self Attention
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
+ hidden_states=hidden_states,
+ attention_mask=attention_mask,
+ position_ids=position_ids,
+ past_key_value=past_key_value,
+ output_attentions=output_attentions,
+ use_cache=use_cache,
+ position_embeddings=position_embeddings,
+ use_rmpad=use_rmpad,
+ cu_seqlens=cu_seqlens,
+ )
+ hidden_states = residual + hidden_states
+
+ # Fully Connected
+ residual = hidden_states
+ hidden_states = self.post_attention_layernorm(hidden_states)
+ hidden_states = self.mlp(hidden_states)
+ hidden_states = residual + hidden_states
+
+ outputs = (hidden_states,)
+
+ if output_attentions:
+ outputs += (self_attn_weights,)
+
+ if use_cache:
+ outputs += (present_key_value,)
+
+ return outputs
+
+class Qwen2VLModel(Qwen2VLPreTrainedModel):
+ def __init__(self, config: Qwen2VLConfig):
+ super().__init__(config)
+ self.padding_idx = config.pad_token_id
+ self.vocab_size = config.vocab_size
+
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+ self.layers = nn.ModuleList([Qwen2VLDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
+ self.norm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+ self.rotary_emb = Qwen2VLRotaryEmbedding(config=config)
+
+ self.gradient_checkpointing = False
+ # Initialize weights and apply final processing
+ self.post_init()
+
+ def get_input_embeddings(self):
+ return self.embed_tokens
+
+ def set_input_embeddings(self, value):
+ self.embed_tokens = value
+
+ def forward(
+ self,
+ input_ids: torch.LongTensor = None,
+ attention_mask: Optional[torch.Tensor] = None,
+ position_ids: Optional[torch.LongTensor] = None,
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
+ inputs_embeds: Optional[torch.FloatTensor] = None,
+ use_cache: Optional[bool] = None,
+ output_attentions: Optional[bool] = None,
+ output_hidden_states: Optional[bool] = None,
+ return_dict: Optional[bool] = None,
+ use_rmpad: Optional[bool] = False,
+ cu_seqlens: Optional[torch.Tensor] = False,
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+ output_hidden_states = (
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+ )
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
+
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+ if (input_ids is None) ^ (inputs_embeds is not None):
+ raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+
+ if self.gradient_checkpointing and self.training:
+ if use_cache:
+ logger.warning_once(
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+ )
+ use_cache = False
+
+
+ hidden_states = inputs_embeds
+
+ # create position embeddings to be shared across the decoder layers
+ position_embeddings = self.rotary_emb(hidden_states, position_ids)
+
+ # decoder layers
+ all_hidden_states = () if output_hidden_states else None
+ all_self_attns = () if output_attentions else None
+ next_decoder_cache = None
+
+ for decoder_layer in self.layers:
+ if output_hidden_states:
+ all_hidden_states += (hidden_states,)
+
+ if self.gradient_checkpointing and self.training:
+ layer_outputs = self._gradient_checkpointing_func(
+ decoder_layer.__call__,
+ hidden_states,
+ attention_mask,
+ position_ids,
+ past_key_values,
+ output_attentions,
+ use_cache,
+ position_embeddings,
+ use_rmpad,
+ cu_seqlens,
+ )
+ else:
+ layer_outputs = decoder_layer(
+ hidden_states,
+ attention_mask=attention_mask,
+ position_ids=position_ids,
+ past_key_value=past_key_values,
+ output_attentions=output_attentions,
+ use_cache=use_cache,
+ position_embeddings=position_embeddings,
+ use_rmpad=use_rmpad,
+ cu_seqlens=cu_seqlens,
+ )
+
+ hidden_states = layer_outputs[0]
+
+ if use_cache:
+ next_decoder_cache = layer_outputs[2 if output_attentions else 1]
+
+ if output_attentions:
+ all_self_attns += (layer_outputs[1],)
+
+ hidden_states = self.norm(hidden_states)
+
+ # add hidden states from the last decoder layer
+ if output_hidden_states:
+ all_hidden_states += (hidden_states,)
+
+ next_cache = next_decoder_cache if use_cache else None
+
+ if not return_dict:
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
+ return BaseModelOutputWithPast(
+ last_hidden_state=hidden_states,
+ past_key_values=next_cache,
+ hidden_states=all_hidden_states,
+ attentions=all_self_attns,
+ )
+
+class Qwen2VLForCausalLM(Qwen2VLPreTrainedModel, GenerationMixin):
+ _tied_weights_keys = ["lm_head.weight"]
+
+ def __init__(self, config):
+ super().__init__(config)
+ self.model = Qwen2VLModel(config)
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+ self.padding_side = "left" # set it to left by default, user can use setter to change padding_sides
+
+ # Initialize weights and apply final processing
+ self.post_init()
+
+ def get_input_embeddings(self):
+ return self.model.embed_tokens
+
+ def set_input_embeddings(self, value):
+ self.model.embed_tokens = value
+
+ def get_output_embeddings(self):
+ return self.lm_head
+
+ def set_output_embeddings(self, new_embeddings):
+ self.lm_head = new_embeddings
+
+ def set_decoder(self, decoder):
+ self.model = decoder
+
+ def get_decoder(self):
+ return self.model
+
+ def get_rope_index(
+ self,
+ input_ids: torch.LongTensor,
+ image_grid_thw: Optional[torch.LongTensor] = None,
+ attention_mask: Optional[torch.Tensor] = None,
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
+ """
+ Calculate the 3D rope index based on image and video's temporal, height and width in LLM.
+
+ Explanation:
+ Each embedding sequence contains vision embedding and text embedding or just contains text embedding.
+
+ For pure text embedding sequence, the rotary position embedding has no difference with mordern LLMs.
+ Examples:
+ input_ids: [T T T T T], here T is for text.
+ temporal position_ids: [0, 1, 2, 3, 4]
+ height position_ids: [0, 1, 2, 3, 4]
+ width position_ids: [0, 1, 2, 3, 4]
+
+ For vision and text embedding sequence, we calculate 3D rotary position embedding for vision part
+ and 1D rotary position embeddin for text part.
+ Examples:
+ Assume we have a video input with 3 temporal patches, 2 height patches and 2 width patches.
+ input_ids: [V V V V V V V V V V V V T T T T T], here V is for vision.
+ vision temporal position_ids: [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2]
+ vision height position_ids: [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1]
+ vision width position_ids: [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
+ text temporal position_ids: [3, 4, 5, 6, 7]
+ text height position_ids: [3, 4, 5, 6, 7]
+ text width position_ids: [3, 4, 5, 6, 7]
+ Here we calculate the text start position_ids as the max vision position_ids plus 1.
+
+ Args:
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+ it.
+ image_grid_thw (`torch.LongTensor` of shape `(num_images, 3)`, *optional*):
+ The temporal, height and width of feature shape of each image in LLM.
+ video_grid_thw (`torch.LongTensor` of shape `(num_videos, 3)`, *optional*):
+ The temporal, height and width of feature shape of each video in LLM.
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+ - 1 for tokens that are **not masked**,
+ - 0 for tokens that are **masked**.
+
+ Returns:
+ position_ids (`torch.LongTensor` of shape `(3, batch_size, sequence_length)`)
+ mrope_position_deltas (`torch.Tensor` of shape `(batch_size)`)
+ """
+ spatial_merge_size = self.config.spatial_merge_size
+ vision_token_id = self.config.image_token_id
+ vision_start_token_id = self.config.vision_start_token_id
+ assert image_grid_thw is not None # TODO:测试纯文本会不会卡住
+ total_input_ids = input_ids
+ position_ids = torch.ones(
+ 3, input_ids.shape[0], input_ids.shape[1], dtype=input_ids.dtype, device=input_ids.device
+ )
+ vision_index = 0
+ for i, input_ids in enumerate(total_input_ids):
+ if attention_mask is not None:
+ input_ids = input_ids[attention_mask[i] == 1]
+ vision_start_indices = torch.argwhere(input_ids == vision_start_token_id).squeeze(1)
+ vision_num = (input_ids[vision_start_indices + 1] == vision_token_id).sum()
+ input_tokens = input_ids.tolist()
+ llm_pos_ids_list: list = []
+ st = 0
+ remain_vision_num = vision_num
+ for _ in range(vision_num):
+ if vision_token_id in input_tokens and remain_vision_num > 0:
+ ed_vision = input_tokens.index(vision_token_id, st)
+ else:
+ ed_vision = len(input_tokens) + 1
+
+ t, h, w = (
+ image_grid_thw[vision_index][0],
+ image_grid_thw[vision_index][1],
+ image_grid_thw[vision_index][2],
+ )
+ vision_index += 1
+ remain_vision_num -= 1
+ ed = ed_vision
+
+ llm_grid_t, llm_grid_h, llm_grid_w = (
+ t.item(),
+ h.item() // spatial_merge_size,
+ w.item() // spatial_merge_size,
+ )
+ text_len = ed - st
+
+ st_idx = llm_pos_ids_list[-1].max() + 1 if len(llm_pos_ids_list) > 0 else 0
+ llm_pos_ids_list.append(torch.arange(text_len).view(1, -1).expand(3, -1) + st_idx)
+
+ t_index = torch.arange(llm_grid_t).view(-1, 1).expand(-1, llm_grid_h * llm_grid_w).flatten()
+ h_index = torch.arange(llm_grid_h).view(1, -1, 1).expand(llm_grid_t, -1, llm_grid_w).flatten()
+ w_index = torch.arange(llm_grid_w).view(1, 1, -1).expand(llm_grid_t, llm_grid_h, -1).flatten()
+ llm_pos_ids_list.append(torch.stack([t_index, h_index, w_index]) + text_len + st_idx)
+ st = ed + llm_grid_t * llm_grid_h * llm_grid_w
+
+ if st < len(input_tokens):
+ st_idx = llm_pos_ids_list[-1].max() + 1 if len(llm_pos_ids_list) > 0 else 0
+ text_len = len(input_tokens) - st
+ llm_pos_ids_list.append(torch.arange(text_len).view(1, -1).expand(3, -1) + st_idx)
+
+ llm_positions = torch.cat(llm_pos_ids_list, dim=1).reshape(3, -1)
+ position_ids[..., i, attention_mask[i] == 1] = llm_positions.to(position_ids.device)
+ position_ids = position_ids.permute(1, 2, 0)
+ return position_ids
+
+ def forward(
+ self,
+ input_ids: torch.LongTensor = None,
+ attention_mask: Optional[torch.Tensor] = None,
+ position_ids: Optional[torch.LongTensor] = None,
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
+ inputs_embeds: Optional[torch.FloatTensor] = None,
+ labels: Optional[torch.LongTensor] = None,
+ use_cache: Optional[bool] = None,
+ output_attentions: Optional[bool] = None,
+ output_hidden_states: Optional[bool] = None,
+ return_dict: Optional[bool] = None,
+ use_rmpad: Optional[bool] = False,
+ cu_seqlens: Optional[torch.Tensor] = False,
+ ) -> Union[Tuple, Qwen2VLCausalLMOutputWithPast]:
+
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+ output_hidden_states = (
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+ )
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+
+ outputs = self.model(
+ input_ids=input_ids,
+ attention_mask=attention_mask,
+ position_ids=position_ids,
+ past_key_values=past_key_values,
+ inputs_embeds=inputs_embeds,
+ use_cache=use_cache,
+ output_attentions=output_attentions,
+ output_hidden_states=output_hidden_states,
+ return_dict=return_dict,
+ use_rmpad=use_rmpad,
+ cu_seqlens=cu_seqlens,
+ )
+
+ hidden_states = outputs[0]
+ logits = self.lm_head(hidden_states)
+
+ if not return_dict:
+ output = (logits,) + outputs[1:]
+ return output
+
+ return Qwen2VLCausalLMOutputWithPast(
+ logits=logits,
+ past_key_values=outputs.past_key_values,
+ hidden_states=outputs.hidden_states,
+ attentions=outputs.attentions,
+ )
+
+
\ No newline at end of file
diff --git a/eval_scripts/DREAM-1K/tarsier/models/modeling_tarsier.py b/eval_scripts/DREAM-1K/tarsier/models/modeling_tarsier.py
new file mode 100644
index 0000000000000000000000000000000000000000..0db3482b040b9227bdc13c5f9f915e6250836252
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/models/modeling_tarsier.py
@@ -0,0 +1,502 @@
+from dataclasses import dataclass
+from typing import List, Optional, Tuple, Union, Dict, Any
+import math
+
+import torch.utils.checkpoint
+from torch import nn
+import torch.nn.functional as F
+
+from transformers import PreTrainedModel, AutoConfig, AutoModel
+from transformers.activations import ACT2FN
+from transformers.cache_utils import Cache
+from transformers.modeling_outputs import ModelOutput
+from transformers.utils import logging
+from transformers.configuration_utils import PretrainedConfig
+from transformers.dynamic_module_utils import get_class_from_dynamic_module
+from transformers.models.auto import AutoModel, AutoModelForCausalLM, CONFIG_MAPPING
+from transformers.generation import GenerationMixin
+
+from transformers import LlamaForCausalLM, Qwen2ForCausalLM
+# from models.modeling_qwen2 import Qwen2ForCausalLM
+from models.modeling_qwen2_vl_fast import Qwen2VLForCausalLM
+from models.utils import _pad_input, _unpad_input
+
+logger = logging.get_logger(__name__)
+
+
+class LlavaConfig(PretrainedConfig):
+
+ model_type = "llava"
+ is_composition = False
+
+ def __init__(
+ self,
+ vision_config=None,
+ text_config=None,
+ ignore_index=-100,
+ image_token_index=32000,
+ projector_hidden_act="gelu",
+ vision_feature_select_strategy="default",
+ vision_feature_layer=-2,
+ image_newline_idx=32002,
+ image_new_idx=32003,
+ projection_head="MLP",
+ **kwargs,
+ ):
+ self.ignore_index = ignore_index
+ self.image_token_index = image_token_index
+ self.projector_hidden_act = projector_hidden_act
+ self.vision_feature_select_strategy = vision_feature_select_strategy
+ self.vision_feature_layer = vision_feature_layer
+ self.image_newline_idx = image_newline_idx
+ self.image_new_idx = image_new_idx
+ self.projection_head = projection_head
+
+ self.vision_config = vision_config
+
+ if isinstance(self.vision_config, dict):
+ vision_config["model_type"] = (
+ vision_config["model_type"] if "model_type" in vision_config else "clip_vision_model"
+ )
+ if 'auto_map' in vision_config:
+ repo_id, class_ref = vision_config['auto_map']['AutoConfig'].split("--")
+ config_class = get_class_from_dynamic_module(class_ref, repo_id, **kwargs)
+ self.vision_config = config_class(**vision_config)
+ elif vision_config["model_type"] in CONFIG_MAPPING:
+ self.vision_config = CONFIG_MAPPING[vision_config["model_type"]](**vision_config)
+ else:
+ raise ValueError(f'vision_config["model_type"] = {vision_config["model_type"]} not supported!')
+
+ self.text_config = text_config
+
+ if isinstance(self.text_config, dict):
+ text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
+ if 'auto_map' in text_config:
+ repo_id, class_ref = text_config['auto_map']['AutoConfig'].split("--")
+ config_class = get_class_from_dynamic_module(class_ref, repo_id, **kwargs)
+ self.text_config = config_class(**text_config)
+ elif text_config["model_type"] in CONFIG_MAPPING:
+ self.text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
+ else:
+ raise ValueError(f'text_config["model_type"] = {text_config["model_type"]} not supported!')
+
+
+ super().__init__(**kwargs)
+
+
+
+@dataclass
+# Copied from transformers.models.idefics.modeling_idefics.IdeficsCausalLMOutputWithPast with Idefics->Llava
+class LlavaCausalLMOutputWithPast(ModelOutput):
+
+ loss: Optional[torch.FloatTensor] = None
+ logits: torch.FloatTensor = None
+ past_key_values: Optional[List[torch.FloatTensor]] = None
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+ attentions: Optional[Tuple[torch.FloatTensor]] = None
+ position_ids: Optional[torch.LongTensor] = None
+
+def add_split_tokens(image_features, image_newline_embed, image_new_embed):
+ num_images, num_image_patches, embed_dim = image_features.shape
+ num_height_patches, num_width_patches = int(math.sqrt(num_image_patches)), int(math.sqrt(num_image_patches))
+
+ # add image_newline
+ image_features = image_features.view(num_images, num_height_patches, num_width_patches, embed_dim)
+ image_features = torch.cat([
+ image_features,
+ image_newline_embed.expand((num_images, num_height_patches, 1, embed_dim))
+ ], dim=2)
+ num_image_patches += num_height_patches
+ image_features = image_features.view(num_images, num_image_patches, embed_dim)
+
+ # add image_new
+ image_features = torch.cat([
+ image_features,
+ image_new_embed.expand((num_images, 1, embed_dim))
+ ], dim = 1)
+
+ return image_features
+
+
+class LlavaMultiModalProjector(nn.Module):
+ def __init__(self, config: LlavaConfig):
+ super().__init__()
+ self.config = config
+
+ self.linear_1 = nn.Linear(config.vision_config.hidden_size, config.text_config.hidden_size, bias=True)
+ self.act = ACT2FN[config.projector_hidden_act]
+ self.linear_2 = nn.Linear(config.text_config.hidden_size, config.text_config.hidden_size, bias=True)
+
+ image_newline_idx = torch.tensor([config.image_newline_idx], dtype=torch.long)
+ image_new_idx = torch.tensor([config.image_new_idx], dtype=torch.long)
+ self.register_buffer('image_newline_idx', image_newline_idx, persistent=False)
+ self.register_buffer('image_new_idx', image_new_idx, persistent=False)
+
+
+ def forward(self, image_features, input_embeddings):
+
+ selected_image_feature = image_features[self.config.vision_feature_layer]
+
+ if self.config.vision_feature_select_strategy == "default":
+ selected_image_feature = selected_image_feature[:, 1:]
+ elif self.config.vision_feature_select_strategy == "full":
+ selected_image_feature = selected_image_feature
+ else:
+ raise ValueError(
+ f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}"
+ )
+
+ hidden_states = self.linear_1(selected_image_feature)
+ hidden_states = self.act(hidden_states)
+ hidden_states = self.linear_2(hidden_states)
+
+ image_newline_embed = input_embeddings(self.image_newline_idx).squeeze()
+ image_new_embed = input_embeddings(self.image_new_idx).squeeze()
+ hidden_states = add_split_tokens(hidden_states, image_newline_embed, image_new_embed)
+ return hidden_states
+
+class PixelShuffleMultiModalProjector(nn.Module):
+ def __init__(self, config: LlavaConfig):
+ super().__init__()
+ self.config = config
+
+ self.downsample_ratio = 0.5
+ vit_hidden_size = config.vision_config.hidden_size
+ llm_hidden_size = config.text_config.hidden_size
+
+ self.mlp = nn.Sequential(
+ nn.LayerNorm(vit_hidden_size * int(1 / self.downsample_ratio) ** 2),
+ nn.Linear(vit_hidden_size * int(1 / self.downsample_ratio) ** 2, llm_hidden_size),
+ nn.GELU(),
+ nn.Linear(llm_hidden_size, llm_hidden_size)
+ )
+
+ image_newline_idx = torch.tensor([config.image_newline_idx], dtype=torch.long)
+ image_new_idx = torch.tensor([config.image_new_idx], dtype=torch.long)
+ self.register_buffer('image_newline_idx', image_newline_idx, persistent=False)
+ self.register_buffer('image_new_idx', image_new_idx, persistent=False)
+
+ def forward(self, image_features, input_embeddings):
+ selected_image_feature = image_features[self.config.vision_feature_layer]
+
+ if self.config.vision_feature_select_strategy == "default":
+ selected_image_feature = selected_image_feature[:, 1:]
+ elif self.config.vision_feature_select_strategy == "full":
+ selected_image_feature = selected_image_feature
+ else:
+ raise ValueError(
+ f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}"
+ )
+
+ image_features = self.pixel_shuffle(selected_image_feature)
+ hidden_states = self.mlp(image_features)
+
+ image_newline_embed = input_embeddings(self.image_newline_idx).squeeze()
+ image_new_embed = input_embeddings(self.image_new_idx).squeeze()
+ hidden_states = add_split_tokens(hidden_states, image_newline_embed, image_new_embed)
+
+ return hidden_states
+
+ def pixel_shuffle(self, x, scale_factor=0.5):
+ if scale_factor == 1:
+ return x
+ n, wh, c = x.shape
+ h, w = int(math.sqrt(wh)), int(math.sqrt(wh))
+ x = x.view(n, h, w, c)
+
+ n, w, h, c = x.size()
+ # N, W, H, C --> N, W, H * scale, C // scale
+ x = x.view(n, w, int(h * scale_factor), int(c / scale_factor))
+ # N, W, H * scale, C // scale --> N, H * scale, W, C // scale
+ x = x.permute(0, 2, 1, 3).contiguous()
+ # N, H * scale, W, C // scale --> N, H * scale, W * scale, C // (scale ** 2)
+ x = x.view(n, int(h * scale_factor), int(w * scale_factor),
+ int(c / (scale_factor * scale_factor)))
+ x = x.permute(0, 2, 1, 3).contiguous()
+ x = x.view(x.shape[0], -1, x.shape[-1])
+ return x
+
+
+LLAVA_START_DOCSTRING = r"""
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+ etc.)
+
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+ and behavior.
+
+ Parameters:
+ config ([`LlavaConfig`] or [`LlavaVisionConfig`]):
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
+ load the weights associated with the model, only the configuration. Check out the
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+class TarsierPreTrainedModel(PreTrainedModel):
+ config_class = LlavaConfig
+ base_model_prefix = "llm"
+ supports_gradient_checkpointing = True # TODO: support latest gc
+ _skip_keys_device_placement = "past_key_values"
+ _supports_flash_attn_2 = True
+ _supports_sdpa = False
+ _supports_cache_class = True # TODO: support different cache
+ _supports_static_cache = True
+
+ def _init_weights(self, module):
+ std = (
+ self.config.initializer_range
+ if hasattr(self.config, "initializer_range")
+ else self.config.text_config.initializer_range
+ )
+
+ if hasattr(module, "class_embedding"):
+ module.class_embedding.data.normal_(mean=0.0, std=std)
+
+ if isinstance(module, (nn.Linear, nn.Conv2d, nn.Conv3d)):
+ module.weight.data.normal_(mean=0.0, std=std)
+ if module.bias is not None:
+ module.bias.data.zero_()
+ elif isinstance(module, nn.Embedding):
+ module.weight.data.normal_(mean=0.0, std=std)
+ if module.padding_idx is not None:
+ module.weight.data[module.padding_idx].zero_()
+ elif isinstance(module, nn.LayerNorm):
+ module.weight.data.fill_(1.0)
+ if module.bias is not None:
+ module.bias.data.zero_()
+ @property
+ def _no_split_modules(self):
+ return self.language_model._no_split_modules + self.vision_tower._no_split_modules
+
+
+class TarsierForConditionalGeneration(TarsierPreTrainedModel, GenerationMixin):
+ def __init__(self, config: LlavaConfig):
+ super().__init__(config)
+ self.vision_tower = AutoModel.from_config(config.vision_config, trust_remote_code=True)
+ if config.text_config.model_type == 'qwen2':
+ self.language_model = Qwen2ForCausalLM(config.text_config)
+ elif config.text_config.model_type == 'qwen2_vl':
+ self.language_model = Qwen2VLForCausalLM(config.text_config)
+ elif config.text_config.model_type == 'llama':
+ self.language_model = LlamaForCausalLM(config.text_config)
+ else:
+ raise ValueError(f'{config.text_config.model_type} not supported!')
+
+ if config.projection_head == 'Pixel_Shuffle':
+ self.multi_modal_projector = PixelShuffleMultiModalProjector(config)
+ elif config.projection_head == 'MLP':
+ self.multi_modal_projector = LlavaMultiModalProjector(config)
+ elif config.projection_head == 'auto_map':
+ repo_id, class_ref = config.auto_map['ProjectionLayer'].split("--")
+ model_class = get_class_from_dynamic_module(class_ref, repo_id)
+ self.multi_modal_projector = model_class(config)
+ elif config.projection_head is None:
+ self.multi_modal_projector = lambda x, *args, **kwargs: x
+
+ self.post_init()
+
+ def get_input_embeddings(self):
+ return self.language_model.get_input_embeddings()
+
+ def set_input_embeddings(self, value):
+ self.language_model.set_input_embeddings(value)
+
+ def get_output_embeddings(self):
+ return self.language_model.get_output_embeddings()
+
+ def set_output_embeddings(self, new_embeddings):
+ self.language_model.set_output_embeddings(new_embeddings)
+
+ def set_decoder(self, decoder):
+ self.language_model.set_decoder(decoder)
+
+ def get_decoder(self):
+ return self.language_model.get_decoder()
+
+ def tie_weights(self):
+ return self.language_model.tie_weights()
+
+ def resize_token_embeddings(self, new_num_tokens: Optional[int] = None, pad_to_multiple_of=None) -> nn.Embedding:
+ model_embeds = self.language_model.resize_token_embeddings(new_num_tokens, pad_to_multiple_of)
+ # update vocab size
+ self.config.text_config.vocab_size = model_embeds.num_embeddings
+ return model_embeds
+
+ def forward(
+ self,
+ input_ids: torch.LongTensor = None,
+ attention_mask: Optional[torch.Tensor] = None,
+ position_ids: Optional[torch.LongTensor] = None,
+ pixel_values: torch.FloatTensor = None,
+ image_grid_thw: Optional[torch.Tensor] = None,
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
+ labels: Optional[torch.LongTensor] = None,
+ num_images: Optional[torch.Tensor] = None,
+ use_cache: Optional[bool] = None,
+ output_attentions: Optional[bool] = None,
+ output_hidden_states: Optional[bool] = None,
+ return_dict: Optional[bool] = None,
+ use_rmpad: Optional[bool] = False,
+ **kwargs,
+ ) -> Union[Tuple, LlavaCausalLMOutputWithPast]:
+
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+ output_hidden_states = (
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+ )
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+
+ if input_ids is None:
+ raise ValueError("You must specify input_ids")
+
+ bsz, max_seq_len = input_ids.shape[0], input_ids.shape[1]
+
+ if max_seq_len > 1:
+ special_image_mask = input_ids == self.config.image_token_index
+ print(f'[{input_ids.device}] num_images: {num_images.tolist()} num_image_tokens: {special_image_mask.sum(-1).tolist()}', flush=True)
+
+ if position_ids is None:
+ if 'Qwen2VLForCausalLM' in self.language_model.__class__.__name__:
+ position_ids = self.language_model.get_rope_index(input_ids, image_grid_thw, attention_mask) # [bsz, seqlen, 3]
+ else:
+ position_ids = attention_mask.long().cumsum(-1) - 1 # # [bsz, seqlen]
+ position_ids.masked_fill_(attention_mask == 0, 1)
+
+
+ if use_rmpad:
+ input_ids, input_ids_indices, cu_seqlens, _ = _unpad_input(input_ids, attention_mask) # [bsz, seqlen] -> [1, seqlen]
+ position_ids, _, _, _ = _unpad_input(position_ids, attention_mask)
+ input_ids, position_ids = input_ids.unsqueeze(0), position_ids.unsqueeze(0)
+ else:
+ input_ids_indices, cu_seqlens = None, None
+
+ inputs_embeds = self.get_input_embeddings()(input_ids) # [1, seqlen, dim]
+
+ image_features = None
+ if pixel_values is not None: # training / first step in generation
+ if 'Qwen2VLForCausalLM' in self.language_model.__class__.__name__:
+ pixel_values = pixel_values.type(self.vision_tower.get_dtype())
+ image_features = self.vision_tower(pixel_values, image_grid_thw)
+ else:
+ image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
+ image_features = self.multi_modal_projector(
+ image_outputs.hidden_states,
+ self.get_input_embeddings(),
+ )
+
+ special_image_mask = (input_ids == self.config.image_token_index).to(inputs_embeds.device)
+ if special_image_mask.sum() > 0:
+ image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
+ inputs_embeds = inputs_embeds.masked_scatter(
+ special_image_mask.unsqueeze(-1).expand_as(inputs_embeds),
+ image_features
+ )
+ else:
+ inputs_embeds = image_features.sum(dim=(0,1)) * 0. + inputs_embeds
+
+ outputs = self.language_model(
+ attention_mask=attention_mask,
+ position_ids=position_ids,
+ past_key_values=past_key_values,
+ inputs_embeds=inputs_embeds,
+ use_cache=use_cache,
+ output_attentions=output_attentions,
+ output_hidden_states=output_hidden_states,
+ return_dict=return_dict,
+ use_rmpad=use_rmpad,
+ cu_seqlens=cu_seqlens,
+ )
+
+ logits = outputs[0]
+
+ loss = None
+ if labels is not None:
+ loss_fct = nn.CrossEntropyLoss()
+ if use_rmpad:
+ labels = labels.view(-1)[input_ids_indices.long()]
+ shift_labels = torch.cat((labels[1:], labels.new_ones((1))*-100))
+ shift_labels.requires_grad = False
+ lbl_seq_lens = (cu_seqlens[1:]-1).long()
+ shift_labels[lbl_seq_lens] = -100
+ loss = loss_fct(logits.squeeze(0), shift_labels)
+ else:
+ # Shift so that tokens < n predict n
+ shift_logits = logits[..., :-1, :].contiguous()
+ shift_labels = labels[..., 1:].contiguous()
+ # Flatten the tokens
+ shift_logits = shift_logits.view(-1, self.config.text_config.vocab_size)
+ shift_labels = shift_labels.view(-1)
+ # Enable model parallelism
+ shift_labels = shift_labels.to(shift_logits.device)
+ loss = loss_fct(shift_logits, shift_labels)
+ elif use_rmpad: # 训练的时候,就不 unpad logits 了,节省显存。
+ logits = _pad_input(logits.squeeze(0), input_ids_indices, bsz, max_seq_len)
+
+ if not return_dict:
+ output = (logits,) + outputs[1:]
+ return (loss,) + output if loss is not None else output
+
+ return LlavaCausalLMOutputWithPast(
+ loss=loss,
+ logits=logits,
+ past_key_values=outputs.past_key_values,
+ hidden_states=outputs.hidden_states,
+ attentions=outputs.attentions,
+ position_ids=position_ids,
+ )
+
+ def prepare_inputs_for_generation(
+ self,
+ input_ids,
+ attention_mask=None,
+ position_ids=None,
+ past_key_values=None,
+ cache_position=None,
+ use_cache=True,
+ pixel_values=None,
+ image_grid_thw=None,
+ **kwargs,
+ ):
+ if past_key_values is not None:
+ past_length = past_key_values.get_seq_length()
+ input_ids = input_ids[:, past_length:]
+
+ model_inputs = {
+ "input_ids": input_ids,
+ "attention_mask": attention_mask,
+ "past_key_values": past_key_values,
+ "use_cache": use_cache,
+ }
+ if kwargs.get('num_images') is not None:
+ model_inputs['num_images'] = kwargs['num_images']
+
+ if cache_position[0] == 0:
+ # If we're in cached decoding stage, pixel values should be None because input ids do not contain special image token anymore
+ # Otherwise we need pixel values to be passed to model
+ model_inputs["pixel_values"] = pixel_values
+ model_inputs["image_grid_thw"] = image_grid_thw
+ else:
+ model_inputs['position_ids'] = position_ids[:, -1, ...].unsqueeze(1).to(device=input_ids.device) + 1
+ return model_inputs
+
+
+ def _update_model_kwargs_for_generation(
+ self,
+ outputs: ModelOutput,
+ model_kwargs: Dict[str, Any],
+ is_encoder_decoder: bool = False,
+ num_new_tokens: int = 1,
+ ) -> Dict[str, Any]:
+ model_kwargs = super()._update_model_kwargs_for_generation(
+ outputs=outputs,
+ model_kwargs=model_kwargs,
+ is_encoder_decoder=is_encoder_decoder,
+ num_new_tokens=num_new_tokens,
+ )
+
+ if getattr(outputs, "position_ids", None) is not None:
+ model_kwargs["position_ids"] = outputs.position_ids
+
+ return model_kwargs
diff --git a/eval_scripts/DREAM-1K/tarsier/models/utils.py b/eval_scripts/DREAM-1K/tarsier/models/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..8aacc26689ae5eb3ca72ab680079d232f5ad3ae0
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/models/utils.py
@@ -0,0 +1,17 @@
+import torch
+import torch.nn.functional as F
+from einops import rearrange
+
+def _unpad_input(input_ids, attention_mask):
+ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+ max_seqlen_in_batch = seqlens_in_batch.max().item()
+ cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
+ input_ids = rearrange(input_ids, 'b s ... -> (b s) ...')[indices]
+ return input_ids, indices, cu_seqlens, max_seqlen_in_batch
+
+def _pad_input(hidden_states, indices, batch, seqlen):
+ output = torch.zeros(batch * seqlen, *hidden_states.shape[1:], device=hidden_states.device,
+ dtype=hidden_states.dtype)
+ output[indices] = hidden_states
+ return rearrange(output, '(b s) ... -> b s ...', b=batch)
diff --git a/eval_scripts/DREAM-1K/tarsier/scripts/run_demo_cli.sh b/eval_scripts/DREAM-1K/tarsier/scripts/run_demo_cli.sh
new file mode 100644
index 0000000000000000000000000000000000000000..50aaa8da7f3feeb94b8eddb9de232faf4954972d
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/scripts/run_demo_cli.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+
+model_path=$1
+n_frames=${2:-8}
+max_new_tokens=${3:-512}
+top_p=${4:-0.8}
+temperature=${5:-0}
+
+python3 -m tasks.demo_cli \
+ --model_name_or_path $model_path \
+ --config "configs/tarser2_default_config.yaml" \
+ --max_n_frames $n_frames \
+ --max_new_tokens $max_new_tokens \
+ --top_p $top_p \
+ --temperature $temperature
\ No newline at end of file
diff --git a/eval_scripts/DREAM-1K/tarsier/scripts/run_demo_gradio.sh b/eval_scripts/DREAM-1K/tarsier/scripts/run_demo_gradio.sh
new file mode 100644
index 0000000000000000000000000000000000000000..27f243f664eb3419a9966dffa73f3389a4676928
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/scripts/run_demo_gradio.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+
+model_path=$1
+max_n_frames=${2:-8}
+
+export MODEL_PATH=$model_path
+export MAX_N_FRAMES=$max_n_frames
+
+python3 -m tasks.demo_gradio
\ No newline at end of file
diff --git a/eval_scripts/DREAM-1K/tarsier/scripts/run_evaluation_only.sh b/eval_scripts/DREAM-1K/tarsier/scripts/run_evaluation_only.sh
new file mode 100644
index 0000000000000000000000000000000000000000..90d8181e2367a3aeea0e4f24a7a1deb7dba82228
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/scripts/run_evaluation_only.sh
@@ -0,0 +1,12 @@
+#!/bin/bash
+export AZURE_ENDPOINT=...
+export OPENAI_API_KEY=...
+
+pred_file=$1
+# benchmarks=${@:2}
+benchmarks=dream
+benchmarks=${benchmarks:-"all"}
+
+python -m evaluation.evaluate \
+ --pred_file $pred_file \
+ --benchmarks $benchmarks
diff --git a/eval_scripts/DREAM-1K/tarsier/scripts/run_inference_benchmark.sh b/eval_scripts/DREAM-1K/tarsier/scripts/run_inference_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8a632036d906f06010f82d3b6c58ecc0b5eb5fb4
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/scripts/run_inference_benchmark.sh
@@ -0,0 +1,80 @@
+#!/bin/bash
+
+# Copy and Modified on: https://github.com/LLaVA-VL/LLaVA-NeXT/blob/video_inference/scripts/video/eval/video_detail_description_eval_shard.sh
+
+#
+
+model_name_or_path=$1
+output_dir=$2
+benchmarks=${@:3}
+benchmarks=${benchmarks:-"all"}
+resume=True
+CHUNKS=8
+
+mkdir $output_dir
+
+echo "Using $CHUNKS GPUs"
+
+# Assuming GPULIST is a bash array containing your GPUs
+GPULIST=(0 1 2 3 4 5 6 7)
+# GPULIST=(0 1)
+
+# Get the number of GPUs
+NUM_GPUS=${#GPULIST[@]}
+
+# Calculate GPUs per chunk
+GPUS_PER_CHUNK=$((NUM_GPUS / CHUNKS))
+
+
+for IDX in $(seq 1 $CHUNKS); do
+ START=$(((IDX-1) * GPUS_PER_CHUNK))
+ LENGTH=$GPUS_PER_CHUNK # Length for slicing, not the end index
+
+ CHUNK_GPUS=(${GPULIST[@]:$START:$LENGTH})
+
+ # Convert the chunk GPUs array to a comma-separated string
+ CHUNK_GPUS_STR=$(IFS=,; echo "${CHUNK_GPUS[*]}")
+
+ ALL_GPUS_FREE=0
+ while [ $ALL_GPUS_FREE -eq 0 ]; do
+ ALL_GPUS_FREE=1 # Assume all GPUs are free initially
+
+ for GPU_ID in $CHUNK_GPUS; do
+ MEM_USAGE=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i $GPU_ID | tr -d '[:space:]')
+
+ # Assuming a GPU is considered free if its memory usage is less than 100 MiB
+ if [ $MEM_USAGE -ge 100 ]; then
+ ALL_GPUS_FREE=0
+ echo "GPU $GPU_ID is in use. Memory used: ${MEM_USAGE}MiB."
+ break # Exit the loop early as we found a GPU that is not free
+ fi
+ done
+
+ if [ $ALL_GPUS_FREE -eq 0 ]; then
+ echo "Not all GPUs in chunk are free. Checking again in 10 seconds..."
+ sleep 10
+ fi
+ done
+
+ echo "CUDA_VISIBLE_DEVICES=$CHUNK_GPUS_STR"
+ CUDA_VISIBLE_DEVICES=$CHUNK_GPUS_STR python3 -m tasks.inference_benchmark \
+ --model_name_or_path $model_name_or_path \
+ --config "configs/tarser2_default_config.yaml" \
+ --max_new_tokens 512 \
+ --top_p 1 \
+ --temperature 0 \
+ --output_dir $output_dir \
+ --output_name predictions \
+ --max_n_samples_per_benchmark -1 \
+ --benchmarks $benchmarks \
+ --resume $resume \
+ --num_chunks $CHUNKS \
+ --chunk_idx $(($IDX - 1)) > $output_dir/run_$IDX.log 2>&1 &
+
+done
+
+wait
+
+python3 -m evaluation.evaluate \
+ --pred_file $output_dir \
+ --benchmarks $benchmarks
diff --git a/eval_scripts/DREAM-1K/tarsier/scripts/run_inference_caption.sh b/eval_scripts/DREAM-1K/tarsier/scripts/run_inference_caption.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3026eb59a4ee56fd09a240acfdc598d487c2ece7
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/scripts/run_inference_caption.sh
@@ -0,0 +1,79 @@
+#!/bin/bash
+
+# Copy and Modified on: https://github.com/LLaVA-VL/LLaVA-NeXT/blob/video_inference/scripts/video/eval/video_detail_description_eval_shard.sh
+
+#
+
+model_name_or_path=$1
+input_file=$2
+output_dir=$3
+CHUNKS=1
+resume=True
+
+mkdir $output_dir
+
+echo "Using $CHUNKS GPUs"
+
+# Assuming GPULIST is a bash array containing your GPUs
+# GPULIST=(0 1 2 3 4 5 6 7)
+GPULIST=(0)
+
+# Get the number of GPUs
+NUM_GPUS=${#GPULIST[@]}
+
+# Calculate GPUs per chunk
+GPUS_PER_CHUNK=$((NUM_GPUS / CHUNKS))
+
+
+for IDX in $(seq 1 $CHUNKS); do
+ START=$(((IDX-1) * GPUS_PER_CHUNK))
+ LENGTH=$GPUS_PER_CHUNK # Length for slicing, not the end index
+
+ CHUNK_GPUS=(${GPULIST[@]:$START:$LENGTH})
+
+ # Convert the chunk GPUs array to a comma-separated string
+ CHUNK_GPUS_STR=$(IFS=,; echo "${CHUNK_GPUS[*]}")
+
+ ALL_GPUS_FREE=0
+ while [ $ALL_GPUS_FREE -eq 0 ]; do
+ ALL_GPUS_FREE=1 # Assume all GPUs are free initially
+
+ for GPU_ID in $CHUNK_GPUS; do
+ MEM_USAGE=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i $GPU_ID | tr -d '[:space:]')
+
+ # Assuming a GPU is considered free if its memory usage is less than 100 MiB
+ if [ "$MEM_USAGE" -ge 100 ]; then
+ ALL_GPUS_FREE=0
+ echo "GPU $GPU_ID is in use. Memory used: ${MEM_USAGE}MiB."
+ break # Exit the loop early as we found a GPU that is not free
+ fi
+ done
+
+ if [ $ALL_GPUS_FREE -eq 0 ]; then
+ echo "Not all GPUs in chunk are free. Checking again in 10 seconds..."
+ sleep 10
+ fi
+ done
+
+ echo "CUDA_VISIBLE_DEVICES=$CHUNK_GPUS_STR"
+ CUDA_VISIBLE_DEVICES=$CHUNK_GPUS_STR python3 -m tasks.inference_caption \
+ --model_name_or_path $model_name_or_path \
+ --config "configs/tarser2_default_config.yaml" \
+ --max_new_tokens 512 \
+ --top_p 1 \
+ --temperature 0 \
+ --input_file $input_file \
+ --output_dir $output_dir \
+ --output_name predictions \
+ --max_n_samples_per_benchmark -1 \
+ --resume $resume \
+ --num_chunks $CHUNKS \
+ --chunk_idx $(($IDX - 1)) > $output_dir/run_$IDX.log 2>&1 &
+
+done
+
+wait
+
+# python3 -m evaluation.evaluate \
+# --pred_file $output_dir \
+# --benchmarks $benchmarks
diff --git a/eval_scripts/DREAM-1K/tarsier/tasks/demo_cli.py b/eval_scripts/DREAM-1K/tarsier/tasks/demo_cli.py
new file mode 100644
index 0000000000000000000000000000000000000000..f36b132eaa98fcb2074591193939c1a1c1ea943d
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/tasks/demo_cli.py
@@ -0,0 +1,116 @@
+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+# http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+import torch
+from copy import deepcopy
+from transformers import StoppingCriteriaList
+from tasks.utils import load_model_and_processor
+from dataset.utils import *
+from tools.conversation import Chat, conv_templates, StoppingCriteriaSub
+from transformers import TextStreamer
+from tools.color import Color
+
+
+device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+
+def main(args):
+ # Load Model
+ print(f"### Start loading model...")
+ model, processor = load_model_and_processor(args.model_name_or_path, args.config)
+ print(f"### Finish loading model.")
+ if 'tarsier2' in args.model_name_or_path.lower():
+ conv_type = 'tarsier2-7b'
+ else:
+ if '7b' in args.model_name_or_path.lower():
+ conv_type = 'tarsier-7b'
+ elif '13b' in args.model_name_or_path.lower():
+ conv_type = 'tarsier-13b'
+ elif '34b' in args.model_name_or_path.lower():
+ conv_type = 'tarsier-34b'
+ else:
+ raise ValueError(f"Unknow model: {args.model_name_or_path}")
+
+ chat = Chat(model, processor, device=device, debug = args.debug)
+ conv = deepcopy(conv_templates[conv_type])
+
+ img_path = ''
+ has_img = False
+ while True:
+ if not has_img:
+ try:
+ img_path = input(Color.green(f"{conv.roles[1]}: ") + "Input a file path of your image/video:")
+ img_path = img_path.strip()
+ if not (os.path.exists(img_path) and get_visual_type(img_path) in ['video', 'gif', 'image']):
+ continue
+ has_img = True
+ conv.messages.append([conv.roles[0], {"type": "video", "text": img_path}])
+ print(Color.green(f"{conv.roles[1]}: ") + "Received your file. Now let's start conversation! :)")
+ print(Color.red(f" "))
+ except Exception as e:
+ print(f"Error: {e}")
+ print("exit...")
+ exit()
+ inp = ""
+ while inp == "":
+ inp = input(Color.blue(f"{conv.roles[0]}: ")).strip()
+ if inp.strip() == 'exit':
+ print("exit...")
+ exit()
+ elif inp.strip() == "reset":
+ conv = deepcopy(conv_templates[conv_type])
+ img_path = ''
+ continue
+ conv = chat.ask(inp, conv)
+
+ stop_words_ids = [torch.tensor([processor.processor.tokenizer.eos_token_id]).to(device)]
+ stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(stops=stop_words_ids)])
+ streamer = TextStreamer(processor.processor.tokenizer, skip_prompt=True, skip_special_tokens=True)
+
+ inputs, conv = chat.prepare_model_inputs(conv, args.max_n_frames)
+ print("conv:", conv)
+ print(Color.green(f"{conv.roles[1]}: "), end="")
+ with torch.inference_mode():
+ outputs = model.generate(
+ **inputs,
+ do_sample=True if args.temperature > 0 else False,
+ temperature=args.temperature,
+ top_p=args.top_p,
+ max_new_tokens=args.max_new_tokens,
+ streamer=streamer,
+ use_cache=True,
+ stopping_criteria=[stopping_criteria])
+ outputs = processor.processor.tokenizer.decode(outputs[0][inputs['input_ids'][0].shape[0]:], skip_special_tokens=True)
+ conv.messages.append(
+ [conv.roles[1], {"text": outputs, "type": "text"}]
+ )
+
+ if args.debug:
+ print(f"Conversation state: {conv}")
+
+if __name__ == "__main__":
+ # python3 -m tasks.demo_cli --model_name_or_path /tmp/tarsier2-1226-dpo --config configs/tarser2_default_config.yaml
+ import argparse
+
+ parser = argparse.ArgumentParser()
+ parser.add_argument('--model_name_or_path', type=str)
+ parser.add_argument('--config', type=str, default="configs/tarser2_default_config.yaml")
+ parser.add_argument("--max_n_frames", type=int, default=16, help="Max number of frames to apply average sampling from the given video.")
+ parser.add_argument("--max_new_tokens", type=int, default=512, help="max number of generated tokens")
+ parser.add_argument("--top_p", type=float, default=1, help="Top_p sampling")
+ parser.add_argument("--temperature", type=float, default=0, help="Set temperature > 0 to enable sampling generation.")
+ parser.add_argument("--debug", action="store_true")
+ args = parser.parse_args()
+
+ main(args)
diff --git a/eval_scripts/DREAM-1K/tarsier/tasks/demo_gradio.py b/eval_scripts/DREAM-1K/tarsier/tasks/demo_gradio.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2117c330d36ad5b007812c1da12c03c59a6c072
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/tasks/demo_gradio.py
@@ -0,0 +1,230 @@
+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+# http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# copy and modify from: https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/demo/demo.py
+
+# import spaces # for deploying on huggingface ZeroGPU
+from copy import deepcopy
+import gradio as gr
+from gradio.themes.utils import colors, fonts, sizes
+from tools.conversation import Chat, conv_templates
+from tasks.utils import load_model_and_processor, file_to_base64
+from dataset.tarsier_datamodule import init_processor
+import os
+import torch
+
+# huggingface-cli login
+
+model_path = os.getenv("MODEL_PATH", "omni-research/Tarsier2-7b")
+config_path = "configs/tarser2_default_config.yaml"
+max_n_frames = int(os.getenv("MAX_N_FRAMES", 16))
+debug = False
+device = 'cuda' if not debug else 'cpu'
+
+# ========================================
+# Model Initialization
+# ========================================
+def init_model():
+ print("Start Initialization...")
+ # if torch.cuda.is_available():
+ if not debug:
+ model, processor = load_model_and_processor(model_path, config_path)
+ else:
+ print(f"No Valid GPU! Lauch in debug mode!")
+ processor = init_processor(model_path, config_path)
+ model = None
+ chat = Chat(model, processor, device, debug)c
+ print('Initialization Finished')
+ return chat
+
+
+# ========================================
+# Gradio Setting
+# ========================================
+def gradio_reset(chat_state, img_file):
+ if chat_state is not None:
+ chat_state.messages = []
+ img_file = None
+ return None, gr.update(value=None, interactive=True), gr.update(value=None, interactive=True), gr.update(value=None, interactive=True), gr.update(placeholder='Please upload your video first', interactive=False),gr.update(value="Upload & Start Chat", interactive=True), chat_state, img_file
+
+
+def upload_img(gr_img, gr_video, gr_gif, chat_state, num_frames):
+ print("video, image or gif:", gr_video, gr_img, gr_gif)
+ conv_type = ''
+ if 'tarsier2-7b' in model_path.lower():
+ conv_type = 'tarsier2-7b'
+ # elif '7b' in model_path.lower():
+ # conv_type = 'tarsier-7b'
+ # elif '13b' in model_path.lower():
+ # conv_type = 'tarsier-13b'
+ # elif '34b' in model_path.lower():
+ # conv_type = 'tarsier-34b'
+ else:
+ raise ValueError(f"Unknow model: {model_path}")
+ chat_state = deepcopy(conv_templates[conv_type])
+
+ if gr_img is None and gr_video is None and gr_gif is None:
+ return None, None, None, gr.update(interactive=True), gr.update(interactive=True, placeholder='Please upload video/image first!'), chat_state, None, None
+ if gr_video or gr_img or gr_gif:
+ for img_file in [gr_video, gr_img, gr_gif]:
+ if img_file is not None:
+ break
+ chat_state.messages.append([chat_state.roles[0], {"type": "video", "text": img_file}])
+ return gr.update(interactive=True), gr.update(interactive=True), gr.update(interactive=True), gr.update(interactive=True, placeholder='Type and press Enter'), gr.update(value="Start Chatting", interactive=False), chat_state, img_file
+
+
+def gradio_ask(user_message, chatbot, chat_state):
+ if len(user_message) == 0:
+ return gr.update(interactive=True, placeholder='Input should not be empty!'), chatbot, chat_state
+ chat_state = chat.ask(user_message, chat_state)
+ chatbot = chatbot + [[user_message, None]]
+ return '', chatbot, chat_state
+
+# @spaces.GPU(duration=120) # for deploying on huggingface ZeroGPU
+def gradio_answer(chatbot, chat_state, img_file, top_p, temperature, n_frames=None):
+ llm_message, chat_state = chat.answer(conv=chat_state, n_frames=n_frames, max_new_tokens=256, num_beams=1, temperature=temperature, top_p=top_p)
+ chatbot[-1][1] = llm_message
+ print(chat_state)
+ print(f"Answer: {llm_message}")
+ return chatbot, chat_state
+
+
+class OpenGVLab(gr.themes.base.Base):
+ def __init__(
+ self,
+ *,
+ primary_hue=colors.blue,
+ secondary_hue=colors.sky,
+ neutral_hue=colors.gray,
+ spacing_size=sizes.spacing_md,
+ radius_size=sizes.radius_sm,
+ text_size=sizes.text_md,
+ font=(
+ fonts.GoogleFont("Noto Sans"),
+ "ui-sans-serif",
+ "sans-serif",
+ ),
+ font_mono=(
+ fonts.GoogleFont("IBM Plex Mono"),
+ "ui-monospace",
+ "monospace",
+ ),
+ ):
+ super().__init__(
+ primary_hue=primary_hue,
+ secondary_hue=secondary_hue,
+ neutral_hue=neutral_hue,
+ spacing_size=spacing_size,
+ radius_size=radius_size,
+ text_size=text_size,
+ font=font,
+ font_mono=font_mono,
+ )
+ super().set(
+ body_background_fill="*neutral_50",
+ )
+
+
+gvlabtheme = OpenGVLab(primary_hue=colors.blue,
+ secondary_hue=colors.sky,
+ neutral_hue=colors.gray,
+ spacing_size=sizes.spacing_md,
+ radius_size=sizes.radius_sm,
+ text_size=sizes.text_md,
+ )
+
+logo_b64 = file_to_base64("assets/figures/tarsier_logo.jpg")
+title = f""" """
+description ="""
+"""
+
+
+with gr.Blocks(title="Tarsier",theme=gvlabtheme,css="#chatbot {overflow:auto; height:500px;} #InputVideo {overflow:visible; height:320px;} footer {visibility: none}") as demo:
+ gr.Markdown(title)
+ gr.Markdown(description)
+ with gr.Row():
+ with gr.Column(scale=0.5, visible=True) as video_upload:
+ with gr.Column(elem_id="image", scale=0.5) as img_part:
+ with gr.Tab("Video", elem_id='video_tab'):
+ up_video = gr.Video(interactive=True, include_audio=True, elem_id="video_upload", height=360)
+ with gr.Tab("Image", elem_id='image_tab'):
+ up_image = gr.Image(type="filepath", interactive=True, elem_id="image_upload", height=360)
+ with gr.Tab("GIF", elem_id='gif_tab'):
+ up_gif = gr.File(type="filepath", file_count="single", file_types=[".gif"], interactive=True, elem_id="gif_upload", height=360)
+ upload_button = gr.Button(value="Upload & Start Chat", interactive=True, variant="primary")
+ clear = gr.Button("Restart")
+
+ # num_beams = gr.Slider(
+ # minimum=1,
+ # maximum=10,
+ # value=1,
+ # step=1,
+ # interactive=True,
+ # label="beam search numbers)",
+ # )
+
+ temperature = gr.Slider(
+ minimum=0.0,
+ maximum=1.0,
+ value=0.0,
+ step=0.1,
+ interactive=True,
+ label="Temperature",
+ )
+
+ top_p = gr.Slider(
+ minimum=0.1,
+ maximum=1.0,
+ value=1.0,
+ step=0.1,
+ interactive=True,
+ label="Top_p",
+ )
+
+ num_frames = gr.Slider(
+ minimum=4,
+ maximum=16,
+ value=16,
+ step=2,
+ interactive=True,
+ label="#Frames",
+ )
+
+ with gr.Column(visible=True) as input_raws:
+ chat_state = gr.State()
+ img_file = gr.State()
+ chatbot = gr.Chatbot(elem_id="chatbot",label='VideoChat')
+ with gr.Row():
+ with gr.Column(scale=0.7):
+ text_input = gr.Textbox(show_label=False, placeholder='Please upload your video first', interactive=False, container=False)
+ with gr.Column(scale=0.15, min_width=0):
+ run = gr.Button("💭Send")
+ with gr.Column(scale=0.15, min_width=0):
+ clear = gr.Button("🔄Clear️")
+
+ chat = init_model()
+ upload_button.click(upload_img, [up_image, up_video, up_gif, chat_state, num_frames], [up_image, up_video, up_gif, text_input, upload_button, chat_state, img_file])
+
+ text_input.submit(gradio_ask, [text_input, chatbot, chat_state], [text_input, chatbot, chat_state]).then(
+ gradio_answer, [chatbot, chat_state, img_file, top_p, temperature, num_frames], [chatbot, chat_state]
+ )
+ run.click(gradio_ask, [text_input, chatbot, chat_state], [text_input, chatbot, chat_state]).then(
+ gradio_answer, [chatbot, chat_state, img_file, top_p, temperature, num_frames], [chatbot, chat_state]
+ )
+ run.click(lambda: "", None, text_input)
+ clear.click(gradio_reset, [chat_state, img_file], [chatbot, up_image, up_video, up_gif, text_input, upload_button, chat_state, img_file], queue=False)
+
+
+demo.launch()
+# demo.launch(server_name="0.0.0.0", server_port=11451)
\ No newline at end of file
diff --git a/eval_scripts/DREAM-1K/tarsier/tasks/inference_benchmark.py b/eval_scripts/DREAM-1K/tarsier/tasks/inference_benchmark.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c5b7d7a6cd26f84038f6d30778e333e5797a71a
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/tasks/inference_benchmark.py
@@ -0,0 +1,197 @@
+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+# http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import torch
+from tasks.utils import load_model_and_processor
+# from dataset.mm_dataset import MMDataset
+from dataset.custom_data_parsers.utils import put_pred_to_data_dict, get_prompt_from_data_dict
+from dataset.tarsier_datamodule import TarsierDataset
+from dataset.utils import *
+
+import json
+import os
+import math
+from tqdm import tqdm
+import yaml
+
+ANN_ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) + '/../data/annotations'
+
+Benchmark2fname = {
+ 'dream': 'DREAM-1k.jsonl',
+
+ 'next-qa': 'Next-QA-val-multi_choice.jsonl',
+ 'egoschema': 'EgoSchema_subset.jsonl', # change to EgoSchema_fullset.jsonl if you test on the fullset
+ 'mvbench': 'MVBench.jsonl',
+ 'tvbench': 'TVBench.jsonl',
+ 'video-mme': 'Video-MME.jsonl',
+ 'favor-bench': 'FAVOR-Bench.jsonl',
+
+ 'msvd-qa': 'MSVD-QA-val.jsonl',
+ 'msr-vtt-qa': 'MSR-VTT-QA-val.jsonl',
+ 'tgif-qa': 'TGIF-QA-test.jsonl',
+ 'anet-qa': 'ActivityNet-QA-test.jsonl',
+
+ 'msvd-caption': 'MSVD-Caption-test.jsonl',
+ 'msr-vtt-caption': 'MSR-VTT-Caption-test.jsonl',
+ 'vatex-caption': 'VATEX-test.jsonl',
+
+ 'video_caption': "caption-test.jsonl", # custom for video caption test
+}
+
+def get_ann_file_path(benchmark):
+ ann_fpath = os.path.join(ANN_ROOT_DIR, Benchmark2fname[benchmark])
+ assert os.path.exists(ann_fpath), f"The annotation file for {benchmark} not exists: {ann_fpath}"
+ return ann_fpath
+
+def split_list(lst, n):
+ """Split a list into n (roughly) equal-sized chunks"""
+ chunk_size = math.ceil(len(lst) / n) # integer division
+ return [lst[i : i + chunk_size] for i in range(0, len(lst), chunk_size)]
+
+
+def get_chunk(lst, n, k):
+ chunks = split_list(lst, n)
+ return chunks[k]
+
+
+def parse_args():
+ """
+ Parse command-line arguments.
+ """
+ parser = argparse.ArgumentParser()
+
+ # Define the command-line arguments
+
+ parser.add_argument('--model_name_or_path', type=str, required=True)
+ parser.add_argument('--config', type=str, default="configs/tarser2_default_config.yaml")
+ # parser.add_argument("--max_n_frames", type=int, default=8, help="Max number of frames to apply average sampling from the given video.")
+ parser.add_argument("--max_new_tokens", type=int, default=512, help="max number of generated tokens")
+ parser.add_argument("--top_p", type=float, default=1, help="Top_p sampling")
+ parser.add_argument("--temperature", type=float, default=0, help="Set temperature > 0 to enable sampling generation.")
+
+ parser.add_argument("--output_dir", type=str, help="Directory to save the model results", required=True)
+ parser.add_argument("--output_name", type=str, default="predictions", help="Name of the file for storing results")
+
+ parser.add_argument("--num_chunks", type=int, default=1)
+ parser.add_argument("--chunk_idx", type=int, default=0)
+
+ parser.add_argument("--max_n_samples_per_benchmark", type=int, default=-1, help="Set as a small number (like 100) to run as debug.")
+ parser.add_argument('--benchmarks', nargs='+', default=["all"], help="Default as 'all' to inference on all benchmarks; Also could be task types: ('dream', 'caption', 'mc_qa', 'oe_qa'); And specific benchmark names: ('dream', 'msvd-caption', 'msr-vtt-caption', 'vatex-caption', 'next-qa', 'egoschema', 'mvbench', 'video-mme', 'msvd-qa', 'msr-vtt-qa', 'tgif-qa', 'anet-qa')")
+
+ parser.add_argument("--resume", type=lambda x: (str(x).lower() == 'true'), default=True, help="Resume from existing inference results file or overwrite them.")
+
+ args = parser.parse_args()
+
+ args.benchmarks = get_benchmarks(args.benchmarks)
+ print("### Selected Benchmarks:", args.benchmarks)
+
+ return args
+
+
+def run_inference(args):
+ """
+ Run inference on selected benchmarks.
+
+ Args:
+ args: Command-line arguments.
+ """
+ # Initialize the model
+ # model, processor = load_model_and_processor(args.model_name_or_path, args.max_n_frames) # max_n_frames set in config_file
+ data_config = yaml.safe_load(open(args.config, 'r'))
+ model, processor = load_model_and_processor(args.model_name_or_path, data_config=data_config)
+
+ all_chunks = []
+ count = 0
+ print(f"Start loading dataset...")
+ for benchmark in args.benchmarks:
+ ann_fpath = get_ann_file_path(benchmark)
+ cur_anns = [json.loads(line) for line in open(ann_fpath)]
+ if args.max_n_samples_per_benchmark > 0:
+ cur_anns = cur_anns[:args.max_n_samples_per_benchmark]
+ count += len(cur_anns)
+ cur_chunk = get_chunk(cur_anns, args.num_chunks, args.chunk_idx)
+ all_chunks.extend(cur_chunk)
+ print(f"### [{benchmark}] Load chunk with {len(cur_chunk)} samples from {len(cur_anns)} samples.")
+ print(f"### Finish loading chunk with {len(all_chunks)} samples from {count} samples in total.")
+
+ # Create the output directory if it doesn't exist
+ if not os.path.exists(args.output_dir):
+ os.makedirs(args.output_dir)
+
+ if args.num_chunks > 1:
+ output_name = f"{args.output_name}_{args.num_chunks}_{args.chunk_idx}"
+ else:
+ output_name = args.output_name
+ answers_file = os.path.join(args.output_dir, f"{output_name}.jsonl")
+ if args.resume and os.path.exists(answers_file):
+ processed_data = [json.loads(line) for line in open(answers_file)]
+ processed_idxs = set([f"{d['dataset']}-{d['idx']}" for d in processed_data])
+ all_chunks = [d for d in all_chunks if f"{d['dataset']}-{d['idx']}" not in processed_idxs]
+ print(f"### Resume from {len(processed_idxs)} samples. {len(all_chunks)} samples to run.", flush=True)
+ ans_file = open(answers_file, "a")
+ else:
+ ans_file = open(answers_file, "w")
+
+ dataset = TarsierDataset(
+ anns=all_chunks, config=data_config, processor=processor
+ )
+
+ generate_kwargs = {
+ "do_sample": True if args.temperature > 0 else False,
+ "max_new_tokens": args.max_new_tokens,
+ "top_p": args.top_p,
+ "temperature": args.temperature,
+ "use_cache": True
+ }
+
+ if len(dataset) == 0:
+ return
+ for ann, inputs in tqdm(dataset, total=len(dataset)):
+ if inputs is not None:
+ if "prompt" in inputs:
+ prompt = get_prompt_from_data_dict(ann)
+ print(f"###Prompt:\n{prompt}", flush=True)
+ # print(f"Input: {processor.processor.tokenizer.decode(inputs['input_ids'][0]), skip_special_tokens=True}", flush=True)
+ try:
+ model_inputs = {}
+ for k, v in inputs.items():
+ if not isinstance(v, torch.Tensor):
+ continue
+ model_inputs[k] = v.to(model.device)
+ outputs = model.generate(
+ **model_inputs,
+ **generate_kwargs,
+ )
+ output_text = processor.processor.tokenizer.decode(outputs[0][model_inputs['input_ids'][0].shape[0]:], skip_special_tokens=True)
+ except Exception as e:
+ print(f"Error: {e}")
+ output_text = ""
+ print(f"###Prediction:\n{output_text}", flush=True)
+ answer = ann['messages'][-1]['content'][-1]['reference']
+ print(f"###Answer:\n{answer}", flush=True)
+ put_pred_to_data_dict(output_text, ann)
+ else:
+ put_pred_to_data_dict("", ann)
+ try:
+ ans_file.write(json.dumps(ann, ensure_ascii=False) + "\n")
+ except:
+ ans_file.write(json.dumps(ann) + "\n")
+ ans_file.flush()
+
+ ans_file.close()
+
+
+if __name__ == "__main__":
+ args = parse_args()
+ run_inference(args)
diff --git a/eval_scripts/DREAM-1K/tarsier/tasks/inference_caption.py b/eval_scripts/DREAM-1K/tarsier/tasks/inference_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..c666076c0c8add97d8462fdb1b28c314523cacc1
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/tasks/inference_caption.py
@@ -0,0 +1,165 @@
+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+# http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import torch
+from tasks.utils import load_model_and_processor
+# from dataset.mm_dataset import MMDataset
+from dataset.custom_data_parsers.utils import put_pred_to_data_dict, get_prompt_from_data_dict
+from dataset.tarsier_datamodule import TarsierDataset
+from dataset.utils import *
+
+import json
+import os
+import math
+from tqdm import tqdm
+import yaml
+
+
+ANN_ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) + '/../data/annotations'
+
+
+def split_list(lst, n):
+ """Split a list into n (roughly) equal-sized chunks"""
+ chunk_size = math.ceil(len(lst) / n) # integer division
+ return [lst[i : i + chunk_size] for i in range(0, len(lst), chunk_size)]
+
+
+def get_chunk(lst, n, k):
+ chunks = split_list(lst, n)
+ return chunks[k]
+
+
+def parse_args():
+ """
+ Parse command-line arguments.
+ """
+ parser = argparse.ArgumentParser()
+
+ # Define the command-line arguments
+
+ parser.add_argument('--model_name_or_path', type=str, required=True)
+ parser.add_argument('--config', type=str, default="configs/tarser2_default_config.yaml")
+ # parser.add_argument("--max_n_frames", type=int, default=8, help="Max number of frames to apply average sampling from the given video.")
+ parser.add_argument("--max_new_tokens", type=int, default=512, help="max number of generated tokens")
+ parser.add_argument("--top_p", type=float, default=1, help="Top_p sampling")
+ parser.add_argument("--temperature", type=float, default=0, help="Set temperature > 0 to enable sampling generation.")
+
+ parser.add_argument("--input_file", type=str, help="Directory to input_file (jsonline)", required=True)
+ parser.add_argument("--output_dir", type=str, help="Directory to save the model results", required=True)
+ parser.add_argument("--output_name", type=str, default="predictions", help="Name of the file for storing results")
+
+ parser.add_argument("--num_chunks", type=int, default=1)
+ parser.add_argument("--chunk_idx", type=int, default=0)
+
+ parser.add_argument("--max_n_samples_per_benchmark", type=int, default=-1, help="Set as a small number (like 100) to run as debug.")
+ parser.add_argument("--resume", type=lambda x: (str(x).lower() == 'true'), default=True, help="Resume from existing inference results file or overwrite them.")
+
+ args = parser.parse_args()
+
+ return args
+
+
+def run_inference(args):
+ """
+ Run inference on selected benchmarks.
+
+ Args:
+ args: Command-line arguments.
+ """
+ # Initialize the model
+ # model, processor = load_model_and_processor(args.model_name_or_path, args.max_n_frames) # max_n_frames set in config_file
+ data_config = yaml.safe_load(open(args.config, 'r'))
+ model, processor = load_model_and_processor(args.model_name_or_path, data_config=data_config)
+
+ all_chunks = []
+ count = 0
+ print(f"Start loading dataset...")
+ ann_fpath = args.input_file
+ cur_anns = [json.loads(line) for line in open(ann_fpath)]
+ if args.max_n_samples_per_benchmark > 0:
+ cur_anns = cur_anns[:args.max_n_samples_per_benchmark]
+ count += len(cur_anns)
+ cur_chunk = get_chunk(cur_anns, args.num_chunks, args.chunk_idx)
+ all_chunks.extend(cur_chunk)
+ print(f"### Load chunk with {len(cur_chunk)} samples from {len(cur_anns)} samples.")
+ print(f"### Finish loading chunk with {len(all_chunks)} samples from {count} samples in total.")
+
+ # Create the output directory if it doesn't exist
+ if not os.path.exists(args.output_dir):
+ os.makedirs(args.output_dir)
+
+ if args.num_chunks > 1:
+ output_name = f"{args.output_name}_{args.num_chunks}_{args.chunk_idx}"
+ else:
+ output_name = args.output_name
+ answers_file = os.path.join(args.output_dir, f"{output_name}.jsonl")
+ if args.resume and os.path.exists(answers_file):
+ processed_data = [json.loads(line) for line in open(answers_file)]
+ processed_idxs = set([f"{d['dataset']}-{d['idx']}" for d in processed_data])
+ all_chunks = [d for d in all_chunks if f"{d['dataset']}-{d['idx']}" not in processed_idxs]
+ print(f"### Resume from {len(processed_idxs)} samples. {len(all_chunks)} samples to run.", flush=True)
+ ans_file = open(answers_file, "a")
+ else:
+ ans_file = open(answers_file, "w")
+
+ dataset = TarsierDataset(
+ anns=all_chunks, config=data_config, processor=processor
+ )
+
+ generate_kwargs = {
+ "do_sample": True if args.temperature > 0 else False,
+ "max_new_tokens": args.max_new_tokens,
+ "top_p": args.top_p,
+ "temperature": args.temperature,
+ "use_cache": True
+ }
+
+ if len(dataset) == 0:
+ return
+ for ann, inputs in tqdm(dataset):
+ if inputs is not None:
+ prompt = get_prompt_from_data_dict(ann)
+ print(f"###Prompt:\n{prompt}", flush=True)
+ # print(f"Input: {processor.processor.tokenizer.decode(inputs['input_ids'][0]), skip_special_tokens=True}", flush=True)
+ try:
+ model_inputs = {}
+ for k, v in inputs.items():
+ if not isinstance(v, torch.Tensor):
+ continue
+ model_inputs[k] = v.to(model.device)
+ outputs = model.generate(
+ **model_inputs,
+ **generate_kwargs,
+ )
+ output_text = processor.processor.tokenizer.decode(outputs[0][model_inputs['input_ids'][0].shape[0]:], skip_special_tokens=True)
+ except Exception as e:
+ print(f"Error: {e}")
+ output_text = ""
+ print(f"###Prediction:\n{output_text}", flush=True)
+ put_pred_to_data_dict(output_text, ann)
+ else:
+ put_pred_to_data_dict("", ann)
+ try:
+ ans_file.write(json.dumps(ann, ensure_ascii=False) + "\n")
+ except:
+ ans_file.write(json.dumps(ann) + "\n")
+ ans_file.flush()
+
+ ans_file.close()
+
+
+if __name__ == "__main__":
+ # python3 -m tasks.inference_caption --model_name_or_path /tmp/tarsier2-1226-dpo --config configs/tarser2_default_config.yaml --input_file data/annotations/caption-test-new.jsonl --output_dir tmp_outputs
+ args = parse_args()
+ run_inference(args)
diff --git a/eval_scripts/DREAM-1K/tarsier/tasks/inference_quick_start.py b/eval_scripts/DREAM-1K/tarsier/tasks/inference_quick_start.py
new file mode 100644
index 0000000000000000000000000000000000000000..907cad70d87e86f656e79c245423fc2fe595f537
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/tasks/inference_quick_start.py
@@ -0,0 +1,91 @@
+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+# http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from tasks.utils import load_model_and_processor
+from dataset.custom_data_parsers.utils import put_pred_to_data_dict, get_prompt_from_data_dict
+from dataset.utils import *
+
+import os
+import torch
+from tqdm import tqdm
+import yaml
+
+def process_one(model, processor, prompt, video_file, generate_kwargs):
+ # inputs = processor(prompt, video_file, edit_prompt=True, return_prompt=True)
+ sample = format_one_sample(video_file, prompt)
+ batch_data = processor(sample)
+ print(f"###Prompt:\n{get_prompt_from_data_dict(sample)}")
+ model_inputs = {}
+ for k, v in batch_data.items():
+ if not isinstance(v, torch.Tensor):
+ continue
+ model_inputs[k] = v.to(model.device)
+ outputs = model.generate(
+ **model_inputs,
+ **generate_kwargs,
+ )
+ # print(processor.processor.tokenizer.decode(outputs[0][:model_inputs['input_ids'][0].shape[0]], skip_special_tokens=True))
+ output_text = processor.processor.tokenizer.decode(outputs[0][model_inputs['input_ids'][0].shape[0]:], skip_special_tokens=True)
+ return output_text
+
+def run():
+ import argparse
+
+ parser = argparse.ArgumentParser()
+ parser.add_argument('--model_name_or_path', type=str)
+ parser.add_argument('--config', type=str, default="configs/tarser2_default_config.yaml")
+ parser.add_argument('--instruction', type=str, default="Describe the video in detail.", help='Input prompt.')
+ parser.add_argument('--input_path', type=str, default="assets/examples", help='Path to video/image; or Dir to videos/images')
+ # parser.add_argument("--max_n_frames", type=int, default=16, help="Max number of frames to apply average sampling from the given video.")
+ parser.add_argument("--max_new_tokens", type=int, default=256, help="max number of generated tokens")
+ parser.add_argument("--top_p", type=float, default=1, help="Top_p sampling")
+ parser.add_argument("--temperature", type=float, default=0, help="Set temperature > 0 to enable sampling generation.")
+
+ args = parser.parse_args()
+
+ # model, processor = load_model_and_processor(args.model_name_or_path, max_n_frames=args.max_n_frames) # max_n_frames set in config_file
+ data_config = yaml.safe_load(open(args.config, 'r'))
+ model, processor = load_model_and_processor(args.model_name_or_path, data_config=data_config)
+
+ generate_kwargs = {
+ "do_sample": True if args.temperature > 0 else False,
+ "max_new_tokens": args.max_new_tokens,
+ "top_p": args.top_p,
+ "temperature": args.temperature,
+ "use_cache": True
+ }
+ assert os.path.exists(args.input_path), f"input_path not exist: {args.input_path}"
+ if os.path.isdir(args.input_path):
+ input_files = [os.path.join(args.input_path, fn) for fn in os.listdir(args.input_path) if get_visual_type(fn) in ['video', 'gif', 'image']]
+ elif get_visual_type(args.input_path) in ['video', 'gif', 'image']:
+ input_files = [args.input_path]
+ assert len(input_files) > 0, f"None valid input file in: {args.input_path} {VALID_DATA_FORMAT_STRING}"
+
+ for input_file in tqdm(input_files, desc="Generating..."):
+ visual_type = get_visual_type(input_file)
+ if args.instruction:
+ prompt = args.instruction
+ else:
+ if visual_type == 'image':
+ prompt = "Describe the image in detail."
+ else:
+ prompt = "Describe the video in detail."
+
+ pred = process_one(model, processor, prompt, input_file, generate_kwargs)
+ print(f"###Prediction:\n{pred}")
+ print('-'*100)
+
+
+if __name__ == "__main__":
+ # python3 -m tasks.inference_quick_start --model_name_or_path /tmp/tarsier2-1226-dpo --config configs/tarser2_default_config.yaml --input_path /mnt/bn/videonasi18n/wangjw/workspace/tarsier/diving.mp4 --instruction "List the names of all sponsors on the background wall."
+ run()
diff --git a/eval_scripts/DREAM-1K/tarsier/tasks/utils.py b/eval_scripts/DREAM-1K/tarsier/tasks/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..832d44581e8c86c7681e610a004774d4d63eef4f
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/tasks/utils.py
@@ -0,0 +1,45 @@
+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+# http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from models.modeling_tarsier import TarsierForConditionalGeneration, LlavaConfig
+# from dataset.processor import Processor
+from dataset.tarsier_datamodule import init_processor
+import torch
+import base64
+from tools.color import Color
+import yaml
+
+def load_model_and_processor(model_name_or_path, data_config):
+ print(Color.red(f"Load model and processor from: {model_name_or_path}"), flush=True)
+ if isinstance(data_config, str):
+ data_config = yaml.safe_load(open(data_config, 'r'))
+ processor = init_processor(model_name_or_path, data_config)
+ model_config = LlavaConfig.from_pretrained(
+ model_name_or_path,
+ trust_remote_code=True,
+ )
+ model = TarsierForConditionalGeneration.from_pretrained(
+ model_name_or_path,
+ config=model_config,
+ device_map='auto',
+ torch_dtype=torch.bfloat16,
+ trust_remote_code=True
+ )
+ model.eval()
+ return model, processor
+
+def file_to_base64(img_path):
+ with open(img_path, 'rb') as video_file:
+ video_b64_str = base64.b64encode(video_file.read()).decode()
+ return video_b64_str
+
diff --git a/eval_scripts/DREAM-1K/tarsier/tools/color.py b/eval_scripts/DREAM-1K/tarsier/tools/color.py
new file mode 100644
index 0000000000000000000000000000000000000000..0bc28271a5cdb57983f9a5d264fa4a4e75712bf8
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/tools/color.py
@@ -0,0 +1,36 @@
+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+# http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+class Color:
+
+ @staticmethod
+ def red(x):
+ return '\33[31m' +x + '\033[0m'
+
+ @staticmethod
+ def green(x):
+ return '\33[32m' +x + '\033[0m'
+
+ @staticmethod
+ def yellow(x):
+ return '\33[33m' +x + '\033[0m'
+
+ @staticmethod
+ def blue(x):
+ return '\33[34m' +x + '\033[0m'
+
+ @staticmethod
+ def violet(x):
+ return '\33[35m' +x + '\033[0m'
+
+
diff --git a/eval_scripts/DREAM-1K/tarsier/tools/conversation.py b/eval_scripts/DREAM-1K/tarsier/tools/conversation.py
new file mode 100644
index 0000000000000000000000000000000000000000..77dcd4eefb179bf3787d82203bc16c47807643de
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/tools/conversation.py
@@ -0,0 +1,256 @@
+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+# http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# copy and modify from: https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/conversation.py
+from PIL import Image
+import torch
+from transformers import StoppingCriteria, StoppingCriteriaList
+from dataset.custom_data_parsers.utils import put_pred_to_data_dict, get_prompt_from_data_dict
+from dataset.tarsier_datamodule import TarsierDataProcessor
+from dataset.utils import *
+
+from enum import auto, Enum
+import os
+import re
+
+data_dict_tmp = {
+ "messages": [
+ {
+ "role": "user",
+ "content": [
+ {
+ "type": "video",
+ "video": {
+ "video_file": "/mnt/hdfs/vlm/videos/movies_aligned_0523/tt8266310/tt8266310_1.50.24-1.50.29.mp4"}
+ },
+ {
+ "type": "text",
+ "text": "Describe the video in detail."
+ }
+ ]
+ },
+ {
+ "role": "assistant",
+ "content": [
+ {
+ "type": "text",
+ "text": "A man in the driver's seat, wearing a black jacket with a maroon shirt, fastens his seatbelt while smiling at the man in the passenger seat, who is adjusting his position. The passenger, also wearing a black jacket with a maroon shirt, turns to look forward and smiles. The driver then leans forward to start the car and leans back in his seat. In the background, a beige car is visible through the window."
+ }]}
+ ],
+ "dataset": "video_caption",
+ "task": "video/caption",
+ "idx": 0,
+}
+
+
+IMAGE_TOKEN = ""
+VIDEO_TOKEN = ""
+
+class SeparatorStyle(Enum):
+ """Different separator style."""
+ SINGLE = auto()
+ TWO = auto()
+
+def get_data_dict(conv, max_n_frames=None):
+ data_dict = {
+ "messages": []
+ }
+ for i, (role, message) in enumerate(conv.messages):
+ if message:
+ text = message["text"]
+ content_type = message["type"]
+ content = {}
+ if content_type == "text":
+ content['type'] = 'text'
+ content['text'] = text
+ task = "text-only"
+ elif content_type == "video":
+ content['type'] = 'video'
+ content['video'] = {
+ "video_file": text
+ }
+ if max_n_frames is not None:
+ content['video']['n_frames'] = max_n_frames
+ task = "video/QA"
+ elif content_type == "image":
+ content['type'] = 'image'
+ content['image'] = {
+ "image_file": text
+ }
+ task = "image/QA"
+ else:
+ content['type'] = 'text'
+ content['text'] = text
+ task = "text-only"
+ if data_dict['messages'] and data_dict['messages'][-1]['role'] == role:
+ data_dict['messages'][-1]['content'].append(content)
+ else:
+ data_dict['messages'].append({
+ "role": role,
+ "content": [content]
+ })
+ data_dict['dataset'] = task
+ data_dict['task'] = task
+ check_data_format(data_dict)
+ return data_dict
+
+
+class StoppingCriteriaSub(StoppingCriteria):
+ def __init__(self, stops=[], encounters=1):
+ super().__init__()
+ self.stops = stops
+
+ def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
+ for stop in self.stops:
+ if torch.all((stop == input_ids[0][-len(stop):])).item():
+ return True
+ return False
+
+
+class Chat:
+ def __init__(self, model, processor: TarsierDataProcessor, device='cuda', debug=False):
+ self.model = model
+ self.processor = processor
+ self.device = device
+ self.debug = debug
+ stop_words_ids = [torch.tensor([self.processor.processor.tokenizer.eos_token_id]).to(device)]
+ self.stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(stops=stop_words_ids)])
+
+ def ask(self,text,conv):
+ conv.messages.append([conv.roles[0], {"text": text, "type": "text"}])
+ return conv
+
+ def prepare_model_inputs(self, conv, n_frames=None):
+ # print(conv.messages)
+ data_dict = get_data_dict(conv, n_frames)
+ if self.debug:
+ # print(f"visual_data_file: {visual_data_file}", flush=True)
+ print(f"###Prompt:\n{get_prompt_from_data_dict(data_dict)}")
+
+ batch_data = self.processor(data_dict)
+ model_inputs = {}
+ for k, v in batch_data.items():
+ if not isinstance(v, torch.Tensor):
+ continue
+ model_inputs[k] = v.to(self.device)
+ return model_inputs, conv
+
+ def answer(self, conv, n_frames=None, max_new_tokens=256, num_beams=1, min_length=1, top_p=1.0,
+ repetition_penalty=1.0, length_penalty=1, temperature=0):
+ inputs, conv = self.prepare_model_inputs(conv, n_frames)
+ if self.model is not None:
+ outputs = self.model.generate(
+ **inputs,
+ max_new_tokens=max_new_tokens,
+ stopping_criteria=self.stopping_criteria,
+ num_beams=num_beams,
+ do_sample=True if temperature > 0 else False,
+ min_length=min_length,
+ top_p=top_p,
+ repetition_penalty=repetition_penalty,
+ length_penalty=length_penalty,
+ temperature=temperature,
+ )
+ output_text = self.processor.processor.tokenizer.decode(outputs[0][inputs['input_ids'][0].shape[0]:], skip_special_tokens=True)
+ else:
+ output_text = "Fake respone as launched in debug mode!"
+ conv.messages.append(
+ [conv.roles[1], {"text": output_text, "type": "text"}]
+ )
+ return output_text, conv
+
+class EasyDict(dict):
+ """
+ Get attributes
+
+ >>> d = EasyDict({'foo':3})
+ >>> d['foo']
+ 3
+ >>> d.foo
+ 3
+ >>> d.bar
+ Traceback (most recent call last):
+ ...
+ AttributeError: 'EasyDict' object has no attribute 'bar'
+
+ Works recursively
+
+ >>> d = EasyDict({'foo':3, 'bar':{'x':1, 'y':2}})
+ >>> isinstance(d.bar, dict)
+ True
+ >>> d.bar.x
+ 1
+ """
+
+ def __init__(self, d=None, **kwargs):
+ if d is None:
+ d = {}
+ if kwargs:
+ d.update(**kwargs)
+ for k, v in d.items():
+ setattr(self, k, v)
+ # Class attributes
+ for k in self.__class__.__dict__.keys():
+ if not (k.startswith("__") and k.endswith("__")) and not k in ("update", "pop"):
+ setattr(self, k, getattr(self, k))
+
+ def __setattr__(self, name, value):
+ if isinstance(value, (list, tuple)):
+ value = [self.__class__(x) if isinstance(x, dict) else x for x in value]
+ elif isinstance(value, dict) and not isinstance(value, self.__class__):
+ value = self.__class__(value)
+ super(EasyDict, self).__setattr__(name, value)
+ super(EasyDict, self).__setitem__(name, value)
+
+ __setitem__ = __setattr__
+
+ def update(self, e=None, **f):
+ d = e or dict()
+ d.update(f)
+ for k in d:
+ setattr(self, k, d[k])
+
+ def pop(self, k, d=None):
+ if hasattr(self, k):
+ delattr(self, k)
+ return super(EasyDict, self).pop(k, d)
+
+conv_tarsier = EasyDict({
+ "system": "",
+ "roles": ("USER", "ASSISTANT"),
+ "messages": [],
+ "sep1": " ",
+ "sep2": "",
+}
+)
+
+conv_tarsier_yi = EasyDict({
+ "system": "",
+ "roles": ("USER", "ASSISTANT"),
+ "messages": [],
+ "sep1": " ",
+ "sep2": "<|endoftext|>",
+}
+)
+
+conv_tarsier_qwen2_vl = EasyDict({
+ "system": "",
+ "roles": ("user", "assistant"),
+ "messages": [],
+}
+)
+
+conv_templates = {
+ "tarsier2-7b": conv_tarsier_qwen2_vl
+}
diff --git a/eval_scripts/DREAM-1K/tarsier/tools/ptbtokenizer.py b/eval_scripts/DREAM-1K/tarsier/tools/ptbtokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..c1590d68b01f7c7178c1a0835b293c4a6b52ab50
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/tools/ptbtokenizer.py
@@ -0,0 +1,66 @@
+#!/usr/bin/env python
+#
+# File Name : ptbtokenizer.py
+#
+# Description : Do the PTB Tokenization and remove punctuations.
+#
+# Creation Date : 29-12-2014
+# Last Modified : Thu Mar 19 09:53:35 2015
+# Authors : Hao Fang and Tsung-Yi Lin
+import os
+import subprocess
+import tempfile
+
+# path to the stanford corenlp jar
+STANFORD_CORENLP_3_4_1_JAR = os.path.dirname(os.path.abspath(__file__)) + '/stanford-corenlp-3.4.1.jar'
+
+# punctuations to be removed from the sentences
+PUNCTUATIONS = ["''", "'", "``", "`", "-LRB-", "-RRB-", "-LCB-", "-RCB-", \
+ ".", "?", "!", ",", ":", "-", "--", "...", ";"]
+
+class PTBTokenizer:
+ """Python wrapper of Stanford PTBTokenizer"""
+
+ def tokenize(self, captions_for_image):
+ cmd = [os.getenv("JAVA_HOME"), '-cp', STANFORD_CORENLP_3_4_1_JAR, \
+ 'edu.stanford.nlp.process.PTBTokenizer', \
+ '-preserveLines', '-lowerCase']
+
+ # ======================================================
+ # prepare data for PTB Tokenizer
+ # ======================================================
+ final_tokenized_captions_for_image = {}
+ image_id = [k for k, v in captions_for_image.items() for _ in range(len(v))]
+ sentences = '\n'.join([c['caption'].replace('\n', ' ') for k, v in captions_for_image.items() for c in v])
+
+ # ======================================================
+ # save sentences to temporary file
+ # ======================================================
+ path_to_jar_dirname=os.path.dirname(os.path.abspath(__file__))
+ tmp_file = tempfile.NamedTemporaryFile(delete=False, dir=path_to_jar_dirname)
+ tmp_file.write(sentences.encode())
+ tmp_file.close()
+
+ # ======================================================
+ # tokenize sentence
+ # ======================================================
+ cmd.append(os.path.basename(tmp_file.name))
+ p_tokenizer = subprocess.Popen(cmd, cwd=path_to_jar_dirname, \
+ stdout=subprocess.PIPE)
+ token_lines = p_tokenizer.communicate(input=sentences.rstrip())[0]
+ token_lines = token_lines.decode()
+ lines = token_lines.split('\n')
+ # remove temp file
+ os.remove(tmp_file.name)
+
+ # ======================================================
+ # create dictionary for tokenized captions
+ # ======================================================
+ for k, line in zip(image_id, lines):
+ if not k in final_tokenized_captions_for_image:
+ final_tokenized_captions_for_image[k] = []
+ tokenized_caption = ' '.join([w for w in line.rstrip().split(' ') \
+ if w not in PUNCTUATIONS])
+ final_tokenized_captions_for_image[k].append(tokenized_caption)
+
+ return final_tokenized_captions_for_image
diff --git a/eval_scripts/DREAM-1K/tarsier/tools/rw_utils.py b/eval_scripts/DREAM-1K/tarsier/tools/rw_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7234c309c2651c5c49900581ff4801bf0d40409
--- /dev/null
+++ b/eval_scripts/DREAM-1K/tarsier/tools/rw_utils.py
@@ -0,0 +1,64 @@
+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+# http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+from json import JSONEncoder
+import numpy
+import pandas as pd
+
+class NumpyArrayEncoder(JSONEncoder):
+ def default(self, obj):
+ if isinstance(obj, numpy.ndarray):
+ return obj.tolist()
+ return JSONEncoder.default(self, obj)
+
+def write_txt(data, path):
+ with open(path, 'w', encoding='utf-8')as f:
+ for d in data:
+ f.write(f'{d}\n')
+
+def read_txt(path):
+ with open(path, 'r', encoding='utf-8', errors='ignore') as f:
+ lines = [l.strip('\n') for l in f.readlines()]
+ return lines
+
+def read_jsonlines(path):
+ objs = []
+ with open(path) as f:
+ for line in f:
+ line = json.loads(line)
+ objs.append(line)
+ return objs
+
+def write_jsonlines(data, path, cls=None, ensure_ascii=False):
+ with open(path, 'w') as f:
+ for d in data:
+ d = json.dumps(d, ensure_ascii=ensure_ascii, cls=cls)
+ f.write(d)
+ f.write('\n')
+
+def read_parquet(path):
+ data = pd.read_parquet(path)
+ return data.to_dict('records')
+
+def write_parquet(data, path):
+ data = pd.DataFrame(data)
+ data.to_parquet(path)
+
+def read_csv(path):
+ data = pd.read_csv(path)
+ return data.to_dict(orient='records')
+
+def write_csv(data, path):
+ data = pd.DataFrame(data)
+ data.to_csv(path, index=False, sep='\t')
diff --git a/eval_scripts/Daily-Omni/Daily-Omni_pipeline.sh b/eval_scripts/Daily-Omni/Daily-Omni_pipeline.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cc9d5705d9711305e805c7e3f549e709d8066a08
--- /dev/null
+++ b/eval_scripts/Daily-Omni/Daily-Omni_pipeline.sh
@@ -0,0 +1,62 @@
+#!/bin/bash
+MODEL_PATHS=(
+ "path_to_AVoCaDO"
+)
+RESULTS_DIR="$1"
+
+ORIGINAL_FILE="eval_scripts/Daily-Omni/grouped_data.json"
+MERGED_FILE="$RESULTS_DIR/captioned_results.json"
+if [ ! -f "$MERGED_FILE" ]; then
+ echo "MERGED_FILE not found. Creating from ORIGINAL_FILE..."
+ cp "$ORIGINAL_FILE" "$MERGED_FILE"
+fi
+CAPTION_FILES_TO_MERGE=()
+CAPTION_KEYS=()
+
+# Step 1: caption geneartion
+for model_path in "${MODEL_PATHS[@]}"; do
+ CLEAN_PATH="${model_path%/}"
+ model_name=$(basename "$CLEAN_PATH")
+
+ caption_file="$RESULTS_DIR/${model_name}_caption.jsonl"
+ echo "Output caption file will be: $caption_file"
+
+ python eval_scripts/Daily-Omni/generate_caption.py \
+ --model_path "$model_path" \
+ --fout_path "$caption_file"
+
+ if [ -f "$caption_file" ]; then
+ CAPTION_FILES_TO_MERGE+=("$caption_file")
+ CAPTION_KEYS+=("${model_name}_caption")
+ else
+ echo "Error: Caption file $caption_file not generated for model $model_path."
+ exit 1
+ fi
+done
+
+# Step 2: merge generated caption files
+echo "Merging all generated caption files..."
+python eval_scripts/Daily-Omni/merge_captions.py \
+ --original_file "$MERGED_FILE" \
+ --caption_files "${CAPTION_FILES_TO_MERGE[@]}" \
+ --merged_file "$MERGED_FILE"
+
+
+# Step 3: evaluation
+python eval_scripts/Daily-Omni/evaluation.py \
+ --merged_file "$MERGED_FILE" \
+ --caption_keys "${CAPTION_KEYS[@]}"
+
+# Step 4: analysis and save evaluation results
+for caption_key in "${CAPTION_KEYS[@]}"; do
+ echo "Running analysis for caption key: $caption_key"
+
+ result_file="$RESULTS_DIR/${caption_key}_result.jsonl"
+ answer_key="${caption_key//_caption/_resp}"
+
+ if [ -f "$result_file" ]; then
+ python eval_scripts/Daily-Omni/analysis.py --result_file_path "$result_file" --answer_key "$answer_key"
+ else
+ echo "Warning: Result file '$result_file' not found for analysis."
+ fi
+done
diff --git a/eval_scripts/Daily-Omni/analysis.py b/eval_scripts/Daily-Omni/analysis.py
new file mode 100644
index 0000000000000000000000000000000000000000..56a7e1c123acc5625b8b55ab40920009c7b3bd75
--- /dev/null
+++ b/eval_scripts/Daily-Omni/analysis.py
@@ -0,0 +1,18 @@
+import pandas as pd
+import argparse
+import os
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser(description="Analyze the evaluation results.")
+ parser.add_argument("--result_file_path", type=str, required=True, help="Path to the result file (.jsonl).")
+ parser.add_argument("--answer_key", type=str, required=True, help="The key for the model's response in the result file.")
+
+ args = parser.parse_args()
+
+ data = pd.read_json(args.result_file_path, lines=True)
+
+ acc = (data['answer'].str.upper() == data[args.answer_key].str.upper()).mean()
+ print(f"Accuracy for {args.answer_key} is: {acc:.2%}")
+
+ with open(f"{os.path.dirname(args.result_file_path)}/{args.answer_key}.log", "w", encoding='utf-8') as fout:
+ fout.write(f"Accuracy for {args.answer_key} is: {acc:.2%}")
\ No newline at end of file
diff --git a/eval_scripts/Daily-Omni/evaluation.py b/eval_scripts/Daily-Omni/evaluation.py
new file mode 100644
index 0000000000000000000000000000000000000000..0ce209a60770ee3377b4171ace5daff19b8d3e68
--- /dev/null
+++ b/eval_scripts/Daily-Omni/evaluation.py
@@ -0,0 +1,225 @@
+###
+# using a llm to answer questions regarding to the video with the specific caption
+###
+import os
+os.environ['GOOGLE_APPLICATION_CREDENTIALS']=''
+LOCATION = "global"
+user_info_path = ''
+user_info = json.load(open(user_info_path))
+PROJECT_ID = user_info['project_id']
+MODEL = "gemini-2.5-pro"
+
+import sys
+import time
+import json
+import traceback
+import multiprocessing
+import random
+import numpy as np
+import argparse
+from google import genai
+from google.genai import types
+from IPython.display import HTML, Image, Markdown, display
+from google import genai
+from google.genai.types import (
+ FunctionDeclaration,
+ GenerateContentConfig,
+ GoogleSearch,
+ HarmBlockThreshold,
+ HarmCategory,
+ Part,
+ SafetySetting,
+ ThinkingConfig,
+ Tool,
+ ToolCodeExecution,
+)
+import subprocess
+
+safety_settings = [
+ SafetySetting(category=HarmCategory.HARM_CATEGORY_HATE_SPEECH, threshold=HarmBlockThreshold.OFF),
+ SafetySetting(category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT, threshold=HarmBlockThreshold.OFF),
+ SafetySetting(category=HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT, threshold=HarmBlockThreshold.OFF),
+ SafetySetting(category=HarmCategory.HARM_CATEGORY_HARASSMENT, threshold=HarmBlockThreshold.OFF)
+]
+
+CONFIG = types.GenerateContentConfig(
+ temperature=0,
+ top_p=0.001,
+ thinking_config=types.ThinkingConfig(
+ include_thoughts=True,
+ thinking_budget=512
+ ),
+ safety_settings=safety_settings,
+ seed=SEED,
+ system_instruction='''
+ You are a precise QA assistant. Your task is to answer multiple-choice questions based ONLY on the video caption provided.
+ Do not use any outside knowledge or assumptions—your answer must strictly reflect information from the caption.
+ Always output only the capital letter corresponding to your choice (e.g., A, B, C, D).
+ If the caption does not provide enough information to answer the question, output "N/A" instead.
+ '''
+)
+client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)
+
+def set_seed(seed):
+ np.random.seed(seed)
+ random.seed(seed)
+
+SEED = 42
+set_seed(SEED)
+
+def caption2json(json_path, caption_path):
+
+ with open(json_path, 'r', encoding='utf-8') as f:
+ json_data = json.load(f)
+ model = os.path.basename(caption_path).split("_")[0]
+
+ captions = {}
+ with open(caption_path, 'r', encoding='utf-8') as f:
+ for line in f:
+ if not line.strip():
+ continue
+ item = json.loads(line)
+ for vid, cap in item.items():
+ captions[vid] = cap
+
+ for entry in json_data:
+ vid = entry.get("video_id")
+ if vid in captions:
+ entry[f"{model}_caption"] = captions[vid]
+
+ with open(f"{model}_merge_data.json", 'w', encoding='utf-8') as f:
+ json.dump(json_data, f, ensure_ascii=False, indent=2)
+
+ print(f"merged successfully, the output file is {model}_merge_data.json")
+
+
+def generate(prompt):
+ contents = [prompt]
+
+ answer, thinking = None, None
+ max_retries = 10
+
+ for i in range(max_retries):
+ try:
+ response = client.models.generate_content(
+ model=MODEL,
+ contents=contents,
+ config=CONFIG
+ )
+
+ answer_parts, thought_parts = [], []
+ for part in response.candidates[0].content.parts:
+ if not getattr(part, "text", None):
+ continue
+ if getattr(part, "thought", False):
+ thought_parts.append(part.text)
+ else:
+ answer_parts.append(part.text)
+ answer = "\n".join(answer_parts).strip()
+ thinking = "\n".join(thought_parts).strip()
+ if answer:
+ break
+ else:
+ print(f"[WARN] Attempt {i+1}: empty answer, retrying ... ")
+ time.sleep(3)
+ except Exception as e:
+ print(f"[ERROR] Attempt {i+1} failed: {e}")
+ traceback.print_exc()
+ time.sleep(3)
+ if not answer:
+ return None, None
+ print(answer)
+ return answer, thinking
+
+def worker(task):
+ vid, video_duration, question, choices, answer, caption_key, answer_key, caption = task
+ choices_text = "\n".join([f"{c}" for c in choices])
+ prompt_filled = f'''
+Here is the video caption:
+"{caption}"
+
+Question: {question}
+Choices:
+ {choices_text}'''
+ try:
+ resp, _ = generate(prompt_filled)
+ return {
+ "video_id": vid,
+ "video_duration": video_duration,
+ "question": question,
+ "choices": choices,
+ "answer": answer,
+ caption_key: caption,
+ answer_key: resp
+ }
+ except Exception as e:
+ traceback.print_exc()
+ return {
+ "video_id": vid,
+ "video_duration": video_duration,
+ "question": question,
+ "choices": choices,
+ "answer": answer,
+ caption_key: caption,
+ answer_key: None
+ }
+
+def run_multiprocess_tasks(tasks, num_processes=None, fout_path=None):
+ if num_processes is None:
+ num_processes = multiprocessing.cpu_count()
+
+ with multiprocessing.Pool(processes=num_processes) as pool:
+ results = pool.map(worker, tasks)
+
+ if fout_path:
+ with open(fout_path, "w", encoding='utf-8') as f:
+ for item in results:
+ f.write(json.dumps(item, ensure_ascii=False) + '\n')
+ f.flush()
+ return results
+
+def eval_dailyomni_caption_qas(file_path, caption_keys=["omni_caption"]):
+ with open(file_path, 'r', encoding='utf-8') as f:
+ data = json.load(f)
+
+ all_results = []
+ for caption_key in caption_keys:
+ answer_key = caption_key.replace("_caption", "_resp")
+ fout_path = f"{os.path.dirname(file_path)}/{caption_key}_result.jsonl"
+
+ tasks = []
+ for video_info in data:
+ vid = video_info["video_id"]
+ video_duration = video_info["video_duration"]
+ caption = video_info[caption_key]
+ for q in video_info["questions"]:
+ task_item = (
+ vid,
+ video_duration,
+ q["Question"],
+ q["Choice"],
+ q["Answer"],
+ caption_key,
+ answer_key,
+ caption
+ )
+ tasks.append(task_item)
+
+ results = run_multiprocess_tasks(tasks, num_processes=20, fout_path=fout_path)
+ all_results.extend(results)
+
+ return all_results
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser(description="Evaluate captions using Gemini.")
+ parser.add_argument("--merged_file", type=str, required=True, help="Path to the merged caption file.")
+ parser.add_argument(
+ "--caption_keys",
+ type=str,
+ nargs='+',
+ required=True,
+ help="A list of caption keys to evaluate"
+ )
+ args = parser.parse_args()
+
+ eval_dailyomni_caption_qas(args.merged_file, caption_keys=args.caption_keys)
\ No newline at end of file
diff --git a/eval_scripts/Daily-Omni/generate_caption.py b/eval_scripts/Daily-Omni/generate_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..73e26f08e53b7e96be249b318d16dc91526ef912
--- /dev/null
+++ b/eval_scripts/Daily-Omni/generate_caption.py
@@ -0,0 +1,142 @@
+import os
+import torch
+from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
+from qwen_omni_utils import process_mm_info
+import argparse
+import json
+from tqdm import tqdm
+from pathlib import Path
+import multiprocessing as mp
+import traceback
+import random
+import glob
+
+VIDEO_MAX_PIXELS = 401408 # 512*28*28
+VIDEO_TOTAL_PIXELS = 20070400 # 512*28*28*50
+USE_AUDIO_IN_VIDEO = True
+video_base_dir = "path_to_Daily-Omni_Videos"
+os.environ['VIDEO_MAX_PIXELS'] = str(VIDEO_TOTAL_PIXELS)
+
+def chat(file_path, prompt, model, processor, model_path, max_new_tokens=2048):
+
+ conversation = [
+ {
+ "role": "system",
+ "content": [
+ {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
+ ],
+ },
+ {
+ "role": "user",
+ "content": [
+ {
+ "type": "video",
+ "video": file_path,
+ "max_pixels": VIDEO_MAX_PIXELS,
+ "max_frames": 256
+ },
+ {
+ "type": "text",
+ "text": prompt
+ },
+ ],
+ },
+ ]
+
+ text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
+ audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
+ inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
+ inputs = inputs.to(model.device).to(model.dtype)
+
+ text_ids = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO, do_sample=False, thinker_max_new_tokens=max_new_tokens)
+ text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+ model_generation = text.split("\nassistant\n")[-1]
+
+ return model_generation
+
+def worker_proc(rank, gpu_id, model_path, video_paths, prompt, out_path):
+ device_map = {"": f"cuda:{gpu_id}"}
+
+ model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
+ model_path,
+ torch_dtype=torch.bfloat16,
+ device_map=device_map,
+ attn_implementation="flash_attention_2",
+ )
+ model.disable_talker()
+ processor = Qwen2_5OmniProcessor.from_pretrained(model_path)
+
+ fout = open(out_path, "w", encoding="utf-8")
+
+ for video_path in tqdm(video_paths, desc=f"Worker-{rank}[GPU-{gpu_id}]"):
+ try:
+ model_generation = chat(video_path, prompt, model, processor, model_path)
+
+ video_id = os.path.basename(video_path).split(".mp4")[0]
+
+ out_data = {
+ "video_id": video_id,
+ "caption": model_generation,
+ }
+
+ fout.write(json.dumps(out_data, ensure_ascii=False) + "\n")
+ fout.flush()
+ except Exception as e:
+ print(f"[Worker-{rank}] Error on {video_path}: {e}")
+ traceback.print_exc()
+
+ fout.close()
+ print(f"[Worker-{rank}] Done, wrote results to {out_path}")
+
+def run_multi_gpu(model_path, video_paths, prompt_list, final_out_path, num_gpus=8):
+ chunk_size = len(video_paths) // num_gpus + 1
+ chunks = [video_paths[i:i+chunk_size] for i in range(0, len(video_paths), chunk_size)]
+
+ processes = []
+ tmp_files = []
+
+ for rank, chunk in enumerate(chunks):
+ gpu_id = rank % num_gpus
+ tmp_out = final_out_path.replace(".jsonl", f".part{rank}.jsonl")
+ tmp_files.append(tmp_out)
+ prompt = random.choice(prompt_list)
+
+ p = mp.Process(
+ target=worker_proc,
+ args=(rank, gpu_id, model_path, chunk, prompt, tmp_out)
+ )
+ p.start()
+ processes.append(p)
+
+ for p in processes:
+ p.join()
+
+ with open(final_out_path, "w", encoding="utf-8") as fout:
+ for tmp in tmp_files:
+ with open(tmp, "r", encoding="utf-8") as fin:
+ for line in fin:
+ fout.write(line)
+ os.remove(tmp)
+
+ print(f"All results merged into {final_out_path}")
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser(description="Evaluate a model and save results.")
+ parser.add_argument("--model_path", type=str, required=True, help="Path to the model checkpoint.")
+ parser.add_argument("--fout_path", type=str, required=True, help="Path to the output caption file")
+ args = parser.parse_args()
+ mp.set_start_method("spawn", force=True)
+
+ video_paths = glob.glob(os.path.join(video_base_dir, "**", "*.mp4"), recursive=True)
+
+ prompt_list = [
+ "Provide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned.",
+ "Thoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated.",
+ "Please describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions.",
+ "Offer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync.",
+ "Describe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized.",
+ "Please provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so.",
+ "Give a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ ]
+
+ run_multi_gpu(args.model_path, video_paths, prompt_list, args.fout_path, num_gpus=8)
diff --git a/eval_scripts/Daily-Omni/grouped_data.json b/eval_scripts/Daily-Omni/grouped_data.json
new file mode 100644
index 0000000000000000000000000000000000000000..26a4a3fb8065926cbd5c7b9f67093829adccb624
--- /dev/null
+++ b/eval_scripts/Daily-Omni/grouped_data.json
@@ -0,0 +1,17957 @@
+[
+ {
+ "video_id": "Ec_lQgZ9wlg",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What visual elements were displayed immediately after Dr. Rajani's 'BOTOX WITHOUT THE BOTOX' video concluded?",
+ "Choice": [
+ "A. Still product bottle → Price text overlay",
+ "B. Facial treatment demonstration → Presenter holding product while explaining",
+ "C. Presenter's torso shot → Secondary screen activation",
+ "D. Bookshelf backdrop → Close-up of string lights"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which visual event coincides with Clare stating 'Apply this to smooth wrinkles better than Botox' in her narration?",
+ "Choice": [
+ "A. Still image of Revox Line Relaxer bottle during price announcement",
+ "B. Bookshelf with string lights visible behind the presenter",
+ "C. Presenter holding a black device while Dr. Rajani's video is displayed",
+ "D. Clip of a woman receiving facial treatment on her face"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What is the primary distinction between Dr. Rajani's video analysis and Clare's presentation regarding the Revox product?",
+ "Choice": [
+ "A. One explains its mechanism as a Botox alternative, the other details cost and availability",
+ "B. One demonstrates application steps, the other chemical composition",
+ "C. One uses clinical before/after visuals, the third-party product endorsements",
+ "D. One features medical professional testimonials, the other consumer demographics"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "XUWxQYmiBQY",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "At the moment the static shot of Obsidian Fury first appeared, which specific audio event was synchronized with it?",
+ "Choice": [
+ "A. Speaker's self-introduction and figure presentation",
+ "B. Reference to avoiding plot spoilers",
+ "C. Logo transition sound effect",
+ "D. Analysis of combat sequences"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What visual event coincided with the speaker's initial introduction of the Obsidian Fury figure?",
+ "Choice": [
+ "A. Rotating turntable showcasing 360-degree details",
+ "B. Chest emblem's golden highlights activation",
+ "C. Robot's stationary placement on mirrored platform",
+ "D. Gradient backdrop replacing monochrome environment"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "rieb8cBb2z8",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "What visual detail was present when the audio mentioned discovering a blue tag in the forest?",
+ "Choice": [
+ "A. The man touching fungus while estimating its age",
+ "B. The man sitting on a log examining fungus",
+ "C. The man wearing a feathered hat and holding a pendant",
+ "D. The man wearing a navy jacket and speaking to the camera"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Why is the man urgently collecting the Agarikon fungus strain during this visit?",
+ "Choice": [
+ "A. Optimal spore release occurs at this growth stage",
+ "B. The forest is scheduled for clear-cutting next year",
+ "C. It's a newly discovered species requiring documentation",
+ "D. The host log's rapid decay would spoil the specimen"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "Me4W36_lUcI",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which audio statement occurred during the panoramic view of Earth with an orbiting satellite?",
+ "Choice": [
+ "A. 'All right Vegas, we're ready for procedure initiation'",
+ "B. 'Loop Bravo venting operations remain pending'",
+ "C. 'Loop Bravo integrity remains unchanged with persistent leaks'",
+ "D. 'Maintaining gradual system shutdown protocol'"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What specific visual perspective is shown when the speaker concludes the statement about Loop Bravo not being vented?",
+ "Choice": [
+ "A. Earth and satellite orbiting in space",
+ "B. High-angle view showing consoles with circular emblems",
+ "C. Elevated wide shot of mission control workers",
+ "D. echnicians examining technical schematics"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "During which visual context did the team discuss irreversible actions?",
+ "Choice": [
+ "A. Near consoles featuring circular emblems",
+ "B. In a densely packed control room section",
+ "C. During initial system checks with static noise",
+ "D. While observing Earth orbit procedures"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "G_VTkkb34gw",
+ "video_duration": "30s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which audio event occurred simultaneously with the two female police officers engaging in target practice at the indoor shooting range?",
+ "Choice": [
+ "A. Sharp aerodynamic swoosh",
+ "B. Foreign-language command phrases",
+ "C. Numerical countdown sequence",
+ "D. Sustained gunfire bursts"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which audio events correspond to the indoor shooting range activities shown in the video?",
+ "Choice": [
+ "A. Countdown sequence during target practice",
+ "B. Electronic music overlapping with gunshots",
+ "C. Hip-hop music with whoosh sounds",
+ "D. Wind sounds followed by human speech"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What is the primary purpose of the white car maneuvering through the water-filled obstacle course with orange cones?",
+ "Choice": [
+ "A. Testing acceleration performance on wet surfaces",
+ "B. Conducting routine vehicle safety inspections",
+ "C. Demonstrating vehicle water resistance capabilities",
+ "D. Simulating emergency response driving scenarios"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "Z0JDTBGh4D8",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which audio segment corresponds to the visual explanation of the Ring Hardwired Transformer Kit's technical diagram?",
+ "Choice": [
+ "A. '...to be compatible, your doorbell transformer must meet these specifications...'",
+ "B. '...if your current transformer doesn’t meet these technical requirements'",
+ "C. '...alternative wireless installation for the video doorbell'",
+ "D. '...triggering the internal chime when pressed'"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What is the purpose of the hand pointing to the doorbell in the demonstration?",
+ "Choice": [
+ "A. To demonstrate wireless connectivity",
+ "B. To show the doorbell meets voltage specifications",
+ "C. To indicate completion of physical mounting",
+ "D. To explain post-installation functionality"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What indicates a shift from discussing technical specifications to installation methods?",
+ "Choice": [
+ "A. Speaker's pitch rises while mentioning voltage",
+ "B. Background transitions from white to light blue",
+ "C. The Ring logo reappears with a click sound",
+ "D. Text requirements dissolve into circuit diagrams"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "eLPe7xp0jRw",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual event directly coincided with the speaker stating 'another marketing illusion or a truly exclusive luxury brand'?",
+ "Choice": [
+ "A. Camera zooming on the bag's red handle stitching",
+ "B. Hands turning the Goyard bag inside out",
+ "C. Hands holding open the red Goyard bag's interior",
+ "D. Hands using a pointed tool to pry open the Goyard bag"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What visual element was prominently displayed when the speaker mentioned having 'a ton of inquiries to review'?",
+ "Choice": [
+ "A. Presentation of the bag on a gray cloth",
+ "B. PEGAI logo on the presenter's shirt",
+ "C. Hands examining the bag's construction",
+ "D. Close-up of the 'GOYARD PARIS' label"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "d6b4OmUFt7I",
+ "video_duration": "30s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which visual-audio pair occurs at the very beginning of the video?",
+ "Choice": [
+ "A. Construction site overview -- training program name reference",
+ "B. Detailed sign graphic -- geographical location mention",
+ "C. Instructor in PPE -- personal equipment experience anecdote",
+ "D. Excavator movement -- hands-on training phase description"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What immediate task required Sara to review documents in the office before equipment operation?",
+ "Choice": [
+ "A. To analyze soil composition charts",
+ "B. To finalize her course payment details",
+ "C. To prepare safety checklists for machinery",
+ "D. To complete classroom requirements stated in the audio"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which visual sequences correspond to the audio mentions of 'go on the equipment' versus 'familiar with the machines'?",
+ "Choice": [
+ "A. First shows an excavator lifting dirt, second shows a dump truck driving through mud",
+ "B. First shows binder review in office, second shows tractor parked on site",
+ "C. First shows safety vest demonstration, second shows mountain landscape",
+ "D. First shows a bulldozer moving earth, second shows a water tank in background"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What specific visual elements were present when Sara mentioned 'email Anderson'?",
+ "Choice": [
+ "A. Indoor office with binder and whiteboard",
+ "B. Dirt field with construction equipment and trees",
+ "C. Excavator actively lifting dirt",
+ "D. Yellow bulldozer parked in foreground"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "bswbQtOPk6E",
+ "video_duration": "30s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "What in-game visual appears immediately after the creator discusses Square Enix's sales results?",
+ "Choice": [
+ "A. Blonde woman analyzing sales charts",
+ "B. Character standing on a rocky landscape",
+ "C. Character engulfed in fiery glow while jumping",
+ "D. Sacred Peaks Refuge sign appearance"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What indicates the fiery propulsion is an intentional game feature rather than a contextual anomaly?",
+ "Choice": [
+ "A. Dynamic cape physics during aerial maneuvers",
+ "B. Tutorial pop-ups about traversal mechanics",
+ "C. Visible controller vibration feedback",
+ "D. Recurring chromatic synchronization with abilities"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "1TtcXC_u4r4",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "How does the background differ when Kerry discusses 'level two/three qualifications' versus 'after-school clubs'?",
+ "Choice": [
+ "A. Shelf toys change to educational books",
+ "B. No visible difference",
+ "C. Background becomes blurred",
+ "D. Presence of a blue identification banner"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What combination of auditory and visual evidence best characterizes Kerry's professional setting?",
+ "Choice": [
+ "A. A podcast studio with soundproofing panels",
+ "B. An academic lecture hall with student desks",
+ "C. An active daycare center during playtime",
+ "D. A vocational training environment with educational toys"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "sNf0WG57RtA",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "When does the presenter begin writing 'colourless salt' relative to the audio's discussion of this property?",
+ "Choice": [
+ "A. Writing commences exactly as the audio first says 'colourless salt'",
+ "B. Audio fully explains its preparation method before writing begins",
+ "C. Writing starts before the audio mentions 'colourless salt'",
+ "D. Audio describes the chemical formula while writing appears"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What is stated about the preparation method of the colorless salt in the presentation?",
+ "Choice": [
+ "A. Formed through protonalysis of sodium",
+ "B. Derived from sodium borohydride under inert conditions",
+ "C. Produced via acid-base neutralization",
+ "D. Requires catalytic hydrogenation"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What specific property of the salt is explained as 'colourless salt' is written?",
+ "Choice": [
+ "A. Preparation method involving protonalysis",
+ "B. Comparison with sodium borohydride",
+ "C. Breakdown of acetate groups in the formula",
+ "D. Role as a catalyst in reactions"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "How is the abbreviation 'STAB' explicitly referenced in the audio compared to its visual presentation?",
+ "Choice": [
+ "A. Audio provides alternative nomenclature while visual highlights functional groups",
+ "B. Visual pairs full name with abbreviation while audio uses only systematic names",
+ "C. Audio uses 'STAB' repeatedly while visual avoids abbreviation",
+ "D. Both modalities explain the abbreviation's derivation from chemical components"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "Why did the presenter emphasize the '(CH3COO)₃' portion of the chemical formula while explaining the compound's name?",
+ "Choice": [
+ "A. To indicate the three acetoxy groups corresponding to the formula's subscript",
+ "B. To highlight the sodium (Na) component in the formula",
+ "C. To demonstrate the borohydride (BH) portion's role as the reducing agent",
+ "D. To connect the acetoxy groups to the 'triacetoxy' prefix in the chemical name"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "gPpJoyPjiPw",
+ "video_duration": "30s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "What specific action at the supermarket provoked the woman's confrontation?",
+ "Choice": [
+ "A. The man publicly embraced another woman",
+ "B. The man forgot to pay for their groceries",
+ "C. The man argued loudly with a cashier",
+ "D. The man criticized her shopping choices"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "iYDcBb9C-Mc",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "During the discussion on lead generation strategies, why does a downward-pointing arrow appear above the envelope?",
+ "Choice": [
+ "A. To indicate a technical error in mass email campaigns",
+ "B. To suggest a dropdown menu for reply options",
+ "C. To demonstrate the physical motion required to open emails",
+ "D. To draw attention to the envelope's color scheme"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "7yY1OWjT4zA",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which event occurs FIRST in the video's timeline?",
+ "Choice": [
+ "A. White fabric unrolled in a room with ambient music",
+ "B. Fabric being unrolled on a reflective surface with music",
+ "C. Liz Rose pushing a cart through a workshop doorway",
+ "D. 'Conservation' text overlay with blurred hands"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What primary function does Liz Rose's workspace serve based on the activities observed?",
+ "Choice": [
+ "A. Archival storage of religious manuscripts",
+ "B. Precision handling of fragile materials",
+ "C. Temporary exhibit preparation",
+ "D. High-resolution textile documentation"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "zmL97ovp0kU",
+ "video_duration": "30s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "Why does the speaker mention armor limitations during the snail close-up?",
+ "Choice": [
+ "A. To showcase armor crafting methods",
+ "B. To emphasize the snail's unstoppable nature",
+ "C. To critique the snail's aesthetic design",
+ "D. To imply the snail has defensive weaknesses"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "zGVKSnlJFhY",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "When the auction item 'a life-size cow' is mentioned, which specific visual element appears simultaneously?",
+ "Choice": [
+ "A. Green sofa with Buckeyes merchandise",
+ "B. Farm animal sculptures near seating",
+ "C. Vintage livestock exhibition posters",
+ "D. Framed OSU athlete portraits"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "MXYFQArjCxg",
+ "video_duration": "30s",
+ "video_category": "Autos & Vehicles",
+ "questions": [
+ {
+ "Question": "What is the correct chronological sequence of visual events following Mick's statement 'Let's take a quick look'?",
+ "Choice": [
+ "A. Hand gestures toward van → Camera pans to driver's seat → Kitchen area displayed",
+ "B. Sliding door opens → Interior pan with cabinets → Close-up of striped seat covers",
+ "C. Door handle grasped → Gasoline container highlighted → Exterior sticker visible",
+ "D. Van door closes → Engine compartment exposed → JDM mention repeated"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What specific visual element is shown precisely when the car door closing sound occurs?",
+ "Choice": [
+ "A. Van's sliding door in motion",
+ "B. RIT sticker on exterior body",
+ "C. Exposed engine compartment",
+ "D. Striped seat covers in the interior"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "m4tk3TxWc1I",
+ "video_duration": "30s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "What audio and visual elements were present as the man began speaking to the audience?",
+ "Choice": [
+ "A. Meditative music and closing title card",
+ "B. Classical strings and visible book titles",
+ "C. Meditative music and the poet's portrait",
+ "D. Wind instruments and bookshelf backdrop"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What is the man's primary purpose in discussing Hryhorii Skovoroda?",
+ "Choice": [
+ "A. Explaining technical aspects of poetic composition",
+ "B. Promoting cultural heritage through translated works",
+ "C. Comparing Ukrainian and Western literary traditions",
+ "D. Analyzing historical context of 18th-century literature"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "81CM8lLrgoM",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which sequence correctly orders the audio-visual events based on their occurrence in the video?",
+ "Choice": [
+ "A. Indistinct chatter → English class discussion → students receiving handouts",
+ "B. Speaker announcement → desk organization by last name → distribution of unsolicited facts",
+ "C. Student enters class → bulletin board observation → magazine reading session",
+ "D. Sales pitch during student movement → seating arrangement explanation → creepy facts handout"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which action explicitly defines Melissa's responsibility during the class session?",
+ "Choice": [
+ "A. Delivering a lecture using the whiteboard",
+ "B. Circulating unsolicited informational handouts",
+ "C. Enforcing seating arrangements alphabetically",
+ "D. Posting weekly announcements on the bulletin board"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "LttCU4RK-Zc",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What topic did the curly-haired woman introduce when the video returned to her after featuring the second speaker?",
+ "Choice": [
+ "A. Professional benefits of bold accessories",
+ "B. Career development resources at OAC",
+ "C. Symbolism of the microphone in the PICCADILLY TALK logo",
+ "D. How OAC helps make aspirations achievable"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "H3-6byL6FTs",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What action does Ahmad perform immediately after stating, 'I will be sharing with you a bit'?",
+ "Choice": [
+ "A. Holding up a looped climbing rope",
+ "B. Adjusting his position slightly",
+ "C. Pinching his fingers to indicate thickness",
+ "D. Stretching the rope straight out"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "aseK5-RpkSQ",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which visual element was displayed at the exact moment the speaker began saying 'Hi, what's up everyone?'?",
+ "Choice": [
+ "A. 'U8' superimposed on TV screen",
+ "B. Text 'BEST OLED TV UNDER 1000'",
+ "C. Hisense logo at bottom center",
+ "D. Colorful abstract pattern on television"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "vmK5xIPPSU8",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "In what sequence were the bubbles on the rainbow flower-shaped pop-it popped down during the magic trick demonstration?",
+ "Choice": [
+ "A. Purple, turquoise, dark green, yellow, orange",
+ "B. Purple, turquoise, dark green, orange, yellow",
+ "C. Turquoise, dark green, purple, yellow, orange",
+ "D. Orange, yellow, dark green, turquoise, purple"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "9PsvKIvzgUs",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What is the first visual event that occurs AFTER the child finishes feeding the llama?",
+ "Choice": [
+ "A. A dog becomes visible behind the fence",
+ "B. The child runs away while laughing",
+ "C. The llama spits on the man near the fence",
+ "D. Russian speech accompanies the main action"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "WkBkvlUoZpU",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What is the FIRST visual event that occurs immediately after Dane mentions 'my top 10 books of winter'?",
+ "Choice": [
+ "A. Camera panning to the 'BookTube' sign",
+ "B. Placing books on the pink donut pillow",
+ "C. Adjusting the black necklace",
+ "D. Gesturing hands near dog-print pillows"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What is the primary purpose of zooming in on the TV screen during the 0-10s segment?",
+ "Choice": [
+ "A. To foreshadow Dane's upcoming top 10 book recommendations",
+ "B. To showcase Dane's eclectic home decor",
+ "C. To highlight technical aspects of the BookTube sign",
+ "D. To emphasize Kyra Thomson's analysis of authors with April birthdays"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "How does the audio reference to '67 books' contrast with concurrent visual elements?",
+ "Choice": [
+ "A. Audio specifies a temporal range while visual displays a 'BookTube' sign",
+ "B. Audio mentions publication dates while visual shows necklace details",
+ "C. Audio states a count while visual highlights decorative pillows",
+ "D. Audio emphasizes a reading milestone while visual depicts relaxed body language"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "Alx0y8O6FKU",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What is the immediate visual and auditory response after the girl slams the book and declares 'I'm seriously done'?",
+ "Choice": [
+ "A. A cursor clicks the red 'Subscribe' button",
+ "B. She wears a blue dress while singing 'You shine everyday'",
+ "C. She mentions a 'Don't Talk to Silly' challenge",
+ "D. The text 'PLEASE SUBSCRIBE' flashes in red"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What specific factor contributes to the girl's apprehension during the 20-30s segment?",
+ "Choice": [
+ "A. Lack of prior experience with this exact challenge format",
+ "B. Anxiety about the challenge's late-night timing implied in its name",
+ "C. Uncertainty about operating the tablet visible on her desk",
+ "D. Pressure from maintaining a singing performance simultaneously"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "1F6g2NS_vGY",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What type of content is the video presenting when GEICO advertisements are shown?",
+ "Choice": [
+ "A. Countdown show analyzing NBA players",
+ "B. Live playoff game coverage",
+ "C. Behind-the-scenes team documentary",
+ "D. Historical player career retrospectives"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Why are most spectators wearing red during the discussion of Game Six in the NBA Finals?",
+ "Choice": [
+ "A. Matching GEICO advertisement's primary color",
+ "B. Red represents Miami Heat's team colors",
+ "C. Red symbolizes intensity of the final quarter",
+ "D. Audience supports Toronto Raptors"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What new contextual information appears on the screen when the Golden State Warriors' score reaches 100?",
+ "Choice": [
+ "A. Player #24 completing a dunk",
+ "B. 'NBA FINALS' graphic overlay",
+ "C. Game clock showing 1:26 remaining",
+ "D. Timeouts counter displaying '1' and '2'"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "FqaJYWyQQcE",
+ "video_duration": "30s",
+ "video_category": "Nonprofits & Activism",
+ "questions": [
+ {
+ "Question": "What justifies the news anchor's serious tone when discussing land subsidence?",
+ "Choice": [
+ "A. The subsidence was confined to coastal regions",
+ "B. Data shows historical records broken since 1978",
+ "C. Structural integrity checks are advised for homeowners",
+ "D. Over 50% of Japan's prefectures were affected"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "How does the 6,000 km² subsidence area compare specifically to the previous year's measurements?",
+ "Choice": [
+ "A. Matched the prior year's subsidence levels",
+ "B. Represents 50% of all recorded subsidence since 1978",
+ "C. Twice the area affected in the prior year",
+ "D. 1,000 times greater than the previous year's data"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "S8D4VF8Yl1g",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What is the man's primary objective when scanning the DVD barcode with his phone?",
+ "Choice": [
+ "A. To assess eBay market value",
+ "B. To check recent notifications",
+ "C. To authenticate the disc's legitimacy",
+ "D. To list the item instantly online"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What visual transition occurs right after the phone displays eBay prices for the DVD?",
+ "Choice": [
+ "A. Demonstration of scanning another DVD case",
+ "B. A close-up of him initiating direct address to viewers",
+ "C. The man begins discussing profit margins on collectibles",
+ "D. A cut to his home shelf with sports memorabilia"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "WpPgSEzdCns",
+ "video_duration": "30s",
+ "video_category": "Comedy",
+ "questions": [
+ {
+ "Question": "Which visual event coincides with Wilson Gramling's admission 'I was on welfare'?",
+ "Choice": [
+ "A. The man in colonial attire standing behind a podium",
+ "B. The man in colonial attire gesturing emphatically at the podium",
+ "C. A Black man in a suit leaning forward to ask questions",
+ "D. Councilwoman Knope sitting in front of a desk"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What does the 'COUNCIL KNO' sign most directly indicate about the meeting's context?",
+ "Choice": [
+ "A. The meeting occurs in a temporary administrative setup",
+ "B. A cultural institution is hosting a special event",
+ "C. Proceedings are held in Councilwoman Knope's designated office",
+ "D. Pawnee Video Dome is implementing new policies"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What was the key focus of Councilwoman Knope's directives to the Pawnee Video Dome?",
+ "Choice": [
+ "A. To phase out subsidized food programs",
+ "B. To approve unrestricted financial allocations",
+ "C. To preserve a local cultural landmark",
+ "D. To ensure transparent fiscal accountability"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "G8ywbM85GVE",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What can be inferred about Phil's professional role based on the audiovisual context?",
+ "Choice": [
+ "A. He is employed in an industrial sector following his graduation",
+ "B. He is currently enrolled as a Monash engineering student",
+ "C. He oversees operations in a warehouse environment",
+ "D. He conducts academic research in mechanical engineering"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What professional sectors do Scott and Phil's respective settings primarily represent?",
+ "Choice": [
+ "A. Scott: Academic institution; Phil: Industrial engineering",
+ "B. Scott: Corporate office; Phil: Construction site",
+ "C. Scott: Research laboratory; Phil: Manufacturing plant",
+ "D. Scott: Government agency; Phil: Energy sector"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What contextual audio information precedes Phil's introduction in the video?",
+ "Choice": [
+ "A. An explanation of alumni-submitted career demonstrations",
+ "B. A tutorial on mechanical engineering equipment usage",
+ "C. A promotional segment for aerospace engineering courses",
+ "D. A discussion about internship application processes"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "7jeoNQXpJYM",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "Which audio segment is heard as the man initiates his phone call while wearing a blue and yellow cap near the brick house?",
+ "Choice": [
+ "A. 'Coming home real soon, man. You know your boy...'",
+ "B. 'That right, too.'",
+ "C. 'been gone anyway, man. You know you, man...'",
+ "D. 'Yo.' (music playing with group rapping)"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "ev_45FHogv8",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What is the primary function of the white lines taped on the floor during the 20-30s segment?",
+ "Choice": [
+ "A. To mark the boundaries for safe practice",
+ "B. To outline the sequence of movements in the form",
+ "C. To repair cracks in the flooring",
+ "D. To provide visual guidance for foot positioning"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "8I2nIXrbFpo",
+ "video_duration": "30s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "What visual event occurs immediately after the high-pitched sound effect in the 20-30s segment?",
+ "Choice": [
+ "A. Vocalizing 'Yahh!'",
+ "B. Grabbing the flour bag",
+ "C. Character standing upright",
+ "D. Pouring flour into the bowl"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "cyfur9LHDII",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What role does the initial electronic music serve in relation to its accompanying visual elements?",
+ "Choice": [
+ "A. To synchronize with the geometric patterns suggesting technological innovation",
+ "B. To foreshadow natural environments shown later in the video",
+ "C. To establish a meditative atmosphere for the product reveal",
+ "D. To create an auditory contrast with subsequent high-energy tracks"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Which visual element is displayed when the audio instructs to 'click the circle for more info'?",
+ "Choice": [
+ "A. A finger interacting with glowing grid lines",
+ "B. 'Best Inflatable Kayak' in orange box",
+ "C. Yellow kayak labeled 'Number 1'",
+ "D. Three kayaks on a sandy surface"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "f3RcpzE1T9U",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What specific mental health challenge did the bald man face during the 1990s?",
+ "Choice": [
+ "A. He sought treatment for clinical depression",
+ "B. He advocated for neurodiversity accommodations",
+ "C. He managed severe anxiety in workplace settings",
+ "D. He experienced bipolar disease"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What functional difference existed between the hand gestures of the two men in blue blazers?",
+ "Choice": [
+ "A. One conveyed emotional states while the other controlled dialogue flow",
+ "B. One matched speech rhythm while the other marked transitions",
+ "C. One demonstrated authority while the other indicated uncertainty",
+ "D. One structured inquiries while the other illustrated personal narratives"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What key workplace challenge was being discussed when the bald man clasped his hands attentively?",
+ "Choice": [
+ "A. Adapting to multiple professional roles",
+ "B. Handling mental health challenges",
+ "C. Resolving interpersonal conflicts",
+ "D. Implementing dress code reforms"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What was the purpose of the man in the blue blazer placing his index finger on his chin?",
+ "Choice": [
+ "A. To prepare to invite a response from Peter",
+ "B. To acknowledge a listener's agreement with his statement",
+ "C. To highlight a key point during his explanation",
+ "D. To express uncertainty about the topic being discussed"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "17MofPDhLTQ",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Which event occurs first in the video based on the speaker's narration?",
+ "Choice": [
+ "A. The woman previews upcoming renovation discussions",
+ "B. The woman focuses on tiled flooring patterns",
+ "C. The woman details replacing black tiling with marble flooring",
+ "D. The woman provides context about the apartment's age"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What key change was made to the entranceway's elements?",
+ "Choice": [
+ "A. The handle mechanism was modernized while retaining its appearance",
+ "B. The handle's color was updated to match modern trends",
+ "C. The flooring material was upgraded from tile to marble",
+ "D. Both elements were replaced for improved durability"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "E0K4HLE_0GI",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What context best explains the speakers' collaborative dialogue?",
+ "Choice": [
+ "A. Improvising a theater performance",
+ "B. Analyzing a planetary documentary",
+ "C. Troubleshooting a satellite interface",
+ "D. Editing a space opera screenplay"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "QFNwukuWJ4I",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual event occurs immediately after the audio line 'His name is All Might'?",
+ "Choice": [
+ "A. Close-up of a bloodied character with green eyes",
+ "B. Title card announcing 'ANIME'S HOTTEST NEW HEROES'",
+ "C. Chaotic city street with overturned cars and smoke",
+ "D. Police officer shielding eyes near scattered money"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which audio event coincides with All Might flying over a city skyline while a large explosion occurs in the background?",
+ "Choice": [
+ "A. 'I want to see a society...' dialogue overlapping with the city explosion",
+ "B. 'I want to be the symbol of peace...' monologue during city street chaos",
+ "C. Whooshing sound during Toho logo transition",
+ "D. Explosion sound effect during a sudden dark frame"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "How does the audio contextualization of the two explosion sound effects differ in their respective scenes?",
+ "Choice": [
+ "A. One is a non-diegetic soundtrack element, while the other originates within the scene",
+ "B. The first explosion sound is truncated for dramatic effect, the second extends naturally",
+ "C. One occurs during a character's motivational monologue, the other during scene transition",
+ "D. The first emphasizes bass frequencies, while the second focuses on treble tones"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "0U_X5BIkYoA",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which social media logo was the first to appear chronologically in the video?",
+ "Choice": [
+ "A. Both Instagram and YouTube appeared simultaneously",
+ "B. YouTube",
+ "C. Instagram",
+ "D. Twitter"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Why does the man specifically mention using the notification bell when asking viewers to subscribe?",
+ "Choice": [
+ "A. To promote his Twitter and Instagram accounts",
+ "B. To acknowledge existing subscribers' loyalty",
+ "C. To ensure updates about whiskey-related content",
+ "D. To receive notifications for all channel activities"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What visual element appears when the speaker mentions 'live streams'?",
+ "Choice": [
+ "A. Instagram handle display",
+ "B. Background music visualization",
+ "C. Twitter logo display",
+ "D. Subscription graphics animation"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "uMkFfqmVr1M",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which event occurred FIRST chronologically?",
+ "Choice": [
+ "A. Team practicing in official uniforms during daylight",
+ "B. Player discussing a personal milestone in casual attire",
+ "C. Athlete interacting with supporters after an event",
+ "D. Female presenter announcing a groundbreaking sports contract"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which visual backdrops are present during Patrick Mahomes' discussion of home ownership versus the announcement of his historic contract?",
+ "Choice": [
+ "A. University hospital logos vs Chiefs training facility",
+ "B. Suburban neighborhood scene vs sports arena signage",
+ "C. Weightlifting gym vs autograph crowd",
+ "D. Contemporary living room vs purple bokeh backdrop"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "6l8n-VJxqmU",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual element is synchronized with the electronic music in the video?",
+ "Choice": [
+ "A. Printer and decorative plants on the desk",
+ "B. White background with 'AUTOMATE YOUR SUCCESS'",
+ "C. 'Facebook Marketplace Account Suspended' text",
+ "D. Man speaking into a microphone"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What key process does the video primarily demonstrate regarding Facebook Marketplace?",
+ "Choice": [
+ "A. Configuring audio equipment for clear recordings",
+ "B. Marketing decorative plants via social platforms",
+ "C. Restoring access to a suspended account",
+ "D. Automating sales processes using specialized tools"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Which visual elements are prominently displayed when the speaker says 'What's up everybody'?",
+ "Choice": [
+ "A. Workspace setup with microphone, plants, and suspension title text",
+ "B. Close-up of speaker's face against soundproofing foam",
+ "C. Minimalist white screen with geometric brand logo",
+ "D. Grey notification box with restoration message"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "-jR3C_yA_G0",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual element was displayed precisely when the narrator stated 'Our Whirlpool Galaxy is a good model for this'?",
+ "Choice": [
+ "A. Whirling spiral structure with dynamic white/red stellar clusters",
+ "B. Spiral galaxy densely populated with red star-forming regions",
+ "C. Side-by-side comparison of galaxy image and rotation curve",
+ "D. Galaxy with superimposed text 'Spiral Arm Density Wave Theory'"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What visual transition occurs as the narrator finishes stating 'Our Whirlpool Galaxy is a good model for this'?",
+ "Choice": [
+ "A. Galaxy condenses into a square on the left with a dark right background",
+ "B. Galaxy's arms dissolve into scattered stars",
+ "C. Text about density wave theory disappears",
+ "D. Milky Way Rotation Curve graph appears"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "-l8djVj0FcU",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which two specific economic factors directly enabled Harry's financial success as a 'Pandemic's winner'?",
+ "Choice": [
+ "A. Tax incentives and healthcare industry growth",
+ "B. Government stimulus checks and property sales",
+ "C. Lower expenses and booming job market",
+ "D. Remote work policies and stock investments"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "In what sequence did the three property investment challenges first appear as interconnected white circles?",
+ "Choice": [
+ "A. Mortgage rates → Tax laws → Politicians",
+ "B. Tax laws → Politicians → Mortgage rates",
+ "C. Politicians → Mortgage rates → Tax laws",
+ "D. Tax laws → Mortgage rates → Politicians"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What two key factors directly contributed to Harry's accumulation of cash according to the video?",
+ "Choice": [
+ "A. Rising stock investments and tax rebates",
+ "B. Property appreciation and rental income growth",
+ "C. Pandemic stimulus checks and inheritance",
+ "D. Reduced expenditures and strong labor market conditions"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What key distinction is highlighted between the individuals shown on the left and right sides of the screen during the investment example?",
+ "Choice": [
+ "A. Depicts Harry versus the narrator",
+ "B. Utilized cash savings versus securing a mortgage",
+ "C. Focused on short-term versus long-term investments",
+ "D. Represented a historical example versus current trends"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "CwW8mqaHQmU",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Why was upbeat pop music used during the animated channel introduction sequence?",
+ "Choice": [
+ "A. To emphasize the mixing process of the featured beverages",
+ "B. To match the enthusiastic tone of the host's verbal greeting",
+ "C. To complement the whimsical art style of the animated logo",
+ "D. To contrast with the complex dessert preparation shown"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "lyX4hKwSyMk",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which sequence reflects the historical progression of concepts discussed, based on audio-visual synchronization?",
+ "Choice": [
+ "A. Genetic tools introduction → Darwin's theory with gestures → Mendel's principles",
+ "B. Darwin's hand gestures → Genetic tools explanation → Mendel's zoom-in section",
+ "C. Darwin's mention with clasped hands → Mendel's ideas during zoom-in → Genetic tools discussion",
+ "D. Mendel's mention while facing audience → Darwin's mechanism → Genetics as critical tool"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What distinct visual technique is used during the discussion of Mendel's contributions compared to Darwin's?",
+ "Choice": [
+ "A. Tight facial closeup during key explanation",
+ "B. Repeated cutting between camera angles",
+ "C. Full-body shot showing demonstrative pacing",
+ "D. Slow pan across the chalkboard"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "Why did the camera zoom in precisely when discussing Mendel's contributions?",
+ "Choice": [
+ "A. To focus on facial expressions while introducing Mendel's concepts",
+ "B. To highlight a shift to discussing Darwin's contributions",
+ "C. To draw attention to a technical term displayed on screen",
+ "D. To emphasize his hand gestures during a demonstration"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "BbVjTgLJYVQ",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What audio-described factor contributed most to the Celtics' win being characterized as a 'gritty' performance in Motor City?",
+ "Choice": [
+ "A. Jason Tatum extending his 30+ point scoring streak",
+ "B. Missing three key rotation players due to injury",
+ "C. Achieving the highest offensive rating in NBA history",
+ "D. Overcoming a 15-point fourth quarter deficit"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What do the statistics presented in the video indicate about the Celtics' offensive and defensive ratings?",
+ "Choice": [
+ "A. Both ratings significantly fluctuated during Tatum's 30-point scoring streak",
+ "B. Offense holds a historically record-breaking rating while defense ranks below league average",
+ "C. Defense has shown steady improvement to surpass their offensive efficiency",
+ "D. Both offensive and defensive ratings rank in the top 10 league-wide"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What key factor contributed to Jason Tatum's scoring success as indicated by the synchronized audio-visual context?",
+ "Choice": [
+ "A. He excelled in one-on-one isolation plays",
+ "B. He compensated for key teammates' absence through increased shot volume",
+ "C. He targeted weaknesses in zone defensive schemes",
+ "D. He exploited defensive mismatches against inexperienced opponents"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "7-v3XPyyipk",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What on-screen information establishes Stu Vaughan's authority to discuss magnesium's benefits?",
+ "Choice": [
+ "A. A medical degree displayed next to his name",
+ "B. Text identifying him as a research scientist",
+ "C. The host introducing him as a certified nutritionist",
+ "D. His title as Head of Marketing and E-commerce"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What visual element is consistently present on screen when Stu explains magnesium's cardiovascular benefits?",
+ "Choice": [
+ "A. Framed picture of pink-leafed plants",
+ "B. 'The Café' logo",
+ "C. Glass jars containing supplements",
+ "D. Blueprint diagrams of human anatomy"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "bf7tZng_lEI",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What is the primary reason the group is dancing in the outdoor plaza based on audiovisual context?",
+ "Choice": [
+ "A. Expressing communal joy referenced in lyrics",
+ "B. Celebrating the national anthem's recognition",
+ "C. Filming a music video scene",
+ "D. Promoting a cultural festival"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which visual element paired with the 'I never knew when they would come' speech explicitly confirms the location?",
+ "Choice": [
+ "A. African geometric shirt pattern",
+ "B. Safety vest in outdoor plaza",
+ "C. Johannesburg skyline backdrop",
+ "D. 'SOUTH AFRICA' text overlay"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "Ggv1cJYtW8E",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What is the primary focus of the video’s narrative based on synchronized audio-visual elements?",
+ "Choice": [
+ "A. A dating platform’s role in modern relationships",
+ "B. A music video compilation",
+ "C. A documentary on internet history",
+ "D. Exploring geometric animation techniques"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "40VQROcd5sQ",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Which audio events coincided with the visual demonstration of Morgan Turner applying makeup during her tutorial?",
+ "Choice": [
+ "A. Male singing layered with foundation application sounds",
+ "B. Upbeat electronic music with male vocals",
+ "C. Apologetic speech about video delays",
+ "D. Woman explaining eyeshadow palette rankings"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What is the correct chronological order of audio-visual events in the video?",
+ "Choice": [
+ "A. Upbeat electronic music plays as lipstick swatches are shown → The creator introduces monthly rankings while applying green eyeshadow → Explanation of 10 palettes with an apology as the cabinet is displayed",
+ "B. Electronic music plays during the initial tutorial → The creator welcomes viewers and discusses rankings with a thumbs up → Detailed explanation of 10 palettes and scheduling apology alongside the cabinet",
+ "C. Male singing accompanies the makeup tutorial's introduction → A thumbs up gesture occurs during a discussion of lipstick shades → The creator mentions being off-schedule while showcasing eyelash images",
+ "D. Green eyeshadow application synchronizes with electronic music → Lipstick swatches are presented during a rankings speech → The cabinet display coincides with the 10 palettes explanation"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "jj5j3jksbUM",
+ "video_duration": "30s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which audio segment occurred as LCpl Browning's name and title were displayed on screen in front of the 'Naval School of Music' flag?",
+ "Choice": [
+ "A. Country music performance with lyrics referencing headlights",
+ "B. A vocalist explaining audience engagement responsibilities in the Marine Corps",
+ "C. Formal self-identification including name and military branch",
+ "D. Orchestral music with superimposed text about historical vocalist milestones"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What was the chronological order of these audio events?",
+ "Choice": [
+ "A. Representation speech → Pop music transition → Country singing",
+ "B. Orchestral music → Name introduction → Country singing",
+ "C. Pop music transition → Representation speech → Country singing",
+ "D. Name introduction → Orchestral music → Representation speech"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What distinguished the visual backdrops between her two spoken segments discussing Marine Corps representation?",
+ "Choice": [
+ "A. First showed soundproof panels and Roland keyboard, second focused on combat uniform details",
+ "B. First segment used text overlay about vocalist history, second featured Naval School of Music flag with drums",
+ "C. First highlighted audio equipment brands, second emphasized filing cabinet organization",
+ "D. Both segments used identical microphone close-ups with neutral walls"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "decrvBAe_Rk",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What is the primary purpose of the batteries included in the Velux solar blind package?",
+ "Choice": [
+ "A. To provide backup power during switch malfunctions",
+ "B. To store excess solar energy for motorized adjustments",
+ "C. To enable wireless functionality in the remote control",
+ "D. To power the wall-mounted switch for manual operation"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "nbuHTaMV8ok",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which event occurs FIRST in the video timeline?",
+ "Choice": [
+ "A. Robotic hand close-up with ominous music",
+ "B. Suspenseful strings music during helmet reflection",
+ "C. Whooshing sound during ISS cloud backdrop",
+ "D. Suspenseful strings music during text appearance"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What operational aspect of the Robonaut does Nic Radford discuss in the video?",
+ "Choice": [
+ "A. ISS structural integrity assessments",
+ "B. Orbital checkout procedures",
+ "C. R2 unit software animation",
+ "D. Robotic hand calibration in microgravity"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "SZRxEbvKih0",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Why does the speaker emphasize 'two things are better than one'?",
+ "Choice": [
+ "A. To demonstrate the importance of home decor in a workspace",
+ "B. To highlight the device's dual functionality as a laptop and tablet",
+ "C. To advocate for purchasing a tablet instead of a laptop",
+ "D. To justify the multitasking habits of young adults"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What is the primary focus of the video's content?",
+ "Choice": [
+ "A. Tutorial on audio editing techniques",
+ "B. Analyzing home office design principles",
+ "C. Advocating for multi-functional tech gadgets",
+ "D. Evaluating smartphone gaming performance"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "_t2geYpykFo",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What was the initial visual event following Camille's display of her ponytail?",
+ "Choice": [
+ "A. Preview of upcoming root treatment",
+ "B. Explanation of orange-red gradient",
+ "C. Exposure of dark root growth",
+ "D. Close-up of pink-tipped curls"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What specific hair concern does the creator mention addressing in their next video?",
+ "Choice": [
+ "A. Curly hair washing with color-protecting products",
+ "B. Maintaining pastel-colored ends",
+ "C. Correcting dark roots with blonde highlights",
+ "D. Transitioning from ombre to solid color"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "HOa6A7GVvEc",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Based on the video's ranking of the top arms, which sequence correctly orders the bodybuilders from fifth to second place?",
+ "Choice": [
+ "A. Wesley Vissers (classic physique) → Keone Pearson (212 division) → Brandon Curry (open division) → Bald bodybuilder with veiny arms",
+ "B. Brandon Curry (SCITEC stage) → Keone Pearson (blue top) → Wesley Vissers (medal) → Bald bodybuilder",
+ "C. Arnold Schwarzenegger (classic reference) → Brandon Curry (gym close-up) → Bald bodybuilder → Keone Pearson (restroom pose)",
+ "D. Keone Pearson (thick arms) → Bald bodybuilder (smiling flex) → Wesley Vissers (brick wall) → Brandon Curry (black cap)"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What specific feature of Brandon Curry's arm is emphasized during the gym pull-down scene?",
+ "Choice": [
+ "A. Affiliation with a supplement brand",
+ "B. Vascularity enhanced by sweat during exertion",
+ "C. Medal-accentuated tricep definition",
+ "D. 'Heart-shaped bicep' referenced in narration"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "aYVMcCgScVE",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "What is the correct sequence of audio-visual pairings in chronological order?",
+ "Choice": [
+ "A. Superheated gas narration with dark clouds → Bright light transformation → Galactic collision visuals with music",
+ "B. Swirling cloud formation → 'exploding stars' audio with nebula → 'superheated gas' description with light beams",
+ "C. Energy sphere creation with 'exploding stars' audio → Swirling jets during black hole description → Star cluster focus during galaxy collision narration",
+ "D. Bright light expansion with 'violent events' narration → Nebula formation with exploding stars audio → Galactic merger discussion"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "cFkYpIKWH5A",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What does the recurring wooden 'DC' logo most likely signify in the video?",
+ "Choice": [
+ "A. Indicates downloadable game content",
+ "B. Identifies the video's production company",
+ "C. Marks collaboration with Nintendo developers",
+ "D. Denotes the creator's channel branding"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What primary visual element was on screen when the speaker first mentioned 'underrated games'?",
+ "Choice": [
+ "A. Yoshi balloon hovering over a crowd",
+ "B. Wooden DC logo in corner",
+ "C. Notification bell with active cursor overlay",
+ "D. Game Boy Advance close-up"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Which game's cover art was displayed when the narrator began discussing the fifth entry in the top Yoshi games list?",
+ "Choice": [
+ "A. Yoshi's Crafted World",
+ "B. Yoshi Topsy Turvy",
+ "C. The Legend of Zelda: Breath of the Wild",
+ "D. Yoshi's Island"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "4sBpUHylq9s",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What is the significance of the recurring username @HerBlackHand in the video?",
+ "Choice": [
+ "A. Poet's collaborative alias",
+ "B. Social media handle for audience engagement",
+ "C. Title of the spoken word series",
+ "D. Poet's personal Instagram account"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "6vKvjzuICxg",
+ "video_duration": "30s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "What specific item is mentioned in the 20-30s audio instruction but absent from the 10-20s instruction?",
+ "Choice": [
+ "A. Liquid containers",
+ "B. Carry-on bags",
+ "C. Foil-wrapped items",
+ "D. Laptops"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "Based on the airline branding and security protocols observed, which United Airlines hub's security checkpoint is depicted?",
+ "Choice": [
+ "A. Denver International Airport",
+ "B. Newark Airport",
+ "C. John F. Kennedy International Airport",
+ "D. Chicago O'Hare International Airport"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "hUQKQwAAncc",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What audio cue coincides with the transition to the structured debate phase marked by 'ROUND 1 CAST'?",
+ "Choice": [
+ "A. Cessation of upbeat electronic music",
+ "B. Persistent synthesized rhythm",
+ "C. Introduction of new dialogue about Marvel",
+ "D. Sudden silence before speaker commentary"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "a_tsqihza4E",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual elements correspond to the 'two awesome figures' mentioned in the audio?",
+ "Choice": [
+ "A. The turret's two blue laser components",
+ "B. The green soldier and yellow-skinned character on the box",
+ "C. The six figurines arranged in a circular pattern",
+ "D. Two Halo action figures placed next to the box"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "ckfRrzOlEuM",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What is the purpose of the persistent 'ESA/Hubble' credit in the video?",
+ "Choice": [
+ "A. To acknowledge research funding from the institution",
+ "B. To certify the visuals are licensed material",
+ "C. To indicate the footage is computer-simulated",
+ "D. To identify the narrator's institutional affiliation"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "6_Foh42T5XM",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "When the speaker states 'They start by copying their parents,' which visual element directly supports this statement?",
+ "Choice": [
+ "A. Crystal Neil speaking against a solid green backdrop",
+ "B. Woman making eye contact with baby",
+ "C. Parent playing with baby on a checkered blanket",
+ "D. Child grasping a red Elmo toy"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "pUqmv-KY0Q4",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What was the chronological order of these audio events?",
+ "Choice": [
+ "A. Introductory speech → Tearing sound → Speech about comfort",
+ "B. Introductory speech → Tearing sound → Crinkling plastic",
+ "C. Introductory speech → Speech about Z reviews → Tearing sound",
+ "D. Speech about comfort → Introductory speech → Tearing sound"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which audio event coincided precisely with the initial appearance of hands holding the wrapped package in the frame?",
+ "Choice": [
+ "A. Sound of scissors cutting through plastic",
+ "B. 'Yo, what's up? Today we're going to unbox this TAKTSTAR Pro 82...'",
+ "C. 'I've been wanting to get these headphones for a while now'",
+ "D. Sound of plastic being unfolded"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What specific tools were used to open the package during the tearing sound and plastic crinkling sound phases respectively?",
+ "Choice": [
+ "A. Scissors during both tearing and crinkling",
+ "B. Fingernails during tearing, Teeth during crinkling",
+ "C. Bare hands during tearing, Scissors during crinkling",
+ "D. Box cutter during tearing, Bare hands during crinkling"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "5bEu5s8VjVo",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What prompted the Maasai man to state there's 'something wrong in your family'?",
+ "Choice": [
+ "A. To emphasize the importance of family heritage",
+ "B. To clarify traditional divisions of labor",
+ "C. As a critique of Western cultural influences",
+ "D. In response to being asked about transgender individuals"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "3znDkU3P0sU",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Which audio event occurs simultaneously with the initial blurred pink stripe and 'lipstickncurls' text display?",
+ "Choice": [
+ "A. Narration describing summer skincare goals",
+ "B. Male vocals with acoustic guitar in Christian music",
+ "C. Upbeat channel introduction sequence",
+ "D. Female explaining daily skincare steps"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What contextual clues indicate the video's intended viewership?",
+ "Choice": [
+ "A. Promotion of social media growth strategies",
+ "B. Featured soundtrack with spiritual themes",
+ "C. Professional makeup application techniques demonstrated",
+ "D. Detailed explanation of daily skincare regimens"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "bVXJPrL4zII",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which section of the book cover contained both the publisher's identifier and international pricing details?",
+ "Choice": [
+ "A. Within the sun's concentric circles",
+ "B. Below the author's byline",
+ "C. Centered under the book title",
+ "D. Upper right corner"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Which audio event occurred when the 'Carpenter of the Sun' text was displayed against a yellow background with an abstract sun illustration?",
+ "Choice": [
+ "A. Harp music transitioning into a calming melody",
+ "B. The host introducing Serena Blaiz as the segment's presenter",
+ "C. The speaker introducing a poem from 'Carpenter of the Sun'",
+ "D. A narrator describing the concentric sun illustration"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "During which visual sequence was the background color present for both harp music and the introductory speech?",
+ "Choice": [
+ "A. White screen with cursive title animation",
+ "B. Yellow backdrop with concentric orange circles",
+ "C. Dark blue canvas displaying white handwritten text",
+ "D. Light gray panel featuring black serif typography"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "swRx8mQ3xAE",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What policy priority is questioned when discussing cartel strategies alongside the GOP's fiscal responsibility claims?",
+ "Choice": [
+ "A. Allocating funds for national security measures",
+ "B. Enhancing physical border infrastructure",
+ "C. Increasing transparency in political donations",
+ "D. Reforming healthcare to address drug addiction"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What visual cue specifically coincides with the anchor stating 'Vivek Ramaswamy back with us now, former businessman now running for President in 2024'?",
+ "Choice": [
+ "A. The Republican Party logo animates on-screen",
+ "B. Leland Vittert briefly glances downward",
+ "C. The screen splits to show a dual-shot interview",
+ "D. A 'VIVEK2024.com' graphic dominates the backdrop"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which visual elements most distinctly differentiate Leland's role as a news anchor from Vivek's identity as a political candidate?",
+ "Choice": [
+ "A. Leland's direct camera engagement vs. Vivek's conversational hand gestures",
+ "B. Leland's 'NewsNation' banner vs. Vivek's timestamp display",
+ "C. Leland's blurred cityscape backdrop vs. Vivek's patterned campaign branding",
+ "D. Leland's pink patterned tie vs. Vivek's American flag lapel pin"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "3cmBbNIcW_0",
+ "video_duration": "30s",
+ "video_category": "Comedy",
+ "questions": [
+ {
+ "Question": "What prompted the man to shut the dumpster lid immediately after saying 'Control C, Control V'?",
+ "Choice": [
+ "A. He concluded examining the object of interest",
+ "B. He located a rare electronic device inside",
+ "C. He noticed approaching footsteps",
+ "D. He needed to execute the keyboard shortcut mentioned"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "p_e4Pyo9Mqo",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual event coincided with the audio statement 'deliver clean, safe, reliable, and affordable energy'?",
+ "Choice": [
+ "A. Worker in a bucket truck inspecting power lines",
+ "B. Three women reviewing technical drawings near equipment",
+ "C. Lineman securing wires on a utility pole with harness",
+ "D. Technician monitoring computer screens in control room"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "AiZKoCH3n7U",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which audio-visual pair coincided when the speaker mentioned 'number two, best soup in Southeast Asia, and also number two'?",
+ "Choice": [
+ "A. Gulai with Indonesian flag",
+ "B. Sinigang with Philippine flag",
+ "C. Tom Yum with Thai flag",
+ "D. Tom Kha Gai with Thai flag"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which sequence reflects the descending order of soup rankings as visually displayed during their corresponding audio announcements?",
+ "Choice": [
+ "A. Gulai → Tom Kha Gai → Tom Yum → Sinigang",
+ "B. Gulai → Tom Yum → Tom Kha Gai → Sinigang",
+ "C. Tom Kha Gai → Gulai → Tom Yum → Sinigang",
+ "D. Sinigang → Tom Yum → Gulai → Tom Kha Gai"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which visual element is displayed on screen when the narrator says 'Part two for the number five'?",
+ "Choice": [
+ "A. Yellow box displaying 'TOP 5 AUGUST 2022'",
+ "B. Philippine flag animation",
+ "C. Close-up of Tom Kha Gai",
+ "D. Map with TasteAtlas text overlay"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "How is the #1 ranked soup presented in this video?",
+ "Choice": [
+ "A. It's visually identical to the number 3 soup",
+ "B. It wasn't shown in this video",
+ "C. It's featured in both the top 5 list and the map",
+ "D. Its country of origin is highlighted on the map"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "EZT03Wpo9jQ",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What is the correct sequence of visual events immediately following Julia's introductory speech?",
+ "Choice": [
+ "A. Black paint stroke → Infinity symbol → JS letters illumination",
+ "B. Blurred image → Turquoise eye → Raven perched with purple dots",
+ "C. Bookshelf close-up → Nirvana logo → Stephen King collection",
+ "D. Title card emergence → Ink droplets → Raven mid-flight"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "PxqzwhVuGyo",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "During which visual context does the speaker first reveal she has a solution for her condition?",
+ "Choice": [
+ "A. While displaying cracked fingertips in the first segment",
+ "B. During hand gestures about variable susceptibility in the third segment",
+ "C. When positioned near the 'hello' artwork in the third segment",
+ "D. When emphasizing nail pain severity in the second segment"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "pCTnL1odiJQ",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual event coincided with the upbeat theme music during the opening sequence?",
+ "Choice": [
+ "A. Animated title card showing 'FOOD MANIA REVIEW' with a blue striped background",
+ "B. Animated title card with merged 'FOODMANIA REVIEW' on radiating stripes",
+ "C. Aubrey gesturing toward the DUOZ box",
+ "D. Amy gripping the 'Hot & Spicy' box"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "Wyg1s_IE5kg",
+ "video_duration": "30s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "During the bird's-eye view of old-fashioned sailing ships near a coastline, which audio statement is heard?",
+ "Choice": [
+ "A. 'Your empire on the cusp of immortality...'",
+ "B. 'The lands tamed, the seas charted.'",
+ "C. 'Triumphant drums echo across the battlefield.'",
+ "D. 'Immortal Empires require strategic planning for endgame scenarios.'"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which environmental feature is prominently displayed when the narrator questions the need for 'one final major hurdle'?",
+ "Choice": [
+ "A. Castle town beneath a yellow sky",
+ "B. Pyramid-like architectural structure",
+ "C. Coastal ships from an aerial view",
+ "D. Flaming rocky mountain terrain"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "09buIj5Z5lk",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What occurred right after the ball crossed the boundary fence?",
+ "Choice": [
+ "A. Ijaz Ahmed was seen walking holding his bat",
+ "B. The camera shifted to reveal the fully occupied stadium",
+ "C. The scoreboard changed to display 'PAK 109/1'",
+ "D. The camera showed Mohammad Azharuddin, the Indian captain"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "a1JOvtgdSk0",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual element is displayed precisely when Ron Smith's name is first spoken in the video?",
+ "Choice": [
+ "A. Bee on orange flower next to seed packets",
+ "B. Older man in NDSU sweatshirt standing outdoors",
+ "C. Four framed video clips of horticulture activities",
+ "D. Text 'Fall Tree Wrapping' with daisy-like flowers"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What type of environment is depicted in the background during Ron Smith's speech compared to the earlier title cards?",
+ "Choice": [
+ "A. Abstract digital animations",
+ "B. Symbolic floral and vine patterns",
+ "C. Residential outdoor setting",
+ "D. Demonstration of gardening techniques"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "7cGPesEX-UQ",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which visual element was prominently displayed alongside the laptop on the Vellum website during the audio statement 'incredibly intuitive and easy to use'?",
+ "Choice": [
+ "A. Stack of books next to the laptop",
+ "B. Purple download button labeled 'Download'",
+ "C. Purple background with laptop silhouette",
+ "D. Application interface showing 'The Secret Garden'"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "ZBN90mXIWko",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What key distinguishing factor between Terry Robinson and Trey Ross is explicitly supported by both audio and visual information?",
+ "Choice": [
+ "A. Terry's attire is described as purple, while Trey's is shown in blue",
+ "B. Terry has a longer reach due to his height advantage",
+ "C. Terry's fight preparation includes a pink towel, unlike Trey",
+ "D. Terry is from Fountaintown, while Trey is from Indianapolis"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "_-BAFzpKigw",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What was the initial focal point of the video's opening sequence?",
+ "Choice": [
+ "A. Certification text on a card's reverse side",
+ "B. David Robinson and Hakeem Olajuwon's card in a display case",
+ "C. Placement of a Tracy McGrady/Yao Ming dual-jersey card",
+ "D. Close-up of dual jersey fragments from both players"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What is the significance of the '2009' mentioned on the back of the basketball card?",
+ "Choice": [
+ "A. It denotes the season the jerseys were first used",
+ "B. It reflects the introduction of new jersey materials",
+ "C. It marks the retirement year of both featured players",
+ "D. It indicates the year the card's memorabilia was certified"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "jMD4D-usWqw",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What distinguishes the contexts of the exclamations 'My God, your hair is incredible' and 'Look at that man's head'?",
+ "Choice": [
+ "A. One expresses admiration while the other critiques technique",
+ "B. Both remarks target audience members",
+ "C. The first references hair length, the second head shape",
+ "D. They occur during distinct performance interactions"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "2ZZru7Sn848",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "In what order do these audio-visual events occur?",
+ "Choice": [
+ "A. Mention of Belgian press article → Display of school dropout statistics → Announcement of street interviews",
+ "B. 'Your concern, our focus' slogan → White table introduction → Light flare transition",
+ "C. Epic orchestra music plays → YNRE logo appears → Tablet placement on table shown",
+ "D. School segment context → Random interview mention → Three questions presented"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "PFU-MhA1ubc",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "While Rob Hamilton is facing away from the camera and holding up his phone in a baseball cap, which audio event is heard?",
+ "Choice": [
+ "A. An explanation involving micro-vibrations",
+ "B. A greeting to the audience",
+ "C. Repeated rhetorical questions",
+ "D. Motorcycle engine sounds"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "How does Rob's tone and focus differ between his appearances wearing a beanie versus a baseball cap?",
+ "Choice": [
+ "A. Beanie: Demonstrating installation steps / Cap: Reflecting on past experiences",
+ "B. Beanie: Analyzing data / Cap: Celebrating a success",
+ "C. Beanie: Detailed technical explanation / Cap: Expressive frustration",
+ "D. Beanie: Promoting a product / Cap: Sharing a personal story"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "d3pPQ3pnxgc",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What is the primary purpose of the 'SH CH' annotation appearing during the Arabic dialogue?",
+ "Choice": [
+ "A. Signals a switch between languages",
+ "B. Indicates the focus of upcoming examples",
+ "C. Highlights common pronunciation errors",
+ "D. Identifies the current grammatical topic"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Why was the word 'Shine' displayed during the Arabic explanation?",
+ "Choice": [
+ "A. To reinforce MS's branding identity",
+ "B. To provide an English example of the 'sh' sound",
+ "C. To signal a transition to technical features",
+ "D. To demonstrate Arabic pronunciation of the 'sh' sound"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "BfHSffcEydY",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What specific visual detail identifies the setting when the child requests the doodle pad?",
+ "Choice": [
+ "A. Area labeled 'Mom's Kitchin'",
+ "B. Dining area with a magnetic drawing board",
+ "C. Kitchen with a wooden table",
+ "D. Playroom with a labeled doodle pad"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "2u3tic9llpw",
+ "video_duration": "30s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "What discrepancy exists between the audio commentary and the graphic regarding regional data on women executives?",
+ "Choice": [
+ "A. The graphic reveals Asia's 88% of companies with no women executives, which the audio omits entirely",
+ "B. The audio cites Asia's 88% explicitly, conflicting with the graphic's silence on Asia",
+ "C. Both sources emphasize the U.S. as having the worst representation at 11%",
+ "D. The audio compares Europe's 68% to Asia's 88%, while the graphic shows equal percentages"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What is the specific purpose of citing 'The Economist/20-first' in the 'WOMEN AT WORK' graphic?",
+ "Choice": [
+ "A. To credit the source of the executive committee statistics",
+ "B. To acknowledge The Economist's editorial oversight of the segment",
+ "C. To reference 20-first's analysis of global workplace trends",
+ "D. To attribute the graphic's visual design to The Economist"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What visual change occurred as the speaker cited 'the consultancy' data following the statement 'but they are, we are not necessarily making it to the top'?",
+ "Choice": [
+ "A. Hands holding documents entered the frame",
+ "B. Three-region comparative infographic appeared",
+ "C. Lower-third banner displayed '20-first' citation",
+ "D. Stock ticker updated from DOW/S&P to DAX 115.09"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What editorial priority does CNN signal by maintaining a 'LIVE' banner with stock tickers during a segment about workplace gender disparities?",
+ "Choice": [
+ "A. Sustaining real-time data streams across programming",
+ "B. Blending financial updates with social commentary",
+ "C. Prioritizing pre-recorded human-interest stories",
+ "D. Highlighting anchors' economic expertise"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "1AYdRPXT9m8",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Which visual element appears when the speaker mentions enhancements made during winter training?",
+ "Choice": [
+ "A. Technician adjusting engine components in pit area",
+ "B. Champion standing on podium surrounded by flags",
+ "C. Motorbike racer wearing number 30 positioned at starting gate",
+ "D. Close-up of a racer in a white helmet looking sideways"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What prompted the racer to raise his left arm during the dirt track sequence?",
+ "Choice": [
+ "A. To signal a pit stop requirement",
+ "B. To wipe dust from his goggles",
+ "C. To acknowledge being declared world champion",
+ "D. To counterbalance during a sharp turn"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "C3qgLxmr9dI",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "When the boy named the colors while pointing at the cones, which sequence of adjacent colors did he articulate from right to left?",
+ "Choice": [
+ "A. Blue, Yellow, Green, Orange",
+ "B. Orange, Yellow, Green, Blue",
+ "C. Blue, Green, Yellow, Orange",
+ "D. Red, Blue, Green, Yellow"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What audio event occurred simultaneously with the purple cushion being thrown into the basket?",
+ "Choice": [
+ "A. The girl explaining 'We have bean bags, rings, and cones'",
+ "B. Sound of cushion impacting the basket",
+ "C. A child reciting 'Red, blue, green, yellow, orange'",
+ "D. The girl exclaiming 'Yay!'"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Based on the equipment shown, why did the child explicitly exclude pink from their game?",
+ "Choice": [
+ "A. The pink beanbags were misplaced during setup",
+ "B. The available equipment lacked any pink-colored items",
+ "C. A pink cone was damaged and removed from play",
+ "D. Pink conflicted with the game's scoring system"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "S0ZyCu3sUAA",
+ "video_duration": "30s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "Which element provides definitive confirmation of the speaker's identity as Luke?",
+ "Choice": [
+ "A. The SpookLuke logo prominently displaying his name",
+ "B. His visual introduction while verbally stating his name",
+ "C. The text 'FASTEST WAY' synchronized with his speech",
+ "D. The race car labeled 'YOU' during the tutorial context"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "aWyu0p-_UQU",
+ "video_duration": "30s",
+ "video_category": "Comedy",
+ "questions": [
+ {
+ "Question": "During which type of vocal delivery did the man briefly glance at a document while addressing the audience?",
+ "Choice": [
+ "A. Informal dialogue-style commentary",
+ "B. Technical microphone feedback",
+ "C. Comedic monologue by the presenter",
+ "D. Crowd's amused reactions"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "RpBnEJYMHLk",
+ "video_duration": "30s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "Which audio narration coincides with the visual display of the 'Grim Fandango' title card?",
+ "Choice": [
+ "A. 'Let's get to it'",
+ "B. 'Number 10, Grim Fandango'",
+ "C. 'How about taking a look at the'",
+ "D. 'We've put together a top 10'"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What does the speaker imply by stating 'many available on scum VM' in the context of the visual content?",
+ "Choice": [
+ "A. Comparing 2D vs 3D interfaces",
+ "B. Highlighting a pre-selected library",
+ "C. Demonstrating cross-platform compatibility",
+ "D. Organizing games into thematic series"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "GsbaC1YiOYA",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "What two primary factors does the man cite for being unable to continue his previous practice?",
+ "Choice": [
+ "A. Surging viewer numbers and platform limitations",
+ "B. Chair ergonomics and outdated recording gear",
+ "C. Microphone failures and subscriber complaints",
+ "D. Discomfort with the plush toy and time constraints"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "8EexlONNnEE",
+ "video_duration": "30s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "What action coincides with the man reciting 'how far I toil, still farther off from thee'?",
+ "Choice": [
+ "A. Sheet music is placed on the piano",
+ "B. His hand presses against his chest",
+ "C. He adjusts the microphone stand",
+ "D. His hands move in rhythmic gestures"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "vAvLt3QI904",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What factor most directly contributes to the engine knocking sound becoming prominent between 10-20 seconds?",
+ "Choice": [
+ "A. The microphone placement amplifies engine vibrations",
+ "B. The driver downshifts to navigate a sharp curve",
+ "C. The engine overheats due to prolonged operation",
+ "D. The terrain's loose rocks require increased torque"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which vehicle's engine characteristics and path navigation are most consistent with the auditory and visual environment?",
+ "Choice": [
+ "A. All-terrain vehicle (ATV)",
+ "B. Off-road motorcycle",
+ "C. Electric mountain bike",
+ "D. Compact car"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "GDbG4rvL1YA",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Why does Aaron garnish the scrambled eggs with herbs in the video?",
+ "Choice": [
+ "A. To demonstrate advanced plating techniques",
+ "B. To adhere to vegan cooking principles",
+ "C. To correct a cooking error in the recipe",
+ "D. To emphasize visual contrast in ingredient comparisons"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "7PgD-7Ld3iQ",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What action immediately followed the sketchbook being transported through the room?",
+ "Choice": [
+ "A. Presenter showcased Uni Chalk Markers",
+ "B. Presenter initiated opening black markers",
+ "C. On-screen text 'The Sketching' emerged",
+ "D. Art supplies were highlighted on the desk"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "4e3gNCp9dxY",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual element directly follows the Kaiser Permanente logo and 'MY PREGNANCY' text in the video sequence?",
+ "Choice": [
+ "A. FAQ hourglass graphic with a doctor",
+ "B. Title card for a certified nurse-midwife",
+ "C. Medical professional speaking in a hospital setting",
+ "D. Transition to animated ultrasound imagery"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "When factors affecting placental location are mentioned, which visual element is present?",
+ "Choice": [
+ "A. Seated physician with credentials overlay",
+ "B. Hourglass FAQ icon",
+ "C. Standing doctor with title card",
+ "D. Animated crib graphic"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What specific pregnancy-related topic do the healthcare professionals in the video primarily discuss?",
+ "Choice": [
+ "A. Postpartum recovery timelines",
+ "B. Nutritional requirements during gestation",
+ "C. Stages of labor and delivery",
+ "D. Patterns of fetal movement awareness"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "OB9OK-riGEg",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What was the sequence of topics discussed in the video?",
+ "Choice": [
+ "A. Wonderland mechanics → DIYOR resources → Spark background",
+ "B. Microphone adjustment → Book organization → T-shirt design",
+ "C. APY setup → Homework links → Channel statistics",
+ "D. APY tricks → Wonderland reliability → DIYOR resources"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "UtVXARD_8VE",
+ "video_duration": "30s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "What is the primary purpose of the text overlays visible during the man's explanation about the puppies?",
+ "Choice": [
+ "A. To transcribe the man's spoken words",
+ "B. To emphasize safety protocols",
+ "C. To showcase available training programs",
+ "D. To direct viewers to the business's online platforms"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "_D-zI4nX-2Y",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "During the segments featuring Indian music, which two elements are simultaneously visually present?",
+ "Choice": [
+ "A. Fruit arrangement and classroom setting",
+ "B. Ellipse diagram and acoustic guitar",
+ "C. IISER logo and mechanical cube",
+ "D. Wooden pear and toy blades"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What is the primary reason for displaying fruits and kitchen items during the Indian music segment in the workshop context?",
+ "Choice": [
+ "A. To create visually appealing arrangements",
+ "B. To utilize inexpensive, readily available materials for experiments",
+ "C. To reflect traditional cultural practices",
+ "D. To demonstrate culinary techniques"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "R2LE8Anj9-c",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Which audio event coincided with the overhead shot of the flat lay containing white flowers and a lipstick?",
+ "Choice": [
+ "A. 'This is hard to open'",
+ "B. 'I love that name, Pearl Moonstone'",
+ "C. 'Swatch that's beautiful'",
+ "D. 'Let's dive into it'"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What specific visual element coincided with the electronic music playing?",
+ "Choice": [
+ "A. The makeup palette's color swatches were displayed",
+ "B. A flat lay of cosmetic items was initially shown",
+ "C. A rose gold text overlay revealed the creator's channel",
+ "D. She struggled to open the eyeshadow palette"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What differentiates the audio elements during the flat lay graphics from the final direct camera address?",
+ "Choice": [
+ "A. Narration describing eyeshadows vs channel subscription plea",
+ "B. Upbeat instrumental music vs verbal self-identification",
+ "C. Ambient white noise vs enthusiastic product commentary",
+ "D. Crinkling packaging sounds vs brush application noises"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What visual elements were present when the woman said 'Hi guys, if you're new here'?",
+ "Choice": [
+ "A. Closed palette in her hands",
+ "B. Rose gold text displaying 'MORGAN TURNER MAKEUP'",
+ "C. Large dangly earrings",
+ "D. Open palette with six eyeshadows"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "TOSWThdYTag",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What is the speaker's stated reason for not referring to the drink as 'tea'?",
+ "Choice": [
+ "A. It replicates store-bought wellness drinks",
+ "B. It is served cold rather than hot",
+ "C. It does not contain tea leaves",
+ "D. It is consumed as concentrated shots"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "3OsJ2f6e7U8",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "When is 'Project Gutenberg' audibly mentioned relative to its visual appearance?",
+ "Choice": [
+ "A. After the website fully loads with book listings",
+ "B. Before the LibriVox.org page is shown",
+ "C. While the LibriVox.org page remains on screen",
+ "D. During the visual transition to gutenberg.org"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "In what sequence do the websites appear both visually and auditorily after the YouTube introduction?",
+ "Choice": [
+ "A. LibriVox.org → Project Gutenberg → Manybooks.net → Archive.org",
+ "B. Gutenberg.org → Archive.org → BooksShouldBeFree.com → Project Gutenberg",
+ "C. Gutenberg.org → Manybooks.net → Archive.org → LibriVox.org → Project Gutenberg",
+ "D. Gutenberg.org → LibriVox.org → Archive.org → Manybooks.net"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which website's interface is displayed when the speaker mentions including links in the description?",
+ "Choice": [
+ "A. A grid of tutorial video thumbnails",
+ "B. The LibriVox audiobook player interface",
+ "C. The Project Gutenberg search interface",
+ "D. The 'TJ FREE' YouTube channel interface"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "8lxVnLHhlpw",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual element in the insulin discovery section reflects the researchers' patent decision, unlike the polio vaccine visuals?",
+ "Choice": [
+ "A. Illustration of currency exchange",
+ "B. Presence of national symbols",
+ "C. Timeline of medical breakthroughs",
+ "D. Group portraits with scientific instruments"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "D_ZFZ2En9uU",
+ "video_duration": "30s",
+ "video_category": "Autos & Vehicles",
+ "questions": [
+ {
+ "Question": "During the host's introductory address to viewers, which visual scenario is depicted?",
+ "Choice": [
+ "A. Close-up of overturned vehicle near large boulders",
+ "B. Wheel adjustment sequence in quarry environment",
+ "C. Vehicle navigating a muddy terrain with 'Scale RC Videos' overlay",
+ "D. RC car perched on a prominent rock formation with snow patches"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What audio context coincides with the RC car's wheel becoming upright?",
+ "Choice": [
+ "A. Mention of head injury concerns",
+ "B. Hip hop music with river ambiance",
+ "C. Reference to late November timing",
+ "D. Discussion of initial interest in RC cars"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "How does the vehicle's positional state differ between the 'Welcome' narration and the 'November' timestamp?",
+ "Choice": [
+ "A. Stable surface anchoring vs kinetic recovery sequence",
+ "B. Full submersion in mud vs 45-degree lateral tilt",
+ "C. Highway-speed traversal vs controlled braking action",
+ "D. Water fording maneuver vs quarry slope ascent"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "UzPXCcPslg0",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What does Frank's deliberate spelling of 'R-E-I-S-E-R' combined with visible laboratory credentials primarily indicate about his audience's intent?",
+ "Choice": [
+ "A. They seek professional training in mineral identification",
+ "B. They require academic validation of his research methods",
+ "C. They are likely unfamiliar with his work and require digital verification",
+ "D. They need physical access to his mineral specimens"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What was the chronological order of Frank's verbal references to his professional identity?",
+ "Choice": [
+ "A. Discussed YouTube search → Named business entity → Spelled surname",
+ "B. Discussed YouTube search → Spelled surname → Named business entity",
+ "C. Spelled surname → Named business entity → Discussed YouTube search",
+ "D. Named business entity → Discussed YouTube search → Spelled surname"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "fZro6N-WTr0",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which audio event coincided with the split-screen displaying monochrome and colored candle arrangements?",
+ "Choice": [
+ "A. Classical music featuring a prominent violin melody",
+ "B. Male vocals singing about overcoming fear",
+ "C. Ambient sounds of crackling candle flames",
+ "D. Classical music with string ensemble and piano"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "zYQl4HFjYP4",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What distinguishes the man's approach to handling the fish between the initial and subsequent retrieval phases shown?",
+ "Choice": [
+ "A. First: visual demonstration of catch / Second: deliberate fatigue management",
+ "B. First: stationary rod positioning / Second: active rod manipulation",
+ "C. First: shallow-water release / Second: deep-water preservation",
+ "D. First: rapid retrieval using the rod tip / Second: steady reeling with rod lowered"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "zaGU7wHPISI",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What visual coincides with the announcer's statement about Ryback being 'one of the most popular talents in the company between 2012 to 2013'?",
+ "Choice": [
+ "A. Crowd shown around wrestling ring",
+ "B. Close-up of Ryback's injured eye",
+ "C. Text 'RYBACK WAS OVER' appears",
+ "D. Ryback screaming against black backdrop"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "nbsGrFj-Z7o",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What key difference distinguishes the costume themes of the first stage group (camouflage/silver outfits) from the later medieval group?",
+ "Choice": [
+ "A. Their costumes represent contrasting genres within the sci-fi and fantasy spectrum",
+ "B. One group's costumes emphasize tactical gear, while the other features ceremonial attire",
+ "C. Only the first group incorporates neon-colored elements into their outfits",
+ "D. The medieval group's costumes are historically accurate, unlike the futuristic designs"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "Which visual event occurred immediately after the stage backdrop with a full moon and stars was shown, while the speaker referenced YouTube?",
+ "Choice": [
+ "A. Group posing in fantasy character costumes",
+ "B. Display of 'Archon Footage courtesy Culvan' text",
+ "C. Two performers in futuristic outfits",
+ "D. Three medieval dancers holding hands"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which visual event directly coincided with Sheila stating that Archon is a 'sci-fi fantasy convention'?",
+ "Choice": [
+ "A. Sheila adjusting her glasses while speaking to the camera",
+ "B. A moonlit stage backdrop with blue curtains",
+ "C. A staged photograph of ten fantasy cosplayers",
+ "D. Three performers dancing in medieval attire on stage"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What visual element was prominently featured when the speaker referenced significant investments in production quality?",
+ "Choice": [
+ "A. Trio performing coordinated dance moves in medieval costumes",
+ "B. Text overlay showing archival footage credit",
+ "C. Close-up of a fantasy character's spiral lollipop",
+ "D. Elaborate starry night stage backdrop"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "e5ExXfd9KzI",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual event coincides with the mention of 'plans for multiple upgrades' in the audio?",
+ "Choice": [
+ "A. Enrico Cardile speaking to camera",
+ "B. Two race car drivers walking side-by-side on a paved surface",
+ "C. Group photo under tent with 'AUS GP 2024' sign",
+ "D. Oracle logo on black background"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "u2gwhTt-B5M",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "What evidence from the video confirms the charger's absence in the Samsung S21 box?",
+ "Choice": [
+ "A. The marble surface visibly contains an indentation for hidden accessories",
+ "B. The charger's omission aligns with environmental goals mentioned in the audio",
+ "C. The unboxing sequence shows only phone removal, not charger placement",
+ "D. The narrator directly states the charger's exclusion while demonstrating the empty box"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "A1UVeD9UB1I",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Which audio segment played during the visual of positioning a heart-shaped gift box beside a Christmas tree and 'Peace' angel figurine?",
+ "Choice": [
+ "A. Suggestions for meaningful relationship gestures",
+ "B. An enumeration of classic romantic presents",
+ "C. A discussion about selecting Valentine's Day gifts",
+ "D. Commentary on gift symbolism over material content"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What is the most accurate description of the interaction between the individual in the white coat and the gift-giver?",
+ "Choice": [
+ "A. Friends participating in an early Christmas gift exchange",
+ "B. Retail associates demonstrating premium holiday merchandise",
+ "C. Spouses presenting Valentine's Day gifts to each other",
+ "D. Colleagues exchanging holiday gifts during a work event"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "P2J6TdJRr7M",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What visual contrast occurs between the segments where the speaker discusses 'necessary edits' and the 'spooly' technique?",
+ "Choice": [
+ "A. Headband transitions from maroon to mauve",
+ "B. Manual finger adjustments vs. tool usage",
+ "C. Circular brush motions vs. straight strokes",
+ "D. Close-up of brows vs. full-face framing"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "fMw8X_4BDU0",
+ "video_duration": "30s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "Which audio event coincided with the initial display of the 'Assassin's Creed Odyssey' title card?",
+ "Choice": [
+ "A. The mention of using Ikaros for exploration",
+ "B. The narrator greeting the audience",
+ "C. The energetic sound effect",
+ "D. The explanation about locating treasure chests"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "PDEJodGzar0",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What is the correct sequence of the first three marketing strategy tips presented?",
+ "Choice": [
+ "A. Sell benefits → Sales channels → Target audience",
+ "B. Analyze competitors → Target audience → Sales channels",
+ "C. Sell benefits → Target audience → Sales channels",
+ "D. Sales channels → Sell benefits → Target audience"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "When analyzing competitor products, which strategic action is recommended as part of the marketing plan?",
+ "Choice": [
+ "A. Running camera tests",
+ "B. Analyzing sales channels",
+ "C. Targeting social media audiences",
+ "D. Emphasizing product features"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "O5PPRB18hls",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which visual element was directly shown when the speaker mentioned 'judgment' in the context of human-technology systems?",
+ "Choice": [
+ "A. Audience illuminated by colored lighting",
+ "B. Behind-the-scenes monitors",
+ "C. Networking attendees holding drinks",
+ "D. Exhibition booth interaction"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "P6-I5fsCA9w",
+ "video_duration": "30s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "What specific historical event prompted the mass celebrations shown in the footage?",
+ "Choice": [
+ "A. Nazi Germany's unconditional surrender",
+ "B. Conclusion of World War II",
+ "C. Successful nuclear weapon deployment",
+ "D. Annual New Year's Eve traditions"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "moaSYWzZmkQ",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Which environmental factor most definitively indicates the cooking is taking place outdoors?",
+ "Choice": [
+ "A. Pile of kindling sticks nearby",
+ "B. Dry grass in the background",
+ "C. Use of a portable stove",
+ "D. Gray table surface"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "h7ljgWAgb0E",
+ "video_duration": "30s",
+ "video_category": "Autos & Vehicles",
+ "questions": [
+ {
+ "Question": "Which visual event occurred when the speaker mentioned taking care to identify components?",
+ "Choice": [
+ "A. Hands holding labeled plastic bags 'Rear IN' and 'Rear EX'",
+ "B. Camera tilting upwards to show cylinder heads",
+ "C. Hand pointing at engine's bore holes",
+ "D. Hands extracting metallic parts from labeled bags"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which audiovisual pairing demonstrates proper part identification during engine assembly?",
+ "Choice": [
+ "A. Labels match spoken intake/exhaust terms while handling bags",
+ "B. Warning about bore holes' irreplaceability while pointing at them",
+ "C. Speech introduces lifter installation while holding a metal cylinder",
+ "D. Discussion of component preparation while displaying internal mechanisms"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "Wv_UPNTBEwU",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What primary contextual element is visible when the man first introduces the 'LightChaser Pro'?",
+ "Choice": [
+ "A. Bridge with sailboats in background",
+ "B. Text overlay reading 'BENETTGRAEZER'",
+ "C. PolarPro website on computer",
+ "D. Studio lighting equipment"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "9IAFUfATScA",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What action followed the female speaker expressing gratitude to Dr. Johnson?",
+ "Choice": [
+ "A. Dr. Johnson started explaining educational methodologies",
+ "B. His academic degrees and professional titles were enumerated",
+ "C. The female speaker highlighted his role in Pan-Africanism",
+ "D. Dawson elaborated on Dr. Johnson's community impact"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "E4oseR4ANDo",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What primary visual event immediately follows the man in the dark suit saying 'Oh'?",
+ "Choice": [
+ "A. Woman in striped shirt reappears addressing viewers",
+ "B. Man in black suit writes in notepad",
+ "C. Text box titled 'Marketing Strategies' appears",
+ "D. Camera zooms on handshake between men"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "During the phrase 'Who do you have for us?', what visual context appears?",
+ "Choice": [
+ "A. Woman speaking with on-screen captions",
+ "B. Group discussing with a notebook",
+ "C. Hallway with suited men",
+ "D. Close-up of man holding a notepad"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "PItExCRipjM",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "In which chronological sequence does the audio narration describe technical haircutting phases that align with visual demonstrations?",
+ "Choice": [
+ "A. Parietal ridge separation → Vertical slicing → Layered blending",
+ "B. Base elevation → Curvature explanation → Central alignment",
+ "C. Section demarcation → Horseshoe formation → Parallel trimming",
+ "D. Wet combing → Clip placement → Angular cutting"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What distinct comb technique is demonstrated between the 10-20s and 20-30s segments?",
+ "Choice": [
+ "A. Creating zigzag vs straight partitions",
+ "B. Detangling curls vs smoothing straight hair",
+ "C. Lifting sections for layering vs holding hair taut for cutting",
+ "D. Adjusting comb angle for fading vs blending"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "C641T1T07t8",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What primary objective do the combined audio-visual elements serve during the 20-30s segment?",
+ "Choice": [
+ "A. Prompting immediate social media interaction from viewers",
+ "B. Showcasing YouTube interface design best practices",
+ "C. Soliciting ongoing support for an educational content series",
+ "D. Teaching geometric concepts through visual patterns"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Which visual element appeared simultaneously when the speaker first said 'In this episode of Word Ideas'?",
+ "Choice": [
+ "A. The Patreon logo at screen bottom",
+ "B. The text 'Onomatopoeia' displayed",
+ "C. The text 'Pronounce It Right' in red script",
+ "D. The YouTube 'SUBSCRIBE' button being clicked"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "4iM6cxGaMp4",
+ "video_duration": "30s",
+ "video_category": "Comedy",
+ "questions": [
+ {
+ "Question": "Which two background elements were clearly visible in the frame when she mentioned placing a link in the description?",
+ "Choice": [
+ "A. Door and whiteboard",
+ "B. Patching compound and blurred text",
+ "C. Plaid shirt and text title",
+ "D. Shelf and choker necklace"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What occurred right AFTER the view became clear following the blurred sequence?",
+ "Choice": [
+ "A. She glanced downward to her left",
+ "B. She reinitiated her opening remarks",
+ "C. She touched her glasses frame",
+ "D. She referenced her boyfriend AJ's presence"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What visual cue is uniquely present during the 'sitting right here' audio moment compared to 'watch that one first'?",
+ "Choice": [
+ "A. Visible text overlay",
+ "B. Leftward hand gesture",
+ "C. Brief background blur effect",
+ "D. Close-up of facial features"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "nzGdRPSMZzw",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What occurs right as the white text 'In 1996 WCW Experimented...' vanishes?",
+ "Choice": [
+ "A. Ric Flair adjusts his suit jacket",
+ "B. The camera cuts to Bryan Diperstein speaking",
+ "C. The crowd erupts in loud cheers",
+ "D. Glacier executes a spinning kick"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "fz82FAmTYjU",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which audio statement coincided with the visual of people in traditional white robes walking alongside slowly moving vehicles under airport runway lights at night?",
+ "Choice": [
+ "A. 'who now have more forces in Afghanistan...'",
+ "B. 'The United States will begin our final withdrawal...'",
+ "C. 'We'll do it, we'll do it responsibly...'",
+ "D. Sound of a zipper"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What visual context contrasts with the audio's denial of a 'hasty rush to the exit'?",
+ "Choice": [
+ "A. Nighttime crowd dispersal",
+ "B. Military vehicle convoy",
+ "C. Disorganized staircase ascent",
+ "D. Orderly boarding process"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "Qe3mqZ6cI9E",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which visual corresponds to the audio discussing the need to select the appropriate category within a cable type classification?",
+ "Choice": [
+ "A. Title card displaying 'SINGLE MODE VS MULTIMODE FIBER'",
+ "B. Close-up of a network switch with multi-colored cables",
+ "C. A man typing on a laptop while holding a black cable",
+ "D. Server racks with technicians handling coiled white cables"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What technical specification is being contrasted when the server room technician handles the fiber optic coil?",
+ "Choice": [
+ "A. Signal attenuation rates in copper vs fiber",
+ "B. Light propagation methods in glass cores",
+ "C. Installation standards for rack-mounted cabling",
+ "D. Compliance requirements for cable jackets"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "ATWQNDujmPU",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What specific action does the promotional graphic in the final segment encourage viewers to take?",
+ "Choice": [
+ "A. Watch instructional content on YouTube",
+ "B. Subscribe to a premium study plan",
+ "C. Analyze sample question structures",
+ "D. Purchase timed practice exams"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What visual element replaced the clock and book icons in the upper portion of the screen immediately after the split?",
+ "Choice": [
+ "A. Full-frame return of woman gesturing",
+ "B. 'Matching Headings' text display",
+ "C. YouTube logo with red call-to-action",
+ "D. Blurred background with promotional graphic"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "AeS6jJ_e_CM",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which visual element appeared when the speaker said 'don't buy anything...'?",
+ "Choice": [
+ "A. A woman working beside boxes with product images",
+ "B. Social media logos connected to a cell phone",
+ "C. A smartphone displaying a shopping cart",
+ "D. A man contemplating numbered checkpoints"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What is the blond man contemplating as he scratches his head?",
+ "Choice": [
+ "A. Sequencing customer acquisition steps",
+ "B. Organizing content creation formats",
+ "C. Mastering bow tie techniques",
+ "D. Diagnosing network connectivity issues"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "XeKOyHOfsp4",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What auditory cue confirms the final conversation occurs indoors?",
+ "Choice": [
+ "A. Absence of weather sounds",
+ "B. Background office ambiance",
+ "C. Echo in vocal tones",
+ "D. Dialogue referencing shelter from rain"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Why does Andy mention 'the weather is against us' during the outdoor filming segment?",
+ "Choice": [
+ "A. Persistent rain forcing relocation of filming indoors",
+ "B. Rainwater compromising camera equipment functionality",
+ "C. Heavy precipitation delaying their scheduled meeting",
+ "D. Rainfall obscuring visibility for on-location shots"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "cHHghW4Pg5k",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which event occurred SECOND in the video's timeline?",
+ "Choice": [
+ "A. Airplane shown as a dark silhouette in the sky",
+ "B. Rhythmic clatter during blurred buildings",
+ "C. Airplane moving on runway with landing gear visible",
+ "D. Plane approaching runway for landing amidst green trees"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which audio event coincides with the rapid camera motion blurring urban structures?",
+ "Choice": [
+ "A. Metallic rhythmic pounding",
+ "B. Jet engine during ascent",
+ "C. Emergency sirens wailing",
+ "D. Aircraft taxiing on tarmac"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "05xCYOyY1bg",
+ "video_duration": "30s",
+ "video_category": "Comedy",
+ "questions": [
+ {
+ "Question": "How did the comedian's physical gestures contrast when discussing 'red flags' versus 'nervousness'?",
+ "Choice": [
+ "A. Scanned audience laterally vs. locked eye contact with front row",
+ "B. Adjusted microphone grip frequently vs. held it stationary",
+ "C. Used broad hand gestures vs. kept arms rigidly at sides",
+ "D. Leaned forward intensively vs. stood upright casually"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "SNqBM-ugm1A",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual action coincides with the speaker stating 'Oh no, it's another episode of Tesserac Devalues his collectibles'?",
+ "Choice": [
+ "A. The man sits on the floral patterned couch",
+ "B. The man holds up the NERF bow box",
+ "C. The man adjusts his position on the couch",
+ "D. The man grabs a flat rectangular object"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What motivated the man to acquire the Blazin' Bow?",
+ "Choice": [
+ "A. He discovered a new one available online",
+ "B. To showcase a common devaluation technique",
+ "C. To modify it using a specialized tool",
+ "D. It was a rare collectible item he needed"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What contextual cue links the man holding the Blazin' Bow and picking up the utility knife?",
+ "Choice": [
+ "A. Both immediately follow critical remarks about collectible preservation",
+ "B. Both precede direct addresses about viewer expectations",
+ "C. Both happen while demonstrating product unboxing techniques",
+ "D. Both occur during explanations of eCommerce habits"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What happens immediately after the man sets down the Blazin' Bow box?",
+ "Choice": [
+ "A. He explains finding the item on eBay",
+ "B. He smiles while sitting on the sofa",
+ "C. He faces the camera directly",
+ "D. He picks up a rectangular object resembling a knife"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "-6GoQCUpnSQ",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual event occurred as the speaker discussed the shedding of wool roving blankets?",
+ "Choice": [
+ "A. Focusing on the peach-colored knitted blanket during a close-up",
+ "B. Zooming in on the salmon-colored scarf during the introduction",
+ "C. Focusing on the teal-colored knitted blanket during a close-up",
+ "D. Displaying the 'TrulyMajestic.com' text in the upper left corner"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "kPpzeSfPoFM",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which audio event occurred during the close-up of the soldier's camouflaged uniform with an EOD patch?",
+ "Choice": [
+ "A. Description of catastrophic consequences from EOD errors",
+ "B. Commentary on modernization trends in military robotics",
+ "C. Application of robots in high-risk bomb disposal missions",
+ "D. Historical overview of iRobot's 1980s prototypes"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What distinguishes Gary Stair from other iRobot employees shown in the video?",
+ "Choice": [
+ "A. Discussing EOD robot integration",
+ "B. Robotics Program Manager title",
+ "C. Military camouflage uniform",
+ "D. US flag patch on baseball cap"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "Which organization is developing technology to help military forces prevent disasters?",
+ "Choice": [
+ "A. iRobot",
+ "B. Defense Advanced Research Projects Agency (DARPA)",
+ "C. EOD Unit",
+ "D. Boston Dynamics"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Which visual elements in Gary Stair's scene specifically emphasize his affiliation with military robotics development?",
+ "Choice": [
+ "A. Warehouse doors and presentation laptop",
+ "B. EOD technician's camouflage and remote control",
+ "C. Robotic ordnance manipulator and passing EOD technician",
+ "D. iRobot-logo apparel and overhead crane"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "VRxHvlS4GCI",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "What is the primary purpose of displaying Greek text simultaneously with the spoken phrase 'interesting developments in evolutionary biology'?",
+ "Choice": [
+ "A. To create redundant reinforcement of the central thesis through dual modalities",
+ "B. To acknowledge contributions from Hellenic research institutions",
+ "C. To provide chronological context for the calendar graphics",
+ "D. To contrast ancient philosophical concepts with modern science"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "When 'HUMAN JOURNEY' appears visually, what is the primary focus of the concurrent audio discussion?",
+ "Choice": [
+ "A. Critiques of traditional ideological constructs",
+ "B. Empirical breakthroughs in evolutionary biology",
+ "C. Methodologies in developmental psychology",
+ "D. Temporal frameworks for research"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "YZME4lTWBHY",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Which visual event coincides with the speaker describing Richard Jefferson as a 'super productive bench player'?",
+ "Choice": [
+ "A. A player in a white jersey #24 completes a dunk",
+ "B. A scoreboard displays 89-99 during the fourth quarter",
+ "C. A player in a red uniform throws a basketball",
+ "D. Richard Jefferson appears in a Nets uniform"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "How does Richard Jefferson's role DIFFER between the first and last scenes of the video?",
+ "Choice": [
+ "A. From retired player to active contributor",
+ "B. From role player to high flyer",
+ "C. From confrontational teammate to bench player",
+ "D. From high flyer to role player"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "h_0DNygeF8A",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "When the speaker states 'the demise of several senior Iranian officials,' which visual is displayed?",
+ "Choice": [
+ "A. City skyline with \"Escalating Tensions\" text",
+ "B. Damaged buildings with smoke and Israeli flag",
+ "C. Title card with \"DEFENCE ON DEMAND\"",
+ "D. Large smoke plume near a mosque with green dome"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which event occurs first: the initial appearance of the Israeli flag on a damaged structure or the first explicit reference to Iran's diplomatic response?",
+ "Choice": [
+ "A. Neither is directly addressed in the content",
+ "B. Iran's response is mentioned during conflict analysis",
+ "C. Israeli flag appears during aftermath visuals",
+ "D. Both occur when discussing regional tensions"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What does the presence of the mosque with the green dome in the video's later scenes indicate about the conflict?",
+ "Choice": [
+ "A. It is located within Israeli borders",
+ "B. The conflict involves military actions in multiple countries",
+ "C. It represents a neutral religious sanctuary",
+ "D. It was used to coordinate Iranian retaliatory strikes"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "6hs_n2fqFfM",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which audio line is heard as Eggman collapses in a panicked state?",
+ "Choice": [
+ "A. 'This is an intervention'",
+ "B. 'Listen, we're doing this...'",
+ "C. 'You did, you pissed on it'",
+ "D. 'I woke up and'"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "k9jQL7bvkbs",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What audio event coincided with the initial nighttime exterior shot of the elegant house?",
+ "Choice": [
+ "A. Animated description of the family's movements",
+ "B. Festive ukulele music",
+ "C. Third-person narration establishing the story",
+ "D. Suspenseful singing"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "gleNOAoDG4U",
+ "video_duration": "30s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "What was the chronological order of key promotional messages in the audio after Vijay's appearance?",
+ "Choice": [
+ "A. Marketing summit highlights → Panel discussion proposal → Date/location announcement",
+ "B. Brand identity introduction → Marketing strategy analysis → Event participation invitation",
+ "C. Panel discussion invitation → Chennai headquarters mention → Registration reminder",
+ "D. January 24th announcement → Industry networking emphasis → Marketing concepts"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What key objective does Vijay Kapoor emphasize for the event discussed in the video?",
+ "Choice": [
+ "A. Introduce Derby's latest sustainable fashion line",
+ "B. Promote a marketing summit featuring Vijay Kapoor as a speaker",
+ "C. Highlight the importance of marketing strategies in brand growth",
+ "D. Facilitate networking among industry professionals"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "AFKgWXM1aqM",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "Which audio-visual pairing most directly conveys the singer's internal emotional struggle?",
+ "Choice": [
+ "A. Lyrics 'would you call me the crazy one' with meadow backdrop",
+ "B. Transition from spoken confession to sung verses",
+ "C. Synchronized 'confused' text and solitary field walking",
+ "D. Text 'just forget about it' with path departure"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What leads to the man's sigh in the opening scene?",
+ "Choice": [
+ "A. Physical exhaustion from recent activity",
+ "B. Technical difficulties with recording equipment",
+ "C. Disappointment with environmental conditions",
+ "D. Reluctance to pursue a romantic admission"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "VY7X-oKLLXg",
+ "video_duration": "30s",
+ "video_category": "Comedy",
+ "questions": [
+ {
+ "Question": "Why did Sheriff Taylor react with concern to Deputy Fife's story?",
+ "Choice": [
+ "A. He perceived negligence in the jail's maintenance",
+ "B. He related to the frustration of losing the snow cone's syrup",
+ "C. He was irritated by the mention of raspberry flavor",
+ "D. He disapproved of Deputy Fife's daily snow cone habit"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Through which contrasting methods do Deputy Fife and Sheriff Taylor convey concern during the snow cone incident?",
+ "Choice": [
+ "A. Using animated facial expressions vs. explicit verbal acknowledgment",
+ "B. Through symbolic background jail bars",
+ "C. Via shared uniform insignia details",
+ "D. Through emphatic gestures vs. shifts in vocal pitch"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "06b2PxvBoNU",
+ "video_duration": "30s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which element in the hospital photo specifically indicates its historical context rather than modern annotations?",
+ "Choice": [
+ "A. White encircling marker around Hitler",
+ "B. Nazi armbands on attendees",
+ "C. Military-style hats on attendees",
+ "D. Colorization inconsistencies"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "da8qBzoWKVY",
+ "video_duration": "30s",
+ "video_category": "Comedy",
+ "questions": [
+ {
+ "Question": "What is the correct chronological order of these events?",
+ "Choice": [
+ "A. 'Are you kidding me' → Shopping cart confrontation → 'You can't be bothered'",
+ "B. 'What if someone wants to park' → Camera shows multiple cars → SUV door ajar",
+ "C. SUV door closes → Camera focuses on navy Jeep → 'No, my job is to bust'",
+ "D. Tahoe door opens → 'You're making your pops...' → Man points to white vehicle"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "-T2WkBc0y_c",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What primary factor caused the 9°C temperature contrast discussed in the forecast?",
+ "Choice": [
+ "A. Diurnal temperature variation between morning and afternoon",
+ "B. Presence of cloud cover blocking solar heating in some regions",
+ "C. Proximity to large bodies of water moderating coastal climates",
+ "D. Temperature disparities between the UK and continental Europe"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "zI4QBIVHeIY",
+ "video_duration": "30s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which audio segment was heard as the camera first shifted to Matthew G. Whitaker preparing to speak?",
+ "Choice": [
+ "A. 'Special Counsel' spoken",
+ "B. 'Roger Stone's indictment' spoken",
+ "C. 'Congressman' spoken",
+ "D. 'important question' spoken"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "mQL-GdboNFk",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Which visual elements appeared in sequence as the female speaker explained wattage calculation and encouraged viewer engagement?",
+ "Choice": [
+ "A. ERD logo → Website URL display → Shop button appears",
+ "B. Notification bell rings → Like button appears → Zoom on woman",
+ "C. Subscribe button changes to 'Subscribed' → Bell rings → Infrared panel label appears",
+ "D. Like button → Subscribe button → Notification bell appear simultaneously"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What primary visual difference distinguishes the speaker's introductory segments at 0-10s and 20-30s?",
+ "Choice": [
+ "A. Transition from brand identity graphics to product demonstration",
+ "B. Shift between full-body framing and close-up shots",
+ "C. Prominent website elements vs. YouTube engagement tools",
+ "D. Contrasting wardrobe choices between segments"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "noHdd2KBfhM",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What is the stated reason Caitlyn uses canned lemon juice instead of fresh in the recipe?",
+ "Choice": [
+ "A. To better emulsify with frozen ingredients",
+ "B. The recipe mandates commercially processed citrus",
+ "C. She lacked fresh lemons at the time",
+ "D. To showcase using non-perishable pantry items"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "When Caitlyn mentions needing 'three to four' in the audio, what specific visual component is emphasized?",
+ "Choice": [
+ "A. Frozen banana chunks being measured into a white bowl",
+ "B. A graphic overlay reading 'Blueberry Smoothie Bowl'",
+ "C. A pink smoothie bowl with strawberries and coconut flakes",
+ "D. Ashley demonstrating a preparation step"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What is the correct sequence of ingredient preparation shown while Caitlyn lists 'frozen bananas, blueberries, lemon juice'?",
+ "Choice": [
+ "A. Bananas positioned → lemon juice squeezed → blueberries placed",
+ "B. All ingredients appear simultaneously",
+ "C. Lemon juice squeezed → bananas placed → blueberries added",
+ "D. Bananas positioned → blueberries placed → lemon juice squeezed"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "OeGAnO04gug",
+ "video_duration": "30s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "Which sequence correctly orders the simultaneous audio-visual events based on their occurrence?",
+ "Choice": [
+ "A. Lectra City introduction → Easter egg explanation → 'One Punch' spawning mention",
+ "B. Teal corridor scene → Easter egg explanation → Lectra City transition",
+ "C. Lectra City introduction → Initial Ninja Turtles reference → Player using sniper rifle",
+ "D. 'One Punch' spawning mention → Downtown Square navigation → Easter egg discovery"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which additional interface element is prominently displayed when the speaker describes the button-pressing mechanic to spawn the enemy?",
+ "Choice": [
+ "A. A KV faction emblem in the upper corner",
+ "B. Enemy health bars above targets",
+ "C. The '@killersix' social media handle",
+ "D. Active objective waypoint markers"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "q8YamD2cKys",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual element is displayed when the speaker initially poses the hypothetical scenario about nuking Backrooms Level Fun?",
+ "Choice": [
+ "A. A charred party room with melted plastic cups",
+ "B. A digital rendering of a missile hovering above the party room",
+ "C. A black detonator remote with an antenna",
+ "D. A fiery explosion with red text stating 'THIS LEVEL'"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which visual element is immediately shown following the mention of bringing a detonator into the scene?",
+ "Choice": [
+ "A. Text overlay 'AROUND ALL'",
+ "B. Black remote control with red button",
+ "C. Close-up of cake with raspberries",
+ "D. Blue thumbs-up icon appearing"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "UQVEW8EuDFw",
+ "video_duration": "30s",
+ "video_category": "Comedy",
+ "questions": [
+ {
+ "Question": "Which audio event coincided with the appearance of the Winter & Construction logo?",
+ "Choice": [
+ "A. Instrumental music",
+ "B. A man speaking nostalgically",
+ "C. Theatrical music",
+ "D. Music ends"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What audio cue directly triggers Trent Trendsetter's expression of discomfort?",
+ "Choice": [
+ "A. A woman sarcastically calling him 'Hollywood's Punching Bag'",
+ "B. Child actor's name being announced",
+ "C. Self-deprecating mention of 'man boobs'",
+ "D. Dramatic stingers in the soundtrack"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What visual event occurred immediately after Trent Trendsetter said 'I was on top of the world'?",
+ "Choice": [
+ "A. Close-up of blonde girl",
+ "B. Double exposure of Trent and a girl",
+ "C. Trent looking downward with discomfort",
+ "D. Woman's silhouette against red"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which visual element is present when the line 'Zachary Lavoy' is spoken?",
+ "Choice": [
+ "A. Trent looking downward with discomfort",
+ "B. Blonde girl close-up",
+ "C. Blurry red background",
+ "D. Double exposure of Trent"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Which two phrases were spoken during the woman's silhouette appearance and a subsequent visual event?",
+ "Choice": [
+ "A. 'biggest deuce' and 'Who am I kidding'",
+ "B. 'Who am I kidding' and 'came crashing down'",
+ "C. 'Zachary Lavoy' and theatrical music",
+ "D. 'biggest deuce' and 'on top of the world'"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "FaK3oS1QuaU",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which audio statement is made during the initial explanation when the instructor's finger touches the concave mirror?",
+ "Choice": [
+ "A. 'An easy way to keep concave and convex mirrors straight...'",
+ "B. 'This configuration demonstrates a virtual image formation'",
+ "C. 'The mirror you see in this picture is a concave'",
+ "D. 'Right now, let's remove my image'"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What occurs right after the instructor states, 'Right now, let's remove my image'?",
+ "Choice": [
+ "A. The mirror type is labeled as concave on the diagram",
+ "B. The instructor's image is removed from the upper corner",
+ "C. A dashed line appears connecting the mirror and object",
+ "D. The pointing finger and orange object vanish from view"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "nDFO_EDadEg",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "Which visual event coincides with the lyrics 'Moving pallets, moving pallets, pick them up, put them down'?",
+ "Choice": [
+ "A. Two men in black 'KOO' shirts dance as the camera zooms out",
+ "B. Two men in black 'KOO' shirts dance as the camera zooms in",
+ "C. A vintage television displays 'CC MOVIE OF THE WEEK'",
+ "D. A poster of 'The Matrix' is visible on storage shelves"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "How do the dancers in the 'KOO' shirts (10-20s) differ contextually from the four dancers (20-30s)?",
+ "Choice": [
+ "A. Their lyrics focus on moving pallets rather than warehouse navigation",
+ "B. They appear before the forklift driver's 'woo' is heard",
+ "C. They perform near stacked DVDs instead of a neon 'Open' sign",
+ "D. Their shirts display pink text instead of safety stripes"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "E5kd8aA6518",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What factor contributes most to the natural light reflecting on the ceiling mentioned by the speaker?",
+ "Choice": [
+ "A. The presence of a large picture window",
+ "B. Light-colored walls enhancing ambient diffusion",
+ "C. Snow outside creating ambient light reflection",
+ "D. Overhead lighting fixtures being activated"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "1k214WV3Ndw",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which visual detail corroborates the audio mention of the spacecraft's launch site?",
+ "Choice": [
+ "A. Morning atmospheric haze in launch sequence",
+ "B. Earth's cloud patterns matching Florida's coastline",
+ "C. NASA insignia visible on the probe body",
+ "D. Nighttime starry sky during launch phase"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "PvdnisI1EwI",
+ "video_duration": "30s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "Which visual event occurs immediately following the conclusion of the musical interlude in the 20-30s segment?",
+ "Choice": [
+ "A. Young man approaching a soccer net",
+ "B. Text 'UPIN & IPIN' with Disney channel labels",
+ "C. Group trophy celebration scene",
+ "D. Disney Channel logo appearance"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "AC_qdIJR870",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "At which specific point in Kathleen's dialogue does the '3 FREE VIDEOS' text first appear?",
+ "Choice": [
+ "A. During her declaration of a new commitment",
+ "B. While explaining bestseller achievement strategies",
+ "C. As she expresses appreciation for someone's presence",
+ "D. When she waves during her introduction"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "VhJEi_k8HMI",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which visual element is displayed simultaneously with the audio reference to 'press adverts in local or national newspapers'?",
+ "Choice": [
+ "A. Red television displaying 'PROMO MEDIA'",
+ "B. Billboard with rainbow gradient against a cityscape",
+ "C. Newspaper page with black/white stripes and a green square in the upper section",
+ "D. Rotated newspaper page on a green background with colored squares"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which visual immediately follows the green circle with a red checkmark chronologically?",
+ "Choice": [
+ "A. Newspaper page drawing",
+ "B. Rotated newspaper on green background",
+ "C. Muted red background",
+ "D. Light blue cityscape"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "vBrspDt3ib0",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What is the functional purpose of mentioning 'yellow yarn' when the camera zooms out on the green crocheted item?",
+ "Choice": [
+ "A. To signal completion of a chain foundation",
+ "B. To showcase yarn texture variation",
+ "C. To emphasize color contrast for visual clarity",
+ "D. To demonstrate securing technique for stitches"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "6OCyOsSRl1Y",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What happened immediately AFTER the girl said 'I have to stretch just a little bit more'?",
+ "Choice": [
+ "A. The man explained they were not playing a real game",
+ "B. The man questioned her stretching purpose",
+ "C. The girl asked if they were still playing",
+ "D. The girl mentioned playing basketball"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which visual event coincided with the man pleading, 'come on, can you come on'?",
+ "Choice": [
+ "A. Man holding football with Santa hat design shirt",
+ "B. Girl holding football threateningly",
+ "C. Girl raising arms while wearing pink shirt",
+ "D. Man holding football with Grinch face"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which statement correctly identified the game they were playing?",
+ "Choice": [
+ "A. 'It's called football'",
+ "B. 'This is not basketball'",
+ "C. 'We're just throwing the ball back and forth'",
+ "D. 'Girls take years to get ready'"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "In which type of location does the entire interaction between the characters take place?",
+ "Choice": [
+ "A. Backyard with pine trees",
+ "B. Indoor basketball court",
+ "C. Grassy park area",
+ "D. Football field"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What emotion did the girl primarily convey through her tone and body language when saying 'Tyrone! I wasn't ready'?",
+ "Choice": [
+ "A. Resentful indignation",
+ "B. Annoyed frustration",
+ "C. Anxious uncertainty",
+ "D. Playful exasperation"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "Dal9o3hxp0c",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which audio event started at the exact moment the 'Tea of the Day' text first appeared on screen?",
+ "Choice": [
+ "A. String instruments playing classical music",
+ "B. Crinkling sounds from handling a tea container",
+ "C. A male voice saying 'OF THE'",
+ "D. Introduction of Remembering Rosemary tea"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "9M_SiEaqnCU",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What was the chronological sequence of key events following the composite image of two men posing in different eras?",
+ "Choice": [
+ "A. 'You ready?' query → Girl shouts 'Daddy!' → Cross necklace becomes visible",
+ "B. Children embrace → Piano plays → Woman adjusts glasses",
+ "C. Girl points upward → Audience cheers → 'Current glory' speech",
+ "D. 'Take part in the current glory' statement → Girl in leopard shirt asks 'You ready?' → Audience cheers"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What prompted the girl in the leopard-print shirt to excitedly yell 'Daddy!' during the performance?",
+ "Choice": [
+ "A. She was responding to the audience's initial cheering",
+ "B. She noticed the woman with glasses entering the stage",
+ "C. She reacted to Omarion's call to 'take part in the current glory'",
+ "D. She recognized Omarion's reference to 'past glory' in his speech"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which audio event directly corresponds to the visual of two men in white suit jackets standing together, one wearing a fuzzy vest and the other sporting silver cross necklaces?",
+ "Choice": [
+ "A. 'Also take part in the current glory'",
+ "B. 'You ready?'",
+ "C. 'Has to be an unselfish start'",
+ "D. 'To not just hear about the past glory'"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "uysGRiIkxu8",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "How does the audio content differ between the first 10 seconds (0-10s) and the segment from 20-30s?",
+ "Choice": [
+ "A. Studio-produced music shifts to environmental kitchen sounds",
+ "B. Instrumental music transitions to speech with background music",
+ "C. Continuous jazz music is replaced by instructional speech",
+ "D. Solo narration contrasts with overlapping dialogue"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What contextual audio-visual cue directly prompts the man's 'shout out' acknowledgment?",
+ "Choice": [
+ "A. Mentioned learning technique from others",
+ "B. OD5 Fishing branding in title cards",
+ "C. Mid-Atlantic region text reference",
+ "D. Presence of a silver knife during cutting"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "I8ZmEpj1yps",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What does the VIP pass on the speaker's lap indicate about the nature of this engagement?",
+ "Choice": [
+ "A. A podcast recording",
+ "B. An exclusive mentorship session",
+ "C. A real estate investment summit",
+ "D. A corporate keynote presentation"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What appears FIRST in the video after the text 'AND THIS IS WHAT MAKES PEOPLE RICH' disappears?",
+ "Choice": [
+ "A. Text 'ANOTHER FLOW'",
+ "B. Images of apartment buildings",
+ "C. The VIP pass on his lap",
+ "D. The man gesturing with both hands"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "DTA5AbpHwEQ",
+ "video_duration": "30s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "What key evidence directly linked Jake England to the Tulsa shooting spree?",
+ "Choice": [
+ "A. The news anchor's emotional delivery of the suspect's background",
+ "B. Vehicle registration documents for the white pickup truck",
+ "C. Live broadcast synchronization of Jake England's Facebook post content",
+ "D. Correlation between Twitter's legal response and arrest timing"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "pfsytPukzuQ",
+ "video_duration": "30s",
+ "video_category": "Autos & Vehicles",
+ "questions": [
+ {
+ "Question": "What visual element is displayed when the mountain snowflake symbol's technical qualification is explained?",
+ "Choice": [
+ "A. Close-up of the 3-Peak Mountain Snowflake symbol",
+ "B. Blue and green tag on a tire sidewall",
+ "C. Winter-rated certification stamp",
+ "D. Text about rubber pliability below 40°F"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "SJxMDLvg2jg",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Which event directly followed the completion of the docket submission instructions?",
+ "Choice": [
+ "A. Revealing the full name of the Northwest 200 prize",
+ "B. Offering a pillion passenger ride experience",
+ "C. Announcing the competition winner",
+ "D. Detailing the judging criteria for entries"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "DFzcI7FxxxY",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Which event occurs IMMEDIATELY after the man finishes saying 'We just talked about Porzingis, Marcus Smart, Tyus Jones'?",
+ "Choice": [
+ "A. The screen switches to Shams Charania's Twitter feed",
+ "B. The screen displays Chris Paul's trade contract details",
+ "C. Twitter's 'What's happening' section appears",
+ "D. The man highlights the '2030 first-round pick' text on screen"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "YkROTtD2vog",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What occurred immediately after the woman showed the tray of small plants?",
+ "Choice": [
+ "A. Person displayed 'TRUE SEED' packets",
+ "B. Person displayed sunflower, tomato, and carrot seed packets",
+ "C. Man demonstrated a seed between fingers",
+ "D. Camera showed the shop interior with stocked shelves"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "6yDJfe_9naU",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "Which visual sequence directly follows the owl perched beside the American flag as the text 'UNPOPULAR' appears?",
+ "Choice": [
+ "A. An aerial shot of a helicopter crossing a river",
+ "B. A distant skyline featuring a recognizable monument",
+ "C. A low-angle view of skyscrapers with foreground flags",
+ "D. A street-level perspective of flags lining building facades"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "lNAjUDhUiUQ",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What visual event coincides with the audio statement '40% slower for...' in the first segment?",
+ "Choice": [
+ "A. She displays a laptop collage",
+ "B. She gestures about prosthetic advancements",
+ "C. She holds a microphone and prosthetic leg",
+ "D. She flips through a notebook"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "akXn1JUJDjU",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What immediately followed the on-screen text 'WEATHER PROBLEMS FOR WRESTLEMANIA?'?",
+ "Choice": [
+ "A. Facebook engagement prompt",
+ "B. Brock Lesnar career retrospective",
+ "C. YouTube subscription prompt",
+ "D. Wrestlelamia XL channel close-up"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "eDZ_nDR1BYk",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Why is a solid blue background maintained throughout the video?",
+ "Choice": [
+ "A. To create visual continuity with astronomy software interfaces",
+ "B. To simulate the appearance of the daytime sky during solar observation",
+ "C. To evoke the darkness of space for thematic relevance",
+ "D. To eliminate competing elements that might divert attention from content"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What is the first topic addressed after specifying the target audience as 'beginning amateur astronomer' through both audio and visual channels?",
+ "Choice": [
+ "A. Demonstration of budget-friendly equipment",
+ "B. Enumeration of solar observation risks",
+ "C. Reiteration of excluding advanced filters",
+ "D. Overview of observer certification processes"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "fqRSLvhdUII",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What is the primary purpose of the '*REGULAR LEVEL 2*' text displayed at 20-30s?",
+ "Choice": [
+ "A. It introduces a sublevel not previously mentioned",
+ "B. It contrasts the environment described in the audio",
+ "C. It synchronizes with the narrator's current dialogue",
+ "D. It highlights a safe zone within the level"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "ZIptSTddEco",
+ "video_duration": "30s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "What visual and auditory elements indicate the video's format?",
+ "Choice": [
+ "A. Realistic sets and unedited candidate speeches",
+ "B. Cartoon figures with sarcastic debate dialogue",
+ "C. Patriotic motifs and campaign rally cheers",
+ "D. Historical footage with narrator voiceover"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What visual detail is present when the speaker mentions 'nine cruise missiles'?",
+ "Choice": [
+ "A. Close-up of a man in yellow tie with glasses",
+ "B. Red and white striped floor visible",
+ "C. Wide shot of all candidates at podiums",
+ "D. Speaker wearing a gold tie"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "LfZh_obXfLs",
+ "video_duration": "30s",
+ "video_category": "Nonprofits & Activism",
+ "questions": [
+ {
+ "Question": "Which visual event was first displayed during the plucked string instrument music in the opening 10 seconds?",
+ "Choice": [
+ "A. The TEDxFultonStreet logo on a white background",
+ "B. A green chrysalis hanging from a branch against a black background",
+ "C. Blurry view of a woman entering the Michael Schimmel Center",
+ "D. An empty stage with large scrolls on the floor"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What visual event immediately followed the appearance of 'A GATHERING' text?",
+ "Choice": [
+ "A. A close-up of brochures being distributed",
+ "B. An audience settling into red auditorium seats",
+ "C. A person in a top hat demonstrating antique audio equipment",
+ "D. A blurry view of a woman entering a building with visible architectural text"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "SlFS0VJ20tU",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What is the primary purpose of the woman's close-up shots highlighting facial imperfections?",
+ "Choice": [
+ "A. To establish credibility as a skincare professional",
+ "B. To demonstrate makeup coverage techniques",
+ "C. To directly respond to TikTok comment requests",
+ "D. To showcase contrasting nail and facial aesthetics"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What visual evidence in the video supports her authority on skincare?",
+ "Choice": [
+ "A. The Ed Hardy logo on her shirt",
+ "B. Repeated attempts to film the video",
+ "C. Visible skin imperfections during explanations",
+ "D. Displayed TikTok comments about skincare"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Which visual overlays are used when discussing viral TikToks versus skincare inquiries?",
+ "Choice": [
+ "A. TikTok video grid vs. close-up product demonstration",
+ "B. Hashtag trends vs. before-after images",
+ "C. App interface vs. text comment collage",
+ "D. Profile page vs. live chat animation"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "C_cCskjpcIE",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What prompted Ken's curiosity about tactile paving during his travels?",
+ "Choice": [
+ "A. He noticed it's linked to aiding visually impaired pedestrians",
+ "B. He designs pedestrian infrastructure",
+ "C. He encountered it frequently in urban areas",
+ "D. His country lacks tactile paving systems"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "of59JPrX1yk",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which visual event is first shown when the upbeat background music begins during the introduction of the Big Machine Vodka Spiked Coolers series?",
+ "Choice": [
+ "A. White drum labeled 'BIG MACHINE DISTILLERY'",
+ "B. Two men standing next to barrels labeled 'BB' in the distillery",
+ "C. Aerial shot of Nashville's bridge over the river",
+ "D. Two men exchanging a high-five near the event advertisement"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What connection is highlighted between the racing scenes and the distillery scenes in this video?",
+ "Choice": [
+ "A. To contrast automotive and beverage industries",
+ "B. To showcase Nashville's tourism attractions",
+ "C. Both are sponsored by the same alcohol brand",
+ "D. To demonstrate product manufacturing processes"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What establishes the specific location context when the wooden barrel close-up is shown?",
+ "Choice": [
+ "A. Prominent display of brewery signage",
+ "B. Previous focus on the Grand Prix advertisement",
+ "C. Nashville city banners in the background",
+ "D. Visible production codes on the barrel"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Which event occurred FIRST chronologically in the video?",
+ "Choice": [
+ "A. Two men high-fiving near a racing advertisement",
+ "B. Aerial panorama of Nashville with a river bridge",
+ "C. Close-up of a branded whiskey barrel",
+ "D. Cars accelerating on a red-and-white racetrack"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "0xenfbrSw8U",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which audio event occurred when the man first picked up the blue iPhone from its box and held it up for initial inspection?",
+ "Choice": [
+ "A. Whooshing sound effect",
+ "B. Punch sound effect",
+ "C. Male speaker discussing camera features",
+ "D. Electronic dance music playing"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What is the speaker's primary objective in this video based on explicit content cues?",
+ "Choice": [
+ "A. Promoting the channel's content",
+ "B. Reviewing smartphone features",
+ "C. Comparing two iPhone models",
+ "D. Unboxing new tech products"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "IWL9AJhQCtQ",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "What causes the abrupt halt in the heavy metal music near the 15-second mark?",
+ "Choice": [
+ "A. The guitarist switches to a secondary amplifier",
+ "B. The video highlights a venue name ('ST. JAMES')",
+ "C. The guitarist briefly stops playing to address the audience",
+ "D. A technical issue requires pedal adjustments"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What context best explains Nick's stage guitar performance in the clip?",
+ "Choice": [
+ "A. A Night Verses music video",
+ "B. A Headcave live concert",
+ "C. A guitar technique tutorial",
+ "D. A Blackstar amplifier demo"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What visual element is shown precisely when the audio states 'He is a form of life'?",
+ "Choice": [
+ "A. Live guitarist with dynamic background projections",
+ "B. Split-screen pedalboard close-up",
+ "C. Two vertically stacked Blackstar amplifiers",
+ "D. Text reading 'ST. JAMES'"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "acpqlVjjIGg",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What distinguished the RC car's second aerial flip (20-30s) from its first (10-20s) in terms of operational challenges?",
+ "Choice": [
+ "A. Utilization of terrain features for momentum",
+ "B. Driver's vocal indication of power loss",
+ "C. Evasion of nearby parked vehicles",
+ "D. Occurrence of static interference during ascent"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "MeSA4VUjGpE",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What occurs right after the child says 'Tomato'?",
+ "Choice": [
+ "A. A wooden cutting board with spices comes into view",
+ "B. The girl offers the red object to the man",
+ "C. Hands begin mixing ingredients in a bowl",
+ "D. The man repeats 'Tomato' affirmatively"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "oW-9LSa-beI",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What visual event directly follows the conclusion of the male speaker's promotional dialogue in the 10-20s segment?",
+ "Choice": [
+ "A. Man in white shirt gestures toward the building",
+ "B. Text 'THE BUILD UP' overlays the boxing ring",
+ "C. Third man in red shirt walks behind the speakers",
+ "D. Nate Campbell's championship photo appears on screen"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "fyS1H_Zs4po",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What key aspect differentiates the man's visual presence from the woman's active role in the video?",
+ "Choice": [
+ "A. The man is the only one referencing urban planning strategies",
+ "B. The woman is shown interacting with physical objects during her speech",
+ "C. The woman is the sole audible speaker throughout the presentation",
+ "D. Both individuals share equal screen time across multiple segments"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "SRqr72NAIP8",
+ "video_duration": "30s",
+ "video_category": "Travel & Events",
+ "questions": [
+ {
+ "Question": "Which audio element is heard when the exterior of the RV with attached bicycles is first shown?",
+ "Choice": [
+ "A. Upbeat movie soundtrack music",
+ "B. Woman narrating location details",
+ "C. Birds chirping and rustling leaves",
+ "D. Classical string music"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "Sq643A_t6e8",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "What visual event coincided with the mention of 'off the planet'?",
+ "Choice": [
+ "A. An astronaut in a dark suit was being assisted",
+ "B. The ISS interior displayed various control screens",
+ "C. A rocket ascended with a sea backdrop",
+ "D. An astronaut in a black top adjusted equipment panels"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "D9PuHd9cQRI",
+ "video_duration": "30s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "Which audio event coincided with the appearance of cartoon sharks surrounding characters?",
+ "Choice": [
+ "A. 'Pinkfong' spoken",
+ "B. 'He dies for a new adventure. Baby shark do do do' sung",
+ "C. Ominous synthesizer music",
+ "D. 'The Legend of the Brawlers and El Primo do do do do do do' sung"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "K8B6LsAl8ds",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "How does Ryan's on-screen framing differ when discussing guidelines versus presenting the full-screen document?",
+ "Choice": [
+ "A. Circular overlay vs. full-body view",
+ "B. Lower-third graphic vs. upper-right corner placement",
+ "C. Circular overlay vs. lower-left inset",
+ "D. Split-screen layout vs. solo presentation"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "wpZfQAKkVVo",
+ "video_duration": "30s",
+ "video_category": "Nonprofits & Activism",
+ "questions": [
+ {
+ "Question": "Which event occurred LAST chronologically in the video?",
+ "Choice": [
+ "A. Casual explanation about Anglican-to-Methodist transition",
+ "B. Hand gestures during the statement 'Well, guess what?'",
+ "C. Mocking tone discussing salvation security",
+ "D. Hand gestures while stating 'he taught free'"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "4U5u7OEcHFs",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "What does the woman's mobile phone usage primarily illustrate about modern consumers?",
+ "Choice": [
+ "A. The necessity of constant connectivity for consumer satisfaction",
+ "B. The accelerated decision-making process enabled by digital tools",
+ "C. The preference for price comparison over in-store experiences",
+ "D. The importance of participating in real-time market research surveys"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which visual sequence coincides with the narrator stating: 'Today's consumers are in the driver's seat... They move fast and make decisions even faster'?",
+ "Choice": [
+ "A. Animated woman looking at phone in cityscape, browsing clothes, then walking toward purple door",
+ "B. SUZY logo on monitor transitioning to pandemic shopping survey",
+ "C. Woman walking toward purple door followed by four people using devices",
+ "D. Purple timeline symbols under people using tablets during 'Empathy' text"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "When the words 'Empathy, Agility, Insight' appear on screen, which audio statement explicitly justifies their necessity in consumer research?",
+ "Choice": [
+ "A. 'Conducting rapid and robust consumer research'",
+ "B. 'Better human understanding leads to better decisions'",
+ "C. 'An always-on insights platform for modern needs'",
+ "D. 'Understanding this modern consumer requires empathy, agility, and insights'"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What visual shift occurs between the 0-10s and 10-20s segments to reflect the audio's emphasis on 'empathy, agility, insights'?",
+ "Choice": [
+ "A. Abstract color transitions vs concrete human activities",
+ "B. Solo consumer journey vs team-based data analysis",
+ "C. Physical retail interactions vs digital interface close-ups",
+ "D. First-person perspective vs third-person observation"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "QxrYJf48s-4",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What infraction did the commentator anticipate when stating 'take the DQ' regarding Hill Taylor?",
+ "Choice": [
+ "A. Violating lane boundaries by crossing into another lane",
+ "B. Starting the race before the official signal",
+ "C. Failing to touch the wall properly during the turn",
+ "D. Exceeding the permitted underwater distance limit"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What is the most likely reason for the commentator's excited tone when stating 'Hill Taylor touching in 23'?",
+ "Choice": [
+ "A. Sensor timing malfunction required manual verification",
+ "B. Multiple swimmers were nearly simultaneous at the wall",
+ "C. Hill Taylor's turn technique caused a disqualification",
+ "D. Hill Taylor set a new championship record"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Which event occurred IMMEDIATELY after the commentator finished introducing all swimmers?",
+ "Choice": [
+ "A. Jason Bergstrom being highlighted in lane 8",
+ "B. Discussion about potential disqualification",
+ "C. Hill Taylor's wall touch celebration",
+ "D. Swimmers creating splashes while racing"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "LScZ6CXS5M8",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "During which visual event could viewers hear an engine sound?",
+ "Choice": [
+ "A. Man adjusting camera angle while discussing system dynamics",
+ "B. Close-up of bicycle chain with superimposed 'Machine Component Design' and 'Torsional Impact' text",
+ "C. Laptop displays '4.3 Torsional Impact' with a date stamp",
+ "D. Presenter's shirt displaying 'UNIVERSITY OF WISCONSIN STOUT' in close-up"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What visual element appeared precisely after the speaker began discussing 'linear translation'?",
+ "Choice": [
+ "A. OneNote page titled '4.3 Torsional Impact' on laptop screen",
+ "B. University logo on shirt with pale gray wall background",
+ "C. Man in navy shirt gesturing against light gray wall with dark door",
+ "D. Close-up of bicycle gears with 'Torsional Impact' text overlay"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "FF-QJWuEgbc",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What specific tool was used in the visual event immediately following the 'RixCanDoIt' title card?",
+ "Choice": [
+ "A. A mechanical pencil sharpener",
+ "B. An eraser to adjust a Leonardo DiCaprio portrait",
+ "C. A blue eraser cleaner box on a table",
+ "D. Charcoal sticks for hair detailing"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What is the primary purpose of Rix's video?",
+ "Choice": [
+ "A. To highlight his professional art studio setup",
+ "B. To address his recent inactivity in video uploads",
+ "C. To introduce his new charcoal drawing technique",
+ "D. To preview an upcoming collaborative art project"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What is the primary contextual purpose of the 'So Can You!' text overlay when paired with the soundtrack?",
+ "Choice": [
+ "A. To highlight the artist's unique technical skills",
+ "B. To encourage viewers to try drawing",
+ "C. To complement the energetic tone of the brass-heavy soundtrack",
+ "D. To demonstrate proper charcoal stroke techniques"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "adtOQyNWGrg",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual event follows the lightsaber sound effect in the video?",
+ "Choice": [
+ "A. Shelves filled with Star Wars/Jurassic Park items are shown",
+ "B. A glowing lightsaber graphic appears",
+ "C. The camera shows Star Wars spaceship models",
+ "D. Dan-O waves in front of dinosaur-themed decorations"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which audio event references the viewer's motivation to click on the video while showcasing shelves with Star Wars and Jurassic Park collectibles?",
+ "Choice": [
+ "A. 'It's my office 2.0, I guess...'",
+ "B. 'This is my entire lightsaber collection... Look at this.'",
+ "C. 'There's a lot of bright lights in my face... Let's get on to it.'",
+ "D. 'A brief tour of the new house.'"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "k48QCmAh31E",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "When the rapper says 'Money on my mind,' which visual element is synchronized with this lyric?",
+ "Choice": [
+ "A. Hundred-dollar bills scattered on the ground",
+ "B. Green litter box with a blue scoop",
+ "C. Close-up of a gray tabby cat in a litter box",
+ "D. Gun in holster with studded belt"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "c-RzXw4GMF0",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What significant shift occurs simultaneously with the appearance of 'BEING REAL IN A FAKE WORLD'?",
+ "Choice": [
+ "A. Animated explosion effect emerges",
+ "B. Speaker switches to formal Arabic greeting",
+ "C. Vocal delivery adopts an irritated tone",
+ "D. Lightning overlay appears on screen"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "qF_bybWcUoE",
+ "video_duration": "30s",
+ "video_category": "Nonprofits & Activism",
+ "questions": [
+ {
+ "Question": "What is the primary reason the man on the left is recording with his smartphone?",
+ "Choice": [
+ "A. To showcase his personal style and accessories",
+ "B. To produce content for a community reform initiative",
+ "C. To document architectural details for preservation",
+ "D. To capture a firsthand account of probation challenges"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What is the primary reason the man on the left is recording with his smartphone?",
+ "Choice": [
+ "A. To showcase his personal style and accessories",
+ "B. To produce content for a community reform initiative",
+ "C. To document architectural details for preservation",
+ "D. To capture a firsthand account of probation challenges"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "ZxHCE5DRZEA",
+ "video_duration": "30s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "Which visual element was present when the first spoken dialogue occurred in the battle sequence?",
+ "Choice": [
+ "A. Close-up of Bowser roaring",
+ "B. Split-screen showing both combatants",
+ "C. Full-screen view of Ness's projectile attack",
+ "D. Character selection menu overlay"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "fZXAplvuYJE",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Why was C.W. Post's portrait displayed during the discussion of the Battle Creek sanitarium?",
+ "Choice": [
+ "A. He pioneered cereal-based therapies at the facility",
+ "B. His research informed the sanitarium's methods",
+ "C. He underwent medical treatment there",
+ "D. He oversaw operations at the sanitarium"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What is the primary implication of the recurring bookshelf backdrop throughout the video segments?",
+ "Choice": [
+ "A. Establishing the speaker’s expertise",
+ "B. Reinforcing retro visual themes",
+ "C. Highlighting interior design trends",
+ "D. Emphasizing historical context of the discussion"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Which visual element is displayed when John Harvey Kellogg's management of the Battle Creek sanitarium is mentioned?",
+ "Choice": [
+ "A. Rocket animation sequence",
+ "B. Vintage starburst patterns",
+ "C. Breakfast cereal product placement",
+ "D. Portrait of C.W. Post"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "pxR4ffElpjM",
+ "video_duration": "30s",
+ "video_category": "Autos & Vehicles",
+ "questions": [
+ {
+ "Question": "Which visual components are shown when the speaker first mentions 'Hyundai Sonata'?",
+ "Choice": [
+ "A. C-clamps compressing caliper and rusty rotor edges",
+ "B. Brake assembly close-up and compressed piston",
+ "C. C-clamps and silver rotor",
+ "D. Mudguard and circular hole on assembly"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "fZKagJ-Q9fg",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which pair of performance metrics in the visualization are represented by a text label and a graph icon, respectively?",
+ "Choice": [
+ "A. Area under the ROC curve and Accuracy",
+ "B. Accuracy and Precision",
+ "C. Recall and Area under the ROC curve",
+ "D. Precision and Recall"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "When the speaker states 'chest X-ray images and gives the probability of the image being from a COVID patient,' which concurrent visual elements validate this process?",
+ "Choice": [
+ "A. Four X-ray images with 'P(covid)' probabilities",
+ "B. Binary classifier interface and metric icons",
+ "C. Covid probability values shown",
+ "D. Stacked X-rays with ROC curve annotations"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which event occurred immediately AFTER the display of performance metric icons (bullseye, gear, etc.)?",
+ "Choice": [
+ "A. December 2023 timestamp appears",
+ "B. Binary classifier text appears",
+ "C. ROC curve graph is shown",
+ "D. List of numbered squares appears"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "Bk7FG6XUlLA",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Which visual element first appeared when the speaker mentioned enhancing Wi-Fi signal strength?",
+ "Choice": [
+ "A. Muscular arm graphic in white overlay",
+ "B. Blue and green curtain visible in background",
+ "C. Router equipment being adjusted",
+ "D. Close-up shot of speaker's face"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What specific technical aspect does 'might even improve something' most precisely reference in the final segment?",
+ "Choice": [
+ "A. Signal propagation resistance",
+ "B. Latency reduction",
+ "C. Data throughput rates",
+ "D. Spatial coverage efficiency"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "1qJvkpqtBlk",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "What is the correct sequence of instruments played by the man between 10-20s when the lyrics 'rhymes with Rusty' and 'Protect our bike chains...' are sung?",
+ "Choice": [
+ "A. Keyboard → Drums → Guitar",
+ "B. Guitar → Keyboard → Drums",
+ "C. Drums → Guitar → Keyboard",
+ "D. Drums → Keyboard → Guitar"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "TwsSrD8g3HA",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What was the chronological order of these events: (1) Fish splash, (2) '4 pounder' spoken, (3) 'good times' spoken, (4) Boat movement?",
+ "Choice": [
+ "A. 2 → 4 → 3 → 1",
+ "B. 4 → 3 → 2 → 1",
+ "C. 4 → 2 → 1 → 3",
+ "D. 4 → 2 → 3 → 1"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What were the respective causes of the two consecutive splash sounds heard around 15-20s in the video?",
+ "Choice": [
+ "A. First: line casting | Second: fish release",
+ "B. Both resulted from the fishing rod's action",
+ "C. First: boat movement | Second: fish release",
+ "D. Both originated from the boat's wake"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "qGLfM8hOlac",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "In what chronological order did these narration audio events occur relative to their corresponding visual preparation stages?",
+ "Choice": [
+ "A. Wedding dedication speech (jewelry assistance) → Sheer love speech (headpiece placement) → Glamour narration (hairspray application)",
+ "B. Electronic music pulses (logo display) → Pop music (flower arrangement) → Processed voiceover (bracelet adjustment)",
+ "C. Glamour narration (headpiece placement) → Sheer love speech (flower arrangement) → Wedding dedication speech (jewelry assistance)",
+ "D. Sheer love speech (bracelet adjustment) → Wedding dedication speech (hairspray) → Glamour narration (logo display)"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "wHLOqhMB-i8",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What did Becca do immediately after specifying her 'haul' was for Dollar Tree?",
+ "Choice": [
+ "A. Introduced her art supply series",
+ "B. Compared prices with Target products",
+ "C. Displayed Walmart-branded items",
+ "D. Walked toward a shelf of home decor"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "4ZbAVRbSncU",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "What visual information appears on screen when the narrator mentions the nebula being 'visible with binoculars'?",
+ "Choice": [
+ "A. Constellation of the Unicorn reference",
+ "B. Gas filament evacuation process",
+ "C. Mid-January imaging timestamp",
+ "D. Copyright and credit details"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "LiRmrXSGKPY",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What led the speaker to determine the flaw was a superficial scratch rather than a structural die crack?",
+ "Choice": [
+ "A. The shield's central positioning concealed imperfections",
+ "B. The emu's preserved details ruled out significant damage",
+ "C. The magnified view displayed ambiguous feature definition",
+ "D. The kangaroo's intact outline indicated minor surface wear"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "i9x_JohBBe4",
+ "video_duration": "30s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "What was visually shown before the conclusion of Molly's segment on the international whaling ban?",
+ "Choice": [
+ "A. The whaling vessel sailing in open waters",
+ "B. A caution sign in Japanese on the dock",
+ "C. Workers using rods to measure a whale",
+ "D. Workers in safety gear ascending the ship's gangway"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "dhLPwi9blMo",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "What was the chronological sequence of key elements in Matt's astrophotography preparation?",
+ "Choice": [
+ "A. Introduction of imaging focus → Identification of celestial target → Orion constellation visualization",
+ "B. Identification of celestial target → Orion constellation visualization → Technical workflow discussion",
+ "C. Orion constellation visualization → Introduction of imaging focus → Identification of celestial target",
+ "D. Technical workflow discussion → Orion constellation visualization → Identification of celestial target"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "AaaTLFxIQ6o",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Why might the music have stopped when the Liberty.me logo appeared?",
+ "Choice": [
+ "A. To coincide with the shift to an indoor setting",
+ "B. To highlight Kyle Platt's initial appearance on screen",
+ "C. To transition from intro graphics to human speech",
+ "D. To synchronize with the fade-out of the Liberty.me logo"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What visual context is present when Kyle states 'me here with Marianne Coppenhaver'?",
+ "Choice": [
+ "A. Both subjects in an indoor atrium",
+ "B. Close-up of Marianne's pearl necklace",
+ "C. Crowd waving in the lobby below",
+ "D. Marianne seated beside Kyle on a balcony"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "ChbZLHnush8",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual event coincided with the audio statement 'I will be teaching you how to speedrun level zero through level one thousand'?",
+ "Choice": [
+ "A. YouTube video titled 'How to Speedrun Backrooms Level 0-100'",
+ "B. Stick figure running through a yellow hallway",
+ "C. Cartoon man in a suit addresses the camera",
+ "D. Thumbnail with a stick figure getting shot by a red beam"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "AYW_zgrLc6U",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What audio event occurred immediately after the man criticized the match as 'an absolute shambolic display'?",
+ "Choice": [
+ "A. A parking sign became visible in the background",
+ "B. Matt Holland introduced the show indoors",
+ "C. The analyst reflected on two quick goals",
+ "D. Rock music played with the logo reappearing"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "jsCsIW8zFvQ",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which audio event occurred simultaneously with the initial appearance of the ESPN logo?",
+ "Choice": [
+ "A. Sound of a engine",
+ "B. 'Robbie, come over here, you do it for me'",
+ "C. 'Thanks, Robbie. Appreciate it'",
+ "D. Upbeat rock music"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What was the primary purpose of summoning Robbie during the shop presentation?",
+ "Choice": [
+ "A. To diagnose a customer's mechanical complaint",
+ "B. To showcase specialized BMX stunt gear",
+ "C. To co-narrate the shop's infrastructure and services",
+ "D. To authenticate liability waiver compliance"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "At the precise moment the upbeat rock music began, which visual element was displayed?",
+ "Choice": [
+ "A. Customer signing waiver at check-in table",
+ "B. Mechanic adjusting handlebars in workshop",
+ "C. Close-up of two cyclists performing mid-air stunts",
+ "D. Exterior shot of ED'S bike shop showing crowd and pennants"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What type of event is being hosted at ED'S bike shop based on the activities shown?",
+ "Choice": [
+ "A. Community group ride",
+ "B. BMX skills exhibition",
+ "C. Grand opening celebration",
+ "D. Professional racing competition"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "H9rk0j3zuog",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual element was displayed simultaneously with the narrator's discussion of 'enormous megalithic blocks' in the 10-20s segment?",
+ "Choice": [
+ "A. The red arrow highlighting desert geology",
+ "B. The 'Hidden Chambers All Over The Great Sphinx?' video overlay",
+ "C. The composite image of pyramids/Sphinx/Anubis",
+ "D. The YouTube window analyzing basalt floor tool marks"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What specific visual element does the narrator's mention of 'highly advanced tool marks' directly support during the 20-30s segment?",
+ "Choice": [
+ "A. The close-up examination of the concrete texture",
+ "B. The composite image comparing pyramids and the Sphinx",
+ "C. The red arrow pointing to Giza's basalt floor location",
+ "D. The YouTube video title about Anubis symbolism"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "ZtE20CoGkus",
+ "video_duration": "30s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which event occurs LAST between these options during the first 20 seconds?",
+ "Choice": [
+ "A. The man exits a black-and-white police car",
+ "B. The cameraman asks 'You drink your own bullshit?'",
+ "C. The man says 'walk away' while facing forward",
+ "D. The man turns around and says 'walk away' after exiting the door"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What prompted the man to demand that someone 'take a step back'?",
+ "Choice": [
+ "A. He was addressing a derogatory comment about his beverage",
+ "B. The cameraman repeatedly questioned his authority",
+ "C. Someone attempted to bypass a restricted area",
+ "D. The cameraman was obstructing his path into the building"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "When the cameraman first asks 'You drink your own bavvas or what?', what is the man in the Lakers jersey doing in the video?",
+ "Choice": [
+ "A. Turning his head to address someone",
+ "B. Scanning the exterior signs",
+ "C. Walking toward the building entrance",
+ "D. Exiting the police car"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "HAMVD9S_6wY",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which visual is shown simultaneously with Dr. Jake's statement about one-third of Americans being under heat advisories?",
+ "Choice": [
+ "A. Heat map displaying U.S. cities with triple-digit temperatures",
+ "B. National weather map with concentrated heat zones in the Southwest",
+ "C. Multilingual safety notices beneath a red stop sign",
+ "D. Illustrated sun graphic with historical temperature records"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "jqJNYUb22Xk",
+ "video_duration": "30s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which audio event coincided with the visual of the man clapping his hands and yelling in the airplane aisle while wearing a black baseball cap?",
+ "Choice": [
+ "A. 'There is more fallout today...' with the news anchor speaking",
+ "B. 'Donald Trump, baby!' yelled aggressively",
+ "C. 'right, this man knows what's up' spoken with a casual and aggressive tone",
+ "D. 'like it? Too bad' spoken mockingly"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "axUesvtufOg",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What specific interaction caused dust displacement near the target area during the initial shooting sequence?",
+ "Choice": [
+ "A. Spent brass casings impacting the terrain",
+ "B. Projectile collisions with ground-mounted steel plates",
+ "C. Rounds penetrating moist earth behind the target",
+ "D. Firearm slide reciprocation disturbing loose soil"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "9P_DH8YInpw",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Why is it necessary to 'store this data' according to the speaker?",
+ "Choice": [
+ "A. To archive raw podcast footage for editing",
+ "B. For preserving historical records of celebrity chess matches",
+ "C. To retain numerous AI-generated video drafts for evaluation",
+ "D. To meet cloud storage platform quotas"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "b7LiW66cSM4",
+ "video_duration": "30s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which specific environmental data sources are cited in the text overlays for the Italy and India segments respectively?",
+ "Choice": [
+ "A. Italy: NASA satellite imagery; India: Central Pollution Control Board",
+ "B. Both: World Economic Forum climate reports",
+ "C. Italy: Reuters nitrogen dioxide analysis; India: NASA aerosol measurements",
+ "D. Italy: CREA emission models; India: Reuters traffic indices"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "-oC3FVOx62g",
+ "video_duration": "30s",
+ "video_category": "Comedy",
+ "questions": [
+ {
+ "Question": "Which of the following sequences of visual and audio events accurately reflects the video?",
+ "Choice": [
+ "A. 'Alright, let's' said -> Pandora app shown -> Song starts -> 'What is this?'",
+ "B. People laying down -> 'are sleep' said -> Hand taps screen -> Bell ringing.",
+ "C. Person waking up -> Door closing -> Song starts -> Distressed speech.",
+ "D. Bell ringing -> iPad screen shown -> 'It's time to get to work' -> Black screen."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What is the likely context surrounding the question 'What is this?' in the video?",
+ "Choice": [
+ "A. It's a question about the closing door sound.",
+ "B. It's a reaction to the song that started playing after the app was opened.",
+ "C. It's a question about the person waking up abruptly.",
+ "D. It's a question about the image of the two young people on the iPad."
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What action and sound occurred around the same time the iPad screen was shown?",
+ "Choice": [
+ "A. Someone saying 'Alright, let's' and the start of a song.",
+ "B. A door closing and someone saying 'Alright, let's'.",
+ "C. A bell ringing and someone touching the screen.",
+ "D. Someone waking up and a door closing."
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "-q6f1XIGuL4",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What is the likely setting or environment based on the combined audio and visual elements?",
+ "Choice": [
+ "A. A fishing trip on a boat in the ocean.",
+ "B. A construction site where gravel is being moved.",
+ "C. A quarry where rocks are being extracted.",
+ "D. A mountain trail where someone has been walking for a long time."
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "-rb6uia1wSo",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What is the likely context of the man's speech in the middle(10-20s) of the video?",
+ "Choice": [
+ "A. He is describing the dark purple curtains in the background.",
+ "B. He is responding to someone off-screen about the beep sound.",
+ "C. He is providing instructions for playing the trombone.",
+ "D. He is concluding a video segment for his audience."
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Why did the character stop playing the clarinet around the middle of the video?",
+ "Choice": [
+ "A. Due to a problem with the white lamp.",
+ "B. Because the musical instrument is broken",
+ "C. To express his gratitude and bid farewell.",
+ "D. Because the melancholic tune was too difficult."
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "-spqqACI6UQ",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Compare the audio during the initial skateboard demonstration with the audio during the outdoor skatepark scene.",
+ "Choice": [
+ "A. Both sections feature human speech, but the indoor segment includes background music, unlike the outdoor segment.",
+ "B. The first section has speech about skateboarding, whereas the outdoor section has speech about pogo sticks.",
+ "C. The first section has sounds related to a different activity being explained, while the outdoor segment features music.",
+ "D. Both sections have instructional speech, but the outdoor segment also includes sounds of skateboarding."
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "Based on the audio and visual content between 20-30 seconds, what is the likely skill level of the person being addressed by the speaker?",
+ "Choice": [
+ "A. Someone already familiar with dropping in but seeking confirmation on their approach.",
+ "B. Highly skilled and experienced with advanced tricks, needing only verbal cues.",
+ "C. A beginner who requires detailed step-by-step instructions and visual demonstrations.",
+ "D. At an intermediate level, needing reminders on specific techniques like foot placement."
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Why did the scene switch to the outdoor skatepark around the 10-second mark?",
+ "Choice": [
+ "A. To transition to a more advanced skateboarding segment, coinciding with the introduction of electronic dance music.",
+ "B. To showcase the same skateboard trick being performed, accompanied by a change in background music.",
+ "C. Because the indoor demonstration of basic skateboard handling was complete, indicated by the end of the pogo stick explanation.",
+ "D. Due to the person finishing their skateboarding instructions indoors, signaled by the sound of a door closing."
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What is the primary focus of the video segment between 20-30 seconds?",
+ "Choice": [
+ "A. Showcasing a compilation of different skateboarding tricks performed by various individuals.",
+ "B. Demonstrating the basic movements involved in riding a skateboard on flat ground.",
+ "C. Providing instructions on how to perform a 'drop in' on a skateboard.",
+ "D. Highlighting the transition from indoor to outdoor skateboarding environments."
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What happened visually before the electronic dance music started playing?",
+ "Choice": [
+ "A. The camera focused on the skateboard wheels.",
+ "B. The person's feet were shown standing on the skateboard.",
+ "C. A door was heard slamming shut.",
+ "D. The person landed a skateboard trick outdoors."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "-w12aSkQ9No",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What action and audio event occur concurrently between 0-10 seconds?",
+ "Choice": [
+ "A. Woman singing and orchestral music playing.",
+ "B. Woman gesturing and the music ending.",
+ "C. Woman standing and a soaring vocal tone.",
+ "D. Woman sitting and a male voice singing."
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What is the order of events involving the woman and the music around the 10-20 second mark?",
+ "Choice": [
+ "A. Another person sings, then the woman starts singing.",
+ "B. Orchestral music starts, then the woman starts singing.",
+ "C. Woman sings alone, then another person joins singing.",
+ "D. Woman finishes singing, then orchestral music starts."
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "-uaTK8sa5Ms",
+ "video_duration": "30s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "What was occurring visually when the fast-paced bluegrass music with a prominent banjo begins to play?",
+ "Choice": [
+ "A. People were arranging a woman in a pink wig on set.",
+ "B. A man was explaining the production process to the camera.",
+ "C. Men were striking poses on a green mat in front of a backdrop.",
+ "D. A group of women in pink dresses were dancing on a set."
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "-wQ8El8J_dE",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "While the garage door was in motion, what auditory cue confirmed a specific visual detail about the opener?",
+ "Choice": [
+ "A. The sound of the motor starting aligned with the initial movement of the door.",
+ "B. The speaker's statement about the door going back up aligned with the blinking light.",
+ "C. The speaker mentioning the sensors coincided with the door fully opening.",
+ "D. The sound of the door panels moving matched the light on the opener blinking."
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "-wymN80CiYU",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "Why might the text 'Said it's too late to apologize' transition to 'Too late...'?",
+ "Choice": [
+ "A. There was a shift in the accompanying instrumental music from keyboard to synthesizer.",
+ "B. The black circle with the tree was replaced by a musical note.",
+ "C. The male vocalist started singing with a more emotional tone.",
+ "D. The music changed from a pop song to Indian folk music."
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Considering both the visual and audio elements, what might be the overall mood or theme of the video?",
+ "Choice": [
+ "A. Regretful and reflective, suggested by the apologetic text and melancholic vocals.",
+ "B. Mysterious and suspenseful, hinted at by the disappearing text and instrumental music.",
+ "C. Celebratory and energetic, indicated by the upbeat music and floral borders.",
+ "D. Peaceful and serene, represented by the tree silhouette and calming synth sounds."
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "-xfgovG6-KU",
+ "video_duration": "30s",
+ "video_category": "Autos & Vehicles",
+ "questions": [
+ {
+ "Question": "What was the man out of view saying while the man in the white shirt and black sleeves was leaning into the car?",
+ "Choice": [
+ "A. He described a small LED light near the door handle.",
+ "B. He talked about the car's acceleration capabilities.",
+ "C. He commented on the presence of a web browser and Google partnership.",
+ "D. He mentioned the car's self-retracting door handles."
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which action immediately followed the camera zooming in on the car's interior and showing the touch screen?",
+ "Choice": [
+ "A. The touch screen displayed a car graphic before turning black.",
+ "B. The door handle was shown in a protruding state.",
+ "C. The man in the blue shirt and green jacket pointed at the door handle.",
+ "D. The man mentioned the self-retracting feature of the door handle."
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "-xQbyezhw_k",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Considering both the visual and audio information, what is the likely setting when the phrase 'Starting at Hooton Avenue and going all the way to State Avenue' is spoken?",
+ "Choice": [
+ "A. An outdoor scene near the mentioned avenues.",
+ "B. A live news broadcast in a studio.",
+ "C. A casual conversation between friends at home.",
+ "D. An noisy office with a map displayed."
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "-z4OZ7ls5Bo",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "Considering both the audio and visual elements, what is the most likely purpose of the text appearing on the screen?",
+ "Choice": [
+ "A. To indicate a change in the ukulele's chords.",
+ "B. To show the lyrics of the song being played on the ukulele.",
+ "C. To display the tuning of the ukulele being played.",
+ "D. To label the different parts of the ukulele."
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "-zI7FN8cMIE",
+ "video_duration": "30s",
+ "video_category": "Autos & Vehicles",
+ "questions": [
+ {
+ "Question": "Which of the following sequences of events accurately combines both visual actions and audio cues?",
+ "Choice": [
+ "A. Speech about battery life -> Showing the steering wheel -> Car driving.",
+ "B. Viewing the dashboard -> Car door closing -> Speech about ventilated seats.",
+ "C. Showing the heated seat buttons -> Car horn sound -> Opening the armrest.",
+ "D. Car horn sound -> Opening the center console -> Engine running."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Considering both the visual and audio information, what is the primary activity occurring in the first 10 seconds of the video?",
+ "Choice": [
+ "A. Testing the car's horn and interior.",
+ "B. Starting the car and preparing to drive.",
+ "C. Demonstrating the storage capacity of the center console.",
+ "D. Inspecting the car's engine."
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "0-2Onbywljo",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What was the immediate context surrounding the first instance of laughter heard in the audio?",
+ "Choice": [
+ "A. The pigs were walking beside a yellow house.",
+ "B. The pigs were already playing outside.",
+ "C. Peppa was throwing the ball to George.",
+ "D. The pigs were sitting at the table."
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "After the laughter in the kitchen and the instruction to go play, what visual event immediately followed according to the video?",
+ "Choice": [
+ "A. The pig in the blue shirt was walking on a grassy hill.",
+ "B. The two pigs were rushing down a green hill.",
+ "C. The pig in the red dress held a spotted ball on a green field.",
+ "D. Peppa threw the ball and it bounced."
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "00FBAdjlF4g",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "Which audio event occurred while a person was visibly touching their chin?",
+ "Choice": [
+ "A. The male speaker talking about sounds.",
+ "B. The sound of someone sniffing.",
+ "C. The sound of someone sighing.",
+ "D. The electronic sound effect."
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What happened visually after the male speaker finished his sentence about having 'a good problem to have'?",
+ "Choice": [
+ "A. The silver line-based triangle emblem appeared.",
+ "B. A black background was displayed.",
+ "C. The scene showed three people in a recording studio.",
+ "D. The scene transitioned to a man with a beard in a studio."
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "00iiNF4tn4o",
+ "video_duration": "30s",
+ "video_category": "Autos & Vehicles",
+ "questions": [
+ {
+ "Question": "What followed immediately after the man stated 'Pretty cool, huh?'?",
+ "Choice": [
+ "A. The pipes made a rattling sound.",
+ "B. He raised a finger and said 'Now I'm back ...'",
+ "C. He exited the frame to the right.",
+ "D. The background shelves became cluttered."
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What likely prompted the initial metallic rattling sound?",
+ "Choice": [
+ "A. Someone off-screen moved the shelves.",
+ "B. The man bumped into the blue frames.",
+ "C. The man spoke loudly.",
+ "D. The man interacted with the brown pipes by looking down."
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "02-0m4RbDUI",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What is the overall context of the man's actions and speech throughout the described video?",
+ "Choice": [
+ "A. He is providing a tutorial on basic home repairs.",
+ "B. He is demonstrating how to maintain kitchen appliances.",
+ "C. He is comparing the quality of different cabinet brands.",
+ "D. He is reviewing different types of cabinet hardware."
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "02UvvE1oA1I",
+ "video_duration": "30s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "Following the appearance of a snowy brick house with icicles, what auditory event happened next?",
+ "Choice": [
+ "A. Cheerful instrumental music begins.",
+ "B. A distinct coughing sound.",
+ "C. A sheep bleating sound.",
+ "D. Upbeat flamenco guitar music continues."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Considering the audio and visual content in the 20-30s section, what might be the overall mood or atmosphere intended?",
+ "Choice": [
+ "A. Somber and reflective.",
+ "B. Calm and peaceful.",
+ "C. Playful and slightly humorous.",
+ "D. Mysterious and suspenseful."
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Based on the audio and visual changes around the 10-second mark, what shift in the video's setting or style can be inferred?",
+ "Choice": [
+ "A. A change in the type of animal being featured.",
+ "B. A move from a musical segment to a dialogue-driven one.",
+ "C. A shift from a realistic depiction to a more abstract one.",
+ "D. A transition from a summery scene to a wintery one."
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "02XbmweOzOY",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Compare the sounds present during the initial workshop view with those when the man is working on the car wheel.",
+ "Choice": [
+ "A. Both sections primarily feature human speech.",
+ "B. The workshop has engine sounds and speech, while the car repair has engine sounds.",
+ "C. The workshop has engine sounds, while the car repair has engine sounds and speech.",
+ "D. Both sections feature loud machinery sounds."
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What is the context surrounding the statement 'Release the break' considering both audio and visual information?",
+ "Choice": [
+ "A. It is a statement about the man's next action during the process of removing the car's wheel.",
+ "B. It is a command given during the process of removing the car's wheel.",
+ "C. It is a statement made while inspecting tools on the shelf.",
+ "D. It is an instruction given while showcasing the air compressor's features."
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Considering the audio and visual content, what is the likely setting of the 'very tiny leak' discussion?",
+ "Choice": [
+ "A. Inside the workshop with the red air compressor.",
+ "B. Near a plumbing system inside the garage.",
+ "C. Inside the car's passenger compartment.",
+ "D. Outdoors, near the white car being repaired."
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "03aJ_RcnBko",
+ "video_duration": "30s",
+ "video_category": "Autos & Vehicles",
+ "questions": [
+ {
+ "Question": "What action was being performed when the narrator said, 'We got the control arm...'?",
+ "Choice": [
+ "A. Heating a bolt with a blowtorch.",
+ "B. Tapping a bolt with a hammer.",
+ "C. Opening a drawer to get tools.",
+ "D. Focusing on the car's suspension."
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What is the immediate context of the narrator saying 'and we're going to come straight off and onto my cooler...'?",
+ "Choice": [
+ "A. After removing a nut from the tire hub with a power tool.",
+ "B. After tapping a bolt with a hammer on the suspension system.",
+ "C. After opening a drawer and heating a bolt.",
+ "D. After a loud noise and the metallic part with studs appearing on the floor."
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What happened visually before the narrator mentioned having a '13 mm on this hub'?",
+ "Choice": [
+ "A. The car was lifted by a jack, and the tie rod end was mentioned as loose.",
+ "B. The blowtorch was used, and a drawer opened.",
+ "C. The suspension system was focused on, and a hammer was used.",
+ "D. A metallic part with studs was shown, and a loud noise was heard."
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "03IZHwOh2eQ",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Which action and speech occurred simultaneously around the middle(10-20s) of the video?",
+ "Choice": [
+ "A. Dropping ice cream and asking 'Is it just this ice cream?'",
+ "B. Broken ice cream and saying 'maybe a production fail, I don't know'.",
+ "C. Picking up ice cream and asking 'What is it? A waffle?'",
+ "D. Rotating ice cream and saying 'Um'."
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Why might the person have said 'maybe a production fail' ?",
+ "Choice": [
+ "A. Because the wrapper was missing and they were talking about a waffle.",
+ "B. Because the ice cream looked perfectly fine and they were silent.",
+ "C. Because the ice cream was broken into unusual pieces and they were actively chewing.",
+ "D. Because the ice cream was completely melted and there was no sound."
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "03zeVlBWhdY",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What sound event aligns with the visual of the man in the white jersey being escorted away by security?",
+ "Choice": [
+ "A. A high-pitched cheering sound.",
+ "B. The crowd chanting 'Jedinstvo! Jedinstvo!'",
+ "C. A male voice shouting 'Hit him down!'",
+ "D. The crowd cheering loudly."
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "04ChQ0fqzxQ",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What action is being performed while the male speaker says, 'Instead, I want to let your card find me'?",
+ "Choice": [
+ "A. The cards are being placed on the table.",
+ "B. The cards are being spread out.",
+ "C. The cards are being shuffled.",
+ "D. A card is being revealed."
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "04xLK7R5wNA",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What is the likely context of the phrase 'there we go' spoken by the child?",
+ "Choice": [
+ "A. After successfully opening a difficult package.",
+ "B. After putting the lid back on the blue can.",
+ "C. After finishing the description of a toy.",
+ "D. After picking up a dropped toy during arrangement."
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "07rwFurzpw8",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which of the following happened immediately before the cheering sound ends?",
+ "Choice": [
+ "A. The woman in pink did a flip.",
+ "B. The xylophone music played.",
+ "C. The man raised his arms.",
+ "D. The promotional speech finished."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "097K1uMIPyQ",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Compare the tone of the male speech when the keyboard is shown versus when the bellows are shown.",
+ "Choice": [
+ "A. Both are tense.",
+ "B. Both are delivered in a whispering tone.",
+ "C. Both are cheerful and inviting.",
+ "D. The former is cheerful, the latter is normal."
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "0ARh-9Amrv0",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Why might the black pony have looked annoyed when the fluffy creature appeared in the doorway around the time of the knocking sound?",
+ "Choice": [
+ "A. Because the fluffy creature had food on its head, as commented by the 'hotdog unibrow' remark.",
+ "B. Because the black pony was expecting someone else, indicated by the surprised 'oh' sound later.",
+ "C. Because the black pony was trying to be quiet, disrupted by the knocking sound.",
+ "D. Because the door was already closed, indicated by the earlier 'door closing' sound."
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "0aaLWD7TUSU",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What is the likely context of the phrase 'round three going to get you around' spoken while the rowing team is visible?",
+ "Choice": [
+ "A. It's a warning about the narrow boats docked on the side of the river.",
+ "B. It's a comment on the team's overall performance in a competition.",
+ "C. It's an instruction to the rowers on how to navigate a specific turn.",
+ "D. It's a remark about the number of people on the bicycle following them."
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "0AkRWunFZPk",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Which of the following describes the order of actions combined from both audio and visual information in the first 10 seconds?",
+ "Choice": [
+ "A. Speaking, clicker sound, holding fabric, rubbing fabric.",
+ "B. Holding fabric, speaking, clicker sound, rubbing fabric.",
+ "C. Speaking, holding fabric, rubbing fabric, clicker sound.",
+ "D. Holding fabric, clicker sound, speaking, rubbing fabric."
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Based on the sounds and actions in the 10-20 second mark, what is the likely purpose of the 'blade' sound following the speaker's words?",
+ "Choice": [
+ "A. To further distress the zipper area.",
+ "B. To separate a piece of fabric for removal.",
+ "C. To sharpen the yellow tool being used.",
+ "D. To cut a loose thread on the jeans."
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "0bP2MH3LqvI",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "What happens visually after the narrator says 'and propellers of many shapes and sizes proliferated'?",
+ "Choice": [
+ "A. The biplane is parked on a grass field.",
+ "B. The biplane takes off into the air.",
+ "C. Three uniformed men stand around the front of the biplane.",
+ "D. A wider shot of the biplane and ground crew is shown."
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "0bRUkLsttto",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "Based on the audio and visual content, what is the likely genre of the video?",
+ "Choice": [
+ "A. A promotional video for tourism in Times Square.",
+ "B. A documentary about urban development.",
+ "C. A short film depicting nightlife in a city.",
+ "D. A music video with hip-hop as its main genre."
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What visual event immediately precedes the close-up shot of a man holding a green bill?",
+ "Choice": [
+ "A. Three people standing on a dark platform with a city skyline.",
+ "B. Three men raising their index fingers on a street corner.",
+ "C. A man in a green jacket with yellow text.",
+ "D. A man in a blue jacket moving his arm to the music."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "0crG6jzAUpE",
+ "video_duration": "30s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "Why does the figure in the black tuxedo appear surprised while looking at the screen?",
+ "Choice": [
+ "A. Because the Bat-Signal flashes and he hears a bell ringing.",
+ "B. Because another character in red and green enters the scene.",
+ "C. Because a siren sound goes off inside the ice base.",
+ "D. Because he hears a triumphant voice saying 'Defeated at last'."
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Following the sound of a ringing bell, what visual event immediately occurs?",
+ "Choice": [
+ "A. The red and green suited figure enters on a platform.",
+ "B. The close-up of the yellow computer screen with the Bat-Signal is visible.",
+ "C. The camera shows the full set of the toy base.",
+ "D. The tuxedo figure walks up to the flashing Bat-Signal screen."
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What visual element is present when the first high-pitched whistle is heard?",
+ "Choice": [
+ "A. The yellow console with the light blue screen.",
+ "B. A close-up of the blue and tan toy figure with a cape.",
+ "C. The black toy figure holding a blue gun inside the ice base.",
+ "D. The full set of the toy base with cotton ball snow."
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Compare the context of the two instances of siren sounds in the video.",
+ "Choice": [
+ "A. Both sirens occur after a voice says 'better call for some special backup'.",
+ "B. The first siren is associated with the ice base, while the second is linked to the yellow console.",
+ "C. The first siren happens during the close-up of Robin, and the second when Batman is visible.",
+ "D. Both sirens occur while the figures are on the balcony."
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "0cZQm65sZjc",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "What is the difference in the narrator's speech between showing the first key working and presenting the two keys?",
+ "Choice": [
+ "A. The first instance includes sound of keys clicking, the second includes music.",
+ "B. The first instance focuses on programming, the second on gratitude.",
+ "C. The tone changes from thankful to neutral.",
+ "D. The first instance is quieter than the second instance."
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "0EvpBtracsk",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What happened visually before the male voice started rapping?",
+ "Choice": [
+ "A. The 'Download it now!' text appeared",
+ "B. A blurred image with a round shape was shown",
+ "C. The product box was shown",
+ "D. The old online radio interface was displayed"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What is the context of the male voice rapping 'have you ever wanted something so bad?'?",
+ "Choice": [
+ "A. Highlighting a technical difficulty with the software.",
+ "B. Emphasizing the user's desire for a specific radio station.",
+ "C. Demonstrating a user reporting a problem.",
+ "D. Illustrating the variety of radio stations available."
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Compare the audio accompanying the initial product box with the audio accompanying the 'Download it now!' text.",
+ "Choice": [
+ "A. Both feature male vocals and guitars.",
+ "B. The first is energetic with percussion, the second is a sad melody.",
+ "C. The first is calmer instrumental, the second is more energetic with male vocals.",
+ "D. Both are upbeat pop songs with female vocals."
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "0F04c_rY4aw",
+ "video_duration": "30s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "What is the likely environment where the events in the video are taking place?",
+ "Choice": [
+ "A. A dog training facility with multiple trainers.",
+ "B. A quiet park with a single dog and owner.",
+ "C. A home environment with dogs and human.",
+ "D. A veterinary clinic during an examination."
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "0FjrveIENzI",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Considering the audio and visual cues, what is the likely purpose of the individuals handling the white wired headphones?",
+ "Choice": [
+ "A. They are inspecting the headphones for damage.",
+ "B. They are preparing to listen to music.",
+ "C. They are putting the headphones away for storage.",
+ "D. They are demonstrating the convenience of the headphones."
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "0g-KosNgMYA",
+ "video_duration": "30s",
+ "video_category": "Travel & Events",
+ "questions": [
+ {
+ "Question": "What is the likely setting when someone says, 'Here we are at Calway, let's go'?",
+ "Choice": [
+ "A. By a lake with paddle boats.",
+ "B. Inside a candy store.",
+ "C. Approaching a roller coaster.",
+ "D. On a vintage-style train."
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "0GFtqzOWnWg",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "After someone said 'This might be our most solid board', what happened next visually?",
+ "Choice": [
+ "A. Placing the book on the ground.",
+ "B. Drilling the skateboard.",
+ "C. Changing the drill battery.",
+ "D. Inspecting the skateboard."
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Based on the audio and visual, what were they likely doing with the drill and the book titled 'Harry Potter'?",
+ "Choice": [
+ "A. Creating a decorative hole.",
+ "B. Assembling a new skateboard.",
+ "C. Securing the book to the board.",
+ "D. Testing the drill's power."
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Why did the person express panic and say 'The battery's dying'?",
+ "Choice": [
+ "A. They were standing and holding a book.",
+ "B. They finished assembling the skateboard.",
+ "C. They were drilling into a book.",
+ "D. They were inspecting the skateboard."
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Considering the audio and visual information, what is the likely relationship between the person assembling the skateboard and the person drilling the book?",
+ "Choice": [
+ "A. They are unaware of each other's actions.",
+ "B. They are showcasing different hobbies.",
+ "C. They are the same person.",
+ "D. They are working on the same project."
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "0Gxn9FtaJFc",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "What visual action was occurring when the male voice began singing the line 'As the day comes to a close and night'?",
+ "Choice": [
+ "A. The woman was taking a selfie with her phone.",
+ "B. The man was sitting outdoors looking at his phone.",
+ "C. The man was wearing a blue shirt and glasses.",
+ "D. The man was grimacing with his eyes closed."
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "0GNNFBrRz1E",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What happened visually after the sound of someone saying 'No damage' and before the sound of someone saying 'Coming up'?",
+ "Choice": [
+ "A. The man in the brown hoodie put down the hammer and picked up putty knives.",
+ "B. The person in the purple shirt started scraping the tiles.",
+ "C. The man in the black jacket started watching the countertop removal.",
+ "D. The person in the brown hoodie continued to detach the countertop edge with a hammer."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "0hUft9scxzY",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which of the following sequences of actions and sounds is accurate according to the video?",
+ "Choice": [
+ "A. Spreading cards - Sound of water - Showing a box.",
+ "B. Saying 'electric energy' - Opening a box - Sound of rustling.",
+ "C. Counting numbers - Opening a foil packet - Saying 'potion'.",
+ "D. Tearing a package - Saying 'stop at nine' - Counting numbers."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What is the context surrounding the act of a hand spreading out four cards on the table?",
+ "Choice": [
+ "A. It happens during the reading of card names like 'Machete' and the sound of rustling.",
+ "B. It is followed by the tearing of a yellow package and the sound of water.",
+ "C. It happens while someone is saying 'stop at nine' and showing a plant character.",
+ "D. It occurs after a hand pulls a pack from a box and while someone is counting."
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "0HzbpwB3xDk",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "Which of the following sequences of events accurately reflects the order presented in the video?",
+ "Choice": [
+ "A. The close-up of the red machine part happened before the \"piccolo flute\" comment.",
+ "B. The sound of an object being moved occurred, then the lathe was shown working, followed by the man in brown speaking.",
+ "C. The drill drilled the cylinder, then the man in beige gestured, followed by the speaker mentioning \"Shargle Factory\".",
+ "D. The man in beige spoke, then the drill retracted, followed by the machine sound."
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What action and sound event were occurring simultaneously in the initial part of the video?",
+ "Choice": [
+ "A. Metal shavings were visible on the lathe while a machine sound was present.",
+ "B. A golden cylinder was being drilled while a man said \"Okay\".",
+ "C. The bald man held a metal piece while saying \"Right here at the Shargle Factory\".",
+ "D. A person in red was near the machine while mentioned a \"piccolo flute\"."
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "0izHOfrwPn4",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Why does the man look excited and point at the toy in the 10-20s section after the 'pop' sound?",
+ "Choice": [
+ "A. Because the man clearing his throat was a signal for the pig toy to pop.",
+ "B. Because the counting down 'one, two, three, go' led to the pig toy to pop.",
+ "C. Because the toy car sound indicated an unexpected event.",
+ "D. Because the laughter indicated someone else caused the pig toy to pop."
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What visual element is prominently present when the playful male voice says 'Let's roll!'?",
+ "Choice": [
+ "A. The toy pig with a chef's hat and red bib.",
+ "B. The word 'NEVER!' displayed over the pig toy.",
+ "C. A man with dark hair and a blue t-shirt.",
+ "D. A collection of red, yellow, green, and purple circular pieces."
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "0JbGxIR8JTk",
+ "video_duration": "30s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "While viewing the multi-story corner building with several storefronts and a yellow traffic light, what was a notable characteristic of the audio at that time?",
+ "Choice": [
+ "A. The distinct sound of car horns.",
+ "B. A male singer performing lyrics in a non-English language.",
+ "C. Upbeat instrumental music with acoustic guitar and percussion.",
+ "D. Silence, indicating a break in the music."
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What specific audio element was present when the camera initially focused on the graffiti art on the gray building?",
+ "Choice": [
+ "A. Sound of construction",
+ "B. Male vocalist singing in a foreign language",
+ "C. No prominent audio",
+ "D. Rhythmic drumming"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "0J3wJhQVOII",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What item mentioned in the conversation at the beginning is visually present in the shopping cart?",
+ "Choice": [
+ "A. Canned peaches",
+ "B. Pure Cane Sugar",
+ "C. Packaged noodles",
+ "D. Almond milk"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "In the third segment, the man's questions about 'sewing and school' are likely related to:",
+ "Choice": [
+ "A. A debate competition they are preparing for.",
+ "B. Their children's activities and their own teaching profession.",
+ "C. A class the women are taking at the grocery store.",
+ "D. An interview about their hobbies and professions."
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "0K0pXiZrUJA",
+ "video_duration": "30s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "Considering the audio and visual information, what is the most likely activity the players are engaged in?",
+ "Choice": [
+ "A. Observing the different types of eggs to study dinosaur behavior.",
+ "B. Exploring the environment to find resources like prime meat from trikes.",
+ "C. Farming the green grass-like plants for crafting purposes.",
+ "D. Building a base near the water where they can easily access fish."
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What is the context surrounding the discussion about item counts ('I've got 30', 'I've got 17')?",
+ "Choice": [
+ "A. They are strategizing how to approach the 'brontos' mentioned earlier.",
+ "B. They are checking their resources, likely in preparation for an activity involving tranquilizer darts.",
+ "C. They are planning to hunt the plesiosaur seen attacking the fish underwater.",
+ "D. They are deciding who will collect the 'prime' from the 'trike' they spotted."
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Why might the player be looking at the shoreline while holding a spear, immediately before discussing item counts?",
+ "Choice": [
+ "A. Because they are anticipating a potential threat from the dinosaurs they see there, as discussed.",
+ "B. Because they are preparing to hunt the fish they saw earlier, as discussed.",
+ "C. Because they are looking for more of the green grass-like plants they saw initially.",
+ "D. Because they are trying to find the trike egg mentioned in the previous audio."
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "0LE6Ll1rVlg",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "What action and sound event are occurring simultaneously around the beginning of the video?",
+ "Choice": [
+ "A. A man is singing a rock song inside a car.",
+ "B. A car is driving by while classical music is playing.",
+ "C. A woman is dancing silently in a dark room.",
+ "D. A man is speaking calmly while playing the drums."
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "0jnvb2H25_Q",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Why does the young man appear confused between 0-10 seconds?",
+ "Choice": [
+ "A. Because the woman is giggling after they both look up.",
+ "B. Because he does not recognize the song being hummed.",
+ "C. Because the woman starts singing loudly.",
+ "D. Because the woman is speaking in a questioning tone."
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "0lJyHuTVMNY",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Which action and statement occurred concurrently?",
+ "Choice": [
+ "A. The woman eats food while saying 'Mmm'.",
+ "B. The woman tastes food while saying 'Next we have this tray'.",
+ "C. The foil package is opened while someone says 'Maybe it's meat?'.",
+ "D. A can is opened while someone says 'Reminds me of blood sausage'."
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "0loP4WNnL2k",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "What sound event is occurring when the company logo with the cityscape first appears?",
+ "Choice": [
+ "A. Melancholic electronic music and a passing train.",
+ "B. A male speaker singing 'You know I love you baby' and the sound of a passing truck.",
+ "C. Only the sound of a passing truck.",
+ "D. Upbeat hip-hop music and a male speaker singing 'It's just that I trust no one'."
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What happens visually immediately after the black and white company logo disappears around the 10-20s mark?",
+ "Choice": [
+ "A. The young man is standing with another person on a city street as melancholic music plays.",
+ "B. The young man is walking alone in an outdoor environment while music plays.",
+ "C. The young man is standing in front of a large arch while a male speaker sings 'It's just that I trust no one'.",
+ "D. The young man is sitting on a wooden bench while melancholic electronic music continues."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "0Mba3BS1oRM",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What action is being verbally described while a red and gray gear-shaped tool is first being shown?",
+ "Choice": [
+ "A. Describing the function of a black pipe while showing a hammer.",
+ "B. Explaining the need for an oil flapper while picking up the tool",
+ "C. Discussing JB weld application while displaying a bucket of nuts and bolts.",
+ "D. Mentioning a small hole while pointing at a metallic stick-like tool."
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "0m9-5BkL4Mc",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "Why might the visual shift from a close-up of the child's face to an outdoor audience overlay?",
+ "Choice": [
+ "A. To emphasize the child's emotional reaction to the audience.",
+ "B. To signify a transition between different parts of the performance.",
+ "C. To provide the audience's reaction to the ongoing children's song.",
+ "D. To indicate a change of location for the performance."
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "During the initial segment of the video, what musical genre is being played while the child is holding a patterned instrument?",
+ "Choice": [
+ "A. Electronic music",
+ "B. Pop music",
+ "C. Classical music",
+ "D. Country music"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "0LTmV_dOmmo",
+ "video_duration": "30s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "What happens immediately after the sound of a door being moved or placed down?",
+ "Choice": [
+ "A. The woman begins removing the door from its hinges.",
+ "B. The dotted lines appear on the dog silhouette.",
+ "C. The close-up of the metal door hinge is shown.",
+ "D. The labels 'Top' and 'Bottom' appear on the door."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "0mZQ6Q-viPw",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What action was being performed when the heavy breathing sound was first heard?",
+ "Choice": [
+ "A. The dog's mouth was being opened.",
+ "B. The camera was panning across the dog's body.",
+ "C. Text was appearing on the screen.",
+ "D. A hand was petting the dog."
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "0NW8rsCj6xQ",
+ "video_duration": "30s",
+ "video_category": "Comedy",
+ "questions": [
+ {
+ "Question": "Which of the following sequences accurately reflects the order of events involving the person in the black shirt?",
+ "Choice": [
+ "A. Puts hands on head, hits ball, laughs.",
+ "B. Moves away from table, puts hands on head, hits ball.",
+ "C. Hits ball, laughs, puts hands on head.",
+ "D. Laughs, hits ball, moves away from table."
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Why does the person in the black shirt begin laughing?",
+ "Choice": [
+ "A. Because he won the ping pong match.",
+ "B. Because the ping pong ball hit the ceiling.",
+ "C. Because the other player complimented his shirt.",
+ "D. Because the other player hit the paddle into the air."
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "0onKcFY1Tu0",
+ "video_duration": "30s",
+ "video_category": "Nonprofits & Activism",
+ "questions": [
+ {
+ "Question": "Why might the men be observing the frog with such focus?",
+ "Choice": [
+ "A. They are scientists studying its unique vocal patterns.",
+ "B. Its vibrant colors are distracting them from their discussion.",
+ "C. Its presence reinforces their conversation about conservation.",
+ "D. They are trying to determine if it's poisonous or not."
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What happens immediately after the man on the left finishes speaking?",
+ "Choice": [
+ "A. Both men simultaneously uncross their arms.",
+ "B. The outdoor scene through the doorway becomes clearer.",
+ "C. The frog makes a distinct croaking sound.",
+ "D. The man on the right starts talking about the rainforest."
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Based on their attire and the setting, what is a likely context for this interaction?",
+ "Choice": [
+ "A. They are zookeepers discussing the frog's habitat needs.",
+ "B. They are colleagues casually observing a found animal.",
+ "C. They are environmental activists at a fundraising event.",
+ "D. They are park rangers on a routine wildlife inspection."
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "0oudYrPGNN8",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What is the last tool mentioned verbally before the camera focuses on the bag of screws?",
+ "Choice": [
+ "A. The drill bit for concrete.",
+ "B. The small orange level.",
+ "C. The security bit for the screwdriver.",
+ "D. The Philips head screwdriver."
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "0QdEYUt4LL8",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which visual element was the focus when the speaker mentioned something going 'right back in the water'?",
+ "Choice": [
+ "A. The black box with red lines.",
+ "B. The flame being directed onto the wall.",
+ "C. The metallic pipe emitting a pink-red flame.",
+ "D. The transparent container filling with liquid."
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "0QqQ1j1vBgw",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which visual happened immediately before the audio about NASA's Lewis Research Center?",
+ "Choice": [
+ "A. A close-up of a propeller with red and blue stripes.",
+ "B. A plane parked on an unpaved surface.",
+ "C. An aerial view of a multi-bladed propeller.",
+ "D. A man in a beige shirt adjusting a dial."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What visual element is being shown around the time the speaker mentions 'NASA asked'?",
+ "Choice": [
+ "A. A close-up of a rotating propeller.",
+ "B. Workers in a control room.",
+ "C. A man hand-starting an airplane propeller.",
+ "D. An aerial view of a white plane."
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "0QYedLfOwcI",
+ "video_duration": "30s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "What happens visually right after the child says 'Are you farting?'?",
+ "Choice": [
+ "A. The dog stares directly at the camera.",
+ "B. The dog bends down suddenly.",
+ "C. The camera pans across the room.",
+ "D. The dog runs away and sits under the table."
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What might be the reason for the dog's perplexed expression at the beginning of the video?",
+ "Choice": [
+ "A. The dog heard someone sniffing and saying 'Huh?'",
+ "B. The dog was looking at the red ornaments.",
+ "C. The dog was focused on its reflection.",
+ "D. The dog saw a Christmas stocking."
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "0qZ54ovyEWQ",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Why does the person tap the edge of the card deck on the green table while the speaker is talking about showing 'their card'?",
+ "Choice": [
+ "A. To create a distraction before performing a move.",
+ "B. To demonstrate the color of the table.",
+ "C. To emphasize the sound of the cards.",
+ "D. To visually highlight the process of show the card."
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "0rzB4A1PjnA",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "What is the overall context of the video based on the combined audio and visual information?",
+ "Choice": [
+ "A. An overview of the different panels and features within FL Studio.",
+ "B. A troubleshooting guide for an audio output issue in FL Studio.",
+ "C. A tutorial on creating music using a specific set of drum samples.",
+ "D. A demonstration of the visual interface of the FL Studio software."
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "0SLv6CmZas8",
+ "video_duration": "30s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "What was being displayed visually when the male voice stated, 'We're committed to investing in wind energy'?",
+ "Choice": [
+ "A. A woman moving her head.",
+ "B. A child holding a black device.",
+ "C. Wind turbines at sea.",
+ "D. The ScottishPower logo."
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "0tgNg0MHgxc",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What sound occurred when the woman with long wavy blonde hair clapped the top of the log?",
+ "Choice": [
+ "A. A loud crash sound.",
+ "B. A sound of a heavy object falling or dropping.",
+ "C. A female speaker saying 'Das wäre geschafft'.",
+ "D. A smaller crash sound."
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "0rz3h0ghprk",
+ "video_duration": "30s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "Considering both the visual and audio information in the 20-30 second segment, what can be inferred about the player's current objective?",
+ "Choice": [
+ "A. The player is cautious and moving away, as suggested by the speech and movement away.",
+ "B. The player is preparing to fight the animals, as suggested by the gun and the 'REC' overlay.",
+ "C. The player is hunting the pigs, indicated by the gun and calm speech.",
+ "D. The player intends to interact with the sheep based on the calm tone."
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "0ylr6p_m4p0",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What action coincides with the speaker saying 'And if you don't want to put it on your face, but'?",
+ "Choice": [
+ "A. The woman is touching her temple with her finger.",
+ "B. The woman is opening the blue and white package.",
+ "C. The woman is holding the closed package towards the camera.",
+ "D. The woman is displaying the package with both hands."
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "0YTieIiZNN4",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What action is the first man performing when he says 'Are you ready to rock?'?",
+ "Choice": [
+ "A. He is looking directly at the camera.",
+ "B. He is actively shooting the rifle.",
+ "C. He is pointing the rifle upwards.",
+ "D. He is reloading the black rifle."
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Why might there be a clicking sound in the second segment(10-20s) before the sound of a gunshot?",
+ "Choice": [
+ "A. It's the sound of the magazine being inserted.",
+ "B. It's the sound of the safety being engaged.",
+ "C. It's the sound of the bullet hitting the target.",
+ "D. It's the sound of adjusting the scope on the rifle."
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "0zgffMTkGtU",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "What is the likely setting or event taking place based on the combination of visual and audio cues in the first 10 seconds?",
+ "Choice": [
+ "A. An interview with a band member on a brightly lit set.",
+ "B. A recording session for a podcast with background music.",
+ "C. A live musical performance with stage instructions.",
+ "D. A rehearsal for a play with improvised dialogue."
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "0ZN2HKsFg4A",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What is the overall context of the video based on the combined visual and audio information?",
+ "Choice": [
+ "A. It's an instructional video on how to hold a cricket bat.",
+ "B. It's an advertisement for a cricket coaching website.",
+ "C. It's a recording of a live cricket match.",
+ "D. It's a video about the history of cricket equipment."
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What visual element is present when the speaker mentions checking for 'Vs'?",
+ "Choice": [
+ "A. Text on a black screen saying 'KEY POINTS'.",
+ "B. Close-up view of hands holding a cricket bat.",
+ "C. Three men standing on a green field.",
+ "D. A red cricket ball with white text."
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "0yxJzCQW5zI",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What is the woman doing while stating 'so we're gonna get started here'?",
+ "Choice": [
+ "A. Picking up a container with both hands.",
+ "B. Looking down at a container of paintbrushes.",
+ "C. Holding a blue hair clip.",
+ "D. Looking directly at the camera."
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "0_1BfYtffzg",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What action was occurring visually when the speaker said, 'I don't care which one it is'?",
+ "Choice": [
+ "A. The person was opening a translucent plastic bag.",
+ "B. The person was taking a pink item out of a bag.",
+ "C. The person was manipulating a tube-shaped object.",
+ "D. The person was holding a bag of dried food."
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "10lWpHyN0Ok",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Based on the audio and visual cues, what might 'Number 93 has five today' refer to?",
+ "Choice": [
+ "A. The number of tricks the man with shoulder-length hair performed.",
+ "B. The number of times the cat meowed that day.",
+ "C. The number of glasses the man with shoulder-length hair broke.",
+ "D. The number of glasses the man with chin-length hair shattered."
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "10DM6WINIS8",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What happened visually and audibly after the camera zoomed in on the sticker sheet focusing on the 'Monster Jam' text?",
+ "Choice": [
+ "A. The sound of keys jingling and a green monster truck being placed.",
+ "B. The man saying \"I see the eyes of the dragon\" and the camera zooming out.",
+ "C. The sound of someone handling stickers and a child saying \"A stone crusher\".",
+ "D. The man saying \"I see the eyes of the dragon\" and the camera zooming in."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "2fi0m8ei_B4",
+ "video_duration": "30s",
+ "video_category": "Autos & Vehicles",
+ "questions": [
+ {
+ "Question": "What sound event occurred around the same time the train doors opened to reveal the station platform with blue support beams?",
+ "Choice": [
+ "A. The distinct sound of a car horn.",
+ "B. An automated voice announcing the next stop.",
+ "C. Two notes of a wind instrument.",
+ "D. The sound of a car door closing."
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which of the following sequences accurately reflects the order of events involving train doors and distinct sounds?",
+ "Choice": [
+ "A. Train doors opened (man on platform) -> Car door closing -> Train doors opened (UNION STATION).",
+ "B. Playful music -> Train doors closed (light rail) -> Automated voice announcement.",
+ "C. Wind instrument -> Train doors closed (blue beams) -> Playful music.",
+ "D. Train doors opened (black pillar) -> Car horn -> Train doors closed."
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "2sGnD9sd1xM",
+ "video_duration": "30s",
+ "video_category": "Comedy",
+ "questions": [
+ {
+ "Question": "Which event occurred immediately after the man said 'Pull it up. It's down to that motor.'?",
+ "Choice": [
+ "A. The sound of paper being crumpled was heard.",
+ "B. The fan was thrown into the air.",
+ "C. A brown, cylindrical object fell out.",
+ "D. The person displayed the broken outer casing."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What can be inferred about the person's emotional state at the end of the video?",
+ "Choice": [
+ "A. He was relieved that the task of disassembling the fan was over.",
+ "B. He found the act of throwing the fan and its result amusing.",
+ "C. He was nervously laughing after almost cutting themselves on the fan.",
+ "D. He was frustrated with the difficulty of disassembling the fan."
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "2jxgoUnZ8AI",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What is the order in which the person interacted with the items on the counter while speaking about necessary ingredients?",
+ "Choice": [
+ "A. Picks up toothpaste, picks up shampoo, stirs slime.",
+ "B. Picks up shampoo, stirs slime, picks up toothpaste.",
+ "C. Stirs slime, picks up shampoo, picks up toothpaste.",
+ "D. Picks up toothpaste, stirs slime, picks up shampoo."
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What is the difference in the actions involving the sticks in the first(0-10s) and third(20-30s) sections of the video?",
+ "Choice": [
+ "A. In both sections, the stick is used for pouring liquids while the speaker talks about containers.",
+ "B. In the first section, the stick is used for stirring while in the third section, it's only used for tapping.",
+ "C. In both sections, the stick is primarily used for tapping on the container while the speaker mentions specific product names.",
+ "D. In the first section, the stick is used for stirring while in the third section, it's used for both stirring and tapping, as the speaker mentions toothpaste."
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "34vzrZ2SlQo",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Why does the mouse exclaim 'I can't see a thing'?",
+ "Choice": [
+ "A. Because the two sheep are too tall in front of him.",
+ "B. Because the knight helmet's visor is obstructing its vision.",
+ "C. Because the small white cat is blocking its view.",
+ "D. Because the room behind the green door is too dark to see."
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Compare the expressions of the first mouse near the door (0-10s) and when saying 'I can't see a thing' (20-30s).",
+ "Choice": [
+ "A. The first expression is shocked, while the second is panicked.",
+ "B. Both expressions are happy and cheerful.",
+ "C. The first expression is angry, while the second is surprised.",
+ "D. Both expressions show confusion and uncertainty."
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What is the context of the statement 'You may be seated, Sir Binu'?",
+ "Choice": [
+ "A. It is said when the mouse is standing in front of an empty blue background.",
+ "B. It is said when the first mouse is surprised by something behind the green door.",
+ "C. It is said when the mouse is offering the purple mug to another character.",
+ "D. It is said when the main mouse is preparing to sit on top of two sheep."
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "2m9rVrVNa6A",
+ "video_duration": "30s",
+ "video_category": "Autos & Vehicles",
+ "questions": [
+ {
+ "Question": "What action was being performed when the speaker mentioned aiming for a 'flush' result?",
+ "Choice": [
+ "A. Scattering tools on the working table.",
+ "B. Sanding the edge of a rectangular model part.",
+ "C. Holding the corner of the model's side.",
+ "D. Pressing a flat plastic piece onto the toy's side."
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "2pHfgh_G1yo",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What is the likely context surrounding the police officer holding a white paddle with a blue circle?",
+ "Choice": [
+ "A. He is using it as a toy to distract the woman while another officer approaches.",
+ "B. He is using it to direct traffic while casually chatting with bystanders.",
+ "C. He is using it as a tool during a tense interaction with a potentially armed individual.",
+ "D. He is demonstrating how to play ping pong to the other Lego figures."
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What was the order of events involving the woman with brown hair and the police?",
+ "Choice": [
+ "A. She held a phone, then she held up two fingers, then the police arrived.",
+ "B. She held a sword, then the police arrived, then she raised her hands.",
+ "C. She raised her hands, then the police arrived, then she held a sword.",
+ "D. The police arrived, then she held a phone, then she held up two fingers."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "3oxbegouR_A",
+ "video_duration": "30s",
+ "video_category": "Autos & Vehicles",
+ "questions": [
+ {
+ "Question": "Which action happens directly after the speaker says 'Hello, this is just a little bit higher'?",
+ "Choice": [
+ "A. The person rubs the sanding tool back and forth.",
+ "B. The person starts wiping the metal with a cloth.",
+ "C. The person picks up a green sanding tool.",
+ "D. The person touches the metal surface with their bare hand."
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "3fnVQT2TnJs",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Why did the man mention the 'night before you got trash' while the freezer lid was open?",
+ "Choice": [
+ "A. Because the freezer appeared full of trash.",
+ "B. To suggest an optimal time to defrost before emptying.",
+ "C. Because the sound of trash trucks was audible.",
+ "D. To emphasize the inconvenience of cleaning."
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "In what state inside of the white chest freezer when the man makes the statement about trash day?",
+ "Choice": [
+ "A. Containing a visible christmas decoration.",
+ "B. Partially filled with wrapped bags.",
+ "C. Empty except for a defrosting device.",
+ "D. Showing signs of recent scrubbing."
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "2tXE7snh6Ak",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What is the correct sequence of events involving the cat in the box during the first 20 seconds?",
+ "Choice": [
+ "A. Cat sits calmly, then rolls around, then is pulled by the leash.",
+ "B. Cat is facing the camera, then rolls around, then bats at a red leash.",
+ "C. Cat is pulled by the leash, then bats at a red leash, then sits calmly.",
+ "D. Cat is facing the camera, then bats at a red leash, then rolls around."
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "2__T6Q8rCCA",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which of the following sequences accurately reflects the order of visual and audio events?",
+ "Choice": [
+ "A. Woman examining honeycomb -> Greek speech about royal jelly -> Buzzing sound.",
+ "B. Greek speech about royal jelly -> Buzzing sound -> Woman examining honeycomb.",
+ "C. Wind sound -> Person scraping honeycomb -> Greek speech about royal jelly.",
+ "D. Buzzing sound -> Woman examining honeycomb -> Greek speech about royal jelly."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "3dUo62RsD00",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "Compare the visual focus when only instrumental rock music is playing to when vocals are introduced.",
+ "Choice": [
+ "A. The focus shifts from the drummer to the bassist.",
+ "B. The focus remains solely on the stage lighting.",
+ "C. The initial focus is on the woman, then includes her singing.",
+ "D. The focus shifts from a single guitarist to a full band shot."
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "3Ab0sgelCEw",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Based on the visual and audio cues, what can be inferred about the location of the video?",
+ "Choice": [
+ "A. It is likely an outdoor area with trains passing by.",
+ "B. It is likely a home with multiple fire alarms.",
+ "C. It is likely an office with a door closing.",
+ "D. It is likely a mechanical room inside a building."
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "3c-vZ4GpIKs",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Which of the following sequences accurately reflects the order of events presented in the video?",
+ "Choice": [
+ "A. Describing wing tip features, kite laid on grass for the first time, then man showing the flying kite.",
+ "B. Kite laid on grass for the first time, man showing the flying kite, then describing wing tip features.",
+ "C. Showing a flying kite, describing wing tip features, then talking about kite durability.",
+ "D. Man talking about kite durability, then showing a flying kite, finally kite laid on grass for the first time."
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "2aO0HKIP3vI",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "What action was visually aligned with the instruction to breathe deeply?",
+ "Choice": [
+ "A. The doctor moved the stethoscope across the patient's chest.",
+ "B. The doctor placed the stethoscope in the aortic area.",
+ "C. The patient remained still while the sound faded out.",
+ "D. The camera zoomed in on the stethoscope's circular shape."
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Based on the combined audio and visual information, what can be inferred about the patient's position during the 'A2' and 'P2' explanation?",
+ "Choice": [
+ "A. The patient was most likely lying down on the examination table.",
+ "B. The patient was likely sitting upright during this explanation.",
+ "C. The patient was probably standing for a breathing capacity test.",
+ "D. The patient was perhaps walking around while the doctor listened."
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Compare the focus of the audio in the first(stethoscope on the left) and second segments(stethoscope on the right).",
+ "Choice": [
+ "A. The first segment focuses on heartbeat sounds, the second on breathing instructions.",
+ "B. Both segments equally emphasize both breathing instructions and heart component details.",
+ "C. The first segment includes heartbeat sounds and breathing instructions, the second on specific heart components.",
+ "D. Both segments primarily focus on detailed medical explanations."
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "2q4_wFFFG_U",
+ "video_duration": "30s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "During the initial news broadcast, what was being discussed when the monitor screen showed the phrase '午後4時頃の'?",
+ "Choice": [
+ "A. A male speaker was describing different types of underwater faults.",
+ "B. A male speaker was explaining the danger of being near the water.",
+ "C. A female speaker was saying they are prepared for any situation.",
+ "D. A male speaker was stating the impossibility of predicting outcomes."
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What happened visually after the male speaker mentioned that a crescent-shaped beach would be the worst place?",
+ "Choice": [
+ "A. A man in a striped shirt gestured to indicate a wide area.",
+ "B. Two female news anchors were looking straight ahead at their desk.",
+ "C. The camera focused on a male news anchor in a striped shirt.",
+ "D. A man was shown against a background of the ocean and sky."
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "2mw1UMcc_Ys",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What happens visually around the same time the audio mentions the baby 'trying to learn how right now try to how to climb up'?",
+ "Choice": [
+ "A. The woman is helping the baby stand briefly on the mat.",
+ "B. The woman is describing the baby's rolling abilities.",
+ "C. The baby is reaching out its arms and making sounds.",
+ "D. The baby is flailing its arms in the woman's lap."
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "2XR54m3kvn8",
+ "video_duration": "30s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "What sound event was heard while a dark-colored animal was observed moving within the translucent mesh?",
+ "Choice": [
+ "A. A male speaker commenting on the animal trying to escape.",
+ "B. The sound of bamboo stalks being adjusted or moved.",
+ "C. A male speaker discussing the clarity of the water.",
+ "D. The distinct sound of water splashing forcefully."
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "3AFdUihWPCk",
+ "video_duration": "30s",
+ "video_category": "Autos & Vehicles",
+ "questions": [
+ {
+ "Question": "What happens when the man says, \"Look at all the crud that's coming off.\"?",
+ "Choice": [
+ "A. He wipes the windshield with a blue cloth.",
+ "B. He sprays the windshield with a bottle.",
+ "C. He wipes the black wiper blade.",
+ "D. He holds up a cloth with dark spots."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What is the immediate context surrounding the statement, \"pick the blades up and clean ...\"?",
+ "Choice": [
+ "A. The man is demonstrating how to wipe the wiper blade.",
+ "B. The man is explaining the importance of cleaning the blades to prevent damage.",
+ "C. The man is holding up a dirty cloth and showing it to the camera.",
+ "D. The man is spraying cleaner onto the windshield and wiping it."
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "3IW7578PicA",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Based on the visual actions and the audio of someone saying 'go ahead and add it in', what can be inferred about the 'it' being added?",
+ "Choice": [
+ "A. It is likely more spices being added to the vegetables.",
+ "B. It is likely water being added to help the food cook.",
+ "C. It is likely cooked eggs being poured into a pan with browned food.",
+ "D. It is likely the whisked eggs being added to the initial mixture."
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "3CnwUQZ6q78",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What was the man doing visually around the time he declared the metal 'unbreakable'?",
+ "Choice": [
+ "A. He was initially bending the metal.",
+ "B. He was removing the metal from the vise.",
+ "C. He was actively hammering the metal piece.",
+ "D. He was displaying the metal to the camera."
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "3posAjJWEi4",
+ "video_duration": "30s",
+ "video_category": "Travel & Events",
+ "questions": [
+ {
+ "Question": "As someone says 'Yeah, that was, well, one more', what visual element is primarily visible?",
+ "Choice": [
+ "A. Raindrops are visible on the camera lens.",
+ "B. A small vehicle near a tunnel exit.",
+ "C. Lush green vegetation on the sides of a narrow channel.",
+ "D. Smooth, orange-brown sculpted walls."
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "2zdYIRTUQYE",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which of the following visual and audio events happened simultaneously?",
+ "Choice": [
+ "A. The black train looks left and the sound of a crash.",
+ "B. The red hook dropped and the sound of speech.",
+ "C. The gray train has a sad face when the steam whistle sound occurs.",
+ "D. The yellow construction vehicle speaks and the 'Sorry boss' line is said."
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What is the sequence of audio and visual events?",
+ "Choice": [
+ "A. 'Sorry boss' speech -> train yard with various trains -> steam whistle sound -> crash sound",
+ "B. Red train with happy face speaks -> 'Is it a good whistle' speech -> brown train with confused expression -> 'Right a little' speech.",
+ "C. 'Right a little' speech -> crash sound -> 'Sorry boss' speech -> 'James chuffed' speech",
+ "D. 'James chuffed' speech -> 'It's the best' speech -> 'Right a little' speech-> yellow construction vehicle speaks"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "2t4CkOvERAI",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What occurs visually and audibly before the purple-haired character's face is shown on a black panel with the text \"Yes, I was.\"?",
+ "Choice": [
+ "A. Visually, the scene shows a desert landscape, and audibly, someone says \"He continues to be a jerk.\".",
+ "B. Visually, the dark-haired character asks about defeating Frieza, and audibly, a man discusses time travel.",
+ "C. Visually, the black-haired character talks about Kakarot's kid, and audibly, someone cheerfully announces someone's arrival.",
+ "D. Visually, new characters appear, and audibly, someone asks \"Who could it be?\"."
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Considering the audio and visual content, what can be inferred about the relationship between the speaker in the 0-10s audio and the video game being shown?",
+ "Choice": [
+ "A. The speaker is probably a player experiencing technical difficulties or reflecting on game mechanics, much like the in-game dialogue.",
+ "B. The speaker is likely someone unrelated to the game, simply making observations about time travel in general.",
+ "C. The speaker is definitively a character within the game, as evidenced by the direct quotes displayed in text boxes.",
+ "D. The speaker is likely a commentator providing background information on the game's lore, similar to the characters explaining Saiyan history."
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Compare the tone of the speech in the 0-10s audio segment with the visual expressions of the characters during that time.",
+ "Choice": [
+ "A. The cheerful tone in the audio matches the characters' friendly interactions.",
+ "B. The annoyed tone in the audio reflects the characters' calm and collected demeanor.",
+ "C. The conversational tone contrasts with the serious dialogue displayed in text boxes.",
+ "D. The conversational tone aligns with the angry expressions as both indicate conflict."
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "30HtOudq2v8",
+ "video_duration": "30s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "What is the order of events when the person says, 'Strikes'?",
+ "Choice": [
+ "A. The dog looks at the camera, then the person speaks, then the dog barks loudly.",
+ "B. The dog sits up, then bites the covers, then a squeak is heard.",
+ "C. The person speaks, then the dog bites the covers, then no distinct sound is explicitly heard.",
+ "D. A squeak is heard, then the dog bites the covers, then the person speaks."
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "36E3XXqab40",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What happens immediately after the chicken is rinsed under running water?",
+ "Choice": [
+ "A. The woman stuffs a spring onion and ginger into the chicken.",
+ "B. The woman rubs salt on the chicken's skin.",
+ "C. The woman puts the chicken into a metal pot.",
+ "D. The woman stuffs herbs and spices inside the chicken."
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Why are spring onion and ginger inserted into the chicken, according to the audio and visual context?",
+ "Choice": [
+ "A. To enhance the flavor of the chicken as it cooks.",
+ "B. To tenderize the meat before cooking in the pot.",
+ "C. To visually inspect the cavity of the chicken before adding water.",
+ "D. To eliminate any remaining poultry smell during the cooking process."
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "3GtvLj8K1F4",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "What action was the person doing when the video starts?",
+ "Choice": [
+ "A. Holding a book near the window.",
+ "B. Sitting silently on the chair.",
+ "C. Playing a harmonica.",
+ "D. Playing with his hand."
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Based on the audio and visual information, what is a likely subject of the music being played?",
+ "Choice": [
+ "A. A specific section of a video game.",
+ "B. A lively street parade happening outside the window.",
+ "C. A sad memory the person is reflecting on.",
+ "D. Background music for a cooking show."
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "3RZlNwEv3XI",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Considering both the visuals and audio, what could be inferred about the person speaking in the initial seconds?",
+ "Choice": [
+ "A. They are operating a boat and navigating through the muddy waters.",
+ "B. They are likely alone and observing the environment from a distance.",
+ "C. They are probably fishing and anticipating catching something in the river.",
+ "D. They are near a source of flowing water, possibly doing laundry."
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "3NwKXCGLjAk",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Considering the audio and visual content of the initial segment, what can be inferred about the sound in 10-20s?",
+ "Choice": [
+ "A. It was likely a part of the car, as the setting is a parking garage.",
+ "B. It could have been related to the dance demonstration, perhaps a shoe hitting the concrete.",
+ "C. It was the person's detailed explanation of the slow motion demonstration",
+ "D. It was likely a microphone being adjusted by the speaker."
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "2t8rYnARZuI",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What is being described by the woman as water flows through it?",
+ "Choice": [
+ "A. Copper containers gaining patina.",
+ "B. Wooden barrels storing rainwater.",
+ "C. A silver chain reflecting sunlight.",
+ "D. A plastic pipe collecting water."
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "3klIBnvMYhg",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "After the sound of running water stops, what does the man say he will do, and what visual action does he perform immediately after speaking?",
+ "Choice": [
+ "A. He says he will agitate the water, then he shakes the glass.",
+ "B. He says he will turn the machine on, then he presses a button.",
+ "C. He says he will do a test, then he picks up a bottle.",
+ "D. He says he will measure the pH, then he puts drops in a glass."
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "36v-AEu7uAU",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "What action coincides with the speaker mentioning a specific smell?",
+ "Choice": [
+ "A. The camera shows a close-up of the beekeeper's gloved hand.",
+ "B. The camera focuses on bees flying around his mask.",
+ "C. The camera is directed towards the sky showing trees.",
+ "D. The camera pans across a field of flowers."
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which visual element immediately precedes the speaker comparing a smell to burnt maple syrup?",
+ "Choice": [
+ "A. The camera focusing on the base of a building.",
+ "B. A view of trees against the sky.",
+ "C. A wide shot of the beekeeper in the suit.",
+ "D. A close-up of the bees on the beekeeper's face."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "How does the speaker's tone when describing the smell of \"yellow\" compare to the visual of the bees?",
+ "Choice": [
+ "A. The serious tone aligns with the static visual of the person.",
+ "B. The excited tone complements the colorful appearance of the bees.",
+ "C. The sarcastic tone matches the aggressive buzzing of the bees.",
+ "D. The conversational tone contrasts with the visual of many active insects."
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "2qZqgRg5E6o",
+ "video_duration": "30s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "What action was the game character performing visually when the distinct sound of a gun cocking was heard?",
+ "Choice": [
+ "A. Speaking the phrase 'there's two of them'.",
+ "B. Exploring a gray building through doorways.",
+ "C. Standing behind a broken barricade.",
+ "D. Standing in front of a wooden barricade."
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "2hxIN0MTKbs",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "Why might the music change to a more mysterious tone around the time the man with thinning brown hair appears?",
+ "Choice": [
+ "A. Because the man is speaking about audio and video.",
+ "B. To match the abstract blue face in the background.",
+ "C. To create a distinct artistic vibe.",
+ "D. To emphasize the surprise on the man's face."
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "3lsHs4MnL5w",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "In what order do the following events occur within the video?",
+ "Choice": [
+ "A. The screen shows recent calls, then the male speaker laughs, then the webpage not available error shows.",
+ "B. The male speaker laughs, then the screen shows recent calls, then the webpage not available error shows.",
+ "C. The webpage not available error shows, then the male speaker laughs, then the screen shows recent calls.",
+ "D. The screen shows recent calls, then the webpage not available error shows, then the male speaker laughs."
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "3gwZdkjpPds",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Considering the attire and the vocal style, what is being presented?",
+ "Choice": [
+ "A. A modern pop concert.",
+ "B. A promotional segment for an opera.",
+ "C. A rehearsal for a play.",
+ "D. A documentary about stage costumes."
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What is the relationship between the appearance of the mask and the ongoing audio?",
+ "Choice": [
+ "A. The mask appears as the singing stops.",
+ "B. The mask appears while powerful operatic singing is present.",
+ "C. The mask appearance is synchronized with a change in the song's tempo.",
+ "D. The mask appears after the spoken dialogue begins."
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "3AJuHcjE31w",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What is the likely context of the cheering sound heard during the segment featuring a person near a drum set?",
+ "Choice": [
+ "A. The cheering is a reaction to the person speaking or singing on stage.",
+ "B. The person near the drum set has just finished a drum solo.",
+ "C. The cheering is directed towards someone off-stage making an announcement.",
+ "D. The cheering is due to a technical malfunction being resolved."
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "3EE5WKQMBek",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "In what context does the sound of a chicken clucking first occur while a speaker is talking?",
+ "Choice": [
+ "A. When the speaker is gesturing towards the camera and describing his past.",
+ "B. After the speaker finishes describing the construction of the chicken coop.",
+ "C. As the speaker mentions his company being bought out by a larger one.",
+ "D. While the speaker is explaining his experience with grain and feed percentages."
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Which event happens immediately before the speaker says \"At least on paper.\"",
+ "Choice": [
+ "A. A duck is seen inside the coop.",
+ "B. He points towards the chicken coop.",
+ "C. He describes working at a feed store.",
+ "D. A rooster crows in the background."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "3Qli5msTyH8",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Why might the audio mention 'don't forget about it'?",
+ "Choice": [
+ "A. To remind the viewer to add a specific ingredient later.",
+ "B. To highlight the importance of using a metal utensil.",
+ "C. To ensure the onions cook evenly and prevent sticking.",
+ "D. To emphasize the need for constant high heat application."
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "2YW7WBtqzzQ",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Based on the audio and visual cues in the first 10 seconds, what is the likely theme or genre of the video?",
+ "Choice": [
+ "A. A magic trick demonstration with a focus on card manipulation.",
+ "B. A fitness guide with energetic background sounds.",
+ "C. A cooking tutorial with a playful animal.",
+ "D. A dramatic movie introduction with suspenseful music."
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What action is the speaker performing while mentioning that he wishes the audience could mix the cards?",
+ "Choice": [
+ "A. Fanning the cards out on the table.",
+ "B. Displaying the Ace of Spades.",
+ "C. Shuffling the deck in his hands.",
+ "D. Pointing towards the camera."
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "2ZHLMmZB-HQ",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "What sound event is occurring when shimmering particles in the sky visually form a question mark?",
+ "Choice": [
+ "A. Loud electronic music with a fast tempo begins.",
+ "B. A low-pitched, repetitive vocal harmony is present.",
+ "C. A male choir is singing with a gentle tone.",
+ "D. Upbeat synth music with a driving beat is playing."
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "2tBRn4xKmzM",
+ "video_duration": "30s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "What action is the dog performing after a male voice says 'Okay, you got a mouthful of fur there'?",
+ "Choice": [
+ "A. The dog is walking towards a pile of dried branches on the snow.",
+ "B. The dog is sniffing the ground and following ski poles.",
+ "C. The dog is running with its tongue hanging out.",
+ "D. The dog is lying down on the snow, looking at the raccoon."
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "2wIrHCaxBNU",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which event happened immediately after the player built a bridge of dirt blocks?",
+ "Choice": [
+ "A. The player gasped and then commented on safety.",
+ "B. The player said 'Nope, nope' in a low tone.",
+ "C. The player noted they wanted to see what was that way.",
+ "D. The player expressed excitement by saying 'I do'."
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Based on the audio and visual information, what was the player likely trying to avoid when navigating the cavern in the last segment(20-30s)?",
+ "Choice": [
+ "A. Falling into the dark, unexplored portion of the cave.",
+ "B. A potential encounter with a sheep.",
+ "C. The river of bright orange lava.",
+ "D. The flowing bright blue water stream."
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "3dgg5f9CyP4",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What was the primary action of the main dancer when the lyrics 'About a little tin box' were being sung?",
+ "Choice": [
+ "A. He was making synchronized hand gestures with the other dancers.",
+ "B. He was singing and dancing at the same time.",
+ "C. He was energetically dancing across the stage with a smile.",
+ "D. He was standing still with the other dancers on the boxes."
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "2rdTrQnl4c8",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What action is being performed while the speaker advises cleaning?",
+ "Choice": [
+ "A. Showing the cylinder.",
+ "B. Adjusting a wired coil.",
+ "C. Connecting an air compressor.",
+ "D. Pointing at the piston."
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Before the cylinder is shown individually, what visual action takes place while the speaker mentions getting a new gasket?",
+ "Choice": [
+ "A. The air compressor is connected to the hose.",
+ "B. The disassembled chainsaw is shown in a wide shot.",
+ "C. Hands point at the gasket surrounding the crankcase.",
+ "D. The piston is moved away from the chainsaw."
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "2VjbDttFK44",
+ "video_duration": "30s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "During the initial part of the video, which action coincides with the woman's statement about someone not being able to wait?",
+ "Choice": [
+ "A. Both individuals are laughing.",
+ "B. The man is gesturing with his hands.",
+ "C. The man covers his mouth with his hand.",
+ "D. The woman points towards the air."
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "2o1p83UjJFA",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "Considering both the visual appearance and the audio content in the final segment, what is the most likely mood being conveyed?",
+ "Choice": [
+ "A. Energetic and celebratory, with bright lights and upbeat music.",
+ "B. Somber and reflective, with dim lighting and melancholic music.",
+ "C. Tense and dramatic, with flashing lights and loud, distorted sounds.",
+ "D. Playful and lighthearted, with colorful visuals and cheerful singing."
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "3Oa4oE3bQTY",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What was happening visually when the woman said, 'the teachers are also continually working on themselves, and'?",
+ "Choice": [
+ "A. A close-up of a woman with short blond hair looking down was shown.",
+ "B. A woman with blonde curly hair was gesturing while sitting down.",
+ "C. A smiling woman was shown in the middle of a group of people.",
+ "D. A sunset was visible with the sun reflecting on the water while a woman gestured."
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Why might the audio transition to a male voice speaking about the 'ideal format for the school'?",
+ "Choice": [
+ "A. Because the previous audio involved a woman talking about aligning intentions.",
+ "B. Because the visuals transitioned to show the scene of teaching in the school.",
+ "C. Because the scene visually shifted to a sunset over a body of water.",
+ "D. Because the preceding audio featured a woman mentioning activities lasting an entire lifetime."
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "2o4-lEtFb2k",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What action was the person performing when they said 'project is pretty much done'?",
+ "Choice": [
+ "A. They were showing the back of the canvas.",
+ "B. They were displaying the finished artwork.",
+ "C. They were carving a name into wax.",
+ "D. They were applying heat to the crayons."
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "2zKyMjBnL0Y",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What action is Jin performing when he says '待っていろ(Wait for it)'?",
+ "Choice": [
+ "A. He is standing among ruins with 'WINS: 7' displayed.",
+ "B. He is performing a fighting stance with 'WINS: 14' displayed.",
+ "C. He is facing away from the viewer with 'WINS: 2' displayed.",
+ "D. He is making motions with a tiger statue in the background and 'WINS: 4'."
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "3g_GUoCbzTc",
+ "video_duration": "30s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which action aligns with the speaker saying 'open your mouth'?",
+ "Choice": [
+ "A. The speaker is putting his hands on his face.",
+ "B. The speaker is touching his chest.",
+ "C. The speaker is gesturing with both hands open towards the camera.",
+ "D. The speaker is bringing his hand down to his mouth, cupping it."
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What is the sequence of actions and speech in the first 10 seconds?",
+ "Choice": [
+ "A. The speaker spreads his hands, then touches his chest, and then makes breathing sounds.",
+ "B. The speaker makes breathing sounds, then touches his chest, and then spreads his hands.",
+ "C. The speaker touches his chest, then spreads his hands, and then makes breathing sounds.",
+ "D. The speaker touches his chest, then makes breathing sounds, and then spreads his hands."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "2TprtMP1RDo",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What action is the man with a light green shirt performing when the audio plays 'exhale'?",
+ "Choice": [
+ "A. He's looking at the camera and has his hands raised.",
+ "B. He's lowering his hands and looking at the camera.",
+ "C. He's looking at the camera with a tilt of his head",
+ "D. He's gesturing with his hands while talking."
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "32Wp1Qo1MaY",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "Based on the audio and visual cues, what could be inferred about the man's demeanor before he walks through the door?",
+ "Choice": [
+ "A. He was angry and stormed out of the room.",
+ "B. He was presenting information confidently and then leaving.",
+ "C. He was sad and walked away silently.",
+ "D. He was nervous and trying to avoid eye contact."
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "2e0ZRe0x_nE",
+ "video_duration": "30s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "What visual element is present when the initial greeting 'hello' is spoken?",
+ "Choice": [
+ "A. A man wearing a gray shirt with red sleeves.",
+ "B. A gray landline phone with a flashing screen.",
+ "C. A still photo of a man in a black hat.",
+ "D. A woman with a high ponytail and blue dress."
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "3Ev_k__U3Io",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What action was occurring visually at the same time the speaker mentioned 'nice big deflection'?",
+ "Choice": [
+ "A. A diagram showing charge symbols appearing.",
+ "B. A syringe resting on a green tray.",
+ "C. A hand holding a balloon over a water stream.",
+ "D. Water starts to drip from a thin tube."
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What is happening in the video when the speaker mentions 'the effect does seem weaker'?",
+ "Choice": [
+ "A. The close-up shows the water stream and a balloon.",
+ "B. The balloon is being moved closer to the water.",
+ "C. The diagram is showing an increase in charge.",
+ "D. The water stream is becoming more turbulent."
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "3UEM541IenE",
+ "video_duration": "30s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "According to the descriptions, which audio event occurred immediately after the hippopotamus bellowed while walking on the dirt surface?",
+ "Choice": [
+ "A. A person coughing.",
+ "B. Indistinct children's speech.",
+ "C. The sound of farting.",
+ "D. A child saying 'Okay, I think we're done playing'."
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Compare the vocalization of the hippopotamus in the first segment(0-10s) to the human speech present in the same segment.",
+ "Choice": [
+ "A. The hippo's sound is clear, while the speech is indistinct.",
+ "B. Both are equally loud and noticeable.",
+ "C. Both are distinct and clearly understandable.",
+ "D. The human speech is prominent, overshadowing the hippo's sound."
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "2rjbtsX7twc",
+ "video_duration": "30s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "Given the combination of visual text and audio content, what is the likely theme or topic being explored?",
+ "Choice": [
+ "A. A sports analysis with background singing.",
+ "B. A scientific lecture on quantum physics illustrated with vocal performances.",
+ "C. A cooking tutorial accompanied by classical music.",
+ "D. A lighthearted pop song with abstract visual elements."
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "2zO-nlygkFo",
+ "video_duration": "30s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "What action and sound event occurred simultaneously after the initial tap dancing performance concluded?",
+ "Choice": [
+ "A. The dancers bowed and an electronic beep was heard.",
+ "B. A man approached and the sound of clapping ceased.",
+ "C. A woman walked forward and someone mentioned 'the air'.",
+ "D. The dancers bowed and someone said 'nice job'."
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "3h8PzeGp2RQ",
+ "video_duration": "30s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What action is being described in the audio while someone is working with a dead duck inside the wooden shed?",
+ "Choice": [
+ "A. Preparing the duck for cooking.",
+ "B. Detaching the wings from the duck.",
+ "C. Removing feathers from the duck's body.",
+ "D. Placing the duck inside a container."
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "2vSF2Yh6o_k",
+ "video_duration": "30s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "Which event occurred first in the video?",
+ "Choice": [
+ "A. The porcupine eating pepper and someone saying 'Yeah'.",
+ "B. A person speaking cheerfully and chewing sounds.",
+ "C. A dog panting and a green stalk being held.",
+ "D. Rustling sounds and the porcupine wearing earmuffs."
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Why might the person calling the porcupine 'a good boy'?",
+ "Choice": [
+ "A. Because the porcupine is eating its vegetables silently.",
+ "B. Because the porcupine is wearing earmuffs.",
+ "C. Because the porcupine is looking at the camera.",
+ "D. Because the porcupine is eating its vegetables and whimpering loudly."
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "37UUg0fwzsw",
+ "video_duration": "30s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "Based on their attire and the audio, what can be inferred about the individuals shown in 13-20s segment",
+ "Choice": [
+ "A. They are likely rebel soldiers.",
+ "B. They are concerned about a rescue mission.",
+ "C. They are preparing for an immediate attack.",
+ "D. They are in a position of authority."
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ },
+ {
+ "Question": "In which visual setting does the dialogue 'Oh, crap. Gear up and let's go' occur?",
+ "Choice": [
+ "A. On a green platform with a grey structure behind.",
+ "B. Inside a white and grey Lego vehicle.",
+ "C. Amidst a battle scene with explosions.",
+ "D. Inside the undercarriage of a large spaceship."
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "2_kLD3IbF2c",
+ "video_duration": "30s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "What sound event was playing when the three animated square icons with stick figures were visible?",
+ "Choice": [
+ "A. Solemn a cappella choir.",
+ "B. A loud fart sound.",
+ "C. Suspenseful brass music.",
+ "D. Dramatic orchestral music."
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "3UWDo636wes",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Why might someone mention the risk of fire?",
+ "Choice": [
+ "A. Because the audio of a door closing might indicate someone leaving a dangerous situation.",
+ "B. Because the lights are being placed near flammable objects visible in the background.",
+ "C. Because the lights are being twisted forcefully, potentially damaging the wires.",
+ "D. Because someone is concerned about the voltage of the lights being used."
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "2zAUcAhX5wQ",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which event happened immediately after the person mentioned turning down the lights?",
+ "Choice": [
+ "A. The person held a brown finger puppet.",
+ "B. The green spiky ball started to glow.",
+ "C. The background showed a collection of toy eggs.",
+ "D. The person cracked open a yellow egg."
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Why did the speaker mention turning down the lights?",
+ "Choice": [
+ "A. Because the pokey bouncy ball started to flash.",
+ "B. Because the yellow egg was difficult to see.",
+ "C. Because the horse head puppet was too bright.",
+ "D. Because the toy dogs in the background were making noise."
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Based on the audio and visual information, what can be inferred about the person's interaction with the yellow-green ball?",
+ "Choice": [
+ "A. They were surprised by its texture and bouncing ability.",
+ "B. They were initially unsure if it would bounce.",
+ "C. They expected it to make a loud noise.",
+ "D. They found it less interesting than the squeaky ball."
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "2ofK2gAQue4",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What is the correct sequence of events combining both visual light changes on the rectangular device and audio announcements?",
+ "Choice": [
+ "A. Blue light, 'carbon monoxide', red light, alarm",
+ "B. Red light, 'carbon monoxide', blue light, alarm",
+ "C. Red light, alarm, blue light, 'Finishing up, just a moment'",
+ "D. Blue light, alarm, red light, 'Finishing up, just a moment'"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "3V-q5MJSzAk",
+ "video_duration": "30s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "Why might the man be separating the rooster's tail feathers after lifting them up, according to the audio and visuals?",
+ "Choice": [
+ "A. Because he wants to check the length and condition of each individual feather.",
+ "B. Because he is preparing the rooster for a walk and demonstrating its plumage.",
+ "C. Because he is making sure the feathers are fluffy before showing them to the camera.",
+ "D. Because the shuffling sound indicates he is preparing to pull them through the wall."
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "2ff27NC3YUc",
+ "video_duration": "30s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "What was being placed on the street when the speaker mentioned creating a 'gap of two'?",
+ "Choice": [
+ "A. Black coal blocks.",
+ "B. White quartz slabs.",
+ "C. White quartz stairs.",
+ "D. Black wool blocks."
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Why were specific blocks mentioned as being placed 'upside down'?",
+ "Choice": [
+ "A. To form the base of the structure.",
+ "B. To add detail to the sides.",
+ "C. To create the front of the vehicle.",
+ "D. To complete the back of a structure."
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What is the difference in the material being placed when the speaker says 'gap of two' compared to when they say 'and on the back three upside down stairs'?",
+ "Choice": [
+ "A. Both are quartz.",
+ "B. Both are coal.",
+ "C. From coal to quartz.",
+ "D. From wool to coal."
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What is the likely activity being performed when the speaker mentions needing 'some black wool'?",
+ "Choice": [
+ "A. Detailing the side of a vehicle.",
+ "B. Laying the foundation of a building.",
+ "C. Creating the roof of a structure.",
+ "D. Building a pathway on the street."
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "4AyEtCf3ND0",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which sequence accurately reflects the order of locations the girl cycles through along with her accompanying speech?",
+ "Choice": [
+ "A. Sidewalk with speech about Instagram -> Asphalt road with speech about Leah's bike ride -> Asphalt road with initial hill comment.",
+ "B. Asphalt road with initial hill comment -> Sidewalk with no described speech -> Asphalt road with speech about hoping viewers enjoyed the video.",
+ "C. Sidewalk with no described speech -> Asphalt road with speech about Leah's bike ride -> Asphalt road with like and subscribe request.",
+ "D. Asphalt road with speech about Leah's bike ride -> Sidewalk with initial hill comment -> Asphalt road with bye to viewers."
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "33fqw0k_qkM",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What happened immediately after the statement 'show that by replacing the old resistor with a new one' was spoken?",
+ "Choice": [
+ "A. The AC button was pressed.",
+ "B. The airflow direction was changed.",
+ "C. The fan speed was set to high.",
+ "D. An adjustment to the AC system was made."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Based on the audio and visual cues, what was the likely purpose of manipulating the fan and AC controls?",
+ "Choice": [
+ "A. To familiarize oneself with the car's features.",
+ "B. To test different sound levels in the car.",
+ "C. To adjust the temperature for passenger comfort.",
+ "D. To demonstrate the functionality after a repair."
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "2VuKP6zZDDc",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Which action immediately followed the child spreading the pink slime on the table?",
+ "Choice": [
+ "A. The woman commenting on the baby.",
+ "B. The child trying to poke other slimes with the green slime.",
+ "C. The child saying the word 'bad book'.",
+ "D. The baby giggling in the high chair."
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "4-DcTjFR4qw",
+ "video_duration": "30s",
+ "video_category": "Autos & Vehicles",
+ "questions": [
+ {
+ "Question": "Which of the following sequences accurately reflects the order of events in the video?",
+ "Choice": [
+ "A. Woman drives, engine sound, car turns.",
+ "B. Car turns, man speaks, woman drives.",
+ "C. Car passes, leaves kick up, woman speaks.",
+ "D. Engine sound, car passes, leaves kick up."
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Why might the initial speaker sound uncertain before the excited commentary?",
+ "Choice": [
+ "A. The car's speed was unexpectedly rapid.",
+ "B. The road conditions appeared dangerous.",
+ "C. The car's speed was still moderate then.",
+ "D. The car model was initially unidentifiable."
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Based on the audio and visual information, what can be inferred about the car's journey?",
+ "Choice": [
+ "A. It was being evaluated for its performance.",
+ "B. It was driving through a busy urban area.",
+ "C. It was participating in a racing event.",
+ "D. It was being driven for daily commuting."
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "4DsPpi-2yTM",
+ "video_duration": "30s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "What sound occurred immediately before the player in the first-person perspective fell off the stairs in the initial part of the video?",
+ "Choice": [
+ "A. A male speaker saying numbers.",
+ "B. An explosion sound effect.",
+ "C. A male speaker expressing frustration.",
+ "D. The sound of a gunshot."
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "In the 10-20 second segment, what happened visually immediately after the player respawned with bloody hands, and what sound occurred around the same time?",
+ "Choice": [
+ "A. The player walked through an outdoor alley with graffiti, followed by a male speaker expressing frustration.",
+ "B. The player entered a grey warehouse, followed by the sound of an explosion.",
+ "C. The player got killed by a thrown knife, and there was the sound of a gunshot.",
+ "D. The player approached a train track and white train car, while a calm male speaker talked about numbers."
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "4m5khG0emPE",
+ "video_duration": "30s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Why did the speaker mention needing to take money out after a receipt printed?",
+ "Choice": [
+ "A. To demonstrate how to close the drawer.",
+ "B. Because the receipt showed a cash sale.",
+ "C. To show the customer the printed details.",
+ "D. Because the transaction was a refund."
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Based on the audio and visual cues, what is the primary function being demonstrated on the cash register?",
+ "Choice": [
+ "A. Printing promotional discount coupons.",
+ "B. Processing a cash refund transaction.",
+ "C. Processing a standard cash sale.",
+ "D. Calculating the daily sales total."
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "4Rym4DqrhHM",
+ "video_duration": "30s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "Compare the clarity of speech in the second and third audio segments.",
+ "Choice": [
+ "A. There is no speech in the third segment",
+ "B. Speech is equally clear in both segments",
+ "C. Speech is clearer in the third segment",
+ "D. Speech is clearer in the second segment"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What is the likely occupation or industry related to the website mentioned?",
+ "Choice": [
+ "A. Software development",
+ "B. Animal husbandry and farming",
+ "C. Retail sales of clothing",
+ "D. Construction and engineering"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "3v93yDHV-KY",
+ "video_duration": "30s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What followed the action of someone running their hands through short dark hair and a speaker mentioning going 'right down to the roots'?",
+ "Choice": [
+ "A. The hair was being tousled to distribute product.",
+ "B. The hair was messed up and pushed to the front.",
+ "C. Scissors were used to cut the dark brown hair.",
+ "D. The hair was combed forward after being trimmed."
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "4qaJW0mvsoY",
+ "video_duration": "30s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "What is the dog doing when the male voice says 'what are you saying about'?",
+ "Choice": [
+ "A. The dog is wagging its tail and holding a bone.",
+ "B. The dog is looking away from the camera.",
+ "C. The dog is barking at something off-screen.",
+ "D. The dog is lying down on the floor."
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "4KftZggyMsk",
+ "video_duration": "30s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Immediately before the sound of someone saying \"Pick this,\" what visual action was being performed with the cards?",
+ "Choice": [
+ "A. The person was shuffling a single deck of cards.",
+ "B. The person was fanning out a deck of cards.",
+ "C. The person was dealing cards onto a blue surface.",
+ "D. The person was holding two separate decks of cards."
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "026dzf-vc5g",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "During which two specific actions did the audience applaud?",
+ "Choice": [
+ "A. Handstand sequence and audience interaction",
+ "B. Initial music cue and final pose",
+ "C. Scorpion pose and back walkover",
+ "D. Cartwheel completion and stage exit"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "1NcGHbFWBFA",
+ "video_duration": "60s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "Which event occurs FIRST chronologically?",
+ "Choice": [
+ "A. Raven mimics human speech",
+ "B. Cat jumps onto lying dog",
+ "C. Elephant takes hat off woman's head",
+ "D. Hamster wheel noise begins"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Why does the woman laugh while standing in the grassy field?",
+ "Choice": [
+ "A. The raccoon performs repetitive carpet rolls",
+ "B. The elephant steals her hat with its trunk",
+ "C. The woman reacts to dogs chasing a ball",
+ "D. The dog's face is comically distorted by wind"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which audio event occurs simultaneously with the raccoon's repetitive rolling motion on the floral carpet?",
+ "Choice": [
+ "A. 'Like you're doing it again' spoken commentary",
+ "B. Upbeat pop-R&B music with bassline",
+ "C. Continuous low-frequency rumbling",
+ "D. Rapid mechanical whirring"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which scene occurs during upbeat music and shows a human participant's full body in view?",
+ "Choice": [
+ "A. Woman throwing a ball to active golden retrievers",
+ "B. Raccoon rolling near partially visible legs",
+ "C. Woman seated cross-legged with golden retrievers",
+ "D. Woman laughing as an elephant takes her hat"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "Which event occurs immediately AFTER the cat disengages from wrestling with the dog?",
+ "Choice": [
+ "A. A raven mimics human speech patterns",
+ "B. The raccoon initiates repetitive rolling motions",
+ "C. A golden retriever sprints to catch a thrown ball",
+ "D. The woman laughs while an elephant takes her hat"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "-yWE13LbFd4",
+ "video_duration": "60s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "What distinguishes the visual context of the 'ETIHAD' player projection from the Manchester City lineup's introduction?",
+ "Choice": [
+ "A. The projection occurs during scoreboard graphics, while the lineup follows pyrotechnics activation",
+ "B. The projection overlays a live match scene, while the lineup is presented in trophy-centered framing",
+ "C. The projection appears during crowd close-ups, while the lineup shows wide stadium views",
+ "D. The projection uses augmented reality effects, while the lineup employs slow-motion cinematography"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "Which audio event coincided with the close-up of the silver FA Cup reappearing on the screen?",
+ "Choice": [
+ "A. 'At long last, the wait is over. A warm welcome to Wembley'",
+ "B. 'It's Manchester City, they face Manchester United, and it's live on EA TV'",
+ "C. 'It is a special occasion and always one to be cherished'",
+ "D. 'It is Manchester City versus Manchester United'"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What visual event coincided with the initial crowd cheering heard in the 0-20s segment?",
+ "Choice": [
+ "A. A player in light blue celebrated scoring a goal",
+ "B. The Emirates FA Cup graphic appeared on screen",
+ "C. The announcer referenced the EA TV broadcast",
+ "D. A man in a blue blazer was shown cheering"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What does the placement of the FA Cup trophy between the teams indicate about the current match phase?",
+ "Choice": [
+ "A. The trophy presentation is underway",
+ "B. The teams are completing pre-match media obligations",
+ "C. The match has concluded with a victory ceremony",
+ "D. The match is about to commence"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "1evyOuQz-jM",
+ "video_duration": "60s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What is the FIRST visual element displayed alongside the on-screen text 'most worn JEWELLERY PIECES'?",
+ "Choice": [
+ "A. Woman's wrist adorned with multiple bracelets",
+ "B. Gold necklace with an intricate pendant held in a hand",
+ "C. Open jewelry box revealing a silver necklace and pendant",
+ "D. Woman wearing a gold choker and layered silver necklaces"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What key distinction in content delivery exists between the 0-20s jewelry segment and the 40-60s segment?",
+ "Choice": [
+ "A. Visual jewelry close-ups vs. coffee mug product placement",
+ "B. Jewelry material analysis vs. handbag maintenance tutorial",
+ "C. Curated jewelry display vs. conversational handbag discussion",
+ "D. Morning accessory selection vs. travel outfit coordination"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "08km9Yqbt-A",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which astronomical event is the first to be both described in the audio narration and visually depicted in the video timeline?",
+ "Choice": [
+ "A. Moon and planetary alignment on June 29",
+ "B. Harvest Moon Eclipse on Sept 17",
+ "C. Jupiter opposition on Dec 7",
+ "D. Saturn opposition on Sept 8"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What astronomical characteristic defines the September 17th lunar eclipse as the 'Harvest Moon'?",
+ "Choice": [
+ "A. It displays a distinct copper hue during totality",
+ "B. It coincides with a super moon, enhancing its apparent size",
+ "C. Its timing relative to the September equinox",
+ "D. It occurs during peak agricultural harvest periods"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What visual element is uniquely present during Saturn's opposition compared to Jupiter's in the video?",
+ "Choice": [
+ "A. Tree silhouettes framing the planet",
+ "B. A depiction of the Sun in the background",
+ "C. An eclipse shadow on the planet",
+ "D. Planetary bands in multiple colors"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "0Y9-MQ44MdU",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Why does the content creator reference COVID-19 in his introduction?",
+ "Choice": [
+ "A. To clarify the hiatus in producing game review content",
+ "B. To introduce the visual of pandemic-related icons during the segment",
+ "C. To rationalize the unconventional location of the basketball game",
+ "D. To emphasize the challenges of filming during health restrictions"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "0w4OTD4L0GQ",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "What visual distinction exists between the presentation of Earth's orbital velocity and the Solar System's speed?",
+ "Choice": [
+ "A. Varied Earth perspectives (clouds, crescent, night) vs combined Earth-sun view",
+ "B. Solar System speed displayed with animated trajectory lines",
+ "C. Velocity magnitudes emphasized through contrasting font sizes",
+ "D. Text overlays appear before celestial imagery in both cases"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "Which elements in the opening sequence primarily establish the video's scientific context?",
+ "Choice": [
+ "A. Absence of narration with static-filled screens",
+ "B. Textual data about planetary motion",
+ "C. Ethereal synthesizer tones layered under spiral galaxy visuals",
+ "D. Corporate emblem transitioning into starfield visuals"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "0ay2Qy3wBe8",
+ "video_duration": "60s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "Which communication method immediately followed the telegraph's demonstration in both audio cues and visual sequence?",
+ "Choice": [
+ "A. Vintage radio broadcast",
+ "B. Semaphore flag signal transmission",
+ "C. Computer-based email composition",
+ "D. Candlestick telephone conversation"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "0eJvnKwGThw",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which audio segment coincides with the display of the 'VALENTINO The Beginning' title card against Rome's skyline?",
+ "Choice": [
+ "A. 'But if I could just flow' spoken",
+ "B. 'me and I feel all this love' spoken",
+ "C. 'and that's what happens when I feel connected' spoken",
+ "D. 'it feels like everything beautiful is' spoken"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "0rCbfsuKdYw",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which specific components does the speaker refer to as 'ancient equipment' during the PlayStation 1 demonstration?",
+ "Choice": [
+ "A. The wired controller and console",
+ "B. The PlayStation 1 console and all included accessories",
+ "C. The video adapter and AC adapter",
+ "D. The 1994 television setup"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "0Jbc3Ah4EIc",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Why did the speaker expose his midsection when stating, 'If you finally want to get lean'?",
+ "Choice": [
+ "A. To establish the baseline for the fat loss journey",
+ "B. To demonstrate the effectiveness of his fitness method",
+ "C. To illustrate the problem of excess body fat",
+ "D. To engage viewers with a personal anecdote"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "1NH5dJ9VRvU",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which audio event occurred simultaneously with the initial display of the '#CBCNEWS' hashtag?",
+ "Choice": [
+ "A. Music playing with a dramatic tone",
+ "B. 'People can't afford to eat...' spoken",
+ "C. 'Gaza's largest hospital now in ruins...' spoken",
+ "D. Sound of a bell"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which audio cue is directly synchronized with the appearance of the 'AXE THE TAX' painted truck?",
+ "Choice": [
+ "A. The phrase 'an emergency meeting' is spoken",
+ "B. The news anchor discusses industrial emissions",
+ "C. A boxer expresses concerns about brain trauma",
+ "D. A report on a hospital's destruction is delivered"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Which visual event occurred immediately after the sound of the bell?",
+ "Choice": [
+ "A. Blonde woman in a dark blue suit speaking",
+ "B. Man driving a vehicle",
+ "C. Gas station price sign showing '151.9'",
+ "D. Black screen with '#CBCNEWS' hashtag"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "0bwyCF0HteE",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which audio event was playing when Léa Kyle first unveiled the segmented mirror structure during her AGT performance?",
+ "Choice": [
+ "A. Sustained audience applause",
+ "B. Rhythmic whooshing effects",
+ "C. Upbeat acoustic music with fast tempo",
+ "D. Deep male vocal commentary"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What was the contestant's attire right before male voice first exclaimed 'Whoa' during her performance?",
+ "Choice": [
+ "A. Red flower prop in hand",
+ "B. Short strapless black dress",
+ "C. Judges' tables shaking",
+ "D. Silver dress with partition wall"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "1sTNqJVrqx8",
+ "video_duration": "60s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "What key audio-visual contrast exists between Kennedy's outdoor family basketball scene and indoor couch celebration?",
+ "Choice": [
+ "A. Dialogue about instability vs reaction to musical performance",
+ "B. Athletic competition sounds vs piano accompaniment",
+ "C. Heartfelt personal narration vs collective emotional outburst",
+ "D. Daylight nature sounds vs evening indoor ambiance"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "-qTAeVGl_e8",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What distinguishes the primary vocal audio elements during the first two successful jumps compared to the final victorious jump?",
+ "Choice": [
+ "A. Crowd narration vs. solo performer vocals",
+ "B. Ambient noise dominance vs. music dominance",
+ "C. Drum-accompanied speech vs. synth-backing chants",
+ "D. English commentary vs. Swahili singing"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "-dfvdKf-KR0",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "During the discussion involving Gerald Tate, which location is visually depicted?",
+ "Choice": [
+ "A. A private meeting room",
+ "B. A tech workstation",
+ "C. A corporate hallway",
+ "D. An executive lounge"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "-y1U0qk1vDs",
+ "video_duration": "60s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "What visual element coincided with the commentator stating 'Hey, if it ain't broke, don't fix it'?",
+ "Choice": [
+ "A. Aurelion Sol banned by Team Liquid",
+ "B. Annie highlighted in green on the central display",
+ "C. Spectators raising a 'GO FLYQUEST' banner",
+ "D. Hand-drawn 'General UmTi #1' sign shown"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What strategic adjustment did FlyQuest implement when spectators displayed the 'YOU LOOK CUTE TODAY' sign?",
+ "Choice": [
+ "A. Maintain previous game's strategy",
+ "B. Support role champion rotation",
+ "C. Jungle role focus shift",
+ "D. Mid-lane priority pick repetition"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What gameplay distinction does the commentary emphasize between Annie and Ari?",
+ "Choice": [
+ "A. Annie has higher ban frequency in recent matches",
+ "B. Annie provides area damage while Ari focuses single targets",
+ "C. Ari's effectiveness depends on jungle synergy",
+ "D. Ari requires more team coordination for engagements"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "0k2ey_okQ4E",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What is the primary purpose of the narrator referencing the 'state of the art laboratory' during the aerial view of the facility?",
+ "Choice": [
+ "A. To showcase medical technicians operating diagnostic equipment",
+ "B. To highlight the advanced technology provided by Castrol EDGE used in the lab",
+ "C. To establish the geographic location of Ronaldo's training regimen",
+ "D. Because sports scientists are conducting exclusive research on his athletic performance"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "20gUIdXGuAs",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which visual event was displayed on screen when the speaker stated: 'It's impossible to train leading AI models without using copyrighted materials'?",
+ "Choice": [
+ "A. The Google logo appears beneath a headline about French fines",
+ "B. A UN representative presenting a resolution appears",
+ "C. The AP News logo briefly flashes on screen",
+ "D. The article titled 'Here's Proof You Can Train an AI Model Without Slurping Copyrighted Content' is shown"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What was the sequence of audio events in the first 20 seconds?",
+ "Choice": [
+ "A. Man discusses New York headline → Upbeat music → Formal welcome",
+ "B. Man introduces weekly segment → Upbeat music → Formal welcome to AI decoded",
+ "C. Upbeat music → Man introduces weekly segment → Man discusses New York headline",
+ "D. Formal welcome to AI decoded → Upbeat music → News headline narration"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which news organization's logo was displayed as the speaker concluded the discussion on ethically sourced AI training data?",
+ "Choice": [
+ "A. The BBC",
+ "B. The Associated Press",
+ "C. The New York Times",
+ "D. The Wall Street Journal"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "1uqupftxFOM",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What caused the audience to laugh between 40-60 seconds?",
+ "Choice": [
+ "A. Yu Hojin mentioned his father's disapproval of his magic career",
+ "B. Howie Mandel joked about Yu Hojin's secret bathroom magic",
+ "C. Yu Hojin admitted practicing magic secretly in the bathroom",
+ "D. The judges reacted to Yu Hojin's military service"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which specific action occurred directly following Yu Hojin's mention of military service?",
+ "Choice": [
+ "A. Yu Hojin declared wanting the world to remember his name",
+ "B. Simon Cowell examined the covered object",
+ "C. Howie Mandel inquired if magic was his specialty",
+ "D. The audience and judges began applauding"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "2LDriAWltwM",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual event coincides with Mia stating 'I'm showing you guys my updated workout split'?",
+ "Choice": [
+ "A. Mia in her car announcing a full week of workouts",
+ "B. Mia performing deadlifts in a gym with superimposed workout text",
+ "C. Mia in her car wearing a 'SUPER ACTIVE' hoodie",
+ "D. Mia wearing gray leggings adjusting gym equipment"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which event occurs FIRST in the 20-40s segment?",
+ "Choice": [
+ "A. A man in a gray hoodie uses a leg press",
+ "B. Mia in a green tank top does crunches",
+ "C. Mia in white workout clothes lifts dumbbells",
+ "D. Mia in gray leggings performs pull-ups"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What visual context is shown when Mia states 'also has been highly requested'?",
+ "Choice": [
+ "A. Close-up of Mia performing deadlifts with barbell",
+ "B. Black leather car seats with brick building visible through window",
+ "C. Gym equipment displaying superimposed 'Full week of WORKOUTS' text",
+ "D. Festive title card with Christmas tree and gold ornaments"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "2Gg4OQo7-zA",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Which specific sequence of goal events determined the match outcome?",
+ "Choice": [
+ "A. Karembeu's opener with Djorkaeff sealing the win",
+ "B. Ronaldo's equalizer canceled by Zidane's penalty",
+ "C. Two Barthez saves preserving a 1-0 lead",
+ "D. Zidane's header followed by a saved Ronaldo attempt"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "2QTiAmvygC4",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual event occurred simultaneously with the male announcer stating 'Grammy and Oscar winner Jennifer Hudson' during the introduction?",
+ "Choice": [
+ "A. A woman in a yellow dress singing beside a piano player",
+ "B. Two men positioned behind an ornate desk in a studio setting",
+ "C. A group dancing behind a couch during a vocal harmony",
+ "D. A close-up of geometric wallpaper in the kitchen"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "How does the number of backup dancers actively performing differ between the 'Get up that hill' and 'We're moving on up' segments?",
+ "Choice": [
+ "A. Four ensemble members vs. one supporting dancer",
+ "B. Two background performers vs. complete absence",
+ "C. Three dancers in coordinated outfits vs. solo performance",
+ "D. Both segments feature synchronized group formations"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What prompted Jennifer Hudson's entrance from the right side of the set?",
+ "Choice": [
+ "A. The need to position herself near the piano for the performance",
+ "B. The conclusion of the male host's verbal introduction",
+ "C. Instructions to avoid obstructing the studio's camera angles",
+ "D. Synchronization with the gospel music's upbeat tempo"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "0JjgoicpYkU",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What visual elements are visible when the woman first speaks to the camera?",
+ "Choice": [
+ "A. Red shirt and outdoor view",
+ "B. Xylophone and nightstand",
+ "C. Open curtains and red mug",
+ "D. Pink bedding and alarm clock"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "0ncLRRLvdfk",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual element is synchronized with the narrator's statement 'evaluate each country individually'?",
+ "Choice": [
+ "A. The word 'AND' on a grid background",
+ "B. A woman holding two pieces of paper",
+ "C. 'MAP 1: GLOBAL VARIATIONS IN FINANCIAL LITERACY'",
+ "D. A downward-sloping NASDAQ-100 graph"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "How does the contextual basis for the 40% and 80% financial illiteracy statistics differ in the video?",
+ "Choice": [
+ "A. One is presented before visual aids, the other after",
+ "B. One uses animated graphics, the other static text",
+ "C. One is derived from surveys, the other from expert estimates",
+ "D. One reflects global averages, the other developing nations"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "3m29MQ-qPfg",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What contextual clue specifically indicates the timing of the student's shirt selection from the closet?",
+ "Choice": [
+ "A. '7:00 AM' text overlay",
+ "B. '8:00 AM' text overlay",
+ "C. Gospel music playing",
+ "D. Text overlay reading 'STUDENT'"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "068rdc75mHM",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "What distinct visual elements are present during Einstein's argument compared to the 'quantum playground' statement?",
+ "Choice": [
+ "A. Abstract network diagrams versus optical instruments",
+ "B. Nighttime observatories versus daytime cityscapes",
+ "C. Historical figures versus modern research teams",
+ "D. Textual documents versus active laboratory apparatus"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "Which visual context coincides with the statement: 'the experiment is just beautiful'?",
+ "Choice": [
+ "A. Man working on complex lab apparatus",
+ "B. Green laser beam optical setup",
+ "C. Scientist adjusting red laser equipment",
+ "D. Server rack with colorful wiring"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "44G3dPME330",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What is the initial event following Penny's entry into the apartment?",
+ "Choice": [
+ "A. Leonard clarifies Sheldon's whereabouts at the comic store",
+ "B. Penny transfers her laptop to Leonard",
+ "C. The group views Sheldon's pre-recorded message",
+ "D. Howard addresses a connectivity issue with the Wi-Fi"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "2za5RwplXdI",
+ "video_duration": "60s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "Which audio event begins simultaneously with the initial wide shot of sauropod dinosaurs walking in the prehistoric landscape?",
+ "Choice": [
+ "A. Loud dinosaur footsteps",
+ "B. Faint background melody with percussions",
+ "C. Soft, melancholic instrumental music",
+ "D. Bird vocalizations near a nest"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "2iHmlBmLKUE",
+ "video_duration": "60s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "What occurs directly after Emma Watson twists the refill bottle 'like an on/off switch' during the perfume demonstration?",
+ "Choice": [
+ "A. The fragrance initiates flow into the bottle",
+ "B. She inverts the refill bottle to align it",
+ "C. She unscrews the perfume bottle's original lid",
+ "D. She secures the cap after disengaging the refill"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "0kRsiSdDFYg",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What visual event coincides with the audio segment 'no more bruv. Alright, alright'?",
+ "Choice": [
+ "A. Two men in black shirts take positions on the field",
+ "B. A coach interacts with seated players",
+ "C. A man in camouflage headgear gestures toward the lens",
+ "D. Athletes initiate training exercises with cones"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "-O6mJ0VBTc4",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which category is presented THIRD in both the numbered visuals and the speaker's explanation?",
+ "Choice": [
+ "A. Labor law regulations",
+ "B. Compensation requirements",
+ "C. Equal employment opportunity",
+ "D. Employee safety and health laws"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "3E4Zca1p64s",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Why did the person rush inside immediately after falling into the pool?",
+ "Choice": [
+ "A. To prepare toasted bread for a snack",
+ "B. To ensure the phone is charged for an upcoming appointment",
+ "C. To change into dry clothes after the fall",
+ "D. To charge the phone that got wet in the pool"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "2ka5AleJ48I",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What visual occurred as the speaker announced 'Let's get to heat one'?",
+ "Choice": [
+ "A. Athletes on a green field raising their hands",
+ "B. Competitor flexing muscles during introductions",
+ "C. Overhead perspective of athletes aligning for heat one",
+ "D. Elevated shot of runners mid-race during qualifiers"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "40BlVzjxu-I",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which visual element appears simultaneously with the distinct metallic clinking noise during the opening scene?",
+ "Choice": [
+ "A. Foam settling in a pint glass",
+ "B. Patrons raising beer mugs",
+ "C. Musicians in kilts parading",
+ "D. Saint Patrick depicted in confinement"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which visual event occurs immediately after the depiction of Saint Patrick's imprisonment?",
+ "Choice": [
+ "A. A festive table with traditional Irish dishes",
+ "B. Saint Patrick in a non-confined setting",
+ "C. A serpent appearing in an illustration",
+ "D. Saint Patrick baptizing converts"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What symbolic purpose does the snake illustration serve when juxtaposed with the narrator's factual claim about Ireland's snake-free history?",
+ "Choice": [
+ "A. Visualizing the mythological narrative contradicting reality",
+ "B. Highlighting Ireland's extinct serpent population",
+ "C. Representing the eradication of pagan cultural practices",
+ "D. Documenting Saint Patrick's literal expulsion of reptiles"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "-gnfUTPmnNU",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which product was displayed at the exact moment the music transitioned from calm electronic to upbeat dance-pop(with percussions)?",
+ "Choice": [
+ "A. Pillow cases",
+ "B. Keychains",
+ "C. Wooden signs",
+ "D. Bookmarks"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "2yGaTOzaGIA",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Why is Doom holding a Belgium flag after winning the race?",
+ "Choice": [
+ "A. To celebrate breaking a world record",
+ "B. To represent his national team affiliation",
+ "C. To commemorate his European Champion title from 2019",
+ "D. To acknowledge his national record for Belgium"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "3MJNKd10kxM",
+ "video_duration": "60s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What is the primary reason the creator mentions using the heart-shaped wooden pieces with red ribbons?",
+ "Choice": [
+ "A. They are essential for Valentine's Day-themed projects",
+ "B. They are sold pre-painted for immediate use",
+ "C. They serve as substitutes when Crafter Square hearts are unavailable",
+ "D. They were purchased exclusively from Crafter Square"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Which visual transition coincides with the speaker's invitation to begin the project after expressing gratitude for the viewer's presence?",
+ "Choice": [
+ "A. Close-up of unpainted wooden heart from Dollar Tree",
+ "B. Pink background with two rabbits at lower corner and 'Spring Pocket DIY' text",
+ "C. Red circle activating notification bell beside social icons",
+ "D. Overhead shot of floral supplies on blue cutting mat"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "3rTsETO3s9U",
+ "video_duration": "60s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "What visual event occurs immediately AFTER 'call me fighter' sung?",
+ "Choice": [
+ "A. Dark screen transition",
+ "B. Puppet-controlled white-haired girl on yellow background",
+ "C. Animated girl standing in front of a wall with a cartoon drawing",
+ "D. School uniform character holding books"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "4VniCsSusJs",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What was the first event that occurred as the men began peeling their bananas in the 0-20s segment?",
+ "Choice": [
+ "A. The gray-shirted man demonstrated peeling from the bottom",
+ "B. The man in the red shirt struggled to peel his banana",
+ "C. The men began eating the peeled bananas",
+ "D. They threw bananas in the air and caught them"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "4iC9Qi3y9q8",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which visual element accompanies the narrator's statement 'each with their own planets and stories' following the initial display of '400,000,000,000'?",
+ "Choice": [
+ "A. A comet traversing a star cluster",
+ "B. A desolate exoplanet with violet atmospheric haze",
+ "C. A spiral galaxy with distinct arms",
+ "D. A comparative size visualization of Earth and the Sun"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What visual contrast exists between the 'each with their own planets' sequence and the Milky Way depiction?",
+ "Choice": [
+ "A. One emphasizes numerical scale while the other focuses on motion paths",
+ "B. One uses abstract auroras while the other shows photorealistic galaxies",
+ "C. One employs monochromatic tones while the other uses vibrant colors",
+ "D. One isolates a singular celestial body while the other displays interconnected structures"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "Which visual element coincides precisely with the narrator's utterance of 'unimaginably vast'?",
+ "Choice": [
+ "A. A galaxy cluster appearing as a hazy circle",
+ "B. The number '400,000,000,000' displayed against a starry backdrop",
+ "C. Earth and sun positioned in contrasting light",
+ "D. A dark starry background with no additional elements"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "5-4Od1qn53s",
+ "video_duration": "60s",
+ "video_category": "Travel & Events",
+ "questions": [
+ {
+ "Question": "Which temple is highlighted immediately after the narrator describes Wat Pho's reclining Buddha?",
+ "Choice": [
+ "A. Grand Palace's golden spires",
+ "B. Wat Arun during sunset",
+ "C. Wat Arun's intricate white details",
+ "D. Wat Pho's courtyard with tourists"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "4ZK-m01XSQ8",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What does the number '13.7' most likely represent in the race commentary?",
+ "Choice": [
+ "A. Wind speed in meters per second",
+ "B. Hurdle height in centimeters",
+ "C. Time to clear the fourth hurdle in seconds",
+ "D. Total race time in seconds"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Which audio event occurred simultaneously with the athletes clearing the hurdles?",
+ "Choice": [
+ "A. '13,7' spoken",
+ "B. Sound of bell ringing",
+ "C. Sound of animal",
+ "D. 'Но теперь всё чисто...' spoken"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "2gFxZmb0dMo",
+ "video_duration": "60s",
+ "video_category": "Travel & Events",
+ "questions": [
+ {
+ "Question": "Which audio event occurs while the tour guide is seated on a bench addressing the camera?",
+ "Choice": [
+ "A. 'A train wreck with Amy Schumer' spoken",
+ "B. 'Nicknames. Has anyone seen Home Alone 2...' spoken",
+ "C. 'I think every single tourist...' spoken",
+ "D. 'Okay, so you're standing on a spot they filmed...' spoken"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "4ME5ZfDg2aw",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What distinct visual change occurs to Greenland's depiction between the first and second world map appearances?",
+ "Choice": [
+ "A. Transition from yellow to white highlighting",
+ "B. Text shifts from discovery narrative to statistical emptiness",
+ "C. Shift from labeled geography to abstract representation",
+ "D. Inclusion of an animated human figure versus isolated focus"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "Which audio event directly aligned with the initial appearance of the world map in the video, before any geographic highlights were added?",
+ "Choice": [
+ "A. Text narration: 'THE EMPTIEST COUNTRY IN THE WORLD'",
+ "B. 'There's nothing around us. Nothing!'",
+ "C. 'So I looked at the world map'",
+ "D. Mysterious drum music fading in"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "5KlS-p5eYH8",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which visual event occurs simultaneously with the alarm sound in the video?",
+ "Choice": [
+ "A. Woman sleeping with houndstooth pillow at 09:00",
+ "B. Woman running on treadmill with green walls",
+ "C. Woman measuring her backside in pink room",
+ "D. Woman sleeping as 'режим' text appears"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "540LkURTR7g",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "What specific advantage is highlighted by demonstrating the addition of powder to water when discussing drink mixes?",
+ "Choice": [
+ "A. Requires minimal preparation time",
+ "B. Utilizes natural fruit extracts",
+ "C. Enhances hydration with electrolytes",
+ "D. Provides a calorie-free flavor option"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What is the FIRST event occurring AFTER Ellen reappears discussing water's role in weight loss?",
+ "Choice": [
+ "A. Close-up of glass filling from faucet",
+ "B. Announcement of hydration strategies",
+ "C. Recommendation for unsweetened tea",
+ "D. Display of limes in yellow bowl"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "3fQxvz4ENDg",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which visual element was displayed when the male speaker stated 'I think today is again an historic day on our long path towards regulation of AI'?",
+ "Choice": [
+ "A. Man in suit with EU flag",
+ "B. Man in suit with EU parliament logo",
+ "C. EU parliament hall view",
+ "D. Female presenter with EU flag and AI robot"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "In what order did these audio-visual events occur? 1) 'report tells you all about...' 2) Male parliament statement 3) 'it's called the EU AI Act'",
+ "Choice": [
+ "A. 1→2→3 with visuals of presenter→logo→hall",
+ "B. 3→1→2 with visuals of hall→presenter→logo",
+ "C. 2→1→3 with visuals of presenter→logo→hall",
+ "D. 1→2→3 with visuals of presenter→hall→logo"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which AI technologies are explicitly referenced as examples requiring regulation under the EU AI Act based on the video?",
+ "Choice": [
+ "A. Medical imaging analysis",
+ "B. Chatbots and deepfake systems",
+ "C. Public facial recognition networks",
+ "D. Industrial robots and predictive maintenance"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Which visual context is displayed precisely when the phrase 'it's called the EU AI Act' is spoken?",
+ "Choice": [
+ "A. EU parliament logo on a podium",
+ "B. EU parliament hall interior",
+ "C. Female presenter with EU flag and AI robot in background",
+ "D. Close-up of a legal document titled 'EU AI Act'"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "55Jas5HrzcQ",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual element was shown precisely when the narrator declared 'the United States launched its first astronaut, Alan Shepard'?",
+ "Choice": [
+ "A. Flames erupting during rocket ignition",
+ "B. Close-up of Alan Shepard in a spacecraft cockpit",
+ "C. Mercury-Redstone 3 rocket on the launchpad",
+ "D. Newspaper headline about Soviet space achievements"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What visual element was displayed during President Kennedy's declaration of the moon landing goal?",
+ "Choice": [
+ "A. Joint Session of Congress audience",
+ "B. Simulated 1960s television broadcast",
+ "C. Astronaut suit technical diagrams",
+ "D. Mercury-Redstone rocket launch sequence"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "4od5mdtEp-E",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Which specific visual element is shown directly following the aerial 'Las Vegas Day 1' text?",
+ "Choice": [
+ "A. A montage of weight-lifting sessions and jersey-clad players",
+ "B. A close-up of cash piles surrounding the presenter",
+ "C. An 'IGNITE' banner with player headshots overlay",
+ "D. The presenter gesturing toward the 'Henderson Multigenerational Center' sign"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which visual element is displayed when the speaker says 'Practice is about to start' to establish the location?",
+ "Choice": [
+ "A. Basketball court markings",
+ "B. Henderson Center exterior sign",
+ "C. 'Las Vegas Day 1' text overlay",
+ "D. SANTA CRUZ WARRIORS banner"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Why is the presenter determined to make the Warriors regret their decision regarding him?",
+ "Choice": [
+ "A. They questioned his commitment to professional training",
+ "B. They passed on drafting him despite his college performance",
+ "C. They cut him following his tryout with the team",
+ "D. They rejected his request for a coaching role"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "4V6G0qYVoBg",
+ "video_duration": "60s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "Which environmental elements were prominently illuminated when 'Rock solid' was spoken?",
+ "Choice": [
+ "A. Stone pathway and torches",
+ "B. Bridge and lava stream",
+ "C. Tower structure only",
+ "D. Explosions and grassy areas"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "3uV8XZcIBbk",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "What visual event occurs immediately after the camera zooms in and upward on the initial heart model?",
+ "Choice": [
+ "A. Scientists working on a complex machine",
+ "B. A mechanical heart apparatus in translucent container",
+ "C. Various artificial heart prototypes on a dark tan surface",
+ "D. Doctors performing surgery in an operating room"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "4jmMWohs1XM",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What historical event is directly associated with the appearance of the year 1863 in the video?",
+ "Choice": [
+ "A. The establishment of its current museum",
+ "B. The year of its archaeological rediscovery",
+ "C. The construction of its depicted location",
+ "D. The statue's original creation date"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "54H8ppxnp8I",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "What is the FIRST major visual transition occurring after the narrator asks 'Why is this happening?' in the video?",
+ "Choice": [
+ "A. A man in a black hoodie appears with 'Fresh&Fit' text",
+ "B. A shirtless man with sunglasses speaking into a microphone",
+ "C. Man in black puffer jacket in parking garage",
+ "D. Split-screen showing two podcasters"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "4aPxxE345OM",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual change occurs during the audio statement 'Griffin Cooper is by far the coolest' and its immediate aftermath?",
+ "Choice": [
+ "A. Woman shifts attention from newspaper to conversation",
+ "B. Front door opening to reveal Wyatt entering",
+ "C. Close-up transition showing the man wearing a gray zip-up sweater",
+ "D. Camera pans to reveal Manny leaning over the shelf"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "How do the verbal components of Manny's greetings to Wyatt and Reuben fundamentally differ?",
+ "Choice": [
+ "A. Formal salutation versus cultural-themed instructions",
+ "B. Use of honorific titles versus familial nicknames",
+ "C. Only one references a planned activity during the greeting",
+ "D. Both incorporate archaic vocabulary and modern slang"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "3EQzRz-V7uE",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which visual element in the first 20 seconds most explicitly reveals the video's thematic focus?",
+ "Choice": [
+ "A. Repetitive hand movements forming bundles",
+ "B. Intermittent appearance of subscription prompts",
+ "C. Close-up of contrasting yarn colors",
+ "D. Textual title overlay on the right side"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "4H8hcvNeWtg",
+ "video_duration": "60s",
+ "video_category": "Comedy",
+ "questions": [
+ {
+ "Question": "When the first sound of whispering occurs in the 0-20s section, which visual event is simultaneously happening?",
+ "Choice": [
+ "A. Chris Hemsworth whispers to Jeremy Renner in the black suit",
+ "B. Chris Evans whispers to Chris Hemsworth in the black suit",
+ "C. Jimmy Kimmel whispers to Jeremy Renner in the black suit",
+ "D. Jimmy Kimmel whispers to Chris Evans in the blue suit"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "5Knkqo-lYF0",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which audio event occurs during Melissa's interview about her daughter's performance?",
+ "Choice": [
+ "A. Audience members cheering enthusiastically",
+ "B. Enthusiastic applause with audience laughter",
+ "C. A melodic vocal performance",
+ "D. A mother's voice stating her child's age"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "6dmTmP4Dsw8",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What visual event occurs directly after the initial instance of the actresses spinning together?",
+ "Choice": [
+ "A. Performers initiate another circular spin",
+ "B. Anna replicates Elsa's hand motion",
+ "C. Elsa generates magical patterns on the walls",
+ "D. Snow begins falling across the stage"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What specific character interaction occurs visually when Elsa asks 'Do you want to build a snowman?'",
+ "Choice": [
+ "A. Sisters clasping hands with hesitant body language",
+ "B. Snowflakes forming on the stage doors",
+ "C. Anna dancing around Elsa",
+ "D. Cool blue lighting illuminating tiered platforms"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "7MFKy7DJsCY",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What distinguishes the content of the curly-haired woman's two on-screen statements?",
+ "Choice": [
+ "A. Both discuss technical limitations of canopy penetration",
+ "B. Both analyze architectural patterns in 3D models",
+ "C. First: Details the plane's role in LiDAR operation; Second: Explains color-coding in terrain maps",
+ "D. First: Uses an X-ray vision metaphor for LiDAR; Second: Highlights data from undocumented regions"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "Which technological application is shown immediately after the aircraft's appearance in the video?",
+ "Choice": [
+ "A. Scientist demonstrating X-ray soil penetration",
+ "B. Laser device generating topographic maps",
+ "C. LiDAR-equipped plane surveying the canopy",
+ "D. Circular Maya relic analysis using 3D imaging"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "7Hk9jct2ozY",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which structural components are visible adjacent to the 'Nucleus gateway' text when it first appears?",
+ "Choice": [
+ "A. Blue cytoplasmic regions and light green organelle elements",
+ "B. Green tower-shaped structures and surrounding purple formations",
+ "C. Turquoise circular membranes and brown fibrous strands",
+ "D. Purple rope-like structures and green clover-shaped objects"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What magnification levels are shown when the nucleus exterior and the final DNA strand are displayed?",
+ "Choice": [
+ "A. 500,000x vs 10,000,000x",
+ "B. 500,000x vs 1,000,000x",
+ "C. 10,000x vs 10,000,000x",
+ "D. 500,000x vs unspecified"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "6DO8yOVYXr0",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Based on the commentator's narration and visual events, which player dived first to save the ball during the rally?",
+ "Choice": [
+ "A. Donnelly, followed by Connor after a long rally",
+ "B. Donnelly made two consecutive saves",
+ "C. Connor, followed by Donnelly",
+ "D. Both players dove simultaneously after a block"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "74TEQfw6L60",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which audio event coincided with the visual of white circles forming the word 'Spark'?",
+ "Choice": [
+ "A. Rock cracking sounds",
+ "B. Ambient music with dramatic and majestic tones",
+ "C. Dramatic orchestral music playing",
+ "D. Narration describing the mountains as majestic"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which visual evidence is presented immediately as the narrator inquires about marine fossils in mountain peaks?",
+ "Choice": [
+ "A. Microscopic mineral structure examination",
+ "B. Satellite imagery of continental division",
+ "C. Field geologist collecting rock samples",
+ "D. Panoramic aerial mountain vistas"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "84EpEwIVFdU",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Which event occurs FIRST chronologically?",
+ "Choice": [
+ "A. Crowd cheering during over-the-shoulder shot",
+ "B. Ronnie discusses compensating for mistakes with good breaks",
+ "C. Interviewer mentions Ronnie's apology to Barry Hawkins",
+ "D. Scoreboard displays 'LIVE' text during interview setup"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "5WIdIs3A9Ok",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual context coincides with the concluding breathing sounds?",
+ "Choice": [
+ "A. Caymans swimming underwater",
+ "B. Scarface stalking through shadows",
+ "C. Scarface camouflaged near a tree",
+ "D. Broad river view with alligator"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What factor primarily establishes Scarface's threat superiority over caimans in this ecosystem?",
+ "Choice": [
+ "A. Exclusive control of transitional territories",
+ "B. His physical size compared to caimans",
+ "C. Caimans' restricted aquatic hunting patterns",
+ "D. Seasonal migration patterns of prey"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "7R1eNHvfspk",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which synchronized audio-visual event marks the official commencement of the basketball challenge?",
+ "Choice": [
+ "A. Pink basketball release synchronized with 'Let's get it' rap",
+ "B. Javier saying 'tara, game tayo dito' while shaking hands",
+ "C. The TikTok logo appearing with record scratch sound",
+ "D. Crowd cheering during hip-hop lyrics about 'keys opening doors'"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What is the primary reason for the crowd's enthusiastic reaction during the 20-40s segment?",
+ "Choice": [
+ "A. Because Javier successfully makes a critical shot",
+ "B. Synchronized with motivational lyrics during an intense scoring sequence",
+ "C. To acknowledge the conclusion of the basketball challenge",
+ "D. Due to a competitor's TikTok profile being displayed"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which visual event coincides with Javier's spoken challenge, 'Think you can beat me?'?",
+ "Choice": [
+ "A. Text displaying '#GameTayoDito' appears",
+ "B. Woman in black shirt shaking hands",
+ "C. Crowd celebrating near basketball machines",
+ "D. Javier in a gray shirt speaking to the camera"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which event occurs FIRST chronologically?",
+ "Choice": [
+ "A. TikTok logo appears with record scratching sound",
+ "B. Crowd cheers with electronic music and record scratching effect",
+ "C. Hip-hop music plays during alternating pink basketball throws",
+ "D. Javier shakes hands to start competition"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "8-aI8Fp2bPU",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What is the primary technical purpose of Shane finishing punch combinations with knees during shadowboxing?",
+ "Choice": [
+ "A. To allow exercise modification for advanced practitioners",
+ "B. To strictly follow the countdown timer synchronization",
+ "C. To reinforce a foundational movement pattern for novices",
+ "D. To prioritize cardiovascular intensity over striking form"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What specific change occurs in Shane's attire between explaining the workout structure and demonstrating the first exercise?",
+ "Choice": [
+ "A. His footwear transitions from black shoes to running shoes",
+ "B. He removes a wristwatch that was previously visible",
+ "C. His hairstyle changes from tied-back to dreadlocks",
+ "D. The background shifts from text graphics to live action"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "7iXM5aq53Ts",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "What was the correct sequence of cleaning tools applied to the MacBook's keyboard in the first 40 seconds?",
+ "Choice": [
+ "A. Yellow cloth → Vacuum → Yellow cloth",
+ "B. Yellow cloth → Compressed air → Yellow cloth",
+ "C. Vacuum → Yellow cloth → Keyboard Cleaner app",
+ "D. Vacuum → Orange microfiber → Yellow cloth"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What was the primary purpose of placing the cat on the keyboard during the demonstration?",
+ "Choice": [
+ "A. To demonstrate how pets locate food particles on electronic devices",
+ "B. To highlight the need for the Keyboard Cleaner app's input-blocking feature",
+ "C. To simulate a common scenario requiring keyboard maintenance",
+ "D. To showcase the MacBook's sensor activation by animal contact"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What audio information is presented simultaneously with the visual display of 'Keyboard Cleaner' text in the browser?",
+ "Choice": [
+ "A. Demonstration of screen suspension activation",
+ "B. Explanation of accidental input prevention",
+ "C. Tutorial for removing crumbs with a vacuum",
+ "D. Discussion about App Store approval rejection"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "7TydWUguPRU",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "During the performer's split with an upward-extended leg, which audiovisual elements define the scene's context?",
+ "Choice": [
+ "A. Pink spotlight and percussive beats",
+ "B. Crystal-like structures and chime sounds",
+ "C. Blue and purple lighting and ambient music",
+ "D. Reflective geometric shapes and ambient music"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "5xi1iD4fkhc",
+ "video_duration": "60s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Why does the speaker explicitly mention pre-cooking the organ meats in the video?",
+ "Choice": [
+ "A. To shield viewers from the unsightly nature of uncooked offal",
+ "B. Because the SPCA recipe explicitly prohibits raw meat preparation",
+ "C. To comply with the cats' documented preference for cooked textures",
+ "D. To demonstrate proper food safety protocols for feline diets"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What audio cue immediately precedes Rothko's meow?",
+ "Choice": [
+ "A. The speaker describing him as 'the loud one'",
+ "B. The speaker describing Dali as 'the bully'",
+ "C. Mention of the SPCA recipe source",
+ "D. A person discussing salt quantity in gravy"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "6NVr0cNiHPM",
+ "video_duration": "60s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What primarily distinguishes the application technique between rose gold and ladies night shades?",
+ "Choice": [
+ "A. Specific brush type required",
+ "B. Underlying skin preparation needed",
+ "C. Area of the face addressed",
+ "D. Layering technique employed"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "7D-gxaie6UI",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual element is displayed precisely when the speaker begins stating, 'This is the story of the deadliest infectious disease of all time'?",
+ "Choice": [
+ "A. A man wearing glasses and a blue blazer looking directly at the camera",
+ "B. A star-shaped blot expanding outward",
+ "C. The same man standing in front of a blackboard",
+ "D. Microscopic bacteria transitioning from a blot"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which visual element is synchronized with the audio mention of '3 million years'?",
+ "Choice": [
+ "A. Medical objects reflected in symmetrical ink patterns",
+ "B. Star-shaped blot transforming into bacteria imagery",
+ "C. A sepia-toned Sphinx image with visual effects",
+ "D. Another expanding star-shaped blot"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "80p80ynsZ78",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What difficulty level remained active when the car initiated the race sequence after stability control was enabled?",
+ "Choice": [
+ "A. Custom Assists",
+ "B. Professional",
+ "C. Simulation Mode",
+ "D. Unbeatable"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "5y5_VyEwZhc",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which event occurs immediately AFTER the dark feathers transform into live stage performers?",
+ "Choice": [
+ "A. Performers in earthy tones form a human pyramid with a hoop",
+ "B. The ballerina in a blue dress appears against white clouds",
+ "C. Text 'CIRQUE DU SOLEIL ALEGRÍA' appears in the stage corner",
+ "D. A woman demonstrates gymnastic poses inside a suspended metal ring"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "8TNPeimqOO0",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "What visual event coincided with Laura Taylor's description of navigating through mist towards the iceberg?",
+ "Choice": [
+ "A. Orcas swimming near the iceberg in choppy waters",
+ "B. Scientists in orange jackets observing from the ship's deck",
+ "C. A satellite image showing the iceberg's trajectory",
+ "D. Text displaying 'EU/Copernicus Sentinel-3' credit"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which visual event coincided with the audio statement 'three times as big as New York City'?",
+ "Choice": [
+ "A. Wide ocean panorama with text 'World's largest iceberg is on the move'",
+ "B. Satellite imagery showing cloud-covered iceberg contours",
+ "C. Aerial shot of iceberg with 'Credit | EU/Copernicus' text overlay",
+ "D. Animated map tracing iceberg's path with yellow line"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which visual evidence confirms the iceberg's resumed movement after being stationary for decades?",
+ "Choice": [
+ "A. Increased marine life around the iceberg",
+ "B. Yellow trajectory line on an Antarctic map",
+ "C. Satellite imagery obscured by cloud cover",
+ "D. Text credit for Copernicus Sentinel-3"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What visual element was paired with Derek Mueller's discussion about the iceberg's ecological impact?",
+ "Choice": [
+ "A. Orcas hunting near glacial formations",
+ "B. Satellite image of iceberg under cloud cover",
+ "C. Office setting with red/black plaid shirt",
+ "D. Antarctic map with migration trajectory"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "7fnVcXfPe5c",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual elements appeared simultaneously with the speaker's first mention of filtering, threading, and pratfall effect?",
+ "Choice": [
+ "A. Red textured words and sketched figures",
+ "B. Speech bubble and frozen posture animation",
+ "C. Arrow movement and crowd formation",
+ "D. Sequential text emergence below moving figure"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What unique visual feature distinguishes the presentation of 'Hey' from other phrases during the filter demonstration?",
+ "Choice": [
+ "A. Displayed in green beneath the funnel",
+ "B. Presented simultaneously with the filter label",
+ "C. Surrounded by an animated highlight",
+ "D. Appears in a larger, bolded font"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "5fPtlxR3s3c",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Why did the announcer specifically question the decision behind Gibbs' successful shot?",
+ "Choice": [
+ "A. She launched a contested fadeaway jumper",
+ "B. The shot occurred during a defensive breakdown",
+ "C. The attempt was made with ample shot clock remaining",
+ "D. She shot immediately after securing a 50-50 ball"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "5ksVshqVuiM",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What do the contrasting trends of the blue line (1997-2000) and black line (2000-2002) on the chart explicitly reflect?",
+ "Choice": [
+ "A. Ratio of tech startups founded versus those surviving post-2002",
+ "B. Aggregate stock price trajectory during boom and bust phases",
+ "C. Market capitalization of Pets.com vs. Amazon.com's share price",
+ "D. Investor sentiment shifting from optimism to pessimism"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "7E6i3E-fsj4",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Which statement did Davontay make during his initial on-camera appearance wearing a black T-shirt?",
+ "Choice": [
+ "A. 'Episode four, we got a good one... Jason Rivera.'",
+ "B. 'I got a lot of respect for them... I'm obviously trying to use that to my advantage.'",
+ "C. 'I'm like that, but I'm D3, so now.'",
+ "D. 'fast pace organized basketball that I might not know. So, I think we're just in my realm today.'"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What was the first action taken by the man in the 'We Are Richmond' hoodie after Jason Rivera declared readiness to play?",
+ "Choice": [
+ "A. The scoreboard updated to 17:58",
+ "B. Davontay contrasted his Division 3 status",
+ "C. He directly prompted Davontay for pre-game remarks",
+ "D. He solicited commentary from another player regarding matchup dynamics"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "8np5YKYx3sU",
+ "video_duration": "60s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "Which visual element directly synchronized with the 'Bang bang into the room' lyrics to enhance their impact?",
+ "Choice": [
+ "A. Grid of yellow squares on dynamic backdrop",
+ "B. Illuminated 'BANG' text during choreographed spins",
+ "C. Blue spotlight patterns illuminating reflective spheres",
+ "D. Bright yellow 'BANG BANG' text on digital screen"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What simultaneous audio-visual events occurred when Ariana first held the white microphone?",
+ "Choice": [
+ "A. Male dancers in tank tops + bassline emphasis",
+ "B. Yellow grid backdrop + sustained high note",
+ "C. Reflective sphere backdrop + audience cheering",
+ "D. Stockings + layered audience claps"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "5i5Be73ESUc",
+ "video_duration": "60s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Which audio event accompanied the initial display of the CHANEL EMM logo animation?",
+ "Choice": [
+ "A. 'You're my sunshine and my friend when it's hard' lyrics in soulful R&B",
+ "B. 'So yeah, I don't know...' spoken dialogue",
+ "C. Hip Hop music with synth bassline",
+ "D. 'You're the coffee that I need in the morning' lyrics in soulful R&B"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "6DbsOZU8mBM",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "When the presenter states, 'This thing right here is a giant Himalayan pink salt slab,' which visual event occurs simultaneously?",
+ "Choice": [
+ "A. The man gestures with empty hands while comparing Wagyu and Kobe beef",
+ "B. The man holds up the salt slab to display its texture",
+ "C. The man wears black gloves while handling the salt block",
+ "D. The camera zooms in on the table of meats and proteins"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What occurs right after the presenter expresses gratitude for viewer likes and explains its impact?",
+ "Choice": [
+ "A. Introducing a hamster named Pesto while requesting likes",
+ "B. Presenting an assortment of meats to be cooked",
+ "C. Adjusting the placement of salmon on the salt slab",
+ "D. Demonstrating the process of heating the salt slab"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "5iA7wZfxglE",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which visual element is displayed when the narrator first mentions 'unsolved mysteries and unanswered questions' in space?",
+ "Choice": [
+ "A. Animated globe with blue oceans and green land masses",
+ "B. DARK MATTER text enclosed in a puzzle piece",
+ "C. Moon in foreground with Earth in the background",
+ "D. International Space Station orbiting Earth with aurora borealis"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What visual element is shown immediately after the scene of the person writing at the desk?",
+ "Choice": [
+ "A. Animated globe with green land masses",
+ "B. Moon with Earth in the background",
+ "C. Starry cosmic background with white dots",
+ "D. Puzzle piece enclosing 'DARK MATTER'"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Which visual sequence coincides with the narrator's discussion of 'gravity' and introduction of 'dark matter'?",
+ "Choice": [
+ "A. Moon with 'GRAVITY' text → 'DARK MATTER' text → animated globe",
+ "B. 'DARK MATTER' puzzle piece → moon with 'GRAVITY' text → star flicker",
+ "C. Moon with 'GRAVITY' text → star flicker → 'DARK MATTER' text",
+ "D. Moon with 'GRAVITY' text → puzzle piece outline → 'DARK MATTER' text"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "6ksaVHtfBG8",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What was shown immediately following the display of the Malaysia team lineup graphic?",
+ "Choice": [
+ "A. Supporters waving Malaysian flags and wearing black/yellow scarves",
+ "B. Players from both teams shaking hands with referees",
+ "C. Goalkeeper diving to block a shot from Son Heung-min",
+ "D. Red team players discussing tactics with yellow team players"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "7Ni4gcl4tpE",
+ "video_duration": "60s",
+ "video_category": "Travel & Events",
+ "questions": [
+ {
+ "Question": "What specific topic was discussed immediately after the aerial footage showing Kiribati's predicted submersion?",
+ "Choice": [
+ "A. Educational reforms for schoolchildren",
+ "B. The seawall's structural deterioration due to wave action",
+ "C. Population density statistics of South Tarawa",
+ "D. Government timelines for resident relocation"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Why was the seawall described as small in the video?",
+ "Choice": [
+ "A. Rising sea levels undermined its foundation",
+ "B. Lack of reinforcement against storm surges",
+ "C. Budget constraints limited its initial construction",
+ "D. Erosion from frequent wave action"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which visual event occurred as the speaker stated, 'it's a different story' while contrasting aerial and ground perspectives of Kiribati?",
+ "Choice": [
+ "A. Close-up of waves eroding a concrete sea wall",
+ "B. Man in black cap and sunglasses gesturing on a beach",
+ "C. Aerial transition from darkness to light showing island topography",
+ "D. Children laughing while wading in shallow water"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "How does the narrator's factual statement about '3 meters altitude' contrast with the concurrent visual scene?",
+ "Choice": [
+ "A. Demographic statistics vs children's educational setting",
+ "B. Geological measurement vs residential infrastructure",
+ "C. Vulnerability indicator vs normal maritime activity",
+ "D. Calm narration vs distressed environmental sounds"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What does the conversation about 'remains in the sand' suggest about former residents?",
+ "Choice": [
+ "A. They were forced to relocate abruptly due to natural disasters",
+ "B. Coastal erosion from wave action destroyed their homes",
+ "C. They abandoned the area due to soil degradation",
+ "D. Rising sea levels gradually inundated their properties"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "990ci2H9BBc",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which visual element coincided with the narrator stating 'Temperatures may also drop depending on a couple of fact'?",
+ "Choice": [
+ "A. Animated Earth and Moon against black background",
+ "B. White board with shadow bands text",
+ "C. Thermometer with red liquid in background",
+ "D. Total solar eclipse framed by tree branches"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "9QXXL96y3A0",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which visual event coincided with the audio statement: 'South Korea already had the world's lowest birth rate, but this week new data showed it had fallen further'?",
+ "Choice": [
+ "A. Man in gray hoodie speaking in front of a whiteboard",
+ "B. Yoshimasa Hayashi speaking at a podium with a Japanese flag",
+ "C. Ultrasound image of a pregnant woman with 'FALLING BIRTH RATES' text",
+ "D. Healthcare worker cradling a newborn in pink scrubs"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "AkHbB9GuCkE",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What is implied by the partial reveal of Jamie Dimon's childhood photo alongside the audio narration?",
+ "Choice": [
+ "A. His relocation to New York",
+ "B. His preference for privacy",
+ "C. His familial ties to finance",
+ "D. His early retirement plans"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Which audio event coincided with the visual of a man in a black suit speaking against a blue background?",
+ "Choice": [
+ "A. 'Bank in America, J.P. Morgan Chase'",
+ "B. Dramatic orchestral music",
+ "C. 'Sitting on a throne of this financial empire...'",
+ "D. Suspenseful orchestral music"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which visual event directly follows the aerial night shot of the sprawling city?",
+ "Choice": [
+ "A. The J.P. Morgan Chase building name reveal",
+ "B. A speaker addressing an audience in a hall with totem poles",
+ "C. The text 'JAMIE DIMON' against a blue background",
+ "D. Framed photos of Jamie Dimon from childhood to adulthood"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "9Y-YJEtxHeo",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which visual is displayed when the narrator cites the statistic about Japanese workers not using their entitled vacation days?",
+ "Choice": [
+ "A. A blue screen with the Japanese flag and statistical graphics",
+ "B. Employees working overtime at computer-filled desks",
+ "C. Uptin holding a beer in a restaurant with colleagues",
+ "D. A solitary figure crossing a brightly-lit nighttime crosswalk"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "During the audio segment discussing a fatal consequence of Japan's work culture, which visual is displayed?",
+ "Choice": [
+ "A. Uptin holding a beer near izakaya-style restaurant decor",
+ "B. Employees working in a glass-walled office with visible clock displays",
+ "C. A crowded Shibuya-style pedestrian crossing during rush hour",
+ "D. Uptin in a white shirt speaking on a bustling Tokyo night street"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which visual element directly coincides with the audio discussion of unpaid overtime exceeding 80 hours per month?",
+ "Choice": [
+ "A. Employees working at desks in a busy office",
+ "B. A crowded public transportation hub",
+ "C. Salarymen walking on a platform in dark suits",
+ "D. A blue screen displaying vacation day statistics"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "8IZvEF4Ui10",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "What prompted the man to redo his channel introduction?",
+ "Choice": [
+ "A. Component incompatibility in initial setup",
+ "B. Interruption by a colleague's correction",
+ "C. Technical issues with microphone feedback",
+ "D. Budget discrepancy correction"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What key contextual difference accompanied the AMD graphics card reveal compared to the initial NVIDIA introduction?",
+ "Choice": [
+ "A. Budget efficiency rationale",
+ "B. Side-by-side competitor comparison",
+ "C. Live overclocking demonstration",
+ "D. Unboxing ceremony with audience reaction"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "8e05W-wi38g",
+ "video_duration": "60s",
+ "video_category": "Travel & Events",
+ "questions": [
+ {
+ "Question": "What activity is the man in the black wetsuit most likely preparing for based on his equipment?",
+ "Choice": [
+ "A. Windsurfing",
+ "B. Scuba diving",
+ "C. Surfing",
+ "D. Deep-sea fishing"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "AF8d72mA41M",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What key misconception about LED casings does the transparent red LED demonstration disprove?",
+ "Choice": [
+ "A. They are essential for structural integrity",
+ "B. They determine the wavelength of emitted light",
+ "C. They filter emitted light to produce colors",
+ "D. They increase electrical conductivity"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "According to the video, which sequence correctly pairs LED color development with their respective inventors/companies in chronological order?",
+ "Choice": [
+ "A. Blue → Red (Holonyak/GE) → Green (Monsanto)",
+ "B. Red (Holonyak/GE) → Green (Monsanto) → Blue",
+ "C. Red (Monsanto) → Green (Holonyak/GE) → Blue",
+ "D. Green (Holonyak/GE) → Red (Monsanto) → Blue"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What was the primary technical limitation preventing LEDs from being used in general lighting before the development of blue LEDs?",
+ "Choice": [
+ "A. The production costs of blue LEDs were prohibitively high",
+ "B. Consumer electronics exclusively required monochromatic displays",
+ "C. Existing semiconductor materials could not emit blue wavelengths",
+ "D. Red and green LEDs could not combine to create white light"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which visual is displayed when the speaker states, 'But blue was almost impossible to make'?",
+ "Choice": [
+ "A. A diagram showing color mixing to create white light",
+ "B. A blue LED placed next to red and green ones",
+ "C. The speaker gesturing emphatically",
+ "D. A green LED lighting up"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "5wLv3pCqZ9o",
+ "video_duration": "60s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Which audio element accompanied the display of the 'Ocean in a Bottle' text?",
+ "Choice": [
+ "A. Ethereal glass harp during oil filling",
+ "B. Somber piano music with melancholic tone",
+ "C. Calm keyboard music during water pouring",
+ "D. Somber piano music while displaying the 'Miniature Bottle' text"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "B2zhLYz4pYo",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual event coincides with the conclusion of the narration about García Márquez's return home to begin writing?",
+ "Choice": [
+ "A. Text 'One Hundred Years of Solitude' in a green rectangle",
+ "B. Colonel Aureliano Buendia's father pointing to ice crystals",
+ "C. Silhouette of a building against an orange background",
+ "D. García Márquez holding a glowing object under a starry sky"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which audio-described event corresponds with the animated calendar flipping to March?",
+ "Choice": [
+ "A. The chronological setting of the novel's opening scene",
+ "B. The announcement of García Márquez's Nobel Prize",
+ "C. The 18-month period referenced in the phrase 'Over the next 18 months'",
+ "D. The timeframe for managing family finances mentioned in narration"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What distinct audio-visual technique differentiates the firing squad sequence from the ice discovery scene?",
+ "Choice": [
+ "A. Rapid scene transitions matching musical crescendos",
+ "B. Superimposed text panels versus purely environmental storytelling",
+ "C. Synchronized character dialogue contrasting with narration",
+ "D. Monochromatic visuals opposing chromatic symbolism"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "A30IuIjQYYg",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which audio event coincided with the screen fading to black before resolving into a blurry, light-filled shot?",
+ "Choice": [
+ "A. 'I finally got the chance to see Thailand for myself' spoken",
+ "B. 'I've always wanted to visit this country...' spoken",
+ "C. Background boat sounds fading",
+ "D. 'Standing right beside...' sung"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which visual event is synchronized with the lyrics 'Nobody, Can’t hold us down'?",
+ "Choice": [
+ "A. Golden reclining Buddha statue",
+ "B. Farmer in red shirt working the land",
+ "C. Traditional boat with colorful ribbons",
+ "D. Stone structure with circular patterns"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "9dQXiqS56Y4",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Which audio event coincided with the close-up of Maxi Rodriguez wearing jersey number 18?",
+ "Choice": [
+ "A. Referee's whistle blowing",
+ "B. Crowd clapping and cheering sound",
+ "C. Announcer discussing Messi's historic World Cup debut",
+ "D. Background music playing"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What occurred as the announcer stated, 'Lionel Messi becomes the youngest Argentine player to play at a World Cup since 1934'?",
+ "Choice": [
+ "A. A fan celebrated by raising arms",
+ "B. Messi adjusted his shorts",
+ "C. Maradona appeared in the crowd",
+ "D. Messi received a hug from Rodriguez"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "9jjTGpWmc5U",
+ "video_duration": "60s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "Which synchronized audio-visual cues indicate the initial moment the boys perceive a threat?",
+ "Choice": [
+ "A. Binocular usage coinciding with upward-floating furniture",
+ "B. Playful banter during ballgame against cityscape backdrop",
+ "C. Anxious vocal inflection describing lunar movement paired with celestial magnification",
+ "D. Excited bird observation with sudden explosive sounds"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "9tBsMSDoDqk",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual event coincides with Kevin Li stating, 'I do magic'?",
+ "Choice": [
+ "A. Kevin gestures near a board with eight photos",
+ "B. A judge in a denim jacket asks a question",
+ "C. A man in a black suit points upwards",
+ "D. The audience cheers and claps"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Why does Kevin Li reference his family history during his AGT introduction?",
+ "Choice": [
+ "A. To evoke shared family viewing experiences with the audience",
+ "B. To complement his outfit's color scheme with stage lighting",
+ "C. To introduce the psychological concept behind his magic trick",
+ "D. To acknowledge the judges' impact on his magical journey"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "A9fQPzZ1-hg",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual element coincided with the narrator stating, 'began tearing up streets and erecting barriers throughout the city of Berlin'?",
+ "Choice": [
+ "A. A woman holding a baby gazes out of a window",
+ "B. A countdown timer showing the number 3",
+ "C. A crane depicted against the night sky",
+ "D. Soldiers shown in military uniforms"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "AnWKOvjFl8k",
+ "video_duration": "60s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "What critical role does the green fluid in the dart play in the villain's scheme?",
+ "Choice": [
+ "A. To incapacitate specific targets through direct injection",
+ "B. To maintain his signature clown appearance",
+ "C. To chemically alter Gotham's water distribution",
+ "D. To fuel propulsion systems in aquatic vehicles"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "AFFZjzuAHMg",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "What was the chronological order of major visual events in the first 20 seconds, excluding opening logos?",
+ "Choice": [
+ "A. Protestors with signs → Shouting KKK members → Lamp-focused office scene",
+ "B. Confederate flag bearers → Office bookshelf scene → 'Racist Violence' sign holders",
+ "C. Robed KKK members with Confederate flags → Diverse protestors with 'Free Bernie' signs → Suited man in office",
+ "D. Legal Eagle logo transition → Suited man speaking → Burning cross imagery"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What contextual cue prompts the speaker's mention of 'Ku Klux Klan members show up to your trial'?",
+ "Choice": [
+ "A. Simultaneous playback of KKK rally footage in the picture-in-picture",
+ "B. Protest signs criticizing systemic racism in the background",
+ "C. Transition graphics featuring Legal Eagle's winged logo",
+ "D. Confederate flags visible in the courtroom scene"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "8UxGzDeRIJk",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What is the most accurate interpretation of 'They feature the world leader' in the context of the race?",
+ "Choice": [
+ "A. The athlete leading at the halfway mark",
+ "B. The host nation's top contender",
+ "C. The current world record holder",
+ "D. The highest-ranked competitor globally"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What immediately followed the announcement '25 laps of the temporary stadium'?",
+ "Choice": [
+ "A. Camera panned out to show full track/stadium",
+ "B. Athletes began running around the track",
+ "C. Display board listed the top runners",
+ "D. A Japanese runner was highlighted on screen"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What visual element coincided with the audience's applause when the announcer mentioned 'Huge group of national champions'?",
+ "Choice": [
+ "A. Runners waving to spectators",
+ "B. Runner names displayed on screen",
+ "C. Children forming a guard of honor",
+ "D. Stadium overview with red seating"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "Atf_Af1q_5w",
+ "video_duration": "60s",
+ "video_category": "Travel & Events",
+ "questions": [
+ {
+ "Question": "Which visual event coincides with the audio statement 'Extraordinary part of the world'?",
+ "Choice": [
+ "A. Ornately carved white structure against green trees",
+ "B. Narrow misty mountain road with driving car",
+ "C. Aerial view of lighthouse overlooking rocky coastline",
+ "D. Close-up of Māori building's red roof against trees"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What visual transition occurs immediately AFTER the couple raises their arms in triumph on the cliff?",
+ "Choice": [
+ "A. Couple sits by the sea",
+ "B. Aerial view of a white lighthouse",
+ "C. Cave-framed beach scene emerges",
+ "D. Text 'They would say...Welcome' appears"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "According to the narrator's commentary, why are the people screaming while wearing life vests on the boat?",
+ "Choice": [
+ "A. They unexpectedly encounter a pod of dolphins",
+ "B. They experience an adventurous initiation described by the narrator",
+ "C. They approach a geothermal area mentioned as a natural wonder",
+ "D. The boat encounters turbulent waves shown splashing onboard"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What visual event immediately follows the dolphin spotting and precedes the river boat scene?",
+ "Choice": [
+ "A. Overhead shot of a geothermal hot spring with wooden walkways",
+ "B. Close-up aerial view of multicolored geothermal pools",
+ "C. Tourists experiencing whitewater rapids in life vests",
+ "D. Diverse group assembling on deck displaying cultural inclusivity text"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "9aGgBzToUz4",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which audio statement coincides with the woman initially raising the Taylor Swift-themed cup during her New Year greeting?",
+ "Choice": [
+ "A. 'Happy New Year, guys...'",
+ "B. 'me getting ready to research every outfit'",
+ "C. 'So I know how many of these bad boys to make'",
+ "D. 'Can you believe it? It is 2024...'"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What action occurs right after the woman finishes saying 'I finally watched Ted Lasso' in the first 20 seconds?",
+ "Choice": [
+ "A. Adjusting her hair to reveal a gold watch",
+ "B. Gesturing toward the camera with her index finger",
+ "C. Briefly touching her necklace spelling 'R A S'",
+ "D. Holding up a Taylor Swift: The Eras Tour cup"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "B6tQyCH5hQM",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which audio event coincided with the initial display of ceramic plates decorated with pink flowers?",
+ "Choice": [
+ "A. Plastic bags being opened",
+ "B. Classical piano music",
+ "C. Sound of a drawer opening",
+ "D. Sound of pottery tools being placed"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What is the chronological order of distinct audio events heard in the initial 40s of the video?",
+ "Choice": [
+ "A. Bags opened → Drawer opens → Piano music begins",
+ "B. Tools placed → Bags opened → Drawer closes",
+ "C. Drawer opens → Tools placed → Bags opened",
+ "D. Piano music begins → Drawer opens → Tools placed"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "When the 'Hi!' graphic appears, what newly placed items become visible on the work surface?",
+ "Choice": [
+ "A. Ceramic plate with floral design",
+ "B. Specialized clay shaping tools",
+ "C. Powder-filled bags and empty jars",
+ "D. Powder-filled jars and bags"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "9HPFkUhOp1Q",
+ "video_duration": "60s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "Which narrated magic effect coincided with the magician clapping hands over the green table?",
+ "Choice": [
+ "A. 'The coin jumps from hand to hand'",
+ "B. 'The box disappears'",
+ "C. 'The ring jumps onto the magician's finger'",
+ "D. 'The queen turns to a card'"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What contrasting elements defined the backgrounds during the musical interlude versus Oscar Owen's introduction?",
+ "Choice": [
+ "A. Solid black backdrop vs warm lighting",
+ "B. Performance table vs editing software display",
+ "C. Card illustrations vs audio equipment",
+ "D. Text overlay vs patterned wall"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "9v9NHoTXc-Q",
+ "video_duration": "60s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "What is the primary purpose of synchronizing ceremonial brass music with three identical men in a peach-colored room?",
+ "Choice": [
+ "A. Highlight multiple character backstories",
+ "B. Emphasize a time loop narrative",
+ "C. Symbolize artistic collaboration",
+ "D. Reinforce brand identity through surrealism"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "BjmPvovl-V4",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which event occurred FIRST chronologically?",
+ "Choice": [
+ "A. Cosmic view of the Milky Way galaxy",
+ "B. Animated electric blue streaks display",
+ "C. Two women walking past 'Speed Limit 20' sign",
+ "D. Kmele Foster and Christof Koch conversing indoors"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What is the man's primary reason for entering the New York Academy of Medicine building?",
+ "Choice": [
+ "A. To present findings on urban bike lane safety",
+ "B. To attend a lecture on galaxy formation",
+ "C. To discuss the scientific study of consciousness",
+ "D. To access medical research archives"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which visual event coincided with the audio question 'But what exactly is consciousness?' being spoken?",
+ "Choice": [
+ "A. Man opening doors with stone pillars",
+ "B. Turtles sunbathing on a log",
+ "C. Empty benches in Central Park",
+ "D. American flag in conversation scene"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "9edWX7TTsLw",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual event coincides with the audio statement 'We need chips, we need energy, we need food' in the documentary?",
+ "Choice": [
+ "A. Complex automated machine assembling components",
+ "B. Sprinkler systems irrigating agricultural field",
+ "C. Man in dark blazer speaking near water body",
+ "D. Grey-haired man gesturing in light purple shirt"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "B9He5DVePvk",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "How does the visual of a woman using an old washing machine align with the audio's reference to 'refusal of technological advancements' in Amish communities?",
+ "Choice": [
+ "A. To highlight generational differences in technology usage",
+ "B. To contrast their lifestyle with 21st-century industrial practices",
+ "C. To demonstrate pragmatic adoption of energy-efficient appliances",
+ "D. To reinforce doctrinal prohibitions against mechanical innovations"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "BoBapnk6TB8",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which event occurred IMMEDIATELY after the diver submerged underwater?",
+ "Choice": [
+ "A. Documenting observations on a submerged whiteboard",
+ "B. Submersible vehicle scanning the seafloor topography",
+ "C. Animated depiction of ancient lighthouse beacons",
+ "D. 3D reconstruction of a historical shipwreck appearing"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "BkiTScinYOQ",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What visual event directly coincided with the initial crowd cheering during the 2018 Portugal vs Spain match?",
+ "Choice": [
+ "A. Scoreboard updated to Portugal 3 - Spain 3",
+ "B. Ball struck the net behind goalkeeper in blue",
+ "C. Team formed huddle near corner flag",
+ "D. Ronaldo raised arm while exiting field"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "ACvLYz9nHvg",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Why did the blocked shot fail to prevent the basket?",
+ "Choice": [
+ "A. The defender committed a reach-in foul during the attempt",
+ "B. The shooter released the ball before defensive contact",
+ "C. The block was ruled as goaltending",
+ "D. The ball caromed off the defender's hand into the hoop"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "B7-6lRo2m4Y",
+ "video_duration": "60s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "Why does the penguin chick initially seek shelter under adult penguins' legs?",
+ "Choice": [
+ "A. To escape the blizzard's wind exposure",
+ "B. Its down coat provides insufficient warmth",
+ "C. To absorb heat from adult huddling behavior",
+ "D. The disappearance of its biological mother"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "How does the mother penguin's vocalization contrast with the narrator's description of other mothers' behavior?",
+ "Choice": [
+ "A. It occurs during environmental chaos vs controlled conditions",
+ "B. It features frequency modulation vs monotonal narration",
+ "C. It demonstrates persistent maternal effort vs collective neglect",
+ "D. It aligns with chick movements vs adult stillness"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What occurs visually in the colony precisely when the mother penguin's final call fades from the audio?",
+ "Choice": [
+ "A. Collective chick vocalizations draw maternal responses",
+ "B. The chick nestles securely under an adult's plumage",
+ "C. Adult penguins synchronize their protective huddling",
+ "D. A solitary chick initiates independent movement through snow"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "CHlJdMVLV2s",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What visual context was present when the female commentator praised the somersault technique?",
+ "Choice": [
+ "A. Two divers in yellow-trimmed trunks performing twists",
+ "B. Graphic overlay identifying Stephan Feck",
+ "C. Spectators in stands during platform preparation",
+ "D. Overhead view of synchronized water entry"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "C6aeL83z_9Y",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What legal and functional contrast is highlighted by the Uber car door and unicorn mask split-screens?",
+ "Choice": [
+ "A. One exposes data privacy violations; the other questions blockchain scalability",
+ "B. One demonstrates corporate compliance; the other advocates NFT environmental impact",
+ "C. One illustrates circumventing labor laws; the other dismisses aspirational solutions to NFT utility",
+ "D. Both criticize deceptive marketing tactics in tech industries"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "Which event occurs first chronologically in the video?",
+ "Choice": [
+ "A. A split screen shows the Uber car door and a woman with a surgical mask",
+ "B. The car flies off the cliff with smoke",
+ "C. A car drives past a sign reading 'BORING REGULATIONS'",
+ "D. The Uber car door with app stickers is shown"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "BxrolAZXZFg",
+ "video_duration": "60s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "What visual event occurred immediately after Tom gestured with his finger while mentioning a 'secret weapon'?",
+ "Choice": [
+ "A. FOX SQUARE news anchors discussing Adopt a Shelter Pet Day",
+ "B. Close-up of Peedy’s blue harness and leash",
+ "C. Exterior shot of TOM DAVIS INTERNATIONAL UPSTATE CANINE ACADEMY",
+ "D. Title card 'HOW TO SAFELY... TRIM YOUR DOG'S NAILS'"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Why does Tom mention that *'a lot of dogs don’t like their feet being touched'* while interacting with Peedy?",
+ "Choice": [
+ "A. To warn about common grooming injuries",
+ "B. To outline the primary cause of canine resistance to nail care",
+ "C. To showcase Peedy's tolerance for handling",
+ "D. To mask the sound of the Dremel tool"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "CKYw_RG7i7w",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "What distinguishes the metaphors used by the female speaker in the red leather jacket from those of the woman in the black jacket during their respective speeches?",
+ "Choice": [
+ "A. She employs corporate jargon instead of emotional language",
+ "B. She uses automotive comparisons rather than relational conflict",
+ "C. Her analogies emphasize spatial design rather than material quality",
+ "D. Her references focus on seasonal motifs instead of historical events"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "Which audio quote coincides with the GUCCI storefront display shown in the video?",
+ "Choice": [
+ "A. 'There are a lot of rumors around'",
+ "B. 'It's better to cry'",
+ "C. 'The story at the heart of Gucci is extraordinary'",
+ "D. 'Sometimes things happen in life that change your life. This was really success beyond my wildest'"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "Bgh-2z6P0ao",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What was the sequence of score changes following Fan Zhendong's initial serve as shown in the 0-20s segment, considering subsequent gameplay developments?",
+ "Choice": [
+ "A. 1-0 → 1-1 → 2-1",
+ "B. 1-0 → 2-0 → 2-1",
+ "C. 0-1 → 1-0 → 2-0",
+ "D. 1-0 → 2-0 → 1-1"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What specific relationship between Fan Zhendong and Wang Chuqin is explicitly mentioned during the match commentary?",
+ "Choice": [
+ "A. Rivals in previous tournaments",
+ "B. Former coach and student",
+ "C. Doubles partners",
+ "D. Teammates on the national team"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "B7Hh0PY1kks",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which visual event coincides with the man's statement 'I got no control of what he eats'?",
+ "Choice": [
+ "A. The first child consumes processed meat",
+ "B. The first baby drinks from a cup",
+ "C. A baby eats cereal in a highchair",
+ "D. The first baby eats fries with a fork"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which visual example supports the woman's statement about 'tooth decay, iron deficiency anemia'?",
+ "Choice": [
+ "A. A child drinks from a purple bottle as a man discusses soda intake",
+ "B. A refrigerator stocked with beer and wrapped meat",
+ "C. A toddler grabs fried chicken from a patterned floor",
+ "D. A baby eats cereal in a highchair"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "ByUONqA1IJY",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What is the primary reason Luis Suarez is seen lying on the field with his face obscured in the 40-60s segment?",
+ "Choice": [
+ "A. Recovering from a physical collision with an opponent",
+ "B. Overcome with relief after a high-pressure goal",
+ "C. Emotionally processing the goal's impact on the match",
+ "D. Experiencing muscle cramps from earlier play"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which national team's supporters are visually identified by the crowd's blue/white colors during the 'finish them!' audio cue?",
+ "Choice": [
+ "A. England",
+ "B. Argentina",
+ "C. Uruguay",
+ "D. Brazil"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Which commentator statement coincides with Luis Suarez scoring into the white goal?",
+ "Choice": [
+ "A. \"Suarez in behind England\"",
+ "B. \"Suarez wins it surely\"",
+ "C. \"One touch to control it and bang\"",
+ "D. \"Uruguay take advantage of more woeful defending\""
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "BiJtPU9uP0c",
+ "video_duration": "60s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "How do the initial structural folds differ between the fox and dog origami?",
+ "Choice": [
+ "A. The dog uses vertical creases for ear definition, while the fox uses horizontal",
+ "B. Both begin with diagonal folds, but the dog adds lateral folds to create a diamond base",
+ "C. The fox requires unfolding after the diagonal fold to refine the head shape",
+ "D. Only the dog skips the corner-alignment step used in the fox's ears"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "CC9VeHrI3Es",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What mechanism enables the stabilization of processed food prices despite daily corn price volatility?",
+ "Choice": [
+ "A. Farmers directly negotiate retail pricing with consumers",
+ "B. Futures markets mediate price agreements between parties",
+ "C. Retail competition dynamically adjusts consumer prices",
+ "D. Pre-negotiated contracts between producers and users fix costs"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What is the primary symbolic meaning of the green stars surrounding the futures market illustration?",
+ "Choice": [
+ "A. Environmental sustainability efforts",
+ "B. Effective risk mitigation strategies",
+ "C. Long-term agricultural contracts",
+ "D. Price fluctuation patterns"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Which visual element was actively drawn when the speaker explained the cereal company's strategy to purchase corn?",
+ "Choice": [
+ "A. Factory with chimneys",
+ "B. Downward arrow dollar sign",
+ "C. Upward arrow dollar sign",
+ "D. Farmer illustration"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "BVJGvKaFG7U",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual event directly coincides with the lyric 'friendship ceases from now on' in the song?",
+ "Choice": [
+ "A. Rhinos walking together",
+ "B. Baby giraffe nuzzling adult",
+ "C. Giraffe herd moving between enclosures",
+ "D. Baby rhino lying on its back"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "913JhnW28U4",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What contextual clue connects the audio mention of 'Dressel' and 'Chalmers' to the visual scene?",
+ "Choice": [
+ "A. They are shown in adjacent lanes as Pan wins the race",
+ "B. They are presenters at the award ceremony handing out mascots",
+ "C. Their names are displayed on the live results board during the race",
+ "D. The commentator contrasts them as top competitors absent from the current medal ceremony"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "B6fvT2LKEDI",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What auditory response occurred right after Sophie mentioned her father's handmade items?",
+ "Choice": [
+ "A. A judge praising her family's creativity",
+ "B. A judge inquiring about her contortion training",
+ "C. Collective laughter from the judges' table",
+ "D. Extended audience cheering and whistling"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "8y2Hj7GDcwI",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Which city is hosting the championship event based on synchronized audio-visual cues?",
+ "Choice": [
+ "A. Auckland, New Zealand",
+ "B. Glasgow, Scotland",
+ "C. Belgrade, Serbia",
+ "D. Birmingham, England"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "BEIVOKz4zXw",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What role did the dark-haired woman primarily fulfill by holding the microphone across multiple scenes?",
+ "Choice": [
+ "A. Coordinator for audience interactions",
+ "B. Lead vocalist for female performers",
+ "C. Technician resolving audio issues",
+ "D. Backup singer for male performers"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What narrative element is emphasized by the alternating shots of male and female groups during the line 'Summer loving had me a blast'?",
+ "Choice": [
+ "A. A call-and-response musical structure",
+ "B. Preparation for solo performances later in the act",
+ "C. Romantic tension between the groups",
+ "D. Technical limitations in filming group choreography"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "BrrDT_uYQsg",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "During which visual event is the audio line 'What an inspired idea to bring them over to Rio' spoken?",
+ "Choice": [
+ "A. When one gymnast steps forward to wave",
+ "B. As the group first enters the stage",
+ "C. While the audience stands for applause",
+ "D. During a synchronized backflip formation"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "E-YnmCB6yjo",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Based on the report, why was the pregnant woman denied an emergency C-section at the hospital?",
+ "Choice": [
+ "A. Medical staff identified discrepancies in her prenatal test results",
+ "B. The hospital's policy required upfront payment from uninsured patients",
+ "C. The hospital lacked the necessary neonatal intensive care facilities",
+ "D. Her undocumented immigration status disqualified her from receiving care"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What visual event immediately followed the sound of rocks tumbling and debris falling?",
+ "Choice": [
+ "A. A Canadian couple smiling on a beach",
+ "B. CBC News investigation title card",
+ "C. Medical personnel attending to a newborn in a bassinet",
+ "D. Injured couple in hospital beds"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What can be inferred about the Canadian couple following the Taiwan earthquake?",
+ "Choice": [
+ "A. They were residents of Toronto",
+ "B. They required emergency medical treatment",
+ "C. They were documenting the disaster for CBC News",
+ "D. They had recently adopted a newborn"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "D61jenC5oBA",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which coordinated movement occurs during the lyric 'I want you to hold out the palm of your hand'?",
+ "Choice": [
+ "A. Solo trampoline performance in black-and-white",
+ "B. Woman stands upright on partner's feet",
+ "C. Partners clasp hands in a bridge pose",
+ "D. Male partner dismounts from aerial split"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Which audio event occurs as the woman in a bright yellow sports bra is lowered into a split position while balancing the man overhead on a purple mat?",
+ "Choice": [
+ "A. Energetic rock music with guitar riffs and bassline",
+ "B. 'Come on Harry, we wanna say good night to you' spoken during partner acrobatics",
+ "C. 'Holding me back, gravity's holding me back' with synthesizers and drums",
+ "D. 'The one who stayed. In this world it's just us' with synth pads and keyboard"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "EP2rD6lr2Xk",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What event occurs immediately AFTER the aerial shot of swimmers competing in the lake?",
+ "Choice": [
+ "A. Arno embraces his friend near the finish area",
+ "B. Arno discusses his 8-month training journey",
+ "C. Swimmers begin wading through shallow water",
+ "D. A cyclist with a red heart emblem passes crowds"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "CwzjlmBLfrQ",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What distinct musical characteristic emerged in the 40-60s segment that was absent in the 20-40s performance?",
+ "Choice": [
+ "A. Improvisational violin solos",
+ "B. Digital audio manipulation effects",
+ "C. Pulsating rhythmic foundation",
+ "D. Prominent brass fanfare sections"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "Which visual event occurs simultaneously with the tense orchestral music featuring a driving beat?",
+ "Choice": [
+ "A. Backdrop illuminated with blue and purple lights",
+ "B. Conductor's initial walk through crowd",
+ "C. Mr. Bean playing a keyboard",
+ "D. Violinists performing with translucent stadium overlay"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "How does the integration of audio-visual elements in the stadium sequence demonstrate production sophistication?",
+ "Choice": [
+ "A. Comedic character actions are precisely timed to musical accents",
+ "B. Live audience reactions were overdubbed in post-production",
+ "C. Lighting transitions mirror dynamic shifts in musical tempo",
+ "D. Musicians' sheet music displays visible last-minute revisions"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Which event occurred FIRST after the orchestra began playing?",
+ "Choice": [
+ "A. Mr. Bean leading the orchestra",
+ "B. Rowan Atkinson's close-up appears",
+ "C. Two violinists play with translucent overlay",
+ "D. Stadium pans out to show full audience"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which visual event coincided with the female announcer stating 'Ladies and gentlemen, please welcome Sir Simon Rattle'?",
+ "Choice": [
+ "A. Conductor bowing to applauding audience",
+ "B. Violinists performing against translucent backdrop",
+ "C. Sir Simon Rattle walking through the crowd smiling",
+ "D. Aerial shot of stadium transitioning between colored lights"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "EavQ0LtEICY",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "According to the host's stated rules, what specific action caused elimination in the first round?",
+ "Choice": [
+ "A. Committing three traveling violations",
+ "B. Touching the ball with both feet out of bounds",
+ "C. Failing to pass the ball within five seconds",
+ "D. Being the last player to successfully make a basket"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What prize is explicitly stated as the reward for the winner in the game explanation?",
+ "Choice": [
+ "A. A car from the '4 Million Road' segment",
+ "B. A signed basketball jersey",
+ "C. A trip to space shown in special effects",
+ "D. A new pair of shoes"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "EeTmrsVW8qE",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What key factor does Willen attribute to the Balkan region being 'ripe with talent' based on synchronized audio-visual context?",
+ "Choice": [
+ "A. Early demonstration of technical mastery",
+ "B. Genetic predisposition for exceptional height",
+ "C. Rapid construction of modern basketball facilities",
+ "D. Strategic international coaching partnerships"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "When the narrator describes the Kalamegden Fortress courts' construction with bleachers, which visual arrangement appears?",
+ "Choice": [
+ "A. Construction site photo with workers laying foundation",
+ "B. Serbian coat of arms beside a faded basketball hoop",
+ "C. Vintage document showing Red Star and Partizan logos",
+ "D. Network of interconnected basketball team photos with white lines"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "DK6bKtXUE1c",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Why does the commentator exclaim 'Greek fans love that' immediately after Tentoglou's jump?",
+ "Choice": [
+ "A. He achieved a new personal best in the competition",
+ "B. His uniform prominently displayed the Greek national colors",
+ "C. He maintained perfect balance during the landing phase",
+ "D. He rose to his feet following a decisive performance"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What audio commentary is heard as the Jamaican athlete hugs Tentoglou?",
+ "Choice": [
+ "A. The sound of multiple athletes clapping",
+ "B. 'Another global title to Greece'",
+ "C. 'Miltiadis Tentoglou has done it again'",
+ "D. 'Mattia Furlani, a 19-year-old'"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "DF_J3vCcbBA",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Why must researchers collect lava samples rapidly?",
+ "Choice": [
+ "A. Rapid collection avoids heat exposure through gloves",
+ "B. To minimize discomfort from ambient heat",
+ "C. To prevent the sampling tool from melting",
+ "D. To ensure the sample remains uncontaminated"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "CoPfYGgzCkM",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual element directly illustrates the historical context mentioned when the narrator discusses intentional violations of protected status in warfare?",
+ "Choice": [
+ "A. Outdoor military tents with soldiers in ceremonial uniforms",
+ "B. A formal dining room with aristocratic decor and a seated figure",
+ "C. Soldiers in gray uniforms aiming rifles at surrendering medics",
+ "D. Medics in green uniforms collapsing with visible blood"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "DfZIv2FKWj4",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What did the England team's body language indicate during the Samoa Siva Tau performance?",
+ "Choice": [
+ "A. They engaged in tactical discussions",
+ "B. They maintained still attentiveness",
+ "C. They synchronized movements with the chant",
+ "D. They exhibited tension through closed postures"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What was the sequence of audio events from when England players emerged until Samoa's pre-game ritual began?",
+ "Choice": [
+ "A. Speech about preparation → Crowd cheering → Samoan chant",
+ "B. Stadium music → Speech → Crowd cheering",
+ "C. Stadium music → Crowd cheering → Samoan chant",
+ "D. Samoan chant → Stadium music → Speech"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "EQ-67udZEeg",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What visual event occurred immediately AFTER Jessica's bar graph with '??' votes was shown?",
+ "Choice": [
+ "A. A green '500' number appeared",
+ "B. Percentage sign with students appeared",
+ "C. Three index card shapes appeared",
+ "D. A '100' number appeared next to an upward arrow"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What concept does the magnifying glass sketch visually reinforce when paired with the audio explanation?",
+ "Choice": [
+ "A. Evolution of stock market indices",
+ "B. Overview of market performance trends",
+ "C. In-depth individual stock evaluation",
+ "D. Statistical sampling methodology"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What fundamental characteristic makes Jessica's polling method analogous to a stock index?",
+ "Choice": [
+ "A. Both measure longitudinal trends",
+ "B. Both employ weighted calculations",
+ "C. Both generate probabilistic snapshots",
+ "D. Both utilize comprehensive data collection"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "DI6SemRT2iY",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "What visual event occurred FIRST after the underwater shot with blue light shafts?",
+ "Choice": [
+ "A. Animation of tectonic plate boundaries",
+ "B. Microscopic view of green microbial chains",
+ "C. Coastal landscape with pebble shore and stormy sky",
+ "D. Graphic of Earth's cross-section highlighting core layers"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Based on the graphical data and geological cross-sections, how did the Mesoproterozoic tropics' temperature compare to the Earth's mantle during that era?",
+ "Choice": [
+ "A. Tropics were significantly hotter due to solar radiation",
+ "B. Mantle's heat made tropics relatively cooler",
+ "C. Both regions exhibited similar thermal stability",
+ "D. Mantle temperatures were less variable than the tropics"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "F4v0BZtrSbY",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What was the immediate reaction following Koji Iwasaki's response to the judge's question about his act?",
+ "Choice": [
+ "A. Simon Cowell appeared unamused",
+ "B. The judges erupted in laughter",
+ "C. Koji began explaining his dream to perform",
+ "D. A judge inquired about prize money plans"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "D52rTzibFRc",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which sequence of visual events occurs immediately after the logo disappears, based on audio-visual synchronization?",
+ "Choice": [
+ "A. Black screen with text → molecular view → smoke visualization",
+ "B. Black screen with text → sunny field → Earth from space",
+ "C. Purple mountains → ice cubes added → red dots shown",
+ "D. Earth from space → thermometer demo → aquarium experiment"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "How does the Earth-from-space scene DIFFER from the aquarium experiment in explaining wind formation?",
+ "Choice": [
+ "A. One uses digital simulations, the other physical models",
+ "B. One visualizes molecular interactions, the other planetary systems",
+ "C. One explains atmospheric pressure principles, the other demonstrates their application",
+ "D. One measures temperature changes, the other manipulates temperature extremes"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What visual demonstration occurs when the narrator cautions 'Kids, don't do'?",
+ "Choice": [
+ "A. Lamp continuously heating one end",
+ "B. Black shield placed atop the tank",
+ "C. Ice cubes deposited on opposite side",
+ "D. Smoke revealing air circulation patterns"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "EVBCLAaflaw",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which audio event coincided with the visual display of Jerry R. Ehman's annotated 'Wow!' note alongside the SETI acronym and its full name?",
+ "Choice": [
+ "A. Radio static heard over flickering horizontal lines",
+ "B. Narrator describing the Wow signal's intensity and 72-second duration",
+ "C. Male speaker explaining the origin of the 'Wow!' name and SETI's establishment by NASA",
+ "D. Male voice stating 'On August 15th, 19'"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which visual element appears immediately after the narrator mentions the Wow signal's 72-second duration in the audio track?",
+ "Choice": [
+ "A. Rotating Earth visualization with continental outlines",
+ "B. Page of alphanumeric data with red-circled values",
+ "C. Close-up of 'Wow!' annotation in red ink",
+ "D. Aerial view of Big Ear telescope's rectangular receiver"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "CxBbw8eT624",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which visual element appeared simultaneously with the mechanical noise following the transparent skull rotation?",
+ "Choice": [
+ "A. Neuralink logo over red building",
+ "B. 'Patient Zero' text overlay",
+ "C. Battery-powered computer device graphic",
+ "D. Tesla Space logo and web address"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What type of audio is featured when the Tesla Space logo is first displayed?",
+ "Choice": [
+ "A. Mechanical whirring and clanking",
+ "B. Medical monitor beeps",
+ "C. Phone notification sounds",
+ "D. Calm ambient electronic music"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "E83XxGvEPtI",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What is the correct chronological order of actions occurring in the first 20 seconds after the alarm stops?",
+ "Choice": [
+ "A. Adjusts phone holder → Checks screen → Turns off alarm → YouTube plaque shown",
+ "B. Plugs in cord → Adjusts phone holder → Says 'Perfect' → Checks screen",
+ "C. Checks screen → Plugs in cord → Adjusts phone holder → Takes black box",
+ "D. Turns off alarm → Plugs in cord → Adjusts phone holder → Checks screen"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which audio event occurs during the man's first interaction with electrical equipment before checking his appearance?",
+ "Choice": [
+ "A. Male voice exclaiming 'Perfect'",
+ "B. Male speaker explaining his daily routine",
+ "C. Ongoing electronic music with heavy bass",
+ "D. Alarm clock ringing"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "EuBhksmj0sQ",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which audio event occurred when the stage backdrop first displayed palm trees against a colorful sky?",
+ "Choice": [
+ "A. Repetitive synthetic drum machine patterns",
+ "B. Harsh electronic synths layered with basslines",
+ "C. Film score with orchestral drums and bass building intensity",
+ "D. Percussive beats using traditional instruments"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "DoFmXBjRGmk",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What is the man's primary occupation based on his morning activities?",
+ "Choice": [
+ "A. Professional video producer",
+ "B. Software development educator",
+ "C. Ride-sharing service driver",
+ "D. University admissions officer"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "CezlmUwMXNo",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What visual event immediately followed the woman's statement 'I'm going to have an abortion'?",
+ "Choice": [
+ "A. Transition to a courtroom with 'Northern District Texas' signage",
+ "B. Display of 'Roe v. Wade' text with Supreme Court citation",
+ "C. Woman in blue dress responding about Texas law",
+ "D. A judge holding a gavel in a black robe"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "ECgVLceb_LE",
+ "video_duration": "60s",
+ "video_category": "Autos & Vehicles",
+ "questions": [
+ {
+ "Question": "When does the glowing 'XPENG' text on the car's rear become visible in the video?",
+ "Choice": [
+ "A. When the steering wheel is shown in close-up",
+ "B. As the car transitions from silver to blue",
+ "C. During the demonstration of propeller functionality",
+ "D. When the electronic music begins playing"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "Ei-TcECJVXU",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What specific engineering problem does the woman address when mentioning 'tiny little can' systems on the ISS?",
+ "Choice": [
+ "A. Reducing structural weight for orbital deployment",
+ "B. Replicating Earth's life-sustaining processes",
+ "C. Managing radiation exposure during spacewalks",
+ "D. Optimizing solar energy collection efficiency"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "Fibq4aA-GVc",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which audio-visual pair occurs when the man in the red baseball cap first addresses the driver?",
+ "Choice": [
+ "A. 'Good thing' spoken - Driver gestures while replying",
+ "B. 'Is that bad?' spoken - Man leans toward driver's seat",
+ "C. 'What?' spoken - Red cap turns head sharply",
+ "D. Crash sound effect - Paper tape graphic displays 'extended'"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "GWNzRtKzIRA",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which object visible in the man's environment provides the strongest evidence for his professional field?",
+ "Choice": [
+ "A. Clapperboard on the shelving unit",
+ "B. Vintage typewriter with engineering components",
+ "C. Custom stamp featuring the typewriter",
+ "D. Books about mechanical design"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "GDzuMTvHnJE",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which visual event directly follows the audience's applause in the video timeline?",
+ "Choice": [
+ "A. Yard view with blue trash bin",
+ "B. Line of people outside 'Job Plus'",
+ "C. Politician speaking at light-blue backdrop",
+ "D. People entering a building"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "FQd5bo9nIZs",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What narrative purpose does the low-frequency rumble around 30s serve?",
+ "Choice": [
+ "A. To emphasize the whimsical nature of smartphone app development",
+ "B. To emphasize the shift into historical exposition",
+ "C. To highlight technical precision in server initialization",
+ "D. To amplify tension around data privacy vulnerabilities"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What contextual audio information is presented when the iPhone app screen is displayed?",
+ "Choice": [
+ "A. Discussion of smartphone app monetization strategies",
+ "B. Details about Brian Acton's rejected job application",
+ "C. Narration about global adoption of smartphone apps",
+ "D. Explanation of the iPhone's hardware specifications"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Which visual element appears on screen precisely when the narrator states, 'But to make sense of all this, we need to travel back in time'?",
+ "Choice": [
+ "A. Black screen with no visual elements",
+ "B. Clock tower encircled by a clock face",
+ "C. Green grid with targeting reticle",
+ "D. iPhone displaying 'SMARTPHONE APPS'"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "GLW9omJfAdk",
+ "video_duration": "60s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "Which audio event plays simultaneously with the initial display of the content warning text?",
+ "Choice": [
+ "A. Calming guitar music accompanying the painting's reveal",
+ "B. A biographical introduction to Francisco Goya",
+ "C. A narrator stating 'This video contains images which some viewers may find distressing'",
+ "D. Tense strings during the fade to black"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What primary factor led to Goya's isolation during his final years?",
+ "Choice": [
+ "A. Official banishment from aristocratic circles",
+ "B. Loss of immediate family and close companions",
+ "C. Self-imposed seclusion to pursue grotesque artistry",
+ "D. Progressive hearing loss limiting social engagement"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "Gh_GtYtQoVI",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which visual element appeared when the speaker said 'Businesses generally need money or capital to start'?",
+ "Choice": [
+ "A. A dollar sign transforming into the word 'CAPITAL'",
+ "B. A hand writing 'BUSINESS' in red marker below a bank sketch",
+ "C. A red X over a building labeled 'TATA'",
+ "D. A hand drawing three stick figures labeled 'SAVERS'"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What was drawn IMMEDIATELY after the light bulb labeled 'IDEA' appeared?",
+ "Choice": [
+ "A. A checkmarked list of financial institutions",
+ "B. An arrow connecting savers to capital",
+ "C. A stick figure holding a crossed-out money bag",
+ "D. A building labeled 'TATA' with a red X"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Why was a red X displayed over the illustration labeled 'TATA'?",
+ "Choice": [
+ "A. To indicate a prohibited investment sector",
+ "B. To show disapproval of direct investment",
+ "C. To emphasize the necessity of intermediaries",
+ "D. To warn against personal financial risk"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What does the inclusion of checkboxes next to financial institutions imply about their function in investment processes?",
+ "Choice": [
+ "A. They represent mandatory regulatory steps",
+ "B. They require physical documentation",
+ "C. They enable direct business transactions",
+ "D. They indicate optional intermediary channels"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What key distinction is highlighted by the red X over the 'TATA' building compared to the red X over the stick figure?",
+ "Choice": [
+ "A. One signifies a failed investment while the other warns of fraud",
+ "B. Both emphasize the necessity of intermediaries for transactions",
+ "C. One prohibits direct funding to corporations and the other to individuals",
+ "D. They reflect opposing strategies for capital allocation"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "FkZn5u3MIiY",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which visual element didn't appears synchronously with the Princess's explicit mention of her abdominal surgery?",
+ "Choice": [
+ "A. 'BBC NEWS' logo on studio stage",
+ "B. Red banner with 'BBC NEWS BREAKING' header",
+ "C. Red banner displaying cancer treatment details",
+ "D. Timecode '18:00' in corner"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What technical production element indicates the Princess's message was pre-recorded rather than broadcast live?",
+ "Choice": [
+ "A. Synchronized breaking news ticker updates",
+ "B. Seamless continuity of floral arrangements",
+ "C. Unchanging natural lighting throughout",
+ "D. Post-production voiceover introducing the segment"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ },
+ {
+ "Question": "How does the on-screen text differ between when the Princess discusses chemotherapy versus the initial surgery announcement?",
+ "Choice": [
+ "A. Text specifies preventative chemotherapy",
+ "B. Change of text in the backdrop",
+ "C. Timecode format switches to 12-hour clock",
+ "D. BBC NEWS banner disappears completely"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "G-mmtUxSt5k",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What visual event coincides with the commentator stating, 'Such good form recently as well'?",
+ "Choice": [
+ "A. Player in blue executing a cross-court drop",
+ "B. Player adjusting wristband during a pause",
+ "C. Audience member waving a national flag",
+ "D. White-shirted player engaged in a rapid exchange"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What commentator instruction was given as Tai Tzu Ying began walking off the court?",
+ "Choice": [
+ "A. Direct shout: 'Smash that'",
+ "B. Praise for recent tournament performance",
+ "C. Analyst discussing strategic recovery",
+ "D. Sharp impact sound from a cross-court smash"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "GV5CuB4zPTY",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What is the primary purpose of the announcer mentioning 'another little intro'?",
+ "Choice": [
+ "A. To introduce a special guest commentator",
+ "B. To humorously explain the reason for the intro music mistake",
+ "C. To highlight the referee's role in maintaining game integrity",
+ "D. The tournament has progressed to the quarter-final stage"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What occurs immediately following the announcer's opening line 'Good morning and welcome to the Predator World 8-Ball Championships...' in the convention center scene?",
+ "Choice": [
+ "A. Competitors exchange pre-match formalities at the table",
+ "B. A referee verifies ball positions on the table",
+ "C. Tight framing captures a player's shot execution",
+ "D. Tournament title graphics overlay audience footage"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "FcP0mzWFCQU",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which visual element directly aligns with the speaker's assertion that fast food 'brings very much of happiness to people'?",
+ "Choice": [
+ "A. A text overlay showing '€19b' as a man opens a McDonald's door",
+ "B. A woman with glasses seated at a table covered with Five Guys wrappers and a drink cup",
+ "C. Two people laughing while eating chicken in a fast-food restaurant",
+ "D. A McDonald's exterior decorated with festive lights as customers enter"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "Fqptx5fNXR4",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "How does the 'Circus Vargas' trailer visually contrast with the text 'THAT WE'RE NOT A COMMUNITY THAT'S WELL ORGANIZED'?",
+ "Choice": [
+ "A. The trailer is stationary; the text appears during dynamic performances",
+ "B. The trailer uses vibrant colors; the text appears in monochrome",
+ "C. The trailer features a clown; the text references chaotic organization",
+ "D. The trailer promotes a theatrical event; the text disputes community structure"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What is the primary reason the woman points toward the circus tent while standing near the snack stall?",
+ "Choice": [
+ "A. To highlight the 'The big one is back!' tagline on the Circus Vargas trailer",
+ "B. To guide viewers toward the construction workers assembling the tent",
+ "C. To reference the location of her imminent performance stated in her speech",
+ "D. To dismiss the chaotic organization mentioned in the narrator's dialogue"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "FkrIcmXvv18",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What external factor heightens the blonde woman's distress during her soliloquy?",
+ "Choice": [
+ "A. Physical constraints of her elevated platform",
+ "B. A technical error in the starry backdrop",
+ "C. Onstage observers documenting her with devices",
+ "D. A dissonant shift in the musical score"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "GQEYGAJIzAM",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "During the sponsorship acknowledgment mentioning 'King Liechtenstein,' which visual event occurs simultaneously?",
+ "Choice": [
+ "A. A man with graying hair gestures upward in a black shirt",
+ "B. A man wears a headlamp while smiling in a blue jacket",
+ "C. An orange-banded planet rotates against a black background",
+ "D. A webpage displays astrophotography equipment"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which statement directly supports the purpose of the rooftop telescope setup?",
+ "Choice": [
+ "A. Mention of clear London night skies",
+ "B. Promotion of high-quality astrophotography prints",
+ "C. '50 million miles away' astronomical claim",
+ "D. Observation of Mars setting behind the moon"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "FYqSATdr62Q",
+ "video_duration": "60s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "What necessitates the black ball's repositioning during the 20-40s segment?",
+ "Choice": [
+ "A. A foul committed by O'Sullivan",
+ "B. Incorrect placement after a color pot",
+ "C. Strategic positioning for subsequent shots",
+ "D. Standard procedure following a legally potted black"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which visual event directly coincided with the commentator stating 'Nine' during the match?",
+ "Choice": [
+ "A. Black ball being repositioned on its spot",
+ "B. White cue ball being replaced in original position",
+ "C. O'Sullivan potting a red ball",
+ "D. O'Sullivan potting a black ball"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "FtIF8VafVZY",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which visual event directly follows the skateboarding scene in the video?",
+ "Choice": [
+ "A. Weighted squats in a gym with other people visible",
+ "B. Person waking up in a black MIT t-shirt",
+ "C. Alarm screen with 'Snooze' option on a mobile phone",
+ "D. Scooter POV shot showing handlebars on a paved road"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "GPOv72Awo68",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What scene directly follows the animated lemonade stand sequence in the first 20 seconds?",
+ "Choice": [
+ "A. Business loan application scene",
+ "B. 'ECONOMICS CRASH COURSE' title card",
+ "C. Factory production line scene",
+ "D. NERDFIGHTERIA CITY COUNCIL meeting"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which props are visibly present behind Adrian Hill when she provides an explanation of mortgages?",
+ "Choice": [
+ "A. Burning barrel graphic and crossbows",
+ "B. Cash register and a white figurine",
+ "C. Lemonade bottles and loan application documents",
+ "D. Factory machinery and white lab coats"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "GsYi-l5PEEs",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which visual event occurred simultaneously with the performer shouting 'Left hand! Left hand!'?",
+ "Choice": [
+ "A. Acrobat balances on his hands on a table",
+ "B. Acrobat grabs a long pipe",
+ "C. Acrobat performs a handstand split on the stage",
+ "D. Acrobat jumps from the table"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Why did the crowd cheer loudly during the first 20 seconds?",
+ "Choice": [
+ "A. The acrobat executed a dangerous jump from a height",
+ "B. The carousel began its rapid spinning motion",
+ "C. The acrobat followed a shouted instruction",
+ "D. The performer successfully maintained an inverted split"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "HK13nVxBYxI",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "What visual element appeared behind the male anchor immediately after he referenced Bitcoin's November 2021 peak during the discussion of market fluctuations?",
+ "Choice": [
+ "A. US dollar bills with golden Bitcoin",
+ "B. Breaking news banner in red text",
+ "C. Stock market candlestick charts",
+ "D. 'BTC MARKET' in blue letters"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "HGv87QeBYnc",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What led to goalkeeper A. Becker's substitution during the match?",
+ "Choice": [
+ "A. He was replaced as a precaution due to prior fitness concerns mentioned in commentary",
+ "B. He received a second yellow card, resulting in ejection",
+ "C. The team opted for a tactical adjustment to reinforce the midfield",
+ "D. He sustained an injury visible on the field"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "When did the newspaper headline appear relative to the trophy being lifted?",
+ "Choice": [
+ "A. Before the fireworks explosion",
+ "B. While commentators discussed the goalkeeper injury",
+ "C. During the team's posed celebration on the field",
+ "D. After confetti surrounded an celebrating athlete"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What commentary accompanied the goalkeeper's substitution?",
+ "Choice": [
+ "A. Analysis of the injury's long-term impact",
+ "B. Mention of a concern for the team's defense",
+ "C. Praise for the substitute's reaction speed",
+ "D. Reference to his flawless league attendance"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "HmnKbBmYii8",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual event coincides with the narrator stating, 'At last, the waiting is over'?",
+ "Choice": [
+ "A. A seal swimming upward in a blue-and-dim underwater environment",
+ "B. A dark seal silhouette ascending in turquoise waters",
+ "C. The mother bear grooming her fur",
+ "D. A cub licking snow off its paws"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "In what chronological order do the following key events occur? 1. Seal ascending towards an ice hole underwater, 2. Polar bear cub actively forming snowballs, 3. Seal's head emerging abruptly from the ice",
+ "Choice": [
+ "A. 2 → 1 → 3",
+ "B. 3 → 2 → 1",
+ "C. 1 → 2 → 3",
+ "D. 1 → 3 → 2"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "HProiNnmGwI",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual event occurs simultaneously with the narrator's statement 'In the history of conflict between East and West'?",
+ "Choice": [
+ "A. Men on horses clashing amid battle chaos",
+ "B. A castle wall with white flags blowing",
+ "C. Silhouetted figures holding flags and crosses",
+ "D. A group of horses and riders charging quickly"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "EFD9BLgMVK8",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which combination of elements specifically distinguishes this as a choreographed halftime show rather than an actual basketball game?",
+ "Choice": [
+ "A. Crowd presence + upbeat pop music",
+ "B. Court boundary lines + corporate branding",
+ "C. Audience-directed narration + superhero attire",
+ "D. Athletic stunts + extended halftime clock"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "HIjX8OPuf-w",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which audio-visual pairing occurred when the Mongol warrior cutout was first placed on the paper?",
+ "Choice": [
+ "A. 'Germans immigrated to the US' spoken / Ship and Statue of Liberty placement",
+ "B. 'Hamburger history, briefly' spoken / Title text being written on paper",
+ "C. 'The Mongols of the 13th century. Apparently...' spoken / Mountain range drawn and warrior cutout placed",
+ "D. 'Tenderizing their dinner' spoken / Warrior cutout wiggling over hamburger-covered mountains"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "HdBsD5VspVM",
+ "video_duration": "60s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "What was the chronological sequence of distinct visual occurrences in the first 20 seconds?",
+ "Choice": [
+ "A. Panel division → Critical damage emphasis → Twitter post appearance",
+ "B. Panel division → Twitter post appearance → Critical damage emphasis",
+ "C. Twitter post appearance → Critical damage emphasis → Panel division",
+ "D. Twitter post appearance → Panel division → Critical damage emphasis"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "G267g0DpCVg",
+ "video_duration": "60s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "What caused the person to laugh during the electronic drum music sequence?",
+ "Choice": [
+ "A. Recognition of the container's pagoda artwork",
+ "B. The noodles were excessively spicy",
+ "C. A clump of noodles stuck together unexpectedly",
+ "D. Frustration with chopstick technique"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "What specific action immediately preceded the person's close inspection of the chopsticks during the Latin Jazz sequence?",
+ "Choice": [
+ "A. Comparing chopstick length with the spoon",
+ "B. Repeatedly dropping noodles while eating",
+ "C. Successfully picking up a large noodle cluster",
+ "D. Reading text on the container lid"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What is the FIRST action performed after the person places the bags/container on the table?",
+ "Choice": [
+ "A. Throwing a fork into a trash bag",
+ "B. Removing chopsticks from a wrapper",
+ "C. Removing a plastic fork from the paper bag",
+ "D. Tearing open the lid of the white container"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "FlcVBKTSjRs",
+ "video_duration": "60s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What visual element accompanies the final vocal repetition of 'alright' in the song?",
+ "Choice": [
+ "A. Production credits with composer details",
+ "B. Text displaying 'SPOON CANDLE HOLDER'",
+ "C. Close-up of crafting tools",
+ "D. Multiple lit candles"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "HooZ84rpovQ",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "What was the sequence of visual events immediately following the highlighted Menorca map?",
+ "Choice": [
+ "A. Bone comparison → Size comparison graphic → Scientist explaining",
+ "B. Scientist explaining → Bone comparison → 'NURALAGUS REX' text",
+ "C. Bone comparison → Owl image → Pebble beach animation",
+ "D. Bone comparison → Scientist explaining → Beach-walking animation"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which anatomical features of Nuralagus Rex contrasted most distinctly with modern rabbits, as explained in the video?",
+ "Choice": [
+ "A. Short hindlimbs and rigid backbone",
+ "B. Longer ears and agile locomotion",
+ "C. Stiff spinal column and widely spaced toes",
+ "D. Flexible spine and compact toe structure"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "When the text 'NURALAGUS REX' appears on screen, what specific biological concept is being explained in the audio?",
+ "Choice": [
+ "A. Evolutionary processes causing increased body size",
+ "B. Its temporal dominance during the Pliocene",
+ "C. Comparative locomotor anatomy with modern rabbits",
+ "D. The fossil discovery methodology in 2011"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "GFTKKyYSCKs",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What specific economic concepts do the yellow question mark and snow globe respectively represent in the context of monetary systems?",
+ "Choice": [
+ "A. They illustrate cryptocurrency vs traditional banking systems",
+ "B. One signifies fiscal stimulus while the other indicates public debt",
+ "C. One denotes monetary inquiry while the other symbolizes supply regulation",
+ "D. They represent government taxation vs. corporate investment"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What visual element appeared immediately after the TED Ed logo disappeared?",
+ "Choice": [
+ "A. White-background quote about money creation",
+ "B. Bank building architectural illustration",
+ "C. Animated '2020' with pandemic imagery",
+ "D. Spinning globe with currency symbols"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "HLxvq_M4218",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What scientific phenomenon allowed distant stars to become visible during the 1919 eclipse depicted in the 20-40s segment?",
+ "Choice": [
+ "A. Atmospheric scattering reduction",
+ "B. Corona's magnetic field interactions",
+ "C. Total lunar obstruction creating darkness",
+ "D. Gravitational lensing bending starlight"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which visual element is displayed as the narrator states, 'aren't where they're supposed to be'?",
+ "Choice": [
+ "A. A still image of a solar eclipse",
+ "B. Animation illustrating gravitational light bending",
+ "C. Newton's portrait with highlighted text from *Principia*",
+ "D. A black-and-white photo of Albert Einstein and colleagues"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Which visual event coincides with the narrator stating 'this one is special' in the 20-40s segment?",
+ "Choice": [
+ "A. Animation of light bending around an object",
+ "B. Einstein's photo with colleagues",
+ "C. Diagram of a half moon with white crosses",
+ "D. 1919 Sobral eclipse image"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "H6Y2KRfpRVY",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Why did the audience cheer during Mark's initial introduction?",
+ "Choice": [
+ "A. He recognized his wife in the audience",
+ "B. He confirmed his occupation as a magician",
+ "C. He disclosed his North London origins",
+ "D. He demonstrated a magic trick with a Rubik's Cube"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "In what order does Mark provide the requested personal details when first questioned?",
+ "Choice": [
+ "A. Occupation → Daughter's age → Wife's presence",
+ "B. Age → Name → London landmarks",
+ "C. Name → Location → Age",
+ "D. Age → Occupation → Family status"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Why isn't Mark's daughter shown on camera when mentioned?",
+ "Choice": [
+ "A. She's too young to appear on camera",
+ "B. She was not present in the audience",
+ "C. Technical issues prevented her appearance",
+ "D. The camera focused on his wife in the audience"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "HvjgQqNOq9A",
+ "video_duration": "60s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What distinguishes the primary focus of the 20-40s audio segment from the 40-60s segment?",
+ "Choice": [
+ "A. Uniform symbolism vs. equipment usage",
+ "B. Balance dynamics vs. weight distribution",
+ "C. Movement philosophy vs. precise foot placement",
+ "D. Speed variation vs. consistent pacing"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "When Oliver elaborates on 'shifting your weight slowly' during stealth walking, which visual element specifically highlights the mechanics of this technique?",
+ "Choice": [
+ "A. Wooden stand holding spears in background",
+ "B. Black belt knot visible during torso movement",
+ "C. Grey mat texture during toe-to-heel foot placement",
+ "D. Close-up of knee lifting during step placement"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "HZJXPm0nhgY",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What was the artist's last step in the video clip in completing the artwork?",
+ "Choice": [
+ "A. Finalizing a black paint contour",
+ "B. Blending colors with a brush",
+ "C. Applying yellow paint as an accent color",
+ "D. Adding titanium white for highlights"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "HeEyJo838PA",
+ "video_duration": "60s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "Which visual event coincided with the narrator stating, 'These animals are perfectly adapted to their environment with keen eyesight and acute sense of smell'?",
+ "Choice": [
+ "A. Mother bear walking with cubs on rocky shore",
+ "B. Individual bear running into water",
+ "C. Two bears positioned at a waterfall",
+ "D. Bear catching a salmon in its mouth"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which visual event directly coincides with the narrator's statement about having the 'opportunity to travel to Alaska'?",
+ "Choice": [
+ "A. Aerial shot of river systems in green terrain",
+ "B. Two bears standing upright in confrontation",
+ "C. Floatplane flying above a lake",
+ "D. Close-up of salmon-filled pink river"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "HiJb_2dvuHc",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual event occurs FIRST in the performance timeline?",
+ "Choice": [
+ "A. Band members synchronize pointing upward",
+ "B. Close-up of red-haired drummer smiling",
+ "C. Group dancing with coordinated arm movements",
+ "D. Pink-haired musician solo spotlight during lyrics"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "I0xrbzYT4Sc",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What specific branding is visible in the background when the speaker remarks, 'Look at how he just lawn darts himself'?",
+ "Choice": [
+ "A. Crowd visible in the background",
+ "B. 'TEXACO' on a building",
+ "C. 'NITRO WORLD GAMES' on a blue ramp",
+ "D. 'Red Bull' on the dirt track"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "HnS87Tw4amM",
+ "video_duration": "60s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "When the speaker describes mold growth locations as 'the soap drawer, inside the drum,' which specific component is shown in a magnified view?",
+ "Choice": [
+ "A. Interior structure of the detergent compartment",
+ "B. Door seal with black mold accumulation",
+ "C. Mold colonization in drum crevices",
+ "D. Scale buildup on soap drawer jets"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What is the primary reason the speaker attributes to mold accumulation in washing machines?",
+ "Choice": [
+ "A. The inherent biological nature of mold as a fungus",
+ "B. Infrequent cleaning cycles allowing organic buildup",
+ "C. Mineral deposits from hard water affecting components",
+ "D. Residual moisture retention in the drum and pipes"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which visual indicator during the explanation of mineral deposit accumulation substantiates the presenter's professional authority?",
+ "Choice": [
+ "A. Enlarged cross-sectional diagrams",
+ "B. Proximity of chemical treatment products",
+ "C. Corporate insignia integrated into attire",
+ "D. Curated array of appliance models"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "GqeRnxSuLFI",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Why do some sellers eventually go out of business?",
+ "Choice": [
+ "A. Demand plateaus as new sellers enter the market",
+ "B. Market saturation drives prices below sustainable levels",
+ "C. Inexperienced sellers fail to attract customers",
+ "D. Supply growth outpaces static consumer demand"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which visual cue signifies Sarah's decision to enter the banana market?",
+ "Choice": [
+ "A. She appears next to Peter with a bag, smiling",
+ "B. Sarah's thought cloud shows 'I SHOULD START SELLING'",
+ "C. Three people stand below a banana star",
+ "D. Sarah hands a white bag to another woman in a quadrant"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What fundamental economic shift occurs in Peter's banana sales strategy from initial to later stages?",
+ "Choice": [
+ "A. Change from fixed location to mobile distribution",
+ "B. Movement from monopoly market to competitive market",
+ "C. Shift from product diversification to banana specialization",
+ "D. Transition from premium pricing to cost leadership"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "HwnB8aCn8yE",
+ "video_duration": "60s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "Based on the synchronization of audio transitions and visual sequences, what is the correct chronological order of hominid appearances in the video?",
+ "Choice": [
+ "A. Dryopithecus → Ramapithecus → Anamensis → Ardipithecus → Afarensis → Robustus → Boisei",
+ "B. Dryopithecus → Ramapithecus → Ardipithecus → Anamensis → Afarensis → Boisei → Robustus",
+ "C. Dryopithecus → Ramapithecus → Ardipithecus → Anamensis → Afarensis → Robustus → Boisei",
+ "D. Dryopithecus → Ramapithecus → Ardipithecus → Afarensis → Anamensis → Robustus → Boisei"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "HP_J8x9lsUg",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which event occurs FIRST chronologically?",
+ "Choice": [
+ "A. Woman says 'He's starting to remember'",
+ "B. 'Binge Now' text appears with futuristic structures",
+ "C. Green aura scream with dramatic sound effect",
+ "D. Man stands on gravel path as male voice says 'His return is only the beginning'"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which visual event occurs simultaneously with the audio line 'How is that possible?'?",
+ "Choice": [
+ "A. Two figures grappling in bathroom",
+ "B. Aerial view of car driving toward house",
+ "C. Young man gasping in bathroom covered in liquid",
+ "D. Young man screaming with outstretched hand surrounded by green aura"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "Huj-zXv4DEw",
+ "video_duration": "60s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "What distinguishes the musical accompaniment between the white/orange cat's emergence and the corgis' snow walk?",
+ "Choice": [
+ "A. Orchestral strings vs. Rock percussion",
+ "B. High-tempo beats vs. Lyrical vocals",
+ "C. Synthesizer textures vs. Diegetic environmental sounds",
+ "D. Disco with synthesizer vs. Pop featuring keyboard"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What text appears on screen when the monkey slaps the woman?",
+ "Choice": [
+ "A. 'best of the year so far'",
+ "B. 'Let's take you on an adventure!'",
+ "C. 'for a PAWsome experience!'",
+ "D. 'I am getting good at hiding'"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What does the black and white cat do right after descending the wooden ladder?",
+ "Choice": [
+ "A. Jumps onto a brown dresser",
+ "B. Begins licking its paws",
+ "C. Steps onto another wooden shelf",
+ "D. Approaches a white refrigerator"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "IKtnfFHjERg",
+ "video_duration": "60s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What is the primary purpose of unfolding the paper immediately after the horizontal fold in the 0-20s segment?",
+ "Choice": [
+ "A. To verify alignment with the diagonal crease",
+ "B. To inspect paper durability under tension",
+ "C. To establish guide creases for geometric accuracy",
+ "D. To prepare for vertical edge reinforcement"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "H8fbAFOMTp4",
+ "video_duration": "60s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "What was the cat's final action before the video ended in the last segment?",
+ "Choice": [
+ "A. Scratching its head mid-exploration",
+ "B. Biting a tan work glove",
+ "C. Investigating a yellow floor marker",
+ "D. Retreating to a folded blanket"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "I5cFBi02O34",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What occurs immediately after the dragon transports the princess to the fiery castle?",
+ "Choice": [
+ "A. The king mourns by the princess's empty bed",
+ "B. The dragon confronts the knight on the cliff",
+ "C. The kingdom posts 'Hero Needed' notices",
+ "D. The knight decapitates the dragon"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What event directly prompted the display of 'Hero Needed' posters in the cat kingdom?",
+ "Choice": [
+ "A. Recruitment for the royal guard's expansion",
+ "B. To commemorate the knight's defeat of the dragon",
+ "C. The king's urgent quest to find his daughter's rescuer",
+ "D. To notify citizens about the princess's abduction"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "HqAfzcJurMM",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual event coincides with the initial 'thud' sound in the video?",
+ "Choice": [
+ "A. Glossy spherical objects displayed on red surface",
+ "B. Man carrying tools along a mountain path",
+ "C. 'BUCK THE CUBICLE' and 'Bruce Gardner, Hikaru Dorodango' text transition",
+ "D. Shovel resting in dirt beside a silver bucket"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Where is Bruce located when he mixes rocks with water in the 40-60s segment?",
+ "Choice": [
+ "A. Workshop sifting station",
+ "B. Workshop storage area",
+ "C. Workshop near a window",
+ "D. Outdoor dirt path"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "How do the Dorodango spheres shown between 20-40 seconds differ from the one Bruce holds in the later segment?",
+ "Choice": [
+ "A. Earlier ones feature mottled surfaces; later one is uniformly smooth",
+ "B. Earlier ones were formed with tools; later one shaped barehanded",
+ "C. Earlier ones have a glossy finish; later one appears rough",
+ "D. Earlier ones contain light-colored elements; later one is solid dark brown"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "Which event occurs at the very beginning of the video?",
+ "Choice": [
+ "A. Gardening tools lying beside a bucket on dirt ground",
+ "B. Close-up of Bruce's hands shaping a glossy sphere",
+ "C. Man carrying a silver bucket and shovel walking on a mountain path",
+ "D. Text 'BUCK THE CUBICLE' appearing with a thud sound"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "I4yye8mUzWg",
+ "video_duration": "60s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "Which audio event coincides with Cyborg's initial finger-pointing gesture?",
+ "Choice": [
+ "A. 'One, two, three, four!' spoken",
+ "B. Music begins playing",
+ "C. 'The night begins to shine' sung",
+ "D. 'Bro!' spoken"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What differentiates the context of the first 'Cyborg!' utterance from the second?",
+ "Choice": [
+ "A. First occurs during a solo dance, second during a group argument",
+ "B. First overlaps with instrumental music, second with vocal harmonies",
+ "C. First expresses approval, second expresses frustration",
+ "D. First is directed at Robin, second at the entire group"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "Which event occurs FIRST in the 20-40s segment?",
+ "Choice": [
+ "A. Cyborg dances and points joyfully in the room",
+ "B. Cyborg hugs Robin on the pink sofa",
+ "C. Cyborg dances with blue spotlights behind him",
+ "D. Cyborg picks up Robin"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "HT2iYl6E_Vo",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What distinguishes the audio commentary between Nexa's P2000 kill and MONESY's kill?",
+ "Choice": [
+ "A. Both occur during commentary addressing player fatigue",
+ "B. Only MONESY's kill is followed by discussions about map control",
+ "C. Only Nexa's kill includes weapon-specific commentary",
+ "D. Both kills are accompanied by discussions about map control"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "HvKpnaXYUPU",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which visual event coincided with the audio statement about the Siberian helicopter crew's mysterious discovery?",
+ "Choice": [
+ "A. Sweeping aerial shot of icy mountain ranges",
+ "B. Helicopter mid-flight against cloud coverage",
+ "C. A hand adjusts the helicopter's controls",
+ "D. View from helicopter cockpit showing green terrain"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "IGyx5UEwgtA",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Immediately after the figurine drops to the ground, which visual element is introduced?",
+ "Choice": [
+ "A. A graphic displaying the staff hierarchy with Craig at the top",
+ "B. A classroom scene featuring women operating adding machines",
+ "C. An animated explanation of the U.S. court system",
+ "D. A close-up of an eagle figurine on the table"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which audio event occurred before Craig knocking his fist on the table during his explanation of hierarchy?",
+ "Choice": [
+ "A. 'Turtles all the way down, except at the bottom' spoken",
+ "B. 'When you realize that there are 50 states...' spoken",
+ "C. Fast-paced rock music playing",
+ "D. Knocking sound effect"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "IQcTd4xZ44o",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Which athlete, wearing a white and blue tank top, was shown on screen during the commentator's initial reference to 'the men's 1500 meters' as the blue ribbon event?",
+ "Choice": [
+ "A. Lewandowski (Poland)",
+ "B. Josh Kerr (Great Britain)",
+ "C. Neil Gourley (Great Britain)",
+ "D. Cheruiyot (Kenya)"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What specific reason did the commentator associate with supporters waving Kenyan flags during Cheruiyot's introduction?",
+ "Choice": [
+ "A. To commemorate his recent African Championship victory",
+ "B. To recognize Makhloufi's tactical brilliance",
+ "C. To show national pride for Cheruiyot's participation",
+ "D. In anticipation of Cheruiyot winning gold"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "Hw5qjcnhrW4",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Why does Halep touch her face immediately after winning the point?",
+ "Choice": [
+ "A. Following a pre-serve routine",
+ "B. To wipe away sweat",
+ "C. Checking for equipment issues",
+ "D. Reacting to the commentator's remark"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "HwpzzAefx9M",
+ "video_duration": "60s",
+ "video_category": "Nonprofits & Activism",
+ "questions": [
+ {
+ "Question": "What audio event coincides with the depiction of the military jeep and helicopter in the illustrated book?",
+ "Choice": [
+ "A. Discussion of the Geneva Convention",
+ "B. Speech about the origins of violence",
+ "C. Sound of an jeep engine",
+ "D. Sound of an airplane propeller"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Which visual event occurs simultaneously with the second explosion sound in the 20-40s segment?",
+ "Choice": [
+ "A. Military jeep driving past a helicopter",
+ "B. Men in uniform carrying a stretcher",
+ "C. Round table with six people drawn",
+ "D. Civilian man approaching an open book"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "IPevXMMEVQ0",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which specific rule violation would have nullified Teemu Hartikainen's shootout goal had it been successful?",
+ "Choice": [
+ "A. Failure to maintain continuous forward motion",
+ "B. Puck deflected off an official's skate",
+ "C. Illegal stick manipulation before release",
+ "D. Goaltender interference during the attempt"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "Hel5dqvImSw",
+ "video_duration": "60s",
+ "video_category": "Nonprofits & Activism",
+ "questions": [
+ {
+ "Question": "What does the dotted line with a moving airplane on the map primarily represent in the context of Nina's story?",
+ "Choice": [
+ "A. Legal clinic outreach routes",
+ "B. Sponsorship application timelines",
+ "C. Domestic violence reporting paths",
+ "D. Nina's immigration sponsorship process"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "IPqb_oDlGBU",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual event occurs after the speaker references the origin of her fascination with murder mysteries by mentioning 'Murder on the Orient Express'?",
+ "Choice": [
+ "A. Hand clutching a note with cut-out text",
+ "B. Derailed train in snowy terrain",
+ "C. Mustached man analyzing documents in snow",
+ "D. Man lying down reading a book"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "During the tunnel table scene, what contrasting relationship exists between the visual setting and the concurrent audio commentary?",
+ "Choice": [
+ "A. Snowy backdrop vs critique of cinematic techniques",
+ "B. Suspenseful interior vs cheerful musical theme",
+ "C. Tranquil environment vs analysis of the story's twist ending",
+ "D. Crowded gathering vs anticipation for a film adaptation"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "N1cdUjctpG8",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which visual element is shown when the audio mentions the discovery of writing tools like ink stones and brush pens in Prince Jin Duan's tomb?",
+ "Choice": [
+ "A. A green bowl and broken chairs in a deteriorated interior",
+ "B. A wooden surface with intricate peacock and floral inlays",
+ "C. A labeled photo of a Ming Dynasty tomb in Shanxi province",
+ "D. Two coffins positioned side by side in a brick chamber"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which item explicitly listed in the audio narration is absent from all visual depictions of the tomb?",
+ "Choice": [
+ "A. Chinese calligraphy brush pens",
+ "B. Wooden sacrificial tables",
+ "C. Ceramic dishware sets",
+ "D. Textured brick surfaces"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "LqaSTy0swNE",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "When the commentator mentions 'tremendous support' for Japan, what visual element directly aligns with this context?",
+ "Choice": [
+ "A. Japanese athlete in foreground with identification code D01",
+ "B. Crowd partially visible behind neon event graphics",
+ "C. Athlete with 'JAPAN' on uniform in mid-race",
+ "D. Chart displaying relay qualification criteria"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "LFGE_8SoYVA",
+ "video_duration": "60s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "What is the correct chronological sequence of events between 20-40 seconds?",
+ "Choice": [
+ "A. Sea lion lunges → Camel grabs hair → Elephant charges vehicle → Parrot interacts with mirror",
+ "B. Sea lion lunges → Elephant charges vehicle → Camel grabs hair → Parrot interacts with mirror",
+ "C. Parrot interacts with mirror → Camel grabs hair → Sea lion lunges → Elephant charges vehicle",
+ "D. Camel grabs hair → Elephant charges vehicle → Sea lion lunges → Parrot interacts with mirror"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which audio event occurs while the chipmunk interacts with the stuffed animal?",
+ "Choice": [
+ "A. High-pitched mammalian vocalizations",
+ "B. Circus-style instrumental music",
+ "C. Panicked human shouting",
+ "D. Female exclaiming 'Ow! Wow! Oh!'"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "LCfBYE97rFk",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What does the metaphor 'my crack' emphasize about the speaker's relationship with typography?",
+ "Choice": [
+ "A. It provides an irreplaceable creative stimulation",
+ "B. She critiques its inconsistent urban application",
+ "C. She feels compelled to perfect poorly designed examples",
+ "D. It's an obsessive compulsion she cannot ignore"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "LVRcD_-ht3g",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What was the sequence of audio events during the first 20 seconds?",
+ "Choice": [
+ "A. Promotional speech → cat safety narration → dramatic music",
+ "B. Promotional speech → guard dog narration → dramatic music",
+ "C. Dramatic music → promotional speech → guard dog narration",
+ "D. Guard dog narration → dramatic music → promotional speech"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What motivated the cat to extend its paw toward the toddler's hand during the balcony interaction?",
+ "Choice": [
+ "A. The cat detected a possible fall risk despite stable footing",
+ "B. The toddler offered food to the cat",
+ "C. The toddler provoked the cat with sudden movements",
+ "D. The cat sought physical contact for social bonding"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "IaTaaNinolU",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which character's first appearance follows the initial depiction of the throne room in the introduction?",
+ "Choice": [
+ "A. Horatio appears near castle gates",
+ "B. Hamlet emerges wielding a sword",
+ "C. Gertrude arrives wearing a crown",
+ "D. Claudius materializes in regal attire"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "During the audio discussing the appearance of Hamlet's father's ghost, what does the silhouette at the castle entrance symbolize?",
+ "Choice": [
+ "A. Claudius surveying his domain",
+ "B. Horatio witnessing a supernatural event",
+ "C. The ghost of King Hamlet",
+ "D. A sentry patrolling the castle grounds"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "MrRvX5I8PyY",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual element is displayed when the speaker begins articulating 'Za' after the transition sound in the initial 20 seconds?",
+ "Choice": [
+ "A. Modern blue-and-white building exterior",
+ "B. Curved white interior architecture with ceiling openings",
+ "C. Aerial perspective of a futuristic stadium",
+ "D. The B1M channel logo"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "NCNeFSJexSY",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which visual event occurred when the speaker said 'Did he? He reviewed reviews on products'?",
+ "Choice": [
+ "A. Man in gray jacket wiping his face",
+ "B. Mark Zuckerberg holding a VR headset",
+ "C. Person in olive green hoodie speaking",
+ "D. Person in light green hoodie discussing Vision Pro"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What directly prompted the transition to Mark Zuckerberg's clip?",
+ "Choice": [
+ "A. Commentary about vlogging style",
+ "B. Explicit cue to play Quest 3 commentary",
+ "C. Announcement of mixed reality demonstration",
+ "D. Discussion of Vision Pro's market threat"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "Pf1hNFndi60",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual event coincides with the man's statement, 'I like it. There's not a light batter'?",
+ "Choice": [
+ "A. A man uses a knife to clean a fish on a car hood",
+ "B. A chef prepares noodles in the restaurant",
+ "C. A man uses chopsticks to eat intestines",
+ "D. A person uses a fork and knife to cut into an ornate silver dish"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "When a man remarks, 'A rubber hose pipe. Nice. Poached in Napa,' what specific visual setting is shown?",
+ "Choice": [
+ "A. A sandy beach with an off-road vehicle",
+ "B. A restaurant interior with red lanterns and diners",
+ "C. A chef pulling noodles in a kitchen",
+ "D. A car hood stained near the letter 'GB'"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "K7YNoRZMv2k",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What user action directly followed the speaker's mention of 'create signiture products'?",
+ "Choice": [
+ "A. The user filtered media by 'audio' type",
+ "B. The cursor hovered over the Membership tab",
+ "C. The user clicked on 'The History of Prada' episode",
+ "D. The vintage video of people in chairs appeared"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What visual action occurred immediately before the 'THE HISTORY OF PRADA' title card appeared?",
+ "Choice": [
+ "A. Virgil Abloh's quote visibility",
+ "B. Audio player displaying '#1 - How to Develop...'",
+ "C. Clicking the Patreon episode play button",
+ "D. Membership tab navigation"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "IqejFRAcys8",
+ "video_duration": "60s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What is positioned adjacent to the yarn ball on either side during the cutting process in the 20-40s segment?",
+ "Choice": [
+ "A. Uncut yarn pile and translucent container",
+ "B. Scissors and loose threads",
+ "C. Glue bottle and cardboard",
+ "D. Pile of cut yarn and translucent container"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "N1ZRYx46ye4",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "When does the woman first mention her previous workout videos' success?",
+ "Choice": [
+ "A. While demonstrating the lat pulldown exercise",
+ "B. While addressing the camera in her bedroom",
+ "C. During the black screen with introductory music",
+ "D. Over text listing 'PUSH DAY, LEG DAY, PULL DAY'"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "When does the superimposed text listing specific workout splits (PUSH DAY, LEG DAY, PULL DAY) first appear?",
+ "Choice": [
+ "A. During the electronic music with heavy bass in the gym montage",
+ "B. When she says 'happy New Year' in her bedroom introduction",
+ "C. As she verbally lists the split components during her explanation",
+ "D. While describing her microphone setup for gym instructions"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What specific aspect of fitness programming does this video emphasize through its visual and verbal content?",
+ "Choice": [
+ "A. Equipment brand partnerships",
+ "B. Music tempo synchronization",
+ "C. Split training periodization",
+ "D. Advanced hypertrophy techniques"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Why does the woman wear distinct outfits in the gym workout compilation?",
+ "Choice": [
+ "A. To differentiate between days in the 4-day training split",
+ "B. To match musical transitions in the background score",
+ "C. To highlight seasonal activewear collections",
+ "D. To reflect filming across separate workout sessions"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "NAahAX62MZ0",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "During the visual sequence showing mozzarella cheese being sliced, which musical combination was present in the audio track?",
+ "Choice": [
+ "A. Keyboard-focused pop music with dance rhythm",
+ "B. Music featuring drums, bass and guitar",
+ "C. Orchestral music with string crescendos",
+ "D. Upbeat electronic music with synthesized percussion layers"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "JwxALozuKfY",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual event coincides with the speaker's first audible breath sounds during his workout?",
+ "Choice": [
+ "A. Man executing controlled overhead dumbbell presses",
+ "B. Man completing barbell deadlift exercises",
+ "C. Man engaged in intense pull-up repetitions",
+ "D. Man performing dumbbell curls with focused expression"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What key detail does the man emphasize during his rooftop monologue in the 40-60s segment?",
+ "Choice": [
+ "A. Viewer inquiries about his meal plan",
+ "B. The intensity of his midday training sessions",
+ "C. The difficulty of early morning gym commutes",
+ "D. His partnership with a supplement brand"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "PSt_op3fQck",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual corresponds to the audio question 'And what happens with different cultures in the workplace?'",
+ "Choice": [
+ "A. Airline counter with AB AIRLINE employees",
+ "B. Passengers disembarking via mobile staircase",
+ "C. Crowded airport check-in area with 'QUEUE HERE' sign",
+ "D. Diversity dashboard showing 'DEAL ORIENTATED' gauge"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Which visual event immediately follows the airplane's entry into the frame?",
+ "Choice": [
+ "A. Five diverse figures stand beneath an 'International Departures' sign with an airplane icon",
+ "B. A quote bubble appears above a figure with standard facial features",
+ "C. Two stick figures appear with a dashed line connecting their eyes",
+ "D. A turbaned figure displays a surprised expression"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "N62mrplIygc",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which personality trait is most emphasized through the symbolic elements in Baylie's introductory graphic?",
+ "Choice": [
+ "A. Social media influence",
+ "B. Focus on vanity and luxury",
+ "C. Commitment to physical fitness",
+ "D. Intellectual curiosity"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Which event marks the beginning of Baylie's story chronologically?",
+ "Choice": [
+ "A. Baylie's introductory graphic with high heels appears",
+ "B. Narrator establishes Baylie's demanding personality",
+ "C. Father predicts her shock with strict South African parents",
+ "D. Christine wishes Baylie well at airport departure"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What core parental dynamic makes Baylie's adaptation challenging with South African host parents?",
+ "Choice": [
+ "A. Host families legally enforce corporal punishment",
+ "B. Airport authorities identified behavioral deficiencies",
+ "C. She vehemently opposes basic household responsibilities",
+ "D. Her biological parents lack consistent rule enforcement"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "P6Q7bQTnrjk",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "During which activity was Jody Hildebrandt positioned in front of organizational branding at the retreat?",
+ "Choice": [
+ "A. Conducting an outdoor team-building exercise",
+ "B. Delivering a keynote address on stage",
+ "C. Filming a promotional segment with children",
+ "D. Demonstrating therapeutic techniques in a workshop"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Which visual event is paired with the narrator's statement about the 911 call involving a YouTuber?",
+ "Choice": [
+ "A. A man in red shorts standing on a porch during bodycam footage",
+ "B. The YouTuber speaking on a stage with a CONNEXIONS banner",
+ "C. Children holding cylindrical containers in a retail setting",
+ "D. A group posing in front of shelves stocked with markers and notebooks"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which individual introduced herself at the second annual women's retreat and was later shown wearing a blue and grey striped prison uniform?",
+ "Choice": [
+ "A. The blonde woman with a silver necklace",
+ "B. The woman wearing furry earmuffs",
+ "C. The man in the checkered shirt",
+ "D. The YouTuber seen mixing ingredients in a metal bowl"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "PB06auioy0Y",
+ "video_duration": "60s",
+ "video_category": "Music",
+ "questions": [
+ {
+ "Question": "What occurs FIRST after the hypnotherapist instructs the viewer to 'open your eyes, Nicolas'?",
+ "Choice": [
+ "A. Nicholas is shown playing guitar on stage",
+ "B. The conductor directs the orchestra in darkness",
+ "C. The keyboardist performs under purple light",
+ "D. The screen displays a landscape scene"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "KIg4rprmO9Y",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What concurrent development is shown alongside the team's expansion in numbers?",
+ "Choice": [
+ "A. Escalation of intergalactic metahuman warfare",
+ "B. Discovery of a new meta-gene testing facility",
+ "C. The League's victory over the Reach",
+ "D. New leadership restructuring within the League"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "According to the report, what is the explicit purpose of testing metahumans for the metagene?",
+ "Choice": [
+ "A. To integrate them into the League's leadership program",
+ "B. To rehabilitate those affected by previous experiments",
+ "C. To identify candidates for extraterrestrial biological weaponization",
+ "D. To develop countermeasures against their abilities"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "M_lwC5zPkyo",
+ "video_duration": "60s",
+ "video_category": "Travel & Events",
+ "questions": [
+ {
+ "Question": "What is the primary purpose of showing the man seated in a park during his narration about travel plans?",
+ "Choice": [
+ "A. To visually corroborate his verbal declaration of current location",
+ "B. To visually show upcoming destinations in his itinerary",
+ "C. To highlight traditional Chinese architecture visible in the background",
+ "D. To demonstrate video production techniques for travel vlogging"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which visual element was prominently featured when the man said 'Alright. Welcome to China'?",
+ "Choice": [
+ "A. Aerial view of the Great Wall winding through mountains",
+ "B. Camouflage-hatted man surrounded by streetlight-mounted red lanterns",
+ "C. Traditional stone courtyard with ornate roof details",
+ "D. Same man in baseball cap seated in lotus position"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What is the chronological order of destinations in the man's explicitly stated travel itinerary?",
+ "Choice": [
+ "A. Shanghai → Kunming → Xi'an → Beijing",
+ "B. Shanghai → Beijing → Kunming → Xi'an",
+ "C. Shanghai → Xi'an → Beijing → Kunming",
+ "D. Shanghai → Xi'an → Kunming → Beijing"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "PYOyMK2Jw7g",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Which sequence of player actions is explicitly mentioned in the audio commentary during the first 20 seconds?",
+ "Choice": [
+ "A. Kelleher deflects first, then Rashford protests",
+ "B. Hernandez dribbles first, then Garnacho dribbles",
+ "C. Garnacho dribbles first, then Hernandez passes",
+ "D. Garnacho shoots first, then Kelleher saves"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What specific play in the final 20s directly triggered the crowd's prolonged cheering?",
+ "Choice": [
+ "A. Kelleher's deflection of a close-range header",
+ "B. Klopp's sideline reaction to an offside call",
+ "C. Salah's left-footed curler from the edge of the box",
+ "D. McTominay's first-time finish after a defensive block"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "Ij-FYOrklFE",
+ "video_duration": "60s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "Which event marks Pavitr Prabhakar's FIRST appearance in the 0-20s segment?",
+ "Choice": [
+ "A. Dialogue: 'Okay, no, I'm not'",
+ "B. Visual text: 'धड़ाक' (Crash)",
+ "C. Gwen Stacy debating with another Spider-Man",
+ "D. Aerial shot over Mumbai skyline"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "PHe0bXAIuk0",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What visual element appears in sync with the narrator stating 'made up of a few simple parts'?",
+ "Choice": [
+ "A. Three interlocking gears labeled 'CORN', 'OIL', 'STEEL'",
+ "B. Stick figures forming a circular arrangement",
+ "C. A graph with scribbled annotations",
+ "D. Multiple cogwheels converging into 'THE ECONOMY'"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which visual event coincided with the speaker saying, ' Though the economy might seem complex",
+ "Choice": [
+ "A. A hand draws a white line with stylized clouds appearing",
+ "B. Stick figures forming a circle are displayed",
+ "C. Three gears labeled 'CORN', 'OIL', and 'STEEL' interlock and turn",
+ "D. A hand draws scribbles on a graph"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "MQEcUPeAdFs",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Which challenge was designated as the first official event immediately after the speaker mentioned 'six challenges'?",
+ "Choice": [
+ "A. Deadlift",
+ "B. Arm wrestling",
+ "C. Tug of war",
+ "D. Carrying an inflatable raft"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What directly triggered the group's flexing and cheering during the exclamation 'Let's go! Let's go!'?",
+ "Choice": [
+ "A. Commencing the first competition",
+ "B. Deciding the challenge order",
+ "C. A reference to their fitness routine",
+ "D. Successfully lifting the maximum weight"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What contextual clue doesn't confirms the deadlift challenge occurs at the beach?",
+ "Choice": [
+ "A. Sand visible under the barbell",
+ "B. Visible body of water behind competitors",
+ "C. Ambient sounds of ocean winds and waves",
+ "D. Competitors wearing athletic attire"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "How was the order of participation determined between the teams for the deadlift challenge?",
+ "Choice": [
+ "A. Both teams used Rock Paper Scissors",
+ "B. Navy SEALs volunteered first while opponents drew lots",
+ "C. Both teams let their strongest member decide",
+ "D. Navy SEALs chose directly while opponents used Rock Paper Scissors"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "KD1yxRyFAZQ",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What occurs right after the woman expresses gratitude to the receptionist with 'Lovely, thank you'?",
+ "Choice": [
+ "A. She immediately enters the dance studio doorway",
+ "B. She ascends the staircase to reach the designated studio",
+ "C. A foghorn sound echoes through the building",
+ "D. The receptionist begins organizing paperwork at the desk"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "NEFG7YcIDcI",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "What was the FIRST map transition shown after the meteorologist discussed US landfall hurricanes?",
+ "Choice": [
+ "A. Atlantic Ocean hurricane growth",
+ "B. Simulated landmass with low pressure system",
+ "C. West Coast of Africa storm formation",
+ "D. Space view of hurricane eye"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What atmospheric process is visualized by the blue swirl graphic during the hurricane's depiction from space?",
+ "Choice": [
+ "A. Heat transfer energizing storm development",
+ "B. Rotational airflow expelling from the storm system",
+ "C. Moisture convergence fueling cloud growth",
+ "D. The Coriolis effect initiating cyclonic spin"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "LCtOpCi5r2s",
+ "video_duration": "60s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "Which visual element is displayed when the speaker states that treaties cannot alter peremptory norms like the prohibition of genocide?",
+ "Choice": [
+ "A. Intersecting pink/blue circles labeled GENOCIDE and TORTURE",
+ "B. Flexed muscular arms surrounding a document labeled 'Customs'",
+ "C. Pink clouds with a sad face labeled 'Opinio Juris'",
+ "D. A white triangle labeled 'JUS COGENS'"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What visual appears immediately after the documents with red flags in the first segment?",
+ "Choice": [
+ "A. Muscular arms flexing with 'Customs' label",
+ "B. World map with traffic lights on continents",
+ "C. Triangle hierarchy of legal sources",
+ "D. Overlapping circles labeled 'General, Uniform, Consistent'"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "IzQ2siryQrM",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What concept do the alcohol bottles visually represent in the video?",
+ "Choice": [
+ "A. Risks associated with accidents",
+ "B. Impairment similar to intoxication",
+ "C. Consequences of sleep aids",
+ "D. Methods of stress relief"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Which event occurs FIRST after the brain cleaning scene with brushes and water?",
+ "Choice": [
+ "A. Tan-skinned man sleeping in red bed",
+ "B. Half-empty battery above tired office worker",
+ "C. Black man sleeping with '7-9 HOURS' thinking bubble",
+ "D. Fair-skinned man with decreasing battery icon"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "IXjkbgrdAkw",
+ "video_duration": "60s",
+ "video_category": "Travel & Events",
+ "questions": [
+ {
+ "Question": "Which food item is Mark Wiens describing when he says 'so soggy and so delicious in my life'?",
+ "Choice": [
+ "A. Saucy Italian beef sandwich",
+ "B. Chicago-style hot dog",
+ "C. Overcooked chicken wings",
+ "D. Barbecue ribs"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "K3ksKkCOgTw",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "What visual event occurs immediately AFTER the cascade of white sugar falls into a pile in the first 20s?",
+ "Choice": [
+ "A. The word 'DANGERS' written in sugar appears",
+ "B. The text 'CANCER' appears in sugar",
+ "C. A woman in a lab coat is seen speaking",
+ "D. The text 'ALZHEIMER'S' written in sugar appears"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "Kn10Jf1x24Q",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What caused the baby rabbit's distressed squeaking sounds in the video?",
+ "Choice": [
+ "A. It was lying on sharp gravel",
+ "B. It was avoiding the parked vehicle",
+ "C. It was vocalizing for its mother",
+ "D. It was being restrained in a human hand"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "Lyv_2usFQA0",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What was the primary reason the speaker's video recording process extended over a week?",
+ "Choice": [
+ "A. Repeatedly exceeded ChatGPT's usage limits",
+ "B. Awaiting Team plan's higher credit allocation",
+ "C. Needed to compile and verify the GPT directory",
+ "D. Time required to thoroughly evaluate over 100 GPTs"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Why is the Team plan visually highlighted when discussing credit limitations?",
+ "Choice": [
+ "A. To emphasize it provided unlimited credits for recording",
+ "B. To indicate it was the subscription tier causing credit depletion",
+ "C. To demonstrate an upgrade attempt that didn't resolve credit issues",
+ "D. To compare its credit allocation with the Plus plan"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "LeBG88YT3rU",
+ "video_duration": "60s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "When the man states 'You said Jack of Clubs and then 23, right?', what primary visual element dominates the screen?",
+ "Choice": [
+ "A. The man sitting on a bed with abstract red wall art",
+ "B. Jen holding the Jack of Clubs card toward the camera",
+ "C. A silver logo prominently displayed on the man's baseball cap",
+ "D. Jen's hands shuffling cards in close-up"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What unexpected result of the card revelation causes Jen's distressed reaction?",
+ "Choice": [
+ "A. The exposed card matched both her suit declaration and number",
+ "B. The trick exposed a card from a marked deck",
+ "C. The man demonstrated prior knowledge of her card",
+ "D. The card position shifted during her count"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "MiPw-RZMHCQ",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Which spoken instruction coincided with the shirtless man and woman continuing their push-up exercises?",
+ "Choice": [
+ "A. 'All right. Helen, easy'",
+ "B. 'Okay, next, two'",
+ "C. 'Got some catching up to do, Mark'",
+ "D. 'That's right'"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What caused the shirtless man to lose balance during his push-up exercise?",
+ "Choice": [
+ "A. Muscle fatigue from prolonged strenuous effort",
+ "B. Distraction due to sudden background laughter",
+ "C. Instability from improperly aligned kettlebells",
+ "D. Overexertion triggered by a coach's competitive remark"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What visual event occurs immediately following the labored breathing during the push-up exercise?",
+ "Choice": [
+ "A. The woman stands up",
+ "B. A new pair begins the exercise",
+ "C. The man picks up a kettlebell",
+ "D. The man loses balance and falls"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "NeLj-gLWgug",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What visual event occurred immediately after the scraping sound stopped in the first section?",
+ "Choice": [
+ "A. Man makes his bed",
+ "B. Man brushes his teeth",
+ "C. Man adjusts the window blinds",
+ "D. Man enters the bathroom"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "PAI-gQ19pbs",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What visual event occurred simultaneously with the announcer's comment about 'Joey Anderson finding Seth Jones'?",
+ "Choice": [
+ "A. Players jostling for position in front of the net",
+ "B. Goalie freezing the puck after a glove save",
+ "C. Winger taking a long shot from the blue line",
+ "D. Defenseman intercepting a centering pass"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "K0mjUgHKfJo",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What visual element appears immediately following the narrator's statement about 'air temperature being colder than warm water'?",
+ "Choice": [
+ "A. Blue dots at polar regions",
+ "B. Sun emitting arrows toward equator",
+ "C. White dashed ocean current paths",
+ "D. Energy transfer diagram with thermometers"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "OCTekmo_szs",
+ "video_duration": "60s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What is the chronological order of glitter additions shown in the video?",
+ "Choice": [
+ "A. Gold → Star-shaped → Silver",
+ "B. Star-shaped → Silver → Gold",
+ "C. Gold → Silver → Star-shaped",
+ "D. Silver → Gold → Star-shaped"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "K759eXmaMTY",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What visual event immediately follows the man's statement about resurrecting the place in the debris-filled area?",
+ "Choice": [
+ "A. An adult woman demonstrating ballet moves",
+ "B. President Erdoğan interacting with children",
+ "C. Close-up of two girls in pink ballet outfits holding hands",
+ "D. A group of girls forming a circle in ballet attire"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "MIYDzEYfI6U",
+ "video_duration": "60s",
+ "video_category": "Travel & Events",
+ "questions": [
+ {
+ "Question": "Which visual elements are displayed when Lana says 'I started this journey as a solo traveler' versus 'People are kinder than you think'?",
+ "Choice": [
+ "A. Illuminated wildlife attraction and factory production",
+ "B. Oversized luggage and patterned home decor",
+ "C. Adventure pathway and cultural celebration",
+ "D. Sunglasses reflection and aircraft structure"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "KEy2iCzpce4",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What was the sequence of distinct audio events between 0-40 seconds?",
+ "Choice": [
+ "A. Pop music → Speech about tape → Applause",
+ "B. Pop music → Speech about scars → Applause",
+ "C. Cheering → Pop music → Applause",
+ "D. Cheering → Pop music → Speech about tape"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What visual event coincided with Brooke's mention of the stage tape during her chin stand?",
+ "Choice": [
+ "A. Brooke executes a standing split",
+ "B. Brooke performs backward flips on stage",
+ "C. Brooke points to her chin",
+ "D. Brooke touches her neck with one hand"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "NrtwBO_nyFA",
+ "video_duration": "60s",
+ "video_category": "Autos & Vehicles",
+ "questions": [
+ {
+ "Question": "What stage immediately follows the wire cutting saw operation shown in the quarry?",
+ "Choice": [
+ "A. Marble block transportation",
+ "B. Detachment using an excavator",
+ "C. Terrace level reinforcement",
+ "D. Initial drilling preparation"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "K946mptgzTU",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What is the primary purpose of the robed individuals traversing the desert?",
+ "Choice": [
+ "A. Participating in a religious pilgrimage",
+ "B. Retracing ancient trade routes",
+ "C. Conducting survival training in extreme conditions",
+ "D. Filming a documentary on desert ecosystems"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "O7zcPvQDIMU",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What visual element is shown as the narrator states, 'Within six months, two of the most powerful monarchs...'?",
+ "Choice": [
+ "A. Two flags swaying",
+ "B. A war scene with clashing soldiers",
+ "C. A helmet and sword",
+ "D. A piece of medieval embroidery with people"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "Which visual element appears immediately following the foreground trees as the narrator describes the comet's historical interpretation?",
+ "Choice": [
+ "A. Armor and weaponry arranged ceremoniously",
+ "B. Heraldic banners fluttering against a plain backdrop",
+ "C. An embroidered tapestry depicting medieval figures",
+ "D. Combatants engaged in close-quarters battle"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "S8w44Tl5atk",
+ "video_duration": "60s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "What mechanic allows Spectral Cutlass to maintain high durability throughout the game?",
+ "Choice": [
+ "A. Summoning Pirates triggers durability gains",
+ "B. Its Lifesteal effect restores durability",
+ "C. Playing cards from other classes increases durability",
+ "D. Weapon buffs like Deadly Poison reinforce durability"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What key mechanical difference determines how Spectral Cutlass and Bloodsail Raider gain their respective stat boosts?",
+ "Choice": [
+ "A. One gains Lifesteal from weapon hits, the other gains attack from weapon buffs",
+ "B. One triggers on minion summon, the other on card draw",
+ "C. One scales with current weapon attack, the other with number of different-class cards played",
+ "D. One scales with weapon durability, the other with number of Pirates played"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What key mechanic of Shoplifter Goldbeard's ability allows it to directly damage the opponent's hero?",
+ "Choice": [
+ "A. Summons gain Charge",
+ "B. Triggers weapon durability effects",
+ "C. Attacks can bypass enemy minions",
+ "D. Synergizes with lifesteal mechanics"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "QdHBJ-NiaHM",
+ "video_duration": "60s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "Which effect is triggered by playing the Mini Factory Assemblybot?",
+ "Choice": [
+ "A. A Divine Shield minion is summoned from your deck",
+ "B. A minion is drawn and transformed into a 5/5",
+ "C. A Deathrattle swaps this minion with one in your hand",
+ "D. A 6/7 Bot attacks a random enemy at turn's end"
+ ],
+ "Answer": "D",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Which event occurs IMMEDIATELY after the explanation of Mini Factory Assemblybot's effect concludes?",
+ "Choice": [
+ "A. Uther declares 'I will fight with honor!'",
+ "B. Annoy-o-Troupe is revealed in the initial hand",
+ "C. Garrosh declares 'I will show no mercy!'",
+ "D. The Hearthstone match between Uther and Garrosh begins"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What audio-visual synchronization confirms Garrosh's dialogue occurs during active gameplay?",
+ "Choice": [
+ "A. Uther's portrait glows during his response",
+ "B. Assemblybot's attack animation triggers",
+ "C. Match music plays while heroes face off on board",
+ "D. Annoy-o-Troupe appears in the initial hand"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "NagvWwLvRik",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What is the primary reason the video includes a scene of an excavator crushing cars?",
+ "Choice": [
+ "A. Demonstrating heavy machinery operation skills",
+ "B. Documenting a planned activity from their itinerary",
+ "C. Preparing a cleared area for their campsite",
+ "D. Highlighting a destruction derby competition"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which event occurred FIRST chronologically?",
+ "Choice": [
+ "A. Ax hitting wood with logo reveal",
+ "B. Truck approaches fenced desert area",
+ "C. Family announces departure from Salt Lake City",
+ "D. Explanation about fossilized footprints"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which planned activity directly followed the family's highway journey to Moab?",
+ "Choice": [
+ "A. Searching for fossilized dinosaur footprints",
+ "B. Off-roading through the desert",
+ "C. Discovering ancient ruins",
+ "D. Destroying cars with excavators"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "J5Npf2xJpag",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What historical context is provided when the man's portrait is prominently displayed?",
+ "Choice": [
+ "A. Intense dramatic music crescendo",
+ "B. 1886 clothing business success",
+ "C. Discussion of risk management strategies",
+ "D. 1869 establishment of the firm"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "PYqmI0Hiaho",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Which event initiates the sequence leading to the goal in the 0-20s segment?",
+ "Choice": [
+ "A. De Bruyne scores with a powerful kick",
+ "B. Crowd erupts in celebration post-goal",
+ "C. A defender closes in on De Bruyne",
+ "D. The commentator notes Lukaku evading a defender"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What does the announcer's commentary specifically indicate about the game's status during the 20-40s segment?",
+ "Choice": [
+ "A. The game was in extra time",
+ "B. Lukaku provided an assist for the goal",
+ "C. Belgium extended their lead to 2-0",
+ "D. The match was part of the group stage"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ },
+ {
+ "Question": "During the 20-40s segment, which commentary line is heard as De Bruyne dribbles past defenders and scores?",
+ "Choice": [
+ "A. 'De Bruyne! 2-0. Belgium!' spoken",
+ "B. 'Is that the moment...' spoken",
+ "C. 'Kevin De Bruyne with a breakthrough...' spoken",
+ "D. 'Belgium have the goal!' spoken"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What event directly triggered the loudest crowd reaction in the 0-20s segment?",
+ "Choice": [
+ "A. A controversial offside call being overturned",
+ "B. Lukaku's successful dribble past two defenders",
+ "C. A defensive error creating an open net",
+ "D. De Bruyne's powerful strike into the net"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "Lo-ewzf-Q1k",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Why does the woman gesture toward her under-eye region during her skincare explanation?",
+ "Choice": [
+ "A. To draw attention to a genetic facial feature",
+ "B. To critique the bathroom lighting conditions",
+ "C. To demonstrate proper sunscreen application",
+ "D. To showcase her morning skincare products"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Which audio event directly coincided with the bathroom sink's running water being shown?",
+ "Choice": [
+ "A. Pop music with male vocals",
+ "B. Male voice singing 'I've been having her in my dreams'",
+ "C. Discussion about sunscreen application",
+ "D. Sound of cotton swab container being opened"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "S_MAesZsnMk",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which statement did the young man make while explaining the correct pronunciation of his name?",
+ "Choice": [
+ "A. 'Price out of paradise.'",
+ "B. 'You pronounce a T is like a D in the Talking culture.'",
+ "C. 'My name is Im Togi and I'm from Kahuku Hawaii.'",
+ "D. 'Hi'"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What reason does Iam Tongi explicitly state for relocating from Hawaii to Seattle?",
+ "Choice": [
+ "A. To focus on launching his music career",
+ "B. Economic pressures referenced as 'price out of paradise'",
+ "C. Preference for Seattle's metropolitan lifestyle",
+ "D. To accompany his mother's job transfer"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "RP1AL2DU6vQ",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual event occurs as Fred states, 'I walk because my school's only in the center of the town'?",
+ "Choice": [
+ "A. Fred stands in the kitchen holding books",
+ "B. Fred exits his yard through green bushes",
+ "C. Fred descends stairs while packing his bag",
+ "D. Fred walks toward a red brick building"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What discrepancy exists between the visual breakfast presentation and the audio description?",
+ "Choice": [
+ "A. Cereal appears with milk rather than described juice",
+ "B. Toast preparation occurs before descending stairs",
+ "C. Breakfast occurs in a dining room absent from narration",
+ "D. Orange juice is shown in a mug during cereal consumption"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "QnGwAUTM1fY",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What distinct auditory elements were present when the dark-skinned man in the 'BRUNO STRONG' shirt became fully visible for the first time?",
+ "Choice": [
+ "A. Authoritative speech declaring game rules",
+ "B. Overlapping crowd chatter and shoe scuffs",
+ "C. Fast-paced music with layered electronic percussion",
+ "D. Basketball dribbling synchronized with sneaker squeaks"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "QTA8j5wSTx4",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Why does Dion position one leg on the chair in the 40-60s segment?",
+ "Choice": [
+ "A. To reach an object on the nearby radiator",
+ "B. To imitate Napoleon's iconic pose",
+ "C. To mimic the hand-in-coat gesture from historical portraits",
+ "D. To stabilize the chair during disassembly"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "UGMDV82m3Ro",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which visual event directly coincided with the initial squawking sound after entering the Patel's living room?",
+ "Choice": [
+ "A. The parrot preens its feathers in the cage",
+ "B. The group first notices the parrot cage",
+ "C. Another black-haired woman turns toward the source of the sound",
+ "D. The marron-haired woman ducks down suddenly"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What was Stacey's primary purpose for kneeling while discussing plans for the parrot?",
+ "Choice": [
+ "A. To prevent startling the parrot during interaction",
+ "B. To inspect the cage's locking mechanism",
+ "C. To collaborate on cage design improvements",
+ "D. To evaluate the parrot's vocal behavior patterns"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "UeOy1WKeTRY",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What emotional undertone does James convey by looking down during the 0-20s segment?",
+ "Choice": [
+ "A. To convey contemplation about the assistant's query",
+ "B. To express physical discomfort from prolonged sitting",
+ "C. To foreshadow criticism of streaming platforms",
+ "D. To emphasize resigned acceptance of endless content"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which audio event coincided with James Corden speaking while seated on a light-colored sofa before receiving a reply about his queue?",
+ "Choice": [
+ "A. Gentle melancholic music",
+ "B. 'Play my queue' spoken",
+ "C. Exhaling sound",
+ "D. 'How many shows left on my queue?' spoken"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "MPfuSfiZ96A",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "What contrast does the narrator's commentary emphasize when juxtaposed with the nighttime cityscape visuals?",
+ "Choice": [
+ "A. Modern skyscrapers vs traditional stilted housing",
+ "B. Urban electricity vs rural reliance on self-made lamps",
+ "C. Electronic music vs acoustic folk instruments",
+ "D. Globalized infrastructure vs localized resource gathering"
+ ],
+ "Answer": "B",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "TPS22HRRY1k",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What visual element signifies the completion of Will's regulatory registration process with authorities?",
+ "Choice": [
+ "A. Document appears showing next steps",
+ "B. Question marks over Will's head",
+ "C. Lightbulb appears over Will",
+ "D. The words 'MUTUAL FUND' reappear"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which specific visual attribute differentiates the 'MUTUAL FUND' text from the 'WILL'S FUND' text when they first appear in 40-60s segment?",
+ "Choice": [
+ "A. Font weight (bold vs light)",
+ "B. Background texture (solid vs gradient)",
+ "C. Accompanying graphic (securities list vs investor group)",
+ "D. Outline/border color (blue vs green)"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ },
+ {
+ "Question": "What key aspect of Will's mutual fund is visualized by the green pie chart as it fills up during the narration?",
+ "Choice": [
+ "A. Total capital raised from investors",
+ "B. Proportional ownership of each investor",
+ "C. Performance growth of the fund",
+ "D. SEC regulatory compliance"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ },
+ {
+ "Question": "What narrative purpose do the question marks above Will's head serve when they first appear?",
+ "Choice": [
+ "A. To show Will's confusion about SEC regulations",
+ "B. To visualize Will analyzing market trends",
+ "C. To demonstrate investor demand for his expertise",
+ "D. To indicate SEC scrutiny of his fund proposal"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "TtJVJ9UP7M0",
+ "video_duration": "60s",
+ "video_category": "Nonprofits & Activism",
+ "questions": [
+ {
+ "Question": "What key information is conveyed through both visual and textual elements in the opening title cards?",
+ "Choice": [
+ "A. Statistics about juvenile sentencing rates",
+ "B. Production credit for legal organizations",
+ "C. Interactive features for courtroom navigation",
+ "D. Dress code variations across court levels"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "Pfyx5bEZK80",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Which audio event specifically coincided with the initial visual breach of two gray whales at the water's surface?",
+ "Choice": [
+ "A. Dramatic musical score begins playing",
+ "B. 'Up to 12,000 make the epic journey' spoken",
+ "C. 'The Hawaiians call these 40-ton mammals Kohola' spoken",
+ "D. 'Along the coast of Hawaii Volcanoes National Park...' spoken"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What is the primary reason humpback whales migrate to Hawaiian waters as indicated by the audio-visual cues?",
+ "Choice": [
+ "A. To reach warmer environments essential for their survival",
+ "B. To evade oceanic noise pollution",
+ "C. To reach their breeding grounds (Kohola)",
+ "D. To access abundant feeding areas"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What does the narration specifically emphasize about the 'epic journey' referenced?",
+ "Choice": [
+ "A. It includes a massive population of whales",
+ "B. It represents an annual migration to warmer waters",
+ "C. It involves complex vertical diving patterns",
+ "D. It takes place solely near breeding grounds"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "QopYbLq-zIQ",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Why does Gordon Ramsay specifically highlight Owen as facing the most pressure in the competition?",
+ "Choice": [
+ "A. His station suffered equipment malfunctions",
+ "B. He holds a Michelin-starred culinary status",
+ "C. His established culinary reputation",
+ "D. He's experimenting with a molecular gastronomy technique"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What visual event occurred immediately after Gordon Ramsay sounded the air horn?",
+ "Choice": [
+ "A. Contestant pouring white liquid into glass",
+ "B. Albert removing items from container",
+ "C. Salt Hank chopping red onion",
+ "D. Owen walking to ingredients table"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "Tn70NxIMk2Q",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What immediate auditory cue prompts Dante to cease dribbling in the 20-40s segment?",
+ "Choice": [
+ "A. Televised crowd cheers during a dunk",
+ "B. An off-camera imperative to stop dribbling",
+ "C. Audience members applauding a Skills Camp play",
+ "D. A coach's 'Gel' command during a drill"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which visual event occurs simultaneously with the woman asking 'Dante, what are you looking at?'?",
+ "Choice": [
+ "A. A muscular man dribbling in front of a 'SKILLS CAMP' audience",
+ "B. Dante signing documents at a 'Duke' table",
+ "C. Dante standing on a city street holding a basketball",
+ "D. An enclosed basketball arena filled with spectators"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What contextual element accompanies the statement 'It was a joke' in the 40-60s segment?",
+ "Choice": [
+ "A. Dante reviewing a critical sports article about himself",
+ "B. A player blocking Dante's shot during the game",
+ "C. Dante signing a contract with the Duke team",
+ "D. Spectators chanting after the 'game over' call"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "PU-XOFIJMlg",
+ "video_duration": "60s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "Which event occurred FIRST chronologically in the video?",
+ "Choice": [
+ "A. Water bending against traditional Asian buildings",
+ "B. Punch sound synchronized with Aang flying forward",
+ "C. Aang levitating rocks with a glowing blue ball",
+ "D. Aang creating a swirling circle of flames around himself"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "Ry2dJuJ-9UE",
+ "video_duration": "60s",
+ "video_category": "Travel & Events",
+ "questions": [
+ {
+ "Question": "Which audio-visual pairing occurs when the narrator defines glaciovolcanoes?",
+ "Choice": [
+ "A. Ominous brass music during the animated title sequence",
+ "B. 'Union of ice and fire' narration over erupting magma and glacial ice",
+ "C. 'Glaciovolcanoes' explanation with simultaneous views of Mount Edziza and Icelandic volcanoes",
+ "D. Male choir chanting during Mt. Eyja's nighttime eruption"
+ ],
+ "Answer": "C",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "QTyzyP2Afys",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual element appeared on screen right after the woman concluded her speech on freedom and silent protest?",
+ "Choice": [
+ "A. Bitcoin value chart with '+0.40%' fluctuation",
+ "B. Close-up of Euro banknotes being printed",
+ "C. Man in a black jacket gesturing emphatically",
+ "D. 'Auftragsvorschau' order preview interface"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "QI9VIulqTCA",
+ "video_duration": "60s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "Which visual event coincides with the fading water sounds?",
+ "Choice": [
+ "A. Rapid zoom into a dark icy passage",
+ "B. Distant iceberg collapse aftermath",
+ "C. Low-angle view of anthropomorphic icebergs",
+ "D. ECV logo transition"
+ ],
+ "Answer": "B",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "U9-Qy3H_UBY",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which audio segment plays during the shot of the 'DYERSBURG' water tower?",
+ "Choice": [
+ "A. 'We at the spot, feel me? About to put my stuff up. Teammates.'",
+ "B. 'Dyersburg, Tennessee. There isn't much going on.'",
+ "C. 'Face insecurities, come with maturity.'",
+ "D. 'ask you questions, you cry out confessions...'"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "What is the correct sequence of events involving the young man?",
+ "Choice": [
+ "A. Green outline appears → Checks phone → Emerges from SUV → Spins basketball",
+ "B. Checks phone → Green outline appears → Spins basketball → Emerges from SUV",
+ "C. Spins basketball → Green outline appears → Emerges from SUV and drinks → Checks phone",
+ "D. Emerges from SUV → Spins basketball → Green outline appears → Checks phone"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "QtKT3q7xB4M",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What audio event occurred as Ed Moses exited the transparent cylindrical tunnel?",
+ "Choice": [
+ "A. A commentator shouted 'Going, he's got...'",
+ "B. Crowd chants of 'USA' resumed",
+ "C. The score increased from 140 to 160",
+ "D. A Portuguese speaker remarked about being lost"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which audio event occurred when Ed Moses first positioned himself on his knees again after standing up on the rotating platform?",
+ "Choice": [
+ "A. 'Tá bem perdido, hein? Como se tivesse bebido muitas garrafas de whisky.'",
+ "B. 'Touché'",
+ "C. 'I'm with you baby!'",
+ "D. 'USA! USA! USA!'"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "TF9I1GxNdJQ",
+ "video_duration": "60s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "Which visual event occurs right after the keyboard typing sound stops in the 0-20s segment?",
+ "Choice": [
+ "A. Flower-headed figure animation playback",
+ "B. Full setup reveal on glass table",
+ "C. Tablet sketch refinement with stylus",
+ "D. Digital artist adding character speech bubbles"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What occurred immediately after the artist paused refining the initial sketch in the animation program?",
+ "Choice": [
+ "A. A drawer opening sound accompanied the stick figure's introduction",
+ "B. The text 'Animator vs Animation V' appeared",
+ "C. The stick figure remained visible on the canvas",
+ "D. Circular elements were added to the flower-headed figure's legs"
+ ],
+ "Answer": "C",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What is the orange stick figure's primary intent when responding with 'yeah...' in the collaboration context?",
+ "Choice": [
+ "A. Willingness to demonstrate a method",
+ "B. Confirmation of current work",
+ "C. Disapproval of artistic style",
+ "D. Uncertainty in creative direction"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Which auditory cue coincides with the orange stick figure's declaration 'Can do' during the leg detail enhancement?",
+ "Choice": [
+ "A. Metallic click upon completion",
+ "B. Pop sound effect",
+ "C. Low-frequency hum during speech",
+ "D. Scratch effect during limb modification"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "PSyedQl9IiQ",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "Why did the man dismiss the children's justification of their spelling as 'nonsense'?",
+ "Choice": [
+ "A. The spelling followed unconventional phonetic rules",
+ "B. He opposed Ms. Honey's pedagogical principles",
+ "C. The students deliberately altered the spelling",
+ "D. The children showed disrespect during the assessment"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "SB4NLADmT_k",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Why was electronic music specifically chosen to accompany Pablo Torrijos' entrance?",
+ "Choice": [
+ "A. To align with Spain's traditional musical motifs",
+ "B. To draw attention to the electronic drum composition",
+ "C. To synchronize with the vivid color-themed visual spectacle",
+ "D. To reinforce the retro aesthetic of the illuminated signage"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ },
+ {
+ "Question": "Why was a player's absence notable in the men's triple jump final?",
+ "Choice": [
+ "A. He was a top performer in the previous outdoor season",
+ "B. He was scheduled to introduce a new jump style",
+ "C. His withdrawal led to a rule change in the event",
+ "D. His absence reduced the competitive intensity"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "RBTiTcHm_ac",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "Which visual element coincided with the explanation introducing the category of pictorial logos?",
+ "Choice": [
+ "A. John Hancock's signature on historical document",
+ "B. FedEx logo in purple and orange",
+ "C. Animation of mountains and sun in white-lined frame",
+ "D. Bright blue background with white circle and number '2'"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Why was John Hancock's signature highlighted in the context of logo design principles?",
+ "Choice": [
+ "A. As an example of a rebus in corporate identity",
+ "B. To showcase branding strategies in political campaigns",
+ "C. To illustrate the concept of a word mark",
+ "D. To demonstrate symbolic integration within typography"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "According to the video, what key design philosophy differentiates Google's logo from Budweiser's?",
+ "Choice": [
+ "A. Google employs symbolic abstraction while Budweiser uses literal representation",
+ "B. Google prioritizes digital adaptability while Budweiser focuses on physical reproduction",
+ "C. Google uses geometric shapes while Budweiser uses organic curves",
+ "D. Google embodies progressive minimalism while Budweiser reflects historical tradition"
+ ],
+ "Answer": "D",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "Q0B5dLHDQ2w",
+ "video_duration": "60s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "Which audio event occurs simultaneously with the dragon holding a gun?",
+ "Choice": [
+ "A. Gunshot",
+ "B. Heavy metal instrumental",
+ "C. Crowd cheering",
+ "D. 'I got a gun...' sung"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "RF86oSB6_bc",
+ "video_duration": "60s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What does the buzzer sound during the washing machine demonstration indicate?",
+ "Choice": [
+ "A. The shoes are already spotless",
+ "B. The cleaning method is inappropriate for the material",
+ "C. The washing machine is malfunctioning",
+ "D. Necessary preparation steps were skipped"
+ ],
+ "Answer": "B",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "QrmxMNBhH2Q",
+ "video_duration": "60s",
+ "video_category": "Film & Animation",
+ "questions": [
+ {
+ "Question": "Which visual event is paired with the audio discussing the aspiration to entertain people?",
+ "Choice": [
+ "A. A boy holding a medal",
+ "B. Video of students' performance",
+ "C. 'Welcome to The City of Wolverhampton' sign",
+ "D. A vampire costume photo"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which visual element appears FIRST in the video's timeline?",
+ "Choice": [
+ "A. 'Holmes Chapel' sign",
+ "B. Stage with pharaoh backdrop",
+ "C. Baby in a white stroller",
+ "D. Cityscape with high-rise buildings"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "UIvP4xyw9H8",
+ "video_duration": "60s",
+ "video_category": "Gaming",
+ "questions": [
+ {
+ "Question": "What contextual element confirms the bomb was planted at an outdoor site in the first 30s?",
+ "Choice": [
+ "A. Player switching to a pistol after bomb placement",
+ "B. Palm trees visible during explosion",
+ "C. Clear blue sky visible during 'Bomb has been planted' audio",
+ "D. Indoor shelving units visible in post-planting navigation"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "UyaKyS2wTQk",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "What was the primary purpose of displaying the spinal X-ray when discussing potential pain reduction?",
+ "Choice": [
+ "A. To showcase post-operative results",
+ "B. To compare surgical approaches",
+ "C. To corroborate clinical assessment",
+ "D. To identify the root pathology"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "Which visual event occurs right after the audio of a man groaning in pain ceases?",
+ "Choice": [
+ "A. Close-up of a surgical light in an operating room",
+ "B. Bare-chested man with back tattoos sitting on bed",
+ "C. Elderly man with blue T-shirt sitting on a bed",
+ "D. Man with goatee discussing desperate solutions"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "UZvydHZKyww",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "What distinct visual settings are paired with the mentions of $16 and $20 minimum wage increases respectively?",
+ "Choice": [
+ "A. Wholesale store checkout lanes vs. fast-food counter",
+ "B. Hospital environment vs. restaurant tablet usage",
+ "C. Legislative graphic overlay vs. dessert shop transactions",
+ "D. News anchor desk vs. cashier counting money"
+ ],
+ "Answer": "A",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "RwdvZLz-aSk",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Which visual was shown when discussing the challenges José Mourinho faced early in his career?",
+ "Choice": [
+ "A. Two men speaking to reporters with microphones",
+ "B. Young Mourinho facing a silver-haired man",
+ "C. Faded close-up of Félix Mourinho with text",
+ "D. Graphic with word 'cruel' over interviewed men"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "TbaCxIJ_VP4",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What occurred immediately following the visualization of viral spike proteins binding to the host cell membrane?",
+ "Choice": [
+ "A. Microinjection of vaccine components into the cellular matrix",
+ "B. Initiation of viral genome replication",
+ "C. Intracellular production of COVID-19 spike proteins",
+ "D. Activation of adaptive immune cell recruitment"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which visual event occurs simultaneously with the audio stating 'quickly detect a coronavirus infection'?",
+ "Choice": [
+ "A. Antibodies responding to viral entry",
+ "B. Virus replicating through host cell attachment",
+ "C. COVID spikes transferring information to immune cells",
+ "D. mRNA directing protein synthesis in cells"
+ ],
+ "Answer": "C",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "VBygMK8jRoI",
+ "video_duration": "60s",
+ "video_category": "Howto & Style",
+ "questions": [
+ {
+ "Question": "What is the primary purpose of wearing socks inside out as suggested in the video?",
+ "Choice": [
+ "A. To minimize friction from the skate's interior stitching",
+ "B. To avoid purchasing specialized skating socks",
+ "C. To enhance foot flexibility during complex maneuvers",
+ "D. To prevent blisters caused by rigid skate materials"
+ ],
+ "Answer": "A",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "QTK_bC00ilg",
+ "video_duration": "60s",
+ "video_category": "Science & Technology",
+ "questions": [
+ {
+ "Question": "Which visual appears immediately AFTER the speaker states 'cross the line and things suddenly change'?",
+ "Choice": [
+ "A. A 3D map labeling the Wallace Line",
+ "B. Australian marsupials and reptiles like kangaroos and Komodo dragons",
+ "C. Asian mammals including tigers and rhinos",
+ "D. A framed map displayed during a scientist's explanation"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "Which visual element is specifically highlighted when the speaker describes the Wallace Line as 'the sharpest and most iconic of all'?",
+ "Choice": [
+ "A. A presenter pointing toward Southeast Asian islands",
+ "B. A text annotation with an arrow identifying the boundary",
+ "C. A framed historical map showing the red demarcation",
+ "D. Kangaroos and cockatoos replacing Asian mammals"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "TGom0uiW130",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What historical inaccuracy regarding aviation pioneers motivates the man in the light blue shirt's frustration?",
+ "Choice": [
+ "A. Suppression of Gustave Whitehead's technical innovations",
+ "B. Fabricated accounts of the Wright brothers' achievements",
+ "C. Misrepresentation of Australian aviation contributions",
+ "D. Flaws in early 20th-century aircraft engineering"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "S1nUMsPC1-0",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What is the primary reason the person is misting the iguana with water?",
+ "Choice": [
+ "A. To encourage social interaction",
+ "B. To promote hydration through drinking",
+ "C. To remove debris from its skin",
+ "D. To maintain proper skin moisture levels"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "UIX6f0xMQ2U",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "Which event occurred FIRST during the competition?",
+ "Choice": [
+ "A. Commentator exclaiming 'That's huge' during Echevarría's 8.50m jump",
+ "B. Euro-house music during Echevarría's subsequent jump",
+ "C. Manyonga's 8.51m appearance in London",
+ "D. Reggae music during Manyonga's 8.53m jump"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "QPzmsQ86_HM",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which visual event coincides with Devonte's statement, 'Listen... I'll be honest' in the video?",
+ "Choice": [
+ "A. Devonte checking his phone on a leather couch",
+ "B. Devonte adjusting headphones while at his desk",
+ "C. A woman unloading pizza boxes from a car trunk",
+ "D. Devonte brushing his teeth in front of a mirror"
+ ],
+ "Answer": "D",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "SfZrnoo1GPM",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual coincides with the gentle music playing after the courtroom shouting discussion?",
+ "Choice": [
+ "A. Two men analyzing movie clips on a couch",
+ "B. Animated man with 'Real Lawyer Reacts!' clapperboard",
+ "C. Spencer in military uniform photo",
+ "D. Naval officer holding red liquid tray"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "SFmocFLXg0Y",
+ "video_duration": "60s",
+ "video_category": "People & Blogs",
+ "questions": [
+ {
+ "Question": "Which visual event occurs when the speaker says 'balance training into your lower body strength training'?",
+ "Choice": [
+ "A. Woman stands on stone stairs with hands on hips",
+ "B. Woman lowers herself on park bench doing push-ups",
+ "C. Woman demonstrates squats using a wooden bench",
+ "D. Woman performs planks on a wooden bench"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which exercise is demonstrated while the speaker begins discussing incorporating balance training into lower body workouts?",
+ "Choice": [
+ "A. Plank position using a bench incline",
+ "B. Single-leg balance on flat stone stairs",
+ "C. Incline push-ups with feet elevated",
+ "D. Step-ups on uneven stone steps"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What activity is the woman engaged in while stating the guideline 'aim for 10 to 15 reps for one to three sets'?",
+ "Choice": [
+ "A. Explaining set recommendations near a pond",
+ "B. Performing balance exercises on stone stairs",
+ "C. Demonstrating incline push-ups on a bench",
+ "D. Discussing plank progression techniques on grass"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "QdlP8ai8trw",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "What was the original intended architectural context for Michelangelo's David versus its current exhibition setting?",
+ "Choice": [
+ "A. Intended for the cathedral's interior, now displayed in a museum",
+ "B. Intended for an elevated niche, now at eye level in a gallery",
+ "C. Intended for the cathedral's exterior, now housed in a museum",
+ "D. Intended for a papal commission in Rome, now located in Florence"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "PaC3CEkCD6k",
+ "video_duration": "60s",
+ "video_category": "Travel & Events",
+ "questions": [
+ {
+ "Question": "Which event occurred FIRST chronologically in the video?",
+ "Choice": [
+ "A. Camera shows cruise ship docked",
+ "B. Woman walks through glass-enclosed walkway",
+ "C. Musicians perform under purple lights",
+ "D. Woman interacts with a pinball machine"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "V_skpmEXebM",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "During which specific action does the man first appear in front of the red curtain?",
+ "Choice": [
+ "A. During the closing subscription request",
+ "B. While folding the dollar bill into a fifty",
+ "C. When demonstrating the bill transformation into a hundred",
+ "D. After proposing a business collaboration with Marlon"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "URUJD5NEXC8",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which sequence accurately matches the chronological appearance of labeled components during the cell structure explanation?",
+ "Choice": [
+ "A. Cytoplasm → Cell membrane → Organelles → Nucleus",
+ "B. Cell membrane → Cytoplasm → DNA → Organelles",
+ "C. Cell membrane → Cytoplasm → Nucleus → DNA",
+ "D. Cell membrane → DNA → Cytoplasm → Organelles"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "XkFulQe9EVE",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual event occurred simultaneously with the speaker stating 'Alright, hold on cause we're going to move fast'?",
+ "Choice": [
+ "A. Aerial shot of campus auditorium with students",
+ "B. Detailed stonework and emblem on a Harvard building",
+ "C. Person in front of a Harvard banner",
+ "D. Skateboard navigating green bike lanes"
+ ],
+ "Answer": "B",
+ "Type": "AV Event Alignment"
+ },
+ {
+ "Question": "Which specific urban obstacle does the skateboarder explicitly mention needing to avoid during his commute?",
+ "Choice": [
+ "A. Pedestrians near Berklee College",
+ "B. Bike lane closures",
+ "C. Giant potholes in Boston streets",
+ "D. Traffic congestion in Southie"
+ ],
+ "Answer": "C",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "U8WCRz0Yh4Q",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What key tactical difference distinguishes the 'drag and step over' from the 'inside outside stepover'?",
+ "Choice": [
+ "A. One requires precise inside foot control, the other emphasizes outside foot use",
+ "B. One is initiated from a stationary position, the other while kneeling",
+ "C. One incorporates a fake directional cut-back, the other exploits wing-forward momentum",
+ "D. One uses synth music during demonstration, the other a drum machine"
+ ],
+ "Answer": "C",
+ "Type": "Comparative"
+ }
+ ]
+ },
+ {
+ "video_id": "VQKpMmBDtZo",
+ "video_duration": "60s",
+ "video_category": "News & Politics",
+ "questions": [
+ {
+ "Question": "What visual element appeared WHEN the speaker mentioned 'feats of the age'?",
+ "Choice": [
+ "A. Map of Earth with orbital routes",
+ "B. Timeline graph of space object accumulation",
+ "C. Satellite deployment in geostationary orbit",
+ "D. 3D Earth depicting orbital collision effects"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "UTzl0lLoAJE",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What was the correct sequence of items manipulated by Dante Barnes' tongue during his record demonstration?",
+ "Choice": [
+ "A. Yellow lollipop → White ball → Red popsicle",
+ "B. White ball → Yellow lollipop → Red popsicle",
+ "C. White ball → Red popsicle → Yellow lollipop",
+ "D. Red popsicle → Yellow lollipop → White ball"
+ ],
+ "Answer": "B",
+ "Type": "Event Sequence"
+ },
+ {
+ "Question": "What visual event directly occurred before the woman in the gold dress walked into the studio?",
+ "Choice": [
+ "A. Dante holding up his Guinness certificate",
+ "B. Makeup application for long fingernails",
+ "C. Dante retracting his tongue",
+ "D. Dante placing a red popsicle in his mouth"
+ ],
+ "Answer": "A",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "WNYi8LUGIZc",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What was Richard Jones's primary reason for approaching the judges' table during his audition?",
+ "Choice": [
+ "A. To deliver a gold-wrapped gift for a judge",
+ "B. To better showcase a magic trick",
+ "C. To request to join the judges at their table",
+ "D. To discuss autographs from celebrities"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "X6UqTmRAApw",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What was the woman's primary reason for checking her watch immediately after her call?",
+ "Choice": [
+ "A. To verify if the newsstand's closing time affected her plans",
+ "B. To determine if her call conflicted with another appointment",
+ "C. To confirm she had time to resolve an issue with Tom",
+ "D. To assess her schedule alignment with Lindsay's morning meeting"
+ ],
+ "Answer": "D",
+ "Type": "Reasoning"
+ },
+ {
+ "Question": "What action immediately followed the vendor saying 'Buy something'?",
+ "Choice": [
+ "A. The woman checked her watch",
+ "B. The windshield shattered violently",
+ "C. She walked away while still on the phone",
+ "D. She purchased mints"
+ ],
+ "Answer": "D",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "VoJ-Ey6q8uM",
+ "video_duration": "60s",
+ "video_category": "Pets & Animals",
+ "questions": [
+ {
+ "Question": "What environmental feature was present when the orange tabby cat pawed the mirror, as indicated by the concurrent tapping sound?",
+ "Choice": [
+ "A. Sliding door mirror",
+ "B. Brown wall with white electrical outlet",
+ "C. Blue beanbag and wooden chair",
+ "D. Bathroom setting"
+ ],
+ "Answer": "D",
+ "Type": "Context understanding"
+ }
+ ]
+ },
+ {
+ "video_id": "VFntoBRGF1A",
+ "video_duration": "60s",
+ "video_category": "Travel & Events",
+ "questions": [
+ {
+ "Question": "According to both the narrator's explanation and textual overlays, what makes this summer the definitive deadline for the group's road trip?",
+ "Choice": [
+ "A. Upcoming relocation of members for graduate studies",
+ "B. Completion of their national park itinerary before seasonal closures",
+ "C. Geographical separation due to adult responsibilities",
+ "D. Expiration of their vehicle's extended rental period"
+ ],
+ "Answer": "C",
+ "Type": "Reasoning"
+ }
+ ]
+ },
+ {
+ "video_id": "XOfeqRgaF1A",
+ "video_duration": "60s",
+ "video_category": "Education",
+ "questions": [
+ {
+ "Question": "Which visual event coincided with the audio statement: 'Level of trustworthiness that you perceive in another person's face [...] can predict criminal sentencing'?",
+ "Choice": [
+ "A. Young professional analyzing facial images on a computer",
+ "B. Elderly man adjusting glasses while facing the camera",
+ "C. Middle-aged man in navy shirt smiling slightly",
+ "D. Researcher with wavy hair discussing brain scans"
+ ],
+ "Answer": "A",
+ "Type": "AV Event Alignment"
+ }
+ ]
+ },
+ {
+ "video_id": "TIK01MpwWGg",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "What occurs right after Ross states he left the whitening gel on longer than directed?",
+ "Choice": [
+ "A. Monica inquires, 'How much longer?'",
+ "B. Ross replies, 'A day'",
+ "C. Chandler quips, 'Hillary's blind, right?'",
+ "D. Monica reminds Ross about his date with Hillary"
+ ],
+ "Answer": "A",
+ "Type": "Event Sequence"
+ }
+ ]
+ },
+ {
+ "video_id": "UcL6WlcyBcI",
+ "video_duration": "60s",
+ "video_category": "Sports",
+ "questions": [
+ {
+ "Question": "What primary role do the orange cones serve during the reaction time game?",
+ "Choice": [
+ "A. Indicate reaction targets",
+ "B. Demarcate exercise starting points",
+ "C. Track individual performance scores",
+ "D. Guide instructor demonstrations"
+ ],
+ "Answer": "A",
+ "Type": "Inference"
+ }
+ ]
+ },
+ {
+ "video_id": "RqA3LAo75kw",
+ "video_duration": "60s",
+ "video_category": "Entertainment",
+ "questions": [
+ {
+ "Question": "During the drawing game, what prompted Scarlett Johansson's remark about the hairstyle?",
+ "Choice": [
+ "A. The sketch combined features from multiple Avengers",
+ "B. The drawn hair bore a striking similarity to her own",
+ "C. Chris Evans sarcastically imitated a teammate's haircut",
+ "D. The stick figure was meant to represent Thor's iconic appearance"
+ ],
+ "Answer": "B",
+ "Type": "Reasoning"
+ }
+ ]
+ }
+]
\ No newline at end of file
diff --git a/eval_scripts/Daily-Omni/merge_captions.py b/eval_scripts/Daily-Omni/merge_captions.py
new file mode 100644
index 0000000000000000000000000000000000000000..77b5a1b5ab6ec3208b121e67d929f22a584b9d49
--- /dev/null
+++ b/eval_scripts/Daily-Omni/merge_captions.py
@@ -0,0 +1,40 @@
+import json
+import os
+import argparse
+
+def merge_data(original_file_path, caption_file_path, merged_file_path):
+
+ print(f"Now processing {caption_file_path} ")
+
+ file_key = os.path.splitext(os.path.basename(caption_file_path))[0]
+ jsonl_data = {}
+ with open(caption_file_path, 'r', encoding='utf-8') as f:
+ for line in f:
+ item = json.loads(line.strip())
+ vid = item["video_id"].replace("_video", "")
+ jsonl_data[vid] = item["caption"]
+
+ with open(original_file_path, 'r', encoding='utf-8') as f:
+ data = json.load(f)
+
+ for entry in data:
+ vid = entry['video_id']
+ if vid in jsonl_data:
+ entry[file_key] = jsonl_data[vid]
+
+ with open(merged_file_path, 'w', encoding='utf-8') as f:
+ json.dump(data, f, indent=2, ensure_ascii=False)
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser(description="Merge caption files into a single JSON file.")
+ parser.add_argument("--original_file", type=str, required=True, help="Path to the original JSON file.")
+ parser.add_argument("--caption_files", type=str, nargs='+', required=True, help="List of caption files to merge.")
+ parser.add_argument("--merged_file", type=str, required=True, help="Path to save the merged JSON file.")
+
+ args = parser.parse_args()
+
+ for cf in args.caption_files:
+ merge_data(args.original_file, cf, args.merged_file)
+
+
+
diff --git a/eval_scripts/UGC-VideoCap/eval_UGC-VideoCap.sh b/eval_scripts/UGC-VideoCap/eval_UGC-VideoCap.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7239a1068b21d099fc33a2b8ab2fb425e397c07b
--- /dev/null
+++ b/eval_scripts/UGC-VideoCap/eval_UGC-VideoCap.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+MODEL_PATH="path_to_AVoCaDO" # TODO
+OUTPUT_DIR="$1"
+
+mkdir -p "$OUTPUT_DIR"
+
+python eval_scripts/UGC-VideoCap/generate_caption.py \
+ --model_path "$MODEL_PATH" \
+ --save_path "$OUTPUT_DIR/model_caption.jsonl"
+
+python eval_scripts/UGC-VideoCap/evaluation.py "${OUTPUT_DIR}/model_caption.jsonl" "${OUTPUT_DIR}/eval_results.json"
diff --git a/eval_scripts/UGC-VideoCap/evaluation.py b/eval_scripts/UGC-VideoCap/evaluation.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c21514682e7476d1b9784afe859ee38c3e0e67c
--- /dev/null
+++ b/eval_scripts/UGC-VideoCap/evaluation.py
@@ -0,0 +1,200 @@
+import os
+import re
+import json
+from datetime import datetime
+from tqdm import tqdm
+import multiprocessing
+import sys
+from pathlib import Path
+
+try:
+ with open("apikey.txt", "r") as f:
+ api_key = f.read()
+except:
+ api_key = ''
+
+def call_gpt4o(msg):
+ while True:
+ try:
+ response = openai.ChatCompletion.create(
+ model="gpt-4o",
+ messages=msg,
+ api_key=api_key,
+ request_timeout=5)
+ break
+ except:
+ print("Timeout, retrying...")
+ time.sleep(5)
+
+ output_text = response['choices'][0]['message']['content']
+ return output_text
+
+
+INPUT_JSONL = sys.argv[1]
+OUTPUT_JSON = sys.argv[2]
+script_dir = Path(__file__).resolve().parent
+GT_PATH = script_dir / "final_caption_qa.json"
+NUM_PROCESSES = 10
+
+def safe_parse_evaluation(response_str):
+ try:
+ match = re.search(r"\{.*\}", response_str, re.DOTALL)
+ if match:
+ return json.loads(match.group(0).replace("'", '"'))
+ except Exception as e:
+ print(f"Error parsing evaluation output: {e}")
+ return {}
+
+def evaluate_caption(sample_id, pred_caption, true_caption):
+
+ system_msg = (
+ "You are an assistant that compares a ground truth video description and a predicted video description. "
+ "Evaluate the predicted description against the ground truth on the following three dimensions:\n"
+ "1. **visual**: the accuracy and completeness of visual content including the scene setting, background, characters or objects, their actions or interactions, and any OCR text.\n"
+ "2. **audio**: how well it captures voices, background music, sound effects, and their emotional tone.\n"
+ "3. **details**: the completeness, thematic consistency, purpose, coherence, and integration of multimodal content.\n\n"
+
+ "For each dimension, assign an integer score from 1 to 5, following these detailed grading criteria:\n\n"
+
+ "**Score 1:** The description is mostly irrelevant or misleading. It misrepresents or omits most key information. "
+ "At least 3 important elements are missing or incorrect. Severe hallucinations may be present.\n\n"
+
+ "**Score 2:** The description captures a few elements (1-2 aspects) but is vague or inaccurate for the rest. "
+ "It is poorly structured or confusing, with major omissions or incorrect details.\n\n"
+
+ "**Score 3:** The description aligns with the video on most elements (3 or more), but lacks depth or specificity. "
+ "Some key details are missing, or minor factual errors exist. It's generally correct but too generic or incomplete.\n\n"
+
+ "**Score 4:** A mostly accurate and complete description. Captures nearly all key information (4+ aspects), "
+ "with clear structure and appropriate level of detail. Minor omissions or simplifications are acceptable.\n\n"
+
+ "**Score 5:** Exceptionally accurate and detailed. Covers all relevant aspects thoroughly, with well-integrated information. "
+ "Captures subtle nuances (e.g., emotion, scene dynamics, audio-visual interplay) and reads like it was written by a domain expert.\n\n"
+
+
+ "Respond only with a valid Python dictionary in this format:\n"
+ "{'visual': int, 'audio': int, 'details': int}"
+ )
+
+ user_msg = (
+ f"Sample ID: {sample_id}\n"
+ f"Predicted Description: {pred_caption}\n"
+ f"Ground Truth Description: {true_caption}\n"
+ )
+
+ for i in range(20):
+ try:
+ messages=[
+ {"role": "system", "content": system_msg},
+ {"role": "user", "content": user_msg}
+ ]
+ content = call_gpt4o(messages)
+ print(content)
+ eval_dict = safe_parse_evaluation(content)
+
+ v = int(eval_dict["visual"])
+ a = int(eval_dict["audio"])
+ d = int(eval_dict["details"])
+ return {'visual': v, 'audio': a, 'details': d}
+ except Exception as e:
+ print(f"Error evaluating sample {sample_id}: {e}")
+
+ v = a = d = 0
+ return {'visual': v, 'audio': a, 'details': d}
+
+def process_sample(args):
+ video_id, pred_caption, true_caption = args
+ scores = evaluate_caption(video_id, pred_caption, true_caption)
+ avg_score = sum(scores.values()) / len(scores)
+
+ result_data = {
+ 'visual_score': scores['visual'],
+ 'audio_score': scores['audio'],
+ 'details_score': scores['details'],
+ 'average_score': avg_score
+ }
+ return video_id, result_data
+
+def run_evaluation():
+ try:
+ with open(GT_PATH, 'r', encoding='utf-8') as f_gt:
+ gt_anno = json.load(f_gt)["samples"]
+ gt = {anno["video_id"]: anno["answer"] for anno in gt_anno}
+ except Exception as e:
+ print(f"Error loading ground truth file: {e}")
+ return
+
+ results = {}
+ if os.path.exists(OUTPUT_JSON):
+ try:
+ with open(OUTPUT_JSON, 'r', encoding='utf-8') as f:
+ existing = json.load(f)
+ results = existing.get('evaluations', {})
+ except Exception as e:
+ print(f"Warning: Failed to read previous results: {e}")
+ results = {}
+
+ tasks = []
+ with open(INPUT_JSONL, 'r', encoding='utf-8') as f:
+ for line in f:
+ try:
+ sample = json.loads(line)
+ video_id = sample.get('video_id')
+ pred_caption = sample.get('caption', '').strip()
+
+ if not video_id or not pred_caption:
+ continue
+
+ if video_id in results:
+ continue
+ if video_id == "PI-0abb79ae-18ae-40d9-a66f-3ad7744a6095": # no GT
+ continue
+
+ true_caption = gt[video_id]
+ tasks.append((video_id, pred_caption, true_caption))
+ except json.JSONDecodeError:
+ continue
+
+ if not tasks:
+ print("No new samples to evaluate.")
+ return
+
+ print(f"Found {len(tasks)} new samples to evaluate.")
+
+ with multiprocessing.Pool(processes=NUM_PROCESSES) as pool:
+ with tqdm(total=len(tasks), desc="Evaluating samples") as pbar:
+ for video_id, result_data in pool.imap_unordered(process_sample, tasks):
+ if result_data:
+ results[video_id] = result_data
+
+ all_scores = list(results.values())
+ count = len(all_scores)
+ sum_visual = sum(s['visual_score'] for s in all_scores)
+ sum_audio = sum(s['audio_score'] for s in all_scores)
+ sum_details = sum(s['details_score'] for s in all_scores)
+ sum_avg = sum(s['average_score'] for s in all_scores)
+
+ current_output = {
+ 'evaluations': results,
+ 'overall_visual_average': round(sum_visual / count / 5.0 * 100, 2),
+ 'overall_audio_average': round(sum_audio / count / 5.0 * 100, 2),
+ 'overall_details_average': round(sum_details / count / 5.0 * 100, 2),
+ 'overall_average_percent': round((sum_avg / count) / 5.0 * 100, 2)
+ }
+
+ with open(OUTPUT_JSON, 'w', encoding='utf-8') as out_f:
+ json.dump(current_output, out_f, indent=2, ensure_ascii=False)
+
+ pbar.update(1)
+
+ print("\nEvaluation complete.")
+ if os.path.exists(OUTPUT_JSON):
+ with open(OUTPUT_JSON, 'r', encoding='utf-8') as f:
+ final_output = json.load(f)
+ print(f"Overall visual avg: {final_output.get('overall_visual_average', 0):.2f}, "
+ f"audio avg: {final_output.get('overall_audio_average', 0):.2f}, "
+ f"details avg: {final_output.get('overall_details_average', 0):.2f}, "
+ f"overall%: {final_output.get('overall_average_percent', 0):.2f}")
+
+if __name__ == '__main__':
+ run_evaluation()
\ No newline at end of file
diff --git a/eval_scripts/UGC-VideoCap/final_caption_qa.json b/eval_scripts/UGC-VideoCap/final_caption_qa.json
new file mode 100644
index 0000000000000000000000000000000000000000..58a34ed66cad67833dc181e578781ebd8206b804
--- /dev/null
+++ b/eval_scripts/UGC-VideoCap/final_caption_qa.json
@@ -0,0 +1,5001 @@
+{
+ "question": "You are given a short video with both audio and visual content. Write a detailed and coherent paragraph that naturally integrates all modalities. Your description should include: (1) the primary scene and background setting; (2) key characters or objects and their actions or interactions; (3) significant audio cues such as voices, background music, sound effects, and their emotional tone; (4) any on-screen text (OCR) and its role in the video context; and (5) the overall theme or purpose of the video. Ensure the output is a fluent and objective paragraph, not a bullet-point list, and captures the video's content in a human-like, narrative style. ",
+ "valid_count": 999,
+ "samples": [
+ {
+ "video_id": "PI-b08991f9-66e2-4bc2-9de5-9228da3bc35d",
+ "path": "/workspace/benchmark/video/PI-b08991f9-66e2-4bc2-9de5-9228da3bc35d.mp4",
+ "answer": "The video captures a formal hearing or debate within a governmental or legislative context, featuring a serious and analytical discussion on health care as a human right. The scene is divided into two sections: the upper part shows an individual with white hair, identified by a nameplate as 'MR. SANDERS,' dressed in a dark suit and light blue shirt, speaking into a microphone. The background is crowded with other individuals, emphasizing the formal setting. The lower section features another speaker with gray hair, also in a dark suit and white shirt with a dark tie, seated and responding to questions. The audio content reveals a male speaker with a neutral tone discussing the concept of health care as a human right, comparing it to free speech and highlighting financial implications, such as long-term smoking leading to cancer. Another male speaker, with an uncertain emotional tone, questions whether the United States should guarantee health care for all as a human right. The overall atmosphere is intense and professional, capturing the formal exchange between the two main speakers and the gravity of the proceedings."
+ },
+ {
+ "video_id": "PI-cc500b00-549b-4d75-b7f9-eb29e3333d64",
+ "path": "/workspace/benchmark/video/PI-cc500b00-549b-4d75-b7f9-eb29e3333d64.mp4",
+ "answer": "The video features two women seated at a wooden table in a room with a white door in the background. The woman on the left, dressed in a vibrant orange and green patterned shirt, a green headband, and a smartwatch, initially rests her chin on her hand. The woman on the right wears a light-colored \"PARK\" t-shirt with her arms crossed. The woman on the left responds to a comment about her daughter's name, explaining that her daughter's real name is Mary April, but her nickname is Apple. She defends this choice with humor, referencing positive associations like apple pie, iPhones, and the phrase \"apple of my eye,\" even mentioning \"lechon\" and suggesting a Google search for \"Apple.\" Her tone shifts from lighthearted to assertive as she expresses strong protective feelings, emphasizing that she is \"very defensive and territorial\" about her offspring. She concludes firmly, stating that no one else has the right to insult her daughter because only she can do so. Throughout, the woman on the right listens attentively with her arms crossed."
+ },
+ {
+ "video_id": "PI-4ccdae5a-fa78-4cd4-887d-0ef30d87c446",
+ "path": "/workspace/benchmark/video/PI-4ccdae5a-fa78-4cd4-887d-0ef30d87c446.mp4",
+ "answer": "In this combined audio-video experience, the audio component sets a somewhat mysterious backdrop. A male voice with an unknown emotional tone is heard, as if the male is discussing a situation centered on financial constraints, possibly in a context similar to a concert dealing with financial issues, conveying a neutral or uncertain emotional expression. A brief mention of a non-English language adds an extra layer of mystery without further elaboration in the dialogue, while ambient music with a steady rhythm maintains a neutral atmosphere throughout. At the same time, the video presents a lively and engaging content review and reaction session. In the beginning, male reviewers are sitting at their desks, ready to respond to a comment suggesting they check out BBNO $. With the beginning of music, the male critic's body undergoes a remarkable transformation. From a motionless sitting position, he transforms into an energetic dancer, swaying to the rhythm and increasing enthusiasm. His hands become a medium of expression, making dynamic gestures perfectly synchronized with the energy of the music. There are moments of pause where he seems to be lost in thought, perhaps dealing with the melody, and then plunging back into the energetic action. At one point, he pointed upwards, clearly indicating his approval. Eventually, he pressed the green button and gave the music a high score of 8/10. The video then transitions to a YouTube channel overview, giving us a glimpse into the reviewer's online presence and the broader content to which he belongs. Together, audio and video create a unique blend of mystery and vibrant enjoyment."
+ },
+ {
+ "video_id": "PI-734e020c-a92b-4ce5-914d-97c97a412773",
+ "path": "/workspace/benchmark/video/PI-734e020c-a92b-4ce5-914d-97c97a412773.mp4",
+ "answer": "The video features a person wearing a black hoodie humorously imitating a Spanish teacher's emphasis on pronouns. The scene begins with the individual pointing to a piece of paper taped to a beige wall, displaying the word \"Rusty\" written in black marker, with a text overlay that reads, \"How the Spanish teacher did it for no reason,\" setting a comedic tone. The person then writes \"ustedes\" on another piece of paper and \"usted\" on a third, highlighting the repetitive nature of formal pronoun usage in Spanish classes. The individual continues to add more papers with \"ustedes\" and \"vosotros,\" each time looking back at the camera with exaggerated expressions, humorously emphasizing the perceived overemphasis on these pronouns. Meanwhile, the audio track features a male voice with a humorous and slightly mocking tone, imitating a Spanish teacher by repeatedly asking for \"usted\" and warning against using informal terms like \"vosotros.\" This achieves a comedic effect by exaggerating the teaching style. The video maintains a consistent theme of playful mimicry, teasing the idea that Spanish teachers might constantly focus on both formal and informal pronouns, with a simple background of beige walls to keep the focus on the humorous action of sticking pronoun papers with tape."
+ },
+ {
+ "video_id": "PI-5b8e3c6f-4cdc-44f5-beec-596c42b6d5a9",
+ "path": "/workspace/benchmark/video/PI-5b8e3c6f-4cdc-44f5-beec-596c42b6d5a9.mp4",
+ "answer": "While the audio and video content seem to explore different themes at first glance, they both evoke strong emotional responses, albeit in distinct ways. The audio features a male and a female speaker engaging in a heartfelt exchange, expressing longing and an emotional connection. The male speaker reflects on the significance of travel and the desire to be remembered, while the female speaker emphasizes spiritual presence and the promise of a reunion despite physical separation, creating a reflective and sentimental atmosphere enhanced by slow-paced ambient music. In contrast, the video begins with a poignant nighttime scene of a gravel driveway, wooden gate, and brick pathway, where a thin stray Xoloitzcuintli (Xolo) dog is discovered sniffing around, coinciding with the passing of the viewer's own dog. The camera captures the Xolo's exploration and its approach to a porch with two bowls, possibly in search of food or water, while a person's feet and a hand petting the dog indicate observation and interaction. The video concludes with the Xolo sitting on the porch, looking at the bowls, possibly waiting for sustenance, and the festive doormat and black coat hanging in the background add to the emotional context. Although the audio delves into human emotions and relationships, the video taps into the emotional bond between humans and animals, particularly in times of loss and unexpected companionship. Together, they offer a rich tapestry of emotional experiences, from the yearning for human connection to the comfort found in unexpected animal companionship."
+ },
+ {
+ "video_id": "PI-c1ad03fe-fcf7-4656-a28c-e9221bfa0aef",
+ "path": "/workspace/benchmark/video/PI-c1ad03fe-fcf7-4656-a28c-e9221bfa0aef.mp4",
+ "answer": "The video captures a delightful and playful scene of a child with blonde hair being styled with colorful Velcro rollers in shades of pink, green, and yellow. The child, dressed in a dark blue dress with vibrant floral embroidery and patterned leggings, is seated in a festive setting adorned with a Christmas tree decorated with white ornaments and a white cabinet. The child moves slightly, showcasing the rollers from various angles, adding a dynamic element to the scene. Later, the child is seen sitting at a table, engrossed in drawing on a large piece of paper with a pen, while the playful rollers remain in their hair, enhancing the humorous and endearing atmosphere. This light-hearted and cheerful moment is complemented by a pop music track in the background, featuring female vocals with a cheerful and upbeat tone. The song's catchy melody and repetitive lyrics create a lively and engaging ambiance, perfectly aligning with the youthful energy and fun depicted in the video. Together, the visual and auditory elements create a cohesive and joyful experience, capturing a moment of creativity and festive spirit."
+ },
+ {
+ "video_id": "PI-320d3f80-5f46-451f-bdaf-7c5dbfc55864",
+ "path": "/workspace/benchmark/video/PI-320d3f80-5f46-451f-bdaf-7c5dbfc55864.mp4",
+ "answer": "The audio, filled with indiscernible noise, evokes a sense of fear and unease among listeners, as the specific content remains unclear. In stark contrast, the video begins with a scene from a video game displayed on a television screen, featuring three armed characters in tactical gear inside a room with glass walls, accompanied by the text, \"Makarov: Remember - no Russian,\" suggesting a tense and potentially violent narrative. However, the scene quickly transitions to a real-life living room, where a woman, dressed in a brown sweater and dark pajama pants adorned with constellation patterns, sits comfortably on a couch, sipping coffee from a white cup. The cozy setting, complete with a cat tree near the window, a wicker chair, and a glass coffee table, contrasts sharply with the initial video game scene. The woman appears calm and relaxed, her demeanor unaffected by the earlier tension, as she continues to watch the television, occasionally glancing around the room. The video concludes with a closer view of the woman, emphasizing her serene state as she enjoys her coffee and the tranquility of her living room. Together, the audio and video create a juxtaposition of fear and calm, with the indiscernible noise of the audio contrasting with the peaceful and relaxed atmosphere of the video."
+ },
+ {
+ "video_id": "PI-f7725055-28ce-4105-b31f-aafbae4afdb2",
+ "path": "/workspace/benchmark/video/PI-f7725055-28ce-4105-b31f-aafbae4afdb2.mp4",
+ "answer": "This short and light - hearted video was shot from top - to - bottom by the camera, showing the interior of an elevator with wooden plank walls. A white - haired man, wearing an orange shirt, a blue - and - white tie, and dark - colored trousers, is dancing merrily to hip - hop music by himself. Moments later, a female police chief in uniform enters the elevator. After a brief pause, she also joins the man's dance. A sign on the back wall of the elevator reads \"El Paso County Sheriff's Office\". The video captures a spontaneous and happy moment between colleagues as they turn a simple elevator ride into a short dance party."
+ },
+ {
+ "video_id": "PI-8cb472dc-0bf3-4132-aaa2-96e8121e47dd",
+ "path": "/workspace/benchmark/video/PI-8cb472dc-0bf3-4132-aaa2-96e8121e47dd.mp4",
+ "answer": "The video unfolds in a quiet and least decorated room, which is paved with light-colored wooden floors, covered with white carpets, has floor lamps and a wheeled car. The atmosphere inside the room is calm and quiet, and natural light shines in through the Windows decorated with white curtains. The central figure of the photo is a person wearing a light-colored shirt and black trousers. At first, he was seen lying face down on the carpet, partially covered by a white blanket, and beside him was an untidied bed. \"When you sneak back to your wife's house and catch her in her actions,\" the text overlay sets a humorous or unexpected tone. As the scene progressed, this person began to move, pushing himself up with his arms and then pulling himself to the standing position with a small wooden stool. Then they sat on the stools, obviously showing surprise or shock, and covered their faces with their hands. There is a grey dress on the floor beside her and a white air purifier beside the wall-mounted TV. This person was still sitting, occasionally glancing around the room as if overwhelmed by the situation. Throughout the video, the text overlay remains unchanged, reinforcing the theme of unexpectedly capturing the spouse in an unspecified behavior. The absence of dialogue or background music focuses on the visual narrative and the characters' reactions. This video humorously captures the essence of the surprise and the direct consequences of such an encounter, leaving the exact nature of the \"behavior\" to the audience's imagination."
+ },
+ {
+ "video_id": "PI-30dab627-391f-4158-9988-efda7d37c769",
+ "path": "/workspace/benchmark/video/PI-30dab627-391f-4158-9988-efda7d37c769.mp4",
+ "answer": "The video shows a woman in a light-filled modern room wearing a white knee-length short-sleeved dress and a fitted bodice. The dress spreads out into a maxi dress, which she gracefully swirls to show off its flowy texture, paired with white heels with decorative elements. The minimalist decor of the room includes a bed with the \"WESTWOOD LOS ANGELES\" logo, a tall floor lamp with spherical white lights and a plant that adds greenery. The lady posed gracefully, adjusting her hair and dress, highlighting the sophistication of the dress and her poised, confident demeanor. Accompanying the visuals is a song featuring a boy's voice, full of excitement, and soothing, cheerful background music enhances the happy and joyful atmosphere of the scene."
+ },
+ {
+ "video_id": "PI-d05780b1-718b-49ae-878c-17917bccaf50",
+ "path": "/workspace/benchmark/video/PI-d05780b1-718b-49ae-878c-17917bccaf50.mp4",
+ "answer": "In the audio, a man's friendly voice introduces various items, each with a specific name, conveying his joy as he presents them. This audio element complements the video, which begins with the same man standing in front of a black car at night, dressed in a beige outfit and sandals. The text \"pov: your dad has a car obsession\" appears on the screen as he gestures and moves around the car, pointing out different features against a backdrop of a brick wall and dark sky. The scene transitions to him in front of a white car, continuing his explanations. He then moves inside a garage to polish the rear of a black vintage car with a \"Felix Chevrolet\" license plate, using a yellow cloth on a red carpet under bright garage lights. The video progresses with the man standing next to a modern silver Mercedes-Benz, pointing at its front while holding a cloth, then moving to a red Jeep to polish its front, and finally wiping down a white Mercedes-Benz. The background reveals a residential area with houses and palm trees. The video concludes with the man next to a silver Rolls-Royce, still pointing and talking, emphasizing his points about the car, with a residential area and string lights in the background, encapsulating his passion for automobiles.\n"
+ },
+ {
+ "video_id": "PI-c1bc8b4e-fd98-47d5-b6a6-eefe940de552",
+ "path": "/workspace/benchmark/video/PI-c1bc8b4e-fd98-47d5-b6a6-eefe940de552.mp4",
+ "answer": "The video features a person standing against a plain, light-colored background, with the text \"My Truth Cards\" prominently displayed above their head. Three pink cards with yellow starburst designs are aligned horizontally below the title. Initially, all cards are face down, but as the video progresses, each card flips to reveal a message: first \"You are Lazy,\" then \"You are Mean,\" and finally \"You need to take a Bath,\" each in bold, pink letters on a white background. The person remains mostly stationary, keeping the focus on the cards and their messages. Meanwhile, the accompanying audio clip features a man speaking in a neutral or uncertain tone, mentioning the \"truth card\" and expressing satisfaction with his situation by saying \"OK, perfect.\" The ambient music has a stable rhythm, creating a neutral or soft atmosphere that complements the video's simple and clean aesthetic. The overall approach is playful yet direct in delivering the messages on the cards."
+ },
+ {
+ "video_id": "PI-1e40cc8e-792a-47cb-96a4-3fdf1c884eea",
+ "path": "/workspace/benchmark/video/PI-1e40cc8e-792a-47cb-96a4-3fdf1c884eea.mp4",
+ "answer": "In the video, a fluffy big cat is sitting comfortably on a beige sofa in the cozy living room. This cat is sitting in the center of the sofa, with its front PAWS sticking out in front and its hind PAWS hidden under its body. The sofa is decorated with all kinds of pillows, including a yellow round pillow, a checkered beige pillow, and some colorful blankets. In the background, one can see a small shelf with decorative items and plants, as well as curtains covering the Windows. Throughout the video, the cat remained basically still, occasionally shifting its gaze and slightly moving its head. The Angle of the camera remains unchanged, allowing for a clear view of the cat and its surrounding environment. The audio overlay includes the voice of a person expressing doubt and curiosity about the existence of the cat. He said, \"Guys, sometimes this is too disturbing.\" It seems that he is not as genuine as Sylvia. What's wrong? Are you comfortable? I... So... You, what are you doing? Did I interrupt anything? What's going on here? Hello. This voice adds a humorous and perplexing tone to the scene, emphasizing the cat's calmness and almost unreal behavior. Occasionally, this cat blinks its eyes, slightly adjusts its posture, and maintains a relaxed and somewhat curious expression. The consistent background and the cat's serene behavior create a peaceful yet slightly interesting atmosphere, highlighting the mundane themes of daily life."
+ },
+ {
+ "video_id": "PI-39fcf2aa-e931-4fd2-9e2d-78459bf86bc2",
+ "path": "/workspace/benchmark/video/PI-39fcf2aa-e931-4fd2-9e2d-78459bf86bc2.mp4",
+ "answer": "The video presents a rustic farm scene centered around a rooster with a striking red comb and wattles, adorned with a mix of black and brown feathers. Positioned initially on a green cloth over muddy ground, the rooster moves off the cloth as the video progresses, standing on the muddy earth with a wire fence and sparse grass in the background. The overcast lighting enhances the rural ambiance. The rooster's minimal movements, including slight shifts in position and head turns, are captured, showcasing its plumage from various angles. After post-production editing, the audio features a male voice with a playful and dramatic tone, repeatedly shouting \"Victoria!\" in an exaggerated and enthusiastic manner. The text \"Victoria!\" appears prominently in yellow across the screen, followed by \"Oooooh Victoria!\" and \"Viiictooriiaaa!\", amplifying the humorous and light-hearted atmosphere. The text \"Farm Jokester\" at the bottom further emphasizes the comedic theme, creating an entertaining effect that complements the visual presentation of farm life."
+ },
+ {
+ "video_id": "PI-21f25f99-386c-4e26-9cce-7c4d5c21bb77",
+ "path": "/workspace/benchmark/video/PI-21f25f99-386c-4e26-9cce-7c4d5c21bb77.mp4",
+ "answer": "The audio features light and cheerful background music, setting a playful and comedic tone, possibly guided by a humorous and mischievous male narrator who invites the audience to imagine the scene through the music, highlighting the humorous reactions to a clown's commands. The video complements this mood with a humorous narrative, starting with a person in a white T-shirt and camouflage pants standing at the door of a carpeted room, which contains a bed with a black blanket and a brown plush toy, and a white wall with an electric light switch. The text \"POV: When a person tells my sister 'Go find your brother'\" appears on the screen, adding a comedic touch. The person in the white T-shirt walks into the room and looks at the camera. The scene then shifts to another person, wearing a black shirt and camouflage pants, standing outside with a hesitant and confused expression, against a backdrop of a window with black bars and some green plants outside, further emphasizing the comedic setting. The video concludes with a close-up of the person in the black shirt, who still appears hesitant and slightly confused, perfectly aligning with the audio's playful and humorous tone. Together, the audio and video create a lighthearted and entertaining experience, blending music and visuals to evoke laughter and amusement."
+ },
+ {
+ "video_id": "PI-8c3da229-5fb7-4f17-9ad2-df3103439dd2",
+ "path": "/workspace/benchmark/video/PI-8c3da229-5fb7-4f17-9ad2-df3103439dd2.mp4",
+ "answer": "In a recent interview, there emerged an interesting discussion about the adaptation of accents in films. The interviewee shared some interesting things about him on the set. He was asked about the pronunciation of the word \"water\" that the character Kelvin he played had to say. When he pronounced it as \"water\", a well-informed and intelligent dialect coach intervened. The coach explained that Aaron - referring to the actor - came from \"round The way\", alluding to a certain dialect. However, the actor emphasized that they would not use this pronunciation because the team particularly wanted Kelvin to say \"water\". After clarification, they never asked him to adjust his pronunciation again. The actor said pensively that perhaps the team was not yet ready for this special version of Scar, which hinted at deeper and possibly unexplored aspects of the character. The video ends in a humorous tone, featuring the roar of a lion and the actor's laughter in the studio, adding a light-hearted touch to the profound narrative."
+ },
+ {
+ "video_id": "PI-d02b1a53-8224-461e-a671-a07f1a50f20b",
+ "path": "/workspace/benchmark/video/PI-d02b1a53-8224-461e-a671-a07f1a50f20b.mp4",
+ "answer": "This video captures a vibrant scene in the baseball field, where a hot dog vendor, wearing a bright neon yellow shirt and a hat, becomes the focus of attention. Standing near the audience seats, the vendors interacted energetically with the crowd. Most of them were wearing light blue jerseys with red and white borders. The background is a bustling stadium, filled with passionate fans. On the field, one can see a baseball player in a blue uniform. The vendor was strapped with a metal container and seemed to be Shouting loudly, as if advertising his hot dogs. The text \"This guy selling hot dogs\" on the video highlights the entertainment value of this vendor. The laughter and reactions of the crowd indicated that the vendor's comical actions were very popular and added a lively atmosphere to the competition. The entire scene encompasses the joy of live sports events and the spirit of community, and even the vendors have contributed to the entertainment."
+ },
+ {
+ "video_id": "PI-57fae5fe-6601-40cb-819b-a4e46d1f4f2a",
+ "path": "/workspace/benchmark/video/PI-57fae5fe-6601-40cb-819b-a4e46d1f4f2a.mp4",
+ "answer": "The video humorously depicts a man preparing for the \"2026 Cave Diving Olympics\" through a series of movements surrounding cracks in concrete. The video begins with a close-up of the crack and the shadow of a hand holding a fork, accompanied by the text \"Training for the 2026 Cave Diving Olympics\". The man examines and expands the crack, symbolizing his training effort. The video shows the man standing on a concrete surface, wearing knee pads and sneakers, looking ready and focused. He concludes with a close-up of the frayed sneaker soles, underscoring his humorous wit. The track is a background music, the music is sung by a male, and the content is passionate. Combined with the video content, it shows a funny video with humor as the main story."
+ },
+ {
+ "video_id": "PI-4e882321-72e4-4797-8b87-0fffbf9f3151",
+ "path": "/workspace/benchmark/video/PI-4e882321-72e4-4797-8b87-0fffbf9f3151.mp4",
+ "answer": "The video commences with a person in a white sleeveless top, covering their mouth in a gesture of shock or disbelief, as the text overlay reveals, \"When a homeless guy in front of a grocery store offered you his dog and you spontaneously said yes.\" This scene perfectly mirrors the audio's opening, where a woman's voice initially expresses distress or regret over an unexpected and emotionally charged situation involving a homeless person offering a dog, leading to a spontaneous and seemingly regretful decision. The person's body language and facial expression in the video convey a state of disbelief, while the background, featuring a neutral-colored wall and ceiling, sets an indoor scene. As the video transitions to a close-up of a small black and white dog sitting calmly on a chair, with its front paws resting on the edge and a desk with a computer monitor, keyboard, and green soda can visible behind, the focus shifts from the person's initial reaction to the dog that was unexpectedly offered. Meanwhile, the audio captures the woman's emotional journey from initial regret to eventual laughter, highlighting the complexity of human reactions to unexpected kindness, a theme that subtly resonates with the video's portrayal of a sudden change in circumstances and the subsequent, perhaps humorous, aftermath."
+ },
+ {
+ "video_id": "PI-351896d3-c4a3-4f21-99e2-9809dde857cd",
+ "path": "/workspace/benchmark/video/PI-351896d3-c4a3-4f21-99e2-9809dde857cd.mp4",
+ "answer": "The video features two individuals in dark hoodies standing in front of a white door, set against an indoor backdrop adorned with a string of lights along the door frame, creating a warm and inviting ambiance. The video showcases a series of creative and intricate handshakes between the two, beginning with a sequence of claps and finger snaps that highlight their coordination and rhythm. This is followed by a synchronized wrist spin, a finger lock that demonstrates their dexterity, and culminates in a complex \"Jutsu master\" handshake inspired by martial arts, adding a dynamic and playful element to their interaction. Throughout the video, the consistent background of the white door and string lights provides a simple yet effective setting for their performance. Accompanying the visual content is a background music (BGM) track that is purely instrumental with a modern and upbeat tone, enhancing the engaging and dynamic atmosphere of the scene through its rhythmic and melodic elements. The combination of the creative handshakes and the lively background music effectively conveys the camaraderie and energy between the two individuals."
+ },
+ {
+ "video_id": "PI-e549e625-5b03-4bb2-a52d-7bcd348b01ac",
+ "path": "/workspace/benchmark/video/PI-e549e625-5b03-4bb2-a52d-7bcd348b01ac.mp4",
+ "answer": "The video presents a sequence of domestic scenes, complemented by subtle audio cues. It begins in a beige-carpeted hallway where a gray cat sits near the wall, facing away from the camera. A small white and brown dog is seen running towards a pink tunnel toy at the hallway's end. The scene then shifts to a living room featuring a light-gray sofa. On the sofa, a white cat lies on the backrest, overseeing a stuffed toy with a red and green pattern and a light-blue blanket, with the same small dog exploring these items. After a brief period, the dog jumps off the sofa and runs back towards the hallway, with the camera following its movement. The gray cat remains in its spot as the dog dashes past it towards the pink tunnel toy. The video concludes by returning to the living-room sofa, where the small dog is now partially under the light-blue blanket, while the white cat stays on the sofa's backrest. Throughout the video, faint sounds resembling running footsteps can be heard, suggesting movement and activity within the scene."
+ },
+ {
+ "video_id": "PI-87ccbaee-93b9-48e0-bf1b-7a9b02d6b94c",
+ "path": "/workspace/benchmark/video/PI-87ccbaee-93b9-48e0-bf1b-7a9b02d6b94c.mp4",
+ "answer": "In the video, an old woman wearing a soft pink robe is speaking directly to the camera in a dimly lit room. Her light-colored hair is neatly combed at the back of her head, and she wears a ring on her left hand. Occasionally, she uses gestures to emphasize her point of view. The background remains largely dark, with some blurry objects faintly visible, creating an intimate and concentrated atmosphere. Throughout the video, the lady shared a sincere message about the brevity of life. She began to say, \"Can I tell you something that I want you to remember?\" Then it continues to emphasize that \"People come and go and the situation is changeable.\" She reminded the audience that most things are temporary and their current situation or wishes are not their ultimate destination. Her tone was gentle yet firm, giving people a sense of wisdom and reassurance. When she speaks, her expressions and gestures reflect her sincere desire to impart this wisdom. She paused occasionally to make her words resonate with the audience. At the end of the video, she said, \"Have some.\" This is... And me... Left. \"Thank you,\" leaving the audience with a comforting and reflective message about the impermanence of life's challenges and the importance of perspective."
+ },
+ {
+ "video_id": "PI-ec337199-c707-4cac-a179-18f6834e33b0",
+ "path": "/workspace/benchmark/video/PI-ec337199-c707-4cac-a179-18f6834e33b0.mp4",
+ "answer": "The video shows two bird-like cartoon characters in the forest, accompanied by an audio track of a man speaking in a neutral tone, possibly looking for someone. The video begins with one character hanging upside down from a tree branch while another character stands on the ground next to the tent. The standing character with the big yellow eyes reacts to the hanging character, and after a brief exchange, the standing character throws in the towel. He then discusses playing hide-and-seek in the woods, but the standing character expresses concern, saying, \"susans gonna kill me\" The standing character appears alone and looks worried. The scene cuts to both characters sitting in a green car with the standing character driving. The driver asked, \"what do you mean honey our daughters always looked like that\" The branch next to the character, was fitted with eyes, and kept silent. The video ends with the text \"daughterless sobbing\" as the car drives away, while the audio content is neutral and may involve someone being sought, complementing the visual narrative without background music."
+ },
+ {
+ "video_id": "PI-a0de8f20-5330-48c7-b755-ec825a59e116",
+ "path": "/workspace/benchmark/video/PI-a0de8f20-5330-48c7-b755-ec825a59e116.mp4",
+ "answer": "The video unfolds in a festive room adorned with pink drapes and a 'WE STICK TOGETHER' banner, where four women in pink outfits toast with pink glasses, radiating joy. Their cheerful voices, as described in the audio, fill the air until three energetic men in blue and white football jerseys burst in, dancing and causing surprise and amusement. The audio captures the initial cheerfulness turning to shock as one man accidentally shatters a pink chandelier. The women's expressions shift dramatically, and the man in blue looks distressed. A close-up shows a concerned woman in a pink off-shoulder dress, reflecting the sudden tension. The men, now with black markings, worry over a heart-shaped cake. The audio's tone shifts to sadness and resistance as the woman in the sparkly dress smashes the cake into a man's face, embodying the conflict. The video concludes with her holding a blue emblem, the man in distress, and a concerned principal in the background, capturing the transition from celebration to conflict."
+ },
+ {
+ "video_id": "PI-2c168670-274e-4a4d-93da-25c5750f9bb8",
+ "path": "/workspace/benchmark/video/PI-2c168670-274e-4a4d-93da-25c5750f9bb8.mp4",
+ "answer": "The video features a person in a red hoodie with a blurred face, set against a simple indoor backdrop that includes a small red figurine on a shelf. Below this, an animated scene unfolds, showing SpongeBob SquarePants looking distressed as he holds a pickaxe next to a coin with a face, while a pink character stands ahead on the road. The pink character attempts to hit the coin with the pickaxe, causing the coin's expressions to change and a yellow warning sign to appear. Eventually, an explosion effect with 'POW!' and 'BIF!' text occurs, and SpongeBob falls, accompanied by the text 'You Lose!' The video underscores the difficulty of the challenge, emphasized by the repeated text 'Why is this so hard' and the increasing counter from 1/120 to 68/120. Meanwhile, the audio is a song sung by a child, full of pleasure and relaxation, which serves as the background music, creating a contrast with the on-screen struggle and enhancing the overall lighthearted yet challenging atmosphere of the scene."
+ },
+ {
+ "video_id": "PI-3fa5b346-6ca9-416c-8133-2532b21176a2",
+ "path": "/workspace/benchmark/video/PI-3fa5b346-6ca9-416c-8133-2532b21176a2.mp4",
+ "answer": "The video opens with a tranquil scene of a baby lying peacefully on a wooden deck, wrapped in a camouflage-patterned blanket and covered with a light blue cloth, while a worried face emoji and the text overlay \"Almost had to go up against a momma deer today\" hint at a potential encounter. As the camera pans, a deer is revealed standing in the grassy area beyond the deck, its calm demeanor contrasting with the text's sense of impending interaction. Meanwhile, the audio captures a woman's gentle and slightly amused voice, humorously suggesting that the deer might mistake the baby for its own fawn, adding a layer of whimsy to the scene. The deer walks closer and then further away, maintaining a peaceful coexistence with the human world, while the lush, green environment with trees and bushes provides a serene backdrop. The video emphasizes the gentle curiosity and quiet observation between the human and animal, highlighting a tender moment of connection with nature, perfectly complemented by the woman's humorous commentary in the audio."
+ },
+ {
+ "video_id": "PI-b3ec1904-f621-4298-9ab6-8121920cfe35",
+ "path": "/workspace/benchmark/video/PI-b3ec1904-f621-4298-9ab6-8121920cfe35.mp4",
+ "answer": "This video captures a heartwarming scene in a cozy living room decorated for Christmas. The central figure of the photo is a child wearing holiday pajamas printed with festion-themed patterns. The child stood on the ground, holding a jar and interacting with the crumpled wrapping paper on the small table. An adult in a red shirt and a green velvet coat stood beside, gently observing the child. The background is a beautiful Christmas tree, surrounded by wrapped gifts, enhancing the festive atmosphere. There is a sofa covered with blankets and pillows in the room, as well as all kinds of wall decorations. The child looked very excited and happy. When holding the jar, he/she was very active and even raised his/her arm happily. Occasionally, adults bend down to interact with children or adjust items on the floor, showing a caring and involved attitude. The scene was filled with laughter and the joy of the festival, capturing the precious moments of family interaction during the Christmas celebration. This video perfectly summarizes the joy and warmth of the festival, highlighting the bond between children and adults."
+ },
+ {
+ "video_id": "PI-9d52afc4-947f-433f-abc4-d9782a3acf4f",
+ "path": "/workspace/benchmark/video/PI-9d52afc4-947f-433f-abc4-d9782a3acf4f.mp4",
+ "answer": "This video captures a vibrant scene of a formal outdoor event, possibly a red carpet event, with lush green plants and a building with a balcony in the background. The main focus is on a woman with curly long golden hair and wearing a dark ruffled robe. When she participates in the conversation and moves around during the activity, she is very conspicuous. There were other participants around her, including a man in a black tailcoat who was concentrating on playing with his mobile phone, and another person holding a blue microphone with \"SK\" written on it. The atmosphere is very lively and people are communicating together. The audio shows a brief interview clip in which a lady was asked, \"Jennifer, as a woman, as a 20-year-old woman, what excites you the most?\" This question seemed to elicit a deliberate or enthusiastic response from her, although her exact answer was not fully reflected in the provided text. The presence of the microphone indicates that she is being interviewed, adding a layer of media interaction to the event. The core theme of this video revolves around the excitement and dynamism of participating in a highly anticipated event, highlighting the interaction and involvement of the attendees, with a particular focus on the perspective of women as women in such occasions. The combination of visual and auditory elements creates a typical snapshot of the charm and social interaction of a red carpet event."
+ },
+ {
+ "video_id": "PI-1eb689cb-cd95-4d24-9eca-d7b0334a1818",
+ "path": "/workspace/benchmark/video/PI-1eb689cb-cd95-4d24-9eca-d7b0334a1818.mp4",
+ "answer": "In the video, a family of six adults and one child gathers outdoors in a park, surrounded by trees and fallen leaves, signaling the arrival of autumn, as they prepare to take a family photo. The atmosphere is relaxed and pleasant, with family members dressed in casual and warm clothes, including sweaters and jeans, suitable for the cool weather. The person in the middle, wearing a beige long cardigan, seems poised to share a significant announcement, as indicated by the text at the top of the video box: \"Tell the family in the 'Family Photo' that I'm pregnant (failed)\", suggesting that the pregnancy statement didn't go as smoothly as planned. However, in the audio clip, a woman's cheerful and celebratory voice enthusiastically tells her family \"I'm pregnant\" and repeats the phrase excitedly, expressing her love and desire to capture the moment in a photo. Although the video captures the candid and heartfelt interactions among family members, highlighting their close relationship and shared excitement, it seems there was an initial misstep in the actual announcement, which contrasts with the joyous tone of the audio. Nonetheless, both the audio and video together paint a picture of a family looking forward to new beginnings and the joy that comes with them."
+ },
+ {
+ "video_id": "PI-ecc2aa2e-ff64-418f-bb01-b7f3c2c9ef12",
+ "path": "/workspace/benchmark/video/PI-ecc2aa2e-ff64-418f-bb01-b7f3c2c9ef12.mp4",
+ "answer": "The video is shot from a first-person perspective in a hotel corridor. A man in a black jacket and dark pants approaches another man wearing a gray-green button-up shirt and black pants.The audio captures a conversation between two people discussing a choice between receiving one dollar and a mystery gift. The man chooses the mystery gift, unveils it to find one dollar inside, and ultimately appears delighted.The man holding a gift and a one-dollar bill, asking another man to choose one. The second man initially reaches for the dollar but retracts his hand, opting instead for the gift box. He opens it to reveal a small box inside. Upon opening the smaller box, he finds another one-dollar bill. Finally, holding both the dollar and the gift box, he gives an amused thumbs-up and feels very happy."
+ },
+ {
+ "video_id": "PI-ed693799-3f52-4849-a26b-bfeed4eda6a5",
+ "path": "/workspace/benchmark/video/PI-ed693799-3f52-4849-a26b-bfeed4eda6a5.mp4",
+ "answer": "The video captures a vibrant scene at a “Cobra Kai” after-party held in a brightly lit indoor venue. The main focus is on a man labeled “Kreese” on screen, dressed in a dark suit, dancing energetically with dynamic movements including deep squats and varied arm gestures, surrounded by numerous attendees. The atmosphere is lively, with vivid red and orange lights illuminating the wooden floor and surrounding area. Prominently displayed above is a screen featuring the “Netflix” and “Cobra Kai” logos, highlighting the event’s theme. The video is accompanied by upbeat dance music and cheers, with occasional applause audible. Other participants in the background are either dancing or recording the scene on their phones. Towards the end, a brief shot flips directly to a young man, indicating he is the videographer. The combination of visual and audio elements conveys a joyful atmosphere of celebration and camaraderie among the attendees.\n"
+ },
+ {
+ "video_id": "PI-1e29c55c-998f-495e-b0f9-74dad98f02d3",
+ "path": "/workspace/benchmark/video/PI-1e29c55c-998f-495e-b0f9-74dad98f02d3.mp4",
+ "answer": "The video was shot on a street. There were few people on the street, but there were many cars on it. There were all kinds of buildings and lights on both sides of the street. Not far away, one can still see traffic lights flashing. The camera turns to a beetle that has been crawling on the ground all along. Accompanied by cheerful music, this beetle keeps crawling forward, looking very relaxed, symbolizing a positive and sunny attitude towards life. There is no obvious text in the video, but from the background music, it can be heard that the overall tone is relaxed and cheerful, indicating that the author is also in a pleasant mood when shooting the video."
+ },
+ {
+ "video_id": "PI-8568c80a-e149-49f3-a7b3-327722b4ac90",
+ "path": "/workspace/benchmark/video/PI-8568c80a-e149-49f3-a7b3-327722b4ac90.mp4",
+ "answer": "The video features two women in a modern indoor setting with glass railings and a staircase in the background, engaging in a friendly and casual conversation. One woman, wearing a light pink sleeveless top with a heart-shaped cutout near the chest, has long, dark hair and is adorned with multiple rings and bracelets. The other woman, dressed in a black sleeveless top and black pants with heart-shaped cutouts, has long, light brown hair and is also wearing rings and bracelets. A text overlay appears, reading 'Reply to _H22208Y's comment: Piper is joining the bop house???' The woman in the pink top gestures with her hand, showing her rings, while the woman in the black top has her arm around her shoulder. They discuss the comment, with the woman in the pink top looking at the camera and the woman in the black top looking at her, both smiling and engaging in light-hearted conversation. The interaction continues with the woman in the pink top facing the camera more directly, speaking and gesturing with her hand, while the woman in the black top stands slightly behind her, resting her hand near her face and listening attentively. The overall atmosphere remains friendly and casual, with both women enjoying the conversation. The audio captures their mild surprise and amusement at the comment about Piper joining the 'Bobbhouse,' reflecting a lighthearted and informal tone."
+ },
+ {
+ "video_id": "PI-28a5d5e1-40b0-4def-887d-3a5cb426a762",
+ "path": "/workspace/benchmark/video/PI-28a5d5e1-40b0-4def-887d-3a5cb426a762.mp4",
+ "answer": "The audio is a heartfelt personal diary entry featuring a woman's warm and excited voice, sharing an emotional and joyful narrative about taking a dog named Asia to meet a potential adoptive family. She recounts Asia's initial nervousness, the process of the meeting, and the ultimate adoption, expressing happiness and relief that Asia, after spending 495 days in the shelter, finally found a loving home and even received a little dress as a gift. The video complements this narrative by visually documenting Asia's journey, showing two people dressed in casual attire traveling with her in a vehicle, setting off with the goal of finding her a home. Upon arrival at the shelter, Asia, wearing an \"ADOPT ME\" harness, is gently guided inside, where she is shown calm and relaxed. The video then depicts Asia in a new environment, sitting on a leash near blue-lit steps, and later walking energetically in a colorful striped shirt, symbolizing her move to a new location. The video concludes with Asia sitting excitedly in a car, being petted affectionately, a touching moment that symbolizes her successful adoption and the start of a loving new chapter in her life. Together, the audio and video create a compelling and heartwarming story of hope, joy, and fulfillment, highlighting the successful adoption and the bright future that awaits Asia with her new family."
+ },
+ {
+ "video_id": "PI-a6fd6705-888f-48ea-af78-768a5b2f60f5",
+ "path": "/workspace/benchmark/video/PI-a6fd6705-888f-48ea-af78-768a5b2f60f5.mp4",
+ "answer": "The video unfolds in a vibrant, purple-lit room with multiple doorways, immediately setting a dynamic and somewhat mysterious tone. The text '2 Players!' appears, signaling a multiplayer game environment. Two characters are introduced: one with long black hair and another sporting a leopard-print hat. The latter moves towards a doorway, with the former in pursuit. They soon encounter a third character, distinguished by brown hair and a brown outfit. The trio converges in a corner, with the leopard-hatted character at the center. Tension escalates as the brown-haired character grabs the leopard-hatted one, seemingly restraining them, while the long-haired character lies on the floor in the background. The scene culminates with the leopard-hatted character standing alone near a doorway, accompanied by the text 'I'm safe,' conveying relief. Overlaying this visual narrative, a woman's voice with an ambiguous emotional tone delivers fragmented English phrases, hinting at an invitation or encouragement to join, though the message remains unclear. The ambient music, with its rhythmic beats, adds a layer of neutrality or uncertainty, enhancing the overall enigmatic atmosphere of the scene."
+ },
+ {
+ "video_id": "PI-f0c8a135-9a04-48ad-aa07-20e9eb71aba1",
+ "path": "/workspace/benchmark/video/PI-f0c8a135-9a04-48ad-aa07-20e9eb71aba1.mp4",
+ "answer": "This video captures an interaction between a young man and a store employee within a brightly lit retail environment. The employee, visibly frustrated, confronts the man about loud screaming noises emanating from his vicinity. The man repeatedly denies being the source of the screams, asserting that he has lost his voice from attending a concert. Despite his claims, further loud screams are heard, seemingly originating from him, leading to increased exasperation from the employee. She warns him that if the screaming continues, he will be asked to leave and that she may involve the police. The visual setting is a well-stocked craft or home goods store, featuring numerous aisles with various merchandise. The camera closely follows the man as he engages in this exchange, providing a clear view of both individuals and the surrounding store interior."
+ },
+ {
+ "video_id": "PI-7340580b-0d60-4575-940d-d3709e8b81eb",
+ "path": "/workspace/benchmark/video/PI-7340580b-0d60-4575-940d-d3709e8b81eb.mp4",
+ "answer": "This video took place at raise Cane's fried Chicken Stick restaurant. The staff were wearing red uniforms and white hats with the raise Cane logo. At the beginning, an employee named Jana Craig was standing behind the counter. The scene transitions to a close-up of the promotional image of the \"set meal\", which includes 4 chicken fillets, 1 potato chip, 1 sugarcane jam, 1 Texas toast and 22 ounces of fountain drink/tea, surrounded by the words \"Our favorite\" and \"ONE LOVE\", and then the video returns to Jana. Another employee, Kenny Rodriguez, joined Jana behind the counter and they had a conversation. It can be heard in the audio: \"OK, baby, it's time to make a decision.\" Will you cover up with the favorite combination of your fans? Are you ready to look back? \" This video presents an interesting and persuasive conversation about choosing the \"box combo\" meal, emphasizing loyalty and the popularity of the combo. The video ends with the logo of raise Cane on a red background, strengthening the brand's identity."
+ },
+ {
+ "video_id": "PI-b5428b2f-3b04-4eb6-99a7-48b3e255dfaa",
+ "path": "/workspace/benchmark/video/PI-b5428b2f-3b04-4eb6-99a7-48b3e255dfaa.mp4",
+ "answer": "The video unfolds in a muddy outdoor environment, with small twigs and leaves scattered around, creating a natural and simple background. The main character in the scene is a person, and his hand can be seen throughout the video. The core action begins like this: A person holds a crushed canned red Dr. Pepper and pours the remaining liquid onto the muddy ground, creating a small puddle. After throwing away the empty cans, this person stirred the turbid water with his fingers, creating ripples and foam. Then, in this scene, a yellow plastic bottle with a green lid appeared and was placed beside the red jar. This person picked up the bottle, showed it to the camera, and then continued to interact with the murky water. The characteristic of audio overlay is a repetitive phrase, \"What test some catfish are coming\", suggesting anticipation or encouragement for the appearance of catfish, which may indicate fishing or bait activities. Then, this person picked up a black, flexible tool with a coiled rope on it, showing its length and flexibility. This might be used for detection or fishing. The video ends with a hand-held tool, a yellow bottle and a red can lying on the ground, emphasizing the continuous interaction with the muddy environment. The core theme revolves around preparing or displaying tools and substances in the natural environment, possibly for the purpose of attracting or catching catfish."
+ },
+ {
+ "video_id": "PI-74cf2e37-20a0-4b4e-b7a3-43730e901714",
+ "path": "/workspace/benchmark/video/PI-74cf2e37-20a0-4b4e-b7a3-43730e901714.mp4",
+ "answer": "At the beginning of the video, a woman in a shiny purple dress is standing on the stage, holding a microphone and speaking to the audience. There is a large sign in the background with partially visible \"GLOBES\" on it, indicating that this might be an award ceremony. The screen was covered with a message, \"Please don't involve God/Jesus in your jokes,\" which indicated that respectful language was required during the event. Then, the scene switches to a man in a light blue fine-striped suit and a red tie, who is standing in front of the city background at dusk. A text overlay reads: \"Emergency: A new bushfire suddenly breaks out near the Hollywood Sign\", indicating an emergency news update. The subsequent video shows that there is a big fire burning on the hillside and the words \"N Curson Ave\" can be seen, which can determine the location of the fire. The man in the suit continued to speak, and the words \"Bushfires broke out in Hollywood Hills\" appeared on the screen, providing further background for this breaking news. This video documents a shift in emergency news coverage from an award ceremony to a bushfire near the Hollywood sign. Then the story transitions to a black screen, on which is written \"One day later...\" The time jump is indicated by bold yellow letters against the background of the colorful pattern. The words \"Bush fire broke out in Hollywood Hills\" were prominently displayed on the screen, emphasizing the ongoing emergency. The man seemed to be doing news reporting and gesturing to emphasize the urgency of the situation. The video continues to focus on bushfires, emphasizing their severity and impact on the Hollywood Hills area."
+ },
+ {
+ "video_id": "PI-651feaaa-d800-42b5-8dc6-5ae62cd99413",
+ "path": "/workspace/benchmark/video/PI-651feaaa-d800-42b5-8dc6-5ae62cd99413.mp4",
+ "answer": "In the video, a person is seen standing in a room adorned with a black shelving unit filled with books, shoes, and a colorful poster on the wall, setting a vibrant and creative atmosphere. They are initially holding a beige rice hat and dressed in a white jacket, black pants with white 'X' designs, and black and white sneakers, with the text \"Styling a Rice Hat With a Shiesty\" appearing on the screen to introduce the theme. The person, with a friendly and enthusiastic tone as heard in the audio, begins to demonstrate various styling techniques. They remove the jacket to reveal a black shirt underneath, then hold up a pair of white jeans labeled \"white,\" followed by black boots described as \"strapped.\" Next, they display a black sleeveless top, referred to as a \"beater,\" and flex their arms to showcase it. The video progresses with the person explaining a \"pants trick,\" creatively transforming the white jeans into a top by tying them around their waist. They also mention using an old boxing wrap, which they proceed to wrap around their hands, adopting a playful stance and mimicking a fighter's pose, humorously referencing \"Lebron James.\" The person continues to style themselves, putting on the rice hat and adjusting it while making playful gestures. The video concludes with the person fully dressed in the new outfit, including the rice hat, and making a final lighthearted gesture towards the camera, with the text \"chuckles\" appearing on the screen. The audio complements the visual content by emphasizing creative fashion styling, encouraging the audience to experiment with accessories like rice hats to enhance their daily image, and highlighting the joy and playfulness in fashion experimentation."
+ },
+ {
+ "video_id": "PI-1e67757b-fb62-46ba-b9fe-a20764bd84e7",
+ "path": "/workspace/benchmark/video/PI-1e67757b-fb62-46ba-b9fe-a20764bd84e7.mp4",
+ "answer": "In the video, a person dressed in a black hoodie and hat sits comfortably indoors, surrounded by a plant and a shelf, directly addressing the camera to seek movie recommendations. They express a deep longing for thrillers, mysteries, or horror movies with plots so captivating that they remain vividly etched in the viewer's mind weeks after watching. This sentiment is mirrored in the audio, where a man's voice articulates the same desire for unforgettable films that keep him thinking long after the credits roll, even if it means staying awake at night. Both the visual and auditory elements emphasize the quest for a movie with a shocking and attractive plot, one that is both suspenseful and thought-provoking. The individual in the video leans back in their seat, creating a casual and relaxed atmosphere, as they invite suggestions from the audience, hoping to find that one film that will leave an indelible mark on their memory."
+ },
+ {
+ "video_id": "PI-14ffb970-e1a1-4fcb-82da-632cfebed1e8",
+ "path": "/workspace/benchmark/video/PI-14ffb970-e1a1-4fcb-82da-632cfebed1e8.mp4",
+ "answer": "In the video, there are three people standing under the porch of a suburban community, where there are bare trees and houses, suggesting that it might be winter. The person in the front is wearing a black shirt and a brown hat, while the person in the middle is wearing a white off-the-shoulder sweater and has long curly hair. The person in the background is wearing a green sweatshirt with white words and graphics on it, including the words \"GRAND\" and \"MUSEUM\", as well as an image of a red car. This conversation is light-hearted and humorous. The person sitting at the front mentioned, \"Do you know that I almost got divorced at Christmas?\" Yes, yes. The person in the middle went on to say, \"I know that's not my wife, but you should also be careful.\" This is a joking warning. The person in the front row replied, \"I don't know what I'm thinking.\" Then he said again, \"What on earth is going on?\" At the end of the video, the person in the foreground said, \"This is so strange. I don't want to look at the skin. Turn it off.\" This adds the last touch of humor and confusion to the interaction. The entire atmosphere was casual and friendly, and everyone enjoyed each other's company very much."
+ },
+ {
+ "video_id": "PI-599b8f62-050f-439d-bcf9-b6eaffe196ea",
+ "path": "/workspace/benchmark/video/PI-599b8f62-050f-439d-bcf9-b6eaffe196ea.mp4",
+ "answer": "The video features a person wearing a gray hoodie with white drawstrings and a small logo on the left side of the chest, set against a backdrop of a beige wall and a partially open door revealing some clothes hanging inside. The person appears to be speaking or expressing something, as indicated by their hand gestures and body language. The text 'i can't stop reposting' is prominently displayed in the middle of the screen, suggesting a theme of obsession or compulsion with reposting content. This theme is further emphasized by a yellow circular icon with two arrows forming a loop and the word 'Repost' below it, highlighting the act of reposting. The video returns to the person in the gray hoodie, who continues to express themselves, with the same text 'i can't stop reposting' reappearing on the screen. The person places their hands on their head, possibly indicating frustration or exasperation. Meanwhile, the audio is sung in an energetic and somewhat obsessed tone, with repetitive melodies and catchy tunes that match the visual theme. The singer expresses an inability to stop forwarding certain content, indicating a sense of compulsion or addiction. The video concludes with multiple instances of the text 'You reposted' appearing as overlays with a thorn emoji, mimicking repost notifications, further highlighting the theme of compulsive reposting."
+ },
+ {
+ "video_id": "PI-fb6bdf82-12fb-4272-a8f9-e87df8a3753a",
+ "path": "/workspace/benchmark/video/PI-fb6bdf82-12fb-4272-a8f9-e87df8a3753a.mp4",
+ "answer": "The video captures an outdoor gathering where a group of adults are playing a lively game of 'water kickball.' The setup includes a large grassy field, a slip 'n slide made of black plastic sheeting, and several inflatable blue wading pools serving as bases. In the background, there's a large inflatable bounce house or obstacle course, some trees, a house, and a few tents. A man kicks a green ball, and then he and a woman run along the slippery plastic, navigating the water-filled pools. The atmosphere is energetic and fun, with people cheering and participating. One man successfully slides through the plastic and splashes into one of the blue pools. The camera is handheld and moves to follow the action."
+ },
+ {
+ "video_id": "PI-58511a78-788f-4029-b92c-abde228bef81",
+ "path": "/workspace/benchmark/video/PI-58511a78-788f-4029-b92c-abde228bef81.mp4",
+ "answer": "The video begins with a person sitting in a car, holding an open jewelry box that contains a ring and a necklace, both featuring a heart-shaped red gemstone surrounded by smaller clear stones, with a dagger design incorporated into the setting. The person, dressed in a white long-sleeve shirt and blue jeans, carefully examines the jewelry before trying on the ring. The scene transitions to a close-up of the person's hand as they adjust the ring, ensuring it fits snugly and securely. The ring's intricate design and the sparkle of the gemstone are highlighted, emphasizing its elegance and craftsmanship. Meanwhile, the audio features a soft and warm voice narrating about a daughter named Aurora, who was named after the dawn because she brings sunshine to her parents' lives. The heartfelt narration tells of the joy and light that Aurora has brought to her family, adding an emotional layer to the visual display of the jewelry. The video wraps up with a steady focus on the ring, showcasing its design and the way it catches the light, while the audio evokes a sense of warmth and affection."
+ },
+ {
+ "video_id": "PI-39f77dbf-5909-4770-95a5-bedc3106ce66",
+ "path": "/workspace/benchmark/video/PI-39f77dbf-5909-4770-95a5-bedc3106ce66.mp4",
+ "answer": "This video is set against the backdrop of a scene that seems like a zoo or a wildlife reserve, presenting a serene and warm setting. The main focus is on a large kapibala, known for its gentle manner. This Kapibala was sitting calmly beside a water body, with a smaller animal, possibly a little Kapibala, perching on its back. The background is decorated with a mural depicting a peaceful landscape with mountains and green plants, enhancing the serene atmosphere. Kapibala remained basically still, blinking occasionally and slightly moving its head, while Kapibala seemed to be combing or lying comfortably on its back. The fence is decorated with some leaves and branches, providing a natural environment for the animals. Throughout the video, the interaction between Kapibala is the central theme, highlighting the unexpected moments of friendship between species. Kapibala's calm and gentle nature has created a touching and harmonious scene. This video captures a brief moment in the lives of these animals, allowing the audience to get a glimpse of how they coexist peacefully. The overall mood of the video is soothing and warm, emphasizing the beauty of the relationships among species in a controlled natural environment."
+ },
+ {
+ "video_id": "PI-4de20ca0-062a-44ab-9c47-4c5e572af4d9",
+ "path": "/workspace/benchmark/video/PI-4de20ca0-062a-44ab-9c47-4c5e572af4d9.mp4",
+ "answer": "The video features a woman sitting at a table in a simple, neutral-colored room, wearing a black spaghetti strap top with her hair tied back. The narrative unfolds through a series of transitions marked by text indicating different periods in her life, starting with '1 year younger', followed by '2 years younger', and '3+ years younger'. During these transitions, she is seen holding a white cylindrical object and engaging with a piece of paper, maintaining a consistent posture and expression that suggests a reflective journey through her past. The background music complements this reflective mood with a cheerful, rhythmic tone, featuring faint, repetitive phrases like 'approved' and 'excellent argument approved' that blend into the music, creating a subtle, inspiring atmosphere. As the video progresses, the text 'We're both grown' appears, signaling a return to the present and emphasizing themes of growth and personal development. The woman's continued interaction with the paper and the cylindrical object underscores a contemplative reflection on the passage of time and the changes she has experienced, wrapping up the video with a sense of continuity and self-awareness."
+ },
+ {
+ "video_id": "PI-8e2edbc8-b72d-41f8-827a-eadb10dd7c1f",
+ "path": "/workspace/benchmark/video/PI-8e2edbc8-b72d-41f8-827a-eadb10dd7c1f.mp4",
+ "answer": "The video captures a lively scene in a hallway with blue walls and a white door, where a person in a vibrant tie-dye dress featuring red, yellow, and blue hues, along with a leopard print bonnet, dances energetically. Their movements are dynamic, involving raised arms, side-to-side sways, and expressive gestures, conveying a joyful or celebratory mood under bright lighting that casts distinct shadows. Meanwhile, the audio features a male speaker delivering a rhythmic and casual statement with an indeterminate emotional tone, mentioning phrases like 'Big GL O in a GL' and repeating '<|withitn|>', suggesting a cryptic or stylized message in a conversational or performative context. The ambient music, with its steady tempo, complements the scene by creating a neutral or subdued atmosphere, blending the energetic visuals with a laid-back auditory experience."
+ },
+ {
+ "video_id": "PI-b72dd3e2-de4c-4d24-a49c-fa8b3081a322",
+ "path": "/workspace/benchmark/video/PI-b72dd3e2-de4c-4d24-a49c-fa8b3081a322.mp4",
+ "answer": "The video features a woman seated at a round table draped in deep red velvet, creating an elegant and dramatic ambiance, complemented by a background of two arched doorways covered with matching deep red curtains. She is dressed in a black lace top and has long dark hair. On the table in front of her lies a black box adorned with a red design on the lid, alongside several cards featuring the same red design. The woman picks up one of the cards, displays it to the camera, and proceeds to sign it with a black marker. The camera zooms in for a close-up of her hands, revealing a detailed image of a person with red hair on the card. As she continues to sign multiple cards, the scene is underscored by a BGM track featuring a female voice with an ambiguous emotional tone. The voice repeatedly asks, \"What do you mean by your tears?\" conveying a sense of confusion and a quest for understanding the meaning behind someone's tears. The video concludes with a close-up of the signed cards laid out on the table, showcasing her handiwork, while the audio highlights a moment of emotional inquiry."
+ },
+ {
+ "video_id": "PI-cf844add-3c9b-41f8-b3bc-d62127a392ef",
+ "path": "/workspace/benchmark/video/PI-cf844add-3c9b-41f8-b3bc-d62127a392ef.mp4",
+ "answer": "The video presents a man in a black sleeveless top and black shorts, situated in a room characterized by a white wall and a mounted television, with a guitar subtly leaning against the wall in the background. The man's initial gesture of covering his mouth with his hand suggests a moment of surprise or deep contemplation. As the video unfolds, he moves his hand away and casts his gaze around the room, as if attempting to comprehend his surroundings or reacting to an unexpected occurrence. His slight bend downwards might indicate an effort to adjust something out of view or to gain a better perspective. Leaning forward with his hands on his knees, he adopts a posture that conveys curiosity or concern. The progression of the video sees him leaning even further forward before straightening up to place his hands on his head, a gesture commonly associated with confusion or disbelief. Another bend downwards suggests he might be inspecting something on the floor or picking something up, followed by a return to covering his mouth, reinforcing the theme of surprise or bewilderment. Complementing this visual narrative, the audio features a male speaker expressing confusion and surprise about an unclear event or object, using phrases like \"What was this?\" and \"It was this.\" His bewildered tone conveys uncertainty and curiosity, with the core message revolving around his attempt to make sense of something perplexing. Together, the video and audio create a cohesive portrayal of a man grappling with confusion and curiosity in an environment filled with subtle, intriguing elements."
+ },
+ {
+ "video_id": "PI-0920e5c4-4275-4a34-a6e5-43c22220f843",
+ "path": "/workspace/benchmark/video/PI-0920e5c4-4275-4a34-a6e5-43c22220f843.mp4",
+ "answer": "The video presents a comfortable and versatile bedroom makeover, beginning with a man sitting on a bed beside a small side table adorned with a lamp and a record player, accompanied by the caption \"Don't Let the Girl Know This.\" The room, decorated with light-colored walls, framed pictures, and a clock, exudes a warm and inviting atmosphere. As the video unfolds, the man expands the sleeping area by pulling out a mattress, where the woman lies down, holding a teddy bear and looking relaxed and content. Later, the woman sits up and picks up the phone, while the scene transitions to transforming the mattress into a plush sofa by adding a back and cushions. They adjust the cushions and enjoy the newly created seating area, showcasing the furniture's adaptability. A close-up of the sofa highlights its comfortable and stylish design, accompanied by the text \"Click to buy now,\" encouraging viewers to make a purchase. Complementing the visuals, the audio features a piece of music sung by a woman, with a cheerful and pleasant tune that enhances the overall upbeat and inviting mood of the video, making the bedroom makeover even more appealing."
+ },
+ {
+ "video_id": "PI-bb00eb2a-f319-4642-806b-0e9652708cc3",
+ "path": "/workspace/benchmark/video/PI-bb00eb2a-f319-4642-806b-0e9652708cc3.mp4",
+ "answer": "This video captures an important moment at the Grammy Awards ceremony when Shakira was awarded the Best Latin Pop Album. There was a large screen on the stage, on which was a close-up of Shakira. She was wearing a striking orange and black dress. When she was holding the trophy, the audience could be seen backstage. Some people stood up and clapped enthusiastically. Shakira's speech is the focus of the video. She first expressed her wish to dedicate this award to her \"immigrant siblings in this country\", emphasizing that they are \"loved\" and \"worthy\", and promised \"to fight with you forever\". The text on the screen reinforced her message, showing key phrases such as \"All my immigrant brothers and sisters in this country\" and \"You are loved\". You deserve to have it. The video features the words \"Shakira wins Grammy for Best Latin Pop Album\" and a polite notice from CBS/ Recording Academy, highlighting the prestige of the event. The core theme of the video is Shakira's heartfelt gratitude and recognition for the struggles and contributions of immigrants and hardworking women, making her Grammy Award a moment of shared celebration and empowerment."
+ },
+ {
+ "video_id": "PI-421c3724-35b4-45e1-bfa6-418591fb5502",
+ "path": "/workspace/benchmark/video/PI-421c3724-35b4-45e1-bfa6-418591fb5502.mp4",
+ "answer": "The video features an enthusiastic cooking demonstration where the speaker shares their experience with a lemon garlic chicken recipe, emphasizing its nutritional benefits of 660 calories and 33 grams of protein. The demonstration includes the preparation of a delicious chicken sandwich topped with cucumber slices, shredded cabbage, and a drizzle of sauce, alongside a refreshing cucumber salad dressed with avocado oil, lemon juice, and mayo. The speaker excitedly shares their personal journey of breaking through a weight plateau, attributing their success to the consumption of healthy fats and the avoidance of seed oils. They highly recommend products like Chosen Foods Classic Mayo and Avocado Oil for their natural ingredients and positive impact on health. The speaker encourages viewers to try the recipe, with detailed instructions available in the description, maintaining an upbeat and energetic tone throughout the video."
+ },
+ {
+ "video_id": "PI-7a0b1eb6-46e8-4647-b0a6-13f9db0a8a79",
+ "path": "/workspace/benchmark/video/PI-7a0b1eb6-46e8-4647-b0a6-13f9db0a8a79.mp4",
+ "answer": "The video features a woman wearing a blue shirt, a black headband, and long curly hair, set against a plain light-colored wall with a vent at the top. She makes various gestures and expressions, often pointing to her face and hair, while text overlays express insecurities such as 'I hate this picture,' 'Bet you didn't notice my hair wasn't brushed,' 'My shirt is wrinkled,' and 'My headband is crooked.' These elements suggest a humorous or self-deprecating take on common insecurities, highlighting minor imperfections in a light-hearted way. Meanwhile, the audio introduces a relaxed and slightly whimsical tone, with a female speaker, expressing confusion or curiosity. The speaker mentions peculiar observations like 'buttery teeth' and 'unbalanced eyes,' adding to the playful criticism of physical features. The combination of the woman's candid behavior and the whimsical audio creates a relatable moment of self-reflection, blending humor with a touch of self-awareness."
+ },
+ {
+ "video_id": "PI-e99bb25b-bc22-4bf6-8c91-851b1164265f",
+ "path": "/workspace/benchmark/video/PI-e99bb25b-bc22-4bf6-8c91-851b1164265f.mp4",
+ "answer": "The video is set against an upbeat and energetic background music track, characterized by its fast-paced rhythm, electronic beats, and vibrant melodies, creating a lively and dynamic atmosphere perfect for creative endeavors. It opens with a close-up of a person holding an orange piping bag, meticulously dripping orange frosting onto a grid pattern, forming neat dots with precision. The scene then widens to show the individual, dressed in a black hoodie, adding cyan frosting between the orange dots to create a striking pattern. As the camera angle shifts for a better view, the person enhances the design by incorporating blue water droplets, followed by purple droplets from a purple piping bag, further enriching the vibrant colors. The process continues with a green piping bag, adding green droplets to the intricate pattern. Finally, another person, wearing a light-colored shirt with blue patterns, joins in, using a blue piping bag to add more droplets to the already colorful and densely patterned surface. The video concludes with a wide-angle shot of the entire decorative surface, showcasing a visually striking array of colorful droplets."
+ },
+ {
+ "video_id": "PI-a028eedc-e2df-4612-adbd-80230153ab1d",
+ "path": "/workspace/benchmark/video/PI-a028eedc-e2df-4612-adbd-80230153ab1d.mp4",
+ "answer": "The video unfolds in a cozy restaurant booth where a couple is seated at a small table, each with drinks and condiments in front of them. Initially, the couple, with the man in a black jacket, light gray sweatpants, and a white beanie, and the woman in a dark hoodie and black pants, sit across from each other, engaged in conversation but maintaining a certain distance, emphasized by the text 'Couples NOT in love' at the top of the screen. As the scene progresses, the atmosphere shifts with the text changing to 'Couples in love', and the couple begins to display more affection; the woman leans in towards the man, resting her head on his shoulder, and he leans in to kiss her on the cheek, showcasing a tender moment. This transformation in their interaction is beautifully complemented by the background music track featuring a male vocalist with a soft and soothing tone, delivering a melodic and gentle song that evokes a sense of calm and tenderness, mirroring the couple's journey from detachment to deep affection. The video concludes with the woman getting up and walking away, leaving the man sitting alone in the booth, looking down at the table, while the BGM's heartfelt expression of love and longing encapsulates the core message of deep affection and emotional connection that the couple experiences."
+ },
+ {
+ "video_id": "PI-4cb37ce2-0304-42b6-b2ee-adfa64a2303c",
+ "path": "/workspace/benchmark/video/PI-4cb37ce2-0304-42b6-b2ee-adfa64a2303c.mp4",
+ "answer": "The video takes place in a well-lit corridor, where a grey carpet is laid and framed photos are hung on the white walls. There are fluorescent lights on the ceiling and an \"Exit\" sign at the end of the corridor. A man in a light grey coat and a hat walked confidently along the corridor, making expressive gestures and looking in a good mood. As he made progress, he began to dance energetically and move his arms and legs rhythmically. The corridor was bustling with activity. There were several people in the background, some sitting and some standing. The man's dance movements became more complicated. He interacted with the crowd, and the crowd's reaction was excitement and entertainment, clapping and cheering for him. The text in the video shows: \"Someone has come in through the door.\" We let some unhappy people in, we let some unhappy people in, we let some unhappy people in, \"add a humorous or fun background to the scene. The man formed a heart shape with both hands, adding a playful element to his performance. Then, he raised one hand high in the air to show his approval of the enthusiasm of the crowd. The background is consistent with the photo frame on the wall and the \"Exit\" sign visible at the end of the corridor. The atmosphere on the spot was lively and cheerful. The man's dance steps became more and more exquisite, and the audience's reaction became more and more enthusiastic. At the end of the video, the man turned and walked towards the \"exit\" sign, still dancing and interacting with the crowd, leaving an impression of vitality and positivity."
+ },
+ {
+ "video_id": "PI-e3043bf9-e84c-4cfc-8a98-f27656918c5a",
+ "path": "/workspace/benchmark/video/PI-e3043bf9-e84c-4cfc-8a98-f27656918c5a.mp4",
+ "answer": "The video captures a casual and laid-back moment in a classroom setting, where a person in a gray hoodie and black cap is seated at a desk, leaning back with their head tilted upwards, holding a pencil and notebook in a relaxed or distracted manner. The classroom walls are decorated with colorful posters, including a creative take on the periodic table featuring cartoon faces. In the background, another individual in a pink hoodie is engaged in their own activities, contributing to the unstructured, everyday school atmosphere. Complementing this visual, the audio track features a human voice expressing a humorous or teasing mood with a smile, indicating a light-hearted and pleasant tone. The combination of the visual and audio elements creates a cohesive scene that highlights the轻松 (relaxed) and humorous nature of the classroom environment."
+ },
+ {
+ "video_id": "PI-45f874d8-441d-4395-b078-2686bac3d64c",
+ "path": "/workspace/benchmark/video/PI-45f874d8-441d-4395-b078-2686bac3d64c.mp4",
+ "answer": "The video shows the fun and festive interaction of two girls in a cozy holiday-decorated living room, which is consistent with the audio track of two girls singing. The girl on the left wears a cream knitted sweater, black leggings and white sneakers, gently touching the face and neck of the person on the right, and the girl on the right wears a green and cream zipper sweater, black leggings and white socks with black text. In the background is a fireplace with a red screen TV, a beige sofa with decorative pillows, a glass side table with gold legs, and small artificial Christmas tree. The interaction is relaxed, and the girl on the right responds with various gestures and movements, creating a warm and joyful atmosphere. The two girls seem to be practicing singing as they methodically repeat a lyric, adding a harmonious and cheerful element to the festive atmosphere."
+ },
+ {
+ "video_id": "PI-60da77a7-c4c0-4a05-8efb-8cc59c009b3d",
+ "path": "/workspace/benchmark/video/PI-60da77a7-c4c0-4a05-8efb-8cc59c009b3d.mp4",
+ "answer": "The video features two individuals in a cozy, Christmas-decorated living room, complete with a concrete ceiling, a large mirror, and a beautifully lit Christmas tree adorned with ornaments and lights. Both dancers, dressed in black t-shirts and loose, camouflage-patterned pants, start by standing still before launching into a synchronized dance routine. Their energetic and coordinated movements include high kicks, spins, and hand gestures, all performed in perfect unison, showcasing their rhythm and chemistry. Meanwhile, the upbeat BGM, devoid of any vocals or narration, complements the performance, enhancing the fun and engaging atmosphere. The consistent background of the Christmas tree and mirror adds a festive and homely touch, capturing the joy and energy of the dance and highlighting the dancers' enthusiasm and synchronization.\n"
+ },
+ {
+ "video_id": "PI-35b12860-f5c0-49be-97f7-95610849b1d4",
+ "path": "/workspace/benchmark/video/PI-35b12860-f5c0-49be-97f7-95610849b1d4.mp4",
+ "answer": "The video shows a brown dog in a cozy living room, initially sitting intently on a light gray rug. In the background is a kitchen area with white cabinets, stainless steel refrigerator, microwave, wooden dining table with black chairs, and a dark grey sofa upholstered with striped pillows. The dog stayed still most of the time, blinking occasionally and moving his head slightly, creating a calm and welcoming atmosphere. As the video progresses, the dog stands up and walks towards the camera, getting closer with each frame until its face fills the screen, highlighting its curious demeanor. Accompanying the visuals, a female singer's song expresses joy in a positive and cheerful tone, while the rhythmic ambient music enhances the overall pleasant and inviting atmosphere, providing the perfect backdrop for a serene and warm scene."
+ },
+ {
+ "video_id": "PI-1ff7d858-ba60-43d2-82fc-781b0fb0e04e",
+ "path": "/workspace/benchmark/video/PI-1ff7d858-ba60-43d2-82fc-781b0fb0e04e.mp4",
+ "answer": "The video opens with a person donning a dark blue hoodie featuring a white logo, seated in a room adorned with a gaming chair boasting LED lights, setting a playful scene. Text at the top of the screen, \"Who let the brother find the dance editor of 24KARATMEDIA,\" hints at a humorous discovery. Transitioning outdoors, a dancer in a light grey down jacket and beige trousers performs in the snow, with red text like \"sh ttt\" and \"SLOW DOWN\" appearing beside their hands, suggesting a reaction to the dance moves. The dancer continues with various movements, accompanied by words such as \"trygna,\" \"A** FALL OFF,\" and \"smash,\" emphasizing the dance's intensity. Surprise and excitement are palpable with \"OMG\" appearing in red at the top of the screen. The dancer raises hands in a stop gesture, followed by the green word \"yes\" at their feet, indicating a willingness to dance. The scene expands to more dancers in similar attire, all performing energetically in the snow, with \"Yes\" in green at their feet and \"OMG!\" recurring, reinforcing the passionate and humorous tone. Meanwhile, the audio captures the voices of two excited males, expressing surprise at the dance movements, encouraging participation with phrases like \"IT's DANCE TIME!\" and \"Yes!\", perfectly complementing the video's vibrant and enthusiastic atmosphere."
+ },
+ {
+ "video_id": "PI-9084dd13-2176-4845-bdde-db2a391ab94c",
+ "path": "/workspace/benchmark/video/PI-9084dd13-2176-4845-bdde-db2a391ab94c.mp4",
+ "answer": "The video features a man with blonde hair, dressed in a dark suit, white shirt, and red striped tie, seated in front of an American flag. The background includes a window with multiple panes and a decorative object on the left side, suggesting a formal or official environment. The man appears to be speaking, as indicated by his mouth movements, and the 'LIVE NOW' logo is visible in the top right corner of the frames. Overlaid on the video is a text box with the message: 'PRESIDENT TRUMP: 'LET HELL BREAK OUT' IF ALL HOSTAGES ARE NOT RELEASED,' in bold, black letters on a white background. Meanwhile, the audio features a male speaker exhibiting a mix of determination and frustration, discussing a situation involving hostages. He emphasizes a firm deadline of Saturday at 12 o'clock for their complete return, strongly opposing partial or staggered releases. The speaker warns of severe consequences if the deadline is not met, expressing a sense of urgency and impending action. The tone is serious and resolute, with no background music or additional ambient elements mentioned, creating a tense and formal atmosphere."
+ },
+ {
+ "video_id": "PI-ec80a7b0-d1a8-4634-a2e7-7a9778b7ce1e",
+ "path": "/workspace/benchmark/video/PI-ec80a7b0-d1a8-4634-a2e7-7a9778b7ce1e.mp4",
+ "answer": "This video captures a young woman with long dark hair and neck tattoos speaking directly and angrily into the camera. Filmed in a selfie style from a low angle, she is inside a modern home with high ceilings and a visible second-floor landing. The woman is confronting an unseen individual, accusing them of illegally recording their phone calls. She brings up the difference in recording laws between California and Florida, admitting she recorded the other person first in Florida. Her speech is filled with insults and threats of sending the other person to jail. She also confesses to deliberately lying during their conversations to create a false sense of friendship. She reveals that this was a manipulative tactic designed to trick the person into meeting up with her so that she could physically assault them. The video ends with a loud, exasperated sound from the woman."
+ },
+ {
+ "video_id": "PI-a9374e6e-c6f3-4867-89b8-3f6c27905b62",
+ "path": "/workspace/benchmark/video/PI-a9374e6e-c6f3-4867-89b8-3f6c27905b62.mp4",
+ "answer": "The video features a person in a dark blazer and light-colored shirt, seated among a group of people, with some in the background wearing red clothing and appearing to clap or react to something. The central figure makes expressive hand gestures, pointing upwards with both index fingers and later flexing their hands, exuding an energetic and assertive demeanor. Overlaid on the video are song lyrics that read: 'hold up hold up hold up,' 'let me catch my breath,' 'let me count these checks,' 'flex on my ex,' 'I don't got no respect,' 'break up in a text,' and 'pull up in a G.' These lyrics suggest themes of confidence, financial success, and moving on from past relationships. Complementing this visual narrative, the audio features a male speaker with a neutral tone discussing tasks like counting checks and managing eggs, indicating a focus on financial or logistical matters. The background music has a steady tempo, creating a focused and ambient atmosphere that underscores the scene's overall mood of triumph and self-assurance."
+ },
+ {
+ "video_id": "PI-d199e019-f355-45c0-a2aa-218b1ef1cc8a",
+ "path": "/workspace/benchmark/video/PI-d199e019-f355-45c0-a2aa-218b1ef1cc8a.mp4",
+ "answer": "This integrated piece captures a beautifully heartwarming scene, both in audio and visual forms. The audio reveals a conversation among adults and children, likely all women, showcasing a loving and joyful interaction within a family. Complementing this, the video vividly depicts one woman with long dark hair and two children sharing a tender moment outdoors. The woman embraces the children, creating an atmosphere of closeness and affection. Set against a backdrop of lush greenery and a house with large windows, the scene exudes serenity and peace. The woman is dressed in a light-colored top, while the children wear colorful, playfully patterned shirts. The overall mood is joyful and intimate, with the group clearly relishing each other's company. The camera captures their smiles and the warmth of their interactions, emphasizing the deep bond they share. The video concludes with a close-up of their faces, highlighting expressions of happiness and love, perfectly mirroring the affectionate tone described in the audio. Together, they convey a powerful message of family, love, joy, and togetherness."
+ },
+ {
+ "video_id": "PI-26611a3c-b3fe-497c-8597-1d99d5341eb7",
+ "path": "/workspace/benchmark/video/PI-26611a3c-b3fe-497c-8597-1d99d5341eb7.mp4",
+ "answer": "The video features a lively and humorous sequence of events centered around video game announcements.It starts with a person standing in front of a television screen displaying a large, glowing purple 'VI' symbol. The text on the screen reads 'Rockstar getting ready to release GTA 6.' The person, dressed in a dark shirt and pants, dances energetically, moving their arms and body rhythmically to the news.The background includes a wall adorned with blue LED lights and a framed picture, adding a vibrant atmosphere to the scene.As the person continues to dance enthusiastically, they raise their arms and make expressive gestures, clearly excited about the announcement.The scene then transitions to an outdoor setting where the same person is seen wearing headphones.The text on the screen changes to 'Notch announcing Minecraft 2,' and the person reacts with surprise and excitement, adding a humorous touch to the unexpected news.Next, the video shifts to another outdoor scene featuring a different person, wearing a white shirt.The text 'Titanfall 3' appears on the screen, and the person appears to be speaking or reacting to something off-camera, possibly expressing their thoughts on the announcement.The final scene shows the same person from the previous outdoor scene, now wearing a light blue shirt.The text 'Valve finally releasing Half Life 3 (It'll never happen..)' is displayed on the screen.The person raises their hand in a gesture that suggests disbelief or sarcasm, humorously commenting on the long-awaited and often-rumored game that has become a meme within the gaming community.Throughout the video, the audio includes the person's excited voice and background music, enhancing the playful and humorous tone.The video wraps up with a series of energetic and sarcastic reactions to various video game announcements and rumors, capturing the excitement and skepticism of gamers worldwide."
+ },
+ {
+ "video_id": "PI-af6571aa-8ca5-4c2d-b8c0-744be81ed43d",
+ "path": "/workspace/benchmark/video/PI-af6571aa-8ca5-4c2d-b8c0-744be81ed43d.mp4",
+ "answer": "The video begins with a serene view of a forested area in Louisiana, showcasing tall, slender trees with exposed roots near a body of water, a ground covered in dry leaves and twigs, and sunlight filtering through the trees to create a peaceful atmosphere. As the camera pans slightly to the right, a large log comes into view, lying horizontally on the forest floor, behind which a brown animal, possibly a bear, is seen sitting and partially obscured by the log and surrounding vegetation. Meanwhile, the audio track features a male speaker with a curious and slightly skeptical tone, expressing confusion and curiosity with phrases like 'What the heck is that?' and 'What is that?' as he observes the scene. The camera continues to pan, revealing more of the forest and the animal's position, while the speaker comments, 'Looks like a dog on grizzly bear,' and decides, 'Guess I'll go find out.' The camera then shifts to show a wooden structure in the background, adding a human element to the wild setting. The video wraps up with a close-up of a man wearing a cap with 'Louisiana' written on it, standing in the forested area, as the speaker concludes with disbelief, 'Yeah I'm not gonna believe what that was,' highlighting the intriguing and mysterious nature of the sighting in the Louisiana outdoors."
+ },
+ {
+ "video_id": "PI-e26332c8-ffa1-492d-93ce-2099059f6cbe",
+ "path": "/workspace/benchmark/video/PI-e26332c8-ffa1-492d-93ce-2099059f6cbe.mp4",
+ "answer": "This video captures a lighthearted prank in an airport terminal, identified as DFW (Dallas/Fort Worth) by a logo on a luggage cart. Two soldiers in camouflage uniforms are seen sleeping soundly on a row of waiting chairs, with their military-style bags and a luggage cart nearby. A person holding the camera quietly approaches them. After whispering, the person filming shouts 'At ease!' causing the soldiers to wake up abruptly. One soldier, who had his cap over his face, sits up with a smile, while the other looks momentarily startled before realizing it's a joke. "
+ },
+ {
+ "video_id": "PI-0471925b-5647-44e2-809a-30fcd2862430",
+ "path": "/workspace/benchmark/video/PI-0471925b-5647-44e2-809a-30fcd2862430.mp4",
+ "answer": "The video presents a heartwarming yet bittersweet scene of a golden retriever indoors, its golden coat and expressive eyes capturing the viewer's attention. The dog, looking excited and eager, seems to be begging for food from its owner, its mouth occasionally opening to reveal its tongue. The cozy setting, complete with a wooden floor, an orange hollow round stool, furniture, and a pet bowl with food in the background, adds to the warm atmosphere. However, as the video progresses, the dog's longing gaze, evident even when it sits down, contrasts sharply with the melancholic BGM track featuring lyrics that express a profound sense of sorrow and longing. The core message of the audio, emphasizing deep sadness and yearning, mirrors the emotions conveyed in the video's final frames, where the dog leaves the frame, and the camera focuses on the empty floor, accompanied by the subtitle \"Miss you so much Bailey girl,\" expressing the owner's profound feelings of missing the beloved pet. The integration of the video's tender visuals and the audio's poignant melody creates a deeply emotional and memorable experience."
+ },
+ {
+ "video_id": "PI-1f148d77-820e-4c4e-9fb2-b3b2e258271b",
+ "path": "/workspace/benchmark/video/PI-1f148d77-820e-4c4e-9fb2-b3b2e258271b.mp4",
+ "answer": "In the video, there are two people standing on a beige wall, creating a simple and tidy background. The person on the left is wearing a black T-shirt with large white letters \"United Nations\" printed on it. There are tattoos on their arms, adding a personal and artistic touch to their appearance. The person on the right is wearing a black T-shirt decorated with intricate gold and grey patterns, paired with dark trousers and multiple gold chains, exuding an air of fashion and trendiness. The person on the left will make all kinds of gestures, including pointing at the chin and covering the mouth, suggesting that they are speaking or emphasizing an point. The person on the right is standing, with his hands clasped together, seemingly listening attentively. This audio clip captures a segment of a conversation, the content of which is \"You want to know how I...\" This indicates that the person on the left might share a personal story or insight. The entire atmosphere is casual and relaxed, and individuals seem to be engaged in friendly and informal communication."
+ },
+ {
+ "video_id": "PI-007dc12f-c342-4c8c-b378-bf2c4f76d4e4",
+ "path": "/workspace/benchmark/video/PI-007dc12f-c342-4c8c-b378-bf2c4f76d4e4.mp4",
+ "answer": "The audio, which serves as the emotional backbone of this heart - warming narrative, predominantly features a mother's tender and emotional voice as she interacts lovingly with her baby. Through her words, she expresses surprise, affection, and a deep - seated emotional connection, vividly highlighting the sweet bond that exists between them. This audio component is seamlessly integrated into the video, which is set in a dimly - lit room. The video opens with a close - up of a baby's face, the infant wearing a hearing aid, immediately setting the context for an extraordinary moment. At the center of the top of the screen, the theme sentence \"Deaf baby hears her mother say 'I LOVE YOU' for the first time ever\" is displayed, emphasizing the significance of the scene. A label reading \"HIDDEN CORE\" in the middle of the lower part of the screen adds an air of mystery or depth to the video's presentation. Accompanied by warm background music that enhances the emotional atmosphere, the mother, though not visible in the frame, gently speaks to her baby. The baby, in response, shows a cute expression and makes some sounds, a reaction that is a testament to the magic of hearing her mother's love for the first time. In essence, the audio and video together primarily tell the touching story of a deaf baby's first - ever experience of hearing her mother say \"I love you,\" a moment filled with pure emotion and a profound deepening of the bond between mother and child."
+ },
+ {
+ "video_id": "PI-1320564d-a846-4b80-bdf6-03276a7d218a",
+ "path": "/workspace/benchmark/video/PI-1320564d-a846-4b80-bdf6-03276a7d218a.mp4",
+ "answer": "The video opens with a charming close-up of a large red heart-shaped box placed on a grey sofa, accompanied by an audio track featuring a playful and affectionate piece of music that poses the question, \"Would you like to be my lover?\" in a gentle and lovely voice, evoking a light and pleasant expression of love. The subtitle \"Would you like to be my lover?\" appears in red font at the bottom of the screen, setting a romantic and playful tone. As the box is opened, a little white goat emerges, adding an element of surprise and cuteness to the scene. The goat crawls out curiously, raising its head as if sniffing the air, and eventually stands on the sofa exploring its surroundings, while the affectionate message remains on the screen throughout. The video continues to focus on the little white goat standing near the big red heart-shaped box on the grey sofa, reinforcing the sweet and whimsical Valentine's Day proposal theme with the combination of the visual surprise of the goat and the playful romantic message conveyed by both the audio and the visuals."
+ },
+ {
+ "video_id": "PI-ccf05e36-8060-4898-a65f-d698e0bd75df",
+ "path": "/workspace/benchmark/video/PI-ccf05e36-8060-4898-a65f-d698e0bd75df.mp4",
+ "answer": "The video begins with a black screen, then transitions to a dance studio where a group of people, dressed in casual athletic wear, are practicing synchronized dance moves in a space characterized by its dark, minimalist aesthetic with black walls and a polished floor. While a neutral female voice in the audio briefly mentions the category of dance with a hint of concern, the video shows the dancers appearing focused and energetic, moving in unison. As the video progresses, the dancers continue their routine with dynamic movements, and some are highlighted in close-up shots to emphasize their expressions and physical exertion. The latter part of the video introduces a person in a white, angelic costume with large wings, standing out against the casual attire of the others. The group incorporates this winged performer into their routine, adding a unique visual element. The video wraps up with a focus on the winged performer moving gracefully among the dancers, creating an atmosphere of creativity and artistic expression as the dancers remain fully immersed in their performance."
+ },
+ {
+ "video_id": "PI-9614f8be-c80d-4ba4-b6e6-8be5c7758935",
+ "path": "/workspace/benchmark/video/PI-9614f8be-c80d-4ba4-b6e6-8be5c7758935.mp4",
+ "answer": "The video takes place on a bustling city street, with classic buildings in the background and a stop sign visible. The main characters are two people standing on the sidewalk, each holding a microphone. The person on the left is wearing a green jacket, a hood lined with fur and a red scarf, while the person on the right is wearing a black jacket. It seems that they are having a lively conversation or interview. Throughout the video, the person on the left holds a microphone, showing images of various people to prompt the person on the right to recognize them. The dialogue includes jokes and guesses, and \"Who is this?\" will appear on the screen. The words such as \"Raise your YA YA YA\" and \"hahaha\" are superimposed to emphasize the critical moment. The person on the right correctly recognized some people, such as Rachel Cenot and Fatima, but had conflicts with others, resulting in humorous communication. A woman with bright hair also appeared in the video, accompanied by the caption \"All the pleasures!\" The words \"Just enjoy\" imply a moment of reflection or excitement. The overall theme seems to be an interesting interactive quiz or interview that combines humor with popular culture. The urban environment and the casual clothing of the participants create a relaxed and charming atmosphere."
+ },
+ {
+ "video_id": "PI-8d787ab1-4559-42db-986e-9f5b59645fdd",
+ "path": "/workspace/benchmark/video/PI-8d787ab1-4559-42db-986e-9f5b59645fdd.mp4",
+ "answer": "The video takes place in a well-lit store aisle filled with various household items, where a man in a gray t-shirt and black pants is seen holding a whistle and playing, adding a playful yet confusing element to the scene. The narrative then shifts to another individual wearing a denim jacket and jeans, walking down a different aisle filled with greeting cards and other items. The camera follows this person as they move through the store, passing shelves stocked with diverse products. The text overlay reads 'Backup Screamer Method not patched' with crying and praying emojis, suggesting a humorous or frustrated context. Meanwhile, the audio features multiple people in conversation, mostly involving a male speaker who sounds curious and a little confused, repeatedly expressing his intention to figure something out. Another male speaker briefly says something unclear, and a speaker notices a child's voice, indicating an attempt to identify or understand a particular situation. The overall tone is curious and a bit chaotic, with the man in the denim jacket continuing to walk, occasionally looking around as if searching for something or someone, adding a sense of curiosity and urgency to the scene."
+ },
+ {
+ "video_id": "PI-1a6aa994-e483-4a30-96a1-89d8e4095b8f",
+ "path": "/workspace/benchmark/video/PI-1a6aa994-e483-4a30-96a1-89d8e4095b8f.mp4",
+ "answer": "The audio features a humorous and confident man explaining a playful method of pretending to have visited any place in the world by simply selecting a channel and taking a photo in front of it, allowing one to convince friends of having actually been there. The video complements this narrative by featuring the same man as the central figure, initially seated indoors in a room adorned with a desk, TV, and window, before transitioning to an outdoor snowy environment complete with a ski lift and a cabin. Dressed in a vibrant yellow jacket, scarf, and hat, the man engages with the camera through smiles, gestures, and lively dialogue, conveying a sense of joy and excitement. The video captures the photographer's movements, including picking up a remote control, moving right, advancing forward, shifting left, and retreating backward, all while the man shares his experiences against the backdrop of ambient snowy outdoor sounds like wind and distant activity. Key objects in the indoor scene, such as a computer monitor, keyboard, mouse, and a red item on the desk, contrast with the picturesque snowy landscape outside, creating a visual narrative that revolves around winter and skiing. The emotional atmosphere is characterized by adventure and enjoyment, perfectly aligning with the man's playful advice in the audio, where he encourages viewers to embrace creativity and fun in their experiences, whether real or imagined."
+ },
+ {
+ "video_id": "PI-f73c5bd7-40d5-4f44-9257-147bd166cc39",
+ "path": "/workspace/benchmark/video/PI-f73c5bd7-40d5-4f44-9257-147bd166cc39.mp4",
+ "answer": "The video provides a detailed introduction to a delicate rose bracelet, featuring interlocking four-leaf clover designs adorned with soft pink and sparkling clear stones. The bracelet is initially displayed in a black velvet box, placed on a cream-colored card that features phrases like \"Paris, France\" and \"I encountered the withering tree of spring,\" suggesting a thoughtful or romantic theme. Its intricate details and adjustable chain mechanism can be clearly understood. The scene then shifts to the bracelet worn on the wrist, showcasing its aesthetic appeal in the real world. Close-up shots highlight the delicate craftsmanship of each clover motif and the subtle sparkle of the stones. The visual narrative is complemented by vibrant hip-hop music, adding a modern rhythm to the calm and objective display of the jewelry. The video emphasizes the bracelet as a refined accessory."
+ },
+ {
+ "video_id": "PI-e65b882a-541b-4210-b88a-217a04b371ed",
+ "path": "/workspace/benchmark/video/PI-e65b882a-541b-4210-b88a-217a04b371ed.mp4",
+ "answer": "In the video, a man wearing glasses and a black jacket stands confidently in front of green curtains, his hands placed on his chest as he speaks, occasionally turning his head to the right, exuding an air of engagement. This scene transitions smoothly to a woman with a radiant smile, her wavy, light brown hair framing her face. She is stylishly dressed in a leopard print fur coat, a black sequined top, and a gold necklace, adding a touch of glamour. As she smiles and tidies her hair with one hand, the green screen background remains consistent, creating a seamless visual experience. Her continued smiling and poised posing, with occasional hair touches, perfectly complement the warm and harmonious atmosphere of the video. Echoing this positive vibe, the audio features a male speaker expressing admiration and humor, describing someone as the most interesting, talented, and best-dressed person he knows, which aligns with the woman's confident and stylish demeanor, thereby enhancing the overall pleasant and praising mood of the scene."
+ },
+ {
+ "video_id": "PI-cec20dc9-8b6a-4ee8-abda-d27e6ac0ae54",
+ "path": "/workspace/benchmark/video/PI-cec20dc9-8b6a-4ee8-abda-d27e6ac0ae54.mp4",
+ "answer": "The audio features a human voice recounting a fragmented narrative that weaves together the struggles of growing up in poverty, the contradictions of self-identity, and unquenched dreams, with complex and profound emotions surging between the lines. Meanwhile, the video depicts a woman with long orange hair positioned in front of a microphone, likely singing or speaking, against a backdrop resembling a recording studio. The walls are illuminated by green lights, and there’s a green light sign with obscured English text behind her. The camera work transitions smoothly from the left front of the subject to the front, then to the right front, and finally back to the front, showcasing a fashionable and trendy atmosphere that enhances the visual appeal, creating a dynamic and engaging scene that complements the emotional depth hinted at in the audio."
+ },
+ {
+ "video_id": "PI-61aceee0-5496-425d-9d85-13302cf27320",
+ "path": "/workspace/benchmark/video/PI-61aceee0-5496-425d-9d85-13302cf27320.mp4",
+ "answer": "The video unfolds in the 14th Street subway station platform, where a woman dressed in a black leather jacket, black pants, and black shoes takes center stage with an energetic dance performance. Her dynamic moves, including pointing, arm gestures, spins, and rhythmic steps, bring a lively and vibrant energy to the otherwise typical subway environment. Her hair flows with her movements, enhancing the expressive nature of her dance. In the background, a police officer in a dark uniform stands near a yellow pillar, observing the scene. The platform is well-lit with fluorescent lights, and signs reading '14 Street' and 'No Exit' are clearly visible. As the video progresses, the woman reveals a black lace top underneath her jacket, adding to the visual interest of her performance. Meanwhile, a male speaker's nonsensical vocalizations, \"Oh di di la,\" delivered in a casual tone, accompany the upbeat ambient music, contributing to the neutral yet engaging atmosphere of the scene. The combination of the woman's dynamic dance and the consistent subway setting creates a captivating and energetic visual narrative."
+ },
+ {
+ "video_id": "PI-641a4b34-6e4c-462d-916b-b5de4c171ab2",
+ "path": "/workspace/benchmark/video/PI-641a4b34-6e4c-462d-916b-b5de4c171ab2.mp4",
+ "answer": "The video takes place on a football field at night, creating a festive and entertaining atmosphere. A woman in a red outfit, holding a microphone with a 'N' logo, stands between two men dressed in festive red and white outfits with white fur trim. The text overlay humorously states, 'LAMAR AND DERRICK HENRY DIDN'T WANNA EAT THE CAKE,' setting a playful tone. In front of them are three footballs labeled 'CHRISTMAS GAMEDAY.' The man on the right, wearing a white football jersey with the number 8 and a white hood, picks up one of the footballs and reveals it to be a red velvet cake, much to the woman's surprise and amusement. She jokes about the cake, hinting at its red velvet flavor and teasing about eating it, while also playfully mentioning a chicken breast. The background shows a stadium with lights and a crowd, enhancing the festive mood. The interaction remains light-hearted and entertaining, with the woman and the men joking and enjoying the moment, creating an engaging and fun experience for the viewers."
+ },
+ {
+ "video_id": "PI-99a5bb4a-e85b-456f-b90d-074a101ae85d",
+ "path": "/workspace/benchmark/video/PI-99a5bb4a-e85b-456f-b90d-074a101ae85d.mp4",
+ "answer": "In the video, a woman is sitting in a room with yellow walls and a black dressing table. She was wearing a green vest, with multiple layers of gold necklaces and several gold bracelets on her wrists. Her hair was tied at the back of her head and she wore a nose ring. The room's layout includes a small photo frame on top of the cabinet, adding a personal touch to the background. Throughout the video, the woman actively getes with her hands while speaking, indicating that she is engaged in an engaging conversation or monologue. The text \"Let me be straightforward\" appears at the bottom of the screen, indicating that she is clarifying or emphasizing a point. Her expression and hand movements conveyed a sense of enthusiasm and emphasis. She stroked her hair, adjusted her necklace, and used her hands to express her point of view. The core theme of the video revolves around this lady's expressive storytelling or explanation. Her vivid gestures and the repeated words on the screen highlighted her desire to make her message clear and influential. The entire atmosphere is lively and personalized, contributing a warm and inviting setting to the woman's clothing and the room's decoration. This video effectively captures a passionate moment of communication, where the woman's movements and visual elements jointly convey her message."
+ },
+ {
+ "video_id": "PI-709bcc2a-423d-4a8a-bd22-faf7c4faf1ac",
+ "path": "/workspace/benchmark/video/PI-709bcc2a-423d-4a8a-bd22-faf7c4faf1ac.mp4",
+ "answer": "The video shows a woman sitting in a car wearing a gray sweater and a necklace with a small pendant. The text \"My Brain Buffer\" is displayed at the top of the frame, humorously indicating moments of confusion or mental processing. The woman is holding a McDonald's Clover Shake, a light green milkshake, and appears to be discussing or reacting to it. With a neutral and slightly uncertain tone, the female speaker repeatedly tries to articulate and clarify the pronunciation of \"clover milkshake\", expressing slight confusion and self-correction throughout. She humorously calls it \"Shamrakeshock\" and other variations, suggesting she struggles to recall the correct name. The person makes various gestures, such as pointing and holding close to the camera shaking, adding to the casual and relaxed atmosphere. The scene seems to involve the individual focusing on expressing or recalling the correct terminology, with a tone suggesting casual contemplation or slight frustration. There's no sign of background music or extra ambient sounds, emphasizing the humorous and relevant moments of mental buffer while trying to remember the name of Shamrock Milkshake."
+ },
+ {
+ "video_id": "PI-f50459d5-a41e-41b9-ae82-9b946b58dff0",
+ "path": "/workspace/benchmark/video/PI-f50459d5-a41e-41b9-ae82-9b946b58dff0.mp4",
+ "answer": "This first-person video captures a lively and heartwarming moment during a walk with a dog and a child. The camera slowly moves along a residential sidewalk while a brown dog is led on a leash. A young girl in a black coat and gray pants playfully jumps and runs around the dog, brimming with energy and excitement. The dog seems to match her enthusiasm, wagging its tail and moving about spiritedly. The girl’s cheerful giggles and laughter can be heard, along with the gentle encouragement and laughter of the woman filming. Their interaction creates a delightful playtime dynamic between the child and the pet. The phrase \"My two babies\" in the video also metaphorically suggests that the girl is the filmer’s daughter."
+ },
+ {
+ "video_id": "PI-6f3e65c9-1838-4360-9f9a-4d2ce31cc3df",
+ "path": "/workspace/benchmark/video/PI-6f3e65c9-1838-4360-9f9a-4d2ce31cc3df.mp4",
+ "answer": "The video features a reaction track to the song \"Denial Is A River\" by Doechii and Issa Rae, where a person in a red shirt and headphones is seen sitting in a room filled with various items, including a stroller and some furniture. The text 'Kai Reacts To Doechii & Issa Rae 'Denial Is A River'' is displayed at the top of the screen, setting the context for the reaction video, while 'Rainbet' is prominently shown at the bottom. Two individuals, one in a light blue shirt, brown vest, and beige pants, and the other in a checkered suit, stand behind microphones, appearing to perform or speak. The scene transitions to a closer view of the performers, with a neon 'OPEN' sign adding a vibrant backdrop. The reactor, animated and engaged, responds to the song with surprise and excitement, especially to the lyrics about an ex-boyfriend crashing into a place and destroying property. Animated text like 'O kay!' appears around the reactor, highlighting their lively expressions and gestures. The video alternates between the reactor's enthusiastic responses and the performers' dialogue or performance, with the reactor eventually standing up and dancing energetically. The video wraps up with the reactor continuing to move animatedly around the room, while the two individuals remain focused on their performance or conversation, creating a dynamic and entertaining viewing experience that vividly captures the reactor's emotional responses to the song's narrative."
+ },
+ {
+ "video_id": "PI-f10c0e85-e876-4925-9efc-a4cb2a5193b7",
+ "path": "/workspace/benchmark/video/PI-f10c0e85-e876-4925-9efc-a4cb2a5193b7.mp4",
+ "answer": "This video presents five distinct moments, each accompanied by an excited exclamation of \"Bomboclat.\" The scenes transition rapidly, beginning with a man attempting to open a coconut using a machete outdoors. The perspective then shifts to the interior and exterior of a luxury car, where a person is observed in the driver's seat, highlighting the car's features. Subsequently, the footage moves to an elevated view of a soccer match, capturing the action on the field and the reactions of spectators. The fourth segment features a close-up of a plate of scrambled eggs and bacon on a kitchen counter, accompanied by brief dialogue. The compilation concludes with an abstract visual of bright, purple and white light, resembling a lightning strike or an energetic discharge. Throughout these varied scenarios, the consistent element is the vocal expression of surprise or emphasis, contributing to the overall theme of the video.\n"
+ },
+ {
+ "video_id": "PI-15b9768b-f55b-44d8-b2fb-a3997d345179",
+ "path": "/workspace/benchmark/video/PI-15b9768b-f55b-44d8-b2fb-a3997d345179.mp4",
+ "answer": "The video captures a lively and interactive scene within a workspace or storage area, specifically a warehouse filled with folded shirts, where a group of people engage in playful banter as they deliberate over their lunch plans. The video opens with a girl donning a black long-sleeved shirt, and text prompts the audience to \"Remember her `What's for lunch´ videos?\" while encouraging them to \"Follow IG, YT and FB @ summershirtproject,\" which remains on screen throughout. The woman is seen initiating conversations about lunch, walking around the warehouse, and interacting with various individuals, including a woman with red nail polish and a man in a black hoodie. The audio complements the visuals by featuring a female voice leading the conversation about lunch, followed by laughter and playful exchanges among the group. A male speaker joins in, asking questions and contributing comments, while the female continues to engage with others, such as someone named Jordan, and individuals wearing a dark blue shirt with the message \"Be kind to others,\" as well as those seated at a table with laptops. The overall atmosphere is light-hearted and humorous, with the group clearly enjoying their interaction as they discuss their lunch options, maintaining a lively and engaging environment within the warehouse until the video concludes."
+ },
+ {
+ "video_id": "PI-e0681c0f-f244-4694-b205-8014fa2fdbde",
+ "path": "/workspace/benchmark/video/PI-e0681c0f-f244-4694-b205-8014fa2fdbde.mp4",
+ "answer": "The video features a young male recording a selfie indoors, with a light-colored wall and an overhead ceiling fan in the background. He is dressed in a black hoodie and a black bandana, accessorized with a gold chain. His face is adorned with facial hair, including a mustache and a goatee, and there is a noticeable shaved slit in his eyebrow. Throughout the short clip, the individual displays subtle movements, such as turning his head and slight body swaying, while holding the camera, resulting in minor shifts in perspective. On the screen, two prominent text overlays appear: the top one asks, \"Would you date the male version of yourself?\" and the bottom one provides a confident reply, \"I'm so handsome, of course, I would!!\" These texts form the main narrative of the video, implying that the protagonist is a female. The background is accompanied by lively hip-hop music, and the lighting is even, primarily coming from an overhead light source. The video serves as a brief personal statement or response to a common social media topic.\n"
+ },
+ {
+ "video_id": "PI-9ac55f7d-b5d3-4627-8e08-75810f60f5fe",
+ "path": "/workspace/benchmark/video/PI-9ac55f7d-b5d3-4627-8e08-75810f60f5fe.mp4",
+ "answer": "In a brightly lit bedroom, a grey cat with white markings on its chest and paws sits calmly on a bed. The bed is made with white sheets and a rust-colored blanket. A person's hand, adorned with painted nails and a wrist tattoo, enters the frame and moves to pet the cat. As the hand gets close, the cat lifts its front paw and firmly places it against the hand, preventing the touch. After this brief interaction, the cat holds its paw suspended in the air. The scene is accompanied by a rap song, with lyrics that humorously align with the cat's action of setting a boundary and refusing to be touched"
+ },
+ {
+ "video_id": "PI-9a6b9f8f-2498-4c58-9adf-0aacfacd00dc",
+ "path": "/workspace/benchmark/video/PI-9a6b9f8f-2498-4c58-9adf-0aacfacd00dc.mp4",
+ "answer": "The video takes place in a well-lit indoor environment. The protagonist is a lady wearing a striped off-the-shoulder top and gold earrings. She is using a green teeth whitening strip called \"InoPro Green Teeth Whitening\". When she was holding the product packaging, the OCR text \"InoPro GREEN Teeth Whitening\" appeared on the screen. The lady expressed her initial concerns about using the product, saying, \"These are the new whitening patches I bought from InoPro. Obviously, they should make your teeth very white, but now I'm a little scared, so I think we should go and have a look together.\" Then, she applied this toothpaste to her teeth, and the teeth turned obviously green. Her first reaction was surprise and worry, just as her exclamation indicated: \"What?\", \"Oh, my goodness!\" and \"Immediate regret.\" She tried to remove the strap and asked, \"How did you take it off?\" She decided to brush her teeth to remove the green on them, so she picked up the toothbrush and began to brush her teeth. After brushing her teeth, she smiled at the camera, revealing her white teeth. She said, \"Guys, I'm in love.\" She examined her teeth carefully while touching them. At the end of the video, she smiled and stroked her chin, seemingly satisfied with the result. The core theme of the video is the lady's experience of using InoPro teeth whitening patches, highlighting her initial fears, the usage process, and her final satisfaction with the product's effect."
+ },
+ {
+ "video_id": "PI-e7154e28-199c-4446-ad0f-de2ab5052070",
+ "path": "/workspace/benchmark/video/PI-e7154e28-199c-4446-ad0f-de2ab5052070.mp4",
+ "answer": "This video captures a visit to a cozy pizzeria nestled in a foggy forest. The journey begins from the perspective of a driver navigating a misty road, with the destination displayed on the car's navigation screen. Upon arrival at the restaurant, which features a rustic log cabin design, the video showcases its charming exterior lit by warm string lights under a gray, overcast sky. Various signs, including a large illuminated \"PIZZA\" sign and another for \"LouEddie's,\" identify the location. The surrounding tall trees are shrouded in fog, creating a serene and secluded atmosphere. Inside, the camera offers a glimpse of the kitchen area before focusing on the food: a delicious-looking pizza laden with meat toppings, followed by a dessert of two cream-filled cannolis. The video concludes with more exterior shots, reinforcing the restaurant's quaint and inviting character amidst the woods.The entire video features vocal music, with a woman singing in an ethereal voice accompanied by soft melodies."
+ },
+ {
+ "video_id": "PI-f9885d21-3272-4ae1-9e6a-ae56a250cc75",
+ "path": "/workspace/benchmark/video/PI-f9885d21-3272-4ae1-9e6a-ae56a250cc75.mp4",
+ "answer": "This video captures a lighthearted moment featuring a group of football players. Initially, they are seen in a large, carpeted room, diligently rehearsing a synchronized dance routine. The on-screen text identifies this as \"the pregame rehearsal.\" The players, dressed in casual attire, practice their moves, laugh, and discuss the choreography. The video then transitions to a split-screen format for a direct comparison. The top half shows scenes from the Disney movie \"Camp Rock,\" where the movie's cast performs the very same dance. The bottom half displays footage of the football players during an actual game, in full uniform, perfectly executing the rehearsed dance as a celebration on the field. The song \"Camp Rock\" provides the soundtrack for the comparison, highlighting the humorous and well-executed homage to the movie.\n"
+ },
+ {
+ "video_id": "PI-e885ae46-7677-44b0-be50-f17010ee82a8",
+ "path": "/workspace/benchmark/video/PI-e885ae46-7677-44b0-be50-f17010ee82a8.mp4",
+ "answer": "The video features a playful pizza delivery roleplay scenario. It begins with a man in a white tank top holding an orange pizza box, standing in front of a white door. The text overlay humorously notes, \"POV: Tried to do role play, but we were too immature.\" The man, maintaining a neutral demeanor as if on a delivery, knocks on the door. A woman with long black hair, dressed in a white graphic t-shirt and gray shorts, answers. She expresses happiness and surprise upon receiving the pizza. The scene then cuts back to the man, who makes a playful gesture pointing to himself. The woman covers her face with her hands, appearing amused, and then points to her bag, indicating she has no money to pay. Throughout this interaction, background music with a moderate tempo and a neutral atmosphere sets the tone. The video concludes with the man looking confused while the woman smiles, highlighting the lighthearted and immature nature of their roleplay attempt."
+ },
+ {
+ "video_id": "PI-b4df549d-6b83-48ae-9862-7e6d7bf6f7d7",
+ "path": "/workspace/benchmark/video/PI-b4df549d-6b83-48ae-9862-7e6d7bf6f7d7.mp4",
+ "answer": "The video showcases a dynamic and engaging cooking process, complemented by energetic background music that enhances the overall atmosphere. It begins with a close-up of a person's hands, adorned with light-colored nail polish and a visible tattoo, as they prepare ingredients on a white plate. The person opens a red packet labeled \"Buldak Sauce\" and squeezes the dark red sauce onto the plate. Next, they pour a light yellow powder from a pink packet beside the sauce, followed by adding a dollop of mayonnaise from a \"Kewpie Mayonnaise\" bottle. A raw egg is then cracked onto the plate, completing the ingredient assembly. The scene transitions to mixing these components into a smooth, orange-colored sauce. The video then shifts to a bowl of steaming noodles generously coated with the prepared sauce. The person lifts a portion of the noodles with a fork, highlighting their texture and the rich sauce. A cooked, halved egg is placed on top, and the person mixes it in, ensuring the egg is well-coated. The video concludes with a close-up of the final dish, emphasizing the creamy sauce and appetizing appearance, all while the energetic background music creates a lively and dynamic ambiance."
+ },
+ {
+ "video_id": "PI-4c26389d-8ef4-4ed6-9c8d-c0aedcff3601",
+ "path": "/workspace/benchmark/video/PI-4c26389d-8ef4-4ed6-9c8d-c0aedcff3601.mp4",
+ "answer": "The video shows a woman holding a panda-faced lollipop at a crosswalk against bare trees, an American flag and purple banners in the background. A variety of cars, including grey, white, blue, red, black as well as SUV models, drove by, adding a dynamic element to the scene. Words superimposed humorously commented on the reality of asking a man for help. The woman adjusts the lollipop and points it, possibly to focus the camera, adding an interesting element to the scene. In the audio, a neutral and slightly confused female voice asks someone to tap the phone screen to focus their attention on a specific area, initially mentioning a panda. The male simply replied, \"Great.\" The interaction suggested a casual environment in which one attempted to adjust the focus of the phone in a neutral to somewhat confused emotional tone, possibly for photos or videos. There is no mention of background music, emphasizing the casual and spontaneous moment."
+ },
+ {
+ "video_id": "PI-49d89e11-fd6a-4290-bcf8-97dea375cb1f",
+ "path": "/workspace/benchmark/video/PI-49d89e11-fd6a-4290-bcf8-97dea375cb1f.mp4",
+ "answer": "The video takes place at a basketball event, likely an All-Star Celebrity Game, as indicated by the jerseys worn by the participants. A woman in a white outfit is standing at a podium, holding a microphone with the ESPN logo, and appears to be announcing the MVP award. On the podium, there is a large, shiny trophy. The text overlay on the video reads 'DRUSKI THOUGHT HE WON MVP' with a crying emoji, suggesting a humorous or unexpected moment. In the background, several people are standing, some wearing blue and white jerseys with the text 'ALL-STAR CELEBRITY GAME' on them. The scene then shifts to a man in a white jersey with the number 0, who is holding the trophy and smiling, surrounded by others, including a man in a blue jersey with the number 1. The atmosphere is celebratory, with people clapping and smiling, capturing the moment of the MVP award announcement and the recipient's reaction. Meanwhile, the audio features a female speaker with a neutral emotional tone, briefly mentioning 'MVP' and referencing 'Rome Flynn,' apologizing and expressing a sense of heaviness but encouraging resilience. The interaction suggests a conversational exchange where the female speaker acknowledges a challenging situation and offers support, adding a layer of depth to the celebratory scene."
+ },
+ {
+ "video_id": "PI-27c44871-23db-4dc4-aa60-e8e862d692f6",
+ "path": "/workspace/benchmark/video/PI-27c44871-23db-4dc4-aa60-e8e862d692f6.mp4",
+ "answer": "In the video, a person is in a room. At the beginning, he is wearing black clothes and multiple rings. The words \"Choose my dress for the Grammys\" are displayed at the top of the screen, which is the theme of the video. This person indicates the start of choosing clothes by gesture. The background includes a clothes hanger filled with various outfits and a kitchen area with white cabinets, providing a casual and relevant environment. This person tried on a bright red dress labeled \"OPTION 1\". This dress has a unique cut design and a flowing skirt. He posed and adjusted the dress to show its characteristics. Next, this person tried on a beige strapless dress labeled \"Option 2\", which was simple and elegant in design and rotated from different angles to showcase it. Then the scene switches to a person. He is wearing a long, black, oversized dress with the label \"OPTION 3\". He walks away and shows the scene behind. Next, this person tried on a light pink fitted dress marked \"Option 4\", with a high neck and side cutout details, adjusting and posing to highlight its design. At the end of the video, this person is holding a deep red velvet fabric, possibly considering another option for the Grammy dress. Throughout the video, there is an interesting and humorous question in the audio, \"How long does it take to roast chicken wings in a microwave oven?\" How long? How long? Then came laughter, adding a relaxed atmosphere to the process of choosing a dress."
+ },
+ {
+ "video_id": "PI-67249447-31ae-4019-932d-f26383ae8c6a",
+ "path": "/workspace/benchmark/video/PI-67249447-31ae-4019-932d-f26383ae8c6a.mp4",
+ "answer": "The video features a woman with long, wavy blonde hair sitting in the backseat of a car, wearing a white sleeveless top and a delicate necklace with a small cross pendant. The car's interior is brown, and a sticker on the ceiling reads \"AIR BAG WARNING FLIP VISOR OVER COUSSIN GONFLABLE AVERTISSEMENT REPLIER LE PARESOLEIL\" with a no-smoking symbol. The background outside the car window shows a street scene with some greenery and buildings. The woman is speaking and using expressive hand gestures to emphasize her points as she searches for a versatile, comfortable, and stylish nude-colored block heel platform shoe with an open toe, suitable for both a wedding and a casual beach dinner. She specifically mentions that she doesn't want a shoe with a large buckle or a thick toe box, and she is adamant about avoiding cheap-looking options from Amazon. She asks viewers not to suggest ugly shoes or inexpensive items that cost twenty bucks on Amazon, as she desires something cute and comfortable to wear for hours, highlighting her preference for quality and style."
+ },
+ {
+ "video_id": "PI-2344d15b-7d82-42e6-9edc-4049dfa3594b",
+ "path": "/workspace/benchmark/video/PI-2344d15b-7d82-42e6-9edc-4049dfa3594b.mp4",
+ "answer": "The video shows a woman wearing a light gray tank top and white layered skirt, adorned with multiple necklaces, standing in a room with beige walls and light switches. The text \"No ✂️ until it reaches my knees\" appears on the screen, emphasizing her commitment to keeping her hair knee-length. The woman raised her arms, turned around slightly, then gathered her hair into a bun and then released it, allowing it to fall smoothly to their knees. The background reveals a doorway to another room, which is decorated with various photographs and a ceiling fan, emphasizing the length and quality of the girl's hair. Accompanying this visual narrative is a soothing song sung by the girl that serves as background music and enhances the harmonious and serene atmosphere of the scene. The combination of visual and auditory elements creates a calming and cohesive experience that draws attention to the beauty and patience of achieving long, healthy hair."
+ },
+ {
+ "video_id": "PI-c94b32b3-ba2d-40c2-9c6f-bfc119323187",
+ "path": "/workspace/benchmark/video/PI-c94b32b3-ba2d-40c2-9c6f-bfc119323187.mp4",
+ "answer": "The video and audio together depict a casual, humorous scene centered around a young man named Jake in a warm, festive room decorated for Christmas, complete with a staircase adorned with stockings and lights, a Christmas tree, and various holiday decorations. Jake, dressed in a green and white hoodie with a logo and gray sweatpants, is seen reusing a blue gift bag with a Santa Claus design for multiple gifts. The audio captures a conversation among people commenting on Jake's act, with remarks about the bag's appearance and Jake's actions. In the video, Jake hands the bag to a woman in a green sweatshirt, who opens it to reveal a gift inside. He then retrieves the bag, walks towards the staircase, places it on a step, climbs the stairs, and walks back down, hinting at its reuse. Later, he hands the same bag to another person, who opens it to reveal a white Stanley tumbler. Jake takes the bag back and repeats the process of walking towards the staircase, placing it on a step, climbing the stairs, and walking back down. Finally, he is seen holding the bag again, walking towards the Christmas tree, and placing it on the floor next to the tree. The video concludes with a shot of multiple gift bags, including the blue one with the Santa Claus design, filled with tissue paper, and Jake sitting on a beige recliner with his hands behind his head, looking relaxed, while the audio continues to reflect the humorous tone of the conversation about Jake's resourceful act."
+ },
+ {
+ "video_id": "PI-b90d0f51-0edd-407a-8465-aad286fa7756",
+ "path": "/workspace/benchmark/video/PI-b90d0f51-0edd-407a-8465-aad286fa7756.mp4",
+ "answer": "The video unfolds in a well-stocked store aisle, where four individuals are initially seen conversing. Among them are three women: one with long blonde hair wearing a brown hoodie and blue jeans, another sporting glasses, a pink top, and a black jacket, and the third with long brown hair in a black jacket and blue jeans. The fourth person is a man dressed in a black t-shirt, black pants, white sneakers, and a light green beanie. The man, with a cheerful and persuasive demeanor, engages the group by offering twenty dollars to anyone willing to scream as loudly as possible. He explains that he works at the store and cannot participate himself due to the proximity of managers. After some hesitation and discussion within the group, he emphasizes the need for the scream to be exceptionally loud while holding out the twenty-dollar bill. This humorous exchange creates a light-hearted atmosphere, capturing an entertaining moment within the store aisle."
+ },
+ {
+ "video_id": "PI-9c3fd2c6-58ec-4504-84ac-cd05ee3e72b0",
+ "path": "/workspace/benchmark/video/PI-9c3fd2c6-58ec-4504-84ac-cd05ee3e72b0.mp4",
+ "answer": "The video unfolds in a classroom where two individuals are seated at their desks. The person in the foreground, dressed in a black beanie, black t-shirt, and black pants, holds a pencil and initially appears relaxed. The scene is overlaid with the humorous text, \"Nobody: That one friend that tell on you during a test,\" suggesting a comedic take on a friend's betrayal during an exam. Suddenly, the foreground individual bends over dramatically, as if reacting to something unexpected, while the background person remains focused on writing. A third person, wearing a white shirt over a black t-shirt and jeans, enters holding a wooden stick and approaches the desks. The foreground individual looks surprised or confused as the newcomer taps their desk with the stick, causing them to lean back in their chair. Meanwhile, the audio captures a male speaker expressing frustration and aggression, with some people laughing, indicating a tense yet possibly humorous interaction. The video concludes with the foreground individual standing up and moving away, while the background observer continues to watch. The combination of visual and audio elements creates a scene of dynamic, albeit slightly confrontational, classroom interaction."
+ },
+ {
+ "video_id": "PI-75b43ad8-1c3e-4e48-835c-113fe0239eea",
+ "path": "/workspace/benchmark/video/PI-75b43ad8-1c3e-4e48-835c-113fe0239eea.mp4",
+ "answer": "In the video, a woman can be seen sitting at a wooden table in the dining room, wearing a black jacket. On the table lay some food wrappers and a set of keys to which a small camera was attached. The woman picks up the camera and shows it to the camera, presses the button to turn it on and display the screen. A text overlay appears saying \"He stole it after I told him the last part\". The woman continues to manipulate the camera, showing its features while expressing frustration. The backdrop features wooden paneling and some dining room decor for a cozy atmosphere. Meanwhile, in the audio, a female speaker enthusiastically discusses the keychain camera, expressing nostalgia and excitement for the vintage feel of his photos. One male speaker questioned the practicality of using a camera instead of a mobile phone, while female speakers commented on the pleasure of using the device. The conversation remained light and entertaining, and the speakers shared a common interest in the novelty of the camera. The video captures the interaction with the camera, while the audio provides a deeper insight into how the speaker thinks and feels about the device."
+ },
+ {
+ "video_id": "PI-74acc9bc-78b9-4266-bf51-39d427f4f98f",
+ "path": "/workspace/benchmark/video/PI-74acc9bc-78b9-4266-bf51-39d427f4f98f.mp4",
+ "answer": "This video captures a heartwarming surprise moment at a 60th birthday celebration. The restaurant buzzes with activity as guests mingle and toast, filled with joyous party atmosphere.\nThe camera follows an elderly woman - captions reveal she's the guest of honor's best friend who traveled specially for this occasion - as she makes her way through the crowd toward the birthday celebrant, who is currently embracing someone else.\nThe focus then shifts to the birthday woman. When she suddenly recognizes her long-distance best friend standing before her, her composed expression transforms into visible excitement. After a brief exchange, the two share a tight embrace, their faces radiant with happiness. This carefully planned surprise unfolds perfectly."
+ },
+ {
+ "video_id": "PI-c81bebc5-53de-49fc-96cb-991794912945",
+ "path": "/workspace/benchmark/video/PI-c81bebc5-53de-49fc-96cb-991794912945.mp4",
+ "answer": "The video takes place in a modern kitchen, featuring stainless steel appliances and white cabinets, creating a clean and modern background. The protagonist was standing there comically wearing a black sweatshirt, a black shirt, blue jeans, white socks, a blue hat upside down and black gloves, with the pants hanging down to the ankles. He held the microphone, suggesting that they were performing or speaking. The text \"Nobody: KENDRICK's Pants\" appeared on the screen, adding a comedic element by suggesting that a person named KENDRICK chose to wear loose pants. This person seems to be rapping or reciting lyrics, which can be seen from their rhythmic movements and the microphones in their hands. The rhythmic music in the audio, with lyrics including \"I'm tripping, I'm sliding, I'm riding a horse\" and \"First Time catching up\", complements the visual performance. The theme of the entire video is humor and lightness, emphasizing exaggerated fashion choices and playful performances in a family setting."
+ },
+ {
+ "video_id": "PI-6563e02b-7dd2-48ad-9afe-24d079b9a3dd",
+ "path": "/workspace/benchmark/video/PI-6563e02b-7dd2-48ad-9afe-24d079b9a3dd.mp4",
+ "answer": "The video features a person wearing a white knit hat and a colorful striped sweater, sitting at a table in a cozy, dimly lit restaurant or bar, with a large red bowl of food and a black spoon in front of them. The background includes a chalkboard with some writing and a shelf with several bottles, possibly beer bottles. The text overlay reads: 'Catching up with my besties from college and laughing with a straight face to see how they react.' The person appears to be engaged in a lighthearted and humorous conversation, likely sharing funny stories or memories with friends. Meanwhile, the audio captures a group of individuals, predominantly female, enjoying a casual and playful interaction. One person recounts an amusing incident where someone peed on the floor during Halloween at a place called McKey's, eliciting reactions from others. Another speaker reiterates the event and questions what someone is doing, while a fourth individual expresses frustration and urges the group to stop. The atmosphere is jovial and slightly chaotic, with occasional laughter, suggesting a friendly and relaxed gathering where friends are enjoying each other's company and sharing humorous anecdotes."
+ },
+ {
+ "video_id": "PI-8b7c42b1-602a-43de-a018-d8828658792a",
+ "path": "/workspace/benchmark/video/PI-8b7c42b1-602a-43de-a018-d8828658792a.mp4",
+ "answer": "The video shows a peaceful and elaborately decorated room, with green as the dominant theme, creating a serene and natural atmosphere. The scene begins in a cozy corner of the room. Natural light shines through the window decorated with green curtains, illuminating the various plants on the windowsill. A hand holding a wooden mug with a black liquid in it, which might be coffee or tea, is a prominent feature of the foreground, adding a personal touch to the background. As the camera pans, a bed can be seen. It has a green headboard and a brown frame, partially covered with blankets, suggesting a peaceful space. Then, the focus shifted to a wooden table with intricate carvings, which was the center of the room. The table is decorated with a series of green-themed ornaments, including small plants, small statues and a frame of a peaceful landscape. A small, green-lit display cabinet showcases the piano setting, adding a modern and harmonious element to the traditional decoration. There is a soothing voice in the audio content saying, \"I hope you can find some peace of mind here,\" which complements the peaceful visual elements. Without any other dialogue or significant sound effects, the audience can focus their attention on the serene atmosphere and the detailed decorations. The video contains no text and focuses on the visual and auditory elements. The core theme of the video is to create a peaceful and harmonious environment through thoughtful decoration and natural-inspired color schemes."
+ },
+ {
+ "video_id": "PI-be5cf1fc-b3ff-43be-ac5a-06ae82eaef71",
+ "path": "/workspace/benchmark/video/PI-be5cf1fc-b3ff-43be-ac5a-06ae82eaef71.mp4",
+ "answer": "The video beautifully captures a moment of thoughtful preparation for a Valentine's Day gift, complementing the sentiments expressed in the audio narrative. It shows a person sitting on a curb, meticulously spraying a handmade card with a blue bottle, likely to preserve it or add a fragrance. The card, adorned with a flower drawing surrounded by pink lipstick kisses and the text \"To: Noe\" in red, reflects the care and personal touch described in the audio. The individual, wearing a colorful beaded bracelet and dark pants, is on a quiet street with a fence and greenery in the background. After ensuring the card is thoroughly coated, the person flips it to reveal \"From: Bella\" written in red, emphasizing the personal connection and effort put into creating this meaningful gift. The scene conveys a sense of love and thoughtfulness, aligning with the calm and affectionate tone of the audio, where the speaker highlights the care and thoughtfulness involved in preparing such a heartfelt present."
+ },
+ {
+ "video_id": "PI-dcf50841-794d-4322-904c-951a6f142e1d",
+ "path": "/workspace/benchmark/video/PI-dcf50841-794d-4322-904c-951a6f142e1d.mp4",
+ "answer": "The video captures an unexpected audience interaction during a stand-up comedy performance at a venue featuring red curtains and a neon sign. A male comedian, dressed in a red jacket and a graphic t-shirt, is on stage holding a microphone. He is verbally engaged with an off-screen individual. The interaction escalates when a man in a bright red puffer jacket walks onto the stage and approaches the comedian, followed by a woman and another man. The man in the puffer jacket verbally confronts the comedian, who initially maintains a composed and slightly amused expression. The situation appears tense but does not result in physical altercation, with the comedian continuing to engage with the situation humorously. The audience's reactions, including laughter and shouts, contribute to the dynamic atmosphere of the event."
+ },
+ {
+ "video_id": "PI-3fefd4fc-1a40-47c1-935c-3afb03efcb48",
+ "path": "/workspace/benchmark/video/PI-3fefd4fc-1a40-47c1-935c-3afb03efcb48.mp4",
+ "answer": "The video begins with a view of a cozy, pink-themed bed adorned with a quilted bedspread and matching sheets, set against a backdrop of a green metal frame and a neutral-colored carpeted floor. Various stuffed animals and toys scattered around the bed enhance the playful and inviting atmosphere of the room. A child, dressed in white pajamas with pink patterns and a matching pink bonnet, enters the frame and climbs onto the bed. The child moves around, adjusting the blankets and eventually sitting down while holding a black remote control, exploring and interacting with it, which creates a sense of curiosity and playfulness. The scene continues with the child sitting comfortably on the bed, still engaged with the remote, pressing buttons and looking at it with curiosity. The consistent background of pink bedding and scattered toys maintains the warm and playful setting. Overlaying this visual narrative is the audio from a woman speaking in a gentle and affectionate tone, expressing her love by saying, 'I think I like this little life. I think I like this little eye,' which conveys a sense of intimacy and tenderness. The core message of the video is the blend of the child's exploration and discovery with the speaker's deep affection for small but precious moments, creating a heartwarming and endearing portrayal of childhood curiosity and parental love."
+ },
+ {
+ "video_id": "PI-b3b043a8-7081-4f33-ad47-5a2df34276d2",
+ "path": "/workspace/benchmark/video/PI-b3b043a8-7081-4f33-ad47-5a2df34276d2.mp4",
+ "answer": "The video presents a vibrant and extensive collection of lip balms, scattered in a colorful array that immediately captures attention. The assortment features a multitude of shapes, sizes, and designs, with bright hues and playful packaging that include character-themed containers. In the foreground, a hand with pink nail polish prominently displays a unique, boot-shaped lip balm with a fluffy white top, resembling a miniature Ugg boot, directly responding to a viewer's challenge about its existence in her collection. The background remains a bustling mosaic of lip balms, from cylindrical to spherical and novelty shapes, emphasizing the collector's dedication and the creativity within the lip balm designs. Accompanying this visual feast is a playful and confident voice-over from a female speaker, who addresses the skepticism about owning the Ugg lip balm with defiance, asserting her possession of it and inviting further engagement by challenging viewers to find something she lacks. This interaction adds a dynamic layer to the display, highlighting both the diversity of the collection and the speaker's enthusiastic personality."
+ },
+ {
+ "video_id": "PI-a2f3318b-1bd1-4349-873f-9f0a5642125f",
+ "path": "/workspace/benchmark/video/PI-a2f3318b-1bd1-4349-873f-9f0a5642125f.mp4",
+ "answer": "The video showcases a creative, homemade Minecraft-themed setup, beginning with a close-up of a cardboard structure featuring a large, hexagonal screen displaying the Minecraft game interface in a blue hue. The background includes a Christmas tree, a lamp, and a guitar, creating a cozy indoor atmosphere. The camera then reveals a child inside the structure, wearing a blue shirt and patterned pants, holding a game controller and appearing deeply engaged with the game. The interior is decorated with cardboard walls, a snack shelf, and an 'ENCYCLOPEDIA MINECRAFTERS' poster, with dim lighting enhanced by blue lights. The child looks around, immersed in the experience, while the camera pans to show details like a small window with a blue spiral design. Complementing this visual, a young voice, likely a child's, expresses excitement and curiosity, reflecting his enthusiasm and immersion in the Minecraft environment. The video concludes with a wider shot of the entire setup, highlighting the creativity and effort put into building this Minecraft-themed space, blending the visual and auditory elements to convey a sense of wonder and engagement."
+ },
+ {
+ "video_id": "PI-7f392659-3b25-4107-801a-80f92badb5ab",
+ "path": "/workspace/benchmark/video/PI-7f392659-3b25-4107-801a-80f92badb5ab.mp4",
+ "answer": "The video features a person with long, curly hair wearing a white, knitted sweater and a delicate necklace with a small, turquoise pendant. The setting is a modern indoor space with large windows that allow natural light to flood the room, offering a view of the outdoors with some greenery and a structure that appears to be part of a house or building. The individual remains mostly stationary, with slight movements of the head and hair, creating a calm and serene atmosphere. Meanwhile, the background music has a soft and emotional female voice conveying a sense of tenacity and warning, with lyrics that state, \"I'm not a killer, but don't push me.\" This hints at themes of self-defense and a warning against provocation, enhancing the overall cozy and inviting ambiance with a message of resilience and a caution not to be pushed too far."
+ },
+ {
+ "video_id": "PI-fcbfa09c-0bdc-452e-9390-0233ba3a4a2c",
+ "path": "/workspace/benchmark/video/PI-fcbfa09c-0bdc-452e-9390-0233ba3a4a2c.mp4",
+ "answer": "This video captures a moment of impressive skill at a lively roller skating rink. The central focus is a person who performs a unique trick, detaching from one of their roller skates and allowing it to roll across the floor on its own while they continue to balance perfectly on the other. The lone skate glides smoothly over the wood-patterned floor before the skater retrieves it. The surrounding environment is a typical roller rink, with a main blue skating area and a checkered ceiling. Colorful blue and purple lights illuminate the scene, creating a vibrant atmosphere. Other people can be seen enjoying themselves, some skating in the background and others watching from red booths along the side."
+ },
+ {
+ "video_id": "PI-1c92c187-6388-42d5-94da-0494c59437ab",
+ "path": "/workspace/benchmark/video/PI-1c92c187-6388-42d5-94da-0494c59437ab.mp4",
+ "answer": "In the video, a person is standing in a snowy environment, in front of a white sports car, most likely a Lamborghini, parked in a sheltered area. This person is wearing a light-colored, oversized sweater and loose grey trousers, paired with brown boots. Snow was falling around them, creating a peaceful and cold atmosphere. The background is dark, indicating that the scene takes place at night. The covered area is illuminated by multiple ceiling lights, contrasting with the falling snow and the dark environment. This person seems to be moving slightly, perhaps adjusting their posture or getting ready to walk. The audio in the video contains explicit lyrics of a song, among which there is such a phrase: \"Oh my god, slow down and catch the other girl as if you wanted your butt to fall off!\" I want to smash them now. Hurry up! These lyrics add provocative and vibrant elements to the originally calm and quiet scene. The serene snow scene combined with bold and clear audio forms a sharp contrast, enhancing the overall effect of the video. This video seems to capture a static moment against a vibrant and bold backdrop, blending the serenity of snowfall with the intensity of music."
+ },
+ {
+ "video_id": "PI-b4c96b9d-7335-42dd-8635-2d6d3e300860",
+ "path": "/workspace/benchmark/video/PI-b4c96b9d-7335-42dd-8635-2d6d3e300860.mp4",
+ "answer": "The video unfolds against the backdrop of a kitchen, with the main focus being the frying pan placed on the stove. This scene captures the cooking process, especially the fried chicken, shredded pork and tortillas in hot oil. One can see a hand in a pan using a spatula to place and adjust the tortillas to ensure they are cooked evenly. When the tortillas turn golden brown, flip them over to make both sides crispy. Covering the visual content is an audio narrative, in which one person recounts a conversation with their \"second pimp\", who insists on teaching them how to cook. The speaker initially questioned the necessity of learning cooking and regarded it as a \"hoe\". However, the pimp's response emphasized that they were \"women first\", highlighting the importance of self-sufficiency and traditional skills. The end of the video shows cooked tortillas, now golden and crispy, placed on a plate, decorated with fresh diced Onions, coriander and tomatoes. The lemon corner side dish adds vitality. The scene is set on a table, with a water bottle and a bowl of salad in the background, suggesting a complete meal. The core theme revolves around unexpected life experiences learned from non-traditional sources, emphasizing the value of cooking skills as a fundamental aspect of women and independence."
+ },
+ {
+ "video_id": "PI-4fb947b2-62f8-4790-9777-9d3df4431cb2",
+ "path": "/workspace/benchmark/video/PI-4fb947b2-62f8-4790-9777-9d3df4431cb2.mp4",
+ "answer": "The video captures a joyful and playful scene in an outdoor setting on a sunny day, featuring a large inflatable play area adorned with vibrant green, yellow, and black colors. The lively atmosphere is enhanced by the absence of human voices in the audio, replaced instead by a sound that resembles a man making playful noises, contributing to the overall cheerful ambiance. The text overlay at the top of the frames humorously reads, \"Tried to get a video of my kid playing...got this instead! *little girl is fine,\" followed by three crying emojis, adding a touch of humor and indicating that while the video may not have captured the intended moment, it still showcases a delightful and entertaining scene. The young children are seen energetically running, jumping, and crawling across the inflatable surface, exploring various barriers and slides labeled \"JUMP\" and \"DUCK,\" embodying the essence of carefree play and childhood joy."
+ },
+ {
+ "video_id": "PI-2f13d937-56a7-4c52-a287-1e2939837bd2",
+ "path": "/workspace/benchmark/video/PI-2f13d937-56a7-4c52-a287-1e2939837bd2.mp4",
+ "answer": "This video takes place in a wedding dress store where a person is trying on a white wedding dress. She was holding a pink heart-shaped sign which read: \"I've found my Mr. Right!\" The words \"@davidbridal\" are written on it, expressing a sense of achievement and joy. Another person in black is helping them, perhaps adjusting the clothes or providing feedback, emphasizing the moment of support and guidance. The background shows others trying on gowns and various wedding dresses hanging on the hangers, creating a typical lively atmosphere of a wedding dress store. The text on the video reads: \"How many halo points did I lose when I had to accept a piece of clothing that I hated, because it was their only piece of clothing in the size zone.\" I cried in the hotel that night. It expresses the emotional struggle and disappointment that a person feels when they have to choose a piece of clothing they don't like due to the limited choices in the large size area. There is a voice in the audio saying, \"Please give me a little courtesy. I want to ask,\" which adds to the emotional weight of this situation and implies a request for understanding and respect. This video captures the complex emotions related to buying a wedding dress, especially for plus-size people, highlighting the impact of challenges and limited choices on self-esteem and emotional health."
+ },
+ {
+ "video_id": "PI-34ad73c6-33ba-4737-92e7-4ae8bdef0a28",
+ "path": "/workspace/benchmark/video/PI-34ad73c6-33ba-4737-92e7-4ae8bdef0a28.mp4",
+ "answer": "A neutral male voice enthusiastically shares his positive experiences with a reusable magic writing book designed for children, emphasizing its unique vanishing ink feature that makes writing practice both cost-effective and enjoyable. The video demonstrates the use of a blue pen with a grip, showcasing its precision and versatility as it writes the word 'onion' on lined paper, traces the letter 'J,' and draws an elephant outline. The product's reusable nature is highlighted, with the ink disappearing after twenty minutes, encouraging children to practice independently without parental prompting. A set of six black pens and colorful pencil grips is displayed, along with various writing practice books tailored for different learning needs, underscoring their educational value. The video concludes with the pen writing 'stand' on lined paper, reinforcing how these books make writing practice engaging and enjoyable for children, while also being a cost-effective solution for parents looking to improve their children's writing skills."
+ },
+ {
+ "video_id": "PI-22c8a87c-763f-4f6b-9a32-21429efce2a3",
+ "path": "/workspace/benchmark/video/PI-22c8a87c-763f-4f6b-9a32-21429efce2a3.mp4",
+ "answer": "The audio track features strong and energetic background music, setting a powerful and vibrant tone that emphasizes a confident and determined demeanor, perfectly mirroring the essence of the shadowy figure depicted in the video. The video begins with the outline of a large red figure, labeled \"rekuraix,\" transitioning to a detailed view of the character playing the guitar, predominantly in red and black with white gloves and yellow cuffs against a black and red backdrop. The character is showcased from various angles, holding the guitar with dynamic visual effects, before reappearing in a powerful posture, pointing forward with a determined expression, dressed in black with red accents and white gloves, against a grid-like red and black background. A close-up of the character's face reveals a confident and slightly smug expression, with red eyes, black fur, white accents on the chest, and red accents on the ears, set against a dark purple and red background. The video then presents a pixelated version of the character in a dynamic pose with minor glitches and a multicolored background, before concluding with another intense and focused close-up of the character's face, this time against a vibrant purple and pink backdrop. Together, the audio and video create a compelling narrative of strength, confidence, and determination, blending powerful music with striking visuals to leave a lasting impression."
+ },
+ {
+ "video_id": "PI-18be107b-1237-4f4f-8a6d-d24c22b8e3ba",
+ "path": "/workspace/benchmark/video/PI-18be107b-1237-4f4f-8a6d-d24c22b8e3ba.mp4",
+ "answer": "The video captures a gripping and suspenseful scene from a horror game, where a player, wearing a white hat and headphones, sits in a room with pink walls and various devices in the background, as the text \"Never give your brother guns in horror games\" humorously sets the stage. The scene then shifts to a first-person perspective within the game, where the player, armed with a gun, navigates through a dimly lit, eerie environment filled with brick walls and several ghosts wrapped in white sheets. The audio, primarily composed of the male player's voices, conveys intense tension and fear as he repeatedly asks who else is present and claims to have spotted someone, showcasing his anxiety and determination to survive. The player moves cautiously, scanning the area, and expresses confusion and fear through voice-over, while shooting at the ghosts, some of which disappear or react to the attacks. As the player continues to traverse the haunted setting, encountering more ghostly figures and engaging in intense confrontations, the tension escalates, culminating in a high-stakes battle that keeps viewers on the edge of their seats. The integration of the audio's expressions of dread and the video's chilling visuals creates a compelling and immersive experience, perfectly capturing the essence of a horror game."
+ },
+ {
+ "video_id": "PI-1d183ddd-ade6-43f0-85cf-881da69d5c27",
+ "path": "/workspace/benchmark/video/PI-1d183ddd-ade6-43f0-85cf-881da69d5c27.mp4",
+ "answer": "The video and audio present a stark contrast in tone and content. In the audio, a woman's voice is clearly frustrated and annoyed as she scolds her daughter, who has just vomited, questioning why she lied and expressing a desire to discuss the matter. However, the video unfolds in a completely different atmosphere. Set in a beige-walled bedroom with a small bedside table and a string of small lights, a woman in a black satin dress enters through a door leading to a corridor illuminated by purple lights. The text overlay humorously suggests that \"Ash doesn't care that his daughter just threw up,\" setting a light-hearted and satirical tone. The woman, seemingly in a good mood or carefree, raises her arms, walks around the room, and engages in playful gestures, such as crossing her arms in front of her chest and stroking her hair. She even has a small animated character with a heart and the words \"Join my team\" on her head at one point, adding a whimsical filter effect. Despite the text overlay's mention of the daughter's vomiting, the woman's demeanor remains playful and happy throughout the video, creating an overall atmosphere that is light-hearted and pleasant. This juxtaposition between the audio's frustration and the video's humor highlights the differing perspectives and tones between the two mediums."
+ },
+ {
+ "video_id": "PI-a1a9dfd1-6e05-45db-be93-527918c7c155",
+ "path": "/workspace/benchmark/video/PI-a1a9dfd1-6e05-45db-be93-527918c7c155.mp4",
+ "answer": "The video seamlessly integrates dynamic background music with a captivating visual performance, creating an engaging and energetic atmosphere. The audio, characterized by its rhythmic and melodic elements, complements the visual narrative without overpowering it. The video features a woman dressed in a traditional Indian saree, predominantly red with gold accents and a green border, adorned with elaborate gold jewelry. Set in a modern living room with a white marble floor, a green chair, and a beige sofa with patterned cushions, the scene presents a striking contrast between traditional and contemporary elements. The woman begins her dance in a standing position, transitioning gracefully into a squatting pose, showcasing her balance and elegance. Her fluid and expressive movements, intricate hand gestures, and rhythmic steps highlight the grace and cultural significance of the dance form. As the performance progresses, she continues to execute dynamic movements, maintaining smooth transitions between poses. The consistent background of the modern living room enhances the visual appeal, emphasizing the beauty and intricacy of the traditional dance within a contemporary setting. The combination of energetic music and graceful dance creates a harmonious and captivating experience, celebrating cultural artistry.\n"
+ },
+ {
+ "video_id": "PI-a7ea3b61-b5d9-4ed9-a404-ef7029895597",
+ "path": "/workspace/benchmark/video/PI-a7ea3b61-b5d9-4ed9-a404-ef7029895597.mp4",
+ "answer": "The video features a comfortable indoor environment, mainly focusing on a bed covered with a fluffy grey quilt. The background is dim lighting, creating a relaxed and intimate atmosphere. A person wearing a light grey T-shirt and a black headscarf is sitting on the sofa. He seems to be in a casual family setting, with plain walls visible behind him. An individual interacts with the audience through gestures, pointing to his chest and raising his hand to indicate. There is a catchy and cheerful tune in the audio. The lyrics are \"Raise your yeah, yeah\", and this person seems to be responding or emphasizing this point with gestures. The overall atmosphere is laid-back and personalized, allowing one to focus on their expressions and movements in a comfortable setting. This video conveys a sense of relaxation and familiarity. The person seems to be enjoying the moment, perhaps responding to music or a conversation. The core theme is to relax and interact casually in a warm environment."
+ },
+ {
+ "video_id": "PI-168b9346-2ed1-43e4-ab90-4e7bdeae0a82",
+ "path": "/workspace/benchmark/video/PI-168b9346-2ed1-43e4-ab90-4e7bdeae0a82.mp4",
+ "answer": "The video depicts a scene in a fast food restaurant, characterized by tiled floors and various kitchen appliances, where a person, likely an employee, stands wearing a purple cardigan over a light-colored shirt and a black apron, sharing their views on being asked to sing for colleagues during work hours. This individual, whose voice is also featured in the audio, humorously expresses reluctance and exhaustion at the request, particularly noting their fatigue at 5 a.m., and emphasizes their lack of enthusiasm for singing. The audio, composed of this woman's voice, conveys a relaxed counterattack against her colleague's performance, highlighting her weariness and the amusing yet tiresome nature of the situation. The background and clothing in the video hint at a casual, typical behind-the-scenes working day in a fast food setting, perfectly complementing the audio's narrative of a lighthearted yet relatable workplace anecdote. Together, they capture the essence of a humorous yet exhausting encounter in a fast-paced work environment."
+ },
+ {
+ "video_id": "PI-8ef743f7-c4df-4950-ba1b-f7c58fa84f69",
+ "path": "/workspace/benchmark/video/PI-8ef743f7-c4df-4950-ba1b-f7c58fa84f69.mp4",
+ "answer": "The video begins with a bald man in tactical gear walking through a set of clear plastic curtains in an industrial setting, followed by another in similar gear. As he advanced, more people joined him, forming a coordinated team of six. The group confidently moves along concrete corridors, past industrial equipment and machinery, with a huge industrial door and various equipment in the background, indicating a warehouse or factory environment. As the group stayed in formation, bright fluorescent lights overhead lit the scene. The audio, meanwhile, features a male speaker discussing various topics in a casual and somewhat fragmented manner in a vaguely emotional tone. He reflects on life's decisions and choices, metaphorically references basketball player John Stockton, and shares observations on social interactions, including moments of confidence and camaraderie. The ambient music has a steady rhythm and creates a neutral, unobtrusive atmosphere that complements the visual narrative of the coordinated movement of the tactical team."
+ },
+ {
+ "video_id": "PI-19b5ad8d-3c86-4937-9c45-5d130ebe8d3e",
+ "path": "/workspace/benchmark/video/PI-19b5ad8d-3c86-4937-9c45-5d130ebe8d3e.mp4",
+ "answer": "The video captures a whimsical and lighthearted scene in a forest setting, where a man dressed in a black t-shirt adorned with a green leaf pattern and a black hat stands amidst oil drums and scattered wood. The man, with visible tattoos on his arm, places a big, round grey cheese on a wooden surface and proceeds to cut through it with a large knife, creating rhythmic movements that add to the playful atmosphere. As he takes a thick slice of cheese out of the bowl and showcases its texture, the audio complements the scene with a light and cheerful tune sung by a lady, suggesting the lightheartedness of the moment. The man then takes a bite of the cheese and makes celebratory gestures, including flexing his muscles and forming peace signs, indicating his satisfaction or success in this quirky endeavor. The combination of the video's visuals and the audio's upbeat music creates a delightful and entertaining experience for viewers."
+ },
+ {
+ "video_id": "PI-b43ae9fe-d4b7-4505-a6de-fae59951fad2",
+ "path": "/workspace/benchmark/video/PI-b43ae9fe-d4b7-4505-a6de-fae59951fad2.mp4",
+ "answer": "In a scene that blends emotional intrigue with rustic charm, a female speaker, her tone laced with ambiguous emotion, begins an audio segment by expressing a desire to share something significant, hinting at romantic interest with a direct and emotional statement, all set against a backdrop of calm and stable music that creates a neutral yet slightly intimate atmosphere. Meanwhile, the accompanying video showcases a couple standing hand in hand in a rustic wooden barn, illuminated by sunlight streaming through wooden beams, casting long shadows on the floor. The woman, adorned in a sleek, black, form-fitting dress with a deep V-neckline and cascading wavy hair, stands beside a man in a classic black suit, their interaction suggesting a posed photo or video shoot. The text overlay, humorously revealing, \"POV: I was pregnant here and I had no idea BUT THE INTERNET DID,\" adds a layer of surprise to the warm and intimate setting, enhanced by the barn's rustic charm and soft lighting, creating a visual narrative that complements the emotional undertones of the audio."
+ },
+ {
+ "video_id": "PI-02e97a8d-3259-4d14-b271-6a530c86a0c5",
+ "path": "/workspace/benchmark/video/PI-02e97a8d-3259-4d14-b271-6a530c86a0c5.mp4",
+ "answer": "In a dynamic and engaging video - audio combination, the energy of the moment is palpable. The audio features a male's voice with a cheerful and inviting tone. The speaker is clearly in the spirit of promotion, encouraging viewers to check out his exclusive content, setting an upbeat and welcoming mood for the whole experience. The video complements the audio perfectly. It shows a man dressed in a black top, a gold necklace, and headphones, sitting at a console adorned with green and red buttons. The background is filled with interesting elements, including a glowing Earth symbol, the word “OUY”, and a refrigerator with a green drink. At the start of the video, a comment “DO WIZTHEMC - SHOW ME LOVE” appears on the screen, to which the man seems to be responding. He initially shows confusion, asking “what is that? ” before diving into an analysis of the beat. He mentions “Afrobeat? ” and “lowkey sounds like it... with that accent,” indicating his keen interest in the music. He then confirms that it's a special song and lets out excited shouts like “YOOO” and “NIGERIAN,” showing that he's already caught up in the vibe of the chat. Throughout the video, the man repeatedly expresses his excitement and anticipation. Phrases like “LET'S GO! ” and “EARS = BLESSED” are followed by his final confirmation “YES. THAT'S THE ONE! ” and a triumphant cheer of “HEEEEY!!!” The entire video has a lively atmosphere, with the man's expressions and movements clearly conveying his love and excitement for the music, while the cheerful and inviting audio of the male voice further enhances the overall enthusiasm of the scene, urging viewers to be part of this musical journey."
+ },
+ {
+ "video_id": "PI-5eecec47-100c-4da7-979e-3e4920a33409",
+ "path": "/workspace/benchmark/video/PI-5eecec47-100c-4da7-979e-3e4920a33409.mp4",
+ "answer": "The audio and video content work in harmony to effectively promote a pair of black gloves, emphasizing their key features and benefits to potential buyers. In the audio, a male voice with a friendly and enthusiastic tone introduces the gloves, highlighting their waterproof, heated, double-insulated, and cold-weather-suitable qualities. The core message of the audio is to convey the gloves' warmth, functionality, and affordability, aiming to persuade listeners to consider purchasing them. The video complements the audio by visually demonstrating the gloves' features. It begins with a close-up of a hand wearing one of the gloves placed over a container of ice cubes, emphasizing the gloves' effectiveness in cold conditions. The text overlay reinforces this point. The video then transitions to a scene where a person pours water onto the glove, showcasing its waterproof feature. Additionally, the person rubs snow onto the glove, further proving its waterproof and insulating properties. The text overlay also mentions the affordability of the gloves, offering two pairs for $10, which aligns with the audio's emphasis on affordability. Together, the audio and video content create a compelling promotional message for the gloves. The audio provides a detailed description of the gloves' features and benefits, while the video visually demonstrates these features, making the message more engaging and persuasive. The combination of friendly and enthusiastic tone in the audio with the visual demonstration in the video effectively encourages viewers to consider purchasing the gloves."
+ },
+ {
+ "video_id": "PI-32b03876-aa09-4893-a262-ae3131bdaea4",
+ "path": "/workspace/benchmark/video/PI-32b03876-aa09-4893-a262-ae3131bdaea4.mp4",
+ "answer": "In the video, there is a person with curly hair, wearing a black shirt and a silver necklace, sitting in a room with a wooden headboard and a green plant in the background. Throughout the video, the words \"A pill that's hard to swallow\" are prominently displayed at the top of the screen, creating a contemplative yet somewhat melancholic tone. On the left side of the screen, three colorful circles appeared, containing text. Each circle conveyed different but interrelated information. The pink circle reads, \"Sometimes, you are the problem.\" The yellow circle reads, \"Kindness can still cause harm.\" The blue circle reads, \"The dream job still feels like a job.\" This person seems to be speaking, and the text seems to be emphasizing incomprehensible truths or realizations. The video maintains a consistent visual theme, focusing on characters and text, with minimal changes in the background or character positions. The core themes revolve around the harsh realities of life, self-reflection, and even the most positive intentions and aspirations, which are often accompanied by unacknowledged challenges. This video encourages the audience to confront these cruel facts and urges people to reflect more profoundly and accept the complexity of human experience."
+ },
+ {
+ "video_id": "PI-9ce71b5f-3602-45ea-9aab-c348d0a529a8",
+ "path": "/workspace/benchmark/video/PI-9ce71b5f-3602-45ea-9aab-c348d0a529a8.mp4",
+ "answer": "The video captures a vibrant scene in a bar or pub, where a group of people are gathered, socializing, and enjoying themselves in an energetic atmosphere with dim lighting. In the foreground, a person in a light-colored sweatshirt and green cap holds a bottle, engaging with others, while a nearby individual in a black hoodie with white text holds a can. The background reveals more people, some raising their arms in celebration, against plaid wallpaper and large screens showing a television program. This festive mood of communal enjoyment is contrasted by an audio overlay featuring a male speaker discussing a situation with urgency. He mentions Don, described as white and possibly on a call, and refers to a corner locking down a number one, suggesting a strategic context, perhaps in sports or security. The phrases \"Dgene is CB1\" and \"Said is wide as can run\" imply specific roles or positions. The ambient music, with its steady tempo, enhances the tense atmosphere of the discussion, creating a striking juxtaposition with the lively visuals."
+ },
+ {
+ "video_id": "PI-f5019a3d-3102-4526-ab9f-e21d42f44e64",
+ "path": "/workspace/benchmark/video/PI-f5019a3d-3102-4526-ab9f-e21d42f44e64.mp4",
+ "answer": "A young woman dances energetically to upbeat male pop music in a sleek, modern kitchen setting. She wears an oversized blue hoodie, black-framed glasses, and fitted black jeans that allow freedom of movement. The camera remains stationary throughout the shot, clearly capturing how the cheerful soundtrack perfectly complements her fluid dance performance. With impressive precision, she executes a series of choreographed movements including intricate arm sequences, rhythmic body sways, and seamless turns. Her expression stays consistently relaxed and joyful, maintaining a warm smile from beginning to end.\nThe background showcases a spotless, brightly lit kitchen with contemporary appliances and ample natural light. This video radiates an effortlessly positive and carefree atmosphere, masterfully conveying the woman's genuine passion for dance. Every gesture and step demonstrates how thoroughly she embraces the music and loses herself in the moment. Viewers can tangibly feel her authentic enjoyment and love for movement, making the scene both uplifting and relatable. The combination of her skillful execution and radiant enthusiasm creates an inspiring snapshot of pure, unadulterated joy in everyday surroundings."
+ },
+ {
+ "video_id": "PI-5fd62136-b000-4140-8508-e5a759178fbe",
+ "path": "/workspace/benchmark/video/PI-5fd62136-b000-4140-8508-e5a759178fbe.mp4",
+ "answer": "The video and audio content blend harmoniously to create a captivating narrative that highlights the thoughtful design features of a Volvo car while evoking an emotional and aesthetic appeal. The video begins with an interior view of the Volvo, showcasing its white leather seats and sunroof, setting the stage for a discussion on design elements that cater to women. As the video progresses, it transitions through various features, such as the steering wheel with the Volvo logo, the dashboard with a large touchscreen display, and a built-in trash compartment, which is humorously highlighted with the text \"A BUILT IN trash can HELLO????\". The practicality of a purse holder attached to the side of the console is also emphasized, along with the luxurious crystal PRNDL gear selector. The video concludes with a serene view of the sunroof and moonroof, offering a glimpse of the clear blue sky and leafless trees outside, reinforcing the luxurious and thoughtful design of the car's interior. Meanwhile, the audio, a BGM track with a female voice singing in a soft, dreamy, and emotional tone, complements the visual content by evoking a sense of beauty and introspection, enhancing the overall aesthetic and emotional appeal of the video. Together, the audio and video content effectively showcase the Volvo car's interior design features while creating an engaging and emotionally resonant viewing experience."
+ },
+ {
+ "video_id": "PI-2a4b1881-33c2-43f9-986c-ad877f1c2bef",
+ "path": "/workspace/benchmark/video/PI-2a4b1881-33c2-43f9-986c-ad877f1c2bef.mp4",
+ "answer": "The audio features a dialogue between an adult man and an underage boy, characterized by the man's proud tone, suggesting a father-son learning experience. This conversation sets a heartwarming backdrop for the video, which unfolds in a cozy wooden cabin where a boy is engrossed in ice fishing. The boy, dressed in a black hoodie with \"Minnesota\" written on it, light gray sweatpants, and black rubber boots, sits comfortably in a black leather recliner, holding a fishing rod attached to a hole in the icy wooden floor. Nearby, a blue bucket labeled 'Napa' and various fishing tools accompany him. The video captures the boy's excitement as he reels in the fishing line, culminating in the triumphant moment of catching a fish. The scene exudes warmth and rustic charm, perfectly mirroring the pride and joy expressed in the audio dialogue, as the father presumably watches over his son, sharing in the experience and imparting valuable lessons. Together, the audio and video create a touching narrative of family bonding and the simple pleasures of outdoor activities."
+ },
+ {
+ "video_id": "PI-b9cb44fb-8a69-49ef-a584-2c6be328cf09",
+ "path": "/workspace/benchmark/video/PI-b9cb44fb-8a69-49ef-a584-2c6be328cf09.mp4",
+ "answer": "The video features a woman in a bathroom adorned with bicycle-patterned wallpaper, creating a unique and reflective setting. She begins by drying her wet, curly hair with a white towel, dressed in a simple white tank top and accessorized with gold bracelets and a heart-shaped necklace. The scene is imbued with a sense of contemplation, mirrored by the audio where a female speaker expresses a desire to communicate certain feelings without causing harm, indicating restraint. As she applies a yellow liquid from a small bottle onto her palm and rubs it through her hair, followed by a creamy substance from a white bottle, likely conditioner or a styling product, the steady rhythm of the ambient music enhances the atmosphere of introspection. The female speaker briefly inserts an English word, \"ah,\" suggesting agreement or acknowledgement, which aligns with the methodical application of the hair products. After ensuring even coverage, she ties her hair into a high ponytail, transitioning smoothly into her makeup routine. She starts with a concealer stick under her eyes and on her forehead, blending it with a brush, then applies foundation evenly across her face. The video progresses with her using a pink brush to apply blush to her cheeks, followed by contouring her face with a brown stick, which she blends thoroughly. Highlighter is then applied to her cheekbones, and she completes her look by outlining her lips with a lip liner and filling them in with nude lipstick. The entire process is conducted in a calm and deliberate manner, reflecting the audio's theme of thoughtful communication and self-care."
+ },
+ {
+ "video_id": "PI-d3d31914-b178-408c-93b1-b170c8c7c3cb",
+ "path": "/workspace/benchmark/video/PI-d3d31914-b178-408c-93b1-b170c8c7c3cb.mp4",
+ "answer": "In the video, a young woman is wearing a black T-shirt with pink and white patterns printed on it. She is indoors, perhaps in the bedroom, with an ordinary wall in the background. This woman is holding a pink smart phone in her left hand. There are long and pointed red nails and several rings on the phone. It seems that she is taking selfies or recording videos while tidying up her hair with her right hand. The text on the video reads: \"Don't try to act like a Spaniard. Her daughters are so funny,\" and it is accompanied by a crying and laughing emoji. This indicates that the audio here, which is not fully transcribed but implied by the text, contains humorous comments about someone's attempt to act in a certain way, possibly related to cultural or linguistic identity, and the speaker finds the daughter's reaction amusing. The core theme of the video seems to revolve around humor and family interaction, highlighting the comedic aspects of attempting to adopt different characters and the interesting reactions it elicits from family members. The woman's expressions and body language convey a sense of entertainment and relaxation, which is consistent with the humorous tone implied by the superimposed text."
+ },
+ {
+ "video_id": "PI-293c50e8-7c84-4044-ab23-66def6109846",
+ "path": "/workspace/benchmark/video/PI-293c50e8-7c84-4044-ab23-66def6109846.mp4",
+ "answer": "The video depicts a person holding a baby dressed in a pink, fluffy bunny costume, complete with a pacifier. The individual is adorned in a vibrant pink scarf and a black jacket, set against an outdoor backdrop featuring trees and a cloudy sky. The person shares that they received a message from the nanny detailing the baby's nap schedule, revealing that the baby only took one 15-minute nap and one one-hour nap during an eight-hour period. They express amusement at this information and mention receiving a video of the baby doing something endearing, inquiring if it's a new behavior. The person then outlines their evening plans, which include attending a Zoom call, playing a card game called 'Onesie' on Zoom, and watching the Four Nations hockey tournament with the baby's father. However, they note that the father won't be home in time, so the baby will need to accompany them. The video captures a warm and engaging conversation between the person and the baby, filled with anticipation for their upcoming activities. Meanwhile, the audio recording complements the video by highlighting that the woman is sharing information received from the nanny about the children."
+ },
+ {
+ "video_id": "PI-9ed4e463-5c11-444d-9757-f77e4d2c0e54",
+ "path": "/workspace/benchmark/video/PI-9ed4e463-5c11-444d-9757-f77e4d2c0e54.mp4",
+ "answer": "The video captures the electrifying atmosphere of a red carpet event, likely the Golden Globes, as evidenced by the microphone labeled \"GOLDEN GLOBES\" held by the interviewer. Set against a backdrop of a well-lit venue adorned with a luxurious red carpet and floral arrangements, the scene buzzes with excitement, drawing a crowd of attendees, photographers, and event staff all vying for a glimpse of the stars. Amidst this glamorous setting, the interaction between the interviewer and the interviewee stands out as a highlight. The conversation, casual yet humorous, revolves around the interviewee's remarkable talent for impersonations. The interviewer expresses admiration, describing the interviewee as fearless in his performances and even suggesting that his impersonations surpass the originals, specifically referencing Jeff Goldblum. This lighthearted banter and mutual appreciation contribute to the informal and lively ambiance, perfectly encapsulating the spirit of celebration and star-spotting that defines such red carpet events."
+ },
+ {
+ "video_id": "PI-6cd69c08-acff-49b8-95d8-032c641ffb01",
+ "path": "/workspace/benchmark/video/PI-6cd69c08-acff-49b8-95d8-032c641ffb01.mp4",
+ "answer": "The video begins with a close-up shot. One hand is holding a pink notebook with the words \"STROBRIGHTS\" written on it, and the cover has a colorful design. The video also marked the words \"Squid Game Chicken Fighting Tutorial\". This means that the video will offer tutorials on making traditional paper toys to beat chickens. The scene transitions to a pair of hands painted with light blue nail polish, folding a piece of blue paper lengthwise on a wooden surface. Then spread out the paper with both hands and cut it in half along the creases to form two rectangles. Continue to fold each piece of paper into a smaller rectangle, with both sides facing the middle, to form a shape similar to an open book. Repeat this process on two pieces of paper. Then spread the paper slightly with both hands to form a square in the middle, which is crucial for the following steps. Fold the upper corner of the paper to the lower corner with both hands to form a triangle, and then fold the triangle towards the middle. At the end of the video, the hand completed the folding process, forming a compact square pancake. The audio contains some scattered phrases, such as \"Squid game.\" \"com\" I take a piece of paper, fold it in half, and place it in the middle. Place one piece of paper on top of another and overlap them. Fold both sides towards the middle, and then do it again with another piece of paper. I unfold it into the skin and the things on the middle finger, with everyone else in the middle. When this person recounted the process of making Darakji, he used a casual and humorous tone."
+ },
+ {
+ "video_id": "PI-137c2f50-e8b1-410d-bb2f-1f6b5a5fa77a",
+ "path": "/workspace/benchmark/video/PI-137c2f50-e8b1-410d-bb2f-1f6b5a5fa77a.mp4",
+ "answer": "The video depicts three boys sitting at a table in a classroom or studio, having a humorous conversation. The middle boy wore a dark hoodie with the words \"Champion\" printed in various colors, while the other two boys wore light gray and red sweatshirts. The background features cabinets and a pink sign in a foreign language, adding to the casual and fun vibe. A text overlay, \"car for nobody 98.99% accurate\" introduces an element of irony or humor into the scene. The audio, meanwhile, captures the lively conversation between the boys, all joking and laughing. A boy humorously said, \"We're going to buy a car too, you want a car, you want a car with no one around, you f***g think I am,\" indicating a sarcastic tone between friends. The video's core message is a relaxed and friendly communication about buying a car, blending visual and auditory elements to create an engaging and entertaining narrative."
+ },
+ {
+ "video_id": "PI-2bd67fb3-85c1-4740-8f11-0bef939934ad",
+ "path": "/workspace/benchmark/video/PI-2bd67fb3-85c1-4740-8f11-0bef939934ad.mp4",
+ "answer": "The audio features a woman's smiling voice, which imbues the scene with a light-hearted and humorous tone, complemented by bursts of laughter that highlight the spontaneous and interesting nature of the situation. The video, set in the aisle of a store, depicts a person pushing a shopping cart brimming with various items, including a prominent bag of Kettle brand potato chips. The text overlay on the video reveals a humorous backstory: someone was persuaded to hop into the shopping cart to \"purchase a TikTok\", only to be subsequently pushed towards the staff section of the store. The background, filled with store shelves and a red door possibly leading to the staff area, adds to the ambiance. The individual pushing the cart, dressed in a dark sweater with their hair tied up, adds to the whimsical and mischievous atmosphere. Together, the audio and video capture a delightful moment of fun and mischief, seamlessly blending the store's setting with the playful actions of the participants, creating a memorable and entertaining experience."
+ },
+ {
+ "video_id": "PI-8a7bec9d-856c-4393-9032-8e42e1945631",
+ "path": "/workspace/benchmark/video/PI-8a7bec9d-856c-4393-9032-8e42e1945631.mp4",
+ "answer": "The video takes place in a large aquarium tank filled with vibrant marine life, where a massive whale shark, adorned with distinctive white spots and stripes, swims gracefully through the water, accompanied by several smaller fish. Above it, a large manta ray glides effortlessly, creating a mesmerizing underwater spectacle. The aquarium is surrounded by a crowd of visitors, some sitting and others standing, captivated by the scene. The water is a clear, deep blue, and sunlight filters through, enhancing the serene atmosphere. The text overlay reads 'He did a trick for us' with a thinking face emoji, suggesting that the marine creatures are performing an impressive display for the audience. Amidst this awe-inspiring setting, a female speaker expresses surprise and excitement, saying 'Oh my gosh, I got,' in an energetic tone, indicating a moment of personal achievement or discovery. A male speaker responds briefly in Korean with an unknown emotional tone, providing a neutral or unclear reaction. The interaction captures the blend of wonder from the visitors and the subtle, personal reactions of individuals within the crowd."
+ },
+ {
+ "video_id": "PI-44934598-0a2b-4697-8dc5-e82e4dbfbff5",
+ "path": "/workspace/benchmark/video/PI-44934598-0a2b-4697-8dc5-e82e4dbfbff5.mp4",
+ "answer": "This video features two distinct scenes, capturing different aspects of urban life and the interesting spontaneity. The story begins with a person walking on the sidewalk of the city, wearing a unique cotton-padded jacket, just like a quilted mattress. This jacket is beige and printed with various letters, including \"N\", \"H\", \"G\", \"F\", \"E\" and \"D\". This person is also wearing blue jeans and yellow shoes, adding a touch of trendy color to their overall look. The background features bustling streets and shops with red and white striped awnings, creating a lively urban atmosphere. Then, the scene transitions to an outdoor area with a wooden fence as the background, where a person in black can be seen dancing energetically. Superimposed on this scene is a series of random letters: \"X\", \"dlaskdfjasldk\", \"fjoiq\", \"woiunvpqwerujo\", \"ouwerlnzmneiqasdf\" and \"klj\", adding abstract and interesting elements to the video. There is a very casual sentence in this audio, \"I have a movie night\", suggesting a relaxed and informal tone. The video effectively contrasts the structured urban environment with spontaneous dance, highlighting the diversity and vitality of urban life."
+ },
+ {
+ "video_id": "PI-d6fe2163-4fc5-4d05-8198-1f78bec120b6",
+ "path": "/workspace/benchmark/video/PI-d6fe2163-4fc5-4d05-8198-1f78bec120b6.mp4",
+ "answer": "This video showcases a live musical performance by singer-songwriter Hozier and his band on a festively decorated stage. The artist, a man with long dark hair and a beard, stands center stage, passionately singing into a microphone while playing a red electric guitar. The setting is warm and atmospheric, with a large, ornate chandelier hanging above and Christmas-themed decorations like wreaths and garlands adorning the stage. The camera alternates between intimate close-ups of Hozier's face and wider angles that reveal the entire band, including a drummer and several backup singers who sway to the music. The performance is for \"Saturday Night Live,\" as indicated by the \"SNL 50\" logo that appears on screen. The band delivers a powerful and soulful rendition of their song."
+ },
+ {
+ "video_id": "PI-20dc3ca4-f2b7-4ec6-ad1e-e636d8f2600b",
+ "path": "/workspace/benchmark/video/PI-20dc3ca4-f2b7-4ec6-ad1e-e636d8f2600b.mp4",
+ "answer": "The video captures a moving rescue operation involving two adorable kittens, accompanied by the voice of a woman who adds a personal and interactive feel to the scene. Initially, the camera focuses on wooden floors and chairs, and the text expresses confusion at the cries heard in the background. As the camera pans, the camera reveals a black leather chair, a pair of shoes and a cardboard box. A spotted kitten named Poppy is found trapped under the box while another kitten, Pudding, tries hard to save her by chewing on the box. The woman's voice can be heard, possibly expressing a sense of concern and urgency, interacting with the cat, adding to the emotional depth of the scene. A hand intervenes to release the poppy, and the video ends with two kittens being found and exploring the area. The on-screen text describes them as \"Sweetest kitties ever,\" highlighting their playful and affectionate relationship. The holistic audio-visual narrative showcases a heartwarming moment of rescue and the strong bond between the kittens."
+ },
+ {
+ "video_id": "PI-c0edeb8d-9c14-4712-a249-24ccb763d8d6",
+ "path": "/workspace/benchmark/video/PI-c0edeb8d-9c14-4712-a249-24ccb763d8d6.mp4",
+ "answer": "The video begins with a serene yet contemplative scene of a gray car parked in a lot, with raindrops speckling the window. The camera zeroes in on the driver's side window, where a person clad in a dark jacket with the hood up is seated. The backdrop of other vehicles clearly places the setting in a parking lot. The individual inside the car seems to be in a state of observation or waiting, adding a layer of quiet introspection to the scene. This visual narrative is complemented by an audio track featuring a casual conversation between two male speakers, both exuding a relaxed yet thoughtful demeanor. The audio delves into musings about stress and understanding, mirroring the visual theme of contemplation. As the video progresses, the person in the car makes a peace sign with their right hand, a gesture that subtly underscores the laid-back atmosphere. The raindrops on the window and the parking lot setting remain constant, providing a cohesive backdrop. The person then lowers their hand and resumes looking around, maintaining the scene's static yet reflective quality. The video wraps up with the individual still seated in the car, encapsulating a moment of calm introspection amidst the rainy weather and parking lot surroundings. The integration of the audio and video elements creates a harmonious portrayal of relaxed contemplation and understanding."
+ },
+ {
+ "video_id": "PI-0ac54f33-f5fb-4de3-9348-acf6501b3ab7",
+ "path": "/workspace/benchmark/video/PI-0ac54f33-f5fb-4de3-9348-acf6501b3ab7.mp4",
+ "answer": "The video showcases an exhilarating basketball game, with the question \"Who is the best player of the season so far?\" prominently displayed at the bottom center of the screen, sparking discussion among viewers. Throughout the match, players from the blue and white teams are locked in a close contest, constantly changing positions to secure their shots. The male commentators provide passionate commentary, highlighting each player's exciting moments, such as dribbling, passing, shooting, and layups, which adds to the intensity of the game. The crowd's loud cheers after successful shots underscore the excitement and energy in the arena. Although the audio mainly consists of crowd noise, which effectively conveys the game's fervor without obvious dialog or background music, it complements the visual content perfectly. The video primarily focuses on key moments from the games of eight players, allowing viewers to witness their skills and performances firsthand. The combination of the commentators' passionate delivery and the crowd's enthusiastic reactions creates an immersive experience, making it easy for viewers to get caught up in the thrill of the basketball game and join in the discussion about the season's best player."
+ },
+ {
+ "video_id": "PI-c12e9844-ddd4-4f78-b0b9-6b1ba1112e7d",
+ "path": "/workspace/benchmark/video/PI-c12e9844-ddd4-4f78-b0b9-6b1ba1112e7d.mp4",
+ "answer": " In this lively scene, a family conversation unfolds among a man, a woman, and their pregnant daughter, with an additional male voice joining later. The man speaks in a calm yet slightly exasperated tone, suggesting they simply ignore the daughter's emotional outburst, believing she'll eventually tire herself out. Meanwhile, the woman responds with a concerned tone, acknowledging the daughter's evident distress. The pregnant daughter, full of energy and insistence, declares she's wide awake and craves something \"really rank\" to eat. As the dialogue progresses, another man enters the conversation in a polite yet humorous tone, engaging with a pizza vendor. He humorously criticizes the pizza's quality—its sauce, bread, toppings, cooking method, the people making it, and even how it tastes like the box it comes in, going as far as to say the butter dipping sauce is terrible—while placing an order to satisfy the daughter's unusual craving. The scene captures a blend of familial concern, lighthearted humor, and the emotional dynamics surrounding a pregnant woman's food cravings, all without any mention of music or ambient sound."
+ },
+ {
+ "video_id": "PI-8e423df7-1c3e-4a45-a305-0576ace69732",
+ "path": "/workspace/benchmark/video/PI-8e423df7-1c3e-4a45-a305-0576ace69732.mp4",
+ "answer": "The video begins with a man in a modern living room, dressed in a mustard yellow polo sweater, light blue jeans, and beige Chelsea boots, with a text overlay advising viewers 'do NOT let your man dress like this pt. 7.' As the male speaker in the audio discusses upgrading men's casual outfits for a more flattering and balanced look, the man in the video critiques his own outfit before moving towards a hanger with new light blue athletic straight jeans. The video shows him examining and trying on the new jeans, then swapping his beige Chelsea boots for black leather round-toe boots, and finally changing into a beige polo sweater. The audio emphasizes replacing tight or overly bright clothing with relaxed yet stylish alternatives, which is mirrored in the video as the man adjusts his new outfit, showcasing the improved proportions and polished appearance. The video concludes with the man confidently posing in his upgraded look, highlighting the more put-together style, while the audio reflects a happy and satisfied tone, underscoring the transformation's success."
+ },
+ {
+ "video_id": "PI-4513b53f-ccac-42d6-898d-216e3d341cd1",
+ "path": "/workspace/benchmark/video/PI-4513b53f-ccac-42d6-898d-216e3d341cd1.mp4",
+ "answer": "The video features two children in a cozy living room setting, both dressed in red shirts and festive pajama pants, creating a warm holiday atmosphere. The child in the foreground, with blonde hair tied up, is holding a blue toy gun and appears to be aiming it, with the back of their shirt reading 'BELIEVE' alongside a Christmas tree design and the text 'in the magic of Christmas.' The other child, also in a red shirt, stands near a staircase in the background, observing the scene. The room is decorated with a large mirror, a pet crate, and a couch, enhancing the homely feel. The camera then shifts to a close-up of the blonde child's face, capturing their cheerful and playful demeanor as they turn towards the camera. Accompanying the visuals, the audio features the excited and cheerful voice of a child, greeting and making interesting or ambiguous statements about someone's knowledge or behavior, which adds a layer of joy and curiosity to the scene. The video wraps up with a continued close-up of the child's face, emphasizing their joyful and engaging presence, perfectly complementing the festive and warm setting."
+ },
+ {
+ "video_id": "PI-08f9eb1d-2712-4bef-bac3-f49812833f91",
+ "path": "/workspace/benchmark/video/PI-08f9eb1d-2712-4bef-bac3-f49812833f91.mp4",
+ "answer": "The video paints a vivid picture of a heartwarming family gathering in a cozy home, complete with a wooden door, a woven light fixture, and a tiled floor. It captures the moment when a woman in a red plaid dress and a man in a white shirt, accompanied by a small black and white dog, return home after getting engaged. As they enter, they are greeted with great excitement by family members, such as a person in a black shirt and another in a gray shirt holding a bottle, filling the room with laughter and cheers. The text overlay \"Got engaged then came home to my whole entire family surprising me\" reveals the context of the celebration. The audio, which features two people returning home to celebrate with their families after their engagement, perfectly complements the video by capturing the moments of love, celebration, and family reunion. The video continues to show the couple embracing near the door, with the dog moving around and family members in the background, while another woman in a black top and striped pants and a person in a gray shirt holding a black bag join the celebration. The family members raise bottles and cheer, creating a lively and joyful atmosphere that encapsulates the essence of love, celebration, and family unity."
+ },
+ {
+ "video_id": "PI-b0c6d05a-7605-48ae-8ed3-1178682d41ec",
+ "path": "/workspace/benchmark/video/PI-b0c6d05a-7605-48ae-8ed3-1178682d41ec.mp4",
+ "answer": "The video shows a woman standing behind a table covered with a patterned tablecloth, wearing a light blue zip-up jacket and dark trousers. In front of her was a pink bag. The text overlay at the top of the frame reads, \"take my card & go get some yogurt and whatever else we might need\" indicating that she is giving instructions, possibly to someone off-camera. She bent down to pick up a box of yogurt from the pink bag. As she continues to interact with the bag, two furry white kittens appear on the table, adding a playful element to the scene. The lady picked one up from the bag and showed it to the camera, while the kittens walked around curiously, one of them being very close to the camera and filling the frame with its face. The lady continued to lift one from the bag and put it back on the table. Meanwhile, the audio features a cheerful person, possibly female, expressing optimism and confidence that everything will be fine and that no one can interfere. The scene conveys a positive and uplifting mood, accompanied by steady-rhythmic ambient music and a bright, encouraging atmosphere that captures moments of domestic life, combining the instructions of the woman with the playful presence of the kitten."
+ },
+ {
+ "video_id": "PI-3cff6c7a-bfd7-40e9-b611-89ade8527202",
+ "path": "/workspace/benchmark/video/PI-3cff6c7a-bfd7-40e9-b611-89ade8527202.mp4",
+ "answer": "The video features a person standing in a room with a vibrant purple-themed background, which includes a textured purple wall and a neon light strip emitting a soft purple glow. The individual is dressed in a purple pajama set adorned with a seashell and starfish pattern, complemented by long, straight blonde hair and multiple gold bracelets and a necklace. They strike various playful poses, placing their hands on their hips, pointing, and gesturing enthusiastically. Overlaid on the video is the text 'little hint for the next Glow House member. she's the SWEETESTTTT & always so positive,' indicating that the person is hinting at a new member joining the 'Glow House' and describing them as exceptionally sweet and positive. This visual presentation aligns with the audio from a cheerful and enthusiastic female speaker who also hinted at introducing a new member to the 'Glow House,' emphasizing their sweet and positive nature to generate excitement about their arrival. The overall atmosphere of the video is lively and colorful, dominated by the purple color scheme."
+ },
+ {
+ "video_id": "PI-c3afed6e-6b1c-42c1-8be1-66c2ed86367b",
+ "path": "/workspace/benchmark/video/PI-c3afed6e-6b1c-42c1-8be1-66c2ed86367b.mp4",
+ "answer": "This video takes the dark starry sky as its background, creating a mysterious and futuristic atmosphere. In the middle, a person with long black hair, wearing a set of black, smooth clothes and long gloves, is sitting confidently on a metal chair. This central figure soon joined four other people, each with a different style: one was wearing a white dress and holding a cat, another was wearing a red coat and had a Bob cut, the third was wearing a pink coat and colorful shorts, and the last one was wearing a light blue suit. Together, they formed a striking collective pose, exuding a sense of unity and rebellion. The track named \"ALTEREGO\" includes songs such as \"Rebirth\" and \"doja cat and raye\", as well as solo versions of \"FXCK UP The WORLD\" and \"rapunzel\". These titles imply themes of empowerment, individuality and bold self-expression. The video concludes with the release date \"02.28.2025\" and the website \"lalisaofficial\", indicating the upcoming event or album release. The overall aesthetic and audio elements convey a strong, avant-garde statement, celebrating uniqueness and contempt."
+ },
+ {
+ "video_id": "PI-7cebf766-934c-4759-b1d2-ca0377717b57",
+ "path": "/workspace/benchmark/video/PI-7cebf766-934c-4759-b1d2-ca0377717b57.mp4",
+ "answer": "The video begins with a visual display of the components necessary to assemble an iPhone, including a rose gold iPhone back, a white iPhone front, a black battery, and various small parts laid out on a white surface. The text \"BUILDING AN IPHONE\" is prominently displayed in a red box above the components, setting the stage for the project. A person wearing a white shirt gestures with their hands while speaking, introducing the concept of building an iPhone from individual parts. The video then develops through a series of screenshots from an AliExpress shopping cart, listing various iPhone components and their prices, such as a \"1set Full Set Metal + Screw for iPhone 13 Pro Max 14 Plus 1...\" and other essential parts. The narrator, with a calm and informative tone, discusses the idea of ordering these individual parts from AliExpress to build an iPhone, comparing the total cost of $324.82 with the prices of iPhones from retailers like Best Buy or eBay, where similar models are significantly more expensive. The core message of the video is the potential cost savings of building an iPhone from scratch using parts sourced online, emphasizing the practicality and economic benefits of this approach."
+ },
+ {
+ "video_id": "PI-f31cca0f-308e-4294-8388-cc95cba387bd",
+ "path": "/workspace/benchmark/video/PI-f31cca0f-308e-4294-8388-cc95cba387bd.mp4",
+ "answer": "This video captures a gameplay session from a Roblox game designed to emulate the \"Red Light, Green Light\" game from the series 'Squid Game'. The scene is set in a sandy, desert-like arena with a long, covered walkway. The camera follows the perspective of a player character dressed as a guard in a red jumpsuit. Another player, in a blue tracksuit with the number '019' and an apple on their head, repeatedly pushes the guard character. The game mechanics are indicated by audio cues and screen filters: a child's voice says \"Green light\" as the screen turns green, allowing movement, and \"Red light\" as the screen turns red, signaling players to stop. The blue player ignores the rules, continuing to push the guard even during the \"Red light\" phases. After being pushed multiple times, the guard character pulls out a handgun and shoots the blue player, who then falls to the ground."
+ },
+ {
+ "video_id": "PI-2e6862f0-9889-48f5-abfd-ebdb956cb8dc",
+ "path": "/workspace/benchmark/video/PI-2e6862f0-9889-48f5-abfd-ebdb956cb8dc.mp4",
+ "answer": "In the video, a person is interacting with a bright orange plush bubble sofa in a beautifully decorated modern room. There is a bed with beige bedding and a white headboard in the room, a floor lamp with a beige lampshade, and several framed abstract artworks hanging on the wall. This person was wearing a grey long-sleeved top, light grey sports pants and white slippers. At first, he was sitting on the sofa, holding a mobile phone in his hand. The words \"TEMU Bubble Sofa\" appeared on the screen, showing the brand and type of the sofa. Then this person stood up and jumped onto the sofa to show its softness and elasticity. She continued to interact with the sofa, sitting, lying down and walking around it, highlighting the comfort and unique design of the sofa. This person was also sitting on the sofa reading a magazine, wearing white headphones, further emphasizing the comfort and alluring nature of the sofa. At the end of the video, there is a close-up of the textured surface of the sofa and a wider shot of the sofa in the room. The text \"Click to Get\" appears on the screen, encouraging the audience to purchase the sofa. In this audio, a humorous and astonished female voice shares her husband's initial suspicion when he bought the sofa from Temu, thinking it was junk. However, she was pleasantly surprised by its quality and comfort, calling it \"executive-level\". She emphasized that the sofa is not only fashionable but also very comfortable, and it is a major highlight of their home."
+ },
+ {
+ "video_id": "PI-4014fd8b-7677-4381-8010-f9cb3d4985bb",
+ "path": "/workspace/benchmark/video/PI-4014fd8b-7677-4381-8010-f9cb3d4985bb.mp4",
+ "answer": "This video takes place in a small convenience store, much like a mini supermarket, which is filled with all kinds of snacks and beverages. The shelves are filled with all kinds of products, including potato chips, candies and canned foods, creating a typical atmosphere of a convenience store. The main focus is on two people: one is looking from behind and is wearing a black shirt, and the other is wearing a light blue shirt and jeans. The person in the black shirt approached the employee holding snacks and began to talk. Throughout the entire video, the caption reads: \"I just met benny blanco at the mini supermarket and tried to sing at his wedding with Selena Gomez.\" This indicates that the person in the black shirt met a famous music producer and expressed interest in performing with a renowned singer at his wedding. The background music included lyrics such as \"Who says you're not perfect? Who says you're not worth it?\", adding an inspiring and optimistic tone to the scene. This video captured the moment when people spontaneously interacted in a local convenience store, highlighting a potential opportunity related to a high-profile event. The entire theme revolves around an unexpected and exciting encounter, which may bring about an important opportunity."
+ },
+ {
+ "video_id": "PI-c7c1198a-5834-415f-80bc-0e06abbca7b4",
+ "path": "/workspace/benchmark/video/PI-c7c1198a-5834-415f-80bc-0e06abbca7b4.mp4",
+ "answer": "The video offers a vibrant glimpse into the lively atmosphere of a fast-food restaurant, both from the exterior and within the bustling kitchen. It begins with a dark-colored car parked near the drive-thru window at night, highlighting the restaurant's brick wall and a yellow bollard, while the text \"100k special\" appears on the screen, hinting at a promotional event. As the camera moves closer to the drive-thru window, the interior kitchen comes into view, revealing several employees in black uniforms actively engaged in their tasks amidst stainless steel counters, cooking appliances, and stocked shelves. The reddish-tiled floor adds to the busy ambiance. The employees are seen preparing food, walking around, and interacting with each other, showcasing their teamwork and camaraderie. This visual narrative is complemented by the ambient sound track, which captures the casual and playful banter of the young staff, with one person repeatedly mentioning the \"100k special\" and others joining in light-hearted conversation and laughter. Together, the video and audio effectively convey the dynamic and collaborative spirit of the fast-food kitchen during a work shift."
+ },
+ {
+ "video_id": "PI-f445781e-6464-497e-b25f-f4ebc910c20e",
+ "path": "/workspace/benchmark/video/PI-f445781e-6464-497e-b25f-f4ebc910c20e.mp4",
+ "answer": "In the living room, a woman is doing the push - up challenge. She is wearing a grey sports top and blue shorts and is practicing in time to the guiding music. In the background, a man is sitting comfortably in a grey armchair, laughing as he watches her and pretending to conduct her up - and - down movements.The video provides a contextual title on the screen that reads: \"In the living room, a woman is doing the \"Sally Up, Sally Down\" push - up challenge. She is wearing a grey sports top and blue shorts and is practicing in time to the guiding music. In the background, a man is sitting comfortably in a grey armchair, laughing as he watches her and pretending to conduct her up - and - down movements. A white surfboard is leaning against the nearby wall. The video provides a contextual title on the screen that reads: \"Ocean Friends PT 2 My Sally Up Sally Down Rating\", indicating that her friend is evaluating her performance. Throughout the clip, the woman remains focused, lowering and raising her body according to the commands of the song, while the man uses his fingers to conduct behind her.\", indicating that her friend is evaluating her performance. Throughout the clip, the woman remains focused, lowering and raising her body according to the commands of the song, while the man uses his fingers to conduct behind her."
+ },
+ {
+ "video_id": "PI-2e817eda-e963-4d6c-8f31-4dbf8161524a",
+ "path": "/workspace/benchmark/video/PI-2e817eda-e963-4d6c-8f31-4dbf8161524a.mp4",
+ "answer": "In the video, a little black dog is sitting on a grey textured mat indoors, wearing a colorful raincoat with a hood. This raincoat is mainly turquoise in color, with yellow and pink accents, adding a touch of playfulness to the dog's appearance. On the left side of the dog, there is a pair of bright yellow rain boots, enhancing the theme of rainy days. The background includes a white wall and a window. The light flowing in indicates that it is daytime. Throughout the video, English words appeared on the screen: \"If you don't want to defecate in the rain, raise your PAWS\" and \"If the ground is really wet\". Most of the time, the dog remains still. Occasionally, it looks around and moves its head slightly. Then the text changed to \"You don't want to use the toilet\", maintaining a humorous tone. This video humorously explains the reasons why dogs are reluctant to urinate in the rain or on damp ground. The playful text combined with the dog's cute appearance creates a light-hearted and fun story that tells of the challenge of training to use the toilet in bad weather."
+ },
+ {
+ "video_id": "PI-c98cd651-9a77-45fa-b05d-15e1b4189ec7",
+ "path": "/workspace/benchmark/video/PI-c98cd651-9a77-45fa-b05d-15e1b4189ec7.mp4",
+ "answer": "The video unfolds in two different environments: one is an indoor room, and the other is an outdoor snow environment at night. At first, a person wearing a white T-shirt and blue sweatpants opened the door of a room. The walls of the room were light in color and there was a shoe rack beside it. The scene switches to a close-up of a red Domino's pizza box with the words \"Emergency Pizza\" written on it, stating that free pizza is provided for those who want to eat pizza. That person seemed very expectant, but the pizza box on the doormat suddenly disappeared, confusing them. The audio shows that this person ordered a pizza but was told it was poisonous, which made them realize that without pizza, they were going through a period of schizophrenia. The scene transitions to the snow outside. The same person, now wearing a white vest and grey shorts, stands in the snow, holding his head in confusion. The text \"Guys with schizophrenia\" shows a red arrow pointing to them. This video profoundly captures the confusion and pain during a schizophrenia episode, integrating visual and auditory elements to convey the chaos between reality and perception."
+ },
+ {
+ "video_id": "PI-39bd7d75-fcfc-451c-8233-5e5bf43ccf26",
+ "path": "/workspace/benchmark/video/PI-39bd7d75-fcfc-451c-8233-5e5bf43ccf26.mp4",
+ "answer": "The video showcases a foldable sofa bed from Tmo, highlighting its transformation from a flat bed to a comfortable seated configuration. A person in black sets up and adjusts the sofa bed, while another individual in white joins and relaxes on it. They both enjoy the sofa bed, using pillows, blankets, and a laptop. The video concludes with the person in black folding the sofa bed back into its flat form, with the text 'CLICK TO GET IT NOW' encouraging viewers to purchase. The audio features a male speaker expressing satisfaction with the affordable and comfortable sofa bed, inviting listeners to learn more."
+ },
+ {
+ "video_id": "PI-fa334430-b64c-43c4-a6f4-0013fddfe45e",
+ "path": "/workspace/benchmark/video/PI-fa334430-b64c-43c4-a6f4-0013fddfe45e.mp4",
+ "answer": "This video captures a heartwarming compilation of two young adopted brothers experiencing Christmas for the first time. The scenes document their pure joy and excitement as they celebrate with their new family. Dressed in festive pajamas, the boys are shown running with glee towards their presents. Their reactions are full of wonder as they unwrap and play with new toys, including large monster trucks and toy lightsabers. A particularly joyful moment shows them being surprised with a large, orange ride-on toy car in the driveway, which they eagerly run towards and climb into. Back inside, they continue opening gifts, revealing a large building set and personalized armchairs. Throughout the video, their delighted screams and laughter create a lively and cheerful atmosphere, as their family lovingly watches on."
+ },
+ {
+ "video_id": "PI-cb30ec39-116e-453e-b1ce-8d22f98e1622",
+ "path": "/workspace/benchmark/video/PI-cb30ec39-116e-453e-b1ce-8d22f98e1622.mp4",
+ "answer": "The video captures a vibrant and joyful scene on a sunny day in an upscale urban shopping area, characterized by high-end stores such as Burberry and US Bank in the background. In the foreground, three individuals are seen interacting playfully. One person, dressed in a purple shirt, black and purple jacket, blue jeans, and purple shoes, holds a large purple balloon shaped like the number \"1\". Beside them, another individual in a dark blazer, light blue shirt, and gray pants holds a large pink balloon shaped like the number \"8\". Both appear to be dancing or moving energetically on a cobblestone pavement. A third person, wearing a dark polka-dotted top and a long green skirt, joins in, adding to the lively atmosphere with their energetic movements. The background features a modern building with large glass windows, palm trees, and a few passing cars, while pedestrians in the distance contribute to the bustling yet sophisticated ambiance. The audio complements this cheerful scene, capturing a light-hearted moment where the speakers are clearly enjoying themselves. The male speaker repeatedly says, \"I'm tripping,\" suggesting a sense of surprise or playful exaggeration, which enhances the overall mood of celebration and joy. Together, the visual and auditory elements create a cohesive and uplifting portrayal of friends reveling in a stylish urban setting."
+ },
+ {
+ "video_id": "PI-4d860115-e561-4c88-873d-ab492e1b8458",
+ "path": "/workspace/benchmark/video/PI-4d860115-e561-4c88-873d-ab492e1b8458.mp4",
+ "answer": "The video features an adorable orange cat sitting on a windowsill inside an office, creating a warm and inviting atmosphere. The cat, adorned with colorful, patterned glasses and a matching collar, adds a playful and whimsical touch to the scene. The office setting is filled with various office supplies, including papers, boxes, and colorful containers, with a pair of glasses hanging on the wall and a laptop visible on a desk in the background. Initially, the cat sits calmly, looking around curiously, before a hand appears from the right side of the frame and gives the cat a high five. The cat responds enthusiastically by raising its paw to meet the hand, showcasing a moment of interaction and connection. The hand then waves, prompting the cat to wave its paw back, as if saying hello, accompanied by a friendly and enthusiastic female voice that adds to the cheerful ambiance. The video wraps up with the cat continuing to wave, creating a charming and engaging scene that feels like a friendly farewell. Overall, the video captures a delightful interaction between the cat and an unseen person, highlighting the cat's playful and responsive nature in a cozy office environment."
+ },
+ {
+ "video_id": "PI-8ebac9f7-fdaa-4710-8157-99237f59d78d",
+ "path": "/workspace/benchmark/video/PI-8ebac9f7-fdaa-4710-8157-99237f59d78d.mp4",
+ "answer": "The video features a performer on stage, dressed in a white t-shirt, light-colored pants, and a black vest with a prominent 'V' design, holding a pink, fuzzy object that resembles a plush toy or a piece of clothing. The stage is dimly lit with small lights visible in the background, creating a dramatic atmosphere. The performer appears to be addressing the audience, possibly reacting to the object in their hands, with a text overlay reading, 'Someone tossed this at Drake on stage ,' indicating a humorous moment where an audience member threw the item onto the stage during a performance. The performer's body language suggests a mix of surprise and amusement. Meanwhile, the audio features a male speaker with a neutral emotional tone, contemplating the possibility of encountering evidence of alien existence. He expresses a sense of uncertainty and reluctance about the implications of such a discovery, thinking with curiosity and concern about the underlying reality of alien life. This combination of visual and auditory elements creates a unique narrative that blends humor with existential contemplation."
+ },
+ {
+ "video_id": "PI-508ca91e-aaaa-44b0-abc0-8b2c2cd75551",
+ "path": "/workspace/benchmark/video/PI-508ca91e-aaaa-44b0-abc0-8b2c2cd75551.mp4",
+ "answer": "The combined audio and video content paints a vivid and lively picture of an energetic street performance. The audio features two male singers with an exciting emotional tone, describing a spectacular crash witnessed by people at a specific location, evoking a chaotic yet festive atmosphere akin to a party, all underscored by upbeat ambient music that enhances the dynamic mood. The video complements this by showcasing a dynamic street performance on a bustling sidewalk, where a singer in a light green suit and black shirt passionately performs alongside a guitarist in a teal suit playing a red electric guitar. The scene is filled with enthusiastic onlookers, including individuals in yellow hoodies and black outfits with cross necklaces, who engage with the performer. Text overlays like \"From the Pretending I can't sing and then surprising everyone\" set the theme, while phrases such as \"sound like a circus,\" \"suite for 2,\" and \"hotel party\" hint at the song's narrative. The performer in the black outfit interacts with the audience, sharing anecdotes and encouraging participation with lines like \"everyone follow me to the top floor\" and \"put the song on repeat.\" The video concludes with the performer stating \"it ain't a party,\" as fans eagerly request the Spotify link, fostering a sense of connection and excitement among the crowd. Together, the audio and video create a rich, immersive experience that captures the essence of a lively, chaotic, and engaging street performance."
+ },
+ {
+ "video_id": "PI-1e80935a-cc92-438a-8872-f65d83a04eda",
+ "path": "/workspace/benchmark/video/PI-1e80935a-cc92-438a-8872-f65d83a04eda.mp4",
+ "answer": "The audio clip captures a joyous moment as a woman enthusiastically tells her family \"I'm pregnant,\" expressing her love and excitement while suggesting they take a photo to commemorate the occasion, with an overall cheerful and celebratory tone. In contrast, the video presents a different scene set in a room adorned with red lighting and LED strip lights on the ceiling. Here, two individuals are featured: one in the foreground wearing a maroon hoodie with a yellow outline of Texas and a cross necklace, and the other in the background in a beige hoodie, also wearing a cross necklace. The person in the maroon hoodie is animated, making a variety of facial expressions and gestures, while the person in the beige hoodie remains mostly seated, observing with minimal movements, as indicated by the text \"i make the face, you make the sound\" at the top of the screen. Despite the contrasting scenarios—one filled with the excitement of a pregnancy announcement and the other showcasing a more subdued interaction through facial expressions and gestures—both pieces of content offer unique insights into human emotions and interactions, albeit in vastly different contexts."
+ },
+ {
+ "video_id": "PI-4cbd8e0e-9669-40b4-82d0-2003778312b5",
+ "path": "/workspace/benchmark/video/PI-4cbd8e0e-9669-40b4-82d0-2003778312b5.mp4",
+ "answer": "The video features a character in a black outfit with a red laser beam standing in a vast, tiled area under a blue sky, with \"BLOCK\" and \"ABILITY\" options displayed at the bottom. The character moves forward, eliminating \"LASTSTOMP\", and approaches \"mareena\", who has a white, glowing aura and extends a red rose. An emote wheel appears, showing \"Rose Gift\". Suddenly, a green-glowing character attacks, causing an explosion, though the character in black remains standing. In the final scene, the character uses their ability, emitting a burst of red energy, and the green-glowing character attacks again, resulting in another explosion. The character in black stands alone as \"Spectate\" appears with names \"Messithegoat090\" and \"Serpent's\" before the screen fades to black. The upbeat and dynamic background music enhances the action-packed scenes, while playful and flirtatious voice lines from a female speaker, such as 'My god,' 'Do you want to be my e boy, please p,' and 'Stop it,' add a humorous and lighthearted element to the intense gameplay."
+ },
+ {
+ "video_id": "PI-918ab7e3-4892-49f1-bb85-3912a53338f6",
+ "path": "/workspace/benchmark/video/PI-918ab7e3-4892-49f1-bb85-3912a53338f6.mp4",
+ "answer": "This video shows a woman in a close-up shot managing a low blood sugar episode. Wearing glasses and a black shirt, she holds a small glass bottle of green Christmas sprinkles. According to the on-screen text, she has diabetes and her blood sugar level is 50, indicating hypoglycemia. To quickly raise her blood sugar, she pours the sprinkles directly from the bottle into her mouth. The scene captures a candid moment of daily life with diabetes, where an individual uses an unconventional but readily available source of sugar to treat a low.The clip serves as an educational moment, illustrating one of the many ways people with diabetes handle their health."
+ },
+ {
+ "video_id": "PI-929a3580-f9bf-41a4-b67b-434b38f8ab45",
+ "path": "/workspace/benchmark/video/PI-929a3580-f9bf-41a4-b67b-434b38f8ab45.mp4",
+ "answer": "The video unfolds in a hallway with light wooden flooring, where a butternut squash is placed near the base of a slightly ajar white door. Initially, the camera remains static, focused on the squash. Soon, a Siamese cat with cream-colored fur, darker points, and striking blue eyes enters from the right, approaching the squash with curiosity. As the cat stares at the squash, the camera begins to pan left, revealing more of the hallway and a kitchen area adorned with white cabinets, a stainless steel dishwasher, and a green wreath on the wall. A wicker basket is partially visible in the distance. Meanwhile, the audio features a playful male voice teasing someone named Merv about a Christmas gift, humorously suggesting it's a \"squwanch\" from Charals or Santa Bagwawa, and joking about the butternut squash's origin. The cat follows the camera's movement towards the kitchen, its tail slightly raised, before jumping off to the left and exiting the frame. The video concludes with the hallway and kitchen in view, the squash still near the door, while the audio playfully concludes with the cat being accused of lying and a warning not to hurt mama, ending in the wrong room."
+ },
+ {
+ "video_id": "PI-56f875ba-eedf-4409-a534-22f0ee5fc9e0",
+ "path": "/workspace/benchmark/video/PI-56f875ba-eedf-4409-a534-22f0ee5fc9e0.mp4",
+ "answer": "In the video, a lady with brown curly hair is wearing a white top and an exquisite pendant necklace. The scene seems to be indoors, perhaps in a room with soft lighting. The walls are white and there is a TV in the background. Throughout the video, she interacted with the camera and made various gestures, such as pointing and touching her hair. Her expressions and actions indicate that she is communicating directly with the audience, possibly sharing a personal story or information. The warm and inviting lighting creates a comfortable and intimate atmosphere. The audio in the video includes the voice of this woman, seemingly conveying a heartfelt or inspiring message. Although no exact wording was provided, the tone indicated that it was a positive and inspiring conversation, drawing attention to her language and expressions. The core theme of the video seems to revolve around personal connection and communication. The lady's direct contact with the camera and her expressive gestures indicate that she intends to establish a connection with the audience. The overall atmosphere is warm and sincere, aiming to inspire or invigorate the audience."
+ },
+ {
+ "video_id": "PI-8e8696e6-fdad-417f-a594-09b4f14b7e8f",
+ "path": "/workspace/benchmark/video/PI-8e8696e6-fdad-417f-a594-09b4f14b7e8f.mp4",
+ "answer": "This NBC News clip, recorded at the White House on March 6, 2025, features a man seated in a leather chair, dressed in a dark suit, white shirt, and bright pink tie, with an American flag pin on his lapel. The backdrop includes part of a golden statue and an American flag. The video's text overlay highlights that the speaker, discussing the topic of daylight savings time, considers getting rid of it a \"50-50 issue,\" reflecting the divided public opinion on the matter. The speaker's reflective and neutral tone indicates he is weighing the pros and cons, as heard in the audio. He elaborates on the reasons behind this division, such as concerns about children going to school in the dark and the varying preferences of the population. The speaker concludes by acknowledging the balanced nature of public opinion and expresses a willingness to address the issue despite the divided views. Both the visual and auditory elements convey the complexity and even split in public sentiment regarding daylight savings time."
+ },
+ {
+ "video_id": "PI-a3fadafd-0cc8-4a3d-a872-a00ad7b45982",
+ "path": "/workspace/benchmark/video/PI-a3fadafd-0cc8-4a3d-a872-a00ad7b45982.mp4",
+ "answer": "The video features two individuals lying down closely together, creating an intimate and cozy atmosphere. The person on the left, with pinkish hair, is partially covered by a leopard-print blanket, adding warmth and comfort to the scene, while the person on the right, with blonde hair, is positioned close by, suggesting closeness and companionship. Behind them, a collection of stuffed toys, including a blue character with large ears and a pink unicorn, enhances the playful and affectionate setting, contributing to a whimsical and tender ambiance. Complementing this visual, the background music (BGM) is performed by a female singer with a playful and mischievous tone, featuring lyrics like 'haha, you like that, you don't' and 'Now look at you now, loo,' which suggest a teasing or flirtatious mood. The BGM adds a light-hearted and humorous touch, perfectly aligning with the overall scene that conveys relaxation, affection, and mutual comfort between the two individuals."
+ },
+ {
+ "video_id": "PI-fdcc5d8e-336a-496e-8484-1c4c6afb5497",
+ "path": "/workspace/benchmark/video/PI-fdcc5d8e-336a-496e-8484-1c4c6afb5497.mp4",
+ "answer": "News correspondent Nancy Loo reports on the potential return of mask mandates in various parts of the United States. This consideration comes as a response to a significant increase in hospitalization rates. Experts attribute this surge to what they are calling a \"quaddemic,\" a concurrent outbreak of four viruses: the flu, COVID-19, RSV, and norovirus, particularly following the holiday season. The report notes that major cities like Los Angeles, New York, and Chicago are seeing these effects. A contributing factor to the severity of these outbreaks is a decrease in the overall number of people getting vaccinated, a phenomenon some are terming a \"COVID hangover.\" While some hospitals are reinstating mask mandates to manage the situation, the reporter clarifies that for the general public, wearing a mask in public spaces largely remains a personal decision. The report emphasizes that preventative measures like limited exposure and good hygiene are crucial for staying healthy"
+ },
+ {
+ "video_id": "PI-a0f77afa-d0ee-44bb-9d62-f8db59f303a6",
+ "path": "/workspace/benchmark/video/PI-a0f77afa-d0ee-44bb-9d62-f8db59f303a6.mp4",
+ "answer": "The video captures a lively scene in a restaurant characterized by its warm and inviting atmosphere, enhanced by dim lighting, wooden furniture, and multiple televisions displaying what appears to be a sports game. In the foreground, two individuals are seated at a table laden with burgers, fries, and drinks. The person on the left, wearing a black hoodie with \"MELROS FOOTBALL\" emblazoned on it, is engaged in conversation or a gesture, holding a napkin, while the person on the right, in a light-colored hoodie, is focused on enjoying their meal. The audio complements the visual with a man's voice, accompanied by laughter, set against a backdrop of beautiful music. This combination creates an ambiance reminiscent of a relaxed party, where everyone seems to be in high spirits, contributing to the overall sense of a casual, enjoyable dining experience during a sports event."
+ },
+ {
+ "video_id": "PI-104a3b88-9119-4563-a630-e4ecc4747774",
+ "path": "/workspace/benchmark/video/PI-104a3b88-9119-4563-a630-e4ecc4747774.mp4",
+ "answer": "The video captures a casual scene outside a store, where a man exits a convenience store clutching two large water bottles, surrounded by a display of various products. After leaving the store, he walks to a black car parked at a gas station, where several boxes and additional products are visible near the entrance. He opens the driver's side door, places the water bottles inside, and then sits in the driver's seat, appearing ready to drive, with the car's interior, including the steering wheel and dashboard, clearly visible. The footage highlights the actions of individuals, all wearing black t-shirts, as they interact with their cars and water bottles, offering glimpses into typical moments at the store. Complementing the visuals, the audio features a conversation between two men, characterized by a neutral tone without obvious emotional expression, though the second half includes a man's laughter, suggesting a relaxed and easygoing exchange. The combination of the video's mundane yet relatable setting and the audio's laid-back conversation creates a snapshot of everyday life, where simple tasks like buying water and preparing to drive are intertwined with casual interactions."
+ },
+ {
+ "video_id": "PI-bdc99456-9c28-4855-9b88-be401245f68a",
+ "path": "/workspace/benchmark/video/PI-bdc99456-9c28-4855-9b88-be401245f68a.mp4",
+ "answer": "This video takes place in a cozy living room with wooden floors, a guitar leaning against the wall, and a dining area can be seen in the background. The protagonist is a woman with long golden hair. At first, she wore a black long-sleeved top and white leggings, but later changed to olive green leggings. She actively demonstrated the fit and features of the skinny pants, turned around to show the back view, and used expressive gestures to highlight specific design elements. Throughout the video, the lady discussed the hip-lifting leggings designed by Laura, highlighting their built-in straps and diamond-shaped stitching, which help hide certain areas. She mentioned that leggings made her feel confident and they were the staple food in her closet. This audio clip includes her passionate comments on the benefits of skinny pants, such as how they improve her figure and provide comfort. The key text appeared on the screen, which read, \"New users can enjoy a maximum discount of 30%.\" The core theme of this video is to promote and showcase the features of hip-lifting skinny pants and encourage viewers to take advantage of the discount offers. The lady's dynamic movements and detailed explanations are designed to convince potential customers of the effect and style of this leggings."
+ },
+ {
+ "video_id": "PI-27c02f56-cab3-4c63-a810-e2d53b253e32",
+ "path": "/workspace/benchmark/video/PI-27c02f56-cab3-4c63-a810-e2d53b253e32.mp4",
+ "answer": "In the video, Vice President Harris is standing in front of a wooden table, with a man in a suit standing behind him. The background is decorated with a chandelier and a flag printed with the seal of the Vice President of the United States. The text above reads, \"Vice President Harris signed on her desk drawer.\" In her speech, Vice President Harris reviewed her role as the 49th Vice president of the United States and emphasized the long-standing tradition of vice presidents signing their desks. She expressed great honor in continuing this tradition, acknowledged the significance of their work, and pointed out that it was meaningful and had influenced people they might have never met before, who might never know their names but would always be grateful to them. She thanked everyone for their contributions and excitedly announced that she was going to take out a marker pen to sign on the table. The text on the screen captured her speech, emphasizing the importance of tradition and the impact of their work. Then she sat down, took out the marker pen and shouted loudly, \"Woo - woo - woo!\" Say excitedly and then sign on the desk drawer. At the end of the video, she held a marker pen and wrote, \"OK, sign.\" It marked the end of the signing ceremony. The overall theme of the video is pride, tradition and gratitude for the meaningful work accomplished."
+ },
+ {
+ "video_id": "PI-1ca40334-4d5c-4ea9-b81b-c892344c2d58",
+ "path": "/workspace/benchmark/video/PI-1ca40334-4d5c-4ea9-b81b-c892344c2d58.mp4",
+ "answer": "The video features a conversation between two individuals in a room with a red and white background, discussing how to say and write the Japanese word for \"power.\" The first person, dressed in a black suit with golden buttons, initiates the discussion by asking, \"How do you say 'power' in Japanese?\" The second person, wearing a red hat and a black shirt, responds with \"Chikara.\" The conversation then shifts to how to write the word, with the first person starting to draw on a whiteboard. The core focus of their interaction, as echoed in the audio where one person asks questions and the other answers, is on ensuring the correct pronunciation and writing of the Japanese term. Throughout the video, they engage in repeated discussions to confirm accuracy, demonstrating an educational communication centered around learning Japanese. The audio captures the essence of this exchange, with one person asking, \"How do you write it?\" and correcting pronunciations like \"No, that's ka\" and \"No, that's chikara,\" highlighting the collaborative effort to master the language."
+ },
+ {
+ "video_id": "PI-1eda0369-e413-463f-b089-b5703a864c4c",
+ "path": "/workspace/benchmark/video/PI-1eda0369-e413-463f-b089-b5703a864c4c.mp4",
+ "answer": "The video features a woman with long, wavy black hair, adorned with large hoop earrings and a white top, set against a backdrop of shelves filled with various beauty or skincare products. Initially, she is seen adjusting her hair and touching her face, drawing attention to her skin concerns, such as redness and hyperpigmentation. She then introduces the ERBORIAN CC Cream, holding the tube up to the camera to highlight its packaging. Demonstrating its application, she shows how the cream, which starts white, blends seamlessly into her skin, adapting to her skin tone and providing a natural finish. Throughout the video, she enthusiastically shares her skincare and makeup routine, emphasizing the product's effectiveness in covering redness, acne, and hyperpigmentation while maintaining a natural look. She praises the CC cream for its soothing properties, SPF 25, and lightweight feel, expressing her satisfaction and confidence in using it. The video focuses on the transformation of her skin, showcasing the product's ability to achieve a flawless, natural look, aligning perfectly with her enthusiastic voiceover."
+ },
+ {
+ "video_id": "PI-892ee280-929f-4b24-9e35-aeef36a0fe03",
+ "path": "/workspace/benchmark/video/PI-892ee280-929f-4b24-9e35-aeef36a0fe03.mp4",
+ "answer": "This first-person video shows a playful interaction in a kitchen. On a black and white granite countertop, a small, round robot toy moves around, pushing spice jars toward the person filming. The on-screen text reads, 'My niece helping me cook,' humorously personifying the device. The person's hand, clad in a gray sleeve, reaches out to take the spices provided by the robot. In the background, various kitchen items are visible, including a large package of ramen noodles and other seasoning containers. As the video progresses, a blue bowl containing raw, cubed meat is placed on the counter, suggesting meal preparation is underway. The scene highlights a fun and modern way of interacting with technology during a daily task like cooking."
+ },
+ {
+ "video_id": "PI-f91f7f0e-2f7f-4219-a0d6-ade5c47128de",
+ "path": "/workspace/benchmark/video/PI-f91f7f0e-2f7f-4219-a0d6-ade5c47128de.mp4",
+ "answer": "The video opens with a brief, first-person perspective of someone running along an asphalt road. The scene then transitions to a wide, scenic view of a long, empty highway stretching into the distance. To the right of the road, a dense forest displays a mix of evergreen trees and others with vibrant yellow and orange autumn leaves. Towering over the landscape is a massive, rocky mountain range, its slopes illuminated by sunlight and its peaks lightly dusted with snow. The sky above is blue with a few scattered white clouds. Accompanying the visuals is a gentle, melodic song with male vocals, creating a peaceful and contemplative atmosphere that enhances the majestic natural scenery"
+ },
+ {
+ "video_id": "PI-a7a77074-dfdb-4789-9ac4-071d147c21a8",
+ "path": "/workspace/benchmark/video/PI-a7a77074-dfdb-4789-9ac4-071d147c21a8.mp4",
+ "answer": "The video features a woman standing in a modern, clean, and well-organized kitchen, equipped with white cabinets, a stainless steel microwave, and a stove. She is dressed in a stylish black and white dress with a diagonal color split and a patterned headscarf, creating a welcoming atmosphere. The countertop holds a bottle of olive oil and a few other small items, while a decorative sign on the wall reads 'KITCHEN' in a cursive font, adding a touch of charm. The woman is passionately speaking, using expressive hand gestures to emphasize her points. Her speech, captured in the audio, includes powerful statements such as, 'Attention everyone, this is Mufasaa. What about Jesus Christ? That is the Messiah. Jehovah God, Provider. You are busy looking at Mufasa. How could God walk faster than you? You are focusing on the Lion King instead of the Lion of Judah. You are trying hard to press back Send, but you are busy looking at Simba and...' The core message of her speech emphasizes the importance of focusing on spiritual beliefs rather than being distracted by secular entertainment, creating a compelling and thought-provoking narrative."
+ },
+ {
+ "video_id": "PI-b609b388-c662-4667-9a0f-fd0408716cda",
+ "path": "/workspace/benchmark/video/PI-b609b388-c662-4667-9a0f-fd0408716cda.mp4",
+ "answer": "The video takes place in a modern restaurant, where a person wearing a light-colored sweater is sitting at a table. There is a plate with leftovers on the table, including some vegetables and sauce, as well as a bottle and a fork. This person seems to be smiling or responding cheerfully, occasionally covering his mouth with his hand. The background shows a high chair and other interacting people, suggesting a casual family or friendly gathering. There was a voice in the audio saying, \"Marco, you ate on the charger. Please don't know what happened there.\" This seems to be addressed to the person sitting at the table, pointing out their mistake of eating from the charger plate. The text on the video further emphasizes this point, which reads \"She is eating on the charger plate\", accompanied by a smiling emoji. The overall atmosphere is light-hearted and humorous, highlighting the interesting mistakes made by the people in the video. The core theme revolves around a humorous dining incident where a person accidentally mistures a charging pad for a dinner plate, causing laughter and banter from others."
+ },
+ {
+ "video_id": "PI-000360ac-8c43-4115-9f4c-332899bbdbb3",
+ "path": "/workspace/benchmark/video/PI-000360ac-8c43-4115-9f4c-332899bbdbb3.mp4",
+ "answer": "This video was shot in a sports shoe store, with the main focus being a pair of custom-made sports shoes on display. This pair of sports shoes features a unique design in olive green, white and black, paired with pink LACES. The conversation centered around the origin of these sneakers. The person wearing the \"Flint Hill\" sweatshirt explained that these sneakers were custom-made by their cousin who collaborated with Bape. The discussion also involved the customization of other brands such as New Balance. The text on the screen provides the background of the conversation, such as \"TRAVIS SCOTT BAPESTAS WHERE\", \"My cousin likes his work at BAPE and he customized them\", \"Anything like this is sweet\", emphasizing the appreciation of the unique design. Although they appreciate the design of the sports shoes, people wearing sweatshirts say they are reluctant to buy them, just like the words \"I just won't buy this.\" Overall, the video explores the theme of custom-made sports shoes, highlighting the appreciation of unique designs and the integration of personal preferences in sports shoe culture."
+ },
+ {
+ "video_id": "PI-e53b36dc-6a56-479a-802b-0020c8c129e7",
+ "path": "/workspace/benchmark/video/PI-e53b36dc-6a56-479a-802b-0020c8c129e7.mp4",
+ "answer": "The video depicts a person undergoing the process of getting a small cross tattoo on their cheek. The individual is seen lying down, dressed in a light gray Nike hoodie, while a tattoo artist, wearing gloves, meticulously works on the design using a tattoo machine. The artist periodically wipes the area with a tissue to keep it clean. The camera provides a close-up view of the tattoo being created, showcasing the fine lines of the cross. Throughout the procedure, the person remains still. The background is filled with tattoo equipment and supplies. Meanwhile, the accompanying audio features a woman speaking in English, though her emotional tone and the content of her speech remain unclear, consisting of fragmented sounds or expressions. The ambient music in the audio has an uncertain rhythm and atmosphere, contributing to a neutral or ambiguous backdrop. After the tattoo is completed, the video shows a close-up of the finished cross on the person's cheek. The individual is later seen wearing a red sweater, with the tattoo still visible, as the video concludes with another close-up of the person's face, emphasizing the tattoo."
+ },
+ {
+ "video_id": "PI-35aaa19d-e82f-4209-ad2f-140f1ee9e04a",
+ "path": "/workspace/benchmark/video/PI-35aaa19d-e82f-4209-ad2f-140f1ee9e04a.mp4",
+ "answer": "This video takes place in a restaurant, in which there are three anthropomorphic animal characters: a red and white character wearing a black bow tie, a yellow character also wearing a black bow tie, and a brown character. The red and white characters are explaining to the yellow characters that being a waiter is \"the easiest job in the world\" because all they have to do is \"fill the drinks\". When the red and white characters speak, the text will appear on the screen. The yellow character looks very confused and responds with \"pshhhhhhh\", indicating disbelief or confusion. Red and White characters then asked, \"Would you mind reexplaining it?\" And further asked, \"Which part do you think you don't understand?\" Deciding to demonstrate, the red and white figure said, \"You know what? Let me show you the evidence.\" Now you have a try. When the characters are reversed, the text is displayed on the screen. The yellow character agreed to have a try and said, \"OK.\" The focus shifted to the brown figure sitting at the table. The yellow character attempts to fill the drink of the brown character, and the words \"Hydration\" appear on the screen. The yellow character commented, \"Look, this isn't difficult.\" The red and white characters responded, \"Oh, yes. Go.\" At the end of the video, the red and white character says, \"OK, OK\", and then the text appears. This video humorously describes the challenge of a simple task, emphasizing the importance of clear instructions and the possibility of misunderstandings. The interesting interactions and misfortunes among the characters add a layer of comedy to the story, emphasizing that even the simplest tasks can have unexpected complexities."
+ },
+ {
+ "video_id": "PI-bdeed0d4-e881-4666-8fea-89bd64ab491c",
+ "path": "/workspace/benchmark/video/PI-bdeed0d4-e881-4666-8fea-89bd64ab491c.mp4",
+ "answer": "The video unfolds in an elegantly furnished living room, complete with a modern ceiling fan, plush light-colored sofas adorned with cushions, a round glass coffee table with a gold stand, and a stylish patterned rug laid over a polished wooden floor. Near the coffee table, a small, fluffy dog with white and brown fur sits attentively, its eyes fixed on the central figure. A woman, dressed in a sleek black short-sleeve top, light blue jeans, and crisp white sneakers, commands the center of the room. Her ensemble is accessorized with multiple bracelets on both wrists and a bold, chunky necklace, while her hair is neatly tied back in a ponytail. She engages in a series of dynamic and expressive hand and body gestures—pointing downward with both index fingers, clapping her hands, placing her hands on her hips, and making various fluid movements with her arms—all contributing to a playful and confident routine. This vibrant scene is perfectly complemented by a hip-hop track playing in the background. The song, delivered with a confident and playful tone, features lyrics that exude self-assurance and charm in a frivolous and relaxed manner. The track's energetic rhythm and the singer's charismatic delivery enhance the woman's lively performance, creating a harmonious blend of visual and auditory elements. Together, they encapsulate a moment of personal expression and allure, set against the backdrop of a comfortable and stylish home environment. The overall atmosphere is one of casual sophistication and joyful self-confidence, capturing the essence of modern, expressive living."
+ },
+ {
+ "video_id": "PI-2471926b-c613-4aa8-afd5-256be61ccc8f",
+ "path": "/workspace/benchmark/video/PI-2471926b-c613-4aa8-afd5-256be61ccc8f.mp4",
+ "answer": "The video was shot in the classroom. A group of students were sitting at their desks, creating a lively and charming atmosphere. There are plenty of fluorescent lights in the room and there is a clock on the wall. The text on the video reads as follows: \"When a room full of people sang our arrangement 'Humanity' to us with heart-shaped eye expressions, it indicated a moment of enjoying and appreciating music together.\" Some students are holding water bottles and notebooks, seemingly singing or taking part in music activities. The atmosphere was full of vitality and enthusiasm. Many students were smiling and actively participating. The camera moved around the room, capturing students from different groups, all of whom were part of this collective music experience. The background music is Michael Jackson's \"Human Nature\", and the students sing along with an arrangement that might have been taught or played by someone in the classroom. The theme of the entire video is to share the joy and friendship brought by music. The students gathered together to sing \"Humanity\", creating an unforgettable and warm moment. This video emphasizes the power of music, which can bring people together and create a sense of unity and excitement."
+ },
+ {
+ "video_id": "PI-e8d8e9f7-8c95-4214-9315-9d2677aed935",
+ "path": "/workspace/benchmark/video/PI-e8d8e9f7-8c95-4214-9315-9d2677aed935.mp4",
+ "answer": "The video captures a lively and festive night scene, featuring a decorated vehicle—likely a truck or trailer—covered in colorful string lights that glow in shades of blue, green, yellow, and red. This dazzling display creates a vibrant atmosphere, enhanced by the dark background which accentuates the brightness of the lights. Seated at the back of the vehicle is a person dressed in a Santa Claus costume, further contributing to the celebratory mood. The vehicle moves slowly, suggesting its participation in a parade or holiday event, embodying the cheerful spirit of the festivities. Accompanying this visual spectacle is a piece of background music characterized by a moderate rhythm and a slightly tense atmosphere, adding a subtle layer of intensity to the overall joyful scene."
+ },
+ {
+ "video_id": "PI-f90fbf54-dddb-4cb4-9f50-a0200f8586d2",
+ "path": "/workspace/benchmark/video/PI-f90fbf54-dddb-4cb4-9f50-a0200f8586d2.mp4",
+ "answer": "The video opens with an uplifting BGM song sung by a famale voice,without narration.A woman holding a large, heart-shaped pink cake in a dimly lit, decorated room. The cake has a single lit candle and a decorative miniature crown on top. The woman, who is wearing a white sweater with black bows, smiles as she takes the crown off the cake and places it on her head. In the background, a sign reads 'Happy birthday' next to balloons. She then leans in to blow out the candle. As she blows, the video transitions to a slow-motion, stylized shot with a heavy overlay of sparkling glitter and colorful bokeh lights. The final shot shows the woman smiling after the candle has been extinguished, with the glitter effect still present."
+ },
+ {
+ "video_id": "PI-a117f55b-1235-43fe-add1-4b764fc407ac",
+ "path": "/workspace/benchmark/video/PI-a117f55b-1235-43fe-add1-4b764fc407ac.mp4",
+ "answer": "The video captures a vibrant scene of two young women dancing energetically in a modern, well-lit living room. The woman in the foreground, dressed in a black short-sleeved top, black shorts, and black socks, moves in sync with her companion in the background, who wears a white hoodie, black shorts, and black socks. Their synchronized dance moves, characterized by rhythmic hand raises and dynamic movements, create a lively and cheerful atmosphere. The living room's aesthetic is enhanced by a tiled floor, a staircase with white railings, and a chandelier, adding a touch of elegance. A white pillow on a metal chair near the staircase and a potted plant contribute to the cozy ambiance. Meanwhile, the audio complements the visual with energetic background music, full of rhythmic and melodic elements that amplify the dynamic and engaging atmosphere, making the entire scene feel like a joyful celebration."
+ },
+ {
+ "video_id": "PI-b3add5ca-ed86-4b56-91d8-0a4a4a8d2587",
+ "path": "/workspace/benchmark/video/PI-b3add5ca-ed86-4b56-91d8-0a4a4a8d2587.mp4",
+ "answer": "The video unfolds in a vivid montage of dynamic scenes, with a consistently dark background that emphasizes the themes within it. It begins with a long-haired woman in a black hat and clothes, moving vigorously, setting the tone for vivid movement. The scene then moves to another woman, wearing a black and green sweatshirt with the words \"JADED\" and the number \"13\" emblazoned on it, making expressive gestures to the camera, adding a layer of personal interaction. The woman also wears a black beanie with white text, maintaining the visual theme. Next, a woman in a patterned jacket and dark pants dances rhythmically, contributing to the dynamic energy of the video. The story progresses to a group of three women in a dimly lit environment, participating and moving together-each dressed differently, one wearing a white beanie, the other in a grey hoodie, and the third in a black jacket, demonstrating a sense of unity and cooperation. The final scene features three women posing with their hands up, exuding confidence and camaraderie. Overlaid in this visual tapestry, the audio features emotionally uncertain female vocals singing, accompanied by ambient music with neutral rhythms and atmospheres. Sounds occasionally make fragmented sounds or express \"F\", lacking clear semantic content, adding a layer of mystery to the overall experience."
+ },
+ {
+ "video_id": "PI-6a7e3f7f-abc8-4ad1-bcbb-b69b01235f90",
+ "path": "/workspace/benchmark/video/PI-6a7e3f7f-abc8-4ad1-bcbb-b69b01235f90.mp4",
+ "answer": "The video captures a moment of surprise and excitement in a modern home, enhanced by a diverse and dynamic audio backdrop. It begins in the kitchen, where a man in a white and beige sweater and light blue jeans enters with another person, carrying a large black and white polka-dotted cloth. The scene transitions to the living room, where a couple sits on a black leather couch, eagerly anticipating the reveal. As the cloth is unfurled, it uncovers a massive TCL 4K UHD Smart TV box, measuring 98 inches. The couple reacts with shock and disbelief, exclaiming in awe at the size of the TV, which is as big as their wall. The video wraps up with the man still marveling at the enormous TV, highlighting a moment of pure excitement and surprise. Meanwhile, the audio features multiple speakers with varying emotions and languages, adding to the vibrant atmosphere. A male speaker expresses an unknown emotion while mentioning 'Christmas tree' and 'This' in English, accompanied by ambient music with a neutral tempo. Another male speaker, speaking in Korean, conveys happiness, while additional male speakers in Korean and Chinese express unknown emotions with brief utterances. The combination of visual surprise and multilingual audio creates a rich, engaging experience."
+ },
+ {
+ "video_id": "PI-cc0084a3-2566-4c17-a8d6-46d66505eee0",
+ "path": "/workspace/benchmark/video/PI-cc0084a3-2566-4c17-a8d6-46d66505eee0.mp4",
+ "answer": "In a harmonious blend of audio and video content, the combined narrative vividly demonstrates a series of light-hearted and playful interactions. The track features multiple female speakers filled with unknown but joyful emotions. They played fun and mysterious exchanges, humorously commenting on someone's appearance, sharing weird self-identities like \"black panther\" and \"wizard,\" while mentioning pronouns in a relaxed tone. Another female speaker joined in briefly, but her contribution was unclear, adding to the casual humorous atmosphere of the scene. Meanwhile, the video takes place in the bedroom where the girls communicate. A girl in short black sleeves stands in front of the screen and acts in a quirky way. She grabbed her neck with her hands and then started dancing behind her, which made the girls behind her laugh and made it hard for them to control themselves as they were completely immersed in the moment. A row of subtitles at the top of the video reads \"try not to cringe but we deleted everyone but allies parts\" Overall, both audio and video help create a relaxed and enjoyable atmosphere that gives viewers a real sense of the joy of everyone involved."
+ },
+ {
+ "video_id": "PI-e5767457-5882-4417-a291-aa3816d91dd8",
+ "path": "/workspace/benchmark/video/PI-e5767457-5882-4417-a291-aa3816d91dd8.mp4",
+ "answer": "The video depicts a man in a casual, relaxed setting, seated on a gray sofa in a living room with wooden flooring. A small table in front of him holds a plate of food and a water bottle, and the text \"People that immediately fall asleep after eating:\" is displayed at the top of the screen, setting a humorous tone. The man, dressed in a black t-shirt, black shorts with white stripes, and a brown cap, begins to lean back on the sofa, embodying the stereotype of post-meal lethargy. In the background audio, a male speaker, with a neutral or slightly tired emotional tone, describes a casual family scene where he is advised by a female speaker to grab a pillow and sleep on the sofa after eating, as he tends to fall asleep every time he eats. The atmosphere is relaxed and informal, with no background music. As the video progresses, the man grabs a pillow and a gray blanket, pulling the blanket over himself. He adjusts it to cover his body and rests his head on the pillow, appearing to drift off to sleep, perfectly illustrating the humorous text on the screen. The background includes a wall clock and two wooden wall shelves with small plants, adding to the cozy, homey ambiance."
+ },
+ {
+ "video_id": "PI-4baabea5-45aa-4389-be4c-19e07b348c45",
+ "path": "/workspace/benchmark/video/PI-4baabea5-45aa-4389-be4c-19e07b348c45.mp4",
+ "answer": "The video features a person sitting on a bed in a room with a neutral-colored wall and a ceiling light, wearing a beige hoodie with 'ESSENTIALS' printed in black and a black balaclava. They examine a light green, perforated mask with a unique design, rotating it to show different angles before putting it on and adjusting it for a proper fit. Next, they pick up a red fabric with a black web pattern, resembling a Spider-Man mask, and carefully place it over the light green mask, ensuring the eye holes and web patterns are correctly aligned. Meanwhile, the audio clip plays a musical segment with two voices carrying a contemplative and sincere tone, reflecting on vulnerability and mutual support. The voices question whether others will offer help when needed and express a willingness to reciprocate, emphasizing the importance of trust and reciprocity in interpersonal relationships. The video concludes with the person fully wearing the red Spider-Man mask over the light green mask, making final adjustments to ensure a proper fit."
+ },
+ {
+ "video_id": "PI-0e9a8476-00e7-4fae-802d-a84f839b081b",
+ "path": "/workspace/benchmark/video/PI-0e9a8476-00e7-4fae-802d-a84f839b081b.mp4",
+ "answer": "The audio features a neutral narration of a conversation between two women discussing where to go, maintaining an unemotional tone throughout. In contrast, the video unfolds in a cozy home setting, where a black and white dog lounges in a wicker basket, becoming the focal point of the scene. The background alternates between plain white walls and a tiled wooden floor, while interior furniture like cabinets and sofas contribute to the homely ambiance. The dog's relaxed and curious demeanor, craning its head to look up at the camera, adds a playful and engaging element. Soft, warm lighting enhances the inviting atmosphere, and although no significant English text appears, the sounds of voices from the TV and a woman teasing her dog create a lighthearted and affectionate mood. While the audio remains emotionally neutral, the video playfully emphasizes the bond between the dog and its owner, crafting a warm and relaxed environment that contrasts yet complements the audio's straightforward conversation."
+ },
+ {
+ "video_id": "PI-abd4ffa4-0955-496b-b261-5bef8d19b513",
+ "path": "/workspace/benchmark/video/PI-abd4ffa4-0955-496b-b261-5bef8d19b513.mp4",
+ "answer": "This video captures the devastating wildfire in Malibu, California, presenting a series of tense and tragic scenes. From the bird 's-eye view of the plane window, one can see the vast city below, under the fierce glow of the raging wildfires in the suburbs. On the ground, a building was swallowed up by flames, which consumed its structure with intense heat. Then the scene switches to a group of firefighters wearing yellow protective gear. They carry fire-fighting tools and walk towards the fire scene in the smoke and barren environment. The video also shows that a convoy of emergency vehicles, including ambulances and fire engines, is speeding along a road with flashing lights, highlighting the coordinated emergency response. Amid the chaos, a helicopter could be seen flying over the mountainous area, sprinkling water or flame retardants on the burning ground below. Throughout the video, the lack of direct water supply is a significant challenge, highlighting the difficulties faced by firefighters. The key OCR text identified the location as Malibu, California, and attributed the footage to \"anuru Corson via Storyful\" and \"ABC NEWS\". The core theme of this video is the ruthless battle against wildfires, showcasing the bravery of firefighters and the extensive damage caused by the fires."
+ },
+ {
+ "video_id": "PI-96b8738b-dc9a-4d76-93f6-8a2763f56c4c",
+ "path": "/workspace/benchmark/video/PI-96b8738b-dc9a-4d76-93f6-8a2763f56c4c.mp4",
+ "answer": "In this short video, a man performs a rhythmic dance in a bathroom setting. The background features a unique blue and white shower curtain and a white door, typical of an indoor environment. The man is wearing a long-sleeve yellow shirt with a matching reflective safety vest, as well as a black beanie and blue pants. His movements are dynamic, involving various arm gestures and body sways that suggest performance. A prominent text overlay reads 'Name 1 thing you brought to school by accident,' indicating a common interactive or meme-like content often seen on social media platforms. The video is accompanied by a lively and upbeat music track that drives the rhythm of the dance, creating an energetic and relaxed atmosphere. The audio primarily consists of cheerful background music, with no evident speech or dialogue. Overall, the presentation is clear and focused on the individual's performance and the overlaid text prompt."
+ },
+ {
+ "video_id": "PI-bd0e980b-a6e7-411c-a2df-bf9150650a51",
+ "path": "/workspace/benchmark/video/PI-bd0e980b-a6e7-411c-a2df-bf9150650a51.mp4",
+ "answer": "The video unfolds in a modern, minimalist clothing store, where a person in a white hoodie featuring a large black heart design on the back stands with their back to the camera, paired with olive green cargo pants. Another individual, clad in a black and white patterned sweater, engages playfully with the hoodie, seemingly drawing or writing on it, adding a creative twist to the design. Overlaying the scene, the text \"bro really thought that he cooked\" (suggesting a humorous or impressed reaction to the hoodie's design) is prominently displayed, enhancing the light-hearted atmosphere. In the background, other shoppers browse the racks, and a large poster of a person in a blue outfit adorns the wall. This casual and playful moment is complemented by an audio track featuring a male singer with a playful and confident tone. The lyrics convey a sense of pride and satisfaction, likely about a creative or mischievous act, delivered with humor and amusement. Together, the video and audio create a cohesive narrative of creative expression and accomplishment, capturing a moment of joy and camaraderie in a retail setting."
+ },
+ {
+ "video_id": "PI-7ab618a8-4719-4781-a8b4-758e2644da14",
+ "path": "/workspace/benchmark/video/PI-7ab618a8-4719-4781-a8b4-758e2644da14.mp4",
+ "answer": "A woman with long, wavy black hair and blue eyes is shown in a close-up shot, sitting inside a car. She is wearing a white, low-cut top, gold hoop earrings, and several gold necklaces. Her makeup is pronounced, and she has long, dark red nails. Throughout the video, she lip-syncs to a fast-paced hip-hop song with a female vocalist. She uses expressive hand gestures, pointing at the camera, touching her chest, and covering her mouth as she laughs. Her facial expressions change rapidly, conveying surprise, confidence, and amusement. The background consists of the car's interior, with a glimpse of trees visible through the side window. An on-screen text caption reads, 'how cute.. u thought a lawsuit could stop me from being a part of the bop house'."
+ },
+ {
+ "video_id": "PI-c57f355a-2eaf-43e6-8c7d-7bad8d768343",
+ "path": "/workspace/benchmark/video/PI-c57f355a-2eaf-43e6-8c7d-7bad8d768343.mp4",
+ "answer": " The video features a person in a car, wearing a beige cap and black shirt, ranking items to wipe their butt. The list starts with 'WET WIPES' as number one. 'PAPER TOWEL' is placed at number three, with the person noting it doesn't feel great. 'UNDERWEAR' is discussed next, with the person sharing outdoors experiences where sacrificing underwear was necessary. 'SOCKS' follow, with similar sentiments.\nA 'HOUSE CAT' is humorously included, with the person admitting they prefer dogs. 'LEAF' is mentioned as a woodland alternative. 'TOILET PAPER' is ranked number two, described as perfect. 'SHIRT' is seen as another sacrifice. Finally, 'NOTHING' is ranked number ten, with the person questioning who would use their bare hands.\nThroughout, the person's gestures and expressions add humor, creating an engaging experience. The screen displays a numbered list with images for each item."
+ },
+ {
+ "video_id": "PI-7629796c-cc0b-4d52-9495-e812d0e322ff",
+ "path": "/workspace/benchmark/video/PI-7629796c-cc0b-4d52-9495-e812d0e322ff.mp4",
+ "answer": "The video begins with a close-up of a white shopping bag being opened, revealing an iPad box inside, with the text 'iPad App Recs' appearing on the screen. The iPad Air and its accessories are laid out on a bed, and upon turning it on, the home screen displays various apps. A cheerful female speaker enthusiastically shares her top app recommendations for iPad users, starting with 'Stardew Valley' for cozy gaming, where a hand is seen building a farm and exploring the town. She then highlights 'Procreate' for advanced drawing, showcasing the creation of a colorful blueberry pattern and a digital painting of a bouquet of flowers, as well as designing wallpapers. The video also features handmade bookmarks. The scene transitions to a digital planner on the iPad, showing a detailed template for tracking daily habits, goals, and tasks, with a cozy setting of a lit candle. The video concludes with various apps displayed on the iPad home screen, emphasizing creative, organizational, and leisure apps like 'GoodNotes' for digital planning, 'Libby' for free eBook access, and 'Notion' for life organization, all presented with excitement and positivity. The ambient music remains neutral with a steady tempo, creating a calm and engaging atmosphere."
+ },
+ {
+ "video_id": "PI-95f4ec87-20bd-49b3-a542-6db5d47a1c55",
+ "path": "/workspace/benchmark/video/PI-95f4ec87-20bd-49b3-a542-6db5d47a1c55.mp4",
+ "answer": "The video and audio together vividly capture the energetic atmosphere of an indoor basketball game. The video shows two young men in black t-shirts and gray sweatpants engaged in a competitive one-on-one match on a polished wooden court with red accents. The offensive player dribbles the ball, makes a move towards the basket, and successfully completes a layup despite the defender's attempt to block the shot. The crowd of spectators lining the walls reacts with cheers and applause, highlighting their engagement and enjoyment. The game continues with the same players, the offensive player making another successful layup, prompting further cheers. The audio complements this by capturing the ambient sounds of the game, such as balls bouncing and shoes squeaking on the court, along with occasional human reactions like laughter and cheers. The speakers, likely male, express emotions ranging from amusement to excitement, emphasizing the lively and engaging atmosphere. Together, they convey the dynamic nature of the basketball game and the enthusiastic responses of the spectators, creating a comprehensive depiction of the event."
+ },
+ {
+ "video_id": "PI-dbb7c6d6-221a-48ce-a90a-57bacc985326",
+ "path": "/workspace/benchmark/video/PI-dbb7c6d6-221a-48ce-a90a-57bacc985326.mp4",
+ "answer": "The video offers an interior tour of a Mazda car set against the backdrop of a rainy day. The initial scene features an individual seated in the passenger seat, observing the external environment through the wet window. The visual narrative then transitions to a comprehensive display of the car's cockpit, detailing the steering wheel and driver’s instruments while highlighting the car's modern design elements. The camera continues to showcase the door panels, presenting the integrated audio system speakers, then focuses on the central console, which includes the gear selector and infotainment controller. The tour concludes with a wider shot that covers the passenger seat area, clearly showing the vehicle's sunroof and emphasizing the spaciousness and amenities of the car's interior. Throughout the video, the rain on the window cleverly frames the vehicle's interior as a comfortable and protected space."
+ },
+ {
+ "video_id": "PI-8ffc4915-58b7-43af-aa92-f0656352d314",
+ "path": "/workspace/benchmark/video/PI-8ffc4915-58b7-43af-aa92-f0656352d314.mp4",
+ "answer": "This video showcases a vibrant and energetic performance by four clowns in an urban setting, possibly an open square or street area, with a tall modern building in the background. The clowns are dressed in colorful and eclectic costumes, including bright shirts, vests and exaggerated trousers, complemented by clown makeup, which adds a mischievous and whimsical touch to their appearance. Their energetic dance movements demonstrated their coordination and enthusiasm. The audio in the video is mainly the vivid sound of the clown's performance. Although the specific dialogue was not clearly distinguishable, the overall atmosphere was festive and charming, possibly accompanied by cheerful music that complemented their dance. The text at the bottom of the screen is overwritten with \"We just performed at the Grammy Awards Ceremony\", indicating that this performance was either part of the celebration or a retrospective of their appearance at the Grammy Awards ceremony. The core theme of the video is the joyous and enthusiastic celebration of the clown's performance at the Grammy Awards ceremony. Overall, this video captured a vibrant and artistic moment, celebrating the clowns' participation in a highly anticipated event."
+ },
+ {
+ "video_id": "PI-1bd5eb7e-811d-4954-9a46-9769a8621d3d",
+ "path": "/workspace/benchmark/video/PI-1bd5eb7e-811d-4954-9a46-9769a8621d3d.mp4",
+ "answer": "The audio features a discussion about skin color involving multiple people, including women and children, setting the stage for an exploration of diversity and perception. The video, set in a classroom adorned with a blue wall, educational posters, and a green bulletin board, provides a visual context for this conversation. A woman with braided hair, dressed in a blue shirt and patterned cardigan, sits at a desk engaging with a preschooler. The text overlay highlights their dialogue about skin color differences, as the child compares their skin tones by pointing at the woman's hand and their own. The woman acknowledges her brown skin and playfully asks the child what color they think she is, to which the child responds with \"white\" and then \"yellow.\" The woman gently corrects the child, who curiously rubs their arm to see if the color rubs off, a moment described as \"Too cute\" in the text overlay. Throughout the interaction, the woman remains warm and engaging, fostering an open and educational discussion about skin color differences, which aligns with the broader theme of the audio content."
+ },
+ {
+ "video_id": "PI-02c6abc1-6713-4be5-a8e2-935ffb86a3d8",
+ "path": "/workspace/benchmark/video/PI-02c6abc1-6713-4be5-a8e2-935ffb86a3d8.mp4",
+ "answer": "In a captivating blend of audio and video, a puzzling pricing conundrum is brought to light. The audio features a male's voice - over with a distinct tone of surprise and confusion. The speaker questions the seemingly illogical pricing of cooked chicken, which is priced at $4.99, and compares it to a two - pack of whole chicken that costs $16.97, expressing bewilderment at how the cooked option is cheaper than the raw one. The video complements the audio by providing a visual representation of the pricing disparity. It takes viewers through the deli and fresh food sections of a supermarket. In the deli section, a package of Kirkland brand rotisserie chicken is clearly marked at $4.99. The package boasts that it contains USDA Grade 'A' chicken, and is gluten - free, MSG - free, and free of artificial flavors and colors. The camera then pans over to the fresh produce section, where a whole chicken is displayed in a refrigerated case. A two - package of whole chicken is marked at $16.97. The narrator in the video echoes the surprise heard in the audio, openly questioning why cooked chicken is priced lower than raw chicken. The footage also offers a glimpse of the clean and bright supermarket environment, showing supermarket shelves and other shoppers. Key OCR text messages, such as the price tag “$4.99” and product description for the cooked chicken, as well as the price tag “$16.97” and product label details for the raw chicken, further highlight the confusion caused by the price difference, making for an intriguing exploration of supermarket pricing logic."
+ },
+ {
+ "video_id": "PI-8d63b8c8-941a-4f8e-99e2-84889bcca5d0",
+ "path": "/workspace/benchmark/video/PI-8d63b8c8-941a-4f8e-99e2-84889bcca5d0.mp4",
+ "answer": "The video unfolds against the backdrop of a family kitchen, surrounding a stainless steel refrigerator decorated with colorful letter magnets. The protagonist's hand is visible and he interacts with the refrigerator by opening its door. This action reveals a bottle of Coffee Mate French Vanilla Cold Foam Creamer sitting on a shelf. The character's hand reached in, retrieved the milk powder bottle, and made it clearer. When the camera pans down, a black cat sits on the wooden floor, adding subtle elements of family life to the scene. The words \"After seeing that Coffee-Mate commercial\" are prominently displayed at the top of the screen, indicating the inspiration behind the character's behavior. Then the character walked towards a trash can and threw away the creamer bottle. At the end of the video, the lid of the trash can was closed, emphasizing the act of discarding the product. The core theme of the video revolves around the characters' reactions to the coffee mate advertisement, leading to purchases and subsequent processing of the product, which may imply dissatisfaction or a change of mind. The lack of dialogue or background music enables players to focus on visual narrative and character actions."
+ },
+ {
+ "video_id": "PI-583f742e-543a-49be-9135-25144b1c16dc",
+ "path": "/workspace/benchmark/video/PI-583f742e-543a-49be-9135-25144b1c16dc.mp4",
+ "answer": "The audio features a male speaker, his voice laden with concern and urgency, as he stands beside a wheelbarrow overflowing with vape pens, highlighting the vast quantity gathered from young people nationwide since Christmas. He passionately advocates for increased awareness and decisive action against the vaping epidemic, urgently calling upon parents and society at large to confront this pressing issue head-on. The video complements this narrative by showcasing the same man, dressed in a gray shirt and cap, standing behind a yellow wheelbarrow filled to the brim with colorful vape pens. The backdrop includes a trailer bearing the prominent text \"AMERICA VS ADDICTION,\" along with logos for Roku, Apple TV, and Fire TV Stick, indicating a broader campaign against addiction. The trailer also announces a TV launch on June 30 and displays the website \"http://www.americavsa.com.\" The man initially stands upright, then leans forward slightly, and eventually bends down to pick up a few vape pens from the wheelbarrow, holding them up to the camera to emphasize the sheer volume collected from youth across the country. Throughout the video, he continues to speak and gesture earnestly, underscoring the gravity of the vaping problem among young people and urging America to wake up to the issue. The visual focus remains squarely on the man and the wheelbarrow filled with vape pens, effectively driving home the message about the vaping epidemic and the need for immediate action."
+ },
+ {
+ "video_id": "PI-56e93267-2111-4d3b-a61f-cf084061d749",
+ "path": "/workspace/benchmark/video/PI-56e93267-2111-4d3b-a61f-cf084061d749.mp4",
+ "answer": "The audio track serves as a lively and energetic backdrop, featuring male vocals that exude high-energy excitement, complemented by rapid, rhythmic beats and catchy, repetitive phrases such as \"Wow WOW that's fucking movie my gosh\" and \"The type of shit I'm on, you wouldn't understand,\" effectively creating a vibrant and intense atmosphere designed to energize and captivate the audience. In contrast, the video content showcases a person in black gloves operating a Kuvings juicer within a modern kitchen environment, beginning with the placement of a spout cap onto the juicer and the subsequent pouring of various thick, colorful liquids into clear glasses and containers. The sequence progresses with the pouring of red and orange liquids, followed by a red liquid with visible pulp, and then another red liquid into a glass, all surrounded by fresh ingredients like cranberries, a lime, and a pineapple. The scene further transitions to the pouring of green and orange liquids, separating the juice from the pulp, while on-screen text reads 'This is your sign to start juicing in 2025! ' and 'click the link in our bio to get started,' encouraging viewers to embrace juicing in the new year and providing a clear call to action for more information. Together, the audio's exhilarating energy and the video's engaging demonstration of juicing create a dynamic and motivating experience for the audience."
+ },
+ {
+ "video_id": "PI-cc3b1b60-d248-4069-8b40-8a7a7b22d54a",
+ "path": "/workspace/benchmark/video/PI-cc3b1b60-d248-4069-8b40-8a7a7b22d54a.mp4",
+ "answer": "This video shows a special offer on an Android tablet with a keyboard. At the beginning of the advertisement, a person holds a tablet computer close to the fountain, highlighting its fashionable design. The scene switches to a tablet used with a stylus, followed by a close-up of the date and time displayed on the screen. Then, a keyboard accessory was displayed on the tablet to emphasize its function. The video highlights the outstanding 10.1-inch IPS display and the latest Android 13 system of this tablet computer. In addition, additional accessories such as protective covers, wireless mice and keyboards were also displayed, presenting the entire set of products. The powerful performance and speed of this tablet computer are touted as an ideal choice for both work and leisure. A person is using a tablet computer with a keyboard to browse various applications and products on the screen. This video encourages viewers to order this tablet at home and instructs them to click on the link below the video to find it on Temu. Then, this person sat on the sofa, holding a tablet computer, and seemed very satisfied with this device. The end of the video shows a person lying down and enjoying a tablet, reinforcing the message of clicking on the video link to make a purchase. From the audio of this video, it can be heard that the entire audio is about introducing the relevant content of the tablet, allowing the audience to better understand its performance."
+ },
+ {
+ "video_id": "PI-e6aa57c2-d8e1-43c8-bc8a-03d97d83fa4b",
+ "path": "/workspace/benchmark/video/PI-e6aa57c2-d8e1-43c8-bc8a-03d97d83fa4b.mp4",
+ "answer": "The video features a scene with two individuals standing by a pool at a modern, upscale house characterized by large glass windows and multiple floors. While the audio includes a male speaker producing unclear phrases and a female speaker briefly chiming in, with no discernible semantic content or emotional tone, the visual elements tell a more vivid story. Both individuals are dressed in sports jerseys—one in yellow and blue, the other in blue and white—and are engaged in playful activities, covering their faces with pink powder and making expressive hand gestures. The person on the left adds a whimsical touch with yellow Pikachu-themed slippers. Overlaying the scene is a text that reads, \"Our last TikTok at the bop house because it's actually getting banned,\" suggesting a farewell to the location. The background reveals a well-lit interior with contemporary furnishings and a visible staircase, contributing to the upscale ambiance. Despite the lack of clear communication in the audio, the overall atmosphere in the video is light-hearted and playful, capturing the individuals enjoying their final moments creating content at this distinctive location."
+ },
+ {
+ "video_id": "PI-235df21c-1a70-4170-916d-b946211da23a",
+ "path": "/workspace/benchmark/video/PI-235df21c-1a70-4170-916d-b946211da23a.mp4",
+ "answer": "The video shows a Crispi Ninja air fryer in operation, placed on a wooden surface, with its transparent cooking chamber providing a clear view of the food cooked inside. The exterior of the air fryer features a stylish black and white design with the brand names \"CRISPI\" and \"NINJA\" highlighted at the top. The humorous text overlay, \"It feels illegal to observe the inside of an air fryer,\" adds an interesting element to the visual narrative. The lens focuses on the air fryer to capture the cooking process from different angles, highlighting the modern design and functionality of the device. The audio track, meanwhile, features a smooth dialogue between two girls, with one asking questions in a conversational tone, suggesting that an engaging exchange may revolve around inquiries, related to air fryers or cooking in general. The combination of visual and audit elements creates an engaging and informative experience that blends the air fryer's functional demonstration with a relaxed conversation."
+ },
+ {
+ "video_id": "PI-a34613e3-7cce-41b9-8d68-f3d0ba77e558",
+ "path": "/workspace/benchmark/video/PI-a34613e3-7cce-41b9-8d68-f3d0ba77e558.mp4",
+ "answer": "The video begins with a close-up of a person in a suit, setting a reflective tone that aligns with the audio's message about relationships. It transitions to a silhouette of a couple holding hands, symbolizing the connection and companionship that the speaker values. The scene then shifts to a man in a leather jacket holding white roses as a woman approaches, suggesting anticipation or reconciliation—a visual metaphor for being cherished. Following this, a person in a black hoodie walks away, evoking solitude, contrasting with the speaker's emphasis on not being taken for granted. The serene night scene with a full moon and city lights creates a peaceful yet melancholic atmosphere, reflecting the introspective nature of the audio. A view of a Ferris wheel under a cloudy sky emphasizes grandeur, but the audio reminds us that true value lies in everyday reassurances, not grand gestures. Finally, bustling city streets and highways highlight constant motion, yet the audio conveys that what matters most is being valued and appreciated in the small, daily moments."
+ },
+ {
+ "video_id": "PI-735dbdf5-ee65-40e7-b4ee-f70c3b5d0486",
+ "path": "/workspace/benchmark/video/PI-735dbdf5-ee65-40e7-b4ee-f70c3b5d0486.mp4",
+ "answer": "The video takes place in an indoor setting with a modern, industrial aesthetic, featuring a dark ceiling with exposed beams and numerous hanging light bulbs, complemented by string lights that create a warm and inviting atmosphere. A person in a bright red puffer jacket, green sweater with a graphic design, camouflage pants, and yellow boots stands in the background, while another person in a beige trench coat, white top, and light blue and white plaid scarf moves energetically in the foreground. The person in the foreground makes various hand gestures and movements, appearing to dance or express themselves joyfully, while the person in the red jacket remains mostly stationary, occasionally shifting their posture slightly. This interaction suggests a playful and lively dynamic, with the energetic movements of the person in the foreground being the primary focus. The overall mood is cheerful and vibrant, enhanced by the warm lighting. Meanwhile, the BGM track features a male singer with a warm and engaging tone, delivering a heartfelt expression that possibly revolves around love or connection, as hinted by the phrase 'You know what?' The core message of the song conveys a sense of intimacy and personal reflection, complementing the lively and joyful atmosphere of the video."
+ },
+ {
+ "video_id": "PI-57021679-5fb5-4b8b-9145-71bbd3a032b0",
+ "path": "/workspace/benchmark/video/PI-57021679-5fb5-4b8b-9145-71bbd3a032b0.mp4",
+ "answer": "The video shows a woman standing on the steps in front of a white door marked with the number \"55\", wearing a brown leather jacket, a cropped white hoodie and baggy blue jeans, complete with multiple gold necklaces, a belt with a chain and a white turban. Holding a brown hat, she strikes a variety of dynamic poses, including tilting the hat and stretching out her arms, exuding a confident demeanor and stylish presence in an urban environment. The background is red brick walls and peeling paint on the doors, adding a gritty urban feel to the scene. Meanwhile, the voice of a woman in the audio speaks in a tone that reflects anxiety and unease, expressing the feeling of being watched, which she attributes to her own anxiety. The ambient music is presented in a moderate rhythm, creating a tense atmosphere that contrasts with the confident visual presentation and adds a layer of psychological depth to the overall narrative."
+ },
+ {
+ "video_id": "PI-65a7495b-578e-4ca6-88d1-6ce7276c63d7",
+ "path": "/workspace/benchmark/video/PI-65a7495b-578e-4ca6-88d1-6ce7276c63d7.mp4",
+ "answer": "This video tells of a humorous and candid moment of a grandmother and her grandson in a comfortable family environment. The scene unfolds in a warmly decorated living room, with the staircase and various personal items visible in the background, creating a familiar and intimate atmosphere. This grandmother, wearing brightly colored and patterned clothes, holding stacks of finished clothes in her hand, approached the camera with a mischievous and amusing expression. The text on the video reads: \"Every time I get my grandma ready:\", laying the foundation for a relaxed interaction. When the old lady approached the camera, she joked, \"You say...\" I said you dressed like a hoe, but I used to dress like a hoe too, but not as sacred as you guys. Her tone was full of deep affection and humor, indicating that her relationship with her grandson was close and comfortable. This conversation is like a talk about fashion choices. The grandmother recalls her past style while gently teasing her grandson about what he looks like now. This video captures the love and humor of the family, highlighting the grandmother's lively personality and the warmth of their interaction. The entire theme revolves around the mischievous and loving dynamics between the grandmother and her grandson."
+ },
+ {
+ "video_id": "PI-fbe2cbe7-8413-4305-859c-930ec17cb12f",
+ "path": "/workspace/benchmark/video/PI-fbe2cbe7-8413-4305-859c-930ec17cb12f.mp4",
+ "answer": "This short 2D animation depicts a humorous and melodramatic conflict over food in a restaurant. The scene opens with a pink waiter serving a salad to a blue, cone-shaped character named Boggo, and a plate of fries to an orange, round character named Boe. When Boe refuses to share their fries, Boggo, who had the chance to order their own, decides to take them by force. Boe's pleas turn to desperation as they attempt to protect their meal, going to extreme lengths to build a metal barricade complete with barbed wire. However, Boggo's long purple arm effortlessly breaks through the defense. In a swift move, Boggo grabs the entire plate of fries and consumes them in a single, large gulp. The animation concludes with a wide shot of the restaurant as Boe lets out a prolonged, agonized scream of \"NOOOOOOOOOO!!!!\""
+ },
+ {
+ "video_id": "PI-6f9928a0-6607-4a38-89d7-8f7ed92d509a",
+ "path": "/workspace/benchmark/video/PI-6f9928a0-6607-4a38-89d7-8f7ed92d509a.mp4",
+ "answer": "In this heartfelt video, we witness a poignant visit to a stepmother with dementia in a hospital room. The scene opens with a view of the room, featuring a checkered floor in muted tones and a hospital bed where the elderly woman lies, covered with a white blanket and wearing a patterned hospital gown. The text overlay sets a touching context: \"POV: You visit your stepmom with dementia who raised you since 2 years old because you're having a 'I need my mom' kind of day thinking she won't be able to tell you what you need to hear but then this happens... 🥺💕🥺\" This indicates the visitor's deep emotional need and the expectation of limited communication due to her condition. The room's warm, soft ambiance is enhanced by a peach-colored curtain and a white pillow, with the woman's gray hair spread out on the pillow as she rests in a state of semi-consciousness, her eyes occasionally opening and closing. Meanwhile, the audio captures a gentle and affectionate voice, likely the visitor's, repeatedly calling out \"Mom\" and expressing love and care, urging her mother to lie down and keep warm. Towards the end of the video, a hand reaches out to gently touch the woman's hand, adding a layer of tenderness and highlighting the emotional bond between them. The overall atmosphere remains calm and intimate, emphasizing the quiet, loving moment shared in the hospital room, despite the stepmother's condition."
+ },
+ {
+ "video_id": "PI-4d2b161b-194f-42b1-ae12-dde7ace0886a",
+ "path": "/workspace/benchmark/video/PI-4d2b161b-194f-42b1-ae12-dde7ace0886a.mp4",
+ "answer": "This video was shot in a brightly lit cafe or bakery, which is characterized by a clean and modern interior. The setup includes a display cabinet filled with various pastries and baked goods, a cash register, as well as some plants and decorations, adding a cozy atmosphere. The text on the video reads \"pov: You work next to the hospital\", indicating that this cafe is located near the hospital and may serve hospital staff and visitors. A woman with long dark hair, wearing a black top and grey trousers, is working behind the counter and interacting with a customer. The background shows a man in a dark uniform, who might be a security guard or a hospital staff member, making a phone call near the entrance. This woman seemed to be processing an order or making a payment. At one point, she touched her hair, perhaps expressing her thoughts or frustration. Another woman in a white top and blue jeans is working backstage, possibly preparing food or drinks. This video documents a typical day of a person working in a cafe near a hospital, highlighting the interaction and environment of the workplace. Dear Lord, when I arrive in heaven, let me... My face makes me want to die. This sentence appears at the bottom of the picture, suggesting a humorous or exaggerated mood, namely the challenge of working in such an environment."
+ },
+ {
+ "video_id": "PI-a08ab2c9-6bfd-412e-a4b3-982265821f1f",
+ "path": "/workspace/benchmark/video/PI-a08ab2c9-6bfd-412e-a4b3-982265821f1f.mp4",
+ "answer": "The video and audio together present a poignant and somewhat disappointing scene inside a home. The video depicts an elderly woman in light-colored pajamas seated on a step at the bottom of a staircase, clutching a colorful blanket, with several wrapped gifts beside her, hinting at a recent or ongoing celebration. A man in a dark long-sleeve shirt and plaid pants stands next to her, leaning in closely to offer comfort and support, his posture radiating concern. A small dog stands on the step below, adding a touch of companionship. As the camera pans upwards, a noticeable hole in the ceiling is revealed, indicating some form of damage or incident. The focus then shifts to a close-up of the elderly woman, who appears to be speaking with a wry smile, likely sharing her thoughts or feelings about the situation. The audio captures this moment of disappointment or resignation, conveying that something anticipated or attempted has ultimately failed or not succeeded, adding an emotional layer to the visual narrative."
+ },
+ {
+ "video_id": "PI-868a201b-d6c2-4cd1-bf4c-5ceb41e7044a",
+ "path": "/workspace/benchmark/video/PI-868a201b-d6c2-4cd1-bf4c-5ceb41e7044a.mp4",
+ "answer": "In the video, two people, Xander and Getty, have an interesting and competitive cup stacking game in a modern living room with wooden floors and dark brown cabinets. At the beginning of the competition, the two contestants stood at the two ends of two rows of parallel cups respectively - the blue one represented Zander and the red one represented Getty. A transparent container with a yellow edge is placed at the starting point to collect cups. Their goal is to be the first one to remove this row of cups and put them into the container. After the game started, Zander and Getty quickly took out cups from their respective teams and put them into containers. The competition was extremely intense, and both contestants demonstrated impressive speed and dexterity. Getty started from the far end and Xander started from the near end. They were evenly matched in their efforts to stack cups into containers. The race continued. Both players maintained a fast pace, but when they approached the finish line, Getty gained a slight edge. The atmosphere became tense as they raced against time to put the last cup into the container. Finally, Getty successfully got the last cup, ensuring the victory of the game. The video captured the excitement and competitive spirit of the cup-stacking challenge, highlighting the participants' quick responses and determination."
+ },
+ {
+ "video_id": "PI-8e8bbb6c-17cd-41eb-8c75-ceb94efdc870",
+ "path": "/workspace/benchmark/video/PI-8e8bbb6c-17cd-41eb-8c75-ceb94efdc870.mp4",
+ "answer": "The video captures a group of five individuals gathered around a wooden table in a cozy indoor setting, with a wreath adorning the wall in the background. They are intently focused on a laptop screen, as indicated by the text overlay, 'POV: the US logging back into TikTok this morning,' suggesting they are experiencing the return of TikTok access in the US. The atmosphere is initially filled with anticipation and curiosity. As the video progresses, the group's reactions become increasingly animated; a person in a red top leans in closely, displaying excitement or surprise, while others raise their arms in celebration or astonishment. The overall mood shifts from anticipation to joy and excitement. Meanwhile, the background music features a stable rhythm and a neutral atmosphere, providing a subtle backdrop that complements the scene without conveying any specific message. Together, the video and audio create a cohesive narrative of shared experience and collective emotion, highlighting the group's engagement and delight in the moment."
+ },
+ {
+ "video_id": "PI-5ad4a989-b6e2-4539-9f54-a2c0f21efbe1",
+ "path": "/workspace/benchmark/video/PI-5ad4a989-b6e2-4539-9f54-a2c0f21efbe1.mp4",
+ "answer": "The audio and video content present a contrasting yet interconnected narrative. The audio tells a lighthearted and relaxed story about the Senate's request to the chairman for more time to speak, creating an atmosphere that invites people to join in the fun. However, the video depicts a formal hearing setting in Washington, D.C., on January 30, 2025, where Mr. Chairman, identified as Sen. Chuck Grassley (R-I.A.), is seen holding a piece of paper and speaking into a microphone during the Patel confirmation hearing. The text overlay explicitly states that a senator is asking for more time for questions, aligning with the audio's mention of the Senate's request but in a much more serious and formal context. Mr. Chairman, dressed in a dark suit with a red vest and a striped tie, is seated at a wooden desk with a nameplate, surrounded by other individuals in suits, some standing and some seated, reinforcing the gravity of the occasion. Despite the audio's humorous and relaxed tone, the video maintains a formal and serious atmosphere throughout, with Mr. Chairman's statements, such as \"before we make,\" \"I'm just telling you,\" \"I'm going to abuse it,\" \"Well, I'm not going,\" and \"to let you abuse it,\" appearing at the bottom of the frame, further underscoring the seriousness of the hearing. The contrast between the audio's playful narrative and the video's formal setting highlights the complexity of political proceedings, where even requests for more time can carry significant weight and implications."
+ },
+ {
+ "video_id": "PI-dd8300e4-7ce3-4a22-a203-f964962b010a",
+ "path": "/workspace/benchmark/video/PI-dd8300e4-7ce3-4a22-a203-f964962b010a.mp4",
+ "answer": "The video begins with a siblings standing close together in a hallway, creating a personal and intimate scene. The man is dressed in a light blue t-shirt, while the woman wears a black dress with white polka dots. They hold hands, and the woman has her arm around the man's waist. Text overlays reveal a humorous yet awkward context: 'Question I asked a lot because I am dating my brother,' followed by 'Do your parents know? Yes but they aren't supportive :)'. This light-hearted moment contrasts with the subsequent professional setting. The scene transitions to a workshop where a man, wearing a dark jacket with a logo, stands with the text 'ivan@lcsign.com' displayed at the top. He holds a large, colorful sign reading 'POWER SMOKE' in bold yellow letters with a flame design. As he gestures towards the sign, explaining its features, the background reveals another person working in the distance, emphasizing the workshop environment. The audio complements this shift, featuring a male speaker with an unknown emotional tone briefly mentioning a phrase in an unclear context, followed by another male speaker with a neutral tone discussing business. The second speaker clarifies that their company has been a fabricator for fourteen years, contrary to any misconceptions, and details customizable product features including font, style, and color. He references a purchase process involving a quote and an LC sign, focusing on product specifications and transactional details. The video effectively juxtaposes personal humor with professional expertise, highlighting both the couple's playful dynamic and the intricate details of the sign fabrication business."
+ },
+ {
+ "video_id": "PI-9fc8c830-b3ed-4c23-8aee-7151f4b5749b",
+ "path": "/workspace/benchmark/video/PI-9fc8c830-b3ed-4c23-8aee-7151f4b5749b.mp4",
+ "answer": "The video features two women enjoying a leisurely moment at an outdoor table under a clear blue sky, with a stunning mountainous landscape in the background. The first woman, clad in a brown fur vest over a knitted sweater, holds a glass of yellow beverage and has a plate with a sandwich and roasted potatoes in front of her. The second woman, with long blonde hair and dressed in a dark outfit with a fur collar, also holds a similar drink and has a plate of roasted potatoes. The scene is vibrant, with other patrons visible in the background, soaking up the sunny day. The women are engaged in a conversation, with the first woman gesturing and the second listening intently, capturing a relaxed and casual atmosphere of social interaction. Meanwhile, the background music (BGM) features a female vocalist with a soft, soothing, and contemplative tone, delivering gentle and reflective lyrics that evoke a sense of emotional introspection and longing. This subtle and melancholic ambiance perfectly complements the serene and intimate setting of the video, enhancing the overall mood of leisure and connection."
+ },
+ {
+ "video_id": "PI-6716d5e6-147a-45ec-aa4b-b5444beb7c6a",
+ "path": "/workspace/benchmark/video/PI-6716d5e6-147a-45ec-aa4b-b5444beb7c6a.mp4",
+ "answer": "The video was shot in a bustling restaurant, decorated with green garlands, colorful lights and balloons for the festival, creating a joyous and celebrative atmosphere. The main focus was a group of people, especially one dressed in black, who was dancing and walking enthusiastically in the restaurant, holding a tray with drinks on it in his hand. This person is performing various dance moves, including squats and spins, which adds to the festive atmosphere. In the background, others are sitting at the table, enjoying the performance. Some people captured this moment with their mobile phones, while others cheered and raised their hands to celebrate. The diners were dressed in casual clothes. Some of them also wore festive ornaments such as reindeer antlers and glowing glasses, adding to the festive atmosphere. The audio in the video features cheerful music, complementing the energetic dance performance. The casual chat and laughter of the diners added to the lively and cheerful atmosphere. This video captures the essence of the festival gathering, highlighting the joy and excitement shared by this group of people. Overall, this video showcases a vibrant and cheerful restaurant event, centered around an energetic dance performance, bringing smiles and cheers from the audience."
+ },
+ {
+ "video_id": "PI-192f908c-f740-44b9-8c49-1477b5e5f768",
+ "path": "/workspace/benchmark/video/PI-192f908c-f740-44b9-8c49-1477b5e5f768.mp4",
+ "answer": "The video showcases a close-up view of a person's lower body, highlighting their legs and feet adorned with dark pants and brown shoes featuring a decorative pattern along the seam, while a light gray knitted sweater is also visible in the foreground. Set against a backdrop of a wooden floor with a yellow line running vertically through the center, indicative of a gymnasium or sports facility, the scene maintains a consistent focus on the lower body. Overlaid with the static text, 'what's a haircut that automatically makes a girl attractive? ', the video prompts viewers to contemplate attractive hairstyles for girls, creating an intriguing visual and intellectual contrast. Meanwhile, the audio complements the video with soft and ambient background music (BGM) that sets a reflective and contemplative mood, allowing viewers to ponder the question at hand in a calm and introspective atmosphere. The integration of the video's visual elements and the audio's serene BGM effectively enhances the overall experience, encouraging viewers to engage in thoughtful reflection on the topic presented."
+ },
+ {
+ "video_id": "PI-c4d06814-a085-4190-8608-613ef06d5c74",
+ "path": "/workspace/benchmark/video/PI-c4d06814-a085-4190-8608-613ef06d5c74.mp4",
+ "answer": "This video shows a striking scene where a person walks down the large brick staircase of an elegant brick house. The front of the house is decorated with a large arched window above the entrance door, with white columns on both sides, illuminated by black lantern-like lamps. This person was wearing a long red and black coat, light blue flared trousers and a grey wide-brimmed hat, and walked down the stairs gracefully. They adjusted their coats as they walked, exuding confidence and a sense of fashion. The background music ingeniously enhances the atmosphere, and the gentle melody complements the peaceful night environment. The lit lights and the warm glow emanating from inside the house added a welcoming atmosphere to the scene. The well-considered actions of the characters and the elegant environment create a cohesive visual narrative. The core theme of the video revolves around a serene and elegant moment, capturing the essence of a peaceful night in a complex environment. There is no text in the video. Visual and auditory elements occupy the central stage, emphasizing a serene and fashionable atmosphere."
+ },
+ {
+ "video_id": "PI-f025619c-476b-4f61-b5b1-027c2b250292",
+ "path": "/workspace/benchmark/video/PI-f025619c-476b-4f61-b5b1-027c2b250292.mp4",
+ "answer": "In a workshop setting, a vocal coach provides hands-on instruction to a young female student. The coach, dressed in a black shirt and jeans, stands beside the student, who wears a light blue sweater and a lanyard identifying her as part of the 'Texas AllState Thespians.' The lesson focuses on a specific vocal exercise, with the coach encouraging the student to 'raise your ya-ya-ya,' as indicated by the text overlay on the screen. The coach places her hand on the student's abdomen and uses hand gestures to guide her in projecting her voice upwards. The student follows along, singing the 'ya-ya-ya' phrase in a high, operatic tone while raising her own hand to visualize the ascending pitch. Her expression shifts from concentration to surprise at the high notes she produces. "
+ },
+ {
+ "video_id": "PI-c1cb93b0-65b0-4c00-8d39-d1b2812c1e85",
+ "path": "/workspace/benchmark/video/PI-c1cb93b0-65b0-4c00-8d39-d1b2812c1e85.mp4",
+ "answer": "The video showcases a person fully immersed in the art of DJing, wearing a brown and white checkered shirt and a blonde wig, while operating a Pioneer DJ controller. With focused hand movements, they adjust various knobs and sliders, occasionally raising their hands in expressive gestures to emphasize the rhythm or transition between tracks. This scene, set indoors against a plain wall and doorway, conveys a strong passion for music and a dedication to crafting a seamless mix. Complementing the visual, the audio features energetic background music with vibrant singing, creating a dynamic atmosphere. Together, the rhythmic and melodic elements of the audio enhance the overall experience, reflecting a commitment to engaging the audience through creative and dynamic musical expression."
+ },
+ {
+ "video_id": "PI-6d92f172-04b2-46bb-a9b0-f96b364d0ef0",
+ "path": "/workspace/benchmark/video/PI-6d92f172-04b2-46bb-a9b0-f96b364d0ef0.mp4",
+ "answer": "In the video, a person wearing a red hoodie and ripped jeans is washing a grey car smeared with blue soapy water. The man is scrubbing the side of the car, which seems to be parked in a car wash or garage. The text on the video reads: \"I'm glad I became popular, but I will never do such a thing again.\" This indicates the person's complex feelings about this situation. This indicates that they are also expressing a sense of relief because they no longer have to repeat this task. Then the camera switches to a close-up of the car door, showing the words \"Damn it, leopard print spray paint\" written in soap foam. This playful and provocative message implies that people are considering or hoping to give the vehicle a leopard print paint job, adding a humorous turning scene. This video captures the contrast between the laborious task of car washing and the creative and rebellious idea of customizing the car with eye-catching paint. There is a fragmented phrase in the audio, \"I'm on red wrong member baby BA\". This sentence might be part of a song or a casual comment, adding an informal atmosphere to the video."
+ },
+ {
+ "video_id": "PI-2ec0a2d0-9ba9-43ce-b6b0-6152012b2b1a",
+ "path": "/workspace/benchmark/video/PI-2ec0a2d0-9ba9-43ce-b6b0-6152012b2b1a.mp4",
+ "answer": "The video captures a press conference at the University of Connecticut (UCONN), featuring a backdrop with the UCONN and Dunkin' logos. The scene centers around a discussion about Paige Bueckers' new hairstyle, with a person with dreadlocks wearing a dark blue UCONN jacket speaking into a microphone. The text overlay on the video reads, \"I like it, y'all like it? yea\" and \"IM CRYING\" with crying emojis, indicating a humorous or touching moment. A water bottle is placed on the table in front of the speaker. Another individual with blonde hair tied back, also in a dark blue UCONN jacket, listens attentively with their chin resting on their hand. The interaction between the two suggests a light-hearted and emotional exchange, possibly in response to a question or comment about Paige's new look. The person with dreadlocks covers their face with their hand, possibly in laughter or emotional overwhelm, while the person with blonde hair continues to listen and react. The overall atmosphere is one of shared excitement and emotional connection among the participants, capturing a moment of camaraderie and humor during the press conference. The audio content reflects the emotional and excited reactions of the speakers, with one noting, \"Did you recognize Paige when she came out with no braid, yeah, it was crazy to see. I thought I slipped back and I was like, wait, she's missing something but she still look good. I like it.\" The video effectively combines these emotional exchanges with visual cues to create a vivid depiction of the event."
+ },
+ {
+ "video_id": "PI-09f981ee-6142-46ab-8b3d-7dc31da1d20f",
+ "path": "/workspace/benchmark/video/PI-09f981ee-6142-46ab-8b3d-7dc31da1d20f.mp4",
+ "answer": "The video unfolds in a rehabilitation or medical facility, capturing the ongoing effort and dedication in the rehabilitation process. A person, dressed in a gray and white hooded jacket, gray pants, and bright yellow socks, is undergoing physical therapy with the assistance of a therapist wearing a dark gray shirt and blue pants. The therapist holds the walker and provides guidance, adjusting it as needed and closely monitoring the person's safety as they move forward using the walker for support. The background features a hallway filled with medical equipment, cabinets, and another person sitting in a chair, also donning yellow socks, while a green strip runs down the middle of the floor, likely serving as a guide for walking exercises. The scene repeats, with the same individual and therapist continuing their work, the person making steady progress with the therapist's unwavering support. Meanwhile, the audio track of this rehabilitation training video, emanating from the background music, effectively conveys the determination and ambition of the men involved. The combination of the visual elements, showcasing the patient's progress and the therapist's support, and the audio, expressing a strong sense of resolve, creates a compelling narrative that highlights the challenges and triumphs of the rehabilitation journey."
+ },
+ {
+ "video_id": "PI-e27d9cfa-c07c-4690-9931-887d11956130",
+ "path": "/workspace/benchmark/video/PI-e27d9cfa-c07c-4690-9931-887d11956130.mp4",
+ "answer": "The video captures a performance on a slippery boardwalk, with palm trees faintly visible in the misty background. A woman, wearing a blue jacket and a black bucket hat, performs a series of dance moves, demonstrating fluidity and precise body control. Around her, other people are seen walking along the path, some stopping to watch the dancer while others continue with their activities. Along the path, there is a building with a light turquoise facade and graffiti on the walls, adding vibrancy to the urban coastal setting. The camera remains focused on the dancer, and with each step, she gradually moves closer to the camera, eventually capturing her entire face in a close-up shot. The scene unfolds with a calm and natural atmosphere, showcasing the interaction between the individual’s performance and the public environment. The background music perfectly complements the scene, with the dancer’s movements syncing with the rhythm of the music."
+ },
+ {
+ "video_id": "PI-0af294b8-775e-4461-9709-3b821426783c",
+ "path": "/workspace/benchmark/video/PI-0af294b8-775e-4461-9709-3b821426783c.mp4",
+ "answer": "The video captures a political event featuring Trump speaking at a podium in a suit and red tie, with the background shifting from a dimly lit stage with red and blue lighting to a brightly illuminated one displaying a large screen with the word \"TikTok\" in white letters. The text overlay \"Trump going to save TikTok\" hints that the speech addresses the future of the social media app or related policies, creating a serious and focused atmosphere around communication, technology, and political decision-making. Meanwhile, the audio features a male speaker discussing the success of their TikTok presence with a mix of neutral and happy emotions, highlighting billions of views and a record-breaking response. He ultimately suggests that the platform might be worth maintaining due to its positive impact, which, though seemingly unrelated to the video's political context at first glance, could imply a broader discussion about the value and influence of social media platforms in today's society, possibly intertwining with the political themes presented in the video. The integration of these two elements, the political speech and the TikTok success story, creates an intriguing narrative that touches on both technology and politics, inviting viewers to consider the multifaceted impact of social media."
+ },
+ {
+ "video_id": "PI-57c360bd-07e1-42d6-ad28-a1fd7b75fb2f",
+ "path": "/workspace/benchmark/video/PI-57c360bd-07e1-42d6-ad28-a1fd7b75fb2f.mp4",
+ "answer": "The audio features a cheerful female speaker enthusiastically narrating her process of growing and hatching sea monkeys and aqua dragons, detailing the steps involved such as preparing water with a purifier for the sea monkeys, adding eggs for both creatures, and observing their development over a few days. She highlights that the aqua dragons required feeding earlier than the sea monkeys and notes visible progress in their growth. The video visually complements this narrative, starting with a close-up shot of a hand pouring a small bag marked \"Sea Monkey Instant Live Egg\" into a glass of water on a blue base, with on-screen text reading \"Growing and hatching sea monkeys and water dragons!!\" The scene then switches to another hand pouring a small bag marked \"Tropical Egg\" into a larger transparent plastic water tank filled with water and decorated with purple and blue patterns at the bottom. A woman in a floral dress is seen comparing the sizes and designs of two containers before opening a blue box marked \"Aqua Dragons\" and assembling the small water tank correctly. She fills it with bottled water, adds a water purifier pack, and pours the \"Sea Monkey Instant Live Egg\" into a glass of water, stirring it to form bubbles. The camera captures close-ups of the water tanks filled with small bubbles, indicating the incubation process has begun for both the sea monkeys and aqua dragons. The video concludes with a side-by-side comparison of the two tanks on a white countertop, showcasing the initial stages of the incubation process, with the larger, decoratively based tank for the sea monkeys and the smaller, blue-based tank for the aqua dragons."
+ },
+ {
+ "video_id": "PI-39cc5b68-d13b-4342-b7bd-8105c541c87d",
+ "path": "/workspace/benchmark/video/PI-39cc5b68-d13b-4342-b7bd-8105c541c87d.mp4",
+ "answer": "In the video, two individuals are seen standing in a kitchen, creating a relaxed and humorous atmosphere. The person on the left, dressed in a purple hooded sweatshirt with white text, black sweatpants, and brown slippers, initially makes gestures and moves their arm, becoming increasingly lively as the video progresses. They wave their arms exaggeratedly, place their hands on their chest, and raise their arms energetically. Meanwhile, the person on the right, wearing a black T-shirt with white text, camouflage pants, and green socks, stands relatively still, occasionally adjusting their posture and touching their face. The kitchen is equipped with white cabinets, countertops filled with various items, and sideboards with glass doors, providing a domestic backdrop. An audio overlay features a mischievous yet confident male voice, whose lyrics mention someone who is good-looking and has a beautiful face. The speaker believes a girl likes him because she keeps rubbing his belly. In the later stage of the video, the person on the left leans to the left, and the person on the right places a hand on their shoulder, engaging in an interesting interaction. The person on the left continues to make lively movements, while the one on the right participates with a smile, capturing a relaxed and pleasant moment between the two in the family kitchen."
+ },
+ {
+ "video_id": "PI-bf64233f-0be3-4933-ac30-53c1b783135a",
+ "path": "/workspace/benchmark/video/PI-bf64233f-0be3-4933-ac30-53c1b783135a.mp4",
+ "answer": "This video captures a relatable and emotional moment of a young boy struggling with his homework. Seated at a wooden table, the boy, who has blonde hair and is wearing a grey shirt, is visibly upset. He cries, covers his face with his hands, and throws his head back in frustration. Beside him, a man, presumably his father, attempts to help. The man, wearing a black sweatshirt and a baseball cap, patiently leans over the boy, guiding his hand as he holds a pencil to a worksheet. The scene is set in what appears to be a dining room, with warm, indoor lighting. "
+ },
+ {
+ "video_id": "PI-08f60841-2846-4c19-a4df-bb7777642ba9",
+ "path": "/workspace/benchmark/video/PI-08f60841-2846-4c19-a4df-bb7777642ba9.mp4",
+ "answer": "The video presents a conversation between two men in a modern, well - lit room with large windows and glass partitions, discussing the possibility of 17 - year - old Piper becoming the next member of their group. The first man, dressed in white and sitting on a beige couch, holds a microphone as he speaks, while the second man, in light pink clothing and seated in a gray chair, also holds a microphone to respond. They talk about Piper's cuteness and her TikTok content, with the man in white expressing appreciation for her but emphasizing the age limit - that no one under 18 can join. The man in pink agrees that Piper is cute and mentions having seen her TikTok videos. Meanwhile, the audio features two female speakers having a similar discussion about the possibility of new members joining, highlighting that being under 18 is a barrier, while also expressing love and appreciation for the person in question (presumably Piper in this context). In the video, the man in blue then suggests other opportunities for Piper, such as Disney Channel auditions, Fortnite, or streaming, pointing out that while they like her dancing and humor, her age makes her a poor fit to join their team at the moment. The video concludes with both men expressing their love and appreciation for Piper, acknowledging her talent and humor, but stressing that the timing isn't right for her to join their group. The audio and video content together convey the dilemma of wanting to welcome someone talented but being restricted by age requirements."
+ },
+ {
+ "video_id": "PI-8711ae92-caa4-4f7d-99e6-94df78d928ad",
+ "path": "/workspace/benchmark/video/PI-8711ae92-caa4-4f7d-99e6-94df78d928ad.mp4",
+ "answer": "The video features a musical performance with a warm, brownish background and floral decorations, where a group engages in a lighthearted discussion through song. A man in a dark suit and red tie sings lyrics like 'AND WE KNEW,' 'BREAK,' and 'FROM KELSEY,' followed by another man with 'JUST KIDDING' and 'KELS.' A woman in a pink dress joins with 'WAIT,' 'WHAT EVEN IS,' and 'THIS SONG?' The narrative continues with two men singing 'NOW WE'RE,' and a man singing 'TEEING UP BOMBS.' The woman returns with 'SLAMMING DOWN SHOTS,' and the men sing 'RONALDO,' followed by 'TAKE A GOLF CART' and 'TO THE CONDO.' A man sings 'MATT,' and 'INVITES RENALDO,' with the men singing 'WE GO TO BED,' and the woman singing 'HE'S ON A WALK' and 'WITH RENALDO.' The audio captures a casual conversation with ambient music, where a male humorously references Ronaldo and a trip, expressing frustration with girls after prison. A female reassures that Ronaldo is in her past. The video maintains a cohesive visual theme, depicting a range of emotional tones among a mixed-gender group."
+ },
+ {
+ "video_id": "PI-9d588e7f-6619-43f8-8280-458b0b13fb4a",
+ "path": "/workspace/benchmark/video/PI-9d588e7f-6619-43f8-8280-458b0b13fb4a.mp4",
+ "answer": "The video captures a dynamic natural phenomenon, beginning with a wide shot of a vast landscape under a partially cloudy sky. The initial footage, taken from a moving vehicle, showcases a massive white cumulus cloud towering over expansive green fields and a distant road on the horizon. As the scene progresses, the lighting dramatically shifts, transforming the clouds into vivid shades of orange and red, signaling the arrival of sunset. The storm clouds evolve from pale hues to fiery red tones, with visible lightning flashes in the distance. The final shots stabilize, featuring a camera mounted on a tripod in the foreground, focused on capturing the changing sky. Accompanied by increasingly intense music, the subtitles read, “little did I know I was about to witness the most majestic sunset ever,” conveying that this is the most spectacular sunset the viewer has ever seen.\n"
+ },
+ {
+ "video_id": "PI-877d6942-9947-4860-8856-ad12d6c3a495",
+ "path": "/workspace/benchmark/video/PI-877d6942-9947-4860-8856-ad12d6c3a495.mp4",
+ "answer": "The video depicts a tense and confrontational scenario where a man, having sneaked into a car and rummaged through it, finds himself trapped when the doors refuse to open despite his attempts to escape by kicking the glass. In a dramatic turn, the car's in-car computer rings, and the man answers, only to find himself in a dialogue with the car owner, who had modified the vehicle to punish the thief. The audio track captures this intense exchange, with the trapped individual begging for help while the owner issues warnings and asserts control. As the confrontation unfolds, the video shows the car owner remotely starting the vehicle and driving it onto the road, escalating the thief's distress. The situation culminates with an elderly person appearing and informing the thief, \"Hello, you're in my car,\" highlighting the car owner's calculated plan to teach the intruder a lesson."
+ },
+ {
+ "video_id": "PI-a4973944-70a8-4204-951f-a034cfe1cbf2",
+ "path": "/workspace/benchmark/video/PI-a4973944-70a8-4204-951f-a034cfe1cbf2.mp4",
+ "answer": "The video and audio together create an engaging culinary experience. The video begins with a close-up of a hand pouring fried rice from a bowl onto a white plate with a red rim, forming a mound. It then transitions to a hibachi grill where a chef is actively cooking, adding pieces of grilled meat and shrimp to the plate of fried rice, followed by a portion of mixed vegetables like broccoli and zucchini. The final frames showcase the restaurant's interior, featuring recessed ceiling lighting and a large mural of palm trees, with the completed plate of vibrant hibachi food prominently displayed. Meanwhile, the audio complements this visual journey with energetic background music, full of rhythmic and melodic elements, enhancing the dynamic atmosphere of the scene. The combination of the visual feast and the lively music creates an immersive experience, highlighting the artistry of hibachi cooking in a vibrant setting."
+ },
+ {
+ "video_id": "PI-559205d0-a5d2-4464-ae74-39c6d83dff74",
+ "path": "/workspace/benchmark/video/PI-559205d0-a5d2-4464-ae74-39c6d83dff74.mp4",
+ "answer": "The feature of this video is that it sets up a vibrant and engaging scene in a virtual environment, possibly from games like Roblox. The background depicts a vague urban backdrop and a red building, adding a touch of realism to the digital world. The character on the left is a real person with long curly hair, wearing a green crop top with white words \"Entertainment\" written on it, light blue jeans and white sports shoes. The character on the right is a digital avatar with long pink wavy hair, wearing a white coat with red accents and white fluffy boots. A small red heart floats on the avatar's head, adding an interesting and whimsical element to the scene. The bold text is overwritten on the image, stating \"Each ROBLOXIANS FYP is now...\" A mixture of white and red. This text indicates that the video represents a hot topic in the Roblox community and has attracted the attention of viewers familiar with the game. The combination of real-life characters and digital avatars creates an engaging contrast, highlighting the integration of reality and virtuality, which is a characteristic of the Roblox experience. The overall mood of the video is lively and energetic, reflecting the creativity and diversity of the Roblox platform. This video effectively captures the essence of the current popular trends among Roblox players."
+ },
+ {
+ "video_id": "PI-761d33c8-eff3-4a5b-afb2-d8ea6da93e4e",
+ "path": "/workspace/benchmark/video/PI-761d33c8-eff3-4a5b-afb2-d8ea6da93e4e.mp4",
+ "answer": "The video captures a large public gathering or protest in an urban setting, where the crowd is dense and participants are dressed in winter clothing, indicating cold weather. The atmosphere is one of solidarity and collective action, with many individuals holding signs and standing closely together, some taking photos or videos of the event. The background features modern tall buildings, suggesting the event is taking place in a city center, and the ground is partially covered with patches of snow and dry leaves, emphasizing the winter season. Meanwhile, the audio is a speech piece with a passionate and resolute tone, discussing the threats faced by transgender people and the efforts to force them back into hiding. It also mentions a national maritime ban that might affect individual decisions. The core message of the speech emphasizes the importance of resisting these oppressive measures and supporting the rights of transgender people, aligning with the sense of community engagement and collective action depicted in the video."
+ },
+ {
+ "video_id": "PI-6d5cb4f3-12de-4adb-bf76-80a8b519cd95",
+ "path": "/workspace/benchmark/video/PI-6d5cb4f3-12de-4adb-bf76-80a8b519cd95.mp4",
+ "answer": "A female speaker, brimming with happiness and enthusiasm, shares her remarkable experience with Marvis Whitening Toothpaste, a product initially designed for smokers. She enthusiastically explains that, despite not being a smoker herself, the toothpaste has proven incredibly effective at removing various types of stains, including those from coffee and wine. She emphasizes that the results are not only quick but also improve with continued use, making it a versatile solution for pesky stains. Meanwhile, the video visually demonstrates the toothpaste's effectiveness on a stained dental model, where the teeth, initially yellow and brown, are placed in an orange liquid-filled container. The toothpaste is applied and scrubbed onto the teeth, gradually lifting the stains and revealing a noticeably whiter surface. This process is repeated to underscore the product's consistent cleaning and whitening capabilities. The video also showcases the sleek design of the Marvis toothpaste tube and its packaging, adding a touch of elegance to the demonstration. Overall, the combination of the speaker's upbeat and confident tone and the visual evidence effectively highlights the toothpaste's superior stain-removing properties."
+ },
+ {
+ "video_id": "PI-f14575c7-64f2-4e58-ae43-2ec84aeb18f6",
+ "path": "/workspace/benchmark/video/PI-f14575c7-64f2-4e58-ae43-2ec84aeb18f6.mp4",
+ "answer": "This video clip captures a highlight from a basketball game, showcasing a one-on-one interaction between players Kyrie Irving and Victor Wembanyama. The game takes place in a large arena filled with spectators, on a vibrant yellow and blue court. Irving, in a light blue jersey, controls the ball and faces off against Wembanyama, who is in a red jersey. With impressive skill, Irving executes a quick crossover dribble to get past Wembanyama. He then drives towards the basket and throws a perfect alley-oop pass to a teammate. The teammate, also in red, jumps high to catch the ball and finishes the play with a powerful dunk. The commentators' excited voices and the roar of the crowd underscore the impressive nature of the play. On-screen text labels \"Kyrie vs. Wemby.\""
+ },
+ {
+ "video_id": "PI-864b7065-d4b0-48bf-ae88-2156e8e537ba",
+ "path": "/workspace/benchmark/video/PI-864b7065-d4b0-48bf-ae88-2156e8e537ba.mp4",
+ "answer": "The video features a person in a kitchen setting, wearing a black shirt with a graphic design and displaying tattoos on their arms, speaking directly to the camera. The background includes grey cabinets, a stainless steel refrigerator, and various items on the countertop, creating a casual, homey atmosphere. Text overlaying the video reads: 'Keith Lee is being KIND not posting all of his food reviews in the UK. You served the man cold, unseasoned, CANNED tuna on canned beans bffr,' suggesting a humorous or critical commentary on a food review situation and implying that the food served was subpar. The person gestures with their hands while speaking, emphasizing their points and adding a dynamic element to the scene. Meanwhile, the audio features a male singer expressing profound emotional exhaustion and the collapse of his world, conveying a deep sense of tiredness and despair. The overall tone of the video is informal and conversational, likely addressing an audience familiar with Keith Lee and his food reviews, while the audio adds a layer of emotional depth to the visual narrative."
+ },
+ {
+ "video_id": "PI-2a7d3acc-fb98-4d17-8b95-bd8209c08f81",
+ "path": "/workspace/benchmark/video/PI-2a7d3acc-fb98-4d17-8b95-bd8209c08f81.mp4",
+ "answer": "In the video, a person is sitting on a blue and black game chair, playing the virtual filtering game displayed on the screen. The game interface presents a funnel-shaped structure with a purple outline and a blue interior, designed to filter different types of balls. At the bottom of the screen, there are three pink boxes with ICONS and numbers on them, representing the types and quantities of balls: a yellow tennis ball with the number 15 on it, a volleyball with the number 3 on it, and a basketball with the number 5 on it. The words \"Impossible Filter\" are prominently displayed below, indicating the difficulty of the challenge. As the game progresses, the numbers in the box will change, which reflects the filtering process. The number of tennis matches decreased from 15 to 6, the number of volleyball matches decreased from 3 to 1, and the number of basketball matches decreased from 5 to 0, and then became 1 again. The funnel is filled with basketballs, tennis balls and volleyballs, among which the basketballs are mainly located at the top. Finally, the game reaches the \"Game Over\" screen, which means the challenge is over. The audio includes the voice of a person (possibly a male) expressing a wish or intention, beginning with \"I've always wanted to play it.\" This tone implies curiosity or interest, adding a personal touch to the gaming experience. The video captures the dynamic process of the filtering game, highlighting the changes in the number of balls and the final result, while the audio provides narrative elements for the visual content."
+ },
+ {
+ "video_id": "PI-665041ea-e48b-4ea4-8f6a-758ada067004",
+ "path": "/workspace/benchmark/video/PI-665041ea-e48b-4ea4-8f6a-758ada067004.mp4",
+ "answer": "In the video, a person wearing a knitted cap and a fur coat is seen holding a brown chick with a red comb in a setting characterized by a simple wooden ceiling and corrugated metal plates. The text overlay asks, \"Has anyone seen this before?!\" The individual interacts with the chicken, which appears to be making unusual sounds or movements, prompting reactions of surprise and concern from the holder, who repeatedly states they have never witnessed such behavior before. The person continues to hold the chick, showing deep concern and care for its health. Meanwhile, the audio features a female speaker comforting a dying individual with tender and supportive words, expressing deep love, sorrow, and encouragement, and acknowledging the individual's strength and pain. The ambient music is gentle and somber, creating a reflective and emotional atmosphere that complements the visual narrative of concern and care for the chick's well-being."
+ },
+ {
+ "video_id": "PI-80094038-fd5f-4226-8518-ce5e1ca0789c",
+ "path": "/workspace/benchmark/video/PI-80094038-fd5f-4226-8518-ce5e1ca0789c.mp4",
+ "answer": "The video begins with a close-up of a hand opening a glass door, revealing a black car parked outside. The camera then shifts to show a person sitting in the driver's seat of the car. The interaction starts with a polite inquiry from a female speaker asking, 'Hi, how are you? Good. Have you been here before?' The driver responds affirmatively. However, the tone quickly escalates as a male speaker, displaying a tone of frustration and disapproval, firmly states, 'Yeah. Sir You need to leave unacceptable,' indicating that the driver's presence is not permitted. The text 'Unacceptable. Call 911' appears on the screen, emphasizing the seriousness of the situation. The atmosphere becomes tense and confrontational, hinting at a formal or service-related setting where the driver's behavior is being flagged as inappropriate. The video concludes with the black car driving away from the parking lot, with palm trees and buildings visible in the background, suggesting the location might be in a warmer, possibly urban area."
+ },
+ {
+ "video_id": "PI-21364679-c823-477d-ab85-40f121a0b9f2",
+ "path": "/workspace/benchmark/video/PI-21364679-c823-477d-ab85-40f121a0b9f2.mp4",
+ "answer": "The video is an important reminder of the potential dangers of using a car wash in cold weather. The video begins with a close-up shot of a hard blue car wash brush, where men can be seen touching and pulling the bristles with one hand to show off their rigidity. The accompanying audio features a man's voice highlighting the risk, using phrases like \"look there\" and \"you can hear it cracking,\" which underscores the warnings issued by the car wash workers. The scene then transitions to a car wash facility where a white car is undergoing the cleaning process. High-pressure water sprays and stiff brushes are at work, highlighting how cold weather can cause brushes to become stiff and potentially damage to vehicles. This blend of visual and audible elements effectively emphasizes the importance of caution when washing your car in cold conditions."
+ },
+ {
+ "video_id": "PI-851f1603-a853-4e03-8fe5-9be77d058756",
+ "path": "/workspace/benchmark/video/PI-851f1603-a853-4e03-8fe5-9be77d058756.mp4",
+ "answer": "In the video, there is a conversation between the two discussing the topic of marriage, with such a sentence repeatedly appearing: \"Marriage is terrifying because if he...\" The first person was wearing a black coat and a red checked shirt, speaking in an indoor environment with Windows and curtains as the background. The second person was sitting on the sofa, wearing a grey shirt and orange trousers, with an orange pillow placed beside him. The background includes a wooden table and a framed painting on the wall. The two took turns to speak. The first one made expressive gestures, and the second one listened attentively, touching his hair occasionally. As the conversation went on, the first person continued to speak enthusiastically, using gestures to emphasize their point of view, while the second person responded, also using gestures to convey their thoughts. The text coverage remains consistent, highlighting the themes of fear and uncertainty in marriage. The repeated text in the video emphasizes the core theme of the discussion: fears and uncertainties related to marriage, especially regarding the potential behavior of a partner."
+ },
+ {
+ "video_id": "PI-54c735b1-8a55-4731-a78c-611a24d7bab3",
+ "path": "/workspace/benchmark/video/PI-54c735b1-8a55-4731-a78c-611a24d7bab3.mp4",
+ "answer": "The video showcases the step-by-step process of making a \"teddy graham latte,\" beginning with a close-up of the finished drink in a clear plastic cup, complete with a lid and black straw, and the text \"teddy graham latte\" accompanied by a teddy bear emoji displayed on the screen. The preparation starts with a person in a gray hoodie adding a spoonful of brown sugar and a drizzle of honey to a small glass pitcher, followed by coffee poured from a machine, blending the ingredients together. Next, milk is poured over ice cubes in a clear plastic cup on a granite countertop, creating a creamy base. The person then uses a handheld milk frother to ensure the coffee mixture's smooth consistency before pouring it into the cup, resulting in a layered effect. The latte is stirred, mixing the layers, and topped with a lid and straw. Meanwhile, the background music features a female vocalist singing in a soft, melancholic tone, with lyrics suggesting themes of introspection and the passage of time, such as \"the years you can't see.\" This reflective and somewhat somber mood enhances the emotional depth of the visual content, culminating in a close-up of the finished teddy graham latte, ready to be enjoyed."
+ },
+ {
+ "video_id": "PI-11498cc0-d3d0-4b07-b88c-b94d70555d83",
+ "path": "/workspace/benchmark/video/PI-11498cc0-d3d0-4b07-b88c-b94d70555d83.mp4",
+ "answer": "The video begins with a lively scene in a bustling city square (most likely Times Square), where a group of people are posing for photos. The scene switches to a close-up of the camera screen, showing the same group of people captured in the photo. The interface of the camera shows Settings such as \"M\" mode, \"F1.2\" aperture, \"1/200\" shutter speed and \"ISO160\". Then the focus shifted to a person, who was wearing a dark grey sweatshirt with colorful skull patterns and the blue letters \"Godspeed\" on it, and was also wearing a thick chain necklace and several rings. The people in costumes in the background are still very lively. Then, a small white portable photo printer was held in one hand, with a busy city square in the background. The printer began to print the photos of this group of people that were taken before. The printed photos came out of the printer, also showing a group of people dressed in colorful clothes. The video continues to be played in the same bustling city square, with vibrant colorful billboards illuminating the background. One hand was holding a small white portable photo printer, in which photos of people dressed in colorful clothes were printed. The scene transitions to a close-up of a printed photo, which is now hung on a keychain and held up against the backdrop of a brightly lit city square. At the end of the video, there is a close-up shot of a man holding a key chain with an earlier group photo printed on it, showing it to the camera. The background is a busy city square at night. At the end of the video, he was still holding the key chain, with the background being a vibrant city square."
+ },
+ {
+ "video_id": "PI-f722fbb2-5f77-4309-af23-52b573894c3b",
+ "path": "/workspace/benchmark/video/PI-f722fbb2-5f77-4309-af23-52b573894c3b.mp4",
+ "answer": "This video presents a man explaining a numerological concept related to birthdays. Seated in an ornate chair against a backdrop of golden, mountain-like carvings, he asserts that the last digit of one's birth date determines their wealth and life path. He speaks in a calm, deliberate tone, using hand gestures for emphasis. Overlaid text in both English and Chinese reinforces his points. The speaker provides specific interpretations: a '7' signifies a destiny as a successful entrepreneur; an '8' suggests a life of dramatic ups and downs, from glorious highs to lonely lows; and a '9' indicates that one will always have supportive people and financial stability, though not necessarily instant riches. Throughout his explanation, the lower part of the screen displays various illustrative scenes, such as a successful businessman, happy crowds, a solitary individual, and hands counting money. A prominent Yin and Yang symbol is visible in the top left corner. The video concludes with the speaker offering personalized life path readings to viewers who share their birthdays with him."
+ },
+ {
+ "video_id": "PI-5d543d15-fd5d-4e0d-9445-ba4eda476020",
+ "path": "/workspace/benchmark/video/PI-5d543d15-fd5d-4e0d-9445-ba4eda476020.mp4",
+ "answer": "The video employs an overhead shot to capture a delicately textured snowscape during active snowfall, with pristine white snow blanketing the ground. White subtitles at the top display the text \"nature is beautiful\" accompanied by a [folded hands] emoji.\nA human hand enters from the right side of the frame, groping through the snow layer. Subsequently, the hand extracts a red Dr. Pepper can, partially dusted with snow traces. As the can is lifted and gently rotated, its branding details—including the \"Dr Pepper\" logo, the number \"23,\" and the volume indication \"12 FL OZ (355 mL)\"—become clearly visible.\nThe interaction between the hand and the icy layer creates subtle disturbances. The entire scene is underscored by an instrumental rock-style soundtrack that maintains a calm, observational tone throughout this brief sequence. The subtitle playfully implies the soda can is \"a gift from nature.\""
+ },
+ {
+ "video_id": "PI-0ff895d4-e94e-45e5-a1cd-1f1a293c12c2",
+ "path": "/workspace/benchmark/video/PI-0ff895d4-e94e-45e5-a1cd-1f1a293c12c2.mp4",
+ "answer": "The video primarily takes place indoors, transitioning between rooms with white walls, windows, and a door, subtly altering the background while maintaining focus on a blonde-haired woman in a pink top or white tank top. The woman is the central figure, engaging in actions such as holding her hand to her mouth and speaking about a personal experience involving a medical or wellness procedure. She describes taking a picture of the procedure, which was painful and required medication afterward. The audio content centers on her narrative, reflecting a mix of calm explanation and emotional reactions, with no mention of background music or other significant sounds. Key English text includes her statement, \"I'm sorry, I'm not sure what you're asking for,\" and her discussion of the procedure. The video conveys a theme of personal vulnerability and discovery, with an emotional shift from calmness to surprise and shock, particularly when the woman reacts to an unexpected image of a man."
+ },
+ {
+ "video_id": "PI-416008ca-bd5b-48ce-ab0b-e52dfeb75bc7",
+ "path": "/workspace/benchmark/video/PI-416008ca-bd5b-48ce-ab0b-e52dfeb75bc7.mp4",
+ "answer": "The beginning of the video is a poster named \"4TOWN The World Tour\", where four boys in orange costumes are set against the backdrop of colorful cartoon characters and green walls, suggesting a vibrant and energetic tour. The scene transitions to the performer on the stage, dressed in white and with huge glowing wings. Surrounded by the backdrop of urban landscapes, bright lights and starburst effects create a magical and vibrant atmosphere. The performers struck various dynamic poses to emphasize the magnificence of the performance. The words \"4TOWN\" appeared in large glowing characters at the bottom of the screen, reinforcing the brand image of this trip. Then, the scene switches to a group of lively children among the audience, who look up in awe and excitement. One child is wearing a green checked shirt, another is wearing a pink shirt printed with a cat pattern, and the others are dressed in all kinds of colorful clothes. A close-up shot shows a child in a blue shirt excitedly stretching out his hand, followed by a close-up shot of a performer in white clothing with large glowing wings flying over the heads of the audience. Looking at the performers from below, the hands of the audience reach out to them, highlighting the connection between the performers and the fans. The video ends with a close-up of a child's hand reaching out to the performer, followed by a close-up of the performer's face, emphasizing the emotional connection and the intensity of the performance. The overall theme of the video is the excitement and joy of a live concert, with a focus on the interaction between the performers and the audience."
+ },
+ {
+ "video_id": "PI-36165c54-e5fa-4f17-b56a-e8e5c10acd40",
+ "path": "/workspace/benchmark/video/PI-36165c54-e5fa-4f17-b56a-e8e5c10acd40.mp4",
+ "answer": "In the video, a cheerful female speaker enthusiastically describes the innovative features of the PowerXL multi-cooker, setting a promotional and informative tone. The demonstration showcases the appliance's versatility and efficiency in meal preparation, with a focus on its ability to replace multiple pots and pans. The video highlights the multi-cooker's capacity to cook meals 60% faster and its hands-free stirring capability. Viewers are guided through the process of sautéing chopped green bell peppers and onions, adding and browning ground meat, and incorporating yellow corn, black beans, and diced tomatoes to create a vibrant and well-mixed dish. The video also emphasizes the multi-cooker's seven-quart capacity, built-in paddles for stirring, and nine pre-programmed settings such as 'StirMax' and 'Slow Cook.' Finally, the cooked mixture is served in a bowl alongside cornbread, with the removable pot's dishwasher-safe feature highlighted for easy cleanup, underscoring the convenience and efficiency of the PowerXL multi-cooker in simplifying meal preparation."
+ },
+ {
+ "video_id": "PI-8dc7bde4-46d4-4a2d-9f69-bf1bce77af88",
+ "path": "/workspace/benchmark/video/PI-8dc7bde4-46d4-4a2d-9f69-bf1bce77af88.mp4",
+ "answer": "At the beginning of the video, three men are standing outdoors near a tree and some growers. The man in the light-colored shirt in the middle introduced his friends Adam and Bumar to each other. Then the camera switches to three people walking into the room. Now they are wearing grey and blue-green jumpsuits and helmets, possibly preparing for an indoor skydiving activity. Inside the facility, two of the men parachuted in the wind tunnel for an exercise, while the third man watched from the ground. The skydivers seemed to have a great time, and one of them gave a thumbs-up. Then the video switches to the outdoor desert scene, where a white car can be seen driving on the sand dunes. As it rises, the sand is kicked up. Subsequently, two men sat in the back seat of the car. Both of them wore seat belts and looked very happy. The last scene shows three people sitting on a red roller coaster indoors, with a high ceiling and some green plants in the background. They were all wearing seat belts and seemed to be having a great time. One of them even raised his hand excitedly. At the end of the video, the two men enjoyed the journey very much and looked around with smiles. The core theme of the video is a group of friends enjoying various adventure activities together, ranging from indoor skydiving to desert driving and roller coasters."
+ },
+ {
+ "video_id": "PI-13e6ee34-d1c5-400f-a664-c2c262c4ba1b",
+ "path": "/workspace/benchmark/video/PI-13e6ee34-d1c5-400f-a664-c2c262c4ba1b.mp4",
+ "answer": "The background of the video is set in a hair salon. The background transitions from a blurry image to a clear one, and the stylist is doing hair for a client. The hairstylist, wearing a black shirt, uses an electric splint to create hairstyles for customers, emphasizing precision and skill. When distributing and combing the hair, the hairstylist moves the electric splints on the customer's head to achieve a specific effect. The key English texts that appear in the video include \"STOP\", \"MOVES UP\" and \"TRY THIS INSTEAD\", emphasizing the instructions or hints related to hairstyles. The audio is mainly characterized by the sound of electric splints, accompanied by occasional conversations between the stylist and the client discussing the styling process, as well as a calm background atmosphere. This video highlights the meticulous process of hair styling and the collaborative efforts between the stylist and the client to achieve a refined and confident effect. The atmosphere is relaxed and focused, emphasizing the importance of technology and attention to detail."
+ },
+ {
+ "video_id": "PI-08c0795d-637f-4c8e-8b5c-da8e37d29f86",
+ "path": "/workspace/benchmark/video/PI-08c0795d-637f-4c8e-8b5c-da8e37d29f86.mp4",
+ "answer": "The audio track serves as an educational tool, explaining the phenomenon of why mirrors and photos often make us look different, aiming to assist viewers in better understanding and improving how they perceive themselves in photographs. The video expands on this concept by delving into the common experience of discrepancies between mirror reflections and photographic images due to lens distortion. It starts with a close - up through a camera lens, drawing attention to the text 'looked' in yellow. A person in a white tank top takes a mirror selfie, and then the video presents various individuals and accompanying text to elucidate the idea. It highlights the use of a 50mm lens filter to get a more accurate representation of one's real - life appearance or suggests downloading the Moggr app to scan one's face and receive a score from 1 to 10, along with tips for enhancing facial features. The video also includes demonstrations of lens compression effects at different focal lengths and a screenshot of the Moggr app interface displaying a 'Golden Ratio Score' of 9.3. Additionally, it features a social media post about the 'inverted and zoomed filter' and a person clarifying that this filter doesn't reflect how others see us in real life. Together, the audio and video content focus on helping individuals understand and enhance their appearance in photos by identifying which facial features to pay attention to."
+ },
+ {
+ "video_id": "PI-c9394fc1-c54c-402f-9cd9-91b210b19adc",
+ "path": "/workspace/benchmark/video/PI-c9394fc1-c54c-402f-9cd9-91b210b19adc.mp4",
+ "answer": "In the audio, the voices of three individuals—two men and one woman—are heard, with the woman enthusiastically reciting lyrics from a song that include phrases like \"The light of a thousand spotlights,\" \"All the stars we stole from the night sky,\" and \"Never enough.\" The men occasionally join in, particularly during the chorus of \"Never Enough,\" reinforcing the song's theme of insufficiency and longing, highlighting that material wealth and fame cannot fully satiate one's inner desires. Expanding on this, the video depicts the same trio in a modern kitchen setting, where they engage in a lively conversation or performance. The person in the middle, animated and expressive, sits on a white counter in a blue and white striped dress, using hand gestures as she speaks or sings, while the other two, dressed in casual attire, stand beside her, occasionally responding or reacting. The video is enhanced with text overlays echoing the song's lyrics, such as \"all the shine of a thousand spotlights,\" \"will never be enough,\" and \"towers of gold are still too little,\" creating an energetic and engaging atmosphere. The scene concludes with the person on the left exiting the frame, leaving the other two standing in the kitchen, encapsulating the song's message of enduring longing and the inadequacy of external achievements."
+ },
+ {
+ "video_id": "PI-3f26b90a-db8c-424c-b48e-086758fad9c8",
+ "path": "/workspace/benchmark/video/PI-3f26b90a-db8c-424c-b48e-086758fad9c8.mp4",
+ "answer": "The video unfolds in a well-lit, modern living room where a person in a black hoodie and beige pants stands near a table, announcing in a casual tone, \"Cardo your dinners ready,\" a phrase that echoes the audio track's friendly notification, albeit with the video's speaker not relying on a screen's 99% accuracy display as in the audio's playful or technological context. The background, featuring a red couch with a colorful crocheted blanket, a decorated Christmas tree, and a TV mounted on a stone wall, sets a cozy scene. As the video progresses, another individual in a red jacket over a white shirt joins, and they share a simple meal consisting of cookie dough and chocolate cake. The seated person enjoys the cookie dough while the standing one takes a bite of the cake, later offering a water bottle, which is accepted with a sip. The video concludes with the seated person continuing to indulge in the cookie dough, as the other stands nearby, bottle in hand, encapsulating a casual and friendly atmosphere centered around sharing a meal. The audio, with its light-hearted dinner announcement, perfectly complements the video's portrayal of a relaxed, everyday interaction."
+ },
+ {
+ "video_id": "PI-65ea4851-4e77-41d7-9733-b5f33ed8aaf6",
+ "path": "/workspace/benchmark/video/PI-65ea4851-4e77-41d7-9733-b5f33ed8aaf6.mp4",
+ "answer": "The video features a moving vehicle driving along a picturesque coastal road under a clear blue sky, with the text 'Turn off the GPS I know exactly where I am' prominently displayed at the top of the screen, highlighting the driver's confidence in their navigation skills. As the vehicle approaches an intersection marked by traffic lights and a 'California Incline' sign, it continues straight towards a bridge, flanked by palm trees on the left and a concrete wall on the right. The journey progresses with the vehicle crossing the bridge, passing a 'SANTA MONICA' sign that confirms the location, and moving forward with the serene backdrop of palm trees and clear skies. The accompanying BGM is an instrumental track characterized by a steady, upbeat, and slightly adventurous melody, which enhances the confident and relaxed atmosphere of the scene. The video wraps up with the vehicle still moving forward on the bridge, reinforcing the tranquil and self-assured mood established throughout the journey."
+ },
+ {
+ "video_id": "PI-355ec7bf-2020-49a6-a067-2de67ab317e8",
+ "path": "/workspace/benchmark/video/PI-355ec7bf-2020-49a6-a067-2de67ab317e8.mp4",
+ "answer": "In the video, a person is wearing a light grey hoodie and a white lace-up, sitting in a plain room with light-colored walls. This person spoke directly to the camera. The text above the picture was covered with: \"I have never seen a beautiful Trump supporter\"... Correct. This person seems to be responding to some comments or videos claiming that there are no attractive Trump supporters. The video briefly switches to a blurry frame and then returns to the person in the grey hoodie. In an interesting twist, this person appeared wearing a golden wig, still in the same environment, continuing to speak, challenging the notion that Trump has no attractive supporters. Text overwriting remains unchanged all the time. When this person is speaking, they make gestures with their hands and focus their attention on their upper body and face. This video seems to be a humorous and confident response to the stereotypes of Trump's supporters, emphasizing that such a generalization is incorrect. This person jokingly questioned how the audience would call them if they were not good-looking, and finally said, \"You are not the exception. You are the rule.\" This strengthened the message that attractive Trump supporters do exist and are not rare."
+ },
+ {
+ "video_id": "PI-d1460d1c-14c1-4bde-b7cb-660496d494e8",
+ "path": "/workspace/benchmark/video/PI-d1460d1c-14c1-4bde-b7cb-660496d494e8.mp4",
+ "answer": "The video was shot on a stage where there were many lights and performers playing Musical Instruments. The singing woman was wearing a yellow short skirt and holding a microphone. She was singing passionately on the stage. During the singing process, she twisted her body and interacted with the audience below. The audience below also responded actively to her interaction. It can be seen from the video that the woman in the yellow dress is enjoying this joyous moment. Behind her, there is also a woman in white dancing, who should be her backup dancer. The drummer behind is also beating hard, creating a lively atmosphere. From the audio in the video, it can be heard that the atmosphere on the stage is very lively. Everyone is happily singing along with the accompaniment, which makes people feel physically and mentally relaxed."
+ },
+ {
+ "video_id": "PI-1406e89a-e3d8-444e-b1c5-c48f1c8fbb7c",
+ "path": "/workspace/benchmark/video/PI-1406e89a-e3d8-444e-b1c5-c48f1c8fbb7c.mp4",
+ "answer": "The video opens with a scene of two individuals in green jackets walking down a hallway, pulling luggage carts labeled 'DFW', setting a casual and perhaps mundane backdrop. However, the atmosphere quickly shifts to a humorous one as the text 'I got em ' appears at the bottom of the screen, hinting at a playful interaction to come. The camera then pans to reveal two soldiers in camouflage uniforms sitting on black benches, one with his legs stretched out and the other leaning back with his hands behind his head, appearing relaxed. A green bag with the initials 'PJ' rests on the floor nearby. The audio track from the video features human voices with no background music, where one person, sounding amused or triumphant, repeatedly says, \"I got em,\" indicating a lighthearted and humorous interaction. Another voice, possibly that of a soldier named 'Turner,' reacts to this, further contributing to the playful ambiance. The video continues to focus on the soldiers, with one adjusting his position while the other remains at ease, all the while the text 'I got em ' remains on the screen, emphasizing the humorous and lighthearted nature of the moment captured on camera."
+ },
+ {
+ "video_id": "PI-44cf3a9f-7fbe-4ffc-b12c-f692b2384c07",
+ "path": "/workspace/benchmark/video/PI-44cf3a9f-7fbe-4ffc-b12c-f692b2384c07.mp4",
+ "answer": "The video captures the moment of two people enjoying the event in a large indoor arena as a light-hearted conversation unfolds between them. One man wore a black down jacket with red and white patches and a black beanie, while the other woman wore a black leather jacket and sunglasses. The arena is bustling with multiple seating levels and bright overhead lights that create a vibrant backdrop. During their interaction, the woman expressed admiration for the man's appearance, describing him as tall, strong, and handsome with white teeth. The man responds modestly, and the woman offers his gratitude and honor for his honesty. The blonde girl raised her sunglasses and playfully pointed at the camera, adding a cheerful and relaxed atmosphere to the scene. The interaction was casual and warm, with the girls showing enthusiasm and the boys staying calm, conveying a sense of enjoyment and camaraderie in the lively arena."
+ },
+ {
+ "video_id": "PI-87a0fa43-e7f2-4866-bc6b-9a82740b0551",
+ "path": "/workspace/benchmark/video/PI-87a0fa43-e7f2-4866-bc6b-9a82740b0551.mp4",
+ "answer": "The video presents a close-up, static scene of a person with voluminous, curly hair, wearing clear-framed glasses and a light-colored jacket with a fluffy collar over a light blue shirt. The background features a ceiling with a ventilation grate, suggesting an indoor setting with plain walls and soft, even artificial lighting. The camera angle is slightly tilted upwards, focusing on the person's face and upper body, with no significant movement or changes in the background. Complementing the visual, the audio track features a female speaker with a playful and slightly exasperated tone. She recounts an incident where someone told her, \"You're not a dog, stop rewarding yourself with food,\" and humorously confirms that she does indeed reward herself with food, offering a lighthearted take on self-reward habits."
+ },
+ {
+ "video_id": "PI-cb5c6627-19af-45a0-948a-0f2b733d668d",
+ "path": "/workspace/benchmark/video/PI-cb5c6627-19af-45a0-948a-0f2b733d668d.mp4",
+ "answer": "The video initiates with a person in a car, offering a farewell gesture, before seamlessly transitioning to a bustling factory interior. Within this industrious setting, multiple workers are diligently engaged at their workstations, surrounded by equipment and materials. A man, clad in a white shirt adorned with a logo, stands out as he enthusiastically speaks while holding a large, white, three-dimensional letter \"B.\" His dialogue, which aligns with the audio content, passionately promotes the benefits of sourcing directly from a factory. He emphasizes the cost savings, quality assurance, and customization options available, particularly advantageous for budget-conscious business owners. The man skillfully uses the letter \"B\" as a visual prop to underscore his points, all while the factory's productive ambiance, complete with focused workers and overhead lighting, serves as a dynamic backdrop. The video maintains a steady focus on both the speaker and the vibrant factory environment, effectively conveying the message of direct sourcing advantages."
+ },
+ {
+ "video_id": "PI-88e812b9-2c29-44d3-ab1c-33c31929d60d",
+ "path": "/workspace/benchmark/video/PI-88e812b9-2c29-44d3-ab1c-33c31929d60d.mp4",
+ "answer": "The video took place in a large indoor basketball court, which was crowded with people. Some were holding basketballs while others were watching from the sidelines. The court is well-lit, with banners and posters of basketball players hanging on the walls, creating a vibrant and competitive atmosphere. The main focus is on a man wearing a white T-shirt and black shorts, who is actively participating in a basketball game. He dribbled the ball, shot, and had impressive performances, including jump shots and layups. The words \"Don't make me hot\" appeared at the top of the screen, hinting at a warning to the opponent that his confidence and skills were on the rise. The background is a basketball circle and a large banner that reads \"League Today\". Tomorrow's NBA. As the game went on, the man in the white T-shirt continued to demonstrate his skills, made a successful shot and rushed towards the basket. The score displayed on the scoreboard is 18 points. The brand \"Tissot\" can be seen, indicating a professional or semi-professional setting. This video captures the dynamics and competitiveness of the game. The players are moving rapidly on the field and the audience are watching attentively. The core theme of the video is the excitement and intensity of basketball, highlighting the players' talents and the vibrant atmosphere of the indoor court."
+ },
+ {
+ "video_id": "PI-6452ba28-5aa9-4c49-b68f-7a4d9d4f7339",
+ "path": "/workspace/benchmark/video/PI-6452ba28-5aa9-4c49-b68f-7a4d9d4f7339.mp4",
+ "answer": "This video presents a rapid-fire montage of scenes from the film \"Trainspotting.\" Accompanied by an upbeat yet slightly melancholic synth-pop track, the clips showcase the main characters navigating a variety of settings. The montage opens with the protagonist, Renton, running frantically, a recurring motif throughout the video. Scenes shift quickly between gritty urban landscapes, including apartment blocks and city streets, and the vast, open countryside of the Scottish Highlands. The characters are shown in different states: running from unseen pursuers, socializing in a dimly lit club, standing contemplatively by a train track, and experiencing moments of drug-induced euphoria and lethargy. The quick cuts and dynamic camera work create a feeling of restless energy and chaos, reflecting the turbulent lives of the young men portrayed."
+ },
+ {
+ "video_id": "PI-f20460c9-46a1-4bec-bf13-26520dd18a55",
+ "path": "/workspace/benchmark/video/PI-f20460c9-46a1-4bec-bf13-26520dd18a55.mp4",
+ "answer": "In a cozy living room, a young boy kneels beside a brown leather sofa, lying prone on a blue dinosaur-patterned blanket where his infant sibling rests beneath the same blanket. With hands clasped in prayer, he leans affectionately on the blanket beside the baby. He explains he’s sharing his affirmation ritual—a custom taught by his mother—reciting a series of positive declarations: \"I am thankful, I am helpful, I am a child of God,\" while expressing gratitude for each day. His mother, audible off-camera, enthusiastically participates by responding \"Amen\" when he prompts her. The scene portrays profound family harmony."
+ },
+ {
+ "video_id": "PI-a685e6d1-f6ba-4f9c-92c9-716265efac4d",
+ "path": "/workspace/benchmark/video/PI-a685e6d1-f6ba-4f9c-92c9-716265efac4d.mp4",
+ "answer": "The beginning of the video is a dark background with tiny colorful dots scattered around, just like the stars in the night sky. The word \"HEY\" appears among the glowing and colorful letters, giving people a vibrant feeling. An animated character with a pink outline and a crown floats in the field of vision, showing a surprised expression. This character, with its simple yet expressive design, becomes a focal point as it moves through the scene. The consistent background enhances the cosmic atmosphere. The scene transitions to a huge, glowing pink circle, gradually revealing itself as a stylized Earth, with the continents outlined in a slightly darker pink. Among them are lyrics like \"The ramen even dropped like heat.\" If this world were mine, I would turn your dreams into the greatest supply. If this world is mine, I will put it in front of me. Music adds a layer of vitality to the visual experience. When the Earth rotates slowly, the picture shrinks and multiple luminous circles of different colors - orange, pink, yellow and purple - can be seen floating in the dark space. The scene switches to a huge, glowing pink heart that beats gently, creating a soothing effect. The smooth and shiny surface of the heart reflects light, adding serenity. At the end of the video are the glowing colored letters \"LUTHER\", followed by the smaller white letters \"ANIMATION by KING SCIENCE\". The last frame lists in white the contributions of sponsors, supporters, and animation coverage, animations, and other contributions. The core theme of the video is a vibrant and dynamic exploration of cosmic elements, combining energetic music and expressive animations to create an engaging visual and auditory experience."
+ },
+ {
+ "video_id": "PI-a90fddef-10b8-4320-b828-b68bd02a7807",
+ "path": "/workspace/benchmark/video/PI-a90fddef-10b8-4320-b828-b68bd02a7807.mp4",
+ "answer": "The video and audio together humorously depict the scenario of feeling underdressed or out of place in a VIP setting. The video begins with two blonde women standing side by side in a brightly lit, ornate room adorned with white walls, pink flowers, and a 'VIP' sign with a crown. The woman on the left wears a red dress, while the one on the right is in a black skirt with a pink bow in her hair. The text overlay reads, 'HOW IT FEELS TO BE THE FRIEND WHO DOESN'T HAVE VIP IN DRESS TO IMPRESS...' Initially posing together, the woman on the right soon turns and walks away, leaving the other standing alone. The camera follows her departure, while the remaining woman looks distressed, placing her hands on her face, with the text 'UNNIE....' (a Korean term for older sister) appearing on the screen. The audio complements this visual narrative with a female speaker expressing regret or frustration, saying, \"I know, I'm so,\" after using the term 'Unnie.' A male speaker, sounding concerned or disapproving, advises against a certain action with the phrase, \"No, don't do this.\" The combination of the visual and audio elements effectively conveys the humorous yet relatable feeling of being the less impressively dressed friend in a VIP context."
+ },
+ {
+ "video_id": "PI-689177fb-09d8-4067-bca0-2b8b1a9bb810",
+ "path": "/workspace/benchmark/video/PI-689177fb-09d8-4067-bca0-2b8b1a9bb810.mp4",
+ "answer": "The video shows a woman in a dark jacket sitting in the back seat of a car with a gray dog with white patches on her chest. The dog is initially lying down, resting his head on the person's shoulder, creating a calm and affectionate scene. A text overlay reads, \"Reply to Phyllis's comment: I love him so much. Was he raised by turkeys???\".The dog occasionally moves and sticks out his tongue to look around, adding a touch of playfulness. The interior of the car and a glimpse of the house outside can be seen through the window. A female speaker, meanwhile, initially in a neutral tone, turned to a happy tone, humorously discussing how other people perceive dog sounds, comparing them to animals such as turkeys, dolphins and birds of prey. The female speaker, with a neutral and comical tone, responds briefly and laughs during the communication. The dialogue is light and entertaining, with the woman having casual, humorous conversations about the dog's vocalization, enhancing the warm and endearing vibe of the video."
+ },
+ {
+ "video_id": "PI-9d1337b6-5080-4de4-b901-d0902f68eb2b",
+ "path": "/workspace/benchmark/video/PI-9d1337b6-5080-4de4-b901-d0902f68eb2b.mp4",
+ "answer": "The video humorously captures a playful scenario where a person in a black shirt, seated against a white headboard adorned with red decorations, becomes the unwitting subject of a makeover. Another individual, whose face remains unseen, drapes a pink and white striped sweater over the first person's head, accompanied by the text 'POV: You're a girl dad' that sets a light-hearted tone. The scene escalates as the unseen person crowns the seated individual with a tiara and proceeds to apply makeup, including eyeliner and lipstick. Throughout this amusing transformation, the person in the black shirt appears relaxed, occasionally closing their eyes. Meanwhile, the audio features a female speaker describing a hot evening at 95 degrees with an ambiguous emotion, expressing her intention to show love, and referencing someone nearby, all while ambient music with a steady tempo maintains a neutral backdrop. The video concludes with the completed makeup look, reinforcing the comedic theme of a father indulging in a typically feminine activity."
+ },
+ {
+ "video_id": "PI-40b9810e-b648-4617-beb2-4f39f2643b80",
+ "path": "/workspace/benchmark/video/PI-40b9810e-b648-4617-beb2-4f39f2643b80.mp4",
+ "answer": "The video shows a vibrant 15th birthday celebration, beginning with a close-up of a beautifully decorated birthday cake featuring a crown and the words \"Happy 15th Birthday\" in gold on a woven table. A woman in a white off-the-shoulder dress stands behind the cake, clapping and raising her arms as the crown candles light up, creating a warm and joyful atmosphere. The scene transitions to colorful confetti and sparkles that enhance the festive atmosphere. Meanwhile, the audio features a song performed by a male vocalist, with the lyrics \"Not like us, but me\" conveying a reflective and individualistic mood, suggesting a personal, perhaps lonely, perspective. Ambient music has a steady rhythm and neutral atmosphere that contrasts with the vivid visuals. As confetti continues to fall, the woman behind the cake forms a heart with her hands, adding a touch of affection to the joyful occasion, while the contemplative tone of the audio provides a subtle layer of introspection to the celebratory scene."
+ },
+ {
+ "video_id": "PI-07b2c73e-8bba-4d9e-8aaa-5c5e54e94744",
+ "path": "/workspace/benchmark/video/PI-07b2c73e-8bba-4d9e-8aaa-5c5e54e94744.mp4",
+ "answer": "In an engaging blend of audio and video, a computer hardware - focused presentation unfolds. The audio is the track of a computer hardware - related video, featuring a male's voiceover. The speaker exudes a sense of curiosity and excitement as he shares his thrilling discovery of an unfamiliar NVIDIA graphics card. He identifies it as an RTX 8000 from the Quadro series, which he stumbled upon at a local garage sale, adding an element of surprise and adventure to the audio narrative. The video, on the other hand, has its own distinct visual elements. The main scene revolves around a white computer chassis and a black computer. The camera movements are carefully choreographed to guide the viewer's attention. From 0 to 7 seconds, a man points his finger at the white chassis, and the camera moves forward accordingly, drawing the viewer in. From 11 to 12 seconds, the camera moves backward, creating a sense of perspective. From 16 to 18 seconds, it moves downward, adding to the dynamic visual flow. The audio is accompanied by a relaxed electronic soundtrack, with a \"beep\" sound at the 14 - second mark when the man presses the remote control, enhancing the overall atmosphere. The key text information appearing in the video is the \"Republic of Gamers\" logo, and the core theme is to showcase computer equipment related to this brand. While the audio focuses on the speaker's exciting discovery of the NVIDIA graphics card, the video uses the man's actions and the camera's movements to highlight the Republic of Gamers - related computer equipment, creating a comprehensive and engaging experience for the viewer."
+ },
+ {
+ "video_id": "PI-8b2d3d96-a6bc-4901-abcb-c58e5b51a56f",
+ "path": "/workspace/benchmark/video/PI-8b2d3d96-a6bc-4901-abcb-c58e5b51a56f.mp4",
+ "answer": "This video is set against the backdrop of a casual conversation where two women engage in a lively discussion. The first woman was wearing green clothes, sitting on a beige sofa, holding a microphone, with one leg tucked under her body, looking very relaxed. The second lady in a white vest and light-colored trousers was also holding a microphone and actively participated in the conversation. The background shows a room with various items and some furniture, creating a comfortable and informal atmosphere. At the beginning of the conversation, the first woman expressed disbelief in a man named Brad buying a car. The second lady confirmed that Brad, who was described as considerate, generously gave her a tip. The conversation continued. The first woman asked them if they talked every day, and the second woman answered affirmatively. Then the first woman asked Brad if he was cute. The second woman agreed and added that he looked very cute. As the discussion went on, the first woman jokingly asked Brad for a Ferrari, wanting to know where her own Brad was. The second woman humorously said that Ferrari was pink. Throughout the video, the atmosphere remained relaxed and friendly, and both ladies enjoyed the conversation very much. The key text includes names such as \"Julia\" and \"Brad\", as well as income details, suggesting the background of online interaction or streaming media. The core theme revolves around the playful and affectionate dynamics between the two women, highlighting their friendship and the humorous nature of their discussion."
+ },
+ {
+ "video_id": "PI-31a64248-aaa7-4673-b55f-3bd654ee31b9",
+ "path": "/workspace/benchmark/video/PI-31a64248-aaa7-4673-b55f-3bd654ee31b9.mp4",
+ "answer": "The video captures the lively atmosphere of a live sporting event at a crowded stadium, where the energy is palpable and the excitement is contagious. The focal point was an enthusiastic crowd in the stands, all dressed in casual, athletic attire. One was wearing a white t-shirt adorned with multiple gold chains, while the other was wearing a yellow jacket with a white pattern and several necklaces. The third person was dressed in black and wore a green beaded necklace. The background was packed with fans, many wearing green jerseys to show their support for a team. The crowd was lively, cheering, clapping, and holding drinks, creating a vibrant, exciting environment. The track complements the scene with an enthusiastic chorus of many people, suggesting the liveliness and communal spirit of the sporting event, which enhances the spiritual atmosphere of the event."
+ },
+ {
+ "video_id": "PI-27038d18-69ce-4626-9d09-b05110caacb3",
+ "path": "/workspace/benchmark/video/PI-27038d18-69ce-4626-9d09-b05110caacb3.mp4",
+ "answer": "The video captures moments of joy in a vibrant indoor play area called \"HaloPlay\", which is decorated with colorful balloons and a large bold sign, creating a festive and family-friendly atmosphere. A woman in a white t-shirt and black jeans holds a child in a white shirt and dark overalls, while a chained man beside her, also in a white t-shirt and dark jeans, raises his arms enthusiastically. The woman pointed up, the man took the child from her and lifted him up, and then both began dancing vigorously, moving their arms and swaying to the rhythm. The backdrop is buzzing with others, including a child in a pink costume and an adult in a plaid shirt, enhancing the celebratory atmosphere. The light-hearted tune of the song, performed by a man, served as the background music and complemented the cheerful and lively atmosphere of the scene. The video exudes energy and joy, highlighting moments of family fun and celebration."
+ },
+ {
+ "video_id": "PI-3726e8f0-ad24-4cdf-b6f2-da1e93a3f3a0",
+ "path": "/workspace/benchmark/video/PI-3726e8f0-ad24-4cdf-b6f2-da1e93a3f3a0.mp4",
+ "answer": "The video shows a furry little penguin in a snowy environment near a glass window, capturing viewers'hearts with its cute behavior. With a distinctive black-and-white color, a round body, large eyes and a small beak, this little penguin initially peeks out from behind a vertical structure and then gradually stands forward on the snow. It tilts its head curiously and occasionally opens its mouth slightly, as if making small noises or calls, conveying a sense of curiosity and innocence. Meanwhile, the audio features a child female speaker expressing happy emotions while discussing shared desires and needs in relationships, emphasizing emotional connection and shared feelings. The cheerful rhythm of the background music adds to a lively and charming atmosphere and enhances the overall warm and pleasant experience of the video."
+ },
+ {
+ "video_id": "PI-59306a36-635b-4fd5-9eb7-ceb137f08da0",
+ "path": "/workspace/benchmark/video/PI-59306a36-635b-4fd5-9eb7-ceb137f08da0.mp4",
+ "answer": "The soundtrack of the video serves as a tense and rhythmic backdrop, imbued with a rock music feel that is emotionally exciting and positive, expressing a passion and love for life that contrasts sharply with the dramatic events unfolding on screen. The video begins with a close-up of a snow-covered \"Ski-Doo\" snowmobile, where a person, clad in bright green gloves, is confronted with a motorcycle problem that quickly escalates into a fire in the engine compartment. The scene unfolds with the person desperately attempting to put out the flames amidst a flurry of smoke and snow kicked up by their efforts. The camera captures the intensity of the fire and the person's urgent, focused movements as they strive to control the blaze in the harsh, cold environment. Multiple close-ups of the engine compartment flames highlight the perilous situation, while the person's persistence in repeatedly trying to extinguish the flames underscores their determination despite the difficult circumstances. The audio's energetic and passionate tone provides a stark contrast to the video's tense and challenging scenario, yet both elements work together to create a compelling narrative of struggle and resilience in the face of adversity."
+ },
+ {
+ "video_id": "PI-15df2380-f2ce-4ea4-a99c-0fa9121ec9a0",
+ "path": "/workspace/benchmark/video/PI-15df2380-f2ce-4ea4-a99c-0fa9121ec9a0.mp4",
+ "answer": "The video captures a chaotic and destructive scene where a person, clad in yellow gloves, holds a stack of white plates and begins hurling them against a wall adorned with a graffiti of a yellow smiling face, divided into two halves with an \"X\" marking one side as eyes. The sound of plates breaking, which is the focal point of the accompanying audio, fills the air, creating a cacophony that underscores the act of destruction. As the plates shatter upon impact, their fragments scatter across the floor, transforming the once-clean surface into a chaotic mess. This repetitive and slightly cathartic act of destruction is starkly contrasted with the pristine appearance of the plates before they were broken, emphasizing the transformative and perhaps even therapeutic nature of releasing energy through such a visceral act. The audio, devoid of any human voices or voiceovers, solely relying on the sound effect of breaking plates, effectively complements the video's visual narrative, together conveying a core message centered around the act of destruction as a means of energy release."
+ },
+ {
+ "video_id": "PI-42b05d5c-a714-4bf6-9316-a89fb4d1809a",
+ "path": "/workspace/benchmark/video/PI-42b05d5c-a714-4bf6-9316-a89fb4d1809a.mp4",
+ "answer": "In the video, a person is lying lazily on a brown leather sofa. Sunlight streams in through the window with blinds, creating a warm and comfortable atmosphere. This person was wearing a grey hoodie and his hair was tied up in a messy bun, exuding a relaxed and casual atmosphere. The text on the video is written like this: \"I give my boyfriend two choices: stay at home with me for the New Year and make a vision board, or go to the party he was invited to...\" He is on his way to the party... This article offers a personal story about a person who asked her boyfriend to spend the New Year at home, engaging in creative activities such as making vision boards or attending the parties he was invited to. The boyfriend chose to attend the party, and the text message humorously or with a hint of disappointment mentioned his decision. This video captures a moment of self-reflection or slight depression as this person reflects on her boyfriend's choice. The sunlight and the relaxed postures of the people enhance the contemplative mood of the video, highlighting the contrast between the comfortable and intimate family environment and the allure of social activities."
+ },
+ {
+ "video_id": "PI-07c24267-397d-4a58-b354-3bf6ac444968",
+ "path": "/workspace/benchmark/video/PI-07c24267-397d-4a58-b354-3bf6ac444968.mp4",
+ "answer": "The video unfolds in a dimly lit bedroom and bathroom, where an individual engages in personal care and introspection. The setting subtly shifts as the person moves around, revealing elements like beds, televisions, mirrors, candles, and laptops in hues of white, black, and brown. The serene atmosphere is enhanced by peaceful background music, which complements the calm mood of the scene. While there is no significant dialogue, the video features fleeting English texts such as \"get of everything,\" \"got me,\" and \"everything,\" which underscore themes of self-reflection and letting go. The soft lighting and intimate environment evoke a sense of relaxation and personal care, creating a tranquil space for the individual to reflect and rejuvenate. Meanwhile, the audio track, characterized by a female's singing voice and background music, adds an additional layer of emotion with lyrics like \"I keep it undercover, my life, isn't it like you?\", suggesting a theme of keeping certain aspects of life hidden or private. Together, the video and audio create a cohesive narrative of seeking peace and introspection in a private, comforting space."
+ },
+ {
+ "video_id": "PI-c4933ce4-70ec-4984-9ecc-b9aed5f578a2",
+ "path": "/workspace/benchmark/video/PI-c4933ce4-70ec-4984-9ecc-b9aed5f578a2.mp4",
+ "answer": "The video features a person, dressed in a white cardigan over a black top and adorned with several bracelets, styling their hair into a messy bun against a vibrant purple background with sparkling lights. While the visual focus is on the hair-styling process—gathering hair into a high ponytail, twisting it into a bun, and adjusting strands for an effortlessly cute look—the accompanying audio adds a contrasting layer. In the audio, although a man can be heard singing, the primary narrative comes from a woman expressing frustration with her hair and an unrelated annoyance about receiving a wrong phone number. The video concludes with the person satisfied with their hairstyle, yet the audio leaves a lingering sense of irritation, blending the themes of personal grooming and everyday annoyances."
+ },
+ {
+ "video_id": "PI-469aace6-5039-40d8-9251-5c80f0e4d97a",
+ "path": "/workspace/benchmark/video/PI-469aace6-5039-40d8-9251-5c80f0e4d97a.mp4",
+ "answer": "The video features Donald Trump, identified as President-elect, delivering a stern message at a podium with a purple sign that has white text, partially visible as 'TRU...'. Dressed in a dark suit, white shirt, and purple tie, he stands in front of an American flag with a stone wall and part of another person in a beige suit in the background. Trump's tone alternates between assertive and composed as he discusses the urgency of resolving a hostage situation in the Middle East. He emphasizes that failure to release the hostages before his assumption of office will lead to severe consequences, warning that 'all hell will break out,' which would negatively impact Hamas and others involved. He acknowledges the efforts of a skilled negotiator, expressing respect for their capabilities while stressing the importance of completing the deal promptly. The text overlay and ABC News logo with the date 'Jan. 7, 2025' are visible, underscoring the urgency and gravity of his statement."
+ },
+ {
+ "video_id": "PI-bcf9f179-852a-42ed-8661-fc8f4d767024",
+ "path": "/workspace/benchmark/video/PI-bcf9f179-852a-42ed-8661-fc8f4d767024.mp4",
+ "answer": "This video presents a detailed manicure art course, with a focus on a close-up of a person's hand that is elaborately decorated with nails. The nails are already decorated with complex patterns, including red and white heart-shaped patterns and exquisite floral patterns. The middle nail was coated with a layer of shiny green nail polish, becoming the focus of the video. Using a thin metal tool, this person carefully placed a small, twisted metal wire structure on the shiny green nail and adjusted it to ensure perfect adhesion. The background remains a simple white surface, drawing the audience's attention to the precise and creative nail art process. The background audio features a soft and melodious tune. The lyrics are: \"What I'm truly interested in is love and...\" You might see me indulging in love, I think, adding a contemplative and artistic atmosphere to the scene. This combination of visual and auditory elements highlights the artistry and meticulousness of nail care, emphasizing the beauty and creativity involved in creating unique nail designs. The video effectively conveys the theme of artistic expression through meticulous manicure art."
+ },
+ {
+ "video_id": "PI-dcd6b6a6-383d-4e8e-9e38-8a2987301070",
+ "path": "/workspace/benchmark/video/PI-dcd6b6a6-383d-4e8e-9e38-8a2987301070.mp4",
+ "answer": "The video features a rock band, identified as 'Pink Slip' from their drum kit, rehearsing energetically in a garage. The performance is characterized by fast-paced guitar riffs and passionate female vocals. The lead singer has curly dark hair with red streaks, and the guitarist is a young woman with long blonde hair. The scene intercuts with a middle-aged woman in a kitchen who appears displeased by the noise. She walks over to a fuse box and flips the switch corresponding to the garage, cutting the power. The music stops suddenly, plunging the band into darkness. The guitarist expresses her frustration, while the woman who cut the power announces that their time is up because it is after six o'clock."
+ },
+ {
+ "video_id": "PI-02d5bd99-8099-4e0c-b528-003f938b117b",
+ "path": "/workspace/benchmark/video/PI-02d5bd99-8099-4e0c-b528-003f938b117b.mp4",
+ "answer": "In a thought - provoking video and audio combination, a poignant narrative about modern - day communication issues is presented. The video depicts an elderly couple dining in a restaurant, seated at a table covered with a red and white checkered tablecloth. The restaurant is filled with other customers, and decorative paintings on the walls create a cozy ambiance. However, an unsettling silence pervades the couple's interaction. The man is constantly engrossed in his phone, while the woman occasionally zones out, her gaze wandering aimlessly. The cameraperson remains stationary, capturing this rather somber scene. The audio, delivered by a female narrator in a flat tone, adds a layer of contemplation. She says, \"In another life, this might have been different,\" which subtly implies a reflection on how things could have unfolded in a more positive or engaging way in another life or situation. The subtitle \"They did not speak the whole dinner and he was on his phone the whole time while she stared off\" appears in the middle of the screen, further emphasizing the lack of communication. This scenario serves as a stark reflection of the widespread issue of \"phone addiction\" in modern society. People often become so engrossed in their electronic devices that they neglect communication with those around them, even in intimate relationships. This not only undermines the quality of interpersonal relationships but also causes them to miss out on precious moments with their loved ones, making the narrator's contemplation all the more poignant."
+ },
+ {
+ "video_id": "PI-534192c4-9289-4413-9ed9-3e687d799287",
+ "path": "/workspace/benchmark/video/PI-534192c4-9289-4413-9ed9-3e687d799287.mp4",
+ "answer": "This video captured a moment during a basketball game. A player in a dark blue jersey, identified as Kyrie Irving, accidentally hit a young fan with the ball. The scene takes place on the football field. The players are wearing white and green jerseys, with the dark blue jerseys clearly visible. The text above reads: \"Owen apologized to young fans after the ball hit them,\" and is accompanied by an emoji of a praying hand, highlighting the gesture of concern from the players. This young fan, wearing a yellow shirt and a blue hat, was sitting among the audience who reacted to the incident. Some viewers captured this moment with their mobile phones, while others watched worriedly. Kyrie Irving approached the fans, demonstrating sportsmanship and concern. Another player, No. 9 \"white\", wearing a white jersey, stood by observing the situation. This video emphasizes the sincere apology of the players and the collective reaction of the audience, presenting a heartwarming interaction in the competitive atmosphere of the game. The background was crowded with spectators, some wearing blue jackets and some casual clothes. Everyone was paying attention to the interaction between Irving and this young fan. This video effectively captures the essence of empathy and sportsmanship in the sports environment."
+ },
+ {
+ "video_id": "PI-e020fd6a-7a5b-4768-8750-66234b4c1917",
+ "path": "/workspace/benchmark/video/PI-e020fd6a-7a5b-4768-8750-66234b4c1917.mp4",
+ "answer": "The video unfolds on a tranquil wooden bridge under a clear blue sky, beginning with a person seated on a bench, clad in a green sweatshirt, pink pants, and brown shoes. The text overlay, \"I'm having so much fun,\" captures their relaxed demeanor as they wave playfully. As the scene shifts, another individual appears on the same bench, dressed in a black hoodie and light gray sweatpants, hair styled in a bun. The text changes to, \"we're just sitting on a bench?\" reflecting their curious glance around. Returning to the first person, the overlay now reads, \"but I'm with you!\" highlighting the theme of companionship. Both individuals are then shown together, sharing a moment of joy against the bridge's scenic backdrop. Accompanying this visual narrative, the BGM features light percussion and a bright, upbeat melody, creating a cheerful and relaxed atmosphere that enhances the sense of happiness and togetherness conveyed throughout the video."
+ },
+ {
+ "video_id": "PI-8b477377-3900-4e61-8c3d-d62b1799d5f3",
+ "path": "/workspace/benchmark/video/PI-8b477377-3900-4e61-8c3d-d62b1799d5f3.mp4",
+ "answer": "The video begins with a man in a bright blue suit and tie standing outdoors, providing a detailed guide on identifying fake police officers. The neutral-toned male speaker emphasizes the importance of observing attire, such as stained or wrinkled clothes, the absence of a police hat, and the lack of a radio, which are red flags. The scene transitions to another man in a white t-shirt and red heart-patterned shorts, then to a man in a dark blue uniform shirt with a 'COP' patch, highlighting the discrepancies in uniform details through close-ups, including the radio. A police officer in full uniform, including a hat and sunglasses, is shown standing on a street, illustrating the correct appearance. The video then compares real police cars, which have reflective decals and are well-maintained, to fake ones, which lack decals and are dirty. The final scenes show interactions between individuals and police officers, stressing the importance of checking identification and behavior. The second speaker, with an unknown emotional tone, briefly addresses officers, adding a procedural element. The male speaker continues with a mix of neutral and happy tones, humorously illustrating how to verify a real officer's credentials, creating an engaging and slightly playful instructional scenario, all while upbeat background music creates a light and informative atmosphere."
+ },
+ {
+ "video_id": "PI-fbf1783f-3932-4447-97ca-0462c7aab72a",
+ "path": "/workspace/benchmark/video/PI-fbf1783f-3932-4447-97ca-0462c7aab72a.mp4",
+ "answer": " In a professional football match, the player Lamine Yamal, wearing a fluorescent green jersey, demonstrated extraordinary abilities. He skillfully dribbled the ball along the pitch, evading several defenders from the opposing team wearing red and white striped jerseys. His forward charge was halted by a tackle from the defender Alex Berenguer, which caused him to fall to the ground. The referee quickly intervened and showed Berenguer a yellow card for the foul. The video showed the attack from multiple angles, including a replay that highlighted Yamal's skillful footwork before the foul occurred. The game was played in a large stadium filled to capacity, and the cheers of the audience could be heard in the background."
+ },
+ {
+ "video_id": "PI-046f05af-0528-4e7b-962b-980172b5c0f4",
+ "path": "/workspace/benchmark/video/PI-046f05af-0528-4e7b-962b-980172b5c0f4.mp4",
+ "answer": "In the video, a person is sitting in the driver's seat of a car, wearing a bright red sweatshirt printed with the letters \"C-F-M\". This person has long dark hair and is wearing a seat belt. The interior of the car is black. From the Windows, one can see a building with a green facade and multiple Windows. This person looks very uncomfortable and is covering his chest with one hand. Then, they picked up a white glass with a straw and took a sip. After drinking, this person looked more relaxed and began to gesture with his hands. The words \"I thought I was going to die\" appeared on the screen, followed by a smiling emoji, indicating that they had made a humorous or exaggerated response to the previous discomfort. This person continued to speak and gesture, seemingly explaining or narrating something. The scene switches to a person holding a piece of paper. They unfold and read. At the end of the video, that person was still sitting in the driver's seat, looking more relaxed and having a more engaged conversation. In the audio, this person said, \"OK, OK. What happened?\" Am I going to call my mother? Then came a cough, adding a humorous tone to the video."
+ },
+ {
+ "video_id": "PI-4da92aa8-478e-47fe-bcc7-e535df38390f",
+ "path": "/workspace/benchmark/video/PI-4da92aa8-478e-47fe-bcc7-e535df38390f.mp4",
+ "answer": "The beginning of the video is a close-up shot. One hand is holding a red Cheetos Flamin 'Hot chip, with obvious red powder on the fingers, creating a humorous tone. The scene switches to a shoe store, where a person is wearing a chestnut shirt and a black apron with the words \"Moo.shoe\" written on it. It says \"Cleaner\", which is a pink substance used to clean white sports shoes. It is very likely to be a cleaning liquid. The background is a shelf filled with shoe boxes and various cleaning supplies, indicating a professional shoe cleaning environment. Then, the story shifts to a person who is wearing a grey sweatshirt, a black hat with \"LA\" written on it, and holding a bag of Chiduo XXTRA Flamin Hot chips in his hand. This person made gestures with hands smeared with red powder, indicating that they had been eating potato chips. The camera switches to the shoe store. Another person in a brown hooded sweatshirt is holding a blue and white sports shoe, carefully inspecting it. The video returns to the person in the grey sweatshirt. He continues to gesture with his hands smeared with red powder and now holds a white sports shoe that is obviously stained with red powder in his hand. At the end of the video, the person in the maroon shirt continues to clean the soiled sports shoes, highlighting the humorous predicament of accidentally getting Chido fire powder on the sports shoes. Text message: \"He can't have just finished cleaning. What kind of brother is he?\" It doesn't matter. He isn't. We have lava because they are in Los Angeles. They are not alone. It adds a layer of humor and relevance to this scene."
+ },
+ {
+ "video_id": "PI-c65ca5c9-87e4-410f-8439-8213d785d3ac",
+ "path": "/workspace/benchmark/video/PI-c65ca5c9-87e4-410f-8439-8213d785d3ac.mp4",
+ "answer": "The video showcases scenes from within the teaching building, highlighting a large number of student activities in the main public areas. The space is illuminated by overhead fluorescent lights and features light-colored walls, sections of brick, and prominent glass exit doors. A unique gold statue on a pedestal serves as the focal point in this bustling environment. Students are constantly in motion, heading toward the exits, with many carrying backpacks, creating a vibrant and lively atmosphere. The camera follows the flow of the crowd, capturing the collective movement. The background sound is dominated by an upbeat hip-hop style, accompanied by the sounds of a large, interactive crowd, including general chatter and expressions of excitement. This combination of visual and auditory elements conveys that the school is primarily in a period of active student movement."
+ },
+ {
+ "video_id": "PI-d34c7467-b7f9-48aa-b552-af1e6d41cf7e",
+ "path": "/workspace/benchmark/video/PI-d34c7467-b7f9-48aa-b552-af1e6d41cf7e.mp4",
+ "answer": "The video features a confident and assertive female speaker who recounts waking up in Colorado after arriving on a private jet, expressing anger and confidence in response to being challenged to a boxing match by someone named Miss Grawny Bki Bosinki. She mocks her challenger's physical weakness, asserting that she or any girl from the Bobhouse could easily win. This confrontational narrative is delivered by a woman with long black hair, seated at a wooden table in a rustic indoor setting. The cozy atmosphere is enhanced by wooden beams on the ceiling, hanging light fixtures, and large windows that allow natural light to flood in. She wears a pink and white plaid jacket over a white top, with several rings adorning her fingers and gold and black designs on her nails. Throughout the video, she uses expressive hand gestures, such as pointing to her eye and raising her hand, while her sunglasses rest on her head. The intense and fast-paced background music complements her aggressive and defiant tone, creating a compelling and dynamic presentation."
+ },
+ {
+ "video_id": "PI-833c6044-fb30-4820-a9b6-15f93a59d01b",
+ "path": "/workspace/benchmark/video/PI-833c6044-fb30-4820-a9b6-15f93a59d01b.mp4",
+ "answer": "In the video, an old woman is sitting comfortably on the bed, wrapped in a light blue blanket. She shared her personal experiences and insights on the importance of understanding politics. She grew up under martial law in the Philippines for 20 years. She emphasized that there was only one president during that period. She described the lack of freedom of speech and the widespread atmosphere of fear during that period, emphasizing the revolution and curfew at that time. This lady emphasized the importance of understanding politics, although the common view is contrary to this. She believes that understanding the situation is crucial for preventing history from repeating itself. Her passionate message encourages the audience to stay clear-headed and open-minded, making sure they know what is happening in their government and the world. She insists that being informed is not only a responsibility but also a necessary condition for a healthy society. Throughout the video, the lady uses gestures to emphasize her point of view, establishing a more engaging personal connection with the audience. Her tone was both thought-provoking and urgent, urging the audience to learn from the past and actively participate in their present and future. The video ends with a strong call for action, encouraging the audience to understand the situation and realize the need to prevent past mistakes from happening again."
+ },
+ {
+ "video_id": "PI-a23bd840-554c-4334-b8e4-c1a8c1512e59",
+ "path": "/workspace/benchmark/video/PI-a23bd840-554c-4334-b8e4-c1a8c1512e59.mp4",
+ "answer": "The video provides a comprehensive visual and educational explanation of the derma rolling process. It begins with a close-up view of a bald head featuring a mohawk hairstyle against a blue tiled background, introducing the context. A light blue derma roller is then shown near the neck area, transitioning the focus to a detailed close-up on an orange textured surface, highlighting the tool's structure and tiny needles. The narrative continues with a close-up of the derma roller's head, emphasizing the arrangement of the microneedles. The video then shifts to a highly magnified view of human skin, showcasing its intricate texture and pores, effectively illustrating how the derma roller creates tiny punctures. Accompanying the visual journey, the male audio track, delivered in a calm and informative tone, explains that these micropunctures are perceived by the body as wounds, activating its natural healing process. This combination of detailed visuals and educational audio provides a clear and thorough explanation of the microneedling procedure, from the tool's appearance to its interaction with the skin's surface."
+ },
+ {
+ "video_id": "PI-eea0fdb2-aa84-4a90-b5c2-7f6d20a9857c",
+ "path": "/workspace/benchmark/video/PI-eea0fdb2-aa84-4a90-b5c2-7f6d20a9857c.mp4",
+ "answer": "This video documents a personal journey that takes an unexpected turn. It begins from a first-person perspective as an individual, wearing a silver jacket and backpack, rides a scooter along a road. The setting appears to be suburban or rural, characterized by bare trees and hills. An overlaid text reads: \"pov : i was scooting to my bfs house & ended up getting hit by a car.\"The accompanying audio features a resolute female voice saying something to the effect of \"I'm going up this mountain to save my marriage.\" The scene then transitions to a static shot of an accident aftermath, showing a white \"Jackson County\" ambulance with flashing lights parked on the roadside. A scooter or bicycle is seen lying on the ground, and a North Carolina driver's license is visible in the foreground. The juxtaposition of the determined audio about \"saving a marriage\" with the visually evident scooter accident creates a darkly humorous or absurd tone, suggesting that this significant, yet unfortunate, event ultimately ended in failure."
+ },
+ {
+ "video_id": "PI-9f4d1ae9-72bc-4679-8c91-55d98c8d043b",
+ "path": "/workspace/benchmark/video/PI-9f4d1ae9-72bc-4679-8c91-55d98c8d043b.mp4",
+ "answer": "The video features a person in a black winter outfit, including a hooded jacket, black pants, and black gloves, sitting on a snowboard in deep snow. They wear reflective goggles with a colorful tint, and their hair is visible beneath the hood. The snowboard, mostly covered in snow, shows a yellow and white design at the front. The person adjusts their position, leaning forward and lying down on the snowboard, preparing for a descent. The camera shifts to a first-person perspective, showing the person's feet in black snow boots strapped into the bindings. They lie face down, pushing off with their hands to start moving down the slope. The scene transitions to the person kneeling at the top of a snowy hill, with a vast mountainous landscape in the background, covered in snow and trees. They adjust their hair and look around, taking in the scenic view before sitting back on the snowboard, ready to continue. Meanwhile, the background music (BGM) features a female singer with an energetic and catchy vocal style, delivering repetitive and upbeat lyrics that create a lively and engaging atmosphere, perfectly complementing the dynamic and adventurous spirit of the winter sports scene. The overall experience combines serene beauty with thrilling action, capturing the essence of a snowboarding adventure."
+ },
+ {
+ "video_id": "PI-b621c401-368e-4077-b6ba-e64228b35343",
+ "path": "/workspace/benchmark/video/PI-b621c401-368e-4077-b6ba-e64228b35343.mp4",
+ "answer": "The video captures a poignant family moment where a young child, dressed in a beige sweatshirt and blue jeans, sits on a grey armchair, coming to terms with the arrival of a new sibling. Beside her, a woman with long blonde hair, wearing a pink shirt, holds a newborn baby dressed in a white jumpsuit with 'Little Brother' written on it. The scene is minimalist, with a grey textured blanket in the background, emphasizing the intimate setting. The text overlay reads, 'POV: When you realize you are no longer an only child,' highlighting the emotional transformation the child is experiencing. As the woman bends down to interact with both children, the young girl's expression shifts from curiosity to obvious sadness, eventually leading to her crying openly. Meanwhile, the audio features a female speaker hesitating to reply 'no,' followed by another speaker, possibly a woman, letting out a brief 'ah,' expressing uncertain emotions. The ambient music maintains a neutral rhythm, creating an unobtrusive atmosphere that complements the raw and genuine emotions displayed in the scene."
+ },
+ {
+ "video_id": "PI-c266e0cb-d4cf-4622-b644-6446119d176f",
+ "path": "/workspace/benchmark/video/PI-c266e0cb-d4cf-4622-b644-6446119d176f.mp4",
+ "answer": "In the video, three people are dancing passionately in a spacious and bright room, with a chandelier and a staircase in the background, suggesting a modern open-plan interior. The central figure is wearing a white T-shirt decorated with a vibrant mouth and the words \"Moral Chameleon\", paired with eye-catching pink, blue, yellow and black tie-dyed trousers. Beside this person, there are two other people: one is wearing a pure beige T-shirt and khaki pants, and the other is wearing a black and white striped T-shirt and beige pants. They performed synchronized dance movements, moving their arms and legs along with the cheerful rhythm. The audio includes a lively background music. \"I ain't finished it abi\" implies an interesting or casual background. The overall atmosphere was cheerful and energetic, and the dancers were completely immersed in their dance. The core theme of the video is the happy and spontaneous dance among friends, capturing the moments of joy and friendship."
+ },
+ {
+ "video_id": "PI-4cf5e861-1b58-42e7-87c6-be815eb5ac72",
+ "path": "/workspace/benchmark/video/PI-4cf5e861-1b58-42e7-87c6-be815eb5ac72.mp4",
+ "answer": "The video takes place in a park, with a modern urban landscape as the background. Under the clear blue sky, there are tall and uniquely shaped buildings. At the beginning of the scene, two women are standing on either side of a small, curly-haired brown dog. A woman wearing a beige sweater, a blue skirt, black skinny pants and black boots is standing on the left, while another woman wearing a black and white striped shirt and white pants is standing on the right, holding a dog's leash in her hand. The text on the video reads: \"Can I touch your dog?\" This is a question often asked by passers-by. Then, the scene switches to a close-up where a woman in a beige sweater is holding a dog. The text overlay has changed to \"What is his race?\" Then there is \"Race is used for humans\" written in yellow letters, and \"For example, I'm white\" written in white letters. This video emphasizes the difference between \"race\" and \"breed\". The former is usually used to describe human races, while the latter is used to describe dogs. The final text cover is written in yellow letters as \"Breed is used for dogs\" and in white letters as \"He is a Yorkipoo\", clarifying the correct terms and providing an example of a dog breed. This video aims to educate the audience on using appropriate language when discussing the bloodline of dogs, emphasizing that it is correct to describe dogs as \"breed\" rather than \"race\"."
+ },
+ {
+ "video_id": "PI-c2b8b91a-cee7-4977-8958-591a34f69521",
+ "path": "/workspace/benchmark/video/PI-c2b8b91a-cee7-4977-8958-591a34f69521.mp4",
+ "answer": "This video depicts a scene of a person and a cat in a modern interior. This person was wearing a light grey jacket, a black hat and a gold necklace. At first, he was seen sitting on a plain background. Then the camera switches to a close-up of a fluffy light brown cat with its blue eyes half-closed, seemingly in a state of annoyance or contemplation. There is a line in the audio, \"Your silence is my deal,\" which seems to be a humorous or satirical statement, perhaps in response to the cat's indifferent behavior. Then the video returns to that person, who is now squatting beside the cat sitting on the wooden floor. This person placed one hand on the cat's back, but the cat still maintained the same posture and looked unhappy. The background shows a modern interior and embedded lighting. At the end of the video, the person and the cat stand in the same position. The overall theme of the video is the humorous interaction between humans and cats, highlighting the cat's seemingly indifferent response to human presence."
+ },
+ {
+ "video_id": "PI-c7b0831f-75b5-415c-a25d-9cb20176cc24",
+ "path": "/workspace/benchmark/video/PI-c7b0831f-75b5-415c-a25d-9cb20176cc24.mp4",
+ "answer": "The video and audio combined present an interesting narrative. The audio features a casual, conversational clip where two male speakers, maintaining a calm and informative tone, share what seems to be personal anecdotes or expertise. Meanwhile, the video visually depicts a character in a green outfit and pink hat running along a path marked with pink and white stripes. This path leads towards an orange tunnel bearing a sign that reads \"ONLY I LS REAL\" in red digital letters. The character, moving with urgency or determination, approaches the tunnel but then slows and stops just before entering. A close-up reveals a blue beanie and mustache, adding a comical touch. After a brief pause, the character turns and runs back down the path, away from the tunnel, with the movement being smooth and continuous. The video maintains a consistent dark background, emphasizing the path and tunnel, and concludes with the character still running away, keeping the same visual elements intact. Together, the audio and video create a layered experience, blending conversational storytelling with a visually dynamic scene."
+ },
+ {
+ "video_id": "PI-49ed2fc4-b44a-4aa9-bdce-cab6049113f3",
+ "path": "/workspace/benchmark/video/PI-49ed2fc4-b44a-4aa9-bdce-cab6049113f3.mp4",
+ "answer": "In this outdoor scene, a young woman with long brown hair enjoys a meal under a bright blue sky. She is seated at a table, eating a large steak topped with melted cheese, served on a wooden board. While cutting and eating her food, she shares a personal anecdote with the viewer. She explains that her mother recently gave her some advice regarding a previous video where she discussed not being wealthy. Her mother suggested that she should have proudly declared herself 'super rich,' defining richness by the valuable relationships she has with friends and family. Amused by this perspective, the woman playfully announces that she has changed her mind and now considers herself super rich. She continues to savor her meal, expressing great satisfaction and commenting that the steak is so delicious it is destined to become one of her favorite, go-to dishes."
+ },
+ {
+ "video_id": "PI-2260a43a-534a-4a3e-8aa6-32c102e704c9",
+ "path": "/workspace/benchmark/video/PI-2260a43a-534a-4a3e-8aa6-32c102e704c9.mp4",
+ "answer": "The video showcases a heartwarming interaction between two women, one in a long white dress with small patterns and the other in a tight pink textured dress, standing back to back outside a house with white walls and glass doors. The text \"looking for a better tall/ short best friend duo\" humorously highlights their height difference as they compare their statures before turning to face each other. The scene is set to a relaxed and pleasant song performed by a female singer, likely serving as background music, which perfectly complements the cozy and friendly atmosphere. The women engage in lively interactions, with the one in white stroking her hair and the other in pink gesturing animatedly, both laughing and spinning around in joyful camaraderie. The backdrop, featuring a hanging plant and a patterned rug, enhances the warm and inviting ambiance, making the video a delightful depiction of friendship and shared enjoyment, all while the soothing melody plays in the background."
+ },
+ {
+ "video_id": "PI-d671dae7-363c-4780-975a-4ae35f7d82d4",
+ "path": "/workspace/benchmark/video/PI-d671dae7-363c-4780-975a-4ae35f7d82d4.mp4",
+ "answer": "In this brief, close-up video, a Black person with dreadlocks speaks directly to the camera. They are positioned in the center of the frame against a simple background of a white wall and a gray ceiling. The speaker, wearing a black tank top, a necklace, and a nose ring, calmly shares a humorous and provocative thought. They state their desire to adopt a white child and raise them within Black culture. To emphasize the point humorously, they proceed to list a series of traditionally Black American names they would give the child, including 'Javonte Durell Johnson Jackson Davis Lamar.' The video is shot in a selfie style with a stable camera, focusing entirely on the speaker's expressive delivery. "
+ },
+ {
+ "video_id": "PI-18d08ec6-b577-44b5-9466-03ed9acf1a6b",
+ "path": "/workspace/benchmark/video/PI-18d08ec6-b577-44b5-9466-03ed9acf1a6b.mp4",
+ "answer": "The video captures a dynamic and energetic scene in a gym, initially focusing on the fitness equipment before shifting to a man dressed in a black and white skirt, showcasing his strong figure and muscles as he works out. The man's impressive physique and perfect body curve are on full display, with the gym's spacious area and numerous fitness equipment, where others are also exercising, providing a backdrop to his routine. After completing his workout, the man jumps off the equipment and begins to dance joyfully on the ground, adding a touch of humor to the video. Meanwhile, the accompanying audio features a song sung by a girl, with an upbeat tune and energetic, uplifting lyrics that perfectly complement the video's lively atmosphere. The combination of the man's energetic workout and playful dance, along with the song's invigorating melody and motivational words, creates a vibrant and inspiring viewing experience."
+ },
+ {
+ "video_id": "PI-fb0767f8-2af8-4b17-86b8-5fd8da2b0a12",
+ "path": "/workspace/benchmark/video/PI-fb0767f8-2af8-4b17-86b8-5fd8da2b0a12.mp4",
+ "answer": "This short, humorous video presents a story about a product invention. It features a woman demonstrating a light blue silicone ice tray designed to create cylindrical ice cubes that fit perfectly into a large tumbler. A text overlay claims that the brand Stanley offered $3.5 million for the design. The woman proceeds to place the custom-shaped ice into a purple and black tumbler. The narrative takes a turn as a subsequent caption explains they declined the offer because they \"know what it's really worth.\" This statement is immediately contrasted by a screenshot of an e-commerce dashboard showing minimal success: just $611.66 in revenue from 34 sales in a 30-day period. A man in the foreground reacts to this story with a pensive look. The video concludes with the sarcastic caption, \"They definitely missed out \" highlighting the vast difference between the claimed value and the actual sales"
+ },
+ {
+ "video_id": "PI-7af5d8e1-c614-47cb-beaf-a4db12047370",
+ "path": "/workspace/benchmark/video/PI-7af5d8e1-c614-47cb-beaf-a4db12047370.mp4",
+ "answer": "The video features a woman in a modern office setting with wooden floors, glass walls, and overhead lighting, delivering a detailed explanation about glute workouts. She is dressed in a black long-sleeve top, green leggings, and white sneakers, standing in front of a whiteboard that lists glute exercises: \"Thrusts, Squats, Hinge, Abduction,\" along with green marker drawings of glute muscles. Using a pointer, she emphasizes different parts of the whiteboard as she explains the importance of incorporating all four exercise variations into a workout routine to achieve well-rounded glute development. She details how thrust movements and glute bridges target the upper glutes, squats focus on the lower glutes and quads, hinging movements benefit the lower glutes and hamstrings, and abduction exercises target the top part of the glutes. Her neutral-toned explanation shifts to a happy tone as she mentions her pre-planned glute program for 2025, encouraging viewers to check her pinned video or access her full glute program via a link in her bio for more detailed guidance."
+ },
+ {
+ "video_id": "PI-9d84ee90-40ef-4c33-b666-4a1c32d21ca4",
+ "path": "/workspace/benchmark/video/PI-9d84ee90-40ef-4c33-b666-4a1c32d21ca4.mp4",
+ "answer": "The video features three individuals dancing energetically in a modern, well-lit room with a zebra-patterned rug, a clothing rack filled with garments, and a white couch in the background. The dancers, dressed in a black tank top with gray jeans, a beige top with gray jeans, and a shimmering silver top with black pants, perform synchronized moves including arm swings, hip movements, and dynamic steps. Their focused and enthusiastic expressions contribute to the lively atmosphere. Meanwhile, the audio involves a male speaker with an ambiguous emotional tone, posing a question or inquiry, though the exact meaning is unclear due to recognition issues. The scene suggests an attempt at communication, but the emotions conveyed remain uncertain, adding a layer of intrigue to the vibrant visual setting. The dancers maintain their energy and synchronization throughout, enhancing the modern aesthetic of the room with its white ceiling and recessed lighting."
+ },
+ {
+ "video_id": "PI-50a703e9-7ab1-45bc-af15-da69a2bdd7ab",
+ "path": "/workspace/benchmark/video/PI-50a703e9-7ab1-45bc-af15-da69a2bdd7ab.mp4",
+ "answer": "In the video, a person is sitting alone in a restaurant, creating an empathetic scene for anyone dining alone. This person is wearing a black shirt decorated with white patterns, including crosses and skulls, exuding a unique and avant-garde atmosphere. They are also wearing large black headphones, indicating that they might be listening to music or podcasts to entertain themselves. This person holds a large cup of dark soda water through a straw and takes a small sip occasionally. The background showcases a simple and ancient ceiling and a window that allows natural light to enter, enhancing the atmosphere of the environment. Throughout the video, this person made a series of gestures, such as pointing at the cup, giving a thumbs-up and touching his own face. These gestures, along with their overall manner, convey a sense of self-entertainment and comfort. The text on the video is \"When you go to have lunch by yourself\", which clearly sets the theme of the video. It humorously captures the essence of dining alone, highlighting its simplicity and tranquility. This video ingeniously emphasizes the point that dining alone can be a very pleasant and liberating experience."
+ },
+ {
+ "video_id": "PI-311a5a4f-abfe-4459-8ed0-55b8b2f260a4",
+ "path": "/workspace/benchmark/video/PI-311a5a4f-abfe-4459-8ed0-55b8b2f260a4.mp4",
+ "answer": "The video depicts a woman in a green hoodie sitting at a table in a room whose beige walls are adorned with hanging green vines and ceiling lights. The woman initially clasped her hands together, then laid them flat on the table, and finally made a fist with one hand. The text \"Crying but then realizing I can raise my ya ya ya\" appears at the bottom, indicating a shift from emotional distress to a sense of empowerment or fulfillment. The consistent backdrop, including a bed on the left, provides a calm and personal environment for this introspective moment. Accompanying the visuals, the track features a girl speaking and making babbling sounds, suggesting an ordinary speaking scene, complementing the visual narrative of introspection and emotional transition."
+ },
+ {
+ "video_id": "PI-6ddc7202-b812-4ac1-8832-8fb8ff4c090a",
+ "path": "/workspace/benchmark/video/PI-6ddc7202-b812-4ac1-8832-8fb8ff4c090a.mp4",
+ "answer": "The video features a cooking series titled 'Cookin' in Rice Cooker Series,' introduced by a person wearing a green sweatshirt with 'GRACELAND' printed on it. The series is aimed at college students or those without a kitchen, demonstrating the versatility of a small rice cooker in preparing various meals. The video begins with the person sitting at a table with a tray of sliced mushrooms, a measuring cup with a brown liquid, and a small bowl of chopped green onions, while holding up a small white rice cooker adorned with cartoon characters. The accompanying audio is a narration track by a female narrator with a friendly and informative tone, who introduces and promotes the remastered cooking series. She emphasizes the practicality of the rice cooker recipes and highlights the new content in the series, conveying enthusiasm and engagement. The video showcases a variety of dishes prepared using the rice cooker, illustrating how it can help create delicious and convenient meals, making it an ideal resource for those looking to cook in limited spaces."
+ },
+ {
+ "video_id": "PI-5596da01-8d56-42b6-90db-ef8b88f0cd77",
+ "path": "/workspace/benchmark/video/PI-5596da01-8d56-42b6-90db-ef8b88f0cd77.mp4",
+ "answer": "The video features a lively scene outside a restaurant or bar, where a group of people dressed in black uniforms are energetically dancing and holding glowing light sticks, creating a vibrant and festive atmosphere. The entrance to the establishment is adorned with glass doors and neon signs that read 'TRUST ME, YOU CAN DANCE - Tequila,' encouraging patrons to dance and enjoy tequila. One person in the foreground holds a tray with a decorative item resembling a fish, while others in the background dance and cheer. A person wearing a blue and beige jacket with 'RANGER' written on it in red letters enters the scene, dancing energetically and holding a microphone, adding to the dynamic environment. Meanwhile, the audio features an intermittent discussion between a man and a woman about the cultural gap, influence, and personal action. The man emphasizes the need to take decisive steps, while the woman highlights the broader significance of influence that goes beyond music. The dialogue transitions to a more abstract and rhythmic tone, accompanied by medium-speed ambient music, enhancing the overall energetic and joyful mood of the video."
+ },
+ {
+ "video_id": "PI-e93cbecc-3c11-4167-a2ed-fd65e0a513c3",
+ "path": "/workspace/benchmark/video/PI-e93cbecc-3c11-4167-a2ed-fd65e0a513c3.mp4",
+ "answer": "The video presents a striking and dramatic visual of a person engulfed in intense flames, which surround their head and upper body. The flames, dominated by bright orange and yellow hues, create a chaotic and powerful scene. At various points, beams of light emanate from the person's eyes, introducing a surreal element to the imagery. The individual's dreadlocked hair is partially visible through the fire, and the background remains obscured, drawing focus to the central figure. Complementing the visual intensity, the audio features two male speakers maintaining an emotionally neutral tone. One man poses a straightforward question to the other about the nature of his relationship with his father. This simple inquiry lacks apparent emotional intensity or background music, creating a stark contrast to the dynamic and fiery visuals. The interplay of the calm audio and the vibrant, shifting flames enhances the overall mystical and powerful atmosphere of the video."
+ },
+ {
+ "video_id": "PI-96302124-0686-40da-ab48-2c20d6de48f2",
+ "path": "/workspace/benchmark/video/PI-96302124-0686-40da-ab48-2c20d6de48f2.mp4",
+ "answer": "The video humorously depicts a man who finds himself locked out of his room, a situation highlighted by the text overlay 'cave divers when they accidentally forget their keys:'. The man, dressed in a blue beanie, dark jacket, black pants, and carrying a beige shoulder bag, is seen in a hallway facing a white door labeled '2011'. He frantically searches for his keys, patting his pockets and looking around before resorting to using his beanie to cover his eyes in a comedic attempt to 'see' better. The scene then shifts to a close-up of the door's lock, where the man, now wearing gloves, uses a tool to pick the lock while shining a flashlight for better visibility. The background music (BGM) enhances the comedic tone with its light-hearted and humorous instrumental tune. After some effort, he successfully unlocks the door and enters a cluttered living space filled with various items, including a couch, a table with boxes, and a kitchen area. The video concludes with a pair of white sneakers hanging on the door handle as the man walks further into the room, emphasizing the relatable absurdity of forgetting one's keys and the lengths one might go to regain entry."
+ },
+ {
+ "video_id": "PI-83ce7cc1-76c1-47e3-b453-3db0ce4749c6",
+ "path": "/workspace/benchmark/video/PI-83ce7cc1-76c1-47e3-b453-3db0ce4749c6.mp4",
+ "answer": "The video features a person in a room with beige walls and carpet, wearing a light gray oversized hoodie and matching sweatpants. The text 'trying the dallas cowboys cheerleaders kickline...' is displayed at the top of the frames, setting the theme for the video. The individual begins by standing with arms outstretched and then launches into a series of dynamic movements, including spins, kicks, and balancing on one leg. The routine is energetic, with various leg lifts and arm movements that showcase an enthusiastic attempt to replicate the cheerleaders' dance moves. The background remains consistent, featuring a white door and a textured ceiling, ensuring the focus stays on the person's spirited performance. Accompanying the video is a modern and upbeat instrumental BGM track, which enhances the dynamic atmosphere with its rhythmic and melodic elements. The overall atmosphere is one of enthusiasm and dedication to mimicking the iconic cheerleading routine, creating an engaging and lively viewing experience."
+ },
+ {
+ "video_id": "PI-d762f25c-4d1d-43ef-a320-d3bb0c7e38f3",
+ "path": "/workspace/benchmark/video/PI-d762f25c-4d1d-43ef-a320-d3bb0c7e38f3.mp4",
+ "answer": "The video features a person with a short pink buzzcut, styled with dramatic black winged eyeliner and dark lipstick. She wears several facial piercings, including a septum ring, bridge piercing, and lower lip rings, along with large silver hoop earrings and a necklace featuring a colorful pendant. The individual is positioned centrally, facing the camera directly, against a plain, light-colored wall. A doorway is visible in the background, suggesting an indoor setting. Text overlays appear on screen, initially displaying the statement \"there's no such thing as a bad gothic style, every style just doesn't fit everyone,\" which then transitions to the phrase \"TRIBAL GOTH.\" The audio accompaniment includes spoken commentary about individuality and style, followed by an abrupt electronic sound effect and vocalizations."
+ },
+ {
+ "video_id": "PI-7c8ae675-478b-4e05-bddf-b8eb0cd171ac",
+ "path": "/workspace/benchmark/video/PI-7c8ae675-478b-4e05-bddf-b8eb0cd171ac.mp4",
+ "answer": "The video features a man in a blue hoodie humorously discussing the intelligence of his orange female cat, Tetra, compared to an orange male cat. He employs Tetra to test the theory that the orange female cat is smarter, but finds the experiment challenging because her role model is Nanaki, an unresponsive orange male cat. The person noted that Nanaki's lack of awareness and balance negatively affected Tetra, who now exhibited poor balance and clumsiness, often falling asleep in unstable positions and mimicking Nanaki's dangerous behaviors such as perching on stair railings. The video shows Tetra playing on a cat tree, sleeping on a couch and walking on a railing, highlighting the person's concerns for her safety. The person expressed a preference for Tetra to mimic Jojo, another cat described as cute and safe, rather than Nanaki. The backdrop includes traditional artwork and furniture, adding a cozy and welcoming atmosphere to the scene."
+ },
+ {
+ "video_id": "PI-149ae1d6-b0e0-4c01-b968-ae48c9d6b73d",
+ "path": "/workspace/benchmark/video/PI-149ae1d6-b0e0-4c01-b968-ae48c9d6b73d.mp4",
+ "answer": "The video captures a light-hearted and amusing interaction between two individuals seated on a patterned sofa against a light-colored backdrop. The person on the left, dressed in a grey sweatshirt and adorned with several necklaces, engages in animated conversation, making gestures as they discuss which actors should play them in a hypothetical film, with the words \"Dead or alive\" appearing on screen, setting a playful tone. The person on the right, wearing a beige fur coat, black stockings, and holding a piece of paper, initially appears engrossed but joins in the humor, with the conversation escalating to the point where the text \"or...\" appears, adding to the comical atmosphere. The left individual seems preoccupied with their imaginative casting choices, while the right one eventually covers their face with the paper, providing a whimsical and interesting conclusion to their banter. Meanwhile, the actual audio accompanying this video, though lacking obvious dialogue, features human activity sounds and, in its latter half, elderly exclamations that contribute to an overall pleasant and relaxed atmosphere, enhancing the charm of the scene where two people affectionately imagine themselves as characters in a movie, showcasing their creativity and camaraderie."
+ },
+ {
+ "video_id": "PI-0726e126-6e8e-42f9-ab0d-f97e6177513c",
+ "path": "/workspace/benchmark/video/PI-0726e126-6e8e-42f9-ab0d-f97e6177513c.mp4",
+ "answer": "The video begins with a woman sitting in a red convertible, wearing a light purple flower-patterned top and blue jeans, as the text \"Decorate my... POLLY\" appears on the screen, indicating her intention to personalize a toy-like car. She has long red hair and is holding a microphone and a black handbag while sitting in the driver's seat. Later, she is seen in a car with pink seats and a pink steering wheel, driving away, suggesting a change in her car's decoration. The video then transitions to a parking lot scene where she meets Nick from High Street Packaging. They discuss car decoration options, with Nick showing various color choices and explaining the pros and cons of different paint jobs. Moving into a room, Nick uses a computer to display car models in different colors, helping the woman visualize and narrow down her options. As Nick points out the characteristics of each color, the woman looks intently at the computer screen, clearly excited and engaged. A close-up shot shows a hand holding a color sample booklet with a bright pink color labeled \"Glossy Bubble Gum Powder,\" as they continue discussing the color options. The woman's enthusiasm is evident as she points at the screen and discusses the choices with Nick. The video concludes with the woman shaking hands with Nick, indicating that a decision has been made. Complementing the visuals, the audio features a passionate and charming female voice describing her experience of decorating a Polly pocket car, initially red but later deciding to change the color. She mentions meeting Nick, who guided her through the color selection process, explaining the pros and cons of each option. Originally planning to use purple, she switched to pink after seeing the available choices, highlighting her positive experience with High Street Packaging Company."
+ },
+ {
+ "video_id": "PI-f8e1c99b-9074-44bb-aae3-5c2c288cecef",
+ "path": "/workspace/benchmark/video/PI-f8e1c99b-9074-44bb-aae3-5c2c288cecef.mp4",
+ "answer": "The video features a close-up shot of a young blonde woman lying on a bed, staring directly into the camera. The background is a simple bedroom setting with a grey headboard.A male speaker with an indeterminate emotional tone mentions \"January, February, March, April, May, June, July, August, September, October, November, December\" in a neutral context, accompanied by ambient music with a steady tempo and an unobtrusive atmosphere.Twelve colorful circles – each labeled with a month – appear centered on the screen, overlaid with the subtitle \"i'm sorry\" at the center. Within each corresponding monthly circle, a series of emojis sequentially pop up: Heart, Medal, Peach, Expressionless Face, Pensive Face, Lying Face, Checkered Flag, Tongue, Pill, Baby, and finally Smiling Face with Smiling Eyes for November. The woman's expression shifts from neutral to shock and disbelief, particularly reacting to November's emoji. Ultimately, she stares open-mouthed in utter incredulity."
+ },
+ {
+ "video_id": "PI-88fa71fb-bd02-4b46-9b15-6608f25974c3",
+ "path": "/workspace/benchmark/video/PI-88fa71fb-bd02-4b46-9b15-6608f25974c3.mp4",
+ "answer": "In the video, a woman is in a modern living room, recalling the time when she enjoyed onion rings in her father's restaurant when she was a teenager. Now, as an adult, she has invented a low-carbohydrate alternative: crispy onion rings made from two ingredients. At the beginning of the video, she was holding a golden-brown crispy onion ring, showing its texture and color. She dipped it in the cream sauce, highlighting its crispness. The scene transitions to a close-up, where a plate is filled with these fries and a small bowl of dipping sauce. Then, the lady demonstrated the preparation process. First, she used a mandolin slicer to slice the onion. She spread the shredded cheese on the baking tray, placed the onion slices on it, and then gently brushed it with olive oil. She seasoned the Onions with spices to make sure they were well coated. Put the tray into the oven and bake the Onions until they are golden and crispy. At the end of the video, the lady placed the completed onion ring fries on a blue plate, encouraging the audience to enjoy these healthier and lower-carbohydrate fries as a delicious snack or side dish. The overall theme emphasizes a delicious and simple recipe, bringing back nostalgic memories."
+ },
+ {
+ "video_id": "PI-11bc1e63-159b-43b1-b43f-f5141187d760",
+ "path": "/workspace/benchmark/video/PI-11bc1e63-159b-43b1-b43f-f5141187d760.mp4",
+ "answer": "The audio track features a female singer expressing a range of emotions, including feeling down, lazy, bored, scared, and ignored, effectively conveying a sense of emotional vulnerability and relatable human experiences. In contrast, the video content humorously captures a person in a white zippered hoodie sitting at a table, enjoying a meal that includes a piece of fried food, while multitasking by holding a phone in their left hand and eating with their right. The background, decorated with items like a \"Hello\" logo and hanging ornaments, creates a comfortable and home-like atmosphere, with stairs and indoor plants further enhancing the setting. The text on the screen playfully suggests that the person is making an important call, referring to them as \"my second emergency contact.\" Despite the audio's somber tone, the video presents a lighthearted scene of multitasking during a casual dinner, juxtaposing the emotional depth of the song with the mundane yet humorous reality of everyday life. This contrast effectively highlights the versatility of human experiences, where emotions can range from profound vulnerability to the simple joys of a casual meal and an important conversation."
+ },
+ {
+ "video_id": "PI-f44f45cd-31c4-4a7d-bffc-c48b8b3c02fc",
+ "path": "/workspace/benchmark/video/PI-f44f45cd-31c4-4a7d-bffc-c48b8b3c02fc.mp4",
+ "answer": "This video captures an intimate moment of a man becoming emotional in a large crowd. The footage focuses on a man with braided hair, wearing a white tank top and a red cardigan, who appears deeply affected by the music playing. As the lyrics of a sad R&B song about regret and lost love appear on the screen, his expression turns to one of profound sadness. He looks down, and at one point, covers his face with his hands as if he is crying. The setting appears to be an outdoor event, possibly a concert, filled with other people. "
+ },
+ {
+ "video_id": "PI-2b4ffc10-b72f-4089-9f16-766e60807030",
+ "path": "/workspace/benchmark/video/PI-2b4ffc10-b72f-4089-9f16-766e60807030.mp4",
+ "answer": "The audio captures the palpable excitement and admiration of two men as they witness a red truck driving on water, with the first speaker expressing awe at the remarkable feat, his voice brimming with disbelief and wonder. This auditory enthusiasm is perfectly mirrored in the video, where a man donning a cowboy hat and a blue life jacket sits in a boat, observing the red and white truck as it defies conventional expectations by gliding smoothly across the calm water under a cloudy sky. The truck, creating large sprays and wakes, showcases its impressive ability to navigate the surface, a sight that leaves both men in the boat - the other wearing a cap - visibly captivated. The boat, adorned with various stickers and an orange fish-shaped float hanging from the ceiling, provides a cozy vantage point for the men to marvel at the spectacle. The scene seamlessly alternates between the exterior view of the truck's aquatic journey and the interior of the boat, where the men's animated conversation underscores their shared amazement. Together, the audio and video create a vivid and engaging narrative, highlighting the unique and captivating moment of the truck's water navigation, and the men's genuine awe at this extraordinary sight."
+ },
+ {
+ "video_id": "PI-6913706e-135b-4b14-97cc-be80c00efe4d",
+ "path": "/workspace/benchmark/video/PI-6913706e-135b-4b14-97cc-be80c00efe4d.mp4",
+ "answer": "The video takes place in a modern kitchen with light-colored cabinets and wooden floors, creating a contemporary and cozy atmosphere. A person with curly hair, dressed in a light blue sweatshirt and grey trousers, stands beside the counter, while another person in a grey sweatshirt and grey sweatpants is seen drinking water from a black cup. The text 'Me everytime she touches me:' is displayed at the top of the screen, hinting at a humorous or exaggerated reaction to being touched. The person in the light blue sweatshirt quickly walks away towards the end of the kitchen, prompting the person in the grey hoodie to remove it, revealing a blue T-shirt with a white pattern on the back. This person then follows the other into the corridor, still holding the grey hoodie in hand. The kitchen counter is cluttered with various items, including a white mug, a bottle of hand sanitizer, and some boxes, adding to the lived-in feel of the space. A large portrait of a woman hangs on the wall above the counter, providing a personal touch. Meanwhile, the audio features a male speaker with an unknown emotional tone, accompanied by background music with a steady tempo and a neutral atmosphere, though the content of the audio is not clearly discernible from the provided text."
+ },
+ {
+ "video_id": "PI-45344642-eb53-4172-b1a7-dd9e9947ce01",
+ "path": "/workspace/benchmark/video/PI-45344642-eb53-4172-b1a7-dd9e9947ce01.mp4",
+ "answer": "There is a warm and intimate moment in the video. A 103-year-old grandma is lying in bed, covered with a soft pink blanket. The background includes a pillow with floral patterns, adding a touch of warmth and comfort to the environment. The text at the top of the video is covered with \"My 103-year-old grandma's reaction to French fries\", which indicates that the video will show grandma's reaction to French fries. The audio recorded a gentle conversation. Someone asked the grandmother, \"Grandma, do you want French fries?\" Her answer was very simple: \"Yes.\" The conversation continued and the person asked, \"Do you like French fries?\" Grandmother affirmed her love for them. Our conversation delved deeply into her preferences. For instance, \"When I hold the map plug, which kind of French fries do you like?\" Grandma expressed her love for all kinds of French fries. She said, \"You like all kinds of fries. I like French fries.\" This video ended with a light-hearted laugh of \"hahaha\", adding a touch of joy and humor to the scene. The overall theme of the video is warmth, love and simple joys in life, highlighting grandma's lasting passion for French fries and the deep interaction between her and her family."
+ },
+ {
+ "video_id": "PI-d68fdff4-a629-49f7-8f4d-ad854c967e1d",
+ "path": "/workspace/benchmark/video/PI-d68fdff4-a629-49f7-8f4d-ad854c967e1d.mp4",
+ "answer": "The video features two people performing a dance in an indoor setting. The background consists of light-colored walls with horizontal black and blue stripes. Both dancers wear coordinated outfits primarily made up of bright pink and reflective silver elements, including mini skirts, tops, and knee-high boots. Their movements are synchronized and fluid, characterized by a series of steps, arm gestures, and body rotations. The camera maintains a static eye-level perspective throughout the video, allowing viewers to appreciate the dancers' performance without any cuts or angle changes. The accompanying audio is an upbeat pop song featuring a clear female voice, providing a vibrant background to the visual content. The lighting is consistent and bright, uniformly illuminating the scene. As they dance, the focus remains on their movements and expressions, creating a cohesive performance both visually and in terms of dynamics."
+ },
+ {
+ "video_id": "PI-5fbee7fb-4339-426a-93ea-e95adf397320",
+ "path": "/workspace/benchmark/video/PI-5fbee7fb-4339-426a-93ea-e95adf397320.mp4",
+ "answer": "In the video, two people are sitting side by side in a well-lit modern room decorated with plants and with a window in the background. Above them, there is a series of circular numbers and emoji ICONS, with \"My 2025\" marked at the top. These ICONS represent different months of the year, and each month has a specific emoji: January, February, March, April, May, June, July, August, September, October, November and December. The text \"Play tricks on my boyfriend with this filter\" is displayed at the bottom of the screen, indicating that this video is a prank and uses a filter that predicts the future. These people seemed to have responded to the predictions of the filter, which indicated that from engagement to marriage, to multiple pregnancies, and ultimately death. The video continues with the same setting and background, maintaining a humorous and prank-oriented theme. Individuals' responses to filter predictions add an extra layer of entertainment as they participate in the interesting and exaggerated timelines presented by the filters. This video effectively captures the essence of a light-hearted prank, showcasing the couple's entertainment and filter's sense of humor in predicting the future."
+ },
+ {
+ "video_id": "PI-ae8dfc44-f1f2-4dc6-97a1-2bda2334d502",
+ "path": "/workspace/benchmark/video/PI-ae8dfc44-f1f2-4dc6-97a1-2bda2334d502.mp4",
+ "answer": "The video takes place at a glamorous event, likely the Grammys, featuring a red carpet under a white tent. The scene opens with a man in a dark suit standing on the red carpet, surrounded by other attendees and event staff. The text overlay reads, 'Kai is what the Grammys need,' suggesting a humorous context. The man in the suit then performs an unexpected and energetic dance move, bending forward and moving his arms and legs in a lively manner. He eventually falls to the ground, continuing to dance and roll on the red carpet, much to the amusement of the onlookers. The background shows other attendees, some in formal attire, watching and reacting to the performance. Meanwhile, the audio features a male speaker with a lighthearted and amused tone, engaging in casual conversation and reflecting on a situation where he ended up amidst a crowd. The atmosphere suggests a relaxed and jovial interaction, accompanied by laughter, with no background music or additional auditory elements. The video captures a moment of spontaneous fun and entertainment at the event, highlighting the unexpected and joyful nature of the man's actions."
+ },
+ {
+ "video_id": "PI-3a137762-91a6-46c1-8dd9-7a6ba55f111f",
+ "path": "/workspace/benchmark/video/PI-3a137762-91a6-46c1-8dd9-7a6ba55f111f.mp4",
+ "answer": "The video is set in a classroom or lecture hall, where a large projection screen displays a video of a person with long blonde hair singing into a microphone, accompanied by a guitarist. The text overlay on the video reads, 'college is so serious' okay... ', with an ellipsis indicating a sarcastic or ironic tone. In the foreground, a person wearing blue jeans is seated, watching the projection, while to the right, an instructor or presenter is seated behind a wooden podium, covering their face with their hands, possibly indicating laughter or disbelief. This aligns with the humorous undertone of the text overlay. The room features a modern design with a patterned carpet and a small table with an orange chair in front of the screen. The background music track has a humorous undertone, likely containing laughter or a playful tone, reflecting a light-hearted and humorous mood. The video highlights the contrast between the serious atmosphere of college and the lighthearted or entertaining content being shown, creating a humorous juxtaposition."
+ },
+ {
+ "video_id": "PI-c63884a4-cfe6-43d6-8623-a84057b1e8df",
+ "path": "/workspace/benchmark/video/PI-c63884a4-cfe6-43d6-8623-a84057b1e8df.mp4",
+ "answer": "The video takes place in a modern, well-lit kitchen, which features white cabinets, ceiling fans and a kitchen island with bar stools. The protagonists are a woman with blonde hair wearing a light grey T-shirt and leggings, and a man wearing white clothes and a black hat. The woman was holding a black bowl filled with food and eating with a fork. The text on the video shows, \"Teasing my husband with 'Cooking with kya'\", indicating that this is a humorous prank. In the background, at first, the man can be seen sitting on the sofa, concentrating on his mobile phone. Suddenly, he playfully slipped behind the woman, picked her up from the ground and made her spin around. The lady looked very surprised but also very happy. She was smiling while holding the bowl. The man continued to walk around the kitchen with her, and the two shared a relaxing and pleasant moment. There are some witty jokes in the audio. The man said, \"Cook with Maria, cook with Maria.\" Then, while spinning her around, he said, \"You can't take the bus today. No way.\" This video captures the cheerful and amusing interactions between the couple, highlighting their pleasant relationship and humorous pranks."
+ },
+ {
+ "video_id": "PI-5dc273d2-cdd5-4bdd-9b37-06e1d7a261bf",
+ "path": "/workspace/benchmark/video/PI-5dc273d2-cdd5-4bdd-9b37-06e1d7a261bf.mp4",
+ "answer": "The audio and video content complement each other to create a captivating and mysterious scene. In the audio, a male speaker, speaking in a calm and curious tone, describes his observation of a semi-circle of lights flickering on and off while working on a job and looking out at the ocean from the deck. His calm demeanor contrasts with the intriguing phenomenon he witnesses, sparking a sense of curiosity in the listener. The video then visually captures this moment, beginning with a nighttime view of a dark sky over a beach, framed by two thick ropes hanging vertically on either side of the frame. The ocean is visible in the background with gentle waves rolling in, while the beach remains mostly dark, with some faint lights in the distance. The focal point of the video is the semi-circular pattern of flickering lights in the sky, which seems to be moving or flickering on and off, mirroring the speaker's description in the audio. These lights create a mysterious and intriguing atmosphere, enhancing the sense of curiosity and wonder. The video maintains the same framing throughout, with the ocean and its gentle waves consistently visible in the background, while the semi-circular pattern of flickering lights continues to be a focal point, adding to the enigmatic ambiance. Together, the audio and video content effectively convey the speaker's observation and the mysterious atmosphere of the nighttime beach scene, inviting viewers to share in his sense of curiosity and wonder."
+ },
+ {
+ "video_id": "PI-3502ef25-4004-4d88-957c-2141c68c0532",
+ "path": "/workspace/benchmark/video/PI-3502ef25-4004-4d88-957c-2141c68c0532.mp4",
+ "answer": "In the video, a person with pink curly hair is sitting at a table. The walls of the room are beige and there are blinds on the Windows. This person is holding a black and white cat. The cat is wearing black and white clothes and has a black ribbon around its neck. This person gently held the cat's head with both hands and slightly adjusted their grip throughout the video. This cat looks directly at the camera with its big green eyes, appearing calm and composed. This person's nails are painted light and she wears rings on her fingers. The interaction between humans and cats is gentle and affectionate. Occasionally, humans will stroke the cat's head. The overall atmosphere of the video is serene and gentle, focusing on the connection between humans and pets. This person seems to be speaking or singing softly, as her mouth moves slightly in some scenes, conveying a sense of warmth and connection. In the audio of the video, there is a soothing voice-over: \"Little cat, you are full of wonder and awe, not about how you are now or in the past, but about how you will be in the future.\" You are strong, tenacious and indomitable. Your mind is filled with consistency and patience, and you firmly hold on to the promise of a beautiful tomorrow. This information enhances the theme of hope and resilience, making the video not only showcase the bond between humans and pets, but also an inspiring and encouraging story."
+ },
+ {
+ "video_id": "PI-86f62975-f818-4f7e-ac8c-b18dfb6bb8ec",
+ "path": "/workspace/benchmark/video/PI-86f62975-f818-4f7e-ac8c-b18dfb6bb8ec.mp4",
+ "answer": "The video features a person sitting at a table in a cozy, festive room, wearing a gray hoodie over a light-colored shirt, with a large stack of US dollar bills in front of them. The background includes a television displaying a scene with a person in a pink outfit and a brightly decorated Christmas tree adorned with colorful lights and ornaments, creating a warm, holiday atmosphere. The individual gestures towards the camera with a middle finger before proceeding to count and display the money, fanning out the bills to emphasize their amount. Meanwhile, the audio is a rap track featuring a male artist with a confident and assertive tone, delivering lyrics about resilience, survival, and street life challenges. The overall scene combines casual defiance with a display of wealth, set against a backdrop of holiday cheer, mirroring the themes of resilience and survival highlighted in the rap lyrics."
+ },
+ {
+ "video_id": "PI-30b8a93c-9882-48d6-b828-5809601b5125",
+ "path": "/workspace/benchmark/video/PI-30b8a93c-9882-48d6-b828-5809601b5125.mp4",
+ "answer": "This video is called \"Things We Should Know...\" Part 16, focusing on repairing the damaged skirting board. The scene is set in a room with light-colored walls and light-colored wooden floors. This tutorial begins with a close-up of a hand scooping out grey multi-purpose putty from a can with a wooden stick. \"Things We Should Know...\" Then there is the red \"Part 16\", and \"Repair the damaged skirting board...\" The scene transitions to a corner of a damaged skirting board. A hand with dark red nail polish applied putty to the damaged area with a putty knife to fill the gaps and scratches. The narrator explained the process, emphasizing the need for a sand block, multi-purpose putty and a two-inch putty knife. Then hold a sand block in hand, ready to smooth the putty after it dries. Mix the putty on the cardboard, apply it to the skirting board and cover the damaged areas. Use sand blocks by hand to smooth out the putty, focusing on the curved parts of the skirting board to achieve a seamless completion. At the end of the video, the skirting board looked repaired and smooth, ready for cleaning and repainting. The last frame encourages viewers to comment on what they want to report next and continue to learn more, and displays the text \"Find me on IG@therenegadehome\". The overall tone is guiding and beneficial, leading the audience through the maintenance process."
+ },
+ {
+ "video_id": "PI-27b1643f-b485-4831-b6f1-5d9aaa2e314b",
+ "path": "/workspace/benchmark/video/PI-27b1643f-b485-4831-b6f1-5d9aaa2e314b.mp4",
+ "answer": "The video shows a man in a green long-sleeved shirt standing in a room with a neutral-colored wall and a door in the background, with the text \"you have a girl best friend\" displayed at the top of the frame. Initially, the man was seen eating from the container with a spoon. As the video progresses, he stops eating and begins vividly gesturing with his hands, moving slightly around the room, touching his hair and looking around, suggesting that he is having a humorous conversation or reacting to a playful tease. The audio, meanwhile, reveals a conversation between three people-a lady and two boys-in a neutral tone and without obvious emotional color, suggesting an ordinary conversation. The whole scene is consistent with minimal changes in the background, focusing on the expressive response of the person, which may correspond to the dialogue heard in the audio."
+ },
+ {
+ "video_id": "PI-d39f53e4-e73b-4e49-b09e-e1748c9298af",
+ "path": "/workspace/benchmark/video/PI-d39f53e4-e73b-4e49-b09e-e1748c9298af.mp4",
+ "answer": "This video shows a close-up shot of a refrigerator, which contains several boxes of cream-colored yellow ice cream. The main focus is on a specific container, where a hand holding a metal ice cream spoon appears and begins to scoop in thick and smooth ice cream. The spoon moves in a circular motion, gradually collecting a portion and revealing some small black dots in the cream texture. The background is consistent with the frosted surface of the entire visible freezer. In the audio, a voice mentioned, \"When I put this heifer into the frame, the farmer told me that her front foot hurt very much but has healed, but this seems to indicate that this is not the case.\" Welcome to Nate with hooves. This conversation seems to have nothing to do with visual content and might indicate a separate context or a podcast playing in the background. The video highlights the consistency of the cream in the ice cream. Due to its thickness, it was a bit difficult to scoop up. Eventually, a portion was successfully scooped into the spoon and it was ready. The core theme of the video is the detailed process of scooping ice cream, highlighting its texture and the effort required to extract a portion."
+ },
+ {
+ "video_id": "PI-985a55a8-b905-47fd-a23f-9b746da271b5",
+ "path": "/workspace/benchmark/video/PI-985a55a8-b905-47fd-a23f-9b746da271b5.mp4",
+ "answer": "The video opens with a woman standing in a modern, well-lit room featuring exposed ceiling pipes and a chandelier, with the text 'January 2024' displayed at the top. She has long, wavy purple hair and is dressed in a green crop top and pink shorts. The woman strikes various poses, moving her arms and shifting her stance, against a backdrop that includes a green sofa, a coffee table with books and a plant, and large windows with dark curtains. The scene then transitions to a different setting marked by the text 'January 2025' at the top. Here, the woman has long, wavy blonde hair and is wearing a blue long-sleeved top and a white skirt. She poses confidently, placing one hand on her hip and making a peace sign with the other. The backdrop features large windows offering an outdoor view, a zebra-patterned rug, and a TV stand with decorative items. Meanwhile, the audio features a male speaker conveying a reflective and emotional tone, expressing anticipation and hope for a future reunion. He speaks about imagining the moment of being seen again, evoking a sense of longing and sentimentality. The ambient music has a gentle tempo, creating a calm and introspective atmosphere that complements the visual narrative."
+ },
+ {
+ "video_id": "PI-18fe92eb-e878-440b-9396-cb57e461a525",
+ "path": "/workspace/benchmark/video/PI-18fe92eb-e878-440b-9396-cb57e461a525.mp4",
+ "answer": "The video shows a spectacular performance on a high-tech stage. The story begins with a large, fascinated crowd looking up at a futuristic doll figure with a metallic body and red accent, suspended in the air and surrounded by numerous cables, performing acrobatics. Viewers record fascinating displays on their mobile phones. As the performance progresses, the figure continues its aerial dance to engage the audience. The climax features characters stretching huge fiery wings, creating a dramatic and intense visual effect. The metal body shimmers in red, enhancing the fiery effect. The video tied up above the cheering crowd with shadow discs rotating, the flames intensifying and illuminating the entire stage and auditorium, leaving a lasting impression."
+ },
+ {
+ "video_id": "PI-9bd97456-4b5f-4595-832d-0007be621efa",
+ "path": "/workspace/benchmark/video/PI-9bd97456-4b5f-4595-832d-0007be621efa.mp4",
+ "answer": "The video shows a close-up of a person's hand typing on a unique transparent keyboard with a glittering silver background. The original keyboard was transparent. When people typed, the blue backlight would illuminate the keyboard. The text overlay occurs as follows: Why did you give me this keyboard? Refund me immediately! But you won for zero dollars. How did I win this crystal keyboard? Because you clicked on the card of this video! Wow, it's so simple! Together with the downward arrow icon, encourage the audience to click for more information. The overall appearance is charming and eye-catching, emphasizing the unique design of the keyboard and its lighting function. Meanwhile,A man and a woman engage in a conversation where the man expresses surprise and curiosity about receiving a crystal keyboard, while the woman explains that he won it for free by interacting with a video card. The man reacts with amazement, and both speakers exhibit neutral to mildly positive emotions. The dialogue reflects a casual and lighthearted exchange."
+ },
+ {
+ "video_id": "PI-213eb9ab-ca52-4845-866c-102fcd671ea7",
+ "path": "/workspace/benchmark/video/PI-213eb9ab-ca52-4845-866c-102fcd671ea7.mp4",
+ "answer": "The video shows a woman on a ski lift, wearing a blue jacket, black gloves, and a blue helmet with reflective goggles, against a picturesque landscape of snow-capped mountains dotted with trees under a cloudy sky. In the video, the lady's voice conveys a confused and frustrated tone, telling her confusing experience while skiing. She described receiving strange glances from others who warned her of the wind at the top of the mountain, suggesting that her short stature might have caused her to fly away from the mountain. They even advised her to stick to beginner skiing, although she assured she had skied in windy conditions before. The central message of her narrative is her frustration of being undervalued and undervalued because of her size, highlighting the challenges small skiers face in windy environments."
+ },
+ {
+ "video_id": "PI-73da758e-c477-404c-b859-de5f35697e4c",
+ "path": "/workspace/benchmark/video/PI-73da758e-c477-404c-b859-de5f35697e4c.mp4",
+ "answer": "The video features two individuals in a cozy and casual room, creating a warm and inviting atmosphere. The person on the left is dressed in a white t-shirt with a large red tongue and lips graphic, reminiscent of the Rolling Stones logo, paired with pink plaid pajama pants, while the person on the right wears a beige hoodie and light-colored sweatpants. The room is simply decorated with a wooden dresser on the left side, holding various items including a water bottle and some containers, and a black coat hanging on the door. A poster on the door reads 'Sweet Child' and 'MOM'S DAY' with colorful drawings, adding a personal touch to the setting. The ceiling light casts a rainbow-like reflection on the wall, enhancing the vibrant feel of the scene. Initially, both individuals stand still, but soon the person on the left bends down and exits the frame, leaving the person on the right standing alone. The person on the left quickly re-enters, and they begin a synchronized dance routine, moving their arms and legs rhythmically. The dance continues with the person on the left making more expressive hand gestures, while the person on the right follows along with coordinated movements, capturing a playful and energetic interaction between the two. Meanwhile, the BGM track features a male singer with an energetic and upbeat tone, complementing the lively rhythm of the dance and enhancing the overall joyful and enthusiastic mood of the video."
+ },
+ {
+ "video_id": "PI-30145fc8-3c83-4a22-972a-b0aa123f98f8",
+ "path": "/workspace/benchmark/video/PI-30145fc8-3c83-4a22-972a-b0aa123f98f8.mp4",
+ "answer": "The video begins with a modern store, on which there is a sign saying \"JILL PLATNER\" on the brick wall. A woman with long brown hair, wearing a green coat, white trousers and a black jacket, is sitting on a wooden stool. Next to her is a man wearing a checked shirt and khaki pants. One hand held a chocolate-filled pastry and reached out to the lady, who looked at it with interest. There is a voice in the audio saying \"traffic movement\". Considering the context, this seems to be an interesting or ironic comment. When the pastry approached her face, the lady seemed to be immersed in a relaxing moment. Then, the man in the checked shirt stood up and walked away, leaving the woman sitting alone on the bench. The scene switches to an outdoor setting at night. A man in a checked shirt is walking on the sidewalk with a white napkin in his hand. The background music includes Sam Smith's song \"Unholy\", adding a contemporary and lively atmosphere to the scene. Another man, wearing a white T-shirt with the word \"Thailand\" written on the back, dark trousers and a hat, was walking in front. At the end of the video, a woman in a white coat walked towards the camera, while a man in a white T-shirt continued to walk along the sidewalk. The overall theme of the video seems to capture the casual and humorous interactions among friends, highlighting the light-hearted and comical moments in both indoor and urban environments."
+ },
+ {
+ "video_id": "PI-268ddf93-057d-4067-b651-1d669e6e0c0e",
+ "path": "/workspace/benchmark/video/PI-268ddf93-057d-4067-b651-1d669e6e0c0e.mp4",
+ "answer": "The audio content features a male speaker delving into themes of regret, self-reflection, and emotional turmoil, particularly focusing on the complexities of moving on from past relationships, with lyrics mentioning covering a song by The Arctic Monkeys, crawling back to someone, and contemplating calling after a few drinks. The video complements this narrative by showcasing a person performing the song, with lyrics displayed on the screen. It begins playfully with the text 'Hozier covering The Arctic Monkeys??? I BET IT'S MID' at the top, suggesting a critical or humorous take on the cover. As the performance unfolds, the artist transitions from a light blue hoodie to a dark brown sweater and blue jeans, performing in a room with changing lighting that shifts from purple to greenish hues, creating varying moods. The lyrics, such as 'crawlin' back to you' and 'ever thought of calling when you've had a few? ', are emphasized with dramatic gestures, while moments like the 'pain interlude' and 'maybe I'm too... ' suggest introspection and contemplation. The performer's expressions range from dramatic flair to quiet reflection, culminating in a resolution indicated by the text 'now, I've thought it through,' as they place a hand on their chest. Together, the audio and video capture the intricate dance of emotions and the struggle to let go of past loves, blending music, lyrics, and visual storytelling to convey a poignant message."
+ },
+ {
+ "video_id": "PI-1d9355d2-f635-4d95-8937-053280cbb9a8",
+ "path": "/workspace/benchmark/video/PI-1d9355d2-f635-4d95-8937-053280cbb9a8.mp4",
+ "answer": "The video features a person sitting in a room with a patterned curtain in the background, wearing a black t-shirt with a graphic design. The text 'When the accent accidently slips out' is displayed, highlighting a humorous or relatable moment where the person's accent unintentionally becomes prominent. Initially, the person sits with their arms resting on a surface, occasionally moving their hands and adjusting their posture. As the video progresses, the person covers their face with both hands, possibly reacting in embarrassment or surprise, and then lowers their hands slightly while still covering part of their face, adding a touch of humor to the everyday scenario."
+ },
+ {
+ "video_id": "PI-7bc59470-2759-4f00-8151-8fe58babca9a",
+ "path": "/workspace/benchmark/video/PI-7bc59470-2759-4f00-8151-8fe58babca9a.mp4",
+ "answer": "The video features a man in a dark suit and yellow tie speaking at a podium adorned with the seal of the President of the United States, set against a white, ornate wall. The text on the screen initially reads: 'Trump spars with Maine Gov. Janet Mills over enforcement of trans athlete order. The governor of Maine?' The man discusses compliance with federal law, emphasizing that non-compliance will result in the withdrawal of federal funding. He states firmly, 'You better do it. You better do it because you're not going to get any federal funding at all if you don't. And by the way, your population...' The scene transitions to a wider shot of a room filled with people seated at round tables, with the American flag and another flag in the background. The man continues, noting that despite the state's somewhat liberal nature, the population does not support men participating in women's sports, and thus, compliance is necessary to avoid losing federal funding. The text on the screen reads: 'Even though it's somewhat liberal, although I did very well there, your population doesn't want men playing in women's sports, so you better comply because otherwise you're not getting any federal funding.' He reiterates the consequence of non-compliance, stating, 'you better comply because otherwise you're not getting any, any federal funding.' The text changes to 'Every state - good.' The man then asserts, 'I'll see you in court. I look forward to that,' and concludes with a smile and the playful remark, 'That should be a really easy one, huh?' repeated multiple times."
+ },
+ {
+ "video_id": "PI-c27a0de1-1bda-48b2-b8b7-1a0a1dba0659",
+ "path": "/workspace/benchmark/video/PI-c27a0de1-1bda-48b2-b8b7-1a0a1dba0659.mp4",
+ "answer": "This short comedic video depicts a common social media trend. It opens with a close-up of a young woman confidently addressing the camera about trying an 'inverted filter.' She states that she doesn't understand the big deal and expects to look the same. The text overlay reads, 'pov: u try the inverted filter for the first time.' Suddenly, a filter is applied that comically distorts her face, making her nose and mouth appear disproportionately large. The scene then cuts to a wider shot of her standing with her back to the camera, looking out her bedroom window in apparent despair."
+ },
+ {
+ "video_id": "PI-2fb11ebf-d12b-4023-91e7-c1b826fd4242",
+ "path": "/workspace/benchmark/video/PI-2fb11ebf-d12b-4023-91e7-c1b826fd4242.mp4",
+ "answer": "The video takes place in a classroom setting, beginning with a close-up of a person holding a large, red gift wrapped in festive Santa Claus-themed paper, setting a warm and celebratory tone. The text 'Be nice and remember that's someone baby at home' is overlaid, emphasizing a message of kindness and thoughtfulness. The camera pans around the classroom, revealing blue chairs, desks, and a tiled floor with blue squares, creating a familiar and cozy environment. The person with the gift approaches another individual sitting at a desk, who is wearing a black shirt and jeans. The gift is placed on the desk, and the seated person starts to unwrap it, revealing various items, including a plush toy and other small objects. In the background, a gentle and soothing music plays, enhancing the heartfelt atmosphere. A dialogue between a man and a woman can be heard, with the woman expressing gratitude and excitement, saying 'thank you' and 'I like it' among other words, indicating her appreciation for the gift. The video concludes with the seated person holding up a colorful box with a yellow and blue plush toy on top, showing it to the camera, while the overlaid text remains a constant reminder of the importance of being kind, making the entire scene both touching and meaningful."
+ },
+ {
+ "video_id": "PI-0d6db6db-2918-4bb9-9e54-ed15d68a38f2",
+ "path": "/workspace/benchmark/video/PI-0d6db6db-2918-4bb9-9e54-ed15d68a38f2.mp4",
+ "answer": "While the audio track presents a song by a female singer, brimming with happiness and affection, reminiscing about a cherished night and expressing a longing for closeness to someone, the accompanying video offers a contrasting yet complementary narrative. The video unfolds in three scenes, primarily focusing on close-ups of a bathroom and living room, where the protagonist, a meticulously clean and tidy individual, uses a razor to remove hair from her clothes, accompanied only by the gentle sounds of flowing water and background music. Key text such as \"Easy to Use Portable Lint Roller\" and the recurring green elements highlight themes of health, cleanliness, and a connection with nature, evoking an atmosphere of calmness and mindfulness. The green visuals symbolize vitality and freshness, underscoring the significance of a simple daily self-care routine, thus creating a harmonious blend where the audio's emotional warmth complements the video's serene and refreshing imagery, albeit in different narrative dimensions."
+ },
+ {
+ "video_id": "PI-21f1ab98-78ff-4763-9ae8-6da1191c9214",
+ "path": "/workspace/benchmark/video/PI-21f1ab98-78ff-4763-9ae8-6da1191c9214.mp4",
+ "answer": "This video shows the decoration process of a creative cookie, starting with a close-up shot of an ordinary, undecorated bell-shaped cookie placed on a white surface. The text \"Decorate the cookies as something different\" appeared at the bottom of the screen, setting the theme of turning ordinary cookies into imagined characters. This biscuit is very simple. Its surface is smooth, it is light brown in color, in the classic bell shape, with a small knob on the top and a bell shape at the bottom. The scene remains static, focused on the cookies, emphasizing its simplicity. The word \"Bell\" is displayed in yellow at the bottom of the screen, indicating the shape of the biscuit. The background music is light and lively, enhancing the whimsical tone of the video. Then the video switches to the same bell-shaped cookie, which is now intricately decorated to resemble the character Lord Farquad. There is a red hat on the biscuit, with a yellow round button on it. The black frosting Outlines the character's hair and facial features, and the blue frosting is drawn on the eyes. The red text \"Lord Farquaad\" at the bottom of the screen shows the character that the cookie has been transformed into. The overall atmosphere of the video is interesting and imaginative, highlighting the creativity of the cookie decorations."
+ },
+ {
+ "video_id": "PI-6f548877-0c53-4245-a64f-3b4f94b8950d",
+ "path": "/workspace/benchmark/video/PI-6f548877-0c53-4245-a64f-3b4f94b8950d.mp4",
+ "answer": "The video humorously depicts a conversation between two animated characters in a kitchen. A blue cat-like character with large eyes and a wide grin, sitting on a wooden chair and reading a newspaper, initiates the exchange by asking if the other character can cook, to which the orange blob-like character, sitting at a table with a yellow and white checkered tablecloth, responds with a simple \"No.\" The blue character then inquires if the orange character can drive, receiving another \"No\" in response. The blue character, showing increasing frustration, proceeds to ask if the orange character can speak Chinese. This question causes the orange character's expression to shift to one of confusion and concern. The orange character then explains, \"I don't really know how to speak Chinese because it is really hard to speak it and I only know a little Chinese so I can't say I speak it,\" looking increasingly distressed. The background features a kitchen with wooden cabinets, a window with frosted glass, and a bowl of fruit on the counter. Meanwhile, a male speaker with a neutral or uncertain tone echoes the questions about driving and speaking Chinese, adding a layer of straightforwardness to the humorous exchange."
+ },
+ {
+ "video_id": "PI-2cb22ee9-c4f1-40b1-a326-e11b28ea4b78",
+ "path": "/workspace/benchmark/video/PI-2cb22ee9-c4f1-40b1-a326-e11b28ea4b78.mp4",
+ "answer": "The audio captures a humorous conversation between a woman and two men, centered around a comical misunderstanding and the tension that arises from the woman's unexpected entry into the house. The video vividly illustrates this scenario, beginning in a well-lit dining area where two individuals, one in a green Nike hoodie and the other in a black and purple tracksuit, are engaged in a conversation. The situation escalates when a third person, dressed in a black puffer jacket and carrying a white bag, enters, prompting the person in the tracksuit to exclaim in surprise, 'You don't just walk into people's houses!' and demand to know who they know there. The person in the green hoodie looks confused as the person in the tracksuit reveals, 'She knows our names!', adding another layer of humor to the situation. The tension continues to build as the person in the tracksuit threatens to call the police and fetch a wiffle bat before exiting, leaving the person in the green hoodie looking puzzled. The focus then shifts to the individual in the black puffer jacket near the front door, looking back into the house, while humorous text on the screen reads, '911?' and 'There's a giant eyelash in my house,' providing a comical twist to the otherwise tense encounter. Together, the audio and video create a lighthearted and entertaining narrative that highlights the absurdity and humor of the situation."
+ },
+ {
+ "video_id": "PI-90def5b0-db61-47c9-8100-2821d72dc99e",
+ "path": "/workspace/benchmark/video/PI-90def5b0-db61-47c9-8100-2821d72dc99e.mp4",
+ "answer": "This heartwarming video captures a tender moment between a mother and her newborn baby, Archer. It opens with a top-down view of the baby, snug in a white knit outfit, resting in a woven bassinet personalized with his name. A woman's hand, featuring a gold ring and decorated nails, gently touches his chest. The scene then shifts to a sunlit room where the mother, wearing a white cardigan over a floral dress, cradles Archer in an armchair. The baby now wears a dark green beanie, also with his name on it in white yarn. "
+ },
+ {
+ "video_id": "PI-2d7500d7-89c7-4227-8f3f-1db70c2fe0e2",
+ "path": "/workspace/benchmark/video/PI-2d7500d7-89c7-4227-8f3f-1db70c2fe0e2.mp4",
+ "answer": "The audio clip captures a dramatic and confused woman's end of a phone call, where she repeatedly asks, 'Who are you?' and 'Who the fuck are you?', expressing surprise and frustration as she reacts to an unexpected or unfamiliar caller during what appears to be a prank. The video complements this audio by showing a person holding a smartphone in front of a laptop keyboard with green backlighting, the screen displaying a FaceTime call with a contact named \"MOTHER.\" The person on the other end, presumably the mother, is seen reacting with surprise or confusion, exclaiming in response to the prank. The text overlay at the bottom of the frames, \"POV we did the FaceTime prank on my dramatic mother,\" confirms that this is indeed a prank being played on the mother. The video captures the mother's dramatic reactions during the call, with the FaceTime interface elements such as the \"Speaker,\" \"Camera,\" \"Mute,\" and \"End\" buttons clearly visible on the smartphone screen, along with the time displayed as 7:35. The blurred background suggests an indoor setting, adding to the intimacy and realism of the prank. Together, the audio and video create a humorous and engaging narrative that highlights the mother's bewilderment and irritation during the unexpected call."
+ },
+ {
+ "video_id": "PI-78f098ea-c0c2-4f78-8d64-6c20d12d6a54",
+ "path": "/workspace/benchmark/video/PI-78f098ea-c0c2-4f78-8d64-6c20d12d6a54.mp4",
+ "answer": "The video begins with a person peeling and chopping ginger on a wooden cutting board, guided by text instructions. The process continues with chopping turmeric and slicing lemons, followed by juicing the lemons. The chopped ginger, turmeric, and lemon juice are then added to a blender along with raw honey and water. The blender mixes the ingredients, and the mixture is strained into an ice cube tray and frozen overnight. The next scene shows the frozen yellow cubes, identified as morning digestion cubes by on-screen text. A hand demonstrates the texture by removing one cube from the tray. The cubes are placed in a bowl, and one is added to a mug with hot water, where it dissolves. The text highlights this concoction as the perfect morning drink. The video wraps up with the person sipping the drink, emphasizing the simple and beneficial morning routine. Meanwhile, a female voice explains the recipe, expressing enthusiasm for the health benefits of these morning digestion cubes, which are made by mixing peeled ginger, crushed turmeric, lemon juice, raw honey, and water, then filtering and freezing the mixture. The ambient music has a stable rhythm, creating a calm and uplifting atmosphere that complements the speaker's pleasant delivery."
+ },
+ {
+ "video_id": "PI-43b770e6-d6dd-4198-adbe-34dcfa634abf",
+ "path": "/workspace/benchmark/video/PI-43b770e6-d6dd-4198-adbe-34dcfa634abf.mp4",
+ "answer": "The video documents the collapse of a 150-foot section of the Santa Cruz Wharf into the water during a large swell on Monday afternoon at 12:45. A female speaker reported the incident in a neutral emotional tone, describing how approximately 150 feet of the dock, including bathrooms, restaurants, construction equipment, piles, and piers, fell into the water. The footage shows the damaged wharf from an aerial perspective, highlighting the broken structure and the surrounding ocean. A small building, likely a restroom, is seen tilting into the water, with waves crashing against it. Onlookers gather on the cliffside, observing the damage. The text 'AUTHORITIES SAY 150-FOOT SECTION OF WHARF WENT INTO WATER' provides context. The debris created by the collapse poses a deadly threat when washed ashore. The speaker also mentioned that three engineers and inspectors who were surveying the dock at the time fell overboard but were rescued with only minor injuries. The video emphasizes the severity of the collapse and the potential dangers of the debris."
+ },
+ {
+ "video_id": "PI-c87af6d2-531a-4d9d-b307-83b026a81f75",
+ "path": "/workspace/benchmark/video/PI-c87af6d2-531a-4d9d-b307-83b026a81f75.mp4",
+ "answer": "The video was shot in an outdoor vegetable garden. A woman first showed a brown basket to the camera today, and then picked different vegetables in the garden and put them in the basket. This is a large outdoor vegetable garden, which might be grown by herself or by someone else. There are all kinds of different vegetables in this vegetable garden. It includes vegetables such as cauliflower and eggplant. The heart of the cauliflower is purple and looks very nice. There is also a white building beside this vegetable garden. The weather was very nice that day. From the video, it can be seen that it should be daytime. There are no subtitles in the video. It can be heard from the audio that this woman was picking vegetables while introducing different categories to everyone."
+ },
+ {
+ "video_id": "PI-526e748c-456f-494d-97f2-a54665e85832",
+ "path": "/workspace/benchmark/video/PI-526e748c-456f-494d-97f2-a54665e85832.mp4",
+ "answer": "In the video, a person is wearing a white shirt and headphones, sitting on a game chair and concentrating on playing the game \"Stardew Valley\". The room is soft-lit and equipped with blue LED lights, creating a comfortable gaming atmosphere. What is displayed on the game screen is a small fishing game. Players can throw a fishing rod into the glittering blue water surrounded by lush green leaves. The text \"Finding The KRAKEN In Stardew\" is displayed at the top of the screen, suggesting that the player's task is to catch a rare fish. Throughout the video, the player has been concentrating on the fishing mini-game. When they try to catch fish, the fishing rod fluctuates. The player's reactions can be heard, such as \"Relax, relax, relax\", \"Stop, stop, stop\", \"No, come on, as long as he has a mobile phone\", which indicates that the player has both anticipation and frustration. Finally, the player successfully caught a fish. The player's excitement was obvious because he reacted to the rare catch and was clearly excited about the discovery. This video captures the essence of the gaming experience, highlighting the dedication of players and the excitement of achieving rare gaming milestones. The combination of players' intense attention and vibrant game graphics creates an engaging and immersive narrative."
+ },
+ {
+ "video_id": "PI-f933f342-7652-4a7a-be04-d76ec3c3ac4b",
+ "path": "/workspace/benchmark/video/PI-f933f342-7652-4a7a-be04-d76ec3c3ac4b.mp4",
+ "answer": "This short video presents a relatable scenario about music taste. It features a man with long, dark, wavy hair, a mustache, and a nose ring, sitting in a room with wood-paneled walls. Colorful lighting casts a rainbow-like glow on the wall and ceiling behind him. An on-screen caption reads, \"how my music starts sounding when i'm not the only one listening.\" Throughout the clip, the man looks around and makes subtle facial expressions, like pursing his lips, as if reacting to the music. The accompanying audio is an upbeat, anthemic pop-rock song with generic, positive lyrics like \"We are young\" and \"we go big or go home.\""
+ },
+ {
+ "video_id": "PI-997fb09d-339a-4aa8-a859-711f01a23bde",
+ "path": "/workspace/benchmark/video/PI-997fb09d-339a-4aa8-a859-711f01a23bde.mp4",
+ "answer": "The video humorously contrasts the idealized vision of owning a dog like \"Air Bud\" with the reality of dog ownership. It starts with a man in a black shirt and gray pants in a backyard, holding a ball, with the text overlay \"What I thought having Air Bud would be like.\" A small dog approaches as he throws the ball, transitioning to a golden retriever catching it mid-air. The scene then shifts to reality with the text \"What it's really like,\" showing a man in a gray long-sleeve shirt near a fence, throwing a soccer ball for two dogs. One dog watches while the golden retriever jumps to catch it. Subsequent scenes depict the man comforting a dog on a trampoline, playing with the dogs, and throwing a ball for them to catch. The video continues with the man in a gray shirt throwing a ball for a golden retriever, who catches it mid-air and brings it back. The final scene transitions to a snowy backyard, where the man, now in a black jacket and beanie, stands in the snow as the golden retriever walks through it. The accompanying voice-over, delivered in a humorous and light-hearted male tone, emphasizes the gap between the fantasy of owning a talented dog and the actual, more mundane or challenging experiences of dog ownership."
+ },
+ {
+ "video_id": "PI-47118c62-63ec-490c-8246-c23dd8e6b224",
+ "path": "/workspace/benchmark/video/PI-47118c62-63ec-490c-8246-c23dd8e6b224.mp4",
+ "answer": "The video begins with a person sitting in a car, holding an orange container with a salad, and the text 'What is Amish Amoxicillin?' appears on the screen, introducing the topic of natural remedies. The scene transitions to a close-up of a book page titled 'Immune System,' listing various remedies, including 'Amish Amoxicillin.' A hand with a ring points to the remedy, highlighting its importance. The book is then opened to a detailed page about 'Nature's Amoxicillin,' complete with instructions, ingredients, and images of the remedy in a glass, emphasizing the practical and comprehensive nature of the information provided. The video continues to focus on the book, showcasing detailed instructions and ingredients for 'Nature's Amoxicillin,' with the hand pointing to the dosage and instructions to underscore the clarity and usefulness of the content. The scene transitions to a broader view of the book, revealing a section titled 'Herbal Remedies and Recipes' with a variety of remedies listed, demonstrating the extensive range of natural remedies available. The video wraps up with a page titled 'About the Author,' introducing Dr. Nicole Apelian, who has a PhD and has lived with indigenous people to learn their healing skills, adding credibility and depth to the book's content. The final frames display the book's cover, 'Forgotten Home Apothecary: 250 Powerful Remedies at Your Fingertips,' with the hand pointing to the title, underscoring the book's value as a comprehensive guide to natural remedies. Meanwhile, the audio features a female speaker with an uncertain emotional tone mentioning efforts to suppress the book, while a neutral-toned female speaker enthusiastically describes the book as a comprehensive home apothecary guide, detailing its structure, remedies, and practical use for common ailments. She highlights the author's expertise, including a PhD and knowledge gained from indigenous practices, and emphasizes the book's value for every household, creating a focused and informative atmosphere."
+ },
+ {
+ "video_id": "PI-70ebbc00-af48-4360-ab45-8ba40d54f088",
+ "path": "/workspace/benchmark/video/PI-70ebbc00-af48-4360-ab45-8ba40d54f088.mp4",
+ "answer": "The video humorously recounts a chaotic kitchen incident involving a grease fire, as described by a man in a red shirt covered in flour, standing anxiously in his kitchen. He begins by expressing his fear that his wife will be furious, then explains how he tried to extinguish a small grease fire with water, which only made it worse. The scene cuts to a pot on the stove, covered in flour, with a sign that reads, 'Greasy's Firey Fast Kitchen Remodels! Just add water!' The man continues, recounting how he attempted to smother the fire with flour, but a firefighter intervened, knocking the flour out of his hand to prevent further escalation. The man, frustrated, admits to hitting the firefighter in retaliation, who then hit him back. Eventually, the man managed to put out the fire with a pot lid. The firefighter, listening intently, concludes with a dry remark, 'You're right, your wife's going to kill you,' which he repeats at the end of the video, emphasizing the gravity of the situation. The background features wooden cabinets and a black countertop, consistent throughout the video. The conversation is a mix of frustration and humor, highlighting common mistakes during kitchen fires and the importance of proper fire safety knowledge."
+ },
+ {
+ "video_id": "PI-df5209a8-0bc6-4150-802f-4947723589d8",
+ "path": "/workspace/benchmark/video/PI-df5209a8-0bc6-4150-802f-4947723589d8.mp4",
+ "answer": "This video captures the process of baking a large chocolate chip panettone, a type of sweet bread, and preparing a rich hot chocolate to accompany it. The baker begins by revealing a large, well-risen dough studded with chocolate chips in a glass bowl. She enthusiastically punches it down on a wooden surface, cuts it in half, and then stretches and shapes it before placing it into a tall, round paper baking mold. For the topping, she pipes a swirl of batter onto the dough and adds pearl sugar sprinkles. The panettone is then placed in the oven to bake. While it's baking, she prepares hot chocolate from scratch by heating milk, adding chopped dark chocolate and sugar, and whisking until smooth. She also whips fresh cream. Once the panettone is baked to a golden brown, she removes it from the oven and presents the finished product. She then pours the thick hot chocolate into a cup, tops it with a dollop of whipped cream, and dusts it with cocoa powder. Finally, she cuts a large, airy slice of the panettone, dips it into the hot chocolate, and takes a satisfying bite."
+ },
+ {
+ "video_id": "PI-8e721b23-a96b-43d0-bcf0-afa70529248a",
+ "path": "/workspace/benchmark/video/PI-8e721b23-a96b-43d0-bcf0-afa70529248a.mp4",
+ "answer": "A young woman appears in a room that resembles a dormitory or bedroom. She is seated in front of the camera, with a lofted bed-over-desk structure in the background, along with a computer, various personal items, and wall decorations such as posters and paper cutouts.\nShe wears light-colored over-ear headphones adorned with black and white checkered ribbons, a dark tank top, and blue jeans. Her outfit is complemented by multiple gold-toned accessories, including necklaces, rings, and bracelets.\nWith a gentle smile, she makes subtle movements—adjusting her headphones and occasionally gesturing with her hands. Text overlays appear on screen, starting with: “you’re so funny,” followed by her response: “thanks, i went viral on pinterest at 13 years old wearing alt outfits and my photos still float around to this day.”\nThe lighting in the room is steady and evenly illuminates her and her surroundings. The video captures a personal moment, reflecting on her early internet experience and how it continues to resonate today."
+ },
+ {
+ "video_id": "PI-594c7499-97fb-42da-9d21-f333568d5637",
+ "path": "/workspace/benchmark/video/PI-594c7499-97fb-42da-9d21-f333568d5637.mp4",
+ "answer": "In the video, a person is lying down with his head resting on a pillow decorated with cherry patterns. The background includes a beige headboard with decorative nails and white blinds covering the Windows, creating a comfortable and intimate environment. This person is wearing a light-colored knitted sweater, adding a warm and comfortable atmosphere. Such words appeared on the screen: \"I listen at three times the speed.\" \"I'm very scared,\" suggesting that this person has been accelerating to listen to something and now feels uneasy or scared. This passage remains consistent throughout the video, emphasizing the mental state of the characters. The overall mood of the video is a quiet self-reflection, with a hint of tension or anxiety. There are no significant actions or additional visual elements that focus people's attention on their emotional state and textual coverage, intensifying the theme of unease. This video seems to capture an individual's reflection or reaction to something heard at an exceptionally fast speed, highlighting the impact of accelerated audio on the listener's emotions."
+ },
+ {
+ "video_id": "PI-d2f0d27c-ba72-4048-b30f-ca7c145da552",
+ "path": "/workspace/benchmark/video/PI-d2f0d27c-ba72-4048-b30f-ca7c145da552.mp4",
+ "answer": "The video opens in an indoor hobby room, showcasing a man interacting with two different remote-controlled (RC) aircraft. Initially, he is seen placing a large grey RC B-2 Spirit bomber onto a black table. The scene transitions to him handling a blue, grey, and white RC F-16 Falcon jet, moving it across the same table. Text overlays emphasize the action, questioning the dropping of planes and announcing the \"Sights & sounds of a Brand NEW RC F-16.\" The video then shifts focus to a large hangar environment, where a full-scale F-16 Falcon aircraft is being meticulously lowered by what appears to be an overhead crane system."
+ },
+ {
+ "video_id": "PI-62ddb361-61fa-4ebf-b0ff-cbd990ee6f46",
+ "path": "/workspace/benchmark/video/PI-62ddb361-61fa-4ebf-b0ff-cbd990ee6f46.mp4",
+ "answer": "This video captures a moment of frustration and humor involving a vending machine. A person is seen running towards a vending machine, reacting to snacks seemingly stuck inside. The individual then forcefully impacts the side of the machine, after which they bend down to retrieve items from the dispensing slot. The interaction results in apparent success, accompanied by sounds of surprise and laughter from an observer. The setting appears to be a common area, possibly a recreational room, with a ping-pong table and a pool table visible in the background, along with natural light from a window.The focus remains on the person's determined effort to dislodge the snacks and the subsequent reaction."
+ },
+ {
+ "video_id": "PI-93d43b76-b2b0-45b9-ae1b-d9dbae713d0c",
+ "path": "/workspace/benchmark/video/PI-93d43b76-b2b0-45b9-ae1b-d9dbae713d0c.mp4",
+ "answer": "The video begins with two men walking down a city street, one wearing a black vest and light pants, the other wearing a green long-sleeved shirt and white pants, accompanied by the text 'SO TALL? \"On screen. The scene transitions to a bearded man sitting at a kitchen table eating, with the text\" I'm a giant baby \"followed by\" 60CM (2 FEET) \"and\" 17 YEAR OLD \"as he shares his personal experience of being unusually tall since childhood. Now wearing a dark blue shirt, he continues to eat, while the text\"2.1M (6'10FT)\"and\" THEY DID A LOT\"appears, suggesting that extensive medical research has been done on him to rule out gigantism or genetic abnormalities. The story develops as the man discusses the medical investigation, mentions \"GIANTISM\" and shows contrasting images of the man's neck. The video then shows a close-up of a hand with the words \"BIG FINGERS\" before transitioning to the man at the dining table, now another person in a white shirt and hat. As the man continues to speak, the texts \"IN PROPORTIONS\" and \"THE CONCLUSION\" appear, emphasizing that he is proportional and healthy, eventually being labeled a \"natural giant\". The atmosphere remains conversational, the second male voice briefly agrees in a neutral tone, and the ambient music is steady and calm, creating a relaxed and introspective environment."
+ },
+ {
+ "video_id": "PI-36a41ad0-092b-4712-9334-326a7e9c666a",
+ "path": "/workspace/benchmark/video/PI-36a41ad0-092b-4712-9334-326a7e9c666a.mp4",
+ "answer": "The video shows a series of fun interactions of cats in a home environment, creating a light-hearted and cute story. The story begins with a light-colored long-haired cat and a brown short-haired cat engaging in a friendly scuffle on a tile floor, with the brown cat gently stroking the light-colored cat's face with its paws. Then the scene shifts to an orange and white cat sitting alone in the hallway, staring out at the doorway. A striped cat quickly enters and approaches the orange cat, resulting in a short playful encounter where the orange cat pounces on the striped cat. The video ends with the orange and white cat walking away, leaving the striped cat relaxed on the floor, highlighting the playfulness and peace of feline interaction. The audio, meanwhile, features an explanation by a man livestreamed from a live game, who briefly answers a firm \"no\" in the context of the conversation, suggesting a direct interaction with no extra exposition or emotional intensity. This audio element contrasts with the interesting video content and adds an unexpected layer to the overall narrative."
+ },
+ {
+ "video_id": "PI-1236233d-7f8f-48e7-b014-4cca03d693b8",
+ "path": "/workspace/benchmark/video/PI-1236233d-7f8f-48e7-b014-4cca03d693b8.mp4",
+ "answer": "The video captures a person standing in a room with white-tiled walls and a dark door in the background, dressed in a large dark sweatshirt and light blue jeans. Initially motionless, the individual soon bursts into a lively and expressive dance, characterized by vigorous arm swings, hip movements, and hair flips, clearly reveling in the moment. This spontaneous display of joy and freedom of expression is set against a minimalist backdrop, emphasizing the purity of the individual's exuberance. Accompanying this visual is a BGM track featuring a male voice singing in a joyful and energetic tone, with lyrics that convey happiness and enthusiasm. The audio's core message of spreading positive and cheerful emotions through music aligns perfectly with the video's portrayal of unbridled joy, creating a harmonious blend where the lively dance and upbeat music collectively evoke a sense of pure, unadulterated happiness."
+ },
+ {
+ "video_id": "PI-69ca282c-853a-4c13-b3c1-61ed846ae77b",
+ "path": "/workspace/benchmark/video/PI-69ca282c-853a-4c13-b3c1-61ed846ae77b.mp4",
+ "answer": "The video begins with a person wearing pink pajamas adorned with white patterns and a green and red elf hat, sitting in a modern living room decorated with a bookshelf filled with books and various decorative items. Another individual, dressed in gray pajamas with white patterns, is lying on the floor, appearing to be asleep or resting. The person in pink pajamas gestures with their hands, possibly speaking or explaining something. The scene transitions to a different part of the house, where another person, wearing a black and white patterned onesie with reindeer antlers, is sitting at a kitchen counter. The person in pink pajamas is now standing on the kitchen counter, dancing energetically, while the person in the reindeer onesie watches and then starts to gesture with their hands, possibly responding to the dancing. The video continues in the modern kitchen area, where the person in pink pajamas is now sitting on a chair, gesturing with their hands, possibly speaking or explaining something. In the background, the person in the black and white patterned onesie with reindeer antlers is dancing energetically on the kitchen counter. The person in pink pajamas then stands up and joins the dancing, moving their arms and body rhythmically. Both individuals dance together, creating a lively and playful atmosphere. The accompanying BGM is an upbeat and lively instrumental track, likely intended to enhance the playful and energetic mood of the video. The video wraps up with both people continuing their energetic dance, showcasing a sense of fun and camaraderie, perfectly complemented by the rhythmic and cheerful melodies of the background music."
+ },
+ {
+ "video_id": "PI-ec7d9ef7-1c6f-4c51-8331-353373c2e80e",
+ "path": "/workspace/benchmark/video/PI-ec7d9ef7-1c6f-4c51-8331-353373c2e80e.mp4",
+ "answer": "The video showcases a CPR training session featuring the 'LUCAS 3' device, placed on a table, with its gray and white body and yellow accents, attached to a beige mannequin torso. A person in a dark blue long-sleeve shirt interacts with the device, adjusting its components and demonstrating its functionality by pressing buttons and manipulating the mannequin. The background reveals a room with a wooden door, a wall clock, and a white table holding a red cup, while the text 'POV: PINK PONY CLUB IS YOUR NEW CPR TEMPO' appears at the bottom. Simultaneously, an instrumental BGM plays in the background, characterized by a stable rhythm and a neutral yet charming tone, suggesting the presence of many voices harmonizing. This music creates a calm and harmonious atmosphere, enhancing the viewing experience by providing a charming backdrop without detracting from the main content of the CPR training demonstration, ultimately emphasizing the device's features and ease of use."
+ },
+ {
+ "video_id": "PI-8d583b4f-518a-4f3e-9818-fadc548f93db",
+ "path": "/workspace/benchmark/video/PI-8d583b4f-518a-4f3e-9818-fadc548f93db.mp4",
+ "answer": "The video begins with a text overlay stating 'Surprising my roommates That I'm going out,' setting the stage for a playful scenario. It transitions to a blonde individual in a dark strapless top and light blue jeans descending a staircase and moving through a dimly lit living room. Upon entering a well-lit room adorned with a pink lip-shaped sign reading 'DANCE WORKS,' she checks her phone. The audio captures a female voice with a neutral tone saying, 'And then we can do it,' suggesting a plan is in motion. The individual then starts dancing energetically, her movements full of joy. Another person, clad in a black jacket and ripped jeans, joins in, and together they engage in a lively dance. The remaining audio contains unrecognizable speech, but the focus remains on the energetic interaction between the two, embodying a spontaneous and fun moment among roommates."
+ },
+ {
+ "video_id": "PI-0c73b12b-8514-4655-8285-af502b6777f4",
+ "path": "/workspace/benchmark/video/PI-0c73b12b-8514-4655-8285-af502b6777f4.mp4",
+ "answer": "The video is set in a rural, outdoor environment during sunset, transitioning from daylight to a darker setting. The main scene features a man wearing a brown hoodie and jeans leading a white cow with brown spots into a building, possibly a barn or stable. The cow is guided calmly by a rope, and its horns are shown in detail. Two other individuals appear in the background, one in a black jacket and the other in a white shirt, but they are not the focus. The BGM expresses admiration for someone sitting in the passenger seat and mentions avoiding taking a fast route, suggesting a casual or reflective conversation. No significant English text is visible in the footage. The video conveys a theme of trust and harmony, highlighting the gentle interaction between humans and animals in an agricultural setting. The atmosphere is peaceful, emphasizing respect and care in a simple, everyday moment."
+ },
+ {
+ "video_id": "PI-b614178e-1901-4cec-93de-8bb44aee718a",
+ "path": "/workspace/benchmark/video/PI-b614178e-1901-4cec-93de-8bb44aee718a.mp4",
+ "answer": "In a positive and instructive atmosphere, a neutral-tone male speaker from the audio discusses the issue of violent engine shaking and suggests cleaning the oxygen sensor as a solution. The accompanying video vividly illustrates this process. It begins with a person in a gray jacket leaning into a car's open hood, holding a small container of yellow liquid, and the engine is seen shaking violently. The person then points to the oxygen sensor, unscrews it, revealing it's covered in grime. Using a small container of blue liquid, the person pours it over the sensor, submerges it, and wipes it clean with a white cloth, effectively removing the dirt and grime. After cleaning, the sensor is reinserted back into its original position in the engine bay. The scene then transitions to the car's interior, where the person turns the ignition key, and the dashboard lights up, indicating a smooth start. The video concludes with the engine running smoothly, and the person standing next to the open hood, gesturing and speaking, with encouraging texts such as 'The engine no longer shakes!', 'Have you learned it?', and 'Follow me and learn about cars every day' appearing on the screen. The background, featuring an outdoor setting with buildings and trees, emphasizes the successful resolution of the engine issue, aligning with the speaker's advice and encouraging ongoing learning about cars."
+ },
+ {
+ "video_id": "PI-ee6f3623-5f83-4723-aaa7-707a2d9ad882",
+ "path": "/workspace/benchmark/video/PI-ee6f3623-5f83-4723-aaa7-707a2d9ad882.mp4",
+ "answer": "This video captures a staged interaction between two women in a locker room. The scene begins with one woman sitting on a bench, tying her shoe, as another woman approaches her. The standing woman offers the seated woman a small pink container, explaining that it is a probiotic product. She then bluntly tells the other woman that she is offering it because she noticed her body odor in class. The seated woman reacts with visible shock and disgust to this direct comment. After a moment, she accepts the product and asks where it was purchased. The first woman replies that she bought it on the 'TikTok shop.'"
+ },
+ {
+ "video_id": "PI-21ac569a-469a-4ba6-b41c-f05253062f7f",
+ "path": "/workspace/benchmark/video/PI-21ac569a-469a-4ba6-b41c-f05253062f7f.mp4",
+ "answer": "The feature of the video is that two people are dancing passionately in a vibrant and colorful room. Both of them were wearing Hello Kitty pajama pants. One was wearing a white lace top and the other was wearing a black crop top. The background includes a neon-decorated staircase and a mannequin wearing grey sweatpants and white boots, contributing to a lively and eclectic atmosphere. The text above reads, \"Sophie Rain: There are no more chairs.\" This is a humorous or spontaneous moment when the dancers jokingly admit that there are no more chairs. The background music is cheerful and lively, enhancing the energetic mood. The dancers performed various interesting movements, including gesturing, spinning and sitting cross-legged on the floor. The overall mood of the video is light and cheerful, capturing the spontaneous moments of enjoyment and friendship between the two people. At the end of the video, the dancers continued their lively and energetic dance, embodying a carefree joy and friendship."
+ },
+ {
+ "video_id": "PI-f225406b-dc94-4aaa-8530-d4c00e5f33d8",
+ "path": "/workspace/benchmark/video/PI-f225406b-dc94-4aaa-8530-d4c00e5f33d8.mp4",
+ "answer": "This video captures a three-year-old boy excitedly bursting into his sister's bedroom on Christmas Day. His sister is still asleep in the dimly lit room partially illuminated by blue light. The little boy eagerly approaches the bed, calling out \"sissy\" to wake her up and announces that Santa brought all the presents—while subtitles reveal these were actually gifts his sister had wrapped the night before. The boy then leaves the room at his mother's suggestion to go wake up the pets."
+ },
+ {
+ "video_id": "PI-b0fcdd6c-db0e-4b28-9640-0072003e4af0",
+ "path": "/workspace/benchmark/video/PI-b0fcdd6c-db0e-4b28-9640-0072003e4af0.mp4",
+ "answer": "The video presents a heartwarming scene of a small, hairless dog with white, wispy fur on its head and ears, being gently cradled in a person's hands. The dog's pink tongue is slightly sticking out, and it appears calm and content. The person holding the dog is adorned with a ring on their finger, and part of their clothing, which has some text on it, is visible. The background reveals a cozy indoor setting, specifically a dining area with a wooden table and chairs, and several framed pictures on the wall, adding a personal touch. The warm lighting enhances the inviting atmosphere. Throughout the video, the dog remains mostly still, with occasional slight head movements, while the person makes subtle adjustments to ensure the dog's comfort. Meanwhile, the audio features background music with a female singer whose soft and tender voice expresses love and affection through gentle lyrics, creating a warm and intimate ambiance that perfectly complements the tender moment between the person and their pet."
+ },
+ {
+ "video_id": "PI-09b6b5e5-9975-4cf2-b689-59a8f48d63da",
+ "path": "/workspace/benchmark/video/PI-09b6b5e5-9975-4cf2-b689-59a8f48d63da.mp4",
+ "answer": "The video presents a somber and reflective scene, featuring a person with long, brown hair wearing a blue and white striped sweater, sitting in a softly lit room with a window on the left. The window, with its white blinds and a white curtain tied in a knot, along with a few items like a red box on the windowsill, and the wall behind adorned with small photos or cards in a grid pattern, sets a contemplative mood. Text overlays in the video, such as 'the cut that always bleeds' and 'but even though you're killing me, yeah,' suggest a metaphorical or emotional wound and a sense of enduring pain. The narrative progresses with lines like 'i need, i need you more than me,' 'i need you more than anything,' and repeated pleas of 'please,' emphasizing a desperate need. The audio, a singing track where a woman is singing deeply, complements this visual narrative. The gist of the lyrics, that even if someone hurts her, she still needs them very much, echoes the sentiments expressed in the text overlays. The melancholic or soulful melody of the audio, combined with the visual elements of the video, creates a poignant and immersive experience, effectively drawing the viewer into the emotional journey of the person depicted, highlighting themes of enduring pain and a profound need for another."
+ },
+ {
+ "video_id": "PI-7619ef35-649c-44f2-a783-2a8523745726",
+ "path": "/workspace/benchmark/video/PI-7619ef35-649c-44f2-a783-2a8523745726.mp4",
+ "answer": "The video features a person wearing a red hoodie and headphones, sitting in a room illuminated by blue LED lights, with a colorful spinning wheel visible on the left side of the screen. The video alternates between the person speaking and gameplay footage from Fortnite, where a character in a blue outfit is seen building structures and engaging in combat. The speaker, a male with an engaging and somewhat challenging tone, discusses the return of the Renegade Raider skin in Fortnite and addresses viewers who claim to be 'OG' (original) players. He questions the authenticity of these claims, suggesting that many started playing the game late yet still refuse to admit it. Text overlays appear throughout the video, reinforcing the theme of questioning the 'OG' status of players. The video encourages viewers to join the stream to see who the real 'OG' players are, maintaining a consistent narrative that challenges viewers to prove their long-standing involvement in Fortnite. The gameplay footage includes scenes of the character building and fighting inside a wooden structure, as well as running through a grassy area outside, further emphasizing the interactive and competitive nature of the game."
+ },
+ {
+ "video_id": "PI-6a62a0e6-2aaf-4be7-b444-9408960ee08a",
+ "path": "/workspace/benchmark/video/PI-6a62a0e6-2aaf-4be7-b444-9408960ee08a.mp4",
+ "answer": "The video depicts a touching and emotional scene between two elderly women in a hospital room, which is further enriched by the audio capturing their heartfelt conversation. One woman, with white hair, is seated on the hospital bed, wearing a white cardigan over a blue top and black pants, while the other woman, with short blonde hair, sits beside her in a purple cardigan and white pants, holding a cane. Their interaction is filled with warmth and affection as they hold hands and lean in close to each other. The background features a bed with white sheets, a wooden cabinet, and a bulletin board on the wall, with an overlay text stating, \"Knowing it's the last time you will see your best friend at 95,\" adding a poignant context to their interaction. The audio complements this scene with a conversation brimming with emotion and a touch of sadness, where one woman expresses gratitude and affection, saying, \"Thank you. See you every day. I love you, and I love you too. Great! Great!\" The other responds with loving words, and amid laughter and shared love, the first woman exclaims, \"Oh my god, oh my God. Stop this.\" A third voice remarks, \"We are just shooting your videos because you are so cute,\" highlighting the profound bond between the two women and the bittersweet realization of their final farewell."
+ },
+ {
+ "video_id": "PI-41a11b49-af69-48b5-920b-a8d3cc98a396",
+ "path": "/workspace/benchmark/video/PI-41a11b49-af69-48b5-920b-a8d3cc98a396.mp4",
+ "answer": "The video begins with a nostalgic scene of cheerleaders performing on a football field, dressed in classic white and light blue uniforms, with a large, enthusiastic crowd in the background. The text 'My mom as an NFL Cheerleader...' sets the stage for a heartfelt generational comparison. The cheerleaders execute synchronized routines with precision, and one cheerleader in the foreground is prominently featured, showcasing dynamic moves such as raising her pom-poms high and kicking her legs with vigor. The scene then transitions to a modern football field, where a single cheerleader in a sleek teal and black uniform with white boots performs energetically, accompanied by the text 'Me, 30 years later, cheering for the same team.' The background reveals a contemporary football field with spectators and staff members, capturing the continuity of the cheerleading spirit. Meanwhile, the audio humorously contrasts past and present cheerleading teams through lively background music and the subtle buzz of crowd noise, creating a relaxed yet engaging atmosphere. The core message of the audio is to highlight the continuity and enduring spirit of cheerleading over more than 30 years, emphasizing how the passion and dedication of cheerleaders remain unchanged despite the passage of time. The video continues to focus on the modern cheerleader, capturing her from various angles as she performs, including close-ups and wider shots that show her interacting with other cheerleaders, wrapping up with a vibrant celebration of the timeless and energetic spirit of cheerleading."
+ },
+ {
+ "video_id": "PI-3b6487e3-78de-4ce9-8e55-23efea4cd979",
+ "path": "/workspace/benchmark/video/PI-3b6487e3-78de-4ce9-8e55-23efea4cd979.mp4",
+ "answer": "The video begins with a close-up of a person wearing a red T-shirt, accompanied by the white words \"Have you ever\" and the pink words \"a DREAM\", creating a contemplative atmosphere. This scene transitions to a black screen, on which the pink word \"that\" is displayed, followed by the white and pink \"shhh that that\", suggesting a quiet moment of reflection. The story then switches to the clouds seen from the plane window, with the pink words \"you'd\" appearing in the upper left corner, evoking a sense of longing and possibility. Words of different colors such as \"you'd\", \"you\", \"would\", \"could\", \"do\", \"want\", \"shhh\", etc. are scattered on the cloud map, symbolizing the fragmentation and richness of dreams and thoughts. Similar clouds seen from the plane window also appeared in the video, showing words such as \"you'd\", \"you\", \"would\", \"could\", \"do\", \"want\" and \"shhh\" in different colors, emphasizing the theme of dreams and aspirations. At the end of the video is a close-up shot of the same person wearing a red T-shirt. This final frame reinforces the theme of unlimited potential and the idea that one can achieve any dream, which is related to the initial question raised at the beginning of the video. The overall narrative encourages the audience to reflect on their own dreams, aspirations and beliefs in their potential."
+ },
+ {
+ "video_id": "PI-fc9cb121-37e7-4a13-bced-2c3264e11942",
+ "path": "/workspace/benchmark/video/PI-fc9cb121-37e7-4a13-bced-2c3264e11942.mp4",
+ "answer": "The video features a young woman with dark hair and tattoos in a modern indoor setting, possibly a store. She wears a camouflage hoodie, striped shorts, and platform sandals. With a female speaker who is expressing happiness and encouragement. She is praising someone named Kylie and another person, possibly named He, for their performance, using phrases like \"Good job\" and \"You're doing amazing sweetie.\" The scene appears to involve a supportive and positive interaction, likely in a setting where achievements or efforts are being acknowledged. The background music is upbeat and creates a cheerful atmosphere. She first kneels on a couch in the foreground, smiling, then gets up and walks towards the back of the room. The background includes a staircase, a clothing rack, and some wall art and signs, one of which reads \"Lockers Bathrooms.\" The overall aesthetic is casual and candid. A text overlay across the video asks viewers to \"guess the new piercing location.\""
+ },
+ {
+ "video_id": "PI-5fbefa42-b8ba-484b-a91f-d14abae414ff",
+ "path": "/workspace/benchmark/video/PI-5fbefa42-b8ba-484b-a91f-d14abae414ff.mp4",
+ "answer": "The video features a person lying in bed, covered with a light-colored, fluffy blanket, with a beige headboard and a light-colored wall in the background. The individual, who has long, light brown hair, remains mostly still, with only slight movements of the mouth and head suggesting they might be speaking or singing. The soft and warm lighting enhances the cozy and relaxed atmosphere, contributing to an overall mood of calm and introspection. Meanwhile, the audio track includes a female speaker with a humorous and critical tone, discussing how people often mispronounce or improperly emphasize certain words or phrases, especially in songs or speeches. She uses the phrase 'that's not my name quarterback' as an example to illustrate her point. Together, the video and audio create a contemplative setting while humorously critiquing the delivery of certain expressions, suggesting they lack authenticity and proper emphasis."
+ },
+ {
+ "video_id": "PI-51849eb7-feba-4923-a567-6a346ef1aaef",
+ "path": "/workspace/benchmark/video/PI-51849eb7-feba-4923-a567-6a346ef1aaef.mp4",
+ "answer": "The video takes place in a bustling cafeteria crowded with students, creating an active social atmosphere. The main focus is a young man wearing a blue \"Delaware\" sweatshirt, sitting at a table piled high with a black bag, some crumpled napkins and food wrappers. He emphasized his point of view with expressive gestures. He raised his hand and counted on his fingers, suggesting that he was listing reasons or arguments. This audio records a humorous and self-mocking monologue. The young man Outlines \"the three major reasons why I don't be a bad boyfriend\". He made jokes about his appearance at the very beginning, saying, \"First, I'm so ugly. I'm so ugly.\" He went on to give an even cuter reason, \"Secondly, I like hugging. I just love hugging.\" Finally, he mentioned, \"Thirdly, I'll call you baby,\" further emphasizing his mischievous and affectionate nature. This video humorously conveys the idea of this young man trying to present himself as a good friend, blending self-mockery with sincere affection. The casual chats in the background and the actions of other students contributed to the environment of the school cafeteria, making his candid and interesting remarks even more prominent."
+ },
+ {
+ "video_id": "PI-17632f7e-603b-479e-a934-ea32b5b68114",
+ "path": "/workspace/benchmark/video/PI-17632f7e-603b-479e-a934-ea32b5b68114.mp4",
+ "answer": "The audio is a track from a video featuring a conversation between a man and a woman in a subway station. The audio includes human voices with no background music. The woman, wearing a blue beanie and brown leather coat, speaks first, and the man, wearing a green jacket and black cap, responds. They discuss the unrealistic nature of male/female friendships in romantic comedies, with the woman making a point and the man responding with humor, mentioning a date with a cousin and family reunions. The conversation continues with the woman holding two donuts, one red and one white with orange sprinkles, and the man taking a bite of the red donut. They continue their light-hearted banter, with the man complimenting the woman and suggesting brunch in Williamsburg. The core information is that the two are engaging in a playful and humorous conversation about relationships and friendships, enjoying each other's company and sharing donuts."
+ },
+ {
+ "video_id": "PI-bed8d469-f807-4199-a166-6eff63b6d560",
+ "path": "/workspace/benchmark/video/PI-bed8d469-f807-4199-a166-6eff63b6d560.mp4",
+ "answer": "The video captures a dynamic protest march in an urban setting, where a diverse group of people walks under a concrete structure, likely a bridge or overpass. Participants, clad in jackets and backpacks, carry Mexican flags prominently, signaling a collective cause. The atmosphere is charged with unity and determination, reflected in their coordinated movement. Above them, a sign reads '12 FT 7 IN', indicating the height clearance of the structure. The camera follows the group as they move with purpose through the city, surrounded by trees and buildings, emphasizing the urban context. This scene is complemented by a dynamic protest chant track playing in the background. The vocals are energetic and collective, embodying the spirit of the protesters united in their cause. The chant's repetitive and rhythmic nature builds solidarity and momentum among the participants, enhancing the overall sense of collective action and unity. Together, the video and audio create a powerful depiction of a community coming together to voice their concerns, navigating the urban landscape with resolve and shared purpose."
+ },
+ {
+ "video_id": "PI-cf50f0e2-940b-4385-af50-46bccaf1a474",
+ "path": "/workspace/benchmark/video/PI-cf50f0e2-940b-4385-af50-46bccaf1a474.mp4",
+ "answer": "The video presents a multifaceted portrayal of a woman through a series of diverse scenes, each capturing her in unique settings and outfits. It opens with her in a light blue long-sleeve top, sitting indoors with a window and a red object in the background, exuding a soft and relaxed demeanor. The narrative then shifts outdoors, where she stands near a bright pink car, clad in a loose white jacket, her hair flowing freely. She is later seen in a pink top with a dramatic red hood, posing confidently with one hand near her face. A mysterious atmosphere envelops her as she walks in a dimly lit environment, dressed in a black dress. The video then brightens with an outdoor scene where she stands alongside another woman; one in a white checkered outfit, the other in a plaid coat and cap, both radiating cheerfulness. In a snowy setting, she is viewed from behind, wearing a fluffy white jacket against a backdrop of traditional wooden architecture. Indoors again, she leans towards someone, wearing a blue top and large hoop earrings, her expression warm and engaging. Complementing this visual journey, the audio features the same woman curiously questioning someone who labeled her a \"porn star.\" She challenges their knowledge of her online activity, asking if they are a subscriber or fan, seeking clarity on the accuser's familiarity with her content and their potential fandom. Together, the video and audio create a rich tapestry of her persona, blending visual diversity with a probing inquiry into her online identity."
+ },
+ {
+ "video_id": "PI-430a77b2-decc-49cd-9893-6506a6fb9353",
+ "path": "/workspace/benchmark/video/PI-430a77b2-decc-49cd-9893-6506a6fb9353.mp4",
+ "answer": "The video features a person with long, dark hair wearing a black shirt with white text and multiple necklaces, including one with a star pendant and another with a red gem. The background shows an indoor setting with a beige wall and a doorway leading to another room with warm lighting. The text \"how some of you sound lowkey\" is displayed at the top of the frames, suggesting a commentary on certain behaviors or speech patterns. The person is seen adjusting their hair and touching their face, appearing to be in a casual, relaxed state, creating a laid-back atmosphere throughout the video. Meanwhile, the audio features a woman speaking with an excited and passionate tone about her upcoming visit to South Korea to see her Korean boyfriend. She plans to visit a Korean convenience store with him, enjoy the food there, and drink Korean beverages, expressing excitement and enthusiasm throughout her speech. Together, the video and audio combine to portray a relaxed indoor setting with a personal narrative of anticipation and joy about a visit to Korea."
+ },
+ {
+ "video_id": "PI-b2afa3a2-a732-47a4-856b-33536664c605",
+ "path": "/workspace/benchmark/video/PI-b2afa3a2-a732-47a4-856b-33536664c605.mp4",
+ "answer": "The video unfolds at night in a dimly lit outdoor area, likely a back alley or side street, where a person in a white t-shirt with black text and dark pants attempts to climb over a black metal fence. Behind the fence, a blue car is parked next to a brick building with a few illuminated windows, and the wet ground suggests recent rain. The person bends down, places their hands on the ground, and starts to climb, but loses balance and falls to the other side. They then get up, brush themselves off, and walk towards a better-lit area with houses in the background. Overlaid text reads 'Yoooo How my bro got over the fence', conveying surprise or disbelief. Amidst this scene, a male speaker with a neutral or unclear emotional tone briefly comments, 'Yeah, that way,' suggesting agreement or direction in the conversational context of the interaction, with no background music or additional auditory elements."
+ },
+ {
+ "video_id": "PI-5ddc54bd-53b5-4352-aa9d-3d12da6ac70f",
+ "path": "/workspace/benchmark/video/PI-5ddc54bd-53b5-4352-aa9d-3d12da6ac70f.mp4",
+ "answer": "In the audio, a girl with a pleasant tone reflects on her appearance, expressing self-admiration as she looks in the mirror, feeling as beautiful and cool as before, and admires her fluffy hair, all set against a backdrop of cheerful and lively music that enhances the joyful mood. Meanwhile, the video features a plush unicorn toy on a floral-patterned fabric background, which becomes the object of playful admiration. The video begins with text highlighting the unicorn's aesthetic appeal, such as \"Looked in the mirror as beautiful as art,\" and proceeds to celebrate the unicorn's various hairstyles with humorous and affectionate text overlays like \"God DAMN i love my side part,\" \"Looked in the mirror... sharp as a bullet,\" and \"God damn i love my fluffy hair!!!!!\". The camera angles shift to provide different perspectives of the unicorn, emphasizing its unique features and hairstyles, from a 'mullet' to a 'buzz cut' and a 'low taper fade', all contributing to a whimsical and playful tone. While the audio focuses on a girl's self-reflection and appreciation, the video humorously extends this sentiment to a plush unicorn, both creating a cheerful and positive atmosphere that celebrates self-love and appreciation in their own unique ways."
+ },
+ {
+ "video_id": "PI-e54c15cb-2aa7-4e63-9c6d-4b43e309568f",
+ "path": "/workspace/benchmark/video/PI-e54c15cb-2aa7-4e63-9c6d-4b43e309568f.mp4",
+ "answer": "The video presents a heartwarming indoor scene that beautifully complements the audio's profound declaration of love. In the visual, a fluffy, light-colored dog is comfortably seated on a gray couch, while a person with blonde hair, dressed in a black vest over a gray long-sleeve shirt and light blue jeans, sits beside them. The person extends their arms in a tender embrace, and the dog leans into this affectionate gesture, creating a touching moment. The background features a living room with dark wooden cabinets and a stainless steel refrigerator, adding a homely atmosphere. A gray pillow rests on the couch next to the dog, enhancing the cozy setting. The audio reinforces this scene with a female voice expressing deep affection, saying, \"Do you know how much I love you? I love you more than everything. Whatever it takes.\" Together, the video and audio convey a powerful message of love and devotion, highlighting the strong bond between the individual and their beloved pet."
+ },
+ {
+ "video_id": "PI-3697d4f8-8e6a-4938-86a3-de66aaf166ea",
+ "path": "/workspace/benchmark/video/PI-3697d4f8-8e6a-4938-86a3-de66aaf166ea.mp4",
+ "answer": "The video features a significant moment in an official setting, likely the Oval Office, where a man in a dark suit is seated at a large wooden desk. The background is adorned with American flags and various official emblems, setting a formal tone. Another man in a suit stands nearby, holding a folder. The seated man is presented with an executive order, which he examines closely. The text on the screen reveals that the order pertains to the declassification of files related to the assassinations of President John F. Kennedy, Senator Robert F. Kennedy, and Reverend Dr. Martin Luther King Jr. The man acknowledges the gravity of the situation, noting that many people have been waiting for this information for years, even decades. He emphasizes that everything will be revealed, indicating the importance of transparency. He then signs the document, formalizing the executive order. After signing, he hands the pen to the standing man and gestures towards the signed document, indicating its significance. The video captures a moment of historical importance, highlighting the decision to declassify long-awaited files and the commitment to transparency."
+ },
+ {
+ "video_id": "PI-522cf90d-ebe1-40f5-9178-bb924da8df24",
+ "path": "/workspace/benchmark/video/PI-522cf90d-ebe1-40f5-9178-bb924da8df24.mp4",
+ "answer": "In the video, a man in a blue suit and tie is standing outdoors at night, reporting the urgent news that TikTok might be shut down. The background shows a dim urban area with some lights in the distance. The man spoke directly into the camera, emphasizing the severe situation TikTok faces in the future in the United States. He reported that TikTok could still be shut down by Sunday and emphasized the urgency of their petition, which has already collected more than 914,000 signatures. Their goal is to collect one million signatures before midnight and send a powerful message to President Biden. He emphasized the importance of a clear statement from the White House to ensure that the servers can continue to host the TikTok app. The man's tone was serious and firm, reflecting the seriousness of the situation. He called on the audience to support the cause by signing a petition to prevent Douyin from \"darkening\" on Sunday night. The video emphasizes the need for immediate action and public participation to influence decision-makers."
+ },
+ {
+ "video_id": "PI-a290191f-5752-4dab-9c37-5a2bcf9ee283",
+ "path": "/workspace/benchmark/video/PI-a290191f-5752-4dab-9c37-5a2bcf9ee283.mp4",
+ "answer": "The video depicts a casual and relaxed scene in a well-lit room with white walls and a ceiling fan, where two individuals are engaged in an interaction. The person in the foreground, wearing a red cap and a dark gray t-shirt, sits at a desk equipped with a black keyboard and a green-lit mouse, actively gesturing with their hands as they speak. In the background, another individual, dressed in a light-colored hoodie and dark pants, sits on a couch, observing the scene with a subtle smile. Natural light streams in from a window or an open door, illuminating the room, while a large mirror on the wall reflects part of the setting, adding depth. Complementing this visual, the background music features a female vocalist singing with a soft, emotive, and slightly melancholic tone, expressing feelings of love and longing. The lyrics convey deep affection and a sense of yearning, enhancing the overall atmosphere of the video with themes of love, emotional vulnerability, and connection. Together, the audio and video create a cohesive experience, blending the visual interaction with the emotional depth of the music."
+ },
+ {
+ "video_id": "PI-bbfc20c9-caa1-4684-9373-7be6d7ac377f",
+ "path": "/workspace/benchmark/video/PI-bbfc20c9-caa1-4684-9373-7be6d7ac377f.mp4",
+ "answer": "The video presents a monochromatic, minimalist scene with abstract geometric shapes in the background, setting a surreal and contemplative mood. A person, wrapped in a black blanket and seated on the ground, embodies a melancholic and introspective state. As the video progresses, text overlays emerge above the person's head, narrating a journey of emotional pain and uncertainty with phrases like \"i'm hurting,\" \"i'm stuck in,\" and \"i don't know where to go,\" leading to a resigned decision to stay in place. These texts convey themes of sorrow and helplessness, aligning with the soulful and melancholic tone of the musical track, which expresses feelings of being hurt, stuck, and uncertain. The person's subtle shifts in posture and gaze enhance the introspective atmosphere. Towards the end, the text \"i love\" appears in red, signaling a shift in emotion. The person stands and moves towards a red door, with the text \"i love i love i love\" reinforcing a desperate yet affirming pursuit of affection. The video concludes with a black screen displaying \"malcom todd chest pains (i love) [Short Jakeneutron Cover],\" tying the visual narrative to the song's themes of sorrow, helplessness, and an overwhelming desire for love. The overall visual style is stark and evocative, effectively complementing the emotional depth of the musical track."
+ },
+ {
+ "video_id": "PI-011c29c2-242e-49f1-9a78-a6265b6fd1a1",
+ "path": "/workspace/benchmark/video/PI-011c29c2-242e-49f1-9a78-a6265b6fd1a1.mp4",
+ "answer": "In a delightful synergy of audio and visuals, this video and its accompanying track create an atmosphere of pure merriment. The audio is a track full of laughter, which effortlessly expresses a relaxed, cheerful, and highly interesting mood. It sets the tone for what's to come, building anticipation for a humorous experience. The video itself is a hilarious spectacle. It depicts a rather unconventional scene where a man is carrying a person on his back. As if that wasn't amusing enough, he then proceeds to wrap the person on his back in a patterned blanket before heading into the cinema. He makes his way towards the female cashier, and this absurd sight causes everyone around to burst into laughter. The entire video is imbued with a relaxed and joyful atmosphere, perfectly complementing the laughter - filled audio. The combination of the man's antics and the infectious laughter of the onlookers makes for a truly entertaining and light - hearted viewing experience, where the audio and video work in harmony to convey a sense of unadulterated fun."
+ },
+ {
+ "video_id": "PI-2bdd2fee-a899-47eb-ab0c-b0a104f6b63b",
+ "path": "/workspace/benchmark/video/PI-2bdd2fee-a899-47eb-ab0c-b0a104f6b63b.mp4",
+ "answer": "The audio sets the stage for a high-profile event with its upbeat and lively background music, perfectly fitting the glamorous atmosphere that permeates the occasion. The focus on the music and the general ambiance conveys a sense of excitement and celebration typical of such events. The video complements this auditory backdrop by showcasing a person standing on a red carpet at the MusiCares Fire Relief event, as evidenced by the prominent backdrop displaying the event's name and a call to action to donate at GRAMMY.com/FireRelief. The individual, dressed in a striking black leather jacket, black sheer underwear, and black platform boots, exudes a bold and edgy look, further enhanced by bright yellow gloves with white cuffs that add a pop of color to their ensemble. The person poses confidently, adjusting their jacket and gloves, and striking various poses to showcase their unique outfit, all while the backdrop emphasizes the charitable nature of the event. The overall atmosphere is one of glamour and purpose, focused on raising awareness and funds for fire relief efforts, creating a powerful and memorable experience for attendees and viewers alike."
+ },
+ {
+ "video_id": "PI-6ee9aacd-a2a0-44c5-9113-3b2ce577faa7",
+ "path": "/workspace/benchmark/video/PI-6ee9aacd-a2a0-44c5-9113-3b2ce577faa7.mp4",
+ "answer": "The video begins with a heartwarming request for a Duolingo-themed bedtime video, aimed at convincing a 19-month-old fan that the Duolingo owl, Duo, goes to bed too. It starts with a person in a yellow shirt and text explaining the situation, where the parent is trying to ease their child into bedtime by showing that Duo also goes to bed. The scene then shifts to a green owl plush toy on a bed, where a hand adjusts the toy, places a smaller owl plush beside it, and covers them with a gray blanket, creating a comforting bedtime scene. Meanwhile, a baby's voice can be heard, adding a touch of innocence and charm to the interaction, which appears casual and short-lived, with no obvious emotional intensity. The ambient music, characterized by a steady rhythm, creates a calm and unobtrusive atmosphere, enhancing the overall soothing effect of the video. The video effectively conveys a parent's effort to use the beloved Duolingo character to make bedtime a more comforting and familiar routine for their child."
+ },
+ {
+ "video_id": "PI-65b31253-05b3-46e1-bfee-ebcf127791bd",
+ "path": "/workspace/benchmark/video/PI-65b31253-05b3-46e1-bfee-ebcf127791bd.mp4",
+ "answer": "The video shows the highs and lows of medieval-style swordsmanship, beginning with a man in a dark jacket and headphones sitting in a room with a dark blue wall decorated with framed pictures. The text \"THE HIGHS AND LOWS OF SWORDFIGHTING\" sets the theme. The scene transitions to a swordsmanship arena where characters in historical costumes fight. The first battle ends with the character in a quilted tunic and metal helmet defeating an opponent in red pants and is accompanied by the text \"You won! Hold G to con\". The next battle is more intense, but the same character wins again. However, the story takes a tense twist and the speaker, a male, expresses surprise and concern, describing the involvement of a man with a long beak hook. He experiences a moment of agony when his weapon is knocked out of his hand, reacting to this unexpected event. During the ensuing battle, the character faces defeat, blood is spattered, and the text \"You are hurt! Hold G to g\". The character falls down, showing an illustration of a dancing skeleton with the words\" YOU DIED \"written in red. The video ends with the man in the room taking a sip from the blue jar, and\" To Menu \"appears, indicating the end of the game, encapsulating the unpredictability and challenging nature of swordsmanship."
+ },
+ {
+ "video_id": "PI-cd48ced7-86a5-45ef-b19a-00828d99ee38",
+ "path": "/workspace/benchmark/video/PI-cd48ced7-86a5-45ef-b19a-00828d99ee38.mp4",
+ "answer": "The video captures a close-up view of a gold necklace being prepared and displayed on a black mannequin bust. A pair of hands meticulously handles the necklace, which features a striking red gemstone pendant. Initially shown as separate pieces, the necklace is skillfully assembled, showcasing its versatile design before being elegantly draped around the mannequin’s neck. The background reveals other jewelry items and additional black mannequin busts, indicating a carefully arranged display setting, possibly a retail space or private collection. Red velvet fabric and black surfaces add rich texture to the scene, complementing the gold jewelry. The camera remains stationary, maintaining a clear focus on the intricate details of the necklace and the hand movements, highlighting the craftsmanship and beauty of the piece. Upbeat background music accompanies the scene, enhancing the atmosphere without distracting from the main subject."
+ },
+ {
+ "video_id": "PI-f3f2299c-9bad-4ed0-a88c-e4fc703cc2a1",
+ "path": "/workspace/benchmark/video/PI-f3f2299c-9bad-4ed0-a88c-e4fc703cc2a1.mp4",
+ "answer": "This split-screen video captures a moment of intense reaction during a video game session. The top panel shows two men, one in a plaid shirt and the other in an orange hoodie and headphones, watching with anxious expressions. The bottom panel displays gameplay from a soccer video game, where the score is tied 4-4 in the 75th minute. As a player in a blue jersey advances with the ball, he is slide-tackled by an opponent in a yellow jersey. The tackle is aggressive, sending the blue-jersey player to the ground. Simultaneously, in the top panel, the man in the plaid shirt reacts dramatically. He grabs the man in the orange hoodie by the collar and forcefully pulls him backward, causing both of them to fall out of their chairs and off-screen."
+ },
+ {
+ "video_id": "PI-be2a4dc2-e51f-48f6-a9b3-663a751d15f4",
+ "path": "/workspace/benchmark/video/PI-be2a4dc2-e51f-48f6-a9b3-663a751d15f4.mp4",
+ "answer": "This video took place in a square, possibly during a festival parade. The background is a large parade float, designed like a huge open book with gingerbread theme decorations, moving along the streets full of audiences. The appearance of characters dressed in colorful costumes made the atmosphere more lively and cheerful. The protagonist in the video is a young woman. She is wearing a green sweatshirt with a gingerbread man pattern and a hair band with gingerbread man ears on her head. She was holding gingerbread in her hand, eating gingerbread shaped like a gingerbread man and enjoying the holiday feast. Her behavior and dress indicate that she is immersed in the festive atmosphere. As the parade progressed, characters dressed in gingerbread man costumes and other chef costumes designed with gingerbread themes interacted with the crowd. The gingerbread man, dressed in a brown and pink outfit, was playing with two chefs in white costumes decorated with red and green heart shapes. The chefs seemed to be chatting and gesturing enthusiastically, adding to the lively atmosphere. The background is a gingerbread themed booth and a group of spectators, some of whom are taking photos or watching the parade. This video captures the essence of the festival parade, highlighting the joyful interaction between the characters and the audience. The core theme revolves around festival joy, and the gingerbread theme runs through the entire event. The parade floats, the figures and the joy of the young women all constitute a coherent and happy story."
+ },
+ {
+ "video_id": "PI-347193bd-9b4c-4f8c-afb0-ca6f0b309212",
+ "path": "/workspace/benchmark/video/PI-347193bd-9b4c-4f8c-afb0-ca6f0b309212.mp4",
+ "answer": "The video begins with a close-up shot of a light blue big foot with a pink outline stepping into a pink crater. With the pressure beneath our feet, the crater became deeper and wider. The scene switches to a large, round, light blue creature with a pink outline, lying on its back, with a contented expression and its arms crossed over its belly. Then, this creature appeared in the library, standing upright and raising its arms, causing the books on the bookshelf to fall off. People were surprised by this. The creature looked around, seemingly not noticing the chaos it caused. A close-up shot shows that the creature peeked from behind the bookshelf, still wearing a contented expression. Then, this creature opened its mouth wide, revealing its black interior. The scene switches to a large, light blue, turtle-like creature with a pink outline standing in a room with a lattice ceiling. The two cannon-like appendages on its back burst into bubbles, and several people ran away in surprise. Then, this creature sat in the cinema, eating popcorn with a contented expression and watching its own silhouette projected onto the movie screen. The audience cheered and raised their hands excitedly. Next, people saw this creature floating in a depression like a crater, surrounded by several short people. Water began to gush out from the appendage on its back that was like a cannon, forming a flood. Then, this creature was placed on the top of a tall building in the urban landscape, and steam or smoke emerged from its cannon-like appendage. Then one can see this creature hanging on the banner, looking down with a mischievous expression. The end of the video is a city landscape, indicating that a catastrophic event has occurred."
+ },
+ {
+ "video_id": "PI-ea169897-c571-4b03-9dab-6219778b7af1",
+ "path": "/workspace/benchmark/video/PI-ea169897-c571-4b03-9dab-6219778b7af1.mp4",
+ "answer": "In a lively classroom setting, a male teacher sits in a chair facing the camera while his students participate in a fun recognition game. A line of students forms behind him, and one by one, they walk past, offering a simple greeting like \"Hi Mr. B.\" With a smile, the teacher correctly identifies each student by their name immediately after they speak. The students, both male and female, appear cheerful, with some laughing as they wait for their turn or watch their classmates. In the background, other students are visible, some standing by a whiteboard and one actively writing on it. The atmosphere is warm and friendly, highlighting a positive rapport between the teacher and his class."
+ },
+ {
+ "video_id": "PI-20e0df36-2939-435a-a249-f4fb96b7d3d5",
+ "path": "/workspace/benchmark/video/PI-20e0df36-2939-435a-a249-f4fb96b7d3d5.mp4",
+ "answer": "This video is a heartfelt tribute to Lisa \"Rose\" Gray, capturing the enduring bond and shared adventure among the three women. At the beginning, they stood on the rocky shore by the sea, with the impact of the waves in the background. The text \"In Loving Memory\" with pigeon emojis appeared, giving people a sense of reflection. The camera switches to the woman in the swimming pool. The words \"Reatha\", Rose \"Grey\" and the rose emoji indicate the dedication of the video. Then, the story shifts to a lively moment inside the bus. Two women are surrounded by passengers who are clapping and cheering. It seems to be a celebration. They posed in front of a small plane with the words \"Let's travel around the world together\" and a globe emoji on it, emphasizing their shared journey. The video continues to show that they are standing in front of a building marked \"Avila Beach Hotel est.1949\", and then standing by the indoor swimming pool at sunset. The words \"We will always cherish these memories\" and the heart-shaped emoji express deep nostalgia and lasting friendship. The background music is soft and melancholy, enhancing the mood of reflection. This video ends with a touching eulogy, paying tribute to the memories and experiences shared by the three women and celebrating the bond between them and the legacy of Risa \"Rose\" Gray."
+ },
+ {
+ "video_id": "PI-f32e4773-7010-4db7-aff0-f2143ba88a73",
+ "path": "/workspace/benchmark/video/PI-f32e4773-7010-4db7-aff0-f2143ba88a73.mp4",
+ "answer": "The video depicts a scene where Santa Claus sits on a chair beside a signboard printed with the slogan \"Let it snow\". With the The classic Christmas song 'You're a Mean One, Mr. Grinch' plays throughout the video. The song is sung by a male vocalist and serves as the primary audio, setting a thematic backdrop for the events involving the Grinch character and the girl. A little girl dashes across the Christmas gathering and snatches the Santa hat off a person dressed as the Grinch, who is attempting to exit through the doorway. This happens because the Grinch had taken the girl's bag. The Grinch then extends his hand, requesting the return of his hat. The girl hands it back to him, after which the Grinch gently pats her head and returns her bag."
+ },
+ {
+ "video_id": "PI-36cd5865-a6f8-41ed-9633-a13b1add0f76",
+ "path": "/workspace/benchmark/video/PI-36cd5865-a6f8-41ed-9633-a13b1add0f76.mp4",
+ "answer": "In the video, a girl is seen in a workshop, wearing a white sweatshirt with a bunny design and pink overalls, as she playfully stretches and manipulates a large container of pink slime.The background is filled with shelves stocked with various containers and supplies.An overlay comment from @molliemyson5969 appears, stating \"Girl you are so ugly 😒🤮\".Unfazed, the girl holds up a bottle of purple liquid labeled \"violet\" and continues to interact with the slime.She then adds green decorative pieces to the slime and playfully places a small white object on top.The scene transitions to her outside near a dumpster, still holding the container of slime.She throws the slime into the dumpster and walks away. Meanwhile, the audio features the girl responding to the comment, thanking Molly for the comment and humorously deciding to make a slime in a \"poop color brown\" as a playful response. She expresses her appreciation for positive comments and wishes everyone a great new year.The video wraps up with her waving, showing her gratitude and positive spirit despite the initial negative comment."
+ },
+ {
+ "video_id": "PI-c445fafc-e188-4df9-972b-13307b5f5363",
+ "path": "/workspace/benchmark/video/PI-c445fafc-e188-4df9-972b-13307b5f5363.mp4",
+ "answer": "The video takes place in an indoor trampoline park, where an individual dressed in a red long-sleeve shirt, black pants, and orange shoes is seen attempting a backflip. The setting is equipped with blue and black padded walls, ropes, and aerial silks in blue and orange. Overlaid text reading \"Getting sh*t in movies Vs Real life\" suggests a comparison between cinematic stunts and real-life attempts. As the person tries the backflip, they lose balance and fall onto the trampoline mat, initially lying spread-eagled before curling up in apparent pain. They sit up, holding their head or neck, then cautiously stand, showing unsteadiness. After walking a few steps, they sit down again near the ropes and aerial silks, continuing to display discomfort by leaning over and holding their side or back. Eventually, they walk slowly, favoring one side, and sit down once more, indicating possible injury from the fall. Over this scene, the background music features male vocals in a humorous and sarcastic tone, with lyrics like \"On a scale from one to ten, my friend, you're fucked,\" adding a layer of ironic commentary to the mishap. The BGM's playful and mocking lyrics humorously criticize or tease, enhancing the video's entertaining and amusing quality."
+ },
+ {
+ "video_id": "PI-32cdb119-9eed-447c-88fe-1eec6a4b87e4",
+ "path": "/workspace/benchmark/video/PI-32cdb119-9eed-447c-88fe-1eec6a4b87e4.mp4",
+ "answer": "The video presents a captivating and mysterious scene on the open sea, where a giant unknown creature is observed 40 miles offshore. It begins with a view of the ocean surface, revealing gentle waves and a faint turquoise shape lurking beneath the water. As the footage unfolds, the creature's form gradually becomes more distinct, showcasing a large, elongated shape swimming just below the surface. The camera angle shifts to offer various perspectives of the creature's movement through the water. Meanwhile, the accompanying audio captures a male speaker's curious and slightly confused reactions, expressing uncertainty about what he witnessed and suggesting to turn on the camera again for another look. The scene transitions to a view from a boat, displaying fishing rods and the vast expanse of the ocean, with the consistent text overlay \"GIANT UNKNOWN CREATURE 40 MILES OFFSHORE\" emphasizing the mystery and location of the sighting. The camera pans across the shimmering sunlight and the endless horizon before returning to focus on the turquoise shape beneath the waves. The video concludes with a closer view of the creature, highlighting its immense size and the way it moves gracefully through the water, leaving viewers intrigued and curious about the nature of this enigmatic giant in the deep sea."
+ },
+ {
+ "video_id": "PI-83078783-c975-4b3c-9159-453810373b50",
+ "path": "/workspace/benchmark/video/PI-83078783-c975-4b3c-9159-453810373b50.mp4",
+ "answer": "The video features a person sitting on a couch, wearing a gray t-shirt and dark pants, holding a large, knitted garment with a perplexed expression. The text overlay reads, 'I was folding my girls clothes can someone explain? ' as the person examines the garment, which has two armholes but no obvious headhole, expressing confusion about its design and purpose. The background shows a cozy room with a lamp and some decorations, indicating a home setting. The person continues to manipulate the garment, trying to understand how it should be worn or folded. In the audio, the man speaks in a neutral tone with occasional curiosity, commenting on the unusual design and questioning its practicality, while a woman humorously refers to the item as a 'necklace shirt.' The video wraps up with the person still puzzled, highlighting the unconventional design of women's clothing and the lighthearted confusion it causes."
+ },
+ {
+ "video_id": "PI-7aae3999-0397-4827-8926-f5a3f88d8abe",
+ "path": "/workspace/benchmark/video/PI-7aae3999-0397-4827-8926-f5a3f88d8abe.mp4",
+ "answer": "The video begins with a close-up of a hand holding a small, crumpled package wrapped in bubble wrap, accompanied by the text 'Plz... no Hate' with a pleading face emoji, indicating a request for understanding or leniency. The package is opened to reveal a black, fingerless glove with a small device attached to the back. The glove is put on, and the device is adjusted to fit snugly on the back of the hand. The person wearing the glove uses it to press buttons on the car's dashboard, demonstrating its functionality. The scene transitions to the engine compartment of a car, where the person uses the glove to manipulate components under the hood, with the device's lights illuminating the area. Meanwhile, the audio features a speaker, likely male, whose emotional state is not clearly discernible. The speaker mentions 'Oh' in a casual tone, and the ambient music has an unknown tempo and atmosphere, contributing to the overall mood. The video showcases the practical use of the glove in both the car's interior and engine compartment, blending a demonstration of functionality with an undercurrent of emotional ambiguity from the audio."
+ },
+ {
+ "video_id": "PI-e2d494bf-8fbd-4e0c-ad93-1b2735a4e92c",
+ "path": "/workspace/benchmark/video/PI-e2d494bf-8fbd-4e0c-ad93-1b2735a4e92c.mp4",
+ "answer": "The video begins with a young woman in a dimly lit, industrial-style room, dressed in a school uniform featuring a white shirt, black pinafore, and a black-and-white striped tie. The gritty urban backdrop, complete with exposed pipes, wires, and brick walls, sets a distinct tone. As she speaks, text overlays appear on the screen, stating, 'She already has 12 other guys telling her she's pretty,' which aligns with the audio's mention of a gender-neutral lady noting that someone already had 12 other guests promoting her as very charming. The scene then transitions to a bright, clean martial arts studio where a man, clad in a black long-sleeve shirt with white designs, light gray cargo pants, and black sneakers, practices with a long staff. The studio's well-lit environment, featuring fluorescent lights and stacked black punching bags, contrasts sharply with the previous setting. The man demonstrates focused and skilled movements, showcasing precision and control. The audio briefly includes a man singing a few lines, suggesting a dialogue scene without obvious emotional intensity or background music, which complements the visual transition from a gritty urban scene to a disciplined martial arts practice."
+ },
+ {
+ "video_id": "PI-621c45a2-b3b9-4476-91c8-5a5a36f2f3de",
+ "path": "/workspace/benchmark/video/PI-621c45a2-b3b9-4476-91c8-5a5a36f2f3de.mp4",
+ "answer": "The audio and video content together present an engaging and dynamic narrative centered around a misunderstanding and a subsequent argument, blending elements of humor and seriousness. The audio track features a dialogue between three characters: a female speaker who expresses frustration, surprise, and assertiveness, and two male speakers who appear skeptical, defensive, and slightly humorous. The female questions the males' presence and actions, while they attempt to explain a misunderstanding and justify themselves. Meanwhile, the video begins with a police officer standing outside a partially open door, transitioning to a festive room decorated with a Christmas tree, where a man in a light purple shirt and dark vest is dancing with a woman in a pink shirt. Playful banter ensues, with the woman slapping the man and asking why he's home early, to which he mistakenly addresses her as Mike. The scene takes a humorous turn as the man falls, knocking over the Christmas tree, and the police officer enters, grabbing the man and explaining that it was a surprise. The woman, confused and concerned, asks what's happening, highlighting the contrast between the festive setting and the serious undertones of the dialogue. The video effectively captures the confusion, humor, and tension of the situation, complementing the audio's dialogue-driven narrative to create a compelling and entertaining story."
+ },
+ {
+ "video_id": "PI-c541f488-0697-4c28-906c-f1c693984a8b",
+ "path": "/workspace/benchmark/video/PI-c541f488-0697-4c28-906c-f1c693984a8b.mp4",
+ "answer": "This video features a split-screen scene. On the left is the news studio and on the right is the representative of LC Sign. In the studio, a woman dressed in bright pink is standing in front of a large screen, on which digital graphics and the Vietnamese flag are displayed, and the sign \"Good News for Los Angeles\" can be seen on it. The focus shifted to Tony, a Chinese marketer from LC Sign who became popular for his humorous social media videos. He was in a showroom filled with neon lights, showcasing various customizable products. Tony emphasized an \"infinite logo\" and a logo of an astronaut playing the guitar, highlighting that any shape and size can be customized. He mentioned that customers can purchase customized products at the best value for money, without a minimum order quantity (MOQ) or global shipping. Tony encouraged the audience to customize the logo on their website or send their logo to get a quote, urging them to follow the LC logo. The background is decorated with colorful neon lights, one of which reads \"Order Now\", enhancing the lively atmosphere of the exhibition hall. This video effectively combines visual appeal with information content, demonstrating the versatility and affordability of LC Sign products."
+ },
+ {
+ "video_id": "PI-79f3890b-70e2-4f2a-bc5d-29822c53eb4e",
+ "path": "/workspace/benchmark/video/PI-79f3890b-70e2-4f2a-bc5d-29822c53eb4e.mp4",
+ "answer": "The video offers a calm and informative exploration of a common misconception about vision loss, presented by a neutral-toned male speaker. It effectively debunks the myth that completely removing one's eyeballs would result in seeing darkness. Instead, the video explains, there would be a complete absence of visual input, akin to trying to see with an elbow, which lacks the necessary sensory organs for vision. The video guides viewers through this thought experiment using a series of text overlays and visual transitions, starting with close-ups of an ear and head, moving to an outdoor setting, and finally using a glowing blue hand and a close-up of an elbow to reinforce the point. Accompanied by ambient music with a steady tempo and a neutral mood, the video successfully educates viewers about the fundamental role of the eyes in processing light and forming images, leading to a state of nothingness in terms of visual perception when the eyes are absent."
+ },
+ {
+ "video_id": "PI-1df3a26f-7a96-4f9c-a247-f00999b20256",
+ "path": "/workspace/benchmark/video/PI-1df3a26f-7a96-4f9c-a247-f00999b20256.mp4",
+ "answer": "The video portrays a scene where a person with long golden hair, adorned with multiple tattoos on their arm and rings on their fingers, sits in front of a granite stage, surrounded by a TV with neon lights and a room with white walls and ceiling fans. The individual, seemingly expressing frustration or disappointment through hand gestures, is depicted as speaking about a personal grievance, as indicated by the text on the video: \"Ash complains because her next 23-year-old victim doesn't want to have sex.\" This text implies that the person, likely Ash, is unhappy because a 23-year-old potential partner is unwilling to engage in sexual relations before marriage due to religious beliefs, echoing the core message of the audio where a woman's slightly helpless and complaining voice expresses discontent over similar situations where target partners are unwilling to be intimate before marriage. The audio and video together highlight the frustration and disappointment faced by individuals in such circumstances, with the video providing a visual representation of the personal complaint or roar at this situation, while the audio delves deeper into the emotional aspect of the complaint."
+ },
+ {
+ "video_id": "PI-fad18f1d-a901-410f-afed-b2e13b8e5a45",
+ "path": "/workspace/benchmark/video/PI-fad18f1d-a901-410f-afed-b2e13b8e5a45.mp4",
+ "answer": "In this short, selfie-style video, a young woman shares her thoughts from her bedroom. Positioned in front of a large, decoratively framed mirror, she speaks directly to the camera about a topic that seems to amuse and exasperate her. She critiques the fashion choice of a public figure, specifically focusing on a particular set of pajamas that she believes are worn too frequently. The speaker uses expressive gestures and a conversational tone to make her point, questioning why the individual doesn't vary their sleepwear. The background of the video shows parts of her room, including a bed with pink and silver pillows."
+ },
+ {
+ "video_id": "PI-a0cda524-2165-4704-9171-87e8be278cbe",
+ "path": "/workspace/benchmark/video/PI-a0cda524-2165-4704-9171-87e8be278cbe.mp4",
+ "answer": "The video features an individual clad in a maroon hoodie and clear glasses, standing centrally against a plain, light-colored backdrop that ensures the focus remains on the speaker. The camera zeroes in on their upper body, capturing various hand gestures such as pointing towards the camera and making an 'OK' hand sign, which add a layer of expressiveness to the delivery. Overlaid at the bottom of the screen is the text 'How to tell you're a freak pt 2:', signaling that this is part of a continuing series. The speaker, with a casual and slightly humorous tone, delivers advice that if searching for 'clean dishes' on TikTok yields nothing but thirst traps and videos of people not cleaning dishes, one should immediately clear their search history. This humorous take is enhanced by the minimalistic setting, free from distractions, allowing the audience to fully engage with the satirical content being presented."
+ },
+ {
+ "video_id": "PI-1f1ec4f3-9589-46de-84ff-786fcaa8dee9",
+ "path": "/workspace/benchmark/video/PI-1f1ec4f3-9589-46de-84ff-786fcaa8dee9.mp4",
+ "answer": "The audio features a female singer's casual and perhaps humorous voice, singing the phrase \"ID. Haha, you're here,\" with the line \"you came along\" suggesting a sense of relief or approval of someone's presence, setting a light-hearted tone. In stark contrast, the video presents a desolate and somber scene, beginning with a close-up of a weathered warning sign on a black metal fence, reading: \"Warning: This property is protected by a well-trained dragon. Not responsible for injury or possible death.\" The backdrop is a charred landscape with bare trees, indicating a recent fire, as the camera pans to reveal more damaged areas, including a garden hose on the ground and a small fire still burning in the distance. The scene transitions to a wider street, showing burned-out vehicles and houses, with the sky filled with smoke, creating an atmosphere of destruction and ruin. The video continues to depict the aftermath of the fire, with damaged houses and vehicles lining both sides of the street, still shrouded in smoke, highlighting the extensive damage caused by the blaze. The contrast between the audio's casual and humorous tone and the video's desolate and somber imagery creates a poignant juxtaposition, emphasizing the stark differences in mood and content between the two."
+ },
+ {
+ "video_id": "PI-4d708aed-1be7-48ea-83a9-e663382e6a85",
+ "path": "/workspace/benchmark/video/PI-4d708aed-1be7-48ea-83a9-e663382e6a85.mp4",
+ "answer": "The video showcases a man and a woman walking down a deserted road under a clear blue sky, their attire - she in a red, form-fitting outfit with black knee-high boots and he in a white blazer, black shirt, and black pants - adding to the visual allure. Initially embracing each other, they soon transition to walking side by side, occasionally adjusting their clothing, before bursting into an energetic dance. Their synchronized moves, filled with lively arm and leg gestures, including the woman occasionally lifting one leg and the man mirroring her, along with their coordinated poses, highlight their dynamic interaction. Meanwhile, the audio, a background music track from the video, features a female singer with a confident and alluring tone, emphasizing a theme of risk and excitement through lyrics like \"The game is danger\" and \"Dangerous.\" The song's content revolves around the thrill and danger of a game, likely in a romantic or seductive context, perfectly complementing the video's portrayal of freedom and joy as the couple dances against the backdrop of a vast, open landscape, enhancing the overall atmosphere with a sense of danger and allure."
+ },
+ {
+ "video_id": "PI-2e4ec821-1756-4d7a-802d-9c5cf33b74b7",
+ "path": "/workspace/benchmark/video/PI-2e4ec821-1756-4d7a-802d-9c5cf33b74b7.mp4",
+ "answer": "The video begins with a close-up view of a wooden floor, where several yellow tortilla chips are arranged in a slightly curved line.The camera then pans to the right, revealing more of the wooden floor and additional tortilla chips scattered around.As the camera continues to move, a golden object, possibly a piece of furniture, comes into view.The scene transitions to a tiled floor with a geometric pattern, where the tortilla chips are now arranged to spell out \"I ❤️ YOU\" in a creative display.The camera pans further to reveal a bathroom setting, featuring a white bathtub filled with orange liquid, likely cheese sauce.The tortilla chips are seen both on the tiled floor forming the message and scattered around the bathtub.The camera focuses on the bathtub, showing the orange liquid in detail.A hand then appears, picking up a tortilla chip and dipping it into the orange liquid, coating it thoroughly.The video includes an audio overlay with a voice saying, \"Because I'm not scared of loving you, I'm just scared of losing you,\" followed by a question, \"Why does the sun shine?\" , and some fragmented phrases that are not fully clear.The video concludes with a person holding the cheese-covered tortilla chip near their mouth, preparing to take a bite."
+ },
+ {
+ "video_id": "PI-a1707f1e-7c41-4956-8c2e-d49eefb6b036",
+ "path": "/workspace/benchmark/video/PI-a1707f1e-7c41-4956-8c2e-d49eefb6b036.mp4",
+ "answer": "The video presents a close-up view of an individual wearing a gray hoodie, pulled up over their head, with the frame focusing on the upper part of their body. The background is plain and neutral, enhancing the simplicity and ensuring the viewer's attention remains on the person. Soft, even lighting minimizes shadows, highlighting the texture of the hoodie and suggesting an indoor setting. Complementing this visual, the audio features a neutral-toned male voice instructing the listener to use the search bar in an application and type in 'Cincinnati, Ohio.' The voice provides clear, straightforward guidance without any emotional undertones or background music, aligning with the video's minimalist approach. Together, the audio and video create a cohesive, instructional experience, focusing on delivering a simple, direct message while maintaining a calm and uncluttered aesthetic."
+ },
+ {
+ "video_id": "PI-f6424d3e-bc9e-432d-9efd-20fec27330b0",
+ "path": "/workspace/benchmark/video/PI-f6424d3e-bc9e-432d-9efd-20fec27330b0.mp4",
+ "answer": "This video captures a woman performing an energetic dance routine in a brightly lit bedroom. Dressed in a black crop top and grey sweatpants, she moves with fluidity and expression on a light-colored carpet. In the background, the room is tastefully decorated with a large bed featuring white linens, a wooden frame, and an abstract painting with a yellow square on the wall.A woman hums along to the lyrics of the BGM while dancing.The BGM is an upbeat electronic pop song with a female vocalist. She sways her arms up and down, left and right, posing, then twists her body in sync, repeating this sequence over and over."
+ },
+ {
+ "video_id": "PI-18f168ce-9596-4073-b2b7-173cb5fe87a3",
+ "path": "/workspace/benchmark/video/PI-18f168ce-9596-4073-b2b7-173cb5fe87a3.mp4",
+ "answer": "The video commences with a screenshot of a text message conversation on a smartphone, featuring a black background with blurred elements hinting at other messages or notifications, before transitioning to the interior of an aircraft. The camera pans across the airplane seats, overhead compartments, and other passengers, capturing the typical flight conditions with ceiling lights illuminating the cabin. The video concludes with a close-up shot of a person seated in an airplane, surrounded by the plane's interior, including seats and windows, suggesting they are on the same flight as a boyfriend mentioned in the preceding text message. Meanwhile, the audio complements the video with soft and ambient background music that creates a calm and intimate atmosphere. While faint, indistinct voices can be heard in the background, they are not the main focus; instead, the BGM enhances the emotional tone, emphasizing a sense of connection and the surprise of being on the same flight, thereby perfectly aligning with the video's narrative of a shared journey."
+ },
+ {
+ "video_id": "PI-11d483c8-558b-4e7f-935f-68737427b19e",
+ "path": "/workspace/benchmark/video/PI-11d483c8-558b-4e7f-935f-68737427b19e.mp4",
+ "answer": "The audio captures the raw emotions of two men through their screams, with one man's screams exuding excitement and the other's displaying anger. The video provides a vivid context for these emotional outbursts, depicting the two men watching a football match. As the live broadcast unfolds, with players actively participating on the field, the live scoreboard displays key information such as '90:00, :21 +2', the score 'JUV 4,6 SLB', and live numbers 'SAMYMAX7, SAMYMAX7, SAMYMAX7'. In the latter half of the video, the emotional intensity peaks as the man on the left reacts excitedly, moving off the screen, while the man on the right, visibly angered, stands up and even kicks the man on the left. This scene effectively conveys the profound emotional impact that winning or losing a football match can have on fans, aligning perfectly with the audio's portrayal of contrasting emotions through the men's screams."
+ },
+ {
+ "video_id": "PI-d748509e-a204-41f4-b7fb-4f14bf1289d9",
+ "path": "/workspace/benchmark/video/PI-d748509e-a204-41f4-b7fb-4f14bf1289d9.mp4",
+ "answer": "In a dimly lit bar, a man teaches a woman how to play pool. He stands behind her, guiding her hands as she aims the cue stick. They share a brief, affectionate kiss before he takes the shot. The camera follows the cue ball as it breaks the rack, sending colored balls scattering across the black felt of the table. Afterward, the couple embraces warmly, both smiling. The scene is set to a background of upbeat hip-hop music, with neon signs for 'Bud Light' and 'PBR' glowing in the background."
+ },
+ {
+ "video_id": "PI-94ee6d92-0c67-4deb-8df3-b439ba6cd378",
+ "path": "/workspace/benchmark/video/PI-94ee6d92-0c67-4deb-8df3-b439ba6cd378.mp4",
+ "answer": "The video and audio together capture the essence of a first-time visitor's experience in New York City. The video shows a man walking on a city sidewalk, dressed in a black puffer jacket, dark pants, and tan shoes, with a black pickup truck and a 'BROOKLYN PIZZA' storefront in the background. Text on the screen reads, 'How dudes be when they first land in New York,' setting the context for his behavior. The man stops to adjust his shoes or pick something up, then stands up to take in his surroundings, shielding his eyes to get a better view. The camera pans to a statue of a robed figure holding a torch, part of a storefront display, as he continues to explore. The audio complements this visual narrative with a male voice expressing surprise and excitement, recounting his fast-beating heart and shock at seeing his Timberland boots in the new environment. He uses expletives to emphasize his astonishment and describes the experience as 'lit.' Together, they convey the overwhelming and exhilarating feeling of experiencing New York City for the first time, capturing both the external observations and internal emotions of a newcomer."
+ },
+ {
+ "video_id": "PI-5661da41-980d-4be2-8715-476c5c260750",
+ "path": "/workspace/benchmark/video/PI-5661da41-980d-4be2-8715-476c5c260750.mp4",
+ "answer": "The video shows a woman in a bathroom wearing a leopard print bathrobe with a red belt and a green curling iron on her head against a background of light green walls and a white towel hanging from a hook. Within the video one can see the woman energetically moving her arms, possibly dancing or expressing excitement, leaning back and looking up, then lowering her head and continuing to move enthusiastically, creating a vivid and lively atmosphere that captures moments of joy and self-expression. Meanwhile, the audio includes a male speaker mentioning in an uncertain emotional tone, \"That's a baby,\" and a female speaker discussing various abstract and fragmented ideas such as hats, beaches, jellies, and tails. Two people engage in disjointed and seemingly unrelated conversations, accompanied by ambient music with neutral rhythm and ambient ambience, adding a layer of complexity to the whole scene."
+ },
+ {
+ "video_id": "PI-90cf97dd-8024-4f28-9578-e0fc4472dcc0",
+ "path": "/workspace/benchmark/video/PI-90cf97dd-8024-4f28-9578-e0fc4472dcc0.mp4",
+ "answer": "The video was shot in a studio. There were many people gathered in the studio, possibly journalists or photographers. There were many lights and cameras inside. A large red carpet was laid on the floor. The core focus of the video was a young lady wearing a long dress with red sequins. She had long golden hair and was very tall. She came out from behind the studio and hugged a man. This man was supposed to be the on-site manager. Then the lady walked to the center of the red carpet. The man directed the positions while chatting with the lady. The content of the conversation was about the small talk between the two of them and the issues to be noted during this shooting and so on. The photographers were taking pictures of the lady from different directions. This lady also struck different poses. Everyone around her looked at her, as if they were envious and admiring. The lady was also immersed in the shooting, and the atmosphere on the spot was cheerful and lively. Overall, the entire video and audio are cheerful, giving people a pleasant mood."
+ },
+ {
+ "video_id": "PI-9183c33e-4cc0-40d6-acd1-8595b534a90f",
+ "path": "/workspace/benchmark/video/PI-9183c33e-4cc0-40d6-acd1-8595b534a90f.mp4",
+ "answer": "The video begins with a view of a computer monitor displaying a scenic wallpaper, a white keyboard, and a mouse on a cardboard surface, accompanied by the text 'Carter please' at the top. The camera then shifts to an open cardboard box containing a computer's internal components, including a graphics card and power supply. The scene transitions to a person with curly hair wearing a black shirt, speaking to the camera in a room with a dark-colored wall, gesturing with their hands. The video then returns to the cardboard box with the computer parts, now with the text 'Jo - 6d ago Please i need it #carterpcs' at the bottom. Meanwhile, the audio features a male speaker expressing frustration over spending excessively on computer components without a proper case, requesting one. Another male speaker humorously critiques the unconventional cardboard setup, suggesting it is functional but not ideal, and proposes sending a case and desk based on social media engagement. His tone shifts from initially ambiguous to neutral and playful, creating a lighthearted atmosphere. The ambient music remains steady and neutral, enhancing the casual vibe of the video."
+ },
+ {
+ "video_id": "PI-e43c7c8e-77d5-4cc5-b54c-895cb8ab637c",
+ "path": "/workspace/benchmark/video/PI-e43c7c8e-77d5-4cc5-b54c-895cb8ab637c.mp4",
+ "answer": "The video showcases a close-up view of a hand with long, decorated nails interacting with a black mechanical keyboard placed on a soft, fluffy gray surface. The hand presses various keys, and as it types, the keyboard's backlighting dynamically changes colors, displaying a vibrant array of pink, purple, red, orange, yellow, green, and blue lights. This backlighting forms a web-like pattern that highlights the keyboard's intricate design, creating a visually striking effect. Complementing the visual elements, the background music is a dynamic and rhythmic instrumental track, enhancing the overall atmosphere with its lively and engaging qualities. The combination of the hand's movement, the colorful backlighting, and the energetic music emphasizes both the aesthetic appeal and functionality of the keyboard, creating a cohesive and captivating experience."
+ },
+ {
+ "video_id": "PI-db20d7b4-ce67-4075-a689-fc7a19ba92b6",
+ "path": "/workspace/benchmark/video/PI-db20d7b4-ce67-4075-a689-fc7a19ba92b6.mp4",
+ "answer": "The video captures a light-hearted \"Don't spray it\" challenge involving a mother and her two children. The mother tries to explain a series of contemporary internet slang terms, often referred to as \"brain rot words,\" while holding water in her mouth. Her children are positioned on either side of her, observing and reacting, which leads to the mother entertainingly struggling not to spray water everywhere. As the challenge progresses, the sequence highlights their playful interactions, resulting in water and shared laughter. The setting is a relaxed family environment, with a dining or kitchen counter as the focus of their activities. The video ends with the participants."
+ },
+ {
+ "video_id": "PI-7a97b475-0a2c-4c24-accb-7f405b3db547",
+ "path": "/workspace/benchmark/video/PI-7a97b475-0a2c-4c24-accb-7f405b3db547.mp4",
+ "answer": "The video begins with a close-up shot of a red-haired woman accidentally spilling juice on a bride's wedding dress, leaving the bride looking distressed. A man in a black suit and bow tie observes the situation as the red-haired woman tries to help, but the bride remains unhappy. The red-haired woman then secretly takes the man's phone to take a selfie, with the phone screen showing a countdown of 1:13. She makes a call to 'Alan,' while the bride, still in her wedding dress and veil, appears increasingly dejected as she talks to the man. He picks up the phone, which displays 'Kate Friend,' and seems to explain something before throwing the phone to the ground and embracing the bride, offering a moment of comfort. Meanwhile, the red-haired woman, holding her phone, turns away looking distressed. The video concludes with the bride resting her head on the man in the black suit, hinting at a moment of reconciliation. Accompanying this visual narrative is a BGM track featuring a female voice singing softly and emotionally, conveying a sense of melancholy and longing. The lyrics reflect themes of giving up and losing happiness, underscoring the emotional struggle and heartache depicted in the video."
+ },
+ {
+ "video_id": "PI-8ee675b4-6ed9-4edf-9cb4-2bc06a33dfb4",
+ "path": "/workspace/benchmark/video/PI-8ee675b4-6ed9-4edf-9cb4-2bc06a33dfb4.mp4",
+ "answer": "The video centers on a pile of Christmas gifts, presenting a festive scene. Each gift is wrapped in various festive-themed papers. The background is made of comfortable golden yellow fabric, which could be a sofa or curtains, adding a warm festive atmosphere. The highlight of the holiday is a variety of presents, wrappers featuring characters from How the Grinch Stole Christmas, and traditional red paper decorated with green Christmas tree patterns. The most eye-catching gift was wrapped in light brown paper and tied with a large green ribbon, on which was written \"Nikki and Rosie\". When the camera moves over the gift, the words \"My boyfriend has his own special wrapping paper\" overlap, highlighting the humorous or unique nature of the packaging choice. This further emphasizes a gift wrapped in paper, which looks like a pattern of roasted beans, adding a comical twist to the scene. The audio and the visual complement each other. The audio expresses the desire to relax and enjoy the night, but is interrupted by the \"excited shrill sound\". This implies a contrast between the originally desired peaceful atmosphere and unexpected, irritating distractions, which might add a layer of humor or frustration to the festive mood. Overall, this video captures a light-hearted and cheerful moment during the holiday season, focusing on the unique and humorous aspects of gift-giving, while the audio introduces unexpected interruption elements, enhancing the relevance and comedy of the scene."
+ },
+ {
+ "video_id": "PI-e7285ec5-2bb1-4184-82e4-72f9a7982035",
+ "path": "/workspace/benchmark/video/PI-e7285ec5-2bb1-4184-82e4-72f9a7982035.mp4",
+ "answer": "In the video, a person stands in a room featuring a ceiling fan and an open door in the background, wearing a dark blue T-shirt with a small logo on the left side of the chest. The screen prominently displays the text \"Valentine's Day DISS\" along with three heartbroken emojis, setting a critical tone from the start. Throughout the video, the individual makes various gestures, including pointing down, giving a thumbs-up, and clapping hands together, all while seemingly speaking or rapping to express their contempt for Valentine's Day. This aligns perfectly with the audio content, where a male speaker adopts a dismissive and slightly critical attitude towards the holiday, describing it as fake and firmly stating his disinterest in receiving chocolates or candy. Both the video and audio convey a straightforward and somewhat skeptical emotional tone, focusing on the speaker's personal and critical views on Valentine's Day in a casual conversational or performance setting."
+ },
+ {
+ "video_id": "PI-d9956c92-ee0b-4c9a-86c9-93487642ce2d",
+ "path": "/workspace/benchmark/video/PI-d9956c92-ee0b-4c9a-86c9-93487642ce2d.mp4",
+ "answer": "This video presents a warm and inviting indoor scene at night, capturing a moment of simple joy and comfort in a home environment. The room features a large window offering a view of a cityscape illuminated by numerous lights, creating a serene backdrop. Inside, a round, shallow ball pit filled with pastel-colored balls is placed on a striped rug, where a small dog is energetically playing, moving around and interacting with the balls. The room is furnished with a wooden cabinet topped with a potted plant, and a television displaying a news channel. Two framed pictures adorn the wall above the cabinet, adding a personal touch to the space. The audio accompanying the video is subtle, with faint sounds resembling the rustling of objects, as if someone is quietly searching for something, which complements the tranquil yet lively atmosphere of the scene."
+ },
+ {
+ "video_id": "PI-d7117940-4fc2-431c-beef-fbef230388a2",
+ "path": "/workspace/benchmark/video/PI-d7117940-4fc2-431c-beef-fbef230388a2.mp4",
+ "answer": "The video begins in a modern kitchen setting, where a woman is leisurely peeling an orange while conversing with a young girl. Meanwhile, the girl is focused on her homework at the table, and the atmosphere is calm and warm. Their conversation seems mundane at first, until the girl casually mentions that she has reported the woman to the IRS, which leaves the woman visibly shocked. Suddenly, the scene shifts dramatically. An individual in full tactical gear swiftly enters through the window, apprehends the woman, and handcuffs her. As the woman is escorted away, the girl remains calm, continuing to write her homework and bidding farewell in a composed manner. The video ends with a sharp electronic tone, emphasizing the sudden and surreal nature of the events. The stark contrast between everyday life and these extraordinary, unexpected occurrences creates a unique and surprising narrative.\n"
+ },
+ {
+ "video_id": "PI-0de8cdec-d158-484b-a9f0-30054ee117de",
+ "path": "/workspace/benchmark/video/PI-0de8cdec-d158-484b-a9f0-30054ee117de.mp4",
+ "answer": "The audio captures a dynamic exchange where a male confidently pitches a viral idea for a song, only to be met with a humorous or sarcastic response from a female, showcasing their playful banter. The accompanying video amplifies this interaction, beginning with a man in a black shirt standing outside a door, gesturing towards the camera with the text \"I have an idea that can make your song popular\" displayed on the screen. He sings and moves towards the room, while the camera then shifts focus to a woman in a beige hoodie, lounging by a window on a sunny day, singing and shaking her head, possibly in jest or disagreement. The audio snippet \"You made me kiss the ground\" hints at the musical context, and the screen text \"My wife's thoughts\" adds a personal touch, suggesting the woman's perspective. The video concludes with the woman's text \"Ordinary FEB 7,\" humorously contrasting the grandiose idea of making a song popular with more mundane or everyday suggestions, thus creating a lighthearted and engaging narrative that mirrors the playful dynamic in the audio."
+ },
+ {
+ "video_id": "PI-d627b249-e7e3-458b-8b52-029dc215e92e",
+ "path": "/workspace/benchmark/video/PI-d627b249-e7e3-458b-8b52-029dc215e92e.mp4",
+ "answer": "The video features a man in a beige t-shirt and headphones, seated at a desk with two large buttons—green and red—in front of him. The background includes shelves with various items and a refrigerator filled with drinks, all under blue lighting. The video begins with the text \"Most requested\" and \"song of the week\" appearing on the screen. A viewer's comment from \"b3astybOy_bb\" requests a \"DO CHESS TYPE BEAT.\" As the man listens to the beat, on-screen text like \"listening\" and \"this is giving serotonin\" appears, indicating his engagement and enjoyment. He reacts with visible excitement, exclaiming \"YOOOO!\" and gesturing animatedly. His positive reaction continues with comments such as \"vibes? Immaculate\" and \"my mood? Elevated.\" He further expresses his approval by saying \"THAT'S ACTUALLY,\" \"A VIBE,\" \"PROCESSING,\" and finally, \"I LIKE IT.\" Meanwhile, in the audio, a male speaker with an indeterminate emotional tone mentions something similar to \"The song of the Wick,\" accompanied by ambient music with an undefined tempo and atmosphere, before expressing approval with \"I like it.\" The video concludes with the man still enthusiastically reacting to the beat."
+ },
+ {
+ "video_id": "PI-40852f3b-b912-473a-9f2a-4d206b415897",
+ "path": "/workspace/benchmark/video/PI-40852f3b-b912-473a-9f2a-4d206b415897.mp4",
+ "answer": "At the beginning of the video, a person wearing a brown jacket and carrying a black backpack is standing near a glass door, on which the text reads: \"Go to and from a class and play Fiti Wap on JBL.\" This passage sets the scene, indicating that this person is on their way to class and enjoying music loudly on their JBL speaker at the same time. The camera switches to a corridor where another person is wearing a beige sleeveless shirt and khaki shorts, riding a small black bicycle. This person carried a huge black speaker on his shoulder and continued to listen loudly to the theme of the music. The same text overwriting still exists, reinforcing the idea of blasting Fetty Wap on JBL speakers. The corridor was crowded with other students, some sitting and some walking, creating a lively and dynamic school environment. The cyclist continued walking along the corridor, passing by more students and entering a larger room. The ceiling of the room is very high, with exposed beams and fluorescent lights, and the floor is carpeted. In the background, several students are sitting at the table. Some are chatting while others are playing with their mobile phones. At the end of the video, the cyclist approaches the door, preparing to leave the room, allowing the audience to feel the vibrant and bustling atmosphere in the school environment."
+ },
+ {
+ "video_id": "PI-033e0865-2c2e-49c0-b765-3c2cc397a3b0",
+ "path": "/workspace/benchmark/video/PI-033e0865-2c2e-49c0-b765-3c2cc397a3b0.mp4",
+ "answer": "The feature of this video is a peaceful and comfortable indoor environment, with wooden floors and a piece of furniture in the background. A tabby cat with distinct stripes is the central focus of the photo. It is lying comfortably on the lap of a person wearing black trousers and beige shoes. This cat looks very relaxed. With its eyes closed, its PAWS are gently placed on a person's leg. This scene is accompanied by a humorous voice-over. A person, perhaps a woman, tells a prank incident about a cat. The voice-over begins like this: \"I want you to tell me what you just did.\" Through the voice-over, the cat \"admitted\" opening the car door for the delivery person, ran out, jumped into the open car door and entered the back seat where a child was tied to the car seat. Then the cat took a piece of chicken from the child. The voice-over expresses a mixture of entertainment and a slight annoyance, highlighting the cat's thick-skinned and opportunistic behavior. At the end of the video, the cat was still lying contentedly on its owner's lap, with its eyes closed, seemingly indifferent to his account of its comical behavior. The overall theme of the video is the cat's humor and unexpected behaviors, showcasing its naughty and mischievous nature. The cat's relaxed behavior combined with an interesting voice-over creates a light-hearted and pleasant story."
+ },
+ {
+ "video_id": "PI-c3cde03b-6930-48a3-b009-2a10c3b671bd",
+ "path": "/workspace/benchmark/video/PI-c3cde03b-6930-48a3-b009-2a10c3b671bd.mp4",
+ "answer": "The video presents a vibrant and glamorous makeup tutorial. It begins with a close-up view of a person applying glittery gold eyeshadow to their eyelid using a small brush. The individual, with long dark hair and red-painted nails, is adorned with a sparkling choker necklace, enhancing the overall glamorous aesthetic. As the scene progresses, the person holds up a small white tube of product with a gold-tipped applicator, followed by displaying five small makeup sticks in various shades of glitter, including gold, copper, and pink, held vertically in front of their face. Meanwhile, the audio features a woman singing with cheerful background music, creating a lively atmosphere that complements the visual tutorial. The video concludes with the person showcasing the five glittery makeup sticks once more, emphasizing the variety of shades available, while their red-painted nails and sparkling choker remain visible, adding a consistent touch of elegance throughout the tutorial."
+ },
+ {
+ "video_id": "PI-eb82cb24-75ca-4fc8-b2d7-a535c7f93878",
+ "path": "/workspace/benchmark/video/PI-eb82cb24-75ca-4fc8-b2d7-a535c7f93878.mp4",
+ "answer": "In the video, Michael C. Hall announces the commencement of production for \"Dexter: Resurrection\" with a blend of neutral and happy emotions. The video captures the first official day of shooting, set in an indoor environment where other crew members and equipment are visible, along with white windows, doors, and several paintings adorning the walls, offering a glimpse behind the scenes. Hall expresses his excitement about the project, stating his eagerness for another thrilling journey. He extends an invitation to the audience, asking if they would like to join them on this adventure. The atmosphere is enhanced by ambient music with a steady tempo, creating a neutral yet engaging backdrop that complements the overall tone of the announcement."
+ },
+ {
+ "video_id": "PI-b1de6e59-e367-4c14-9219-c6704b02c622",
+ "path": "/workspace/benchmark/video/PI-b1de6e59-e367-4c14-9219-c6704b02c622.mp4",
+ "answer": "The video opens with a view down a long, empty hallway, featuring a patterned carpet in shades of gray and beige with yellow lines running down the center, wooden-paneled walls with glass doors, and a white ceiling with recessed lighting. At the far end, a figure is seen walking away before starting to run towards the camera. Dressed in a black hoodie, black pants, and white shoes, the person runs closer, eventually leaping into the air. The scene then shifts to the same individual crouching on the carpet, interacting with something possibly picked up from the floor. They then crawl briefly before quickly rising and sprinting down the hallway. Meanwhile, the audio features a man speaking in an excited tone, as if running to warn someone that something has fallen, adding a sense of urgency and excitement to the scene. The video concludes with the person running away from the camera towards the light at the far end of the hallway, encapsulating a dynamic and energetic moment."
+ },
+ {
+ "video_id": "PI-73a02519-7499-471b-95d2-a2be5019264f",
+ "path": "/workspace/benchmark/video/PI-73a02519-7499-471b-95d2-a2be5019264f.mp4",
+ "answer": "The video begins with a person standing in front of a brick wall, wearing a light-colored top and gold hoop earrings, creating an intimate and engaging atmosphere. The text overlay, 'laugh with every stranger, the stranger the better,' sets a thematic tone encouraging viewers to embrace interactions with unfamiliar people. The person appears to be speaking or singing, adding a personal touch to the scene. The narrative then transitions to an outdoor nighttime setting, where a sidewalk lined with trees and parked cars is shown. A person dressed in a long, dark coat walks away from the camera, introducing an element of mystery and anonymity. This shift from an intimate indoor setting to a more enigmatic outdoor scene underscores the theme of connecting with strangers. The video then returns to the initial setting, with the person in the light-colored top and gold hoop earrings standing once again in front of the brick wall, bringing the narrative full circle. Accompanying the visuals is a modern and upbeat instrumental BGM track, which enhances the dynamic and engaging atmosphere, reinforcing the message of finding joy in unexpected interactions."
+ },
+ {
+ "video_id": "PI-ed86891f-ee21-40c6-b2c8-370485f16103",
+ "path": "/workspace/benchmark/video/PI-ed86891f-ee21-40c6-b2c8-370485f16103.mp4",
+ "answer": "An individual is shown indoors, positioned in front of white shelves stocked with books and various items. The person is holding a smartphone, which they periodically bring closer to the camera to reveal its display. The phone's screen displays a visual of two hands forming a heart shape, accompanied by text. Overlaying the video is text prompting the viewer to search for \"love\" and note the first result as a self-description. Throughout the clip, a calm and sentimental musical track plays in the background. The music features a male vocalist repeating \"I love,\" contributing to the reflective and gentle atmosphere of the video."
+ },
+ {
+ "video_id": "PI-d668efb3-d209-4142-b61b-da8a48947f25",
+ "path": "/workspace/benchmark/video/PI-d668efb3-d209-4142-b61b-da8a48947f25.mp4",
+ "answer": "In a gym setting with green artificial turf flooring and dark walls, a woman with curly blonde hair performs a leg exercise. She is lying on her side on a padded workout bench, dressed in a white sports bra and light grey leggings. A black kettlebell is attached to her left foot, adding resistance as she lifts and rotates her leg in a controlled, repetitive motion. This workout appears to target the hip and glute muscles. In the background, other pieces of gym equipment are visible, including an exercise bike, a medicine ball, and a box with the text \"EPIC FIT\"."
+ },
+ {
+ "video_id": "PI-25873e45-939f-4f81-8e7f-4b7669859e1a",
+ "path": "/workspace/benchmark/video/PI-25873e45-939f-4f81-8e7f-4b7669859e1a.mp4",
+ "answer": "The video shows a black and white dog sitting on a robotic vacuum cleaner in a cozy room with wooden floors, creating a fascinating and entertaining scene. The black and white dog sits on the moving vacuum cleaner, occasionally scanning around, adjusting his posture to maintain his balance, and showing a curious and slightly cautious demeanor. There is a shelf in the room with various objects and a plant that enhances the warm atmosphere. The blue light of the vacuum cleaner indicates its operation as it navigates the space. At the same time, the track is a cheerful and passionate song sung by girls, providing a light and uplifting backdrop that complements the playful interaction between pets and modern technology. The combination of visual and auditory elements creates an enjoyable and engaging experience."
+ },
+ {
+ "video_id": "PI-d6321b33-40b5-4e41-a540-2cb3cca9daad",
+ "path": "/workspace/benchmark/video/PI-d6321b33-40b5-4e41-a540-2cb3cca9daad.mp4",
+ "answer": "The video captures a quiet moment at home, focusing on a couple engaged in a shared activity. The man, dressed casually and wearing a baseball cap, is kneeling on one knee, fully absorbed in installing the flooring. Various tools are laid out beside him. Opposite him, the woman sits calmly on the light wood-patterned floor, conversing with the man. Her presence conveys a sense of companionship and support. The background features simple, light-colored walls, evoking a homely atmosphere—possibly a new home or one undergoing renovation. Soft acoustic music plays gently in the background, enhancing the calm and intimate mood. Text overlays express deep affection and gratitude, highlighting the woman’s feelings of safety, joy, and emotional connection while being by her partner’s side.\n"
+ },
+ {
+ "video_id": "PI-25860259-1cbf-404a-81e3-94811735ce7d",
+ "path": "/workspace/benchmark/video/PI-25860259-1cbf-404a-81e3-94811735ce7d.mp4",
+ "answer": "This video depicts a lively and charming scene in a bar, where a group of people are sitting around a wooden table, playing what is called \"the best bar game\". The text overlay on the video emphasizes the fun and popularity of the game. It can be seen in the picture that the players' hands are all pointing at a small glass in the center of the table with their index fingers, which is filled with amber liquid. This game seems to involve coordination and precision, as players have to carefully position their fingers to keep the balance of the small wine glass. The atmosphere of the event was full of vitality and social interaction, and all the participants were fully engaged in it. The wooden table adds a warm and simple atmosphere, enhancing the overall ambitivity of the bar. The audio contains a sequence of counting: \"1,2,3,5,4,3\", which indicates that the game involves some form of counting or coordination among players. The counting was carried out in a casual and interesting tone, indicating a relaxed and pleasant atmosphere. This video captures the essence of an interesting interactive bar game, highlighting the friendship and joy among players."
+ },
+ {
+ "video_id": "PI-fe28b7ec-1a3b-4b80-831b-2e7c16b65c0d",
+ "path": "/workspace/benchmark/video/PI-fe28b7ec-1a3b-4b80-831b-2e7c16b65c0d.mp4",
+ "answer": "This video captures an interesting safety update at a ski resort. From a first-person perspective, a snowboarder is waiting in the chairlift line on a bright, snowy day. A resort employee approaches and asks if the snowboarder has a leash for their step-on bindings. The employee explains that the resort now requires leashes for this specific type of Burton binding because they have experienced many instances of boards becoming detached and creating a 'runaway' hazard on the slopes. The snowboarder, seemingly taken aback, accepts the provided leash. Once on the chairlift, the snowboarder shares his thoughts on the matter. He expresses his astonishment at the new rule, questioning the fundamental reliability of expensive step-on bindings if they are prone to such failures. He muses that for a premium price, a product should be dependable enough not to require a simple leash for safety, suggesting a potential design flaw that could be dangerous for riders."
+ },
+ {
+ "video_id": "PI-feea7dc7-1030-4c0c-868f-73052aa044e8",
+ "path": "/workspace/benchmark/video/PI-feea7dc7-1030-4c0c-868f-73052aa044e8.mp4",
+ "answer": "The video unfolds in a contemporary open-concept kitchen-living area, featuring a male performer dressed in a blue bomber jacket, black tee, and light-wash oversized denim—a deliberate sartorial nod to Kendrick Lamar’s iconic streetwear aesthetic. Holding a microphone-like prop, he executes rhythmic dance-rap movements while the exaggerated 36-inch inseam jeans (2.5x standard length) pool and drag around his ankles, creating a dynamic fabric distortion effect."
+ },
+ {
+ "video_id": "PI-b6a219f8-35df-47f0-b113-602d8d3554a8",
+ "path": "/workspace/benchmark/video/PI-b6a219f8-35df-47f0-b113-602d8d3554a8.mp4",
+ "answer": "The video takes place in a well-lit modern family interior. The white walls are decorated with framed photos, and there is a stone wall in one of the rooms. The protagonist of the film is a young woman with long dark hair, wearing a dark blue long-sleeved shirt and light blue jeans. She walked through the corridor and into a room that looked like a storage room. There was a table, shelves filled with books and other items, and a set of drums inside. Her core movements include gesturing with her hands while speaking, pointing at the drum and other objects in the room. The key audio content includes her conversation, revolving around the topic of the Grammy Awards. She mentioned, \"All the discussions about the Grammy Awards remind me that my dad did have three Grammy Awards and I could reach for them at any time.\" This sentence is also displayed as text superimposed on the video. The central theme of the video is that this lady recalls her father's achievements, especially his three Grammy Awards, which are placed in her home. When she interacted with the Spaces and objects related to her father's achievements, she expressed a sense of pride and nostalgia."
+ },
+ {
+ "video_id": "PI-7140ff82-3bdd-4900-843e-fb19c6613591",
+ "path": "/workspace/benchmark/video/PI-7140ff82-3bdd-4900-843e-fb19c6613591.mp4",
+ "answer": "The video depicts a snowy backyard scene in Maryland, where a dog, dressed in a white winter coat, is lying comfortably in the snow. The text 'Neighbors called the cops on us for dog neglect' is displayed throughout, highlighting the irony of the situation. The speaker, likely male, recounts his frustration and mild annoyance as police and animal control visited his home due to an anonymous report about his dog being left outside in cold weather. The authorities acknowledged the visit was unnecessary and apologized, recognizing it as a waste of time. Despite the cold weather, the dog appears content and well-cared for, with the camera panning across the snowy yard, showing trees and a stone wall in the background. The video emphasizes that the dog is not neglected but is well-protected against the cold, making the neighbors' concern seem unnecessary and the authorities' visit an overreaction."
+ },
+ {
+ "video_id": "PI-b12c4f4b-3ff1-4e1d-b5e7-d4186953b996",
+ "path": "/workspace/benchmark/video/PI-b12c4f4b-3ff1-4e1d-b5e7-d4186953b996.mp4",
+ "answer": "The video begins with a close-up of a hand holding a red can of Dr Pepper soda next to a tall glass filled with ice on a snowy surface, with a black metal railing and snow-covered street in the background, setting a cold, urban scene. The hand tilts the can, pouring the dark brown soda into the glass, creating a layered effect with the ice and gradually filling it until foam forms at the top. The scene then transitions to a person in a black hooded jacket, holding the glass and taking a sip, with a building and window in the background, suggesting an outdoor residential area. Meanwhile, the audio features a male speaker with a neutral tone, reflecting on the moment as \"blessed by an eski moment,\" implying a sense of significance or reverence tied to the experience. The video wraps up with the person continuing to drink from the glass, emphasizing the refreshing nature of the beverage in the cold setting, while the audio captures the individual recounting this meaningful event with a calm demeanor."
+ },
+ {
+ "video_id": "PI-b5af0737-d7dd-4c8f-904f-deb79201e240",
+ "path": "/workspace/benchmark/video/PI-b5af0737-d7dd-4c8f-904f-deb79201e240.mp4",
+ "answer": "The video presents an intimate scene of a man seated at a piano, set against a backdrop of a brick wall that adds a rustic charm to the environment. Dressed in a light gray t-shirt, his tattoos subtly visible on his arms, he embodies a sense of personal artistry. The warm lighting envelops the space, creating a cozy atmosphere that could be a home studio or a small, inviting venue. As the video unfolds, the man is deeply engrossed in playing the piano, his hands gliding over the keys with a natural grace. Soon, he begins to sing, his voice carrying a contemplative and emotional tone that resonates with themes of jealousy, love, and spirituality. The audio delves into the singer's inner turmoil and longing, expressing a profound sense of personal conflict and a heartfelt plea for divine intervention. The camera maintains a consistent angle, capturing him from the side and slightly above, which accentuates his connection to the piano and the music. This perspective, combined with the evocative audio, crafts an overall mood that is both reflective and soulful, underscoring the man's deep immersion in the song he performs. The synergy of visual and auditory elements creates a compelling narrative of emotional depth and artistic expression."
+ },
+ {
+ "video_id": "PI-0775adf3-7692-4488-a13c-1dfc4456cc22",
+ "path": "/workspace/benchmark/video/PI-0775adf3-7692-4488-a13c-1dfc4456cc22.mp4",
+ "answer": "The video captures a heartwarming moment in a bathroom, featuring a playful interaction between a man and a young child, who are revealed to be an uncle and his young nephew through the on-screen text \"POV: Me as an uncle.\" The man, dressed in a navy blue tracksuit with a white logo on the chest, holds a spray bottle with a pink nozzle, while the child, wearing a dark-colored shirt with a Batman logo, stands in front of a sink. The child moves around energetically, and the uncle watches with a warm and playful demeanor, occasionally spraying something as part of their game. The child reaches out towards the uncle, who responds with affectionate and playful gestures, creating a warm and engaging atmosphere. The interaction continues with the uncle making animated gestures and examining the child's arm and hair, further highlighting their close bond and joyful interaction. Complementing the visuals, the audio features human voices, including the uncle's warm and playful voice and the child's energetic and excited tone, capturing the affectionate and playful dynamic between them. The background, which includes a shower with a glass door and a wooden door, adds to the bathroom setting, enhancing the overall charm of the scene."
+ },
+ {
+ "video_id": "PI-cc5157f0-15d7-4f88-a945-bc9e9da400da",
+ "path": "/workspace/benchmark/video/PI-cc5157f0-15d7-4f88-a945-bc9e9da400da.mp4",
+ "answer": "The video captures an intense basketball game, where a player in a purple No. 23 jersey showcases his skills on the court. Amidst the energetic atmosphere, the player makes a decisive pass, assisting his teammate in scoring and eliciting enthusiastic cheers from the audience. Following the successful shot, he raises three fingers in celebration, symbolizing a well-executed three-pointer, and shares high-fives with teammates. The scene then shifts to the audience, highlighting a particularly excited spectator in a yellow coat who waves his arms in delight. Other audience members join in the applause, adding to the vibrant ambiance. Reflecting on his earlier move, the No. 23 player looks at his hand before exchanging a high-five and words with a teammate in a No. 28 jersey, demonstrating teamwork. Meanwhile, in a casual discussion, one speaker with a neutral tone briefly mentions something unclear, while another, more enthusiastic speaker references basketball player Stephen Curry, expressing excitement about making three-point shots himself. This blend of on-court action and off-court enthusiasm vividly portrays the excitement of the game and the shared passion for basketball."
+ },
+ {
+ "video_id": "PI-1f67fd8a-eb83-4623-9003-665f589919a9",
+ "path": "/workspace/benchmark/video/PI-1f67fd8a-eb83-4623-9003-665f589919a9.mp4",
+ "answer": "In the video, a man wearing a sleeveless shirt with a pattern is standing beside a car outdoors at night. The background includes buildings and palm trees, with cars and illuminated buildings parked on both sides of the street. The text \"Your feminine way of speaking\" is overwritten on the video, indicating that what people are concerned about is this person's way of communication. This person has long curly hair and wears an exquisite necklace. The camera captures different angles of people and the environment, emphasizing their appearance and textual coverage. This audio is characterized by a person's voice. Judging from the intonation, it is very likely to be a female voice, expressing frustration or disappointment. The speaker said, \"Why are you urging me to go out?\" I originally wanted to give you a hug, but you didn't give me a sweet hug. This indicates a sense of haste and missing a warm gesture. The video combines the visual elements of the night background with the emotional tone of the audio to create a narrative centered on communication and emotional connection."
+ },
+ {
+ "video_id": "PI-c7e95e63-5dc9-4578-9baa-37092cbe0abb",
+ "path": "/workspace/benchmark/video/PI-c7e95e63-5dc9-4578-9baa-37092cbe0abb.mp4",
+ "answer": "The video and audio together create a humorous and static narrative. The audio features a conversation where statements express dislike for popular sodas to their fans, such as not liking Coke, Root Beer, Sprite, or Dr. Pepper, humorously conveying the reactions that might be triggered by sharing such unpopular preferences. Meanwhile, the video shows a bald man with a beard, dressed in an orange t-shirt, seated calmly in a kitchen. The kitchen setting includes white cabinets, a refrigerator covered with various magnets and notes, and a window with a decorative curtain featuring a skull design and vibrant colors. The ceiling has a textured surface with a noticeable dark stain. The man is positioned at a wooden table holding a few items, including a mug and small decorative objects, under the warm glow of a ceiling light fixture. Throughout the video, the man remains mostly stationary, looking slightly to his left with minimal movement, while the background remains consistent, emphasizing the calm and static nature of the scene. The combination of the audio and video elements results in a cohesive and amusing portrayal of expressing unpopular soda preferences in a serene kitchen environment."
+ },
+ {
+ "video_id": "PI-a333d929-ad9d-44d6-8867-260eb99b1fd5",
+ "path": "/workspace/benchmark/video/PI-a333d929-ad9d-44d6-8867-260eb99b1fd5.mp4",
+ "answer": "The video presents a minimalist yet compelling visual narrative that complements the audio's theme of personal confession and honesty. It features a man in a dark blue shirt seated against a plain, light-colored wall with a slanted ceiling and a visible light fixture, initially looking directly at the camera. This direct gaze establishes a connection with the viewer, mirroring the reflective tone of the audio where the speaker confesses to faking a New York accent. The scene then transitions to a close-up of another man wearing a black cap and a light-colored shirt, whose slightly averted gaze adds a layer of introspection. The background remains consistent, emphasizing the simplicity of the setting. As the video cuts back to the first man, his continued direct gaze reinforces the theme of self-revelation, while the second man's close-up maintains the visual consistency. The video's focused style effectively highlights the speakers' expressions and the minimalistic environment, enhancing the audio's core message of honesty and personal disclosure."
+ },
+ {
+ "video_id": "PI-f1e04c07-50d5-4a51-adfe-0ce0e64228e6",
+ "path": "/workspace/benchmark/video/PI-f1e04c07-50d5-4a51-adfe-0ce0e64228e6.mp4",
+ "answer": "A man and a woman are standing in a hallway.\nThe woman wears a fitted black dress with clear padding to make her hips and stomach look very big.\nHer makeup looks weird,thick eyeshadow on her eyebrows and dirty marks on her cheeks. Her blonde hair is messy.\nShe's holding a doll and a bag of snacks.\nThe man beside her is dressed all in black: a hoodie with \"cK\" printed on it, shorts, and a baseball cap.\nBoth are mouthing the words to a rap song playing, and sometimes flailing their arms.\nThe camera doesn't move during the whole video."
+ },
+ {
+ "video_id": "PI-b5dc86fa-665a-4f7e-980f-9973eeea633e",
+ "path": "/workspace/benchmark/video/PI-b5dc86fa-665a-4f7e-980f-9973eeea633e.mp4",
+ "answer": "The video presents a compelling split-screen format that intertwines the process of voice acting with the animated character it brings to life. On one side, we see Knuckles the Echidna from the Sonic the Hedgehog movie series, positioned in a vibrant, neon-lit urban environment, exuding determination and readiness for action. On the other side, a man in a black shirt, wearing headphones, stands in a recording studio with a solid red background, actively voicing the character. His expressive gestures, aided by large white gloves, mirror Knuckles' intensity, creating a vivid connection between actor and character. Meanwhile, the audio features a conversation with a tense and heated atmosphere, where a speaker confidently asserts superiority over a past adversary, reflecting a self-assured and formidable nature. This combination of visual and auditory elements effectively highlights the dynamic process of animation production and the powerful personality of the character being voiced."
+ },
+ {
+ "video_id": "PI-5a057ec8-979e-4159-b5be-ea4fcd1f53dd",
+ "path": "/workspace/benchmark/video/PI-5a057ec8-979e-4159-b5be-ea4fcd1f53dd.mp4",
+ "answer": "There is a close-up shot of a person in the video. She has long black hair and is wearing an exquisite gold necklace and a ring on her fingers. The background is dim lighting, creating an intimate yet somewhat mysterious atmosphere. This person placed his hand near his face, with his fingers slightly bent, increasing his contemplative mood. Such words appeared on the screen: \"Brothers, could you give me a few magnificent songs for me to post on ig?\" This means that songs need to be recommended for sharing on Instagram. This text remains consistent throughout the video, emphasizing the person's request for music advice. The overall mood of the video is cheerful, focusing on the individual's demands and the subtle details of their appearance and surrounding environment. Dim lighting and close-range framing enhance the sense of intimacy. This video effectively captures an individual's emotional expression and invites the audience to participate by sharing their favorite songs."
+ },
+ {
+ "video_id": "PI-802870d8-1efe-4744-be21-2c9d657aa274",
+ "path": "/workspace/benchmark/video/PI-802870d8-1efe-4744-be21-2c9d657aa274.mp4",
+ "answer": "This video was shot in a festive indoor setting, such as a celebration or party, decorated with red lanterns and ornaments, creating a vibrant and cheerful atmosphere. The core of the event is a vibrant red and gold lion dance costume, which is moving vigorously throughout the room. A person in a red top and black trousers is guiding the lion dance costume, ensuring its smooth movement and interaction with the audience. \"Oh dear, I see!\" It appears on the screen, adding an interesting and encouraging tone to the scene. In the background, several people are sitting around a table covered with a white tablecloth. There seem to be food and drinks on the table, indicating that this is a social gathering. People watched and enjoyed the lion dance performance, and the atmosphere was warm and festive. This video captures the vibrant and joyous spirit of the event, highlighting the cultural significance and entertainment value of the lion dance. The phrase \"I got cash in my pocket\", although not shown in the video and indicating willingness to participate or contribute, might imply the tradition of giving red envelopes or tips in such performances, further enriching the cultural background of the celebration."
+ },
+ {
+ "video_id": "PI-53a47b0a-37a0-44ab-885a-f16079d0c42d",
+ "path": "/workspace/benchmark/video/PI-53a47b0a-37a0-44ab-885a-f16079d0c42d.mp4",
+ "answer": "The video features a person with long, wavy purple hair sitting in a cozy, creatively decorated room adorned with colorful posters and string lights along the ceiling. Dressed in a black top, the person holds a fuzzy, pink puppet with large, expressive eyes, a red nose, and sharp teeth, accented with pink and purple details including paws and ears. The puppet's mouth moves as if speaking, and the persons it to display a range of emotions through dynamic movements and expressions, such as opening its mouth wide and moving its head. Meanwhile, the audio clip includes a playful and expressive voice saying, \"I'm not a killer, but don't push me,\" in a tone that is somewhat threatening yet amusing, followed by \"Let's go,\" hinting at readiness to take action. The combination of the puppet's dramatic and humorous dialogues adds an entertaining element to the video, enhancing the overall engaging and lively atmosphere."
+ },
+ {
+ "video_id": "PI-cfe173d2-7562-4d55-b658-8972ec5b707e",
+ "path": "/workspace/benchmark/video/PI-cfe173d2-7562-4d55-b658-8972ec5b707e.mp4",
+ "answer": "The video presents a serene scene of a woman with long, curly hair seated against a white brick wall, exuding a calm and contemplative atmosphere. She is casually dressed in a black short-sleeved top and blue jeans, accessorized with a gold necklace and bracelet, adding a touch of elegance to her relaxed posture. In her left hand, she holds a brown bag labeled \"BROWNIE BRITTLE,\" while her right hand rests on a pink laptop beside her on a gray surface, suggesting she might be in a lounge or study area. The setting is minimalistic, with the neutral backdrop of the white brick wall keeping the focus on her. Complementing this visual tranquility, the audio features a female speaker reflecting on her suggestion to a Hispanic friend to understand Ash Trevino's version of Spanish. The speaker conveys a subtle humorous or critical nuance about Trevino's language use, adding a layer of intellectual or critical engagement to the otherwise calm scene. Together, the video and audio create a cohesive narrative of contemplation, blending a peaceful visual with a thought-provoking audio commentary."
+ },
+ {
+ "video_id": "PI-f1417eef-7711-4e89-a7a4-4871a65e5bce",
+ "path": "/workspace/benchmark/video/PI-f1417eef-7711-4e89-a7a4-4871a65e5bce.mp4",
+ "answer": "The combined audio and video content presents an engaging and lighthearted discussion centered around a playful question about Ariana Grande's (referred to as Arianna in the audio) role in relation to \"The Wizard of Oz\" in the TV series context. The audio features a group conversation with mixed - gender participants expressing a range of emotions from happiness and humor to neutrality. They engage in cheerful banter over which \"Wizard of Oz\" character Arianna portrayed in the show \"Victorious,\" with one person humorously suggesting Toto, the puppy, prompting laughter and brief remarks. The atmosphere is casual and relaxed. The video, on the other hand, starts by posing the interesting question about Ariana Grande's role as the \"Wizard of Oz\" in the TV series \"Victory Song\" (likely a mis - named version of \"Victorious\"). It then adds an element of suspense by mentioning the special rules on the set of \"Song of Victory\" (again, likely a mis - reference to \"Victorious\"), where no external items are allowed, before revealing the same unexpected and humorous answer of Toto. Overall, the combined content, with the audio having a 2.5:1 ratio in terms of content compared to the video, creates a short and entertaining narrative that showcases a relaxed and pleasant communication atmosphere, hinting at the possible humorous side of Ariana Grande during her time on the show."
+ },
+ {
+ "video_id": "PI-73418974-737e-4b4a-8b4f-7ed320f47366",
+ "path": "/workspace/benchmark/video/PI-73418974-737e-4b4a-8b4f-7ed320f47366.mp4",
+ "answer": "This combined summary presents an intriguing and rebellious narrative, blending elements from both an audio and a video clip. The audio content appears to be an impromptu rap or manifesto, delivered with a tone of provocation and confidence by someone claiming to have hacked into a music media account. Central to this message are the numbers \"65\" and \"67,\" along with a defiant declaration that \"You can take away my things, but you cannot take away my glory.\" Meanwhile, the video complements this narrative with a mysterious and provoking scene: a black man donning black-framed glasses sits indoors with an angry expression, asserting through subtitles that he has invaded the music media account, now claiming it as his own, and issuing a warning to the audience. He repeatedly emphasizes that \"65\" is taking over, while reiterating his unyielding stance that \"you can't take away my glory.\" The number \"67\" features prominently in the video, and there's a clear disdain for \"62, 63 or 65,\" potentially hinting at a coded message, faction, or specific objective. Furthermore, mentions of \"Huron\" and \"highway check off,\" along with \"sitting in the driveway,\" add geographical and operational layers to the event, enhancing the overall sense of intrigue. Together, the audio and video create a tense and enigmatic atmosphere, suggesting a potential power shift in the digital realm or some clandestine operation, with the audio content providing a more rhythmic and personal expression, while the video adds visual and contextual depth to the unfolding drama."
+ },
+ {
+ "video_id": "PI-4dfedbc2-11ad-4e4a-ba20-116cb2bd56ac",
+ "path": "/workspace/benchmark/video/PI-4dfedbc2-11ad-4e4a-ba20-116cb2bd56ac.mp4",
+ "answer": "The audio captures a male voice filled with confusion and frustration as he reacts to a concept from a video he watched, specifically the idea of punching someone with both hands. His tone oscillates between perplexity and exasperation, interspersed with moments of laughter, as he questions the practicality and meaning behind the described action. This sentiment is mirrored in the video, where a person donning a black basketball jersey with blue trim and the number '0' stands indoors, surrounded by a whiteboard, a string of red LED lights along the ceiling, and various posters and certificates on the wall. The individual's hand gestures and body language suggest they are speaking or reacting to something, much like the male in the audio. At one moment, they raise their fist to their ear, embodying confusion or disbelief, and later, they energetically punch the air with both fists, making exaggerated movements that seem to demonstrate or mock the very concept that perplexed the male in the audio. The scene in the video encapsulates a blend of confusion, frustration, and a humorous reaction, with the person expressing their thoughts through vivid physical gestures and expressions, perfectly complementing the audio's narrative of a baffled and amused response to an unusual concept."
+ },
+ {
+ "video_id": "PI-543d7359-3b26-4ffe-b407-16cabbca8d38",
+ "path": "/workspace/benchmark/video/PI-543d7359-3b26-4ffe-b407-16cabbca8d38.mp4",
+ "answer": "The video unfolds in a small room with sufficient light. The walls of the room are grey and the door is yellow, creating a tense and closed atmosphere. A group of people, including several policemen in black uniforms, were surrounding a table. A person in a red coat is sitting at the table, covering his mouth with his hand and looking very painful. The text \"IM SCREAMINGGGGGG\" displayed at the top of the screen and multiple crying emojis emphasize the emotional fluctuations of the red character. At the bottom of the video, the text \"I will also leave the room\" and multiple crying emojis indicate that there is a common sense of discomfort or pain among the participants. The Angle of the camera changed slightly, bringing the police officers closer. One of them was holding a notebook and a pen, possibly taking notes, indicating that a serious discussion or interrogation was underway. The atmosphere at the scene remained tense, with the distressed person and the serious attitude of the police. At the end of the video, another police officer is standing near the yellow door and seems to be opening it, indicating that there might be an exit or a shift to another scene. The theme of the entire video revolves around a high-pressure interaction in a confined space, with the emotional state of the participants being the central focus."
+ },
+ {
+ "video_id": "PI-e8d9c6a0-39c4-4bb6-aedc-7b1ab65d3e37",
+ "path": "/workspace/benchmark/video/PI-e8d9c6a0-39c4-4bb6-aedc-7b1ab65d3e37.mp4",
+ "answer": "The video depicts a serene yet chilly winter night scene, where a person stands outdoors amidst falling snow. Dressed in a black cowboy hat and a patterned jacket adorned with geometric shapes and animal motifs, the individual is set against a backdrop of snow-covered trees and a brightly lit streetlamp. The snowy landscape and visible snowflakes enhance the wintry atmosphere. The person appears to be engaged in conversation or gesturing, with noticeable hand movements, occasionally adjusting their stance and looking around. Meanwhile, the audio features a male speaker with an ambiguous emotional tone, describing a scenario of someone sitting in the passenger seat, hinting at companionship. He then alludes to driving at high speeds, suggesting adventure. The upbeat ambient music complements the dynamic yet cold setting, creating an energetic contrast to the tranquil visuals."
+ },
+ {
+ "video_id": "PI-99e3385f-e2c2-4494-ba33-c42d8319932e",
+ "path": "/workspace/benchmark/video/PI-99e3385f-e2c2-4494-ba33-c42d8319932e.mp4",
+ "answer": "The video captures an intense and skillful moment in a baseball game, where a pitcher in a vibrant pink uniform with the number 20 on the back delivers a pitch to a batter dressed in yellow. The batter swings but misses, and the catcher, also in pink gear, catches the ball. The scene then shifts to a wider view, showing the pitcher moving off the mound and the batter stepping out of the batter's box. The video highlights the pitcher's subsequent actions as he walks towards the back of the mound, adjusting his cap and holding a blue glove. The background features a well-maintained field with green grass, a dirt infield, and a crowd of spectators. The audio complements this with a male speaker expressing a happy and energetic tone, describing the play as impressive and skillful, involving a behind-the-back throw and a double trick play executed with finesse. The upbeat and dynamic ambient music enhances the lively atmosphere, conveying excitement and admiration for the players' exceptional techniques. The video concludes with a close-up of the pitcher continuing his walk towards the first base side, showcasing his pink uniform and turquoise cleats."
+ },
+ {
+ "video_id": "PI-6922d22c-b4d0-472c-a725-090757ca52dc",
+ "path": "/workspace/benchmark/video/PI-6922d22c-b4d0-472c-a725-090757ca52dc.mp4",
+ "answer": "The video features a woman with long, wavy, reddish-brown hair, dressed in a black strapless crop top, a black sequined mini skirt, and black knee-high boots, standing in front of an escalator in a shopping mall. The text 'The mall was closed' appears at the top of the screen, indicating the mall's closure, and as she begins to ascend the escalator, the text changes to 'so I had to a video,' suggesting she is creating content due to the mall being closed. The background includes various mall signage and decorations, with a prominent 'MAXFUN' sign visible on the right side, and the bright lighting typical of a shopping mall environment. As she continues her ascent, she playfully adjusts her hair and poses, adding a dynamic element to the scene. The video continues with her maintaining these playful poses, and the background remains consistent, featuring the 'MAXFUN' sign and other mall signage, including a sign that reads 'BLACK' with an arrow pointing upwards. As she nears the top of the escalator, the text 'the result is so cool' appears at the bottom of the screen, expressing her satisfaction with the video. Meanwhile, the audio features a conversation between two individuals, a male and a female, with neutral emotional tones, both stating 'The result is so,' accompanied by ambient music with a steady tempo, creating a calm and neutral atmosphere."
+ },
+ {
+ "video_id": "PI-e8e9dc00-6cbc-4e2a-8d70-2aa5eacf014d",
+ "path": "/workspace/benchmark/video/PI-e8e9dc00-6cbc-4e2a-8d70-2aa5eacf014d.mp4",
+ "answer": "This video documents a person's response to an online challenge. The challenge, shown as a user comment, asks the creator to display fidget toys that correspond to a specific set of emojis: lettuce, a bowl of rice, an onion, a laughing face, poop, a toilet, and a clown. The video starts outdoors with the creator scooping fresh snow into a blue tray. It then moves inside to a room featuring a large, white plastic storage unit with many transparent drawers. These drawers are filled with a vast and colorful assortment of fidget toys. The creator proceeds to search through the drawers, successfully finding a toy for each emoji mentioned. They present a lettuce-shaped squishy, a rice bowl squishy, a purple onion squishy, a yellow laughing emoji toy, a brown poop-shaped toy, a glow-in-the-dark toilet-shaped toy, and finally, a frightening clown figurine, thereby completing the challenge."
+ },
+ {
+ "video_id": "PI-b39e040a-c1f9-44fc-b689-a885702681e6",
+ "path": "/workspace/benchmark/video/PI-b39e040a-c1f9-44fc-b689-a885702681e6.mp4",
+ "answer": "The video and its accompanying audio together paint a vivid picture of a warm and intimate moment inside a car. The video shows a person, clad in a black jacket and a black beanie with a pom-pom, seated in the driver's seat and cradling an orange cat close to their chest. One hand gently pets the cat while the other rests on the steering wheel. The car's interior, including the back seat and a suitcase on the floor behind the driver's seat, is visible. Meanwhile, the audio captures the speaker's amused and affectionate tone as she interacts with her feline companion. She playfully expresses surprise at the cat's paw movements, suggesting it has thumbs, which adds a layer of humor to the scene. This interaction highlights the deep bond and affection shared between the person and their cat, creating a heartwarming atmosphere within the vehicle."
+ },
+ {
+ "video_id": "PI-b39cbde6-2892-46c0-b0c7-c7793d450327",
+ "path": "/workspace/benchmark/video/PI-b39cbde6-2892-46c0-b0c7-c7793d450327.mp4",
+ "answer": "The video features a person with long blonde hair sitting in a charmingly decorated room, complete with string lights and plush toys, including a unicorn and a pink fluffy toy on a pink chair. Dressed in a light pink top and a delicate necklace, the person presents a virtual dresser interface with a pink, stylized theme, showcasing various outfits and accessories such as a pink dress with a bow, a pink handbag, black high-heeled boots, a pearl necklace, and a white bunny outfit. As the video progresses, different virtual outfits are displayed on a headless mannequin in a pink-themed room, including a beige school uniform with a red bow, black shorts with pink accents, a black crop top, a white dress, a black outfit with a leather jacket, and an orange mermaid tail with a matching top. Meanwhile, a cheerful female speaker enthusiastically discusses various fashion themes and characters, including dress codes and updates leading up to Valentine's Day. She mentions names like Dolly, Megan, and others, listing a variety of fashion personas and styles such as mermaid tech, Moto Princess, and Ashley Bunny, while expressing excitement and positivity. The ambient music is upbeat and lively, creating a vibrant and engaging atmosphere that complements the stylish and pink aesthetic of the video."
+ },
+ {
+ "video_id": "PI-eb9cc983-0f1b-4e60-a35a-990889dc1a1f",
+ "path": "/workspace/benchmark/video/PI-eb9cc983-0f1b-4e60-a35a-990889dc1a1f.mp4",
+ "answer": "This short, comedic video features a young man lip-syncing to a dialogue.Standing in a room with light green walls, the man acts out a scene where a character threatens to call someone's mother,while another part of the audio track features high-pitched voices shouting in protest. The man then lip-syncs the character's line, 'It works every time,' with a satisfied smile. The scene takes a turn when a female voice said\"call me for what\"?The man responded to her with even greater delight and tenderness."
+ },
+ {
+ "video_id": "PI-b26b6112-1c48-4f7a-9d2b-586d90618478",
+ "path": "/workspace/benchmark/video/PI-b26b6112-1c48-4f7a-9d2b-586d90618478.mp4",
+ "answer": "In scenes that combine elements of audio and visual content, a woman's words unfold in a backdrop that shifts from a calm, neutral auditory environment to a vibrant, jewel-filled visual environment. The audio shows a female speaker expressing frustration and slight confusion as she struggles to find something and mentions wearing multiple items, and she shows a very troubled mood. The ambient music in the audio is steady and neutral, creating a calm atmosphere. The video, meanwhile, features a pink jewelry store as a backdrop, decorated with various decorations including earrings and eye patches. A woman dressed in white with long blond hair pulled the rubber ring on her arm and chewed something, then found that the earrings brought a look of surprise and delight to her face. Her love for the earrings is palpable as she holds them, and the text \"Me in 2003\" appears at the top of the video, adding a temporal context to the scene. Together, audio and video create a unique narrative in which a male search for the unseen is contrasted with a woman's joyful discovery in a jewelry store, all set against a calming music and visually stimulating environment."
+ },
+ {
+ "video_id": "PI-aae468a3-bf3c-4e70-ad0d-3dd21eba309e",
+ "path": "/workspace/benchmark/video/PI-aae468a3-bf3c-4e70-ad0d-3dd21eba309e.mp4",
+ "answer": "The video begins from a first-person perspective inside the vehicle, focusing on the central food item, which is a baked potato presented in a disposable container. The potato is paired with pork, a red barbecue-style sauce, and a dollop of white sour cream. Melted cheese is also visible. A black plastic fork is placed on the other side of the disposable container. The camera provides a close-up view of the food, highlighting its various details. The visual presentation emphasizes the substance of the dish and its rich toppings, providing a clear understanding of the food's texture and color. The background consists of the car's interior, including part of the dashboard and steering wheel, indicating the environment in which the meal is being eaten."
+ },
+ {
+ "video_id": "PI-976899a2-94bd-48e8-bf96-59513fc5dec4",
+ "path": "/workspace/benchmark/video/PI-976899a2-94bd-48e8-bf96-59513fc5dec4.mp4",
+ "answer": "This video clip presents two young women performing a synchronized dance routine in an indoor setting, likely a bedroom or similar private space. The background features light gray walls, a wooden floor, a white door, and an open white closet containing various garments. Both individuals are dressed in relaxed, casual attire; one wears a gray long-sleeved top and dark blue sweatpants, while the other is in a large navy blue sweatshirt and light gray sweatpants. Their movements involve raising their arms, pointing, and swaying in coordination, with visible expressions of cheerfulness. The camera remains static, providing a clear and steady view of their performance. An overlaid text graphic states, 'us because our friendship didn't end when we didn't get any classes together,' indicating the thematic content of the video. The audio accompaniment is an upbeat disco-funk song with a male vocalist and a strong rhythmic section, contributing to the energetic and positive mood of the scene.The overall atmosphere is lighthearted and positive, highlighting the fun and connection between the two friends."
+ },
+ {
+ "video_id": "PI-1c6ebe48-8d3b-482f-8ddf-3017c9f30f24",
+ "path": "/workspace/benchmark/video/PI-1c6ebe48-8d3b-482f-8ddf-3017c9f30f24.mp4",
+ "answer": "The video depicts a man wearing a black shirt, large headphones, and a gold chain, sitting at a table with two prominent buttons—one green and one red—against a backdrop of shelves displaying various items, including YouTube play button awards and a refrigerator filled with drinks. As user comments requesting the song \"Money\" by Nigerian artist Teni appear on the screen, the man listens to the music with evident passion, mirroring the energetic and passionate tone of the audio track. He reacts enthusiastically, nodding, raising his hands, clapping, and making various expressive gestures. Text overlays such as \"listen,\" \"HEEEEEEEY! ,\" \"MAJED FEEL GOOD MODE ACTIVATED,\" and \"BEST VIBES\" further emphasize his positive and joyful response to the music. The combination of the man's animated reactions and the passionate music creates a vibrant and engaging atmosphere, suggesting that he is fully immersed in the listening experience. The audio, featuring a passionate music tune and a man's equally passionate voice, perfectly complements the visuals, enhancing the overall energy and enthusiasm of the scene."
+ },
+ {
+ "video_id": "PI-59cdf126-f29b-4e2d-a1a8-de5f346a107d",
+ "path": "/workspace/benchmark/video/PI-59cdf126-f29b-4e2d-a1a8-de5f346a107d.mp4",
+ "answer": "A young woman with dark, braided hair and prominent makeup performs a dance routine in a well-lit room. She is dressed in a unique costume consisting of a green mesh crop top with a crisscrossing vest and loose, patterned orange pants secured by a wide belt. The room appears to be a backstage or break area, furnished with a brown couch, a small bar cart, a microwave, and a refrigerator. Pink LED strip lights run along the ceiling, adding to the ambiance. At the beginning of the video, another person in a costume can be seen sitting on the couch. The woman dances energetically to an upbeat dancehall track, smiling and interacting with the camera. Her movements are fluid and match the rhythm of the music, creating a lively and engaging performance."
+ },
+ {
+ "video_id": "PI-4e64b17e-e2c5-44b1-a7d8-66be076edc1f",
+ "path": "/workspace/benchmark/video/PI-4e64b17e-e2c5-44b1-a7d8-66be076edc1f.mp4",
+ "answer": "The video is set in a gym and follows a person dressed in black athletic wear, capturing the aftermath of an intense workout. The scene opens with the person walking away from the camera towards the back of the gym, which is equipped with various exercise machines, weights, and benches. The person appears exhausted and collapses onto an inclined bench, lying face down with arms outstretched. The camera remains stationary, emphasizing the person's fatigue. The word \"aftermath\" is overlaid, highlighting the theme of post-workout recovery. As the video progresses, the person begins to recover, slowly sitting up, adjusting their hair, and catching their breath. The gym environment remains consistent, with additional equipment and a digital display visible in the background. Eventually, the person stands up, stretches, and appears fully recovered, standing next to the bench with hands on hips. Meanwhile, the audio features a mix of background music and a male voice, possibly a meme or humorous commentary, saying things like \"let him cook\" and \"my God,\" which adds a lighthearted tone to the scene, suggesting a humorous take on the person's post-workout state. The video effectively captures the transition from exhaustion to recovery, underscoring the relatable experience of post-workout fatigue and regaining energy."
+ },
+ {
+ "video_id": "PI-c91eb9ca-752a-426f-b35b-d00f05df3513",
+ "path": "/workspace/benchmark/video/PI-c91eb9ca-752a-426f-b35b-d00f05df3513.mp4",
+ "answer": "In this short video, a young woman uses a social media filter to predict when she might get married. The filter, displayed on the left, is a vertical gauge with years ascending from 2026 to 'NEVER' at the top, accompanied by an icon of a diamond ring. The video captures her real-time reaction as a blue indicator bar climbs up the scale. Initially, she watches with a calm, neutral expression. However, as the indicator passes the earlier years and continues upward, her demeanor changes to one of mock horror and disappointment. She raises her hand to cover her mouth and eyes in a dramatic gesture, feigning distress at the filter's unfavorable prediction."
+ },
+ {
+ "video_id": "PI-99fa1a5e-9984-4464-b0d7-bb2d07ba3508",
+ "path": "/workspace/benchmark/video/PI-99fa1a5e-9984-4464-b0d7-bb2d07ba3508.mp4",
+ "answer": "In the video, a person with thick, curly reddish-brown hair, wearing a white vest, is sitting in a comfortable indoor environment. The background is a light-colored wall decorated with a framed painting, and the grey part of the sofa is visible. This person seems to be having a conversation or a monologue, making all kinds of gestures, such as raising one's hand and moving it in front of the face, or placing one's hand on the chest. There are such lyrics in the audio: \"I thought you had to be a real man.\" Seize it. If I give you a chance, take it. These lyrics imply a reflective or emotional theme, which might be related to interpersonal relationships or personal challenges. This person's expressive gestures and the accompanying lyrics create a sense of intimacy and personal expression. The overall atmosphere of the video is casual and relaxed, with everyone sharing their own thoughts or stories. The combination of visual and auditory elements conveys the themes of self-expression and emotional vulnerability."
+ },
+ {
+ "video_id": "PI-4b52cf10-8af7-4c55-add8-eb9f3561bc41",
+ "path": "/workspace/benchmark/video/PI-4b52cf10-8af7-4c55-add8-eb9f3561bc41.mp4",
+ "answer": "This video is about the confrontation between two men in a subway carriage. A man in a black leather jacket and a white shirt pointed at the other person and said, \"What the hell are you looking at?\" A man in a dark suit with a white shirt and a dark tie replied in a calm tone, \"Didn't you hear what I said?\" When the man in the leather jacket said, \"Get out of my car,\" the tension escalated, and the man in the suit asked, \"Why?\" The man in the leather jacket explained, \"I didn't know that was your car.\" I had a great time here. The standoff continued. The man in the suit approached and said, \"Get out of my car!\" The man in the leather jacket looked very surprised and explained, \"Damn it!\" Mary Serra, your wife? Then the camera switches to a close-up of the man in the suit. He said, \"Now...\" They are dividing a \"Big MAC\" in the kitchen. Then, in your own bed, he will scold her... It's right on your specially made green bedspread. He suggested that men in leather jackets say, \"Do yourself a favor...\" Put the knife where it should be! At the end of the video, the man in the leather jacket was shocked. The man in the suit said calmly, \"There is a train coming from the other direction. You will catch it.\" You will thank me tomorrow morning. The man in the leather jacket hesitated. The man in the suit comforted him, saying, \"You're wrong, man...\" Oh, I'm right. You will see. Meanwhile, in this audio clip, a man expressed anger and frustration. The characteristic of music is a stable rhythm, creating a slightly tense atmosphere. This video captures the intensity and emotional weight of the confrontation."
+ },
+ {
+ "video_id": "PI-0f76fd6e-0e1c-4dc6-9fd0-4e2c9f4c5200",
+ "path": "/workspace/benchmark/video/PI-0f76fd6e-0e1c-4dc6-9fd0-4e2c9f4c5200.mp4",
+ "answer": "The audio track captures the electric atmosphere of a sports competition, with the audience's continuous cheers creating a palpable sense of tension and excitement, as if the listener were right there in the stadium. The accompanying video vividly depicts an indoor basketball game, where the lively crowd adds to the vibrant energy. The protagonist, a man in a white jersey, showcases his basketball skills with several remarkable moments of shooting, layups, and dribbling, all captured on others' mobile phones. Each time he sinks a basket, the audience erupts into thunderous cheers, amplifying the excitement. After scoring his first two goals, he even playfully taunts the camera, adding a touch of confidence and flair to his performance. The video effectively encapsulates the thrilling moments of the man in white's basketball game, while the roaring cheers of the audience and his own self-assured demeanor further underscore the intense and exhilarating atmosphere of the match."
+ },
+ {
+ "video_id": "PI-53010bab-7fea-49e1-9eaa-5771598d50e2",
+ "path": "/workspace/benchmark/video/PI-53010bab-7fea-49e1-9eaa-5771598d50e2.mp4",
+ "answer": "The audio features a female speaker with a curious and slightly puzzled tone as she observes a tabby cat interacting with a pet water fountain, expressing uncertainty about its functionality while also explaining how it works by pointing out where the cat should press to get water. Her amusement and curiosity are palpable as she watches the cat paw at the water and drink from it. The video perfectly complements this narrative, showing the tabby cat on a colorful, patterned rug interacting with a white and gray pet water fountain labeled 'els pet'. The text overlay at the top of the frames initially reads 'i don't know if this is working... ' with a crying emoji, mirroring the speaker's uncertainty. The cat is seen dipping its paw into the water and then licking it, while a hand, presumably the speaker's, points at different areas of the fountain, possibly elucidating its features. The text 'hmm right here right' appears, further emphasizing the confusion or curiosity present in both the audio and video. The video concludes with the text 'no' and 'ha ha', suggesting a moment of light-heartedness as the cat continues to explore the water with its paw, encapsulating the speaker's amusement and the cat's endearing interaction with the fountain."
+ },
+ {
+ "video_id": "PI-ea0a61b4-ea06-4d7a-b8c6-908e8780279c",
+ "path": "/workspace/benchmark/video/PI-ea0a61b4-ea06-4d7a-b8c6-908e8780279c.mp4",
+ "answer": "The video presents a sequence from a bowling activity within a modern, neon-lit bowling alley. A young woman, central to the scene, is first observed holding a distinct red bowling ball adorned with a subtle rose pattern. She then proceeds to move towards the lane, executing a deliberate throw. The bowling ball is shown rolling down the lane, which is illuminated by vibrant blue and purple neon lights, complemented by red accents. In the background, multiple digital screens display various colorful animations and imagery, contributing to the lively ambiance of the venue. The camera work is dynamic, tracking the woman's actions and the ball's movement, providing a close-up perspective of the activity. The overall visual tone is energetic, reflecting the recreational nature of the environment. Accompanying the visuals is an upbeat hip-hop or rap track, which provides a driving rhythm to the scene. Ambient sounds characteristic of a bowling alley, such as the ball rolling and the distant impact of pins, are also present, grounding the experience in reality. The combination of dynamic visuals, a lively soundscape, and the depiction of leisure activity creates an engaging and representative portrayal of a contemporary bowling outing."
+ },
+ {
+ "video_id": "PI-6bce63a8-3bcd-4128-8cee-283740e9a904",
+ "path": "/workspace/benchmark/video/PI-6bce63a8-3bcd-4128-8cee-283740e9a904.mp4",
+ "answer": "The video features a caring and attentive male speaker who describes his efforts to take care of his girlfriend during her period. It begins with him standing beside a bed where his girlfriend lies, covered with a white polka-dotted blanket. The text 'Taking care of my girlfriend on her period' sets the theme. He places a wooden tray over the bed and hands her a bowl to eat from. They then pick up light green sweatpants and a pink garment before walking away, holding hands, and entering a bathroom where he helps her put on a pink headscarf. The scene continues with him adjusting the headscarf and her wrapping a light pink towel around her body, followed by her taking out a HoneyPot pad for comfort. Fully dressed in a green sweatshirt and pink pajama pants, she stands in the kitchen as he cooks salmon. The cooked salmon is brought to her on a gray couch, where she eats while he cuddles and rubs her stomach. The video wraps up with both lying on the couch, appearing comfortable and relaxed, with the kitchen's well-lit background of white cabinets and a stainless steel refrigerator. The core message is the thoughtful and supportive actions he takes to ensure her comfort and well-being."
+ },
+ {
+ "video_id": "PI-e7c15057-e643-4925-b79d-b372749b31f1",
+ "path": "/workspace/benchmark/video/PI-e7c15057-e643-4925-b79d-b372749b31f1.mp4",
+ "answer": "The video and audio together weave a narrative of romantic inquiry and heartfelt expression. It begins with a close-up of a person with blonde hair tied back, wearing a black long-sleeve shirt over a beige tank top, covering their mouth with a ringed finger, set against a white brick wall and wooden structure. Overlaid text poses the poignant question, \"how do you know he's the one?\" which is echoed in the accompanying audio where a female speaker, her voice tinged with emotion, asks the same. The scene transitions to a kitchen floor adorned with red rose petals and dog food pellets spelling \"BE OUR,\" where a white dog on a red leash stands beside a white cabinet, gazing at the camera. A male figure, clad in a white t-shirt and light-colored pants, holding a bouquet of red roses, stands near a door with the dog. The audio reveals a calm, gentle male voice, presumably responding to the earlier question, as part of a dialogue that suggests a romantic proposal or declaration of love. The video concludes with the man and dog in the same spot, the dog sniffing the floor, encapsulating a moment of tender anticipation."
+ },
+ {
+ "video_id": "PI-d17fb048-7fc5-4c26-b0fb-c63b8e6f88c2",
+ "path": "/workspace/benchmark/video/PI-d17fb048-7fc5-4c26-b0fb-c63b8e6f88c2.mp4",
+ "answer": "This video captures a lighthearted, behind-the-scenes moment featuring a woman in a grand, classically designed hall. In the center of the frame is a large, white sign with the word \"LOVESHELL\" and a playful illustration of a lobster in a tuxedo. Initially, only the woman's hands are seen, but she soon pops up from behind the sign with a smile. She is wearing a simple black top, and her long, dark hair is styled in loose waves. As pop music plays, she sings along, gesturing and interacting with the camera in a candid manner. The background setting includes prominent architectural columns, arched doorways, and a vase of yellow and white flowers on a table. The camera remains stationary throughout, focusing on her playful performance. "
+ },
+ {
+ "video_id": "PI-c0b1e029-7afc-479d-b2b1-8543aa0144e9",
+ "path": "/workspace/benchmark/video/PI-c0b1e029-7afc-479d-b2b1-8543aa0144e9.mp4",
+ "answer": "In a hospital examination room, the video shows a patient’s legs resting on an examination bed, with a key visibly embedded near the sole of the right foot. Medical staff are present, including a man wearing teal scrubs and a woman dressed in a black shirt. Voices of the staff can be heard expressing surprise and discussing how such a rare injury might have occurred. The on-screen text overlay, “when u are the unluckiest person on the planet,” corresponds to the situation, emphasizing the unusual nature of the incident. The environment is typical of a medical facility, with walls featuring medical equipment such as an AED. The video focuses closely on the injury details and the immediate reactions of the staff. The overall atmosphere conveys a calm assessment of an unexpected event, with personnel observing the situation quietly.\n"
+ },
+ {
+ "video_id": "PI-474fba7b-a332-47b1-aef1-21c29d3b4a16",
+ "path": "/workspace/benchmark/video/PI-474fba7b-a332-47b1-aef1-21c29d3b4a16.mp4",
+ "answer": "The video portrays a formal outdoor event, likely a ceremonial occasion marked by the presence of military personnel and an American flag, suggesting an official or governmental setting. The scene unfolds with a person in a dark coat and a black hat with a white band standing before a Marine in a dark uniform holding the flag, while a white building and a tree form the backdrop. As the video progresses, another individual in a dark coat and purple tie joins, and the Marine adjusts his grip on the flag, with the person in the hat raising their hand, perhaps to adjust their hat or pay respects. The sequence concludes with the person in the hat looking towards the Marine, who now holds the flag more securely, maintaining an overall atmosphere of formality and respect. Meanwhile, the audio, a background music track from a song performed by a male singer with a reflective and somewhat melancholic tone, complements the video's narrative by evoking emotions of nostalgia and respect through lyrics like \"There goes my hero, watch him as he go,\" which revolve around the theme of a hero departing. The audio's poignant and contemplative mood enhances the video's portrayal of honoring someone's departure, creating a harmonious blend of visual and auditory elements."
+ },
+ {
+ "video_id": "PI-884a491b-93f3-4264-83c6-34a1a5dd3be5",
+ "path": "/workspace/benchmark/video/PI-884a491b-93f3-4264-83c6-34a1a5dd3be5.mp4",
+ "answer": "In the video, a person with long blonde hair in a black T-shirt is sitting indoors. The ceiling is high and the structural elements are obvious, suggesting a spacious room or studio environment. These people have several tattoos on their arms. One of them looks like a trumpet and the other has words, adding a personal style to their appearance. Throughout the video, it can be seen from their expressive gestures and constantly changing facial expressions that this person has been either monologue or conversation. She often touches her face, tidies her hair and gives a thumbs-up to show that she is energetic and passionate. The audio content shows that this person has been listening to many songs of Lady Gaga, especially her old ones. She expressed appreciation for songs such as \"Adre\", \"Poker Face\" and \"Born This Way\", emphasizing the quality and influence of these songs. Referring to \"Old Lady Gaga\" implies nostalgia or retrospective appreciation of the artist's early works. The theme of the entire video revolves around an individual's personal enjoyment and appreciation of Lady Gaga's music, shared in a relaxed and informal environment. The combination of visual and audio elements creates an engaging and relevant narrative, focusing on reliving the joy of beloved music."
+ },
+ {
+ "video_id": "PI-52a24162-ed2d-4eb8-a3d8-6fdc7ad24498",
+ "path": "/workspace/benchmark/video/PI-52a24162-ed2d-4eb8-a3d8-6fdc7ad24498.mp4",
+ "answer": "The video captures a heartwarming moment between a large black dog and a person in green work uniform and colorful white shoes. The dog, wearing a blue collar, stands on its hind legs and leans against the person, wrapping its front paws around them in a warm embrace. The person reciprocates the hug, holding the dog tightly, while the dog's tail wags vigorously, showing its excitement and affection. This scene continues, highlighting the strong bond between the dog and the person, possibly during a reunion or farewell. Meanwhile, a male speaker in the audio reflects on the challenges of facing intense situations, which could relate to the emotional dynamics between humans and their pets. The ambient music, with its steady tempo and neutral atmosphere, complements the introspective tone of the speaker and the emotional depth of the scene, underscoring the profound connection and mutual affection shared between the dog and its owner."
+ },
+ {
+ "video_id": "PI-1bf3fd7a-e0b1-49ae-b939-d071299bdce4",
+ "path": "/workspace/benchmark/video/PI-1bf3fd7a-e0b1-49ae-b939-d071299bdce4.mp4",
+ "answer": "The video features a DJ in a vibrant, neon-lit recording studio, dressed in a black shirt and a gold chain with \"MAJED\" inscribed on it. The background includes a shelf with various items and a sign that reads \"ounz.\" The DJ receives a request through an on-screen comment saying, \"Brother, please check NINAO-GIMS.\" His reactions of surprise and entertainment are captured with text overlays such as \"EGH\" and \"EGHH.\" As he experiences what he calls \"instant head movement activation,\" his love for the song becomes evident. Continued listening prompts another surprised reaction, highlighted by the text, \"Wait... Is this a bit like a slap in the face?\" The DJ becomes completely immersed in the music, emphasized by the words \"INTO IT\" appearing on the screen. Complementing the visuals, the audio features the DJ's vocals and DJ background music, creating a lively interaction with the audience and showcasing his eagerness to meet their demands. This combination of visual reactions and auditory elements builds an engaging and dynamic narrative of the DJ's experience with the requested song, fostering an exciting and pleasant atmosphere."
+ },
+ {
+ "video_id": "PI-2c3edee8-436a-4217-b95d-a8ad22511613",
+ "path": "/workspace/benchmark/video/PI-2c3edee8-436a-4217-b95d-a8ad22511613.mp4",
+ "answer": "The audio features a piece of pleasant background music that soothes both body and mind, creating a cheerful and festive atmosphere that perfectly complements the visual narrative. In the video, a golden retriever sits gracefully on a red velvet armchair in a cozy living room, its appearance enhanced by two red ribbons tied to its ears and a red heart-shaped mark on its face, adding a touch of cuteness and festivity. The dog, looking extremely happy and content, with its mouth slightly open and tongue sticking out, glances around and gently moves its head, embodying pure joy. The background, a beige wall adorned with a group of people's photos, suggests a private and warm environment, further emphasizing the intimate and loving atmosphere. The warm lighting enhances the alluring charm of the scene, making it even more inviting. Together, the audio and video create a harmonious blend of cheerfulness and warmth, capturing moments of pure joy and emotion that resonate with viewers."
+ },
+ {
+ "video_id": "PI-c26277e1-9629-4239-8ed5-353165c2f64f",
+ "path": "/workspace/benchmark/video/PI-c26277e1-9629-4239-8ed5-353165c2f64f.mp4",
+ "answer": "The video is set in a studio and features intense discussions within a CNN panel. The main characters are a male host in a dark suit and a red tie, and a female guest in a black top. The host was clearly very excited and repeatedly used phrases like \"This is this\", while the text on the screen highlighted the controversy: \"The CNN team erupted in tributes to the Nazis.\" The discussion focused on the anger caused by the salute that reminded people of Nazi gestures, and the host questioned the normality of this behavior. He questioned, \"Do you think this is very normal...\" If you think this is a normal way of greeting, do it on TV right now. The female guest seemed very engaged and responded to the host's provocation. The lower third of the screen shows a piece of news, which mentions \"Musk told the Germans...\" And \"Between the Congolese army and the rebels...\" Indicate other ongoing news reports. The core theme revolves around the group's strong reaction to the use of controversial salutes, questioning their appropriateness in modern society."
+ },
+ {
+ "video_id": "PI-34ca52a4-eddb-4986-aad5-d4a27bc5be47",
+ "path": "/workspace/benchmark/video/PI-34ca52a4-eddb-4986-aad5-d4a27bc5be47.mp4",
+ "answer": "The video presents a front-facing view of a woman with braided hair styled into two buns and bangs. She is wearing an off-the-shoulder black top with a structured, corset-like midsection. Her makeup includes glitter application around her eyes, dark eyeliner, and light pink lipstick. She has long nails with a distinct black and white or silver abstract design. Throughout the clip, she maintains eye contact with the camera, exhibiting subtle head movements, shifts in facial expression, and hand gestures. In one instance, she places a hand over her chest, showcasing her detailed nail art. The background is a simple, light-colored interior, possibly a room or studio, with a wall-mounted light fixture visible to the left. Upbeat pop music plays continuously in the background, providing a consistent sonic backdrop to the visual presentation. The lighting appears soft, illuminating the subject evenly. The video captures a series of brief, posed moments, highlighting her appearance and stylistic choices."
+ },
+ {
+ "video_id": "PI-e3633a23-9c26-434c-a1db-405a37b69eb9",
+ "path": "/workspace/benchmark/video/PI-e3633a23-9c26-434c-a1db-405a37b69eb9.mp4",
+ "answer": "This video clip captures a segment from a sports broadcast, where three presenters, two men and one woman, are analyzing football league standings displayed on a large screen in a blue and purple lit studio. The screen presents rankings for the 'ROUND OF 16' and 'KNOCKOUT ROUND PLAYOFFS' after Matchday 6, showing team names, goal differences, and points. The male presenter on the right gestures towards the screen, discussing various French teams like Lille and Monaco. He then humorously emphasizes 'Brest' as his preferred team, prompting a lighthearted reaction and laughter from his colleagues. The camera employs slight pans and cuts to follow the presenters as they move and interact, with text overlays appearing on screen to highlight key spoken words. The overall atmosphere is professional yet infused with a casual and engaging dynamic among the hosts."
+ },
+ {
+ "video_id": "PI-c386974a-78a9-45ec-8931-a9ebac3a260c",
+ "path": "/workspace/benchmark/video/PI-c386974a-78a9-45ec-8931-a9ebac3a260c.mp4",
+ "answer": "The video takes place in a bustling sports studio, with a live football match as the background, which can be seen from the visible field and the crowd. Three people were sitting at a table marked \"ESPN College Game Day\", which meant a sports discussion program. The central figure in the photo is wearing a black coat and a black hat, having a vivid conversation with the man on his right who is wearing a jacket printed with a clover pattern and a hat. The conversation revolved around the allegations of cheating, particularly regarding the payments made by the U.S. Securities and Exchange Commission to the players. The man in the jacket clarified, \"I was just joking.\" I think the SEC has never paid the players. That's all. Then, the conversation turned to the integrity of college football, and the central figure emphasized: \"I believe in integrity.\" He elaborated in detail on how the plan could offer players better chances of success and pointed out that they earn more money in the NFL than players from any other school. He proudly mentioned that 61 players in their league entered the NFL and attributed this success to the cultivation of the players rather than cheating. Key text overlays include phrases such as \"You previously called him a liar\", \"I was just joking\", \"I don't think the U.S. Securities and Exchange Commission pays the players\", and \"Alabama Jones is very serious\", which highlight the key points of the discussion. The core theme of this video revolves around defending the integrity of college football, opposing cheating allegations, and emphasizing the development and success of players in professional leagues. The atmosphere on the spot was still full of passion, reflecting the passionate nature of sports commentary."
+ },
+ {
+ "video_id": "PI-0fff81b7-5a99-4272-af2f-5f8fba0a2e20",
+ "path": "/workspace/benchmark/video/PI-0fff81b7-5a99-4272-af2f-5f8fba0a2e20.mp4",
+ "answer": "The audio track features a female singer who passionately encourages embracing authenticity and taking risks, evoking the exhilaration of a \"teenage dream\" and emphasizing self-expression and living in the moment. The accompanying video, set to a cheerful mood BGM, showcases the luxurious cruise ship \"Icon of the Seas\" against a vibrant and dynamic backdrop that transitions from a clear daytime sky to a nighttime scene illuminated by the ship's lights. Passengers are seen engaging in various leisure activities such as swimming, sliding on water slides, relaxing in lounge areas, and socializing, highlighting the ship's numerous entertainment facilities, including a water slide, a pool, a rock climbing wall, a basketball court, and a mini-golf course. The text \"ICON OF THE SEAS\" prominently appears, reinforcing the ship's identity and its appeal as a vacation destination. The video effectively conveys a theme of fun, adventure, excitement, and relaxation, encapsulating the essence of a carefree and enjoyable cruise experience, which, while not explicitly linked to the audio's message of authenticity and risk-taking, complements it by showcasing an environment where passengers can freely express themselves and live in the moment, indulging in the lively enjoyment and leisure offered by the cruise."
+ },
+ {
+ "video_id": "PI-79a6d2a4-1501-418a-a183-084b9517ae30",
+ "path": "/workspace/benchmark/video/PI-79a6d2a4-1501-418a-a183-084b9517ae30.mp4",
+ "answer": "The video features a white parrot perched on a wooden structure with a beige, bowl-like top, set against a backdrop of a wooden door frame and a light-colored wall. The parrot, with its predominantly white feathers and subtle variations in shade, moves its head and body gently, occasionally looking around and adjusting its position on the perch. Its light-colored beak and dark eyes add to its serene appearance. The wooden structure has a natural, rustic look with visible grain and texture, and the warm lighting in the room creates a cozy atmosphere. Meanwhile, the audio introduces a contrasting emotional tone, with a female speaker expressing a sense of fear. The main content of the audio revolves around this feeling of being scared, with a mix of English and Japanese, where the Japanese portion translates to 'horrible'. The juxtaposition of the calm and curious nature of the parrot with the speaker's fearful emotions creates an intriguing dynamic, highlighting the serene exploration of the parrot against the backdrop of human vulnerability."
+ },
+ {
+ "video_id": "PI-bd3abc53-92bd-439c-9d36-8f62abc9a691",
+ "path": "/workspace/benchmark/video/PI-bd3abc53-92bd-439c-9d36-8f62abc9a691.mp4",
+ "answer": "The video takes place in a well-lit corridor with tiled floors and fluorescent lights installed on the ceiling. The background features a door at the far end and a notice board on the right wall, adding what seems to be a corridor in a school or office. The main characters are two people wearing black hoodies, black leggings and white socks. The person on the left is wearing white sandals, while the person on the right is wearing black slippers. They are performing a synchronized dance move, full of energy and in step. Their movements include raising their arms, clapping their hands, jumping and spinning together. Once, they held hands and raised their hands together, demonstrating their coordination and comradeship. The lyrics of the background music \"I want you to stay\" complement the cheerful dance atmosphere. This video captures a moment of friendship and joy, emphasizing the themes of unity and sharing happiness in a simple yet vibrant environment."
+ },
+ {
+ "video_id": "PI-b8d89471-48f8-4f87-8037-3b983375318b",
+ "path": "/workspace/benchmark/video/PI-b8d89471-48f8-4f87-8037-3b983375318b.mp4",
+ "answer": "This video takes place in a car, where a man in a light grey suit and a woman with long blonde hair in a light grey hoodie are sitting together. The woman looked very depressed, waving her hands and talking excitedly, while the man watched by her side and responded occasionally. The text at the top of the screen reads: \"I just found out that my boyfriend betrayed me\", setting the background for the conversation. This person began to explain an incident that happened shortly after Christmas. He mentioned going out to play with that woman's father and he was considering buying something \"wrapped in a hair dryer\". The man expressed doubts about the use of this item. He said, \"I thought at that time that she wouldn't use this.\" When will you use this? He advised not to buy it. The woman's reaction was disbelief and frustration. She put her hand on her face and made a contemptuous gesture. She seemed to be on the verge of crying, emphasizing her emotional state. Throughout the video, the interior of the car is still clearly visible, and the scenery outside passes through the Windows. A woman's emotional response is obvious. She continues to express her feelings through gestures and facial expressions. The man sitting beside her seemed to be trying to explain or defend his actions, but the woman's reaction indicated that she was deeply hurt and sad by the betrayal. This video captures the moments of emotional confrontation and pain within the limited space of the vehicle."
+ },
+ {
+ "video_id": "PI-3ff4cea2-b697-458f-9aa3-48d2e7dcde78",
+ "path": "/workspace/benchmark/video/PI-3ff4cea2-b697-458f-9aa3-48d2e7dcde78.mp4",
+ "answer": "This video pays tribute to Duolingo's mascot \"Owl Duo\" in a humorous and serious imitation way. Two people, dressed in colorful big clothes, are standing in front of a building with a sign that reads \"Doingo\". The person on the left is wearing pink and orange clothes with a tear drop on the cheek, while the person on the right is wearing purple and black clothes. They were holding a black bag with the words \"Rest in peace\" written on it. The text on the video reads: \"guys duo is dead. By the way, I'm deaf, so I hope this is a sad song.\" This sets an interesting and satirical tone for the mascot's \"death\". The scene switches to the back of a pickup truck. A person dressed in bright green is lying in a box like a coffin, with an \"X\" drawn on his eyes, surrounded by red fabric and rose petals, adding dramatic and somewhat terrifying elements to the scene. Then The video switches to a black and white commemorative card, which reads: \"In memory of Duo Keyshauna Renee Lingo 'Duo The Owl', 1000 BC - 2025 AD.\" Please don't send flowers. Come and take a Duingo class instead. The overall tone of the video is a blend of humor and mockery, using visual elements and text to satirize and pay tribute to the mascot of Duolingo. The video effectively combines visual narrative with text overlay, conveying humorous messages and making it both interesting and memorable."
+ },
+ {
+ "video_id": "PI-98478288-a59c-4585-8ca7-d76e356e5842",
+ "path": "/workspace/benchmark/video/PI-98478288-a59c-4585-8ca7-d76e356e5842.mp4",
+ "answer": "The video takes place in the bathroom, which is characterized by a shower curtain, a towel rack with dark towels and a small shelf holding various items. The protagonist is a person. He is wearing a light-colored T-shirt with pumpkin and \"autumn\" patterns on it, paired with grey shorts and black and white sports shoes. His hair was tied into a messy bun with blue and white checkered hair ties. This person displayed a series of gestures and postures, starting with making a peace gesture with both hands, then crossing his arms, adjusting his shirt and changing his posture. These actions imply a sense of self-awareness or contemplation. There is very little audio in the video, mainly focusing on the subtle movements and expressions of the characters. The key text reads as follows: \"POV: A 'popular' girl that no low-key person likes.\" This sentence sets the tone for the video, suggesting a narrative about the often unspoken tensions in social dynamics and interpersonal relationships. This person's actions, including his adjustments to his clothes and occasional scans of the room, all reinforce this theme, creating a character who may be considered very popular but also subject to subtle opposition or discomfort from others. Overall, this video captures a moment of self-reflection or social embarrassment, using the bathroom as a private space to reflect on the complexity of social status and acceptance."
+ },
+ {
+ "video_id": "PI-cbcc3c9a-b4f9-4118-8bb7-f7bdc8bede99",
+ "path": "/workspace/benchmark/video/PI-cbcc3c9a-b4f9-4118-8bb7-f7bdc8bede99.mp4",
+ "answer": "The video is characterized by a vibrant and captivating scene, with a bright blue background decorated with white cloud-like shapes, creating a cheerful and interesting atmosphere. The main characters are two animated characters: One has dark skin, black hair, and is wearing a green shirt and blue shorts printed with dinosaur patterns; The other one has a light complexion, a small tuft of yellow hair, and is wearing a white shirt with blue sleeves and a bear pattern on it, paired with yellow trousers. The dark-skinned character peeked through the yellow structure with a red circular light on the top, similar to a vehicle or a game room, and then stood beside the light-skinned character. Both characters displayed vivid expressions and gestures, indicating that their interaction was very active. The audio content includes cheerful background music, which enhances the fun of the game. In addition, the characters will have conversations. Characters with dark skin will make enthusiastic gestures, while those with light skin will respond with vivid expressions. The core theme of the video revolves around friendly interaction and interesting participation, aiming to entertain young audiences. The absence of key text indicates that the focus of information transmission still lies on visual and auditory elements."
+ },
+ {
+ "video_id": "PI-6a05f5d7-c31c-44ee-b124-23afebc4d15e",
+ "path": "/workspace/benchmark/video/PI-6a05f5d7-c31c-44ee-b124-23afebc4d15e.mp4",
+ "answer": "The video features a person in a dimly lit room filled with musical equipment, playing an electric guitar while wearing a black t-shirt with a white graphic design and black pants. The room is equipped with various musical instruments, including guitars hanging on the wall, a drum set, and shelves filled with audio equipment and cables. The video begins with the text 'YO DUDE' appearing at the bottom of the screen, setting a casual and friendly tone. As the person plays, the text 'THAT SONG THAT NANANA' appears in a red box, suggesting a request for a specific song. The person responds with 'YUH,' indicating acknowledgment. The text 'THAT ONE THAT GOES' and 'ONE THAT GOES' follow, emphasizing the request. The person then plays a different riff, and the text 'NO NO' appears, indicating a correction or change in the requested song. The text 'YEAH YEAH' follows, showing agreement or approval. The video wraps up with the text 'YEAH' and 'YES' appearing, confirming the final choice of song. This casual and humorous interaction is complemented by an audio track featuring a male speaker in a light-hearted mood, engaging in a playful conversation about playing songs, using phrases like 'Yo dude, can you play that song that goes...' followed by nonsensical 'na na' refrains and affirmative 'yeah' responses. The overall atmosphere is one of playful interaction and the speaker's attempt to recall or request specific songs, adding a layer of context to the guitar playing."
+ },
+ {
+ "video_id": "PI-f7ca5103-dda5-4eed-92f4-1e09db2180fe",
+ "path": "/workspace/benchmark/video/PI-f7ca5103-dda5-4eed-92f4-1e09db2180fe.mp4",
+ "answer": "This short, humorous video captures a relatable family moment. It begins with a close-up of a young woman looking shocked, her hand covering her mouth as she reacts to something off-screen. On-screen text sets the scene: \"when your mom is out of town so your dad is in charge of dinner.\" Her shaky camera movements and wide-eyed expression convey her disbelief. The video then cuts to reveal the source of her reaction: a large salad bowl on a dining table. The salad, which contains greens, tomatoes, and red onions, is unusually topped with a large quantity of sliced bananas. "
+ },
+ {
+ "video_id": "PI-6a5b4bcc-2a14-4231-b918-c3f785bc8d1a",
+ "path": "/workspace/benchmark/video/PI-6a5b4bcc-2a14-4231-b918-c3f785bc8d1a.mp4",
+ "answer": "The video shows a close-up of flipping through a handcrafted booklet, presenting a series of abstract facial paintings. This booklet is placed on a maroon, textured surface, perhaps a towel or cloth, providing a rich and contrasting background for the white pages. The original painting was minimalist, outlining the facial contours of different features with simple black ink, such as large round eyes, small round eyes, large black circles on the cheeks and the curves of the mouth. As the video progressed, the artist's experiments with colors became obvious. A painting with red as the main tone, painted on the top of the head and one cheek, adds vitality to the originally monochromatic artwork. Next comes another change. The face is set against a background of rainbow stripes, forming a sharp color contrast with the black outline of the face. This video highlights the artist's creative journey, highlighting the evolution from simple black-and-white sketches to more colorful dynamic interpretations. There is no dialogue at all, allowing the audience to focus completely on the visual art and the transformation of the abstract facial paintings throughout the booklet."
+ },
+ {
+ "video_id": "PI-ce37283b-0427-4525-8a66-72de7aec563d",
+ "path": "/workspace/benchmark/video/PI-ce37283b-0427-4525-8a66-72de7aec563d.mp4",
+ "answer": "The video takes place in a bright, modern kitchen where a woman, dressed in a gray tank top and a white skirt, stands with an amused expression. The on-screen text reads, \"POV you ask your husband to take the lead in this trend and this is what he did.\" She gestures animatedly as she speaks, and soon, a man wearing a dark t-shirt and light-colored patterned pants enters the scene. Playfully pushed forward by the woman, he begins to mimic a ballet pose, placing one hand on his stomach and extending the other arm. The woman joins in, adopting a similar ballet stance beside him. Together, they move in a ballet-like manner, with the woman occasionally adjusting her stance and the man following along. This playful interaction is set against the backdrop of a calming BGM track featuring a male voice singing gently and rhythmically. Lyrics such as \"Say a little prayer for you,\" \"A moment I wake up,\" \"Before I put on my,\" and \"And one new address\" suggest a theme of reflection or preparation, possibly for a new day or chapter, enhancing the video's lighthearted yet contemplative mood."
+ },
+ {
+ "video_id": "PI-105a2dd8-e480-4e0e-9502-367a68440d7d",
+ "path": "/workspace/benchmark/video/PI-105a2dd8-e480-4e0e-9502-367a68440d7d.mp4",
+ "answer": "In the video, a woman, dressed in a white T-shirt and adorned with a gold necklace featuring various pendants, speaks directly to the camera from a room with ceiling lights and doors, creating a casual and personal atmosphere. She shares her personal experiences and preferences regarding bikini thread hair removal, advocating for the use of depilatory cream as an excellent alternative to traditional methods such as shaving, waxing, laser hair removal, or sugar scrubbing. Her tone is casual, mirroring the relaxed setting, as she emphasizes her preference for depilatory cream, holding up a pink bottle labeled with \"OYO\" and \"bye bye bush,\" indicating the specific product she uses and recommends. This video serves as both a personal testimony and a product recommendation, focusing on the effectiveness and convenience of the depilatory cream for bikini thread hair removal. The audio content aligns seamlessly with the video, with the woman's casual tone and core message advocating for depilatory cream as a superior choice for those seeking a hassle-free hair removal solution. Together, they provide a compelling case for the benefits of using depilatory cream in one's personal grooming routine."
+ },
+ {
+ "video_id": "PI-84fc45b7-2a63-4984-9aa9-6e49e6662d96",
+ "path": "/workspace/benchmark/video/PI-84fc45b7-2a63-4984-9aa9-6e49e6662d96.mp4",
+ "answer": "The video features a dog sitting on a tiled kitchen floor, wearing a horse mask that covers its entire head, creating a humorous and whimsical appearance. The dog's light brown body and white paws contrast with the playful mask, and it appears relaxed, sitting and looking downwards. The background includes a black trash bin, a wooden shelf with various items, and a window with blinds, all well-lit to suggest it is daytime. The scene captures a lighthearted moment, likely intended to entertain viewers. Meanwhile, the audio features a male speaker in a contemplative and melancholic tone, reflecting on the emotional trade-off between experiencing brief happiness that ultimately leads to sadness versus avoiding the experience altogether. The slow and somber ambient music enhances the reflective atmosphere, adding a layer of introspection to the otherwise playful scene. This juxtaposition prompts viewers to ponder the balance between fleeting joy and enduring sadness, making the video both entertaining and thought-provoking."
+ },
+ {
+ "video_id": "PI-92fd83de-2801-4fcd-95b0-6d945c167551",
+ "path": "/workspace/benchmark/video/PI-92fd83de-2801-4fcd-95b0-6d945c167551.mp4",
+ "answer": "The video shows a man wearing a brown hoodie and earrings sitting on a plain light-colored wall. Another woman, visible in the left part of the frame, dressed in black, interacts with him. The text overlay at the bottom of the screen reads \"Having my bf react to Bhad bhabie Diss song\", indicating that the man is reacting to a specific song. The man in the brown hoodie displays a variety of reactions, such as looking to the side and making gestures that suggest he's involved in the song. Audio, meanwhile, includes female speakers with an unknown emotional tone to convey short and unclear statements in a conversational setting. The content appears fragmented, with phrases suggesting attempts to understand and express gratitude, and including non-English language elements that are not detailed. There was no sign of ambient music or additional participants. The overall setting is simple and interior, focusing on the reaction and interaction between two people."
+ },
+ {
+ "video_id": "PI-5d98f1d2-de2c-4fd1-8e7a-c6014c475103",
+ "path": "/workspace/benchmark/video/PI-5d98f1d2-de2c-4fd1-8e7a-c6014c475103.mp4",
+ "answer": "The video features a golden retriever sitting calmly on a wooden floor in a room with light-colored walls and a white door in the background, complemented by a wooden cabinet to the left. The dog, with its golden coat and subtle red markings on its face—likely from a filter or editing effect—remains mostly stationary, occasionally blinking and slightly moving its head. Its gentle and attentive expression is directed towards the camera, which captures the serene scene from a consistent angle. The soft, natural lighting enhances the tranquil atmosphere, emphasizing the dog's calm demeanor and friendly appearance. Accompanying this peaceful visual is a soft and soothing background track performed by a woman singer with a warm and gentle tone. Her lyrics convey a message of friendship and support, expressing a willingness to be a friend, which together with the serene visuals, creates a comforting and harmonious ambiance."
+ },
+ {
+ "video_id": "PI-1c5afb3b-7531-4025-bf7d-04e50258070b",
+ "path": "/workspace/benchmark/video/PI-1c5afb3b-7531-4025-bf7d-04e50258070b.mp4",
+ "answer": "This video captures an intimate and tender moment between two people lying in bed, covered with a textured white blanket. The scene was dimly lit, creating a comfortable and private atmosphere. Such words appeared on the screen: \"POV: When the hug feels too perfect to end, but you have to do it because you are still children.\" This passage sets the background for the scene, suggesting that individuals are enjoying moments of intimacy and love, but they realize they need to stop because they are still young. There are some gentle and affectionate words in this audio, such as \"My baby, my baby\". \"You are my baby,\" this intensifies the moment of tenderness and love. The theme of the entire video revolves around the bittersweet nature of young love and the brief but precious moments of connection that follow. The combination of visual and auditory elements creates a warm and emotional narrative that resonates with the experience of cherishing intimate moments."
+ },
+ {
+ "video_id": "PI-055b5d3c-e856-4b8c-849b-ff9016380e8b",
+ "path": "/workspace/benchmark/video/PI-055b5d3c-e856-4b8c-849b-ff9016380e8b.mp4",
+ "answer": "The video begins with a close-up shot of a hand placing a printed photo on the back of a smartphone, which rests on a white surface alongside a small car model and a keyboard. This scene transitions to a person holding a smartphone encased in a grey box featuring a transparent window that displays photos. The on-screen text reads: \"Code: dqd7945 bought on TEMU for $0,\" accompanied by a lightning emoji, highlighting a promotional offer. The video then shows users navigating through a smartphone application, selecting various photos from their gallery to be displayed on the phone case, including images of a couple embracing and a cute animal illustration, demonstrating the product's versatility. The phone case is then placed on a stand, clearly showcasing the selected photos through the transparent window. Complementing the visuals, the audio features a male voice-over in a friendly and enthusiastic tone, explaining that this unique phone case, available for zero dollars on TEMU, can display photos from the app or gallery on the back. He emphasizes the ability to switch countless phone cases effortlessly and suggests placing a special photo, such as one of a wife or lover, to make them smile. The core message of both the audio and video is to encourage viewers to hurry to TEMU and purchase one of these innovative phone cases."
+ },
+ {
+ "video_id": "PI-ccc8d4e8-9ac6-47d8-9aa5-6d6c65c0eb04",
+ "path": "/workspace/benchmark/video/PI-ccc8d4e8-9ac6-47d8-9aa5-6d6c65c0eb04.mp4",
+ "answer": " The video presents a close-up exploration of a person's facial features, primarily focusing on their lips, nose, and mouth. It begins in a dimly lit setting, where the lips are slightly parted, revealing some teeth. As the camera gradually moves upward, more of the person's face is revealed, including their eyes and part of their hair, with a necklace featuring large, round beads becoming visible. The person appears relaxed, occasionally showing a slight smile with their tongue visible inside their mouth. The camera continues to shift angles slightly, providing varied perspectives of the mouth and lips, while the soft lighting casts gentle shadows on the face. Throughout the video, the focus remains on the intricate details of the lips, teeth, and tongue, with the background staying dark to emphasize these features. There is no dialogue or audio content to accompany these visual elements, allowing the viewer to focus solely on the subtle movements and expressions of the person's face."
+ },
+ {
+ "video_id": "PI-80a1893b-ea51-4dfc-8d5f-b2440d15f2fb",
+ "path": "/workspace/benchmark/video/PI-80a1893b-ea51-4dfc-8d5f-b2440d15f2fb.mp4",
+ "answer": "The video begins with a close-up of a person wearing glasses and a black headband, creating a private and intimate tone. The words \"You can't possibly forget this grapevine\" appeared at the top of the screen, accompanied by a crying emoji, immediately evoking a sense of nostalgia and shared cultural memory. This passage indicates that the video refers to a popular Vine video, inviting viewers to recall familiar and humorous moments in the past. Then, the scene switches to a person wearing a red hooded sweatshirt and blue checked pants, energetically jumping onto a bunk bed in a room with a wooden ladder and a bed covered with a blue blanket. The same words, \"You can't forget this grapevine,\" are still on the screen, intensifying the nostalgic theme. This person's energetic movements and casual, home-like environment, along with the beige walls and some household items, capture the playful and humorous essence of the original Vine content. This video seems to be a reproduction or tribute to a famous one on Vine, aiming to evoke the memories of those viewers who remember the original video and make them laugh heartily."
+ },
+ {
+ "video_id": "PI-338b03ef-3e0f-4a53-a011-459ac08bc828",
+ "path": "/workspace/benchmark/video/PI-338b03ef-3e0f-4a53-a011-459ac08bc828.mp4",
+ "answer": "The video features a person standing outdoors, surrounded by greenery and a wooden bridge, holding a synthetic skull in one hand and an icicle in the other. Dressed in a white t-shirt, black gloves, and a beige cap, the individual explains an experiment to test if a falling icicle can penetrate the synthetic skull, which is designed to replicate human bone and tissue. They discuss creating a guide tube to ensure the icicle hits the skull accurately, noting the difficulty of making icicles in their location. The person describes the icicle as pointy, sharp, and likely to have a good chance of penetrating the skull, and asks the audience if icicles are common where they live. The scene transitions to a close-up of the guide tube setup, with a countdown from 3 to 1. The icicle is dropped through the guide tube and strikes the synthetic skull. The video concludes with a close-up of the skull, showing a significant hole where the icicle impacted, demonstrating the force and sharpness of the icicle. The audio includes the person's explanations and questions, adding context and engagement to the experiment."
+ },
+ {
+ "video_id": "PI-eacc6b40-8fe4-4910-877e-2524b4fe73c6",
+ "path": "/workspace/benchmark/video/PI-eacc6b40-8fe4-4910-877e-2524b4fe73c6.mp4",
+ "answer": "The video presents a relaxed, sunny scene at a resort pool, where two women are swimming and conversing. One woman, wearing a dark bikini and a camouflage hat, holds onto a black pool noodle, while the other, in a blue bikini, moves through the water. The background features a building with arched windows, turquoise umbrellas, and deck chairs, enhancing the resort atmosphere. Overlaying this visual is a humorous text: \"POV you're trying to have a conversation but you're in a hot tub,\" setting a casual tone. The audio, however, introduces a contrasting emotional layer. One female speaker expresses frustration and disappointment over not receiving a response after writing to someone, while the other sounds dismissive and slightly annoyed, advising her to stop reaching out. This dialogue, devoid of background music, highlights a casual yet emotionally charged interaction, with one feeling neglected and the other indifferent, adding an intriguing dynamic to the serene setting."
+ },
+ {
+ "video_id": "PI-80c4c73d-d258-45a3-976c-4d2d0dc0e5de",
+ "path": "/workspace/benchmark/video/PI-80c4c73d-d258-45a3-976c-4d2d0dc0e5de.mp4",
+ "answer": "The video takes place in a subway station, where two individuals engage in a lively game of Ddakji, creating a playful and competitive atmosphere. One person, dressed in a gray fur coat and black pants, faces off against another in a black suit. The text 'POV: you're playing Ddakji at the same station they filmed Squid Game' sets a humorous tone, drawing a parallel to the popular show. The game unfolds with humorous moments, such as the text 'totally not embarrassed rn' appearing after a missed attempt, and 'Let's try again' as the person in the gray coat prepares for another turn. The person in the suit nearly succeeds in flipping the Ddakji, prompting the text 'ok it almost flipped'. The person in the gray coat declares 'my time to shine' before another attempt, while the suit's throw leads to the text 'THIS IS RIGGED'. After several attempts, the person in the gray coat finally succeeds, celebrated with the text 'FINALLY'. The person in the suit opens a briefcase filled with money, and the text 'show me the money' appears as the person in the gray coat receives a blue envelope and some money, marking their victory. The background features a purple wall with '양재시민의 숲 (매헌)' (Yangjae Citizen's Forest (Maeheon)) and 'D09', indicating the subway station location. The video captures the excitement and humor of the game, set in a bustling subway environment, while the audio, though fragmented and lacking clear semantic meaning, subtly underscores the playful and dynamic interaction between the players."
+ },
+ {
+ "video_id": "PI-62c8a54f-1af3-461b-8540-3a0c2aabf56c",
+ "path": "/workspace/benchmark/video/PI-62c8a54f-1af3-461b-8540-3a0c2aabf56c.mp4",
+ "answer": "In the video, a person wearing a red top is speaking to the camera, expressing concern for a girl standing in the rain who looks very cold. The text on the screen reads like this: \"Guys, it's raining outside. There's a girl outside. She looks very cold. I think she might be a fan, but I don't know. But she seems to need help.\" What should I do? The camera transitions from one glass door to another, and a blonde girl is seen standing outside. She is wearing a black top, grey trousers and green shoes. The person in the red coat invited her in. The scene switched to indoors and the text on the screen was: \"This girl really came all the way from the UK.\" At the end of the video, the two are sitting together. The person in the red coat thinks how sweet it is for the girl to sit outside in the cold just to take a photo. The conversation between the two people in the audio reflects the care between people and conveys a friendly emotion."
+ },
+ {
+ "video_id": "PI-cf4f69a7-a73a-4d62-90f2-9ee71dd604dc",
+ "path": "/workspace/benchmark/video/PI-cf4f69a7-a73a-4d62-90f2-9ee71dd604dc.mp4",
+ "answer": "The video presents a detailed close-up time-lapse of the healing process of two eyebrow piercings over six days, with each day clearly labeled. It begins on Day 1, showing fresh piercings with slight redness and minor bruising, and progresses through Day 2 where bruising becomes more pronounced with purple and blue discoloration. By Day 3, the bruising spreads and changes to yellow and green hues, indicating the onset of healing, though swelling persists. Day 4 shows further evolution with more yellow tones and reduced swelling, while Day 5 reveals lighter discoloration and significant recovery. By Day 6, the bruising is much lighter, with discoloration almost gone, and the skin appears calmer. Complementing this visual journey, the audio features a melancholic and emotional song performed by both male and female singers. The lyrics delve into themes of personal struggles, feeling deported (likely metaphorical for emotional or life challenges), burnt (hurt), and a yearning for renewal. Together, the video and audio create a poignant narrative of pain, hardship, and the gradual healing process, both physically and emotionally."
+ },
+ {
+ "video_id": "PI-4690fe84-9c47-4e88-b15f-3fbdd35d9b02",
+ "path": "/workspace/benchmark/video/PI-4690fe84-9c47-4e88-b15f-3fbdd35d9b02.mp4",
+ "answer": "The video features a person wearing a colorful, floral-patterned jacket, set against a backdrop of a window with white blinds, overlaid with animated characters from the movie 'Moana,' including Moana herself and other characters like Maui, all set against a blue, water-like background. Moana is depicted in her iconic red and white outfit with long black hair flowing, and as the video progresses, the animation changes to show different characters and scenes from the movie, such as Moana with her arms crossed and other characters in various poses. Text appears on the screen, including 'TARGET' and 'Game Over' with percentages like '75%', suggesting a game or interactive element. Accompanying the visuals, the audio features a female voice with a casual tone and a hint of confusion, mentioning 'Moana' and expressing uncertainty or confusion about a certain detail, perhaps regarding color or appearance, as indicated by the reference to 'dark red.' The video maintains a consistent visual style, blending live-action footage with animated elements from 'Moana,' creating an engaging and interactive experience that combines the speaker's casual commentary with the vibrant animation."
+ },
+ {
+ "video_id": "PI-5934f414-1826-42b3-a026-7c8223959045",
+ "path": "/workspace/benchmark/video/PI-5934f414-1826-42b3-a026-7c8223959045.mp4",
+ "answer": "The video features a red carpet interview with a woman in a white suit and gold clutch, conducted by an interviewer also dressed in white, set against a green hedge wall backdrop. The interview delves into her knowledge of her NBA boyfriend, covering specifics such as his college (UCLA), jersey number (11), draft year (2023), height (6'7\"), position (guard), and career high (31 points), with the interviewer commending her accurate responses. The audio captures the lively and engaging conversation between the participants, predominantly female voices, in a cheerful and upbeat tone. The atmosphere is relaxed and enthusiastic, with participants expressing camaraderie and occasionally affirming each other's reactions. The ambient music is energetic, enhancing the joyful and interactive atmosphere. Towards the end, the conversation shifts to a brief debate about who is more likely to win a race, with the woman confidently claiming she would win, while a man in a light blue suit, who joins later, mentions he ran track, adding an interesting dynamic to the discussion."
+ },
+ {
+ "video_id": "PI-aa7fb8e0-c5b8-4fdf-a5f8-e664fb230dbb",
+ "path": "/workspace/benchmark/video/PI-aa7fb8e0-c5b8-4fdf-a5f8-e664fb230dbb.mp4",
+ "answer": "The video captures a humorous and immersive scene in a dimly lit room with green lighting, where two individuals are experiencing virtual reality (VR) for the first time. The room's geometric neon outlines on the walls create a futuristic ambiance, enhancing the high-tech setting. Both participants are seated on specialized VR exercise equipment, wearing headsets and holding controllers. The person in the foreground is highly active, swinging their arms and shifting their body energetically, deeply engaged in the dynamic VR simulation. The person in the background, though less pronounced in movement, is also participating. The text overlay 'Handed bro the VR Set for the first time' underscores the novelty and comedic context of the situation. Meanwhile, the audio features a male speaker with an unknown emotional tone, briefly communicating in Cantonese and English, expressing a casual acknowledgment with 'Yeah' in English. The audio lacks clear semantic content beyond these utterances, suggesting a solitary individual speaking without discernible context or interaction, and no ambient music or additional auditory elements are present. The video emphasizes the excitement and novelty of the VR experience, with both participants fully immersed in the virtual environment, while the audio remains minimal and context-free."
+ },
+ {
+ "video_id": "PI-f597492e-f673-43c2-be6f-6a7a7f0f7e2e",
+ "path": "/workspace/benchmark/video/PI-f597492e-f673-43c2-be6f-6a7a7f0f7e2e.mp4",
+ "answer": "This video captures a tense incident in a Los Angeles neighborhood where residents intervened to stop a man from allegedly starting a fire. The footage begins with a distant view of a confrontation, with people shouting \"Put that down!\" and \"He's starting a fire!\" as they attempt to de-escalate the situation. The scene then shifts to a view from a car driving past, revealing the arrival of the police. According to text overlays, neighbors had called the authorities after noticing the man's suspicious activity. When police arrived, they reportedly found the suspect in possession of fire starters and blow torches. After questioning him, they arrested him immediately. A conversation between two witnesses in the car suggests that authorities believe this individual may also be responsible for the recent West Hills fire."
+ },
+ {
+ "video_id": "PI-a4443251-f37c-41b4-b3d5-8af271e7ee2b",
+ "path": "/workspace/benchmark/video/PI-a4443251-f37c-41b4-b3d5-8af271e7ee2b.mp4",
+ "answer": "The video features a person with long, dark hair wearing a red short-sleeved top and multiple bracelets on their wrist, set against a room with a purple-hued wall and a glass shelf adorned with various items, including a plush toy. The individual is actively engaged in making hand gestures, such as holding up their hand with fingers spread, and appears to be applying makeup or demonstrating something on their face, using their fingers to touch and gesture near their face, possibly explaining or showcasing a product. Complementing this visual, the audio clip features a female speaker with a passionate and playful tone, saying, 'Is your name Black Pink, Black Pink b?' This phrase seems creative and interesting, possibly referring to the South Korean pop band Blackpink, and conveys a relaxed and fun atmosphere. Together, the video and audio create an engaging and interactive experience, aiming to attract the audience in an interesting and entertaining manner."
+ },
+ {
+ "video_id": "PI-b53c82cd-988c-4a24-90b8-3a6435068dda",
+ "path": "/workspace/benchmark/video/PI-b53c82cd-988c-4a24-90b8-3a6435068dda.mp4",
+ "answer": "The video captures a joyful moment in a parking garage where four men are gathered. The central figure, dressed in black, holds up a film strip, seemingly announcing a significant piece of news. The text overlay reveals that this is a special moment: \"my husband telling his best friends we're pregnant is my new favorite video.\" The audio complements this visual, with a male speaker expressing excitement and happiness, declaring, \"Well, everyone can say I'm going to be an uncle.\" Although the other speakers' words are somewhat unclear, their tones are upbeat and filled with positive emotions. The men react with exuberance: one in a white shirt and beige pants points and smiles, another in a brown shirt and blue jeans gives a thumbs-up, and the fourth, in a light-colored shirt and jeans, stands with his arms crossed, grinning. They celebrate by hugging, jumping, and playfully interacting, sharing the happy news. The background features parked cars and a modern building structure with horizontal slats and overhead lighting, adding to the contemporary feel of the scene. The atmosphere is lively and filled with camaraderie as they revel in the moment, making it a memorable announcement."
+ },
+ {
+ "video_id": "PI-0ea6dd37-fe26-4505-b69f-a632a8f80e9d",
+ "path": "/workspace/benchmark/video/PI-0ea6dd37-fe26-4505-b69f-a632a8f80e9d.mp4",
+ "answer": "The video takes place on a wooden floor, featuring a person interacting with a cat. The main characters are the person, whose hand and fingers are visible, and the cat, which is grey and white. The person is wearing a white shirt and a ring on their left hand. The core action involves the person petting the cat, with the cat responding by following the hand's movements. The background alternates between darker and lighter areas as the cat moves. The audio is a female voice, with a cheerful emotional tone. The audio content centers around affectionate interactions with cats, which might indicate that training is underway. The video conveys a theme of affection and playfulness, highlighting the bond between the person and their pet."
+ },
+ {
+ "video_id": "PI-6323ac7a-49eb-407e-a811-9cfab6e83e9b",
+ "path": "/workspace/benchmark/video/PI-6323ac7a-49eb-407e-a811-9cfab6e83e9b.mp4",
+ "answer": "The audio and video content seamlessly combine to create an uplifting and visually appealing demonstration of crafting a delightful Blizzard-style dessert, accompanied by a cheerful and energetic vocal track. The video opens with a close-up of a person scooping vanilla ice cream into a large metal mixing bowl, setting the stage for the creative process inspired by a TikTok trend, as indicated by the text overlay. As the person adds Oreo cookies and a variety of colorful candies, including Reese's Pieces, to the bowl, the anticipation builds for the final product. The mixture is then blended in a stand mixer, transforming into a thick and creamy concoction that is expertly scooped into a clear plastic cup adorned with red patterns. The person, dressed in a light blue sweatshirt, proudly holds up the filled cup, showcasing the delicious creation. The video concludes with a close-up of the cup, now adorned with whipped cream, cherry pieces, chocolate syrup, and additional candies, highlighting the visually stunning and mouthwatering dessert. Meanwhile, the male singer's cheerful and energetic vocal track underscores the lighthearted and joyful attitude of enjoying the process of creating something fun and delicious, perfectly complementing the video's engaging demonstration."
+ },
+ {
+ "video_id": "PI-0ac92d6f-0685-49de-b2bc-69c731982918",
+ "path": "/workspace/benchmark/video/PI-0ac92d6f-0685-49de-b2bc-69c731982918.mp4",
+ "answer": "This is the audio track of an animated video, featuring no narration; all dialogue is delivered directly by the characters. In the audio, the male character's voice clearly conveys nervousness and eagerness as he musters the courage to extend a sincere invitation to a female character, hoping she will become his girlfriend. As the story unfolds, this female character actively supports the male character, and through a series of events or interactions, his image is significantly enhanced in the eyes of others. However, the plot takes an unexpected turn when, after his transformation and increased attractiveness, another female character extends an invitation to him. Surprisingly, faced with this new temptation, the male character unhesitatingly abandons the first female character, who had initially reached out to help him and shared emotional moments with him, choosing instead to be with the second female character. This plot development profoundly highlights the complex and ever-changing dynamics between the characters, as well as some nuances of human nature."
+ },
+ {
+ "video_id": "PI-3209f986-e0d7-4a39-85f0-d4fcc9ad414f",
+ "path": "/workspace/benchmark/video/PI-3209f986-e0d7-4a39-85f0-d4fcc9ad414f.mp4",
+ "answer": "In the video, a person is sitting in a car, wearing a black fur coat and white headphones. The interior of the car is visible, with the ceiling and some Windows showing. This person is responding to a comment from a user named @elijah7, who asked, \"Can you sing in English?\" The poster JIANNE replied, \"If your comment gets 500 likes.\" As the video played, this person began to sing and the lyrics appeared on the screen. The lyrics include phrases such as \"Round Round\", \"Round and Round\", \"Let's circle\", \"Let's dance together\", \"Pat our paa\", \"Sing this song\", \"~ Lala Lala ~\", \"happy\" and \"we're ll daaAaAnCE\". This person seems to be enjoying this moment very much. Maybe they are getting ready to go out for New Year's Eve, just as mentioned in the song. The video maintains a lively and festive atmosphere. This person keeps singing and the lyrics are displayed on the screen. The lyrics \"Slap Our paa\" are accompanied by an angry emoji and a hand emoji, while \"Sing this Song\" is accompanied by an emoji of a dancing person. Furthermore, a black-and-white image of people dancing will appear at the top of the screen, covered with the words \"ringa linga\" and \"LINGGGGG\", enhancing the festive atmosphere. This video captures the essence of celebration and joy, and people are immersed in the spirit of music and the festival."
+ },
+ {
+ "video_id": "PI-bda2c2d2-a3d0-41b3-8e9f-8c53e9d8df1d",
+ "path": "/workspace/benchmark/video/PI-bda2c2d2-a3d0-41b3-8e9f-8c53e9d8df1d.mp4",
+ "answer": "The video unfolds aboard a dimly lit boat, where the soft glow of blue light illuminates the interior, casting a serene and slightly enigmatic atmosphere. A screen displaying sonar or fish-finding data with numerical readings and wave-like lines subtly hints at the boat's purpose. As the camera pans, it reveals the expansive ocean under a breathtaking sunset, with the sky ablaze in shades of purple and orange. A silhouetted figure in a cap stands in contemplation, gazing at the horizon, embodying a moment of quiet reflection. This visual narrative is accompanied by a narrative audio clip featuring a female speaker, whose calm and contemplative tone, enhanced by warm music, deepens the emotional resonance of the scene. The video beautifully captures the tranquil interplay between humans and nature, emphasizing a peaceful coexistence in this idyllic setting."
+ },
+ {
+ "video_id": "PI-401707ce-6239-4b9d-b2b4-129c7bbdb5c8",
+ "path": "/workspace/benchmark/video/PI-401707ce-6239-4b9d-b2b4-129c7bbdb5c8.mp4",
+ "answer": "The video depicts a woman lying in bed, covered with a gray blanket adorned with white patterns and resting her head on a pink pillow with a floral design. The room is bathed in a purple hue, creating a cozy and intimate atmosphere. Throughout the video, the text 'PLS help me find this song' is displayed at the bottom of the screen, emphasizing her urgent plea for assistance. In the accompanying audio, the female speaker expresses her frustration over a song that has been stuck in her head all day. She describes not knowing any of the lyrics but can hum a small instrumental part of the song. Despite trying Shazam and Google, she has been unsuccessful in identifying it. She believes the song is from a TikTok video, likely featuring an aesthetic beach scene with kids skiing in Bali or a similar setting. The video and audio together highlight her earnest attempt to find the elusive song by describing its context and attempting to hum the melody."
+ },
+ {
+ "video_id": "PI-799e755f-f78f-4245-a774-ff8a784f03ce",
+ "path": "/workspace/benchmark/video/PI-799e755f-f78f-4245-a774-ff8a784f03ce.mp4",
+ "answer": "The video features a close-up of a blue snack container placed on a granite countertop, with multiple compartments, one filled with small, twisted pretzels. A hand reaches into the container and begins to pick up pretzels, linking them together to form a chain, continuing to add more pretzels to create a longer and more intricate structure. The playful and nostalgic tone is set by the accompanying audio, where a female speaker asks if anyone else still makes chains out of pretzels, expressing surprise and amusement at this simple activity. She invites listeners to relate to her by sharing this childhood pastime. The hand moves the pretzel chain around, displaying its length and complexity, and then places it around the wrist, mimicking a bracelet, showcasing the playful and creative use of the snack. The video concludes with the hand holding the pretzel chain above the container, ready to add more pretzels or to display the chain once more, evoking a sense of nostalgia and connection through this shared, lighthearted experience."
+ },
+ {
+ "video_id": "PI-0f627fcb-cea1-4802-ab1c-41804152f431",
+ "path": "/workspace/benchmark/video/PI-0f627fcb-cea1-4802-ab1c-41804152f431.mp4",
+ "answer": "The video takes place in a workshop where a person wearing a welding helmet adorned with colorful decorations is welding. With the passionate background music, which indicating the passionate vitality. The person is wearing a dark hoodie with a colorful print on the front and welding gloves. The welding process produces visible sparks and smoke, creating an energetic and tense scene. Another person wearing a gray hoodie and welding helmet approaches the welder and raises his hands in surrender or apology. The background is a typical workshop setting with industrial lighting and various equipment. The video humorously captures a moment of misunderstanding during the welding process, and the text emphasizes the comedic nature of the scene."
+ },
+ {
+ "video_id": "PI-b3e846e4-a6dd-45b3-84f3-860c1163b792",
+ "path": "/workspace/benchmark/video/PI-b3e846e4-a6dd-45b3-84f3-860c1163b792.mp4",
+ "answer": "The video presents a unique split-screen format that masterfully intertwines an actor's humorous recounting of his film experience with actual movie footage. On the top half, a man in a black jacket is seated in what appears to be an aircraft, gesturing animatedly as he shares an amusing anecdote about recording a specific line for a film. His tone is light-hearted, highlighting the unexpected and comical aspects of the process. Simultaneously, the bottom half of the screen showcases scenes from the movie, featuring a superhero character in various settings—from a battle-damaged environment to a determined stance against a backdrop of a large, glowing sun. The video alternates between the man's narrative and the superhero scenes, creating a seamless narrative connection that offers a behind-the-scenes glimpse into the film's production. This innovative format effectively combines the actor's personal commentary with the dramatic visuals of the movie, providing viewers with an engaging and multifaceted experience."
+ },
+ {
+ "video_id": "PI-acff10ef-f65b-4748-8436-f457254914e6",
+ "path": "/workspace/benchmark/video/PI-acff10ef-f65b-4748-8436-f457254914e6.mp4",
+ "answer": "The video unfolds in an upscale restaurant, characterized by warm ambient lighting, red stall seats and original white tablecloths. The main focus of the photo is a lady in a smooth black dress, wearing a striking large necklace. She is sitting at the table. There is still leftover food on the plate, indicating that she is almost done eating. A waiter in a white shirt and a black tie came towards her. He collected her plate, which still contained some food, and then gave her a small plate, which might be for dessert or a taste cleaner. The woman accepted the suggestion gratefully. As the interaction continued, the waiter demonstrated outstanding service. This lady seemed very satisfied with the service and gave a thumbs-up to show her satisfaction. The video concludes with the golden letters \"VIP\" set with diamonds on a black background, emphasizing the exclusivity and high-end nature of the dining experience. The core theme of this video is the outstanding service provided in a luxurious restaurant environment, highlighting the attention to detail and the high-quality experience offered to guests."
+ },
+ {
+ "video_id": "PI-0e0c7855-5428-4d01-8481-415c0193f7ef",
+ "path": "/workspace/benchmark/video/PI-0e0c7855-5428-4d01-8481-415c0193f7ef.mp4",
+ "answer": "The video depicts a serene cityscape in the morning with palm trees silhouetted against a deep orange sky. In the background is a serene city skyline bathed in warm golden light, with the setting sun casting a tranquil glow over the entire scene. The main “characters” are the palm trees, which vary in size and color, enhancing the visual sense of serenity. The camera keeps moving forward, indicating that the filmmaker is also moving forward. The audio is a female singer singing, conveying a contemplative and serene experience with a hint of introspection. She describes being in the trees, feeling the breeze, hearing footsteps, and seeing her reflection in her surroundings. The main English text in the video, “LA this morning,” appears prominently, suggesting a time and place context. The overarching theme conveys a sense of peace, relaxation, and the beauty of nature in an urban environment, evoking a calm emotional resonance."
+ },
+ {
+ "video_id": "PI-10a43c0c-eda3-42b9-b558-dcbb1f0cbd11",
+ "path": "/workspace/benchmark/video/PI-10a43c0c-eda3-42b9-b558-dcbb1f0cbd11.mp4",
+ "answer": "The audio track captures a man's enthusiastic commentary on a song, where after a female voice in the background music emerges, he excitedly expresses his recognition and fondness for the track. The video, set in a recording studio or casual workplace, features a man in a white shirt, wearing headphones and a gold chain, sitting on a chair facing the camera. The background includes a table with red and green buttons, a shelf, and a refrigerator, suggesting he might be a radio host. The video begins with a comment on the screen: \"Reply to tytus154's comment: 'plz react to Say My Name - Remix',\" prompting the man to play the song. After a brief pause, where he shouts to stop because he wasn't ready, he continues playing the song, waving his hands along to the rhythm and heaping praise on it. Finally, he presses the green button and enthusiastically suggests that viewers add the song to their playlists. The video effectively showcases the man's positive evaluation of the song as requested by netizens, with his genuine enthusiasm and high praise aligning perfectly with the audio's depiction of his recognition and fondness for the track."
+ },
+ {
+ "video_id": "PI-ebf14605-f167-4635-856e-cb58f0ebfa20",
+ "path": "/workspace/benchmark/video/PI-ebf14605-f167-4635-856e-cb58f0ebfa20.mp4",
+ "answer": "In a lively family setting, a game is underway around a long table laden with prizes. A young boy takes his turn, carefully rolling a green can down the surface. The prizes, including cash and lottery tickets, are arranged in lanes, with a coveted Nintendo Switch box positioned at the very end. Family members gather around, watching with bated breath and cheering him on. The camera follows the can's journey as it steadily rolls past the smaller prizes. As it nears the end of the table, the excitement in the room visibly mounts. In a moment of pure luck, the can comes to a perfect stop right in front of the Nintendo Switch. The boy erupts in a massive shout of pure joy, a reaction mirrored by his family who break into loud, celebratory cheers. He is then seen triumphantly holding his new game console, beaming with an ecstatic expression while receiving a high-five."
+ },
+ {
+ "video_id": "PI-38736e2d-9efe-4c09-bb74-8723e3cf57f7",
+ "path": "/workspace/benchmark/video/PI-38736e2d-9efe-4c09-bb74-8723e3cf57f7.mp4",
+ "answer": "The video shows a comfortable and elaborately decorated game and living space where a couple spent their night. The feature of the room is that there are two monitors set on a white table, with one person on each side. The walls are decorated with floral wallpaper, and the bookshelves are filled with books and ornaments, creating a warm and inviting atmosphere. A text cover humorously reads: \"pov: You moved your husband's play space next to yours, so now your night looks like this,\" highlighting the couple's common interests and the intentional design of their space. The couple was deeply immersed in their game activities. One wore headphones while the other occasionally interacted with the items on the table, such as a pink box and a lit candle. The table is piled high with various game accessories, small containers and a microphone, indicating that a game or streaming media setup is in progress. The person on the left was initially wearing a pink sweater and later changed to a blue one, while the person on the right is still concentrating on the game. Finally, the video returns to the game area. The person wearing a black hoodie and green headphones is watching something on the big screen, which might be a movie or a program. The room is dimly lit and the screen provides the main light source. The end of the video is a scene on the screen, adding a touch of everyday life to the comfortable environment. The background audio includes a faint, repetitive chanting, \"Hello, Alabama, Arkansas, Arkansas, Alabama.\" \"Love you,\" adds a whimsical and light-hearted feeling to the entire atmosphere."
+ },
+ {
+ "video_id": "PI-08dc7604-9b66-4623-a6e6-a0471c3c4897",
+ "path": "/workspace/benchmark/video/PI-08dc7604-9b66-4623-a6e6-a0471c3c4897.mp4",
+ "answer": "The video features two individuals sitting closely together in a room adorned with various posters on the wall, including one with a vinyl record and another with the text 'LOVE' and 'HOLLYWOOD.' Both are dressed in matching pink pajamas with red bow patterns. The text overlay at the top of the frames reads 'pov: you are sisters on christmas' (point of view: you are sisters on Christmas), setting a playful and festive tone. Initially, they are holding blue and white patterned blankets. The scene transitions to them holding up various items, including a black package labeled 'Adore Me,' a pink box, and a blue package, suggesting they are showing off Christmas gifts. The video then cuts back to them without the items, still in their matching pajamas. The final frames show them holding white rectangular objects, possibly soap or another small gift, continuing the theme of sharing and showing off their Christmas presents. The overall atmosphere is warm and festive, capturing the joy of sibling bonding during the holiday season. The audio includes a voiceover that adds a humorous and competitive layer to the scene. The voice says, 'I got married in Italy. Am I saying you copied me by getting married in Italy? Who performed at my wedding? Andrea Bocelli. Who performed at Courtney's wedding? Andrea Bocelli.' This dialogue suggests a playful accusation of one sister copying the other's wedding choices, adding a humorous twist to their interaction."
+ },
+ {
+ "video_id": "PI-c3ec3491-dcbd-422e-9bb9-4bd8826846b0",
+ "path": "/workspace/benchmark/video/PI-c3ec3491-dcbd-422e-9bb9-4bd8826846b0.mp4",
+ "answer": "This video captures a lively scene at a fish market stall. A vendor in an orange hoodie is standing behind a counter filled with all kinds of fish. There is a conspicuous sign hanging on the stall, which reads, \"High-quality and low-priced fish are in the city!\" It emphasized the commitment of the stall to offer first-class seafood at competitive prices. Several customers gathered around the stall to conduct transactions. A woman wearing a yellow headscarf and a dark coat was standing with a camera on her back, holding a blue bag in her hand, seemingly interacting with a vendor. Not far away, a man in a grey jacket and a grey hat was carrying a bag and observing the scene. In the foreground, a man in a black leather jacket is actively making gestures. He might be having a conversation or emphasizing a certain point. The phrase \"Very, very Steve a pound of fish, my God\" keeps recurring in the background audio. This seems to be a promotional slogan or a customer's exclamation about the affordability of this kind of fish. The words \"1 pound fish\" and dancing emojis appeared on the screen, enhancing the promotional aspect. The overall atmosphere is bustling and vibrant, capturing the essence of a typical market environment, with a focus on fish stalls and customers. The core theme of the video revolves around the vivid interaction and transactions at the fish market, highlighting the products at the stalls and the charming atmosphere."
+ },
+ {
+ "video_id": "PI-4f77fe79-3797-4064-a279-e115a0505c67",
+ "path": "/workspace/benchmark/video/PI-4f77fe79-3797-4064-a279-e115a0505c67.mp4",
+ "answer": "The video captures a whimsical and humorous scene where a person crafts luxurious desserts inside a car, blending Oreo mint ice cream, Oreo cake ice cream, chocolate pretzels, rainbow-sprinkled lollipop cake, caramel sauce, chocolate syrup, and fresh cherry syrup into a glass bowl, all while playfully emphasizing the \"healthy\" aspect of the cherries. The production process is intertwined with the use of the Cal AI application to calculate the dessert's calories, revealing a staggering 1362 calories and a health score of 2/10, a fact that doesn't deter the creator from savoring the concoction with a wooden spoon. The video's humor is further amplified by the joke about \"calories not counted in birthdays,\" and it concludes with another blonde person indulging in the rainbow candy cake within the dessert. Meanwhile, the audio features a female narrator with a curious tone, expressing amusement and skepticism towards the absurdity of the situation, much like one might feel about a claimed punching technique in a different context, though here it's cleverly redirected to the comical excess of the dessert's calorie count, perfectly complementing the video's lighthearted and playful narrative."
+ },
+ {
+ "video_id": "PI-1b270291-7967-4733-8e4d-bf33873622b0",
+ "path": "/workspace/benchmark/video/PI-1b270291-7967-4733-8e4d-bf33873622b0.mp4",
+ "answer": "The audio content features background music lyrics that vividly depict renewal and presence, with the singer commanding attention to a space being cleansed and organized, emphasizing the creation of a new, distinct identity or impression. The video content complements this theme by showcasing a person engaging in personal care and self-reflection within a dimly lit bedroom and bathroom. The person is seen walking around the room, with the background changing to reveal items such as beds, televisions, mirrors, candles, and laptops in white, black, and brown tones. The peaceful background music in the audio enhances the calm atmosphere of the video, which lacks significant dialogue but includes important English texts like \"get of everything,\" \"got me,\" and \"everything\" that underscore the theme of self-reflection. The soft lighting and intimate setting of the video convey a mood of relaxation, self-reflection, and personal care, aligning perfectly with the audio's message of renewal and the pursuit of a fresh identity or impression in one's space."
+ },
+ {
+ "video_id": "PI-4823b76e-bd16-476e-bf97-29327882a4a0",
+ "path": "/workspace/benchmark/video/PI-4823b76e-bd16-476e-bf97-29327882a4a0.mp4",
+ "answer": "The video begins with an intense soccer match where a player in a red jersey is seen skillfully dribbling the ball down the field, closely pursued by two players in dark jerseys. Demonstrating remarkable agility, the player in red evades the defenders and advances towards the goal area. The scene transitions to a close-up of the player in red sprinting with the ball, accompanied by a timer overlay starting at 0:00 and counting up in seconds, indicating the duration of the play. As the player approaches the penalty area, a goalkeeper in a green jersey rushes out to intercept. The player in red deftly maneuvers past the goalkeeper and takes a decisive shot at the goal, successfully scoring. The timer shows 0:06 as the player in red celebrates the goal with visible excitement. The video continues with a close-up of the player in the red jersey celebrating, with the text 'GOAL' prominently displayed in large white letters and '0:06' in green below it, emphasizing the time taken to score. The background reveals a blurred crowd of spectators, adding to the celebratory atmosphere, while the NWSL (National Women's Soccer League) logo is visible in the top left corner, indicating the league in which the match is taking place. Meanwhile, the audio features a female speaker with a neutral or slightly excited emotional tone, discussing an opportunity and mentioning someone who is progressing in a nuclear-related endeavor, expressing a sense of accomplishment and affirmation about the individual's success. The video wraps up with the player continuing to celebrate, capturing the joy and triumph of the moment, while the audio conveys a focused and positive atmosphere."
+ },
+ {
+ "video_id": "PI-b123a80d-f09d-4273-951f-00a7f852c437",
+ "path": "/workspace/benchmark/video/PI-b123a80d-f09d-4273-951f-00a7f852c437.mp4",
+ "answer": "The video features a close-up of a hand with exquisitely trimmed nails and a shiny bracelet and ring on the ring finger. The background is soft and blurred, suggesting an outdoor environment with ambient lighting, possibly near a window or a well-lit area. The main focus of the video is the display of hands and various rings. Move your hand gently to highlight the different angles of the ring. Each ring is displayed in sequence, with a number appearing in the upper left corner, indicating the order from 1 to 7. The designs of these rings vary, including a diamond-cut chain ring, a buffered cut ring with a halo, and a three-stone ring with a large diamond in the center. Throughout the entire video, the following text appeared on the screen: \"Boyfriend test tag him to see if he knows which ring you will choose.\" This interesting challenge invites the audience to guess the ring style they like together with their partners. The background music is a catchy tune. The lyrics are like this: \"Sorry, you seem to love me very much. You seem to want...\" Baby, I don't blame you for looking me up and down. The tone of this video is relaxed and frivolous. The core theme of this video is to explore different styles of engagement rings in an interesting and interactive way, encouraging couples to establish connections in terms of preferences and choices."
+ },
+ {
+ "video_id": "PI-dd298539-77ad-40c1-910f-c3b9d0d47479",
+ "path": "/workspace/benchmark/video/PI-dd298539-77ad-40c1-910f-c3b9d0d47479.mp4",
+ "answer": "The video takes place in a vibrant garage environment, accompanied by an instrumental background score characterized by a fast-tempo electronic rhythm and dynamic synthesizer melodies. The energetic and intense musical tone effectively conveys a sense of urgency and excitement, reinforcing the high-energy ambiance of the scene. The opening shot features a close-up of the car's interior, focusing on the dashboard, which includes a radio system, various buttons and knobs, and a smartphone mounted on an air vent clip displaying a music application. A person wearing yellow gloves is seen manipulating the gear shift and reaching toward the dashboard, seemingly making adjustments. The scene then transitions to an exterior view, where the same individual, now holding a yellow-colored tool, moves around the vehicle, which is parked in a graffiti-covered garage prominently marked with the phrase \"RAGE ZONE.\" The camera captures multiple perspectives of the car’s exterior while also offering glimpses into the interior through the windows. As the individual continues inspecting or working on the vehicle, the camera pans to reveal more elements of the garage setting, including a door and additional graffiti artwork. The video concludes with a shot showcasing the front and side of the car, highlighting further graffiti such as the word \"CHEATER\" and a heart symbol, while the person crouches near the front, likely conducting additional inspection or interaction."
+ },
+ {
+ "video_id": "PI-7f2f7686-886f-42a8-a51c-383ac8b40582",
+ "path": "/workspace/benchmark/video/PI-7f2f7686-886f-42a8-a51c-383ac8b40582.mp4",
+ "answer": "The video begins in a car, where a man is sitting in the driver's seat, wearing a black leather jacket and grey trousers. The interior of this car is black and the sunroof is visible. Beside him, a lady wearing a white sweater with pink details and grey trousers was holding a bunch of red and pink flowers wrapped in pink paper in her hand, as well as a large pink heart-shaped pillow with the words \"Will you be mine?\" written on it. It is written in red above. The scene switches to the external view of the car, showing the woman standing beside the driver's side door and looking at the car. Then, she opened the side door of the driver's seat and got back into the car, still holding flowers and a pillow in her hand. When she handed him the flowers and the pillow, the man looked at her with a surprised expression. The woman smiled as if explaining something to the man, while the man still held those things and looked at her in disbelief. The man looked at her, both surprised and amused, gesturing with his hand as if trying to understand or respond. The woman continued to smile and talk, while the man listened attentively, still holding the flowers and pillow in his hand. At the end of the video, the woman leaned back in her seat, still smiling, while the man looked at her as if understanding what she was saying. The core theme of the video revolves around a touching moment when a woman gives a man a considerate gift, which elicits a surprising and happy response from the man."
+ },
+ {
+ "video_id": "PI-469d7af5-d7bc-4d88-b8b8-7ede6d746085",
+ "path": "/workspace/benchmark/video/PI-469d7af5-d7bc-4d88-b8b8-7ede6d746085.mp4",
+ "answer": "In this compelling integration of audio and video elements, a narrative unfolds that interweaves emotional depth with visual storytelling. The audio features songs performed by a man and a woman, both conveying a mysterious emotion. The male singer eagerly calls out someone to speak, while the female voice echoes a longing for connection, suggesting moments of emotional tension or longing between two people. This emotional background is reflected in the video, which sees a woman helping a man with a facial injury walk across the street. The screen displays the text \"Say Something\" in the upper left corner, emphasizing the unspoken words between them. Surrounded by houses and lush greenery, the man's difficult walking posture, set against the headphones he wears, hints at his inner struggles. As the scene shifts, the woman leads him up the stairs, revealing a vibrant streetscape with many houses, a car and a thriving green belt, symbolizing the continuity of life despite personal difficulties. The man's expression mixes pain and courage, suggesting a deep inner struggle, possibly contemplating giving up but finding resilience in the face of adversity."
+ },
+ {
+ "video_id": "PI-64f5c9d6-246c-46d6-8ee0-5a3ac8571f56",
+ "path": "/workspace/benchmark/video/PI-64f5c9d6-246c-46d6-8ee0-5a3ac8571f56.mp4",
+ "answer": "The video begins with an animated scene where a green elf wearing a blue headscarf and a golden accent stands next to a bald man in a black and white costume, the elf emitting a glowing blue aura. The man made a gesture as if he was making a wish, and the words \"taller\" appeared to indicate that he wanted to be taller. The whimsical moment transformed into a real-life basketball game, with the men in the video wearing various jerseys, including white, red, black and yellow, making impressive dunks. During one of the dunks, the words' Magic'and a little bunny holding a bat appeared in the animation, adding a touch of playfulness. The video highlights the athleticism of the players, particularly highlighting one player in a yellow jersey and another in a red jersey with \"James\" and number 6. The audio, meanwhile, features an unclear emotional tone of one male speaker and another male speaker expressing joy, reciting lines about wishing for various things like growing taller, being a player, having a rabbit in his hat. The male speaker's presentation is accompanied by cheerful and lively ambient music, creating a playful and imaginative atmosphere that complements the visual elements of the video."
+ },
+ {
+ "video_id": "PI-3f0f1c7a-1914-452f-884e-7ffff74dc28e",
+ "path": "/workspace/benchmark/video/PI-3f0f1c7a-1914-452f-884e-7ffff74dc28e.mp4",
+ "answer": "This dramatic video captures the terrifying moment a massive landslide tears down a steep hillside. The scene begins with a small section of earth collapsing near a set of stairs, but it rapidly escalates into a powerful deluge of soil and debris. The landslide surges down the slope with incredible force, completely engulfing and destroying several staircases that were built into the hill. At the base of the slope, several houses stand in the path of the potential destruction, including one with a prominent red roof. In the foreground, a group of people are seen running for their lives, trying to escape the imminent danger. The sheer scale and speed of the landslide are staggering, as it consumes a large portion of the landscape and creates a massive cloud of brown dust. "
+ },
+ {
+ "video_id": "PI-6d3e2937-62ad-43d0-bff8-edecba7a7ad2",
+ "path": "/workspace/benchmark/video/PI-6d3e2937-62ad-43d0-bff8-edecba7a7ad2.mp4",
+ "answer": "In this brief yet impactful piece, a neutral-toned male speaker delves into the likelihood of government actions, suggesting that the government possesses the capability to undertake certain measures. While the audio content maintains a calm and neutral atmosphere with steady background music, the video interpretation elevates the message to a profound warning. It urges the audience, particularly the younger generation referred to as \"children,\" not to blindly accept claims that the government won't engage in specific actions. The core idea emphasizes the importance of skepticism and vigilance towards governmental conduct, highlighting that when someone defends the government's actions by asserting it won't take certain steps, it's entirely plausible that it might. This perspective reflects a deep-seated distrust towards authority, encouraging independent thinking, questioning official narratives, and maintaining a critical eye on power dynamics. Despite its brevity, the message is potent, aiming to awaken awareness about potential governmental overreach or misconduct, and advocating for a clear-minded and questioning spirit. Its value lies in reminding the public, especially in an era of information overload, that critical thinking is paramount for comprehending social and political realities."
+ },
+ {
+ "video_id": "PI-e9df1ac8-ca22-4d40-9f9a-28fb4b0e1ee2",
+ "path": "/workspace/benchmark/video/PI-e9df1ac8-ca22-4d40-9f9a-28fb4b0e1ee2.mp4",
+ "answer": "This video captures two individuals performing music in a residential kitchen. On the left, a man wearing a black t-shirt, black shorts, and a black baseball cap with an orange SF logo plays a brown ukulele. He is focused on the instrument, accompanying the person next to him, gently swaying to the rhythm. On the right, another man dressed in a black and white striped t-shirt, distressed blue jeans, a white belt, and gold accessories sings along to the music. He gestures with his hands and displays animated expressions during the performance. The kitchen features white cabinets and countertops, with a light-colored door and wooden floor in the background. Throughout the recording, the camera remains static, providing a clear and steady view of the performance. The overall atmosphere is light-hearted, joyful, and full of energy.\n"
+ },
+ {
+ "video_id": "PI-8ce0bc7c-fa27-4f4c-ba28-2749d7acc8d6",
+ "path": "/workspace/benchmark/video/PI-8ce0bc7c-fa27-4f4c-ba28-2749d7acc8d6.mp4",
+ "answer": "The video features a playful scene in a cozy, modern room with wooden walls and a tiled floor, where a white dog with a black collar lies calmly on a beige rug, observing a black monkey wearing a pink diaper. The monkey is the focal point, energetically climbing and swinging on a rope attached to a pole, showcasing its agility and playful nature. The room's glass door allows natural light to flood in, enhancing the lively atmosphere. Meanwhile, the audio captures a female speaker expressing admiration with a confident and enthusiastic tone, describing someone as a legend and emphasizing their significance in the present moment. Her words, 'She's an icon, she's a legend, and she is the moment come on now,' are delivered with a celebratory acknowledgment, accompanied by ambient music with a lively and uplifting tempo. This combination of visual and auditory elements creates a vibrant and engaging scene, highlighting the interspecies interaction and the speaker's enthusiastic praise."
+ },
+ {
+ "video_id": "PI-62844a26-fb98-434d-be85-16cad5457140",
+ "path": "/workspace/benchmark/video/PI-62844a26-fb98-434d-be85-16cad5457140.mp4",
+ "answer": "The video shows a woman wearing a light gray sweatshirt and a light blue headband standing inside a room with white walls and white-framed windows. The woman has long black hair and speaks with gestures and slight movements, giving off a casual and humorous tone. Soft natural light indicates that it is daylight. The female speaker humorously criticized the film's unrealistic portrayal of post-intimate moments and contrasted it with her own actual experience. She describes behaviors such as calling for towels or rushing into the bathroom to avoid discomfort, highlighting the need for hygiene in cuddling and sleeping scenes often portrayed in films. Her relaxed and frank monologue highlights concerns about urinary tract infections, adding a relevant and interesting perspective to the discussion."
+ },
+ {
+ "video_id": "PI-ce5b885a-1661-4575-aa61-eebca8745e83",
+ "path": "/workspace/benchmark/video/PI-ce5b885a-1661-4575-aa61-eebca8745e83.mp4",
+ "answer": "The video depicts a person seated inside a car, clad in a gray jacket with a white fur-lined hood, their hair styled in a braid, and sunglasses perched atop their head. The car's interior details, such as the seats, window, and door panel, are clearly visible. As the individual speaks, they employ hand gestures for emphasis, occasionally glancing around and pointing towards the car door and window. Outside the vehicle, a brick building and a parked white car are noticeable, with daylight illuminating the scene through the windows, suggesting a parking lot or street setting adjacent to commercial buildings. Concurrently, the audio features a female speaker in a contemplative and slightly bewildered tone, pondering the nature of space and physical dimensions. He draws a comparison between a car and a bus, questioning how both can occupy the same road space despite their differing internal configurations. This introspective monologue conveys a blend of curiosity and mild frustration, delving into the perception and allocation of space without any accompanying music or ambient sounds."
+ },
+ {
+ "video_id": "PI-08282db3-c39d-422b-a91d-75af3433b99b",
+ "path": "/workspace/benchmark/video/PI-08282db3-c39d-422b-a91d-75af3433b99b.mp4",
+ "answer": "In the video, a person is wearing a black sweatshirt with red text and red patterns on it, standing in a room with a wooden floor and a white door as the background. Text \"If you were born in...\" The words appear at the top of the screen, followed by the orange font \"June\". This person uses gestures when speaking. The audio says, \"If you were born in January, February, March, April, May, June, July, August, September, October, November, December or Pagemian, then this is for you.\" Then the scene switches to the same person kneeling on the floor, facing an orange cat. Text message: \"Well, this is for you...\" It appears on the screen. This person placed one hand on the floor. The cat stood in front of them, looking up. The background shows a room undergoing some form of construction or renovation, with ladders and some building materials visible. The end of this audio is, \"Only in this way can I breathe.\" This video seems to be a personal or informal Vlog-style clip. Considering the unusual word \"Pagumiene\" and the casual background, it might have a playful or humorous intention."
+ },
+ {
+ "video_id": "PI-32f64cc1-2203-4fea-980d-3daceade7cf4",
+ "path": "/workspace/benchmark/video/PI-32f64cc1-2203-4fea-980d-3daceade7cf4.mp4",
+ "answer": "The video introduces a privacy case for smartphones with a cheerful and enthusiastic female voice guiding the audience through its features. It starts with a person expressing amazement while holding a phone with a green case, then shifts focus to a gold-colored phone before presenting the black privacy case. The text 'Meet the Privacy Case!' appears, highlighting the case's key feature of making the screen look black from the sides, ensuring privacy with the text 'Now no one can read your texts'. The video demonstrates the case's magnetic clipping mechanism, emphasizing its convenience. Additionally, it highlights the anti-yellowing feature and camera protection, enhancing the product's appeal. The video wraps up with a display of the case in various colors and a call to action, 'CLICK TO GET IT NOW', encouraging viewers to purchase the product, all while upbeat ambient music creates a lively and engaging atmosphere."
+ },
+ {
+ "video_id": "PI-4d195ab7-1c7a-4519-9181-54f76c2f476d",
+ "path": "/workspace/benchmark/video/PI-4d195ab7-1c7a-4519-9181-54f76c2f476d.mp4",
+ "answer": "The video captures a heartwarming moment in a cozy living room where a woman, dressed in a dark blue t-shirt and matching pants, is sitting on a light-colored couch, holding a baby dressed in a white onesie with light-colored patterns. The woman is seen interacting affectionately with the baby, who is holding a small object, possibly a toy or a spoon. The background features a wooden coffee table with a pink baby bottle, a bed with light blue bedding, a wall-mounted TV, and some decorative items including a painting and small plants, creating a relaxed and homely atmosphere. The audio complements the visuals with the woman's cheerful voice, commenting on the baby's reflexes, saying, \"Your reflex is okay,\" and playfully suggesting, \"One, we do it again,\" followed by hearty laughter. This interaction highlights the joyful bonding and care between the woman and the baby, making the scene both tender and delightful."
+ },
+ {
+ "video_id": "PI-7c06f89e-9ab9-4cb2-a722-d7465daf2d65",
+ "path": "/workspace/benchmark/video/PI-7c06f89e-9ab9-4cb2-a722-d7465daf2d65.mp4",
+ "answer": "At the beginning of the video, a person is taking a selfie in front of the mirror with a light pink phone case. They are wearing a black shirt with white patterns. The text on the screen reads \"Some of my random interests\", indicating that the video will showcase various hobbies or collections. The camera switches to a room decorated with goods of \"Freddy's Five Nights at Freddy's\". There is a large poster hanging on the wall, which reads, \"Can you survive the five nights in Freddy?\" It reads \"Celebrate!\" It is prominently displayed. Below the poster, there are two plush toys similar to game characters. On the adjacent wall, there are red handprints, adding a sense of horror. The next scene is the vibrant display of the \"Magic Ladybug\" product. The room was bathed in purple light, and the shelves were filled with character dolls, plush toys and box sets related to the animation series. Posters of the characters in the play decorate the walls, and a huge image of the characters in the play is prominently displayed. Then the video switches to the series with the theme of Hello kitty. The pink shelves are filled with Hello Kitty plush toys, small statues and other goods. Two large Hello Kitty plush toys are placed on the wall. There are large Hello Kitty patterns on the wall, surrounded by smaller Hello Kitty items, including a container and other plush toys. The last scene showcases a wide variety of plush toys, including characters from popular media and original designs. Toys are densely piled up on the shelves, presenting a wide variety of colors and styles. This video effectively highlights the different interests of this person through their extensive collection of \"Five Nights at Freddy's\", \"miracle Ladybug\", Hello Kitty and other plush toys."
+ },
+ {
+ "video_id": "PI-fb295a75-c314-4b4a-9d70-af8a938c813c",
+ "path": "/workspace/benchmark/video/PI-fb295a75-c314-4b4a-9d70-af8a938c813c.mp4",
+ "answer": "This video, shot from a high angle, captures a moment between a pet owner and their three pigeons in a carpeted room. The owner can be heard speaking to the pigeons in a mock-annoyed tone, repeatedly telling them to \"go.\" There are three pigeons in total: one brown and white, one mostly white, and one with uniquely pink feathers. They calmly wander across the gray carpet, seemingly oblivious to the instructions. The owner's hand occasionally appears in the frame, making shooing gestures. The owner expresses their own helplessness, stating the pigeons \"know how ridiculous their behavior is better than anyone.\" In the background, a pet enclosure decorated with artificial flowers is visible. The scene is accompanied by light, playful music, creating a humorous contrast between the owner's exasperation and the pigeons' indifference"
+ },
+ {
+ "video_id": "PI-271ff680-b8b2-42d5-849b-15bf6757ebbe",
+ "path": "/workspace/benchmark/video/PI-271ff680-b8b2-42d5-849b-15bf6757ebbe.mp4",
+ "answer": "In the video, a person is demonstrating how to use a pink laundry apron with a red border and is marked as \"Find Laundry Apron on Amazon Home\". This person is wearing a black hat, a black shirt and blue shorts, and is standing in front of the washing machine. This scene begins with a person lifting an apron, then putting it on and adjusting it in the appropriate way. The background includes a white door and a tiled floor, creating a clean and simple environment. Then, this person bent down to open the washing machine and began to put the clothes into the apron, which was like a portable laundry basket. The apron was piled up with all kinds of clothes, some white and some grey. There is a close-up shot in the video. This person is holding a white shirt hanging on the hanger and is about to put it into the apron. There is also a room in the background with white walls, a door and a white folding chair with black straps. There was a casual and friendly voice in the audio saying, \"What kind of brother is that, honey? Do you know?\" A relaxed and conversational tone was added to the demonstration. The theme of the entire video is to demonstrate the practicality and convenience of the laundry apron, which is used for collecting and transporting clothes."
+ },
+ {
+ "video_id": "PI-1e1222e2-4be2-4314-a0ba-6c44f3eb22ff",
+ "path": "/workspace/benchmark/video/PI-1e1222e2-4be2-4314-a0ba-6c44f3eb22ff.mp4",
+ "answer": "The protagonist of this video is a person with long dark hair, wearing a grey shirt with a colorful logo on it. She is in a room with a white door in the background. This person is carefully combing their hair with a black comb, with a focus on the specific part near the parting line. Their nails are very long and painted in light colors, perhaps white or cream. There are some tattoos on her fingers. This person seems to be trying to remove something from her hair because she regrets her actions, perhaps because it is close to the dividing line. They mentioned that this was the only comb they had on them. Throughout the video, this person continued to comb his hair firmly and occasionally adjusted the comb. She doubted whether the comb would work and said, \"Shoot!\" I don't think this will work. Damn it! Maybe. \"God\" A moment that may indicate hope or acceptance. Ultimately, this person concluded that this might not work and decided to stop, saying, \"No.\" Ok. I have to stop. The entire scene evokes a moment of grooming or solving a hair problem, with this person showing a mixture of determination and frustration."
+ },
+ {
+ "video_id": "PI-fa7225c9-ad39-41e3-90fe-c6ea0f1b1de7",
+ "path": "/workspace/benchmark/video/PI-fa7225c9-ad39-41e3-90fe-c6ea0f1b1de7.mp4",
+ "answer": "The video opens on a close-up of a bespectacled man with curly brown hair.A female speaker with an indeterminate emotional tone is speaking in a setting accompanied by ambient music with a neutral tempo and atmosphere. Singing\"Let me show you how proud I am to be yours.This dress lies haphazardly thrown on the ground,But in your eyes, it's still beautiful.Beautiful just for you.\" As he stares directly into the camera, he takes a deliberate sip from a glass of waterHe begins silently mouthing the lyrics to the background song with increasing intensity, his expression shifting from neutral to impassioned. Subtitles display \"let me show you how proud I am to be yours\" has to be the hottest thing anyone has ever said in a song. \" during this sequence. The camera then pulls back to reveal a wide shot of the hallway. He shakes his head slowly before dropping to one knee in a posture of awe, visibly stricken by the music's emotional power."
+ },
+ {
+ "video_id": "PI-495cf2a1-d5b3-470f-80b2-f299bce43708",
+ "path": "/workspace/benchmark/video/PI-495cf2a1-d5b3-470f-80b2-f299bce43708.mp4",
+ "answer": "The video features a person sitting inside a car, wearing a light gray cap with the word 'NAVY' written on it and a light gray zip-up jacket. The background consistently shows the interior of the car, including the headrests and part of the window. The person appears to be speaking throughout the video, using hand gestures to emphasize their points, indicating they are engaged in a conversation or monologue, possibly expressing strong opinions or emotions. The lighting is natural, suggesting it is daytime. Meanwhile, the audio features a male speaker expressing anger and frustration, criticizing the use of military time and mocking individuals who use it as pretentious. He sarcastically ridicules someone for acting superior, referencing their attire and past experiences, including being in ROTC with poor performance. The tone throughout is confrontational and dismissive, with no background music present. The video maintains a consistent setting and tone, focusing on the individual's expressive delivery and the intensity of their speech."
+ },
+ {
+ "video_id": "PI-de38a476-6ab0-4c2c-80bc-f99095b54080",
+ "path": "/workspace/benchmark/video/PI-de38a476-6ab0-4c2c-80bc-f99095b54080.mp4",
+ "answer": "This video compilation captures moments of actress Millie Bobby Brown and her partner, Jake Bongiovi, at various red carpet events. It begins with an interaction where a photographer refers to Jake as her boyfriend. Millie calmly corrects him, saying, \"He's not my boyfriend,\" suggesting a more significant relationship. This is followed by a montage of the couple looking affectionate at different premieres and award shows. They are shown holding hands, embracing, and kissing for the cameras. Throughout the clips, they appear in different formal outfits; Millie wears a variety of elegant gowns, including a shimmering gold dress and a sleek black one, while Jake is consistently dressed in black suits and tuxedos. The backgrounds change, showing press walls for events like the premieres of \"The Electric State\" and \"Damsel,\" highlighting their numerous public appearances together."
+ },
+ {
+ "video_id": "PI-91314f3b-7c8d-424a-a5ae-acc2c22abee8",
+ "path": "/workspace/benchmark/video/PI-91314f3b-7c8d-424a-a5ae-acc2c22abee8.mp4",
+ "answer": "The video captures a lively scene in a gymnasium during a basketball game, where the camera pans across a row of spectators seated on bleachers. The audience is diverse, including people of various ages, from children to adults. A notable moment features a young child with blonde hair, wearing a gray sweater, sitting between two older individuals. The child appears intently focused on the game, occasionally glancing around, embodying the text overlay that humorously highlights, 'Think we found the world's youngest basketball fan.' The background reveals other spectators, some in red and gray clothing, and a partially visible scoreboard, contributing to the bustling atmosphere. While the audio content is unclear, with indistinguishable English speech and no discernible emotional tone or background music, the video effectively conveys a candid moment of community and sports fandom. The focus on the child amidst the lively environment underscores the endearing and humorous essence of the scene, capturing the essence of shared excitement and engagement in the event."
+ },
+ {
+ "video_id": "PI-c05e8574-a762-474d-bc29-db666f8b2d37",
+ "path": "/workspace/benchmark/video/PI-c05e8574-a762-474d-bc29-db666f8b2d37.mp4",
+ "answer": "The video begins in a cozy, well-lit bedroom where a person, dressed in a black t-shirt and black shorts, is seen sitting on a bed holding a pair of jeans. The room is neatly arranged with a white ceiling fan, a white nightstand with a lamp, and a bed covered with a plaid comforter. The person then gets up and walks towards a bathroom, which features a white shower curtain and a tiled wall. In front of the mirror, they pull up their black t-shirt to reveal their midriff, seemingly checking their appearance. The scene transitions to a spacious living room where the person starts exercising in front of a TV displaying a workout video. The living room is warmly decorated with a stone fireplace, wooden cabinets, a large rug, and a coffee table with various items. The person engages in stretches, performs push-ups on the rug, and stands while raising their arms, now wearing wrist weights. Throughout these scenes, the audio plays a vital role in setting the tone; it features energetic background music with rhythmic and melodic elements that create an engaging and dynamic atmosphere, complementing the person's active workout routine. The video wraps up with the person continuing their exercise regimen in the living room, embodying a sense of dedication and vitality."
+ },
+ {
+ "video_id": "PI-7d11c8d3-894d-4fa4-a4da-e57e8c63700e",
+ "path": "/workspace/benchmark/video/PI-7d11c8d3-894d-4fa4-a4da-e57e8c63700e.mp4",
+ "answer": "This video offers a unique, behind-the-scenes perspective at a Super Bowl celebration. Positioned on the edge of the football field, a staff member is shown continuously loading a large, black confetti cannon. He repeatedly takes handfuls of teal-colored confetti and stuffs them into the machine's opening. The cannon, powered by adjacent gas canisters, is actively firing, creating a loud hissing sound as it propels the confetti high into the air. The small, rectangular pieces of paper then float down, blanketing the field and the cheering crowd in the stadium. In the background, other event personnel, including photographers and staff in purple vests, can be seen on the field, while the stands are packed with spectators."
+ },
+ {
+ "video_id": "PI-877ead98-b389-4cad-bbf2-21ed41413859",
+ "path": "/workspace/benchmark/video/PI-877ead98-b389-4cad-bbf2-21ed41413859.mp4",
+ "answer": "The video and audio together depict a conflict scene in a supermarket aisle. The video shows a curly-haired individual in a black sweatshirt and grey sweatpants pushing a shopping cart and suddenly covering his face, seemingly startled. The audio reveals a heated exchange between this customer and a female store employee. The customer, sounding frustrated and defensive, denies making a loud noise, suggests checking surveillance cameras, and demands an apology. The employee, assertive and exasperated, insists the customer leave if not responsible and eventually offers a reluctant apology. The video visually captures this argument through character interactions and text overlays, while the audio provides the dialogue, together presenting a vivid portrayal of the conflict and its resolution through (albeit reluctant) reconciliation."
+ },
+ {
+ "video_id": "PI-6a710e4d-93e3-4425-8513-700a2e396bcc",
+ "path": "/workspace/benchmark/video/PI-6a710e4d-93e3-4425-8513-700a2e396bcc.mp4",
+ "answer": "The video begins with a person standing next to a large, dark-colored truck with an RV attached to it, dressed in a black long-sleeve shirt with 'OUTDOOR ADVENTURES' written on the back and black pants. They place one hand on the truck's tire and the other on their hip before walking towards the RV. The text 'Extreme RV' appears in a yellow box at the top of the screen as the person approaches a large black door and opens it, revealing a modern interior with beige seats, a kitchen area, and a driver's seat. The person climbs a set of steps and enters the RV, showcasing its features. The video concludes with the person lying on a large bed in the RV, demonstrating the spacious and comfortable sleeping area. The accompanying BGM is a purely instrumental track with a modern and upbeat tone, enhancing the engaging and dynamic atmosphere of the video through its rhythmic and melodic elements, perfectly complementing the visual tour of the RV."
+ },
+ {
+ "video_id": "PI-464f40ac-2942-49cd-bda4-846294028d5b",
+ "path": "/workspace/benchmark/video/PI-464f40ac-2942-49cd-bda4-846294028d5b.mp4",
+ "answer": "The video features a baby lying on a bed, wrapped snugly in a white blanket adorned with small red floral patterns, with a pacifier shaped like a cat in their mouth. The background includes a patterned pillow with a mix of floral and geometric designs in warm colors, creating a cozy and endearing atmosphere. The text 'if you didn't have baby fever, now you do' appears at the top of the screen, suggesting that the sight of the baby will evoke a strong desire for having a baby in viewers. The baby remains mostly still, with slight movements, and the camera angle remains consistent, focusing on the baby's upper body and face, emphasizing their cuteness. Meanwhile, the audio features an unidentified speaker with an unclear gender and emotional state, expressing fragmented and unintelligible speech that lacks coherence and clarity, making the intent or context difficult to discern. The overall experience combines a visually calming scene with an audio track that remains ambiguous and disconnected from the visual content."
+ },
+ {
+ "video_id": "PI-f8c701d0-0d9a-4e5d-b57e-21bf6f4ed195",
+ "path": "/workspace/benchmark/video/PI-f8c701d0-0d9a-4e5d-b57e-21bf6f4ed195.mp4",
+ "answer": "A woman films herself outdoors on a snowy day, standing in front of a building. Snowflakes visibly dust her two-toned hair and green vest as she speaks directly to the camera. The video's on-screen text identifies her intended audience as \"Dear White People.\" She poses a question, asking that if the video has reached a white person who does not have a problem with people who look like her, whether they can be friends. Her delivery is calm and direct. As she speaks, animated words corresponding to her speech appear at the bottom of the screen. In the background, the scene is wintry, with snow covering the ground, and another individual is briefly seen walking into the building behind her. She concludes by reiterating that she is simply trying to find out if such friendships are possible."
+ },
+ {
+ "video_id": "PI-357a9c47-5c99-40ea-9aa5-73d73936d2b1",
+ "path": "/workspace/benchmark/video/PI-357a9c47-5c99-40ea-9aa5-73d73936d2b1.mp4",
+ "answer": "The video features a person wearing a blue and green striped shirt, sitting outdoors under a wooden structure with a clear sky and some greenery visible in the background.The individual has long, straight hair and is wearing small, dangling earrings.They are holding a lip liner pencil near their lips, seemingly applying or demonstrating it.The person has visible tattoos on their fingers and hand, including a small cross and some other designs.The text overlay on the video reads, \"Did Hailey Bieber Just Confirm Rhode Lipliners??? 👀💋\", suggesting a discussion or speculation about Hailey Bieber possibly confirming or endorsing Rhode lip liners.The audio content includes a soft, indistinct background conversation or music, but the primary focus is on the visual demonstration of the lip liner.The overall mood of the video appears casual and relaxed, with the person occasionally looking at the camera and adjusting their hair."
+ },
+ {
+ "video_id": "PI-aab8e526-05b8-401a-8027-ec760403af60",
+ "path": "/workspace/benchmark/video/PI-aab8e526-05b8-401a-8027-ec760403af60.mp4",
+ "answer": "The video features a woman with long, straight black hair, dressed in a black zip-up top and a gold necklace with a blue pendant, standing in a well-lit room with a lamp and a mirror visible in the background. She holds a contour stick in her right hand, showcasing her bright red nail polish, and the product's packaging in her left hand. The scene is clean and minimalistic, focusing on her demonstration of the contour product as she holds it close to her face to display the shade. Meanwhile, the audio track features a female speaker expressing frustration and disappointment, critiquing the contour product from a brand established since 1997. She emphasizes her dissatisfaction with the product's quality, stating, 'Honestly has literally been around since nineteen ninety seven twenty something years... Twenty eight years in the game, and you mean to tell me this is all you could give. Wow get in that lab and do it quickly.' Her assertive and critical tone underscores the urgency for the brand to innovate and improve, creating a cohesive narrative that highlights both the product demonstration and the speaker's call for better quality."
+ },
+ {
+ "video_id": "PI-e14cd183-429a-4165-9bc1-d9486930755d",
+ "path": "/workspace/benchmark/video/PI-e14cd183-429a-4165-9bc1-d9486930755d.mp4",
+ "answer": "In the video, we observe a sequence that begins with a close-up of a black card placed on a wooden surface. The card poses an intriguing question related to the Harry Potter series, reading, \"After ______ Hermione knew she had to choose Ron,\" with a blank space underlined to indicate a missing word or phrase. As the scene develops, it transitions to a different setting on the floor where four white cards are laid out, each bearing the title \"Cards Against Muggles.\" A gray cat is seen curiously sniffing or inspecting these cards. During this time, audio captures a female speaker with an indeterminate emotional tone briefly uttering the word \"She's,\" contributing to a sense of incomplete or interrupted speech. The ambient sound features a neutral tempo, creating a subtle and undisturbed atmosphere that complements the visual. A hand appears, interacting with the cat and the cards, pointing at them before picking one up. The following scene shows several white cards spread on the wooden floor, with one prominently reading \"Harry's lesbian haircut in movie 5,\" accompanied by a small icon of a person with glasses. Other cards are partially visible, displaying texts that are not fully readable, while the gray cat continues to observe the scene, adding a touch of everyday curiosity to the unfolding interaction."
+ },
+ {
+ "video_id": "PI-d605563e-c766-4041-95b4-47d1384e4adb",
+ "path": "/workspace/benchmark/video/PI-d605563e-c766-4041-95b4-47d1384e4adb.mp4",
+ "answer": "At the beginning of the video, a person wearing a pink shirt and denim overalls waves to the camera. The background is colorful decorations and a visible logo of the letters \"ABC\". Text message: \"Hi, friends!\" \"@becky_weiss_\" appears at the bottom of the screen. Then the scene switches to another person in the factory, wearing a black jacket with a logo on the chest. The text \"Not interested, Sir\" is displayed in yellow at the top of the screen. This person is holding a sign with a dog in a suit on it, on which is written \"Midas Team\". The background is other workers and factory equipment. This person continued to hold the sign, emphasizing its quality. At the end of the video, that person was still holding the sign. The words \"Contact Us\" and \"whatsapp Number\" were yellow, and the words \"Send an email there\" following them were also yellow. The LCsign logo is visible in the lower left corner of the entire clip. Overall, this video is discussing some detailed issues or points that need attention regarding the factory's production of this product."
+ },
+ {
+ "video_id": "PI-717419b7-84ba-4d01-b8a7-127868321f16",
+ "path": "/workspace/benchmark/video/PI-717419b7-84ba-4d01-b8a7-127868321f16.mp4",
+ "answer": "The video features a person with vibrant red hair and black sunglasses resting on their head, wearing a gray shirt, sitting against a plain, light-colored wall with a white pillow in the background. The text 'Rant' is displayed at the top of the screen, setting the tone for the individual's expressive opinions or frustrations. Initially, the person remains mostly stationary, creating a focused and intimate setting. As the video progresses, they begin to gesture with their hands, moving them closer to the camera to emphasize their points. The audio is a rant track featuring a male speaker expressing frustration about people who follow thousands of others indiscriminately and then suggest friendship. The speaker criticizes such superficial social media interactions, highlighting that such behavior makes it seem like they are friends with everyone, yet they have many followers but few following. The consistent background and the text 'Rant' remain throughout, maintaining the theme of the video and creating an atmosphere of direct and personal communication."
+ },
+ {
+ "video_id": "PI-2b75e203-2d1f-45cb-8816-131055a51343",
+ "path": "/workspace/benchmark/video/PI-2b75e203-2d1f-45cb-8816-131055a51343.mp4",
+ "answer": "The audio features a male speaker offering a humorous take on how to identify a \"freak,\" suggesting that if someone searches for \"glass cleaners\" on TikTok and instead encounters only thirst traps, they might have a more significant issue at hand. This playful advice is echoed in the video, where a man dressed in a chestnut hoodie speaks directly to the camera in an ordinary, light-colored indoor setting. The text overlay on the video clearly states, \"How to tell you are a freak,\" as the speaker uses gestures to emphasize his point, maintaining a light-hearted and playful tone throughout. The video's focus remains squarely on the speaker, with minimal changes to the background, fostering an intimate and direct connection with the audience. Together, the audio and video create a humorous and engaging narrative around a whimsical topic, using relatable scenarios and gestures to convey the speaker's message effectively."
+ },
+ {
+ "video_id": "PI-76b03703-aa31-4fd4-8842-cf26fc39f6ca",
+ "path": "/workspace/benchmark/video/PI-76b03703-aa31-4fd4-8842-cf26fc39f6ca.mp4",
+ "answer": "The video takes place in a room with warm lighting, elegant decoration and yellow walls and wooden floors. There is a white bed sheet, a patterned sofa, a table with various items and a lamp in the room, creating a comfortable and inviting atmosphere. The protagonist is wearing a black leather jacket, a black and white striped shirt and light-colored trousers, and is holding a metal object similar to a trophy or a prize in his hand. They examined it carefully and then began to dance energetically in the room. The dance movements are dynamic and expressive. The individual rhythmically swings his arms and legs, and occasionally adjusts his hair and coat to be comfortable. Although the background music was not explicitly mentioned, it can be inferred that it is cheerful and lively, complementing the energetic dance performance. The video captured the joy and freedom of the dancers as they moved around the room, showcasing moments of celebration and self-expression. Without any dialogue, the audience's attention is focused on the visual and auditory elements, emphasizing the theme of unrestrained joy and the simple pleasure of dancing in a comfortable private space."
+ },
+ {
+ "video_id": "PI-77f9cac4-86f8-4901-b4de-0d086c2e761e",
+ "path": "/workspace/benchmark/video/PI-77f9cac4-86f8-4901-b4de-0d086c2e761e.mp4",
+ "answer": "The video unfolds in a warmly lit hallway adorned with wooden flooring, scattered red rose petals, and red heart-shaped balloons, creating a romantic and celebratory ambiance. A woman in a pink sweater and blue jeans is kneeling on the floor, holding a small child dressed in pink, near a wall featuring a large abstract painting. The text overlay reveals that the woman is holding the child back from running towards someone who is about to enter, as the child usually greets this person enthusiastically upon their return from work. In the background, a person dressed in dark clothing approaches, holding a bouquet of pink flowers, heightening the anticipation of a special moment or surprise. The audio captures the voices of a child and an adult, adding to the warmth and kinship of the scene. The video beautifully encapsulates the emotional and heartfelt atmosphere of this family moment, emphasizing the love and connection shared among them."
+ },
+ {
+ "video_id": "PI-ca64f59d-6ae7-44b6-b2db-0a7c46f39fb1",
+ "path": "/workspace/benchmark/video/PI-ca64f59d-6ae7-44b6-b2db-0a7c46f39fb1.mp4",
+ "answer": "The video centers on a young woman with long, straight blonde hair, dressed in a dark hoodie featuring a prominent light blue design. She is positioned in a simple indoor environment with light-colored walls and a visible ceiling vent, creating an intimate and casual atmosphere, likely a bedroom or personal space. Throughout the video, a persistent text overlay in the upper left corner displays a viewer's comment: \"Reply to J's comment: why did camilla leave bro they unfollowed her n everything,\" which serves as the focal point of the woman's address.In the accompanying audio, the woman's voice is calm and reassuring as she clarifies the situation regarding Camilla. She explicitly states that Camilla has not left and remains in the house. The audio delves deeper into the reasons behind the lack of updates, explaining that legal restrictions prevent them from posting or discussing the matter extensively. However, she emphasizes that there are no tensions or \"beef\" among the house members, assuring viewers that the situation is amicable. The woman conveys that everything is under control, and they merely need to maintain a low profile for a short period. She concludes her explanation with a promise that Camilla will return soon, instilling a sense of reassurance and anticipation.The video wraps up with the woman still in frame, continuing her heartfelt explanation. Her occasional hand gestures and hair touches, combined with the soft, even lighting, enhance the personal and sincere tone of the video, making the viewer feel directly engaged in the conversation."
+ },
+ {
+ "video_id": "PI-1dd286a9-7666-44c1-9dcf-86146ab10487",
+ "path": "/workspace/benchmark/video/PI-1dd286a9-7666-44c1-9dcf-86146ab10487.mp4",
+ "answer": "The video features a woman who has dedicated her year to rescuing ten uniquely named puppies: Eve, Jelly Bean, Einstein, Sunny, Olive, Dasher, Bucket, Vixen, Comet, and Blitzen, showcasing her compassion and unwavering commitment to their care. Through various settings, including a cozy indoor scene with a Christmas tree, an outdoor visit to a cemetery, a playful day at the beach, and moments in a modern home, the woman is seen holding, cradling, and interacting with the puppies, with close-ups highlighting their distinct features and playful antics, such as puppies bounding up stairs and Comet strutting along a sandy beach. Meanwhile, the audio track complements this heartwarming visual narrative with a girl speaking in a positive and pleasant tone, likely introducing her friends—possibly a metaphorical or direct reference to the puppies—by name, reinforcing the theme of love, care, and the joyful bonds formed through rescue and companionship. Together, the video and audio create a touching portrayal of the woman's dedication to these puppies and the happiness they bring into her life."
+ },
+ {
+ "video_id": "PI-57b9c608-355b-4539-ab7b-13125af59a27",
+ "path": "/workspace/benchmark/video/PI-57b9c608-355b-4539-ab7b-13125af59a27.mp4",
+ "answer": "The scene in the video seems to be a formal event or gathering, with a visible urban landscape in the background. It is very likely that it was shot in an urban environment. The main focus of the photo is a lady who is wearing a striking red dress with hollowed-out details and a thick gold necklace. Her long golden curly hair is a striking feature, prompting the text and conversations on the screen to comment that her hairstyle is similar to the style in the TV show \"Gossip Girl\". Text message \"Hello??!?\" What's the hairstyle of Gossip Girl? This theme is prominently displayed, emphasizing people's humorous or surprised reactions to her hairstyle. There were also ladies in green dresses and men in black suits and other attendees at the scene, adding a formal atmosphere. The background is filled with a mixture of people, suggesting that this is a social or celebration occasion. The dialogues in the play indicate that the audience is either surprised or amused by the fashion choices in the play, especially those hairstyles that remind people of \"Gossip Girl\". Overall, this video captures a moment of social interaction when fashion and hairstyles become the focus of the conversation, highlighting the influence of popular culture on personal style in formal events."
+ },
+ {
+ "video_id": "PI-5ee8c5f6-2232-4ef2-af1a-33d407735f60",
+ "path": "/workspace/benchmark/video/PI-5ee8c5f6-2232-4ef2-af1a-33d407735f60.mp4",
+ "answer": "The video features two individuals standing outdoors on a paved area surrounded by trees and greenery, creating a serene and natural setting. The person on the left, labeled 'Grandson 2005,' is dressed in a brown and green quilted jacket, a beige hoodie, light blue jeans, and white sneakers, while the person on the right, labeled 'Nana 1924,' wears a leopard print blouse, dark blue jeans, and black shoes. Initially, they stand side by side, facing forward, but soon turn to face each other and engage in a playful interaction, raising their hands as if to high-five or play a hand game. Their synchronized hand movements and smiles suggest a joyful game or dance, capturing a heartwarming moment between the grandson and his grandmother. The background music complements this scene with its warm and nostalgic instrumental tones, sung by male voices, expressing passion and love for life. Together, the visual and auditory elements highlight the deep bond and shared joy between the grandson and his grandmother, creating an endearing and memorable portrayal of family connection."
+ },
+ {
+ "video_id": "PI-835e0894-6f06-429a-850c-c056665f7f9f",
+ "path": "/workspace/benchmark/video/PI-835e0894-6f06-429a-850c-c056665f7f9f.mp4",
+ "answer": "The video features a heartwarming and joyful narrative centered around a golden retriever's affectionate routine at a local bodega. It begins with a woman in a black jacket with a red hood standing outside at night, setting the scene with the text, 'POV: my golden retriever cannot pass our local bodega w/out saying hi to his gf.' Inside the well-lit bodega, the golden retriever, on a leash, eagerly enters and explores, eventually sitting on a green 'DELI' mat and looking towards the shelves. The woman kneels to hug the dog affectionately, with the text 'me ordering him a burger in the background' highlighting her care. The dog, excited and engaged, stands on its hind legs, reaching up towards the counter in anticipation. The woman, now behind the counter, interacts with the dog, likely handing over the burger. The video wraps up with the woman outside at night, holding a white bag, and the text 'he's so happy 😂😂' appearing on the screen as the golden retriever follows her, clearly delighted. The lively BGM enhances the cheerful and warm atmosphere, perfectly complementing the woman's amusing anecdote about her dog's affectionate visits to the bodega."
+ },
+ {
+ "video_id": "PI-069807fd-4f1a-4a8b-942c-b6bf1603db0a",
+ "path": "/workspace/benchmark/video/PI-069807fd-4f1a-4a8b-942c-b6bf1603db0a.mp4",
+ "answer": "The video depicts an interior decoration scene where the camera initially focuses on debris and trash scattered on the ground. As the camera pulls back, it reveals a man dressed in a black hoodie, khaki trousers, and black gloves standing amidst unfinished walls and a cluttered room. The room features an old brown sofa with various items piled on it, red curtains adorning the window, and a vintage-style chandelier hanging from the ceiling. The man makes various gestures toward the camera, seemingly explaining or commenting on the state of the room, with one speaker adopting a casual tone tinged with sarcasm or humor. The scene then shifts to another man wearing a fluorescent green hoodie, black bib, and yellow beanie, who opens a white door leading to an outdoor area with a double-trough sink and some outdoor scenery. Upon opening the door, he expresses surprise, which is echoed by the other male voice in the audio exclaiming, \"What the hell?\" This combination of visual and auditory elements creates a vivid portrayal of an unexpected and somewhat chaotic interior decoration scenario."
+ },
+ {
+ "video_id": "PI-9fcf113b-e22d-4302-9d49-81226fa96df1",
+ "path": "/workspace/benchmark/video/PI-9fcf113b-e22d-4302-9d49-81226fa96df1.mp4",
+ "answer": "The video features a man standing on a sidewalk at a street corner under a clear blue sky, dressed in a dark suit, white shirt, and white sneakers. He holds a sign that reads, \"SHE'S NOT DRESSED LIKE A SLUT, YOU JUST THINK LIKE A RAPIST,\" making a powerful statement against victim-blaming and sexual violence. The man occasionally gestures with his hand, emphasizing his message. In the background, a hospital sign (a blue 'H' with an upward arrow) and a political sign on the ground reading \"FDT\" are visible. The street is relatively quiet, with a few vehicles passing by, including a white SUV and a black pickup truck, under an urban setting with a highway overpass visible. Meanwhile, the background music (BGM) features a male vocalist with a neutral and calm tone, delivering a spoken-word style message that addresses social issues with a cautionary and advisory emotion. The track's content is a direct and serious admonition against objectification and sexual violence, encapsulating a core message of respect and personal responsibility, which complements the man's protest in the video. The overall scene conveys a strong stance on social justice and personal accountability."
+ },
+ {
+ "video_id": "PI-83e01b7e-de52-40ca-ac9b-8add8337fc62",
+ "path": "/workspace/benchmark/video/PI-83e01b7e-de52-40ca-ac9b-8add8337fc62.mp4",
+ "answer": "The audio and video content converge to create a compelling narrative centered around a casual request for creative collaboration, set in a laid-back and informal atmosphere. The audio features a male speaker expressing a desire for someone to remix a track using a specific beat, possibly hinting at a New Orleans influence, and later mentioning plans to visit a store for a drink, emphasizing his eagerness for assistance and enjoyment. Meanwhile, the video showcases a person wearing an orange shirt and a gray cap, sitting in front of a microphone in a recording studio, adorned with a large, shiny pendant necklace. The text overlay, \"Can yall PLEASE remix this!\", clearly conveys the individual's passionate plea for someone to create a remix of a song or audio clip. The simple background of a plain wall and microphone setup underscores the focused recording environment, while the person's expression and the context reinforce the urgency and desire for creative reinterpretation. Together, the audio and video content effectively communicate the speaker's casual request for a remix and his enthusiasm for the creative process, inviting viewers to participate in the collaboration."
+ },
+ {
+ "video_id": "PI-008cb09e-fa4a-4881-8cf4-a2b76eff1371",
+ "path": "/workspace/benchmark/video/PI-008cb09e-fa4a-4881-8cf4-a2b76eff1371.mp4",
+ "answer": "In a well - crafted short video, the audio serves as the primary narrative thread, with a female blogger's voice taking center stage. This audio is from the video where she humorously pretends to rank her friends, known as the \"Bob House girls.\" She starts off in a seemingly serious tone, only to quickly reveal that it's all a joke. Through her words, she emphasizes that all her friends are beautiful and shouldn't be pitted against each other in a ranking system. The video itself is set in an indoor environment adorned with chandeliers and floral arrangements, creating an elegant backdrop. The main character is a woman with black hair, dressed in a white shirt and accessorized with gold jewelry such as earrings, a necklace, and bracelets. She carries a black bag with gold accents over her shoulder. Initially, she begins to rank the \"Bob House girls,\" but then changes her tune, stating that such rankings are toxic. She reiterates that all the girls are beautiful and shouldn't be ranked against one another. During the course of the video, the camera makes a few brief switches. Overall, the core theme of the video, as conveyed through both the audio and visuals, is the woman with black hair's perspective on the futility and unfairness of ranking the \"Bob House girls,\" instead highlighting the value of friendship and the fact that beauty is not something to be measured or compared."
+ },
+ {
+ "video_id": "PI-9c136173-1d50-42d9-bfce-707e6cd31c22",
+ "path": "/workspace/benchmark/video/PI-9c136173-1d50-42d9-bfce-707e6cd31c22.mp4",
+ "answer": "The video presents a close-up view of a person with blonde hair, dressed in a black hoodie with partially visible text, lying on a light-colored, fluffy surface, likely a bed or couch indoors. The individual displays a playful and relaxed demeanor, making various facial expressions, moving her head slightly, and occasionally twirling her hair around her fingers. A tattoo on her hand becomes visible as she brings her fingers near her face. The background features a neutral-colored wall, emphasizing the focus on her upper body and face. Accompanying this visual scene, the audio includes a male speaker with an indeterminate emotional tone delivering fragmented and unclear speech, which briefly includes the English word \"This\" and an unidentified Korean utterance. Ambient music with an undetermined tempo and atmosphere underscores the audio, contributing to the overall neutral or ambiguous emotional state of the scene."
+ },
+ {
+ "video_id": "PI-481b2aa4-61fd-41a4-be33-ab95a33b53b9",
+ "path": "/workspace/benchmark/video/PI-481b2aa4-61fd-41a4-be33-ab95a33b53b9.mp4",
+ "answer": "In this emotional video, a female speaker conveys a message full of joy and emotion. Her urging someone not to apologize to her but to feel remorse for themselves highlights the shift from seeking external validation to internal reflection. The slow, melancholy ambient music emphasizes her words and enhances the emotional resonance. Visually, the video opens with a woman in crying, wiping her face with a tissue, accompanied by text overlays, reinforcing the theme of self-reflection rather than seeking apologies. The scene transitions to a car decorated with Valentine's Day gifts, symbolizing loving and hard-working relationships of the past. The textual overlay here shifts to convey a sense of detachment and the speaker's newfound belief that she will be of great value to others. The video ends with a man leaving the car, symbolizing the beginning of a new chapter, leaving the past firmly behind. This amalgamation of audio-visual elements creates a poignant narrative of heartbreak, self-empowerment, and moving forward."
+ },
+ {
+ "video_id": "PI-46f66e85-00a8-452b-9c2c-a5103d959fd2",
+ "path": "/workspace/benchmark/video/PI-46f66e85-00a8-452b-9c2c-a5103d959fd2.mp4",
+ "answer": "The video takes place at the Planet Fitness Center, which is characterized by its unique purple and yellow color scheme. At the beginning, a person was walking on a treadmill, wearing a purple long-sleeved top, chestnut leggings and white sports shoes, with a long red braid tied into a bun. The background is a locker and a slogan on a wall, which reads \"Planet Fitness\" and \"2024 Vs...\" Add \"You did a great job today\" to set an inspiring tone. The scene switches to the same person, now wearing a black long-sleeved top and black leggings, still on the treadmill, maintaining the same hairstyle. The environment of the gym remains consistent, with multiple treadmills and other fitness equipment visible. Then the video switches to another person running on a treadmill. She is wearing blue clothes, black leggings and beige sports shoes. \"February 2025!\" It appears at the top of the framework, indicating a future date and suggesting forward-looking viewpoints or goals. The background continues to show the setup of the gym, with rows of treadmills and fitness equipment. This audio clip includes some fragmented phrases, such as \"This damn, I'm going to do something big next year\" and \"get my chain is he\", which might imply personal achievements or future aspirations. This video captures the progress of fitness activities in a gym environment and highlights themes such as motivation, personal growth, and future goals."
+ },
+ {
+ "video_id": "PI-4ab30bdd-eb25-4803-a90d-c028d2fbca6d",
+ "path": "/workspace/benchmark/video/PI-4ab30bdd-eb25-4803-a90d-c028d2fbca6d.mp4",
+ "answer": "The video begins with two men in suits standing by a waterfront, one in a yellow shirt and the other in an orange shirt. They are dancing, creating a relaxed and exciting atmosphere. The absence of dialogue is complemented by uplifting background music that enhances the enjoyable and lively mood of the scene. The men's movements are synchronized and energetic, set against the backdrop of a serene waterfront, which adds a touch of tranquility to the vibrant setting. The combination of their dynamic dance and the soothing yet upbeat music creates a captivating visual and auditory experience, making the overall atmosphere both engaging and pleasant."
+ },
+ {
+ "video_id": "PI-efd4ed56-1cc8-4c20-878b-33517f7dc831",
+ "path": "/workspace/benchmark/video/PI-efd4ed56-1cc8-4c20-878b-33517f7dc831.mp4",
+ "answer": "The video unfolds on a picturesque street lined with white houses and green trees, with a few cars parked along the roadside. At the start, a woman is seen running towards the camera, engaging in what appears to be a lively conversation, her smile never fading. As the camera pans left, another building comes into view, mirroring the charm of the first. The woman continues to speak animatedly, tapping her phone screen, making gestures, and occasionally touching her face, all while maintaining a cheerful demeanor. Meanwhile, a person dressed in black walks to the left, adding a subtle dynamic to the scene. This harmonious and pleasant atmosphere mirrors the audio content, where a conversation among multiple individuals, including both male and female voices, unfolds with casual and slightly playful emotions. In the audio, a woman asks for someone's phone number, leading to a lighthearted exchange about having two phones, one for work. When another person inquires about her profession, she responds enigmatically, hinting at her identity as an actress, much like the mysterious and engaging presence of the woman in the video. The dialogue in the audio carries a mysterious and teasing tone, with participants expressing curiosity and mild nervousness, paralleling the engaging and lively interaction depicted in the video."
+ },
+ {
+ "video_id": "PI-cd3fd401-7ca4-47c1-ae6b-35d38ea7b9a3",
+ "path": "/workspace/benchmark/video/PI-cd3fd401-7ca4-47c1-ae6b-35d38ea7b9a3.mp4",
+ "answer": "The protagonist of the video is a blonde and blue-eyed person, wearing a dark green polka dot shirt and a silver necklace. The background is a soft and neutral indoor setting. As the video progresses, it can be seen that this person moves slightly, causing her hair to move gently and then turn into a white top with a small green thing pinned on it. The background includes an elegant floral arrangement with red and white flowers, adding a touch of complexity to the setup. The audio overlay includes phrases such as \"She, you love me, what did you mess up\" and \"I know\", which seem to be part of a conversation or song and help create a casual conversational atmosphere. The lighting remains soft and consistent, enhancing the atmosphere of calmness and tranquility. The core theme of the video seems to revolve around a personal or self-reflective moment, possibly reflecting a mistake or realization, as suggested by the audio content. The overall composition is simple and beautiful, with the focus on the upper body of the person and the surrounding decorations."
+ },
+ {
+ "video_id": "PI-d88b14a1-640d-4a06-b695-62a345b51147",
+ "path": "/workspace/benchmark/video/PI-d88b14a1-640d-4a06-b695-62a345b51147.mp4",
+ "answer": "The video features a woman in a white terry cloth bathrobe recording herself in the bathroom mirror using a smartphone. The backdrop includes raw white marble walls and bright lighting, creating a clean environment. A blue and white striped towel can be seen hanging on a rack in the background, adding a subtle touch of color. The woman's hair is styled in a neat bun. Throughout the recording, she exhibits a range of facial expressions, including playful faces, closing her eyes as if immersed in thought or emotion, and directly gazing into the smartphone camera. The lyrical content of the playing pop music aligns with themes of self-worth and confidence. She holds the smartphone in her right hand, which is distinctly clear. At one point, she rotates the phone, changing the angle of the shot and providing a slightly different perspective of the reflection. The video maintains a stable focus on the woman and her actions in the well-equipped bathroom space, with the camera remaining relatively static, primarily capturing her movements and expressions."
+ },
+ {
+ "video_id": "PI-9341b8f8-e705-4a21-b733-740a9184d664",
+ "path": "/workspace/benchmark/video/PI-9341b8f8-e705-4a21-b733-740a9184d664.mp4",
+ "answer": "The video features a man in a dark blue shirt and black apron demonstrating an effective shoe cleaning process. He begins by sitting at a table with a large pile of sandy mud, using a tool to reveal its texture. The scene transitions to a close-up of a mud-covered sneaker, which he scrubs with a brush to showcase the initial cleaning phase. Holding a green bottle labeled 'SHOE CLEANER CONDITIONER,' he sprays the cleaner onto the sneaker, creating white foam that lifts the dirt as he continues to scrub. The man then rinses the sneaker under running water, revealing a cleaner surface, and removes the shoelaces to soak them in soapy water. Finally, he holds the now clean and shiny sneaker against a backdrop of yellow flowers, highlighting the dramatic transformation. Meanwhile, the audio features a male speaker discussing shoe cleaning with a neutral or slightly enthusiastic tone. He shares his experience and advice, emphasizing the effectiveness of the non-toxic shoe cleaner, which is safe for all materials and colors. The speaker describes the cleaning process, expressing satisfaction with the results and noting the cost savings achieved. The video effectively illustrates the shoe cleaner's ability to restore the sneaker's appearance, aligning with the speaker's endorsement of the product."
+ },
+ {
+ "video_id": "PI-d1b64a86-619a-463d-8f0f-02cb684c7674",
+ "path": "/workspace/benchmark/video/PI-d1b64a86-619a-463d-8f0f-02cb684c7674.mp4",
+ "answer": "The video shows a typical interview setup, with many pedestrians, buildings, and vehicles on the street. \nEveryone in the video is laughing, and subtitles reveal that two girls are playing a word game called 'Finish the Line.' A picture-in-picture below the video shows them playing the game by pausing it. One girl is holding a microphone and wearing headphones, while the other holds a microphone and looks at the camera, acting as the interviewer. The video shows that they played about four or five games, and the final scene ends with the two women smiling. The entire video creates a relaxed and humorous atmosphere. The audio in the video indicates that it is a conversation among multiple people, similar to a game called 'Finish the Line,' filled with laughter, and everyone is having a great time."
+ },
+ {
+ "video_id": "PI-6b5a32ef-3ce8-4ab8-bccb-26929594983a",
+ "path": "/workspace/benchmark/video/PI-6b5a32ef-3ce8-4ab8-bccb-26929594983a.mp4",
+ "answer": "The video unfolds in a cozy bedroom, where a person is wearing a vibrant striped vest with horizontal stripes in pink, blue, yellow and green. They sat on the bed, which was covered with a light pink plush blanket, creating a warm and inviting atmosphere. In the background, a dressing table can be seen, piled high with all kinds of bottles and containers, which might be cosmetics or personal care products. There is also a mirror and dark curtains covering the Windows. Another person wearing a long-sleeved red and pink shirt was sitting behind the first person, holding a small thing in his hand, which might be a telephone or a remote control. The room has soft lighting, enhancing the intimate atmosphere. This scene is basically static. The movements of both people are very small, suggesting a quiet interaction or a moment of relaxation. This audio includes some fragmentary phrases, such as \"Period, I'm at...\" The entire bedroom was blown away by the wind. He said he was leaving... This implies a casual and potentially interrupted conversation, adding a layer of spontaneity to the scene. This video captures a peaceful and private moment in the bedroom, emphasizing the serene and comfortable environment shared by two people."
+ },
+ {
+ "video_id": "PI-8c4798b1-b5a4-48d8-9ac9-430882281acb",
+ "path": "/workspace/benchmark/video/PI-8c4798b1-b5a4-48d8-9ac9-430882281acb.mp4",
+ "answer": "The video features a woman with long brown hair documenting her journey towards a nose job, capturing her anticipation and excitement through a countdown displayed in text overlays: '2 months until my nose job,' '1 month,' '3 weeks,' '2 weeks,' '1 week,' '6 days,' '4 days,' '2 days,' '6 hours,' and '5 minutes until my nose job!!!' She is seen sitting in a car, wearing a black top and later a maroon shirt with a logo, expressing her eagerness. The final scene shows her in a hospital bed, wearing a blue surgical cap, with the text '5 weeks after my nose job!!!' Meanwhile, the audio features multiple speakers with a mix of male and female voices, expressing neutral or unclear emotions. The dialogue includes fragmented and unclear statements in English, discussing reminders, possibilities, and stopping certain actions, with one speaker briefly mentioning 'things we could be.' There is also a brief segment in Korean, which is not contextually clear. The ambient music has a steady tempo and creates a neutral or subdued atmosphere, contributing to the overall scene of individuals engaging in disjointed and unclear communication, contrasting with the woman's clear and excited documentation of her surgical journey."
+ },
+ {
+ "video_id": "PI-cf39e88f-7290-4578-9754-0a4458eb2c64",
+ "path": "/workspace/benchmark/video/PI-cf39e88f-7290-4578-9754-0a4458eb2c64.mp4",
+ "answer": "In this integrated media piece, the video begins with the camera approaching a vibrant yellow advertising poster, which features a striking pattern of the repeated white word \"aaaaaa\". Below this pattern, the enticing offer of \"Free ice Cream\" is displayed alongside a picture of soft ice cream in a waffle cone. As the camera zooms in, the details of the poster become clearer, enhancing the visual appeal. Overlaying the video is the text \"He did what they told him,\" suggesting a narrative or commentary about someone following instructions in a potentially humorous or satirical context. Complementing the visual content, the audio features a male speaker with a cheerful and excited tone. He greets with \"Hi, good evening,\" followed by a series of \"ah\" sounds that convey surprise or realization. This audio reaction aligns with the overall light-hearted and cheerful tone of the video, which appears to be set in a public place, possibly near a fast-food restaurant or an ice cream shop. The combination of visual and auditory elements creates a cohesive and engaging experience, with the overlay text adding a layer of humor or satire to the scene."
+ },
+ {
+ "video_id": "PI-89f76c27-b26a-4f97-82f1-2028b4009a68",
+ "path": "/workspace/benchmark/video/PI-89f76c27-b26a-4f97-82f1-2028b4009a68.mp4",
+ "answer": "The video is set against the backdrop of city streets, brick walls and doors, hinting at the background of streets or alleys. A woman in a leopard print dress stood there with her arms crossed, exuding a sense of composure and calmness. The background music is a romantic and supportive melody. The lyrics are like \"It's me against The world, just me and you\", emphasizing the themes of unity and support and opposing external doubts. As the picture unfolds, a man wearing a black hoodie, grey trousers and a hat enters the frame energetically. He performed a series of interesting and exaggerated movements, including jumping, spinning, and tilting towards the woman, as if whispering or teasing her. Although his funny actions were very active, the woman remained motionless most of the time. The text on the video reads as follows: \"When people say, 'I didn't know I needed to cooperate,' you are two kids from the theater, waiting for this moment all along.\" This indicates that this video is a relaxed collaboration between two people with a background in drama, capturing the moment they were eagerly awaiting. The man finally walked behind the woman and hugged her warmly with his arms. He gently lifted her off the ground and they shared a happy and affectionate moment. They all laughed, obviously enjoying this interaction. Then, the man put the woman back on the ground, but his arm was still around her waist, and the two continued to hold each other. This video conveys a core theme, namely the joy of interesting emotions and cooperation, highlighting the connection between the two people. The combination of the city background, interesting actions and romantic background music creates a charming and warm story."
+ },
+ {
+ "video_id": "PI-8022fc31-b7d0-4b53-b11c-7486b1686c2a",
+ "path": "/workspace/benchmark/video/PI-8022fc31-b7d0-4b53-b11c-7486b1686c2a.mp4",
+ "answer": "In the video, a person wearing a bright orange wig, a black jacket, and a red tie, with a patch that reads \"lcsign,\" is featured holding a neon \"OPEN\" sign. They stand in front of a backdrop adorned with various neon lights, making vivid gestures while speaking enthusiastically and persuasively. The individual emphasizes the quality and versatility of customizable neon signs, highlighting their suitability for a range of businesses, including clothing stores, cafes, and electronics shops. They stress that these signs can be tailored in size, shape, and color to meet specific needs. The video encourages viewers to click on the link in the bio to obtain a free quote and urges them to act quickly, mirroring the core message of the accompanying audio track, which also promotes the neon signs' adaptability and quality for commercial signage.The overall tone is passionate and persuasive, aiming to promote neon lights as an ideal choice for commercial signage."
+ },
+ {
+ "video_id": "PI-e7628ecf-4695-4e36-9fbd-1356551c2cbb",
+ "path": "/workspace/benchmark/video/PI-e7628ecf-4695-4e36-9fbd-1356551c2cbb.mp4",
+ "answer": "The video captures a young man energetically performing a pop song in a bedroom setting. He is wearing glasses and a black t-shirt, holding a hairbrush as a microphone, and actively moving his arms and upper body in sync with the music. The wall behind him features several posters, one of which prominently displays \"CUTS\" and \"OLIVIA RODRIGO.\" A text overlay on the screen reads: \"The aura Chappell Roan gained from having an entire ARENA filled with literally the music industry sing her song back to her is unmatched.\" The audio features the song playing, accompanied by the distinct sound of a large crowd singing along, suggesting this is an imitation of a live concert.\n"
+ },
+ {
+ "video_id": "PI-df6ee9fa-19ef-4c36-a2d2-20110063407b",
+ "path": "/workspace/benchmark/video/PI-df6ee9fa-19ef-4c36-a2d2-20110063407b.mp4",
+ "answer": "A young girl and an adult man engage in a humorous dialogue while dining, unfolding a delightful interaction. There are wooden cabinets, a dark-colored refrigerator, and a light-colored countertop, which holds a purple water bottle, a glass of water, and a bowl of food. The girl, wearing a pink long-sleeved shirt, is eating pasta, while the man, in a dark gray long-sleeved top and a black hat, responds to her statements. The conversation revolves around the girl's playful mention of having 'three boyfriends,' discussing Henry, Hudson, and Locke. The adult responds with feigned skepticism and humor, leading to the child's smile. The scene captures a typical family moment, highlighting the innocence and imaginative nature of childhood dialogue and the affectionate dynamic between parents and children. Natural lighting and clear audio contribute to a genuine portrayal of the exchange."
+ },
+ {
+ "video_id": "PI-5cd62932-3669-494a-a95e-dbc88006c730",
+ "path": "/workspace/benchmark/video/PI-5cd62932-3669-494a-a95e-dbc88006c730.mp4",
+ "answer": "This video shows the comparison between an 80-degree holographic projector and a 3000-degree holographic projector in a room illuminated by purple and blue LED lights. The words \"80 hologram VS 3000 hologram\" appeared at the top of the screen, setting up a stage for comparison. A man is holding a smart phone in one hand, showing a series of hologram options. The scene switches to a corner of the room where an 80 holographic projector is installed on the wall. The feature of the first hologram is that a character similar to Mario in popular video game series is floating in the air. The hologram is very simple and a bit blurry, reflecting its relatively low price. The video continues to emphasize the affordability and simplicity of the 80 hologram projector, presenting various holograms such as circular side images, a butterfly with pink and white wings, and characters similar to those from the animated series \"Naruto\". The background is a computer setup and a speaker, with a small decoration on top. The video effectively compares the functions of the 80 holographic projector with the higher quality of the 3000 projector, highlighting the budget-friendly nature of the 80 option. Vibrant lights and dynamic holograms create an engaging visual experience, highlighting the theme of being more economical and high-end holographic technology."
+ },
+ {
+ "video_id": "PI-7fde54c4-26f2-4f42-b6c1-bbe235893871",
+ "path": "/workspace/benchmark/video/PI-7fde54c4-26f2-4f42-b6c1-bbe235893871.mp4",
+ "answer": "The video unfolds in a large, well-lit lecture hall bustling with students, creating a lively academic atmosphere. The central figure, dressed in a black hoodie and a black beanie adorned with white patterns, is seated at a desk enjoying a taco wrapped in paper, with a brown paper bag likely containing more food items beside it. The background is filled with students either standing or seated, engaged in various activities, while a screen at the front displays content, enhancing the educational setting. The individual savors the taco, taking bites and gesturing with their hands as if sharing their thoughts or experiences. Meanwhile, the audio track features a male speaker with a casual and humorous tone, who appears relaxed and jovial. He talks about enjoying his taco and playfully asks for an edit of himself eating it. He mentions that it's Tuesday and jokes about what a king eats on Tuesdays, concluding with a light-hearted request to stop recording. The core message of the video is the speaker's enjoyment of his meal and his playful interaction with the recorder, set against the vibrant backdrop of a busy lecture hall where students interact and go about their day."
+ },
+ {
+ "video_id": "PI-24b02832-ad12-4361-9b2f-80a56bea03a8",
+ "path": "/workspace/benchmark/video/PI-24b02832-ad12-4361-9b2f-80a56bea03a8.mp4",
+ "answer": "The video shows a woman standing in a room with shelves filled with various perfume bottles and wearing a light-colored sweatshirt with a graphic design. The woman shared a secret, revealing that she and Peyton never dated, contrary to what fans might have assumed. The revelation was underscored by text overlays, \"me and payton were never dating.\" and \"payton and i never dated.\" The speaker expressed regret and apologies to the fans who created the fan page, accompanied by a crying emoji that showed empathy for her followers. At the same time, the track captures a neutral dialogue between a man and a woman, with no obvious emotional hints, suggesting that it may be an ordinary conversation. The video aims to clear up fans'misunderstandings and end relationship rumors, while the consistent backdrop of the perfume bottle adds a personal sense of life to the scene."
+ },
+ {
+ "video_id": "PI-fe5e49c0-d18a-40fe-a141-784b1d2cb277",
+ "path": "/workspace/benchmark/video/PI-fe5e49c0-d18a-40fe-a141-784b1d2cb277.mp4",
+ "answer": "This video captures the creative process of a musician building a song from scratch using unconventional sounds. The musician starts by creating a simple, robotic vocal melody using a melodica connected to a keyboard synthesizer. A series of emojis at the top of the screen—a robot, a telephone, a soccer ball, a snake, and a synthesizer—guides the viewer through his process. He proceeds to sample sounds from various objects corresponding to the emojis. First, he records the click of a vintage red telephone receiver. Next, he taps on a soccer ball to create a percussive beat. He then uses a toy snake as a shaker for another rhythmic layer. Finally, he combines all these elements on a different synthesizer, represented by the synthesizer emoji. The video culminates with the artist playing the finished track, an upbeat and layered electronic song, smiling and nodding along to his creation."
+ },
+ {
+ "video_id": "PI-c7dbe764-4bfa-4c20-b803-63370c9822da",
+ "path": "/workspace/benchmark/video/PI-c7dbe764-4bfa-4c20-b803-63370c9822da.mp4",
+ "answer": "This video captures a heartrending scene: the street is shrouded in thick orange smoke, indicating that a wildfire has broken out nearby. The background was mainly tall palm trees, almost invisible in the smog. Several cars were parked by the roadside, and some drivers seemed hesitant due to the low visibility. On the sidewalk, a group of people can be seen evacuating the area. They are carrying luggage and pets, hinting that they are leaving in a hurry for safety. A woman leading a dog on a leash highlights the urgency and concern for family members, including pets. The audio captured a painful voice saying, \"I don't know, dear. I'm fucking gone.\" This emphasizes the need for panic and sudden escape. The overall atmosphere was tense. Smoke cast a strange glow at the scene, emphasizing the severity of the wildfire and the direct threat it poses to the community. This video effectively conveys the chaos and fear that residents have experienced in this environmental crisis."
+ },
+ {
+ "video_id": "PI-b410d64c-6164-4b5b-82b8-ea75645498d1",
+ "path": "/workspace/benchmark/video/PI-b410d64c-6164-4b5b-82b8-ea75645498d1.mp4",
+ "answer": "The video humorously explores the daily stress experienced by men, intertwined with subtle audio cues that enhance the narrative. It begins with a stick-figure police officer leaning against a pole, engrossed in his phone, with a text bubble reading, 'Po: How Nervous Are Men'. The scene shifts to a man inside a car, gripping the steering wheel with a worried expression. Throughout the video, the audio features a male speaker expressing unknown emotions, with phrases like 'Are you okay?' suggesting concern or empathy, and another responding with 'This hold', possibly indicating a pause or hesitation. The ambient music maintains a steady rhythm, creating a neutral yet slightly tense atmosphere. As the video alternates between the anxious officer and the stressed driver, the officer's anxiety seems to grow as he interacts with the man, whose phone displays a game called 'BLOCK BLAST'. Eventually, the man in the car puts on sunglasses, gives a thumbs-up, and appears to find relief as the traffic light turns green, driving away. This leaves the police officer looking even more nervous, with tears welling in his eyes. The simple blue sky background consistently emphasizes the characters' expressions and movements, while the audio elements subtly underscore the emotional undertones of their brief exchange."
+ },
+ {
+ "video_id": "PI-d61d3125-569c-419c-a59f-cf1ad194e0b5",
+ "path": "/workspace/benchmark/video/PI-d61d3125-569c-419c-a59f-cf1ad194e0b5.mp4",
+ "answer": "This video is set inside a white-themed bridal shop, with contrasting gray-white and brown flooring, and a display platform. A woman stands in front of the camera, showcasing the wedding dress she's wearing. To better highlight the gown, she steps onto the display platform behind her, subtly touching the skirt to emphasize its overall effect. Her joyful tone, as she mentions the dress is from \"LOVELLA BRIDAL,\" clearly conveys her excitement and satisfaction. Soothing background music plays in the shop, creating a relaxing atmosphere. The woman concludes by saying any \"winter bride\" would need this dress, underscoring her deep contentment with the gown and radiating the happiness of an soon-to-be newlywed."
+ },
+ {
+ "video_id": "PI-b4cf22ed-c600-4a33-b6ae-f2b3522d6aa8",
+ "path": "/workspace/benchmark/video/PI-b4cf22ed-c600-4a33-b6ae-f2b3522d6aa8.mp4",
+ "answer": "The video vividly captures the essence of childhood play, particularly the excitement and imagination associated with Nerf wars. It begins with a person in a gray hoodie, black pants, and a helmet, crouching near a staircase indoors, holding a colorful Nerf gun and preparing for action. The scene develops as the individual cautiously ascends the staircase, aiming the Nerf gun ahead, with a framed photo on the wall and a window with a tied-back curtain in the background, allowing light to filter in. After reaching the top, the scene shifts to a living room where another person with long hair, wearing a black vest over a striped shirt, sits on a gray couch holding a red Nerf gun in a playful stance. The room is filled with various items, including a guitar and scattered belongings, enhancing the lived-in feel. The video then returns to the first person, now descending the staircase, still in a ready position with the Nerf gun, maintaining the playful atmosphere. Accompanying this visual narrative is a voice-over from a male speaker with an energetic and playful tone, reminiscing about childhood experiences with Nerf wars. His expressive and humorous language evokes a strong sense of nostalgia, highlighting the carefree and imaginative nature of such playful battles. The combination of visual and auditory elements effectively conveys the fun and nostalgic spirit of childhood adventures."
+ },
+ {
+ "video_id": "PI-4664602c-f46e-47fe-b009-0e638b4f48c5",
+ "path": "/workspace/benchmark/video/PI-4664602c-f46e-47fe-b009-0e638b4f48c5.mp4",
+ "answer": "In the video, a person is sitting in front of a modern curved table in a well-lit room, creating a modern and inviting atmosphere. The chandelier hanging from the ceiling illuminates the room, and the large Windows with blinds fill the space with natural light, enhancing the brightness of the space. This person is wearing a dark shirt. One hand is placed on the chin and the other arm is crossed over the back of the chair. It seems to be in a state of deep thought. The text on the video is written like this: \"When the songs at work start to become popular because you have to listen to them for eight hours every day.\" It humorously emphasizes the fatigue that people might feel when repeatedly hearing the same popular song during long working hours. This article captures the essence of workplace fatigue and the shared experience of being overwhelmed by overplayed songs. In addition, the audio also includes a clip of a song with lyrics that go, \"So now I'm a lucky woman, just like this is the last night,\" which adds a layer of irony and humor to the situation as it contrasts with the message about song fatigue in the text. The video effectively combines visual and auditory elements to create a relevant and humorous narrative about the challenges of a repetitive working environment."
+ },
+ {
+ "video_id": "PI-37cd5519-332f-4221-91ee-529ddba1b1b4",
+ "path": "/workspace/benchmark/video/PI-37cd5519-332f-4221-91ee-529ddba1b1b4.mp4",
+ "answer": "The video captures a lighthearted interaction between a young man and woman in a brightly lit indoor setting. The room, likely a hotel or residential space, features white walls and a light-colored floor, with a bed and a bunk bed visible in the background. The man, with distinct red curly hair, wears an orange t-shirt and loose light blue jeans. The woman, with dark curly hair, is dressed in a light blue crop top and black pants. Their conversation is brief and playful, with the man introducing the woman as his 'girlfriend' before she playfully corrects him to 'wife,' which he enthusiastically accepts as 'even better.' This affectionate exchange is followed by a moment where the man lifts the woman off the ground, holding her close as they share smiles and briefly rotate. The scene is underscored by a pop-style song with a female vocalist, contributing to the cheerful and romantic atmosphere of their interaction. The overall visual and auditory experience conveys a moment of genuine joy and affection between the two individuals."
+ },
+ {
+ "video_id": "PI-689ba3aa-eb28-4033-9586-7ae4b86328b8",
+ "path": "/workspace/benchmark/video/PI-689ba3aa-eb28-4033-9586-7ae4b86328b8.mp4",
+ "answer": "The video features a person wearing a dark cap and glasses, standing in a room with a white wall and a light fixture above, holding the brim of the cap with their right hand in a casual and contemplative pose. The text overlay on the video reads: 'Just found out he wrote this song for Lily, his daughter. My favorite Disney lullaby was written for Phil's daughter. I'm going to sob.' This text reveals a personal and emotional revelation about the origin of a song, indicating that it was written for someone's daughter, and the viewer's favorite Disney lullaby was also written for a daughter, leading to an emotional reaction. The scene remains consistent throughout the video, with minimal changes in the person's position and the background, emphasizing the emotional impact of the text. The accompanying BGM is an instrumental track featuring a gentle and soothing melody, likely a lullaby, which evokes a sense of warmth and tenderness. The video conveys a sense of connection and sentimentality, highlighting the universal theme of a parent's love for their child as expressed through music."
+ },
+ {
+ "video_id": "PI-a6be789a-f97b-4f5c-baf5-b7efe39683bf",
+ "path": "/workspace/benchmark/video/PI-a6be789a-f97b-4f5c-baf5-b7efe39683bf.mp4",
+ "answer": "The video and audio together create a humorous and engaging narrative. The video begins with a man in a kitchen, wearing a dark blue zip-up sweater over a white shirt, holding a large onion on a wooden cutting board, accompanied by the text overlay, \"I love your daughter.\" The scene transitions to sautéing sliced onions, with the text changing to, \"what do u have to offer her?\" Three tortillas are then laid out, and the man gestures with his hands as the text \"nothing.\" appears. The focus shifts to a raw ground meat patty on a tortilla, which the man seasons with salt and pepper, with the text \"only this\" appearing. He cooks the patty in a heated pan, pressing it down with a white plate, and tops it with a slice of melting cheese. The scene transitions back to stirring sautéed onions, followed by a bowl of mayonnaise, chopped pickles, and spices being mixed together. The video concludes with the man assembling the cooked patty topped with melted cheese, pickles, and sautéed onions onto the tortilla. Meanwhile, the audio captures a male speaker making a playful statement about loving someone's daughter and implying that his culinary offering is all he has to give, highlighting a light-hearted interaction. The combination of the visual cooking process and the humorous audio creates an entertaining and cohesive story."
+ },
+ {
+ "video_id": "PI-48364d32-3f4e-49f7-bd7b-b90a5baa6b4c",
+ "path": "/workspace/benchmark/video/PI-48364d32-3f4e-49f7-bd7b-b90a5baa6b4c.mp4",
+ "answer": "The video unfolds in a cozy restaurant setting, characterized by warm lighting and brown tile panels on the walls. The beginning is a close-up of a person wearing a colorful striped sweater, accompanied by the text: \"People think my sister is charming, but look at this.\" It implies an interesting contrast between the sister's charm and her actual behavior. Then the scene switches to a wider shot, showing the interaction between the person in the striped sweater and another person sitting on the sofa. The person sitting in the photo is wearing a light blue shirt, a colorful headscarf and a beige jacket over his legs. His posture is very relaxed. The person in the striped sweater jokingly touched the face of the sitting person, causing an exaggerated reaction and adding a comedic effect. On the table in front of them lay a black bowl, a glass of water and some folded napkins, creating a casual dining atmosphere. This video captures the light-hearted and humorous moments between the two, emphasizing the contrast between the sister's charming image and her mischievous and down-to-earth behavior."
+ },
+ {
+ "video_id": "PI-774cd7ac-176e-4432-ba25-d1a4e2afc2e4",
+ "path": "/workspace/benchmark/video/PI-774cd7ac-176e-4432-ba25-d1a4e2afc2e4.mp4",
+ "answer": "The video begins in a hospital room where a person in a red shirt stands beside a hospital bed, with an elderly person lying in the bed. The room is equipped with medical devices and a light fixture on the wall, creating a somber atmosphere. A text overlay appears, stating, \"I wonder if my face will change if I lose weight...\", hinting at personal concerns about physical transformation. The scene then transitions to a close-up of another person in a gray shirt inside a car, followed by a person in a denim jacket sitting in the car's front seat, suggesting a shift in location and possibly time. The final scene shifts to a gym where a muscular individual, wearing a black tank top with a white graphic, black shorts, and a black cap, stands confidently with hands on hips. The gym background includes various exercise equipment and other people working out, emphasizing a theme of physical fitness and transformation. Meanwhile, the audio features a male speaker conveying a tone of assertiveness and possible frustration, discussing the importance of responding decisively to disrespect and emphasizing physical action as a follow-up. The speaker expresses a strong stance on handling conflict, suggesting a confrontational or intense situation. The ambient music, with its rhythmic tempo and intense atmosphere, complements the speaker's emotional delivery, tying together the themes of personal transformation, physical fitness, and assertive confrontation."
+ },
+ {
+ "video_id": "PI-00016e0b-b1b4-4414-a7bb-8a3374aeb019",
+ "path": "/workspace/benchmark/video/PI-00016e0b-b1b4-4414-a7bb-8a3374aeb019.mp4",
+ "answer": "The video features a female dancer in a well-lit room with a neutral-colored wall and a chandelier, creating a simple yet elegant backdrop. She is dressed in a white long-sleeve crop top, white sweatpants with black designs, and white sneakers, embodying a fresh and energetic look. The video begins with the bold, pink and white text \"ONE MORE TIME TRY IT WITH ME\" displayed prominently, setting an encouraging tone. As the dancer performs a series of energetic moves, including arm gestures and hip sways, the text on the screen changes to guide viewers through the routine with phrases like \"WE GO OOOO,\" \"OOOOO HA TEE,\" and \"TEE TEE.\" The dancer's fluid movements are accentuated by text cues such as \"EE\" and \"E AND I POP,\" which highlight specific dance steps. The sequence continues with her raising her arms above her head, marked by the text \"HEY,\" and crossing her arms in front of her chest during the \"BREA THING MOMENT.\" She then flicks her hair, guided by \"HEY FLICK IT OUT 2,\" followed by counting \"THREE FOUR\" and instructions \"AND A LEFT AND A RIGHT\" as she steps side to side. The consistent background and outfit keep the focus on her precise and lively dance moves, with the text providing clear instructions to help viewers follow along. Meanwhile, the audio track complements the visuals with the dancer's energetic and encouraging voice, repeating phrases like \"one more time try it with me,\" \"we go,\" and specific dance instructions such as \"e and a push and,\" all of which encourage viewers to engage with the dance routine and enjoy the experience."
+ },
+ {
+ "video_id": "PI-a0904290-23dd-47a9-853e-e748dcd38eb1",
+ "path": "/workspace/benchmark/video/PI-a0904290-23dd-47a9-853e-e748dcd38eb1.mp4",
+ "answer": "The video features an individual with curly hair and a beard, dressed in a plaid shirt over a white t-shirt with a colorful design, set against an indoor backdrop that includes a potted plant and a television, suggesting a living room environment. The person remains mostly stationary, with minimal movements, while the focus is on the text overlays that shift from 'Liberals doing the Conservative makeup trend' to 'Conservatives doing the Liberal makeup trend,' offering a commentary on the cross-pollination of makeup trends between political spectrums. Accompanying this visual narrative is a BGM track ,characterized by a modern and upbeat tone, which enhances the dynamic atmosphere of the video. Together, the audio and visual elements explore the intersection of political identity and personal style choices, particularly how makeup trends transcend traditional political boundaries."
+ },
+ {
+ "video_id": "PI-4ec22771-8ca1-41fd-93b9-bae48794224a",
+ "path": "/workspace/benchmark/video/PI-4ec22771-8ca1-41fd-93b9-bae48794224a.mp4",
+ "answer": "The video humorously depicts a common stereotype about adults living with their parents through a relatable kitchen scene. Set in a kitchen with white cabinets and a granite countertop, a person in a cream-colored sweater, blue jeans, and green sandals is seen doing dishes, holding a wooden spoon and a plate. The text overlay reads, 'omg, you're 31 and still live w/ your parents???' and 'me and my mom everyday after i get off work:', highlighting the daily routine of the individual. Another person, dressed in a black shirt, dark pants, and a green cap, enters holding a can and a plate, engaging in a brief interaction. The first person then dons bright pink rubber gloves and continues washing dishes, while the second person observes. The scene transitions to the first person carrying a large laundry basket filled with clothes towards a door, presumably to do laundry, with the second person following, holding a can and a phone, and appearing to cheer or encourage them. Meanwhile, a male speaker with an indeterminate emotional tone is heard speaking in fragmented English, mentioning actions like 'right to the man down' and 'trying to see,' while ambient music with a rhythmic tempo creates a neutral atmosphere, subtly underscoring the everyday activities depicted in the video."
+ },
+ {
+ "video_id": "PI-9fd00ab5-c007-4012-bffe-50fe39c1787a",
+ "path": "/workspace/benchmark/video/PI-9fd00ab5-c007-4012-bffe-50fe39c1787a.mp4",
+ "answer": "The video showcases a series of stylish outfits worn by a male model, each exuding confidence and charisma, perfectly complemented by a background music track featuring a male vocalist with a confident and assertive tone. The first outfit, a white hoodie and matching sweatpants adorned with a striking black and red lion design, creates a bold and edgy look, with the model standing confidently on the steps of a modern building. The second outfit, featuring an intricate white dragon design on a white hoodie and sweatpants set, gives off a mystical and powerful vibe as the model poses dynamically on a street. The third outfit, a black t-shirt and pants with a roaring lion graphic and the words 'The King' printed on them, exudes authority and confidence, with the model standing against a brick wall. The final outfit, a white hoodie and sweatpants featuring a majestic white tiger walking through a snowy forest, evokes a sense of wild elegance in a modern indoor setting. Throughout the video, the BGM's rhythmic and bold lyrics emphasize a strong, no-nonsense attitude, encapsulating a core message of confidence and authority that mirrors the model's commanding presence in each outfit. The text 'Mens who dress like this >>>' and 'CLICK TO GET THEM NOW' appear on the screen, suggesting these fashionable and assertive outfits are available for purchase."
+ },
+ {
+ "video_id": "PI-d2ad692a-7ddd-4940-87c4-1157b287789e",
+ "path": "/workspace/benchmark/video/PI-d2ad692a-7ddd-4940-87c4-1157b287789e.mp4",
+ "answer": "The video features a young woman in a dimly lit blue room, taking a selfie. She is dressed in a light blue tank top that reveals extensive tattoos on her chest, neck, and arms, including intricate designs of cobwebs, flowers, and other patterns. With her long, dark hair partially pulled back, she adjusts her hair and poses to showcase her tattoos and outfit. In her right hand, she holds a white smartphone equipped with a triple-camera setup. The background includes a partially visible neon sign with cursive script, adding to the ambient blue tone of the room. Meanwhile, the audio features a female speaker with an unknown emotional tone, possibly due to non-verbal voices or unclear speech. The content seems to involve a monologue or conversation, but the specific semantic details are indiscernible, creating an ambiguous emotional expression that complements the visual elements."
+ },
+ {
+ "video_id": "PI-2c5f34b9-60f0-47f2-b22e-b00965d3223e",
+ "path": "/workspace/benchmark/video/PI-2c5f34b9-60f0-47f2-b22e-b00965d3223e.mp4",
+ "answer": "The video unfolds in a dark cinema setting, and the rocks inside the video are dotted with small glowing red particles, creating an eerie and mysterious atmosphere that sets the stage for an eventful narrative. In the center, an orange semi-circular object is partially buried and the striking white text \"He's ALIVE!!\" Appears, suggesting a major revelation and adding a layer of suspense and excitement. The object begins to move slightly, indicating that it has a life or mechanical function, which further emphasizes the approach of a character, reaching out to the object through white and red boots and a gloved hand. The passionate music sung by a boy serves as the background, enhances the dramatic tension and emotional intensity of the scene. The video ends with the hand still reaching out to the object, leaving the viewer in suspense about the nature of the object and the character's intentions, while the repetitive text emphasizes the themes of rediscovery or resurrection, making the whole sequence both thrilling and thought-provoking."
+ },
+ {
+ "video_id": "PI-b64c5b05-ab89-411d-b8bb-51bbb2f4f4f6",
+ "path": "/workspace/benchmark/video/PI-b64c5b05-ab89-411d-b8bb-51bbb2f4f4f6.mp4",
+ "answer": "The video features a woman with blonde hair tied back, dressed in a simple white t-shirt, speaking calmly and reflectively into a microphone labeled \"THE MEL ROBBINS PODCAST.\" The softly blurred background creates a cozy, well-lit indoor ambiance, suggesting a studio or home office setting, which enhances the inviting and professional tone of the video. Throughout her discourse, she delves into \"The Let Them Theory,\" a concept that encourages listeners to embrace a mindset of acceptance, allowing others to be themselves without the need for control. Her expressive gestures and engaged demeanor effectively convey the essence of her message. She emphasizes that by letting people reveal their true selves, you gain invaluable clarity on their suitability for your life, empowering you to make better choices. The video seamlessly transitions in its final frame to a promotional image of a book titled \"THE LET THEM THEORY\" by Mel Robbins. The book cover is striking, with a green background and large, bold, yellow letters. Below the book image, text reads \"The perfect gift this season!\" and \"Available wherever books are sold,\" along with a website URL \"LetThem.com.\" This final frame not only promotes the book but also reinforces the podcast's message, creating a cohesive narrative that champions personal empowerment and peace through acceptance and the release of control over others. The overall video maintains a professional and inviting tone, effectively bridging the podcast discussion with the book promotion."
+ },
+ {
+ "video_id": "PI-d3cb8143-d4ba-42f3-ad30-760aaa7fb28c",
+ "path": "/workspace/benchmark/video/PI-d3cb8143-d4ba-42f3-ad30-760aaa7fb28c.mp4",
+ "answer": "The video, set in a bustling grocery store, captures a moment of celebration or appreciation, likely tied to an event or holiday. Several men are seen holding bouquets of flowers, with one man in black pushing a red shopping cart while holding his bouquet, and another in a green shirt also holding flowers. The background features a sign that reads \"Lost in an Ocean of Love,\" adorned with illustrations of sea creatures, enhancing the theme of affection. As the men walk through the store aisles, passing shelves stocked with various products, a man in a blue shirt and black shorts is also seen holding a large bouquet. The store is well-lit with fluorescent lights, and other shoppers are visible, contributing to the lively atmosphere. The text \"We love to see it fellas!\" appears on the screen, emphasizing the positive and affectionate mood. Meanwhile, the audio features a male speaker expressing a sense of anticipation and optimism, suggesting that love will be discovered. His reflection on the potential for love conveys a hopeful and slightly mysterious emotion, complemented by background music with a moderate tempo that creates an ambient atmosphere. This combination of visual and auditory elements weaves a narrative of hopeful anticipation and joyful celebration."
+ },
+ {
+ "video_id": "PI-a037925e-7da9-45a0-ad68-38edc13d9e29",
+ "path": "/workspace/benchmark/video/PI-a037925e-7da9-45a0-ad68-38edc13d9e29.mp4",
+ "answer": "The video opens with a man dressed in a patriotic ensemble, a blue suit adorned with white stars, a red vest, a white shirt, a black bow tie with white stars, and a top hat featuring an American flag design—standing in a stadium filled with blurred lights, suggesting a large, enthusiastic audience. This scene quickly transitions to a vibrant dance sequence led by a man in a blue and white varsity jacket, black shirt, blue jeans, and black gloves, flanked by two dancers in white outfits with red hair. The brightly lit stage amplifies the energetic atmosphere, and overlay texts such as 'IT'S A CULTURAL DIVIDE,' 'I'MA GET IT ON THE FLOW,' and others emphasize themes of cultural significance and authenticity. These messages resonate with the background music (BGM) track, which features a male vocalist with a confident and energetic tone, delivering rhythmic and assertive lyrics that exude determination and defiance. The track's content underscores resilience and authenticity, encapsulating a core message of standing strong against adversity and staying true to oneself. The scene then shifts to an animated sequence featuring penguins in a snowy landscape under a bright blue sky, with one penguin in the foreground appearing particularly lively and animated. The video cuts back to a live-action segment showing a woman in a white crop top and black pants walking on a tightrope against a clear blue sky, showcasing her balance and focus. It then returns to the dancer in the blue varsity jacket, now joined by other dancers in similar blue outfits, performing synchronized moves on a stage with a large audience in the background. The video wraps up with a return to the animated penguins, highlighting their cheerful and energetic demeanor as they move in unison across the snowy terrain, leaving the audience with a sense of joy and unity."
+ },
+ {
+ "video_id": "PI-aa0fc068-e8e0-4ae3-a978-a238f3a01418",
+ "path": "/workspace/benchmark/video/PI-aa0fc068-e8e0-4ae3-a978-a238f3a01418.mp4",
+ "answer": "The video takes place in a modern living room, featuring stainless steel appliances and white cabinets. The protagonist in the photo is a person wearing a beige hat, a black hooded sweatshirt and khaki pants. He is drinking water from a black cup. This person turned around to show the back of his hoodie. Another person reached out his hand and adjusted the sweatshirt in order to better display its design. Then the focus shifted to a close-up of the black sweater, highlighting its texture pattern and a small white label on the chest. This person is still wearing a gold necklace and a black ring. Throughout the video, a conversation unfolds as one person praises another's new shirt, expressing admiration for its design and quality. The wearer mentioned that this shirt was bought from \"Douyin enthusiasts\", emphasizing that it is affordable and comes in a variety of colors. The conversation conveyed enthusiasm and satisfaction with the purchase. The core theme of the video revolves around showcasing the new sweater, its design, quality, and the convenience of finding it online. The person in the sweater was also holding a black water bottle. Occasionally, he adjusted his clothes during the conversation, which strengthened his approval of the sweater."
+ },
+ {
+ "video_id": "PI-4960f6ee-441b-4ed5-b4d0-3192d8c4836c",
+ "path": "/workspace/benchmark/video/PI-4960f6ee-441b-4ed5-b4d0-3192d8c4836c.mp4",
+ "answer": "This video offers a close-up shot of a fast food restaurant, in which an employee can be seen handling fries. The stainless steel surface of the French fries station is very conspicuous, and on the metal shelves are red containers filled with French fries. The employee's arm is clearly visible. He is using a metal spoon to collect fries from the station and put them into a container. The text on the video reads: \"Please tell me when a rude customer walks past the drive-through lane.\" This passage describes from a personal perspective the challenges that employees in the fast food industry face when dealing with difficult customers. This article implies a humorous or relevant context, highlighting the stress and daily tasks involved in preparing food in a fast-paced environment. This video captures the behind-the-scenes work of a fast food restaurant, highlighting the interaction between employees and customers, as well as the sometimes frustrating experiences at work. The audio focuses the audience's attention on the visual elements and text overlay, reinforcing the theme of dealing with rude customers while completing daily tasks."
+ },
+ {
+ "video_id": "PI-d5f56595-cf52-445d-9539-1fc8560262a4",
+ "path": "/workspace/benchmark/video/PI-d5f56595-cf52-445d-9539-1fc8560262a4.mp4",
+ "answer": "In a modern, well-lit living space characterized by light gray walls, a large window with a black frame, and a circular, gold-toned chandelier, three children are featured. The room's light wood flooring and white cabinets provide a bright and airy backdrop. Initially, two children capture the focus: one in a light blue sweatshirt and jeans playfully interacts with another seated on a light gray stool, dressed in a blue hoodie and matching sweatpants with white sneakers featuring black stripes. The standing child leans on the seated one, who has their arms crossed. Soon, a third child wearing a gray hoodie with \"NY\" on the front and black pants enters from the right, observing the interaction. As the scene progresses, the child in the gray hoodie moves closer before stepping out of the frame. They later re-enter holding a baby in a light pink outfit, standing next to the seated child while holding the baby up. Throughout, the child in the light blue sweatshirt continues to engage with the seated child. Meanwhile, a male speaker's firm and protective voice is heard, directing others to stay away from his brother, asserting his authority with a determined and serious tone. The ambient music, with its steady tempo, adds a neutral yet slightly tense atmosphere to the scene, underscoring the speaker's emphasis on setting boundaries."
+ },
+ {
+ "video_id": "PI-469ff900-bf69-4e44-95c0-8b1081a838b5",
+ "path": "/workspace/benchmark/video/PI-469ff900-bf69-4e44-95c0-8b1081a838b5.mp4",
+ "answer": "The video begins with a view of a snowy landscape where a large orange snow plow is diligently clearing a path through thick, heavy snow. The plow's blade is heavily laden with snow, and orange flags attached to it enhance visibility in the snowy conditions. The surrounding area is blanketed in snow, with snow-covered trees and a fence visible on the left side. As the plow moves forward steadily, it pushes the snow to the sides, creating a clear path and revealing more of the road beneath. A small sign becomes visible in the distance, indicating a possible destination or direction. The plow's progress is methodical, ensuring the road is cleared for safe passage. Meanwhile, the audio features a male speaker with an ambiguous emotional tone, briefly uttering the fragmented phrase, \"Where we,\" in a neutral or unclear context, accompanied by ambient music with a steady tempo that creates a neutral atmosphere. The video continues to focus on the snow plow's steady progress, maintaining the same snowy landscape and methodical clearing of the road, with the small sign becoming more prominent, reinforcing the possible destination or direction. The video wraps up with the plow still in motion, continuing its essential task of clearing the snow-covered road."
+ },
+ {
+ "video_id": "PI-e3741a7c-84e4-47f2-87b6-7f14f8e134e6",
+ "path": "/workspace/benchmark/video/PI-e3741a7c-84e4-47f2-87b6-7f14f8e134e6.mp4",
+ "answer": "This video showcases the unboxing and initial review of a ugee Fun Drawing Pad. A person is shown on a white desk, carefully opening the product's packaging with a utility knife. They remove the tablet, its folio case, and various accessories, including a stylus, charger, and a drawing glove. The video highlights the tablet's large screen size. The user then powers on the device and demonstrates its features, cycling through three different screen modes labeled Regular, Color Paper, and Ink Paper. These modes are suggested for various uses like reading, studying, and watching videos. To conclude, the user puts on the drawing glove and begins to sketch an eye on the tablet. This part of the demonstration focuses on testing the stylus's pressure sensitivity and the screen's paper-like texture. The entire sequence is accompanied by calm background music and the natural sounds of the unboxing and drawing process.focusing on the product's features and user experience."
+ },
+ {
+ "video_id": "PI-c11c5d1d-b6ef-4193-a8bb-08bc1047c9c4",
+ "path": "/workspace/benchmark/video/PI-c11c5d1d-b6ef-4193-a8bb-08bc1047c9c4.mp4",
+ "answer": "This video takes place in a car, with a person sitting in the driver's seat. This woman is wearing a light blue sweatshirt with a white Nike logo on it. The interior of the car is clearly visible, including the headrests and some Windows. The buildings and trees outside can be vaguely seen, indicating that it is daytime. This person will make all kinds of gestures, such as touching her hair and adjusting her posture, indicating that she is having a conversation or responding to something. The key audio content is conveyed through text overlay, which reads: \"This is the man who sent his wife naked to the Grammys!!\" The text is white. The capital letters \"WIFE\" and \"GRAMMYS NAKED\" highlight the shocking nature of this statement. The core theme of the video revolves around expressing astonishment or disbelief at a man's decision to allow his wife to attend the Grammy Awards naked. The overall tone of the article is a mixture of surprise and criticism, with the focus on the controversial choice made by this man."
+ },
+ {
+ "video_id": "PI-7725aaf6-8af4-4055-ab9f-78984127be8c",
+ "path": "/workspace/benchmark/video/PI-7725aaf6-8af4-4055-ab9f-78984127be8c.mp4",
+ "answer": "This short video captures a unique point-of-view from a football field during what appears to be a pre-game ceremony. A young man with blonde and brown dreadlocks lies on his back on the green turf, filming himself with a phone. As the video progresses, a massive American flag is carried over him, held aloft by several people. The camera spins and moves, suggesting the man is being transported under the flag as it's unfurled across the field. In the background, the stadium stands are filled with spectators. Text overlaid on the screen reads, 'I bet your wondering how I got here.'"
+ },
+ {
+ "video_id": "PI-ca7cb456-8dc6-4902-a170-3deba8cf9514",
+ "path": "/workspace/benchmark/video/PI-ca7cb456-8dc6-4902-a170-3deba8cf9514.mp4",
+ "answer": "The video unfolds in a comfortable indoor environment. A woman in a bright red dress with white fringes is sitting beside a blind, with her long black hair draped over her shoulders. She looked very excited, wiping her tears with one hand and placing the other on her face. She wore multiple bracelets on her wrists, adding a personal touch to her appearance. The text on the video shows: \"My dad called me and said that he didn't know what to catch in the fire, so he took away all my favorite stuffed toys when I was a child so that I could give them to my children one day.\" This heartfelt message sets the emotional tone. Then the scene switches to the split-screen view. In the small window on the upper right, the woman continues to react emotionally. The larger window introduces a man in a black shirt. He is kneeling and holding a plush teddy bear in his hand, facing the camera. Beige curtains and wooden floors serve as the backdrop, creating a warm and home-like atmosphere. The posture of this man holding the teddy bear highlights the core theme of the video - emotional value and family affection in adversity. The teddy bear was a precious item from this woman's childhood. This video perfectly captures a moment of emotional resonance, highlighting the extent to which parents preserve memories for their children's future."
+ },
+ {
+ "video_id": "PI-072e79df-cc1f-47a0-92b7-840c6065cc92",
+ "path": "/workspace/benchmark/video/PI-072e79df-cc1f-47a0-92b7-840c6065cc92.mp4",
+ "answer": "The video features a woman dressed in a black coat and carrying a light blue bag standing outdoors on a masonry path, with a backdrop of shared bicycles, bare trees, and a cloudy sky. The camera then shifts to capture a multi-storey brick building, highlighting its numerous windows, red awnings at the ground floor entrance, and manicured bushes in front of the door. Several cars, including a prominent yellow taxi, are parked along the street. The camera pans the building from various angles, sometimes looking up to emphasize its height and other times focusing on its exterior details. A cyclist is also seen, adding to the vibrancy of the city streets. Complementing the visuals, the background music is a song with a relaxed and lively tune, enhancing the overall atmosphere of the scene and creating a harmonious blend of urban life and cheerful ambiance."
+ },
+ {
+ "video_id": "PI-2746c863-0246-4268-8534-cfcd7d58944f",
+ "path": "/workspace/benchmark/video/PI-2746c863-0246-4268-8534-cfcd7d58944f.mp4",
+ "answer": "The video begins with a view of a ceiling fan against a light-colored ceiling, creating a calm and ordinary atmosphere.The scene then transitions to a close-up of a mobile phone screen displaying a Facebook page.The page shows a follower count of 0, a following count of 0, and a total of 384.3K followers, with no likes indicated.The page's bio reads, \"We believe people can do more together, than alone,\" followed by a link to Instagram's Facebook page and an option to download the app.The screen then shows a red \"Follow\" button, which changes to \"Unblock\" after being clicked, indicating that the page was previously blocked.A message appears stating, \"You unblocked facebook.You're free to view other's content,\" confirming the action taken.Throughout these frames, a person wearing a blue headscarf is partially visible, with only the top of their head and shoulders in the frame, suggesting they are holding the phone or are in close proximity to it.Simultaneously, an audio overlay provides a narrative, urging viewers to block the Facebook page on TikTok before they start their regular Sunday scrolling.The speaker emphasizes the ease of blocking the page, demonstrating the action firsthand.The audio adds a layer of urgency and instruction to the visual content, guiding viewers on how to manage their social media interactions effectively.The video wraps up by highlighting the simplicity of unblocking a previously restricted page, while also offering a cautionary note on mindful social media usage."
+ },
+ {
+ "video_id": "PI-9c7bc85d-8411-4340-80fc-989ab6e1f8c3",
+ "path": "/workspace/benchmark/video/PI-9c7bc85d-8411-4340-80fc-989ab6e1f8c3.mp4",
+ "answer": "The video captures a lively scene on an urban sidewalk where a child, dressed in a gray jacket, black pants with red stripes, and red and black shoes, is energetically engaging with a microphone connected to a speaker on the ground. The child appears to be singing or speaking, occasionally gesturing and looking around, embodying a positive and spirited demeanor. The setting is enhanced by grass and trees lining the sidewalk, scattered fallen leaves, and a parked black car in the background. Nearby, tents and people suggest a bustling market or event. A person in a red shirt and black pants walks past carrying a white plastic bag, while another in a blue shirt and shorts heads in the opposite direction. The child continues to interact with the microphone, moving slightly along the sidewalk and glancing at passersby and the street. The backdrop of a brick building with large windows and white doors adds to the urban atmosphere. The audio complements this scene with a child’s lively tone, featuring phrases like “Come on, come on” and “the sweetest,” creating a casual and friendly vibe, while ambient music with an unknown rhythm enhances the overall cheerful mood."
+ },
+ {
+ "video_id": "PI-f1ffcdf5-9614-4375-acff-e9c7345ee8f8",
+ "path": "/workspace/benchmark/video/PI-f1ffcdf5-9614-4375-acff-e9c7345ee8f8.mp4",
+ "answer": "This video shows a young woman with brown hair creating a curtain bangs hairstyle in her bedroom. She braided her bangs in symmetrical braids along her hairline and fastened the top hair with mint green claw clips to create a layered half-ball head. The atmosphere of the picture is warm and natural, with melodious melodies and English background music that blends indie pop and hairstyle inspo for curtain bangs styles, creating a comfortable and stylish visual experience. The whole conveys a relaxed and practical styling inspiration, reflecting the fusion of personal style and everyday aesthetics."
+ },
+ {
+ "video_id": "PI-44ae5c6c-7ec5-48a4-a9ff-a215f901623c",
+ "path": "/workspace/benchmark/video/PI-44ae5c6c-7ec5-48a4-a9ff-a215f901623c.mp4",
+ "answer": "The video showcases a dynamic and energetic performance by a female singer dressed in a striking red outfit, consisting of a shiny red jacket and matching pants with lace-up details. With long, curly red hair, she passionately sings into a microphone, moving energetically across the stage. The text overlay in the initial frames reads, 'Reply to mads's comment: SZA COOKED OH MY GOODNESS,' indicating that the performance is a response to a comment praising her skills. The stage is filled with dancers in coordinated outfits—some in red and others in white—performing synchronized dance moves around the singer, enhancing the visual appeal of the performance. The background reveals a large, enthusiastic audience, contributing to the vibrant atmosphere, while bright lighting and spotlights highlight the singer and dancers, creating a visually engaging spectacle. Accompanying the visuals, the audio features the singer's powerful and expressive voice, delivering an energetic and emotional song that emphasizes themes of personal empowerment or realization, aligning with the text overlay's implication of a remarkable performance. The video captures the essence of a high-energy concert, emphasizing the singer's powerful performance and the dynamic choreography of the dancers."
+ },
+ {
+ "video_id": "PI-c3d802fe-4ffc-40de-8dcb-8368e336cbcc",
+ "path": "/workspace/benchmark/video/PI-c3d802fe-4ffc-40de-8dcb-8368e336cbcc.mp4",
+ "answer": "The video begins with an intimate close-up of two individuals, both adorned with grills on their teeth and dressed in black attire, exuding a cheerful demeanor as they smile. The camera then pans up to reveal more of their upper bodies, showcasing the person on the left in a black hoodie and the person on the right in a black shirt with a gold chain, set against a backdrop of white walls and a window with blinds. As the camera continues to move, a third individual with dreadlocks, also in a black hoodie, enters the frame from the right, joining the other two in their joyful expressions. Meanwhile, the audio features a hip-hop beat with rap vocals, delivered by a male artist in a confident and energetic tone. The fast-paced lyrics focus on themes of success, confidence, and a lavish lifestyle, conveying a core message of self-assuredness and triumph. The video wraps up with all three individuals standing close together, smiling at the camera, perfectly complementing the audio's emphasis on success and camaraderie."
+ },
+ {
+ "video_id": "PI-680e9b0a-01a5-44c1-aeb5-6b03c845bf98",
+ "path": "/workspace/benchmark/video/PI-680e9b0a-01a5-44c1-aeb5-6b03c845bf98.mp4",
+ "answer": "The video features a person sitting inside a luxurious private jet, wearing a light-colored jacket with a black inner lining and a white shirt underneath. The jet's interior is adorned with cream-colored leather seats and wooden paneling on the walls, creating a warm and inviting atmosphere. The individual speaks directly to the camera, using various hand gestures to emphasize their points, and maintains a lively and engaging demeanor throughout. In the audio, a male speaker expresses excitement and happiness, revealing that he is on the private jet preparing to submit an official offer to acquire a platform, potentially becoming its new CEO. He enthusiastically announces a giveaway of $10,000 to five random followers and hints at the significant nature of the offer, though he is unable to disclose details at this time. The overall setting and the speaker's demeanor reflect a high-end, exclusive environment, filled with anticipation and energy."
+ },
+ {
+ "video_id": "PI-41b8b757-48c3-47dd-bdff-c54f8fd379a0",
+ "path": "/workspace/benchmark/video/PI-41b8b757-48c3-47dd-bdff-c54f8fd379a0.mp4",
+ "answer": "The video features a woman standing in a simple indoor setting with a white door and light-colored wall, sharing another woman's video on her smartphone. The video on her phone depicts a woman expressing surprise and humor over an awkward situation where someone's sibling discovered an intimate moment. The text on the video screen reads, \"I know I'm freaking his sis out of all the people who could've walked in and saw us and it had to be his own sister. I'm so dead.\" The woman sharing the video appears speechless and nervous, occasionally touching her head and waving her hands as she reacts. Meanwhile, the audio captures two female voices discussing the same awkward situation, expressing surprise and humor, and conveying the speaker's nervousness. One voice humorously laments, \"Like it is ridiculous that the bob house would even do that when there's so many other people you can collaborate with, it is okay if you want me to understand piper being in the bathhouse is wrong since she's seventeen, but the issue. She came over. To Sophie's place after we ran into her at the mall, it was literally just for fun. all we did was make Tik Tok s joy wow, are you crying? why verthis like the world's going to end you literally could have easily just sent me a text but instead you want to go cry on social media when you have zero context. What.\" The video and audio together highlight the humorous yet awkward situation, emphasizing the speakers' nervous amusement."
+ },
+ {
+ "video_id": "PI-5ef3186e-d4ca-413b-a520-2260faf35e0b",
+ "path": "/workspace/benchmark/video/PI-5ef3186e-d4ca-413b-a520-2260faf35e0b.mp4",
+ "answer": "The video begins with a group of people sitting on benches beside a track and field, watching a game. The runway features a yellow inner lane and a black outer lane, with a crowd of spectators in the stands in the background. A person in a green sports suit with 'COUGARS' written on the side stands near the track, appearing to be a competitor. The camera then pans to show more of the track and the audience. The runners sprint down the track, with the one in green leading the pack. The scene transitions to another part of the track and field, showing a hurdle setup with several people standing around, possibly preparing for another event. Throughout the video, the text 'BRO MOVE LIKE A TRACKHAWK' with an eagle emoji is displayed on the screen, emphasizing themes of speed and agility. An athlete in a green sports suit is seen sprinting towards the hurdles, leaping over them with agility and speed. The background features a green venue marked with white, with several other participants and spectators visible. The video wraps up with the runner in green successfully crossing the hurdles and continuing to sprint on the track. Meanwhile, the audio features a man shouting, expressing surprise or shock with repeated exclamations of 'Oh, my God' and the name 'Robert,' suggesting a reaction to an unexpected or astonishing event involving someone named Robert. The absence of background music in the audio makes the emotional tone strong and full of suspicion, adding an extra layer of intensity to the visual narrative of speed and agility on the track."
+ },
+ {
+ "video_id": "PI-b2bdf2b7-803f-4381-8c6b-72ca7523f2f0",
+ "path": "/workspace/benchmark/video/PI-b2bdf2b7-803f-4381-8c6b-72ca7523f2f0.mp4",
+ "answer": "The video presents a cheerful and festive scene set in a cozy, Christmas-decorated living room. Two individuals, a male and a female, are engaged in a playful activity in front of a fireplace adorned with tall spiral Christmas trees and red stockings with white snowflake patterns. The male, dressed in a brown t-shirt, blue jeans, and white socks, stands with his hands clasped, while the female, wearing a black \"Sluggers Softball\" t-shirt, blue sweatpants, and light blue socks, uses hand gestures to guide him. Their interaction is light-hearted, with the female instructing the male to raise his right finger at various points, creating a sense of fun and spontaneity. The room's warm atmosphere is enhanced by a floor lamp and stacked storage bins, contributing to the casual and inviting setting. The audio complements this scene with multiple voices expressing a range of emotions; the male speaker conveys happiness, though his words are fragmented, while others express surprise with exclamations like \"Oh, my goodness!\" and similar outbursts. A third voice briefly adds to the mix, expressing an unknown emotion, all contributing to the overall joyful and surprising interactions within this festive environment."
+ },
+ {
+ "video_id": "PI-b38fbf05-d82c-46b3-ac13-92cbbd35dbad",
+ "path": "/workspace/benchmark/video/PI-b38fbf05-d82c-46b3-ac13-92cbbd35dbad.mp4",
+ "answer": "The video begins with a close-up view of a hand holding a small sketchbook, open to a page displaying an intricately shaded pencil drawing of a woman's face, focusing on details like the nose, lips, and chin. As the hand gently flips through the pages, more of the drawing is revealed, showcasing the completion of the woman's face, her long flowing hair, and upper body, all rendered with delicate lines and consistent artistic style. The background remains blurred, keeping the focus on the evolving sketch. Meanwhile, the audio features background music performed by a male singer with a gentle and melancholic voice, presenting a soft, slow-paced melody that evokes introspection and nostalgia. This contemplative and emotional tone complements the visual journey of the sketchbook, enhancing the overall atmosphere of meticulous artistry and deep reflection. The video wraps up with a final view of the completed sketch, highlighting the artist's skill and the harmonious blend of visual and auditory elements."
+ },
+ {
+ "video_id": "PI-1ca596d3-885f-4c90-aa07-98eb567ecfa1",
+ "path": "/workspace/benchmark/video/PI-1ca596d3-885f-4c90-aa07-98eb567ecfa1.mp4",
+ "answer": "The video captures a scene in a school security area, where students are undergoing security checks, with a sign in the background highlighting the school's dedication to a positive, safe, and student-centered environment. A man in a black hoodie and backpack is seen sorting out his belongings, while other students follow the instructions of security personnel, who are dressed in black jackets, to remove laptops and other electronics from their bags and lift them through security checks. The audio, featuring a firm and authoritative male voice likely from a security announcement or instruction in a public place, complements the visuals by instructing people to take their laptops and electronics out of their bags, remove hoodies if worn, and emphasizes compliance with these instructions. The core message of both the audio and video revolves around security checks, ensuring that all necessary items are properly presented for inspection, and students adhering to safety protocols, such as raising their hands and removing hoodies if required. This integration of audio and video content effectively portrays the importance of security measures in maintaining a safe school environment."
+ },
+ {
+ "video_id": "PI-7d1534d2-100f-4c4d-a777-a368b1f7ad8d",
+ "path": "/workspace/benchmark/video/PI-7d1534d2-100f-4c4d-a777-a368b1f7ad8d.mp4",
+ "answer": "The video unfolds in a cozy living room, with a person standing and holding a mobile phone, expressing concerns about potential stalkers. This person is wearing a black cotton-padded jacket with white stripes on the sleeves, black shorts and black sports shoes. The background is a kitchen area, showcasing wooden cabinets and various items on the countertop. Text message: \"I think I'm being followed!\" It appears on the screen and sets the tone for the unfolding narrative. This person continued to speak, revealing that \"This girl IRIS has been writing to me\" and \"Telling me that I'd better go and see her?\" Then the scene switches to two people sitting on a beige sofa. One of them is wearing a white T-shirt with blue text and jeans, while the other is wearing a pink shirt and dark trousers. Do you know her? What does she look like? \"Appear, then come,\" but she said that if I stopped answering her, she would come here! The person in the pink shirt thoughtfully made a gesture, and then words like \"This girl's name is IRIS\" and \"Maybe she spelled her name\" appeared. The scene switches to the person in the black jacket. His reaction is very surprised. The text message is \"I.R.S.?\" The words \"\" appear in large characters. The person in the black jacket went on to say, \"Yes, she does!\" It seemed relieved as the words \"OHHH\" appeared on the screen. The video ends with a close-up of the two sitting on the sofa. The core theme of the video revolves around this person's concern about receiving a letter from someone named Iris, who threatens that if they stop replying, they will come to visit, and humorously realizes that \"Iris\" might be \"IRS\"."
+ },
+ {
+ "video_id": "PI-a4e51fcc-55f5-46fd-904e-e0ae50fd6905",
+ "path": "/workspace/benchmark/video/PI-a4e51fcc-55f5-46fd-904e-e0ae50fd6905.mp4",
+ "answer": "The video features a heartwarming scene in a cozy living room, where a man dressed in a black t-shirt and green pants sits on a light gray carpet with two golden retriever dogs and a baby in a green outfit with a white 'C' on the chest. They are all engaged with a black laptop on the floor, with the man leaning in to show something on the screen, capturing a moment of family bonding and the dogs' curiosity. The background includes a light gray sofa with white and gray striped pillows, a map on the wall, and decorative items, creating a warm and engaging atmosphere. The scene then transitions to a close-up of a framed black-and-white photograph of two paws touching, symbolizing connection and affection, emphasizing its emotional significance. Complementing this visual narrative, the background music (BGM) is instrumental with a light and cheerful melody, enhancing the overall mood without overpowering the theme, and creating a pleasant and engaging experience that perfectly aligns with the heartwarming visual content."
+ },
+ {
+ "video_id": "PI-b4d0c74b-85d5-4f7a-87a0-56e877dd02d6",
+ "path": "/workspace/benchmark/video/PI-b4d0c74b-85d5-4f7a-87a0-56e877dd02d6.mp4",
+ "answer": "The video captures a dramatic moment during a basketball game between Indiana and Northwestern, where a player in a white jersey, number 13, is seen running with the ball under tight defense from a player in a red jersey. As the player in white attempts a shot, he is fouled and falls hard onto the court, making direct contact with the floor. The scene shifts to the injured player sitting near the basket, holding his head in pain, with teammates and staff rushing to assist him. Meanwhile, the audio describes a concerning situation where someone is injured, with the speaker's urgent and concerned tone highlighting the severity of the fall, noting that the injured person's face hit the floor, resulting in bleeding. The scoreboard initially shows Indiana leading 77-70 with 22.5 seconds left in the 2nd half, but updates to Northwestern leading 79-70 with 16.8 seconds remaining. The text overlay \"The exclamation mark\" emphasizes the dramatic and painful nature of the incident, concluding with the player still holding his head, indicating a potential injury."
+ },
+ {
+ "video_id": "PI-f56362c1-c4a9-4020-9530-c242536d64d0",
+ "path": "/workspace/benchmark/video/PI-f56362c1-c4a9-4020-9530-c242536d64d0.mp4",
+ "answer": "On a New York City sidewalk, an interviewer approaches two elderly women, Christine and Carol, who are walking hand-in-hand. Dressed warmly in winter coats and hats, they pause their stroll to share the story of how their relationship began. Christine explains that they met by chance in 1977. Both were separately heading to a women's NCAA final with friends and ended up sharing a taxi cab. Pointing to a nearby street corner, Christine identifies it as the exact spot where they picked Carol up. They have been together for nearly 48 years since that day. When asked about their favorite thing about each other, Christine fondly describes her partner as \"kind and patient,\" to which Carol humorously replies, \"nothing right now.\" The brief but charming interaction offers a heartwarming glimpse into their long-lasting love story."
+ },
+ {
+ "video_id": "PI-01c09e77-01e5-4bbe-8d85-d86ca28b22b3",
+ "path": "/workspace/benchmark/video/PI-01c09e77-01e5-4bbe-8d85-d86ca28b22b3.mp4",
+ "answer": "In this charming video and audio combination, the audio serves as a gentle backdrop to the visual narrative. The audio is a track of soothing background music performed by a female singer, its calming melody setting a serene and cozy tone for the scene that unfolds.The video showcases a girl enjoying a delightful date night at home. In the first four seconds, we see her in the indoor entrance hall. She kneels on the floor, opens the white door, and retrieves a brown \"TRADER JOE'S\" bag from outside. After closing the door, she turns to flash a warm smile at the camera, immediately inviting the viewer into her intimate moment. From the 4 - second mark to the 13 - second mark, the scene transitions to the living room. Here, the girl sits comfortably on the floor, skillfully using the remote control to operate the TV. With her left hand, she holds a vibrant yellow, green, and orange chip package bag, from which she takes chips and feeds them into her mouth with her right hand. Throughout this sequence, the text \"date nights in ❤\" is prominently displayed in the middle of the screen, reinforcing the theme of a cozy and romantic evening at home. The soothing background music perfectly complements the girl's relaxed and contented demeanor, enhancing the overall charm of this simple yet heart - warming date night scenario."
+ },
+ {
+ "video_id": "PI-46046371-2b7e-47f4-a564-b8b501bfc040",
+ "path": "/workspace/benchmark/video/PI-46046371-2b7e-47f4-a564-b8b501bfc040.mp4",
+ "answer": "In the video, a person is seen sitting at a table in a room bathed in blue lighting, wearing a light gray t-shirt with 'Miami' and 'United States of America' printed on it. The individual appears to be deeply engaged in a task, tapping on the table and rubbing a white powdery substance with a pen, all while muttering something under his breath. The atmosphere suggests he is thoroughly enjoying himself, with a sense of happiness and excitement evident in his actions. Meanwhile, the accompanying audio captures a man singing in an emotional tone, counting \"one, two, three, four, five, six, seven, eight\" and continuing the sequence. His voice conveys a sense of joy and relaxation, which aligns with the visual depiction of the person at the table, who seems to be lost in a pleasurable and stimulating activity. The combination of the visual and auditory elements creates a vivid portrayal of someone immersed in a moment of contentment and engagement."
+ },
+ {
+ "video_id": "PI-5805bb27-f613-41ee-a577-2877e9185430",
+ "path": "/workspace/benchmark/video/PI-5805bb27-f613-41ee-a577-2877e9185430.mp4",
+ "answer": "The audio track captures a humorous dialogue where a male speaker playfully jests about not being a \"toy monkey\" and expresses his reluctance to participate in creating viral videos for someone with few followers, while the other speaker seems to encourage or respond to his jests, creating a lighthearted exchange centered around the idea of not wanting to engage in viral content creation. The video, on the other hand, unfolds in a festively decorated room with a Christmas tree and green lighting, where a person dressed in a green furry costume, complete with a red Santa hat and red boots, steps out from behind a red curtain-adorned doorway. Another person, wearing a blue and red uniform, stands nearby, observing the scene. As the person in the green costume gestures, on-screen text appears, first saying \"what is up my guy\" and then changing to \"okay you know what I regret saying anything,\" indicating a shift in the person's mood or reaction. The person in the green costume continues to gesture near the doorway, while the person in the uniform remains in the background, with the festive decorations and the partially visible observer in green shirt and beige pants adding to the lively and somewhat comical atmosphere of the interaction. Although the audio and video content seem unrelated at first glance, both capture moments of playful interaction and humor, with the audio focusing on a verbal exchange about viral content and the video showcasing a visual comedy routine in a festive setting."
+ },
+ {
+ "video_id": "PI-b0bfaf5f-1abd-43d6-a027-b665d56d1197",
+ "path": "/workspace/benchmark/video/PI-b0bfaf5f-1abd-43d6-a027-b665d56d1197.mp4",
+ "answer": "There is a person in the room in the video, which is the main scene. There is a wire rack in the room, which is filled with all kinds of things, including shoes and folded clothes. This person is wearing a black hat, black sunglasses and a dark jacket. He held the top shelf with one hand and seemed to be speaking directly to the camera. The audio content includes monologues, in which the person seems to be encouraging or motivating themselves, possibly in a humorous or energetic tone. The key phrases in the audio include \"This is your second semester, damn it, do well in your grades\", suggesting a focus on academic performance and self-motivation. This person also mentioned, \"It's better to bet on this and scare yourself to sleep,\" which adds a layer of personal challenge and determination. There is no visible text in the video. The core theme revolves around self-encouragement and the motivation for success, especially in an academic context. The character's expressive gestures and direct speech to the camera create an engaging dynamic atmosphere, emphasizing his determination to overcome challenges and achieve his goals."
+ },
+ {
+ "video_id": "PI-c0ad261c-9ba7-4164-ada5-f6f3f062a271",
+ "path": "/workspace/benchmark/video/PI-c0ad261c-9ba7-4164-ada5-f6f3f062a271.mp4",
+ "answer": "The video captures a moment of artistic creation, featuring a person with long hair, dressed in a black top and large headphones, immersed in a recording session. Positioned indoors against a plain, light-colored background, the setting emphasizes the subject's focus and the significance of the recording process. The individual appears deeply engaged, with subtle mouth movements suggesting they are singing or speaking into a microphone. This scene is complemented by the audio, a vocal recording track where the female speaker conveys a personal or emotional theme through her singing. The track highlights her focused and expressive mood, enhancing the atmosphere of concentration and creativity. Together, the video and audio create a cohesive portrayal of the artistic process, underscoring the emotional expression and dedication involved in creating music."
+ },
+ {
+ "video_id": "PI-dd522cc0-adee-4d75-976b-173b2aa17faa",
+ "path": "/workspace/benchmark/video/PI-dd522cc0-adee-4d75-976b-173b2aa17faa.mp4",
+ "answer": "This heartwarming video captures a tender moment between a mother and her baby. From a first-person perspective, a hand is seen opening a door to a dimly lit nursery. The room contains a blue and white checkered rug, and the camera moves toward a wooden crib where a baby is sleeping peacefully on its stomach. The mother gently reaches in and strokes her baby's rosy cheek, whispering to him. At first, the baby only stirs slightly in his sleep. However, the moment she says his name, he wakes up. The baby lifts his head, his eyes open wide, and a look of recognition is followed by a large, joyful smile that lights up his face. He continues to gaze at his mother with pure delight, gurgling happily as she speaks to him. An on-screen caption explains that the mother had been away for a few days and this was the baby's reaction to finally realizing she was home."
+ },
+ {
+ "video_id": "PI-3fb234d5-a48e-4069-b521-225433820b3f",
+ "path": "/workspace/benchmark/video/PI-3fb234d5-a48e-4069-b521-225433820b3f.mp4",
+ "answer": "Four individuals stand side-by-side in a brightly lit studio, presenting a unified and stylish look. The setting is minimalist, featuring a seamless white backdrop and a single studio light visible in the upper corner, indicating a professional photoshoot or video recording is in progress. All four are dressed in chic black and white attire, with variations that give each person a distinct character. Their outfits consist of white collared shirts paired with black trousers, and either a black vest or a blazer. The person on the far right stands out slightly in a pinstripe suit. Throughout the short video, they make subtle adjustments to their clothing—fixing a collar, straightening a vest, or adjusting a jacket—as they hold their poses for the camera. The atmosphere is cool and composed, set to the sound of an upbeat pop song playing in the background."
+ },
+ {
+ "video_id": "PI-a968a06a-57e9-44b7-b600-7ffb93a6b1a1",
+ "path": "/workspace/benchmark/video/PI-a968a06a-57e9-44b7-b600-7ffb93a6b1a1.mp4",
+ "answer": "The video begins with a close-up of a hand reaching for a door handle, accompanied by a humorous text overlay: 'POV ur sister just gave birth and the first thing she tells u is 'why are u wearing my sweater''. The scene transitions to a hospital room viewed through vertical blinds, where a woman in a white long-sleeve shirt and hospital wristband sits in a bed, tenderly holding her newborn baby wrapped in a blanket. The room is equipped with medical devices, including an IV stand and a monitor. As the camera moves closer, it captures the intimate details of the baby's tiny features and the woman's affectionate gaze. The video continues to focus on the woman, now wearing a black long-sleeve shirt over the white one, cradling the baby dressed in a dark outfit with a white hat. She gently touches and looks at the infant, with the hospital room's medical equipment and a bedside table holding items like a pink tumbler and a gray cup visible in the background. The text overlay remains on the screen throughout, adding a touch of humor to the scene. Meanwhile, the audio track features a female speaker with a warm and affectionate tone, expressing amazement and tenderness towards the newborn with phrases like 'My god, nice little setup here,' 'My god, he's so tiny,' and 'Oh, you feel sweet.' This combination of visual and auditory elements creates an intimate and emotional atmosphere, capturing a precious moment between the new mother and her baby while highlighting the speaker's emotional response of awe and endearment."
+ },
+ {
+ "video_id": "PI-7c5804b6-a3e7-4e70-bef4-ffd3f410595c",
+ "path": "/workspace/benchmark/video/PI-7c5804b6-a3e7-4e70-bef4-ffd3f410595c.mp4",
+ "answer": "The video unfolds in a room with light-colored walls and a ceiling fan, setting a casual and relaxed atmosphere. It begins with two women dressed in black outfits standing near a doorway, through which a warmly lit room with a bed is visible. A person in a black hoodie and sweatpants approaches the doorway, and the two women start a synchronized dance routine, moving their arms and swaying to the rhythm. The person in the black hoodie joins in, taking the lead with various dance moves such as raising their hands, placing a hand on their chest, and making expressive gestures, while the two women follow along, mirroring the lead dancer's movements. The performance continues with the person in the black hoodie center stage, pointing forward and making more expressive gestures, with the two women maintaining their synchronized swaying and arm gestures near the doorway. The consistent background of the warmly lit room and bed adds to the cozy ambiance. Meanwhile, the background music features a female singer with a soulful and emotional tone, singing about a moment of reflection and prayer, followed by waking up and preparing for a new beginning or journey. The core message of the video revolves around themes of renewal and readiness for what lies ahead, captured through the lively and coordinated dance performance that showcases the dancers' energy and synchronization."
+ },
+ {
+ "video_id": "PI-ee729c2b-2c89-4344-942d-a5627f432872",
+ "path": "/workspace/benchmark/video/PI-ee729c2b-2c89-4344-942d-a5627f432872.mp4",
+ "answer": "On a bright and sunny winter day, two young men collaborate to push a giant snowball over the railing of a bridge. The camera focuses on their efforts as they heave the massive ball of snow over the side. It then follows the snowball's trajectory as it falls through the air and plunges into the murky water below, creating a significant splash and a perfectly circular ripple effect on the surface. The scene then cuts back to the two men on the snowy bridge, who are seen laughing heartily and celebrating their successful endeavor. An overlay text reads, 'We did this for the boys,'"
+ },
+ {
+ "video_id": "PI-f6d41c38-c6aa-4ee6-bef8-442075f70394",
+ "path": "/workspace/benchmark/video/PI-f6d41c38-c6aa-4ee6-bef8-442075f70394.mp4",
+ "answer": "This video, set in a Costco warehouse, showcases the features of a sleeper sectional sofa. A woman is seen demonstrating the product's functionality. Initially, the gray fabric sectional is in its standard seating configuration. The woman then pulls a section from beneath the main cushions, which easily lifts up to form a large, flat sleeper surface, effectively turning the sofa into a bed. Next, she moves to the chaise portion of the sectional and lifts the top cushion to reveal a spacious storage compartment inside. A close-up shot of the product sign identifies it as the 'Coddle Aria Fabric Sleeper Sectional,' priced at $1549.99. The sign details its key features, including a reversible storage chaise, a pull-out sleeper, four power outlets, and a combination of two USB and two USB-C ports for charging devices. "
+ },
+ {
+ "video_id": "PI-d535ffb6-810c-4679-909f-7ab0e49dd57b",
+ "path": "/workspace/benchmark/video/PI-d535ffb6-810c-4679-909f-7ab0e49dd57b.mp4",
+ "answer": "In a cozy living room setting, the video captures a lively and playful interaction among three individuals. It begins with a person in a white T-shirt and plaid pajama pants standing on the stairs, holding up a pair of pants in a jovial manner, while another person in a black T-shirt and gray pants lies face down on the carpet, wearing a black helmet. Soon, the person in the black T-shirt rises and walks toward the camera as a third person, clad in full ski gear including a helmet and goggles, appears in the foreground, adjusting their equipment. A new character, wearing a white T-shirt and a fluffy hat, then enters, standing in the foreground with the skier behind them, making various expressions and gestures. This person then interacts with the individual on the stairs before jumping onto the sofa, while the person in the black T-shirt stands beside the sofa, arms raised in apparent cheer. Finally, all three stand together, performing exaggerated actions like covering their ears and jumping, filling the scene with joy and energy. Meanwhile, an audio track plays in the background, where a male speaker conveys a sense of frustration and determination, discussing reaching a point where enough is enough, accompanied by steady, tense music that complements the speaker's emotional state. The overall atmosphere remains light-hearted, showcasing the fun interactions among friends or family."
+ },
+ {
+ "video_id": "PI-79ad677c-32e4-4971-a527-9cb16d97ef72",
+ "path": "/workspace/benchmark/video/PI-79ad677c-32e4-4971-a527-9cb16d97ef72.mp4",
+ "answer": "The video begins with a modern kitchen, which has white cabinets and tiled floors. A person wearing a black vest with patterns and blue shorts walked towards another person wearing a grey sweatshirt and grey sweatpants. The latter was sorting out items on the counter, including a large bag of yellow Doritos and several packs of bottled water. The text above reads \"POV: When he spanks your bottom vs. You spank his bottom\", indicating that this is an interesting interaction. The person in the black vest came from behind and playfully spanked the person in the grey hoodie. Then the scene transitions to a spacious living room with large Windows and a grey combination sofa. In the video, the person wearing a grey hoodie was lifted into the air and then landed on the grey carpet in front of the sofa. The person in the black vest makes gestures with his hand, possibly explaining or apologizing for his behavior. At the end of the video, the person in the black vest was still standing, and the person in the grey hoodie was still standing on the ground, looking up at them. There was a voice in the audio saying, \"What's wrong with you? You sneaked behind me. I thought you were an intruder.\" This added an atmosphere to the interesting and intense interaction between the two."
+ },
+ {
+ "video_id": "PI-cd46db80-3583-45d6-9b31-1a492be802de",
+ "path": "/workspace/benchmark/video/PI-cd46db80-3583-45d6-9b31-1a492be802de.mp4",
+ "answer": "The video, complemented by a brief and unclear audio statement from a male speaker with an indiscernible emotional tone, begins with a woman sitting cross-legged on a Pilates machine, taking a selfie in a mirror, with a golden retriever standing beside her. The scene then shifts to a green wall featuring a cart that holds a tablet and towels. The dog is seen walking around a room that has a patterned rug, dumbbells, and a bed. Following this, the woman's feet, clad in white socks and black leggings, are shown walking on a wooden floor and a checkered rug. She approaches a glass door, steps outside onto a concrete surface, and continues walking on a sidewalk. The scene transitions to her opening a door equipped with a keypad lock, leading into a well-equipped home gym. Inside, a man in a blue hoodie, black shorts, and white leggings is seen lifting a barbell. The video concludes with him continuing his weightlifting exercise in the gym, while the unclear audio statement adds an enigmatic element to the overall content."
+ },
+ {
+ "video_id": "PI-b02f61d6-bd61-462b-aebf-7c95ab4a6a4f",
+ "path": "/workspace/benchmark/video/PI-b02f61d6-bd61-462b-aebf-7c95ab4a6a4f.mp4",
+ "answer": "This video documents a casual exchange between two individuals in a retail parking lot on a rainy, overcast day. The scene begins with a man in a red hooded sweatshirt standing near a dark SUV. The vehicle's license plate, displaying \"WASHINGTON CMJ5156,\" is briefly visible. As the interaction progresses, the man opens the rear of the SUV, revealing a cargo area containing a multipack of bottled water and other items. He then takes out two shoeboxes. One box is black with a red lid, featuring \"JORDAN\" branding, and contains a pair of red and white sneakers, identified by an on-screen overlay as \"Jordan 4 Retro\" valued at $350. The other box, entirely black, holds a pair of red and black sneakers, also identified by an overlay as \"Jordan 12 Retro\" with a value of $300. Following the inspection of the footwear, a sum of cash, consisting of U.S. dollar bills, is handed over. The exchange concludes, and the man in the hoodie closes the SUV's trunk. The camera then captures a grey vehicle with gold-colored wheels parked nearby. The dialogue throughout the video is brief and pertains directly to the transaction, with phrases like \"How you doing, bro?\" and expressions of satisfaction regarding the items and payment."
+ },
+ {
+ "video_id": "PI-86f221e8-bcad-4b9e-8ba2-4aa5f50e0096",
+ "path": "/workspace/benchmark/video/PI-86f221e8-bcad-4b9e-8ba2-4aa5f50e0096.mp4",
+ "answer": "The video features a serene beach scene with gentle waves rolling onto the shore, creating a peaceful and relaxing atmosphere. The ocean water, a beautiful gradient of light blue and green, reflects the clear sky above, while the sand appears smooth and damp, glistening under the sunlight. A person wearing a white shirt lies on the wet sand near the water's edge, mostly stationary, enjoying the soothing touch of the waves. Complementing this tranquil visual is a soft and gentle background music track, which enhances the serene mood of the scene. The music's melody is calming, contributing to the overall sense of tranquility and calmness, making the viewer feel as though they are experiencing a moment of peace by the sea."
+ },
+ {
+ "video_id": "PI-97727df2-a7b5-480d-bb69-57e47d907983",
+ "path": "/workspace/benchmark/video/PI-97727df2-a7b5-480d-bb69-57e47d907983.mp4",
+ "answer": "The video features a cheerful and optimistic woman named Chitaqua standing beside a black Kia car adorned with red, white, and blue bows, parked in front of the 'Ride Time' dealership. The dealership's entrance is marked by a large sign, and its windows display various promotional papers. Chitaqua, dressed in a black jacket, beige trousers, and white sneakers, introduces herself with enthusiasm, as indicated by the text overlays: 'HEY MY NAME IS' followed by 'CHITAQUA AND I'M DOWN HERE.' She promotes the dealership's services, mentioning financial aspects such as a 500 down payment, bad credit, goodcredit,and all credittypes,with over lays like′500 DOWN BAD CREDIT' and 'GOOD CREDIT ALL CREDIT.' The lively atmosphere is enhanced by cheerful background music, reflecting her happy demeanor. She gestures towards the car and the dealership, encouraging viewers to 'COME SEE THE MAN' and assuring them with 'PROMISE YOU.' The video concludes with Chitaqua shaking hands with someone off-camera, accompanied by a pink heart emoji and a smiling face emoji next to the 'THANK YOU HAHA' text, along with the dealership's phone number, 219-888-9954, displayed at the bottom of the screen."
+ },
+ {
+ "video_id": "PI-3555026c-7c50-4589-935d-ef04b6495822",
+ "path": "/workspace/benchmark/video/PI-3555026c-7c50-4589-935d-ef04b6495822.mp4",
+ "answer": "The video features Mike from Ride Time on US30 introducing Marcia and her new 2020 Honda Accord in a snowy parking lot. The red car is prominently displayed, with a building in the background advertising a 'SALE' and a 'Capital One Pre-Approval Sale Event.' Mike highlights the car's low mileage and affordable monthly payments, while Marcia expresses her excitement and satisfaction with the purchase. Text overlays provide contact information and the website 'Rtautofinance.com,' along with playful comments about the fun and wholesome experience of buying the car and warming up in it. The video emphasizes the positive and enjoyable aspects of the car-buying process."
+ },
+ {
+ "video_id": "PI-4b9b9a7b-d686-4873-86ba-00a67bb1d66e",
+ "path": "/workspace/benchmark/video/PI-4b9b9a7b-d686-4873-86ba-00a67bb1d66e.mp4",
+ "answer": "The video humorously depicts a day in the life of a man treated like a baby by a woman, presumably in their house. The narration begins with the 'baby' waking up in bed, wrapped in a blanket, and being gently awakened. He is then taken to the kitchen, showing reluctance, and subsequently enjoys a banana breakfast. After that, the 'baby' is brought to the living room to watch his favorite show, eliciting smiles and laughter. Then, the woman gives him a shave, which leads to a tantrum. Ultimately, the 'baby' is sent back to bed for the rest of the day, still wrapped in his blanket. The entire sequence has a lighthearted tone, highlighting the comedic dynamics between the two individuals."
+ },
+ {
+ "video_id": "PI-044374d3-5922-42ac-81f6-b17c8c0858d8",
+ "path": "/workspace/benchmark/video/PI-044374d3-5922-42ac-81f6-b17c8c0858d8.mp4",
+ "answer": "In the video, a person is lying on a beige sofa, wearing a white T-shirt with a red logo on the sleeves and a grey hat. This person has a tattoo on his left arm, surrounded by all kinds of cosmetics, including a pink makeup plate, lip gloss and a small mirror, all of which are placed on a light-colored blanket. This person looks very relaxed. One arm is resting on the sofa and the other arm is slightly raised. The background includes light-colored walls with vertical panelling. This person said in a naughty and slightly annoyed tone, \"Come on, baby. Let's go, baby. Smile.\" Say, cheese, I just want to see nature. Yes, but I asked him to do the makeup with me. This is without makeup. So, what I want to do is to see nature and then say goodbye. This video humorously captures people's relaxed interactions, which might be with pets or others, as well as their teasing laments about being forced to put on makeup instead of watching nature. The entire theme revolves around a laid-back and humorous moment, shared in a relaxed family setting, highlighting their casual and humorous demeanor during their makeup adventure."
+ },
+ {
+ "video_id": "PI-2e64c485-142c-49ac-9271-eac59285424b",
+ "path": "/workspace/benchmark/video/PI-2e64c485-142c-49ac-9271-eac59285424b.mp4",
+ "answer": "The audio track is a motivational piece featuring a female voice with an encouraging and confident tone, reminding the audience that people often talk about them because they've lost the privilege of direct communication. She urges listeners to persist in their endeavors, thereby providing those individuals with topics to discuss. The video complements this message by depicting a person standing outdoors in the city, with a brick building featuring a fire escape and some green plants in the background. The individual, dressed in a dark blue shirt and adorned with layers of necklaces, including a heart-shaped pendant, embodies the theme of empowerment and confidence. Text overlays in the video echo the audio's motivational message, stating, \"Just a reminder, people often talk about you because they have lost the right to talk to you. So keep doing what you are doing and give them some topics.\" This visual and auditory combination effectively encourages viewers to continue pursuing their goals and activities, suggesting that the chatter of others might stem from their inability to communicate directly, and that by staying true to themselves, they can provide interesting topics for discussion. The overall mood is one of empowerment, urging the audience to embrace their journey with confidence."
+ },
+ {
+ "video_id": "PI-d8ff4e4d-926f-4797-b55b-e479aa9e2293",
+ "path": "/workspace/benchmark/video/PI-d8ff4e4d-926f-4797-b55b-e479aa9e2293.mp4",
+ "answer": "A young woman with blonde hair and pink hair clips is featured in a close-up shot within what appears to be a bedroom setting. She wears a white top and multiple necklaces, including one with a large, colorful cross pendant. Initially, she displays a pouty expression, looking around with a seemingly contemplative or slightly displeased demeanor. An on-screen text overlay reads \"But he's weird\". As the video progresses, her facial expression subtly shifts, transitioning into a cheerful smile, suggesting a change in mood or a playful reaction to the accompanying audio. The background includes a mirror with string lights and various items on a dresser, indicating a personal space. The video captures a brief moment of emotional transition, set against a musical backdrop."
+ },
+ {
+ "video_id": "PI-0c5d99e2-8edb-4375-8253-c85b6cdc2966",
+ "path": "/workspace/benchmark/video/PI-0c5d99e2-8edb-4375-8253-c85b6cdc2966.mp4",
+ "answer": "The video presents a lighthearted and satirical scene, where a man wearing glasses holds a smartphone against a background that alternates between a plain green screen and a white screen with a green border. With a red lipstick - like substance on his lips and an exaggerated smile, he expresses a happy mood, mentioning that nothing is going on between him and his girlfriend, contributing to the video's humorous tone. The text \"@SILLYBALOO\" displayed on the right of the smartphone possibly represents a username, brand, or social media handle, further emphasizing the theme centered on technology usage or social media behavior. Meanwhile, the audio track features a male voice speaking calmly, starting with \"Take a look at my girlfriend, she's the only one I got not mu,\" leaving the sentence incomplete, which may suggest underlying emotions or a story context. While the video's cheerful and relaxed emotional atmosphere is dominated by the man's playful interaction with the smartphone and his humorous declaration about his relationship, the incomplete sentence in the audio adds an intriguing layer, hinting at a deeper narrative that contrasts with or complements the video's lightheartedness. The combination of these elements creates an engaging and thought - provoking piece that explores themes of relationships, technology, and social media in a humorous yet potentially meaningful way."
+ },
+ {
+ "video_id": "PI-469d1384-e3bb-468b-bc0f-38268aaf0e4e",
+ "path": "/workspace/benchmark/video/PI-469d1384-e3bb-468b-bc0f-38268aaf0e4e.mp4",
+ "answer": "This video shows a dynamic vocal training course where a coach is teaching a student to sing the signature song \"All That Jazz\". The venue is a room, decorated with patterned carpets and beige walls, with a speaker on a stand in the background, suggesting a practice or performance space. The text at the top of the video is covered with \"Vocal Coach teaches....\". It is written \"All That Jazz\" in a stylized font at the back, clearly indicating which song is to be taught. This student, wearing a black coat, a blue tie and a badge, is standing beside the vocal coach who is wearing a green coat and black trousers. The text at the bottom of the video reads: \"She held the note for too long!!\" Just wait! \"Implies that she will have an impressive vocal performance. The coach demonstrated various gestures and movements to guide the students to complete the song. The students followed, imitating the coach's movements, holding a piece of paper, which might be a musical score. The coach's expressive hand movements and body language emphasize the teaching process, creating an engaging and dynamic learning environment. There is a recurring phrase \"That That Jess Jess Jess Jess Jess Jess Jess Jess Je\" in the audio. British Airways This seems to be part of the song lyrics or perhaps a vocal practice, adding a lively atmosphere to the scene. The video captures the essence of the interactive and passionate vocal tutoring course, highlighting the coach's dedication and the students' eagerness to learn."
+ },
+ {
+ "video_id": "PI-a59931bb-b924-4742-a247-4dfff8509320",
+ "path": "/workspace/benchmark/video/PI-a59931bb-b924-4742-a247-4dfff8509320.mp4",
+ "answer": "The video features a woman sitting indoors in a cozy room with shelves in the background, filled with various items. She is dressed in a light zip-up hoodie over a dark top, and the warm lighting enhances the intimate atmosphere. The woman speaks directly to the camera, her direct gaze and slight movements suggesting a personal confession. Text overlays reveal her thoughts: 'since this app is going away there's one more secret I feel I have to share w you' and 'all of my interviews were scripted.' Her hand gestures near her face emphasize her points, creating an introspective and candid mood. This personal revelation is contrasted by an audio overlay featuring two male speakers. The first expresses a somber, existential reflection on mortality and a secret he feels compelled to share, while the second dismissively states his dislike for the movie 'The Godfather,' repeating this sentiment. The conversation highlights a mix of serious, anxious tones and a dismissive attitude towards a popular film, adding a layer of complexity to the scene."
+ },
+ {
+ "video_id": "PI-01016f64-0069-4259-bfa9-11cf6e3673d6",
+ "path": "/workspace/benchmark/video/PI-01016f64-0069-4259-bfa9-11cf6e3673d6.mp4",
+ "answer": "In the video, a person is wearing a light green T-shirt with the words \"CHEER CAMP\" printed in bold on the front of the T-shirt. The environment is a comfortable room with wooden floors, ceiling fans, and beds decorated with various plush animals, creating a warm and personal atmosphere. A television set can still be seen on a small table in the background. The text at the top of the video reads: \"Have you heard that Charlie Damilio is pregnant?\" This indicates that this person is responding to the hot topics on the \"For You Page\" FYP on social media platforms. This person seems to have responded humorously or suspiciously to the news of Charlie da Merio's pregnancy, which can be seen from their body language and the context provided by the text. The core theme of the video revolves around the impact of social media trends on personal expression, highlighting how individuals can participate in and respond to the popular news and rumors spreading on these platforms."
+ },
+ {
+ "video_id": "PI-1415c197-8445-4df8-a26c-27282ccd1eca",
+ "path": "/workspace/benchmark/video/PI-1415c197-8445-4df8-a26c-27282ccd1eca.mp4",
+ "answer": "The video showcases a light-hearted and affectionate interaction between two individuals seated on a patterned sofa with a light-colored backdrop. The person on the left, dressed in a grey sweatshirt and several necklaces, makes gestures while speaking, suggesting in a casual and humorous tone that Jesse Eisenberg should play him in a film. The person on the right, wearing a beige fur coat and black stockings and holding a piece of paper, initially appears engrossed in a newspaper but then joins in the conversation, with the text \"Dead or alive\" and \"Either like\" appearing on screen, adding a playful twist. As the discussion progresses, the conversation becomes increasingly humorous, with the person on the left looking preoccupied and the person on the right eventually covering his face with the paper, providing a comical ending to their banter. The audio from the video further complements this scene, with the man insisting jokingly about his choice of actor and the woman eventually recommending Marilyn Monroe, highlighting the relaxed and affectionate dynamic between the two as they imagine the characters in the movie."
+ },
+ {
+ "video_id": "PI-6922cb3c-05eb-4afb-8520-30f25831407f",
+ "path": "/workspace/benchmark/video/PI-6922cb3c-05eb-4afb-8520-30f25831407f.mp4",
+ "answer": "In the video, a person is sitting at a table. The ceiling of the room is decorated with string lights, creating a comfortable and modern atmosphere. This person is wearing a green shirt and holding a cotton roller. The words \"Acceptable Age Gap\" are prominently displayed at the top of the screen, indicating the theme of the video. This person humorously explores the different age gaps in interpersonal relationships, starting with \"one year older\", then \"two years older\", and then \"three or four years older\". When he discusses each age gap, it seems that he is preparing for its acceptability in a relaxed or satirical way. The background includes a bed with white bedding and some items, adding a casual and related setting to the video. This person gestured with a cotton roller to emphasize his point of view. Near the end, the text changes to \"They are all adults and have left school\", suggesting considerations of maturity and life stages in the context of acceptable age differences. The video ended in a humorous tone because the person seemed to agree with the proposed argument, as can be seen from the words \"Excellent arguments approved\" that appeared on the screen. The entire theme revolves around social norms and individuals' views on the age gap in interpersonal relationships."
+ },
+ {
+ "video_id": "PI-418f79a3-2a14-4844-b266-955544c3077a",
+ "path": "/workspace/benchmark/video/PI-418f79a3-2a14-4844-b266-955544c3077a.mp4",
+ "answer": "The video shows a lively and entertaining scene of three people dancing energetically in a room with textured walls and circular decorations. Beginning dominated by a close-up of a white curtain and a text overlay, the group, one person in a shiny purple costume with a hood and sheer chest patch, danced rhythmically, with people usually dressed in purple taking center stage and expressively moving their arms and heads. Others dressed casually and joined in the fun, creating an atmosphere full of joy and camaraderie as they enjoyed their time together. Meanwhile, the audio is dominated by a chorus of multiple men designed to vent pent-up or introspective emotions, and the singers show their love for the song. The combination of vibrant visuals and emotional audio creates a vibrant and engaging experience."
+ },
+ {
+ "video_id": "PI-d9199359-d7f5-4f03-a5d6-58c2cce658d1",
+ "path": "/workspace/benchmark/video/PI-d9199359-d7f5-4f03-a5d6-58c2cce658d1.mp4",
+ "answer": "This video comically depicts a relatable scenario. It opens in a bedroom with a man wearing headphones and a beanie, displaying a startled expression. An on-screen caption reads: \"When you up late as hell and you hear your door open.\" The man's lips are synchronized with a dramatic, high-pitched scream, and his arm gestures convey a sense of panic, as if he's trying to explain something.The scene then abruptly shifts to a serene close-up of a brown dog, eyes peacefully closed, bathed in a warm orange glow. The same caption and intense audio continue to play over this tranquil image. The humor of the video stems from the stark contrast between the man's panicked reaction and the dog's complete calmness. This juxtaposition perhaps implies the man has given up on explaining himself, resigned to being lectured by a parent or partner. The video's overall humorous and witty tone, combined with the man's expressive performance, creates an amusing and relatable experience."
+ },
+ {
+ "video_id": "PI-a158cc7a-ff6a-4776-b99c-1500f47edd4a",
+ "path": "/workspace/benchmark/video/PI-a158cc7a-ff6a-4776-b99c-1500f47edd4a.mp4",
+ "answer": "The audio features a frustrated speaker detailing their morning struggle to open a car door that is heavily covered in thick ice, highlighting the difficulties brought on by cold weather. Correspondingly, the video begins with a close-up view of the ice-covered car door, showing a person in a blue shirt and silver watch using their hand to scrape off the ice from the door handle area. Set against a backdrop of snowy ground, the scene emphasizes the harsh, cold conditions. The person persistently scrapes and chips away at the ice, gradually revealing more of the car door underneath, though the ice remains stubborn and difficult to remove completely. As the video progresses, the same close-up view persists, with the individual still engaged in the arduous task of ice removal. The snowy ground remains a constant visual reminder of the wintry setting, while the person's focused and determined hand movements illustrate their ongoing effort. The video wraps up with the person still actively scraping off the ice, underscoring the continuous struggle and the challenging nature of the task, perfectly mirroring the frustration conveyed in the audio.\n"
+ },
+ {
+ "video_id": "PI-87c762df-8f34-4a2b-9e2d-5930b5638e07",
+ "path": "/workspace/benchmark/video/PI-87c762df-8f34-4a2b-9e2d-5930b5638e07.mp4",
+ "answer": "The video unfolds in a festively decorated room adorned with plaid tablecloths and an array of food dishes, setting a cheerful holiday scene. An overlay text humorously teases, \"Santa came to our house... but wait for it 😂.\" Initially, a person dressed as Santa Claus descends a staircase, greeted by a group of excited children in colorful winter clothing who run into the room, thrilled by his arrival. Santa waves and engages with the children, who dance energetically, while adults observe and capture the moment on their phones. The festive atmosphere continues as Santa interacts with the guests near the food-laden tables. Suddenly, a person in a green Grinch costume enters, surprising Santa and the guests. The Grinch playfully moves around, eliciting laughter and excitement, while Santa reacts with surprise and amusement, engaging in the fun. The children watch and participate, adding to the joyful chaos. The playful interaction escalates as the Grinch pushes Santa, causing more laughter, and runs around the room with Santa in pursuit. The video wraps up with Santa and the Grinch ascending the staircase together, followed by a group of children and adults, all laughing and enjoying the festive atmosphere. In contrast, the audio track is a brief, unprocessed recording with minimal content, featuring a neutral acknowledgment, \"Yeah,\" providing no additional context or information to enhance the visual narrative."
+ },
+ {
+ "video_id": "PI-df90c2c2-9b5a-4479-bbb7-c9fc203c382a",
+ "path": "/workspace/benchmark/video/PI-df90c2c2-9b5a-4479-bbb7-c9fc203c382a.mp4",
+ "answer": "In a cozy living room with a modern kitchen visible in the background, a heartfelt interaction unfolds as a woman, dressed in a black t-shirt and shorts, lies curled up on a beige couch in a fetal position, clearly in discomfort from menstrual cramps, as indicated by the text overlay mentioning \"Having bad period cramps in front of my boyfriend to see his reaction...\" and a blood drop emoji. Her boyfriend, dressed in a dark t-shirt and black shorts, initially occupied with a pink phone, notices her distress and immediately shows concern. He puts down his phone, leans closer, and gently places his hand on her leg, asking if she is experiencing cramps. The woman nods slightly, and echoing the supportive tone from the audio where one speaker proposes offering snacks to alleviate discomfort, the man offers to get her some snacks to help her feel better. Continuing to show attentive care, he soothingly massages her leg. After a moment, he retrieves a bright green blanket and carefully drapes it over her, then resumes massaging her back. This tender moment captures the essence of the audio's casual and caring atmosphere, where the speakers discuss addressing someone's immediate needs with an apologetic and supportive tone, highlighting the empathy and concern the man shows for his partner's well-being."
+ },
+ {
+ "video_id": "PI-d53a2f2a-7024-4fa5-a5fb-4da194538312",
+ "path": "/workspace/benchmark/video/PI-d53a2f2a-7024-4fa5-a5fb-4da194538312.mp4",
+ "answer": "The integrated account of the award ceremony combines elements from both the audio and video summaries, offering a vivid depiction of the joyous occasion. The event, marked by a lively and uplifting musical background, was filled with applause and a celebratory mood. In the audio, multiple individuals expressed their happiness and gratitude. A female speaker, battling anxiety due to dyslexia, delivered a heartfelt speech of thanks, while another speaker, likely male, marveled at the fellow nominees and the mutual support among them. Meanwhile, in the video, the focus initially landed on six ladies before the subtitle \"Zoe Saldana'Emilia Pérez'\" emerged. A dark-skinned woman, visibly excited, leaped up, and the camera then panned to a lady in a blue dress watching her admiringly. The dark-skinned woman, now on stage clutching a trophy, spoke into the microphone, with subtitles indicating she thanked her Canadian people and all who had supported her. Her emotional speech culminated in tears of joy, mirroring the heartfelt sentiments expressed in the audio, where a third female speaker also emotionally acknowledged her family with love and appreciation. Together, the audio and video summaries paint a rich tapestry of gratitude, support, and shared joy at the award ceremony."
+ },
+ {
+ "video_id": "PI-e09a2edb-cbbd-4df9-939b-2dff47df060a",
+ "path": "/workspace/benchmark/video/PI-e09a2edb-cbbd-4df9-939b-2dff47df060a.mp4",
+ "answer": "This video serves as a tribute to the popular TikTok creator known as Lynja. The video begins by briefly showing two other creators, identified as Kya and Mula, before shifting focus to Lynja's TikTok profile page. A male narrator then delivers the sad news, stating, \"I'm super sad to inform you that... unfortunately, Lynja has passed away.\" What follows is a montage of clips from Lynja's various cooking videos. The montage showcases her personality, featuring scenes of her in a turkey costume as cranberries fall around her, holding up an eggplant, and pouring juice. Various close-up shots highlight her expressive face and joyful demeanor. Throughout the tribute, which is set to somber music, the text \"SIGMA CULT\" is displayed. The video is a memorial compilation, celebrating her life and the content she shared with her many followers."
+ },
+ {
+ "video_id": "PI-ed22a002-e24a-4a50-9d26-6464cdaeefe6",
+ "path": "/workspace/benchmark/video/PI-ed22a002-e24a-4a50-9d26-6464cdaeefe6.mp4",
+ "answer": "The video unfolds on a bustling street, where pedestrians stroll, vehicles shuttle by, and several bright shops add a lively backdrop. At the heart of the scene, a couple, both dressed in grey tops—the man in black shorts and grey shoes, and the woman in white trousers and white shoes—sit comfortably on black chairs, eagerly awaiting the completion of their anime-style portrait by a street painter, which features blue and white as the dominant colors. Their faces are lit up with smiles as they engage in cheerful banter, exuding an aura of joy and contentment. This heartwarming interaction mirrors the casual, friendly conversation captured in the audio track, where multiple speakers of different genders share light-hearted remarks and laughter, creating a relaxed and jovial atmosphere that perfectly complements the video's depiction of the couple's shared happiness and the beauty of their loving bond."
+ },
+ {
+ "video_id": "PI-6a5cb2b0-2554-41af-8caf-821d943971e1",
+ "path": "/workspace/benchmark/video/PI-6a5cb2b0-2554-41af-8caf-821d943971e1.mp4",
+ "answer": "The video humorously contrasts the differing perceptions of casual attire between two individuals. It begins with a close-up of a wooden ceiling adorned with green foliage and small orange fruits, accompanied by the text 'POV: your man's 'casual' is not your casual,' setting a playful tone. The scene transitions to a person standing outdoors, dressed in a black sweater with textured sleeves and a long, striped skirt in shades of brown and beige, holding a white bag next to a decorative white chair. The text 'My casual:' appears, indicating this person's interpretation of a casual outfit. The video then shifts to another individual wearing a black hoodie, black pants with red stripes, and a black cap with a gold 'A' emblem, with the text 'His casual:' highlighting the contrast. The background features a brick building with a green door, outdoor seating, and string lights, enhancing the casual, outdoor setting. Meanwhile, the audio features a conversation between the two individuals, marked by a neutral or slightly perplexed tone, as they discuss and express disagreement or confusion about what constitutes being casual. The male speaker questions the situation, while the female speaker reiterates the lack of casualness, adding a reflective and conversational layer to the visual humor."
+ },
+ {
+ "video_id": "PI-f214332f-3f42-4c67-ba0b-3797a5c1cd0f",
+ "path": "/workspace/benchmark/video/PI-f214332f-3f42-4c67-ba0b-3797a5c1cd0f.mp4",
+ "answer": "The video captures a heartwarming pregnancy announcement during a family photoshoot. Set against a snowy outdoor backdrop, four people and their two dogs pose on the snow-covered lawn before a house. The man on the right instructs the group to count down before shouting in unison: \"Madison's pregnant!\" As the words echo, the couple on the left transforms their smiles into stunned disbelief. The rightmost man confirms the news, coordinating with the woman beside him for celebratory gestures. Witnessing this, the left pair shifts from shock to elation, culminating in an emotional embrace between the two women. The stationary camera documents genuine reactions of joy and wonder, revealing the family's warmth and happiness as they process this thrilling news in an unforgettable fashion."
+ },
+ {
+ "video_id": "PI-65d62db0-eae4-4137-9234-11fdad7b3ebc",
+ "path": "/workspace/benchmark/video/PI-65d62db0-eae4-4137-9234-11fdad7b3ebc.mp4",
+ "answer": "The video presents an aerial, isometric perspective of a virtual farm within a simulation game. The camera executes a slow, continuous pan from the top-left to the bottom-right, gradually revealing various sections of a meticulously designed landscape. A prominent feature is a large, heart-shaped formation of red-leafed trees, which forms a central visual element. The farm is intricately laid out with diverse elements, including structured pathways, flowing water channels, and numerous flowerbeds displaying red, yellow, purple, and white blossoms. Various farm structures, such as sheds and small houses, are visible, along with distinct animal enclosures housing static figures of horses, donkeys, and cats. Decorative items like a golden fountain, benches, and a pink popcorn cart are strategically placed throughout the digital environment. The overall visual impression is one of a well-organized and aesthetically rich virtual space. An overlaying text caption remains visible throughout the video, providing a narrative context. The visual content is accompanied by calm, instrumental background music featuring a female vocalist, contributing to the serene atmosphere."
+ },
+ {
+ "video_id": "PI-ec40bd18-cdcd-484a-abcc-5a67020f20af",
+ "path": "/workspace/benchmark/video/PI-ec40bd18-cdcd-484a-abcc-5a67020f20af.mp4",
+ "answer": "A person documents their struggle to get into their car on a cold, icy night. The video, presumably intended for their boss as an excuse for being late, shows the car covered in a layer of frost. The individual repeatedly tries to open the driver's side door, but the handle appears to be frozen shut. On-screen text comically narrates their escalating panic and feelings of hopelessness. They pull on the handle and even hit the door in frustration, all while a sad piano melody plays in the background, heightening the mock-dramatic effect. After several attempts, a moment of realization occurs. The person pulls out their key fob, presses the unlock button, and the door opens without any issue. The video concludes with the person laughing at their own oversight and the final on-screen text reading, \"I definitely need more sleep...\""
+ },
+ {
+ "video_id": "PI-2fe25558-179c-4141-89ea-e20c1bf67e2c",
+ "path": "/workspace/benchmark/video/PI-2fe25558-179c-4141-89ea-e20c1bf67e2c.mp4",
+ "answer": "The video starts with a close-up of a computer monitor displaying a scene from a movie or show, featuring a person in a dimly lit room, accompanied by content warnings for 'language, smoking, violence'. The player number 'Player 456' is displayed, hinting at a possible game or challenge context. The scene transitions to a mysterious character in a hooded outfit with a geometric mask, standing in a warmly lit hallway, with the text 'Looking for these?' suggesting a search or quest. The video then shifts to a video player interface titled 'LIQUID GAME', indicating online viewing. Meanwhile, the background music is a cheerful tune sung by a girl, adding a relaxed and uplifting atmosphere to the scene. The video wraps up with a hand unplugging a cable from a computer tower, marking the end of the session and bringing a sense of closure to the viewing experience."
+ },
+ {
+ "video_id": "PI-892a5cee-95bd-4dc2-9cff-7f7f1e22a301",
+ "path": "/workspace/benchmark/video/PI-892a5cee-95bd-4dc2-9cff-7f7f1e22a301.mp4",
+ "answer": "The video takes place in a bustling shopping center, which is characterized by modern architecture, bright lighting and tiled floors. The main focus is a person wearing a black suit, a white shirt, a red tie, a red hat and riding a huge plush rhinoceros toy. This person seems to enjoy himself very much, making interesting gestures and interacting with the surrounding environment. The background is filled with busy shoppers everywhere, adding a lively atmosphere to the shopping center. Throughout the video, the person on the rhinoceros toy said some humorous words, which were displayed on the screen in the form of text. Phrases such as \"Don't worry, New Jersey. I'm taming all the rhinos, especially that big fat Chris Christie\" and \"There are so many rhinos in Washington D.C.. I'm going to get rid of every one of them, and I love animal rights\" emerged one after another, adding a comedic element to the scene. These remarks imply an interesting narrative around the existence of \"taming\" or solving the \"rhinoceros\", metaphorically referring to people or problems, specifically mentioning a person named Chris Christie. This video creates an interesting and engaging experience for the audience. The core theme revolves around the interesting interaction between an individual and a plush rhinoceros. The background is a busy shopping center, and the text adds layers of background and humor to the scene."
+ },
+ {
+ "video_id": "PI-eff178d8-5093-43d3-8ee5-bac4872268b0",
+ "path": "/workspace/benchmark/video/PI-eff178d8-5093-43d3-8ee5-bac4872268b0.mp4",
+ "answer": "The video features a woman seated inside a vehicle, dressed in a red hoodie, with earphones in, the wires visible against the red fabric. Through the vehicle's window, a clear blue sky, some leafless trees, and distant buildings are visible. The woman occupies most of the frame, appearing to look directly at the camera. Text overlays appear on the screen during the video. The first overlay states, \"Since tiktok’s getting banned there’s one more secret I feel I have to share with you.\" This is complemented by the audio content, where a male voice (but in this context referring to the perspective of the woman in the video as the speaker in the narrative) in a serious and reflective tone emphasizes a need to share a final secret due to the impending ban of TikTok. Later, another text overlay appears, reading, \"I have literally never lied on this app and that was not me in Tokyo.\" The audio reinforces this by asserting the speaker's honesty on the app and specifically denying being in Tokyo, referring to some unknown event or claim. The video focuses on the woman's upper body and face, though the face is not described in detail, and the overall setting suggests a casual, personal revelation. The core message conveyed is an insistence on truthfulness and a clarification about the speaker's whereabouts, integrating both the visual and auditory elements to present a cohesive narrative of personal revelation and honesty."
+ },
+ {
+ "video_id": "PI-fbf3a08f-1d25-423b-91f8-184960ad6ef8",
+ "path": "/workspace/benchmark/video/PI-fbf3a08f-1d25-423b-91f8-184960ad6ef8.mp4",
+ "answer": "In this heartwarming video, a young boy in pajamas walks out of his house onto the front porch, where he is met with a huge surprise. His father and sister, who were hiding, pop out to greet him. The boy's face registers pure shock, which quickly turns into overwhelming emotion. As the person filming wishes him a \"Merry Christmas,\" the boy rushes up the steps and into his father's arms for a long, emotional hug. His sister looks on, visibly moved and smiling. After the tender moment, the boy is guided toward the open garage. There, his Christmas gift is revealed: a green Kawasaki dirt bike, complete with a large green bow on the seat. The father explains he spent over a month restoring it. "
+ },
+ {
+ "video_id": "PI-611b0b41-3651-4b66-aee0-dea6f0af8878",
+ "path": "/workspace/benchmark/video/PI-611b0b41-3651-4b66-aee0-dea6f0af8878.mp4",
+ "answer": "The video shows a woman with long curly hair wearing a black top and gold jewelry examining a beige lip liner labeled \"Rod\" in a softly lit room with a bed and a window with blinds. What is the text on \"what is this on the rhode lip liners? \"Show, indicating the woman's curiosity about the product. With a tone of frustration and mild sarcasm, the woman humorously questioned the blending ability of the lip liner, saying, \"What the hell is this, because it likes to blend it? Because if so, it's not the girl who does it well, no shadow girl.\" This criticism shows dissatisfaction with the tool's functionality. However, she shifted to a more positive tone, expressing her appreciation for the color of the item and decided to focus on one side, ending by saying, \"Let me stick to this side, even though I love the color.\" The overall vibe was casual and frank, with no background music present, emphasizing the speaker's straightforward and relevant evaluation of the product."
+ },
+ {
+ "video_id": "PI-65c0fe3b-a6b1-4cc4-be6b-353c22f50daf",
+ "path": "/workspace/benchmark/video/PI-65c0fe3b-a6b1-4cc4-be6b-353c22f50daf.mp4",
+ "answer": "The video features a person standing in a room with wooden flooring and a ceiling light fixture, wearing a white tank top with a graphic design of skeletons and black pants. They are engaged in a series of hand and arm movements, crossing their arms in front of their chest, extending their hands forward, and making various gestures, suggesting a dance or exercise routine. The background includes a partially visible bed with a dark-colored blanket and a closed door, creating a simple, indoor environment with minimal decor that focuses attention on the person's energetic and rhythmic movements. Meanwhile, the audio track features a female speaker with an upbeat and confident tone, delivering a phrase that conveys a carefree attitude towards material possessions, emphasizing that having or not having something doesn't matter. The combination of the dynamic visual movements and the positive audio message creates a lively and liberating atmosphere."
+ },
+ {
+ "video_id": "PI-15b2b0c6-5acc-43bf-99f3-4822306b4ca5",
+ "path": "/workspace/benchmark/video/PI-15b2b0c6-5acc-43bf-99f3-4822306b4ca5.mp4",
+ "answer": "The video captures a warm and festive atmosphere in a cozy living room adorned with a small Christmas tree, where two roommates, dressed casually in dark green and light grey sweatshirts paired with black leggings, stand back to back, setting the stage for a lighthearted interaction. The screen displays the humorous text, \"Let my roommate's boyfriend earn overnight fees,\" hinting at the playful dynamic between them. As they begin to dance in unison, moving their arms and swaying to the rhythm, a third person joins the frame, dressed in a black and grey checked shirt, grey sports pants, and a black borderless hat, adding even more liveliness to the scene. Meanwhile, the audio features a female voice singing \"Pray for you\" in a playful and relaxed tone, followed by fragmented phrases like \"Before I put on my hair,\" \"Now welcome my hair,\" and \"I've changed to a new address,\" which further hint at a casual and interesting interaction among roommates. The combination of the video's festive setting, the roommates' synchronized dancing, and the third person's energetic entrance, along with the audio's playful singing, creates a relaxed and friendly atmosphere that perfectly encapsulates the joyful moments shared among friends during the holiday season."
+ },
+ {
+ "video_id": "PI-3a102cee-2995-43b2-a2e7-d69d92795d1d",
+ "path": "/workspace/benchmark/video/PI-3a102cee-2995-43b2-a2e7-d69d92795d1d.mp4",
+ "answer": "The video shows a heartwarming scene in a shop where a child wearing a purple shirt and white jacket decorated with purple stars is engrossed in examining a plush toy, white hair, red hat, blue costume, accompanied by a yellow pamphlet. The child's curiosity is evident when the child turns the toy around and carefully observes its features, particularly its large green eyes. The backdrop is vibrant and filled with shelves filled with colorful books and stationery. Meanwhile, the audio captures a happy conversation between the lady and the small child, possibly a mother and daughter, filled with laughter and warmth. The child's involvement with the toy and affectionate conversation creates a normal, comfortable and pleasant atmosphere throughout the video."
+ },
+ {
+ "video_id": "PI-b79aba92-0bee-422f-ac74-c50bdb9f225c",
+ "path": "/workspace/benchmark/video/PI-b79aba92-0bee-422f-ac74-c50bdb9f225c.mp4",
+ "answer": "The video features a person standing in a bathroom characterized by a distinctive blue wall decorated with cartoon characters and a shower curtain adorned with vibrant images of SpongeBob SquarePants and other animated figures, creating a playful and lively atmosphere. The individual, dressed in a gray sweater, stands in front of the shower curtain hung on a curved shower rod, with the bathroom's white tiled wall visible to the right and a purple bottle placed on the edge of the bathtub. Throughout the video, the person makes various hand gestures, including bringing their hands together in a prayer-like position, touching their face, and giving a thumbs-up, adding a dynamic element to the scene. Meanwhile, the audio content features a male singer who, in a happy and relaxed mood, describes his girlfriend's quirks, such as her tendency to call friends and her charm, reflecting on the dynamics of their relationship in a playful tone. The cheerful background music complements the overall joyful and casual atmosphere of the video, enhancing the light-hearted and animated environment."
+ },
+ {
+ "video_id": "PI-4be3d501-1193-4fe8-ae44-6001cab8e701",
+ "path": "/workspace/benchmark/video/PI-4be3d501-1193-4fe8-ae44-6001cab8e701.mp4",
+ "answer": "The video captures a cozy and intimate outdoor scene at night, where two individuals, one donning a red top and a green and white checkered blanket and the other in a maroon hoodie with a light-colored blanket, sit closely together, engaged in a playful and lighthearted exchange, making hand gestures and sharing smiles. The dimly lit background, dotted with faintly visible people and objects in the distance, suggests a social gathering or event, while a fire pit in the foreground adds warmth to the ambiance. Meanwhile, the audio, a musical clip with ambient music at a steady tempo, underscores the reflective and contemplative mood, enhancing the overall atmosphere of warmth and friendliness. The text overlay, \"what plays in my head everytime i'm around kids,\" hints at a humorous or relatable thought process, perfectly complementing the video's depiction of a joyful moment of connection between the two individuals, as well as the broader context of social interaction and the comforting presence of a fire at night."
+ },
+ {
+ "video_id": "PI-485b8042-0133-442a-88a3-0908bdfc1dff",
+ "path": "/workspace/benchmark/video/PI-485b8042-0133-442a-88a3-0908bdfc1dff.mp4",
+ "answer": "The video humorously captures a living room scene where a man in a blue shirt and black shorts is playing a drum set received as a Christmas present, creating a vivid contrast between his excitement and the limited use of the gift. The drum kit sits in front of a grey sectional sofa decorated with various pillows and blankets, and the room features striped rugs and wall decor, adding to a cozy and ironic vibe. His father doesn't allow the brother to play the drums often. The audio, meanwhile, features an unidentified speaker of uncertain gender and emotion, expressing the phrase \"roseose\" in English, accompanied by ambient music of uncertain rhythm and atmosphere, helping to form a neutral or ambiguous auditory background that contrasts with the visual humor of the scene."
+ },
+ {
+ "video_id": "PI-3de6afdb-30e6-4270-896c-2275572b8096",
+ "path": "/workspace/benchmark/video/PI-3de6afdb-30e6-4270-896c-2275572b8096.mp4",
+ "answer": "The video begins with a man speaking in front of the camera, and the video shows I got black friend what what the fuck is that. Then the man flipped the camera to a black-skinned statue of Santa Claus. Santa Claus was wearing red clothes with black skin, which made the man feel ridiculous. In the video, the man had a conversation with his black friend, and the subtitles at the bottom of the video were to the effect that the man felt that the statue was ugly and had a dispute with his friend. After the argument, the man flipped the camera and pointed it at himself again and spoke in front of the screen. The video shows the man's dissatisfaction with the statue and an argument with his friend."
+ },
+ {
+ "video_id": "PI-d2ead292-7543-422f-b9e9-7ea79b588f7c",
+ "path": "/workspace/benchmark/video/PI-d2ead292-7543-422f-b9e9-7ea79b588f7c.mp4",
+ "answer": "This video documents the charming transformation of a small area into a cozy bedroom for Giuseppe, the cat. It kicks off with a shelf on display, which soon gains new purpose. A woman then steps in, holding the shelf to showcase its role in the cat's room.We're treated to adorable scenes of Giuseppe. He first rolls around happily and later lies on the shelf, showing early approval. Next comes the cleaning of a blanket. Giuseppe, being the inquisitive cat he is, squats on it and is carried along, a comical moment.Two people then carry Giuseppe upstairs on the blanket, indicating the new room's location. Once there, the woman spreads the blanket to interact with him. She then presents a small blue cardboard sofa, which quickly becomes Giuseppe's favorite spot. It doubles as a scratching post, a great addition for the cat's fun.The audio adds to the warmth. The woman, brimming with happiness, describes the room's progress.Giuseppe loves a vintage rug, which gives the space a classic feel. She also painted wood pieces for a climbing gym to match the wallpaper, creating a visually appealing and fun space. Vintage frames for a gallery wall were carefully chosen too. The overall mood is cheerful, with the woman's enthusiasm for this cat - centered project evident throughout."
+ },
+ {
+ "video_id": "PI-1ddf521e-9773-48a8-b153-ae03206e5b73",
+ "path": "/workspace/benchmark/video/PI-1ddf521e-9773-48a8-b153-ae03206e5b73.mp4",
+ "answer": "The video features a TikTok profile page for the user \"@show,\" displayed against a black background with the time shown. The profile indicates that she has 199 followers, 4.3 million following, and 9.5 million likes, with a profile picture of a person in a suit and a red 'Follow' button beside it. Prominently marked in red and white letters, the text below the profile information reads: \"Breaking TIKTOK News!\" The main focus of the video is a person wearing a black shirt and multiple bracelets, gesturing expressively as they speak, suggesting they are reporting or discussing something significant. In the background, smaller video thumbnails display various scenes, including people in suits and ties, and a postcard with the words \"TikTok Live.\" Complementing the visuals, the audio features a man's voice, sounding both surprised and worried, as he reports breaking news about a change in someone's TikTok profile. He expresses confusion and asks what has happened, adding a sense of urgency and intrigue to the overall narrative."
+ },
+ {
+ "video_id": "PI-30274abe-af1a-4b76-a17d-2b4084faaa3f",
+ "path": "/workspace/benchmark/video/PI-30274abe-af1a-4b76-a17d-2b4084faaa3f.mp4",
+ "answer": "The video begins with a woman with long blonde hair wearing makeup and carefully outlining her lips with a black lip liner. She wore a black shiny top with visible tattoos on her fingers, including a cross and a ring, adding a touch of personal style. The woman then adjusted her hair to reveal its smooth, shiny texture before changing into light pink, silky clothes. As she gently moves her head, the video showcases the volume of the hair and the luxurious texture of the outfit, creating an elegant and cohesive visual presentation. Soft lighting enhances the overall aesthetic, highlighting the woman's stylish look and the interplay of textures and colors. Meanwhile, the background music features a boy singing a pleasant and relaxed tune, which complements the tranquil and sophisticated atmosphere of the scene and is perfect for a relaxed setting."
+ },
+ {
+ "video_id": "PI-caa5860a-da98-43bb-aaf2-976e8906e82d",
+ "path": "/workspace/benchmark/video/PI-caa5860a-da98-43bb-aaf2-976e8906e82d.mp4",
+ "answer": "The video presents a creative collage of four distinct images arranged in a 2x2 grid, each showcasing a unique style and label, while a cheerful male voice expresses happiness about the day, complemented by upbeat ambient music that enhances the overall positive mood. In the top-left frame, an animated character with a muscular build is featured, wearing a black crop top with \"DIII\" written on it and light blue jeans, standing confidently with one hand in the pocket; the label \"Bratz\" appears below in purple text. The top-right frame displays a real-life person in a black short-sleeved top and black leggings, making a peace sign with their right hand, labeled \"Disney\" in white text. The bottom-left frame repeats the same person in the same outfit and pose, now labeled \"Cartoon Network\" in white text. Finally, the bottom-right frame shows the person again in the same attire and pose, with the label \"Disney Princess\" in pink text, accompanied by an image of an animated prince in a purple and gold outfit. Throughout the video, the background remains a plain white door and light-colored wall, with static images and no movement, creating a playful comparison of different styles and labels associated with the person and characters."
+ },
+ {
+ "video_id": "PI-92aead6c-1060-446f-a60c-e24b9c8caaa8",
+ "path": "/workspace/benchmark/video/PI-92aead6c-1060-446f-a60c-e24b9c8caaa8.mp4",
+ "answer": "The video captures a serene suburban scene during late afternoon or early evening, bathed in warm, golden lighting. A beige house with an open garage reveals two vehicles inside—a black SUV and a white car—while a white boat on a trailer is parked on the driveway to the right. Near the garage, a group of people gathers: a woman in a pink top and black pants, a man in a white shirt and shorts, and a young child in a pink outfit, who walks towards the driveway. In the foreground, a man in a white T-shirt and black shorts walks down the sidewalk towards the group. The text overlay reads, 'Plan A was to build on land with no neighbors in sight.' As he approaches the child, he bends down to interact, and the text changes to, 'But this is plan B and we are never leaving.' He picks up the child, sharing a moment of affection before standing and turning to face the group. Meanwhile, the audio features a male speaker with an unknown emotional tone narrating a scene involving a house built from stone, a fallen window, and a chair he was sitting in, accompanied by calm and steady ambient music that creates a neutral atmosphere. The video emphasizes the joy of family togetherness and satisfaction with their current home, contrasting with the audio's more abstract narrative."
+ },
+ {
+ "video_id": "PI-cbc23ca9-a6d1-4777-a9f6-3b4cbf0c4ef3",
+ "path": "/workspace/benchmark/video/PI-cbc23ca9-a6d1-4777-a9f6-3b4cbf0c4ef3.mp4",
+ "answer": "This point-of-view video captures a humorous scene in a hospital room. An elderly woman is lying in a hospital bed while a nurse in scrubs and a face mask attends to her. The on-screen text reads, 'pov: my 76 year old great grandma pregnant.' In the background, a conversation can be heard where the nurse seemingly tells the woman she is pregnant, to which the woman reacts with audible shock and disbelief. The camera pans unsteadily between the patient and the nurse, who is standing by a counter. The room is a typical clinical setting with medical equipment like a vitals monitor and an IV stand. Loud rap music plays throughout the video, adding to the comedic and surreal atmosphere of the situation, which appears to be a lighthearted prank."
+ },
+ {
+ "video_id": "PI-88780f99-2140-479b-80c1-0f87364ba10c",
+ "path": "/workspace/benchmark/video/PI-88780f99-2140-479b-80c1-0f87364ba10c.mp4",
+ "answer": "The video takes place in a gymnasium, characterized by a metal structure and a concrete floor. A woman wearing a light blue off-the-shoulder top, white trousers and brown shoes is the main focus. She peeked playfully out from behind a metal structure, grabbed both sides with her hands, then came out and looked around as if looking for someone. The text on the video reads: \"It's always fun to scare the people I like!\" This indicates that she intends to joke and scare the person she likes. The subsequent scene shows a man sitting on a metal platform, wearing a shirt with blue patterns and shorts. That woman stood beside him, looking up at him with a naughty expression. The man seemed not to be aware of her existence at first, but later noticed her and smiled back. The woman stood there, with her hands in her pockets, looking very content and amused by this successful prank. The overall tone of the video is light-hearted and cheerful, capturing the funny and mischievous moments between the two people. The superimposition of text enhances the interesting interaction and the surprising theme."
+ },
+ {
+ "video_id": "PI-ee95723a-4c01-49eb-b7e8-c760a2a06047",
+ "path": "/workspace/benchmark/video/PI-ee95723a-4c01-49eb-b7e8-c760a2a06047.mp4",
+ "answer": "This video captures a moment of shared accomplishment followed by an unexpected mishap. Three individuals are seated around a wooden table, focused on completing a large jigsaw puzzle depicting a dramatic scene illuminated by candles. When the female member places the final piece, the entire group erupts in cheers and high-fives to celebrate. However, during the enthusiastic celebration, one of the individuals across from them deliberately swings his arm, causing the entire completed puzzle to slide off the table and scatter onto the floor. The camera quickly shifts from the fallen puzzle to the incredulous faces of the three, their expressions filled with shock and disappointment. The setting appears to be a communal space with light-colored walls, and simple decor including a poster and a collection of old radios.\n"
+ },
+ {
+ "video_id": "PI-a3983bd0-89de-4507-b707-d67b27a39b31",
+ "path": "/workspace/benchmark/video/PI-a3983bd0-89de-4507-b707-d67b27a39b31.mp4",
+ "answer": "The video and audio together weave a narrative of two characters—a baby and a dog—arriving at Padma's workplace, a bustling call center. The video begins with a crowded blue bus parked in front of a grand building, from which the white dog with a red collar and the baby in a yellow shirt and red overalls emerge. The baby comments on their arrival and expresses a sense of apology towards buses in America. The scene transitions to an elevator where the duo, accompanied by three men in suits, exits and walks down a hallway adorned with a poster of a woman in a headset. They stop in front of a door labeled 'CALL CENTER EMPLOYEES ONLY.' The audio complements this visual journey with a dialogue between two characters discussing their arrival; one voice is curious and concerned about finding Padma among the many employees, while the other is confident and reassuring, hinting at a plan. This confidence is realized when the dog pulls out a phone and instructs someone to stand on their desk, setting the stage for the next part of their adventure in the call center."
+ },
+ {
+ "video_id": "PI-5694b398-f840-4b6e-b55f-af24c2fd3bb3",
+ "path": "/workspace/benchmark/video/PI-5694b398-f840-4b6e-b55f-af24c2fd3bb3.mp4",
+ "answer": "This video shows a charming and surreal scene seen through an airplane window. A magnificent white horse was captured in flight against the backdrop of a bright blue sky dotted with fluffy white clouds. The horse's mane and tail flow gracefully in the wind, enhancing the ethereal quality of this moment. At the lower part of the picture, the wings of the aircraft can be seen, providing a sense of perspective and background, indicating that the audience is observing this dreamlike scene from inside the aircraft. This horse seems to be in a dynamic posture. Its legs are stretched out and its body is slightly tilted downward, as if it were soaring or descending among the clouds. The contrast between the mechanical wings of an airplane and the free and unrestrained horse in mythology creates an astonishing juxtaption that blends the realms of reality and fantasy. This video evokes a sense of wonder and magic, taking the audience into a dreamlike world where the ordinary and the extraordinary coexist. No audio content can allow visual elements to take center stage, emphasizing the beauty and surrealism of the scene."
+ },
+ {
+ "video_id": "PI-0366f6ba-6905-4e0f-bc5d-c13a0a4da6d1",
+ "path": "/workspace/benchmark/video/PI-0366f6ba-6905-4e0f-bc5d-c13a0a4da6d1.mp4",
+ "answer": "This video was shot in a comfortable and warm room with wooden walls and a large flat-screen TV in the background. A person with long brown wavy hair, wearing a dark patterned long-sleeved top and blue jeans is the focus. At the beginning of the video, this person is standing beside a wooden table, with a red chair placed beside it, viewed from behind. Then she sat down, clenched her hands tightly, expecting. Such text will appear on the screen: \"Today, you will perform a blind unboxing operation on your Alter Ego album\", setting the background for the event. This person is now seated, waving to the audience and asking, \"Hello everyone, what are we doing today?\" Then, she expressed excitement about opening the box and said, \"Hi, my children. Whichever one I want to see first, I'll go with him.\" This person picked up a photo album named \"ALTER EGO\" with various pictures on the cover and began to open it. She took out a stack of dark blue photo albums and showed them to the camera. At the end of the video, this person was holding several photo albums. On the cover of one of them, \"Lisa\" was written in pink letters, and he jokingly shouted: \"Roxy Roxy come on, Roxy Roxy Roxy 1,2,3.\" The overall theme of the video is an enthusiastic and engaging process of opening the Alter Ego album. Throughout the process, people share their excitement and interact with the audience. The warm and inviting environment combined with the lively manners of the characters creates an engaging and pleasant viewing experience."
+ },
+ {
+ "video_id": "PI-38a4bad0-b874-40d2-9094-6dfa6a0089b0",
+ "path": "/workspace/benchmark/video/PI-38a4bad0-b874-40d2-9094-6dfa6a0089b0.mp4",
+ "answer": "In the video, a person wearing a black shirt and a dark apron is demonstrating the process of making chocolate strawberries in the kitchen, while emphasizing common mistakes that should be avoided. Article \"3 Mistakes to Avoid Chocolate Strawberries!\" Displayed at the top of the screen, it is used to set the theme of the video. This person first explained the first mistake: not pickling the strawberries. This person proved this by washing fresh strawberries in a transparent glass bowl filled with water and vinegar, making sure they were thoroughly rinsed. Next, this person pointed out the second mistake: the strawberries were not dried properly. They explained that excessive moisture on the outside of strawberries would affect the chocolate coating. The third mistake is that there is no tempered chocolate. This person explained that thick chocolate shells are not advisable. On the contrary, chocolate should be thin and crispy. Throughout the video, this person provided detailed explanations and demonstrations to ensure that the audience understood each step and the importance of avoiding these common mistakes. At the end of the video, this person drizzled white chocolate over strawberries covered with dark chocolate, forming a decorative spiral pattern to showcase the final product."
+ },
+ {
+ "video_id": "PI-dba2c4a1-7f83-4c2f-9a8c-6c4c661279d1",
+ "path": "/workspace/benchmark/video/PI-dba2c4a1-7f83-4c2f-9a8c-6c4c661279d1.mp4",
+ "answer": "This video is a comedic skit portraying a typical interview scenario with a K-pop group. Filmed from the interviewer's point of view, it features two young men in colorful wigs role-playing as K-pop stars inside an apartment. The on-screen text reads, \"That one Kpop member that fluently speaks English.\" An off-screen interviewer asks questions in English. One of the 'idols,' wearing a ginger wig and a grey tank top, answers the questions articulately and calmly in English. In contrast, his bandmate, in a blonde wig and yellow sweater, acts shy yet energetic, dancing, making heart gestures, and speaking in enthusiastic gibberish. The fluent speaker often has to calm his partner down or speak for him, humorously highlighting the dynamic of having a designated English speaker in a foreign musical group. The skit playfully satirizes this common trope in the K-pop industry, all set to an upbeat pop track"
+ },
+ {
+ "video_id": "PI-55184bac-fdad-4fb2-82bd-45892b7a36cf",
+ "path": "/workspace/benchmark/video/PI-55184bac-fdad-4fb2-82bd-45892b7a36cf.mp4",
+ "answer": "The audio and video content together capture the essence of a close yet occasionally contentious friendship, reminiscent of sibling relationships. In the audio, a man and a woman discuss their friends' quarrels, comparing them to the arguments between brothers and sisters, creating an atmosphere of intimacy and familiarity. The video complements this narrative by depicting a person standing in a hallway with beige walls and a shiny, white tiled floor adorned with yellow squares. The individual, dressed in a coral-colored hoodie and black pants, is seen holding a phone and looking at it, with a text overlay that reads, 'that one friend u got so close with that yall argue like siblings,' emphasizing the theme of a close friendship marked by occasional arguments. The person is engaged in a lively conversation, gesturing animatedly with their hands while holding the phone, suggesting an ongoing and spirited discussion. At one point, they turn slightly to the side, still holding the phone and continuing to gesture, indicating an engaging and animated exchange that mirrors the close, yet sometimes tumultuous, nature of sibling relationships described in the audio. Together, the audio and video content paint a vivid picture of the complexities and joys of close friendships."
+ },
+ {
+ "video_id": "PI-fff34530-8208-40f3-87a9-539159f74be9",
+ "path": "/workspace/benchmark/video/PI-fff34530-8208-40f3-87a9-539159f74be9.mp4",
+ "answer": "This video presents a narrative in two distinct parts. The first segment opens in what appears to be a decorated garage or party room, where several young men are relaxing on couches under festive string lights. A text overlay reads, \"Thank god l'm single bc l'm not explaining this to an insecure boy.\" The scene includes a brief shot of a decorated Christmas tree and a close-up of a young woman's face. The video then abruptly shifts tone and subject matter. The second part showcases a matte black Mercedes-Benz E63 S parked at a gas station at night. Various shots highlight the car's sleek exterior, its rear featuring the license plate '9FLZ986,' and its modern interior with purple ambient lighting. A convenience store named \"BUZZIN' MARKET #1\" is visible in the background. During this car sequence, the word \"Hoe.\" appears on the screen. The accompanying audio also changes from a soft instrumental track to an energetic rap song.The video strongly suggests the woman's freedom to pursue her passions without needing to justify them to an insecure partner. The initial text, \"Thank god l'm single bc l'm not explaining this to an insecure boy,\" directly sets up this theme of liberation from the burden of an overly scrutinizing relationship"
+ },
+ {
+ "video_id": "PI-6ba28c31-1b26-4524-8d98-52edef65c9a6",
+ "path": "/workspace/benchmark/video/PI-6ba28c31-1b26-4524-8d98-52edef65c9a6.mp4",
+ "answer": "The video begins with a woman in a black crop top and black wide-legged pants assembling a large, modular wardrobe in a room. She unfolds and positions the panels, carefully aligning and securing them to form the structure, and then installs horizontal shelves and hanging rods inside. A man in a gray long-sleeve shirt and light gray sweatpants joins her to assist with the top shelf and final adjustments, resulting in a fully assembled, empty wardrobe with a clean, minimalist design. The scene transitions to a close-up of the wardrobe, now neatly filled with clothes: the top shelf holds folded jeans, the hanging rods display various pairs of jeans and sweaters in different colors, and the drawers are organized with accessories. The left side features open shelves with a woven basket, a red cap, and a brown bag. The wardrobe doors, with wooden frames and frosted glass panels, are closed, showcasing the completed and organized wardrobe. Meanwhile, the audio features a male speaker expressing a preference for mutual loyalty and satisfaction, emphasizing 'that's the way I like it,' with another male speaker echoing this sentiment and a female speaker briefly interjecting in Chinese, which translates to 'send.' The ambient music has a steady tempo, creating a neutral or casual atmosphere that complements the methodical assembly and organization of the wardrobe."
+ },
+ {
+ "video_id": "PI-e1e56b91-74f0-4f11-aa36-378b7c69b5ce",
+ "path": "/workspace/benchmark/video/PI-e1e56b91-74f0-4f11-aa36-378b7c69b5ce.mp4",
+ "answer": "This video records the pleasant and relaxed interaction between a man and a woman in a comfortable living room. There is a beige sofa in the room, and three photo frames are hung on the wall behind it, creating a warm and intimate atmosphere. The woman was wearing a yellow hoodie and sitting on the floor opposite the man. The man was wearing a grey hoodie with the \"Ford\" logo and a black hat. At first, the woman made an interesting gesture, raising two fingers, suggesting that she was playing a light-hearted joke. Then she leaned forward, tilted her head back, and laughed happily. Her laughter was loud and continuous, indicating that she had discovered something very interesting. The man spread out his legs and lifted one foot, taking part in this funny joke. Once, he leaned against her and they shared a moment of laughter and affection. The man gently stroked the woman's face and they exchanged warm smiles. This audio and video complement each other, highlighting the girl's hearty laughter and the man's relaxed admission of \"It's my fault\", suggesting a context of joking or teasing. The video and audio together convey the core message of shared humor and relaxed interaction, capturing the candid moments of connection between the two people."
+ },
+ {
+ "video_id": "PI-ba1cf5d9-6147-40de-a23c-ee6c3a8241e9",
+ "path": "/workspace/benchmark/video/PI-ba1cf5d9-6147-40de-a23c-ee6c3a8241e9.mp4",
+ "answer": "The video takes place in a modern hair salon characterized by a clean, minimalist design with light-colored wood floors and geometrically patterned walls. A young girl with long, straight, dark brown hair is seated in a salon chair, wearing a light blue shirt featuring a colorful cartoon character. The text overlay humorously reads, 'POV: You're a boy mom so they become silk press models' with a crying emoji, suggesting a light-hearted take on the experience of being a mother to boys and how it leads to her daughters getting silk press treatments. The camera focuses on the meticulous hair styling process, where the hairdresser's hands are seen using a comb and other tools to ensure the hair is smooth and straight. The background includes a black salon chair with a cape, a wicker basket with clothes, and a black cabinet with various salon products. Meanwhile, the audio features a mix of male and female voices in casual, somewhat chaotic conversations. One person confidently asserts their preferences, another makes humorous remarks, and a third mentions personal matters like yoga and a lack of response from their father. There is also a brief statement in Chinese expressing excitement. The ambient music with a steady tempo creates a neutral yet slightly dynamic atmosphere, complementing the detailed and precise hair styling process showcased in the video. The video concludes with the girl's hair looking impeccably straight and shiny, highlighting the care and precision taken throughout the transformation."
+ },
+ {
+ "video_id": "PI-1f3b10a8-eb3c-449c-a552-f8f6fa896e5f",
+ "path": "/workspace/benchmark/video/PI-1f3b10a8-eb3c-449c-a552-f8f6fa896e5f.mp4",
+ "answer": "The video humorously depicts a scene outside the house at night, featuring three people as the main player, supplemented by a piece of music that includes two men singing to music played by the instruments, suggesting a lively party atmosphere. In the video, people in green jackets and hats lively stand and gestured while speaking, dominating the interaction and adding to the vibrant atmosphere. People in camouflage hats and dark clothes sit, play keyboards and occasionally look up, contributing to the musical elements heard in the audio. A third person, wearing a beige jacket and hat, sitting on the right side, observes the scene, and the text \"How Woke Rappers be in the Studio\" is shown at the top. This video highlights the comedic nature of the behavior of the socially conscious rapper in the outdoor singing, capturing a vivid and captivating moment in which the man in the green jacket continues to speak and gesture with vigor, combining nicely with the vibrant soundtrack to represent a party scene."
+ },
+ {
+ "video_id": "PI-8c8e0d96-b56c-4b4d-9a2c-e5010f25040c",
+ "path": "/workspace/benchmark/video/PI-8c8e0d96-b56c-4b4d-9a2c-e5010f25040c.mp4",
+ "answer": "This video shows a vibrant concert scene where a male performer stands in the center of the stage, attracting a large number of enthusiastic audiences. The background is an indoor venue with high ceilings and industrial-style lighting, creating a vibrant atmosphere. He initially wore a white vest and light-colored trousers. Later, he changed into a uniquely patterned green hooded sweatshirt and dark trousers, but his vitality remained unchanged all the time. Throughout the entire performance, he interacted with the audience with expressive gestures, raised his arms and moved vigorously on the stage. The audience was clearly very excited. Many people raised their mobile phones to capture this moment, and their phone lights produced a flashing effect in the dimly lit venue. The stage is equipped with large speakers and microphones, emphasizing the live music experience. The audio was mainly composed of the performer's energetic music, which aroused the enthusiasm of the audience. Although the specific lyrics are illegible, the cheerful rhythm and caddies suggest a genre that blends rock or alternative influences, enhancing the atmosphere of the entire concert. It highlights the connection between the artist and the audience. This video effectively captures the essence of a live concert, showcasing the energy, excitement and shared experience of participating in the performance."
+ },
+ {
+ "video_id": "PI-cd85c134-6eab-4173-b1f8-60c8b3c2bb76",
+ "path": "/workspace/benchmark/video/PI-cd85c134-6eab-4173-b1f8-60c8b3c2bb76.mp4",
+ "answer": "The video provides a comprehensive demonstration of the Hyundai Fate 2 translation earbuds, led by an enthusiastic woman who effectively showcases their features and benefits. She starts by holding one earbud near her mouth, explaining its functionality while the camera zooms in to highlight its design and the intricate microphone area. She then demonstrates the earbud's comfort and fit by placing it in her ear briefly. Opening the sleek, compact black charging case, she reveals both earbuds, emphasizing their modern design and the prominent Hyundai logo.The accompanying promotional audio track enhances the demonstration with a warm and excited tone, detailing the earbuds' ability to translate over 135 different languages in real-time. The audio underscores the current affordability of the earbuds, priced under $10, presenting this as the optimal time for purchase. The speaker's persuasive tone encourages viewers to take immediate action, ensuring they never miss out on important conversations, whether it's catching up on gossip at a nail salon or engaging in multilingual discussions. The video concludes with a close-up of the earbuds nestled in their charging case, reinforcing the product's practicality, value for money, and its cutting-edge functionality."
+ },
+ {
+ "video_id": "PI-1df3b64d-f431-45e3-9caf-3c5b15895bf3",
+ "path": "/workspace/benchmark/video/PI-1df3b64d-f431-45e3-9caf-3c5b15895bf3.mp4",
+ "answer": "The video depicts a realistic scenario where a police officer in a dark uniform stops a car in a residential area, engaging in a conversation with the occupants, one of whom is wearing a yellow jacket and hat, and the other in a blue suit. The interaction alternates between the officer leaning against the vehicle and the individuals inside, discussing how to respond to questions about speeding and car searches. The officer is seen holding a speed gun, and a moot court scene is subtly highlighted to emphasize the speed dispute. Meanwhile, the audio track features a male voice providing advice on how to handle such a traffic stop, echoing the key points made in the video. The speaker emphasizes the importance of not admitting to speeding, refusing to consent to a car search, remaining calm, and legally filming the officer during the stop to protect one's rights and avoid inadvertently admitting guilt. The combined message from both the audio and video is clear: it's crucial to be informed and prepared during a traffic stop to safeguard one's legal rights and handle the situation with composure."
+ },
+ {
+ "video_id": "PI-a7e9b299-e8c8-4f8b-ae60-c05aa5b92793",
+ "path": "/workspace/benchmark/video/PI-a7e9b299-e8c8-4f8b-ae60-c05aa5b92793.mp4",
+ "answer": "The video features a light-hearted interaction between a woman and a man in a room with a white wall. The woman, dressed in a black off-shoulder top, black cardigan, and dark jeans, initially stands with her hands on her hips, while the man, in a white t-shirt and beige pants, leans against the wall with one hand in his pocket. The text overlay 'i can do better' sets a playful tone as the woman begins to move her arms and body energetically, and the man points and gestures, seemingly encouraging her. They engage in a synchronized dance or exercise routine, maintaining a fun and relaxed atmosphere. The scene continues with the man leaning against the wall, one arm raised, and the woman standing beside him with her hands in her pockets. The text 'markells time to shine' appears as the man starts moving energetically, bending his knees as if preparing to dance. The woman watches and occasionally moves her hands. The man's energetic movements continue, and the woman joins in, bending her knees and shifting her position. The text changes to 'think we used this one lol' as both individuals move playfully, keeping the cheerful atmosphere. Meanwhile, the audio features a disjointed conversation among three individuals—two males and one female—with neutral or uncertain emotions. The dialogue includes unclear statements, such as one male expressing confusion about a filming location and another mentioning numbers briefly. The female's contribution is minimal and lacks clarity, suggesting a casual and fragmented interaction with no evident emotional intensity or specific purpose, and no background music is present. The overall scene combines the visual playfulness with the audio's casual, disjointed conversation."
+ },
+ {
+ "video_id": "PI-939e40ba-3017-49ba-bb19-7dcf45530b9e",
+ "path": "/workspace/benchmark/video/PI-939e40ba-3017-49ba-bb19-7dcf45530b9e.mp4",
+ "answer": "In the video, a person wearing a yellow hat and a grey hoodie stands against a plain indoor background, and a door frame can be seen on the left. Throughout the entire editing process, the individual remained basically still, with only slight movements in the head and shoulders. The lighting is soft and even, suggesting that this is an indoor environment, perhaps a room with the least decoration. The audio overlay in the video includes a clip of a song whose lyrics are: \"She just wants a party. Every time I'm not wrong...\" This implies a casual, perhaps reflective mood, which is consistent with the visual elements of the video. The text is superimposed at the top of the frame. Read \"When I don't reply to text messages\", and then it becomes \"When someone doesn't reply to text messages\". These text messages emphasize the theme of communication and the emotional responses related to not receiving the messages. The expression and posture of this person, combined with the text, convey a kind of contemplation or a slight sense of depression, which may reflect the individual's dynamic experience of communication. The core theme of this video revolves around the common social experience of waiting for a text reply and the possible emotional impact it may have. Simple visual elements, combined with relevant text and audio, create narratives that many viewers may find familiar and appealing."
+ },
+ {
+ "video_id": "PI-96d5b0e6-6ec3-4fee-9815-3694230dee6b",
+ "path": "/workspace/benchmark/video/PI-96d5b0e6-6ec3-4fee-9815-3694230dee6b.mp4",
+ "answer": "The video features a person in a medical setting, dressed in a blue scrub top, a colorful patterned surgical cap, and a white apron, standing in a well-lit, clean, and organized hospital storage room filled with shelves of medical supplies. The individual, who also wears a necklace with a small pendant, appears to be delivering a reflective commentary directly to the camera. This aligns with the audio content, where a neutral-toned female speaker calmly and matter-of-factly discusses the challenges of having a strong work ethic. She emphasizes that excelling at one's job often leads to taking on additional responsibilities beyond one's own role, highlighting the workplace dynamics where competence can result in increased workload. The video captures a moment of professional reflection, blending the visual context of a medical environment with the speaker's insights on workplace expectations and the unintended consequences of being good at one's job."
+ },
+ {
+ "video_id": "PI-0095105a-a978-428d-91d2-59eccd584797",
+ "path": "/workspace/benchmark/video/PI-0095105a-a978-428d-91d2-59eccd584797.mp4",
+ "answer": "The feature of this video is that two women have a brief but meaningful interaction facing a plain beige wall. The woman on the left is wearing a light blue off-the-shoulder sweater and holding a blue lanyard with \"Texas\" and \"Alana\" written on it. The woman on the right, wearing a black long-sleeved top and blue jeans, with curly hair, seems to be a guiding or encouraging figure. She guided another woman and encouraged her with gestures, just as emphasized by the text overlay \"Raise your YAA YA YA\". This sentence, combined with visual cues, represents a moment of support and enthusiasm, and may aim to boost confidence or celebrate achievements. The core theme of this video is the encouragement and positive reinforcement from one person to another, emphasizing the importance of support in both personal and professional growth."
+ },
+ {
+ "video_id": "PI-4112e806-b89e-48b9-a7bf-4edaa2bf1818",
+ "path": "/workspace/benchmark/video/PI-4112e806-b89e-48b9-a7bf-4edaa2bf1818.mp4",
+ "answer": "The video features two women interacting in various settings, accompanied by a light and lively BGM track sung by two female voices. The music, with its gentle piano melodies and soft strings, creates a warm and soothing atmosphere, enhancing the overall calm mood of the scene. Initially, the women are outdoors, engaged in conversation; one has long, wavy blonde hair and wears a light-colored top with a brooch, while the other has blonde hair tied with a headband and wears a checkered top. The scene shifts to them inside a car, where the first woman is now seen wearing sparkly gloves. A close-up captures the second woman making a hand gesture. The focus returns outdoors, where the first woman holds a lighter and the second woman, in lace gloves, holds a cigarette. The narrative continues with a close-up of the first woman in sparkly sunglasses, holding the steering wheel, suggesting she might be driving or posing as if driving. The scene transitions to both women in the car, with the second woman lighting a cigarette, and they appear to be enjoying a light-hearted conversation. The video concludes with a close-up of the first woman, emphasizing her sparkly sunglasses and top, as she continues to hold the steering wheel, all while the calming BGM plays in the background, providing a perfect ambiance for relaxation or reflection."
+ },
+ {
+ "video_id": "PI-1218964c-ca49-49a5-aa21-68bac8369beb",
+ "path": "/workspace/benchmark/video/PI-1218964c-ca49-49a5-aa21-68bac8369beb.mp4",
+ "answer": "The video showcases Ethan and Mattie engaging in a lively cup-flipping game in a kitchen setting, complete with a brick wall and white cabinets as the backdrop. Ethan uses blue cups, while Mattie opts for white ones, and they take turns attempting to flip their cups onto a wooden table. Below the table, several clear cups are labeled with varying dollar amounts, adding an element of excitement and anticipation to the game. The audio accompanying the video features a conversation between two people, with both male and female voices exuding excitement and amusement, perfectly mirroring the playful atmosphere of the cup-flipping challenge. As the game progresses, Mattie successfully flips a cup onto the table, winning the round, and the text \"MATTIE $5\" appears next to her, followed by the triumphant display of \"WINNER!\" above her, flanked by trophy icons, celebrating her victory. The combination of the engaging visuals of the game and the spirited audio conversation creates a dynamic and entertaining scene, capturing the joy and camaraderie of the participants."
+ },
+ {
+ "video_id": "PI-d0eccd74-7e66-4d8b-8d0b-dddb98caa7ac",
+ "path": "/workspace/benchmark/video/PI-d0eccd74-7e66-4d8b-8d0b-dddb98caa7ac.mp4",
+ "answer": "The video opens with a person in a beige cap and olive green shirt seated at a red picnic table, carefully holding a container of beans, setting a casual outdoor scene. Another individual in a beige t-shirt joins in, gesturing animatedly as if sharing a humorous anecdote, with playful on-screen texts like 'ayyo' and 'brought beans to school' adding a lighthearted touch. Meanwhile, a person in a black t-shirt, holding a sandwich and Takis chips, reacts with mock disdain, saying something akin to \"what a loser,\" injecting a bit of playful rivalry. The beige-capped person continues to impress by packing an array of homemade foods, including tamales, prompting the beige-tshirt person to exclaim, \"He thinks it's Christmas,\" highlighting the unexpected abundance. As the beige-capped person skillfully cuts vegetables with impressive knife skills, preparing salsa, the black-tshirt person reacts with surprise, perhaps saying, \"Wait, a salsa right now?\" The video culminates with the beige-capped person proudly holding a large, homemade tortilla, while the black-tshirt person, clearly intrigued, considers asking for a taste, suggesting a friendly and communal atmosphere."
+ },
+ {
+ "video_id": "PI-b39f00ec-018c-4f5f-b8d3-8389f3a7e8c4",
+ "path": "/workspace/benchmark/video/PI-b39f00ec-018c-4f5f-b8d3-8389f3a7e8c4.mp4",
+ "answer": "This video shows an intimate and gentle scene where a little monkey, like a baby, is held in someone's hand. This monkey is tightly wrapped in a red and black checkered hat, adding elements of comfort and warmth to the scene. The background is composed of a beige tufted surface, decorated with clear circular ornaments, giving a luxurious and comfortable atmosphere. The person holding the monkey wore a gold chain, and her nails were painted white, indicating a sense of concern and attention to detail. Throughout the video, the monkey remains relatively still, occasionally moving its head and limbs, showing its curiosity and slight unease. This person gently adjusted her hat to ensure that the monkey felt safe and comfortable. The background audio is a soft and melodious voice singing: \"I...\" I... I... \"Woo woo\", this adds an emotional dimension to the scene, suggesting a sense of happiness and emotion. The core theme of the video is the bond between humans and monkeys, highlighting the themes of care, tenderness and companionship. The entire atmosphere is heartwarming, emphasizing the joy and satisfaction gained from nurturing and protecting a weak creature."
+ },
+ {
+ "video_id": "PI-f18659ef-ad97-4f7e-aa9c-55221cf80612",
+ "path": "/workspace/benchmark/video/PI-f18659ef-ad97-4f7e-aa9c-55221cf80612.mp4",
+ "answer": "In this short, humorous video, a man with a beard sits in the driver's seat of a car, wearing a white baseball cap. He looks directly at the camera as he shares a lighthearted observation about short girls. On-screen text accompanies his monologue. He begins by saying, 'Short girls wake up every single day with two options.' He then presents these options, with the corresponding text appearing on screen: 'Am I gonna be nice, cute, and sweet today.. Or am I gonna choose violence.' After a brief pause, he continues, 'And almost every single time...' He concludes his joke with a smile, as the final text appears: 'They choose violence,' followed by a winking face and a skull emoji."
+ },
+ {
+ "video_id": "PI-0215e791-b127-4aa4-a3e1-9e62b6c139c5",
+ "path": "/workspace/benchmark/video/PI-0215e791-b127-4aa4-a3e1-9e62b6c139c5.mp4",
+ "answer": "The video unfolds outside a baseball stadium. A group of people, dressed in sportswear, including purple sweatshirts with the words \"PITSTOWN\" printed on them and black shirts with numbers such as \"12\" and \"11\" printed on the back, are engaged in a lively discussion. The conversation revolved around a proposal involving baseball bats: \"If I give you a bat and I hit it, I'll keep it.\" If we give you a new bat and you hit it, you can keep it. This group of people negotiated the tactics of the game. And there was also enthusiastic encouragement, \"Come on, come on, baby, come on.\" The scene switches to a baseball field where a player in a black uniform with the number \"11\" behind him is hitting the ball. The player swings the club to hit the ball and then rushes towards the base, capturing the dynamic movement and determination. The background features barbed wire and trees, emphasizing the outdoor environment. At the end of the video, the player interacts with the crowd near the fence. The text \"BAT actually...\" The unreleased 2026 products appeared on the screen, highlighting the excitement and comateship among the team members, who celebrated the potential of a successful baseball game."
+ },
+ {
+ "video_id": "PI-4c2e725c-fd0e-4e61-b9e1-8c9f47943648",
+ "path": "/workspace/benchmark/video/PI-4c2e725c-fd0e-4e61-b9e1-8c9f47943648.mp4",
+ "answer": "The protagonist of the video is a blond man in a black shirt. His background is a simple wall with a lamp on it, creating a simple and concentrated background. The text at the top of the screen reads like this: \"Whose man put a plate of food at ASHLEY's house?\" \"!\" Emphasize a humorous scene where someone's partner is delivering food to ASHLEY's family. This person began to speak, gestating with his hands, explaining that a person was recording, throwing down a plate of food and saying, \"I brought you Christmas food. Merry Christmas\", while recording the entire process. Then, this person opened the aluminium foil on the plate, revealing a meal consisting of ham, macaroni, rice and beans. They held up the plates and showed in detail what was on them, perhaps describing the food and the interesting things that happened during the delivery process. The video ends with a close-up of a plate of food, emphasizing the humorous background provided by the superimposition of content and text, creating an engaging and light-hearted story about an unexpected takeout."
+ },
+ {
+ "video_id": "PI-43a6ae2c-b42b-480c-90cb-12451dda9876",
+ "path": "/workspace/benchmark/video/PI-43a6ae2c-b42b-480c-90cb-12451dda9876.mp4",
+ "answer": "The video begins inside an airplane, showing images of the aisles and passengers, creating a tense and uncertain atmosphere. The subtitle reads, \"Passenger video shows the scene after the Toronto plane crash,\" indicating that the video will depict the consequences of a serious accident. Then the scene switches to the external view of the plane. The plane has taxied off the runway and is now on the snow. This aircraft was severely damaged. The nose and part of the fuselage were wrinkled, highlighting the severity of the crash. Several people were walking around the plane, and some were taking photos or videos, capturing the chaotic and shocking scene after the accident. The sky was partly cloudy and the ground was covered with snow, adding a desolate and melancholy atmosphere. The text coverage remains unchanged throughout the clip, emphasizing the consequences of the plane crash. The video continues to show the consequences of the Toronto plane crash. The same text is superimposed, \"Passenger footage shows the consequences of the Toronto plane crash.\" The scene shows that the damaged plane has skidded off the runway and is now on the snow. The nose of the aircraft and part of the fuselage were severely wrinkled. Several people were walking around the plane and some were taking photos or videos. This video captured the chaos and shocking scenes after the accident, and people reacted to the wreckage. The audio includes exclamations of shock and disbelief, emphasizing the suddenness and horror of the event. This video effectively conveys the direct consequences of the plane crash, highlighting the disbelief and chaos experienced by those present at that time."
+ },
+ {
+ "video_id": "PI-53134bc3-341c-493a-99e3-2eb210135437",
+ "path": "/workspace/benchmark/video/PI-53134bc3-341c-493a-99e3-2eb210135437.mp4",
+ "answer": "The video takes place in a dimly lit indoor setting, likely a club or event space, characterized by exposed pipes and various posters on the walls. A group of men, dressed in formal attire consisting of dark suits with gold trim, white shirts, and ties or bow ties, are gathered closely together, exuding high spirits as they engage in animated conversation and gestures. Some of the men are clapping their hands, while others make expressive hand movements, suggesting a celebratory or jubilant mood. The atmosphere is lively and energetic, with the group seemingly enjoying a moment of camaraderie or celebration. Complementing this vibrant scene is the background music, featuring a male singer with high energy and enthusiasm. The lively and rhythmic melody, along with the singer's passionate vocal delivery, enhances the overall vibrant atmosphere, conveying a core message of excitement and celebration."
+ },
+ {
+ "video_id": "PI-9c2aeb2d-8bde-41e3-a9a8-98fcea987d6d",
+ "path": "/workspace/benchmark/video/PI-9c2aeb2d-8bde-41e3-a9a8-98fcea987d6d.mp4",
+ "answer": "This video captures a peaceful family scene in a cozy living room. The background is a grey sofa, decorated with beige blankets and patterned cushions. The black-and-white photo collage frame above the sofa adds a personal touch to the decoration. There is a wooden table in front of the sofa, on which there is a large plastic bottle containing orange liquid and a plate with food residues. The main characters in the video are a person wearing a black hoodie with white text and multiple necklaces, as well as a child wearing a grey hoodie and light grey sweatpants. This person is sitting on the sofa, concentrating on playing with their green mobile phone. The child stood beside the coffee table, his hands resting on it, and occasionally glanced at that person or the table. There is very little audio in the video and no important dialogue, emphasizing the quiet and intimate atmosphere of the scene. The key point remains the visual interaction between people and children. The child looked very curious and moved slightly, but kept close to the table, while this person was still concentrating on playing with the mobile phone. The core theme of the video is the peaceful coexistence and subtle interaction between people and children in the family environment. The absence of language communication highlights the tranquility of the present, captures snapshots of daily life, and simple and quiet companionship becomes the central stage."
+ },
+ {
+ "video_id": "PI-66d20579-4a49-4780-9199-4eaa2f4c5d70",
+ "path": "/workspace/benchmark/video/PI-66d20579-4a49-4780-9199-4eaa2f4c5d70.mp4",
+ "answer": "The video captures a moment of energy and joy ahead of a rugby game as Vikings in white and purple uniforms are celebrating on the pitch. The scoreboard showed the score close to 27-24 and there was a flag on the game indicating a tense and crucial moment in the game. Text \"Vikings Camp Rock\".Humorously overlaid the scene, drawing an amusing comparison between the team's energetic celebrations and the movie Camp Rock. The players can be seen dancing and cheering with energy, reflecting a sense of accomplishment and determination. The scene then shifts to a group of fans in colorful costumes, dancing and enjoying in the stadium, reflecting the enthusiasm of the players. Meanwhile, an emotionally neutral or uncertain male speaker discussed a big moment in the game when Jackson achieved his first interception of the season to secure the victory. The interception was highlighted as a decisive and impactful event, contributing to the overall atmosphere of excitement and camaraderie portrayed in the video."
+ },
+ {
+ "video_id": "PI-19b0a25d-94ac-41a0-934e-f775f20c5d61",
+ "path": "/workspace/benchmark/video/PI-19b0a25d-94ac-41a0-934e-f775f20c5d61.mp4",
+ "answer": "The video, which takes place inside a car, features a woman with long blonde hair wearing a dark hoodie with a graphic design. The words \"Go stream Sports Car\" are prominently displayed in the center of the screen. Girls individually participate in various hand and arm movements to create a dynamic visual effect. Sunlight shines through the windows, enhancing the casual and relaxed atmosphere. Another man in a red hat is partially visible in the background. The video appears to be promoting the streaming of content called \"Sports Car,\" which focuses on the person's expression gestures and the car's interior settings."
+ },
+ {
+ "video_id": "PI-669da437-0440-414f-a7f1-2d9f4ca48787",
+ "path": "/workspace/benchmark/video/PI-669da437-0440-414f-a7f1-2d9f4ca48787.mp4",
+ "answer": "The video begins with a tranquil view of a wooden pathway nestled in a lush forest, where a hand holding three red grapes is visible in the foreground. A black gibbon with white facial markings soon appears at the top of the pathway, and the text 'HE'S COMING' appears on the screen, signaling its approach. The gibbon descends the pathway energetically, swinging its arms and moving quickly. As it reaches the bottom, the scene transitions to a close-up of the gibbon sitting on the pathway, looking directly at the camera. The playful interaction begins with text on the screen saying 'THANKS FOR MEETING ME ON SUCH SHORT NOTICE,' followed by 'REALLY APPRECIATE IT' and 'JERRY'S BEEN ON MY BACK ALLLL DAY,' suggesting a humorous tone. The gibbon moves around, and more text appears, including 'QUIT MONKEYING AROUND' and 'I'M LIKE DUDE I'M LITERALLY A MONKEY,' adding to the playful banter. The gibbon then reaches out towards the hand holding the grapes, and the text 'IS THAT ALL YOU GOT?' and 'YOU HOLDING OUT ON ME?' appears as it grabs a grape and starts eating it. The video wraps up with the gibbon jumping off the pathway and swinging away into the forest, showcasing its agility. The voiceover, featuring a male voice with a playful and humorous tone, narrates the gibbon's approach, greets it, and engages in a light-hearted conversation, expressing appreciation for the gibbon's presence and joking about its activities, highlighting a moment of connection and humor in the forest."
+ },
+ {
+ "video_id": "PI-a5d3145d-ca02-4d7d-a379-b2a5d31ae646",
+ "path": "/workspace/benchmark/video/PI-a5d3145d-ca02-4d7d-a379-b2a5d31ae646.mp4",
+ "answer": "The video enthusiastically promotes a special event at Dairy Queen, aligning perfectly with the cheerful announcement from the audio that free ice cream cones will be offered on March 20th without any purchase. It begins with a close-up of a hand holding a Dairy Queen ice cream cone featuring a red swirl, positioned in front of a Dairy Queen store, accompanied by a text overlay that reads 'FREE Dairy Queen (No purchase)' with an ice cream emoji. The scene transitions inside the store, showing the same hand dipping the cone into a container of red syrup. A person in a light-colored sweatshirt is then shown holding the finished cone, giving a thumbs-up gesture, with a menu board displaying various Dairy Queen options in the background. A promotional graphic appears, announcing 'FREE CONE DAY March 20' with an image of a vanilla soft-serve cone. The video concludes with another close-up of the red-swirled cone, held by the person in the light-colored sweatshirt, with 'HAPPY TASTES GOOD' visible on the wrapper, as the person gives another thumbs-up, emphasizing satisfaction with the free treat. The overall atmosphere is lively and positive, effectively conveying the excitement of the promotional event."
+ },
+ {
+ "video_id": "PI-7aa33a44-295d-46f2-b3dc-64b90e21b09f",
+ "path": "/workspace/benchmark/video/PI-7aa33a44-295d-46f2-b3dc-64b90e21b09f.mp4",
+ "answer": "In the video, a little girl of about four years old is wearing a purple shirt with the words \"Just a girl who loves chocolate milk\" written on it. She is wearing a green stethoscope around her neck, indicating that she is pretending to be a doctor. The background is a white bookshelf, filled with books and all kinds of items, creating a comfortable and home-like atmosphere. The girl is holding a green toy in her hand and seems to be examining it carefully. It was written on the screen: \"At the age of four, she has already learned the art of distraction...\" It highlights her intelligence in her youth. As the camera progresses, an adult's hand appears in the frame, and the girl is asked, \"What do you like to do?\" She replied, \"I like...\" At the same time, she continued to focus on the toy in her hand, demonstrating her ability to divert attention. Finally, she replied, \"I like editing videos.\" Her unexpected answer took the audience by surprise. At the end of the video, the words \"Well done\" appeared on the screen, and then the girl said \"Thank you\", showing her polite and grateful nature. This video effectively captures the girl's mischievous and intelligent behavior, as well as her ability to participate and entertain through her own actions and reactions."
+ },
+ {
+ "video_id": "PI-f7c550d4-dd52-47b8-b526-a90b1e0cc146",
+ "path": "/workspace/benchmark/video/PI-f7c550d4-dd52-47b8-b526-a90b1e0cc146.mp4",
+ "answer": "The video features an experiment conducted by a group of four people to see how much air pressure different sports balls can withstand before exploding. For safety, the individuals stand behind a protective plexiglass shield. The experiment begins with a soccer ball, which is inflated using an air compressor. An on-screen gauge tracks the pressure in PSI. The soccer ball bursts at 34 psi. Next, they test a basketball. As the pressure increases, the men's anticipation and tension grow. The basketball dramatically explodes at 97 psi. The final test is a football. When the gauge climbs past the previous record, the men visibly tense up. The football loudly bursts at 103 psi. Slow-motion replays capture the moment of explosion and the men's startled reactions. After cleaning up the debris, the video concludes with a teaser for the second part, where they plan to inflate a giant soccer ball and a call to subscribe to their channel."
+ },
+ {
+ "video_id": "PI-60b88de4-770f-4f9f-bd59-15f7558389fe",
+ "path": "/workspace/benchmark/video/PI-60b88de4-770f-4f9f-bd59-15f7558389fe.mp4",
+ "answer": "This video takes place at night in a busy urban area, where a large crowd of people gather in front of a prominent billboard. There is a straightforward and influential sentence written on the billboard: \"The number of deaths due to smoking this year: 444,947, and it is still increasing.\" The crowd was clearly involved. Many people took out their mobile phones to record or take pictures of the billboards. The atmosphere was gloomy, and the seriousness of this message highlighted this even more. The bright lights on the billboard formed a sharp contrast with the dark sky, drawing people's attention to this worrying statistic. As the video progresses, the crowd's attention remains focused on the billboard, which reflects the severity of the information. Suddenly, colorful confetti began to fall from above, forming a surprising and somewhat discordant contrast with the serious message. Confettes continued to fall, and the reactions of the crowd varied. Some people raised their hands excitedly, while others continued to stare at the billboards. The climax of the video is that the text on the billboard dramatically changed to \"The number of deaths from smoking this year: 0, and it is still increasing\", indicating a sudden and unexpected shift. The fallen confetti and new messages created a festive atmosphere, juxtaposing with the initial gloomy tone. This video effectively utilizes visual and textual elements to powerfully illustrate the impact of smoking and the hope for zero smoking-related deaths in the future."
+ },
+ {
+ "video_id": "PI-507b7da5-cc35-49bb-a11f-e77790ee166f",
+ "path": "/workspace/benchmark/video/PI-507b7da5-cc35-49bb-a11f-e77790ee166f.mp4",
+ "answer": "The video features a man participating in a 36-day weight loss or exercise challenge, as indicated by the overlaid text \"DAY 36 Weight Loss / Exercise!!\". Standing outdoors on a residential street, he is dressed in a black t-shirt with \"HAWAII EST. 1959 ALOHA STATE\" printed on it, gray shorts, white socks, and white sneakers. The background includes houses, trees, and a parked car, creating a typical suburban setting. The man, who has tattoos on his arms and a small ponytail on top of his head, is seen performing a series of movements, shifting his weight from side to side and moving his arms, demonstrating his commitment to the fitness challenge. The consistent background and attire suggest a continuous daily record, emphasizing the ongoing nature of his journey. Meanwhile, the accompanying audio consists of ambient sounds, possibly the wind, along with some footsteps that might be human, adding a subtle layer of atmosphere to the scene without distracting from the main focus of the video, which is the man's dedication to his fitness goals."
+ },
+ {
+ "video_id": "PI-15f7fca4-9df5-4035-92c5-025f0aa591fa",
+ "path": "/workspace/benchmark/video/PI-15f7fca4-9df5-4035-92c5-025f0aa591fa.mp4",
+ "answer": "The video portrays a reflective and light-hearted scene, where an individual, dressed in a chestnut and brown striped sweater, sits in a comfortably and elaborately decorated room, creating a warm and personal atmosphere with a bulletin board adorned with photos and notes, and a shelf filled with small plants, a basket of flowers, and small statues. Initially, the person rests their chin on their hand, appearing contemplative or relaxed, but then moves their hand away, perhaps in reaction to something, possibly recalling their past as an Internet celebrity mentioned in the accompanying audio. The soft and consistent lighting in the room enhances the nostalgia and intimacy of the moment. The audio, filled with humorous and self-reflective monologues, expresses an interesting and slightly skeptical feeling towards the individual's past online fame and the fleeting nature of internet trends. The person's expressions and body language in the video convey a sense of surprise or interest, possibly triggered by memories of their younger self in the context of these online experiences. Together, the video and audio create a harmonious blend of personal nostalgia and humorous insight into the brevity of online trends, capturing a moment of self-reflection with a light and nostalgic tone."
+ },
+ {
+ "video_id": "PI-4030017f-06f9-430a-9403-0dbcee04b815",
+ "path": "/workspace/benchmark/video/PI-4030017f-06f9-430a-9403-0dbcee04b815.mp4",
+ "answer": "The video features a dynamic and engaging visual narrative, showcasing a diverse group of individuals in various settings, while the audio presents a casual and humorous conversation about the theme music of \"Saturday Night Live.\" Visually, the video begins with a woman in a black velvet outfit in a dimly lit room, followed by a man in a suit in a warmly lit room, and a person with red hair against a backdrop of festive lights. The narrative continues with a man in a dark jacket speaking, a man in a dark blue shirt in front of framed pictures, and a woman in a white graphic shirt in a dark setting. As the video progresses, more individuals appear, including a man in a floral sweatshirt, a man in a plaid shirt, a man in a black leather jacket, and a man with long, dark hair in a pinstripe coat, creating a rich tapestry of scenes. Meanwhile, the audio conversation humorously discusses the \"Saturday Night Live\" theme, describing it as filled with saxophones and multiple horns, evoking an atmosphere of chaos and activity. The speakers try to recall the melody, with one joking about a saxophonist being escorted out of a building by police while playing. The energetic and paced background music enhances the lively and somewhat chaotic atmosphere of the conversation, complementing the diverse visual elements of the video."
+ },
+ {
+ "video_id": "PI-334502aa-319d-4509-a5f0-56084cd6e054",
+ "path": "/workspace/benchmark/video/PI-334502aa-319d-4509-a5f0-56084cd6e054.mp4",
+ "answer": "The video unfolds in a well-lit communal or dining area, characterized by tables and chairs, where a series of elderly individuals are featured, each holding a framed photograph in front of their faces. The sequence begins with a person wearing a red shirt with floral embroidery and a beige cardigan, holding a small, round, framed photograph. This is followed by another elderly person in a leopard print jacket over a yellow shirt, presenting a larger, rectangular, black-and-white portrait. The narrative continues with an elderly man in a gray cardigan over a striped shirt, showcasing a photograph with a red and green plaid border, and then an individual in a pink sweatshirt displaying a picture of a person in a red and white patterned outfit. The theme of reminiscence is further emphasized as an elderly person in a black jacket and beige pants holds a black-and-white portrait, and another in a striped shirt and black jacket presents a large, ornate, oval-framed photograph of two people in formal attire. The video maintains this reflective mood, with each participant sharing a piece of their personal history through the photographs they hold. The background remains consistent, reinforcing the communal setting where memories are cherished and shared. The audio content, though not explicitly detailed, suggests a nostalgic and heartfelt tone, with phrases like \"I'm glad whenever I can share a laugh\" and \"do that funny little girl slipping through my fingers all the time I tried to capture every minute the feeling,\" indicating a deep appreciation for shared moments and the passage of time. The video beautifully captures the essence of memory and the joy of reminiscing, highlighting the importance of preserving and sharing personal histories."
+ },
+ {
+ "video_id": "PI-c7931465-9723-47a0-943e-c710892adf90",
+ "path": "/workspace/benchmark/video/PI-c7931465-9723-47a0-943e-c710892adf90.mp4",
+ "answer": "In this video, there is a man who is a worker. This video was shot in a factory. The factory looks very large and spacious. There is a vehicle inside, which should be a vehicle specifically used for work. In the video, a man is wearing a green work uniform and seems to be explaining the operation of something. While speaking, he is making movements with his hands and constantly looking at the camera. There is no scene switching or the appearance of other characters in the video. The man keeps pulling the belt under the car and finally loops it in a circle. On the top of the vehicle, you can see a very thick chain, which should be used to pull heavy objects. Overall, there is no particularly key text information in the video. From the audio, it can be heard that the man is teaching everyone some operations about the car straps."
+ },
+ {
+ "video_id": "PI-261ea1bd-de7a-4412-842e-9bbcee93777f",
+ "path": "/workspace/benchmark/video/PI-261ea1bd-de7a-4412-842e-9bbcee93777f.mp4",
+ "answer": "The video features two individuals sitting at a table in a kitchen setting.The person on the left is wearing a purple top and has dark hair, while the person on the right is wearing a white long-sleeve shirt and has red hair. The table is covered with a white tablecloth that has blue circular patterns.The background includes kitchen cabinets, a window, and some plants.The video begins with the text overlay 'since this app won't be around much longer' appearing on the screen.The person on the right speaks, and the text changes to 'there's one more secret I feel I have to share with u..'. The person on the left listens attentively.The text then changes to 'this was never my mom' as the person on the left reaches for a yellow folder.The person on the left opens the folder and reveals a document with the word 'CONFIDENTIAL' at the top.The text overlay changes to 'we're actors😱' as the person on the left holds up the document.The person on the right points at the document, and the text changes to 'but she'll be a good mom one day🥰'.The video concludes with both individuals looking at the document and the text overlay remaining the same.The audio content includes the person on the right saying, 'Since we're all going to die, there's one more secret I feel I have to share with you.'The person on the left responds with, 'I did not care for the godfather, what did not care for the godfather, how can you even say that I didn't like didn't like it?'The dialogue suggests a humorous or candid conversation between the two individuals, with the revelation that they are actors and not actually related, as indicated by the text overlays."
+ },
+ {
+ "video_id": "PI-54f8ae78-5128-4f08-a2e5-6b911444cdd2",
+ "path": "/workspace/benchmark/video/PI-54f8ae78-5128-4f08-a2e5-6b911444cdd2.mp4",
+ "answer": "The video captures a warm celebration of a lifelong friendship between two women, set against a cosy room with aqua walls and speckled countertops. One woman was dancing energetically in a white flowing dress, while the other was wearing a green top with sequined sleeves and smiling and watching. The text overlay highlights their journey from beginning as best friends in their teens to now in their late 50s, highlighting their shared experiences in high school, cheerleading, dance teams, and beauty pageants, as well as their resilience to overcome life's challenges. Meanwhile, the audio features a male speaker discussing cultural differences, the importance of distinguishing influence from manipulation, and wider influences beyond music in a firm, confident tone. Other speakers contribute fragmented phrases that add dynamic but unclear elements to the dialogue. The ambient music is rhythmic and energetic beats, creating an intense and focused atmosphere that contrasts with the joyful and camaraderie-filled visuals in the video, which exudes warmth and celebrates lasting bonds of friendship."
+ },
+ {
+ "video_id": "PI-4b0aa16c-8bd7-4dd4-b138-148c83d577fc",
+ "path": "/workspace/benchmark/video/PI-4b0aa16c-8bd7-4dd4-b138-148c83d577fc.mp4",
+ "answer": "In the video, a person is wearing a beige twill high-neck sweater and speaking directly to the camera in a dimly lit room. The background is simple. A neutral-colored wall and a visible door create a focused and intimate environment. The text on the video reads like this: \"Now TIK TOK has just confirmed which app they officially shut down in the United States (Instagram is terrible).\" This passage sets the tone for this person's discussion, centering on the recent announcement that TikTok will shut down in the United States on the 19th. This person seems to have expressed concerns and frustration about the future of social media, questioning where users will turn next. She particularly mentioned that Instagram is not a satisfactory alternative and emphasized that finding a platform that offers the same type of organic content as TikTok is a challenge. This person invited the audience to share their thoughts on potential alternatives to TikTok, demonstrating a strong desire for community engagement and discussion. After TikTok withdrew from the US market, the overall mood of the video was one of uncertainty and seeking solutions."
+ },
+ {
+ "video_id": "PI-d3e2ba50-48f0-4404-bcdc-72c399c20299",
+ "path": "/workspace/benchmark/video/PI-d3e2ba50-48f0-4404-bcdc-72c399c20299.mp4",
+ "answer": "The video features a dynamic scene where a person is energetically playing the violin, embodying a deep passion for the instrument. The individual, dressed in a black hoodie with red text and checked trousers, has long curly hair and is seen indoors, likely in a dimly lit room with a dark background. As she plays, she moves her body and head expressively to the rhythm, with the camera focusing closely on her upper body and the violin, capturing both her movements and the instrument's details. Complementing the visual, the background music track includes a female voice initially, followed by a male voice, both conveying expressive and passionate emotions. The recurring phrase in the vocals, \"Stick to the violin,\" reinforces the theme of deep engagement and connection with the instrument, aligning perfectly with the video's lively and enthusiastic atmosphere. Together, the audio and visual elements create a cohesive narrative of fervent dedication to violin performance."
+ },
+ {
+ "video_id": "PI-c397a844-683a-4c05-a31f-21b35b722e3b",
+ "path": "/workspace/benchmark/video/PI-c397a844-683a-4c05-a31f-21b35b722e3b.mp4",
+ "answer": "At the beginning of the video, a person wearing a hoodie is sitting indoors, and a window can be seen in the background. The text above reads as follows: \"I asked a man in the bar to explain to me how difficult his job was.\" This is a humorous or satirical tone, suggesting that the person is listening to others explain their work in a condescending manner. Then the scene switches to a medical environment, specifically an operating room, where medical professionals in blue work clothes are preparing for surgeries. The text overlay changed to \"I had a brain surgery at 7:30 the next morning\", highlighting the dramatic shift from a casual social environment to a high-risk professional one. This video then shows a close-up of a medical professional, who is wearing a blue surgical gown, a surgical mask and a hairnet, concentrating under a microscope during a brain surgery. This emphasizes the precision and concentration required in the medical field. Next, a person wearing a bright pink sports jacket adjusts the mechanical equipment at a conference or exhibition, indicating different professional backgrounds. The scene switches back to the operating room, where multiple medical professionals gather around a patient, discussing and pointing at the medical images on the screen. This emphasizes the collaborative nature of surgical operations. Overall, the video contrasts daily social activities with the high-pressure and high-skilled environment of medical work, highlighting the expertise and accuracy required in the medical field. The core theme revolves around the underestimation of professional skills, especially in the medical field, as well as the obvious disparity between casual conversation and the reality of performing complex medical procedures."
+ },
+ {
+ "video_id": "PI-2c67ffc9-282e-4289-9bb0-0144b8f01ba4",
+ "path": "/workspace/benchmark/video/PI-2c67ffc9-282e-4289-9bb0-0144b8f01ba4.mp4",
+ "answer": "The animated video is set in a stylish pink and white room, where a girl in a black mask and red and blue checked shirt struggles with her identity at 15. She wishes to remove her mask but is warned against it, with exaggerated claims that revealing her face would destroy the world. A group of girls mock her, suggesting she's unattractive and unloved, leaving her distraught. A turning point occurs when a kind boy in a black leopard print top offers friendship and compliments, boosting her spirits. Though he asks to see her face, she hesitates due to self-doubt. The video ends with a close-up of the boy, Nina, leaving her face hidden, hinting at a future reveal. Themes of self-acceptance, kindness, and bullying's impact on self-esteem are explored, mirroring the audio's message of overcoming self-doubt through friendship."
+ },
+ {
+ "video_id": "PI-f1af0a1d-3ef8-466d-8060-5d28053098a3",
+ "path": "/workspace/benchmark/video/PI-f1af0a1d-3ef8-466d-8060-5d28053098a3.mp4",
+ "answer": "This visual scene presents an important moment full of solemnity and emotion in front of a brick building outdoors. In the picture, an American soldier in uniform, kneeling on one knee, hands a neatly folded American flag to two elderly people sitting on chairs. The white-haired old man takes the flag, is emotionally excited, and keeps nodding. A group of onlookers stand quietly behind them, and several others stand aside, holding their mobile phones to record this scene. Above the dark wooden door in the background, there is a sign that says \"Family\". After shaking hands with the old man, the soldier slowly stands up, takes a step back, and then salutes the old man in a solemn manner. At the same time, a passionate folk song plays as background music, conveying a powerful and inspiring force."
+ },
+ {
+ "video_id": "PI-c9f236ed-b9cf-40c2-be1d-acba8419794a",
+ "path": "/workspace/benchmark/video/PI-c9f236ed-b9cf-40c2-be1d-acba8419794a.mp4",
+ "answer": "The video takes place in a zoo or wildlife sanctuary, where a man in a black T-shirt with \"TEAM HUNI\" on the back stands by a wooden railing, observing a group of baboons in a naturalistic enclosure filled with green grass, trees, fallen logs, and a small stream. The man, who has a tattoo and carries a black backpack, watches the baboons with interest, pointing and gesturing towards them, especially focusing on one with a light-colored face and pinkish rump. The text \"Bro wanted it \" appears at the bottom of the screen, adding a humorous context to the scene. Throughout the video, the man reacts to the baboons' movements with amusement. Meanwhile, the audio features minimal content, with only a single sentence from a man accompanied by continuous laughter, enhancing the overall relaxed and light-hearted atmosphere of the video. The scene remains consistent, concluding with the man still observing the baboons, maintaining the humorous tone."
+ },
+ {
+ "video_id": "PI-c2885f9a-d833-4c52-a647-9606124bedd7",
+ "path": "/workspace/benchmark/video/PI-c2885f9a-d833-4c52-a647-9606124bedd7.mp4",
+ "answer": "The video unfolds outdoors, possibly in a park or backyard, featuring bare trees and a partially cloudy sky. It depicts two people participating in an interesting football activity. The first person was wearing a light grey hoodie, black sweatpants, white sports shoes, a black hat and holding a football. It seemed that he was directing or commanding the game, just as indicated by \"Yo Chase go long\" heard in the audio. The second person was wearing a grey T-shirt, dark shorts and a black hat. He was standing with his arms outstretched, ready to catch the ball. The words \"Beautiful catch cheer\" and \"99.9% accuracy rate\" appeared on the screen, humorously emphasizing the accuracy of the throw. The first person threw the football, but the second person failed to catch it and looked confused or frustrated. This audio captures a light-hearted joke with the line, \"The ball we throw is not red.\" Can we buy some? The reaction of the first person was surprise and disappointment, with his hand on his head. The second person stood with his hands on his hips, seemingly questioning the throw. At the end of the video, the second person attempted to catch the ball again. This time, he succeeded. He raised his arm to celebrate the victory. The recurring text and audio elements highlight the accuracy of humor and exaggeration in throwing and catching, summarizing the fun and friendship between the two people."
+ },
+ {
+ "video_id": "PI-d2a52b45-30eb-4f81-b495-2c2a8b78efed",
+ "path": "/workspace/benchmark/video/PI-d2a52b45-30eb-4f81-b495-2c2a8b78efed.mp4",
+ "answer": "A woman is dancing in a well-lit living room where there are all kinds of things, including a super large carpet, as well as Windows and cabinets. There is also an electric fan on the ceiling of the living room. The entire room gives people a comfortable feeling and creates a warm atmosphere. In the video, the woman is dancing in front of the camera. Then she did a downward split, repeatedly doing the split and then standing up. It can be seen that this woman is a dance enthusiast. During her dancing, a cheerful smile appeared on her face and she gave people a very confident feeling. From the audio, it can be heard that the background music is very rhythmic, making people feel an urge to dance while listening. Generally speaking, the entire video is a cheerful melody. The words \"my only talent ever\" also appeared in the video, which can also indicate that the woman is a dance enthusiast."
+ },
+ {
+ "video_id": "PI-f7ad1047-fe08-4bd7-bcb2-16fc62f6cc16",
+ "path": "/workspace/benchmark/video/PI-f7ad1047-fe08-4bd7-bcb2-16fc62f6cc16.mp4",
+ "answer": "The video captures a man's inner experience within a modern kitchen. As the background music begins, he's washing dishes at the sink, with on-screen text that reads: \"Me every minute since Sunday:\". Then, his own head appears in the upper right corner, seemingly in conversation with him, lips synchronized to the music. The background then shifts to a wider shot of the spacious kitchen, equipped with white cabinets and stainless steel appliances. The man starts twisting his body in sync with the background music and rap melody.\nSubsequently, the camera angle shifts to a low-angle view and begins moving backward, as the man gradually moves from the kitchen into the living room. Throughout the video, he's dressed in a white long-sleeved shirt, dark pants, and light-colored slippers, presenting a casual appearance. However, the combination of the aggressive rap melody, the man's facial expressions, and his body language conveys a sense of helplessness – that he can't even get proper rest, even on a Sunday."
+ },
+ {
+ "video_id": "PI-b1814560-0c19-4aff-88ca-fff5748c00f7",
+ "path": "/workspace/benchmark/video/PI-b1814560-0c19-4aff-88ca-fff5748c00f7.mp4",
+ "answer": "The video unfolds on a sunny day in the parking lot of an apartment building, capturing a tense and informal exchange between individuals. A man in a light grey hoodie, black trousers, and a black hat stands beside a car, with the camera positioned inside the vehicle, filming through an open window. A hand holding a $100 bill extends towards the man, who looks at the money and then at the people inside the car. The audio reveals multiple male voices, with one individual persistently seeking brief physical contact in exchange for the money, while another responds with annoyance and uses informal, abrasive language. The man outside eventually takes the money, checks the banknotes, and expresses surprise or disbelief at the softness of the person's hand. The video concludes with the man still standing in the parking lot, looking at the people in the car, encapsulating a scene where emotions fluctuate between mild insistence and sharp discontent."
+ },
+ {
+ "video_id": "PI-7afdd81c-3a6f-4075-b917-075ff56d8feb",
+ "path": "/workspace/benchmark/video/PI-7afdd81c-3a6f-4075-b917-075ff56d8feb.mp4",
+ "answer": "The video begins with a close-up of a person assembling a black rose with his hands on a wooden table. This person carefully attached the petals to the stem to ensure that they were firmly fixed in place. In the background, a vase can be seen, filled with already assembled black roses, creating a cohesive and elegant scene. This person continued to adjust the petals to ensure they were perfectly arranged. The scene transitions to a close-up of a person holding a completed black rose, showcasing its intricate details and smooth finish. Then, this person picked up a small green component, which seemed to be part of a rose stem or leaf, and began to assemble it. Shift the focus to the person who connected the green component to the bottom of the black rose, making sure it is correctly aligned and functional. Then this person installed a black device on the rose, which might be used to adjust the petals. The fully assembled black rose was lifted up, highlighting its fine craftsmanship. The background is a blurry image of a room with candles and books placed inside, adding a warm and inviting atmosphere. The scene transitions to a vase filled with black roses placed on a wooden table, with a black texture in the background. The arrangement of the roses is uniform, emphasizing their elegance and the meticulous craftsmanship. The video ends with a close-up of a vase filled with black roses, highlighting their uniformity and the detailed attention paid to their arrangement. The consistent background and meticulous rose positioning throughout the video emphasize the meticulous craftsmanship and elegant presentation."
+ },
+ {
+ "video_id": "PI-3d628d55-408e-40f2-bf74-e326739d96e0",
+ "path": "/workspace/benchmark/video/PI-3d628d55-408e-40f2-bf74-e326739d96e0.mp4",
+ "answer": "The video features a person sitting at a table in a modern kitchen with white cabinets and hanging pendant lights, creating a bright and inviting atmosphere. The person is holding a colorful phone case adorned with pink, orange, and blue flowers in their left hand and a calm tabby cat with black and gray stripes in their right arm. The cat appears relaxed, looking around the room. The text overlay at the bottom of the frames reads, 'pov: your cats obsessed w the greatest sonic edit,' suggesting a humorous or relatable scenario where the cat is fixated on a particular piece of audio or video content. The background shows a well-lit kitchen with various appliances and a clean, organized countertop, adding to the everyday life setting. The audio in the background is a music track with a humorous and playful tone, featuring a male voice with a casual and light-hearted delivery. The content of the audio appears to be a nonsensical or comedic phrase, enhancing the light-hearted and amusing atmosphere of the video. The video captures a moment of everyday life with a touch of humor, highlighting the bond between the person and their pet cat."
+ },
+ {
+ "video_id": "PI-8a211352-249e-420a-8cec-ea623bf37f0e",
+ "path": "/workspace/benchmark/video/PI-8a211352-249e-420a-8cec-ea623bf37f0e.mp4",
+ "answer": "This video captures a heartwarming moment in a formal event, such as a party or an award ceremony. The environment is elegant. The round table is covered with a white tablecloth. The lighting is soft and the background is pink curtains. The atmosphere is lively. Engaging in conversations with other participants contributes to the festive mood. The focus of the video is the affectionate interaction shared by two women. The woman on the left is wearing a strapless golden dress. The corset has intricate details and there is a large bow at the back. Her hair was tied up in a high ponytail and she wore a pearl necklace. The lady on the right is wearing a shiny silver dress with a unique cut design at the back. She is wearing long white gloves and elegant earrings. The two women embraced each other passionately. The woman in the silver robe leaned closer, speaking softly or in a soft voice. They shared smiles and laughter, held each other's arms and continued their intimate interaction. The background was still bustling with other attendees, adding to the vibrant atmosphere of the event. The core theme of the video is the sincere feelings and friendship between two women, highlighting the moments of connection and joy in the grand event. There is no text or specific audio dialogue. The audience's attention is focused on their interaction and the elegance around them."
+ },
+ {
+ "video_id": "PI-f7a48163-384c-4a43-848d-89505902d366",
+ "path": "/workspace/benchmark/video/PI-f7a48163-384c-4a43-848d-89505902d366.mp4",
+ "answer": "This video captures a powerful stage performance by rappers Quavo and Offset, who are seen honoring their late Migos bandmate, Takeoff. The stage is set with dramatic lighting, featuring a central spotlight and a large projection of a space shuttle in the background. The two artists, dressed in dark attire, perform their song 'Hotel Lobby' with synchronized movements. As the performance progresses, the lighting shifts from cool blue tones to vibrant purple and pink hues, creating a dynamic visual experience. The performance culminates in a poignant tribute: as the song ends, the rappers stand on either side of the shuttle image, pointing upwards. Pyrotechnics ignite, simulating a rocket launch, while smoke fills the stage."
+ },
+ {
+ "video_id": "PI-f421c707-4e20-4e37-8305-bc18837030b3",
+ "path": "/workspace/benchmark/video/PI-f421c707-4e20-4e37-8305-bc18837030b3.mp4",
+ "answer": "This video is a fast-paced, edited montage featuring scenes and characters from what appears to be the horror game \"Poppy Playtime.\" The visuals cycle through dark, industrial environments, including long corridors lit with stark red or blue light, and dilapidated rooms with machinery. Interspersed are shots of various monstrous toy characters, such as a pink furry creature, a turquoise monster with a wide, toothy grin, and a silhouetted figure with a television for a head. The editing is rapid, with quick cuts synchronized to the beat of the music. A key visual element is a child's drawing on a wall that reads \"SAFE HAVEN\" amidst sad-faced cartoon animals. The entire video is set to a dramatic electronic pop song with a female vocalist. "
+ },
+ {
+ "video_id": "PI-92eb5f3a-f12a-4493-b6e8-3ea6e2782d00",
+ "path": "/workspace/benchmark/video/PI-92eb5f3a-f12a-4493-b6e8-3ea6e2782d00.mp4",
+ "answer": "This video shows a heartwarming scene where a child of about two years old is playing an imagined game. This child is holding a blue phone and seems to be pretending to speak. The background is soft lighting, suggesting the indoor environment, which might be the living room. Some household items can be seen, such as a blue bag, but it is not in focus. This child is wearing a blue and white shirt. Throughout the video, the child interacts with an adult, and the adult's hand guides and supports the child in grasping the mobile phone. This interaction highlights the interesting and nurturing moments between children and adults. The expressions and actions of the children indicate that they are deeply immersed in their pretend game, imitating the actions of something, which might be inspired by real-life experiences. The audio content captures the process of the child's attempt to communicate, such as phrases like \"ice\", \"Can you give me a photo?\" and \"Where is your mother?\", indicating the child's curiosity and desire to participate in the conversation. The responses of adults are inaudible, but their presence is felt through supportive hands and occasional verbal cues. A key element in the video is the text overlay, which reads: \"When your two-year-old child wants to order pizza by himself, you let him.\" This video perfectly captures the moments of childhood imagination and the joy of learning through games. It emphasizes children's growing curiosity and the role of adults in cultivating this curiosity in a supportive and interesting way. The overall theme is warmth, learning and the simple joy of parent-child interaction."
+ },
+ {
+ "video_id": "PI-5827f100-6975-427e-b69a-1cfc8cafca4a",
+ "path": "/workspace/benchmark/video/PI-5827f100-6975-427e-b69a-1cfc8cafca4a.mp4",
+ "answer": "This video is set against the backdrop of a city environment at night, presenting a striking scene: a person wearing bright red, shiny clothes squatting beside a pile of garbage. The background includes two large garbage bins, which are filled with all kinds of garbage, such as plastic bags, cardboard and other fragments. The person, who was covered from head to toe with red sequins, began to move gracefully, adjust their position and gesture with their hands. Sequins capture light, creating dazzling effects that contrast sharply with the dark and rough environment. When this person stood up, she showed more elaborately made costumes, including a flowing red cape. The change in the Angle of the lens provides a close-up view of a person's upper body and face, highlighting the complex details of the sequined clothing and the large decorative earrings. At the beginning of the interaction, a voice asked, \"Sorry, can you speak English?\" The person in red responded, and then the voice continued, \"OK, I'm a photographer in New York City. I just think you're very beautiful. If possible, I'd like to take pictures of you.\" This person seemed to agree that the photographer continued to capture images, focusing on the complex details of the clothing and the sharp contrast between the distinct red and the dark city background. This video emphasizes the beauty and artistry of the clothing, while the background is the less glamorous side of the city, creating a powerful visual effect. The visible text on the trash bin in the background, including the phone number \"866-276-9900\" and the words \"ACTION\", adds to the authenticity of the urban environment."
+ },
+ {
+ "video_id": "PI-9ebaec6f-5716-4577-9314-3c86793fb7ca",
+ "path": "/workspace/benchmark/video/PI-9ebaec6f-5716-4577-9314-3c86793fb7ca.mp4",
+ "answer": "This video captures two individuals, a female and a male, performing synchronized dance movements in an indoor setting. They are positioned against a plain white wall with a wooden floor, suggesting a residential space. An on-screen text overlay provides context, stating \"Pov: relationships where the girl has strict parents and the boy has chill parents >>>\". The female, wearing a black long-sleeved top and blue jeans, executes fluid dance steps, occasionally swinging her long hair. The male, dressed in a dark grey t-shirt and light grey sweatpants, mirrors her energy with his own set of movements. Their coordination and expressions suggest an engaging and playful interaction, aligned with the lighthearted tone of the background disco music. "
+ },
+ {
+ "video_id": "PI-8826920d-a3d5-4564-b10c-5fe5f88baa11",
+ "path": "/workspace/benchmark/video/PI-8826920d-a3d5-4564-b10c-5fe5f88baa11.mp4",
+ "answer": "The video captures a distressing moment during a live stream where a young male, dressed in a black Nike long-sleeve shirt and colorful shorts, is seen in a room with a white door, gray walls, and a black gaming chair. Initially, he moves around the room in obvious discomfort before sitting on the floor next to the gaming chair, holding his knee and expressing pain. The text overlay, \"Rakai Tears His Acl Live On Stream And Calls For His Mom...\" with crying and sad face emojis, sets the context for his emotional state. As he continues to hold his knee, showing signs of severe discomfort, his mother, wearing a red hoodie with \"BILLS\" written on it, enters the room. She approaches her son, who is now seated in the gaming chair, still clutching his knee. Their conversation reveals the mother's concern and the young male's detailed description of his injury, including the swelling and pain he is experiencing. The audio track from the live stream complements the visual narrative, with the young male tearfully explaining that he tore his ACL during the stream and calling for his mother. The combined audio and video content effectively conveys the young male's physical and emotional distress, highlighting his vulnerability and the immediate need for comfort and support from his mother."
+ },
+ {
+ "video_id": "PI-a5e4f03a-555f-4ab8-91cc-88a015b327d1",
+ "path": "/workspace/benchmark/video/PI-a5e4f03a-555f-4ab8-91cc-88a015b327d1.mp4",
+ "answer": "This video captures a heartwarming moment of compassionate assistance during a mother's journey with her baby. The scene opens in a modern airport terminal, where a woman holds her baby, who's snugly wrapped in a leopard-print blanket. A stroller and luggage nearby hint at their travels. An immediate on-screen caption reads, \"crying because the cutest man saw me struggling with my babies on a 5 hour flight today and decided to help me\". The video then transitions to an airplane cabin. Here, a man is seen holding a toddler (one of the woman's babies) on his lap. Their interaction is gentle, playful, and affectionate; the man smiles at the child, and the child touches his face, indicating a positive connection. This act highlights the stranger's kindness, making a potentially difficult journey more manageable. The entire video is set to a calm and upbeat song, enhancing its emotional impact. The video concludes with an overlaid text, \"TikTok do your job and help me find this man!!!,\" a plea to locate the helpful individual."
+ },
+ {
+ "video_id": "PI-d05909d0-278e-4ec8-8b81-0d4004006917",
+ "path": "/workspace/benchmark/video/PI-d05909d0-278e-4ec8-8b81-0d4004006917.mp4",
+ "answer": "In an amalgamation of personal and political discourse, a male speaker's emotions oscillate between neutrality, a sense of unknown, and joy, discussing various topics during what appears to be a live lecture, as evidenced by video footage. The video, filmed in a lecture hall in Washington, D.C. on January 20, 2025, and subtitled \"President Trump shouts out'very tall son' Barron NBC NEWS Washington, D.C. Jan. 20, 2025\" captures the speaker, a white-haired man in a black suit, standing on a podium adorned with the \"SEAL OF THE PRESIDENT OF THE UNITED STATES,\" introducing his tall son Barron. The speaker reflected on his success in winning youth votes by a wide margin and shared insights gained from his son, including advice on engaging with popular figures such as Joe Rogan. The atmosphere in the room was upbeat and the speakers expressed satisfaction and positivity about the interactions, describing them as both successful and enjoyable. The video then cuts to a young brunette man, presumably Barron, rising from his seat and confidently waving to the audience, who applauds. The speaker continued his speech, Barron got up and waved again, the crowd stood up in support, and the subtitle \"Barron!\" Appeared at the bottom of the screen, emphasizing the confident and determined mood of the event. The reflective and upbeat tone of the audio is reflected in the video, which depicts a dynamic and engaging interaction between the speaker and the audience, with personal anecdotes and political reflections seamlessly intertwined."
+ },
+ {
+ "video_id": "PI-dc06826c-dd7b-421a-adfd-e165bf3e6387",
+ "path": "/workspace/benchmark/video/PI-dc06826c-dd7b-421a-adfd-e165bf3e6387.mp4",
+ "answer": "This video provides a demonstration of a classic physics toy, the gyroscope. A person's hands are shown preparing the device by wrapping a string around the central axle. With a quick pull of the string, the inner wheel of the gyroscope is set into rapid rotation. The spinning gyroscope is then carefully balanced on the tip of its stand, where it defies gravity and exhibits precession, slowly wobbling while remaining upright. The person picks it up, tilts it, and moves it around, illustrating the stability and resistance to changes in orientation caused by its angular momentum. The demonstration takes place on a desk with a few decorative items in the background."
+ },
+ {
+ "video_id": "PI-0181b1ad-0751-46ce-853f-2c2e9fa4c3f2",
+ "path": "/workspace/benchmark/video/PI-0181b1ad-0751-46ce-853f-2c2e9fa4c3f2.mp4",
+ "answer": "The video unfolds in a cozy living room with tiled floors, a black leather sofa and a colorful oil painting hanging on the wall. The main characters are a person wearing a black T-shirt with green patterns and green trousers, as well as a child wearing a light grey sports suit. This video captures the interesting and dynamic interaction between the two. At the beginning, the child is jumping around the room energetically, while the person in the black T-shirt makes vivid gestures, possibly encouraging or imitating the child's movements. This person raised his arm and made exaggerated movements, which prompted the child to imitate these movements enthusiastically. This interesting communication continued. The adult covered their face with their hands and the child followed suit. In the video, the man and the child stand side by side. Both of them raise their arms and keep jumping. Then this person covered his face again, and the child imitated this action once more. At the end of the video, the child was sitting on the floor, looking as if he was resting or playing, while the person in the black T-shirt walked away from the camera. The core theme of the video is the cheerful and charming interaction between two people, highlighting the interesting and connected moments in the family environment. Such a phrase can be heard in the audio: \"Hot morning?\" Did you run wild in the morning? It adds a background to \"But when you get home\", suggesting an interesting inquiry about the child's morning activities and returning to the comfort of their home."
+ },
+ {
+ "video_id": "PI-947b81d0-3cae-47c5-a1a2-fa230bd4cc03",
+ "path": "/workspace/benchmark/video/PI-947b81d0-3cae-47c5-a1a2-fa230bd4cc03.mp4",
+ "answer": "The video features two women in a modern indoor setting with a wooden floor and a black wall adorned with a thermostat and light switch. The woman in the foreground, with long dark hair and wearing a light pink top, appears to be speaking, her expressions and lip movements indicating she is expressing frustration. The text overlay on the video reads, 'ash’s annoyed that the guys aren’t paying attention to her,' suggesting that she feels ignored by the men in the room. In the background, a blonde woman in a black sleeveless top stands near the wall, adding to the interpersonal dynamics of the scene. Meanwhile, the audio features a calm and reassuring female voice mentioning that someone is in the room and has been checked on, referencing 'Type C.' A male voice briefly responds in Korean, while another male voice in English mentions 'Type C here,' and a fourth male voice affirms with a simple 'Yeah.' The overall atmosphere is neutral and composed, focusing on a discussion or confirmation about someone's status or presence in the room, which contrasts with the visible frustration of the woman in the foreground. The video effectively captures the tension between Ash's desire for attention and the subdued, matter-of-fact tone of the audio conversation."
+ },
+ {
+ "video_id": "PI-1be34bb7-b78a-4066-93b2-a2fdee79015b",
+ "path": "/workspace/benchmark/video/PI-1be34bb7-b78a-4066-93b2-a2fdee79015b.mp4",
+ "answer": "This video takes place in a simple indoor environment. There is no text in the video, only a strongly rhythmic background music is added. The shot is a close-up of the hand. A person inserts the blue spring of the right hand into the slot of the left hand and presses it several times. Then a small black short stick is embedded in the spring. After assembly is completed, the spring is pressed again. This video mainly records the process of a person assembling a short, a spring and a small short stick into a small toy. The rhythm of pressing the spring is in sync with the rhythm of the background music, highlighting the relaxed and pleasant atmosphere of the video."
+ },
+ {
+ "video_id": "PI-ad5b0b5d-4eaa-4cb9-8741-26e4f80fc654",
+ "path": "/workspace/benchmark/video/PI-ad5b0b5d-4eaa-4cb9-8741-26e4f80fc654.mp4",
+ "answer": "The video presents a cozy and personal scene where a person, dressed in a black 'One Piece' t-shirt and holding a pink blanket, sits on a bed. Behind them, a shelf filled with stuffed animals, including a pink bunny and a yellow character, adds a playful touch to the setting. The individual engages with the camera, sharing a story in an excited tone about her dad coming home and playfully suggesting to act like her gay best friend, as indicated by the audio. She uses expressive hand gestures, at one point raising her hands to form circles with her fingers. The scene then transitions smoothly to her applying makeup, where she carefully uses a small brush near her eye and blends the product with a pink sponge. Throughout the video, the background remains consistent, featuring the stuffed animals and string lights, creating an intimate atmosphere focused on her activities and expressions."
+ },
+ {
+ "video_id": "PI-7d88790e-53fc-4919-93ef-bfcc441d1777",
+ "path": "/workspace/benchmark/video/PI-7d88790e-53fc-4919-93ef-bfcc441d1777.mp4",
+ "answer": "The video features a young girl wearing a black hat and a black shirt with white text, enthusiastically raising her arms in a vibrant setting illuminated by blue lighting. The text 'THIS' appears on the screen with a large white arrow pointing to her, followed by 'SPIT SOME BARS!' and a fire emoji, indicating she is about to rap or speak rhythmically. The scene transitions to a group of people on stage with 'WILDSTYLE' in the background. A man in a red sweatshirt speaks into a microphone, accompanied by the text 'THINK' and a brain emoji. The focus returns to the girl, now holding a microphone, with 'ALRIGHT' and an angry emoji appearing on the screen. The text 'LETS' with two people emojis suggests a collaborative effort as she prepares to speak or rap, with 'IT' and a winking emoji appearing. The girl and the man in the red sweatshirt stand on stage, with the text 'YA'LL' followed by 'OUTFIT!' and emojis of shirts, ties, and jeans, indicating a comment on her outfit. The man continues speaking, and 'I'VE' appears on the screen. The focus shifts back to the girl as she speaks into the microphone, with 'WORK FOR' appearing. The background music is lively and energetic, complementing the upbeat mood of the interaction, as the young girl delivers her rhythmic expression and the cheerful male expresses enthusiasm and approval."
+ },
+ {
+ "video_id": "PI-2401ea46-ce20-4088-9d64-be17ef6210f0",
+ "path": "/workspace/benchmark/video/PI-2401ea46-ce20-4088-9d64-be17ef6210f0.mp4",
+ "answer": "The audio features a male speaker offering a thoughtful and humorous observation about the logistics and risks involved in transporting valuable large goods, specifically noting that whenever he sees a towed trailer carrying a jet aircraft, he imagines the extremely high insurance premiums they must have in case of an accident. The overall tone is casual and reflective, showcasing the speaker's curiosity and interest in the subject. The video complements this by depicting a scene of a towing trailer transporting a large fighter jet on the highway, with the jet safely installed on a special trailer against a backdrop of a clear sky and green plants by the roadside. The camera then switches to a close-up of a person wearing a grey T-shirt with a logo on the chest and a tattoo on the arm, who, although not audible, is likely discussing the logistics and risks associated with transporting such valuable goods. The video further highlights the complex details and markings on the plane through close-ups of the cockpit and tail. Finally, the person is shown again, speaking in front of a large truck with the words \"Just Too Cool\" and \"Polar International\" written on it, along with a polar bear logo, possibly elaborating on the challenges and costs of transporting large items. Together, the audio and video provide a comprehensive look at the intriguing and complex world of transporting high-value, large-scale goods."
+ },
+ {
+ "video_id": "PI-535127fc-6a71-48e6-92a9-724df87ff47e",
+ "path": "/workspace/benchmark/video/PI-535127fc-6a71-48e6-92a9-724df87ff47e.mp4",
+ "answer": "In the video, a female speaker adopts a neutral to slightly contemplative tone and shares her observations and reflections on the pain associated with touching a cactus. Standing on a residential street with parked cars and houses in the background, she is dressed in black and examines a tall cactus with sharp thorns. She points out the deadly appearance of the cactus and expresses her curiosity about its tactile qualities. She gently touches the cactus, pondering if it really hurts as much as it seems, or if the pain is exaggerated, as in the movie. She concluded that while it may cause some discomfort if the spine penetrates deeply, it is not too serious or severe. The sunny setting enhances the calm and introspective atmosphere of the scene, adding a casual, exploratory atmosphere as she reflects on her observations."
+ },
+ {
+ "video_id": "PI-b5b1ffe6-698f-4c5f-b44f-ef1186f961c5",
+ "path": "/workspace/benchmark/video/PI-b5b1ffe6-698f-4c5f-b44f-ef1186f961c5.mp4",
+ "answer": "The video presents a detailed close-up of a mop on a tiled floor, emphasizing its design with a blue circular base and white mop fibers, connected to a slender metal handle with blue grips. The camera remains fixed on the mop, which starts stationary but gradually moves, allowing the fibers to brush against the clean tiles. The word \"Mop.\" appears in white text at the bottom right, clearly identifying the object. In the background, a section of the tiled floor and a portion of a wooden surface are visible in the top left corner, with bright lighting highlighting the scene's cleanliness. Meanwhile, the audio features a male speaker with an ambiguous emotional tone, accompanied by ambient music that maintains a steady tempo and neutral atmosphere. Although the specific content of the audio is unclear, it seems to involve a discussion or presentation, but the lack of explicit details makes the emotional context uncertain. The overall composition focuses on the mop's interaction with the floor, creating a simple yet informative visual narrative."
+ },
+ {
+ "video_id": "PI-2fdea73b-8d2a-4323-a190-12c9f659fa1c",
+ "path": "/workspace/benchmark/video/PI-2fdea73b-8d2a-4323-a190-12c9f659fa1c.mp4",
+ "answer": "In the video, a person shows their curly hair care routine using cake beauty products. At the beginning, a person sat in a room with a simple background, wearing a large white T-shirt, showing off their wet curly hair. Text message \"Curly Hair all night....\" Appear on the screen and set the theme of the video. This person explained that ten years ago, they would have thought it crazy to have curly hair on a night in Miami because of the problem of curly hair. To solve this problem, they used cake beauty products, especially \"Curled Whip Curled Mousse Souffle\". They scooped it into their hands from a pink bottle and applied it to their hair to ensure even distribution. The text \"BOTH products are in myBIO in 'CURLY HAIR ROUTINE'\" appeared, guiding the audience to search for more information in the bio. This person continued to style his hair, kneading it into a ball and adjusting it to make the curly hair more beautiful. The scene transitions to a person with a complete style, fluffy and distinct curly hair, still wearing a white T-shirt. Then the video shows the back of the person's fully set curly hair, highlighting the number and outline of the curls. This person is now wearing a black strapless top, showing their hair from different angles. Text will appear on the screen showing the time intervals for hairstyles: \"30 minutes later\", \"1 hour\", \"2 hours\" and \"3 hours\". It was noticed that her hair was fluffy, curly and shiny in appearance, and it was mentioned that she went out to play in a floor-length dress at night. The end of the video is a confident message, as well as the effectiveness of the cake beauty product in maintaining curly hair throughout the night."
+ },
+ {
+ "video_id": "PI-515147bc-f3df-45b7-a25c-525baebf593c",
+ "path": "/workspace/benchmark/video/PI-515147bc-f3df-45b7-a25c-525baebf593c.mp4",
+ "answer": "This video takes place in a fast food restaurant and focuses on the interaction between a customer and a counter staff member. This employee is wearing a light blue uniform shirt and standing behind the counter, which is equipped with a digital payment terminal and an orange menu stand. The background is a menu board showing various kinds of food, along with a \"Pick up food\" sign indicating where customers can pick up their food. The text \"TikTok does your thing\" is overwritten on the video, indicating that this scene might be part of a TikTok trend or challenge. The customer's arm can be seen in the foreground. It seems that he is interacting with the staff and might be having an interesting or humorous interaction. The atmosphere is casual and relaxed, and the employees maintain a neutral expression when handling transactions. This video captures a moment of daily life in a fast food environment and has hints of social media influence. There is a voice-over in the audio saying, \"Everyone knows.\" \"No, I'll let them do such a thing. That's enough,\" suggesting that this interaction might be part of a larger, unspoken background or internal joke. Overall, this video combines the mundane setting of a fast food restaurant with the fun and viral nature of TikTok, creating an engaging and resonant scene."
+ },
+ {
+ "video_id": "PI-47f978a5-2342-4cc6-a95f-52a04b6a1e5a",
+ "path": "/workspace/benchmark/video/PI-47f978a5-2342-4cc6-a95f-52a04b6a1e5a.mp4",
+ "answer": "The video begins with a close-up of a burning frying pan by the swimming pool. The pan seemed to be filled with burnt food, tilted and flames could be seen, indicating a cooking mistake. The background features a stone-paved surface and a clear blue pond, forming a serene and sharp contrast with the chaos in the pot. The text above reads: \"When you try to surprise your boyfriend with the meals you cook yourself\", humorously setting the background for this cooking disaster. Then the camera switches to a clearer display of the swimming pool, with the pan still in the field of view, emphasizing the outdoor environment. When the pan was moved away from the poolside, smoke rose from the pan, highlighting the consequences of the cooking incident. The scene transitions to a person in white holding a pot, with green plants and part of a modern building in the background. This person seems to be reacting to this situation, perhaps in surprise or frustration. The end of the video is a close-up of a pot. The pot is now empty but still shows signs of burning, intensifying the humorous failure of this home-cooked meal. Against the backdrop of a picturesque outdoor setting, the overall narrative captures the relevant and comedic moments of cooking mistakes."
+ },
+ {
+ "video_id": "PI-cdc566f8-26bd-40d5-be6d-17ae422b4b02",
+ "path": "/workspace/benchmark/video/PI-cdc566f8-26bd-40d5-be6d-17ae422b4b02.mp4",
+ "answer": "The video takes place in a fast-food restaurant, where the main focus is on a person dressed in a vibrant red and yellow bee costume. This individual is seen moving energetically on the tiled floor, performing a series of playful dance moves and gestures. The background features several restaurant staff members wearing red uniforms and caps, standing near a service counter with a display case showcasing various food items. A sign on the counter reads \"CLAIM HERE,\" indicating a service point for customers. The audio in the video includes a song with lyrics such as \"Shoddy bad and she's fine fucking with me, let me change your life I know you could see the demons all in your eyes,\" adding a lively and somewhat mischievous tone to the scene. The overall atmosphere is light-hearted and entertaining, with the person in the bee costume being the center of attention, bringing a sense of fun and whimsy to the otherwise ordinary setting of a fast-food restaurant. The core theme of the video is the joyful and spontaneous performance by the individual in the bee costume, which brightens the environment and engages both the staff and potential viewers."
+ },
+ {
+ "video_id": "PI-7032bc8a-704c-49c2-a89b-bee98fa1ee32",
+ "path": "/workspace/benchmark/video/PI-7032bc8a-704c-49c2-a89b-bee98fa1ee32.mp4",
+ "answer": "The beginning of the video is a close-up shot of a person. She is wearing glasses and leaning against a plain light-colored wall. Then the scene switches to the kitchen, and this person is wearing a blue shirt. The kitchen is characterized by white cabinets and various items on the countertop, including a plant and some containers. This person seems to be speaking, which can be seen from their mouth movements and slight head movements. The text in the audio is like this: \"Guys, I'm worried that my new glasses make me look like I'm going to walk into the place where my stepson is naughty. Please tell me what you think, because I'm actually worried that they might be a little.\" It's as if they didn't have frames at all, so I have no idea how they could change my appearance to such a large extent. Is this my face without glasses? This is and. This indicates that this person is worried that her new glasses, lacking the traditional frame, will affect her appearance. The video continues in the same kitchen setting. This person is still wearing a blue \"bear\" shirt and further discusses how she looks with and without glasses. The background remains consistent, showcasing white cabinets and a ceiling with recessed lighting. This person uses gestures to emphasize her point and seeks feedback from the audience."
+ },
+ {
+ "video_id": "PI-4eed0daa-896e-44e3-bd81-b4ba2fa5bad6",
+ "path": "/workspace/benchmark/video/PI-4eed0daa-896e-44e3-bd81-b4ba2fa5bad6.mp4",
+ "answer": "The audio features a lively discussion among three individuals about a couple's outward appearance and demeanor, where a male speaker expresses frustration, using strong language to suggest the woman appeared aggressive or confrontational, while another male speaker comments on the man's hesitant behavior, implying he was about to speak. A female speaker, however, offers a neutral perspective, describing the couple as seemingly normal and typical on the surface. The video complements this narrative by beginning with a scene from a television show or movie displayed on a TV screen, showing a couple in formal attire, with text reading, \"Reid: THE COUPLE SEEMED LIKE A NORMAL, TYPICAL FAMILY,\" against a dark, indoor backdrop. The scene then transitions to a close-up, emphasizing the idea that their appearance may be deceiving with the text changing to \"ON THE OUTSIDE.\" The video then cuts to a man, identified as comedian Rob Campbell through his black beanie and t-shirt displaying his name and social media icons, sitting in a room with a casual, lived-in feel, reacting to the previous scene with facial expressions and body language that suggest a humorous or skeptical commentary on the couple's portrayal. The room's soft lighting and residential setting, complete with a staircase, doorway, shelf with folded clothes, and a lamp, create a relaxed atmosphere that contrasts with the audio's intense discussion, highlighting the video's focus on the disparity between the couple's seemingly normal appearance and the observer's reaction to it."
+ },
+ {
+ "video_id": "PI-8c564fe8-3ff1-4699-8823-016c9449043b",
+ "path": "/workspace/benchmark/video/PI-8c564fe8-3ff1-4699-8823-016c9449043b.mp4",
+ "answer": "The video presents a compelling blend of personal reflection and academic diligence. It begins with a close-up of a young woman with long hair, dressed in a dark shirt with \"ROMA\" partially visible, set against a backdrop softly illuminated by purple LED strip lights, evoking a warm and intimate ambiance. This visual is juxtaposed with the audio of a male speaker expressing a deep longing and desire for someone to be his, his contemplative and emotional tone resonating with the reflective atmosphere. The scene then shifts to a computer screen showcasing an online physics dashboard, where various lessons such as \"One-Dimensional Motion Problems,\" \"Motion Charts,\" and \"Two-Dimensional Motion Problems\" are displayed, all marked as 100% complete. The dashboard also features sections like \"Multiple Choice Challenge: Motion,\" \"Multiple Choice Challenge: Forces,\" and \"Multiple Choice Challenge: Circular Motion & Gravitation,\" with varying completion rates, highlighting a dedication to academic excellence. The video then returns to the initial close-up of the young woman, maintaining a focus on visual elements without auditory components, creating a seamless integration of personal emotion and scholarly achievement."
+ },
+ {
+ "video_id": "PI-7f4cbc38-d77f-4c36-b0a7-4694dcab886b",
+ "path": "/workspace/benchmark/video/PI-7f4cbc38-d77f-4c36-b0a7-4694dcab886b.mp4",
+ "answer": "This short video captures a prank played on a mother by a child in a dimly lit bedroom. The scene is filmed from just outside the room, through a door crack, giving a partial view.An older woman wearing glasses is lying in bed, her attention focused on something out of frame, likely a television. Suddenly, a small, bright red laser dot appears on the wall behind her. The dot moves around erratically before being guided directly onto the woman's face.The moment she notices it, her reaction is swift and dramatic, her face contorting with a mix of surprise and terror. The camera then quickly pulls back, and the video abruptly ends. The camera's shaky movement enhances the realism, confirming the prank's success."
+ },
+ {
+ "video_id": "PI-90022dbd-bab0-4f12-b0f0-9a8fee7afa0b",
+ "path": "/workspace/benchmark/video/PI-90022dbd-bab0-4f12-b0f0-9a8fee7afa0b.mp4",
+ "answer": "The video features a person with a thick, curly brown beard and hair, dressed in a dark gray shirt, set against a simple indoor backdrop of a blue wall and a white door, with a ceiling lamp visible in the upper left corner. The text overlay, which reads 'When a demonic record is played in reverse in a horror movie,' remains consistent throughout the video, suggesting a thematic exploration of horror movie tropes, particularly the idea of revealing hidden messages or invoking supernatural events by playing a demonic record backwards. The individual appears to be speaking, likely elaborating on this concept, although the audio content is unclear and lacks discernible semantic content, making the exact context or purpose of the communication ambiguous. The absence of background music or additional auditory elements further limits the contextual clues. Despite this, the video maintains a consistent visual style, focusing on the speaker and the overlaid text to convey its message, creating an atmosphere that aligns with the eerie and mysterious theme suggested by the text."
+ },
+ {
+ "video_id": "PI-576bae27-2281-48c1-8373-8d5ea7336aa9",
+ "path": "/workspace/benchmark/video/PI-576bae27-2281-48c1-8373-8d5ea7336aa9.mp4",
+ "answer": "The video takes place in a hardware store, where a person wearing a dark blue jacket is seen holding a metallic duct fitting in one hand and a price tag attached to a pair of metal snips in the other. The text \"40$ I think not\" is overlaid on the video, reflecting the person's disbelief or disagreement with the price. They closely examine the duct fitting, turning it around to inspect it from different angles, while the background reveals shelves stocked with various hardware items, such as rolls of tape and other metal fittings. This scene is accompanied by a vocal track singing in a soft and gentle tone, with a slow, melodic tune that has a dreamy quality. The overall atmosphere of the video is one of calm contemplation, as the person continues to hold and inspect the duct fitting, emphasizing their consideration or hesitation regarding the price. The video concludes with the person placing the duct fitting back on the shelf and reaching for another item, suggesting they are exploring other options, all while the gentle and soothing song conveys a sense of introspection."
+ },
+ {
+ "video_id": "PI-4a71c28c-2e45-4be9-bbb5-34c6c1b6eafc",
+ "path": "/workspace/benchmark/video/PI-4a71c28c-2e45-4be9-bbb5-34c6c1b6eafc.mp4",
+ "answer": "The video humorously explores the theme of missed \"red flags\" by showing a woman's gestures and text overlays in the video, and a reflective and sarcastic tone is created in the audio track. The woman in a light-colored sweatshirt holds up a finger with blue nail polish and wobbles, as if to emphasize a point, while the text is covered with the question, \"how did you not see all the red flags?\" Followed by \"my intuition:\". Adding measurement superimposes, such as \"4.61 cm\", \"6.71 cm\", etc., adds a quirky, puzzling element that may symbolize the distance to achieve intuition or recognize warning signs."
+ },
+ {
+ "video_id": "PI-4dccc9d7-8f2e-4fe4-bec7-b4dae6801fdd",
+ "path": "/workspace/benchmark/video/PI-4dccc9d7-8f2e-4fe4-bec7-b4dae6801fdd.mp4",
+ "answer": "The feature of the video is that a person is sitting in a room with a green wall and grey curtains in the background. This man is wearing a black hoodie with blue and red words on it. At the beginning of the video, there is a passage that reads: \"Reply to greenarts2's comment: Can you sing Reese Young?\" Don't attend the party any more. This indicates that this person was responding to the audience's request by singing Reese Young's song \"No More Parties\". Then this person began to reply to the comments, \"Can you sing Reese Young?\" The yellow part that appears on the screen. They bring the microphone close to their mouths, indicating that they are going to sing or rap. \"reese young-no more parties\" is shown in yellow, followed by the person who starts singing or rapping, and \"I don't need no more parties\" is shown in yellow, indicating the lyrics of this song. The video continues to play. That person was still holding the microphone, and the words \"Reese Young-no more party\" appeared yellow again. The green additional text reads: \"View my original music in my profile!!\" The yellow one said, \"Just to go back to the past.\" The background is a messy room, with the dressing table and wall decorations filled with all kinds of items. This video seems to be a humorous or casual response to the audience's request, and this person is also promoting their original music."
+ },
+ {
+ "video_id": "PI-05e8ec6d-2260-478e-ac06-528783029b67",
+ "path": "/workspace/benchmark/video/PI-05e8ec6d-2260-478e-ac06-528783029b67.mp4",
+ "answer": "In a delightful blend of audio and video, the NBA All - Star 2024 Media Day in Indianapolis is brought to life with a series of humorous and candid interactions. The audio serves as the soundtrack to these light - hearted and fun exchanges between the players and interviewers. The overarching theme of the audio is centered around showcasing the players' personalities and off - court humor, which is perfectly mirrored in the video. The video unfolds at the media day, with the backdrop consistently displaying 'ALL - STAR 2024 INDY' and 'MEDIA DAY' logos, along with AT&T branding. Players like Anthony Edwards, Nikola Jokić, Trae Young, and Karl - Anthony Towns are seen speaking at podiums with microphones. The interviewer, often holding a microphone, asks quirky questions that range from bathroom habits during games to horse names, favorite sneaker brands, and even hair authenticity. Key OCR text, such as player names on nameplates and humorous comments like '25 horses? ', 'White forces or black forces? ', and 'fake hair! ', add to the comical tone. The players respond with laughter and candid answers, creating a lively and entertaining atmosphere. The video effectively captures the off - court humor and personalities of the players, highlighting their willingness to engage with the playful questions, while the audio amplifies the fun and light - hearted nature of these interactions, making for an enjoyable viewing and listening experience."
+ },
+ {
+ "video_id": "PI-311d72b2-a4f1-494b-ab13-671ca4888de6",
+ "path": "/workspace/benchmark/video/PI-311d72b2-a4f1-494b-ab13-671ca4888de6.mp4",
+ "answer": "The video takes place in an elevator. The walls of the elevator are made of metal and the floor is herb-shaped wooden flooring, creating a narrow yet somewhat clinical atmosphere. A person wearing a black coat and black leather pants was squatting in the corner of the elevator, holding his head with one hand and leaning against the wall, looking very painful or uncomfortable. The text \"There were only three drunk homosexuals in the elevator\" was overwritten on the video, providing the background for the scene and hinting at a humorous or uncomfortable encounter. Then the camera switches to another person, who is standing closer to the camera, wearing a dark jacket and a green shirt. This person seems to be speaking or responding to this situation, although their faces have been blurred, adding an element of anonymity. There is a voice in the audio saying, \"I don't want to be anyone,\" which might imply not wanting to be part of this situation, or it could be a reaction to the three drunk people. The overall mood of this video is a blend of entertainment and unease, capturing a candid yet somewhat awkward moment in the elevator."
+ },
+ {
+ "video_id": "PI-55226cd6-cc71-4c84-8676-bce9205e7d3d",
+ "path": "/workspace/benchmark/video/PI-55226cd6-cc71-4c84-8676-bce9205e7d3d.mp4",
+ "answer": "The video features a curly-haired girl wearing a black and white striped shirt adorned with texts and graphics such as \"HOCKEY,\" \"SP,\" \"Bruins,\" and \"CHAMPIONS 2011 MIX XII.\" The setting is a simple indoor environment with wooden doors and white shower curtains in the background. Initially, the screen displays \"The worst answer Ever,\" which later changes to \"The worst response in history.\" Meanwhile, a neutral female voice poses a thought-provoking question about the importance of beauty versus intelligence. In response, a cheerful female voice expresses gratitude for the question and emphasizes that intelligence is more important because beauty is subjective and varies from person to person. She confidently concludes that she possesses both qualities, exuding a sense of happiness and confidence. The video captures a moment of self-reflection and self-expression, highlighting the individual's preference for intelligence while hinting at a more nuanced perspective."
+ },
+ {
+ "video_id": "PI-73bdfcef-afd0-48ee-9555-1ad27247101f",
+ "path": "/workspace/benchmark/video/PI-73bdfcef-afd0-48ee-9555-1ad27247101f.mp4",
+ "answer": "The video features a person with vibrant pink hair, dressed in a dark brown satin top and adorned with multiple gold necklaces, standing in a neutral indoor setting. On the left side of the screen, a vertical list of numbers from 1 to 10 is displayed, each accompanied by a small, indistinct icon. Above the person's head, a large, colored droplet icon appears, changing colors to represent different hair dyes, starting with 'BLONDE' and transitioning through 'MAGENTA,' 'DARK,' 'PASTEL,' 'ORANGE,' 'UNICORN,' 'AUBURN,' 'BLUE,' 'PLATINUM,' and 'RED,' before returning to 'BLONDE.' With each color change, the person's hair transforms accordingly, showcasing a variety of vibrant and pastel shades. The person interacts with their hair, adjusting and touching it to display the new colors, and at one point ties their hair into a bun with evident excitement. The background remains consistent throughout the video, emphasizing the dynamic hair color changes. The background music (BGM) features cheerful and lively melodies, with a female voice singing in the second half, enhancing the charming and vibrant atmosphere of the video."
+ },
+ {
+ "video_id": "PI-99d96a2f-ed82-4776-b913-cb5fbce3a3fe",
+ "path": "/workspace/benchmark/video/PI-99d96a2f-ed82-4776-b913-cb5fbce3a3fe.mp4",
+ "answer": "The video features two individuals sitting closely together at a table in a modern restaurant or cafe setting. The person on the left is wearing a black cap with a white 'LA' logo, glasses, and a dark jacket, while the person on the right sports a yellow cap with black text, a dark jacket, and long, curly hair. They are engaged in a lively and friendly conversation, with the person on the right making expressive hand gestures, creating an atmosphere of camaraderie and enjoyment. Meanwhile, the audio reveals two female speakers exhibiting a happy demeanor, performing synchronized breathing exercises. The main focus is on coordinating breathing patterns, with one instructing others to breathe along. The background music is lively and upbeat, enhancing the positive and encouraging tone of the scene. The combination of visual and auditory elements creates a cohesive and uplifting experience."
+ },
+ {
+ "video_id": "PI-513a97c6-0950-49b2-9348-9a4f6f91eea9",
+ "path": "/workspace/benchmark/video/PI-513a97c6-0950-49b2-9348-9a4f6f91eea9.mp4",
+ "answer": "At the beginning of the video, a person is wearing a light blue hat and a black shirt, sitting among the lush green leaves. This person seems to be in a state of deep thought or rest. Their hands are close to their faces, suggesting a moment of reflection or relaxation. The background is a vibrant green leaf, creating a serene and natural environment. Then, the scene transitions to a close-up shot, with a row of smooth, colorful circular stones at the bottom of the shallow water. These stones present a wide variety of tones, including purple, red, green and blue, forming a sharp contrast with the natural landscape. The camera moved slowly over the stones, revealing more underwater landscapes. Then the field of vision expands to the entire scene: a serene lake, with its clear and calm water surface reflecting the surrounding mountains and trees. The audio overlay includes a soothing message: \"Wait for me, be patient\", which enhances the calmness and reflective mood of the scene. The mountains in the background are magnificent and imposing, with the peaks basking in the sunlight, adding a serene and picturesque atmosphere. The video emphasizes the beauty of nature and serene scenes. The reflections in the water create a mirror-like effect, enhancing the overall tranquility. The audio can be heard: \"I run very far and often bump into the wall.\" This adds a layer of mystery to the serene visual effect, suggesting themes of patience, reflection and an unknown future."
+ },
+ {
+ "video_id": "PI-78e08059-57e4-4a52-9984-6ceded404bdb",
+ "path": "/workspace/benchmark/video/PI-78e08059-57e4-4a52-9984-6ceded404bdb.mp4",
+ "answer": "The video shows the detailed process of handling and organizing a series of complexly designed swords. The scene is set in a workshop or storage area, where many swords are placed on a wooden surface. A person wearing a black jacket, black gloves and white sports shoes is carefully inspecting and cleaning each sword. They wiped the blade with a pinkish-white cloth to make sure there was no dust or debris. The hilts and blades of these swords have detailed carvings and patterns, indicating a high level of craftsmanship. This person picked up each sword methodically, examined them carefully, and then put them back in an organized manner. The background includes various tools and materials, and suggestions for ongoing projects or inventory management. As the video progresses, this person continues to handle the swords with the same care and precision, occasionally adjusting their positions to ensure they are neatly arranged. Near the end, this person began to put the sword into a cardboard box, which indicated that it might be ready for storage or transportation. The overall atmosphere is focused and meticulous, highlighting the aesthetics and potential historical value of these elaborately crafted weapons. This video emphasizes the careful handling and preservation of these items, demonstrating the person's commitment to maintaining their quality and appearance."
+ },
+ {
+ "video_id": "PI-755b9b16-3205-47ce-a1ca-a3bb73633df6",
+ "path": "/workspace/benchmark/video/PI-755b9b16-3205-47ce-a1ca-a3bb73633df6.mp4",
+ "answer": "The video features a person wearing a blue top and a black headband, holding a bottle of Anastasia Beverly Hills foundation, with long black hair, red nail polish, and gold jewelry. They discuss the darkest shade of the new Anastasia Impeccable Foundation, specifically \"Norvina,\" expressing frustration and disappointment over the lack of inclusivity in the shade range for darker complexions. The speaker, who notes Anastasia Beverly Hills' two-decade history in the beauty industry, compares the shade shown online with the actual product, highlighting discrepancies. They emphasize that this issue is not a mistake or lab error but a deliberate choice by the brand, reinforcing their critique by holding up a black contour stick alongside the foundation bottle. The background includes a plant and shelves with various items, providing a casual setting for the discussion. The speaker reiterates their dissatisfaction with the brand's long-standing approach to inclusivity, underscoring their message about the need for greater inclusivity in the beauty industry."
+ },
+ {
+ "video_id": "PI-688958f8-63e7-43a6-b03b-6b413188a0e0",
+ "path": "/workspace/benchmark/video/PI-688958f8-63e7-43a6-b03b-6b413188a0e0.mp4",
+ "answer": "There is a snowy outdoor scene in the video, where a person wearing a black hooded jacket is standing in the gently falling snowflakes. The background shows a winter landscape, bare trees and snow-covered grassland. Another person wearing a maroon sweatshirt, white text and red checked pants appeared in the background and captured this moment with a smart phone. The atmosphere here is filled with a sense of urgency and surprise. The text above emphasizes this: \"POV, you just realized that your rooster has come out.\" The people in the foreground seemed to realize that their rooster, named Danny, had run away and reacted accordingly. When she shouted loudly, \"Josh has made too much snow, Mom.\" Oh my God, Danny is out. Run, run! \"This video captures the immediate worry and humorous chaos that arise when people discover the rooster venturing into the snowy environment. The combination of visual Settings and audio dialogues creates a vivid portrayal of a moment full of surprises and humor."
+ },
+ {
+ "video_id": "PI-c0a39605-7da5-41f4-af10-c30bda966556",
+ "path": "/workspace/benchmark/video/PI-c0a39605-7da5-41f4-af10-c30bda966556.mp4",
+ "answer": "This video shows a golden retriever and its owner having a warm interaction in a cozy indoor environment. At the beginning of the scene, the dog is standing beside a small black piano. The owner is wearing a black long-sleeved shirt and chestnut trousers, kneeling on the ground and playing the piano keys. The text overlay, \"My dog has learned to play the piano,\" sets an interesting and educational tone. This dog looks very curious, sniffing and grasping the piano keys with its PAWS, showing a strong interest in the musical instrument. The owner would occasionally stop to observe and encourage the dog's exploration. As the video played, the dog owner leaned against the light grey sofa and continued to watch the interaction between the dog and the piano. This dog was still very engaged, pressing the keyboard with its PAWS. The owner's amusing reaction highlighted the entertainment of this moment. The background is a bookshelf filled with books and a large green plant, which adds to the atmosphere of the family. In the last part of the video, the owner can be seen sitting at a table with multiple computer monitors, wearing a green and white checked shirt and grey shorts. The dog continued to explore the piano, pressing the keys with its PAWS. The owner watched and smiled, obviously very happy. This video captures the light-hearted and cheerful moments of a dog's curiosity and the bond it shares with its owner, emphasizing the themes of sharing happiness and learning."
+ },
+ {
+ "video_id": "PI-e377a2f5-c3e7-40d5-aa9a-f9b0a43084b1",
+ "path": "/workspace/benchmark/video/PI-e377a2f5-c3e7-40d5-aa9a-f9b0a43084b1.mp4",
+ "answer": "The video documents a live online auction event hosted by a young man with blonde hair, taking place in what appears to be a spacious warehouse or office environment. The setting includes shelves stocked with various shoe boxes, a festive Christmas tree adorned with lights, and a prominent 'Supreme' sign. The host presents a pair of new Nike Blazer Mid 77 Premium sneakers, characterized by their white base and teal Nike swoosh, as he unboxes them and displays them to the camera. An on-screen overlay provides real-time updates, showing the current bid, the leading bidder's username, and a countdown timer. Initially, the host anticipates a sale price around $150, later adjusting to $130. However, as the auction nears its conclusion, the speaker's demeanor shifts from enthusiasm to evident disbelief and frustration. The bids remain unexpectedly low, culminating in a final sale price of $29. "
+ },
+ {
+ "video_id": "PI-c4f0d04a-5652-4944-a463-67e4355aee66",
+ "path": "/workspace/benchmark/video/PI-c4f0d04a-5652-4944-a463-67e4355aee66.mp4",
+ "answer": "The video features a person in an indoor setting that resembles an office or utility room, energetically moving their arms and body in various directions, suggesting they are dancing or engaging in expressive movements. The individual is dressed in a light-colored sweatshirt and dark pants. The background includes a drop tile ceiling with a large ventilation unit, a water cooler with a blue bottle to the left, and a black office chair nearby. On the right, electrical panels and utility boxes are mounted on the wall, with visible wiring and a bag of what appears to be salt on the floor. The room is well-lit with natural light filtering through blinds-covered windows. Throughout the video, the person continues their dynamic movements, occasionally shifting angles and stances. Over this scene, the background music features female vocals conveying emotional and heartfelt sentiments, with lyrics like \"He was crying on my shoulder, all I could do is...\" and \"Only made us closer until July. I know that you love...\" The song evokes themes of emotional closeness and support during difficult times, adding a layer of depth to the visual content."
+ },
+ {
+ "video_id": "PI-bf3a575e-6b58-4ec5-aae2-2dbcd2674e95",
+ "path": "/workspace/benchmark/video/PI-bf3a575e-6b58-4ec5-aae2-2dbcd2674e95.mp4",
+ "answer": "In the video, a young man is sitting in a car. Sunlight shines through the car window and illuminates the interior. He was wearing a dark hoodie, with a brightly patterned T-shirt over it and a seat belt fastened to his chest. Throughout the video, he made various gestures, such as pointing with his index finger, forming a shape similar to a mobile phone with his thumb and little finger, and making a peaceful gesture, indicating that he was in a state of contemplation or communication. The audio content shows a conversation where the young man is answering a question raised by someone outside the camera: \"Bro, give me five reasons why you don't like Trump.\" The young man seemed to be talking to himself, citing the controversial remarks made by Donald Trump, especially those involving his daughter. It indicates that this young man is elaborating in detail on the reasons why he doesn't like Trump. When he discussed this topic, his gestures and expressions conveyed a sense of deliberation and criticism. The core theme of this video revolves around the young man's critical views on the controversial remarks of Donald Trump, especially those involving his daughter. The young man's vivid gestures and thoughtful expressions highlighted his dissatisfaction, as well as the broader social issues surrounding such remarks."
+ },
+ {
+ "video_id": "PI-d22c4917-fec1-40b7-9a9e-70e8f6504277",
+ "path": "/workspace/benchmark/video/PI-d22c4917-fec1-40b7-9a9e-70e8f6504277.mp4",
+ "answer": "The video offers a compelling visual comparison of the passenger capacities of three major ocean liners: the Titanic, the Wilhelm Gustloff, and a third unidentified ship. It begins with a large crowd of people gathered in front of the Titanic, accompanied by the text \"Titanic 420\" (indicating 2,420 passengers), which then increases progressively to 1,148, 1,449, and finally 1,517, illustrating the ship's capacity. The scene then transitions to the Wilhelm Gustloff, where another crowd is shown with numbers starting at 1,067 and escalating to 9,343, emphasizing its larger capacity. The video wraps up with a side-by-side comparison of the Titanic and Wilhelm Gustloff, clearly highlighting the difference in their passenger capacities. Meanwhile, the audio features a male speaker with an unknown emotional tone briefly saying \"Come\" in Chinese, suggesting an individual calling or summoning someone, though the emotional context remains unclear, accompanied by ambient music of an indeterminate tempo and atmosphere."
+ },
+ {
+ "video_id": "PI-406ff74a-1033-465e-86cf-23f64b125d7e",
+ "path": "/workspace/benchmark/video/PI-406ff74a-1033-465e-86cf-23f64b125d7e.mp4",
+ "answer": "The video begins with a person in a black jacket standing in front of a white door, gesturing as if to introduce a theme. The text 'some dads take a beautiful car and hide them away from the rest of the world 💔' appears on the screen, setting a tone of seclusion and privacy. The scene transitions to a close-up of a hand pressing a garage door opener, revealing a garage door slowly lifting to unveil a car covered with a black car cover. The camera moves closer, offering a detailed view of the covered vehicle and the garage's interior, which is equipped with various tools and equipment. The focus remains on the car, emphasizing its hidden state under the cover. The video continues to highlight the car, with the same text overlay persisting. The camera captures different angles of the covered car, eventually revealing red taillights and a portion of the white body. A hand with dark nail polish lifts the edge of the car cover, hinting at the car's identity. The camera pans to show more of the car's rear, including dual exhaust pipes and a spoiler, before concluding with a close-up of the car's side, displaying the wheel and part of the body, still partially covered. The garage's interior remains visible in the background, reinforcing the theme of seclusion. Overlaying this visual narrative is a musical clip with the lyrics, 'Some fathers will look around and hide away from the outside world,' which complements the video's message of isolation and the desire to keep something special hidden from others."
+ },
+ {
+ "video_id": "PI-6c322926-5f5b-4bde-8a8d-e1de69ea46ec",
+ "path": "/workspace/benchmark/video/PI-6c322926-5f5b-4bde-8a8d-e1de69ea46ec.mp4",
+ "answer": "The video features four women sitting closely together in a well-lit, modern room with a large spherical woven lamp hanging from the ceiling, all wearing matching black shirts with pink characters. The text 'Bop house Who's Most Likely' is displayed at the top of the screen, indicating a playful game of 'Who's most likely' among friends. The women engage in a lively discussion, pointing at each other and making gestures as they speculate on various scenarios such as who is most likely to become an alcoholic, be a stay-at-home parent, become wealthy, never marry, hook up within the group, experience multiple marriages and divorces, and other humorous or extreme situations. The atmosphere is casual and fun, with the women laughing and enjoying the game. The background remains consistent with light-colored walls and partially visible blinds on the windows, capturing the essence of friendly and humorous interaction among the group. Meanwhile, the audio features a male speaker with a neutral tone, engaging in the same light-hearted and humorous game, adding to the playful and casual atmosphere of the video."
+ },
+ {
+ "video_id": "PI-20625f75-67e9-4b6f-bdf6-8a5cc423b5ef",
+ "path": "/workspace/benchmark/video/PI-20625f75-67e9-4b6f-bdf6-8a5cc423b5ef.mp4",
+ "answer": "The audio features a female voice delivering a sarcastic tour of her fire-damaged property, using dark humor to highlight the destruction and ultimately ending with a resigned farewell. The video complements this narrative by depicting the woman standing amidst the ruins of her house, dressed in a light blue hoodie, black pants, and flip-flops, with a black backpack slung over her shoulders. The desolate atmosphere is accentuated by charred debris, burnt trees, and a smoky sky. As she walks through the ruins, she gestures towards the destruction, pointing out various remnants of the house, such as a burnt structure and a large, overturned metal box. The scene transitions to her standing near a large, round concrete fire pit filled with ashes and charred wood, further emphasizing the aftermath of the fire. Despite the grim setting, she manages to find a moment of lightness by picking an orange from a tree and examining it closely. The video continues to follow her as she kneels next to a large, round concrete table in the middle of the fire-damaged area, holding the orange and seemingly explaining or pointing out something about the table. Throughout the video, the woman's demeanor remains a mix of resilience and resignation, as she navigates through the debris and interacts with the remnants of her former home, all while the audio provides a sarcastic commentary that adds depth to the visual narrative."
+ },
+ {
+ "video_id": "PI-f020452b-dbe8-4f13-84ac-93951c512d05",
+ "path": "/workspace/benchmark/video/PI-f020452b-dbe8-4f13-84ac-93951c512d05.mp4",
+ "answer": "In the audio, a male speaker, whose emotional tone is somewhat ambiguous, reflects on a memory of seeing a statue for the first time, perhaps indicating a moment of personal revelation or nostalgia. The ambient music accompanying this reflection maintains a steady tempo, creating a neutral yet contemplative atmosphere that complements the speaker's introspective mood. Meanwhile, the video presents a peaceful and intimate scene of two individuals sharing a tender moment in bed. The dim background lighting enhances the soft and private ambiance, as a man lies on his side, gently wrapping his arm around a woman and stroking her head. The woman, partially covered by a blanket, sleeps quietly, embodying a sense of tranquility and security. The overall mood of the video is affectionate and tender, suggesting a deep emotional connection between the two individuals. The text \"gentle love\" appears on the screen, reinforcing the theme of tenderness and care that permeates their relationship, and adding a layer of emotional depth to the scene."
+ },
+ {
+ "video_id": "PI-40bb059b-e365-40ea-a5ec-8b535b0e1857",
+ "path": "/workspace/benchmark/video/PI-40bb059b-e365-40ea-a5ec-8b535b0e1857.mp4",
+ "answer": "In the video, a person is sitting on a white sofa with a fluffy pillow of texture placed behind him, creating a comfortable and inviting atmosphere. This person was wearing a dark hoodie and was placed in a well-lit modern room. The background includes a kitchen area with wooden stools and various items on the counter, as well as a large green plant and a lamp with a white lampshade, adding a home-like feel. This person speaks directly into the camera, using gestures to emphasize their point of view, which indicates that they are conveying a sincere or important message. At some point, they put their hands on their chests, perhaps to express sincerity or emotion. The audio stated: \"This application will be banned in less than 48 hours, so I have a request. Please do me a favor. Well, my current number of followers is 390,000, 7 million. I do want to stay realistic. I'm only 300,000 away from 40 million.\" I hope to reach that milestone before we have to say goodbye, okay? This indicates that before the application was banned, this person urgently requested the support of his fans to reach 40 million followers, expressing a sense of urgency and vulnerability. The theme of the entire video revolves around requests for support and personal appeals, achieving an important milestone before the possible loss of the platform."
+ },
+ {
+ "video_id": "PI-711145f2-ae4a-492e-a77c-2adc27817e90",
+ "path": "/workspace/benchmark/video/PI-711145f2-ae4a-492e-a77c-2adc27817e90.mp4",
+ "answer": "There are two people in the video, Casey and another one, sitting next to each other in a dimly lit environment, possibly in a restaurant or a bar. The person on the right is wearing a denim jacket, a black shirt over it and a necklace, while Cassie is wearing a dark top with white stars on a blue background. The background is a decorative pattern, which could be a wall or a partition. Create a comfortable atmosphere with warm lighting. The person on the right is speaking, gesturing with their hand and pointing at something or someone outside the camera. The audio said like this: \"Casey and I are having dinner. Someone is following us down there as if shooting a video.\" Your problem is that she looks very strange. I want to find this person because they ruined our dinner, not because we have already eaten and we are calming down. Thank you very much. This audio indicates that these people were aware that someone was filming them and made humorous or satirical comments about it. They described the people filming as \"creepy\" and said they were \"spoiling\" their dinner, although they also mentioned that they had already eaten but it was \"very cold\". The overall tone of the video is light-hearted and fun, and everyone participates in it in a joking way. This video captures the reactions of the two when they were filmed by others in the restaurant, their casual interactions and sense of humor."
+ },
+ {
+ "video_id": "PI-514d902f-bcb9-46a6-9288-56e8f521b5bf",
+ "path": "/workspace/benchmark/video/PI-514d902f-bcb9-46a6-9288-56e8f521b5bf.mp4",
+ "answer": "This video was shot in a vibrant classroom, with colorful banners hanging from the ceiling. The main focus is on a person wearing a black hoodie. This person was sitting at a white table, skillfully performing the trick of spinning a pen, attracting the attention of the students around. The atmosphere on the spot was lively. The students were either sitting or standing, watching the performance attentively. Among the onlookers, one student was wearing a purple hooded sweatshirt, while another was wearing a white hat, a cardigan with red and blue stripes, and black trousers. He actively made gestures to encourage the translator. The characteristic of the audio is the rhythmic beats, which may come from a song. The lyrics are like \"one, two, three, four, six, four, four...\" Since all these years, I have been managing the garbage, telling the world that I am the game of the future and I want to be the radar, \"enhancing the dynamic and spiritual atmosphere on the spot. This video effectively captures the creativity and friendship of the students, highlighting the talent of the pen changers and the enthusiastic support of the classmates."
+ },
+ {
+ "video_id": "PI-80887a6b-40ea-4c3b-b258-bb7e65724cfd",
+ "path": "/workspace/benchmark/video/PI-80887a6b-40ea-4c3b-b258-bb7e65724cfd.mp4",
+ "answer": "The video features a person standing outside a Target store, as indicated by the '@target' text overlay in red at the bottom of the frames. The individual is dressed in a white coat over a black top, and the background reveals the store's entrance and some interior details, such as glass windows and a sign that reads 'ARHAUS' (a furniture store), suggesting the video was shot in a shopping mall or a similar commercial area. The setting includes elements typical of a retail environment, with well-lit interiors and modern decor visible through the glass windows. The person stands in front of the store entrance, and the camera angle remains relatively stable, focusing on the individual and the store's exterior. Meanwhile, the audio track features a female speaker with a casual and somewhat sarcastic tone, expressing agreement and relief that Target doesn't need their money. This stance provides the speaker with a convenient excuse not to return to the store, aligning perfectly with their personal preference to avoid it. The core message of the video is the speaker's satisfaction with Target's position, which mirrors their own inclination to steer clear of the retailer."
+ },
+ {
+ "video_id": "PI-2400df13-490c-4541-b276-1747c0f5f042",
+ "path": "/workspace/benchmark/video/PI-2400df13-490c-4541-b276-1747c0f5f042.mp4",
+ "answer": "The video provides a detailed demonstration of the JBL removable speaker in a new Toyota Tundra, accompanied by a soundtrack of passionate music that complements the energetic presentation. A person wearing a checkered shirt is featured, showcasing the speaker's design by rotating it to highlight its features before placing it into a dashboard slot and adjusting it to fit securely. The vehicle's interior, including the dashboard and a rain-speckled windshield with a residential area visible outside, serves as the backdrop. Overlaid text, 'JBL REMOVABLE SPEAKER IN NEW TOTOTA TUNDRA!!', emphasizes the product being showcased. The video highlights the speaker's ease of use, portability, and seamless integration into the vehicle's design, offering Toyota Tundra owners a practical and stylish audio solution, all while the energetic music enhances the overall viewing experience."
+ },
+ {
+ "video_id": "PI-16762b7b-982b-4d99-8d26-44d2900fb422",
+ "path": "/workspace/benchmark/video/PI-16762b7b-982b-4d99-8d26-44d2900fb422.mp4",
+ "answer": "The video captures a scene in a spacious gym filled with various exercise equipment, including weightlifting machines and benches, where a person dressed in a black T-shirt, black trousers, and black shoes is attempting barbell squats. The individual starts by holding the barbell above their shoulders in a squatting position, but repeatedly loses balance and falls backward onto the gym floor, with the barbell remaining on the wooden platform each time. This process of effort and struggle is underscored by the accompanying audio, which consists of background music and a male speaker's exclamations that convey efforts, surprise, or disbelief, highlighting the intensity and challenge of the exercise. The consistent gym environment and layout in the background serve as a backdrop to the individual's persistent attempts, effectively illustrating the core message of the physical exertion and difficulty involved in the exercise process."
+ },
+ {
+ "video_id": "PI-9a3ff6aa-230b-40b8-bfd3-ff5ab1d4237c",
+ "path": "/workspace/benchmark/video/PI-9a3ff6aa-230b-40b8-bfd3-ff5ab1d4237c.mp4",
+ "answer": "The video captures a light-hearted moment in a bustling restaurant or bar, characterized by wooden furniture, neon signs, and televisions broadcasting a sports game. A server, dressed in a red shirt, black apron, and a white beanie with a logo, presents a large, raw piece of meat on a white plate to a customer in a light-colored hoodie. The customer, appearing jovial, gestures as if making a joke, while text overlays reading 'Manager couldn't figure out how to cook it!' and 'Blue!' with a laughing emoji emphasize the humorous situation. Meanwhile, in a separate, casual conversation, a male and a female engage in a neutral-toned discussion. The male mentions being busy with work, to which the female responds positively, expressing agreement. The male then shifts the topic by inquiring about the temperature. This interaction, devoid of background music, contrasts with the lively restaurant scene but shares the common thread of light, conversational exchanges. The video effectively blends these two scenarios, highlighting the humorous presentation of the raw meat while subtly acknowledging the mundane yet relatable work conversation."
+ },
+ {
+ "video_id": "PI-6da39f22-8084-403b-9fef-aa30b20cd35e",
+ "path": "/workspace/benchmark/video/PI-6da39f22-8084-403b-9fef-aa30b20cd35e.mp4",
+ "answer": "The video begins with a blurred image of a person with long blonde hair, wearing a black cowboy hat and a blue top, accompanied by the text 'To the guy I went home with after the Zach Bryan concert...' which sets a personal and intimate tone. The scene transitions to a clearer view of the same person, now also seen wearing white pants and holding a water bottle, walking in a parking area alongside another individual dressed in a black sleeveless shirt and black pants, suggesting a casual, post-concert atmosphere. The narrative then shifts to a heartwarming revelation with a close-up of a baby lying on a gray blanket, wearing a light green onesie adorned with small heart patterns. The text 'You now have a 3 week old' appears at the bottom of the screen, highlighting a significant life event. A hand gently touches the baby's head, emphasizing the serene and tender moment. The video continues to focus on the newborn, reinforcing the message with the same text and a hand with a visible tattoo gently touching the baby's head, adding a personal touch. The consistent background underscores the tranquility of the scene. The video wraps up with this heartwarming image, perfectly complemented by the background music track featuring a singer with a warm and welcoming tone. The lyrics express happiness about someone's return, aligning with the joy and tranquility of welcoming a new life and the joy of reuniting with a loved one."
+ },
+ {
+ "video_id": "PI-4d234e8a-d387-4667-a011-d8bc1287446a",
+ "path": "/workspace/benchmark/video/PI-4d234e8a-d387-4667-a011-d8bc1287446a.mp4",
+ "answer": "The video features two individuals standing close together in a room with a white wall and a ceiling fan in the background.The person on the left is wearing a white shirt and a gold chain with a cross pendant,while the person on the right is wearing a black shirt,large hoop earrings, and a gold necklace with a pendant.Both individuals have tattoos on their arms.Above them,the text \"My 2025\" is displayed,accompanied by a grid of twelve circles,each containing a number from 01 to 12 and a different emoji.The absence of dialogue is complemented by uplifting background music that enhances the enjoyable and lively mood of the scene.The individuals react to the changing emojis with various gestures and expressions,creating a dynamic and engaging narrative that humorously anticipates the events of the year 2025.The sequence of emojis and the individuals' reactions provide a playful and imaginative glimpse into what the year might hold,making the overall atmosphere both entertaining and lighthearted."
+ },
+ {
+ "video_id": "PI-9ff48331-ce44-49d7-a91e-e440f44248aa",
+ "path": "/workspace/benchmark/video/PI-9ff48331-ce44-49d7-a91e-e440f44248aa.mp4",
+ "answer": "The audio and video together vividly capture the delightful and creative process of crafting a decadent dessert at an ice cream shop. The audio emphasizes the fun and creativity inherent in the job, detailing how indulgent treats are made by coating ice cream in rich chocolate and crushed cookies, then finishing them with a drizzle of caramel sauce. The video complements this by visually showcasing each step: it begins with a close-up of a hand with blue-painted nails holding a small white cup under a dispensing machine releasing soft-serve ice cream. The hand then dips the cup into melted chocolate, rolls the coated ice cream in crushed cookies for added crunch, and places the final product into a white paper container. The dessert is then topped with a generous zigzag pattern of caramel sauce, resulting in a visually appealing treat ready to be savored."
+ },
+ {
+ "video_id": "PI-94682c20-eaee-40f0-9173-b96773d8e916",
+ "path": "/workspace/benchmark/video/PI-94682c20-eaee-40f0-9173-b96773d8e916.mp4",
+ "answer": "The video illustrates a medieval European torture method known as rat torture, narrated by a male speaker with a neutral tone and accompanied by steady, atmospheric ambient music. It begins with a shirtless man lying on a wooden table, arms and legs spread out, against a blue tiled floor, with the word 'punishment' displayed. The camera zooms in to reveal a metal pot secured with a wire frame on his stomach. The scene transitions to a close-up of a rat inside a cage, with the words 'be' and 'it' appearing sequentially. The focus returns to the pot, where a rock is dropped, igniting flames beneath it. The camera then shows the pot from above, flames still visible, before shifting to the cage with the rat inside, and the word 'escape' appears. Lifting the cage reveals glistening sausages inside, suggesting they are cooked. The final frames focus on the man's upper body and face, with words like 'method', 'torture', 'primarily', 'medieval', and 'mainly' appearing sequentially, providing historical context. The video effectively depicts the brutal torture method, emphasizing its historical use in medieval Europe."
+ },
+ {
+ "video_id": "PI-acc00f5d-9e67-400e-943d-0635f102f36e",
+ "path": "/workspace/benchmark/video/PI-acc00f5d-9e67-400e-943d-0635f102f36e.mp4",
+ "answer": "The video features a close-up of a man with spiky light brown hair against a plain light background. The man appears to be in a reflective or emotional state, as suggested by his gestures near his face. The text overlay at the top of the frame reads, \"Going back to my first saved sound and this made me emotional...\" followed by a crying emoji indicating that the man is revisiting a previously saved audio clip that evokes intense emotions. The man's hair is slightly messy, and he seems to be touching his face. He may be wiping tears or expressing deep affection. Meanwhile, the speakers in the audio (presumably male) recite a playful and rhythmic phrase in a neutral or unknown emotional tone: \"The interior I drive alligator alligator, the interior of a Chevrolet cinema.\" The phrase is delivered in a relaxed fashion, accompanied by moderately paced background music that creates a lively atmosphere. The overall mood of the video is introspective and emotional, focusing on the personal impact of the audio segment, while the audio content adds a layer of playful rhythm and casual description."
+ },
+ {
+ "video_id": "PI-6d1cfbfd-dfa3-4f18-a7dd-46264a2a832d",
+ "path": "/workspace/benchmark/video/PI-6d1cfbfd-dfa3-4f18-a7dd-46264a2a832d.mp4",
+ "answer": "The video begins with a close-up view of a person's feet wearing light blue sneakers standing on a bed of colorful pebbles, which are in various shades of blue, green, purple, and red, creating a vibrant and textured surface. The person slightly moves their feet, revealing more of the pebbles beneath, while the camera captures their intricate details and the way light reflects off them, enhancing their vivid colors and smooth textures. The scene transitions to a wider view, showing the pebbles extending into the distance, with a calm body of water in the background. The water is clear, allowing the pebbles to be seen beneath the surface, and the reflection of trees and the sky on the water adds to the serene atmosphere. Complementing this peaceful visual is the audio, which features a man singing in a quiet and peaceful manner. The music is enjoyable and immersive, creating a harmonious blend with the tranquil scenery, though it contains no dialogue or narration. Together, the video and audio evoke a sense of calm and relaxation, inviting viewers to immerse themselves in the serene natural setting."
+ },
+ {
+ "video_id": "PI-0e68b0c2-187f-491a-91c5-ddcd5ac5427c",
+ "path": "/workspace/benchmark/video/PI-0e68b0c2-187f-491a-91c5-ddcd5ac5427c.mp4",
+ "answer": "The audio presents a serene background music track accompanied by a man's affectionate singing, with lyrics that roughly translate to \"This must be what love looks like,\" setting a tender and romantic tone. The video complements this mood by depicting a passionate scene in a corridor, framed from a surveillance perspective, with a white wall and a door in the background. The central theme, \"the look of love,\" is prominently displayed on the screen. A man in a black coat and a woman in a brown coat engage in a heartfelt interaction, where the man steps back, the woman moves forward, and they share a tender kiss, followed by a gentle forehead kiss. This video poignantly captures the beauty of romantic love between two individuals, while the soothing background music enhances the emotional depth, effectively underscoring the theme of \"the look of love\" and creating a harmonious blend of auditory and visual elements."
+ },
+ {
+ "video_id": "PI-32e342fa-ed1f-416c-a7f7-5f934b10c6c7",
+ "path": "/workspace/benchmark/video/PI-32e342fa-ed1f-416c-a7f7-5f934b10c6c7.mp4",
+ "answer": "The video shows a woman sitting in an airplane seat, wearing a dark blue top and multiple curlers, creating a humorous scene, as shown in the text \"Trying hard part two on the plane that went wrong\". The woman engages in a daily beauty routine, makeup, including mascara and blush, and adjusts her curling iron. She carefully removes the rollers one by one to reveal the curls and styles them with her hands, while the background shows the interior of an airplane with compartments and windows overhead. The woman seemed focused on looking presentable on the fly, highlighting the challenges of maintaining an image in a limited space. Accompanying the scene is a piece of music played by a male, and the soothing and calming musical tune contrasts with humorous and somewhat messy attempts at grooming, adding a layer of serenity to the other related struggles of trying to look good while traveling."
+ },
+ {
+ "video_id": "PI-d8c91c62-c892-4a03-9c9f-cc4f76be3365",
+ "path": "/workspace/benchmark/video/PI-d8c91c62-c892-4a03-9c9f-cc4f76be3365.mp4",
+ "answer": "This video features a woman making a joke about her appearance. Initially, a close-up shows her looking sad, lamenting that because she is white, her desire to wear braids has resulted in her looking like a \"sister wife.\" The view then widens to show her full outfit, which consists of a long, floral, patchwork-style dress worn over a long-sleeved shirt, with her hair in a single braid. Adopting a country accent, she humorously praises her fictional husband, \"Jebediah,\" on behalf of herself and her two sister wives. To complete the comedic sketch, she lifts her dress to reveal a fake pregnant belly, announcing it's \"baby number 30.\""
+ },
+ {
+ "video_id": "PI-fdf40fb7-d219-4784-9067-fdcb53d1071b",
+ "path": "/workspace/benchmark/video/PI-fdf40fb7-d219-4784-9067-fdcb53d1071b.mp4",
+ "answer": "In a room with vintage decor, a woman wearing a white lace dress excitedly unpackages a large pink mailer from the brand 'Lace Made'. The background features ornate furniture, including a gold-framed mirror and a white vanity, set against patterned wallpaper. She sequentially unpacks three long, quilted puffer coats, each from its own branded garment bag. First, she tries on a light blue coat, which has a corseted waist and a full, gown-like skirt. She smiles and twirls, showing off the unique design. Next, she models a similar coat in white, which features a white fur-trimmed hood. Finally, she unboxes and wears a black version of the coat with a black fur-trimmed hood. Throughout the video, the classic song \"I Wanna Be Loved by You\" plays, enhancing the retro and joyful mood."
+ },
+ {
+ "video_id": "PI-ebab6c59-e5a3-4f63-9a39-fb373c66df78",
+ "path": "/workspace/benchmark/video/PI-ebab6c59-e5a3-4f63-9a39-fb373c66df78.mp4",
+ "answer": "The video captures a lighthearted exchange in a hospital room, centered around a male patient connected to medical equipment and utilizing a communication device. The patient, lying in a white bed and covered by a camouflage blanket, uses a synthesized voice to pose humorous questions to his family and a healthcare worker. These questions, displayed on the device's screen, elicit laughter and amused reactions from those present. For instance, the patient inquires about the timing of a \"weiner enlargement surgery\" he is supposedly having, and later asks his mother if she is stressed. The mother, while slightly stressed, joins in the laughter, and a male healthcare worker, wearing a mask, provides a playful response regarding the surgery schedule. Another male, presumably the patient's father, is seen laughing heartily from a chair in the room. Throughout the video, the patient's breathing apparatus produces distinct \"huffing\" sounds, underscoring his medical condition, yet the overall atmosphere remains jovial and supportive."
+ },
+ {
+ "video_id": "PI-27d84dc4-9184-44b4-b5bd-3afaf216ee52",
+ "path": "/workspace/benchmark/video/PI-27d84dc4-9184-44b4-b5bd-3afaf216ee52.mp4",
+ "answer": "The video presents a serene and peaceful scene where a woman in a gray hoodie and gray pants can be seen adjusting a large gray sheet on the bed, making sure it unfolds neatly. The room features light-colored walls, a framed painting and a shuttered window, maintaining a peaceful atmosphere. A big black dog is lying comfortably on the bed, covered with gray sheets and white pillows, and holding a gray donkey toy. The woman placed a small white fan and a bone-shaped object on the white bedside table beside her bed, adding to the serene environment. The video ends with a close-up of the dog resting peacefully, looking content and relaxed. Accompanying this tranquil vision is a song performed by a female vocalist whose tune is calm and soothing, perfectly complementing the quiet and peaceful atmosphere of the scene, making it an ideal place to embrace a peaceful atmosphere."
+ },
+ {
+ "video_id": "PI-37f98a02-e119-4d6e-b167-0cf53e894341",
+ "path": "/workspace/benchmark/video/PI-37f98a02-e119-4d6e-b167-0cf53e894341.mp4",
+ "answer": "A vibrant woman with pink hair showcases a detailed green corset, paired with a flowing purple sheer skirt in an indoor setting. The scene features a bed with white linens and white curtains draped to the side, complemented by a light-colored armchair, while two pillows sit on the right side. Throughout the video, the woman gracefully transitions from a standing to a kneeling position and back again, performing subtle arm movements that emphasize the contours and volume of her outfit. The camera remains steady, providing a consistent full-body view of her movements against the carefully arranged home background. Accompanying the visuals is an upbeat pop song that adds to the overall dynamism of the presentation. The combination of the unique outfit, fluid movements, and background music creates a focus on the clothing. ..."
+ },
+ {
+ "video_id": "PI-cc04fb9c-5102-4fe0-9d3a-4a90a7f5d1c0",
+ "path": "/workspace/benchmark/video/PI-cc04fb9c-5102-4fe0-9d3a-4a90a7f5d1c0.mp4",
+ "answer": "The video showcases a modern, minimalist interior with light wooden flooring and white walls, where a large window reveals a lush, green forest outside. In the background, a cozy seating area features a green sofa with colorful cushions and a small shelf filled with toys and books, all set upon a large, white rug with a dinosaur print. A person dressed in a white sleeveless shirt, light gray sweatpants, white socks, and white sandals stands next to a tall, wooden slatted room divider. The text overlay reads, 'POV: ur husband made this magic hidden door.' The person gestures towards the divider and begins to push it, revealing a hidden door that opens smoothly to another room. As the person interacts with the door, showcasing its seamless integration into the room's aesthetic, a female speaker's excited voice expresses happiness, exclaiming that this is a dream come true, while another female speaker briefly echoes the sentiment with surprise. The cheerful and enthusiastic atmosphere is enhanced by upbeat and lively background music, highlighting the clever design and functionality of the hidden door."
+ },
+ {
+ "video_id": "PI-513bb010-58d0-42ac-b062-ed47edc749b8",
+ "path": "/workspace/benchmark/video/PI-513bb010-58d0-42ac-b062-ed47edc749b8.mp4",
+ "answer": "At the beginning of the video, a woman walks confidently past a modern apartment characterized by wooden floors and minimalist design. She was wearing a blue and white checked shirt, light blue ripped jeans and black sandals, exuding a casual yet fashionable vibe. The apartment has a clothes hanger with a black jacket and an orange coat hanging on it, a white trash can, and a kitchen area with a counter and stools, providing a modern and residential feel. The text on the video reads like this: \"My mother said that when she was young, she rejected Shema Moore because she was dating my father...\" . The woman continued to walk towards the camera, gesturing with her hand as if sharing a story, hinting that she was telling this personal tale. Then the video switches to a black-and-white photo, in which there is a woman with braids, wearing a ribbed sweater, a heart-shaped pendant necklace and her arms crossed. This photo is very likely to represent the woman's mother or an important figure from the past, adding elements of nostalgia and reflection to this story. The video ends with this photo, arousing a sense of curiosity among the audience and connecting it with the personal story being shared, highlighting the themes of family history, personal choices and the passage of time."
+ },
+ {
+ "video_id": "PI-9eeee5f2-c844-4c57-aff5-50de7f5dd76d",
+ "path": "/workspace/benchmark/video/PI-9eeee5f2-c844-4c57-aff5-50de7f5dd76d.mp4",
+ "answer": "The video features a woman with long blonde hair, wearing a black tank top, standing in a well-lit indoor room characterized by a white ceiling and a circular light fixture. She holds a yellow package, likely containing a hot link sausage, and is seen preparing to eat it, bringing the sausage closer to her mouth as if about to take a bite. The woman has a tattoo on her upper arm and wears multiple bracelets on her wrist, adding personal details to her appearance. The background remains minimal, with a window allowing natural light to create a bright and clean atmosphere, keeping the focus on her actions. Meanwhile, the audio is a personal vlog-style track featuring a female speaker with a casual and slightly hurried tone, sharing her experience of eating the hot link sausage before heading to court. This reflects a moment of everyday life, highlighting her cultural familiarity with the food and adding a layer of authenticity and relatability to the scene. The overall narrative combines visual and auditory elements to convey a simple yet intimate glimpse into the woman's routine."
+ },
+ {
+ "video_id": "PI-8e96030b-f9ba-45e9-ac37-8c7d08207a37",
+ "path": "/workspace/benchmark/video/PI-8e96030b-f9ba-45e9-ac37-8c7d08207a37.mp4",
+ "answer": "In the video, a person is wearing a hoodie with a camouflage pattern and holding a pink Indomie instant noodle in his hand. The background is solid and light, with the focus on the individual and the noodle packaging. This person is speaking. The text on the video shows: \"At least she treats me like an important person.\" Next comes another text overlay, raising a reflective question: \"But if you were an unknown nobody, would she still love you?\" Then, the camera switches to a close-up shot, with the packaging of Indominium instant noodles lying on the bed sheet, emphasizing the product. The story continues. The man in the hoodie appears again, accompanied by this sentence: \"When I was unknown, no one loved me.\" The noodle packaging reappeared, with the pink words \"I made it!\" It implies personal affirmation or realization. The video switches to the person in the sweatshirt, who seems to be considering the previous statement. The noodle packaging reappeared, with the text \"Before the money...\" Then it is \"Before the money...\" Before fame... This sequence highlights the themes of unconditional acceptance and love, in contrast to conditional love based on wealth or fame. The core message revolves around the idea that true love or acceptance exists even without the constraints of money or fame, just like what Indomie noodles symbolize, a simple and lasting existence."
+ },
+ {
+ "video_id": "PI-d4959730-5cbe-42a5-8efc-cc9b318a05ee",
+ "path": "/workspace/benchmark/video/PI-d4959730-5cbe-42a5-8efc-cc9b318a05ee.mp4",
+ "answer": "In the video, a man wearing headphones and a beige t-shirt is seated at a desk featuring a black panel with green and red buttons, and the word \"MAJED\" illuminated on it. Behind him, a purple-lit globe and a shelf with various items, including a YouTube play button, set the scene. The background music is a vibrant, high-energy electronic beat with an infectious melody, creating a lively atmosphere. A comment appears on the screen: \"Reply to peekerj's comment: NASTY! BY IMPXSTR IS 🔥\", prompting the man to react with a frustrated \"fk\" as he continues speaking. Text overlays such as \"WTF IS HAPPENING\" and \"KINDA VIBING\" appear as he gestures, expressing his irritation with the repetitive nature of the content he's reviewing. He questions why he finds it groovy, with texts like \"WHY IS THAT GROOVY?\" and \"listening\" appearing as he analyzes the beat. His frustration is evident as he notes the repetitiveness, yet he continues, with \"something's coming\" and \"analyzing beat\" appearing. Eventually, he decides \"LET'S GO\", and as he speaks, texts like \"like this is\" and \"into something\" appear. Finally, he presses the green button, leading to a green screen displaying \"APPROVED\" with a rating of \"5.5/10\", as the background light shifts to green, highlighting his mixed feelings of discontent and approval towards the unoriginal material."
+ },
+ {
+ "video_id": "PI-2eb09166-983c-4455-a385-c2f8f56a5d3c",
+ "path": "/workspace/benchmark/video/PI-2eb09166-983c-4455-a385-c2f8f56a5d3c.mp4",
+ "answer": "In the video, a child is wearing a colorful jumpsuit, decorated with various shapes and colors, including the eye-catching red design at the front. Against the dim outdoor background, the child walked around energetically. There are some vague objects and people in the background, suggesting a casual, home-like environment. The child was holding a small object in one hand and seemed to be exploring this space with curiosity and excitement. The camera Angle is slightly tilted, adding a sense of dynamism and playfulness to the scene. The child's movements were lively, showing a sense of joy and exploration. This video captures an innocent moment of play. In the video, one can hear someone say, \"I shot your bed. She followed me and asked me to change your diaper.\" This indicates that the child might be following the speaker in bed, and the speaker has changed the child's clothes or diaper. The overall mood of the video is about nurturing and focusing, highlighting the tender moments and interactions between the child and the caregiver."
+ },
+ {
+ "video_id": "PI-434a8356-1d08-4f14-9b54-85cfa75e0f56",
+ "path": "/workspace/benchmark/video/PI-434a8356-1d08-4f14-9b54-85cfa75e0f56.mp4",
+ "answer": "The video shows a man standing in front of the \"Fanatics\" logo background, wearing a black hoodie with a colorful graphic design. The video begins with the words \"WHAT'S A CLASSIC MOVIE YOU'VE NEVER WATCHED\" that appear on the screen, followed by \"YOU'VE NEVER WATCHED\". An interviewer with a \"Fever\" microphone approaches, and the text \"I'VE NEVER WATCHED\" appears, indicating that the person admits not to have seen a classic movie. The conversation continues with the texts \"YEAH YOU MISSED OUT ON\"appearing in turn, suggesting the interviewer's teasing about missing the opportunity to watch a classic movie. The man then replies to the interviewer and the text \"Oh Harry Potter\" appears, indicating the film in question. The video ends with the individual acknowledging the advice with \"YUP\" highlighting a light discussion about classic movies. Meanwhile, a male speaker with a neutral emotional tone refers to a classic movie he has never seen, indicating apathy, while a male speaker responds with a seemingly unrelated comment, \"Oh, Harry Potter,\" which may indicate a confused or joking comment, adding to the casual and humorous nature of the conversation."
+ },
+ {
+ "video_id": "PI-41f0d21e-7885-4b68-9ae2-55d18a8d7ef0",
+ "path": "/workspace/benchmark/video/PI-41f0d21e-7885-4b68-9ae2-55d18a8d7ef0.mp4",
+ "answer": "The video features a nostalgic scene that revolves around a machine called \"SUITMATE\" designed to remove moisture from swimsuits. The video begins with a close-up of the machine's circular opening, accompanied by the text \"A core childhood memory.\". Follow the detailed instructions on the machine and place the dark blue swimsuit into the machine with one hand, including one swimsuit at a time and firmly hold the lid down for 5-10 seconds. Hand presses against the metal surface of the machine, emphasizing its texture. The video ends with hands ensuring the swimsuit is properly positioned inside the machine, evoking familiarity and nostalgia associated with childhood memories. Meanwhile, an emotionally uncertain male speaker discussed the importance of perseverance and not giving up, emphasizing that success in a deal or endeavor requires effort and determination. The atmosphere suggests a reflective or motivational backdrop, accompanied by ambient music with steady rhythms and neutral to uplifting tones, reinforcing the message of determination and the value of persistent efforts to succeed."
+ },
+ {
+ "video_id": "PI-8c3402c5-221d-4363-a236-fdbd3a8857a6",
+ "path": "/workspace/benchmark/video/PI-8c3402c5-221d-4363-a236-fdbd3a8857a6.mp4",
+ "answer": "This video takes place in a large warehouse-style grocery store, especially in the chilled beverage section. The background is a wall-mounted refrigerated cabinet, inside which various beverages are placed, with orange juice and other drinks prominently displayed. Above the refrigerated cabinet, a large sign reads: \"No one saves more milk than you\", emphasizing \"milk\" in red. The site is well-lit, a typical retail environment, and the floor is polished concrete. Two women pushing shopping carts in the video. They were dressed casually and their shopping carts were filled with all kinds of groceries. When they walked past the camera, they seemed to be talking, although their specific conversation was not clear. Their appearance adds dynamic elements to the static scene of the store aisle. The audio in the video captures a humorous communication. A person off the camera commented, \"What's written on it?\" No one said they would milk you more. This is what it says. Yes. No one has left any milk for you, you idiots. Yes, OK. This dialogue highlights an interesting misunderstanding of the sign information, which was intended to promote milk conservation but was playfully misunderstood.The core theme of the video revolves around a humorous misunderstanding of store signs. The original intention of this logo was to attract customers by promoting milk conservation, but the audio review added a layer of humor by humorously distorting the wording. This light-hearted marketing strategy emphasizes the comedic intent of the video, making it an interesting snippet of the daily life in the grocery store."
+ },
+ {
+ "video_id": "PI-7a8790b6-d326-463b-a7a9-957342af59b7",
+ "path": "/workspace/benchmark/video/PI-7a8790b6-d326-463b-a7a9-957342af59b7.mp4",
+ "answer": "The video takes place in a formal setting in Washington, D.C., on January 16, 2025, likely a hearing or meeting, where a significant debate on the federal minimum wage unfolds. It begins with Scott Bessent, Trump's pick for treasury secretary, seated at a table with a nameplate reading 'MR. BESSENT.' The NBC NEWS text overlay states that Bessent believes the federal minimum wage should not be increased. The scene transitions to Senator Bernie Sanders, identified as 'Sen. Bernie Sanders I-V.T.,' nearly 40 million people earn less than $17 an hour. The audio track features a dialogue between Sanders, advocating for an increase to a living wage, and Bessent, who responds calmly, stating that he believes the minimum wage is more of a statewide or regional issue. The video captures the essence of this critical debate, emphasizing the contrasting views on whether the federal minimum wage should be raised."
+ },
+ {
+ "video_id": "PI-2066f65f-93cd-4c3f-9266-69f45721d53c",
+ "path": "/workspace/benchmark/video/PI-2066f65f-93cd-4c3f-9266-69f45721d53c.mp4",
+ "answer": "The video gives a mysterious and artistic introduction to \"Ruby,\" which features a woman with striking red hair in a black costume with fishnet stockings and white boots. Close-ups from the eyes and character costumes emphasize detail and mood, enhancing the mysterious atmosphere. The words \"Jennie PRESENTS Ruby\" and \"THE 1ST STUDIO ALBUM\" highlight the announcement, hinting at an important musical release. The woman also wears a school uniform, adding a layer of versatility and thematic depth. Added names of featured artists such as Childish Gambino, Dominic Fick, Dua Lipa, and Kali Uchis appear, hinting at high-profile collaborations. The aesthetic of the video is complemented by an atmospheric track performed by a lady, featuring melancholy and immersive musical compositions with soft, haunting melodies or electronic beats. Together, these elements enhance the mood of mystery and introspection, and lyrics or vocal elements may reflect themes of introspection, transformation, and emotional depth. The overall sound design is designed to create an immersive experience that will transport the audience into the world of \"Ruby\". The video ends with a bright frame and the text \"March 7th\", indicating an important release date."
+ },
+ {
+ "video_id": "PI-1d9440cc-c26d-4e6f-8be1-90bbad63203d",
+ "path": "/workspace/benchmark/video/PI-1d9440cc-c26d-4e6f-8be1-90bbad63203d.mp4",
+ "answer": "The video captures a candid moment featuring Justin Bieber walking down a sidewalk in an urban setting, surrounded by palm trees, parked cars, and various storefronts, with a mix of modern and older buildings, some under construction, contributing to the lively urban atmosphere. Dressed in a light-colored hoodie, beige shorts, white socks, and white sneakers, and clutching a green drink, Justin appears agitated and animated as he walks, gesturing and possibly expressing frustration or annoyance, which aligns with the text overlay indicating his reaction to the presence of paparazzi. The audio track features Justin's voice, conveying his agitation and even anger through his tone and choice of words, culminating in the phrase \"peace, love unity, respect, suck my dick,\" which strongly reflects his resistance and outrage towards the paparazzi's incessant tracking and filming. This integration of visual and auditory elements provides a vivid glimpse into Justin Bieber's personality and his response to public attention, particularly the intrusive actions of the paparazzi."
+ },
+ {
+ "video_id": "PI-04a2d2ae-d2b6-4101-b3bb-174373fdc6db",
+ "path": "/workspace/benchmark/video/PI-04a2d2ae-d2b6-4101-b3bb-174373fdc6db.mp4",
+ "answer": "In a captivating fusion of audio and video, a boxer's intense training session comes to life. The audio is the track of a boxing training, featuring the voice of the male boxer himself, speaking with unwavering determination and enthusiasm. He firmly emphasizes that he is \"definitely southpaw\" and goes on to describe his normal hitting routine, with the clear goal of increasing the power in his hands and exerting maximum effort during his training. The video provides a vivid visual representation of this dedication. It showcases a man dressed in an orange T - shirt, black leggings, and black and white sneakers, engaged in outdoor boxing training. The backdrop, a red - brick wall with a white Everlast punching bag hanging from a stand, adds to the gritty atmosphere. The man, wearing white boxing gloves, practices with intense focus, delivering powerful and rapid punches like jabs and hooks. The text \"I'm definitely southpaw\" accompanied by a skull emoji appears in the center of the screen, reinforcing his self - identification. His voice, filled with strength and determination as heard in the audio, perfectly matches his on - screen actions. The entire video exudes a positive and dedicated training atmosphere, showcasing the man's passion and commitment to boxing, as he strives to improve his skills through his regular hitting routine."
+ },
+ {
+ "video_id": "PI-2bc0e551-a9f7-4b2f-9130-aebb4d8f3499",
+ "path": "/workspace/benchmark/video/PI-2bc0e551-a9f7-4b2f-9130-aebb4d8f3499.mp4",
+ "answer": "The video shows the playful and affectionate interaction of a young man and woman in a comfortable kitchen setting, which suggests that they are a couple and are part of a family of three, including a daughter, inferred from the audio content. The woman, dressed in a blue striped shirt, was cutting vegetables and joked humorously with the man, who was dressed in black and made expressive gestures during their conversation. The track captures their lighthearted conversation filled with laughter and inside jokes, indicating a relaxed and happy family atmosphere. The kitchen setup, complete with kitchen supplies, a wall clock and a record player, enhances the warm and familiar environment. The video highlights the intimate and loving relationship between the two characters, highlighting their fun connection and the overall cheerful dynamics of their home."
+ },
+ {
+ "video_id": "PI-728fe7d5-284a-4acf-b208-5b84e1fcf1a3",
+ "path": "/workspace/benchmark/video/PI-728fe7d5-284a-4acf-b208-5b84e1fcf1a3.mp4",
+ "answer": "The video takes place inside a Safeway store in Washington, where the camera focuses on a locked ice cream freezer, an unusual sight that suggests security measures against theft. The freezer is stocked with various brands and flavors, including Tillamook and Ben & Jerry's, all neatly arranged on multiple shelves with visible price tags, such as $6.49 for Tillamook products. The text overlay, \"this morning @ safeway in WA ... the only ice cream locked ...\", emphasizes the peculiarity of the situation. The camera then reveals a person wearing a blue, textured jacket and a white beanie adorned with yellow star patterns, standing in front of the freezer. The background features other grocery items and shelves, creating a typical supermarket setting. Accompanying the visuals is a modern and upbeat instrumental BGM track, which enhances the dynamic atmosphere of the scene, adding a rhythmic and melodic element that contrasts with the static nature of the locked freezer, thereby highlighting the store's security measures."
+ },
+ {
+ "video_id": "PI-b7781e79-c824-4376-88af-0f8a13198f3c",
+ "path": "/workspace/benchmark/video/PI-b7781e79-c824-4376-88af-0f8a13198f3c.mp4",
+ "answer": "The video and audio together create a heartwarming depiction of a birthday celebration that feels straight out of a fairy tale. The video shows a young woman in a stunning pink ball gown walking down a garden pathway, holding a bouquet, while a white horse-drawn carriage follows behind her, carrying another person in white. The garden setting is serene, with neatly trimmed hedges and leafless trees, and guests in elegant attire accompany the birthday girl as a photographer captures the moment. The accompanying audio enhances this enchanting scene with the sound of many people singing happy birthday songs, creating a warm and moving atmosphere. The text overlay on the video, 'It was the 'happy birthday song while she walked in', for me,' along with emojis of a smiling face, a heart, sparkles, and a puzzle piece, adds a personal and joyful touch to the occasion. Together, these elements weave a narrative of celebration, elegance, and pure magic."
+ },
+ {
+ "video_id": "PI-43b4df82-917c-4312-b53b-5eecd1f2a748",
+ "path": "/workspace/benchmark/video/PI-43b4df82-917c-4312-b53b-5eecd1f2a748.mp4",
+ "answer": "The video showcases a kitchen renovation project inspired by Scandinavian design, where the cabinet doors were removed to create a more open and inviting space. It begins with a close-up of open cabinets, revealing various stored items like a green pot and metal containers, accompanied by a text overlay explaining the inspiration behind the renovation. As the camera pans, it reveals a stainless steel refrigerator adorned with magnets and photos, and light blue walls that contribute to the bright and airy atmosphere. The kitchen features organized open cabinets filled with spices, bottles, and dishes, enhancing the aesthetic appeal. The countertops are cluttered with kitchen items, and a window above the sink floods the space with natural light. The consistent theme throughout the video is the transformation achieved by removing cabinet doors, aligning with Scandinavian design principles. Meanwhile, the background music with its light and cheerful tune and lively rhythm complements the positive and warm atmosphere of the renovation, enhancing the emotional resonance of the transformation."
+ },
+ {
+ "video_id": "PI-405b2b9c-d873-4ea2-a976-8a303ab098ab",
+ "path": "/workspace/benchmark/video/PI-405b2b9c-d873-4ea2-a976-8a303ab098ab.mp4",
+ "answer": "In the video, a person with long golden hair and wearing a black top is holding a rose gold smart phone with three cameras on the back of the phone. This person is wearing multiple bracelets, including a red one, and there is a tattoo on the wrist. The background is a wall decorated with a magnificent small mirror and some string lights, creating a comfortable and private atmosphere. The text on the video reads: \"Ash used artificial intelligence voice in the live stream, making people feel that she was really talking to someone else.\" A crying emoji followed, indicating that this person was using artificial intelligence technology to simulate conversations during the live broadcast. The dialogue in the video includes such phrases: \"Are you monitoring my life?\" Why should I monitor your life? I'm just talking. And \"Because obviously I won't waste time looking at your life, anyway\", this implies a kind of humorous or satirical interaction. The conversation went on, \"You still got married.\" Are you still married? If I am, then what does it have to do with you? \"Because you really told me about you,\" hinting at the personal or relationship topic being discussed. The overall tone of the video is light-hearted and comedic. Just as the crying emoji emphasizes, it highlights the use of artificial intelligence to create engaging and entertaining content during the live broadcast."
+ },
+ {
+ "video_id": "PI-8949b4b6-b7fa-4f89-b42b-c8a6b1ea4f3e",
+ "path": "/workspace/benchmark/video/PI-8949b4b6-b7fa-4f89-b42b-c8a6b1ea4f3e.mp4",
+ "answer": "This video features a humorous and relaxing scene set in a cozy kitchen. A woman in a red and white sweater is sitting at the table preparing spring rolls. She brushed the egg liquid on the dumpling wrappers and then rolled them up. There are various ingredients placed on the dining table, including a bowl of beaten eggs and a plate of spring rolls. The conversation between this woman and another person in a grey shirt adds a comedic element to the video. They referred to the iconic scene from \"Titanic\", where Rose was standing on the door and Jack was in the icy water. This woman jokingly assigned roles, suggesting that the person in the grey shirt was Rose, while her daughter, known as \"Apple\", was Jack, rolling in the water and freezing to death. The conversation continued. The woman asked her if she would be willing to put an apple on the door to help her survive. The man in the grey shirt replied humorously, \"Of course not!\" The key OCR text in the video includes \"Hey, Mom, do you know that scene in 'Titanic'?\" Where is Rose standing on the door? \"Then Jack fell into the icy water,\" \"Let's put it this way. You're Ruth on the door, right?\" And \"She needs your help to survive\", all these highlight the joking nature of their conversation."
+ },
+ {
+ "video_id": "PI-bc38ebfe-190f-4fac-9b92-037ef6b0021b",
+ "path": "/workspace/benchmark/video/PI-bc38ebfe-190f-4fac-9b92-037ef6b0021b.mp4",
+ "answer": "The video unfolds in a comfortable, well-lit room with white walls, with a large TV screen in the background showing a vibrant animated scene. This room exudes a casual and warm atmosphere. A red blanket is spread on the bed or sofa in the foreground, with a piece of paper placed on it, which further enhances this atmosphere. Three children are the focus of this scene. A child wearing a black hoodie and a hat is concentrating on playing a shiny large copper pipe. The glittering surface of the instrument captures the light, adding a touch of brilliance to the room. Beside the trumpeter, another child wearing a white patterned T-shirt and dark trousers was clapping enthusiastically, injecting a rhythmic vitality into the scene. The audio is dominated by loud voices, which is the core of the children's impromptu performance. The child playing the tuba seemed very concentrated, and the child in the white T-shirt also clapped his hands along, creating a lively and charming atmosphere. The text on the video reads: \"The children came in and did this...\" It implies a spontaneous and happy moment captured by the camera. This video contains a carefree moment of fun and creativity, highlighting the children's interaction and the joy they find in their spontaneous musical efforts."
+ },
+ {
+ "video_id": "PI-8717e54d-04f5-4b34-8acf-93e3ecd4818d",
+ "path": "/workspace/benchmark/video/PI-8717e54d-04f5-4b34-8acf-93e3ecd4818d.mp4",
+ "answer": "The video features a person with long, curly hair wearing a white headband with brown spots, resembling a giraffe pattern, and a light green sweatshirt, attempting to contour their face for the first time. They use a pink makeup applicator to apply contour product to various parts of their face, including the cheek, jawline, and forehead, with the text overlay reading, 'Enjoy this video of me trying to contour for the first time,' indicating a personal makeup tutorial or demonstration. The background shows a room with a door and a colorful poster on the wall. The person carefully applies and blends the contour product with a makeup brush to achieve a smooth finish, refining the contouring and focusing on seamless blending. Meanwhile, the audio is a vocal track featuring a female singer with a playful and energetic tone, singing 'Oh oh oh' and the word 'Jail,' creating a light-hearted and catchy musical moment that adds a cheerful ambiance to the makeup application process. Towards the end, another individual wearing a brown shirt with the text 'Coors Banquet RODEO' and a graphic of a bull appears in the background, observing the process, maintaining a consistent setting and theme that emphasizes the learning and application of contouring techniques."
+ },
+ {
+ "video_id": "PI-ff238e9d-edfe-47fc-ae12-defb456caa8e",
+ "path": "/workspace/benchmark/video/PI-ff238e9d-edfe-47fc-ae12-defb456caa8e.mp4",
+ "answer": "In this video, a young woman with long, voluminous, reddish-brown hair responds to a viewer's comment accusing her of wearing a wig. The video is captioned \"Proving my hair is real.\" To disprove the accusation, she meticulously parts her hair in multiple places, showing her scalp up close to the camera to demonstrate that there is no lace or glue visible. She runs her fingers through her hair from the roots, highlighting her natural hairline and scalp. Throughout the video, she speaks directly to the camera, insisting that her hair is authentic and expressing frustration that some people doubt that Black women can have long, natural hair. Towards the end, she briefly shows a bottle of hair growth oil, which she credits for helping her grow her hair. She concludes by reaffirming that she is not wearing a wig."
+ },
+ {
+ "video_id": "PI-95e00dcd-ce4e-4f6a-9956-f2293dc18a7f",
+ "path": "/workspace/benchmark/video/PI-95e00dcd-ce4e-4f6a-9956-f2293dc18a7f.mp4",
+ "answer": "In the video, a man is sitting indoors with an ordinary white curtain in the background. He is wearing a dark green T-shirt with an orange pattern and a silver necklace. The man is holding a black nose dilator near his nose. Throughout the video, he seems to be demonstrating or responding to the use of this device, which is designed to assist with breathing. The text on the video reads as follows: \"When your fiance has a deviated nasal septum, he has to breathe his entire life through his mouth.\" This indicates that the man experienced an improvement in nasal breathing for the first time, which was a significant change for him. He expressed his surprise and relief through various facial expressions and body language, such as leaning back in the chair, raising his head and taking deep breaths. The key phrases in his conversation include, \"No, this is really a big difference,\" \"Wow, this is strange,\" and \"It's actually like taking a deep breath.\" So much air. I can feel my air from below... My God. These words highlight his astonishment at the amount of air he can now breathe through his nose. Previously, due to a deviated nasal septum, he was unable to breathe effectively. The video ends with the words \"Happy Valentine's Day\" that appear on the screen, followed by \"Thank you\", suggesting a light-hearted or humorous context, which might be a message for Valentine's Day. The theme of the entire video is the man's newly discovered ability to breathe normally through his nose, which he finds both strange and wonderful."
+ },
+ {
+ "video_id": "PI-6202073d-28f6-4ac8-911d-efd58ed97e15",
+ "path": "/workspace/benchmark/video/PI-6202073d-28f6-4ac8-911d-efd58ed97e15.mp4",
+ "answer": "In the video, a dog is wearing a blue shirt, which is decorated with colorful patterns of various shapes and figures. This dog was placed on the dark wooden floor near the door, with a dark area and a pink object close to the wall in the background. Throughout the video, the dog remained basically still, occasionally shifting its center of gravity from one paw to the other and looking around. The dog's facial expression changed slightly, showing curiosity and a slight interest in its surrounding environment. Audio overlay includes a series of commands and statements, such as \"I said that if you change your clothes, you can have dinner\", \"No, you will stop that descendant. Stop\", \"I don't care. You can turn around.\" These phrases imply conversations or instructions given to the dog, possibly in a joking or corrective way. The entire scene captures the dog's calm and curious behavior, with a simple indoor background as the backdrop. This video seems to focus on the interaction between the dog and the invisible speaker, highlighting the dog's response to commands and the gentle, almost humorous tone in the conversation."
+ },
+ {
+ "video_id": "PI-cfeeefd4-e484-4d42-88a3-53bf5ba44953",
+ "path": "/workspace/benchmark/video/PI-cfeeefd4-e484-4d42-88a3-53bf5ba44953.mp4",
+ "answer": "This video presents a hair styling technique intended to reduce tension associated with traditional slick-back hairstyles. A person with medium-length, dark hair is shown from a close overhead perspective, demonstrating the application of a hair-parting tool. The individual meticulously uses the tool to create a series of small, individual sections at the crown of the head, moving backward from the hairline. This method appears to be an alternative to pulling all hair tightly back, potentially alleviating stress on the hair follicles and scalp. The process is demonstrated clearly, highlighting the precise movements involved in sectioning the hair to achieve the desired effect. The visual instructions are complemented by a text overlay that emphasizes the purpose of the technique."
+ },
+ {
+ "video_id": "PI-e2d81784-24ed-4362-b1e5-9863542ef815",
+ "path": "/workspace/benchmark/video/PI-e2d81784-24ed-4362-b1e5-9863542ef815.mp4",
+ "answer": "The video presents a stark contrast to the intense and aggressive emotions described in the audio. It begins with a close-up view of a person's feet in black slippers and distressed jeans, standing on a wooden floor, while a blue and white spin mop is used to clean the floor. The mop head, white and rotating, moves smoothly back and forth, emphasizing the cleaning action from an overhead perspective. Natural light filters through a glass door with blinds in the background, creating a calm and methodical atmosphere. The scene remains consistent, highlighting the thoroughness of the cleaning process, which is meticulous and serene. This visual narrative of calm diligence stands in direct opposition to the audio's depiction of a heated verbal outburst, where a male speaker expresses frustration or anger through explicit language, accompanied by tense and chaotic ambient music. The juxtaposition of these two elements creates a striking dichotomy between tranquility and turmoil."
+ },
+ {
+ "video_id": "PI-dba07bb8-8ae9-4d18-a871-2fc47a9f9a96",
+ "path": "/workspace/benchmark/video/PI-dba07bb8-8ae9-4d18-a871-2fc47a9f9a96.mp4",
+ "answer": "The video unfolds on a city street, where a person in a gray hoodie and jeans stands near a stainless steel guardrail, while another individual in a dark jacket walks past the camera. The text overlay, \"POV your friends using a fake so you have to get creative,\" hints at a playful scenario. As the camera pans, a group gathers near the guardrail, where someone in a dark long-sleeve shirt and light-colored pants interacts with them, gesturing as if explaining something. Several purple cups are stacked in a chair nearby. The group listens attentively, some glancing at their phones. Overlaid on this scene, the audio captures a female speaker expressing uncertainty and surprise in English with exclamations like \"I don't know\" and \"Oh my god,\" reacting to something unexpected. A male speaker responds briefly in Korean, adding minimally to the dialogue. The person behind the guardrail raises a purple bag, possibly demonstrating something, while the group remains engaged. The background reveals a well-lit store with various items visible through glass doors, enhancing the casual and social atmosphere as people interact in a relaxed manner."
+ },
+ {
+ "video_id": "PI-0216673e-dce2-4321-9a75-fa6fd1970811",
+ "path": "/workspace/benchmark/video/PI-0216673e-dce2-4321-9a75-fa6fd1970811.mp4",
+ "answer": "In the video, a person with thick curly hair is wearing a green shirt with the white words \"VERELICS\" and \"UNIMU\" written on it, as well as a logo. This is a comfortably decorated room, with blue and green lights projected from the ceiling fan, creating a vibrant atmosphere. The ceiling is decorated with green vine ornaments, and the string lights add a warm and inviting glow to the background. This person made vivid gestures with their hands, indicating that they were speaking enthusiastically or expressing something. The lyrics \"I'm not a killer, but don't push me, don't push me, don't.\" can be heard in the audio. Let's go. This person might be singing, rapping or conveying a strong message. The core theme of the video is confidence and self-assurance, set against a visually captivating backdrop. The expressive gestures of the characters combined with confident lyrics create a vibrant and captivating narrative that emphasizes resilience and preparedness."
+ },
+ {
+ "video_id": "PI-e37737d1-c81a-4c57-bb6e-94c8a7582560",
+ "path": "/workspace/benchmark/video/PI-e37737d1-c81a-4c57-bb6e-94c8a7582560.mp4",
+ "answer": "The video captures a striking scene in a residential neighborhood where a person, clad in a blue long-sleeve shirt, black pants, and a green-accented helmet, stands precariously on a drone-like device attached to a slowly moving car. The drone, equipped with multiple rotors, hovers just above the ground as the car progresses down the quiet street, lined with houses, palm traees, and parked cars. The rider skillfully maintains balance, adjusting their stance as needed. In the background, a few onlookers observe this unusual spectacle. Complementing the visual, the audio reveals a tense and chaotic atmosphere, with multiple voices expressing frustration and surprise. A male voice urges someone to move faster, criticizing the electric car's slowness, while a female voice echoes confusion. Another male voice repeatedly exclaims in shock, heightening the sense of astonishment and concern among the witnesses. The ambient music, with its steady tempo, adds a neutral yet slightly tense undertone to the scene. Together, the video and audio underscore the unique and potentially hazardous nature of this transportation method, set against the calm backdrop of a suburban street."
+ },
+ {
+ "video_id": "PI-e97b5b53-ca23-4057-873f-595495c4cb12",
+ "path": "/workspace/benchmark/video/PI-e97b5b53-ca23-4057-873f-595495c4cb12.mp4",
+ "answer": "The video depicts a young woman named Ash wearing a black top, looking at her phone while talking to a woman named Hailey, features a group of individuals, primarily female voices, engaging in a casual and somewhat competitive conversation about social media accounts and follower counts. The female speaker expresses frustration and excitement over reaching milestones like 600K followers, while another female mentions copying her hairstyle, suggesting playful rivalry. A female voice offers support by promoting her account, and the overall tone is lighthearted with hints of frustration and humor. The scene appears to involve friends or acquaintances discussing social media achievements and personal appearances in an informal setting, with no background music present. Ash repeatedly asks Hailey if she can create her own social media account, pointing out that Hailey already has a large number of followers. Another woman standing beside Ash occasionally joins the conversation. Ash expresses frustration, mentioning that she herself has 600,000 followers. The conversation shifts to Hailey dyeing her hair red, prompting the other woman to comment that Hailey copied her own hair color. Ash ends by playfully pleading for the account while rubbing her face, revealing a lighthearted moment of social media competition."
+ },
+ {
+ "video_id": "PI-fc63780b-164e-4642-9e93-6392d83a682d",
+ "path": "/workspace/benchmark/video/PI-fc63780b-164e-4642-9e93-6392d83a682d.mp4",
+ "answer": "This comedic vignette captures a delightful family moment. The camera focuses on a grandmother holding a light blue silicone mold, cheerfully convinced it's for creating Christmas tree-shaped ice cubes. An on-screen caption reveals the humorous truth to viewers: \"When Grandma thinks she bought Christmas tree ice molds...\" – while the molds are actually adult toys. Oblivious to the true nature of the design, the grandmother smiles while discussing the \"pretty\" ice shapes with family members, whose stifled laughter can be heard off-camera. In the unfolding scene, she places one of the distinctive ice shapes into a water glass on the dining table, her face lighting up with genuine surprise when the ice stands perfectly upright. The comedy stems from the striking contrast between the grandmother's innocent misunderstanding and the risqué reality."
+ },
+ {
+ "video_id": "PI-693005a0-e99a-442c-8791-bb97f7686de7",
+ "path": "/workspace/benchmark/video/PI-693005a0-e99a-442c-8791-bb97f7686de7.mp4",
+ "answer": "The video features a simple hand-drawn animation that humorously contrasts the idea of a boycott with a welcoming image of Ben and Jerry's ice cream. The figure inside stood behind a desk with a figure labeled \"U.S Citizens\", initially smiling brightly, with the text \"Americans planning to boycott cooperations\" displayed at the top. The scene cuts to the same person with a surprised expression, accompanied by the text \"Ben and Jerry:\" and a sheet that reads \"Welcome!Bent Juny's Ice Cream! \"Then the character's expression changes to sadness and tears, while the cow and logo remain. Meanwhile, an emotionally neutral male speaker refers to the aftermath of an incident and denies the need for cookies, while an equally neutral female speaker suggests hunger after eating a sandwich. The entire scene includes casual conversations between two people discussing food-related issues, with no apparent emotional intensity or background music, highlighting the ironic twist in the background of the boycott."
+ },
+ {
+ "video_id": "PI-dd815296-edc8-41e8-a8d8-f30e647336dc",
+ "path": "/workspace/benchmark/video/PI-dd815296-edc8-41e8-a8d8-f30e647336dc.mp4",
+ "answer": "The video features a young woman with long, wavy blonde hair, dressed in a white strapless top and covered with a fluffy blue blanket, sitting on a light-colored couch against a plain wall. The text overlay on the video clarifies, \"Btw i didn’t write this tatiana did,\" as she speaks and gestures with her hands, occasionally glancing at the camera. At one point, another individual wearing a black hoodie with green text on the sleeves and blue jeans joins her on the couch, though the focus remains on the woman. The camera angle shifts slightly to reveal more of the room, including a lamp and some plants, enhancing the casual and relaxed atmosphere. Overlaying this visual narrative, the audio features a playful and teasing female vocal performance, conveying a light-hearted and flirtatious emotion through a catchy melody, which complements the overall easygoing vibe of the video."
+ },
+ {
+ "video_id": "PI-aa6e2a21-c856-4f01-9136-c7936ab70d92",
+ "path": "/workspace/benchmark/video/PI-aa6e2a21-c856-4f01-9136-c7936ab70d92.mp4",
+ "answer": "The video unfolds in a spacious gymnasium with high ceilings and bright lights, prominently featuring the word 'BRAVES' on the wall, indicating the associated school or team. A man in a black tracksuit is seen sitting on the polished wooden floor, dribbling a basketball. He then attempts a trick shot from a seated position, launching the ball towards the hoop. After the shot, he stands up and walks towards the camera, gesturing and speaking. The audio reveals a male speaker expressing anger and frustration, highlighting his ability to shoot from beyond half court while sitting and responding to comments directed at him. He mentions trying a technique or exercise with an uncertain emotional tone. Another person, also in a black tracksuit, enters the frame holding a basketball. The first man gestures towards the second, who then walks to the other end of the court. The first man continues to talk and gesture, possibly explaining or commenting on the trick shot attempt. The second man reaches the far end of the court, near the basketball hoop, and prepares to take a shot. The first man watches from a distance as the second man attempts a shot. The audio includes a brief interjection from the second man in a neutral tone, saying 'Nor human,' before the first man concludes with a statement of self-identity, claiming to be the 'Avenger.' The video ends with the second man making a shot, and the ball going through the hoop. The overall scene involves a dynamic and somewhat chaotic dialogue, with emotions ranging from frustration to neutrality, and an atmosphere suggesting a mix of personal defense and casual banter."
+ },
+ {
+ "video_id": "PI-49404913-56ec-495f-937b-315b7f404c56",
+ "path": "/workspace/benchmark/video/PI-49404913-56ec-495f-937b-315b7f404c56.mp4",
+ "answer": "The video showcases a woman in a black top, blue jeans, and black high heels dancing energetically on a concrete surface outdoors on a sunny day. Against the backdrop of a large beige wall and a clear blue sky, she begins her dance near a large rock and some cacti, with a few onlookers visible in the background. As the video progresses, she moves to a flat concrete area and then to a sloped surface, maintaining her dynamic and fluid movements throughout. Her dance, characterized by jumps, steps, and expressive gestures, highlights her confidence and rhythm. Meanwhile, the audio in the background consists of ambient music without discernible vocals or lyrics, enhancing the overall atmosphere of the scene. This combination of energetic dance and subtle background music creates a vibrant yet harmonious urban setting, capturing the essence of public space and the woman's spirited performance from various angles."
+ },
+ {
+ "video_id": "PI-8ee37f98-8967-491e-81c3-4287c5ba762c",
+ "path": "/workspace/benchmark/video/PI-8ee37f98-8967-491e-81c3-4287c5ba762c.mp4",
+ "answer": "The video begins with a man in a dark suit, white striped shirt, and yellow tie speaking in a studio setting, with a blurred background featuring hints of red and blue colors. The scene transitions to a close-up of a Crayola crayon box labeled 'Limited Edition 8 Retired Colors,' which is yellow with green accents and features cartoonish crayon characters. The text on the screen announces that for the first time in its 122-year history, Crayola is bringing back a collection of retired colors. The video then shows a close-up of several Crayola crayons, with the text explaining that the crayons will return as part of a limited edition eight-count pack. Specific colors like green, yellow, orange, and red are highlighted, with the text noting that the company has been flooded with requests over the years to bring back popular colors such as Dandelion, Blizzard Blue, and Magic Mint. The video returns to the Crayola box, emphasizing the 'Limited Edition 8 Retired Colors' and the characters on the box, reiterating that Crayola hopes the crayons will inspire even more creative moments for fans. The video wraps up with the man in the suit continuing to speak, with the text 'Inspire even more creative moments for our fans' displayed on the screen, reflecting the enthusiastic tone of the audio, which expresses happiness about the reintroduction of these beloved colors."
+ },
+ {
+ "video_id": "PI-8c72f852-2d3d-4458-a9a6-c6f67ede15c8",
+ "path": "/workspace/benchmark/video/PI-8c72f852-2d3d-4458-a9a6-c6f67ede15c8.mp4",
+ "answer": "The video features a group of contestants in blue sports jerseys with white numbers, participating in a tense challenge within a stark, white-walled room. The scene unfolds with one contestant bending over to place items into a brown paper bag, while others observe, their expressions a mix of disbelief and frustration. The text overlay highlights their astonishment that a 'Constant' is bringing them down, amplifying the tension. A bright light beam focuses on the bag, adding a dramatic effect. In the background, the audio captures multiple conversations: a male speaker expresses frustration, describing a game where the last person to enter the bag is left out, while a female speaker finds it difficult to perceive the white walls, suggesting a sense of challenge or discomfort. Another male speaker responds with surprise and suspicion, indicating urgency. The female speaker's emotional pleas for consent and expressions of distress add to the jittery atmosphere, enhanced by moderately paced ambient music. As the challenge progresses, contestants react with increasing frustration; one sits dejectedly against the wall, while another gestures in disbelief, emphasizing the unfolding drama and their collective distress."
+ },
+ {
+ "video_id": "PI-bb6f3387-102a-4fa1-88b2-d12cf2b723aa",
+ "path": "/workspace/benchmark/video/PI-bb6f3387-102a-4fa1-88b2-d12cf2b723aa.mp4",
+ "answer": "The video begins with a sleek, futuristic black flying car hovering in a minimalist white room, showcasing its aerodynamic design with a pointed front and smooth curves. The scene then transitions to a man sitting in a room illuminated with blue and purple lighting, speaking into a microphone. He is dressed in a dark t-shirt featuring a graphic design. As the video progresses, it alternates between the man speaking and various angles of the flying car, highlighting its advanced technology and features, such as the illuminated orange lights on its underside and the modern dashboard with a large screen displaying the word 'SONY' and various controls. The male speaker, maintaining a neutral or contemplative emotion, discusses the aspiration of owning flying cars in the future. He humorously contrasts this vision with the current reality, where people spend excessive time on trivial activities like sharing memes, spending fourteen hours a day on the toilet sending memes to each other. The ambient music, with its steady tempo and light, unobtrusive atmosphere, complements the casual and reflective tone of the scene. The video concludes with another view of the flying car hovering in the white room, emphasizing its sleek appearance and the futuristic possibilities it represents."
+ },
+ {
+ "video_id": "PI-401a41d8-705f-4567-bc5a-832e11fcc1e1",
+ "path": "/workspace/benchmark/video/PI-401a41d8-705f-4567-bc5a-832e11fcc1e1.mp4",
+ "answer": "The video features a person dressed in an elaborate and colorful costume walking down a dimly lit street at night, with the costume including a black base adorned with gold and white accents, red and gold beads, orange pom-poms, and golden fabric. The individual wears a large, ornate headdress decorated with fruits like pineapples and bananas, along with other vibrant elements. They perform dynamic movements such as squatting, raising their arms, and spinning, showcasing the vibrant details of their attire. The background reveals a quiet street lined with trees and buildings, some with illuminated windows, and a few potted plants along the sidewalk, creating a lively and festive atmosphere. Meanwhile, the audio is a rap track featuring a female artist with a confident and assertive tone, delivering rhythmic and energetic lyrics that enhance the lively and engaging mood of the scene. The performer's energetic actions, combined with the vibrant musical backdrop, create a dynamic and festive experience."
+ },
+ {
+ "video_id": "PI-06ae1fac-adb2-40a8-995b-dc7c9cc866bd",
+ "path": "/workspace/benchmark/video/PI-06ae1fac-adb2-40a8-995b-dc7c9cc866bd.mp4",
+ "answer": "In a harmonious fusion of audio and video, a unique and uplifting experience is created. The audio is a track of a song where the male singer croons in a positive mood. The lyrics of the song mainly express the uniqueness of a person, filling the air with an air of positivity and self - appreciation. The video, on the other hand, presents a series of delightful scenes. It is divided into three parts. The first scene is set in a meadow at dusk, where a Golden Retriever can be seen running happily with a piece of camera equipment in its mouth. The second scene shifts to a sunny forest. The Golden Retriever explores the forest, looking around curiously before jumping a little and continuing its run. The final scene takes place in a snow - covered clearing in the sun. The Golden Retriever runs left and right in the snow, seemingly having the time of its life. The video captures the Golden Retriever's daily joy of running through various outdoor scenes, all while chewing on the camera equipment. While the audio celebrates the uniqueness of an individual, the video showcases the carefree and unique spirit of the Golden Retriever, creating a charming and wholesome combination that is both visually and aurally appealing."
+ },
+ {
+ "video_id": "PI-20da628a-49c2-4185-a532-688df2f9589a",
+ "path": "/workspace/benchmark/video/PI-20da628a-49c2-4185-a532-688df2f9589a.mp4",
+ "answer": "In a recent development, both the audio and video content highlighted the breaking news from the Supreme Court concerning the potential ban on TikTok. The audio featured a speaker reporting in a neutral and professional tone that President-elect Donald Trump had requested the judge to lift the possible ban on TikTok, seeking an extension to find a political solution. Meanwhile, the video began with a news anchor, dressed in a purple sports top and a white top, broadcasting the same news, emphasizing that Trump had urged the Supreme Court to suspend the TikTok ban. The anchor further explained that TikTok must sever ties with its Chinese parent company by January 19th to continue operating in the United States, a point reinforced by a close-up of the TikTok logo on a smartphone screen. The video also included a man in a suit speaking into a microphone, noting that this deadline falls on the day before the inauguration ceremony. Additionally, the video showcased the Department of Justice building, highlighting that TikTok and the Department of Justice had submitted opposing briefs to the court, with the debate scheduled for January 10th. Together, the audio and video content provided a comprehensive overview of the legal proceedings and the timeline surrounding the potential TikTok ban, underscoring the urgency and significance of the issue."
+ },
+ {
+ "video_id": "PI-50145a3f-b9d8-4fd9-a2a7-168271505896",
+ "path": "/workspace/benchmark/video/PI-50145a3f-b9d8-4fd9-a2a7-168271505896.mp4",
+ "answer": "The audio clip presents a straightforward scenario where a male speaker briefly utters \"Go,\" signaling an instruction or direction, accompanied by the call of a little boy, creating a scene of direct verbal interaction devoid of additional background or emotional nuances. In contrast, the video unfolds with a person donning a black basketball jersey featuring blue trim and the number '0', standing indoors against a backdrop that includes a whiteboard, a string of red LED lights along the ceiling, and an array of posters and certificates on the wall. The individual's hand gestures and body language suggest they are engaged in conversation or reacting to an unseen stimulus, with a moment of confusion or disbelief captured as they raise a fist to their ear. Subsequently, the person bursts into energetic movement, punching the air with both fists and executing exaggerated gestures, as if demonstrating or mocking a concept, encapsulating a blend of confusion, frustration, and a touch of humor. While the audio's concise \"Go\" serves as a catalyst for potential action or reaction, the video enriches the narrative by visually depicting the person's emotional and physical response, creating a dynamic interplay between the auditory and visual elements."
+ },
+ {
+ "video_id": "PI-1ccc9945-08cd-4b27-9533-7705f628c867",
+ "path": "/workspace/benchmark/video/PI-1ccc9945-08cd-4b27-9533-7705f628c867.mp4",
+ "answer": "The video captures the daily routine of a person, wearing an orange shirt with a logo, preparing breakfast and lunch for their partner before heading to work. The scene opens with the text \"Before going to work, I make breakfast for the baby's mother at home,\" setting the tone for the morning tasks. A close-up follows, showing a hand cracking an egg into a frying pan on the stove, emphasizing the hands-on nature of the preparation. The video then transitions to the person reaching into the refrigerator, grabbing a bottle of juice, and placing a breakfast sandwich made of English muffins, eggs, and cheese, wrapped in cling film, into the fridge. The individual, now in the kitchen with the coffee machine and cabinets in the background, continues with the preparations, all while the same text overlay remains visible. The audio, featuring a male narrator with a casual and slightly humorous tone, complements the visuals by describing the process of cracking eggs, packing lunch, preparing coffee, and doing the dishes, highlighting the speaker's morning responsibilities and the efforts made to support their partner. The video concludes with a touching gesture—a notebook with a handwritten note: \"Cole is ready. Breakfast on the bridge. Have a wonderful day. I love you,\" encapsulating the love and care put into these daily rituals. Together, the audio and video paint a heartwarming picture of a morning routine filled with thoughtful gestures and familial love."
+ },
+ {
+ "video_id": "PI-5c9090f1-0936-4201-89c8-5869f51fa386",
+ "path": "/workspace/benchmark/video/PI-5c9090f1-0936-4201-89c8-5869f51fa386.mp4",
+ "answer": "The video shows a series of humorous live televised blunders with a blue and yellow gradient in the background and the text \"A WHOLE ACCOUNT DEDICATED TO LIVE TV BLOOPERS\". It shows two women in a studio, one in a black blazer and purple top, the other in a red cardigan and pearl necklace, joking. Words such as \"PALE,\" \"ON AIR,\" \"WASHINGTON,\" and \"LOWLANDS\" appear on screen, reflecting their relaxed conversation. The audio captures a variety of emotional tones, including a neutral observation of someone's skin color and concerns about the severe snowfall affecting the Washington Lowcountry. A female voice is incoherent, possibly apologetic, with a tone of confusion or regret. The video also includes a close-up shot of a police officer named Deidre Menged, a neon sign that reads \"TOWER\" and \"DELICATESSEN\" and a humorous veterinary scene where a woman in a pink top discusses a puppy with a veterinarian, and words such as \"CAN,\" \"US,\" \"TO,\" \"TEETH'S,\" and \"BIT\" appear. The female voice displays a playful and instructive tone, describing the activation of the sign and mimicking the action of brushing teeth, adding to the overall casual and relaxed vibe of the video."
+ },
+ {
+ "video_id": "PI-b36208d4-e2ab-4fdc-8ef1-472c112137d3",
+ "path": "/workspace/benchmark/video/PI-b36208d4-e2ab-4fdc-8ef1-472c112137d3.mp4",
+ "answer": "This video takes place in a rustic stable or barn, characterized by straw scattered on the earthen floor and an iron structure in the background, which might be part of a livestock pen or feeding area. The main focus is on the interaction between human hands and horses. This person is holding a black container, which might contain a substance used for horse care. It can be seen that he is applying yellow ointment or treatment to the horse's hooves. Throughout the process, the horse lifted its hind foot, indicating the degree of trust and familiarity between the animal and the human. The audio in the video is minimal, with some faint and blurry sounds that do not interfere with the main focus. The key text coverage is like this: \"I didn't register this when I bought the horse in my teens.\" This statement emphasizes the unexpected responsibilities and challenges brought about by raising horses, especially the practical operations that the new owner might not have anticipated. The core theme of the video revolves around the reality of taking care of horses, emphasizing the meticulous and sometimes unpleasant responsibilities. It depicts a moment of daily maintenance. Although necessary, the owner might not have fully considered this when buying the horse for the first time. This narrative helps to inform, perhaps humorously remind the audience, the commitment involved in caring for a horse, apart from the initial excitement of ownership."
+ },
+ {
+ "video_id": "PI-dd194da3-28ee-43e3-9484-9f3cd5888874",
+ "path": "/workspace/benchmark/video/PI-dd194da3-28ee-43e3-9484-9f3cd5888874.mp4",
+ "answer": "In a paved urban square, a street performer entertains onlookers with a unique act. The man, wearing a blue shirt and a baseball cap, skillfully manipulates a small skeleton marionette. Using a control bar and strings, he brings the puppet to life, making it dance rhythmically on the asphalt. The skeleton, complete with tiny black shoes, performs a series of fluid, modern dance moves that are surprisingly lifelike. His setup, including speakers and a pink bag, sits nearby. In the background, several people are seated on a low wall, watching the performance against a backdrop of green grass, trees, and large, classical-style buildings. "
+ },
+ {
+ "video_id": "PI-9235e6b5-25ac-4966-a72e-d796d310800c",
+ "path": "/workspace/benchmark/video/PI-9235e6b5-25ac-4966-a72e-d796d310800c.mp4",
+ "answer": "The video captures a vibrant scene at the Yung Gravy outdoor music festival, illuminated by bright purple stage lights that cast a lively hue over the entire venue. The stage features a large screen displaying the event's name and various visual effects, enhancing the energetic atmosphere typical of such festivals. In the foreground, a person wearing a dark blue T-shirt with 'West Virginia' printed on it faces the camera, surrounded by a dense crowd of attendees. The text overlay on the video reads, 'When you launch a pulled pork sandwich at Yung Gravy,' hinting at a humorous or regrettable incident involving a pulled pork sandwich during the event. Despite the unclear emotional tone and incomprehensible elements in the speech of some attendees, as noted in the audio summary, the video effectively conveys the collective experience and dynamic environment of the festival. The absence of distinguishable background music or additional auditory elements in the audio summary contrasts with the visual liveliness, emphasizing the visual and experiential aspects of the event. The video encapsulates the essence of the festival's energetic ambiance and the shared enthusiasm of the participants."
+ },
+ {
+ "video_id": "PI-5b24823d-3781-4f31-a93d-1615e0a78ff4",
+ "path": "/workspace/benchmark/video/PI-5b24823d-3781-4f31-a93d-1615e0a78ff4.mp4",
+ "answer": "The feature of this video is an engaging street interview conducted in front of a store with large glass Windows and a wooden exterior wall. The interviewer, wearing a black suit with white text and star patterns, raised such a question: \"In which year of birth is a person considered old?\" This question lays the foundation for the humorous exploration of intergenerational concepts. The first group of respondents consisted of two people. One was wearing a black sweatshirt and jeans, and the other was wearing a grey sweatshirt and light-colored trousers. The person in the grey hoodie replied, \"2000.\" Then the scene switches to another group of three people, and the interviewer repeats the same question. The new respondents were dressed more energetically: One wore a white down vest, paired with a black suit and beige sandals; Another one was wearing a bright pink sportswear, paired with a white down vest and pink shoes. The respondents wearing pink sportswear also answered \"2000\", which was exactly the same as the previous answer. The video ended with this consistent response, emphasizing that respondents generally regarded people born in 2000 as elderly. This video, set against the backdrop of bustling city streets, effectively captures the views of a generation on age."
+ },
+ {
+ "video_id": "PI-4e669029-bb7d-4a83-9813-73d99d2c5dd5",
+ "path": "/workspace/benchmark/video/PI-4e669029-bb7d-4a83-9813-73d99d2c5dd5.mp4",
+ "answer": "The video features a person sitting in the driver's seat of a car, wearing a green sweatshirt with yellow accents and a gold chain necklace. The car's interior is white, and through the windshield, a residential area with houses and a cloudy sky is visible, creating a quiet, suburban backdrop. The person, who is wearing a seatbelt, appears to be in a contemplative or distressed mood, as indicated by their hand gestures and occasional touching of their face. Subtitles on the screen reveal their thoughts: 'I think there must be something wrong with me,' 'minus Christmas is coming,' and 'but I'm not happy I don't feel the way I'm supposed to feel.' The background music complements this emotional tone with a soft, perhaps melancholic instrumental track that enhances the overall mood of personal reflection and emotional struggle. The BGM provides a subtle emotional background, focusing on instrumental elements to convey emotions without any clear vocals or voiceovers, thereby deepening the impact of the visual narrative."
+ },
+ {
+ "video_id": "PI-c6dd8b4a-36d1-4ed8-a7e1-e13d05db2945",
+ "path": "/workspace/benchmark/video/PI-c6dd8b4a-36d1-4ed8-a7e1-e13d05db2945.mp4",
+ "answer": "The video beautifully captures the transformative journey of a rescued dog over a year, enhanced by a soft and melodious background music (BGM) track that deeply enriches the emotional narrative. The BGM, characterized by its gentle, soothing tunes, mirrors the tender care and gradual improvement in the dog's life, creating a poignant atmosphere that draws viewers into the story. The video begins with the dog in a state of neglect, lying on a beige dog bed next to a bed with white bedding in a room featuring a wooden floor, a rug, and a partially open window that allows natural light to filter in. A red toy on the floor near the bed adds a subtle touch of color to the scene, while the text \"When I found her on the streets\" appears on the screen, highlighting the dog's difficult past. A close-up shot emphasizes the dog's thin body and visible ribs, underscoring its poor condition at the time of rescue. The narrative then transitions to a year later, where the dog is seen in a cozy, festive setting decorated for Christmas, complete with a lit tree and wrapped presents. Now wearing a white sweater and lying comfortably on a beige rug, the dog appears healthier, more relaxed, and content. The text \"1 year later\" marks the passage of time, and a subsequent clarification, \"No she’s not pregnant..\", addresses the dog's rounded belly as a result of weight gain, not pregnancy. Throughout this heartwarming transformation, the soft and soothing BGM continues to play, its melodic tunes complementing the visual journey and reinforcing the core messages of care, compassion, and renewal that define the dog's remarkable improvement in its new, loving environment."
+ },
+ {
+ "video_id": "PI-57a843f3-0062-496d-8432-a9cce3c46627",
+ "path": "/workspace/benchmark/video/PI-57a843f3-0062-496d-8432-a9cce3c46627.mp4",
+ "answer": "This video presents a humorous scene from the online game Roblox. In what appears to be a horror-themed game, a group of player avatars is being chased by a tall, shadowy monster with glowing eyes. One player stands out, controlling an avatar dressed as a priest, holding a bible and leading the escape. As the frantic chase unfolds through dimly lit rooms and narrow brick hallways, a powerful and dramatic gospel choir song serves as the soundtrack. The juxtaposition of the terrifying chase and the uplifting music creates a comedic effect. This is further enhanced by text bubbles that appear on screen, showing the priest character rebuking the monster 'IN THE NAME OF JESUSSS' and quoting scripture to protect the group.The clip highlights a moment of creative and funny role-playing within the gaming community, blending elements of horror, religion, and meme culture into a short, entertaining video."
+ },
+ {
+ "video_id": "PI-07ce65a8-a227-438a-9d50-a08c56c16219",
+ "path": "/workspace/benchmark/video/PI-07ce65a8-a227-438a-9d50-a08c56c16219.mp4",
+ "answer": "This is an integrated account of both the audio and video content. The video, as indicated by the NBC News logo and the date January 14, 2025, in the top corners of the frames, features a man in a blue suit and red tie speaking at a hearing or formal event in Washington, D.C. The backdrop shows an audience, with some individuals, like a woman with blonde hair and another in a maroon top, in clear view. The man, speaking into a microphone, is the focus, and text overlays offer context to his words. The first overlay states, 'Hegseth says he does pushups in sets of 47 in an apparent nod to Trump.' Subsequently, the audio comes into play, capturing the voice of a male speaker, presumably Hegseth, from the hearing, who asks another man about his push - up capacity in a somewhat challenging tone. The other man, in this case Hegseth himself through his speech in the video, responds proudly, \"I did five sets of 47 this morning,\" as shown in another text overlay. The final overlay reveals, 'Trump will be the 47th U.S. president when he takes office,' and Hegseth concludes with, 'My brother saw it this morning.' The audio and video together highlight a moment where Hegseth, through his statements about doing push - ups in sets of 47, seems to be making a reference to Trump, possibly as a sign of respect or alignment with Trump's upcoming presidency, emphasizing the significance of the number 47 in this context."
+ },
+ {
+ "video_id": "PI-e44358e9-6a19-4dd4-9ff2-412ad8cf3222",
+ "path": "/workspace/benchmark/video/PI-e44358e9-6a19-4dd4-9ff2-412ad8cf3222.mp4",
+ "answer": "The video features a young child in a pink shirt and dark pants exploring a bedroom filled with various items, including a quilted-pattern dresser, a mirror, and hanging clothes. The child interacts with the dresser, touching and pointing at the drawers, while occasionally glancing towards the camera. The background includes a television showing a colorful cartoon, a shelf with bottles and containers, and a backpack on the wall, capturing a moment of curiosity. Complementing this scene, the audio track includes a vague statement from a person and the child shouting, expressing tense or excited emotions. The combination of the child's exploratory actions and the vocal expressions creates a dynamic and engaging portrayal of a typical childhood moment, highlighting both the visual and emotional elements of the scene."
+ },
+ {
+ "video_id": "PI-c5ba83f2-7f46-4003-97e7-830968e9a6d2",
+ "path": "/workspace/benchmark/video/PI-c5ba83f2-7f46-4003-97e7-830968e9a6d2.mp4",
+ "answer": "The video features a family of four gathered around a laptop on a wooden table, engaged in a humorous activity. The family includes a man holding a gray chicken, a young woman in a brown hoodie, a young man in a green shirt, and an older woman in a black jacket. They are all focused on the laptop screen, with the man occasionally pointing at it while holding the chicken. The background shows a staircase and a wall adorned with a decorative wreath. A text overlay at the bottom of the screen reads, \"Pretending our chicken got into Harvard bc we’re bored,\" indicating their playful pretense to combat boredom. The atmosphere is light-hearted, and as the video progresses, the family's expressions shift to surprise and amusement, with laughter erupting among them. The background music complements the scene with its cheerful melody, enhancing the relaxed and humorous ambiance, and contributing to the overall comedic theme of the video. This combination of visual and auditory elements creates a cohesive and entertaining experience, capturing a moment of shared joy and humor within the family."
+ },
+ {
+ "video_id": "PI-886b1a76-ec9f-4cd7-b623-2ef3ff026028",
+ "path": "/workspace/benchmark/video/PI-886b1a76-ec9f-4cd7-b623-2ef3ff026028.mp4",
+ "answer": "The video begins with a close-up of the Starbucks ordering screen, on which a highly customized beverage order is displayed. Many add-ons and modifications are listed on the screen, including various types of milk, raw cream, cream pudding and syrup. The initial total order amount was $16.80, but as more goods were added, the total amount rose sharply to $1,293.60 and eventually reached $1,478.40. The text above reads: \"Oh my god, when did Starbucks become so cheap?\" Being surprised at the seemingly low price is actually the result of excessive customization. Then the scene switches to a Starbucks barista, wearing a red apron and standing behind the counter, holding an electronic device in his hand. The same text message, \"Oh my god, when did Starbucks become so cheap?\" It is covered on the screen. The barista seems to be communicating with someone outside the camera, perhaps discussing something. Then, the barista extends the device towards the camera, and the Starbucks logo is displayed on the screen. Throughout the entire interaction process, the text remained on the screen. The barista gestured as if explaining something, and then turned and left the counter. The barista seemed to be explaining something, gesturing with his hand. Throughout the entire interaction process, the text remained on the screen. Then, the barista turned and left the counter, but the camera followed closely behind, capturing the barista's movements. At the end of the video, the barista was still moving around, leaving the counter area, emphasizing the humorous and exaggerated nature of the total cost of custom orders."
+ },
+ {
+ "video_id": "PI-23514543-2ece-4247-885c-15999d492128",
+ "path": "/workspace/benchmark/video/PI-23514543-2ece-4247-885c-15999d492128.mp4",
+ "answer": "This video features a series of food-related scenes, each accompanied by a text overlay, conveying an individual's insights and philosophy on diet. The beginning of the post shows a person holding a piece of fried chicken in one hand and a fork of pasta dipped in orange juice in the other, with the words \"Because TIKTOK will be banned in two days\" and \"I think there's still a secret to share with you...\" written on it. Appear on the screen. This sets the tone of a sense of urgency and personal enlightenment. The scene switches to the same person, holding a large tacos dripping with sauce, next to a box of soup and a serving of tortilla chips. The same text is covered on top, emphasizing the importance of sharing personal secrets. In the video, a person is holding a fork of food from a container filled with fries, meat, cheese and other ingredients, with the words \"Food is culture\" written on it, highlighting the cultural significance of the food. The video ends with a close-up of a bowl of noodles, decorated with sausages, boiled eggs, dumplings and chopped scallions, accompanied by the words \"Food is fuel\", emphasizing the practical use of the food. Throughout the video, the background music is relaxed and casual, giving people a sense of ease. The dialogue in this audio adds a personal and humorous touch to the video, further emphasizing the theme of individuality and personal preferences in food and entertainment choices. The overall atmosphere is positive and indulgent, celebrating the joy and cultural significance of food."
+ },
+ {
+ "video_id": "PI-0a42c578-1293-4e68-8748-a5bdc3f0c3d9",
+ "path": "/workspace/benchmark/video/PI-0a42c578-1293-4e68-8748-a5bdc3f0c3d9.mp4",
+ "answer": "The video presents an intimate and soft - toned scene, featuring a person lying on a bed adorned with a beige pillow and a red and white striped blanket. The person is dressed in a white sleeveless top with small black bows on the front and a black choker necklace, their hair held back with white hair clips. They adjust their hair and move closer to the camera, offering a close - up view of their face and hair, with light and playful movements that add to the gentle atmosphere. Meanwhile, the audio features a female speaker calmly and reassuringly repeating the phrase \"I'll make it worth it,\" which conveys a promise or commitment to ensure that something will be valuable or rewarding. This audio element perfectly complements the video's theme of reassurance or commitment, as it aligns with the overall soft and intimate mood of the scene. The combination of the visual elements, such as the person's attire and the cozy bed setting, and the audio's soothing repetition of the promise, creates a cohesive and engaging narrative that evokes a sense of comfort and trust."
+ },
+ {
+ "video_id": "PI-4b8ea03c-4aab-4821-9f78-7471407cf0ed",
+ "path": "/workspace/benchmark/video/PI-4b8ea03c-4aab-4821-9f78-7471407cf0ed.mp4",
+ "answer": "The video features a person sitting at a table with a granite countertop, wearing a bright pink hoodie with red text that reads 'NONCLAN UNIVERSITY' and a black cap, set against a backdrop of a ceiling fan and a neutral-colored wall, suggesting an indoor kitchen or dining area. The individual is engaged in a casual and conversational monologue, using hand gestures to emphasize their points. Text overlays appear on the screen, providing context to their speech: 'like my women black and petite mix with pretty ash' indicates a discussion about personal preferences in women; 'they say shooter shoot...' implies a reference to taking action or seizing opportunities; and 'zoeeoliviaa_wsp with you??' poses a playful or inquisitive question directed at someone named Zoee Olivia, possibly inquiring about her contact with the viewer. Meanwhile, the audio clip complements the visual content with a musical segment that has a relatively stable rhythm, contributing to a neutral or ambiguous atmosphere that enhances the casual or intimate environment of the video."
+ },
+ {
+ "video_id": "PI-ebe52af4-63eb-4bf5-8fe8-3d74efd1d7f2",
+ "path": "/workspace/benchmark/video/PI-ebe52af4-63eb-4bf5-8fe8-3d74efd1d7f2.mp4",
+ "answer": "This video segment captures a live auction event in an indoor space that functions as both a studio and a storage area. A young man with bleached blonde hair and tattooed arms is seated at a desk, actively managing the auction. He holds and presents a green quilted enamel leather long wallet, the item being auctioned. The background reveals shelves filled with red boxes, a black-and-white checkered floor, and various personal touches, including colorful ambient lights and a large screen, creating a distinct, personalized workspace. On-screen text overlays continuously update details of the \"LIVE AUCTION,\" including product information, current bid amounts, and the remaining time, encouraging participants to keep bidding. The auctioneer's voice is clear and energetic, guiding the process with countdowns and motivating engagement. Bidding starts at one dollar and rapidly escalates, ultimately closing with the wallet sold for \\$200. The scene effectively conveys the fast-paced nature of an online auction.\n"
+ },
+ {
+ "video_id": "PI-67708f07-804b-4f5d-bb82-2b905b6d260d",
+ "path": "/workspace/benchmark/video/PI-67708f07-804b-4f5d-bb82-2b905b6d260d.mp4",
+ "answer": "The video depicts a scene where two individuals are engaged in drilling a hole in a wooden floor. One person, wearing a green hoodie and jeans, initially bends over the hole, while another, in a gray hoodie, operates a green ice auger. The setting features wooden walls with storage pockets, adding a rustic backdrop. As the drilling begins, white foam is revealed on the water's surface, and the person in the gray hoodie continues to drill deeper, aiming to reach a sufficient depth. However, the text 'Came up short' appears, indicating that the initial attempt did not achieve the desired result. Subsequently, an extension is added to the auger, as highlighted by the text 'Extension added'. The drilling resumes, and the auger penetrates further, creating a swirling motion in the foam. Eventually, the auger is lifted, revealing clearer water beneath, and both individuals observe the completed hole, signifying the end of their task. The somber and reflective mood of the background music, with its themes of struggle and conflict, complements the determined effort and perseverance shown in the drilling process."
+ },
+ {
+ "video_id": "PI-045adbd2-f084-4842-bb7d-3ecff302ece8",
+ "path": "/workspace/benchmark/video/PI-045adbd2-f084-4842-bb7d-3ecff302ece8.mp4",
+ "answer": "The audio clip presents a piece of music characterized by a dynamic rhythm, suggesting an energetic and engaging auditory experience. Meanwhile, the video depicts a woman in an office setting, where she is seated at a desk wearing a white sleeveless shirt. The office environment is adorned with two abstract paintings in gold and dark tones on the wall, and a black metal file organizer filled with papers and folders sits on the table. The woman is animatedly shaking her body and using expressive gestures, occasionally leaning back in her chair and gazing upward, perhaps lost in thought or emphasizing a passionate idea. The bright lighting in the room adds to the professional atmosphere, while her lively movements and facial expressions convey a relaxed and cheerful demeanor. Together, the dynamic rhythm of the music and the woman's spirited gestures create a vivid contrast between auditory and visual elements, each enhancing the other's energetic essence."
+ },
+ {
+ "video_id": "PI-e5784e8e-6db4-4d1d-882c-1c71cf68849e",
+ "path": "/workspace/benchmark/video/PI-e5784e8e-6db4-4d1d-882c-1c71cf68849e.mp4",
+ "answer": "The video captures a warm and intimate scene set in a well-lit, cozy living room, where a woman with shoulder-length dark hair sits gracefully, wearing a strapless beige top and a delicate necklace. The soft-colored wall, a lit wall lamp, and a glass coffee table adorned with books and a small decorative item create a serene backdrop, while a curtain is partially visible on the left side of the frame. As she engages in what appears to be a conversation or monologue, her expressive hand gestures and the movement of her mouth suggest she is speaking animatedly, occasionally glancing to her left and right as if addressing someone off-camera or reflecting on different points. Her relaxed and natural demeanor contributes to the inviting atmosphere of the scene. Complementing this visual content, the background music features a female vocalist expressing a playful and flirtatious mood, with a soft, rhythmic melody and a gentle beat that enhances the overall ambiance, creating a relaxed and intimate setting that perfectly mirrors the woman's casual and engaging presence."
+ },
+ {
+ "video_id": "PI-7be37528-d874-4594-9190-58db958d75a8",
+ "path": "/workspace/benchmark/video/PI-7be37528-d874-4594-9190-58db958d75a8.mp4",
+ "answer": "The video begins with a woman walking through a room filled with shelves and various items, dressed in a black long-sleeve top with a gold design on the front and black pants. The background includes a white door and shelves stocked with colorful items. The scene transitions to the woman standing in front of a row of green lockers, where she lifts her top to reveal her midriff. The lockers are filled with various items, including wigs and cleaning supplies. The focus then shifts to a man standing in front of the same lockers, wearing a white t-shirt with a graphic design, light blue jeans, and white socks. He lifts his shirt to reveal his midriff as well and starts dancing energetically, moving his arms and legs in various directions. Meanwhile, a male speaker with an indeterminate emotional tone briefly utters the phrase 'baby all' in English, accompanied by ambient music featuring a rhythmic tempo and neutral atmosphere, creating a subtle background to the playful and lighthearted scene in the storage or utility room."
+ },
+ {
+ "video_id": "PI-41f85a6e-33b9-4492-b4e3-e55df8611d9a",
+ "path": "/workspace/benchmark/video/PI-41f85a6e-33b9-4492-b4e3-e55df8611d9a.mp4",
+ "answer": "The video captures an exhilarating moment from a football match under the 'Concacaf CHAMPIONS CUP'. It opens with a view of a well-marked green field, setting the stage for the action. A player in a light blue uniform sprints across the field, but the focus soon shifts to a player in a black uniform who leaps to head the ball, skillfully evading two defenders in light blue. Demonstrating exceptional control and dribbling, the player in black advances towards the goal, pursued by defenders. Despite the goalkeeper's attempt to block the shot, the player successfully scores, sending the ball rolling into the net and celebrating the goal. The scene then transitions to a close-up of the player, revealing the number 10 and the name 'MESSI' on his jersey, as he continues to celebrate near the goalpost, wearing green gloves and a captain's armband. The background features a person in a bright yellow jacket, a fence, and some purple and white items. The video is accompanied by a vibrant and rhythmical pure music track, typically associated with sports events, which incorporates drum beats, electric guitar, and synthesizer effects to create an inspiring atmosphere. This music, with its fast rhythm and dynamic melody, echoes the tension and excitement of the match, enhancing the emotional and energetic atmosphere of the scene."
+ },
+ {
+ "video_id": "PI-44e91a8a-5259-4816-85f7-ab165552cbd1",
+ "path": "/workspace/benchmark/video/PI-44e91a8a-5259-4816-85f7-ab165552cbd1.mp4",
+ "answer": "The video begins with a vibrant scene where two cartoon characters are set against a pink background. A green character is smiling, wearing a green hat and holding hands, while a blue character has a battery icon on its chest, indicating that it is a battery-powered creature. The text above reads like this: \"I had a great time in social activities!\" Light blue expresses the joy of social interaction. The scene transitions to a close-up of the blue character, which is now shown as a battery with a green top and a red bottom, symbolizing its energy level. The background changes to light blue, emphasizing the focus on battery characteristics. Then the story shifts to a pink room. The green character looks shocked, with his mouth wide open and his hands raised, surrounded by what seem to be fragments of walls or furniture. In the background, a blue figure can be seen, looking very distressed. There is a large crack on the wall behind, indicating some form of chaos or accident. The video ends with humorous and exaggerated scenes, highlighting the ups and downs of the characters' adventures. The audio content includes phrases such as \"I had a great time\", \"Tomorrow is a big day\", \"The dog has to be released\" and \"Thank you\", adding a narrative background and suggesting that a series of events have exhausted the blue character and required a long time to recover."
+ },
+ {
+ "video_id": "PI-9c69e120-b7d7-4461-8d33-5dbbdf49d68a",
+ "path": "/workspace/benchmark/video/PI-9c69e120-b7d7-4461-8d33-5dbbdf49d68a.mp4",
+ "answer": "In the video, a person is undergoing a dramatic makeup transformation. First, for those who are bald, use a tube and fingers to apply beige foundation or concealer on the scalp. The background is a well-lit room with a white cabinet and some green plants. Then, this person applied the same product to the eyebrows with a small brush and mixed it carefully. Then, she mixed the product on her face with a makeup sponge to ensure even application. Next, this person applies white or light-colored products to the area of her eyebrows, possibly to highlight or adjust the shape. Then, she uses white eyeliner or similar products on her eyelids, and then applies brown products on both sides of her face and scalp, possibly for contouring. Blend the contours with the makeup sponge to achieve a seamless look. When this person puts on the reddish-brown wig. Finally, the person with reddish-brown hair used a purple brush to draw eyeliner to enhance their eye makeup. They styled their hair, tied it into ponytails and made the final touches to their appearance. At the end of the video, the lady showed her full makeup. She had thick long hair and exquisite makeup, including prominent eyelashes and neutral lip color."
+ },
+ {
+ "video_id": "PI-9a5f8e4a-1143-417c-b9ec-c0298eaa9f02",
+ "path": "/workspace/benchmark/video/PI-9a5f8e4a-1143-417c-b9ec-c0298eaa9f02.mp4",
+ "answer": "In an animated video, a scene features two men playing chess, one with white hair and the other with a purple face, with a chess game in the background and viewers capture the moment with their cameras. The audio overlay reveals a fierce and emotional speech from a competition contestant, focusing mainly on the purple-faced man's perspective. He sounds frustrated and defeated, repeatedly acknowledging his impending loss with strong emotional language and self-deprecating humor, indicating a deep sense of resignation and shame. Meanwhile, the white-haired man remains silent in the video, but the purple-faced man shows obvious signs of anxiety and panic through the behavior in the video, as shown in the captions, such as scratching his head and cheeks. Despite the purple-faced man's emotional turmoil, a short interjection in a calm neutral tone in the audio hints at a possible move, contrasting with the calm demeanor of his opponent. Eventually, the video ends with the purple-faced man winning and shaking hands with his defeated opponent, a moment that contrasts with the purple-faced man's earlier desperate expression captured in the audio summary."
+ },
+ {
+ "video_id": "PI-c36465f0-daac-4e53-b0e1-a2a1abb73b37",
+ "path": "/workspace/benchmark/video/PI-c36465f0-daac-4e53-b0e1-a2a1abb73b37.mp4",
+ "answer": "The video shows a camouflage patterned pop-up tent set up on the wooden floor indoors. The comfortable living room background is set against the backdrop of a fireplace. A natural scene of a deer is playing on the TV, along with a decorated Christmas tree. The opening text covered: \"When he said that blind people were seen through, I thought he was joking...\" It laid the foundation for showcasing the uniqueness of the tent. A person in the tent, dressed in dark clothes and wearing a hat, was the main focus. They first adjusted their positions and then continued to manipulate the mesh screen of the tent, pulling it backward to display the indoor environment. This person emphasized the 180-degree transparency of the tent and how the net slides open from all directions. They demonstrated the spaciousness of the 5x5 tent, pointing out that it could comfortably accommodate two people or possibly three. This person pointed out how the design of the tent maintains a clear view while keeping it concealed. Throughout the video, this person expressed enthusiasm for the potential uses of the tent in the upcoming hunting season and repeatedly exclaimed, \"Wow, this is so cool.\" This video effectively showcases the functions, versatility and spaciousness of the tent, making it an exciting prospect for outdoor enthusiasts."
+ },
+ {
+ "video_id": "PI-35f37f48-2d55-4c76-b0ba-176d115b1ac9",
+ "path": "/workspace/benchmark/video/PI-35f37f48-2d55-4c76-b0ba-176d115b1ac9.mp4",
+ "answer": "The video begins with a serene scene of a baby wrapped in a camouflage onesie, lying on a light blue blanket on a wooden deck. One of the women's hands within the video gently touches the baby and the baby remains still and peaceful, creating a moment of calm and tenderness. As the woman's hand moves away, the camera shifts focus to the wooden railing and the lush green grass in the distance, where a fawn appears, gracefully across the lawn. The camera follows the gentle footsteps of the deer and captures the tranquil beauty of the natural environment. The track, meanwhile, features a woman's soothing voice, possibly trying to soothe the baby, as faint crying can be heard in the background. This combination of visuals and audio emphasizes the peaceful coexistence of the baby and the deer, creating a heartwarming and serene moment that highlights the harmony between man and nature while also showcasing the tender care of the woman for the baby."
+ },
+ {
+ "video_id": "PI-a98679aa-b935-4f25-859d-4f3c84e5bb09",
+ "path": "/workspace/benchmark/video/PI-a98679aa-b935-4f25-859d-4f3c84e5bb09.mp4",
+ "answer": "The video captures a festive scene in a spacious indoor hall with a high ceiling and wooden beams, filled with people engaged in lively activities. The atmosphere is playful and energetic, with participants moving around, interacting, and holding playing cards with red hearts. One person in a patterned vest over a white long-sleeve shirt holds a card close to another person's face, while the latter, wearing a light-colored sweater, is later seen holding a plate with food near their mouth. The background shows more people, some standing and some sitting at tables covered with white tablecloths, laughing and engaging with each other. The focus then shifts to a person in a green sweater kneeling on the floor, holding a playing card close to their mouth and interacting with it playfully, while another person in a light blue denim jacket points at the card. The video wraps up with the person in the green sweater still kneeling, now holding the card in their hand and looking down at it, as the background remains busy with other participants. This joyful scene is complemented by the audio, sung by a male speaker with a neutral mood, which describes the festive atmosphere of Christmas, the ubiquitous decorations, and the festive spirit, enhanced by a cheerful rhythm that amplifies the overall merry ambiance."
+ },
+ {
+ "video_id": "PI-28236ee0-9db2-4ca1-b912-59f092c4457d",
+ "path": "/workspace/benchmark/video/PI-28236ee0-9db2-4ca1-b912-59f092c4457d.mp4",
+ "answer": "The audio clip captures a touching moment between a woman and a crying baby, where the woman recounts a heartwarming story about a little boy who used his two-dollar lunch money to buy a small balloon as a gift for his mother. Her voice trembles with emotion, and it's evident that she is deeply moved, possibly to the point of tears. The video complements this emotional narrative by showing an adult woman and a baby in a cozy bedroom at home, adorned with white curtains, lamps, and potted plants in white flowerpots. Throughout the video, the woman is seen speaking with a sad expression, wiping away her tears, while the baby briefly appears and then disappears from the frame. The culmination of the video finds the adult woman in tears once again, her emotional state mirroring the poignant story she shared in the audio. Together, the audio and video create a poignant and tender scene, highlighting the profound impact of simple acts of kindness and the universal language of love and emotion that transcends words."
+ },
+ {
+ "video_id": "PI-357603d3-1949-4d2b-aa05-8bdf04f87361",
+ "path": "/workspace/benchmark/video/PI-357603d3-1949-4d2b-aa05-8bdf04f87361.mp4",
+ "answer": "This short video offers a humorous perspective on the dynamic between a man and his girlfriend's dog. It opens with a shot of a white Labrador retriever sitting on a hardwood floor, looking up attentively. An on-screen caption reads, 'POV: Girlfriend's dog.' The scene then cuts to a young man speaking directly to the camera. The audio features a dubbed dialogue that playfully subverts a classic trope. A voice representing the dog declares, 'This town is big enough for the two of us.' The man questions this phrasing, pointing out that the saying is typically the opposite. In response, the dog's friendly, accented voice confirms they can share because he considers the man his best friend. The video concludes with the dog standing, wagging its tail with an open-mouthed, happy expression."
+ },
+ {
+ "video_id": "PI-2a72146b-c51a-4a87-b602-c40994afa3af",
+ "path": "/workspace/benchmark/video/PI-2a72146b-c51a-4a87-b602-c40994afa3af.mp4",
+ "answer": "The video features a captivating visual and auditory experience that combines a mesmerizing display of a transparent cube with a playful and teasing audio narrative.The video begins with a close-up of a hand holding a transparent cube, gently rotating it to showcase its clarity and pristine condition. The background is softly blurred with blue and purple hues, creating a dreamy and enchanting atmosphere.A purple teddy bear ornament is subtly visible in the background, adding a playful touch to the setting.The audio overlay playfully warns, \"Don't show your girlfriend this...\" accompanied by emojis of a smiling face with hearts and a crying-laughing face, hinting at the cube's potential to charm or amuse.As the video progresses, the hand carefully places the transparent cube onto a dark wooden base.The base is positioned on what appears to be a book or a similar flat surface, providing a stable and elegant foundation for the cube.The audio continues with the same playful tone, reinforcing the idea that the cube is something special and potentially irresistible.Once the cube is securely placed, something magical begins to happen inside it.Blue glowing patterns start to emerge, gradually forming the shape of a beloved cartoon character.The character's image is vivid and dynamic, with different actions and expressions being displayed in a captivating sequence.The blue glow illuminates the cube from within, making the character's image stand out even more against the dark background.The audio remains silent during this visual spectacle, allowing the viewer to fully immerse themselves in the enchanting display.Throughout the video, the camera remains focused on the cube and its internal display, with minimal movement to ensure that viewers can fully appreciate the details and the beauty of the changing patterns.The combination of the transparent cube, the dark wooden base, and the blue glowing character creates a visually stunning and mesmerizing effect that is sure to captivate anyone who watches it.The playful audio narrative adds an extra layer of charm and intrigue, making the video a delightful and engaging experience."
+ },
+ {
+ "video_id": "PI-f8e5b905-362f-4bdc-a763-2c0b005a1044",
+ "path": "/workspace/benchmark/video/PI-f8e5b905-362f-4bdc-a763-2c0b005a1044.mp4",
+ "answer": "Under the brooding clouds stretching across a vast grassland, columns of uniformed U.S. Army soldiers stand in precise formation. As family members, friends and loved ones circulate among the ranks—presumably for a military ceremony—one solitary soldier draws attention by standing unattended.\n\nA small boy in a vibrantly colored hooded jacket detaches from another group and approaches the isolated soldier. With a gentle tap on the arm, the boy instantly shatters the soldier's rigid posture. Kneeling down to the child's level, the soldier's face blossoms into a radiant smile as he draws the boy into a warm, heartfelt embrace. Their shared laughter fills the air before posing for a photo, the soldier flashing a peace sign.\n\nThis spontaneous act of kindness creates a profoundly moving tableau—a tender human connection piercing through the formal military pageantry, where a child's innocent gesture bridges the divide between warrior and civilian. The poignant contrast between the stern military environment and this unrestrained moment of joy lingers in the memory"
+ },
+ {
+ "video_id": "PI-d458fae0-8510-4d13-9a67-d53f0f374e7a",
+ "path": "/workspace/benchmark/video/PI-d458fae0-8510-4d13-9a67-d53f0f374e7a.mp4",
+ "answer": "This video was shot in a large supermarket. There are many shelves in the supermarket, and all kinds of items are piled up on the shelves. In the video, there is a man with his hands by his mouth, seemingly Shouting loudly. According to the text that appears on the screen, it can be known that the man is mistaken for stealing in a supermarket, and the man is Shouting to vent his dissatisfaction. There are also two women in the video, who are having a conversation with this man. From the subtitles in the video, it can be known that At first, the two women thought it was this man who took the things from the supermarket. Eventually, they realized that it seemed they had misunderstood him. The theme expressed in the video is a misunderstanding about theft. From the audio, it can be heard that after being misunderstood, the man was in a bad mood and shouted loudly in the supermarket. This also gives us a reminder that we should not casually speculate about a person without sufficient evidence."
+ },
+ {
+ "video_id": "PI-bb84e457-9513-4567-b45d-c63ef48d652e",
+ "path": "/workspace/benchmark/video/PI-bb84e457-9513-4567-b45d-c63ef48d652e.mp4",
+ "answer": "The video features a humorous and casual interaction between a man and a woman, set against an urban backdrop with a brick building and a white SUV. The man, dressed in a black quilted jacket and blue turtleneck, approaches the car holding a pizza box, with a text overlay indicating, 'SORRY MY HUSBAND WANTS TO TELL YOU SOMETHING.' The woman inside the car, wearing a dark blazer over a green top and gold hoop earrings, engages in a light-hearted conversation with him. The discussion revolves around a local tip about the Dearborn Meat Market in Detroit, which the man reveals is owned by his uncle. The text overlay changes to reflect their dialogue, including phrases like 'WELL HE'S ON SPEAKER SO IF YOU COULD...' and 'GO AHEAD BABE.' The man expresses a mix of surprise and irony, especially when he exclaims, 'HIS UNCLE OWNS IT! THE THING IS THOUGH... THAT'S AN INSIDER REVIEW!' The woman waves him off, saying, 'YEAH ALRIGHT, GET OUT OF HERE,' as he walks towards a pizza shop. The text overlay then reads, 'CAN'T BE DOING REVIEWS FOR YOUR OWN FAMILY,' highlighting the humorous conflict of interest. The man takes a bite of pizza, quipping, 'OH THIS GREAT RANDOM PLACE THAT MY FUCKING PARENTS OWN,' and gestures with the pizza box, adding, 'GET OUTTA HERE.' The video concludes with the man still holding the pizza box and saying, 'SHE BLEW IT, DID YOU HEAR HIM? I MEAN COME ON,' emphasizing the comedic tone of the exchange. The audio complements the video with a casual and slightly humorous conversation among several people, mainly men, discussing the family-owned meat market, adding a layer of irony and light-hearted banter to the scene."
+ },
+ {
+ "video_id": "PI-d7a250ce-89e3-4540-a8d8-526b036a0598",
+ "path": "/workspace/benchmark/video/PI-d7a250ce-89e3-4540-a8d8-526b036a0598.mp4",
+ "answer": "In a heartfelt and emotional video, a woman records herself as she prepares to evacuate her home due to a nearby wildfire. Visibly distraught and crying, she explains the difficulty of the situation, focusing on her sadness about leaving her beloved podcast studio. She walks through the studio, showing the personal items and decorations that make the space meaningful to her. She explains that the studio is more than just a room; it's the place that connects her to her community and audience. While she has packed essential documents, the thought of losing her home and this special space is overwhelming. She moves to a window to show the cause for the evacuation: large plumes of smoke are billowing from the hills in the distance, demonstrating the fire's proximity. She becomes overcome with emotion, reflecting on how a crisis clarifies one's priorities, and mentions she is choosing to save small, sentimental gifts from fans over expensive equipment."
+ },
+ {
+ "video_id": "PI-8bcad9c2-c6be-496e-a58f-5bb516cd6e6b",
+ "path": "/workspace/benchmark/video/PI-8bcad9c2-c6be-496e-a58f-5bb516cd6e6b.mp4",
+ "answer": "This video captures a demonstration of an advanced AI robotic arm's balancing capabilities. Set against a dark wall and on a floor with a bold black-and-white geometric pattern, a silver robotic hand is mounted on a gray pedestal. The robot skillfully balances a thin black wand on its index finger. A person repeatedly interacts with the setup by pushing the wand, attempting to knock it over. Each time, the robotic arm reacts with incredible speed and precision, making minute and rapid adjustments to keep the wand perfectly balanced. "
+ },
+ {
+ "video_id": "PI-20337f9e-d0ab-424b-995e-a29edeb187f6",
+ "path": "/workspace/benchmark/video/PI-20337f9e-d0ab-424b-995e-a29edeb187f6.mp4",
+ "answer": "The video shows a detailed examination of the smartphone, beginning with the removal of the dark grey back cover to reveal a sleek silver interior. The person then attached a circular magnetic accessory labeled \"Nice\" to the back of the phone, turning its appearance into a shiny silver. The background features a colorful, illuminated keyboard and vivid, detailed imagery on a computer monitor, adding a dynamic and visually engaging element to the scene. The phone turns on and displays a cityscape wallpaper with the time \"14:54\" and the date \"February 22nd\", providing a clear view of its interface. Meanwhile, the background music, a song written by the boy with a melancholy tone, is played softly, with lyrics expressing regret for the loss of communication, enhancing the emotional depth of the video. The video ends with the person holding the phone, highlighting the included accessories and the sleek design of the phone from multiple angles, providing a comprehensive view of its functionality and design elements. This combination of visual and auditory elements creates an engaging experience for the viewer."
+ },
+ {
+ "video_id": "PI-7db184fa-bf43-419a-a4c9-887b86fee02b",
+ "path": "/workspace/benchmark/video/PI-7db184fa-bf43-419a-a4c9-887b86fee02b.mp4",
+ "answer": "The video showcases the process of cleaning a heavily stained, beige Croc shoe, originally intended to be white. It begins with a close-up of the dirty shoe on a wooden surface, with a person in a pink sweatshirt pointing out its condition. The person, wearing a dark apron with \"SHOELADA\" and \"LOS ANGELES\" text, sits at a table and prepares to clean the shoe using a green bottle of \"SHOELADA SHOE CLEANER & CONDITIONER.\" They spray the cleaner onto the shoe, creating white foam, and use a wooden brush with black bristles to scrub it. The person dips the brush into a black bowl of water, generating bubbles, and continues scrubbing, spreading the foam. A gray cloth is then used to wipe off the foam, revealing a cleaner surface. The narrative progresses with close-ups of the person's hands, adorned with pink nail polish, vigorously scrubbing the shoe. The person rinses the shoe under running water from a faucet, washing away the foam and dirt. The scene transitions to the clean, white Croc shoe hanging on a metal fence outside, with a large RV and a brick wall in the background. The video concludes with a close-up of the clean shoe held up against a backdrop of yellow flowers and a clear blue sky, highlighting its transformation and the effectiveness of the cleaning process. The speaker's optimistic and positive tone underscores the successful restoration of the shoe's original whiteness using the product and a hard-bristled brush, appropriate for the rubber material."
+ },
+ {
+ "video_id": "PI-5f1b107b-0cd4-4bd3-b5e8-b0ce6c7d3148",
+ "path": "/workspace/benchmark/video/PI-5f1b107b-0cd4-4bd3-b5e8-b0ce6c7d3148.mp4",
+ "answer": "The audio and video content, when combined, create a humorous yet somewhat relatable scenario centered around a prank or a common mishap involving a misplaced or confiscated phone. In the audio, two male voices express confusion and repeatedly apologize, their voices flustered as they struggle to communicate, possibly due to the stress of a prank involving a taken phone. This conveys a sense of urgency and regret, suggesting that they are in a situation where they are trying to rectify a mistake or explain themselves. The video, on the other hand, features a person wearing a white hoodie standing in front of a brightly lit closet filled with various clothes, creating a clear and vibrant scene. The text overlay at the top of the frames, \"that friend who is constantly getting their phone taken:\", sets the context for the video, indicating a humorous or relatable situation about a friend who frequently has their phone confiscated. The person in the video appears to be speaking or reacting in a playful or exasperated manner, which aligns with the audio's tone of confusion and apology, suggesting that they might be the ones involved in the prank or the ones who had their phone taken. The contrast between the audio's sense of urgency and regret and the video's lighthearted and relatable setting creates a humorous narrative, where the viewer can imagine the person in the video being the one causing the confusion and apologizing over the phone, as depicted in the audio. Together, the audio and video content effectively convey a comical scenario centered around a prank or a common mishap involving a phone."
+ },
+ {
+ "video_id": "PI-12fc8c9a-9fa8-4f22-a6bb-4f6060d8049e",
+ "path": "/workspace/benchmark/video/PI-12fc8c9a-9fa8-4f22-a6bb-4f6060d8049e.mp4",
+ "answer": "The audio track features a confident and somewhat playful female voice discussing her reason for joining \"American Horror,\" attributing her decision to the show's iconic status and her personal passion for engaging in well-known projects. The video complements this narrative by showcasing a composed and confident woman sitting in a room with insufficient background lighting, wearing a smooth black top and sporting a Bob haircut. Occasionally glancing to one side with a contemplative expression, the scene focuses on her facial expressions and body language, suggesting self-reflection and an underlying confidence. The video, which is from \"THE KARDASHIANS\" on \"Hulu,\" includes subtitles that help the audience better understand the content, as the woman elaborates on her admiration for the program's iconic reputation and her eagerness to be part of such a renowned endeavor. The audio and video together effectively convey the woman's enthusiasm and the show's allure, creating a compelling narrative around her decision to join \"American Horror.\""
+ },
+ {
+ "video_id": "PI-6e70971d-fce4-4223-81c4-e3da289ee02d",
+ "path": "/workspace/benchmark/video/PI-6e70971d-fce4-4223-81c4-e3da289ee02d.mp4",
+ "answer": "In the video, a person is sitting in a car, wearing a dark jacket and a light-colored shirt with patterns. The interior of the car is visible. From the window, one can see outside, including some trees and a clear sky. The text on the video reads: \"It is very important to have a suitable partner in life,\" emphasizing the importance of a relationship of mutual support and love. It can be seen from her gestures and the movements of her mouth that this person is having a conversation with someone. Their conversation seemed full of emotion and sincerity, such as phrases like \"Let you go out through the front door, or if you want to park, come with me. I can go with you\", \"I love you. I'm sorry. I have nothing to do\", \"Come on\" and \"Thank you\", suggesting moments of connection and comfort. This video captures a tender moment, emphasizing the importance of having a supportive partner in life. Through the superimposition of words, the emotional tone is further strengthened, emphasizing the importance of a suitable partner."
+ },
+ {
+ "video_id": "PI-50c31908-1e1b-4ce2-a2ea-d9a23a7c0149",
+ "path": "/workspace/benchmark/video/PI-50c31908-1e1b-4ce2-a2ea-d9a23a7c0149.mp4",
+ "answer": "The video showcases a detailed baking process in a modern kitchen, featuring a metallic stand mixer on a white countertop with a tiled backsplash. The mixer bowl contains a mixture of oats, melted butter, and chocolate chips, indicating the preparation of a delicious baked treat. A hand, adorned with a ring, is seen adjusting the mixer's attachment and operating the controls to blend the ingredients. The hand carefully ensures that the mixture is well combined, occasionally pausing to check the consistency. The audio overlay includes a comforting and motivational message: \"You're doing fine, you can't be everything you want to be before your time. It's okay. You can stop at your house. Too bad, but you can see when you're wrong. You know you can.\" This message subtly underscores the theme of patience and self-acceptance in the baking process, suggesting that perfection takes time and effort. As the video progresses, the mixture becomes more uniform and cohesive, with the chocolate chips gradually blending into the oat and butter mixture. The video wraps up with the mixture nearing completion, highlighting the transformation of the ingredients into a smooth, well-blended dough, ready for the next step in the baking journey."
+ },
+ {
+ "video_id": "PI-454def78-d8f6-4b19-87c8-3d77378b1686",
+ "path": "/workspace/benchmark/video/PI-454def78-d8f6-4b19-87c8-3d77378b1686.mp4",
+ "answer": "The video features a news segment from NBC News, dated January 16, 2025, in Washington, D.C., focusing on senators' reactions to TikTok CEO Shou Chew attending Trump's inauguration. The segment begins with a senator in a blue suit and tie expressing his disapproval, stating, \"It's not my favorite thing.\" This sentiment is echoed by Sen. Josh Hawley (R-MO), who also shares his displeasure. The discussion then shifts to Sen. Mark Warner (D-VA), who, with a neutral tone, highlights TikTok as a significant national security concern, noting that 80% of Congress, both Democrats and Republicans, agree on this issue. Warner expresses skepticism about the potential for a deal in the next 90 days, stating, \"I was a business guy longer than I've been a politician. I'm open to that, but the idea of simply kicking the can and continuing to have this national security issue, I don't know how that sits with the rhetoric out of the new administration. They're going to be tough to...\" The conversation, which includes input from two male senators and a neutral female inquiry, reflects a serious and contemplative atmosphere, emphasizing bipartisan agreement on the potential risks posed by TikTok and the implications of delaying action on this critical national security matter."
+ },
+ {
+ "video_id": "PI-096791f7-af45-4156-bc27-c87209953bee",
+ "path": "/workspace/benchmark/video/PI-096791f7-af45-4156-bc27-c87209953bee.mp4",
+ "answer": "This video features a close-up of a series of lip products, which are neatly arranged on a textured light-colored surface. These products include lip glosses, lipsticks and lip balms of various brands, with packaging in red, pink and black. The text on the picture is \"big fan of this is whatever\", suggesting personal preference or recognition of the colors presented in this series. This audio is in English. The general meaning of the audio is \"Marry your boyfriend, and he can cry for a lifetime and love you for a lifetime.\" This video seems to combine the visual display of beauty products with emotional or romantic messages, creating a unique atmosphere between these two elements. The core message of this video is that it highlights a series of lip products while conveying emotional messages about love and commitment."
+ },
+ {
+ "video_id": "PI-0e69993e-369b-4515-a42a-893122764b27",
+ "path": "/workspace/benchmark/video/PI-0e69993e-369b-4515-a42a-893122764b27.mp4",
+ "answer": "The video captures an energetic basketball game between two teams with a cheering gym in the background. The main scene alternates between close-ups of players dribbling the ball and wide shots of the court, showing the energetic game. Main characters include players actively playing the game and fans enthusiastically supporting the team. The audio includes the sounds of the game, cheers from the crowd, and occasional commentary to set the lively atmosphere. Important English text visible in the footage includes ‘Clippers’ and ‘Fans’, which describe the context of the event as a Clippers game and attracts a large audience. The video conveys the passion and energy of a live basketball game, emphasizing the collective joy and competitive intensity of the event."
+ },
+ {
+ "video_id": "PI-eee430ac-00cb-40ef-9361-a0382bda3651",
+ "path": "/workspace/benchmark/video/PI-eee430ac-00cb-40ef-9361-a0382bda3651.mp4",
+ "answer": "This video documents a man named Phil's persistent efforts to learn a specific dance move, referred to as \"the chicken head,\" throughout a bachelor trip, with the intention of performing it at a friend's wedding. The visual narrative unfolds across various locations, beginning with an outdoor street scene featuring buildings and palm trees, where Phil is seen practicing the dance alongside another individual. The setting then shifts indoors to a hotel corridor, where Phil continues his practice, now accompanied by two friends. Following this, the scene moves to a hotel room, where Phil demonstrates the dance as his friends watch. The culmination of his efforts is shown at a wedding reception, where Phil performs the \"chicken head\" dance under blue lighting, integrating his learned moves into the celebratory atmosphere."
+ },
+ {
+ "video_id": "PI-3895ff89-6207-4784-a5f7-75b4070b4647",
+ "path": "/workspace/benchmark/video/PI-3895ff89-6207-4784-a5f7-75b4070b4647.mp4",
+ "answer": "The video unfolds in a hair salon, capturing the sounds of human activities and a little dialogue, suggesting a shooting scene. A blonde woman, initially clad in a black salon cape with pink sleeves, is seen curling her hair with a curling iron. Text overlays reveal her preference for curling her hair herself and her request to do so at the salon, along with a hint of impatience as someone walks around her, adding a layer of context to the auditory backdrop. In the background, a person in a white top is seen working, contributing to the bustling salon atmosphere. The woman eventually removes the cape, showcasing a pink cardigan over a black top and gray jeans. She adjusts her newly curled hair and poses, clearly pleased with the result. The video wraps up with her taking a selfie with another woman in a black salon cape, accompanied by the text 'LOVE IT BYE ', leaving the viewer with a sense of satisfaction and closure."
+ },
+ {
+ "video_id": "PI-34b043bd-7a44-48bb-ad55-7ffe586187cc",
+ "path": "/workspace/benchmark/video/PI-34b043bd-7a44-48bb-ad55-7ffe586187cc.mp4",
+ "answer": "This video captures a lighthearted moment in a classroom. A female student is at her desk, seemingly practicing a technique with a brush on a gridded paper, when her teacher approaches from behind. The audio is a vocal recording featuring human voices; the main subjects are a smiling female student and a kind-hearted and humorous female teacher.The teacher playfully scolds the student for recording a video during class instead of working. She jokes that the student should be learning to do nails and that if the video goes viral,The student claimins it's just a daily vlog. The overall tone is lighthearted and friendly, with some background chatter from other students in the classroom.The teacher playfully confronts the student for filming, stating she should be learning instead. The friendly banter continues as the teacher jokes about her own appearance and demands payment if the video goes viral. The student laughs along, defending the recording as part of a 'daily vlog'. The interaction ends with the teacher walking away and the student smiling as she returns to her work. "
+ },
+ {
+ "video_id": "PI-1390ac32-1cd7-4810-9248-0c0271cdab03",
+ "path": "/workspace/benchmark/video/PI-1390ac32-1cd7-4810-9248-0c0271cdab03.mp4",
+ "answer": "The video begins with a close-up shot of hands skillfully assembling a clay model of a hamburger, guided by an interesting and instructive voice-over from the audio track, which likely walks the audience through each step of the creative process. The hands start by placing a round and flat piece of white clay on a curved piece to form the base of the hamburger bun, followed by adding a fan-shaped edge piece for the layering effect. Set against a backdrop of a grid-patterned cushion and a colorful hamburger poster, the video then shows the hands coloring the top bun yellow and adding intricate details to the hamburger layers, such as red for tomato sauce, green for lettuce, and brown for meat patties. The craftsmanship continues with the addition of white lines to mimic the texture of the bun and other components, followed by the placement of transparent pink clay, resembling jelly, on top of the model. Using a delicate tool, the hands draw white lines on the pink clay, enhancing its decorative appeal. The video concludes with a fully assembled and shiny hamburger model held in hand, showcasing the final, polished result, while the background remains consistent with the grid-patterned cushion and the colorful hamburger poster. The audio's core message of guiding step by step in creating an interesting and colorful clay burger model, emphasizing creativity and attention to detail, is perfectly mirrored in the visuals, creating an engaging and educational experience for the viewer."
+ },
+ {
+ "video_id": "PI-39d2dc14-deeb-4809-bac3-c1c9ef534595",
+ "path": "/workspace/benchmark/video/PI-39d2dc14-deeb-4809-bac3-c1c9ef534595.mp4",
+ "answer": "The video features a group of bridesmaids gathered in a room with yellow walls and a patterned rug, all dressed in matching dark green gowns of various styles, including one-shoulder, strapless, and with different slit lengths. The text 'the bridesmaids reactions to the dress:' is displayed at the top of the screen, setting the context for the scene. The bridesmaids are seen reacting with visible excitement and surprise, covering their mouths with their hands and clapping in unison, clearly delighted with their dresses. The atmosphere is one of shared joy and celebration, capturing a moment of unity and happiness among the bridesmaids. The scene then transitions to a closer view of two bridesmaids, one in a one-shoulder dress and the other in a strapless dress, who continue to express their joy and approval, further emphasizing the positive and enthusiastic reactions to the dresses. The video effectively conveys the theme of excitement and satisfaction among the bridesmaids as they see and react to their elegant attire for the upcoming event."
+ },
+ {
+ "video_id": "PI-d95a39ba-334d-47c3-b46f-8b59cae2147c",
+ "path": "/workspace/benchmark/video/PI-d95a39ba-334d-47c3-b46f-8b59cae2147c.mp4",
+ "answer": "This video is a selfie-style recording of a man impersonating someone named Lavar on TikTok, as indicated by the on-screen text. He is positioned in front of a dark, padded wall. Initially, he speaks in a loud and energetic voice, introducing himself as Lavar and encouraging viewers to help another account reach 500,000 followers. His demeanor then shifts. He says he can't get something out of his head and begins to sing a line from a song. The video concludes with him making a shocked and exaggerated facial expression, with his eyes and mouth wide open."
+ },
+ {
+ "video_id": "PI-0e9bb252-11eb-49c8-9ce5-6c54c0be0bd4",
+ "path": "/workspace/benchmark/video/PI-0e9bb252-11eb-49c8-9ce5-6c54c0be0bd4.mp4",
+ "answer": "The audio track captures a conversation between a man and a woman in the living room, discussing the functions and special effects of a facial mask, providing insights into its benefits and usage. The video content, however, takes a more playful and theatrical approach, beginning with a person holding a pink package labeled \"ARE YOU READY?\" featuring a girl's face and the words \"BRINGGREEN\" and \"Young Hee Fresh Mask,\" setting the stage for a prank. The person applies a sheet mask resembling the dolls from the \"Squid Game\" series and approaches another individual named Lexi, who is initially surprised and then delighted by the unexpected appearance. The prankster continues to move around, startling Lexi further, who then inquires about the mask's origin. The masked person playfully replies, introducing the \"Young Hee fresh facial mask brought by green,\" expressing amusement at the experience. The scene concludes with both individuals standing together, the masked person still donning the facial mask, looking at the camera, and the video effectively blending humor with the product's introduction, while the audio remains focused on the practical aspects of the facial mask's functions and effects."
+ },
+ {
+ "video_id": "PI-5e60f11d-ebcf-44bc-acf1-fa3c2244765c",
+ "path": "/workspace/benchmark/video/PI-5e60f11d-ebcf-44bc-acf1-fa3c2244765c.mp4",
+ "answer": "The video takes place in a lively bar or pub setting, where a group of people are dancing and socializing, creating a vibrant and joyful atmosphere. The scene features a person in a red hoodie and blue jeans energetically dancing while holding a drink in one hand. The background includes a wooden wall with a sign that reads 'MOONEY'S PUB' and a glass door reflecting the interior lights, enhancing the lively ambiance. The text overlay on the video reads 'Love with every stranger The stranger the better,' emphasizing a theme of embracing connections with strangers. Complementing the visuals, the audio track is an upbeat and energetic BGM that contributes to the overall enjoyment and lively interaction among the people in the bar. The combination of the energetic music and the dynamic social interactions creates a cohesive and engaging experience, highlighting the joy of spontaneous connections in a social setting."
+ },
+ {
+ "video_id": "PI-f6910ccf-d1bd-40bd-9b55-ff05f4b784cf",
+ "path": "/workspace/benchmark/video/PI-f6910ccf-d1bd-40bd-9b55-ff05f4b784cf.mp4",
+ "answer": "The video opens with a man in a black shirt speaking into a microphone, The man expresses his deep love and appreciation for his father, stating that his father is his life, his happiness, his best friend, and his everything. He concludes by directly addressing his father with the words, 'I love you my dear father.'Against an initial backdrop of deep green, the scene transitions to a twilight highway bustling with moving cars beneath a teal-tinted sky. With impassioned intensity, the man declares his love: \"My father means everything to me,my father is my life my father is my happiness,my father is my best friend.\" expressing profound gratitude toward his father. The visual tone shifts from a personal, direct address to a more expansive, moody landscape, creating a reflective and emotional atmosphere. \n"
+ },
+ {
+ "video_id": "PI-aae71fc9-0b60-434c-8233-a6af8cfcc50c",
+ "path": "/workspace/benchmark/video/PI-aae71fc9-0b60-434c-8233-a6af8cfcc50c.mp4",
+ "answer": "The video begins with a person clad in a black balaclava and a black hoodie featuring a yellow design, standing outdoors under a cloudy sky, setting the stage for an enthusiastic review. As the scene transitions, a young boy in a gray hoodie with white stripes on the sleeves and camouflage pants is seen interacting with an e-bike that boasts a brown leather seat and a sleek black frame. The camera then zooms in on the e-bike's brown leather seat, followed by a view of the battery compartment, which houses a removable battery branded \"Jansno.\" The e-bike is showcased from multiple angles, emphasizing its stylish design, front headlight, and overall structure. The handlebar area is explored in detail, highlighting the headlight, turn signals, and horn. The reviewer, who is the same person initially seen, demonstrates the e-bike's handlebar controls, including the shifter and LCD display, and points out the tail light. Later, the reviewer, now dressed in a blue jacket and gray pants, stands beside the e-bike as the young boy reappears. The reviewer continues to point out the e-bike's features, and in a final flourish, performs a wheelie, lifting the front wheel off the ground to showcase the e-bike's impressive balance and control, all while enthusiastically detailing its design and functionality to the viewer."
+ },
+ {
+ "video_id": "PI-0f6a33de-5f13-481c-8d47-cc5d7bc4ad58",
+ "path": "/workspace/benchmark/video/PI-0f6a33de-5f13-481c-8d47-cc5d7bc4ad58.mp4",
+ "answer": "The audio track features a lively conversation between a male singer and his fans, culminating in a passionate group sing-along, showcasing the strong bond and mutual enthusiasm between the artist and his admirers. The accompanying video unfolds outside a Nordstrom store, where the singer, dressed in a striking white and black suit with a unique geometric pattern, engages with a crowd of enthusiastic fans. The vibrant atmosphere is enhanced by the backdrop of a white building, trees, and large glass structures. Red text such as \"JACK,\" \"WHEN I'M NOT,\" \"WITH YOU,\" and \"MY MIND\" appear on the screen, suggesting a playful and humorous interaction, while the singer leans over to sign autographs, indicated by the text \"A SIGN,\" and makes a gesture, followed by the blue text \"TIME\" and a wave to the crowd. The video captures the dynamic and energetic exchanges between the singer and his fans, emphasizing the excitement and fervor of the event. Additionally, the audio playfully references the song \"Hit Me Baby One More Time,\" adding a layer of humor to the scene and further highlighting the lighthearted and engaging nature of the interaction between the singer and his fans."
+ },
+ {
+ "video_id": "PI-3e950133-32fb-4996-b9c0-9849ab30ef85",
+ "path": "/workspace/benchmark/video/PI-3e950133-32fb-4996-b9c0-9849ab30ef85.mp4",
+ "answer": "The video, shot in a bedroom setting, presents a humorous and interesting scene that contrasts with the calming backdrop of an accompanying BGM track. The audio features a gentle piano melody intertwined with soft strings, crafting a warm and soothing atmosphere that conveys a sense of calm and tranquility, making it ideal for relaxing or reflective moments. Meanwhile, the video captures a man speaking on a TV placed on a white TV cabinet, with text overlay reading \"Going to bed but sleeping with a large rodent,\" evoking a tone of shock. A person, seen grasping the quilt with his hand, appears to be sitting on the bed, which is adorned with a yellow blanket, emphasizing the cozy ambiance of the room. As the camera slowly pans right, a small brownish-yellow doll comes into view, likely the \"big rodent\" referenced in the subtitles, adding a playful twist to the otherwise ordinary bedroom scene. This blend of visual humor and auditory serenity together creates a unique atmosphere that celebrates both the warmth of the bedroom and the simple joys of life."
+ },
+ {
+ "video_id": "PI-7125611a-73fa-4461-8bdb-f5da092031da",
+ "path": "/workspace/benchmark/video/PI-7125611a-73fa-4461-8bdb-f5da092031da.mp4",
+ "answer": "The video opens with a warning about graphic content. It then transitions to a nighttime street scene in Baltimore, Maryland, showcasing a crossroads with streetlights and parked cars. A truck is observed at the intersection. As the truck drives, it strikes a person, causing that individual to fall to the ground and be run over. The truck continues to drive away from the scene. The person lies on the ground, and shortly after, others can be seen approaching the victim's location. The incident occurs at a nighttime intersection, with visible traffic lights and road signs indicating Chester Street and Pratt Street."
+ },
+ {
+ "video_id": "PI-d0756ddb-096d-4bd8-9ba0-8c2af144895e",
+ "path": "/workspace/benchmark/video/PI-d0756ddb-096d-4bd8-9ba0-8c2af144895e.mp4",
+ "answer": "The beginning of the video shows a gentle scene: A woman is sitting on the sofa, wearing a green sweatshirt, and a little girl in a pink pajama is standing behind her, leaning against her back. It reads: \"Blair loves her parents.\" The scene switches to a woman picking up her mobile phone, showing a photo of a couple walking hand in hand along the beach at sunset. The woman is wearing a white top and a floral skirt, while the man is wearing a white shirt and jeans. The following few frames are close-ups of the mobile phone screen, showing more intimate moments of the couple, including a black-and-white photo of them hugging and another photo of them walking barefoot on the beach. The text overlay remains unchanged within these frames. Then the video switches to the woman and the little girl on the sofa. The woman looked at the child affectionately. The child's hair was tied into two small buns and he looked around curiously. The final scene returns to the woman showing the mobile phone screen. What is displayed on the screen is a photo of the woman in a floral dress. She is holding a small thing in her pocket, which might be a mobile phone. The background shows a comfortable living room environment with a plant and a cabinet. Both the video and the audio create a warm scene where the girl misses her father while her mother comforts her on the side."
+ },
+ {
+ "video_id": "PI-3724c335-43f7-49a7-a12a-79d0b8a8311a",
+ "path": "/workspace/benchmark/video/PI-3724c335-43f7-49a7-a12a-79d0b8a8311a.mp4",
+ "answer": "The video features a person lying on a bed, wearing a black beanie and a black top, with long, dark hair spread out on the pillow. The background is a light-colored wall, and the person is holding a pink object in their hand, possibly a phone or a small device. The text overlay on the video initially reads, 'when a woman says she hates men,' to which the person appears to react in a humorous or sarcastic manner. The scene then shifts to show the same person with the text overlay changing to 'when a man says he hates women,' prompting a different reaction, suggesting a commentary on the double standards or varying societal reactions to similar gender-related statements. Accompanying the video is an audio track sung by a male voice with a relaxed and casual tone. The lyrics, 'She just wants a party, never wrong every time,' express a carefree and enjoyable attitude towards life, emphasizing the core message of embracing fun instead of being picky when you want to enjoy good times. This juxtaposition highlights the contrast between the lighthearted nature of the audio and the more serious theme of societal perceptions and reactions to gender-related statements in the video."
+ },
+ {
+ "video_id": "PI-02373b80-4a90-410c-a169-fead4e2e41ad",
+ "path": "/workspace/benchmark/video/PI-02373b80-4a90-410c-a169-fead4e2e41ad.mp4",
+ "answer": "The protagonist of this video is a person wearing a red and grey work uniform, and the background is a clinic or a spa center. First, the subjects stirred a black liquid in a metal container. The digital screen displayed the words \"zenoli\", with a shelf holding bottles in the background. Then this person was holding a small black thing, which might be a tool, seemingly explaining or demonstrating something. The camera switches to a person walking in the corridor, where there is a sign that reads \"Come in, stretch a bit\", and his gestures seem to be continuing their explanation. Next, this person stands in front of the display of various skin care products on the wooden shelf, holding two white tubes and extending them towards the camera, possibly providing or showcasing the products. The last scene of the first segment is that this person is lying face down on a grey massage bed, wearing a grey coat and with his hair tied up, looking very relaxed. The background is a framed painting and a window with blinds. The video continues to show that this person is still lying face down on the grey massage bed, with his left hand on the table and his right hand making gestures, possibly continuing their explanation or demonstration. The overall environment and atmosphere indicate that the focus is on skin care or health treatment. The rhythmic, hymatical voice in the video, \"Wax lady never plays playstation, four playstation, four playstation, four playmates\", adds interesting and catchy elements to the video. The combination of visual and audio elements creates a cohesive narrative centered on spa or clinical services, with people actively participating in the demonstration and explanation."
+ },
+ {
+ "video_id": "PI-acdeadd0-81f5-4d55-a0d8-a0664545baf3",
+ "path": "/workspace/benchmark/video/PI-acdeadd0-81f5-4d55-a0d8-a0664545baf3.mp4",
+ "answer": "The video presents a chilling scene set in a desolate, leaf-strewn yard during late autumn or winter, where a figure clad in a black hooded robe and a white mask stands ominously near a green shed with a brown roof. The figure holds a large bouquet of red balloons and a sign that reads, \"I'D KILL TO BE YOUR VALENTINE,\" creating an unsettling juxtaposition with the typically romantic connotations of Valentine's Day. The background of leafless trees enhances the eerie atmosphere, while overlaid text stating, \"The person that sent you this wants to be your valentine,\" accompanied by two small heart emojis, adds an ironic twist to the threatening message. Complementing this visual, the audio track features a casual conversation where one speaker expresses a sense of being scared with the phrase \"Like scary,\" followed by a question about favorite scary movies. The relaxed and inquisitive voices engage in a light-hearted discussion about horror films, reflecting an informal and friendly interaction. Together, the video and audio create a stark contrast between the unsettling visual elements and the casual, conversational tone of the dialogue, blending horror with a typically romantic theme in a uniquely disturbing manner."
+ },
+ {
+ "video_id": "PI-10977ea9-377c-414a-81a9-f11559e4e64d",
+ "path": "/workspace/benchmark/video/PI-10977ea9-377c-414a-81a9-f11559e4e64d.mp4",
+ "answer": "The video captures a crime scene, cordoned off by a yellow tape, with emergency vehicles, including police cars with flashing blue and red lights and rescue vehicles with red lights, dominating the scene. The dark sky, lit up by the lights of emergency vehicles and streetlamps, along with the damp ground reflecting these lights, creates a somber and tense atmosphere, emphasizing the gravity of the incident. Law enforcement and medical personnel are seen closely monitoring the situation, surrounded by background elements like trees and other vehicles, though the focus remains on the emergency response. Accompanying this visual is a BGM track with a melancholic and emotional tone, characterized by a slow, somber melody and a soft, reflective female voice. The music evokes a sense of sadness or longing, possibly related to themes of love, loss, or personal choice, enhancing the overall emotional impact of the video's serious and somber scene, where the interplay between the visual and auditory elements underscores the gravity and emotional depth of the depicted event."
+ },
+ {
+ "video_id": "PI-dd4c001c-4650-430e-8a2b-808fab1e280d",
+ "path": "/workspace/benchmark/video/PI-dd4c001c-4650-430e-8a2b-808fab1e280d.mp4",
+ "answer": "In a gentle interview setting, a man with long blonde hair and a mustache asks another individual, Hasbulla Magomedov, if he has ever been in love. The video cuts between close-up shots of the two as they converse. Hasbulla, wearing a black t-shirt and holding a spoon, initially laughs at the personal question. He then provides a thoughtful response, translated via subtitles, stating that he loves his relatives, the Prophet Muhammad, and other kind people. The interviewer acknowledges that Hasbulla is a religious man and proceeds to ask what qualities he would look for in a future partner. In a calm and sincere tone, Hasbulla replies, \"We'll see what God has planned for us. That's all.\" The exchange highlights a respectful conversation centered on personal values, love, and faith, leaving the future to a higher power."
+ },
+ {
+ "video_id": "PI-258cc268-1e1f-4f3b-8863-c67a83871c55",
+ "path": "/workspace/benchmark/video/PI-258cc268-1e1f-4f3b-8863-c67a83871c55.mp4",
+ "answer": "The video features a person sitting inside a car, wearing a gray hoodie.The background consistently shows the car's interior, including the headrest and window, through which some buildings and a clear sky are visible.The person is speaking and making hand gestures to emphasize their points.The text overlay at the bottom of the frames reads, 'How to tell if someone is a societal outcast:'.The person appears to be explaining or discussing characteristics that might indicate someone is a societal outcast.The video maintains a consistent setting and focus on the speaker throughout.The audio content includes the speaker's commentary, which adds context to the visual elements.The speaker humorously mentions, 'And then I told them to get up and teach them instead of handing them a freaking packet. yo hahaha, so listen if you go on your tok search bar and you type in milk, and it's just stars, traps and not people showing you their favorite milk. Have a very big problem.' This suggests a discussion about social behaviors and online trends, possibly critiquing the lack of genuine content on platforms like TikTok."
+ },
+ {
+ "video_id": "PI-1674ae13-8e48-4429-9eca-7722bb8edbbd",
+ "path": "/workspace/benchmark/video/PI-1674ae13-8e48-4429-9eca-7722bb8edbbd.mp4",
+ "answer": "The video captures a poignant moment during Josh Allen's MVP speech, where he is seen wearing a black suit and tie, surrounded by a celebratory atmosphere filled with blurred spectators and red lights. Initially, Allen is depicted hugging multiple people, including his fiancee, Hailee Steinfeld, who is elegantly dressed in a black gown with a sparkling necklace, as they share a passionate embrace. The text overlay, \"Josh Allen thanks Hailee Steinfeld in MVP speech,\" accompanied by a teary-eyed emoji, sets the tone for the heartfelt scene. The audio, characterized by a man speaking in a quiet setting before cheers erupt in the second half, aligns with the video's narrative, suggesting that the track is indeed a speech. Allen then turns to the audience, holding an award, and continues his speech, expressing profound gratitude and love for Hailee's support, capturing a moment of genuine affection and appreciation amidst the festivities. The integration of the audio's quiet dialogue transitioning to cheers, and the video's visual narrative of Allen's emotional speech and embrace with Hailee, paints a vivid picture of a heartfelt and memorable occasion."
+ },
+ {
+ "video_id": "PI-508da3fd-5e80-4223-bb06-b69c93fc8f6e",
+ "path": "/workspace/benchmark/video/PI-508da3fd-5e80-4223-bb06-b69c93fc8f6e.mp4",
+ "answer": "The audio and video content together create a poignant and heartfelt narrative about a man's deep longing and nostalgia for his late father. In the audio, a male speaker expresses a mix of nostalgia and happiness as he reflects on his desire to work with his father once more, imagining showing him the factory's operations, which include producing 8 to 25 cakes per minute. He shares his wish for his father to witness the unity of his family, including his children, and conveys his love and longing through his words, all set against upbeat and cheerful ambient music that adds a warm and positive atmosphere. The video complements this by featuring the man standing in a large industrial kitchen or factory setting, dressed in a white long-sleeve shirt, white pants, a blue hairnet, and a gold chain with a pendant. The background, filled with stainless steel equipment, large mixing bowls, and various kitchen appliances, underscores the professional food production environment. The man is seen leaning against a large industrial mixing bowl, pouring a liquid into a yellow bucket below, maintaining a contemplative and reflective posture throughout. Text overlays on the screen share the man's narrative, revealing his wish to work with his father again and for his father to see the factory's operations, the family's achievements, and the current family unity. The industrial setting contrasts with the personal and emotional narrative, highlighting the man's journey and success in the food industry while capturing a moment of profound reflection and nostalgia."
+ },
+ {
+ "video_id": "PI-acacfd50-4d73-4cfd-bf54-ca82e9b54bd3",
+ "path": "/workspace/benchmark/video/PI-acacfd50-4d73-4cfd-bf54-ca82e9b54bd3.mp4",
+ "answer": "The video unfolds in a retail store's baby section, marked by a sign above, where two individuals engage in a casual interaction. The person on the left, dressed in a black T-shirt, red shorts with white stripes, and gray slides, holds a smartphone towards the other individual, suggesting an interview or conversation. The person on the right, wearing a black vest, red sports pants, and black slides with white text, participates in the exchange. The background is filled with various baby clothes hanging on racks, and the text \"@BOOMANDBEZZIE\" overlays the video, likely indicating the creators or source of the content. Complementing this visual, the audio track features a segment of dialogue with a male voice conducting the interview in a casual and inquisitive tone. He asks the interviewee, Greg, about his name, occupation, and gender preferences. Greg responds that he is a manager at McDonald's and prefers women. The core message conveyed through both the video and audio is a light-hearted and informal interaction between two individuals discussing personal preferences in a casual setting, with the visual context enhancing the understanding of the scene."
+ },
+ {
+ "video_id": "PI-6447c85e-1d7a-4c67-ac29-b93e4f1d1a1c",
+ "path": "/workspace/benchmark/video/PI-6447c85e-1d7a-4c67-ac29-b93e4f1d1a1c.mp4",
+ "answer": "The video shows a woman in a red shirt standing in a room with a neutral-colored wall and a ceiling light, with a partially visible doorway in the background. Animation elements overlay the scene, including yellow and red geometric patterns at the bottom and vertical pink bars on the left. The top hexagonal counter shows numbers from 00 to 950, while the circular frame with a gold border shows a blurry face with letters inside \"L A A A\". At the 950 mark, confetti-like animation drops, adding to the celebratory effect. The audio includes a female speaker whose emotional state is unknown, with an untranslated Chinese clip followed by an English phrase expressing casual or dismissive emotions. The speaker seems to be singing or humming casually, followed by a light or humorous remark. The background music has a cheerful rhythm and creates a lively and lively atmosphere that complements the visual elements."
+ },
+ {
+ "video_id": "PI-bc609332-b212-4e88-80a8-3b4b518cfabf",
+ "path": "/workspace/benchmark/video/PI-bc609332-b212-4e88-80a8-3b4b518cfabf.mp4",
+ "answer": "The video presents a contemplative scene where a person, dressed in a black beanie and dark jacket, is seated inside a car. The setting is a residential area with brick buildings visible through the car window, under a cloudy sky that adds to the subdued atmosphere. The individual remains mostly still, with minimal movement, suggesting a reflective mood. Throughout the video, the person delivers a monologue, discussing the emotional challenge of staying away from someone their soul, body, and heart naturally yearn for. The speaker, a male with a reflective and somewhat melancholic tone, emphasizes the discipline required to avoid thinking about that person. He offers reassurance that things will be okay, advising to take life one day at a time and hoping for a better future if that person returns. The overall mood of the video is introspective and calm, with the urban backdrop complementing the speaker's contemplative state."
+ },
+ {
+ "video_id": "PI-360508d1-e450-4a41-9cbd-082c74c9b890",
+ "path": "/workspace/benchmark/video/PI-360508d1-e450-4a41-9cbd-082c74c9b890.mp4",
+ "answer": "In the video, a man and a woman are walking outdoors together on a sunny day, creating an atmosphere full of vitality and hope. This man is wearing a black coat and black boots, holding a piece of paper in his hand, while this woman is wearing a light-colored dress and black high-heeled boots, with a long grey coat on the outside. They walked confidently on the paved ground, with buildings and trees in the background, suggesting the urban environment. Such words appeared on the screen: \"Just a former drug dealer and a former call girl jumped into Congress to start a revolution.\" This article emphasizes the narrative of change and ambition, indicating that an individual has overcome past challenges and pursued a significant political goal. The scene transitions to a man and a woman approaching the entrance of the building. The man opens the door and they both enter the building. There is a glass door at the entrance, with the words \"Please enter here\" and \"Note: An alarm will sound when the door is open\" written on it, which adds a sense of formality and purpose to their entry. This video captures a purposeful moment of action, highlighting an individual's journey and determination as they enter a new stage of their efforts. The cheerful audio message \"One thing I can tell you is that you must be free\" complements the visual narrative and emphasizes the theme of freedom and liberation."
+ },
+ {
+ "video_id": "PI-fe4afd98-432f-4dbd-aedb-7a3623148bef",
+ "path": "/workspace/benchmark/video/PI-fe4afd98-432f-4dbd-aedb-7a3623148bef.mp4",
+ "answer": "This video captures a tense moment inside a car at night. A young man is engaged in a difficult phone call. The scene is dimly lit, with the main source of light coming from the smartphone in his hand. He is on a hands - free call with a serious - looking older man, who is likely his parent and seems to have tracked his location. The text on the screen, \" North liberty 81.8 % accuracy\", reveals the context of this discovery. The conversation shows that the young man is at a friend's house, a fact that angers the caller. The older man vehemently expresses his dissatisfaction and informs the teenager that his mother is on her way to pick him up. Although the young man remains calm verbally, his body language, such as rubbing his face, betrays his nervousness. This short video powerfully presents a moment of family conflict triggered by modern technology. "
+ },
+ {
+ "video_id": "PI-4b6fa0e2-d464-471e-b8f8-5753e11a33c8",
+ "path": "/workspace/benchmark/video/PI-4b6fa0e2-d464-471e-b8f8-5753e11a33c8.mp4",
+ "answer": "The video features an elderly man sitting at a table in a warmly lit restaurant, adorned with decorative elements on the walls and other patrons in the distance. He is dressed in a light-colored hat and a beige jacket over a striped shirt. In front of him is a plate with a cupcake topped with white frosting and colorful sprinkles, accompanied by a fork and a napkin. The text overlay on the video reads, \"the 'illegal alien' that people are afraid of,\" which contrasts with the serene scene of the man enjoying his dessert, taking a bite and savoring it. Meanwhile, the audio plays a piece of background music with poetic lyrics, evoking a nostalgic and dreamy atmosphere. The lyrics describe the fragrance of flowers in the New Year, a youthful sunset on the beach, and longing memories, adding a layer of reflection and nostalgia to the visual content. The combination of the man's peaceful enjoyment of his dessert and the evocative music creates a poignant contrast between the simplicity of the moment and the deeper, more reflective emotions suggested by the audio."
+ },
+ {
+ "video_id": "PI-2b9c2871-9f5e-4b46-9372-4ac0f4ffb537",
+ "path": "/workspace/benchmark/video/PI-2b9c2871-9f5e-4b46-9372-4ac0f4ffb537.mp4",
+ "answer": "In the video, a person is wearing a light pink red-letter sweatshirt and black trousers, dancing passionately in a comfortable living room. This sweatshirt is marked with the word \"FAULT\", and there are some smaller characters below it, adding a casual and fashionable atmosphere to the garment. This person has braids and moves his arms and body rhythmically, creating a lively and charming atmosphere. The background includes a ceiling fan, a window with patterned curtains, a pink bicycle, a brown sofa with all kinds of things on it, and a large green plant, creating a warm and cozy feeling. The text appeared at the top of the picture, saying: \"I need someone more talented than me to do this because I have a vision that I can't perfect.\" This passage implies a humorous or self-aware comment on the dancers' performance, suggesting that they recognize their own limitations and are eager to collaborate to achieve a perfect vision. Furthermore, a faint background voice can be heard saying, \"When I breathe, you breathe, you breathe.\" This might be part of the song or another sound in the room, adding a layer of atmosphere to the scene. The video captures a moment of joy and self-expression, combining movement with humor, highlighting the dancer's playful admission of her creative process."
+ },
+ {
+ "video_id": "PI-0eb4acec-4941-447d-bc45-1ffb710a48db",
+ "path": "/workspace/benchmark/video/PI-0eb4acec-4941-447d-bc45-1ffb710a48db.mp4",
+ "answer": "The video takes place in a restaurant setting revealing a woman seated at a table. The main character, a woman, is depicted enjoying a meal consisting of various vegetables and meat, accompanied by a drink. Her actions include eating, drinking, and engaging with the camera, often smiling. This is an audio track from a vlog-style recording featuring a female voice. With a playful yet slightly skeptical tone, the woman discusses her father's cooking, particularly his steak, speaking in Spanish to her mother in an attempt to badmouth her father's food and observe his reaction. However, her father is displeased with the use of Spanish and seeks genuine feedback from his daughter about the meal. Meanwhile, the non-diegetic text \\\"Speaking badly about my Dad's food in Spanish to my mom to see how he reacts\\\" appears in the center of the screen, summarizing the video's premise. The video captures an amusing and somewhat tense interaction centered around food, family dynamics, and language choice."
+ },
+ {
+ "video_id": "PI-5396e9c8-9585-4850-bb36-252837f87c41",
+ "path": "/workspace/benchmark/video/PI-5396e9c8-9585-4850-bb36-252837f87c41.mp4",
+ "answer": "The video begins with a modern kitchen. A woman in a light purple hooded sweatshirt operates a toaster, with a man in a black T-shirt standing behind her. The countertop is piled high with items, creating a daily atmosphere. The camera turns to a young woman in a dark blue zigzag-shaped sweater. She is sitting at the table, seemingly chatting or lost in thought. Next came another woman in a light purple hoodie. She had a basket in front of her. After looking around the room, she got up and left, reflecting the dynamics of the home. Then, the woman in the light purple hoodie stood beside the countertop with items, with the living room in the background. The woman in the red T-shirt is operating the coffee machine by the pool. The child in the grey T-shirt of \"HEAT BASEBALL\" is observing beside. The woman in the dark blue sweater is still at the table, adding to the family atmosphere. The scene switches to a little boy in a red shirt and a blonde woman in a light blue shirt sitting on the sofa, suggesting a family gathering or a relaxing moment. The video ends with a close-up of the little boy in the red shirt speaking or reacting. In the audio, the three people are chatting, each with a different mood, reflecting the tension and understanding within the family."
+ },
+ {
+ "video_id": "PI-234a6c13-8118-4d78-b531-9074539aac47",
+ "path": "/workspace/benchmark/video/PI-234a6c13-8118-4d78-b531-9074539aac47.mp4",
+ "answer": "The audio subtly underscores a collaborative dynamic between two men as they work in tandem to mix materials using a power drill, with one man offering guidance and support to his partner, creating an auditory backdrop of teamwork and coordination. This audio element is seamlessly woven into the visual narrative of the video, which vividly depicts the two men engaged in their task outdoors on a concrete surface. One man, clad in a camouflage jacket, blue jeans, and a cap, diligently operates a red power drill, meticulously mixing materials in a green bucket labeled \"Stine.\" Meanwhile, the other man, dressed in a gray hoodie and light blue jeans, stands close by, keenly observing and providing timely suggestions for adjustments, thereby enhancing the efficiency of their work. The backdrop of a residential area, complete with houses, trees, and a trailer brimming with construction materials, adds context to their endeavor. As the mixing progresses, the observer doesn't hesitate to assist by handling additional materials, such as bags of concrete mix, and incorporating them into the bucket. The power drill's whirring action stirs the mixture, generating dust that symbolizes the intensity and dedication of their labor. Throughout the video, the two men exhibit a seamless partnership, with the observer's support complementing the operator's focus, encapsulating a moment of unwavering teamwork and commitment to completing their task with precision and efficiency. Here, the audio complements the visuals by reinforcing the collaborative spirit that defines their work."
+ },
+ {
+ "video_id": "PI-626bb3e3-2a35-48fd-b527-851df8cf943e",
+ "path": "/workspace/benchmark/video/PI-626bb3e3-2a35-48fd-b527-851df8cf943e.mp4",
+ "answer": "The video and audio content together create a lighthearted and humorous scenario, blending a surprising encounter at a McDonald's with playful banter about cooking. The video opens with a person in a light-colored jacket and black shirt standing in a McDonald's, expressing surprise or disbelief through a text overlay about encountering someone named vexbolts after a mass unfollowing. The scene then shifts to a McDonald's employee in a light blue uniform and visor, holding a tray of red fries containers, engaged in a conversation with the person in the jacket. The employee appears defensive or surprised, gesturing with their hands, while the background displays a McDonald's menu board. The interaction continues with the employee holding the tray and possibly explaining or reacting to something, as the camera zooms in on the fries containers, showing the McDonald's logo. Meanwhile, the audio features a male speaker expressing happiness about finding someone and humorously mentioning the McDonald's employee, engaging in casual banter about cooking. The speaker's excitement and amusement mirror the video's humorous and surprising encounter between the two individuals, creating a cohesive and entertaining narrative that blends real-life interaction with playful commentary."
+ },
+ {
+ "video_id": "PI-3f055bab-4038-4466-98ce-dd0597430eae",
+ "path": "/workspace/benchmark/video/PI-3f055bab-4038-4466-98ce-dd0597430eae.mp4",
+ "answer": "In the video, a group of people gather in the corridor of the underground passage under the overpass, as if watching a dance similar to a street performance. People are dressed in all kinds of clothes, wearing hats, cotton-padded clothes and down jackets. This should be a video shot in winter. However, the woman dancing in the video is dressed completely differently from the people around her. She is dressed very lightly. Wearing a khaki top and khaki plush trousers, she looked like a casual outfit. She shook hands with the people around her and then demonstrated the dance to everyone along with the music, with a confident smile on her face. It could be seen from the expressions of the audience that people really liked her dance. The background music of the entire video is cheerful melodies, giving people a relaxing feeling. There are no special words in the video."
+ },
+ {
+ "video_id": "PI-5eeaf429-4931-4f41-a6af-f0df07bc1f8c",
+ "path": "/workspace/benchmark/video/PI-5eeaf429-4931-4f41-a6af-f0df07bc1f8c.mp4",
+ "answer": "The video begins with a woman in a gray sweatshirt and blue jeans standing behind a white door, holding a phone, while a man in a white and black tracksuit approaches. She opens the door to reveal two people outside: one in a blue sweatshirt holding a sign that reads 'Jackson, Let's have a BALL at PROM?' and another in a pink hoodie holding a phone. The man greets them, and the woman excitedly asks him to prom, to which he enthusiastically accepts. The person in the blue sweatshirt hands him a basketball, and they take a photo. The person in the blue sweatshirt expresses excitement for the man's upcoming championship game, and they pose for another photo, with the person in the blue sweatshirt saying, 'This is so fun.' The scene transitions to them at the doorway, holding the sign and basketball box. The person in the blue sweatshirt asks for a real hug, which the man gives. The woman tells the man to say thank you, and he responds, 'You are the best.' The person in the blue sweatshirt points to the sign, and the woman comments, 'You think I'm at BYU.' The person in the blue sweatshirt says, 'This is so exciting,' and the woman says, 'I can't wait.' The person in the pink hoodie asks, 'Is this really exciting?' to which the man responds, 'Yes.' The audio, featuring a female's excited and happy voice and a male's cheerful and encouraging tone, captures the prom invitation and their shared excitement about the upcoming prom and championship game, wrapping up with the group sharing a moment of anticipation and joy."
+ },
+ {
+ "video_id": "PI-855f053a-b8c3-44e0-8165-936cb31f6f5a",
+ "path": "/workspace/benchmark/video/PI-855f053a-b8c3-44e0-8165-936cb31f6f5a.mp4",
+ "answer": "The audio and video content intertwine to present a compelling narrative about a man's personal journey, emphasizing the spiritual significance of never having broken a bone and the protective measures that accompany such a physical record. The audio shares the man's personal story of his unbroken bones, highlighting the steps he has taken to avoid physical and medical burdens while pursuing self-discovery and spiritual growth. The video complements this narrative by featuring a person sitting at a kitchen table, wearing a dark hoodie and multiple rings, with text overlays suggesting that individuals who have never broken a bone are destined to face various challenges but are protected by guardian angels and deities. The person in the video shares a personal anecdote about their own unbroken bones, reinforcing the spiritual message of protection and burden-free growth on the path to self-discovery and spiritual enlightenment. Together, the audio and video content create a rich and layered exploration of the idea that physical resilience can be intertwined with spiritual protection and growth."
+ },
+ {
+ "video_id": "PI-e79697d2-f796-4962-9d92-bce774292d8c",
+ "path": "/workspace/benchmark/video/PI-e79697d2-f796-4962-9d92-bce774292d8c.mp4",
+ "answer": "The video presents a close-up, static shot of a young woman in a dimly lit room. She is focused on assembling a large, bright pink cake box. Her hair is pulled back, and she wears a long-sleeved, ribbed top.The audio consists solely of a slow, melancholic song with female vocals playing in the background. The somber melody and emotional singing style create a sad and poignant atmosphere , her facial expression is serious and appears laden with emotion, suggesting sadness or stress. On-screen text is overlaid on the video,saying \"Wrapping up the cake I just brought my brother for his 18 birthday .. he didn't wanna sing happy birthday because our mom and dad are both deceased not knowing I spent 50$ on the cake and I'm late on rent.\"The video captures a moment of sibling love, trying to create a moment of celebration for her brother on a significant birthday."
+ },
+ {
+ "video_id": "PI-20b881ed-1ad3-4250-b818-f930d70effca",
+ "path": "/workspace/benchmark/video/PI-20b881ed-1ad3-4250-b818-f930d70effca.mp4",
+ "answer": "The video centers on a plush teddy bear lying on a bed decorated with a patterned bedspread, presenting a cozy and intimate scene. This bear with light brown fur and dark sunglasses is the main focus of the video. It was holding a stack of US dollar bills on its chest, with denominations including a 10-dollar bill. This bear's relaxed posture and playful sunglasses add a laid-back and almost humorous appearance to it. Throughout the video, the Angle of the camera varies slightly, providing different perspectives of the bear and the money in its hand. This subtle movement adds dynamic elements to the originally static scene, drawing people's attention to the cute expression of the bear and the crispy banknotes. In the background, the soft and gentle color of the bedspread creates a calm atmosphere, enhancing the warmth of the entire environment. Near the end of the video, a voice could be heard saying, \"All I saw was...\" My name is John. This short audio clip adds an element of mystery and conspiracy, prompting the audience to be curious about the context and the identity of the speaker. The sound was clear and distinct, in sharp contrast to the visual tranquility of the scene. Overall, this video combines elements of ingenuity and curiosity, centering around a charming teddy bear and unexpected cash. The juxtaposed interesting visual effects and mysterious audio create an unforgettable and captivating experience for the audience."
+ },
+ {
+ "video_id": "PI-26b51e5c-9451-4fcd-b82e-00c8f3a1b9c5",
+ "path": "/workspace/benchmark/video/PI-26b51e5c-9451-4fcd-b82e-00c8f3a1b9c5.mp4",
+ "answer": "In the video, a person is sitting on a black office chair, and the room is carpeted with leopard print. This person is wearing a white vest, light blue shorts, white socks, a white hat and a tattoo on his arm. The room is decorated with all kinds of items, including a pink and golden chair, a blue arcade game, and framed photos on the wall. The text on the video shows, \"Kai is still shocked after meeting Billie eilish at the Grammy Awards ceremony,\" along with a shocked emoji, a heart-shaped eye emoji and a heart-shaped emoji, and the word \"Stake\" is displayed at the bottom of the screen. This person seems to be expressing excitement and doubt, using gestures to emphasize their point of view. The audio includes a account of this meeting, among which there is \"I'm staying here, but he's walking back.\" The overall mood is a high energy and enthusiasm, capturing the lingering excitement and influence after meeting a celebrity. This person stood up from the chair, still speaking and gesturing actively, and then got away from the chair, leaving it empty. This video highlights the unforgettable and joyous nature of meeting Billie Eilish at the Grammy Awards ceremony."
+ },
+ {
+ "video_id": "PI-65e6711e-b63f-4209-be4c-9476c5c0c413",
+ "path": "/workspace/benchmark/video/PI-65e6711e-b63f-4209-be4c-9476c5c0c413.mp4",
+ "answer": "In the video, a person is lying in bed, surrounded by a row of comfortable pink and white pillows, with patterned fabrics on them adding a touch of color. This person has long black hair and wears gold earrings. Throughout the video, they shared a personal anecdote with complex emotions. She expressed an unexpected sense of heartbreak, which was related to the upcoming event. This woman ordered a special set of clothes from Jaded London, calling it \"the most disgusting, most disgusting, most disgusting clothes in the world\", and it will arrive the next day. She plans to take off at 4 a.m. Apparently, she is holding back her tears, thinking that her audience should see her in this dress in New York. She disclosed that she spent 400 dollars on this suit and tried to downplay its significance, saying, \"It's really no big deal.\" However, her tone of emotion indicates that she is still influenced by the situation. This video captures a moment of vulnerability and anticipation, integrating personal emotions with the upcoming plans."
+ },
+ {
+ "video_id": "PI-de6239a1-4bde-4d24-bdee-28e3d2902825",
+ "path": "/workspace/benchmark/video/PI-de6239a1-4bde-4d24-bdee-28e3d2902825.mp4",
+ "answer": "The video opens in a brightly lit kitchen, where a woman is engaged in preparing a chicken casserole. She is seen using a light blue stand mixer and handling shredded chicken in a grey bowl. The kitchen features light-colored cabinetry, granite countertops, and a decorative tiled backsplash. A man subsequently enters the scene, carrying a canned beverage and a snack box. An exchange occurs between the two, leading the woman to forcefully throw a container towards the man, who then quickly exits the frame. Following this interaction, the woman expresses visible frustration, remaining in the kitchen. The on-screen text indicates the video's purpose: \"MAKE A CHICKEN CASSEROLE WITH ME!\"."
+ },
+ {
+ "video_id": "PI-3d0e02c3-c566-402a-a87b-616571982692",
+ "path": "/workspace/benchmark/video/PI-3d0e02c3-c566-402a-a87b-616571982692.mp4",
+ "answer": "The beginning of the video shows a scene of a press conference. A person is standing behind the podium, with signs such as \"PARIS GAMES 2025\", \"NBA\", \"2K25\", \"Nike\", \"Foot Locker\" and \"PARIONS SPORT\" on the blue background. This person is wearing a black shirt with the white logo of the \"SPURS\" printed on it. At the bottom of the screen, in large white characters, \"diagnosis\" is written, indicating an important medical notice. The scene switches to the outdoor scene in front of the Eiffel Tower, where three people are standing. The person in the middle is wearing a blue-green coat. The large white characters \"DEEP\" appear. The next frame is \"THROMBOSIS\", showing the disease under discussion. The camera switches to a group of basketball players wearing black Spurs jerseys, who interact with fans and sign autographs in a stadium with green stands and banners. The text \"OUT\" appears in white capital letters, suggesting its influence on the player's career. Subsequently, the player, wearing a light blue jersey with the words \"WEMBANYAMA\" printed on it and the number \"1\" on the back, celebrated in the stadium crowded with spectators. The player's powerful Slam Dunk is the focus of the basketball net, showcasing his athletic ability. Finally, there is a player in a white sleeveless jersey making a jump shot during the game, with the camera focused on the basket. There is also a close-up of the same player preparing for a free throw, emphasizing the precision and skills of basketball. At the end of the video, a player was celebrating on the field wearing a black \"SATX\" jersey printed with \"WEMBANYAMA\" and the number \"1\" on the back, highlighting the player's prominent position and the support of the audience."
+ },
+ {
+ "video_id": "PI-f7364b80-bf54-4270-b91c-0c54981eeca1",
+ "path": "/workspace/benchmark/video/PI-f7364b80-bf54-4270-b91c-0c54981eeca1.mp4",
+ "answer": "The video focuses on a male restorer taking one of a heavily soiled pair of white-and-green Jordan sneakers out of the delivery package in an indoor workshop or studio and renovating it through a series of professional steps. He wears a brown leather apron with the \"MSP HEATING AND AIR CONDITIONING\" logo on it, and uses shoe braces, foam cleaners, and a variety of brushes to scrub his sneakers, including the leather upper, midsole, and outsole. After cleaning, he had a professional adhesive repair and showed off the restored sneaker from multiple angles outdoors to show off its new look. The video is interspersed with sound effects such as scrubbing, foam dispensing, and water flowing, and upbeat, rhythmic hip-hop music plays in the background, as the male narrator enthusiastically and confidently explains the process of cleaning sneakers and emphasizes the importance of hard work. Key textual messages that appear in the video include branding on the apron and cleaning bottles. Through the stark contrast between the before and after restoration, the video highlights the importance of shoe cleaning and restoration, as well as the ingenuity of the restorer."
+ },
+ {
+ "video_id": "PI-2999d9f1-8124-4d1d-8683-ee311b367899",
+ "path": "/workspace/benchmark/video/PI-2999d9f1-8124-4d1d-8683-ee311b367899.mp4",
+ "answer": "The feature of this video is a stand-up comedy performance, with a vibrant red wavy curtain in the background. He was wearing a black hoodie, black leather pants and red and white sports shoes, and holding a microphone tied with a red rope in his hand. This performance humorously explored the comedian's changing drinking habits as he grew older. He began by recalling his younger days, saying, \"I can no longer drink as I did when I was 22. When I was 23, I drank for several consecutive days.\" He added in a playful but disrespectful tone, \"Dance later, your bitch No. 3.\" Then, the comedian compared this with his state at the age of 43 and humorously said, \"The state at 43 was different from now.\" He continued the topic with a joke about paranoia. He said, \"Someone put something in my drink. Someone is here tonight and wants to do something to me.\" The comedians' movements and expressions are full of vitality, adding a sense of humor to the performance. Haha! The logo of the network is visible in the lower right corner of the frame, indicating the performance of network broadcasting. This video captures the comedian's reflection on aging and the changes in his ability to handle alcohol, conveyed in a playful and irreverent tone."
+ },
+ {
+ "video_id": "PI-03f7a6cc-bb4c-4a7a-bf49-8e4d22122437",
+ "path": "/workspace/benchmark/video/PI-03f7a6cc-bb4c-4a7a-bf49-8e4d22122437.mp4",
+ "answer": "In a compelling interview - based video and audio presentation, an intriguing political exchange unfolds. The audio serves as the direct representation of the interview dialogue. A woman asks the man if he told Trump that Putin agreed to the peace talks because he was afraid of him, to which the man replies in the affirmative, adding that Putin also knew about this matter. The video complements the audio by providing a visual context. It begins with the woman posing the question, and at the same time, the core theme sentence \"Zelensky said he told Trump that Putin was afraid of him\" appears in the center of the screen, setting the stage for the discussion. The words \"MUNICH, GERMANY FEBRUARY 15\" show up in the upper - right corner, indicating the location and date of the interview. The camera skillfully transitions from a panoramic shot of the woman to a close - up, then to a medium shot of the man, and finally back to a close - up of him, creating a dynamic visual flow. As the man starts to answer the questions, he turns his head to the left, raising and lowering his left hand halfway, adding a subtle physical element to his response. After he finishes answering, the audience bursts into laughter, and the woman also covers her mouth with her hand and giggles quietly, perhaps indicating the unexpected or amusing nature of the revelation. Overall, the video mainly conveys the man's statement that he had informed Trump that Putin agreed to peace talks due to fear, and that Putin was aware of this communication, all of which is clearly articulated in the audio as well."
+ },
+ {
+ "video_id": "PI-55f60ffb-30c4-452f-a600-4ff84ad2413c",
+ "path": "/workspace/benchmark/video/PI-55f60ffb-30c4-452f-a600-4ff84ad2413c.mp4",
+ "answer": "The audio track captures a brief interaction between two female voices, where one, claiming to be a tourist, asks the other for directions to the Christmas tree, which is then promptly provided. However, the conversation takes an unexpected turn when the tourist, in a moment of humor or misstep, makes a comment about Trump that is misunderstood or deemed inappropriate by the other woman. The latter quickly denies and corrects the tourist, emphasizing that it was not her intention to convey such a message. This audio narrative revolves around a misunderstanding that is swiftly clarified. Meanwhile, the video unfolds in a store aisle filled with beauty and skin care products, where a woman in a grey sweater, blue jeans, and a white scarf is browsing the goods. A man in a grey hoodie, beige trousers, and a wide-brimmed hat approaches her, appearing to ask for directions or information. Text on the screen reveals that he is a tourist who doesn't speak English and is inquiring about the location of the Christmas tree. The woman obliges, pointing the way and providing clear instructions. However, the man then says something that surprises the woman, with on-screen text indicating it was an inappropriate or humorous remark about Trump. The woman immediately responds, \"No, that's not what I meant,\" and tries to clarify, but the man repeats similar words, prompting her to chase after him while still holding the product she was inspecting, her excitement and urgency to correct the misunderstanding evident in her actions. The video thus visually complements the audio's narrative of a misunderstanding and its prompt resolution."
+ },
+ {
+ "video_id": "PI-41ab9900-8177-459c-93c7-b951002d407a",
+ "path": "/workspace/benchmark/video/PI-41ab9900-8177-459c-93c7-b951002d407a.mp4",
+ "answer": "The video begins with a close-up of a bowl of yellow chicken noodle soup, decorated with green herbs and black pepper. The text above reads: \"The only thing that can make me feel worse than being ill is drinking a bowl of soup that I don't like.\" The camera switches to a person wearing a dark hoodie, holding a spoonful of soup to his mouth, with the words written on it: \"I'm really ill. Cough cough, prove it.\" This person explained that the only thing that could cure them was lemon chicken noodles. The last scene is a bowl of cooked soup, decorated with green herbs, with the words \"Whether you are ill or not\" written on it. At the end of the video, a close-up shot of stirring soup was used to highlight its bright colors and texture. In the video, a person wearing a dark hoodie is holding a spoonful of soup close to his mouth, on which is written: \"So the only thing that can heal me is this bowl of lemon chicken noodle soup.\" This video highlights the comfort and healing power of homemade chicken soup noodles, especially when carefully prepared and with the right ingredients."
+ },
+ {
+ "video_id": "PI-5f27600f-5a3d-445d-b163-1bc42bdd8d29",
+ "path": "/workspace/benchmark/video/PI-5f27600f-5a3d-445d-b163-1bc42bdd8d29.mp4",
+ "answer": "The audio and video content work together seamlessly to create an engaging and appealing promotional piece for Ninja Kitchen appliances. The audio, an instrumental background music track with a light, upbeat, and cheerful melody, sets a positive and inviting tone. The soft, rhythmic beat creates a pleasant atmosphere that enhances the visual content, making it more engaging and enjoyable for viewers. The music is likely intended to complement the video's message of showcasing the functionality and versatility of Ninja Kitchen appliances, providing a positive and engaging backdrop that encourages viewers to watch and learn more. The video, on the other hand, is a dynamic showcase of various Ninja Kitchen appliances in use. It begins with a visually appealing scene of a light blue Ninja air fryer on a countertop, revealing golden-brown chicken inside, immediately capturing the viewer's attention. The text overlay \"pov: you love your Ninja Kitchen products\" emphasizes user satisfaction and sets a positive tone for the rest of the video. As the video progresses, it transitions smoothly between different scenes, each highlighting a different Ninja Kitchen appliance in action. From a soda maker dispensing orange soda to a coffee machine pouring two-layered coffee, and from a hand stirring a creamy mixture in a Ninja pan to an espresso machine dispensing espresso into a chocolate-drizzled glass, the video effectively demonstrates the functionality and versatility of these appliances. The combination of the audio's upbeat and cheerful melody with the video's dynamic showcase of Ninja Kitchen appliances creates a compelling promotional piece that not only highlights the products' features but also evokes a sense of excitement and satisfaction in potential buyers. The audio and video content work together to enhance the overall viewing experience, making it more engaging and persuasive."
+ },
+ {
+ "video_id": "PI-cc6288a5-2f05-438e-8c4e-fdd942288e8b",
+ "path": "/workspace/benchmark/video/PI-cc6288a5-2f05-438e-8c4e-fdd942288e8b.mp4",
+ "answer": "The video captures a woman holding a brown poodle-type dog in an indoor setting with a city skyline visible through large windows. The woman is dressed in a light blue top and jeans, with sunglasses perched on her head. The dog is wearing a black harness. Throughout the short clip, the woman interacts affectionately with the dog, holding it close and displaying a range of expressions, from playful to smiling. The dog remains calm in her arms, occasionally looking towards the camera or licking its lips. The background shows a modern high-rise environment, indicative of an urban setting. An overlayed text reads, \"Going to the new Bop house today 👀,\" suggesting a transition or new experience for the duo."
+ },
+ {
+ "video_id": "PI-7c65de83-716d-4e1a-a897-bf6ef9c8129c",
+ "path": "/workspace/benchmark/video/PI-7c65de83-716d-4e1a-a897-bf6ef9c8129c.mp4",
+ "answer": "The video unfolds in a cozy living room with beige walls, a ceiling fan, and recessed lighting, creating a warm and homely atmosphere. A child, dressed in a bright turquoise shirt and blue jeans, sits on a green surface, flanked by two adults. One adult, wearing a white long-sleeve shirt, has their arm around the child, while the other, in a light gray hoodie, sits on the other side. The child points towards something off-screen, and the adult in white follows suit, indicating a shared focus. A person with dark nail polish and a ring gestures towards the child, possibly emphasizing a point or giving instructions. The interaction is friendly and engaging, with the child appearing curious and the adults showing attentiveness and care. Meanwhile, the background dialogue features a female speaker with a defensive and assertive tone, warning someone to stay away from her brother, highlighting a protective stance and emphasizing loyalty and protection towards family. This blend of visual warmth and vocal assertiveness portrays a scene of familial bonding and vigilance."
+ }
+ ]
+}
\ No newline at end of file
diff --git a/eval_scripts/UGC-VideoCap/generate_caption.py b/eval_scripts/UGC-VideoCap/generate_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ef9d3aff209a42c90c1123fd95be73551970b22
--- /dev/null
+++ b/eval_scripts/UGC-VideoCap/generate_caption.py
@@ -0,0 +1,87 @@
+import os
+import torch
+from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
+from qwen_omni_utils import process_mm_info
+import argparse
+import json
+from tqdm import tqdm
+from pathlib import Path
+
+VIDEO_MAX_PIXELS = 401408 # 512*28*28
+VIDEO_TOTAL_PIXELS = 20070400 # 512*28*28*50
+USE_AUDIO_IN_VIDEO = True
+os.environ['VIDEO_MAX_PIXELS'] = str(VIDEO_TOTAL_PIXELS)
+video_dir = "" # TODO
+
+
+parser = argparse.ArgumentParser(description="Evaluate a model and save results.")
+parser.add_argument("--model_path", type=str, required=True, help="Path to the model checkpoint.")
+parser.add_argument("--save_path", type=str, required=True, help="Path to save the evaluation results.")
+args = parser.parse_args()
+
+model_path = args.model_path
+fout_path = args.save_path
+
+fout = open(fout_path, 'w', encoding='utf-8')
+
+model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
+ model_path,
+ torch_dtype=torch.bfloat16,
+ device_map="auto",
+ attn_implementation="flash_attention_2",
+)
+model.disable_talker()
+processor = Qwen2_5OmniProcessor.from_pretrained(model_path)
+
+
+def chat(file_path, prompt):
+ conversation = [
+ {
+ "role": "system",
+ "content": [
+ {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
+ ],
+ },
+ {
+ "role": "user",
+ "content": [
+ {
+ "type": "video",
+ "video": file_path,
+ "max_pixels": VIDEO_MAX_PIXELS,
+ },
+ {
+ "type": "text",
+ "text": prompt
+ },
+ ],
+ },
+ ]
+
+ text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
+ audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
+ inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
+ inputs = inputs.to(model.device).to(model.dtype)
+
+ text_ids = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO, do_sample=False, thinker_max_new_tokens=2048)
+
+ text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+ model_generation = text.split("\nassistant\n")[-1]
+
+ return model_generation
+
+for idx, video_id in tqdm(enumerate(os.listdir(video_dir), start=1)):
+ video_path = os.path.join(video_dir, video_id)
+ prompt = ("You are given a short video with both audio and visual content. Write a detailed and coherent paragraph that naturally integrates all modalities. "
+ "Your description should include: (1) the primary scene and background setting; (2) key characters or objects and their actions or interactions; "
+ "(3) significant audio cues such as voices, background music, sound effects, and their emotional tone; "
+ "(4) any on-screen text (OCR) and its role in the video context; and (5) the overall theme or purpose of the video. "
+ "Ensure the output is a fluent and objective paragraph, not a bullet-point list, and captures the video's content in a human-like, narrative style.")
+
+ model_generation = chat(video_path, prompt)
+ out_data = {
+ "video_id": video_id.replace(".mp4", ""),
+ "caption": model_generation,
+ }
+ fout.write(json.dumps(out_data, ensure_ascii=False)+'\n')
+ fout.flush()
\ No newline at end of file
diff --git a/eval_scripts/VDC/evaluation.sh b/eval_scripts/VDC/evaluation.sh
new file mode 100644
index 0000000000000000000000000000000000000000..52941f27e60d06d903b40c082c29d3eaa1784d57
--- /dev/null
+++ b/eval_scripts/VDC/evaluation.sh
@@ -0,0 +1,7 @@
+raw_file="$1"
+
+output_file="${raw_file/caption.jsonl/results.jsonl}"
+
+python eval_scripts/VDC/score_sglang_multi-threads.py \
+ --raw_file $raw_file \
+ --output_file $output_file
\ No newline at end of file
diff --git a/eval_scripts/VDC/generate_caption.py b/eval_scripts/VDC/generate_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..4e315a82eb5dc7df9e0b366268f5d9225d817980
--- /dev/null
+++ b/eval_scripts/VDC/generate_caption.py
@@ -0,0 +1,281 @@
+import os
+import torch
+from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
+from qwen_omni_utils import process_mm_info
+import argparse
+import json
+from tqdm import tqdm
+import multiprocessing as mp
+import re
+import traceback
+import logging
+import sys
+import time
+from multiprocessing import Value, Lock
+from torch.utils.data import Dataset, DataLoader
+from functools import partial
+import glob
+import gc
+
+
+VIDEO_MAX_PIXELS = 401408 # 512*28*28
+VIDEO_TOTAL_PIXELS = 20070400 # 512*28*28*50
+USE_AUDIO_IN_VIDEO = False
+video_dir = "path_to_Video-Detailed-Caption_videos"
+os.environ['VIDEO_MAX_PIXELS'] = str(VIDEO_TOTAL_PIXELS)
+
+def setup_logger(rank, log_dir):
+ logger = logging.getLogger(f"worker_{rank}")
+ logger.setLevel(logging.INFO)
+ logger.propagate = False
+ if not logger.handlers:
+ log_file = os.path.join(log_dir, f"eval_rank_{rank}.log")
+ file_handler = logging.FileHandler(log_file, mode='w')
+ formatter = logging.Formatter('%(asctime)s - [%(levelname)s] - (%(processName)s) - %(message)s')
+ file_handler.setFormatter(formatter)
+ logger.addHandler(file_handler)
+ return logger
+
+class VideoDataset(Dataset):
+ def __init__(self, video_paths, prompts):
+ self.video_paths = video_paths
+ self.prompts = prompts
+
+ def __len__(self):
+ return len(self.video_paths)
+
+ def __getitem__(self, idx):
+ video_path = self.video_paths[idx]
+ prompt = self.prompts[idx]
+
+ conversation = [
+ {
+ "role": "system",
+ "content": [
+ {
+ "type": "text",
+ "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."
+ }
+ ]
+ },
+ {
+ "role": "user",
+ "content": [
+ {
+ "type": "video",
+ "video": video_path,
+ "max_pixels": VIDEO_MAX_PIXELS,
+ "max_frames": 256
+ },
+ {
+ "type": "text",
+ "text": prompt
+ },
+ ]
+ },
+ ]
+ return conversation, video_path
+
+def collate_fn(batch, processor):
+ conversations = [item[0] for item in batch]
+ video_paths = [item[1] for item in batch]
+
+ texts = processor.apply_chat_template(conversations, add_generation_prompt=True, tokenize=False)
+ audios, images, videos_pre = process_mm_info(conversations, use_audio_in_video=USE_AUDIO_IN_VIDEO)
+
+ if videos_pre:
+ max_frames = max(video.shape[0] for video in videos_pre)
+ padded_videos = []
+ for video in videos_pre:
+ current_frames, C, H, W = video.shape
+ padding_needed = max_frames - current_frames
+ if padding_needed > 0:
+ padding_tensor = torch.zeros((padding_needed, C, H, W), dtype=video.dtype)
+ padded_video = torch.cat([video, padding_tensor], dim=0)
+ padded_videos.append(padded_video)
+ else:
+ padded_videos.append(video)
+ videos = padded_videos
+ else:
+ videos = videos_pre
+
+ inputs = processor(text=texts, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
+
+ return inputs, video_paths
+
+def generate_captions(batch_inputs, model, processor, model_path):
+ text_ids = model.generate(**batch_inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO, do_sample=False, thinker_max_new_tokens=2048)
+
+ decoded_texts = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
+
+ results = []
+ for text in decoded_texts:
+ model_generation = text.split("\nassistant\n")[-1]
+ results.append(model_generation)
+ return results
+
+def worker_proc(rank, gpu_id, model_path, all_video_paths, all_prompts, out_path, counter, lock, log_dir):
+ logger = setup_logger(rank, log_dir)
+ logger.info(f"Worker-{rank} started on GPU-{gpu_id}, process ID: {os.getpid()}")
+
+ device_map = {"": f"cuda:{gpu_id}"}
+
+ try:
+ logger.info("Loading model...")
+ model = Qwen2_5OmniForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map=device_map, attn_implementation="flash_attention_2")
+ model.disable_talker()
+ processor = Qwen2_5OmniProcessor.from_pretrained(model_path)
+ logger.info("Model loaded successfully.")
+ except Exception as e:
+ logger.error(f"Failed to load model: {e}", exc_info=True)
+ return
+
+ all_prompts = list(all_prompts)
+
+ num_gpus = torch.cuda.device_count()
+ video_paths_subset = all_video_paths[rank::num_gpus]
+ prompts_subset = all_prompts[rank::num_gpus]
+ dataset = VideoDataset(video_paths_subset, prompts_subset)
+ collate_with_processor = partial(collate_fn, processor=processor)
+
+ dataloader = DataLoader(
+ dataset,
+ batch_size=2,
+ num_workers=1,
+ collate_fn=collate_with_processor,
+ pin_memory=True,
+ )
+
+ fout = open(out_path, "w", encoding="utf-8")
+
+ for batch_inputs, batch_video_paths in dataloader:
+ logger.info(f"Processing batch of {len(batch_video_paths)} videos. First video: {os.path.basename(batch_video_paths[0])}")
+ model_generations = None
+ try:
+ batch_inputs = {k: v.to(model.device) for k, v in batch_inputs.items()}
+ model_generations = generate_captions(batch_inputs, model, processor, model_path)
+
+ for video_path, model_generation in zip(batch_video_paths, model_generations):
+ video_id = os.path.basename(video_path).split(".mp4")[0]
+ if model_path.endswith("HumanOmniV2"):
+ context = extract_context(model_generation)
+ thinking = extract_think(model_generation)
+ answer = extract_answer(model_generation)
+ out_data = {"video_id": video_id, "context": context, "thinking": thinking, "caption": answer}
+ else:
+ out_data = {"video_id": video_id, "caption": model_generation}
+ fout.write(json.dumps(out_data, ensure_ascii=False) + "\n")
+
+ fout.flush()
+ with lock:
+ counter.value += len(batch_video_paths)
+ except Exception as e:
+ logger.error(f"CAUGHT PYTHON EXCEPTION on batch starting with {os.path.basename(batch_video_paths[0])}: {e}", exc_info=True)
+ finally:
+ del batch_inputs
+ del model_generations
+ gc.collect()
+ if torch.cuda.is_available():
+ torch.cuda.empty_cache()
+ fout.close()
+ logger.info(f"Worker finished processing all its data.")
+
+def run_multi_gpu(model_path, video_paths, prompts, final_out_path, num_gpus=8):
+ log_dir = os.path.join(os.path.dirname(final_out_path), "eval_logs")
+ os.makedirs(log_dir, exist_ok=True)
+ print(f"Log files will be saved in: {log_dir}")
+
+ processes = []
+ tmp_files = []
+ counter = Value("i", 0)
+ lock = Lock()
+ total = len(video_paths)
+
+ for rank in range(num_gpus):
+ gpu_id = rank % num_gpus
+ tmp_out = final_out_path.replace(".jsonl", f".part{rank}.jsonl")
+ tmp_files.append(tmp_out)
+
+ p = mp.Process(
+ target=worker_proc,
+ args=(rank, gpu_id, model_path, video_paths, prompts, tmp_out, counter, lock, log_dir),
+ name=f"Worker-{rank}"
+ )
+ p.start()
+ processes.append(p)
+
+ with tqdm(total=total, desc="Processing videos", ncols=100) as pbar:
+ last_count = 0
+ while True:
+ with lock:
+ current_count = counter.value
+ if current_count > last_count:
+ pbar.update(current_count - last_count)
+ last_count = current_count
+ if current_count >= total:
+ break
+ time.sleep(1)
+
+ for p in processes:
+ p.join()
+
+ print("Merging results...")
+ with open(final_out_path, "a", encoding="utf-8") as fout:
+ for tmp in tmp_files:
+ if os.path.exists(tmp):
+ with open(tmp, "r", encoding="utf-8") as fin:
+ fout.write(fin.read())
+ os.remove(tmp)
+ print(f"New results have been appended to {final_out_path}")
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser(description="Evaluate a model and save results.")
+ parser.add_argument("--model_path", type=str, required=True, help="Path to the model checkpoint.")
+ parser.add_argument("--fout_path", type=str, required=True, help="Path to the output caption file.")
+ args = parser.parse_args()
+
+ prompt = "Describe every aspect of the video in full detail, covering all the information it contains."
+
+ mp.set_start_method("spawn", force=True)
+
+ completed_video_ids = set()
+ if os.path.exists(args.fout_path):
+ print(f"Found existing output file at: {args.fout_path}. Attempting to resume.")
+ with open(args.fout_path, 'r', encoding='utf-8') as f:
+ for line in f:
+ try:
+ data = json.loads(line)
+ if 'video_id' in data:
+ completed_video_ids.add(data['video_id'])
+ except json.JSONDecodeError:
+ print(f"Warning: Skipping corrupted line in existing output file: {line.strip()}")
+ print(f"Loaded {len(completed_video_ids)} completed video IDs from the existing file.")
+
+ if not os.path.isdir(video_dir):
+ print(f"Error: Provided video directory does not exist: {video_dir}")
+ sys.exit(1)
+
+ print(f"Scanning for .mp4 files in {video_dir}...")
+ all_video_paths = sorted(glob.glob(os.path.join(video_dir, "*.mp4")))
+
+ if not all_video_paths:
+ print(f"Error: No .mp4 files found in {video_dir}")
+ sys.exit(1)
+
+ print(f"Found {len(all_video_paths)} videos.")
+
+ videos_to_process = []
+ for path in all_video_paths:
+ video_id = os.path.basename(path).replace(".mp4", "")
+ if video_id not in completed_video_ids:
+ videos_to_process.append(path)
+
+ if not videos_to_process:
+ print("All videos have already been processed. Nothing to do. Exiting.")
+ sys.exit(0)
+
+ print(f"Identified {len(videos_to_process)} videos remaining to be processed for this run.\n")
+
+ prompts = [prompt] * len(videos_to_process)
+
+ run_multi_gpu(args.model_path, videos_to_process, prompts, args.fout_path, num_gpus=8)
\ No newline at end of file
diff --git a/eval_scripts/VDC/score_sglang_multi-threads.py b/eval_scripts/VDC/score_sglang_multi-threads.py
new file mode 100644
index 0000000000000000000000000000000000000000..26956bfe2db6e3e929312be1209f894c497e1c23
--- /dev/null
+++ b/eval_scripts/VDC/score_sglang_multi-threads.py
@@ -0,0 +1,195 @@
+import os
+os.environ['OPENBLAS_NUM_THREADS'] = '1'
+import json
+import random
+import ast
+import argparse
+from tqdm import tqdm
+from sglang import function, system, user, assistant, gen, set_default_backend, RuntimeEndpoint
+import concurrent.futures
+
+
+@function
+def gener_pred_response(s, pred_cap, q):
+ s += system(
+ "You are an intelligent chatbot designed for providing accurate answers to questions related to the content based on a detailed description of a video or image."
+ "Here's how you can accomplish the task:"
+ "------"
+ "##INSTRUCTIONS: "
+ "- Read the detailed description carefully.\n"
+ "- Answer the question only based on the detailed description.\n"
+ "- The answer should be a short sentence or phrase.\n"
+ )
+ s += user(
+ "Please provide accurate answers to questions related to the content based on a detailed description of a video or image:\n\n"
+ f"detailed description: {pred_cap}, question: {q}"
+ "DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide short but accurate answer."
+ )
+ s += assistant(gen("answer_1", max_tokens=256))
+
+
+@function
+def gener_pred_score(s, qa):
+ s += system(
+ "You are an intelligent chatbot designed for evaluating the correctness of generative outputs for question-answer pairs. "
+ "Your task is to compare the predicted answer with the correct answer and determine if they match meaningfully. Here's how you can accomplish the task:"
+ "------"
+ "##INSTRUCTIONS: "
+ "- Focus on the meaningful match between the predicted answer and the correct answer.\n"
+ "- Consider synonyms or paraphrases as valid matches.\n"
+ "- Evaluate the correctness of the prediction compared to the answer."
+ )
+ s += user(
+ "Please evaluate the following video-based question-answer pair:\n\n"
+ f"Question: {qa['question']}\n"
+ f"Correct Answer: {qa['answer']}\n"
+ f"Predicted Answer: {qa['response']}\n\n"
+ "Provide your evaluation only as a yes/no and score where the score is an integer value between 0 and 5, with 5 indicating the highest meaningful match. "
+ "Please generate the response in the form of a Python dictionary string with keys 'pred' and 'score', where value of 'pred' is a string of 'yes' or 'no' and value of 'score' is in INTEGER, not STRING."
+ "DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "
+ "For example, your response should look like this: {'pred': 'yes', 'score': 4}."
+ )
+ s += assistant(gen("answer_1", max_tokens=256))
+
+
+def process_video(video_id, pred, answer, result_gtqa_list):
+ tp_result_dict = {
+ 'id': video_id,
+ 'pred_caption': pred,
+ 'gt_caption': answer,
+ 'qa_tp_list': []
+ }
+
+ qa_list = []
+
+ with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
+ future_to_qa = {
+ executor.submit(
+ gener_pred_response.run,
+ pred_cap=pred,
+ q=qa_dict['question']
+ ): qa_dict
+ for qa_dict in result_gtqa_list
+ }
+
+ for future in concurrent.futures.as_completed(future_to_qa):
+ qa_dict = future_to_qa[future]
+ try:
+ state = future.result()
+ qa_list.append({
+ "question": qa_dict['question'],
+ "answer": qa_dict['answer'],
+ "response": state["answer_1"]
+ })
+ except Exception as e:
+ print(f"[ERROR] generate response failed for {video_id}: {e}")
+
+ with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
+ future_to_qa = {
+ executor.submit(
+ gener_pred_score.run,
+ qa=qa
+ ): qa
+ for qa in qa_list
+ }
+
+ for future in concurrent.futures.as_completed(future_to_qa):
+ qa = future_to_qa[future]
+ try:
+ state = future.result()
+ response_dict = ast.literal_eval(state["answer_1"])
+ qa.update(response_dict)
+ tp_result_dict['qa_tp_list'].append(qa)
+ except Exception as e:
+ print(f"[ERROR] score evaluation failed for {video_id}: {e}")
+
+ total_score, total_acc = 0, 0
+ for qa in qa_list:
+ total_score += float(qa.get('score', 0))
+ if qa.get('pred') == 'yes':
+ total_acc += 1
+
+ tp_score = total_score / len(qa_list) if qa_list else 0
+ tp_acc = total_acc / len(qa_list) if qa_list else 0
+
+ return tp_score, tp_acc, tp_result_dict
+
+
+def main():
+ parser = argparse.ArgumentParser(description="Process VDC results and evaluate captions.")
+ parser.add_argument('--raw_file', type=str, help='Path to the raw input JSON file.')
+ parser.add_argument('--output_file', type=str, help='Path to the output JSONL file.')
+ parser.add_argument('--tp_qa_path', type=str, default='path_to/aurora/post_eval/detailed.jsonl', help='Path to the TP QA JSONL file (default: post_eval/background.jsonl).')
+ parser.add_argument('--num_workers', type=int, default=4, help='Number of parallel workers for processing videos.')
+ args = parser.parse_args()
+
+ tp_gt_qa_dict = {}
+ with open(args.tp_qa_path, 'r') as file:
+ for line in file:
+ data = json.loads(line)
+ tp_gt_qa_dict.update(data)
+
+ captions_dict = {}
+ captions_path = 'path_to/aurora/post_eval/VDC_1k_captions.jsonl'
+ with open(captions_path, 'r') as file:
+ for line in file:
+ data = json.loads(line)
+ key = data['video_id']
+ caption = data['captions']['detailed_caption']
+ captions_dict[key] = caption
+
+ set_default_backend(RuntimeEndpoint("http://localhost:30000"))
+
+ preds_dict = {}
+ with open(args.raw_file, "r") as f:
+ for line in f:
+ data = json.loads(line)
+ preds_dict[data["video_id"]] = data["caption"]
+
+ result_list, tp_scores, tp_accs = [], [], []
+
+ with concurrent.futures.ThreadPoolExecutor(max_workers=args.num_workers) as executor:
+ future_to_vid = {
+ executor.submit(
+ process_video,
+ video_id,
+ pred,
+ captions_dict[video_id],
+ tp_gt_qa_dict.get(video_id, [])
+ ): video_id
+ for video_id, pred in preds_dict.items()
+ if video_id in tp_gt_qa_dict
+ }
+
+ for future in tqdm(concurrent.futures.as_completed(future_to_vid), total=len(future_to_vid)):
+ video_id = future_to_vid[future]
+ try:
+ tp_score, tp_acc, tp_result_dict = future.result()
+ if tp_score and tp_acc:
+ tp_scores.append(tp_score)
+ tp_accs.append(tp_acc)
+ result_list.append({
+ 'id': video_id,
+ 'tp_score': tp_score,
+ 'tp_acc': tp_acc,
+ 'qa_tp_list': tp_result_dict['qa_tp_list']
+ })
+ except Exception as e:
+ print(f"[ERROR] processing video {video_id} failed: {e}")
+
+ tp_score = sum(tp_scores) / len(tp_scores) if tp_scores else 0
+ tp_acc = sum(tp_accs) / len(tp_accs) if tp_accs else 0
+
+ os.makedirs(os.path.dirname(args.output_file), exist_ok=True)
+ with open(args.output_file, 'w') as file:
+ for item in result_list:
+ file.write(json.dumps(item) + '\n')
+ file.write(json.dumps({'tp_score': tp_score, 'tp_acc': tp_acc}) + '\n')
+
+ print(f"Results saved to {args.output_file}")
+ print(f"Overall TP Score: {tp_score}")
+ print(f"Overall TP Accuracy: {tp_acc}")
+
+
+if __name__ == "__main__":
+ main()
\ No newline at end of file
diff --git a/eval_scripts/WorldSense/WorldSense_pipeline.sh b/eval_scripts/WorldSense/WorldSense_pipeline.sh
new file mode 100644
index 0000000000000000000000000000000000000000..197cac3aee6fc0516a5a490b8cd46ee7c0c30731
--- /dev/null
+++ b/eval_scripts/WorldSense/WorldSense_pipeline.sh
@@ -0,0 +1,61 @@
+#!/bin/bash
+MODEL_PATHS=(
+ "path_to_AVoCaDO"
+)
+RESULTS_DIR="$1"
+
+ORIGINAL_FILE="eval_scripts/WorldSense/worldsense_qa.json"
+MERGED_FILE="$RESULTS_DIR/captioned_results.json"
+if [ ! -f "$MERGED_FILE" ]; then
+ echo "MERGED_FILE not found. Creating from ORIGINAL_FILE..."
+ cp "$ORIGINAL_FILE" "$MERGED_FILE"
+fi
+CAPTION_FILES_TO_MERGE=()
+CAPTION_KEYS=()
+
+# Step 1: caption geneartion
+for model_path in "${MODEL_PATHS[@]}"; do
+ CLEAN_PATH="${model_path%/}"
+ model_name=$(basename "$CLEAN_PATH")
+
+ caption_file="$RESULTS_DIR/${model_name}_caption.jsonl"
+ echo "Output caption file will be: $caption_file"
+
+ python eval_scripts/WorldSense/generate_caption.py \
+ --model_path "$model_path" \
+ --fout_path "$caption_file"
+
+ if [ -f "$caption_file" ]; then
+ CAPTION_FILES_TO_MERGE+=("$caption_file")
+ CAPTION_KEYS+=("${model_name}_caption")
+ else
+ echo "Error: Caption file $caption_file not generated for model $model_path."
+ exit 1
+ fi
+done
+
+# Step 2: merge generated caption files
+echo "Merging all generated caption files..."
+python eval_scripts/WorldSense/merge_captions.py \
+ --original_file "$MERGED_FILE" \
+ --caption_files "${CAPTION_FILES_TO_MERGE[@]}" \
+ --merged_file "$MERGED_FILE"
+
+# Step 3: evaluation
+python eval_scripts/WorldSense/evaluation.py \
+ --merged_file "$MERGED_FILE" \
+ --caption_keys "${CAPTION_KEYS[@]}"
+
+# Step 4: analysis and save evaluation results
+for caption_key in "${CAPTION_KEYS[@]}"; do
+ echo "Running analysis for caption key: $caption_key"
+
+ result_file="$RESULTS_DIR/${caption_key}_result.jsonl"
+ answer_key="${caption_key//_caption/_resp}"
+
+ if [ -f "$result_file" ]; then
+ python eval_scripts/WorldSense/analysis.py --result_file_path "$result_file" --answer_key "$answer_key"
+ else
+ echo "Warning: Result file '$result_file' not found for analysis."
+ fi
+done
diff --git a/eval_scripts/WorldSense/analysis.py b/eval_scripts/WorldSense/analysis.py
new file mode 100644
index 0000000000000000000000000000000000000000..56a7e1c123acc5625b8b55ab40920009c7b3bd75
--- /dev/null
+++ b/eval_scripts/WorldSense/analysis.py
@@ -0,0 +1,18 @@
+import pandas as pd
+import argparse
+import os
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser(description="Analyze the evaluation results.")
+ parser.add_argument("--result_file_path", type=str, required=True, help="Path to the result file (.jsonl).")
+ parser.add_argument("--answer_key", type=str, required=True, help="The key for the model's response in the result file.")
+
+ args = parser.parse_args()
+
+ data = pd.read_json(args.result_file_path, lines=True)
+
+ acc = (data['answer'].str.upper() == data[args.answer_key].str.upper()).mean()
+ print(f"Accuracy for {args.answer_key} is: {acc:.2%}")
+
+ with open(f"{os.path.dirname(args.result_file_path)}/{args.answer_key}.log", "w", encoding='utf-8') as fout:
+ fout.write(f"Accuracy for {args.answer_key} is: {acc:.2%}")
\ No newline at end of file
diff --git a/eval_scripts/WorldSense/evaluation.py b/eval_scripts/WorldSense/evaluation.py
new file mode 100644
index 0000000000000000000000000000000000000000..c299914ce36cb6d33b7d6f9f0283cb971cdb4299
--- /dev/null
+++ b/eval_scripts/WorldSense/evaluation.py
@@ -0,0 +1,227 @@
+###
+# using a llm to answer questions regarding to the video with the specific caption
+###
+import os
+os.environ['GOOGLE_APPLICATION_CREDENTIALS']=''
+LOCATION = "global"
+user_info_path = ''
+user_info = json.load(open(user_info_path))
+PROJECT_ID = user_info['project_id']
+MODEL = "gemini-2.5-pro"
+
+import sys
+import time
+import json
+import traceback
+import multiprocessing
+import random
+import numpy as np
+import argparse
+from google import genai
+from google.genai import types
+from IPython.display import HTML, Image, Markdown, display
+from google import genai
+from google.genai.types import (
+ FunctionDeclaration,
+ GenerateContentConfig,
+ GoogleSearch,
+ HarmBlockThreshold,
+ HarmCategory,
+ Part,
+ SafetySetting,
+ ThinkingConfig,
+ Tool,
+ ToolCodeExecution,
+)
+import subprocess
+
+safety_settings = [
+ SafetySetting(category=HarmCategory.HARM_CATEGORY_HATE_SPEECH, threshold=HarmBlockThreshold.OFF),
+ SafetySetting(category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT, threshold=HarmBlockThreshold.OFF),
+ SafetySetting(category=HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT, threshold=HarmBlockThreshold.OFF),
+ SafetySetting(category=HarmCategory.HARM_CATEGORY_HARASSMENT, threshold=HarmBlockThreshold.OFF)
+]
+
+CONFIG = types.GenerateContentConfig(
+ temperature=0,
+ top_p=0.001,
+ thinking_config=types.ThinkingConfig(
+ include_thoughts=True,
+ thinking_budget=512
+ ),
+ safety_settings=safety_settings,
+ seed=SEED,
+ system_instruction='''
+ You are a precise QA assistant. Your task is to answer multiple-choice questions based ONLY on the video caption provided.
+ Do not use any outside knowledge or assumptions—your answer must strictly reflect information from the caption.
+ Always output only the capital letter corresponding to your choice (e.g., A, B, C, D).
+ If the caption does not provide enough information to answer the question, output "N/A" instead.
+ '''
+)
+client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)
+
+def set_seed(seed):
+ np.random.seed(seed)
+ random.seed(seed)
+
+SEED = 42
+set_seed(SEED)
+
+def caption2json(json_path, caption_path):
+ with open(json_path, 'r', encoding='utf-8') as f:
+ json_data = json.load(f)
+ model = os.path.basename(caption_path).split("_")[0]
+
+ captions = {}
+ with open(caption_path, 'r', encoding='utf-8') as f:
+ for line in f:
+ if not line.strip():
+ continue
+ item = json.loads(line)
+ for vid, cap in item.items():
+ captions[vid] = cap
+
+ for entry in json_data:
+ vid = entry.get("video_id")
+ if vid in captions:
+ entry[f"{model}_caption"] = captions[vid]
+
+ with open(f"{model}_merge_data.json", 'w', encoding='utf-8') as f:
+ json.dump(json_data, f, ensure_ascii=False, indent=2)
+
+ print(f"merged successfully, the output file is {model}_merge_data.json")
+
+
+def generate(prompt):
+ contents = [prompt]
+
+ answer, thinking = None, None
+ max_retries = 10
+
+ for i in range(max_retries):
+ try:
+ response = client.models.generate_content(
+ model=MODEL,
+ contents=contents,
+ config=CONFIG
+ )
+
+ answer_parts, thought_parts = [], []
+ for part in response.candidates[0].content.parts:
+ if not getattr(part, "text", None):
+ continue
+ if getattr(part, "thought", False):
+ thought_parts.append(part.text)
+ else:
+ answer_parts.append(part.text)
+ answer = "\n".join(answer_parts).strip()
+ thinking = "\n".join(thought_parts).strip()
+ if answer:
+ break
+ else:
+ print(f"[WARN] Attempt {i+1}: empty answer, retrying ... ")
+ time.sleep(3)
+ except Exception as e:
+ print(f"[ERROR] Attempt {i+1} failed: {e}")
+ traceback.print_exc()
+ time.sleep(3)
+ if not answer:
+ return None, None
+ print(answer)
+ return answer, thinking
+
+def worker(task):
+ vid, video_duration, question, choices, answer, caption_key, answer_key, caption = task
+ choices_text = "\n".join([f"{c}" for c in choices])
+ prompt_filled = f'''
+Here is the video caption:
+"{caption}"
+
+Question: {question}
+Choices:
+ {choices_text}'''
+ try:
+ resp, _ = generate(prompt_filled)
+ return {
+ "video_id": vid,
+ "video_duration": video_duration,
+ "question": question,
+ "choices": choices,
+ "answer": answer,
+ caption_key: caption,
+ answer_key: resp
+ }
+ except Exception as e:
+ traceback.print_exc()
+ return {
+ "video_id": vid,
+ "video_duration": video_duration,
+ "question": question,
+ "choices": choices,
+ "answer": answer,
+ caption_key: caption,
+ answer_key: None
+ }
+
+def run_multiprocess_tasks(tasks, num_processes=None, fout_path=None):
+ if num_processes is None:
+ num_processes = multiprocessing.cpu_count()
+
+ with multiprocessing.Pool(processes=num_processes) as pool:
+ results = pool.map(worker, tasks)
+
+ if fout_path:
+ with open(fout_path, "w", encoding='utf-8') as f:
+ for item in results:
+ f.write(json.dumps(item, ensure_ascii=False) + '\n')
+ f.flush()
+ return results
+
+
+def eval_worldsense_caption_qas(file_path, caption_keys=["omni_caption"]):
+ with open(file_path, 'r', encoding='utf-8') as f:
+ data = json.load(f)
+
+ all_results = []
+ for caption_key in caption_keys:
+ print(f"### Now processing {caption_key}")
+ answer_key = caption_key.replace("_caption", "_resp")
+ fout_path = f"{os.path.dirname(file_path)}/{caption_key}_result.jsonl"
+
+ tasks = []
+ for vid, video_info in data.items():
+ video_duration = video_info["video_duration"]
+ caption = video_info[caption_key]
+ task_keys = [k for k in video_info.keys() if k.startswith("task")]
+ for tk in task_keys:
+ qinfo = video_info[tk]
+ task_item = (
+ vid,
+ video_duration,
+ qinfo["question"],
+ qinfo["candidates"],
+ qinfo["answer"],
+ caption_key,
+ answer_key,
+ caption
+ )
+ tasks.append(task_item)
+
+ results = run_multiprocess_tasks(tasks, num_processes=20, fout_path=fout_path)
+ all_results.extend(results)
+
+ return all_results
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser(description="Evaluate captions using Gemini.")
+ parser.add_argument("--merged_file", type=str, required=True, help="Path to the merged caption file.")
+ parser.add_argument(
+ "--caption_keys",
+ type=str,
+ nargs='+',
+ required=True,
+ help="A list of caption keys to evaluate"
+ )
+ args = parser.parse_args()
+
+ eval_worldsense_caption_qas(args.merged_file, caption_keys=args.caption_keys)
\ No newline at end of file
diff --git a/eval_scripts/WorldSense/generate_caption.py b/eval_scripts/WorldSense/generate_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..d63031970722a8a6219a34152d9b2aae2c5898b4
--- /dev/null
+++ b/eval_scripts/WorldSense/generate_caption.py
@@ -0,0 +1,141 @@
+import os
+import torch
+from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
+from qwen_omni_utils import process_mm_info
+import argparse
+import json
+from tqdm import tqdm
+from pathlib import Path
+import multiprocessing as mp
+import traceback
+import random
+import glob
+
+VIDEO_MAX_PIXELS = 401408 # 512*28*28
+VIDEO_TOTAL_PIXELS = 20070400 # 512*28*28*50
+USE_AUDIO_IN_VIDEO = True
+video_base_dir = "path_to_WorldSense_videos"
+os.environ['VIDEO_MAX_PIXELS'] = str(VIDEO_TOTAL_PIXELS)
+
+def chat(file_path, prompt, model, processor, model_path, max_new_tokens=2048):
+
+ conversation = [
+ {
+ "role": "system",
+ "content": [
+ {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
+ ],
+ },
+ {
+ "role": "user",
+ "content": [
+ {
+ "type": "video",
+ "video": file_path,
+ "max_pixels": VIDEO_MAX_PIXELS,
+ "max_frames": 256
+ },
+ {
+ "type": "text",
+ "text": prompt
+ },
+ ],
+ },
+ ]
+
+ text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
+ audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
+ inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
+ inputs = inputs.to(model.device).to(model.dtype)
+
+ text_ids = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO, do_sample=False, thinker_max_new_tokens=max_new_tokens)
+ text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+ model_generation = text.split("\nassistant\n")[-1]
+
+ return model_generation
+
+def worker_proc(rank, gpu_id, model_path, video_paths, prompt, out_path):
+ device_map = {"": f"cuda:{gpu_id}"}
+
+ model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
+ model_path,
+ torch_dtype=torch.bfloat16,
+ device_map=device_map,
+ attn_implementation="flash_attention_2",
+ )
+ model.disable_talker()
+ processor = Qwen2_5OmniProcessor.from_pretrained(model_path)
+
+ fout = open(out_path, "w", encoding="utf-8")
+
+ for video_path in tqdm(video_paths, desc=f"Worker-{rank}[GPU-{gpu_id}]"):
+ try:
+ model_generation = chat(video_path, prompt, model, processor, model_path)
+
+ video_id = os.path.basename(video_path).split(".mp4")[0]
+ out_data = {
+ "video_id": video_id,
+ "caption": model_generation,
+ }
+
+ fout.write(json.dumps(out_data, ensure_ascii=False) + "\n")
+ fout.flush()
+ except Exception as e:
+ print(f"[Worker-{rank}] Error on {video_path}: {e}")
+ traceback.print_exc()
+
+ fout.close()
+ print(f"[Worker-{rank}] Done, wrote results to {out_path}")
+
+def run_multi_gpu(model_path, video_paths, prompt_list, final_out_path, num_gpus=8):
+ chunk_size = len(video_paths) // num_gpus + 1
+ chunks = [video_paths[i:i+chunk_size] for i in range(0, len(video_paths), chunk_size)]
+
+ processes = []
+ tmp_files = []
+
+ for rank, chunk in enumerate(chunks):
+ gpu_id = rank % num_gpus
+ tmp_out = final_out_path.replace(".jsonl", f".part{rank}.jsonl")
+ tmp_files.append(tmp_out)
+ prompt = random.choice(prompt_list)
+
+ p = mp.Process(
+ target=worker_proc,
+ args=(rank, gpu_id, model_path, chunk, prompt, tmp_out)
+ )
+ p.start()
+ processes.append(p)
+
+ for p in processes:
+ p.join()
+
+ with open(final_out_path, "w", encoding="utf-8") as fout:
+ for tmp in tmp_files:
+ with open(tmp, "r", encoding="utf-8") as fin:
+ for line in fin:
+ fout.write(line)
+ os.remove(tmp)
+
+ print(f"All results merged into {final_out_path}")
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser(description="Evaluate a model and save results.")
+ parser.add_argument("--model_path", type=str, required=True, help="Path to the model checkpoint.")
+ parser.add_argument("--fout_path", type=str, required=True, help="Path to the output caption file")
+ args = parser.parse_args()
+ mp.set_start_method("spawn", force=True)
+
+ video_paths = glob.glob(os.path.join(video_base_dir, "**", "*.mp4"), recursive=True)
+
+ prompt_list = [
+ "Provide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned.",
+ "Thoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated.",
+ "Please describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions.",
+ "Offer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync.",
+ "Describe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized.",
+ "Please provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so.",
+ "Give a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ ]
+
+ run_multi_gpu(args.model_path, video_paths, prompt_list, args.fout_path, num_gpus=8)
diff --git a/eval_scripts/WorldSense/merge_captions.py b/eval_scripts/WorldSense/merge_captions.py
new file mode 100644
index 0000000000000000000000000000000000000000..23aa53a0aa8823fec26310af9a7aa080b0cb8247
--- /dev/null
+++ b/eval_scripts/WorldSense/merge_captions.py
@@ -0,0 +1,37 @@
+import os
+import json
+import argparse
+
+def merge_captions(original_file_path, caption_file_path, merged_file_path):
+ print(f"Now processing {caption_file_path} ")
+ jsonl_data = {}
+ file_key = os.path.splitext(os.path.basename(caption_file_path))[0]
+ with open(caption_file_path, 'r', encoding='utf-8') as f:
+ for line in f:
+ item = json.loads(line.strip())
+ vid = item["video_id"]
+ caption = item["caption"]
+ jsonl_data[vid] = caption
+
+ with open(original_file_path, 'r', encoding='utf-8') as f:
+ data = json.load(f)
+
+ for vid, entry in data.items():
+ if vid in jsonl_data:
+ if file_key not in entry:
+ entry[file_key] = jsonl_data[vid]
+
+ with open(merged_file_path, 'w', encoding='utf-8') as f:
+ json.dump(data, f, indent=2, ensure_ascii=False)
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser(description="Merge caption files into a single JSON file.")
+ parser.add_argument("--original_file", type=str, required=True, help="Path to the original JSON file.")
+ parser.add_argument("--caption_files", type=str, nargs='+', required=True, help="List of caption files to merge.")
+ parser.add_argument("--merged_file", type=str, required=True, help="Path to save the merged JSON file.")
+
+ args = parser.parse_args()
+
+ for cf in args.caption_files:
+ merge_captions(args.original_file, cf, args.merged_file)
diff --git a/eval_scripts/WorldSense/worldsense_qa.json b/eval_scripts/WorldSense/worldsense_qa.json
new file mode 100644
index 0000000000000000000000000000000000000000..3be907e280023503c56c9f7bd6278f43390e2aff
--- /dev/null
+++ b/eval_scripts/WorldSense/worldsense_qa.json
@@ -0,0 +1,57916 @@
+{
+ "dvOkwKAs": {
+ "video_id": "dvOkwKAs",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A young man and woman, both wearing white short-sleeved shirts, sit facing the camera playing guzhengs. The woman's shirt has a small logo on the chest and a red sash around her waist. They pluck the strings with picks attached to their right hands.\u00a0 The background features a wall with a colorful mural incorporating celestial and organic imagery.\u00a0 Behind the woman is a partially visible, closed metal gate.The camera pans to the right, revealing two more young women seated and playing guzhengs.\u00a0They wear similar white shirts, one with a light teal sash and the other with a red one. They too use picks on their right hands. A staircase is partially visible in the background. The camera remains stationary as the four students continue to play.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the metal roller door relative to the woman wearing white in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. To the left of the woman wearing white.",
+ "B. To the right of the woman wearing white.",
+ "C. In front of the woman wearing white.",
+ "D. Behind the woman wearing white."
+ ]
+ }
+ },
+ "UYkFSXsh": {
+ "video_id": "UYkFSXsh",
+ "video_duration": "119s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a light background and a logo at the top center. Below the logo is a youtube subscribe button. At the bottom of the screen is a text overlay. The video transitions to a scene that is bordered by a multi-colored frame. In the center of the screen is a man holding a microphone. There is a white backdrop behind him with text at the top, and there is a christmas tree on the right. A text banner along with several social media icons appears at the bottom of the video. The scene then changes to show an audience of people sitting. The scene then returns to the man speaking, and he gestures with his hands. He then points at the camera and continues talking. The camera then continues to focus on him as he gestures and talks. During this time a variety of banners appear at the bottom of the screen with different advertisements. The video then ends with a final scene of him standing and facing the camera.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "At the end of the video, how did the volume of the man speaking in the black suit change, if at all?",
+ "answer": "B",
+ "candidates": [
+ "A. It is unclear to me.",
+ "B. The volume increased from soft to loud.",
+ "C. The volume decreased from loud to soft."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video do the audience members cheer?",
+ "answer": "D",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. Throughout the video.",
+ "C. At the end of the video.",
+ "D. At the beginning of the video."
+ ]
+ }
+ },
+ "hHwdqJxc": {
+ "video_id": "hHwdqJxc",
+ "video_duration": "408s",
+ "duration": "6-8min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video contains 14 distinct clips that are stitched together.The first clip shows a person wearing a white hard hat with the text \"THIESS\" and the name \"PHIL\" printed on it, sunglasses, and an orange shirt with a white stripe. They appear to be standing in front of a large industrial construction site, with various metal structures, scaffolding, and equipment visible. There are also orange safety cones on the ground.The second clip shows a person wearing a dark polo shirt with a red logo, and dark pants, standing in front of several garages with the text \"MARUSSIA F1 TEAM\" on them. Multiple people and several racing cars can be seen in the garage.The third clip shows race cars entering pit stops. A race team is present, and appear to be working on the cars. The first car is silver, the second is red and blue, the third is red, and the fourth is white and blue.The fourth clip shows the same person as in the second clip standing in the same location. A couple of other people appear in the background.The fifth clip shows a race car driving on a race track. People can be seen in the stands.The sixth clip shows a race car driving on a race track.The seventh clip shows the same person as in the second and fourth clips standing in the same location. Multiple people wearing orange uniforms appear in the background.The eighth clip shows the same person as in the second, fourth, and seventh clips standing in the same location.The ninth clip shows the same person as in the first clip wearing a plaid shirt, in front of a building with the text \"SCANIA\" on it.The tenth clip shows a person wearing a white hard hat with the text \"Schlumberger\" on it, safety glasses, and a blue uniform, standing in front of a white door and a gray wall, with various safety signs also visible.The eleventh clip shows the same person as in the ninth clip wearing a blue shirt, in front of a balcony. There are several buildings in the background.The twelfth clip shows the same person as in the first clip wearing a white hard hat, sunglasses, and an orange shirt in front of a body of water. There is construction equipment, and a partially constructed bridge visible.The thirteenth clip shows the same person as in the second, fourth, seventh and eighth clips wearing a red polo shirt with a black stripe on the side, standing in front of three framed pieces of wall art.The fourteenth clip shows four people standing together in front of a framed image of a vehicle.The fifteenth clip shows a person wearing a yellow safety vest and glasses in a car. A building and a road are visible in the background.The sixteenth clip shows a person wearing a gray hoodie against a plain white background.The seventeenth clip shows a person wearing a black jacket in front of a red staircase.The eighteenth clip shows a person with a beard, wearing a blue hoodie, against a plain orange background.The nineteenth clip shows a person wearing glasses in front of a tractor, followed by a brief shot of the same person in what appears to be a workshop, holding a piece of equipment. The clip concludes with a shot of the person working on two monitors, with a phone to the right of the monitors.The twentieth clip shows a person wearing a dark jacket and a white shirt, standing in front of a paved road, a tree, shrubs, and a modern building with an angular facade.The twenty-first clip shows a person wearing a white t-shirt, with a picture in the background, in a room, with a bed to the left and a dresser on the right.The twenty-second clip shows a person wearing a military uniform in front of a ship with the number \"150\" on its hull.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the woman's voice appear in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. One.",
+ "B. Two.",
+ "C. Four.",
+ "D. Three."
+ ]
+ }
+ },
+ "rgAZIyvn": {
+ "video_id": "rgAZIyvn",
+ "video_duration": "100s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A single person is visible in the video. They are holding a dark colored tablet and are wearing a white and light blue shirt. They also have a white object with black details wrapped around their neck. They are standing or sitting in front of a shelf with books and a dark brown door behind them. At certain points, they are in front of a shelf that is a light color. On this shelf is a bowl that has something that is a light reddish purple color in it, as well as a green and orange cylindrical object. The person in the video is moving their hands, looking at their tablet, and speaking.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What change occurs in the little girl's voice in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. It changes from loud to soft.",
+ "B. It changes from soft to loud.",
+ "C. It remains very quiet.",
+ "D. It remains very loud."
+ ]
+ }
+ },
+ "KZsaltBw": {
+ "video_id": "KZsaltBw",
+ "video_duration": "42s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video starts with the text \u201cUNPREPARED PASSENGERS\u201d on a black screen. The scene then cuts to an airport security checkpoint. Three security officers wearing blue uniforms are standing near an X-ray machine, with a conveyor belt going into it. A man in a light-colored shirt walks up to the conveyor belt and places a dark-colored item onto the belt. A woman in a tan coat walks towards the belt. Then a woman with blond hair in a black shirt walks to the conveyor belt. A text bubble appears with the text \u201cShe has a firearm in her bag\u201d near the woman with the tan coat. A security officer gestures, and a text bubble appears with the text \u201cHe\u2019s calling a supervisor.\u201d Another security officer speaks on the phone. Then, a text bubble appears with the text \u201cNow they\u2019re calling the police\u201d. The woman with the blond hair places an item on the conveyor belt. A text bubble appears with the text \u201cMeanwhile, everyone has to wait.\u201d At 4:58, the text bubble changes to \"It takes about 4 minutes for the officer to arrive.\" The security officer in the blue shirt speaks to the woman in the tan jacket. Then the text appears with \"Now they search her bag.\" The woman in the tan coat puts her bag in a bin. The officer searches it. Text appears with \u201cThey found the firearm and they will escort her out\u201d and an area is circled on the security officer\u2019s shirt, where she is holding a small red object. The video shifts to showing other people in the line.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the overall atmosphere of the entire video?",
+ "answer": "D",
+ "candidates": [
+ "A. Plain.",
+ "B. Lively.",
+ "C. Relaxed.",
+ "D. Tense."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the process followed by the security officer after discovering a firearm?",
+ "answer": "C",
+ "candidates": [
+ "A. Open and search the bag, alert authorities, and then call the supervisor.",
+ "B. Alert authorities, call the supervisor, and then open and search the bag.",
+ "C. Call the supervisor, alert authorities, and then open and search the bag.",
+ "D. Open and search the bag, call the supervisor, and then alert authorities."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What did the police officer do after confirming there was a gun in the bag in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Escort the woman away.",
+ "B. Speak with the woman to provide education.",
+ "C. Confiscate the gun.",
+ "D. Detain the woman."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the color of the box used for holding passengers' items during security checks in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Transparent.",
+ "B. Gray.",
+ "C. White.",
+ "D. Black."
+ ]
+ }
+ },
+ "kenhCuOM": {
+ "video_id": "kenhCuOM",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person in a long-sleeved black shirt plays a black grand piano on a stage in an auditorium with wood-paneled walls. A second person, seated to the right at a wooden table, plays a guzheng, a Chinese stringed instrument. The second person wears a long-sleeved red shirt. Behind them, a large screen displays a painting of a bird on a branch. To the right of the screen is a retractable banner with some text and logos. In the side area of the stage are two unoccupied chairs and an unseen instrument leaning against the wall.\u00a0 The audience sits in rows of chairs facing the stage. Some audience members wear hats. A few heads in the audience shift or turn slightly. Someone in the audience takes a photo or video with a phone. The two musicians continue to play their respective instruments.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What is the instrument on the right in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Guitar.",
+ "B. Cello.",
+ "C. Banjo.",
+ "D. Guzheng."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the woman in red do to the guzheng during the performance in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Taps the guzheng with fingers.",
+ "B. Strikes the strings with mallets.",
+ "C. Strikes the guzheng with her hand.",
+ "D. Scratches the strings with fingernails."
+ ]
+ }
+ },
+ "TGqYCmQM": {
+ "video_id": "TGqYCmQM",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A young woman and a man stand facing forward, playing violins. The woman is positioned on the left side of the frame and the man is on the right, slightly behind a black music stand.\u00a0The woman wears a short-sleeved, light-colored dress with a floral pattern in shades of blue and purple. The man wears a light blue, short-sleeved, button-down shirt and light-colored pants. They both hold their violins under their chins and move their bows across the strings. The woman's hair is dark and shoulder length. The man wears glasses.\u00a0The background is a light brown wood-paneled wall. A window with small square panes is visible in the center of the frame, behind the music stand. The lighting in the room appears soft and warm. They continue to play the violins for the duration of the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "How many instruments are in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Three.",
+ "B. Two.",
+ "C. One.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the most likely emotional tone conveyed by the music in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The emotion in the video is joy and excitement.",
+ "B. The emotion in the video is melancholy and pain.",
+ "C. The emotion in the video is joy and relaxation.",
+ "D. The emotion in the video is calm and melancholy."
+ ]
+ }
+ },
+ "JWilJddD": {
+ "video_id": "JWilJddD",
+ "video_duration": "154s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "World News",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a graphic of the earth. The screen then shows the interior of a train car with several women wearing brightly colored clothes. The camera angle then shifts to show the women on the train, and then a train platform outside the train. Next, the camera shows the interior of a train car with multiple people, some sitting and some walking through the aisle. The video then shows the exterior of the train at a train platform. Next, the interior of a train car is shown with 3 women sitting on a bench. The exterior of the train is shown with people climbing into it. The screen shows the exterior of the train with many people standing around the open train doors. Then a woman is shown sitting on the train car speaking. The camera then shows what appears to be an office with multiple people working at desks. The video shifts to a woman speaking. It then shows the exterior of the train, then multiple people walking on a path near green bushes. Several women are shown in an office using computers and wearing headsets. Then a woman is shown speaking. The camera shifts to three women inside the train car speaking. Then the video shows women in the train car talking. The video then shows a woman reading a newspaper inside a train car. The camera shows women inside a train car looking at a cellphone. Then the screen shows a train car filled with a large number of people standing. The camera then shifts to men in a large crowd on a train platform. The video shows a woman in a train car looking through a window. Lastly, the video shows a group of women inside a train car.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the profession of the man with a beard wearing a suit in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A news anchor.",
+ "B. A company spokesperson.",
+ "C. A railway builder.",
+ "D. A train ticket seller."
+ ]
+ }
+ },
+ "ksQrDUgx": {
+ "video_id": "ksQrDUgx",
+ "video_duration": "107s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a montage of clips featuring different men. In one clip, two men are wrestling; one is wearing black shorts and is shirtless, while the other is wearing a silver and white outfit with white wrestling boots. In several other clips, one man is wearing a suit and tie. He is seen speaking in front of a red and white background, in a room with a blue wall, and in front of a dark background. In one scene, he is wearing headphones. In one clip, two men are standing shoulder to shoulder, one in a light blue suit and one in a white suit. In another clip, a man is wearing a dark shirt and is sitting down. The background is blurry, but a \"KEG HOUSE & BAR\" sign is partially visible. Another clip shows a shirtless man standing on a wrestling ring with red ropes. In another clip, the same man is holding a golden belt over his head. He is wearing colored confetti and a black wristband. Another clip shows the same man with a man wearing a black and white top holding belts over their heads. Another clip shows the word \u201cWrestlemania\u201d with a bald man with dark wristbands. Another clip shows the words \u201cUndertaker Versus Undertaker\u201d with a bald man with dark wristbands. Two men are posed holding belts, one in white wrestling attire and one in a dark red outfit. The final clip shows a wrestling ring with two men wrestling and a referee nearby. There is also text displayed in each clip.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Will McMahon return to WWE?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes.",
+ "C. No.",
+ "D. None of the above."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "According to the video content, what is the correct sequence? (a) Vince McMahon was erased from WWE history, (b) Chris Benoit's malicious actions, (c) Vince McMahon was charged, (d) WWE promoted matches featuring other wrestlers.",
+ "answer": "B",
+ "candidates": [
+ "A. (c)(a)(d)(b)",
+ "B. (c)(a)(b)(d)",
+ "C. (b)(c)(a)(d)",
+ "D. (d)(b)(c)(a)"
+ ]
+ }
+ },
+ "iEWqgNPY": {
+ "video_id": "iEWqgNPY",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people are visible in the video. A woman on the left plays a silver flute while looking at a music stand with sheet music on it. She is wearing a black shirt. To the right, a man with dark hair and glasses plays an upright bass. He is wearing a button-down shirt with a checkered pattern. Both are in front of a light wall, with a green area on the right side of the frame. They both appear to be moving their instruments as if playing music.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man do while playing the double bass?",
+ "answer": "A",
+ "candidates": [
+ "A. Strike the body of the double bass with a hand.",
+ "B. Scratch the body of the double bass with fingernails.",
+ "C. Tap the double bass with fingers.",
+ "D. Use a mallet to strike the strings."
+ ]
+ }
+ },
+ "lJSKEFPd": {
+ "video_id": "lJSKEFPd",
+ "video_duration": "51s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "The video begins with a title card with images of people holding stringed instruments. Text is visible on the card as well, along with some dotted curved lines. The next shot features 3 people sitting and playing stringed instruments, then the image dissolves into 4 people sitting and playing similar stringed instruments. There are papers in front of one of the 4 people. The last part of the video features 3 people sitting and playing stringed instruments with text overlayed at the top of the screen, then the shot shifts to a closeup of an instrument and then an individual playing an instrument, then finally ending with another title card that has text and a QR code.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What are the four women doing in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The woman in black is demonstrating the structure of the pipa to the other three women.",
+ "B. The woman in black is teaching the other three women how to play the pipa.",
+ "C. The woman in black is discussing pipa playing techniques with the other three women.",
+ "D. The woman in black is appreciating pipa sheet music with the other three women."
+ ]
+ }
+ },
+ "NuwraABy": {
+ "video_id": "NuwraABy",
+ "video_duration": "68s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Academic Lectures",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video opens on a light blue background with a curly, yellow letter J on the left. The background changes to show white text that reads \"Tip Number\" and then the number 2. The camera then shows a woman with long, blond hair, wearing a blue top, who is talking to the camera. Behind her is a white desk, with make-up products, a plant, and make-up brushes visible on the desk. The woman\u2019s head and shoulder area are shown in a close-up view. The woman continues to speak to the camera, and then there is a shot of a hand holding a notebook with the words \"Essay Plans\" written on it in purple and green with multi-colored lines. The hand turns the page and writes in a notebook with a black pen and then a hand puts down the pen and is shown writing in the notebook again. The camera then shows the hand writing different words and phrases with a black pen and different drawings in multi-colored pens, including the words \"Social\" \"Psychological\" \"Personal\". The woman is then shown again speaking to the camera. The camera then shows her hands picking up multiple colored pens. The camera then shows her writing and drawing using a ruler and a black pen on a white notebook page. The camera shows a hand using a black pen and purple pen to write on a white notebook page. Finally, the woman is shown again speaking to the camera.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the overall mood of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Calm.",
+ "B. Surprised.",
+ "C. Happy.",
+ "D. Tense."
+ ]
+ }
+ },
+ "BnseaLEM": {
+ "video_id": "BnseaLEM",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person with dark hair and a black long-sleeve shirt is holding a violin and bow. The person is moving the bow across the violin strings. A hand is visible on a piano keyboard to the left. There is a white shelving unit on the left with various containers and objects, a wooden door in the center and an exercise machine with blue handles on the right. There is text in blue on the screen, above the head of the person.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What changes occur in the rhythm of the violin music at the beginning of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. It transitions from heavy to gentle.",
+ "B. It transitions from gentle to heavy.",
+ "C. It transitions from heavy to joyful.",
+ "D. It transitions from lively to gentle."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "How many types of instruments produce sound in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Four types.",
+ "B. Three types.",
+ "C. Two types.",
+ "D. One type."
+ ]
+ }
+ },
+ "NBzSgbJK": {
+ "video_id": "NBzSgbJK",
+ "video_duration": "212s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A hand is drawing figures on a whiteboard. Initially, the hand draws a picture of a building with three columns and a head on a pedestal labeled \"MAN\" next to a fox on a pedestal labeled \"FOX\". A speech bubble saying \"WE HAVE ONE & ONLY LIFE\" points to the head. Words in red say \"WHEN WE TALK OF BUILDING AN EMPATHIC CIVILISATION\". Next, the hand draws a simple outline of a head, then adds a body with arms and legs, and gives the person in the drawing an object to hold. The hand writes \"WE ARE HOMO-EMPATHICUS\" in black text at the top left, followed by \u201cHOW DOES CONSCIOUSNESS CHANGE IN HISTORY?\u201d in red. Additional small circles and lines are drawn above the man's head, labeled \u201cAPPS\u201d, \u201cFOOTBALL\u201d, and \u201cCOFFEE\u201d. The drawing is added to by another outline of a person in a robe holding an axe. The hand adds \u201cMEDIEVAL SERF\u201d below that person, and writes \u201cMODERN MAN\u201d below the first. Small circles above this new figure's head are linked to lines and labeled \u201cFOOD,\u201d \u201cSOCIAL POSITION\u201d, and \u201cGOD\u201d. Another figure with a hunched back and large mouth is drawn to the far left. Below this figure is written \u201cFORAGER/HUNTER\u201d. Small circles with lines above the head are labeled \"BELONGING\", \"SEX\", \"FOOD\" and \"SHELTER\". Below all the figures, the hand writes \"WE ARE SOFT WIRED FOR EMPATHIC DISTRESS\". The hand then writes \"IS IT POSSIBLE WE CAN EXTEND OUR EMPATHY?\" in black, adding a curved red line that goes from the head of the \"modern man\" to a drawing of a cow, a fox, and a rabbit. The hand then draws an arrow pointing to the group of animals. The words \"OUR FELLOW CREATURES\" and \"EVOLUTIONARY FAMILY\" are written at the side of the animal figures. The hand adds a red arrow and a drawing of three people with the words \"HUMAN RACE\" written next to them. A large hand is then drawn and encompasses a drawing of the earth with clouds on it. The word \"BIOSPHERE\" is added at the bottom of the circle containing the Earth. The words \u201cIF ITS IMPOSSIBLE TO IMAGINE THEN I CAN'T SEE HOW WE ARE GOING TO MAKE IT!\u201d are written near the Earth drawing. Next, the hand writes \"EMPATHY IS THE INVISIBLE HAND\" with red writing, and also writes \"HUMAN RACE\", and \u201cOUR FELLOW CREATURES\u201d. The hand writes \u201cCOMMUNICATION ONLY EXTENDED TO LOCAL TRIBE\" at the top left, and adds a speech bubble to the figure on the right saying \u201cSHOUTING DISTANCE! HEY!\u201d Another stick figure is added with a spear pointing towards the figure on the right. A mountain is drawn behind the figures, and another stick figure in a flying saucer is added above the mountain with \u201cALIENS\u201d in a speech bubble. The hand writes \u201cEMPATHY ONLY EXTENDED TO BLOOD TIES.\u201d Underneath the figures the hand writes \"HYDRAULIC AGRICULTURAL CIVILISATIONS\" in red, and draws a figure with a long tunic carrying a round object. A new figure is drawn and walks to the center of a circle. Another person walks towards them on the circle and the words \u201cANNIHILATE TIME & SPACE BRING PEOPLE TOGETHER\u201d appear on the board. Under this the hand writes \u201cTHEOLOGICAL CONSCIOUSNESS\u201d, and \u201cDETRIBALISE & BEGAN ASSOCIATIONS WITH RELIGIOUS TIES\u201d. A drawing of two people with religious markings are talking to each other. Then, three people with religious ties are shown followed by a stick figure with a Muslim head cover. Two figures are talking and one says \u201cBROTHER!\u201d At the top of the screen the hand writes \u201cINDUSTRIAL REVOLUTION\u201d, and draws factory smokestacks. The hand writes \u201cMARKETS CREATE THE FICTION OF THE NATION STATE\u201d, and draws a book labeled \u201cBRITAIN\u201d, then a book labeled \u201cGERMANY\u201d, and then one labeled \u201cAMERICA\u201d. Another book labeled \u201cFRANCE\u201d is drawn next to the American book. Above the books the hand draws a power pylon and an oil derrick. Above all the previous items, the hand writes \"LOYALTIES BASED ON COMPLEX ENERGY COMMUNICATION REVOLUTIONS THAT ANNIHILATE TIME & SPACE\". Next the hand writes \u201cWHY STOP HERE?\u201d, followed by a drawing of a blood drop, and the word \u201cBLOOD TIES\u201d. An arrow points from the blood drop to a cross, a crescent moon and star, and a star of david. Under these religious icons is written \u201cRELIGIOUS\u201d. The arrow then moves from the blood drop, past these symbols, and to a drawn British flag. The words \u201cNATIONAL IDENTITY\u201d appear next to the flag. An arrow appears at the right of the page and points to \u201cHUMANS\u201d. The hand writes \u201cCAN WE CONNECT OUR EMPATHY TO A SINGLE RACE WRIT LARGE IN A SINGLE BIOSPHERE?\u201d with a red line moving from the head of the figure next to the words and pointing at the Earth with the hand already covering it. The hand is shown writing \"BELONG ITS AN EMPATHIC DRIVE\". Under this the hand writes, \u201cWE HAVE THE TECHNOLOGY TO THINK VISCERALLY AS A FAMILY\u201d and then draws a map of Haiti. Finally, the hand draws a bird with the word \u201cTWITTER\u201d, a cellphone, the word \u201cYOUTUBE\u201d, and a stick figure raising his hands. Under all of the items above it, the hand writes \u201cWHEN THE EARTHQUAKE HIT\u201d, and below that, \u201cTHE ENTIRE HUMAN RACE IN EMPATHY EMBRACE\u201d and then draws \u201c1 HOUR\u201d, \u201c2 HOURS\u201d, and \u201c3 HOURS\u201d.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many countries are mentioned in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Three.",
+ "B. Five.",
+ "C. Four.",
+ "D. One."
+ ]
+ }
+ },
+ "doGbGgQs": {
+ "video_id": "doGbGgQs",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A woman with short dark hair, wearing a light, patterned, long-sleeved shirt or jacket and darker pants stands on a set of stairs playing the violin. Beside her, and slightly below her on the stairs, a man with dark hair and glasses, wearing a dark suit jacket, plays the accordion. He has a small gooseneck microphone attached to his accordion. Behind them is a dark, latticed partition with diamond-shaped openings, and behind that, a wall with lit wall sconces. A music stand sits in front of the accordion player, with a small lamp attached. The two musicians play together throughout the video. The video focuses tightly on them, providing no wider view of the room. The lighting is subdued.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "How does the rhythm of the music in the video change?",
+ "answer": "C",
+ "candidates": [
+ "A. From deep and substantial, it becomes solemn and dignified.",
+ "B. From gentle and soothing, it becomes vigorous and high-pitched.",
+ "C. From lively and enthusiastic, it becomes brisk and compact.",
+ "D. From exhilarating and passionate, it becomes powerful and strong."
+ ]
+ }
+ },
+ "wuUXIvIu": {
+ "video_id": "wuUXIvIu",
+ "video_duration": "41s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a first-person perspective of a player moving through a corridor with grey walls. The player uses several guns to shoot at enemy figures in the corridor. The enemies are various sizes and shapes and there are visual effects such as blue light beams and splashes of red liquid. The player picks up various items from the ground that have floating panels displaying information about them. The player also receives visual indicators of damage as well as an indicator that reads \"Second Wind\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Are there any gunshots in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. None.",
+ "D. I am not sure."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the one-armed monster in the video and the television on the wall?",
+ "answer": "B",
+ "candidates": [
+ "A. The one-armed monster can interfere with the TV signal.",
+ "B. The TV lights turn on, causing the one-armed monster to appear.",
+ "C. The TV monitors the activities of the one-armed monster.",
+ "D. The one-armed monster draws energy from the television."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the one-armed monster appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Two.",
+ "B. Four.",
+ "C. Three.",
+ "D. One."
+ ]
+ }
+ },
+ "oUugAAPs": {
+ "video_id": "oUugAAPs",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "There are six people on stage. From left to right, they are playing violin, violin, a lute-like instrument, guitar, violin, and cello. They are all seated. There are music stands in front of five of them. At the center of the group, in front of the guitarist, is a microphone on a tripod. The backdrop appears to be a solid tan wall.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Does the first sound in the video come from a pipa?",
+ "answer": "B",
+ "candidates": [
+ "A. None.",
+ "B. No.",
+ "C. Yes.",
+ "D. I am not sure."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What changes occur in the sound dimension of the music performed in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The beginning and end of the video feature guitar solos, while the middle showcases a multi-instrumental ensemble.",
+ "B. The instrumental ensemble plays throughout, with the tempo slowing in the middle.",
+ "C. The music performance in the video primarily features a single instrument, with other instruments serving merely as accompaniment.",
+ "D. The video begins and ends with a multi-instrument ensemble, while only the pipa and guitar play in the middle."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "At the beginning of the video, what is the woman in the black skirt on the left doing with the violin?",
+ "answer": "B",
+ "candidates": [
+ "A. Slowly sliding the bow along the strings.",
+ "B. Sweeping the bow across the strings of the violin.",
+ "C. Rhythmically tapping the head of the violin with the bow.",
+ "D. Gently touching the body of the violin with the bow."
+ ]
+ }
+ },
+ "cYmkYEXB": {
+ "video_id": "cYmkYEXB",
+ "video_duration": "287s",
+ "duration": "4-6min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person's hands are shown preparing several pufferfish on a light green cutting board placed over a metal sink. The hands, belonging to a dark-skinned individual, first place a single pufferfish on the cutting board and then make two parallel cuts along its belly using a silver knife. Another, uncut, pufferfish is added to the board. The person then makes two parallel cuts along the belly of the second fish. The first fish is then flipped and two more parallel cuts are made on its back. The same is repeated for the second fish. The person then starts removing the entrails of the first fish, after which they repeat the cleaning process with the remaining pufferfish on the cutting board. After all pufferfish have been gutted, the person begins to fillet one of the fish, this time using a fork to hold the fish in place and remove the meat. The person proceeds to fillet the remaining pufferfish in the same manner.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence for handling a pufferfish? (a) Pull the flesh out from the cutting point using a fork, (b) wrap the fish head with a paper towel to prevent slipping, (c) scrape along the fish's back with a fork, (d) cut the head off.",
+ "answer": "C",
+ "candidates": [
+ "A. (a) (b) (c) (d).",
+ "B. (b) (d) (c) (a).",
+ "C. (d) (b) (c) (a).",
+ "D. (c) (a) (b) (d)."
+ ]
+ }
+ },
+ "FnOIAada": {
+ "video_id": "FnOIAada",
+ "video_duration": "24s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Supermodified race cars speed around an oval asphalt track. Various cars are shown in different parts of the pack. Some cars are red, some blue, some white, some black, some yellow, and some are multi-colored. The cars race close together. A green flag is briefly visible in the background, along with a man in a safety vest. The camera cuts to a shot of the cars lined up for the race start, then cuts back to the cars racing. At the end of the video, a woman with blonde hair and sunglasses is handed a red baseball cap by someone mostly off-screen.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the number on the rear wing of the red vehicle at the start of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. 56.",
+ "B. 50.",
+ "C. 58.",
+ "D. 59."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "Which car is driven by the girl with yellow hair and black eyes?",
+ "answer": "B",
+ "candidates": [
+ "A. A vehicle with a blue body and a white cockpit.",
+ "B. A vehicle with a black body and a white cockpit.",
+ "C. A vehicle with a yellow body.",
+ "D. A vehicle with a red body."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why did the female driver wearing black glasses get out of the car?",
+ "answer": "B",
+ "candidates": [
+ "A. Her vehicle broke down.",
+ "B. She had completed the race.",
+ "C. She was disqualified for violating the rules.",
+ "D. She experienced a health issue."
+ ]
+ }
+ },
+ "KdszHuKW": {
+ "video_id": "KdszHuKW",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people are standing in front of a plain wall. The person on the left is wearing a light colored shirt with a blue and gray plaid overshirt, and blue jeans. He is holding a gold colored trumpet. The person on the right is wearing a black t-shirt with a graphic design on it and blue jeans. He is holding a light brown acoustic guitar and singing into a microphone. There are three microphone stands visible with two microphones and a pop filter. A green potted plant is visible on the right. In the bottom right corner there is an amplifier.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What happened after the man in black finished singing?",
+ "answer": "C",
+ "candidates": [
+ "A. The stage lights suddenly went out, and the music stopped.",
+ "B. A man in a plaid shirt started playing the guitar.",
+ "C. A man walked onto the stage and began playing the violin.",
+ "D. The audience stood up and applauded, requesting the man to sing another song."
+ ]
+ }
+ },
+ "oKjgTgfn": {
+ "video_id": "oKjgTgfn",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man and a woman sit side by side holding stringed instruments. The man, on the left, wears a light blue, long-sleeved button-up shirt. He holds a two-stringed instrument, a Chinese fiddle or erhu, with a dark wood neck and light wood resonator covered with python skin. He uses a bow to play the instrument, moving it back and forth across the two strings. His left hand fingers the strings. The woman, on the right, wears an off-white, ribbed turtleneck sweater. She holds a plucked string instrument, a pipa or Chinese lute, made of light brown wood. She uses her right hand to pluck the strings, while her left hand presses on the strings, presumably altering their pitch. Both musicians gaze straight ahead, focused on playing their instruments. They are positioned in front of closed, light gold curtains.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Is there a voiceover in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. No.",
+ "B. I am not sure.",
+ "C. Yes."
+ ]
+ }
+ },
+ "HoifilGe": {
+ "video_id": "HoifilGe",
+ "video_duration": "244s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Camping",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a map of Jekyll Island, Georgia, followed by views of a campground with recreational vehicles, a lighthouse, a bridge, a pier, the Jekyll Fishing Center, pelicans flying over water, close ups of birds, a historical marker about the Spanish on Jekyll Island, a person riding a bicycle, a beach scene, a boardwalk, a person on a beach with two beach chairs underneath the boardwalk, ocean waves, a person taking pictures at the beach, more bicycle riding through open and wooded paths, a sign for turtle-friendly lights, a sign for bird watchers, a map of Beach Village, a marsh, a maritime forest sign, a Hammock sign, a Marshes and Tides sign, a Coastal Georgia Greenway sign, a sign about Barrier Islands, a sign for a Birding Trail, a white bird on a post, a cardinal, a deer in front of a house, the Jekyll Island Club Wharf historical marker, a dark colored creature swimming in the water, a helicopter in the air, a metal statue of a giraffe, historical markers describing The Skeet House, A Winter City, and the Clubhouse, an historic home identified as the Dubignon Cottage, an enormous tree with moss hanging from the branches, a large white building called the Sans Souci, and a directional sign with names of locations. A person wearing an orange shirt with white lettering is shown from the back, and two people are shown eating at an outdoor restaurant. A woman at the restaurant is then shown from the chest up, and is talking and gesturing with her hands. The video also shows a sign indicating that Tortuga Jacks is pet friendly and requiring leashes.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many white cranes are shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. One.",
+ "B. Zero.",
+ "C. Three.",
+ "D. Two."
+ ]
+ }
+ },
+ "DbEYyaTo": {
+ "video_id": "DbEYyaTo",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows an electrical substation with several transformers and electrical equipment. Then, it transitions to a shot of a construction crane on train tracks. Following this, an aerial shot of the substation is shown, and then a ground-level view of the substation. Then, a long blue train is shown passing over a bridge. Several workers are seen walking next to train cars with black coverings. A red train car carries large white metal rolls. Then, a green train passes. The video shows train cars carrying large cable reels. Workers are seen at the top of a train car adjusting equipment. The video shows an indoor shot of a gray control panel. Then, a top-down view of the substation is shown. The next screen displays gold text on a black background: \"Savings of \u20b9250 crore per year.\" This is followed by an image of a train with a red and white passenger car. The video shows a train at a station. A train with the words \"Indian Railways\" is shown. Then, a blue and yellow passenger train is shown. The next shot is an outdoor shot of the South Western Railway entrance. People and cars are seen moving in front of the building. The video then shows the interior of the railway station with a \"Welcome to Hospete Railway Station\" sign. Many people are walking on the platform. The video then shows ancient stone architecture, with the word \"Hampi\" and \"UNESCO Heritage Site.\" The video then shows an aerial view of the station, an ornate indoor gateway, and a relief of a seated figure on the wall. The video then shows the station entrance and people passing through. Then, several men in casual clothing are seen talking. A woman in a blue uniform talks to a man holding a small child. The video shows a room with beds, and a ticket booking window with people waiting. A woman in a wheelchair and a man walking behind her are seen, followed by a shot of a busy train platform. Finally, the video shows the following text on a black background: \"Dedication of longest railway platform in the world.\"",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is the last train that passed by?",
+ "answer": "A",
+ "candidates": [
+ "A. Green.",
+ "B. Blue.",
+ "C. Red.",
+ "D. Yellow."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the tone of the speaker in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Serious and proud.",
+ "B. Joyful and proud.",
+ "C. Calm and proud.",
+ "D. Cheerful and humorous."
+ ]
+ }
+ },
+ "WzKwhSic": {
+ "video_id": "WzKwhSic",
+ "video_duration": "79s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A child wearing a black long-sleeved shirt with white robot images, black pants, and black shoes is being lifted from a car seat. The child walks on a path composed of light-colored bricks and mulch in a backyard. The child walks toward a playset with a green slide, a wooden ladder, and swings, and climbs on the playset. A person wearing blue jeans, a black top, and yellow shoes then places the child on a swing. The person then approaches the playset, puts the child on the slide, and catches the child at the bottom.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What type of shoes is the woman wearing in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Yellow canvas shoes.",
+ "B. Yellow casual shoes.",
+ "C. Yellow sneakers.",
+ "D. Yellow slippers."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "How many times did the woman swing on the swing in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Three.",
+ "B. I don't know",
+ "C. Two.",
+ "D. One."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What are the steps the little boy follows in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Ring swinging, stair climbing, swinging, and sliding.",
+ "B. Ring swinging, stair climbing, sliding, and swinging.",
+ "C. Stair climbing, swinging, ring swinging, and sliding.",
+ "D. Stair climbing, ring swinging, swinging, and sliding."
+ ]
+ }
+ },
+ "Pirddbfp": {
+ "video_id": "Pirddbfp",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three people are in the video. Two of them are playing violins, and one person is playing a guitar. The person playing the guitar is sitting to the left, with the other two standing. All three are in front of a large window. The guitarist has their hair in a bun, and is wearing a black long-sleeved shirt with a design on it. One violinist is wearing a sleeveless black shirt and white pants, and the other is wearing a gray sleeveless shirt and black pants. They are all looking forward, towards the camera. The guitar player has a black music stand in front of them. Both violin players have a black music stand in front of them, with pieces of white paper on them. They are all using bows to play their instruments.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the woman in black start playing the violin?",
+ "answer": "A",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. Throughout the entire video.",
+ "C. At the beginning of the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the first sound in the video coming from the instrument on the left?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes.",
+ "C. No."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "How many types of instruments can be heard producing sound in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Three types.",
+ "B. One type.",
+ "C. Four types.",
+ "D. Two types."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What does the woman in black do to the violin when the woman in gray begins playing the violin?",
+ "answer": "C",
+ "candidates": [
+ "A. Adjust the violin's tuning pegs.",
+ "B. Play the violin strings with a bow.",
+ "C. Pluck the violin strings with her fingers.",
+ "D. Wipe the body of the violin."
+ ]
+ }
+ },
+ "ExlUpRsr": {
+ "video_id": "ExlUpRsr",
+ "video_duration": "38s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "swimming",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A swimming pool with several lanes is shown. There are people in the pool swimming. There are also people standing on the sides of the pool. A person in a pink top walks along one side of the pool. Some people are seated in the bleachers. There is a digital clock on the wall showing the time. The camera angle changes. The swimmers reach the other side of the pool. One of the swimmers touches the wall. The swimmers begin to get out of the pool. A close-up shot of a digital scoreboard is shown.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "Which column on the screen at the end of the video shows the contestants' rankings?",
+ "answer": "D",
+ "candidates": [
+ "A. The fifth column.",
+ "B. The first column.",
+ "C. The third column.",
+ "D. The fourth column."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual incident occurred during the competition shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A contestant was disqualified for staying underwater for too long.",
+ "B. The competition was stopped midway.",
+ "C. A spectator jumped into the water and disrupted the competition.",
+ "D. A contestant committed a false start."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Who is the final champion of this match?",
+ "answer": "A",
+ "candidates": [
+ "A. Pontus.",
+ "B. Hill Taylor.",
+ "C. Da Silva.",
+ "D. Tommy Sacco."
+ ]
+ }
+ },
+ "zmLYtPow": {
+ "video_id": "zmLYtPow",
+ "video_duration": "159s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Software",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A white router, two wire frames, and a black pen with a green cap are on a light gray surface. A hand enters the frame, touches the router, and picks up one of the wires. The hand manipulates the wire, then wraps it around the pen, and removes it. The hand picks up and manipulates the other wire, wrapping it around the router antenna and other wire components. The hand moves the router antenna, then touches the router body. The hand then places a wrapped wire component on the antenna, then touches the router antenna and body again. The hand leaves the frame, and then reenters, first touching the router, then moving the router antenna. The hand manipulates other wire parts and places them on the router antenna and body. The hand then picks up and touches the pen, router and parts again, before leaving the frame. The router is shown with a wire antenna and wrapped wire structures on the antenna and to the side. The hand reenters, touching the antenna and router body, before leaving the frame again. The hand reenters and is again shown manipulating and connecting more wire pieces to the router antenna, before leaving the frame again. The hand reenters once more, again showing a wire addition to the antenna, before the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Which props are shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A pen, a metal wire, a router, and a pair of pliers.",
+ "B. A metal wire, a router, and a pair of pliers.",
+ "C. A pen, several metal wires, and a router.",
+ "D. A pen, several metal wires, a router, and a pair of pliers."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did you see the man wearing a gold bangle on his hand?",
+ "answer": "A",
+ "candidates": [
+ "A. No, I didn't see it.",
+ "B. Yes, I saw it.",
+ "C. I'm not sure."
+ ]
+ }
+ },
+ "KCZSDaMq": {
+ "video_id": "KCZSDaMq",
+ "video_duration": "50s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video displays gameplay from a computer game, with a view of a rocky, blue-toned landscape, and a large, spiky, gray creature in the center. There is also a smaller, orange figure moving around the creature. There are also various game interfaces visible such as a status bar at the top and an action bar at the bottom. In the lower right, a man is visible from the shoulders up, wearing a black shirt and yellow headphones. He is mostly looking at the game being played, sometimes looking toward the camera. He makes hand gestures such as crossing his fingers, and then raises both hands above his head. A text box containing information appears over the video game landscape. The man\u2019s head moves around, before he puts on a mask and gives a thumbs up.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the sequence of events in the video? (a) The man obtained a rare trophy, (b) the man replied to comments, (c) the man controlled a character in the game to fight",
+ "answer": "C",
+ "candidates": [
+ "A. (a) (b) (c)",
+ "B. (a) (c) (b)",
+ "C. (c) (b) (a)",
+ "D. (c) (a) (b)"
+ ]
+ }
+ },
+ "VfKCAONc": {
+ "video_id": "VfKCAONc",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Four individuals are seated and playing bassoons, viewed from an overhead perspective. They are arranged roughly in a diamond formation on a tiled floor of alternating beige and reddish squares and triangles. Each person has a music stand with sheet music in front of them.The person at the top of the diamond wears a dark long-sleeved shirt and dark pants and appears to be an older male with gray hair. The person to their lower left wears a royal blue shirt and dark pants.\u00a0 The person to the lower right wears a black long-sleeved shirt and dark-colored pants. The person on the bottom also wears black pants and a dark long-sleeved shirt or jacket.They are all playing their bassoons while looking at their respective music stands. There is a clear plastic bottle or cup visible on the floor by the lower-right person's feet. The feet of some other people can be seen in the upper-right corner of the frame. Empty music stands are visible on the right.\u00a0The musicians continue to play their instruments for the duration of the video.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Does the volume of the instrument playing change in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. There is no change.",
+ "B. The sound gradually becomes quieter.",
+ "C. The sound gradually becomes louder.",
+ "D. The sound first becomes quieter, then becomes louder."
+ ]
+ }
+ },
+ "rCOymcSS": {
+ "video_id": "rCOymcSS",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person is shown from the chest down, holding and playing a ukulele. A chord diagram is displayed on the left of the screen. As the person changes chords on the ukulele, the diagram changes to reflect the new chord. The chord names (D, F#m, Bm, G, A, C#7, B7, Em) are displayed below the diagram as they are played. The person strums the ukulele repeatedly while holding each chord. The person's right hand strums downwards. The fretting hand uses a variety of finger positions to form the chords. The background is solid white.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "Which strings did the person in the video press to produce a D chord on the ukulele?",
+ "answer": "D",
+ "candidates": [
+ "A. Index finger on the 1st fret of the A string, middle finger on the 3rd fret of the E string, and ring finger on the 2nd fret of the C string.",
+ "B. Index finger on the 9th fret of the A string, middle finger on the 10th fret of the E string, and ring finger on the 9th fret of the C string.",
+ "C. Index finger on the 8th fret of the A string, middle finger on the 9th fret of the E string, and ring finger on the 10th fret of the C string.",
+ "D. Index finger on the 9th fret of the C string, middle finger on the 9th fret of the A string, and ring finger on the 10th fret of the C string."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many different chords were used to play the ukulele in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Nine.",
+ "B. Six.",
+ "C. Eight.",
+ "D. Seven."
+ ]
+ }
+ },
+ "DrzYFTIk": {
+ "video_id": "DrzYFTIk",
+ "video_duration": "61s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Celebrity Interviews",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "Five people are seated around a glass table. Each person has a white mug with red logo in front of them. There is a pink floral arrangement in the center of the table. The background is a set with a bookcase and kitchen in it. The person second from the left touches the person in the center's shoulder and then that person reaches to adjust something on their jacket. A close up shows the center person pointing to a pin on their jacket. A photo of six men shirtless appears. The five people are now laughing, the person in the center is not. The center person is shown talking close-up, then the scene returns to the five people around the table.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the guest positioned in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. On the far left.",
+ "B. Second from the left.",
+ "C. On the far right.",
+ "D. In the center."
+ ]
+ }
+ },
+ "gTPSiUro": {
+ "video_id": "gTPSiUro",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "The video begins with three people standing in a large room with high ceilings. The person on the left, who has a beard, is playing a brown woodwind instrument. The person in the middle is a woman who is playing a similar woodwind instrument and wearing a dark dress and a blue scarf. The person on the right, a man, is holding and playing a long brown instrument, which has a curved metal piece near the top. Each person is standing behind a music stand. As the video progresses, the person on the left stops playing, as do the person in the middle and the person on the right, as the scene shifts to a closer view of them. The scene then shifts to a different view of the three people, and a fourth person comes into view, on the right side of the frame, holding a guitar. The video shifts to other views of the four people and close ups of each person. At some points, some of the four people are not visible. The person on the left with the beard is wearing a dark suit jacket, and a blue bow tie with a light colored shirt. The person on the right with the long instrument is wearing a dark suit jacket and tie. The fourth person with the guitar is wearing a dark suit jacket with a light colored shirt. All of them are playing the instruments they are holding.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What instrument is located to the left of the first sound device?",
+ "answer": "B",
+ "candidates": [
+ "A. Clarinet.",
+ "B. Flute.",
+ "C. Saxophone.",
+ "D. Banjo."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many types of instruments can be heard in the entire video?",
+ "answer": "C",
+ "candidates": [
+ "A. Five.",
+ "B. Three.",
+ "C. Four.",
+ "D. Two."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the emotional tone of the song in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Sad and lonely.",
+ "B. Calm and soothing.",
+ "C. Joyful and exhilarating.",
+ "D. Serious and somber."
+ ]
+ }
+ },
+ "lBBfaYEu": {
+ "video_id": "lBBfaYEu",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person with white hair and glasses is playing a banjo. They are wearing a purple patterned shirt and black pants. They are sitting down with a blue background behind them.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the strumming sound occur in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Six times.",
+ "B. Nine times.",
+ "C. Eight times.",
+ "D. Seven times."
+ ]
+ }
+ },
+ "stNwBdDX": {
+ "video_id": "stNwBdDX",
+ "video_duration": "37s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Editorials",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The first scene shows a woman with light brown hair, wearing a black and white patterned coat and a multicolored scarf, speaking to the camera. The background shows trees and yellow flowers. The next scene is a close up of hands using a computer mouse, a white keyboard and a small desk pad with a pen on a wooden desk. The next scene shows a woman with glasses and dark clothing standing next to a man with a checkered shirt, both looking at a computer screen. A hand on the monitor points to a page layout displayed on the screen. The next scene shows a boat on the water, with many people onboard. A magazine cover with a picture of this boat is shown. The cover reads: \"Humanity adrift: Why refugees deserve better\". The next scene shows the first woman, in a red shirt and a multi-colored scarf, speaking in an office. The next scene shows two women, one holding a cell phone, speaking outdoors in front of a stone wall with a plant in the background. A close up of one of these women is shown. The next scene shows people on a boat from an over the shoulder view, then a wide shot of people on a boat in the sea. The next scene shows a man in a black jacket and red arm band pointing while running past a light colored wall. Then a scene shows several people removing a woman from a scaffolding structure. The next scene shows four men, with black and navy jackets and light shirts, carrying the woman by her arms and legs. The next scene shows a man in a red knit cap holding a black recorder and speaking. The last scenes show the first woman from earlier, in a red shirt and multi-colored scarf, speaking in an office.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What does the man do to the woman in black on the shelf later in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The man in a fur-lined coat attempts to push the woman in black down.",
+ "B. The man with white hair attempts to dismantle the shelf to make the woman fall.",
+ "C. The man wearing a watch attempts to take the woman in black down.",
+ "D. The man with an orange wrist guard attempts to attack the woman in black."
+ ]
+ }
+ },
+ "YAwWqyZB": {
+ "video_id": "YAwWqyZB",
+ "video_duration": "89s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "swimming",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows an outdoor swimming pool with eight lanes, each marked by lane lines of red and yellow. There are swimming blocks at the edge of each lane. Eight swimmers are standing on their respective starting blocks at the edge of the pool. Several officials in white shirts and dark pants are also present. There is also a large sign with \u201cEUROPEAN JUNIOR SWIMMING CHAMPIONSHIPS ROMA 2021\u201d. The swimmers then enter the water and swim freestyle, splashing and kicking, the camera following the swimmers as they move down the lane. The video shows a graphic of each of the swimmers, with their names, country flag, and lane number, displayed on the pool deck. The swimmers continue down the length of the pool, with their movements creating significant water turbulence. Some of the lane markers are visible as the camera pans across the pool. At the end of the race, the camera focuses on a swimmer and the lane he is in as he reaches the edge of the pool, while other swimmers also reach the edge of the pool. After the race, graphic results are displayed. The video also shows the European Aquatics logo as well as a brief head-on shot of one of the swimmers, close up, while he is treading water. The camera then returns to a top view of the pool, while the results are displayed again.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is David Popovici's recorded time from the previous 100m World Championship?",
+ "answer": "D",
+ "candidates": [
+ "A. 48.77 seconds.",
+ "B. 47.3 seconds.",
+ "C. 49.37 seconds.",
+ "D. 47.56 seconds."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual event occurred in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. No one breaks the 49-second record.",
+ "B. Three people break the 49-second record.",
+ "C. One person breaks the 49-second record.",
+ "D. Two people break the 49-second record."
+ ]
+ }
+ },
+ "NvhqgFYu": {
+ "video_id": "NvhqgFYu",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Four women play double reeds in what appears to be a living room. Three of them are standing and playing bassoons, while the fourth is seated and playing a contrabassoon. They appear to be reading music from tablets attached to music stands. They are playing in unison and moving slightly to the rhythm of the music. The three bassoonists are wearing short-sleeved dresses, with the leftmost wearing a mottled gold/brown dress, the center a dark teal dress and the rightmost a dress with blue and paisley patterns. The contrabassoonist is wearing light blue jeans and a teal shirt, and is seated in a light wooden chair. Behind them is a section of wall with a dark brown cabinet to the left, a white fireplace in the middle and a large picture on the right, showing desert scenery. A segment of stairs is visible to the left. The bottom of the screen has the caption \u201cWaltz no.2 \u2013 Shostakovich\u201d.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "In the video, where is the bassoon sound coming from?",
+ "answer": "C",
+ "candidates": [
+ "A. At the back.",
+ "B. In front.",
+ "C. On the right.",
+ "D. On the left."
+ ]
+ }
+ },
+ "JZKDMRrE": {
+ "video_id": "JZKDMRrE",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man and a woman sit side by side at a grand piano. The woman wears a mustard-yellow, long-sleeved blouse, and the man wears a dark gray, long-sleeved button-down shirt.\u00a0The camera angle changes periodically between a shot from behind the two, where their backs are visible, a close-up view of their hands on the keyboard, and a shot looking down at them and the keyboard from above.The man leans closer to the woman and they begin to play the piano together. Their hands move across the keyboard, sometimes simultaneously, sometimes alternately, their fingers pressing the keys. They play a fast-paced piece.The man speaks to the woman and gestures with his hands. The woman responds by playing a section of the piece. She looks up at the man and wipes a bit of hair out of her eyes. They continue playing, their hands moving rapidly and precisely. The camera focuses on different sections of the piano as they play.The man again leans over to speak to the woman while they both play, then stops playing himself to adjust the position of her hands on the keys. They both continue playing. The music becomes more complex and dynamic, their fingers moving fluidly across the keyboard. The camera focuses on their intertwined hands as they play in unison.The woman nods and continues to play. They resume their positions, their fingers dancing across the keys. The tempo of the piece fluctuates, with moments of rapid playing interspersed with slower, more deliberate passages. The camera captures close-up views of their hands as they navigate the intricacies of the music.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the piano continuously playing in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "yAvDdKyd": {
+ "video_id": "yAvDdKyd",
+ "video_duration": "60s",
+ "duration": "1-2min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person with shoulder-length brown hair and glasses, wearing a black long-sleeved shirt, stands before a music stand with blank sheet music and a whiteboard. They hold a saxophone and play a few notes. The camera then zooms in on their hands as they play a scale on the saxophone. After playing the scale once, they reposition their fingers and play it a second time, a bit slower. They return to the original position and play the scale a third time. They take the saxophone away from their mouth and the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "What instrument's sound can be heard in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. French horn.",
+ "B. Trumpet.",
+ "C. Saxophone.",
+ "D. Bassoon."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "How many sharps are in the C# minor scale?",
+ "answer": "B",
+ "candidates": [
+ "A. Five.",
+ "B. Four.",
+ "C. Three.",
+ "D. Seven."
+ ]
+ }
+ },
+ "BQDKdEep": {
+ "video_id": "BQDKdEep",
+ "video_duration": "36s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A bright blue light flares horizontally across the screen. The words \"ALL NINE FILMS\" appear on a black starry background. Two yellow podracers speed through a desert landscape, connected by pink lightning. A Star Destroyer floats in space above a planet, a smaller ship firing a green laser towards it. A TIE fighter flies past the Star Destroyer. A Lego Luke Skywalker fires a blaster at a Lego Slave I. A Lego Kylo Ren ignites his blue lightsaber in a darkened room. A Lego Poe Dameron grins in the cockpit of an X-wing. Countless Lego Star Wars ships fly through space. A giant Lego Yoda appears, looking downwards. The words \u201cIN THE BIGGEST LEGO GAME EVER\u201d appear against a black background. Several Lego Jedi ignite their lightsabers on a desert planet. A Lego X-wing flies and fires red lasers, exploding against a large ship and terrain. Lego Han Solo uses macrobinoculars in the cockpit of the Millennium Falcon. Lego Han Solo, Chewbacca, and Luke Skywalker walk together, surrounded by rebel soldiers. Lego General Grievous faces Lego Obi-Wan Kenobi. Lego Obi-Wan Kenobi smiles. A Lego Jedi deflects blaster fire with a lightsaber. A Lego snowspeeder flies past an AT-AT\u2019s legs in a snowy landscape. The AT-AT falls forwards into the snow. Lego Luke Skywalker\u2019s face is illuminated by a lightsaber. Lego Finn fights a Lego stormtrooper with melee weapons. Lego Luke Skywalker stands against a backdrop of flames, then ignites his lightsaber. A Lego Jedi deflects Force lightning with a lightsaber. Many Lego Star Wars characters are assembled on a hilltop against a starry night sky; their lightsabers are ignited.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What will eventually happen to the two lightsabers?",
+ "answer": "B",
+ "candidates": [
+ "A. There might be a victory.",
+ "B. There might be a loss.",
+ "C. There might be a draw.",
+ "D. The fight might not continue."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Is there a large spacecraft in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Unclear.",
+ "B. No.",
+ "C. Yes.",
+ "D. None."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "How does the spaceship defeat the white robot on the snowy ground?",
+ "answer": "A",
+ "candidates": [
+ "A. Use a rope to wrap around the robot's legs.",
+ "B. Employ a special method to short-circuit the robot.",
+ "C. Attack the robot's legs using the spaceship's machine gun.",
+ "D. Fire spaceship cannons at the ground near the robot's legs."
+ ]
+ }
+ },
+ "FljuEXJO": {
+ "video_id": "FljuEXJO",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people, each holding an accordion, are shown against a light background. On the right-hand side of the screen, the flag of a country is displayed, along with the name of the country. The first image shows the flag and country name \"Israel\". Both people wear black shirts.The second image shows the flag and country name \"England\". Both people wear white shirts with a black bow tie.The third image shows the flag and country name \"Ghana\". Both people wear black shirts and black sunglasses.The fourth image shows the flag and country name \"Greece\". One person wears a blue shirt and the other wears a camouflaged pattern shirt.The people move the accordions while their hands move over the keys and buttons. Both people face directly forward.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "Which country is associated with the lively and energetic music with the highest pitch that appears in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Israel.",
+ "B. Greece.",
+ "C. England.",
+ "D. Ghana."
+ ]
+ }
+ },
+ "UwTfoSHD": {
+ "video_id": "UwTfoSHD",
+ "video_duration": "33s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with dark hair and glasses is standing in front of a purple sign that says \"Comedy Garage\" with three white stars above it, and is also standing in front of wooden paneling. The person is wearing a light-colored button-up shirt. The person is holding a black microphone with a wire. The person is talking and gesturing with their hands. Text appears over the video, some of which includes: \"savage audience DESTROYS helpless comic,\" \"I think I should stop there,\" \"How much ever I say,\" \"It's risky,\" \"I can't make fun of God or religion,\" \"in this country,\" \"But you know what I can make of?\" \"WOMEN!\" \"What ?!?!,\" \"If I make fun of God,\" \"I'll die!,\" \"If I make fun of women,\" \"you'll hate me,\" \"which is better!\" \"You Might Die!\" and \"WASTED.\" At the end of the video, a black screen appears with the text \"FULL VIDEO OUT SOON\" and a red YouTube play button.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What mood does the man have when he says that making jokes about women is preferable?",
+ "answer": "B",
+ "candidates": [
+ "A. Pain.",
+ "B. Excitement.",
+ "C. Resignation.",
+ "D. Anger."
+ ]
+ }
+ },
+ "gNZukOFC": {
+ "video_id": "gNZukOFC",
+ "video_duration": "42s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Camping",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person wearing a blue shirt, tan shorts, a tan vest, a black backpack, and a white patterned bandana is walking across a grassy field. They walk towards a tree and touch the tree. They then pick up a small piece of white trash off the ground. They hold it up to the camera and then put it down. They then continue walking across the field, holding up an orange leaf. They gesture with their hands. They walk across a parking lot behind a gray metal railing, past a blue sign that reads \"CALL BOX\". They then crouch down behind the railing.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What could the identity of the man in the video be?",
+ "answer": "C",
+ "candidates": [
+ "A. Outdoor sports enthusiast.",
+ "B. Landscape designer.",
+ "C. Urban camper.",
+ "D. Environmental agency surveyor."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the man suddenly become very joyful at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. He discovers food.",
+ "B. He finds a spot to hang a hammock.",
+ "C. He finds a camping spot.",
+ "D. He discovers a stream."
+ ]
+ }
+ },
+ "xTJnGEBg": {
+ "video_id": "xTJnGEBg",
+ "video_duration": "149s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video opens with a shot of a skyline showing various buildings with gold and silver domes. This is followed by several aerial shots showing a line of 4 vehicles moving along a road, with some smoke and debris. There are also shots of one or more soldiers carrying rifles, moving through damaged buildings and terrain. There are interior shots of a group of soldiers in a dim space. Later shots show soldiers sitting next to a fire. Later in the video, there are more aerial shots of vehicles moving along roads and across fields, with plumes of smoke and explosions. The video concludes with a red logo with the number 13 inside a circle.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotion does this video convey?",
+ "answer": "B",
+ "candidates": [
+ "A. Fear.",
+ "B. Anxiety.",
+ "C. Sadness.",
+ "D. Pain."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What distinguishes the 110th Mechanized Brigade from the 65th Mechanized Brigade of the Ukrainian Land Forces in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The 110th Mechanized Brigade benefits from a better geographical location.",
+ "B. The 65th Mechanized Brigade is situated in a less advantageous geographical location.",
+ "C. The 110th Mechanized Brigade has superior weaponry.",
+ "D. The 65th Mechanized Brigade possesses superior weaponry."
+ ]
+ }
+ },
+ "KUWlZumR": {
+ "video_id": "KUWlZumR",
+ "video_duration": "53s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows gameplay footage of a level on a video game. There is a man in the lower left part of the screen wearing a yellow hoodie, who is watching the game on a monitor or screen. On the top left side of the screen, there is a number of views and a heart which indicates health. The game starts with the text \"LEVEL COMPLETE!\" with \"gg\" below it. The next screen shows the options for \"DORAMI CHALLENGE III\" with a play button, a green diamond, and a blue button with three horizontal lines, and text with \"NORMAL MODE\" at 0% and \"PRACTICE MODE\" at 0%. Next, the screen says \"ATTEMPT 1\". The next part of the video shows the gameplay with a dark blue arrow with white streaks going from the center of it. There is a screen with the options \"PRACTICE MODE\" with a play button, a green diamond, and a blue button with three horizontal lines. The next shot shows a red car on the man's face. The text \"NEW BEST!\" appears and there is a gameplay part that follows that. The man shows his face while scrunched up, then the gameplay resumes again. On the screen, \"ATTEMPT\" is shown followed by the game loading. There is gameplay with a character jumping from spikes. A graphic appears on the screen saying \"X20 SPEED\". The gameplay resumes. A yellow circle with a dark outline and dark triangles inside it explodes. A wheel-like character appears and then gameplay continues. There is an explosion on the screen and it says \"NEW BEST!\" and \"14%\". The gameplay resumes and there is text that says \"OK, WHAT???\" . A black character appears in the video with a circular logo on its body with dark lines surrounding it. A dark wheel appears and the gameplay resumes. The game shows a blue background with objects that the character interacts with such as triangles and other objects. At one point, there is a text \"WOW\" on the screen, and then \"CC\" and then \"DORA...\". The screen reads \"NA\" and the video ends.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the man unusually excited at the beginning of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Because the game's difficulty has been reduced.",
+ "B. Because the game's difficulty has increased.",
+ "C. Because he acquired a skill in the game.",
+ "D. Because he is very familiar with the game."
+ ]
+ }
+ },
+ "wdliOPEN": {
+ "video_id": "wdliOPEN",
+ "video_duration": "76s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A light brown llama wearing blue overalls with a yellow and green striped shirt touches the leg of another larger light brown llama wearing a pink shirt with a dark purple pattern and dark purple patterned pants. Text on the image says \"Llama Llama starts to whine\". The same two llamas are visible next, along with a \"SALE 20%OFF\" sign. Text on the image says \"Clearance sales and discount buys, what is little llama's size? Try it on and take it off. Pull and wiggle, itch and cough.\" The image shifts to a closer view of the lower portion of the larger llama and the text remains on the screen. Next, the smaller llama is shown holding up an orange shirt while the larger llama places their hands on the smaller llama's arm. Text on the screen says \"Shirts and jackets, pants and shoes. Does this sweater come in blue? Brand-new socks and underwear?\". The smaller llama appears again without the orange shirt, holding a small brown stuffed animal. A large light brown llama appears putting on a blue shirt. Text on the screen says \"Llama Llama does not care.\" The larger llama is shown looking at shelves stocked with jars, and then with both llamas, the smaller llama sitting in a shopping cart. Text on the image says, \"What would llama like for lunch? Llama Llama doesn't know. Llama Llama wants to go.\" The larger llama is shown with a red rectangular object and a wrapped loaf of bread. The image changes to show the larger llama with a frustrated expression. Text on the image says \"Loaf of bread and Cream of Wheat. Llama Llama wants his treat. It's no fun at Shop-O-Rama. Llama Llama MAD at Mama!\" The smaller llama is shown next, with text on the image stating, \"Flying pasta. Spraying juice. Paper towels rolling loose.\" The image changes to show the larger llama biting a roll of paper towels. The text on the screen says \"Coffee, bread, and chips galore. Shoes and sweaters hit the floor.\"",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How did the llama's emotions change after entering the supermarket?",
+ "answer": "C",
+ "candidates": [
+ "A. From calm to frustrated.",
+ "B. From calm to excited.",
+ "C. From calm to angry.",
+ "D. From calm to disappointed."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where does the scene of the baby alpaca playing with toys in the video take place?",
+ "answer": "C",
+ "candidates": [
+ "A. In a clothing store.",
+ "B. In a milk tea shop.",
+ "C. In a supermarket.",
+ "D. In a breakfast shop."
+ ]
+ }
+ },
+ "HbXJbFcJ": {
+ "video_id": "HbXJbFcJ",
+ "video_duration": "70s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A man in a camouflage military uniform and hat is speaking to the camera while standing in front of a group of people in similar uniforms who are near a wooden structure. Two men in similar uniforms and helmets stand facing the camera in a grassy area with trees in the background. People in uniform climb over wooden beams and a wooden wall. People in uniform carry other people on their backs. Two people in uniform engage in a physical practice. Two people in uniform crouch and then stand. The video shifts to night vision, showing a dark field with people in uniform moving around. The video then shifts back to show the man in uniform standing in front of a group near the wooden structure. The video shifts again to night vision.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the sequence of new recruit training shown in the video? (a) Flip bar. (b) Obstacle flip. (c) Double person weight-bearing. (d) Barbed wire training.",
+ "answer": "D",
+ "candidates": [
+ "A. (b) (a) (c) (d).",
+ "B. (a) (c) (b) (d).",
+ "C. (c) (b) (a) (d).",
+ "D. (a) (b) (c) (d)."
+ ]
+ }
+ },
+ "SklutrAL": {
+ "video_id": "SklutrAL",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man sits and plays a guzheng, a Chinese plucked zither. He wears a gray pinstriped shirt, brown suspenders, and a red and white checkered bow tie.\u00a0 He sits between two open black metal gates covered in vines.\u00a0 Behind him, a large warehouse-like building with glass panels on the roof can be seen.Cut to a man sitting on a small, foldable chair amongst tall green grass playing the erhu, a two-stringed Chinese bowed musical instrument. He sits on a train track and a green train car is visible behind him.Close-up on the first man, as he continues to play the guzheng.\u00a0Cut to a man standing on a metal walkway playing the dizi, a transverse flute. He wears a brown newsboy cap, striped shirt, brown pants held up with brown suspenders, and a brown patterned tie. He is partially obscured by green trees.Cut to a different angle of the man playing the dizi; the shot shows concrete stairs leading up behind him as he plays.A man sits on a metal grate landing between flights of concrete stairs and plays the pipa, a four-stringed plucked lute. He is dressed similarly to the other men: a gray striped shirt and brown pants held up by brown suspenders.Close-up on the man playing the guzheng as he plucks the strings with his right hand.Cut back to the pipa player on the metal stairs, and the space between two warehouse buildings is now visible behind him.\u00a0The four musicians are now seen together playing their instruments. The erhu player sits on a folding chair, and the dizi player stands next to him. Behind them stand the pipa player and the guzheng player, who is now standing to play his instrument.Cut back to the erhu player, partially obscured by grass, and now sitting on the train tracks by the green train.\u00a0Cut back to the pipa player on the metal stairs.Cut back to the dizi player standing on a metal walkway.\u00a0Cut back to the erhu player, again partially obscured by grass.Cut back to the dizi player, this time standing against a concrete wall.\u00a0The four musicians are shown playing together.Close-up on the dizi player.\u00a0The erhu player is shown again, sitting on the train track in front of the green train.Cut back to the dizi player standing on the walkway.Cut back to the pipa player on the metal stairs.Cut back to the dizi player standing on the walkway, now against a concrete wall.Cut back to the four musicians playing together.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the quartet performance scene, where is the guzheng located?",
+ "answer": "B",
+ "candidates": [
+ "A. To the left of the flute.",
+ "B. To the right of the pipa.",
+ "C. To the left of the erhu.",
+ "D. To the left of the pipa."
+ ]
+ }
+ },
+ "nrPszoKV": {
+ "video_id": "nrPszoKV",
+ "video_duration": "269s",
+ "duration": "4-6min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video displays a screen recording of gameplay footage of two different video games, Overwatch and Among Us, along with a small window showing a person wearing a headset. The video starts with a screen saying \"ROUND 2 COMPLETE,\" along with some overlaid text that states \"What the f*** is this game.\" The view then changes to the top of the screen, showing two names - \"FLATS\" and \"EMONGG\" with speaker icons and the number \"3\" beside each name and the text overlaid \"ITS OVERWATCH BABY!\". The view switches back and shows gameplay of Overwatch with the person's face at the top-left. The text \u201cROUND 2 COMPLETE\u201d is then shown. Next, the score is displayed: \u201c0\u201d, followed by \u201c1 / 2\u201d, and then \u201c2 / 2\u201d. At this point, the text \u201cSWITCHING SIDES\u201d is displayed. Gameplay of Overwatch begins, with the game displaying the perspective of the character. A few moments of gameplay are shown, with character eliminations, captures, assists and contested areas shown. The character is then shown to be eliminated. Then there is a screen where characters of other players are shown, followed by more gameplay from the character's point of view. More combat is shown and a text overlay \"That's a cursed image top right\" appears, and then \"Eliminated\" messages appears and more gameplay. The view changes again, showing character selection screens followed by more combat gameplay. A final \"Eliminated\" message appears. Then another screen with scores is shown and the text \"*marks highlight\" is seen. More gameplay and character movement takes place, before a new text message: \"Infra Sight Gained from VINTAGE!\" appears. More gameplay is shown with combat and movement, then voice commands are shown. The character is then shown to be eliminated, and gameplay footage is once again shown from the perspective of other characters. Gameplay footage of the game then ends. The video then shows gameplay of the video game Among Us, still with the person's face at the bottom-left of the screen. The words \"DEAD BODY REPORTED\" are then shown. A screen displaying players and \"DISCUSS!\" appears, then one which shows who has died. Then a screen shows the dead character floating in space, with the text \"Flats was An Impostor\" then \"1 Impostor remains.\" The gameplay continues with a top-down view of the game, where more movement and action take place. Then a screen is shown stating \"Later that round\". More top-down view gameplay footage is shown of more movement and characters are eliminated and an oxygen depletion occurs. Another top-down view of more gameplay is shown, followed by a screen saying \"Victory\", and finally a screen showing the game settings. The video ends with a screen saying \"THANKS FOR WATCHING\", with a logo in the center of the screen and social media handles at the bottom.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many players are involved in the game during the latter part of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Ten.",
+ "B. Eleven.",
+ "C. Eight.",
+ "D. Nine."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the man in the video display a strange smile while holding his forehead?",
+ "answer": "D",
+ "candidates": [
+ "A. Because he was trapped and voted out.",
+ "B. Because he was voted off by others.",
+ "C. Because he sacrificed himself to ensure his team's victory.",
+ "D. Because he eliminated another player."
+ ]
+ }
+ },
+ "bOSHuPTz": {
+ "video_id": "bOSHuPTz",
+ "video_duration": "147s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man in a suit stands in a boxing ring, speaking into a microphone and reading from a piece of paper. He gestures toward someone outside the frame.Two fighters, one in black shorts and one in grey, stand facing each other in a boxing ring. A referee in black gives them instructions, then separates them. The fighter in black moves to his corner, while the fighter in grey bounces on his feet. The referee signals the start of round 2. The fighters circle each other. The fighter in black lands several punches, including one that causes the fighter in grey to stumble. The fighter in grey throws punches, some of which connect, others are blocked or miss. The fighter in black lands a punch to the head, followed by a quick flurry of body blows, forcing the fighter in grey against the ropes. They separate briefly. The fighter in black again forces the fighter in grey against the ropes and lands multiple punches. The referee breaks them up. The fighter in black corners the fighter in grey again and connects with several more punches before the referee breaks it up. The fighter in grey gets a few punches in. The fighter in black delivers a combination of punches that sends the fighter in grey reeling into the corner. The fighter in black lands a series of unanswered punches to the fighter in grey\u2019s head. The referee steps in and stops the fight.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the commentator excitedly exclaim \"Oh my God\" in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Because the boxer in black pants lands a punch on the opponent.",
+ "B. Because the athlete in white shoes defeats their opponent.",
+ "C. Because a wrestling takedown occurs in the boxing match.",
+ "D. Because a contestant falls during the competition."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did you hear the referee speaking?",
+ "answer": "B",
+ "candidates": [
+ "A. Yes, I heard it.",
+ "B. No, I didn't hear anything.",
+ "C. None.",
+ "D. I am not sure."
+ ]
+ }
+ },
+ "kYKOXvlI": {
+ "video_id": "kYKOXvlI",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man in a black suit and glasses plays a tenor saxophone. He stands in front of a music stand with sheet music on it.\u00a0 To his right is a drummer sitting at a drum kit. The bass drum is dark blue. The drummer wears a dark t-shirt.\u00a0 To the drummer\u2019s right sits a keyboard player at an electric keyboard. The keyboard player wears a dark suit.\u00a0 Behind the saxophonist, a double bass is visible, presumably played by someone out of view. Light blue curtains hang behind the musicians. The words \u201cBlue Bossa\u201d appear on screen, followed by \u201cMusic Written by Kenny Dorham.\" The video transitions and the same musicians are now playing a different piece. The on-screen text changes to \"There Will Never Be Another You,\" followed by \u201cMusic Written by Harry Warren\u201d and \u201cLyrics Written by Mack Gordon.\u201d",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who is playing the piano?",
+ "answer": "A",
+ "candidates": [
+ "A. The man sitting and wearing glasses.",
+ "B. The man sitting and with a beard.",
+ "C. The man standing and with a beard.",
+ "D. The man standing and wearing glasses."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many different instruments can be heard in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Five kinds.",
+ "B. Two kinds.",
+ "C. Four kinds.",
+ "D. Three kinds."
+ ]
+ }
+ },
+ "rvsFdFqt": {
+ "video_id": "rvsFdFqt",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A woman wearing a dark top and glasses plays a black piano. In front of her is a music stand with music sheets on it. To the right of the piano, a man wearing a blue long-sleeved shirt and light pants plays a bassoon while standing. In front of the man is a music stand with papers. There are several chairs stacked against the back wall as well as a black case.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What kind of changes occur in the piano sound in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. From a whole-tone scale to a half-tone scale.",
+ "B. From low notes to high notes.",
+ "C. From the natural minor scale to the harmonic minor scale.",
+ "D. From a middle pitch to a high pitch."
+ ]
+ }
+ },
+ "NMPjCQGY": {
+ "video_id": "NMPjCQGY",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two women are playing musical instruments outdoors in a garden setting. One woman, seated on a white folding chair, plays the violin. The other woman, also seated on a white folding chair, plays the cello. Both women wear black long-sleeved shirts and dark pants. They are positioned near a light-colored wooden bench.\u00a0 A black instrument case lies on the bench next to the cellist. A patterned tote bag sits on the ground near the cellist's feet.\u00a0 Sheet music rests on stands in front of each musician.\u00a0 Trees, shrubs, and grass form the backdrop.\u00a0 Some trees show autumn foliage. A paved path runs between the musicians and the grassy area behind them. The ground around the musicians is covered with small stones. They play their instruments continuously throughout the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What instrument does the woman with curly hair play?",
+ "answer": "A",
+ "candidates": [
+ "A. A cello.",
+ "B. A violin.",
+ "C. A viola.",
+ "D. An erhu."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What is the tone quality of the instrument on the left?",
+ "answer": "B",
+ "candidates": [
+ "A. The tone is soft and rich.",
+ "B. The tone is bright and crisp.",
+ "C. The tone is powerful and deep.",
+ "D. The tone is rich and full."
+ ]
+ }
+ },
+ "NAWQcEWR": {
+ "video_id": "NAWQcEWR",
+ "video_duration": "222s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A large piece of meat, dark reddish-brown with a glistening surface, sits on a metal grill grate inside a black smoker. A hand reaches in and opens the smoker's lid further. The hand, holding a meat thermometer, probes the meat in several locations. Then the hand puts down the thermometer and picks up a spray bottle, spraying the inside of the smoker, creating a burst of steam. The hand puts down the spray bottle, dons a heat-resistant glove, and pulls the meat from the smoker, placing it on a wooden cutting board. The meat is a dark, charred color, with a crusty exterior.The meat is placed on a kitchen counter, and a hand points at a section of it. The hand pats the top of the meat, then picks up a long knife and begins to slice thin pieces from the meat. The slices are dark on the outside and pink in the middle. The hand continues to slice the meat.Two men stand in a kitchen, one wearing a Hawaiian shirt and a baseball cap, the other wearing a gray t-shirt and colorful shorts. The man in the Hawaiian shirt points toward a double oven behind them.The man in the Hawaiian shirt slices the brisket on a cutting board. He holds up a slice to display the dark, crusty exterior and the pink interior of the meat. He continues slicing, then cuts a larger piece off the end of the brisket and holds it up, showing the inside texture of the meat. He slices the point and the flat of the brisket separately. A woman in a pink shirt and white skirt stands by as he cuts the point into burnt ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many slices of beef brisket a quarter inch thick did the man cut in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Nine pieces.",
+ "B. Four pieces.",
+ "C. Seven pieces.",
+ "D. Six pieces."
+ ]
+ }
+ },
+ "CSrKZbvS": {
+ "video_id": "CSrKZbvS",
+ "video_duration": "106s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Physics",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a dark screen, then transitions to a black-and-white scene of a woman sitting in a chair, talking on a black rotary telephone. A brown leather-like bag is seen standing beside a doorway. The scene changes to a living room where two people appear to be lying down on the floor and on a couch. A man's head emerges from under the couch. A springy cartoon figure can be seen standing on a book on the floor beside him. The man is then shown surrounded by a web, with the cartoon figure still nearby. The scene transitions to an interior view of the interior of a couch, followed by a view of the couch cushion and then back to the man under the couch. The man then looks around the room and gets up off the couch. He puts on a jacket, and then walks towards the doorway where the bag stands. The man picks up the telephone and appears to be using it. A close up of the dial of the telephone is seen, with a cartoon figure standing on it.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the man's mood in the beginning part of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. From disappointment and curiosity to frustration.",
+ "B. From frustration and curiosity to happiness.",
+ "C. From sadness to joy with humor.",
+ "D. From frustration and sadness to happiness."
+ ]
+ }
+ },
+ "wgRPYHlH": {
+ "video_id": "wgRPYHlH",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three men are seated in chairs and are playing guitars. All three men are wearing black jackets and pants. They are sitting in a room with a brick wall and a plain floor. There are music stands in front of each man. In front of the men is a lit candle and a small box. There are small white objects scattered on the floor.The video appears to be a time-lapse, which makes the guitarists appear to move rapidly. Sometimes, there are multiples of each man visible at the same time, as if a time-lapse overlay has been applied to the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is the instrument held by the man on the far right?",
+ "answer": "D",
+ "candidates": [
+ "A. Natural wood.",
+ "B. Gradient color.",
+ "C. Grass green.",
+ "D. Walnut brown."
+ ]
+ }
+ },
+ "kyuTwZAr": {
+ "video_id": "kyuTwZAr",
+ "video_duration": "81s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person in a red shirt and a white glove picks up cooked meat from a black pan using tongs. The camera briefly shows an animated clock with the hands indicating 5:30. Then, the camera shows a person wearing a blue t-shirt and a white apron putting a piece of cooked meat on an orange cutting board, and pulling at the meat with two forks to shred it. The person continues to shred the meat with the two forks until the piece of meat is fully shredded. The video then shows a logo of a moth with the word GREGFLY underneath.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "At approximately what time does the meat start roasting in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. 11:00:00.",
+ "B. 10:00:00.",
+ "C. 12:00:00.",
+ "D. 10:30:00."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the man taste the grilled meat?",
+ "answer": "C",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. While tearing it apart and tasting it.",
+ "C. At the end of the video.",
+ "D. In the middle of the video."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Is there a knife shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes, there is.",
+ "B. I don't know.",
+ "C. No, there isn't."
+ ]
+ }
+ },
+ "XOgdqvYw": {
+ "video_id": "XOgdqvYw",
+ "video_duration": "111s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a map on a video game screen. A person is visible in the upper right corner wearing a green shirt, with a beard, and a headset. The map switches between zoomed-out and zoomed-in views. On screen, an avatar runs through a field. The avatar battles enemies. The avatar uses a shield that creates a circular gust of wind around it. The avatar gathers resources. The avatar shows a thumbs up emoji, then a thumbs down emoji. The video displays text stating \"42 minutes later 153x sped up.\" The video changes to a shot under the surface of the water. The avatar returns to the surface of the water. An in-game text box appears, stating \"You can't pick up any more of this.\" The video ends with the avatar standing in a field.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the blonde character use the Revalli Shield skill in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Twice.",
+ "B. Four times.",
+ "C. Once.",
+ "D. Three times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "How is the security vulnerability demonstrated in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The blonde character restores health.",
+ "B. The blonde character removes her armor.",
+ "C. The blonde character picks up a spear.",
+ "D. The blonde character picks up a shield."
+ ]
+ }
+ },
+ "rigroHAj": {
+ "video_id": "rigroHAj",
+ "video_duration": "70s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a dark brown screen with the words \"Connection Lost\", \"End of Stream\", and a button that says \"Back to title screen\" in white. A cartoon of a man with brown hair and a green shirt, sitting in a pink chair, looking shocked, appears. The man is then pushed by another cartoon character wearing a blue and white superhero outfit, causing the pink chair to tilt back and the first man to fall out. The first man is then shown on a red, orange and black background, looking concerned, followed by a shot of the second man, the superhero, posing. Then the man with the green shirt is shown again looking scared. The video then shows a hand holding a black phone, and a display of a message \"Tube Alert\", followed by a screen on the phone with a search bar, and then a profile for \"MinecraftFinest\" and a button to favorite it. The superhero character is then seen posing again. Then a view of the first man looking at the superhero followed by a view of a phone with a list of videos. Then there is another view of the phone, with a message \"Enhance your subscriptions!\" and a video of a Minecraft character is shown. Next the superhero character and the first man are next to each other again, the first man holding a phone. The superhero character then yells at the man with the phone. Then there is a view of a phone with a list of channels, with a message: \"Support your favorite Youtube channels!\" The man with the green shirt is shown again looking at his phone. The superhero character looks angry and stares down at the other man. The superhero character then covers the other man's mouth and grabs him. Finally, the superhero character is shown posing and pointing, followed by the first man, again, looking at his phone with two other people, and then with 3 other people, then looking at an orange clown fish in a fish tank. The video ends with the superhero character saluting.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "At the beginning of the video, what issue occurred with the computer of the boy wearing green?",
+ "answer": "C",
+ "candidates": [
+ "A. The computer is lagging.",
+ "B. The computer screen is black.",
+ "C. The computer's network connection is lost.",
+ "D. The computer speakers are broken."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video, in what order does the man dressed as Superman introduce the features of this application? (a) View the top 25 most popular channels. (b) Add frequently watched channels to favorites. (c) Receive notifications for new channel updates.",
+ "answer": "A",
+ "candidates": [
+ "A. (b) (c) (a).",
+ "B. (c) (b) (a).",
+ "C. (b) (a) (c).",
+ "D. (a) (b) (c)."
+ ]
+ }
+ },
+ "rZozjOBI": {
+ "video_id": "rZozjOBI",
+ "video_duration": "36s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Climate Change",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A hand appears and erases a black square with white stars on it. A drawing of the Earth with a speech bubble that says \"2\u00b0 Celsius\" is shown. Three pages with drawn graphs appear, and a hand taps each one. Then a black flag with white stars appears, a grid is drawn with a hand touching it, followed by a pile of rocks and then a stopwatch. The hand pushes the pile of rocks off the screen, then draws a rectangle with two arrows pointing in opposite directions within it. Under the arrows, the hand draws a downward curved arrow pointing towards another pile of rocks. The hand draws a number \"1\" followed by a line and a circle on the screen. A switch is drawn, and then a molecule consisting of a circle marked \"C\" with two circles marked \"O\" attached by lines is drawn. A drawing of a factory with smoke is drawn. The hand then draws a number \"2\" followed by a line and a circle on the screen. A drawing of a factory with smoke is drawn, and then a large \"X\" is placed over the drawing. Then stacks of paper marked \"100\" are drawn. Finally, a grid is drawn and tapped by a finger. A flag is drawn and pushed up with the hand, and then a pile of rocks is drawn and pushed with the hand.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the sequence of events shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Explore potential methods to accelerate the phase-out of coal in the EU, reduce coal usage, close factories with the lowest revenue potential, and shut down power plants with the highest carbon emissions.",
+ "B. Explore potential methods to accelerate the phase-out of coal in the EU, reduce coal usage, shut down power plants with the highest carbon emissions, and close factories with the lowest revenue potential.",
+ "C. Explore potential methods to accelerate the phase-out of coal in the EU, shut down power plants with the highest carbon emissions, reduce coal usage, and close factories with the lowest revenue potential.",
+ "D. Reduce coal usage, explore potential methods to accelerate the phase-out of coal in the EU, shut down power plants with the highest carbon emissions, and close factories with the lowest revenue potential."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many tables are shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Seven.",
+ "B. Three.",
+ "C. Five.",
+ "D. Nine."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Have you heard about Asia?",
+ "answer": "A",
+ "candidates": [
+ "A. No, I haven't heard anything.",
+ "B. Yes, I have.",
+ "C. I'm not sure."
+ ]
+ }
+ },
+ "eXAoktke": {
+ "video_id": "eXAoktke",
+ "video_duration": "83s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Academic Lectures",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A stick figure walks along a white curved line, carrying a large speckled circle on their back. To the left, a drawing of a brain is depicted with red markings at its base and with lightning bolts coming out from it. To the right, a drawing of a brain with a graduation cap is shown. A white speech bubble appears above the stick figure, containing a drawing of two stick figures. The stick figure continues to walk along the line. The image then changes to a single animated elephant. The elephant is shown walking and tilting its body. The image changes to show the elephant walking near two larger elephant outlines. A third larger elephant outline is added to the scene. The image shifts to display the outlines of the three elephants on either side of the single smaller elephant. The image zooms out to reveal the outline of four large letters that are formed by the outlines of the large elephants.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What might the man's occupation be in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Parent of a student.",
+ "B. Lawyer.",
+ "C. Speaker.",
+ "D. Educational expert."
+ ]
+ }
+ },
+ "SoZgkCfM": {
+ "video_id": "SoZgkCfM",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person plays a white grand piano in a snowy landscape with large ice formations in the background. The piano has the brand name \"Yamaha\" visible on the front. The person is wearing a gray jacket. Cuts to a person playing a white cello in the same snowy and icy setting, wearing a gray jacket and dark pants. The person plays the cello with a bow. The video alternates between shots of the person playing the piano and the person playing the cello. The camera angles vary, including close-ups of the instruments and wider shots of the performers in the environment. At times, an aerial view shows the piano from above, surrounded by snow. The cellist sits on a dark wooden chair as they play. In some shots, the light has a blue tint. Near the end of the video, the two musicians play their instruments together, with the piano player seated and the cellist to their right, also seated. The final shot shows the white grand piano covered in snow and icicles, with the cellist and a chair visible at a distance through a haze of snow.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What change occurs in the rhythm of the man's instrument in the video who is wearing a tie?",
+ "answer": "B",
+ "candidates": [
+ "A. The beat weakens and then strengthens.",
+ "B. The rhythm accelerates and then slows down.",
+ "C. The intensity diminishes and then strengthens.",
+ "D. The rhythm slows down and then speeds up."
+ ]
+ }
+ },
+ "ccliRIsr": {
+ "video_id": "ccliRIsr",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person wearing a blue and tan hat, a blue and white patterned button-down shirt, and glasses is playing a ukulele into a microphone. They are standing in front of a tan brick wall. Later, two additional people are visible. One person on the left is playing a stand-up bass. The person on the right is sitting on a box, and playing it as a percussion instrument.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What is the rhythmic characteristic of the instrument played by the man on the far right?",
+ "answer": "B",
+ "candidates": [
+ "A. It uses a 4/4 time signature.",
+ "B. It uses a 2/4 time signature.",
+ "C. It uses a 3/4 time signature.",
+ "D. It uses a 1/4 time signature."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "What is the source of the sound of the wooden box drum in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The man wearing a white hat.",
+ "B. The man wearing a red hat.",
+ "C. The man wearing a black hat.",
+ "D. The man wearing a blue hat."
+ ]
+ }
+ },
+ "OCBeBmWU": {
+ "video_id": "OCBeBmWU",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man and a woman stand before microphones, seemingly singing. The man wears a white suit jacket and a black tie with white crosses on it. The woman wears a black jacket. Behind them, a large painting depicts a mountain landscape with trees. The scene then changes to a woman playing a trumpet alongside a woman playing a piano. The trumpeter wears a blue tie-dye shirt, and the pianist wears a black top and blue jeans. They are in a room with wood-paneled walls and a wooden door. The women continue to play their instruments.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What is the first instrument to make sound in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Guitar",
+ "B. A larger size.",
+ "C. A piano.",
+ "D. A French horn."
+ ]
+ }
+ },
+ "BUCEfyAF": {
+ "video_id": "BUCEfyAF",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A woman with long brown hair sits at a piano, reading sheet music illuminated by a small lamp attached to the piano.\u00a0 She is wearing glasses and a white and beige patterned short-sleeved top. To her right, an older woman with short white hair plays a trumpet.\u00a0 She wears a red and black plaid jacket and long black skirt. Next to her, a man with short white hair and glasses plays a trumpet. He\u2019s wearing a gray suit jacket, white collared shirt and dark-colored pants. Behind them is a plain white wall. A black music stand holds the man's sheet music. The three individuals play their instruments for an extended period.The scene changes to a man and woman seated in front of a large painting of a mountain landscape. The painting depicts a waterfall and trees.\u00a0 The woman, with short brown hair, is on the left. She wears glasses, a pearl necklace, a zebra-print top and yellow jacket. The man, with short white hair, wears glasses, a light yellow cardigan over a white collared shirt, and a tie with a musical instrument print. Both hold microphones and speak, though what they say is not audible. They remain seated and speaking for the duration of the scene.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What instrument is played by the woman in the red checkered skirt?",
+ "answer": "C",
+ "candidates": [
+ "A. A French horn.",
+ "B. A piano.",
+ "C. A trumpet.",
+ "D. A tuba."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "How many instruments are shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. One.",
+ "B. Two.",
+ "C. Three.",
+ "D. Four."
+ ]
+ }
+ },
+ "fBsxdmSA": {
+ "video_id": "fBsxdmSA",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person wearing glasses and a pink shirt is in the foreground playing a silver trumpet with a clear finish. They also have white wired headphones in their ears. In the bottom left of the video frame there is a smaller rectangular box containing a recording of a musical group. The visuals in the box change periodically showing differing segments of the group. Sometimes there is a person in the smaller box also playing the trumpet. Sometimes an orchestral group is visible. Sometimes a string section.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What instrument is the woman with black curly hair playing?",
+ "answer": "C",
+ "candidates": [
+ "A. Tuba.",
+ "B. Cello.",
+ "C. Violin.",
+ "D. Trumpet."
+ ]
+ }
+ },
+ "WKCdskJf": {
+ "video_id": "WKCdskJf",
+ "video_duration": "26s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man walks through a kitchen doorway and holds up a loaf of bread before setting it down. The video cuts to a single slice of bread toasting in a small toaster oven. A hand opens the toaster oven door and removes the toast. The toast is placed on a wooden cutting board. Butter is spread on the toast with a butter knife. The buttered toast is then placed in a clear glass pie plate. Cheese slices are placed on the toast to form a square border around the center. An egg is cracked and poured into the square formed by the cheese on the toast. Black pepper is ground onto the egg. The pie plate containing the toast, cheese, and egg is placed in the toaster oven.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct order for making a sandwich as shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. (c) Toast one side of the bread. (e) Spread butter on the untoasted side. (a) Place four slices of Swiss cheese in the middle to form a frame, and put an egg in the center. (d) Sprinkle with salt and black pepper. (b) Put it in a preheated oven at 400 degrees and bake for 15 minutes.",
+ "B. (b) Put it in a preheated oven at 400 degrees and bake for 15 minutes. (c) Toast one side of the bread. (a) Place four slices of Swiss cheese in the middle to form a frame, and put an egg in the center. (d) Sprinkle with salt and black pepper. (e) Spread butter on the untoasted side.",
+ "C. (d) Sprinkle with salt and black pepper. (b) Put it in a preheated oven at 400 degrees and bake for 15 minutes. (a) Place four slices of Swiss cheese in the middle to form a frame, and put an egg in the center. (c) Toast one side of the bread. (e) Spread butter on the untoasted side.",
+ "D. (a) Place four slices of Swiss cheese in the middle to form a frame, and put an egg in the center. (b) Put it in a preheated oven at 400 degrees and bake for 15 minutes. (c) Toast one side of the bread. (d) Sprinkle with salt and black pepper. (e) Spread butter on the untoasted side."
+ ]
+ }
+ },
+ "UbIjfJBl": {
+ "video_id": "UbIjfJBl",
+ "video_duration": "80s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person with brown, curly hair is centered in the frame. The person is wearing a black and white striped shirt. At first, their hair is parted into multiple small braids, pulled back into a bun. The person moves their head back and forth, and then the braids come undone, and their hair falls around their face. The person then throws their hair back, opens their mouth wide, and proceeds to play with their hair and do various facial expressions. In the background, there is a computer screen, pictures hanging on the wall, and a drawing of a pineapple in a frame. At the bottom of the frame, a translucent purple caption appears stating \"if you are confused this was filmed before my haircut!!\" This caption appears throughout the rest of the video. The person holds up two fingers to show a peace sign. The person moves their arms up and down. The person then continues to talk while moving their head and playing with their hair. The person puts both thumbs up. The person then touches their forehead and hair. The person continues to make facial expressions and play with their hair while looking at the camera.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is this woman so excited?",
+ "answer": "D",
+ "candidates": [
+ "A. Because someone copied her hairstyle.",
+ "B. Because her hair is soft and as adorable as a baby's.",
+ "C. Because someone subscribed to her videos.",
+ "D. Because she is about to get a short haircut."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many different hairstyles does the woman appear with in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. 5.",
+ "B. 3.",
+ "C. 6.",
+ "D. 4."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Is the woman in the video demonstrating how to style hair live?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes.",
+ "B. I don\u2019t know.",
+ "C. No."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Why does the woman's voice change in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. It is for the program's effect.",
+ "B. She is reluctant to part with her long hair.",
+ "C. She needs to change her look.",
+ "D. She likes to intimidate the audience."
+ ]
+ }
+ },
+ "CXMtIuwC": {
+ "video_id": "CXMtIuwC",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people are visible in the video, a woman on the left and a man on the right. They are both playing stringed instruments. The woman is playing a brown ukulele, and the man is playing a yellow guitar. They are sitting side by side, and the camera is positioned at a horizontal eye level. Behind them, there is a blue wall with a horse painting, a table with a floral pattern cover, a shelf with a clock, and a closed light green door with several hooks. The woman is wearing a white top with a pattern. The man is wearing a light gray shirt.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What instrument does the woman play?",
+ "answer": "A",
+ "candidates": [
+ "A. A guitar.",
+ "B. A guzheng.",
+ "C. A harp.",
+ "D. A ukulele."
+ ]
+ }
+ },
+ "ObbMhvLS": {
+ "video_id": "ObbMhvLS",
+ "video_duration": "278s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a woman with long blonde hair holding up paper bills, then transitions to a cartoon graphic with the words \"TRAVEL COMIC\" in red letters on a yellow background. The next scene is an outdoor market stand with a person sitting on a stool, and several tables covered in red cloths displaying numerous pieces of jewelry. The video transitions to other outdoor market stands displaying various items, including buckets of dried goods, packaged spices and soaps, socks, and clothing on hangers. People with and without face masks are seen throughout the market scene. A man wearing a face mask is shown, raising his fist in the air. A different man wearing a face mask gestures and speaks to the camera, followed by several scenes of various outdoor market stands and the people selling and shopping. The camera shows close ups of several food items, including: pumpkins, tomatoes, eggplants, green and red peppers, avocados, potatoes, and bananas. A woman buys pumpkin and hands a man a paper bill. A man asks if a batch of avocados are organic and makes a purchase. A man asks about the origin of the cucumbers at a market stand. The scene changes to several more outdoor market stands offering a variety of clothing and items. The video then shows a market stand displaying various kinds of cheese and cuts to the close up of cheese being purchased. The video then shows a person walking through a market stand filled with items, including perfume boxes, jackets, wallets, belts, shoes, and socks. A person then displays a blue sock with a German design. The man is shown purchasing socks and then showing off the socks to the camera. The video moves through the market as the man asks about the authenticity of an Adidas jacket and purchases items. The video then cuts to a cartoon of an oven with the words \"JUST PUT IN THE OVEN\". The video returns to the indoor market as a person wearing a mask purchases food and goods from the various market stands. The video concludes with a man stating his concerns about the pandemic and then waving goodbye to the camera.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the blogger's interpersonal standing in this market?",
+ "answer": "B",
+ "candidates": [
+ "A. Unwelcome.",
+ "B. Popular.",
+ "C. Disliked.",
+ "D. Average."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What is the nature of the interaction between the chestnut vendor and the blogger?",
+ "answer": "D",
+ "candidates": [
+ "A. The blogger asks about the price of the chestnuts, and the vendor replies.",
+ "B. The chestnut vendor discusses some events from home.",
+ "C. The blogger and the chestnut vendor exchange suggestions.",
+ "D. The chestnut vendor sings for the blogger."
+ ]
+ }
+ },
+ "bdSvWGSz": {
+ "video_id": "bdSvWGSz",
+ "video_duration": "79s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Basketball",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a basketball game. The first part of the video shows a play where several players are around the basket. Players wearing white and red uniforms attempt to score, but the player with the ball in the blue and white uniform receives the ball, dribbles, and then passes. The video then transitions to another play. Some of the players have changed positions. A player in the white and red uniform dribbles and passes to another player in the same uniform. The ball is then dribbled around the court. Then a player in the white and red uniform shoots the ball and scores. The camera switches to show a player wearing number 2 in a blue and white uniform briefly. The camera switches to a wide view of the basketball court. The ball is dribbled around the court and passed between players. The team wearing white and red attempts to score. One of their players dribbles the ball. A player wearing a blue and white uniform approaches the player and tries to block him, which results in a pile of players on the ground. The ball is then passed between players. A player in the white and red uniform shoots the ball and scores, with the ball going through the net. The camera briefly focuses on a player in the blue and white uniform. The video then shows a wide shot of the basketball court. The ball is passed between several players on each team. A player in the white and red uniform makes a jump shot. The camera briefly focuses on a player with number 2 on their blue and white uniform again. The players continue to move the ball around the court. A player in blue and white dunks the ball. The camera briefly follows a player wearing number 2 in the blue and white uniform and then shows the wide shot of the basketball court again. The players continue to run around the court, dribble and pass the ball, and make jump shots. Finally, a player in the blue and white uniform dunks the ball and the video shows a coach in a suit. The camera then shows a wide view of the court again. The players continue to pass and dribble the ball. A player wearing a blue and white jersey dunks the ball. The camera then shows a closeup of a coach in a suit. The camera then shows a wide view of the court and players continue to pass and dribble the ball as the video ends.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What change occurred in the score after the blue team scored a goal at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Add one point.",
+ "B. Add two points.",
+ "C. Add three points.",
+ "D. There is no change."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times do the players in blue uniforms stand up to celebrate in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Three.",
+ "B. Six.",
+ "C. One.",
+ "D. Seven."
+ ]
+ }
+ },
+ "BSmEQwXk": {
+ "video_id": "BSmEQwXk",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man in a black suit and blue shirt plays the cello. He sits on a chair to the right of a grand piano. Another man in a black suit sits to the left of the piano, playing the trumpet. A woman in a sparkly black dress sits at the piano and plays. Another woman sits behind her. There are microphones set up in front of each musician and the piano. There is a small arrangement of fall flowers and candles at the back of the piano, and a similar but larger arrangement on a stand further back and to the left. There is a water bottle on the floor by the cellist's chair and another by the trumpet player. The camera zooms in close on the cellist, and then pulls back to the wider view. The camera zooms in on the trumpet player, then pulls back. The musicians continue playing. The camera again zooms in on the cellist, then pulls back. The camera zooms in on the trumpet player, then pulls back. The musicians continue to play their respective instruments. The camera again focuses on the cellist. The musicians continue playing. The camera zooms in on the trumpet player, and then back out. The musicians continue to play.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "What instrument is to the left of the piano?",
+ "answer": "D",
+ "candidates": [
+ "A. A cello.",
+ "B. A violin.",
+ "C. A saxophone.",
+ "D. A horn."
+ ]
+ }
+ },
+ "dVaZnqWE": {
+ "video_id": "dVaZnqWE",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A young woman with shoulder-length light brown hair plays a bassoon. She is wearing a black short-sleeved shirt. She is shown first in a small room with light walls and a door. The second part of the video shows her on a stage in an elegant hall with high ceilings, ornate white railings on a balcony, and arched windows, playing alongside a woman playing a flute and a grand piano. The bassoonist wears a teal skirt. The flutist wears a long black dress. They are playing in front of a seated audience. The third part of the video returns to the bassoonist alone, wearing the teal skirt and short-sleeved black shirt, playing in a smaller room with an empty grand piano behind her to the right.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "In the video, which other instrument is playing when the flute is sounding?",
+ "answer": "D",
+ "candidates": [
+ "A. A piano.",
+ "B. A bagpipe.",
+ "C. An accordion.",
+ "D. An oboe."
+ ]
+ }
+ },
+ "WcZMMSfc": {
+ "video_id": "WcZMMSfc",
+ "video_duration": "231s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video starts with a blank canvas in Adobe Illustrator. A black circle is drawn freehand using the brush tool. A vertical line is drawn down the middle, followed by two semi-circles on either side of the line, outlining a basketball. A horizontal line is drawn to complete the outline. A bounding box appears around the drawing. The stroke weight is increased. The \"object,\" \"path,\" and \"expand appearance\" options are selected from the top menu bar. In the \"Pathfinder\" menu, \"Merge\" is selected. The cursor moves to the \"Layers\" palette and duplicates \"Layer 1\" which contains the basketball outline. The visibility of the duplicated layer (\"Layer 1 Copy\") is switched off. A rectangle shape is drawn over the basketball design. It is filled with a mauve/grey color. The rectangle is sent behind the basketball outlines. The \"Layers\" palette is opened again and the visibility of the copy is turned back on, then \"Layer 1\" (the colored rectangle) visibility is turned off. The direct selection tool is used to select individual sections of the basketball outline and each section is filled with a different color using the color picker - grey, pink, blue, and two shades of red. The blob brush tool is then selected, and further strokes are added in similar shades to create shading on the design.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many colors are filled in the internal area of a basketball?",
+ "answer": "B",
+ "candidates": [
+ "A. Three.",
+ "B. Two.",
+ "C. One.",
+ "D. Four."
+ ]
+ }
+ },
+ "QymnFmBd": {
+ "video_id": "QymnFmBd",
+ "video_duration": "79s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A hand points to a white, plastic circular object, which appears to be part of a larger unit with wires of different colors attached. Liquid begins to spray from a section of the object, causing bubbles. The object is then shown in a black bucket, supported by two wooden planks, with a black hose running out of the object and into a gray bucket. A hand wearing a bright orange sleeve is visible in the video. A tool is used to make contact with two silver metal points in the object and a blue device, attached to wires connected to the object, lights up with a number displaying on the small screen. The blue device is shown displaying different numbers. A person is shown from the shoulders up, talking to the camera. The video returns to showing the object in the black bucket, and the blue device once again lights up and displays various numbers as the tool is used to make contact with the object. A black and silver battery is seen next to the black bucket. The hose is seen on a flat, gray surface and then again next to the black and silver battery.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where does the scene at the beginning of the video take place?",
+ "answer": "C",
+ "candidates": [
+ "A. Electric vehicle garage.",
+ "B. Motorcycle garage.",
+ "C. Car garage.",
+ "D. Airplane hangar."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence for testing the fuel pump as presented in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. (b) Connect the power probe to the ground wire, (a) Identify the power wire and ground wire, (c) The power probe lights up green, (d) Check the water flow in the pipe.",
+ "B. (a) Identify the power wire and ground wire, (b) Connect the power probe to the ground wire, (c) The power probe lights up green, (d) Check the water flow in the pipe.",
+ "C. (a) Identify the power wire and ground wire, (b) Connect the power probe to the ground wire, (d) Check the water flow in the pipe, (c) The power probe lights up green.",
+ "D. (c) The power probe lights up green, (a) Identify the power wire and ground wire, (b) Connect the power probe to the ground wire, (d) Check the water flow in the pipe."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What change occurs when the power probe is placed on the car's fuel pump in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. There is no change.",
+ "B. The light changes from off to green.",
+ "C. The light changes from green to red.",
+ "D. The light changes from off to red."
+ ]
+ }
+ },
+ "DlZzwIzS": {
+ "video_id": "DlZzwIzS",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man in a dark suit sits on a chair and plays a cello while a woman plays a grand piano beside him. Both musicians are positioned on a small stage in front of an audience. Behind them is a banner with Chinese characters on it. The cellist moves the bow across the strings of the cello and presses the strings with his fingers. The pianist's fingers press the piano keys. The lower part of a person\u2019s head and shoulders wearing glasses can be seen in the bottom right of the frame.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Is there any voice-over in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. There is.",
+ "B. None of the above.",
+ "C. I am not sure.",
+ "D. There is not."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the cello and the piano in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The cello and piano both share the roles of bass and treble, with the cello leading the melody.",
+ "B. The cello plays the treble while the piano plays the bass, with the cello leading the melody.",
+ "C. The cello and piano both share the roles of bass and treble, with the piano leading the melody.",
+ "D. The piano plays the treble while the cello plays the bass, with the piano leading the melody."
+ ]
+ }
+ },
+ "hJZQTvzL": {
+ "video_id": "hJZQTvzL",
+ "video_duration": "258s",
+ "duration": "4-6min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video transitions from a title card with geometric shapes to a news report with three different people in a panel, all with text overlays that are constantly changing. The first person is a man with blond hair, wearing a plaid shirt, and is standing outside in a street with old buildings and a cathedral in the background. The second scene shows two people, a woman with short blond hair wearing a brown blazer and a man with a shaved head wearing a blue collared shirt. The background behind this man is of a fire truck and a burnt building. Both of these people are on camera while simultaneously appearing on a screen behind them. The next clip is the same as the previous one but with a text overlay at the bottom which reads \"Partido Pirata Argentino\" for a couple of seconds, which later changes to a website address. Next, the man on the right is seen alone, he has a light blue collared shirt on and is located on a street corner with a fire truck and burnt out building behind him. He is interviewed, with Spanish subtitles at the bottom of the screen. While the man is speaking, a fire fighter in a yellow safety vest can be seen walking in the background. The final scene shows the same two people from earlier in the report sitting in front of a news desk. Both people are wearing blazers.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "What did the three police officers in the video do?",
+ "answer": "C",
+ "candidates": [
+ "A. Investigate the cause of the fire.",
+ "B. Arrest the arsonist.",
+ "C. I don\u2019t know.",
+ "D. Extinguish the fire."
+ ]
+ }
+ },
+ "hzbpQzOo": {
+ "video_id": "hzbpQzOo",
+ "video_duration": "41s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Climate Change",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A rotating globe with blue oceans and green and brown continents is displayed on a white background, and then a close up of a cylindrical object with a geometric pattern is shown with a silver rod above. Then two tall, white towers with white smoke billowing above are shown, with a dark colored car on a paved road in the foreground. A man in a dark suit with a red and white tie is shown speaking, against a blue background. Then a white boat with red accents on the side is shown sailing on a body of water, with distant mountains. A man in a dark suit is shown gesturing with his left hand next to a taller man in a similar suit. A woman in a white lab coat and headscarf is shown handling petri dishes and then jars in bags. A woman in a white lab coat is shown using a tool to manipulate small plants. A woman in a white shirt is shown standing in a field of tall plants. Then a gloved hand is shown pointing to a computer screen with a graph. Finally, two men are shown looking at a large computer screen showing a map, and then a dark gloved hand is pointing to a different computer screen with text and colored graph. Finally, a man in a suit and tie, previously seen in the video, speaks again against a blue background.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What emotion does the man wearing a suit convey in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Sadness.",
+ "B. Anger.",
+ "C. Urgency.",
+ "D. Indifference."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "How does the Earth model rotate at the beginning of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Rotates to the left.",
+ "B. Rotates to the right.",
+ "C. Does not rotate.",
+ "D. Alternates back and forth."
+ ]
+ }
+ },
+ "lHpTrgnx": {
+ "video_id": "lHpTrgnx",
+ "video_duration": "146s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Fashion",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person with long blonde hair and red lipstick is holding and displaying a pair of blue jeans. The person holds up different parts of the pants to show details. The background includes a mirror, a plant, a door, and a white wall. The person later is shown wearing the blue jeans, and poses in front of the mirror, turning from side to side to display the pants. The person then holds the pants up again, showing details like frayed hems, and the person then waves at the camera.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the woman in the video like to wear bell-bottom pants?",
+ "answer": "D",
+ "candidates": [
+ "A. They are comfortable to wear.",
+ "B. They enhance the shape of the legs.",
+ "C. They are fashionable and durable.",
+ "D. They cover the shoes."
+ ]
+ }
+ },
+ "cQTAodvu": {
+ "video_id": "cQTAodvu",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three individuals are visible on a stage. Two individuals are seated in chairs on the left side of the image, and one is seated in a chair on the right side of the image. The two individuals on the left each have a music stand in front of them and are playing an accordion and an instrument respectively. The seated individual on the right is playing a tuba. Another tuba is standing upright on the stage to the right. All three individuals appear to be reading from music on their music stands. The individuals are situated on a reflective black stage under dark lighting, and there is a wall behind them with lights.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Does the instrument on the left side of the video have a stronger rhythm compared to the one on the right?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes.",
+ "B. No.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "maNiUSOx": {
+ "video_id": "maNiUSOx",
+ "video_duration": "113s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "DIY & Crafts",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person's hands are seen holding and manipulating a piece of fabric, folding it into a triangular shape, and then folding it again. The hands place a see-through ruler on the fabric, and then a yellow rotary cutter cuts along the edge of the ruler. The hands remove a strip of fabric. The process is repeated, cutting more strips. The strips are picked up, and arranged on the cutting surface. Then, the hands use the ruler and cutter to trim along the edge. The hands place two strips of fabric together in an L shape. The hands use the ruler to align the fabrics, and then the rotary cutter to cut along the ruler edge. The person's hands hold, then cut, the fabric at various angles with the ruler and rotary cutter. The hands lay the pieces on the grid pattern of the cutting mat. The hands align fabric strips and press them together.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct order for cutting the angled fabric strip as demonstrated in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Unfold the folded fabric and trim the uneven edges, fold the rectangular fabric at a 45-degree angle, continue folding the fabric, cut a 2-inch-wide fabric strip, and demonstrate how to trim the angled ends and sew the strips at right angles.",
+ "B. Cut a 2-inch-wide fabric strip, unfold the folded fabric and trim the uneven edges, continue folding the fabric, fold the rectangular fabric at a 45-degree angle, and demonstrate how to trim the angled ends and sew the strips at right angles.",
+ "C. Fold the rectangular fabric at a 45-degree angle, continue folding the fabric, unfold and trim the uneven edges, cut a 2-inch-wide fabric strip, and demonstrate how to trim the angled ends and sew the strips at right angles.",
+ "D. Unfold the fabric and trim the uneven edges, continue folding the fabric, cut a 2-inch-wide fabric strip, fold the rectangular fabric at a 45-degree angle, and demonstrate how to trim the angled ends and sew the strips at right angles."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the woman fold the fabric in the video to create a trapezoid?",
+ "answer": "B",
+ "candidates": [
+ "A. Two.",
+ "B. Three.",
+ "C. One.",
+ "D. Four."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the purple fabric relative to the woman in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Behind.",
+ "B. In front.",
+ "C. On the left.",
+ "D. On the right."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there any pink fabric in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. I don't know.",
+ "B. Yes.",
+ "C. No."
+ ]
+ }
+ },
+ "aBOKKdTm": {
+ "video_id": "aBOKKdTm",
+ "video_duration": "89s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a first-person perspective from someone riding a blue dirt bike on a dirt path through a wooded area. The path is muddy and uneven. Initially, a dark-colored side-by-side off-road vehicle is visible ahead. As the video progresses, another blue dirt bike can be seen moving on the path ahead. The camera moves with the motorcycle over several inclines. The trees in the area are mostly bare and the weather appears overcast. At one point in the video, blue smoke is visible coming from behind the motorcycle in the lead. There are various logs and tree stumps present near the sides of the path. At the end of the video, the camera is briefly on the ground, before being picked up again.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "In the video, how many trees did the man in front ride past?",
+ "answer": "C",
+ "candidates": [
+ "A. Four trees.",
+ "B. Three trees.",
+ "C. Two trees.",
+ "D. One tree."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "How does the man at the front react when he sees an obstacle ahead?",
+ "answer": "A",
+ "candidates": [
+ "A. Turns.",
+ "B. Steps over.",
+ "C. Brakes.",
+ "D. Collides."
+ ]
+ }
+ },
+ "CAahpGpX": {
+ "video_id": "CAahpGpX",
+ "video_duration": "154s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a yellow screen displaying a logo with the words \"Curious Buddies\" and \"Available Fall 2004 on VHS and DVD!\". This is followed by a view of the edge of the screen as it pans to reveal a yellow stuffed animal toy on the right and a pink stuffed animal toy in front of a knitted white blanket. The screen then transitions to a baby wearing a pink outfit sitting with the green stuffed animal. Next, a boy wearing a plaid shirt is shown hugging a pink stuffed animal. A second baby is shown wearing a blue shirt and overalls as they sit in front of stuffed animal toys. A third baby is shown holding and touching stuffed animals. Then, a baby is in the view with the stuffed animals surrounding them. A blonde haired girl is in the view with a pink stuffed animal. Next, a boy with dark hair is shown with a green stuffed animal. This is followed by a boy crawling in front of stuffed animals. A blonde-haired girl is in the view with a pink stuffed animal. Then the video jumps to a series of clips of a cartoon dog named Blue from the show Blues Clues. Blue is shown in a variety of scenes in a house, drinking from a glass, holding a clipboard, reading a book, and drawing on an easel. In addition, several stuffed animals of Blue's character are shown on shelves, in a chest, and on the floor. Finally, the video displays an image of two VHS tapes of Blue's Clues, followed by text stating, \"Available Now, Where Videos and DVDs Are Sold\" and a Paramount logo.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times do the children kiss in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Four.",
+ "B. Three.",
+ "C. Nine.",
+ "D. Five."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What can this DVD do?",
+ "answer": "B",
+ "candidates": [
+ "A. Discover many social lessons.",
+ "B. Embark on adventures with Little Blue.",
+ "C. Watch Little Blue and his friends.",
+ "D. Learn how to create Little Blue's tutorials."
+ ]
+ }
+ },
+ "fctpAzCw": {
+ "video_id": "fctpAzCw",
+ "video_duration": "282s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a construction site where a retaining wall is being built. Initially, the camera shows a dug-out trench with wooden boards and supports already in place. A black backpack, a yellow extension cord, and other tools are nearby on the ground. The scene cuts to a closer view of the trench, where two people wearing blue and orange shirts can be seen, with the camera focused on the one who is wearing a blue shirt. A person in a gray shirt is seen using a power tool on the wooden boards and then the camera cuts back to show the man in the blue shirt again. Then, the camera shows the trench from the side of the wall opposite to the construction site. The video then transitions to showing a man in a red hat who is using a hammer, and who appears to be working alone. After that the video shows a group of people working on the site. Two of them are wearing red hats. The camera then shows a pile of rebar and an orange piece of paper on the ground. It cuts to a red forklift that is carrying a bundle of rebar. Then a close up of a person using an orange tool to cut rebar, followed by the camera showing the wall, with red marks on the boards. A man in a yellow shirt then walks by. The camera then focuses on one person who is fastening wire to the wood, followed by the camera again moving to show the entire wall. The video then shows two people, one in a yellow shirt and one in a gray shirt, connecting rebar, using an orange tool. The camera then pans over the finished rebar grid along the wall. The next scene shows several people wearing safety gear who are now attaching more boards to the wall, including one person who is wearing an orange shirt. The final scene shows several people pouring concrete into the trench between the wallboards and smoothing it out.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the action of stacking boards occur in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Three.",
+ "B. Two.",
+ "C. Four.",
+ "D. Five."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occur to the concrete at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. After being poured from a height, it hardens.",
+ "B. After being poured from the bottom, it settles and is eventually leveled by workers.",
+ "C. After being poured from a height, it settles and is finally leveled by workers.",
+ "D. After being poured from the bottom, it solidifies."
+ ]
+ }
+ },
+ "OuiGopBO": {
+ "video_id": "OuiGopBO",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people are standing and playing saxophones. The person on the left is wearing a light colored jacket and a white and red patterned shirt. The person on the right is wearing a dark and white checkered long sleeved shirt. Both people are playing brass colored saxophones. Behind them are many other saxophones on stands. There are also two posters in the background with images of saxophones.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At which point in the video does the man in the suit smile?",
+ "answer": "C",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. In the middle of the video.",
+ "C. At the end of the video.",
+ "D. At the beginning of the video."
+ ]
+ }
+ },
+ "oXKfEYqY": {
+ "video_id": "oXKfEYqY",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people are visible in the video, each playing a gold-colored saxophone. They are both positioned facing forward, with a music stand behind them. The person on the left is wearing a white shirt and black jacket, while the person on the right is wearing a white, button-down shirt with a polka-dot pattern. Their hands can be seen pressing keys on their respective instruments. The background consists of a plain, gradient blue-to-white wall with several colored squares.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the music being played from over there louder?",
+ "answer": "A",
+ "candidates": [
+ "A. On the right.",
+ "B. On the left.",
+ "C. I am not sure.",
+ "D. Both sides are equally loud."
+ ]
+ }
+ },
+ "tovZZHhU": {
+ "video_id": "tovZZHhU",
+ "video_duration": "223s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two people wearing gis grapple on a large blue mat with a yellow border inside a gymnasium. A referee in black stands nearby. Several other individuals in gis stand or sit on the edge of the mat. Spectators sit in the bleachers. People walk around near the mat. A table with a laptop, chairs, and a water bottle sits near the mat. A person wearing a white gi walks in front of the camera. Two people grapple near the edge of the mat. They move around on the mat. More people walk by the camera. The two grapplers continue grappling. One wears a dark gi, the other wears white. The referee separates the two grapplers. One grappler, wearing a white gi, faces the referee with their back to the camera. The two grapplers stand and face each other. They shake hands. One grappler in white raises their arms. The two separate. People walk around on the mat. The person in white walks off the mat. A person in a blue gi walks by the camera. Another person in a blue gi walks onto the mat. The person in black walks off the mat. Two people in blue gis stand and face each other on the mat. They shake hands and begin grappling. They move around the mat and change positions. One grappler kneels over the other. The grapplers continue to grapple. They stand up. They separate briefly, then continue grappling. One grappler, with a black belt, kneels while holding onto the other grappler, who stands. The referee separates them. The grapplers reset their positions and continue grappling. They move towards the edge of the mat. The referee moves them back towards the center. One grappler is on their back while the other kneels over them. The referee stands them up. They continue grappling. One grappler kneels while holding the other. The one being held steps out of bounds. The referee resets them. They continue grappling. One is on the ground, the other is kneeling over them. They return to standing and grapple. The referee separates them. They continue. One is on their back, the other is on top. They remain mostly stationary. The grapplers are helped off the mat by other people.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the blogger's mood in the latter part of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Particularly happy.",
+ "B. A bit irritated.",
+ "C. Somewhat sad.",
+ "D. Very disappointed."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What happens at the end of the video after the person in white wins the match?",
+ "answer": "B",
+ "candidates": [
+ "A. Provocation.",
+ "B. Hug.",
+ "C. Fist bump.",
+ "D. Handshake."
+ ]
+ }
+ },
+ "tbxnShhc": {
+ "video_id": "tbxnShhc",
+ "video_duration": "216s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A first-person perspective in a virtual reality setting shows a pair of hands interacting with a virtual coffee grinder. The view transitions to a perspective behind the counter of a virtual cafe, focusing on two anime-style avatars, one with blue hair and the other with light hair and a Santa hat. The blue-haired avatar pours liquid from a kettle into a coffee filter held above a cup. The view returns to the first-person perspective as the hands place coffee grounds from the grinder into the filter and then pour the liquid from the kettle over the grounds. A white substance is then squeezed into the cup. The perspective shifts back to the cafe scene, where the blue-haired avatar is now making latte art by drawing a heart shape in the coffee foam. The scene changes again to the first-person view as a virtual box of tea is picked up and placed on the counter. A larger cup is selected, and more coffee is poured into it. Back in the cafe scene, the light-haired avatar drinks from her cup and reacts with widened eyes. The blue-haired avatar pours more hot water into the larger cup. The perspective shifts back to the first-person view as a hamburger is placed on the counter next to the tea box. The blue-haired avatar pours hot water onto the hamburger. After various interactions within the cafe, including drawing on the counter and picking up a small round object, the view shifts to outside the cafe where it\u2019s snowing. The perspective rotates and looks upward towards a spiral staircase with a net lining the bottom side. Next, the blue-haired avatar is shown floating in a dark outdoor pool. She picks up the light-haired avatar and drops her in the water. The blue-haired avatar is floating in space, and the image cuts to a color bar screen. The blue-haired avatar is wearing 3D glasses and eating popcorn while facing the color bars. Back in the VR cafe, the first-person view interacts with floating VR menu panels. The hands pull the light-haired avatar out of the water. The blue-haired avatar is back in the cafe, near the counter.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Which foods does the person in red request in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A cup of coffee and a hamburger.",
+ "B. A cup of coffee, a hamburger, and a steak.",
+ "C. Two cups of coffee and a hamburger.",
+ "D. Two hamburgers and a cup of coffee."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the person with blue hair do with the paintbrush in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Used the paintbrush to take notes.",
+ "B. Used the paintbrush as a hoe.",
+ "C. Chased the person in red with a paintbrush.",
+ "D. Used the paintbrush for painting."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional changes occur throughout the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Excited to absurd to shocking.",
+ "B. Humorous to absurd to shocking.",
+ "C. Joyful to frustrated to shocking.",
+ "D. Absurd to shocking to humorous."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "How many egg tarts are on the table?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. I am not sure.",
+ "C. Two.",
+ "D. One."
+ ]
+ }
+ },
+ "PuiIJOvI": {
+ "video_id": "PuiIJOvI",
+ "video_duration": "92s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Historical Analysis",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video consists of three graphic overlays on a background of dark space with stars. Two of the overlays are stylized red and black logos. The third overlay is a white rectangular video display with footage from a scene that changes periodically throughout the video. The video footage shows scenes inside what appears to be a metallic structure, with people that appear to be actors in costume and makeup. The footage shows these actors moving or speaking in a variety of ways. Towards the end, a white text box with a purple line and the text \"Support my work on patreon\" appears over the video footage. Shortly afterwards, there are animated yellow and white shapes and lines that overlay the video footage.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the pop-up sound effect occur in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. One.",
+ "B. Six.",
+ "C. Four.",
+ "D. Two."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the female alien relative to the black-haired male alien in the latter part of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. On the right rear side.",
+ "B. Directly in front.",
+ "C. Directly behind.",
+ "D. On the left rear side."
+ ]
+ }
+ },
+ "qffZFeQq": {
+ "video_id": "qffZFeQq",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "The video shows two musicians playing instruments. On the left, a man with short brown hair and a beard plays a bassoon. He wears a black turtleneck shirt. The background behind him is white tile. On the right, a man with glasses and long light brown hair plays the violin. He wears a black long-sleeved shirt. The background behind him is black. Below the musicians, sheet music is displayed, scrolling from right to left in sync with the music.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Does the bassoon play continuously throughout the video?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes.",
+ "C. No."
+ ]
+ }
+ },
+ "bYSLkryp": {
+ "video_id": "bYSLkryp",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people stand in a room facing each other, each holding a long, dark instrument. A person sits at a piano behind one of them. The room has a whiteboard and other musical instruments. There are several chairs around the room. The people are all stationary.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What change occurs in the music at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The melody gradually becomes soothing.",
+ "B. The melody gradually becomes joyful.",
+ "C. The melody gradually becomes tense.",
+ "D. The melody gradually becomes somber."
+ ]
+ }
+ },
+ "UOYirfbr": {
+ "video_id": "UOYirfbr",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man plays a brass-colored trumpet. He wears a black t-shirt and a watch on his left wrist. He has short brown hair and light skin.\u00a0 A microphone is attached to his trumpet.\u00a0 He has earbuds in his ears.\u00a0 A silver trumpet stands on a piano behind him. There is a whiteboard on the wall behind him.\u00a0Sheet music scrolls across the bottom of the screen as he plays.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the trumpet continuously playing in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. /",
+ "B. I am not sure.",
+ "C. No.",
+ "D. Yes."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does a high note appear in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Three times.",
+ "B. Twice.",
+ "C. Four times.",
+ "D. Once."
+ ]
+ }
+ },
+ "CsJnelLD": {
+ "video_id": "CsJnelLD",
+ "video_duration": "50s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A man with glasses and a grey jacket over an orange shirt stands in front of a blue backdrop. To his right is a large white flip chart. The man\u2019s hands move as he is speaking. An animated woman wearing a white shirt and a light blue hair covering, stands in front of a factory and several trucks. The factory has the words \u2018Fritolay\u2019 on the front and the trucks are red, white, and have images of potato chips on their sides. An animated person is in front of a backdrop of blackboards that have coffee advertisements. The person wears a black shirt with a green apron, and has a silver earring. An animated scene shows three people, then another scene shows an animated man in a suit at a table, followed by an animated person sitting in a subway holding a yellow phone. The scene then shows three more animated people separated by different color backgrounds, then an animated blue car driving on a street. Then, the video shows the man in the grey jacket again, gesturing with his hands as he speaks. A large group of people holding signs and banners march down a city street. A split screen shows a skeleton dressed in an orange shirt holding a sign, and a woman holding a white sign that says \u2018On Strike\u201d. A group of people are standing on a roadside holding blue signs that say \u2018UAW On Strike\u2019. A group of people are standing outside holding signs and a banner that says \u2018Pro Worker Pro Act.\u2019 The video shows the man in the grey jacket speaking again. The screen changes to a blue background with text listing various names, then a screen showing green and white with the words \u2018Inequality Media\u2019 in the center.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual occurrences are depicted in the images of the worker strike?",
+ "answer": "C",
+ "candidates": [
+ "A. A dog participated in the strike.",
+ "B. Paranormal events took place during the strike.",
+ "C. A skeleton participated in the strike.",
+ "D. A cat participated in the strike."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "In how many ways have businesses exploited workers?",
+ "answer": "B",
+ "candidates": [
+ "A. Four ways.",
+ "B. Five ways.",
+ "C. Six ways.",
+ "D. Seven ways."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Which of the following is not a reason for workers to organize a strike?",
+ "answer": "C",
+ "candidates": [
+ "A. Work assignments during rest periods.",
+ "B. Withholding of wages.",
+ "C. Unjustified termination of employees.",
+ "D. Mandatory overtime work."
+ ]
+ }
+ },
+ "myXEtZLx": {
+ "video_id": "myXEtZLx",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three men play musical instruments outdoors. The man on the left plays a tuba, the man in the middle plays a clarinet, and the man on the right plays a banjo. They wear black vests and hats, and white shirts. The man playing the tuba wears sunglasses and a newsboy cap. The other two men wear fedoras. They play in front of a backdrop of bamboo. People walk past on a paved walkway behind them. There are signs indicating restrooms, a special events pavilion, and a picnic outpost. Some people walking by are dressed in costumes. The camera pans to the left, focusing on the tuba player. The camera continues to move to follow the musicians as they play. The camera then focuses back on the whole group as they continue to play their instruments.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Where is the clarinet producing sound located in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. In the hand of the man wearing glasses.",
+ "B. In the hand of the man wearing a necklace.",
+ "C. In the hand of the man wearing a watch.",
+ "D. In the hand of the man wearing sunglasses."
+ ]
+ }
+ },
+ "CmGDtpbg": {
+ "video_id": "CmGDtpbg",
+ "video_duration": "39s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A rocket stands upright on a launchpad at night, illuminated by spotlights and red lights on nearby towers. The view switches to a close-up of the rocket's base. Flames and smoke erupt from the bottom of the rocket. The view changes again to a wider shot as the rocket lifts off, creating a large cloud of smoke and fire around the launchpad. The final view focuses on the rocket ascending through the dark sky, leaving a bright trail of light. A starburst effect is visible around the rocket's engine flame.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Do the flames at the bottom of the rocket keep burning continuously?",
+ "answer": "A",
+ "candidates": [
+ "A. No, they don't.",
+ "B. I am not sure.",
+ "C. Yes, they do."
+ ]
+ }
+ },
+ "PuMtRSMz": {
+ "video_id": "PuMtRSMz",
+ "video_duration": "122s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A white statue holding a microphone is on the left side of the frame. A woman with blond hair, wearing sunglasses and a patterned top, looks at the statue. The scene shifts and two men wearing suits, one with a tie, and one with a grey striped tie, walk in the hallway where the statue was. They walk past the statue and head to an exit door labeled \"men\". One of the two enters the restroom. The man in the restroom uses the urinal, looks around, and then moves towards the sink. He turns on the water, and then notices that he has wet his trousers. He is taken aback, but quickly heads to the paper towel dispenser and begins pulling paper towels out. He moves the paper towel dispenser, and tries to dry the front of his trousers. He then tries the hand drier before moving to a nearby bin and moves it. He then stands on the bin and holds on to a horizontal metal bar in an attempt to reach a hand dryer. He then grabs the light, before the second man enters the restroom. The first man uses the hand drier again. He picks up a round white object off a soap dispenser. He moves away from the dispenser, gets on the bin, and pretends to mouth the ball, while the other man looks at him. The second man exits the restroom, and the first man remains on the bin with the white ball.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What issue does the man in the white shirt face in this video?",
+ "answer": "D",
+ "candidates": [
+ "A. There is no toilet paper after using the toilet.",
+ "B. The light bulb goes out.",
+ "C. The faucet does not supply water.",
+ "D. His pants get wet while washing his hands."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What is the final condition of the man\u2019s pants in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. I am not sure.",
+ "B. They are not wet.",
+ "C. They are not dried.",
+ "D. They are already dried."
+ ]
+ }
+ },
+ "sqXcQfvk": {
+ "video_id": "sqXcQfvk",
+ "video_duration": "87s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing dark boots walks through a brown liquid on a light gray floor. The boots kick and shatter a white object on the floor. A woman wearing a white lab coat and a black shirt is holding a gray tablet. She is standing next to a white electrical box with a yellow warning sign. She walks to the left of the screen, then she is shown inside a glass cube that has light green borders. The perspective shifts to a wide shot, where the woman is in front of the glass cube and a chair is on the left side of the screen. The woman is now reaching inside the glass cube to touch a white object shaped like a mask. The perspective shifts back to the woman inside the glass cube, where the cube is lit up in red. There is a timestamp on the screen, as well as a \"REC\" sign in the top left corner. The woman is reaching into the cube while the object is inside. The timestamp and recording signal remain in the top right corner of the screen, then the woman is shown from the waist up holding the object. The woman is touching the object, which changes to a different shape. The perspective shifts to the woman putting on the object which is shaped like a mask, then two people in dark, tactical clothing come through a door on the left of the screen. The perspective shifts to a close up of the woman's face as she puts her hand up to her face, then it shifts back to a wide shot, where the woman is now wearing a white mask and two people are holding rifles. The camera shows the perspective of one of the men in dark clothing, then it shifts back to a wide shot of two men in dark clothing holding rifles and one woman with a mask and another without one, then a close-up of the woman's face without a mask. Two men in dark clothing rush through the door, and one of them puts their hands on a woman without a mask, The perspective shifts to the woman with the mask, then it shifts to a door opening with two men with rifles rushing through, and the woman without the mask is being held. The woman without the mask is now seen looking through the barrel of a rifle. The woman is then seen running into the same dark room with a door and a clear box with the mask and two metal bars in it, and then shown again next to a wall, holding a small white object and moving it towards her head. She is shown with her hands on her head, looking distraught. She is shown again next to the door with two men rushing through holding the woman in the mask, then her hands are in the air. The woman without the mask is shown again in a distorted view of the room, and a man with a rifle is in the foreground. The last shot is of two men grabbing the woman without the mask, and they walk through the door on the right.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why was the brown-haired female researcher escorted out of the room by the guards?",
+ "answer": "D",
+ "candidates": [
+ "A. A broken coffee cup outside the room needs to be cleaned.",
+ "B. To prevent her from stealing the mask.",
+ "C. The room is too small to accommodate two people.",
+ "D. To avoid the risk of being attacked by the researcher with messy hair."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video display any stationary glass cases?",
+ "answer": "D",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No.",
+ "C. None of the above.",
+ "D. Yes."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the alarm sound in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Forty-two times.",
+ "B. Thirty-nine times.",
+ "C. Thirty-six times.",
+ "D. Forty times."
+ ]
+ }
+ },
+ "ybDCtmvQ": {
+ "video_id": "ybDCtmvQ",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "The video shows a performance taking place on a stage, in front of a large audience. The stage features a backdrop that depicts a snowy, or wintery scene of bare trees. A large choir of people is arranged in two rows behind the main performing area on the stage. The choir appears to be made of of both men and women, with the men wearing black suits and the women wearing pink outfits. In front of the choir, there is a group of people seated with string instruments including harps and cellos. There are two people playing stringed instruments in the front right of the stage, and one woman in the front-center of the stage, who appears to be the main singer. There is also a large audience seated on the floor in front of the stage. The singer is wearing a red outfit, and black pants. She is singing and uses expressive hand movements. The video also shows an older woman with white hair wearing glasses. The video contains text in Chinese characters, and the logo for a media organization called \u91cd\u5e86.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who is singing in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The woman in a red shirt.",
+ "B. The woman in a white skirt.",
+ "C. The man in a black shirt.",
+ "D. The man in a blue shirt."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What instrument is to the left of the singer in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Violin.",
+ "B. Cello.",
+ "C. Harp.",
+ "D. Erhu."
+ ]
+ }
+ },
+ "LZCtNmRI": {
+ "video_id": "LZCtNmRI",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A woman plays the cello. She wears a white shirt and a silver bracelet.\u00a0 A ballerina in a dark green and gold tutu dances in a church-like setting with stained-glass windows and stone columns. The camera focuses on architectural details of the church, including more stained glass windows, arches, and a balcony area. The cellist is shown again, now with a braided hairstyle. As she plays, the ballerina dances in the background. The camera alternates between close-ups of the cellist, wider shots of the ballerina in the church, and architectural details. The ballerina performs various ballet moves including arabesques and pirouettes. At one point, the ballerina stands en pointe, her feet together. The lighting creates a shadow around her feet that expands and contracts subtly. The ballerina continues her graceful, fluid movements. A blonde woman plays the piano while the cellist plays. The ballerina\u2019s makeup changes, becoming darker and more dramatic. Her costume also shows subtle changes, including dark embellishments around the bodice. The cellist and pianist continue to play as the ballerina dances, her movements becoming sharper and more intense, reflecting the darker aspects of the music and setting.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Does the first sound in the video come from a cello?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes.",
+ "C. No."
+ ]
+ }
+ },
+ "gwViqiot": {
+ "video_id": "gwViqiot",
+ "video_duration": "123s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Camping",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Three people are walking in a snowy forest, they are wearing dark colored jackets and carrying poles. A close up view shows a person's feet walking on snow, then a person uses their gloved hand to brush the snow aside. Two people in dark pants and jackets with backpacks walk along a snowy path with trees on either side. One person is holding a shovel. A person in a red jacket is standing next to a table with some documents and a man in a gray uniform is seated at the table. The man in the red jacket appears to be talking to the seated man. A man in a hat and glasses is facing a man in a gray uniform. The man in the uniform is pointing at a document in front of him on the table. A sign reading \"Glacier National Park\" is shown with snow and mountain peaks in the background. A large lake is visible surrounded by snow-covered mountains and a forest. A wide shot of a snow-covered mountainside shows a patch of trees and other vegetation. A single person is seen skiing down a snowy path in the woods. The person is wearing a dark colored jacket and appears to be holding ski poles. A person in a hooded purple jacket crouches in front of a wall of snow. A wide shot shows snow coming down a large mountain with patches of trees. Two people in jackets and backpacks walk over a snow covered landscape. One is wearing a red jacket and one is wearing a dark colored jacket. Two people are visible from the chest up as they walk on a snow-covered hillside. The person wearing a blue jacket has ski poles and both are wearing backpacks. A hand is shown using a clip, then a close up shows a hand zipping a jacket. The lower portion of a person's body is shown skiing in the snow. Two people walk downhill on a snow-covered slope, both are wearing backpacks and holding ski poles. Text appears over an image of a snow-covered cloud. The text says, \"In an Avalanche Shout an alert\". The same text appears with the following text underneath, \"Throw off pack\". The text changes to include \"Make swimming motions\" below it. A person in a blue jacket ties a pink ribbon around a small tree. Three people are visible from the waist up, they are skiing and walking on snow. One of them is carrying a shovel. Text appears over the scene. The text says, \"As you search Assign a spotter\". The text adds, \"Note safety zones\" then, \"Have an escape route\". A close up shows a person moving a pink ribbon that is attached to a small tree. A man stands near a snow-covered slope, he has a pole in his hand. Two people walk across a snow-covered slope while holding ski poles. A person in gloves holds a black object with orange highlights. A person wearing ski boots and dark pants is walking in the snow and holding a ski pole. Two people walk on a snow covered slope, both holding ski poles. They are wearing dark-colored jackets. A man on skis holds a shovel, while a view of mountains covered in snow is visible in the background.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "How are the man in the red coat and his friends walking through the snow?",
+ "answer": "D",
+ "candidates": [
+ "A. A tree branch.",
+ "B. Ski poles.",
+ "C. An iron rod.",
+ "D. A hiking stick."
+ ]
+ }
+ },
+ "ORbtGuVC": {
+ "video_id": "ORbtGuVC",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person wearing a black and white shirt, a cap, and headphones, plays various percussion instruments, including a drum with a black top, a drum with a tan top, and a conga. Sometimes they use a red mallet, and other times they use their bare hands to play. There is a microphone near each instrument. The person is located in what appears to be a recording studio, with a piano and a guitar visible in the background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Where does the \"dumdum\" sound come from in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A palm strikes the drum surface.",
+ "B. A mallet strikes the drum surface.",
+ "C. A brush strikes the drum surface.",
+ "D. Drumsticks strike the drum surface."
+ ]
+ }
+ },
+ "RvPzvLxq": {
+ "video_id": "RvPzvLxq",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person in a dark blue outfit and a white head covering plays a saxophone. Another person wearing a tan suit plays a white guitar. A woman wearing a white outfit sits in a chair. There are other people on a bench, and other musical instruments visible, including another guitar and a harp. A black amp with white text is on the floor near the guitar player. A plant in a pot is on a table. There is a curtain with a gold rod hanging in the background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What instrument is on the far right in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The guzheng.",
+ "B. The pipa.",
+ "C. The guitar.",
+ "D. The bass."
+ ]
+ }
+ },
+ "XGQPsWgE": {
+ "video_id": "XGQPsWgE",
+ "video_duration": "39s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a title card: \"MADONNA LIVE! DROWNED WORLD TOUR 2001 PRE-SHOW\" against a light blue background. The title fades and cuts to Madonna walking backstage with a person beside her. Quick cuts follow, showcasing various aspects of the tour's preparation: costume sketches, a gymnast warming up, a man sorting through clothes racks, dancers practicing fight choreography, a woman singing, a crowd cheering in a concert venue, Madonna dancing on stage with others, more dancers getting ready, Madonna performing on stage again, and brief interviews with crew members. The video continues to intercut these scenes, showing behind-the-scenes moments, Madonna's performances, and more interviews. The countdown timer in the corner decreases as the video progresses. The last few seconds feature Madonna in a black outfit against a blue background, followed by a close-up of her face, and finally, a shot of her with long, dark hair against a purple backdrop.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Which part of the video first shows someone spinning on stage?",
+ "answer": "C",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. At the end of the video.",
+ "C. In the middle of the video.",
+ "D. At the beginning of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why do people like Madonna's songs?",
+ "answer": "B",
+ "candidates": [
+ "A. Her songs keep up with current trends.",
+ "B. Every moment features an amazing performance.",
+ "C. She is excellent at promotion.",
+ "D. She is highly skilled in presentation."
+ ]
+ }
+ },
+ "ATJXnVaG": {
+ "video_id": "ATJXnVaG",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man in a white shirt and tie sits on a chair, playing a hollow-body electric guitar. To his right, a banjo stands on a small amplifier. Further right, another man in a white shirt sits on a stool behind a drum set, playing the drums with drumsticks. To the drummer's right, a third man sits on a chair, playing a tuba. The text \"Basin Street Blues (C.S.Williams)\" appears on the bottom of the screen.The same visual continues, but the text changes to \"After You've Gone (T.Layton).\"A fourth man sits down to the left, holding a trombone, and begins to play. The other three musicians continue playing their respective instruments.The text changes to \"In a Mellow Tone (D. Ellington).\"The four musicians continue to play their respective instruments. The text changes to \"I Got Rhythm (G. Gershwin).\" The four musicians continue to play their respective instruments.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What is the instrument on the far left in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Piano.",
+ "B. Trombone.",
+ "C. Flute.",
+ "D. Guitar."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many types of sounds are present in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Five.",
+ "B. Six.",
+ "C. Three.",
+ "D. Four."
+ ]
+ }
+ },
+ "RPQcozvA": {
+ "video_id": "RPQcozvA",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A clapperboard appears in front of two men in a bar set. The clapperboard displays the following information: C 75 V14, 1, 23+16, 807, \u201cAlways Sunny in Philly\u201d, Richie Keen, Peter Smokler. As the clapperboard is removed, the two men look down at their stomachs, then hold hands and look towards the camera. They both point towards the camera. They both tap themselves on the shoulder, then continue pointing at the camera and laughing. They continue to point, laugh, and tap their shoulders. One man speaks while gesturing. Both start laughing hysterically and bend at the waist. One man holds up a large jug of milk while the other gestures. One man drinks from the milk jug while the other continues gesturing. The milk jug has a picture of a flexing bicep and the words \u201cFight Milk\u201d. The man holding the jug gives a thumbs up, and the other man also gives a thumbs up. The man drinks more from the milk jug. The second man takes the milk jug and speaks to the camera. The first man walks off screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the clapperboard strike in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Three.",
+ "B. Two.",
+ "C. One.",
+ "D. Four."
+ ]
+ }
+ },
+ "ISYmMwyo": {
+ "video_id": "ISYmMwyo",
+ "video_duration": "361s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a shot of four pumpkins on a white shelf. The camera pans to the right, revealing an additional two pumpkins, which look a darker shade. Another pan shows the pumpkins again, and two additional orange pumpkins appear on the left side of the row. The scene cuts to a person with long black hair, wearing a white and orange cropped shirt and blue jeans. The scene then cuts to the same person standing near red register counters. Next, the scene shifts to a view of the same person through the glass doors of a store. Then the video cuts to an interior room, where the same person is sitting among moving boxes. A purple vacuum cleaner and some black garbage bags are also visible in this shot. The text \"Back Home\" appears over the image. The same person then stands up and walks slightly towards the camera, then turns and faces the vacuum cleaner. After the text \"Bedroom\" appears over the image, the camera shifts to a shot of a messy bedroom with a bed, 2 laundry baskets and an open cardboard box. The same person moves into view of this scene and moves some clothes into the laundry baskets. The scene then shifts to a view of the same person opening a closet door, and then it cuts back to the same interior room where the text \"Back Home\" originally appeared. The same person is shown walking toward a washing machine, and placing some clothes in it. Next the scene shifts back to the bedroom, and the person is shown putting clothes in laundry baskets. Then the scene cuts to a room with various shelves, three moving boxes, and a large mirror. The same person walks in this room, puts an object in the cardboard box, and then the camera cuts back to the same person facing forward and standing still in the same interior room where the text \"Back Home\" originally appeared. Then, the camera moves back towards the boxes in this room, and focuses on one of the open boxes. Next, the camera shows the same person putting another cardboard box in a stack of boxes, and then the same person picks up the vacuum cleaner and begins vacuuming the rug in the bedroom. The same person is shown vacuuming from multiple angles. Text appears in the corner reading \"Speed: X4.\" The camera then shows the same person walking out of the bedroom and picking up another cardboard box, and placing it in a pile of boxes in the original interior room. Then, the text, \"IRL Vlog Channel: Aranza Romero\" is shown over the interior room scene. The text \"Instagram: iicottencloud\", \"Roblox: iicottencloud\", and \"TikTok: iicottencloud\" appears below this. The scene cuts to another view of this room, now with garbage bags, a bucket with a yellow \"Wet Floor\" sign and some papers. Text appears in the middle of the screen, reading \"Hallways.\" The same person then walks across the frame, and picks up one of the garbage bags, and the camera focuses again on this person facing forward with text in the background. Then, the camera looks into the washing machine, and then at the washing machine's features. A wider shot of this machine is shown, and the same person is seen closing the door and pressing buttons on the washing machine. The person opens and closes drawers in the bedroom, and the scene cuts to a view of broken furniture leaning against a wall. Then a shot shows a paper on a wooden floor. Text appears in the middle of the screen, reading, \"Bathroom.\" The same person is shown walking through a doorway into this bathroom. The same person then comes back into the hallway, and begins using a purple broom to clean the floor, moving it across the hallway, into the kitchen, and finally to a wall. The person takes the broom apart, and then the scene cuts to a large kitchen, with a variety of dishes and a computer, along with multiple chairs. The person walks towards the kitchen sink, and begins washing dishes. Another close up view of the dishes is shown, as well as the kitchen table. Next a wide shot of the kitchen shows the person picking up an orange spray bottle. The same person then walks to the table and begins wiping the table, and then places the spray bottle back. The same person grabs a white cloth and returns to the table, and the camera moves back for a wide view of the kitchen, showing the same person wiping the table with the cloth. The scene then cuts to the living room and the same person. The text \"Living Room\" appears in the middle of the screen. The scene cuts to a wide view of the room showing a white couch and various furniture. The scene zooms in on the couch and the wooden tables in front of it. Next the camera zooms in on several cakes placed on the wooden tables in the center of the room, as well as on some tools and fake money nearby. The scene then changes as the camera zooms in on these items again. The person enters the scene, and walks around the furniture. Next, the person picks up an object from a wooden tool box. Then, the camera moves towards a soundbar on a shelf, and a small robot vacuum cleaner on the floor. Then the person is shown walking around the living room as the camera changes angles. The same person puts a purple object back onto the floor. The camera view changes, and focuses on the same person, walking towards a doorway, and then grabbing the broom from the wall. Then, the scene shifts to the same person, who is now mopping the kitchen floor, and the camera shows the same person at different angles. The video then displays an overview of the living room and then the person mopping the floor.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the drill relative to the table in the living room?",
+ "answer": "C",
+ "candidates": [
+ "A. The drill is to the right of the table.",
+ "B. The drill is under the table.",
+ "C. The drill is on the table.",
+ "D. The drill is to the left of the table."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the girl use to clean the bedroom floor in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A cloth.",
+ "B. A floor scrubber.",
+ "C. A mop.",
+ "D. A robotic vacuum cleaner."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "Has the washing machine been turned from off to on?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No, the washing machine is not operating.",
+ "C. Yes, the washing machine has started operating."
+ ]
+ }
+ },
+ "KbVeuqQH": {
+ "video_id": "KbVeuqQH",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A woman sits on a chair, playing a pipa (a pear-shaped lute). She wears a short-sleeved, floral-patterned dress. Her hair is dark and shoulder length. She plucks the strings with a plectrum held in her right hand while her left hand fingers the strings near the instrument's neck.To her right, a younger woman sits playing a guzheng (a Chinese zither). She wears a light purple, sleeveless, high-necked dress. Her hair is dark and reaches her waist. She uses picks attached to the fingers of her right hand to pluck the strings while pressing down on strings with her left hand to adjust the pitch. She occasionally adjusts the picks on her right hand. Behind them, the wall features a vertical banner with Asian characters, a framed piece of calligraphy, and two framed pieces of artwork. A vase of reddish-orange flowers sits on a stand next to the guzheng player. A circular drum rests on the floor in the far right corner.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the pipa constantly being played in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. None of the above.",
+ "B. I am unsure.",
+ "C. No.",
+ "D. Yes."
+ ]
+ }
+ },
+ "usfkQbgD": {
+ "video_id": "usfkQbgD",
+ "video_duration": "41s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two people are in a dark room with scientific equipment, including computers, monitors, and a frame structure with cameras and lighting. One person wearing a white lab coat is working at a table with a number of screens. Another person wearing glasses and a dark jacket is seated at another desk working with computer equipment. A light-colored mouse with a harness attached to a wire is shown walking on a green ramp, and then on tiered platforms. A man wearing glasses, a suit jacket, and a striped shirt speaks in front of a screen displaying an image of a brain, and then a screen displaying a cartoon mouse.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where are the mechanical devices located in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The mechanical devices are attached from above and worn by the mouse.",
+ "B. The mechanical devices are connected from the left side and fitted onto the mouse.",
+ "C. The mechanical devices are connected from the right side and attached to the mouse.",
+ "D. The mechanical devices are connected to the mouse from below the desktop."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occurred in the behavior of the mouse in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. It transitioned from staying still to moving.",
+ "B. It remained completely still.",
+ "C. It transitioned from moving to staying still.",
+ "D. It remained in continuous motion."
+ ]
+ }
+ },
+ "LhCOTGxu": {
+ "video_id": "LhCOTGxu",
+ "video_duration": "180s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man wearing a green hat and a dark shirt is shown in an initial frame. The video transitions to showing the inside of a car as it drives along a road and through a gate. The camera shows the car\u2019s dashboard, a water bottle, a red bag, and a hand that is seen opening a car door. The video then focuses on the street, where what appears to be salt or snow is scattered. It shows the rear of the same car as it is on fire with large flames coming out. It then shows a man using a red fire extinguisher on the burning car, with other people joining to also assist. The fire is shown being put out using the extinguisher and a grey container. The camera enters the car, showing smoke inside and a pile of clothes on a seat. The video transitions back to the exterior, showing the man with the green hat again and a white minivan driving by. One of the men at the car is then shown using a hose to spray what appears to be a fire retardant onto the car\u2019s exterior. The person spraying is then shown opening the car\u2019s trunk, which is filled with various items. The person then opens one of the car doors. Two people are then shown speaking outside of the car, and then the video transitions to one of the men with a fire extinguisher. The camera transitions to a shot of the damaged back of the car. A man in a black shirt is seen pointing towards a field and pond. The camera then shows the back of the car again, with another man using a fire extinguisher on the vehicle. The video transitions to show what is in the car\u2019s trunk again, before cutting to multiple men standing outside of the car. One of the men is seen walking towards the fence. Then multiple men are standing near the car and then around the fence with two other vehicles in the background. The video shows a person walking through a gate, and another person standing near the car again. One of the men is walking with the red fire extinguisher again, before one man points to the car and then the camera focuses on the damaged back of the car and the person walks away. The video ends with a man with a beard talking to the camera.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual situation occurred in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The front tires of the vehicle exploded.",
+ "B. The engine of the vehicle is on fire.",
+ "C. The rear tire of the vehicle is on fire.",
+ "D. A fire is occurring at the rear of the vehicle."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who said \"At least you didn't have it for too long\" in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A man wearing a black short-sleeve shirt with a cartoon image on the top-right corner.",
+ "B. A man wearing a black hat and a black short-sleeve shirt.",
+ "C. The car owner.",
+ "D. A man with short white hair wearing light blue jeans."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many fire extinguishers are used in the firefighting shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Two.",
+ "B. Four.",
+ "C. Three.",
+ "D. One."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the silver fire extinguisher located relative to the red fire extinguisher at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. In front.",
+ "B. On the left.",
+ "C. Behind.",
+ "D. On the right."
+ ]
+ }
+ },
+ "hsFJUuie": {
+ "video_id": "hsFJUuie",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three musicians are seated on wooden chairs playing string instruments in a grassy area. Trees and shrubs are visible in the background. A speaker on a stand and a microphone are positioned behind the musicians.\u00a0The musician on the left plays a cello. She is wearing a black long-sleeved shirt or jacket and has reddish-brown hair.The musician in the center plays a viola. She is wearing a dark jacket with lighter-colored buttons or trim, dark pants, and dark sunglasses.\u00a0Her hair is pulled back and appears to be blond or light brown. She is wearing dark gloves on both hands.The musician on the right plays the violin. She is wearing a dark jacket or long-sleeved shirt, dark pants, and dark sunglasses. She has dark hair.All three musicians read sheet music from stands in front of them as they play. They are playing \u201cBohemian Rhapsody\u201d.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many double basses are playing in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. One.",
+ "B. Zero.",
+ "C. Two.",
+ "D. Three."
+ ]
+ }
+ },
+ "eChzydTx": {
+ "video_id": "eChzydTx",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man and a woman are seated, playing erhu. The man, on the left, wears a dark jacket and dark pants. A glimpse of red cloth is visible under his jacket collar. He holds the erhu's bow in his right hand and the instrument's neck in his left. The woman, to his right, wears a black turtleneck or long-sleeved shirt and glasses. She holds the erhu in a similar manner to the man. Sheet music rests on a stand in front of her. Behind them, a golden wall features a painting of a green, stylized pine tree. They both move their bows across the erhu strings, their fingers moving on the necks of the instruments. They play in unison. The video continues with them playing, their hand and arm movements changing according to the melody. The woman's head moves slightly to the rhythm of the music. The man appears to sing along subtly.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Is there any narration in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. There is.",
+ "B. None of the above.",
+ "C. There isn't.",
+ "D. I am not sure."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional changes does the woman experience in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. From calm to distressed.",
+ "B. From calm to sad.",
+ "C. From calm to anxious.",
+ "D. From calm to happy."
+ ]
+ }
+ },
+ "WPWFcaaS": {
+ "video_id": "WPWFcaaS",
+ "video_duration": "127s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two men are standing in a gym in front of a stall bar, with a red and blue mat underneath it. A long trampoline is placed next to the mat. The man on the right, who has long brown hair, gestures towards the stall bars. The man on the left, wearing a black t-shirt and black pants, places his right hand on a bar at shoulder height and his left hand on a bar slightly below his waist. He then leans forward, placing his weight on his hands, and extends his right leg backward. He continues to lean forward until his body is almost horizontal, supported by his two hands on the stall bars. He keeps his body straight and his left leg tucked in. He then lowers his extended leg and returns to a standing position. He proceeds to repeat the exercise. The two men change positions. The man who previously performed the exercise is now standing on the mat facing the stall bars. The long-haired man grabs the stall bar with his right hand, at a rung a little above head height, and with his left hand, a little lower than his waist, leans into the wall as if performing a handstand and raises one leg at a time.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where does the scene in the video take place?",
+ "answer": "D",
+ "candidates": [
+ "A. A table tennis hall.",
+ "B. A basketball court.",
+ "C. A gymnasium.",
+ "D. A fitness center."
+ ]
+ }
+ },
+ "WftQbQtj": {
+ "video_id": "WftQbQtj",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man with a mustache is playing bagpipes. He is wearing a light blue long-sleeved shirt and brown pants. The bagpipes are black with gold or yellow accents and one white tassel hanging off of them. He has both of his hands on the bagpipes, one is on the black portion at the base and the other holds the chanter. The background is purple with what looks like light blue square accents. A second man enters into the frame, wearing a striped light blue long-sleeved shirt and brown pants. This second man plays a violin, and stands to the right of the first man, a black microphone is between them. The second man is shown by himself, playing the violin, in front of a purple background. The first man is shown by himself, playing the bagpipes, with the light blue and purple background. Both men are shown in the frame playing their instruments again, with the light blue and purple background. The second man is shown by himself again, playing his violin in front of a purple background. The first man is shown again playing his bagpipes by himself with the light blue and purple background, before finally both men are shown again playing their instruments.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotion does the music played in the video convey?",
+ "answer": "A",
+ "candidates": [
+ "A. Joy.",
+ "B. Tension.",
+ "C. Sadness.",
+ "D. Melancholy."
+ ]
+ }
+ },
+ "rIHxIuPE": {
+ "video_id": "rIHxIuPE",
+ "video_duration": "70s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Fashion",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video opens on a white screen with the text \"IMPORTED_CHOCOLATE\" in yellow at the top left, and under it, in smaller text, \"INSTAGRAM.\" There are a few fine, dark lines moving on the screen. Some of the lines grow and change color, first to light blue then light green. The lines become more of a flowing, abstract shape and the text \"Wilaande Provisions\" appears in blue, then darkens to black. The scene changes to a woman wearing a pink shirt in a room. At the top left corner, the text \"IMPORTED_CHOCOLATE\" in a yellow box appears again, and under it, in smaller text \"INSTAGRAM\". Below the woman, white text appears that reads \"www.damisactivewear.com.\" The woman is holding a yellow fabric garment and talks while holding it. The scene then changes to the same woman wearing the yellow garment, now a two-piece swimsuit. She moves around to display it, then changes back to wearing the pink shirt while holding the garment. The swimsuit appears on the woman once more and then it changes back to her holding the swimsuit.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order in which the woman introduces the bikini in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. She tries it on first, then explains.",
+ "B. She explains first, then recommends it.",
+ "C. She explains first, then tries it on.",
+ "D. She explains first, then criticizes it."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "For what occasion does the woman plan to use this bikini?",
+ "answer": "A",
+ "candidates": [
+ "A. A vacation.",
+ "B. A yacht.",
+ "C. The beach.",
+ "D. A pool."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "Based on the skin color, to which continent's ethnic group might the woman in the video belong?",
+ "answer": "B",
+ "candidates": [
+ "A. North America.",
+ "B. Europe.",
+ "C. Africa.",
+ "D. Asia."
+ ]
+ }
+ },
+ "wIaZkXzf": {
+ "video_id": "wIaZkXzf",
+ "video_duration": "70s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Software",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A blue and white hexagonal pattern appears first, followed by a vintage-style television. A man wearing a red t-shirt appears within the television screen, standing behind a computer displaying a blue screen. The man touches his head, then crosses his arms. The computer screen shows an \"Activate Windows\" message. Then the computer screen changes to a Windows 11 interface. The video then shows a pink grid background with \"CDKEYOFFER\" in a stylized font and then the words \"WINDOWS 10 + WINDOWS 11\" and \"ACTIVATION CODES\". A man with dark hair and glasses appears in front of the grid. A computer screen in front of the man displays the words \"LEGIT SAFE MURA\". The man gives a thumbs up. A computer screen displaying a website with a product page for Microsoft Windows appears. The website shows an \"Add to Cart\" page and then a payment method selection. The video shows a computer screen displaying an \"Activate Windows\" prompt. A person wearing a helmet and a patterned dress begins to dance. Then, the video shows multiple people and video snippets within the frame. Then, a woman named \"Kay Jaynalynn\" appears and is described as a \"housewife\". A woman in a blue shirt with a floral pattern and a man in a black shirt stand in front of a building. Then the woman\u2019s belly appears extremely enlarged. The man touches her belly. A woman is shown and named \"Coach Winnie Dapoix\" and is described as a \"Zumba Instructor\". She is seen teaching a Zumba class in an outdoor area. A man named \"Raphael Claude Ismael\" appears and is described as a \"Web Developer.\" He is shown using a computer and talking. A man with dark hair is shown in front of a banner that says \"Hardware Sugar\". He is shown holding up a graphics card box. A woman with dark hair is shown standing in front of a city backdrop. A man with a beard and glasses appears. He gives a thumbs-up. The video concludes with the \u201cCDKEYOFFER\u201d logo and the text \"Use Code: HS20 to get discounts\" and a smiling man in the foreground with many people behind him.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "How much of a discount is the Windows 10 activation tool recommended in the video offering?",
+ "answer": "A",
+ "candidates": [
+ "A. $5.55.",
+ "B. $16.66.",
+ "C. $16.21.",
+ "D. $21.21."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual event occurs in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A man is cross-dressing and appears to be pregnant.",
+ "B. A man is cross-dressing and marries another man.",
+ "C. A man is cross-dressing and playing computer games.",
+ "D. A man is cross-dressing and doing fitness exercises in the park."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What do the text and numbers inside the red box on the page displaying shopping methods represent?",
+ "answer": "B",
+ "candidates": [
+ "A. Represent the original price of 25.21, with a payment of 16.66 and a discount of 8.55.",
+ "B. Represent the original price of 22.21, with a payment of 16.66 and a discount of 5.55.",
+ "C. Represent the original price of 21.21, with a payment of 16.66 and a discount of 4.55.",
+ "D. Represent the original price of 23.21, with a payment of 16.66 and a discount of 6.55."
+ ]
+ }
+ },
+ "qdfvlgVp": {
+ "video_id": "qdfvlgVp",
+ "video_duration": "53s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Software",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A webpage with the text \"Create a New Campaign\" in green on a gray background is visible. A white box appears with the text \"Campaign Save\" and \"You Saved Campaign Successfully!!\" A green \"OK\" button is in the box, and the box disappears after it is selected. A new page appears, with a red button that reads \"Created Campaigns\", and a table with the columns \"Product Image\", \"Campaign Name\", \"Product Title\", \"Description\", \"Created On\", \"Public Link\", and \"Action\". A blue button labeled \"Generate Link\" appears under the Public Link column. A url appears in a grey box, and is replaced with \"campaign?id=10014\". A new tab is opened, and the text \"Leave a Testimonial for 'Make the Leap to Paleo'\" is displayed in green, along with an image with the text \"Make the Leap to Paleo\" surrounded by images of fruits and vegetables. The text \"Get a 20% Coupon For a Review\" is also visible, along with a purple arrow, and text that reads \"Submit Your Testimonial\". The text is followed by a form that reads \"Fill out the following fields to proceed\" with the options to choose a picture, first name, last name, email, and select a date, and a red button labeled \"next\". The text \"Ben\" and \"Murray\" autofill the form. A file browser appears, and then the form is updated with a head shot of a person in a suit. A calendar appears to select a date. After a date is selected, the red \"Next\" button is selected. A new form appears that reads \"Write your comments about the product above,\" with an text field below and 5 black stars. The word \"Great!\" is typed into the field, and the fifth star is selected, turning it yellow, along with the first four stars. The red \"Next\" button is selected again, and a new form appears with three tabs at the top \"Info\", \"Your Review\", and \"Video Testimonial\". The \"Video Testimonial\" tab is selected and has the options to \"Enter YouTube URL\", or \"Record Live\" a grey box, and the option to click the red \"Record Testimonial\" button. The \"Record Live\" option is selected, and a page appears with a black box, labeled \"Video Status: inactive\", and a red \"Start Recording\" button, which is then selected. A man in a grey shirt appears on screen, and then the red \"Stop Recording\" button is selected, followed by a selection of the green \"Finish\" button. The webpage showing the man speaking appears with a white box over top of it with the text \"Great! Leave a Testimonial for Make the Leap to Paleo\", as well as the man's head shot, and a red \"Submit\" button. The red \"Submit\" button is selected.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where are the jump links located on the page?",
+ "answer": "D",
+ "candidates": [
+ "A. At the bottom left.",
+ "B. At the top left.",
+ "C. At the top right.",
+ "D. At the bottom right."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the person not shown in the video do on this webpage?",
+ "answer": "D",
+ "candidates": [
+ "A. Shop online.",
+ "B. Create an account.",
+ "C. Rate the product.",
+ "D. Record a tutorial."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times is the word \"Vidmonial\" mentioned by the commentator in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. 2.",
+ "B. 4.",
+ "C. 1.",
+ "D. 3."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What identity might the man speaking in the video have?",
+ "answer": "C",
+ "candidates": [
+ "A. Video blogger.",
+ "B. Review personnel.",
+ "C. Website staff.",
+ "D. User."
+ ]
+ }
+ },
+ "lRnVcNpX": {
+ "video_id": "lRnVcNpX",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three men sit on chairs on a stage. The man on the left plays a violin. The man in the center plays an accordion. The man on the right plays a guitar. Each man has a music stand in front of him. There are microphones in front of the accordion and guitar. A small portion of a person\u2019s head is visible at the bottom of the screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Which instrument can play simultaneously with the violin?",
+ "answer": "A",
+ "candidates": [
+ "A. A harmonica.",
+ "B. A cello.",
+ "C. A banjo.",
+ "D. A bassoon."
+ ]
+ }
+ },
+ "UceieAXi": {
+ "video_id": "UceieAXi",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "The video is split-screen. On the left, a man with a mustache and longish brown hair pulled back under a white fedora plays a saxophone at the water's edge of a calm body of water. He wears an open white jacket over a white shirt. The sky is a beautiful golden sunset color. On the right side of the split-screen, a young girl with long brown hair and a light blue dress plays the violin. She has a light-colored bow in her hair. The sun is setting behind her in an orange sky. The girl occasionally smiles at the camera while playing. The two musicians appear to be playing the same song together. The girl occasionally moves her head and upper body rhythmically to the music. The man occasionally moves his hands as if conducting or reacting to the music.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Where does the sound of the drum set in the video come from?",
+ "answer": "C",
+ "candidates": [
+ "A. The man on the left.",
+ "B. The performance of the man and woman together in the video.",
+ "C. The voiceover in the video.",
+ "D. The woman on the right."
+ ]
+ }
+ },
+ "vWcEVEGa": {
+ "video_id": "vWcEVEGa",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "The video begins with a ground-level view of the forest floor, showing plants, leaves, and a large rock. A person is then shown sitting on a stool in the forest, playing a cello. The view then returns to the forest floor, with various shots. The video then shows a woman standing in a shallow creek, wearing a brown hat, blue shirt, and brown shorts, while playing a violin. These shots alternate until the end of the video.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What changes occur in the audio of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The cello and violin shift from bowing simultaneously to plucking simultaneously.",
+ "B. The cello and violin transition from plucking simultaneously to bowing simultaneously.",
+ "C. The cello switches from plucking while the violin bows to the violin plucking while the cello bows.",
+ "D. The cello moves from bowing while the violin plucks to the violin bowing while the cello plucks."
+ ]
+ }
+ },
+ "ZWnYWVDp": {
+ "video_id": "ZWnYWVDp",
+ "video_duration": "70s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Biology",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A neuron appears in the center of the screen against a dark background. The neuron consists of a cell body with a purple nucleus, multiple dendrites extending outward, and an axon extending downward with axon terminals at the end. A white label with black text appears below the cell body with the words \"Cell body\". Then, white label appears with the words \"Dendrite\" above the dendrites and \"Axon\" below the axon. Then, white label appears at the bottom of the axon with the words \"Axon terminals\". The camera zooms in on the cell body. White labels with black text appear over the cell body, first with the words \"Cell body or Soma\", then \"Cytoplasm\", then \"Nucleus\". The camera pans up the cell body. The cell body glows yellow and a light travels along the dendrites. A white label with the word \"Dendrites\" appears above the dendrites. Another label with black text says \"Dendrites carry impulses towards the cell body\". The camera zooms out and then a light travels down the axon. White label with black text appears along the axon with the word \"Axon\". The light travels down the axon and illuminates the axon terminals.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the spatial relationship between dendrites and the cell body?",
+ "answer": "B",
+ "candidates": [
+ "A. Separated from the cell body by an axon.",
+ "B. Connected to the cell body.",
+ "C. No spatial relationship with the cell body.",
+ "D. Located at the center of the cell body."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What is the video most likely going to introduce next?",
+ "answer": "B",
+ "candidates": [
+ "A. Dendrite.",
+ "B. Axon terminal.",
+ "C. Axon.",
+ "D. Cell body."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Which part of the neuron receives neural impulses?",
+ "answer": "C",
+ "candidates": [
+ "A. Axon terminal.",
+ "B. Cell body.",
+ "C. Dendrite.",
+ "D. Axon."
+ ]
+ }
+ },
+ "qdudpHMG": {
+ "video_id": "qdudpHMG",
+ "video_duration": "92s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A white pickup truck with a lightbar on top is visible, with the camera facing the rear of the vehicle. The lightbar is emitting flashing blue and red lights. The truck is traveling on a paved road. At times, a sidewalk-like area is visible on the side of the road. The video shows two people on the road. The first person is wearing a red shirt, black pants, and black shoes. The second person is wearing a black uniform. The person in the black uniform appears to be assisting the person in the red shirt, who at times is lying on the ground. The two people are walking toward the white pickup truck and enter the vehicle. The white pickup truck is labeled \u201cPOLICE\u201d and \u201cEPHRATA\u201d. The number \"911\" is visible on the rear of the vehicle.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What emotional changes did the person holding the video camera experience due to the actions of the person in uniform?",
+ "answer": "C",
+ "candidates": [
+ "A. From anger to disappointment.",
+ "B. From anger to disappointment to tension.",
+ "C. From anger to tension to disappointment.",
+ "D. From anger to tension."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What scene is depicted in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A police car.",
+ "B. A grassy area.",
+ "C. A street side.",
+ "D. A highway."
+ ]
+ }
+ },
+ "wWxCqUmm": {
+ "video_id": "wWxCqUmm",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people are visible. The person on the left is holding and playing a violin. The person is wearing a dark long-sleeve shirt, dark pants, and a light-colored belt. The person on the right is seated and playing a guitar, and has a music stand in front of them. The person is wearing a light colored, long-sleeved shirt. The background is a plain light color.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What change occurred in the sound of the violin in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. It changed from rapid to gradual.",
+ "B. It has always been slow.",
+ "C. It changed from slow to rapid.",
+ "D. It has always been rapid."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotion does the music in the video express?",
+ "answer": "B",
+ "candidates": [
+ "A. Relaxed.",
+ "B. Tense.",
+ "C. Joyful.",
+ "D. Frightening."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the guitarist press the strings with their palm?",
+ "answer": "B",
+ "candidates": [
+ "A. Eleven.",
+ "B. Ten.",
+ "C. Eight.",
+ "D. Nine."
+ ]
+ }
+ },
+ "hhdeNHAT": {
+ "video_id": "hhdeNHAT",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "The video opens with a wide shot of a forested area. A man sits on a large tree stump playing a tuba. He is center frame in the foreground. In the background, groups of musicians stand among the trees, playing various brass instruments and a drum set. The title \"B\u00c4RENWALZER\" appears at the top of the frame in gold text, along with \"Solo f\u00fcr Bass/Tuba und Blasorchester\" and \"von Adam Hudec\" below.The camera zooms in slightly, focusing on the tuba player. The camera then cuts to a close-up of the tuba player. He wears a white shirt with colorful embroidered patterns on the sleeves and a dark green vest. He plays the tuba continuously.The camera cuts to a group of musicians playing clarinets. They are standing in a line among trees. They all wear black pants and dark green vests over white shirts. They, too, play continuously.The camera cuts back to the tuba player, zooming in slightly. The camera then cuts to a group of musicians playing brass instruments, including trumpets and trombones, along with a person playing a drum set.The camera cuts back to a close-up of the tuba player. The shot focuses on his hand moving along the tuba valves. The camera cuts back to the wider shot of the forest, the tuba player, and the musicians in the background. The camera cuts back to a close-up of the tuba player and then returns to the wide shot of the entire scene, including the tuba player and the orchestra. The music continues.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Which instrument has the strongest sense of rhythm in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Drum.",
+ "B. Clarinet.",
+ "C. Trumpet.",
+ "D. Tuba."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Where is the drum, which produces the sound in the video, located?",
+ "answer": "B",
+ "candidates": [
+ "A. In the center at the back of the video.",
+ "B. On the right side at the back of the video.",
+ "C. At the very front of the video.",
+ "D. On the left side at the back of the video."
+ ]
+ }
+ },
+ "xknKmxmY": {
+ "video_id": "xknKmxmY",
+ "video_duration": "365s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black jersey with the number 35 is standing at a desk and using a computer. Another person wearing a pink t-shirt and black shorts comes into the scene. A child is standing near the couch. Another person in a red t-shirt and dark pants is also in the room, and a dog walks up to them. One of the people hands a small white object to the person at the desk, and they talk with the other two people in the room. The child plays with a soccer ball. A camera is pointed at the person in the pink t-shirt, then at the person on the couch wearing a black jersey, then at the person with a red t-shirt. Finally, the camera focuses on the person in the pink t-shirt, the child, and then returns to the person in the pink shirt.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "Who did the little girl hit with a ball in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A woman in a red T-shirt.",
+ "B. A man in a pink T-shirt with ear piercings.",
+ "C. A woman in ripped jeans.",
+ "D. A man with a beard and glasses."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many people appear in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Six.",
+ "B. Five.",
+ "C. Four.",
+ "D. Seven."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How does the emotional state of the man with a beard and glasses change?",
+ "answer": "D",
+ "candidates": [
+ "A. From nervousness to frustration.",
+ "B. From nervousness to anger.",
+ "C. From anger to frustration.",
+ "D. From nervousness to humor."
+ ]
+ }
+ },
+ "mDWjoiOG": {
+ "video_id": "mDWjoiOG",
+ "video_duration": "610s",
+ "duration": ">8min",
+ "domain": "Tech & Science",
+ "sub_category": "Academic Lectures",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A hand is seen holding a black pen, and drawing on a piece of gridded paper. On the paper, there is a horizontal and vertical axis numbered from 0 to 10 and a small shape made up of four black dots labeled \"A\", \"B\", \"C\" and \"D\". To the right, there is a text written in black, \"3 RIGHT\" and \"2 down\". The pen is then used to move the letters A, B, C and D, following the rules mentioned above. This process is repeated again. Then, a ruler is brought into view and used along with the pen to create a vertical line through the location of C. The same ruler is then used to create the horizontal line through the location of C. Then, a piece of paper is used to overlay the original shape, and then rotated. The corners of the shape are then traced onto the gridded paper below, to map out the rotated version of the original shape. The original sheet is then removed, revealing the new shape. Then, the text \"Thanks for watching!\" is briefly superimposed on the image, before a blue screen is shown.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occur to the coordinates of point C after the geometric figure is reflected?",
+ "answer": "A",
+ "candidates": [
+ "A. The coordinates change from (4, 6) to (6, 2).",
+ "B. The coordinates change from (4, 6) to (6, 4).",
+ "C. The coordinates change from (4, 6) to (5, 1).",
+ "D. The coordinates change from (4, 6) to (5, 5)."
+ ]
+ }
+ },
+ "CLdCJsPA": {
+ "video_id": "CLdCJsPA",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "The video shows a person playing a guitar. The person has a bald head and wears a blue plaid shirt. The person sits in a black chair in a room with wooden furniture. Later in the video, the camera shows another person playing a trumpet. This person is wearing glasses and a dark-colored shirt and black headphones. The background is a blank white wall. These two clips alternate throughout the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Does the brass instrument have a stronger sense of rhythm in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No.",
+ "C. Yes."
+ ]
+ }
+ },
+ "yeqcNPai": {
+ "video_id": "yeqcNPai",
+ "video_duration": "66s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man sits in the driver's seat of a truck. The camera focuses on the dashboard. The odometer reads 256,642 miles. The man turns the key in the ignition. The camera view shifts back to the man, now driving. He talks to the camera while driving along a road lined with fields and power lines. A speed limit sign of 45 mph is visible. The camera focuses on the road again. The man continues talking while driving, looking between the camera and the road.The scene changes to a partially demolished camper trailer parked on a grassy area. A man in an orange shirt works on the camper's exterior, removing debris. Another man points to different parts of the camper, showing its damaged sections, exposed wooden frame, and scattered debris, including wooden panels, insulation, and metal screws. He climbs onto the camper's roof, showing a blue chalk line drawn along its surface. A large piece of insulation lies on the roof. One man pulls on a part of the camper's wall, revealing the damaged interior. Two men work together on removing a panel from the camper. The man pointing continues to show the camper's exterior, indicating the damaged areas and exposed insulation. He points upwards to the roofline where the top edge appears detached.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where is the disassembled door placed in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The grass.",
+ "B. A road.",
+ "C. A house.",
+ "D. A camping trailer."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color are the chalk lines on the roof of the camper van in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. White.",
+ "B. Purple.",
+ "C. Blue.",
+ "D. Red."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What was the driver's emotion when it was discovered in the video that the droplets on the car were actually insects?",
+ "answer": "D",
+ "candidates": [
+ "A. Finds it unbelievable.",
+ "B. Extremely surprised.",
+ "C. Half believing, half skeptical.",
+ "D. Takes it for granted."
+ ]
+ }
+ },
+ "SucySLvH": {
+ "video_id": "SucySLvH",
+ "video_duration": "198s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "Sports Game",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person wearing a gray long-sleeved shirt, dark pants, a gray baseball cap, and white shoes stands in front of a green mat and a white golf ball on the mat. They are holding a golf club. On the left side of the frame is a large mirror and a black chair. There is a person's reflection visible in the mirror. On the right side of the frame are several golf bags and an orange golf ball tee. The person adjusts their stance, then swings the club, striking the ball. A new view appears on the left of the frame, showing a golf course, with an orange line indicating the ball's trajectory. The person then adjusts their stance and swings the club again. The golf course view appears again. The person then stands in front of a wooden wall, and a black graphic appears at the bottom of the frame. The person manipulates the club in their hands and gestures with one hand. They then swings the club again. The golf course view appears again. The person gestures with one hand again, and adjusts their grip on the club again. They place the club on the ground again. They then use one hand to gesture again, and the video ends.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What does the \"5651\" on the electronic display indicate?",
+ "answer": "A",
+ "candidates": [
+ "A. The spin speed of the ball.",
+ "B. The speed of the ball.",
+ "C. The height of the ball's flight.",
+ "D. The traveling speed of the ball."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the man's emotion at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Happy.",
+ "B. Disappointed.",
+ "C. Angry.",
+ "D. Calm."
+ ]
+ }
+ },
+ "oRNJRDIe": {
+ "video_id": "oRNJRDIe",
+ "video_duration": "45s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A man with braided hair and a maroon long-sleeved shirt stands slightly angled to his right, facing a woman with long, curly brown and blonde hair, who is holding a microphone to his face. He appears to be answering her questions. Behind them, several other people are present, some holding cameras and recording equipment. One man behind the interviewer is wearing a mask. The room they\u2019re in has large beige curtains and a partially visible window showing the daylight outside. The name \"JOHN DAVID WASHINGTON\" appears in the bottom left corner, below which \"BOY WILLIE\" is written in smaller font. A logo reading \"B|S\" appears in the bottom left corner. The interview continues, with both individuals maintaining eye contact and engaging in conversation. The man occasionally smiles while the woman maintains a more serious expression as she listens and speaks into the microphone.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video is the man wearing a brown and red shirt smiling happily?",
+ "answer": "D",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. In the middle of the video.",
+ "C. Throughout the video.",
+ "D. At the beginning of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How does the emotion of the man wearing a brownish-red shirt in the video change?",
+ "answer": "C",
+ "candidates": [
+ "A. From calm to excitement.",
+ "B. From disappointment to seriousness.",
+ "C. From joy to seriousness.",
+ "D. From excitement to anger."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man wearing a blue mask do with the camera in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. He is protecting his camera.",
+ "B. He demonstrated his camera in front of the lens.",
+ "C. He showed off his camera to others.",
+ "D. He took several photos in front of him."
+ ]
+ }
+ },
+ "ElWbyEpe": {
+ "video_id": "ElWbyEpe",
+ "video_duration": "50s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Book Reviews",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman with curly hair is talking and holding a paper. Then, three people, two women wearing green dresses and a man wearing a suit and holding a white object, are singing on a stage. Next, there is a picture of six people, four men wearing suits and two women, in front of green foliage. Then, the woman with curly hair is talking and a faint image of the people from the previous picture is overlaid on her. She is then shown speaking again, and then sitting in a chair in front of a bookshelf. Next a shot of the woman in the green dress singing is shown again. Then, a black screen with the words \"...AND CAPTURED THE HEART OF A MUSICAL LEGEND\" appears. A couple is hugging, and then the same couple is shown in a close-up shot. Then, a picture of the couple is shown in a newspaper. Then, a black and white image of the couple is shown. Then, a black screen with the words \"SHE WAS AT THE HEIGHT OF HER CAREER\" appears. Two men are shown at a podium, one wearing a black blazer and the other wearing a blue shirt with a bow tie and a dark blazer. Then, the curly-haired woman is shown in an audience, and walks up to the stage. Then a shot of a stage with the text \"BEST LATIN POP PERFORMANCE\" is shown, along with a text graphic displaying \"LANI HALL \"Es Facil Amar\". A woman with red hair is hugging a man at a podium. Finally, the same woman with red hair is shown speaking at a microphone on the podium. Then a black screen with the words \"Then Her Voice\" appears.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What type of music is played at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Jazz music.",
+ "B. Rock music.",
+ "C. Classical music.",
+ "D. Pop music."
+ ]
+ }
+ },
+ "HGsvAAug": {
+ "video_id": "HGsvAAug",
+ "video_duration": "228s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Book Reviews",
+ "audio_class": [
+ "Event"
+ ],
+ "video_caption": "A person wearing a blue shirt and an orange knit hat is in front of a bookshelf. They are holding up and displaying various books, one at a time. The books have different colors, including black, red, and tan. They display several different books, some while holding the spine facing the camera, and at other times the front cover of the books are facing the camera. They are speaking. At some times their hands are on either side of their torso, and at other times they are using their hands to gesture while holding the books.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "In the video, what are the titles of the books mentioned by the man that come before and after \"The Soul of West Lake\"?",
+ "answer": "D",
+ "candidates": [
+ "A. \"NOS4R2\" and \"The Art of Writing.\"",
+ "B. \"The Book of Tribal Members\" and \"The Underground Railroad.\"",
+ "C. \"The Nosra Book\" and \"The Art of Writing.\"",
+ "D. \"The Underground Railroad\" and \"The Nosra Book.\""
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Who is the author of the book *The Soul of West Lake*?",
+ "answer": "D",
+ "candidates": [
+ "A. Giyang.",
+ "B. Joe Hill.",
+ "C. Colson Whitehead.",
+ "D. Rio de Janeiro."
+ ]
+ }
+ },
+ "BPJUFPmO": {
+ "video_id": "BPJUFPmO",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A girl and a boy stand in a classroom, facing the camera. The girl, on the left, holds a saxophone. She wears a short-sleeved navy dress and a light gray scarf. The boy, on the right, holds a trumpet. He wears a navy polo shirt and khaki cargo shorts. They stand between two black music stands. A whiteboard is behind them, along with other classroom items including a desk, shelves, and posters. An American flag hangs on a pole near the corner. In front of the musicians, an older boy sits in a chair holding a younger boy on his lap. They watch the musicians. As the duo begins to play, the video zooms out slightly, revealing three more seated audience members. The girl and boy play their instruments. The audience watches attentively. The two musicians continue playing, and the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many types of musical instruments can be heard in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Two.",
+ "C. Three.",
+ "D. One."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Are both instruments playing simultaneously throughout the video?",
+ "answer": "D",
+ "candidates": [
+ "A. No.",
+ "B. I am not sure.",
+ "C. None of the above.",
+ "D. Yes."
+ ]
+ }
+ },
+ "NYyFfAWf": {
+ "video_id": "NYyFfAWf",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "There are nine people standing in front of what appears to be an entrance with glass doors, with two people to their left. Each standing person has a music stand in front of them. All of the standing people are holding flutes. There is a red rope across the front. The person standing in the center of the nine has a microphone in his hand. Below the standing people are two people in wheelchairs. There is a sign in the lower right corner that says \"Five Floor\" with an arrow. There is a red banner above the glass doors that has some Chinese writing on it. There are two television screens hanging above the glass doors. A couple more people can be seen in the far background to the left of the standing people.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the person in the video playing the flute?",
+ "answer": "D",
+ "candidates": [
+ "A. For a course examination.",
+ "B. For a formal performance.",
+ "C. To participate in a competition.",
+ "D. To rehearse music."
+ ]
+ }
+ },
+ "InzNIMnE": {
+ "video_id": "InzNIMnE",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two girls play string instruments in a living room. On the left, a girl in a blue dress plays the violin. On the right, a girl in a white shirt plays the cello. A wooden bookcase filled with books is visible behind the violinist. A flat-screen television is mounted on the wall behind the cellist, above a brick fireplace. A wooden music stand holds sheet music between the two girls. They play their instruments throughout the duration of the video. The violinist moves her bow across the strings, her left hand manipulating the violin's neck and fingerboard. The cellist\u2019s left hand moves up and down the cello's fingerboard as she bows, and her right arm moves the bow across the strings. Neither girl looks at the camera, and their expressions remain focused on their instruments.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Which instrument has a stronger sense of rhythm, the violin or the cello?",
+ "answer": "B",
+ "candidates": [
+ "A. The cello has a stronger sense of rhythm.",
+ "B. The violin has a stronger sense of rhythm.",
+ "C. It is not known.",
+ "D. Both have the same sense of rhythm."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What change occurs in the sound of the violin in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The tempo decreases.",
+ "B. The tempo increases.",
+ "C. The volume increases.",
+ "D. The volume decreases."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many types of instrument sounds are in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Three types.",
+ "B. Four types.",
+ "C. Two types.",
+ "D. One type."
+ ]
+ }
+ },
+ "GLKpTtvz": {
+ "video_id": "GLKpTtvz",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man wearing glasses and a black shirt plays a four-string bass guitar. He is standing in front of a backdrop with the repeated words \"Sambil Barquisimeto\". The video has a reddish tint and rounded corners, giving it a vintage film effect.\u00a0Later, the camera pans out, revealing a drummer and another guitarist.\u00a0 The drummer sits behind a standard drum kit, wearing a plaid shirt and headphones. The other guitarist, bald and wearing a dark striped shirt, plays a hollow-body electric guitar. The three musicians continue to play.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Which instrument has a stronger sense of rhythm, the electric guitar or the drum set?",
+ "answer": "C",
+ "candidates": [
+ "A. Both have the same sense of rhythm.",
+ "B. The drum set has a stronger sense of rhythm.",
+ "C. The electric guitar has a stronger sense of rhythm.",
+ "D. I don't know."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many electric guitars are shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. One.",
+ "B. Three.",
+ "C. Two.",
+ "D. Four."
+ ]
+ }
+ },
+ "MJnIbMgM": {
+ "video_id": "MJnIbMgM",
+ "video_duration": "46s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man and woman stand on a beach before large illuminated letters spelling \"MARRY ME\". Red rose petals are scattered around their feet, and several red roses are stuck in the sand. The ocean is visible behind them, and the sun appears to be setting. The man and woman are both dressed in white. The woman has a flower in her hair. They are holding hands. The man lets go of one of the woman's hands and reaches into his pocket. The camera zooms in closer to the couple. The man continues holding the woman's left hand with his right hand while pulling a small box from his pocket with his left hand. The woman brings her right hand up to her chest. The man opens the box to reveal a ring. He gets down on one knee, and the camera angle shifts, revealing more of the red rose petals and the ocean. The woman covers her mouth with her hands and nods her head yes. The man places the ring on her finger. The camera angle shifts closer again, as the woman continues to hold her face with both hands. She looks at the ring. She removes her hands from her face, touches her neck and hair, and then places her left hand on the man's face. They kiss.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "When was the man originally planning to propose in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. December 5th.",
+ "B. December 6th.",
+ "C. December 4th.",
+ "D. December 7th."
+ ]
+ }
+ },
+ "PFokqhaM": {
+ "video_id": "PFokqhaM",
+ "video_duration": "95s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Sports Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows the screen of a mobile game. In the first scene, five digital cards featuring football players are displayed. The cards have different numbers and positions, and one is labeled \"Jeremy Lane\" and another \"Chris Long.\" The screen then transitions to the game store, where \"Super Bowl Starter Pack\" and \"Big Game Pack\" are among the options. The \"Big Game Pack\" is selected, and the next scene displays the same pack card against a stadium background, with the text \"I WAS STREAMING\" overlaid on the screen. Then the text \"THEN THIS HAPPENED...\" appears. Next, a screen displays five cards with the NFL logo on their backs, and the text \"THEN THIS HAPPENED...\" appears again. After that, a set of five cards with football players is displayed. The first card is labeled \"DA'MON CROMARTIE-SMITH\" and another \"CHRIS CLEMONS.\" This series of cards also has the text \"THEN THIS HAPPENED...\". Another set of five cards is shown. Then the screen returns to the game store, with the text \"THEN THIS HAPPENED...\" briefly visible before fading. Another view of the Big Game Pack card is presented, then the NFL card backs are shown once more. A set of five blank cards is then shown, and finally, the last scene contains a set of five football cards. The first card is labeled \"BOB SANDERS,\" another \"LOUIS VASQUEZ,\" and \"D'JOUN SMITH\" is also visible. A series of card statistics then appears for one of the cards before the sequence loops again through a similar set of five cards, some of which have the word \"SETS\" in the corner of the card, and the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the man in the video swipe the card?",
+ "answer": "C",
+ "candidates": [
+ "A. Four.",
+ "B. One.",
+ "C. Two.",
+ "D. Three."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What sport is the man in the video likely to enjoy?",
+ "answer": "A",
+ "candidates": [
+ "A. Rugby.",
+ "B. Soccer.",
+ "C. Basketball.",
+ "D. Table tennis."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Why does the man's voice suddenly change in the middle of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Because he drew his favorite soccer star.",
+ "B. Didn't draw the character he wanted.",
+ "C. Lacks the gold coins for the next draw.",
+ "D. Accidentally triggered the decompose feature."
+ ]
+ }
+ },
+ "SbbeAnCD": {
+ "video_id": "SbbeAnCD",
+ "video_duration": "54s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Historical Analysis",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video opens with a black screen displaying the text \"Part 3\" followed by \"19th Century Nationalism\" in white text. The screen then transitions to a graphic design program interface displaying a map of Europe with blue fill. A red line is drawn on the upper left of the map and the numbers \"1815-\" are written in red next to it. The number \"1914\" is written in red next to the previous text, completing \"1815-1914\". A web browser window is opened on top of the interface, displaying a wikipedia page about \"Rise of nationalism in Europe\". The browser window is moved out of the way, and the browser is used to open another wikipedia page, this one titled \"Sick man of Europe\". This browser window is then moved out of the way. Red, orange, and yellow lines are drawn around portions of the map, followed by the map being filled in with black to show an outline. An orange colored outline is drawn around another area, a red line is drawn beneath it, followed by three red exclamation points. A small red shape resembling fire is drawn on the bottom center of the map.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Is there a change in the background sound in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes.",
+ "C. No."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order in which the video listed other countries that became independent after the Napoleonic Wars? (a) Norway, (b) Luxembourg, (c) Italy, (d) Germany, (e) Belgium.",
+ "answer": "C",
+ "candidates": [
+ "A. (e)(d)(b)(c)(e).",
+ "B. (a)(b)(c)(d)(e).",
+ "C. (e)(a)(d)(c)(b).",
+ "D. (a)(c)(d)(b)(e)."
+ ]
+ }
+ },
+ "eXaCGcXh": {
+ "video_id": "eXaCGcXh",
+ "video_duration": "196s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A man with long brown hair and a beard sits in front of a microphone. He wears a red t-shirt and has beaded bracelets on both wrists. He adjusts his hair and gestures with his hands while speaking.The video cuts to gameplay of Oddballers, showing cartoonish characters in a field with red propane tanks. A character throws a plunger attached to a rope, which wraps around a tank and pulls it toward them. Another character throws a tank at another player. A tank explodes.The video then transitions to a different game, showing gameplay on what appears to be a ship or barge in a canal. Cartoonish characters are again present, running around the barge.The scene shifts to a vertical phone screen showing Pikmin Bloom gameplay. Pikmin are walking across a green field.The man in the red shirt reappears.An animated cutscene begins. It shows artwork of a white, scaled creature with multiple heads and purple spikes on its back. Four warriors with swords, bows, and axes are depicted fighting. The scene shifts to a black and grey silhouette of a large monster being attacked by a small figure. A purple dragon flies through the air. It is hit by a pink beam and falls.A tweet from Wood Hawker (@BeatemupsWood) is displayed. It reads: \"The only 3 things that EVERY Nintendo Fan want to see tomorrow: 1. Fire Emblem 2. Xenoblade 3. New Kirby\" and is timestamped \"1:42 PM \u00b7 Sep 12, 2022 \u00b7 Twitter for Android\".The man in red returns.The animated cutscene continues, returning to the four warriors fighting. A young man with split blue and red hair is shown. He wears a white and gold outfit with blue accents. A red-haired person in armor fights a white dragon. They leap towards the dragon and swing their sword. The character with split hair reappears. He faces the camera and has a determined expression. The same character is now shown from the back, standing in a field with a sword at their side. The title \"FIRE EMBLEM ENGAGE\" appears in red and blue text on a dark background. The man with split hair is lying down, and a woman with light hair speaks to him. He sits up. Two light-haired people look at him with concern. The split-haired man appears again.The man in red is back onscreen.Gameplay of Octopath Traveler II is shown. A character is being sentenced to life imprisonment. A woman in a purple dress stands on a street. A person stands on the rooftop of a building in a city. A man in robes addresses a small group. Three characters stand in front of a building with a steeple. Two characters stand on a cliff overlooking a foggy landscape. A character stands in front of a large gate. A character enters a room with a checkered floor. Two characters fight in the same checkered floor room. Three characters stand on a dark cliff under a large moon. A character stands in a burning village. A character stands in a cave with green water. A character speaks to a lion and fox near a waterfall. A group of characters walks down a street in a village.The man in red returns, commenting on the gameplay.The \"PIKMIN 4\" logo is shown. It features white and yellow flowers forming the number 4.The man in red is back.A gameplay clip shows a creature being lifted by a rope through a foggy landscape with floating islands. A character climbs a giant root or vine. A figure in a green suit freefalls. The figure continues to fall, then uses a glider to fly. The figure flies on a larger, fan-shaped glider. Green symbols appear around the glider. The symbols form a circle, and the glider flies towards the center. The green circle glows and symbols appear on it. The words \"THE LEGEND OF ZELDA\" appear within the circle, in gold. Below the title, \"TEARS OF THE KINGDOM\" appears in smaller gold letters. The Nintendo Switch logo appears in the upper left corner.The date \"5.12.2023\" is displayed in gold on a black screen. The Nintendo logo appears on a red background. The man in red returns. The \u201cTHE LEGEND OF ZELDA: TEARS OF THE KINGDOM\" screen is shown again, followed by the date \"5.12.2023\". The Nintendo logo appears. The man in red returns, then the date is shown again, and the man reappears, saying his goodbyes.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times do dragons appear in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Twice.",
+ "B. Four times.",
+ "C. Three times.",
+ "D. Once."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is the second dragon that appears in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Black and purple.",
+ "B. Blue and white.",
+ "C. Black and red.",
+ "D. Black and white."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the dragon relative to the man with the sword?",
+ "answer": "C",
+ "candidates": [
+ "A. Behind and to the right of the man.",
+ "B. Behind and to the left of the man.",
+ "C. In front of and to the right of the man.",
+ "D. In front of and to the left of the man."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How does the emotion of the man in the red short-sleeve shirt in the video change?",
+ "answer": "D",
+ "candidates": [
+ "A. From sad to calm.",
+ "B. From calm to sad.",
+ "C. From excited to calm.",
+ "D. From calm to excited."
+ ]
+ }
+ },
+ "quGdQZGH": {
+ "video_id": "quGdQZGH",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three individuals are standing in a room. Each person is playing a saxophone. Each person has a music stand in front of them. The person on the left is wearing a blue shirt, the middle person is wearing a red shirt, and the person on the right is wearing a black shirt. There are empty chairs, a trash can, and a whiteboard in the room. The camera angle is stable and the lighting is bright.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What change occurs in the saxophone music in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. From gentle to heavy.",
+ "B. From heavy to gentle.",
+ "C. From calm to joyful.",
+ "D. From joyful to tranquil."
+ ]
+ }
+ },
+ "TENrJyjh": {
+ "video_id": "TENrJyjh",
+ "video_duration": "148s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A slot machine screen displays a game called China Mystery. The screen shows a $10 bet, $45 credit, and no initial win. The reels spin, revealing various symbols including Chinese coins, letters J and K, and a tea pot. A bonus round is triggered, displaying the text \u201cTotal Free Games 30.\u201d The game shifts to a bonus selection screen where a golden scale presents the choice between \"Credit Prize\" (1350 with 60 credits) and \"Free Game\" (30 free games). A hand taps \"Free Game.\" The game returns to the main screen with 29 free games remaining. The reels spin multiple times, with various symbol combinations and small wins. During the free spins, bonus credits accumulate. The free games continue to spin down, displaying different combinations of symbols like Chinese lanterns, golden pigs, and letters. The win amount gradually increases throughout the free spins. The game ends with 0 free games remaining and a final win of $406. The player\u2019s credit increases to $451. The game resets to its initial state, displaying the original symbols and a $10 bet. The video ends with an aerial view of a cityscape at night.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What musical instruments are heard in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Erhu.",
+ "B. Drum set.",
+ "C. Suona.",
+ "D. Saxophone."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "In the video, how does the music change when the slot machine wins again?",
+ "answer": "A",
+ "candidates": [
+ "A. When winning, the slot machine emits a sharp sound to indicate the win.",
+ "B. When winning, the slot machine emits a deep sound to indicate the win.",
+ "C. When winning, the slot machine emits a faint sound to indicate the win.",
+ "D. When winning, the slot machine emits a melodious sound to indicate the win."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many times does the man pull the lever on the slot machine in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Sixty times.",
+ "B. Fifty times.",
+ "C. Thirty times.",
+ "D. Forty times."
+ ]
+ }
+ },
+ "FqhoneQR": {
+ "video_id": "FqhoneQR",
+ "video_duration": "252s",
+ "duration": "4-6min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man with a beard sits in a gaming chair and puts on over-ear headphones. The scene transitions to a car driving on a highway at night. The man's face, now wearing headphones, is superimposed in the top left corner of the screen. The main view shows the car following a white Mitsubishi Lancer Evolution. The overlaid video of the man shows him reacting to the driving. Various YouTube-related graphics, such as \"Like\" and \"Subscribe\" buttons, along with a website address, intermittently appear on-screen. The car continues following the Lancer, which has \"MILLERS OILS\" printed on the back. The car eventually begins driving alongside the Lancer. The perspective cuts back to the man in the gaming chair, who adjusts his headphones. The highway driving scene resumes. The car drives behind the Lancer for a while and then starts to overtake it. The scene cuts back to the man in his gaming chair again. The driving scene resumes, showing the car driving behind the Lancer and attempting to overtake other cars on the highway. The driving sequence and the cuts back to the man in the gaming chair continue to alternate for the remainder of the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many types of transportation are shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Two types.",
+ "B. Five types.",
+ "C. One type.",
+ "D. Three types."
+ ]
+ }
+ },
+ "FvaGjjSC": {
+ "video_id": "FvaGjjSC",
+ "video_duration": "43s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows two people in separate video feeds. The top feed shows a person wearing a black robe and glasses, sitting in a chair in front of two flags and two circular emblems on a wooden wall. The bottom feed shows a second person wearing a suit with a light colored shirt and dark tie. The top person raises their hands and appears to speak. The bottom person also appears to speak. The video includes text overlays at various points, as well as graphical elements such as a gavel, lightning bolts, a symbol resembling an eye with a line through it, and red stick figures with lines coming from their mouths. The words \"Judge vs. Attorney\" also appear at the top of the video toward the end. The word \"ZOOM\" is visible on the bottom screen.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the occupation of the white man?",
+ "answer": "B",
+ "candidates": [
+ "A. Police officer.",
+ "B. Lawyer.",
+ "C. Judge.",
+ "D. Legislator."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Why have the voices of black men changed?",
+ "answer": "A",
+ "candidates": [
+ "A. Doubts from white men.",
+ "B. Dissatisfaction with the other party.",
+ "C. Discontent with the verdict.",
+ "D. To intimidate the other party."
+ ]
+ }
+ },
+ "RNSbFfpi": {
+ "video_id": "RNSbFfpi",
+ "video_duration": "57s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black background. In the center is a white circle with a \"U\" with a number \"4\" in its center. Text below this circle reads \"UNREAL ENGINE\". There are silhouettes of two people on either side of this circle. Below that a text box appears that reads, \"TEKKEN 7 RAGE ART AND RAGE DRIVE POWERFUL ATTACKS WHEN LOW ON LIFE\".The background changes to a scene with water, and clouds and mountains in the distance. Two characters are facing each other. The character on the left has blue hair, and wears a blue coat and white pants. The character on the right has blonde hair, and wears a floral blue shirt and blue jeans. The blue-haired character turns away. The camera focuses on the blue-haired character's face. Text appears at the bottom of the screen: \"You can handle me?\". The blue-haired character and blonde-haired character fight, and the blonde-haired character is sent tumbling. The blue-haired character is seen from the back, and then turns to face the camera, holding a red rose. Text appears at the bottom of the screen: \"Chew on this!\". The rose is thrown and hits the blonde character. The blonde character falls backwards. Text appears at the bottom of the screen: \"We're done\". The camera shows the blue-haired character walking away and the blonde-haired character falling down with red particles around him.The background changes back to the water scene, and two different characters appear. On the left is a character with reddish-orange skin and hair, who wears a blue outfit. On the right is a character with grey hair, and wears a brown outfit. The characters fight. The reddish orange character is seen close up, with text at the bottom of the screen: \"Spare yourself!\". The camera then shows the fight and focuses on the reddish orange character. White text in Japanese is displayed on screen. The character moves forward, and two pink light bursts are shown. The camera shows the character from the back with a pink Japanese character shown on the outfit. Text appears below: \"Metsu!\". The camera zooms out and then shows a fight scene.The scene changes to a nighttime setting. Two new characters face each other. The character on the left has blonde hair, and wears a red outfit. The character on the right has long dark hair, and wears a blue and gold outfit. Numbers and two names, \"PAUL\" and \"FENG\", appear at the top of the screen. The two characters fight. Yellow text \"PERFECT\" appears on screen. The blonde character performs more moves that create orange bursts of light, and the dark haired character is knocked to the ground. The scene then shows the two characters facing each other again, as they move and attack each other.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order are the super move mechanics introduced in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The video starts by demonstrating that the lower the life bar, the higher the damage of the super move. Midway through, it shows the super move being blocked, and later it explains the super move and its unlock conditions.",
+ "B. The video begins by introducing the super move and its unlock conditions. Midway through, it shows the super move being blocked, and later it demonstrates that the lower the life bar, the higher the damage of the super move.",
+ "C. The video starts by demonstrating that the lower the life bar, the higher the damage of the super move. Midway through, it explains the super move and its unlock conditions, and later it shows the super move being blocked.",
+ "D. The video begins by introducing the super move and its unlock conditions. Midway through, it demonstrates that the lower the life bar, the higher the damage of the super move, and later it illustrates that the super move can be blocked."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What move defeats the blond man midway through the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A super move.",
+ "B. An uppercut.",
+ "C. A combo move.",
+ "D. A straight punch."
+ ]
+ }
+ },
+ "cBhWjNyR": {
+ "video_id": "cBhWjNyR",
+ "video_duration": "575s",
+ "duration": ">8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Book Reviews",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man with glasses and a cap sits in front of a wall decorated with artwork and bookshelves. He gestures with his hands while speaking. A cartoon zombie walks across the screen from left to right. Text appears on the screen: \"Durability: 5% Skip Capacity \u2248 1,000 BEUs \u2248 4th Heightening Connection: 10% Skip Capacity \u2248 2,000 BEUs \u2248 5th Heightening\". A book cover for \"Yumi and the Nightmare Painter\" by Brandon Sanderson appears on the right side of the screen, alongside a picture of a person looking at an island in the ocean. A map of California and Nevada appears, with a circled area highlighted near Los Angeles. The man continues gesturing. Quotation marks appear briefly on either side of his head. A logo consisting of a circle with a stylized arrow pointing downwards appears to his left.\u00a0 Simultaneously, on the right, a shadowy figure stands amidst smoke or fog, with a bright light source behind them. A banner reading \"I \u2665 Hemalurgy\" appears across the man's chest. Finally, the word \"PATREON\" appears at the bottom of the screen, along with several names above it.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is used to represent the Cosmo Universe timeline titles in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Blue.",
+ "B. Green.",
+ "C. White.",
+ "D. Red."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How many pieces of content has the Argent team published, as shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. 6018.",
+ "B. 6013.",
+ "C. 6618.",
+ "D. 6613."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the jet engine relative to the interstellar spaceship in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The jet engine is on the left side of the interstellar spaceship.",
+ "B. The jet engine is on the upper-right side of the interstellar spaceship.",
+ "C. The jet engine is on the lower-left side of the interstellar spaceship.",
+ "D. The jet engine is on the right side of the interstellar spaceship."
+ ]
+ }
+ },
+ "aVOJIFgy": {
+ "video_id": "aVOJIFgy",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three people are on a stage, two holding violins and one seated at a piano. The person at the piano is on the left of the frame, and is wearing a long dark dress, and is playing the piano. The other two people are in the center and on the right, facing the audience with their violins held to their left shoulders. The person in the center is wearing a black dress and white shoes and socks. The person on the right is wearing a long black gown and dark shoes. In front of the violin players are music stands, with sheet music on them. Behind all the people are two pianos, and more sheet music is visible, hanging from stands. The stage floor is wooden.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the violin on the right louder than the piano on the left?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes, the violin is louder.",
+ "B. No, the piano is louder.",
+ "C. No, they are equally loud.",
+ "D. I am not sure."
+ ]
+ }
+ },
+ "lFJblBlb": {
+ "video_id": "lFJblBlb",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three musicians are performing on a stage for a seated audience. To the left, a person plays a grand piano. In the center, a musician plays an erhu, a two-stringed bowed musical instrument. To the right, a musician plays a dizi, a transverse flute. The musicians are dressed in dark clothing. The stage is covered in an orange carpet, and the walls are white with large windows and dark window shades halfway drawn. The musicians play their instruments continuously throughout the video. The audience watches attentively.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the sound of the flute louder than the sound of the piano?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes, the flute is louder.",
+ "C. No, the piano is louder.",
+ "D. No, both are equally loud."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does the flute begin playing in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. At the end of the video.",
+ "C. Throughout the entire video.",
+ "D. In the middle of the video."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many types of instruments are there in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Two types.",
+ "B. Three types.",
+ "C. Four types.",
+ "D. One type."
+ ]
+ }
+ },
+ "TGpCRcsF": {
+ "video_id": "TGpCRcsF",
+ "video_duration": "145s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "There are multiple women wearing matching outfits consisting of a purple top with floral embroidery, a gold fringe at the bottom, and a gold skirt. The skirt has several layers of gold fabric, giving it a full look, and the women also wear gold arm bands. One woman in particular also wears large, golden, angled shoulder pads and a necklace with a circular design. The women appear to be on a stage with a large screen as a backdrop. They are dancing in unison and at times the woman in the shoulder pads is in the center, and at other times the women move together as a group across the stage. The backdrop on the screen shows various images, including a star, a bird, and a dragon. The lighting is bright and changes color throughout the video.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What prop did the lead singer drop at the front of the stage?",
+ "answer": "A",
+ "candidates": [
+ "A. I am not sure.",
+ "B. A golden fan.",
+ "C. A golden microphone.",
+ "D. A golden skirt."
+ ]
+ }
+ },
+ "JGxqrHjm": {
+ "video_id": "JGxqrHjm",
+ "video_duration": "55s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A group of about 10 people are on bicycles at the starting gate of a dirt track. They are wearing helmets, long-sleeved shirts, and long pants. The person in the middle of the group has the number 23 on their chest plate. A large black arch with the word \"KENDA\" is over the gate. The group is on a wooden platform and then the gates drop. The group then all travel down the dirt track which includes many hills and curves. There are more people viewing the event on the sidelines. Banners and flags surround the race track. Tents and cars are also visible. The group of bicycle riders rides over several dirt hills as the video ends.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "How did Contestant 23 surpass Contestant 11?",
+ "answer": "C",
+ "candidates": [
+ "A. On the second bend, overtakes Contestant 11 from the inside.",
+ "B. On the second bend, overtakes Contestant 11 from the outside.",
+ "C. On the first bend, overtakes Contestant 11 from the inside.",
+ "D. On the first bend, overtakes Contestant 11 from the outside."
+ ]
+ }
+ },
+ "rjsNCAdH": {
+ "video_id": "rjsNCAdH",
+ "video_duration": "129s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Fitness",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video displays the title \"IPF WORLD POWERLIFTING CHAMPIONSHIPS 2016\" on a black background. The scene then changes to an interior room. A woman wearing a dark blue shirt, dark pants and a black baseball cap and a man in a black shirt are standing next to a woman wearing a red and blue top, red shorts, a black weight belt, and a silver cross necklace. There is another woman in a white top and white shorts in the background. The woman in the red and blue top leaves the group. The scene shows her walking towards an area with a weightlifting rack, where two men squat down near the weight rack, and another man stands behind the woman. The video displays a caption saying \"USA, KIMBERLY WALFORD - SQUAT - ATTEMPT 1 - 167.5KG\". She places a barbell across her shoulders, squats down, and then returns to a standing position. The video then shows the same action, with a caption of \"USA, KIMBERLY WALFORD - SQUAT - ATTEMPT 2 - 177.5KG\". The video then shows her performing another squat with the caption \"USA, KIMBERLY WALFORD - SQUAT - ATTEMPT 3 - 185KG\" and is followed by a caption stating \"FAILURE DUE TO INSUFFICIENT DEPTH\". The woman then talks to the man behind the rack, then exits the area. The scene shifts to the woman performing a bench press, with the caption \"USA, KIMBERLY WALFORD - BENCH PRESS - ATTEMPT 1 - 105KG\" and is then followed by a caption stating \"FAILURE DUE TO FOOT MOVEMENT\". The woman exits the bench press area. She then returns to perform another bench press with the caption \"USA, KIMBERLY WALFORD - BENCH PRESS - ATTEMPT 2 - 105KG\". After the bench press, the woman exits the area. The scene changes to the woman performing a third bench press attempt, with the caption \"USA, KIMBERLY WALFORD - BENCH PRESS - ATTEMPT 3 - 110KG\". She then exits the bench press area. The scene changes to the woman performing a deadlift with the caption \"USA, KIMBERLY WALFORD - DEADLIFT - ATTEMPT 1 - 225KG\". She then drops the bar to the floor. The scene shifts to the woman performing a second deadlift with the caption \"USA, KIMBERLY WALFORD - DEADLIFT - ATTEMPT 2 - 235KG\", She then drops the bar to the floor. The video changes to the woman performing a final deadlift, with the caption \"USA, KIMBERLY WALFORD - DEADLIFT - ATTEMPT 3 - 243KG [WR]\". The woman drops the barbell and moves out of the lifting area. She is also seen talking to another man wearing a black shirt with the word \"STAFF\" on it.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What was the weightlifter's mood after the second round?",
+ "answer": "C",
+ "candidates": [
+ "A. Sad.",
+ "B. Disheartened.",
+ "C. Proud.",
+ "D. Disappointed."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What interaction did the competitors have with the man wearing a baseball cap at the stadium exit after the competition?",
+ "answer": "C",
+ "candidates": [
+ "A. A handshake.",
+ "B. A kiss.",
+ "C. A hug.",
+ "D. A high-five."
+ ]
+ }
+ },
+ "DJOXdzZH": {
+ "video_id": "DJOXdzZH",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A bald man wearing a light colored suit and yellow shirt plays a clarinet, standing in front of a black background with many small lights. He then walks toward the left and then to the back of a stage with three other men. One man is playing an orange guitar, another is playing a brown bass, and the third is playing a drum kit. The man in the suit sits on a stool facing away from the other men. Then the video focuses back to the man playing guitar as he is seated. Next, the video shows a close up of the man playing guitar with his fingers on the neck of the guitar. Then a close up of his hands playing the guitar is shown. Next, the video shows the man in the suit on a stool with the other three men behind him and then continues to show the close up of the guitar man and his hands as he plays the guitar. Next the video shows the bass player as he plays the instrument. The video then shows a man playing the drums in his kit. Finally, the video shows the three men playing instruments on the stage.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the man wearing a white suit in the video relative to the man wearing a striped suit?",
+ "answer": "B",
+ "candidates": [
+ "A. The man wearing a white suit is standing in front of the man wearing a striped suit.",
+ "B. The man wearing a white suit is standing to the left of the man wearing a striped suit.",
+ "C. The man wearing a white suit is standing to the right of the man wearing a striped suit.",
+ "D. The man wearing a white suit is standing behind the man wearing a striped suit."
+ ]
+ }
+ },
+ "OsXJMIJg": {
+ "video_id": "OsXJMIJg",
+ "video_duration": "134s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Motorsport",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a series of motorcycle races. The motorcycles have different color combinations of the riders' suits and their helmets are also different colors. The riders are on a dirt track and are driving at high speeds. The first few races are during the daytime, and then a few races later it becomes nighttime. Later in the video, two different riders are being interviewed by people with microphones in what appears to be the garage area of the track. A separate shot shows some people holding and waving different colored flags in the stands at the races. One race, a rider appears to fall off of his motorcycle and is approached by several people. At the end of the video a logo appears with a subscribe button.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What technique does the motorcycle use to take turns in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Slowing down and driving normally.",
+ "B. Drifting.",
+ "C. Turning by using the clutch and throttle.",
+ "D. Leaning."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What were the emotions of the two people involved in the collision during the first match?",
+ "answer": "B",
+ "candidates": [
+ "A. Resentful.",
+ "B. Friendly.",
+ "C. Irritable.",
+ "D. Upset."
+ ]
+ }
+ },
+ "dUWwBOPW": {
+ "video_id": "dUWwBOPW",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A young woman and a man sit facing forward and play instruments. The woman, on the left, plays a black ukulele. The man, on the right, plays a soprano saxophone. A microphone on a stand sits between them.\u00a0 The woman wears a dark short-sleeved shirt and jean shorts, her dark hair pulled back in a ponytail. The man wears glasses and a dark t-shirt and light-colored shorts. They both appear focused on playing their instruments. Text displayed across the lower portion of the screen reads \"Tr\u00edade Music\" and \"Contatos: (21) 97256-4076 (zap)\". The background is a plain light-colored wall. The video consists of a static shot of them playing.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Which instrument is the loudest in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Ukulele.",
+ "B. Bagpipe.",
+ "C. Clarinet.",
+ "D. Flute."
+ ]
+ }
+ },
+ "VnhtJDnH": {
+ "video_id": "VnhtJDnH",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three people are performing music in an outdoor setting at night. Two of the performers are standing and one is sitting. The person on the left is playing a violin and holding a bright orange object, and they are wearing a jacket that is mostly white and orange over a black outfit. The person on the right is playing a flute, and they are wearing a light pink patterned jacket over a black outfit with pink and white leggings. In the center, a person is playing an instrument that looks like a combination of a keyboard and a drum. This person is wearing a black and green jacket over a black outfit. They are all performing in front of a collection of small wooden tables. In front of the left-most performer is a white box with illustrations on it. There are several people watching the performance.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotion is expressed by the music in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Melodious and graceful.",
+ "B. Majestic and heroic.",
+ "C. Energetic and inspiring.",
+ "D. Peaceful and gentle."
+ ]
+ }
+ },
+ "VSJaGsww": {
+ "video_id": "VSJaGsww",
+ "video_duration": "236s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man with braided hair sits in a red and black gaming chair in front of a desk.\u00a0 A smaller video is overlaid on the screen showing another man in a white tank top.\u00a0 The man in the gaming chair holds a smartphone.\u00a0 He holds the phone up close to his face.\u00a0 He appears to be recording a video or voice note.\u00a0 He rubs his forehead and looks down at the phone.\u00a0 He turns the phone slightly, and a sticker on the back of the phone becomes visible.\u00a0\u00a0He sits up straighter and continues recording a message on the phone.\u00a0 He puts the phone down on the desk and leans back in his chair.\u00a0 He leans forward again, picks up the phone and continues recording a message. He puts on a pair of glasses and resumes recording. He continues speaking into the phone for a considerable duration. Finally, he puts down the phone and looks directly at the camera. The video cuts to a cartoon monkey's face and the text \"Subscribe!\" in black font on a white background. Another cut shows a white screen with a reddish-purple and dark blue abstract design before returning to the monkey image and \u201cSubscribe!\u201d text. The video concludes with the text \"HELP ME QUIT UPS!\" in black font on a white background.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the two people conversing in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Family members.",
+ "B. Strangers.",
+ "C. Classmates.",
+ "D. Friends."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did you hear the sound of the chair being moved?",
+ "answer": "A",
+ "candidates": [
+ "A. No, I didn\u2019t hear it.",
+ "B. Yes, I heard it.",
+ "C. I\u2019m not sure."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How do the character's emotions change in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Curious, tense, terrified.",
+ "B. Curious and excited, amused, anxious.",
+ "C. Curious and excited, anxious, amused.",
+ "D. Happy, amused, anxious."
+ ]
+ }
+ },
+ "wfCeOHgY": {
+ "video_id": "wfCeOHgY",
+ "video_duration": "338s",
+ "duration": "4-6min",
+ "domain": "Games",
+ "sub_category": "Sports Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen. The video then transitions to show two men standing near a glass doorway. One man wears a black shirt, and the other man wears a blue shirt. The man in the blue shirt then walks away from the camera towards a room containing a table and two computer monitors, while the man in the black shirt looks away from the camera. The man in the blue shirt sits in a black chair in front of one of the computer monitors, which shows a game of basketball being played. There is also a foosball table, along with some colorful chairs and carpet, in the background. The man in the blue shirt then looks directly at the camera and speaks. The video switches to a screen view of the basketball game being played on a computer. The video then shows the man in the blue shirt and the man in the black shirt on the top left, viewing the basketball game. The video switches back to a view of the men, where the man in the blue shirt speaks. The video switches again to a screen view of the basketball game, and the two men are still visible in the top left. The man in the blue shirt has an additional blue headset in. The video then shows the basketball game again, and switches back to the four men. The video switches again to the game, with the four men visible in the top left. The video then shows a zoomed view of the man in the blue shirt. Then the video shows all of the men again. The video switches to a screen view of the basketball game, and the four men are still visible in the top left. The video then switches to a screen view of the basketball game. The video switches to the four men and then switches to the basketball game again. The video then shows the four men laughing, and then a screen view of the basketball game. Then the video shows the four men and the basketball game again. Then a zoomed view of the man in the blue shirt. Then a screen view of a basketball game, and the four men in the top left. The video shows the basketball game and the four men and then ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where does the beginning of the video take place?",
+ "answer": "A",
+ "candidates": [
+ "A. An arcade.",
+ "B. A basketball court.",
+ "C. An office corridor.",
+ "D. A home theater."
+ ]
+ }
+ },
+ "QwwLHARV": {
+ "video_id": "QwwLHARV",
+ "video_duration": "480s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "DIY & Crafts",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with the word \"Research\" written in a 3D metallic font against a gray background. Additional letters appear in front to complete the word \"AReResearch\" on the screen. The background is dark and blurry. Next the screen turns completely black and displays the text \"Watching the solar eclipse safely: chips-o-scope\". Then a man appears from the chest up with a gray background with horizontal blinds behind him. The man is wearing a long sleeved striped shirt with a zipper and small logo. The man is gesturing with his hands as he speaks. Next, there's a clip of an off-white table with many items on it including a red can, a tube of glue, scissors, tape, a hammer, drill, and other small items. A hand enters the screen and points to the items on the table as the voice from the previous clip speaks. The can is then shown alone, and two hands enter the frame, and are shown measuring and marking the can and a white circular piece of paper. The white paper is cut out with scissors, and a piece is glued onto another piece of paper. The hands are shown measuring with a ruler and making a cut into the side of the can. A large piece of gray material is measured, cut, and glued into the inside of the can. The can is then shown on the white table with pieces of material taped in place. Next the can is shown outside attached to a tripod, first at a distance and then up close. The video then closes on a black screen displaying text, first with the author's name and date, and then to ask the viewer to subscribe.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man do with the ruler at the beginning of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Measure the center point at the bottom of the bottle.",
+ "B. Use the ruler to measure the diameter of the bottle cap.",
+ "C. Measure the part of the bottle that needs to be cut off.",
+ "D. Use the ruler to measure the appropriate length of the cardboard."
+ ]
+ }
+ },
+ "FiCKkyUU": {
+ "video_id": "FiCKkyUU",
+ "video_duration": "46s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video starts with a wide shot of a landscape that includes a red cliff face in the foreground and a city in the background. Then there are multiple explosions shown in the red cliff face, with dust and smoke rising up. After this a large crane is shown with its bucket tilted down, filling the bucket with rocks. Then it is shown using the bucket to deposit the rocks into a large truck. Next a woman is shown in the driver's seat of the large truck. Finally, the large truck is shown driving into an indoor area and dumping its load of rocks.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where does the woman wearing glasses appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Inside an excavator.",
+ "B. Inside a tractor.",
+ "C. Inside a dump truck.",
+ "D. Inside a processing plant."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, in which direction is the excavator in relation to the rock truck?",
+ "answer": "B",
+ "candidates": [
+ "A. Behind.",
+ "B. The right side.",
+ "C. The left side.",
+ "D. Ahead."
+ ]
+ }
+ },
+ "qJMpcPcV": {
+ "video_id": "qJMpcPcV",
+ "video_duration": "94s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person\u2019s hand, wearing a gold ring, presses the button on an elevator. The hand and arm retreat, followed by a leg and foot that step out of the elevator. The elevator door closes. The elevator door opens again and a person in a grey suit walks out of the elevator. They turn to face the elevator, which now has a person in a dark coat exiting it. The camera switches to an outside view from a window, where a large building and an airplane can be seen. The camera switches again to an interior shot of a man in a white shirt and striped tie sitting on a couch. He is blurred by a divider. He moves over to a table where he opens a brown bottle. He turns to talk to another man sitting across from him. The camera switches between the two men as they talk. The man in a white shirt and tie takes a drink from his brown bottle. The man in a dark coat touches his face. The camera switches back to the man in the white shirt and tie. The man in the white shirt takes another drink from his brown bottle. The camera switches back to the man in a dark coat. The camera switches between the two men as they talk. A view through the divider shows the two men. The view changes to the man in the dark coat using a laptop. The camera switches back to the two men as they talk.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the act of drinking beer appear in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. One.",
+ "D. Two."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why do men post the information they acquire on Twitter?",
+ "answer": "B",
+ "candidates": [
+ "A. Because the bribe giver and the president have a close relationship.",
+ "B. Because the bribe giver and Putin have a close relationship.",
+ "C. Because the president has a close relationship with Putin.",
+ "D. Because the president's lawyer has a close relationship with Putin."
+ ]
+ }
+ },
+ "bCTYQnTZ": {
+ "video_id": "bCTYQnTZ",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A woman and a man sit on a couch, playing musical instruments. The woman, on the left, plays a clarinet. She wears a light reddish-pink shirt with small darker pink flowers or shapes on it, and dark pants. Her dark hair is pulled back, with some of it in a topknot or bun on the top of her head. The man, on the right, plays a ukulele. He wears a short-sleeved collared shirt with a camouflage pattern in muted tones, a white sleeveless undershirt, and dark pants. He wears rectangular glasses.\u00a0Behind them, several items are visible on the back of the couch:\u00a0 Two large vinyl record album covers propped against the wall, with a third leaning against the center one;\u00a0 two pillows, one a solid medium red, and one a lighter orangey-red; a partially visible yellow object or surface to the left; and a light-colored wall. The image has a slightly faded, vintage filter, and a darkened vignette around the edges. The two people play their instruments continuously for the duration of the video, looking straight ahead. They remain largely still, with some small movements of their heads and hands as they play.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the clarinet on the left more rhythmic than the ukulele on the right?",
+ "answer": "D",
+ "candidates": [
+ "A. No, they have the same rhythm.",
+ "B. I am not sure.",
+ "C. No, the ukulele is more rhythmic.",
+ "D. Yes, the clarinet is more rhythmic."
+ ]
+ }
+ },
+ "kuypgHeB": {
+ "video_id": "kuypgHeB",
+ "video_duration": "31s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man in a wheelchair sits at a bar, speaking to the bartender, who holds a white cloth. The scene changes to a fancy restaurant where a man and woman are eating at a table. The man looks out the window at a neon sign that reads \"The Drunken Clam.\" The scene changes to a bar, where a man leads a bull on a rope. Patrons at the bar watch the bull. The bull walks along the bar and flips a man in the air. The bull charges through the bar, smashing tables, chairs, and bottles. The bull crashes through the bar, knocking over a television, cash register, and the rest of the bar. The bull smashes through a wall and runs out of the bar. The bull runs through the bar again, crashing through more tables and smashing through another wall.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occurred in the interior of the tavern as shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. From dilapidated to clean and tidy.",
+ "B. From empty and barren to abundant.",
+ "C. From abundant to empty and barren.",
+ "D. From clean and tidy to dilapidated and messy."
+ ]
+ }
+ },
+ "jXqPgQSL": {
+ "video_id": "jXqPgQSL",
+ "video_duration": "17s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A group of snowboarders and skiers stands at the bottom of a snowy slope. A snowboarder approaches a jump, launches off it, performs a trick, and lands. Another snowboarder approaches a different jump, slides over it, and continues down the slope. The next snowboarder goes over a rail and then rides down. Another snowboarder jumps over a different feature, does a trick, and lands. A different snowboarder slides down a slope, approaches a jump, and performs a backflip before landing. Another snowboarder rides quickly down a slope. A different snowboarder rides down a slope toward a jump and performs a trick before landing. A snowboarder rides down a slope in snowy conditions, heading towards a crowd of onlookers in the distance.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What are the emotional changes experienced by the man wearing a black snowsuit and a white hat in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. From tense to disappointed.",
+ "B. From tense to excited.",
+ "C. From calm to relaxed.",
+ "D. Always calm."
+ ]
+ }
+ },
+ "dzFzCRyR": {
+ "video_id": "dzFzCRyR",
+ "video_duration": "242s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man sits at a wooden table before a large burger. The burger is topped with a small cherry tomato on a skewer, a pickle slice, and a tall pile of crispy onion rings. Under the onion rings, there's a thick beef patty, melted cheese, and what looks like coleslaw or purple cabbage. Lettuce and tomato peek out from under the patty, sitting on the bottom bun. A small metal cup of white sauce sits on the tray next to the burger. A beverage in a tall plastic cup sits to the right, and another side dish, perhaps coleslaw, sits to the left. The background is yellow and red, with a large diagram of a cow showing different cuts of meat.The man removes the tomato, pickle, and skewer, adding them to a small container with lettuce and tomato next to the white sauce. He begins to compress the burger, then takes a large bite. He continues to eat the burger, occasionally taking sips of his drink.\u00a0After eating a portion of the burger, he separates the top bun and onion rings from the rest of the burger and sets them aside.\u00a0 He picks up the meat and cheese, now somewhat compressed and saucy, and takes bites from it. He displays the compressed burger patty and filling to the camera, showing its cross-section. He then continues to eat the remaining patty, taking occasional bites of the items on the tray next to the white sauce.\u00a0He finishes the patty and begins eating the top bun with the onion rings, along with the side salad from earlier. The video ends with him gesturing towards the camera.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man drink a beverage in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Once.",
+ "B. Three times.",
+ "C. Four times.",
+ "D. Twice."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "How much does the food cost that the man eats in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. $14.",
+ "B. $16.",
+ "C. $13.",
+ "D. $15."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where is the man located in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Inside KFC.",
+ "B. Inside McDonald's.",
+ "C. Inside the Burger King store.",
+ "D. Inside the Roll & Smoke store."
+ ]
+ }
+ },
+ "RLzKnJzC": {
+ "video_id": "RLzKnJzC",
+ "video_duration": "118s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A video opens with two people, a man and a woman, sitting at a table with a laptop in front of them. Both are smiling and wearing dark colored shirts. The woman is wearing a black shirt with the words \"SPACE FOR HUMANITY\". The man says \"Let's do it\". The screen switches to show the two individuals on a video call, and a phone ringing sound effect is played. The man says, \u201cHi Keisha, how\u2019s it going?\u201d On the left side of the split screen is a woman with dark hair and a black strapless top, standing on a porch. The woman on the left responds, \u201cI\u2019m good, how are you?\u201d. The man on the video call says, \u201cGood, this is Rachel\u201d, and the woman next to him waves and says \u201cHi\u201d. The woman on the left says \u201cHi Rachel\u201d. She continues, \u201cWe know that one of your goals has always been to go to space.\" The screen then shows the man and woman at the table again. The woman on the left can be heard stating that she always dreamed of life beyond Earth. The screen then transitions to show a man with white hair, who says \"I think dreams always can come true, maybe not in this lifetime, but it will come true eventually.\" The video then returns to the split screen with the three individuals. The woman on the left says \"I think dreams always can come true, maybe not in this lifetime, but it will come true eventually.\" The man on the video call says, \u201cKeisha, I just want to thank you for entering and being a finalist.\u201d The woman on the left says \u201cKeisha, I just want to thank you for entering and being a finalist.\u201d The man at the table is then shown, saying that before they launched Omaze, they did a white boarding session. The split screen reappears, and the man is heard saying, \u201cand wrote down \u2018this would be a success if\u2019, and one of the first things we wrote is that we\u2019d make it so that\u201d and the woman on the left says, \u201cand wrote down \u2018this would be a success if\u2019, and one of the first things we wrote is that we\u2019d make it so that\u201d. The man at the table continues, \u201cany person in the world would have a chance to go to space, so the fact that you are a finalist\u2026\u201d. The woman on the left says \u201cany person in the world would have a chance to go to space, so the fact that you are a finalist\u2026\u201d The man at the table says, \u201cit\u2019s also a dream come true for us\u201d. The screen then displays the man and woman at the table again, with a man with white hair sitting behind them at the table. The man at the table says \u201cAnd have you thought about all who you would take with you if you won?\u201d The woman on the left says, \u201cActually, my daughter wants to come with me, she says, \u2018Don\u2019t you dare take anybody else\u2019\u201d. The man with white hair laughs and says, \u201cActually, my daughter wants to come with me, she says, \u2018Don\u2019t you dare take anybody else\u2019\u201d. The video then shows the two individuals at the table again. The man at the table says \u201cSo envision this, you\u2019re standing there with your daughter and you\u2019re about to go.\u201d Then the video transitions to show the man with white hair at his own table. The video returns to the man and woman at the first table as the man says \u201cAnd Richard is standing there, what would you say to Richard?\u201d The screen transitions to the split screen with the woman on the left, who says \"I would tell him I love him, thank you.\" The man with white hair laughs. The video transitions back to the video call with the man and the woman. The man says, \u201cHey Keisha, I think we\u2019re having challenges here on the internet side, let us log off and call you back.\u201d The woman next to him also nods. The woman on the left says \u201cSure, no problem\u201d. The video cuts to the man with white hair who says \u201cWe\u2019re off to the races now, let\u2019s go!\u201d. The man, woman and the man with white hair are then shown together, standing, at a table with a laptop, as if the camera is behind them. The camera then cuts to show the three individuals walking outside past banana trees. The video then shows the woman on the porch again, who screams, while holding her hands up to her face. The man with white hair says \"Come here\", and then hugs her as they all laugh. The woman, who is wearing a black strapless top and tan pants, receives a yellow box with a black bow. Inside is a blue space suit. The man with white hair says \u201cYou\u2019re going to be an astronaut!\u201d The man with white hair takes out the suit and places it on her. The man with white hair is shown again, as the phone is ringing. He says \u201cHi Ana, you and your mother are going to space!\u201d The video cuts to show the man holding his phone while video chatting with the woman in the black top. He says \u201cAna, we won!\u201d. The screen switches to show the three individuals again. The woman in the black strapless top is seen on the man\u2019s phone. The man at the table says \u201cSo you would be the first astronaut from Antigua, of course.\u201d The woman with dark hair says \u201cI\u2019m making history.\u201dThe woman in the black strapless top is given a yellow box and then opens it to reveal a space suit. The woman in the black top holds up the suit, the man with white hair holding one side.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the occupation of the person who brings gifts later in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A celebrity.",
+ "B. A news reporter.",
+ "C. A friend of the black woman.",
+ "D. An event staff member."
+ ]
+ }
+ },
+ "qhhHpwUl": {
+ "video_id": "qhhHpwUl",
+ "video_duration": "34s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A video shows two people facing each other. The next scene shows two animated characters sitting across from each other at a round table. The next scene shows a close-up on the characters sitting at the table. The next scene shows one of the characters on a screen with an animated character next to it at a control panel. The next scene shows two different animated characters at the same control panel. The next scene shows another animated character at the control panel. The next scene shows the two people from the beginning of the video sitting across from each other at a table. The next scene shows one of the animated characters from the beginning of the video sitting at the table. Then one of the animated characters from the beginning of the video is seen sitting at the table with the other animated character in the background. The next scene shows two of the animated characters playing musical instruments.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many people are present in the scene after the first change?",
+ "answer": "B",
+ "candidates": [
+ "A. Six.",
+ "B. Five.",
+ "C. Three.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How does wearing a beard and gray clothes affect one's mood?",
+ "answer": "C",
+ "candidates": [
+ "A. From frustration to anger.",
+ "B. From anger to happiness.",
+ "C. From silence to anger.",
+ "D. From happiness to anger."
+ ]
+ }
+ },
+ "zIkBMIkw": {
+ "video_id": "zIkBMIkw",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three men are playing acoustic guitars on a patio overlooking a lake. The patio has a glass railing, and green grass is visible beyond it. In the background, a large body of water stretches across the scene, with mountains visible in the distance. A cluster of buildings, including a prominent white structure, stands on the far shore. The men are sitting on chairs. The guitarist on the left has dark hair tied back in a ponytail and wears a salmon pink shirt. He sits on a tan chair and plays a light-colored acoustic guitar. The center guitarist wears a light-colored, long-sleeved shirt and dark pants. He plays a dark-colored acoustic guitar. The guitarist on the right wears a plaid, long-sleeved shirt and dark pants. He plays a light brown acoustic guitar. Small amplifiers and microphones are placed on the ground near the musicians, connected by cables. The guitarist on the left has a foot pedal on a wooden stand. The angle shifts at 0:05 to focus on the two guitarists on the right. It returns to the wider view at 0:26. The angle shifts again to focus on the two guitarists on the right at 0:52, and back to the wide shot at 1:13. At 1:38 the angle shifts once more to focus on the two guitarists on the right, then returns to the wide view at 2:00.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What is the main instrument in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. All are main instruments.",
+ "B. The acoustic guitar on the right.",
+ "C. The acoustic guitar on the left.",
+ "D. The electric guitar in the middle."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Which bass guitar is producing the low sound in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The guitars on the left and right.",
+ "B. The guitar in the middle.",
+ "C. The guitar on the right.",
+ "D. The guitar on the left."
+ ]
+ }
+ },
+ "zmrcvcBP": {
+ "video_id": "zmrcvcBP",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man sits playing an accordion while wearing a black face mask. Another man stands and plays a bassoon. They are both on a stage in a large auditorium or performance hall. The stage is made of light-colored wood and curves slightly at the front. Behind the accordion player, a grand piano sits on the stage with the lid open. A padded bench sits in front of the piano. A music stand is placed in front of the bassoon player. Two microphones on stands are positioned between the musicians. The camera zooms out to show the rest of the auditorium. Rows of empty seats face the stage. The walls of the auditorium are wood and beige. Steps lead down to the seating area on either side of the stage. A small purple light shines near the top left corner of the frame. The camera zooms back in on the two musicians, who continue playing. The camera remains zoomed in on the musicians for the remainder of the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Which instrument plays alongside the bassoon?",
+ "answer": "B",
+ "candidates": [
+ "A. A piano.",
+ "B. An accordion.",
+ "C. A bassoon.",
+ "D. A trumpet."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When is the accordion used for accompaniment in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. Throughout the entire video.",
+ "C. At the end of the video.",
+ "D. At the beginning of the video."
+ ]
+ }
+ },
+ "kuhPQczc": {
+ "video_id": "kuhPQczc",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A young woman with long brown hair, wearing a black tank top, jean shorts, and over-the-ear headphones, sits at an electronic drum set in a grassy field and plays. The drum set is silver and red. The field is dotted with yellow wildflowers. Trees line the edge of the field in the background.The camera angle shifts to an aerial view looking down on the drummer and the drum set. The camera moves further out, showing a wider view of the field. The woman is small in the frame, next to a dirt path. The camera follows the path, going past a person walking towards the camera, dressed in a similar outfit to the drummer. Buildings can be seen in the distance beyond the field.The aerial view focuses back on the woman playing the drums. The view returns to the original side angle, showing her drumming close up. The camera switches back to the aerial perspective, following the path in the opposite direction, away from the buildings and past the drummer. The camera angle then shows the person in jean shorts and black top walking away from the camera down the path. It switches to a ground-level angle showing this person's black and white sneakers as they walk, moving closer to the camera.The camera flies over the person walking along a dirt path toward a forested area. It follows them as they walk toward the trees, and captures them running along the path once they are in the trees. The camera then follows them from behind as they walk along a path by the shore of a large body of water. The person is still walking along this path as the camera goes back to the field with the electronic drum set. The camera pans back to the person playing the drums and continues to film them playing from an aerial view. It rotates around to view from different angles as the woman continues playing. The camera cuts back and forth between the aerial view and a close-up side angle as she plays.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Which part of the drum set is used most frequently in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. First tom-tom drum.",
+ "B. Second tom-tom drum.",
+ "C. Bass drum.",
+ "D. Snare drum."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "Is the drum always playing in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I am not sure."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotion does the song in the video convey?",
+ "answer": "B",
+ "candidates": [
+ "A. Somber and gloomy.",
+ "B. Energetic and aggressive.",
+ "C. Gentle and calming.",
+ "D. Violent and blood-soaked."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does the high note first appear in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. At the end of the video.",
+ "C. In the middle of the video.",
+ "D. Throughout the entire video."
+ ]
+ }
+ },
+ "UJBUDyvK": {
+ "video_id": "UJBUDyvK",
+ "video_duration": "61s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Unboxing Videos",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two people are visible. One person is wearing an orange t-shirt with \"TEXAS\" in white letters and a black baseball cap with a white logo on the front. The other person is wearing a red t-shirt with a yellow lightning bolt design and a navy blue baseball cap with a red logo on the front. They are sitting in front of a background with clothing hanging on racks and packaged items on a shelf. At times, the person in the red shirt is holding a white lacrosse head with colorful strings. At one point the person in the orange shirt holds the lacrosse head. A close up of a white lacrosse head with colorful strings is shown. At one point a flame from a torch is shown near an orange string. The person in the orange shirt is shown walking outdoors and swinging a lacrosse stick.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man wearing a black hat swing the club completely in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Four.",
+ "B. Twice.",
+ "C. Three.",
+ "D. Five."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man wearing a black hat do to the net bag in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. He tested the net bag's strength.",
+ "B. He untied the net bag's strings.",
+ "C. He removed the net bag from the pole.",
+ "D. He used a lighter to burn the net bag's strings."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the person wearing a black hat in the video doing?",
+ "answer": "A",
+ "candidates": [
+ "A. Makes and sells golf clubs.",
+ "B. Tests golf clubs.",
+ "C. Is a lacrosse player.",
+ "D. Designs helmets."
+ ]
+ }
+ },
+ "HdTNFvou": {
+ "video_id": "HdTNFvou",
+ "video_duration": "47s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A hand holds a circular object with red, green, and blue lights on its edge. The object is then placed flat on a wooden floor. The camera view switches to an overhead shot of the circular object, revealing wires connected to it. A sequence of images shows a pair of hands securing a thin, LED light strip around a round, wooden frame, and then sandwiching a mirror between two wooden frames. Finally, a person is shown using a screwdriver to attach the two wooden frames to each other.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many screws were installed on the item in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Six.",
+ "B. Eight.",
+ "C. Nine.",
+ "D. Seven."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man press the LED strip into the slot in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. 14.",
+ "B. 15.",
+ "C. 16.",
+ "D. 13."
+ ]
+ }
+ },
+ "URxOzqgJ": {
+ "video_id": "URxOzqgJ",
+ "video_duration": "107s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Editorials",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person is in the center of the frame, against a white background, and they are speaking. They are wearing a black button-up shirt. The video shows many short, continuous segments with this same scene. At one point there is a text overlay that reads \"So What Happened?\".",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the mood of the boy in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Low-spirited.",
+ "B. Depressed.",
+ "C. Thoughtful.",
+ "D. Furious."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What are the potential consequences when interest rates rise, making credit less accessible?",
+ "answer": "D",
+ "candidates": [
+ "A. A financial crisis occurs.",
+ "B. Deposit interest rates decrease.",
+ "C. Borrowing costs increase.",
+ "D. House prices drop significantly."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why might a bank face discrimination lawsuits?",
+ "answer": "C",
+ "candidates": [
+ "A. Because the bank's staff are not white.",
+ "B. Because the bank lacks sufficient funds.",
+ "C. Because the bank does not provide loans to enough low-income individuals.",
+ "D. Because the bank refuses to lend to Black individuals."
+ ]
+ }
+ },
+ "kwbImNRt": {
+ "video_id": "kwbImNRt",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people are in the video. The person on the left is wearing a grey short sleeved shirt with the text \"NUS NURSING\" on it, and is playing a long, brown stringed instrument. The person on the right is wearing a grey short sleeved shirt with the text \"CALIFORNI\" on it, and is playing a tan acoustic guitar. Both people are wearing glasses. In between the two people there is a microphone on a stand. The wall behind them has various rectangular patterns.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What technique is used in the guitar playing in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Using a pick for strumming.",
+ "B. Strumming.",
+ "C. Fingerpicking.",
+ "D. Plucking strings."
+ ]
+ }
+ },
+ "ROEyKoiF": {
+ "video_id": "ROEyKoiF",
+ "video_duration": "199s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "DIY & Crafts",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black top and tan pants stands in front of a yellow wall. Three terracotta pots are stacked to the left of the person, and plants in terracotta pots are on either side of the person. To the person's right is a woven stool and a pink bucket. The person reaches for the top of the pot stack, removes the lid, and looks into the pot. The person places newspaper in the top pot, and then removes the pot from the stack. The person puts newspaper in the middle pot, places the top pot back on the stack, puts leaves from a pink bucket into the middle pot, and places the top pot on the middle pot. The person places soil in the top pot, and then adds food scraps. The person demonstrates how to remove and stack the top pot and bottom pot, and then reassembles the three pot stack. The person then places the top pot in the middle of the stack, removes the middle pot, and places it on the bottom. The person then removes the top pot and places it in the middle, and removes the pot on top. The person adds paper to the top pot, and places the lid on top. The video includes text overlay, such as \u201cStep 2 Setting up the composter\u201d, \u201cStep 3 How to start composting\u201d, \u201cAvoid putting WHOLE FRUITS AND VEGETABLES\u201d, \u201cStep 4 Repeat and maintain\u201d, and \u201cStep 5 Change when filled up\u201d.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What is the weather like in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Sunny.",
+ "B. Snowy.",
+ "C. Overcast.",
+ "D. Rainy."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "Has the position of Unit C in the video changed?",
+ "answer": "A",
+ "candidates": [
+ "A. No change.",
+ "B. I don't know.",
+ "C. Yes."
+ ]
+ }
+ },
+ "pUxsOmrs": {
+ "video_id": "pUxsOmrs",
+ "video_duration": "338s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person with dark, voluminous, curly hair, wearing a gray hoodie with the word \"RAIDERS\" printed in block letters, appears in front of a white background. At the beginning of the video, the person moves their head from side to side and fluffs their hair with their hands, then shows a small white bottle with a red cap labeled \"CHI Silk Infusion\" and then a red spray bottle labeled \"CHI Iron Guard\". The person then rubs their hands together and applies a clear liquid to their hair, working the liquid in with their fingers. The person then divides their hair, putting some of it up with two hair clips, leaving a large section down. The person then uses a pink comb, then a black and pink brush, repeatedly moving the hair downwards with each tool. The person talks, moves the brush through their hair, and then shows the final result of straight hair. The person moves their hands and hair around, at one point using their hands to comb through their hair and hold sections to compare the straight hair to the remaining curly hair, before moving it all to show the result of the fully straightened hair. The person then fluffs the hair again before applying some clear product to the edges and face with the ends of their fingers. Finally, the person moves their head and body and smiles for the camera.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the woman do with the small white bottle at the beginning of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Apply cream to her neck.",
+ "B. Apply cream to her hands.",
+ "C. Apply cream to her cheeks.",
+ "D. Apply cream to her hair."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many clips did the woman use to hold her hair in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Four.",
+ "B. Three.",
+ "C. One.",
+ "D. Two."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the woman's emotion later in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Feeling appreciative.",
+ "B. Feeling excited.",
+ "C. Feeling emotional.",
+ "D. Feeling sad."
+ ]
+ }
+ },
+ "YpTIwQLQ": {
+ "video_id": "YpTIwQLQ",
+ "video_duration": "37s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "Two people wearing uniforms are facing each other on a yellow mat. One person is wearing a white uniform with a black belt, the other is wearing a blue uniform. The two people grapple with each other. The person in the white uniform lifts the person in the blue uniform and flips them onto the yellow mat. The person in the white uniform ends up on top, and the two people continue to grapple on the mat. The person in white then stands up while the person in blue lies on the mat. The person in the white uniform then faces off again with the person in blue as the person in blue also stands up. A similar sequence of grappling, lifting, and flipping occurs between the two people in the uniforms. The person in the white uniform again flips the person in the blue uniform onto the yellow mat, ending up on top of the person in the blue uniform. The person in the white uniform rises and steps away from the person in the blue uniform, ending the video.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the positional relationship between the two contestants at the moment they fell to the ground?",
+ "answer": "D",
+ "candidates": [
+ "A. Both contestants remained standing.",
+ "B. Both contestants fell to the ground at the same time.",
+ "C. The contestant wearing a white judo gi is below, while the contestant wearing a blue judo gi is on top.",
+ "D. The contestant wearing a blue judo gi is below, while the contestant wearing a white judo gi is on top."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the competitor wearing a white judo uniform kick the opponent?",
+ "answer": "C",
+ "candidates": [
+ "A. One.",
+ "B. Zero.",
+ "C. Two.",
+ "D. Three."
+ ]
+ }
+ },
+ "TOgjgzHh": {
+ "video_id": "TOgjgzHh",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A young girl with shoulder-length dark hair sits at a drum set in a room with blue walls. She wears a black and white long-sleeved shirt and black pants with white stripes down the sides. She holds drumsticks and hits the drums and cymbals.\u00a0 Two smaller children sit on stools behind her, also with drumsticks and smaller drum sets. A man with short dark hair stands to the left and slightly behind the first girl, occasionally gesturing as if instructing her. He also has drumsticks and taps on the cymbals of the larger drum set. A second man, also with short dark hair, stands next to the first girl, facing the camera. He wears a black shirt with a silver dragon design, black pants, and red and white sneakers. He counts out loud with his fingers and taps his foot, sometimes mimicking playing drums, as if demonstrating for the girl.\u00a0 The room has a partially visible banner with some Chinese characters and what looks like the English translation \"Golden Age Cultural and Art Center\". A frame with several smaller pictures hangs on the wall, possibly depicting other students and instruments. The floor is covered with blue carpeting. The girl continues to play the drums for the duration of the video, occasionally looking towards the two men. The other children also play their smaller drums, occasionally looking at each other or the instructor.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Which type of drum is used most frequently in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Snare drum.",
+ "B. Bass drum.",
+ "C. Tom-tom drum.",
+ "D. Hi-hat cymbal."
+ ]
+ }
+ },
+ "VdCVCUsP": {
+ "video_id": "VdCVCUsP",
+ "video_duration": "70s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows two different women in separate locations. One woman is in a kitchen, wearing a patterned orange and blue garment with a headscarf. She holds a crumpled white object in her hands, touches it, and then brings her hands to her face and appears to wipe her eyes. The second woman is in a room with light-colored walls, wearing a dark top and beige-colored garment with blue trim. She faces the camera and appears to be speaking. The video alternates between these two women. The woman in the kitchen uses her hands to cover her face. The video also shows a 'like' icon and a 'subscribe' icon on the screen.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the two women in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Friendship.",
+ "B. Best friend relationship.",
+ "C. Mother-daughter relationship.",
+ "D. Sisterly relationship."
+ ]
+ }
+ },
+ "xqbzEYyz": {
+ "video_id": "xqbzEYyz",
+ "video_duration": "59s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man in an orange sweater sits at a table while a man in a suit and glasses stands next to him, looking at a clipboard and writing on it. The man in the orange sweater stands up and walks alongside the man in the suit. The video cuts to a courtroom where a police officer escorts the man in the orange sweater to a table. A female judge sits at the judge's bench, with an American flag and a state flag behind her, a digital timer to her left displaying \u201c10:24:40\u201d, and a nameplate that reads \"Judge Crocker\". The police officer stands near the defendant's table, listening as the judge speaks. The video focuses on the judge as she continues speaking, and then shows the police officer again. The video focuses back on the judge, then back to the officer. The judge talks for an extended period. The words \"Clerk said 'No problem'\" appear overlaid on the video of the courtroom. The judge continues talking, looking at papers in front of her. The video then abruptly changes to darker lighting and lower contrast as the judge looks directly at the camera and points.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the profession of the lady sitting in the middle in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A guard.",
+ "B. A court clerk.",
+ "C. A lawyer.",
+ "D. A judge."
+ ]
+ }
+ },
+ "VnMgadZz": {
+ "video_id": "VnMgadZz",
+ "video_duration": "121s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a white helmet with a red visor, red gloves, and a white and red race suit is seen inside a red car. The person brings their hands up towards their face then puts them down. The scene switches to show a man adjusting the car in the driver's seat, the person's face is obscured by a helmet. The camera switches to the back of a red car surrounded by other cars and people. Next a man in an orange suit stands holding a sign that says \"30 seconds\". The scene switches to show a number of racing cars on a racetrack, then the cars begin to move. A red car is visible racing on a track, other cars are also on the track. The scene changes to show 4 cars racing on the track, then 3 cars racing on the track, then 4 cars racing on the track, then 4 cars racing on the track again. The scene changes to a leaderboard on the side of the screen with a race track in the background. The leaderboard is visible as the camera pans out and the cars continue to race. The camera continues to follow the cars as they move around the track. The leaderboard is visible on the left side of the screen. The view changes again, to cars moving around the track and the leaderboard is still visible. More cars are visible on the track. A red car drives onto a finishing zone. The same red car is seen driving back from the track toward the gate, and the camera then shows the same red car again and a person in it. Two people fist-bump near the red car. Then, a man in a blue shirt shakes hands with another man in a race suit. Three people stand on a podium holding trophies. Then there are shots of the same three people, and then one man in an orange and white suit opens a bottle and sprays the liquid around at other people. The man in the suit then drinks from the bottle. The man then takes off his hat and talks with people.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How do the emotions in the video change?",
+ "answer": "D",
+ "candidates": [
+ "A. From excitement to tension.",
+ "B. From tension to sadness.",
+ "C. From sadness to excitement.",
+ "D. From tension to excitement."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Who won first place in the competition?",
+ "answer": "C",
+ "candidates": [
+ "A. A cyclist wearing black.",
+ "B. A cyclist wearing white.",
+ "C. A cyclist wearing red.",
+ "D. A cyclist wearing blue."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "With what time advantage did the cyclist in red win the race over the second-place finisher?",
+ "answer": "B",
+ "candidates": [
+ "A. Approximately 9 seconds.",
+ "B. Approximately 5 seconds.",
+ "C. Approximately 7 seconds.",
+ "D. Approximately 2 seconds."
+ ]
+ }
+ },
+ "hPTJElGk": {
+ "video_id": "hPTJElGk",
+ "video_duration": "64s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A short video clip with digital distortion effects plays, followed by a purple smoky animation revealing a hairless cat logo and the name \"Molly Schuyler\". The scene transitions to a woman with a shaved head, sunglasses, and a black dress standing outside a restaurant. She gestures with her hands while looking at the camera. A US flag hangs from the wooden pergola above her, and several small metal tables and chairs are arranged on the sidewalk. The restaurant's brick facade and signage are partially visible, and a person wearing a face mask walks past in the background. The address \"1 Center Square, Hanover, Pennsylvania 17331, (717) 634-5090\" appears on screen. The woman continues gesturing as if in conversation.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "How much would a 40-inch pizza likely weigh?",
+ "answer": "C",
+ "candidates": [
+ "A. 16 pounds.",
+ "B. 8 pounds.",
+ "C. 10 pounds.",
+ "D. 14 pounds."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What happens when a woman participates in a food challenge?",
+ "answer": "A",
+ "candidates": [
+ "A. She will eat all the food on the plate.",
+ "B. She does not participate in the challenge.",
+ "C. She will give up halfway.",
+ "D. She will have someone else finish eating for her."
+ ]
+ }
+ },
+ "HKnCFlXB": {
+ "video_id": "HKnCFlXB",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Six children are playing musical instruments in a room. A girl in a turquoise shirt plays the piano. Two girls are seated at a large wooden table, one playing a recorder and the other a clarinet. A boy plays the viola, another boy plays the cello, and a girl plays the violin. They all appear to be reading from sheet music. The room's decor includes two framed artworks on the wall, a deer head mounted on the table, and a lamp. Another artwork, resembling a globe or planet, is visible further back in the room. The children continue playing their instruments throughout the duration of the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many instruments appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Five.",
+ "B. Seven.",
+ "C. Six.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "What type of ensemble is featured in the music of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Solo.",
+ "B. Quartet.",
+ "C. Duet.",
+ "D. Sextet."
+ ]
+ }
+ },
+ "IaLHTnTl": {
+ "video_id": "IaLHTnTl",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A young woman sits on a chair in a grassy backyard, playing an acoustic guitar. She wears a sleeveless top with a small, circular pattern and a brown skirt. Her hair is braided to the side. She watches sheet music on a stand in front of her. Beside her, a younger girl stands barefoot, playing a violin. She wears a reddish-pink, patterned dress and light-colored shorts. She also looks at the sheet music on the stand.\u00a0A small, golden-colored dog sits near a yellow plastic wading pool in the background. Trees and a section of fence are also visible in the background. Both girls play their instruments throughout the video's duration. The girl playing the guitar occasionally sings along as she plays.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the instrument on the left more rhythmic than the instrument on the right?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes, it is.",
+ "C. No, it is not."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What technique does the guitarist use in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Finger-picking.",
+ "B. Strumming.",
+ "C. Plucking strings.",
+ "D. Strumming with a plectrum."
+ ]
+ }
+ },
+ "zSqwqmCu": {
+ "video_id": "zSqwqmCu",
+ "video_duration": "120s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video is divided into four frames.In the top left frame, a man with short brown hair and sunglasses is seated and playing an acoustic guitar. He is wearing a light-colored hoodie and dark pants. There is a rug on the floor and a fireplace in the background.In the top right frame, a man with curly brown hair and sunglasses is seated and playing an acoustic guitar. He is wearing a black shirt and gray pants. There is a painting of red flowers in the background.In the bottom left frame, a man with short brown hair and sunglasses is seated on a green sofa and playing a red bass guitar. He is wearing a dark jacket and blue jeans.In the bottom right frame, a woman with long brown hair, a black hat, and sunglasses is facing forward. She appears to be singing. These four frames alternate throughout the video.During the video, the woman appears in the center frame, and she is also seen singing. The top left frame features a man singing, instead of playing the guitar at one point. The bottom right frame features the woman singing. The bottom right frame changes to a man with brown hair playing a keyboard. Another man is shown with the melodica at the top right frame. The man from the bottom right frame is seen with a melodica. The woman from the bottom right frame is seen playing a brass instrument.The four frames are shown all together again, with all 4 people singing and playing instruments. Then, the four frames change to a split view of the left two frames, with only those two people. Then, the four frames changes to show only the bottom two frames. The frames are alternated several times. The video ends with a black screen. The text \u201cSEE MORE COVERS HERE\u201d appears in white on the black screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What type of music is being played in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Pop music.",
+ "B. Classical music.",
+ "C. Folk music.",
+ "D. Hip-hop music."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the sound of the melodica occur in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. One.",
+ "B. Three.",
+ "C. Two.",
+ "D. Four."
+ ]
+ }
+ },
+ "mqbRcXdQ": {
+ "video_id": "mqbRcXdQ",
+ "video_duration": "97s",
+ "duration": "1-2min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person is seated at a desk, with a black hoodie on, and is holding a microphone. The individual appears to be singing, and they are moving their hands on the various electronic musical instruments that are on the desk. The video is brightly lit with red light. Text is displayed overlaying the video.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the purpose of the items on the table in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. To serve as accompaniment for music.",
+ "B. To function as a decorative item.",
+ "C. To adjust the pitch and timbre of music.",
+ "D. To play music."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What are the last three words in the subtitle?",
+ "answer": "A",
+ "candidates": [
+ "A. Love you guys.",
+ "B. Please click.",
+ "C. Life passing.",
+ "D. Sign up."
+ ]
+ }
+ },
+ "dpLgBgDy": {
+ "video_id": "dpLgBgDy",
+ "video_duration": "129s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person is fishing from a small boat on a river. The riverbanks are lined with trees and vegetation. The water is murky brown. The person casts their line into the water. The fishing rod bends as a fish takes the bait. The person reels the fish in. They use a paddle to maneuver the boat and keep the fish close. The fish jumps out of the water several times. A second person in the boat gets ready with a net. The first person brings the fish close to the boat. The second person nets the fish. The second person lifts the fish out of the net and shows it to the camera. The fish is a large Chinook salmon. The second person releases the salmon back into the river.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many fishing rods are there in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Three rods.",
+ "B. One rod.",
+ "C. Four rods.",
+ "D. Two rods."
+ ]
+ }
+ },
+ "qrQEtDWz": {
+ "video_id": "qrQEtDWz",
+ "video_duration": "90s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A cartoon figure stands near a large hole in the ground. The figure walks, then climbs over a wall while holding a large gun. The figure then stands while holding the gun. The figure is then shown in a black and white silhouette. The figure is then shown again while holding the large gun. Two cartoon figures sit on a couch, one looking at a tablet. A white cartoon dog is shown looking at papers on a table. The camera zooms in on a drawing that is being held, and then on a specific part of that drawing. The dog is shown again, then turns to look out of a window. A child is shown using the zipline at an indoor playground. The camera pans across the playground and shows a variety of cartoon figures, including children, parents, and other adults. Several children are then shown gathered around an arcade game. A cartoon figure is shown on a metal ledge next to three large bottles. The camera pans over an indoor playground. Large, cartoon bottles fall towards the playground. A close-up shows a cartoon-style device. Dark brown liquid falls downwards. A cartoon figure is shown holding a crossbow. A blue arrow flies across the screen. Three blue arrows fly across the screen. A large, bouncy structure is shown. A cartoon figure is shown holding two objects with rotating ends. The objects are released. Several children are shown on a trampoline. A close-up is shown of an object that has three rotating rods. The object is shown again. A structure of blue cloth with three rods is shown. A cartoon figure is shown holding two spray cans which are emitting green gas. The two cans emit green gas, then a single can is shown with green gas coming from it. The can rolls along the floor of the indoor playground while emitting green gas. The gas spreads throughout the playground, eventually covering the whole scene. People in the playground are then shown lying on the ground, as well as a child wearing a gas mask. A cartoon figure is shown holding a device emitting pink sparks. The device is then thrown to the ground. The figure is shown standing next to a device with a pink, electrical spark. A figure is then shown standing next to a dog. The figure and the dog are shown in the indoor playground with people lying on the floor in the background. The dog and the figure stand near several bouncy structures and a child is shown lying on top of the structures. The figure is then shown near arcade games. The figure is shown again, opening its hands. The figure is shown with a look of shock.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many items did Stewie Griffin use to disrupt the party?",
+ "answer": "C",
+ "candidates": [
+ "A. Three items.",
+ "B. Two items.",
+ "C. Five items.",
+ "D. Four items."
+ ]
+ }
+ },
+ "gCzkJpVl": {
+ "video_id": "gCzkJpVl",
+ "video_duration": "374s",
+ "duration": "6-8min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two boys and a man are in a room with a bookshelf in the background. The boys are seated in chairs while the man is seated on a stool. The man is wearing a red glove, then a red glove and a blue shirt, then he has a hammer over his shoulder, then a blue light saber, then he has a toy gun and a pair of goggles. The boys have cups. The two boys and the man drink and spit water. They all laugh. The camera zooms in on the boy in the yellow shirt and the man. There are black and white screen overlays with the words \"Superheroes Gone Wrong\" and \"Too Close for Comfort\". The video ends with the three figures sitting close together and talking.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the bearded man do to make the two young boys laugh?",
+ "answer": "A",
+ "candidates": [
+ "A. Tell a humorous and absurd joke.",
+ "B. Laugh in front of them.",
+ "C. Threaten them.",
+ "D. Make a silly face."
+ ]
+ }
+ },
+ "iHSbxsaa": {
+ "video_id": "iHSbxsaa",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man with shoulder-length curly brown hair sits at a grand piano, playing. He wears a dark gray, almost black shirt. Sheet music rests on the piano's music stand.\u00a0The view changes to show a woman standing and playing the violin, with the pianist in the background. She wears a light teal-colored dress with a floral pattern of orange, red, and white flowers with dark green leaves. She has long dark hair and wears small pearl earrings.\u00a0The view returns to the pianist, who plays with his head slightly bent down, eyes focused on the sheet music.\u00a0The view changes back to the violinist playing, moving the bow across the strings and her fingers along the neck of the violin.\u00a0The view focuses closely on the violinist's hands as she plays.\u00a0The view shifts to a wider shot showing both musicians as they play together, the violinist standing to the left and slightly in front of the grand piano. The room is decorated with gold and white molding and long gold curtains. A window is partially visible in the background.The video cuts to a close-up of the pianist's hands as he plays, fingers moving swiftly across the keyboard.\u00a0The video shows the violinist playing again.The video returns to the close-up of the pianist's hands.The view goes back to the violinist, then back to the pianist's hands, and then back to the violinist once more as the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Does the violin have a better sense of rhythm than the piano in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes.",
+ "B. I am not sure.",
+ "C. Not applicable.",
+ "D. No."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does the high-pitched sound of the violin occur in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Throughout the entire video.",
+ "B. In the middle of the video.",
+ "C. At the beginning of the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What kind of emotional atmosphere does the entire piece of music convey?",
+ "answer": "D",
+ "candidates": [
+ "A. Subtle and sorrowful.",
+ "B. Joyful and uplifting.",
+ "C. Intense and restless.",
+ "D. Solemn and melancholic."
+ ]
+ }
+ },
+ "oQaepmMl": {
+ "video_id": "oQaepmMl",
+ "video_duration": "78s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A woman in a light blue pantsuit stands to the right of a small wooden stool, speaking into a microphone to a seated crowd in what appears to be a town hall meeting. Two glasses and a water bottle sit on the stool. The crowd sits in white folding chairs, arranged in a semi-circle around the speaker.\u00a0 Behind the first rows of chairs, other attendees stand against a dark blue curtain.\u00a0 Some are recording the event with cell phones and cameras. The speaker gestures with her right hand as she speaks, occasionally pausing to take a sip of water. She makes a specific point with three fingers.\u00a0 A woman in the audience begins speaking loudly to the speaker. She holds a white rectangular object up in the air with writing on it. The crowd stands up and applauds, and most of them are facing away from the speaker, with some clapping over their heads.\u00a0 The woman with the white object lowers it, her arms still raised in the air. The crowd continues to applaud.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the woman in a blue shirt raise her clenched left hand towards the sky?",
+ "answer": "C",
+ "candidates": [
+ "A. Eight times.",
+ "B. Six times.",
+ "C. Twice.",
+ "D. Four times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "How did the sound of the crowd change at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Gradually decreases.",
+ "B. Gradually increases.",
+ "C. Increases first, then decreases.",
+ "D. Decreases first, then increases."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual action did the woman wearing a black top and white coat in the video take?",
+ "answer": "C",
+ "candidates": [
+ "A. She stood up and then sat back down.",
+ "B. She suddenly pointed at the woman who was speaking.",
+ "C. She encouraged those around her to join in resisting the speaker.",
+ "D. She pushed aside the man who was preventing her from speaking further."
+ ]
+ }
+ },
+ "yPseXUBS": {
+ "video_id": "yPseXUBS",
+ "video_duration": "24s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man rides a horse down a foggy street at night. The rider dismounts and approaches another man. The second man runs away. The rider remounts and follows him, then dismounts again. Cut to a different location: indoors, at night. A man picks a lock on a cabinet using a thin metal tool that glows with heat. He opens the cabinet. Another man stands near him, holding a satchel. The first man ties a bandana over his face as he walks through an open doorway. Cut back to the lockpicking: the metal tool breaks. The lockpicker throws the broken tool to the floor. Cut to a first-person view of someone shouldering a rifle, stepping through a doorway, and into a different room. The person lowers the rifle.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Why did the person riding the horse stop in the first scene?",
+ "answer": "B",
+ "candidates": [
+ "A. Someone called out for the person to stop.",
+ "B. Someone fell from the sky and landed in front of the person.",
+ "C. The horse is hungry and unable to continue running.",
+ "D. The person has reached the destination."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "At the end of the video, did the masked man wearing a black triangular scarf die?",
+ "answer": "C",
+ "candidates": [
+ "A. It is unclear.",
+ "B. He was shot.",
+ "C. He did not die.",
+ "D. He was killed by an explosion."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "In the video, whose saddlebag is the bearded man most likely reaching for?",
+ "answer": "A",
+ "candidates": [
+ "A. It is the saddlebag of the man wearing a black triangular scarf.",
+ "B. It is the saddlebag of the man with a mask made of red plaid fabric.",
+ "C. It is the saddlebag of the abducted bank employee.",
+ "D. It is the saddlebag of the man wearing a cowboy hat."
+ ]
+ }
+ },
+ "xrPRjopD": {
+ "video_id": "xrPRjopD",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man in a kilt plays the bagpipes while a younger man plays a snare drum next to him. They are in a large room with wooden floors and blue chairs arranged in a semi-circle around them. Blue and white streamers hang along the wall. The camera pans to show a group of people standing near tables covered in white tablecloths. Food is arranged on the tables. The people appear to be watching and listening to the two musicians. The camera pans back to the musicians, who continue playing. The camera then pans back to the group of people, some of whom are clapping along to the music. The camera returns to the musicians, and then pans to show the tables of food again. The camera pans from the tables back to the musicians one more time, and they are still playing.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many drums are shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. 2.",
+ "B. 3.",
+ "C. 1.",
+ "D. 0."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "Where might the person playing the wind instrument in the video be from?",
+ "answer": "D",
+ "candidates": [
+ "A. Japan.",
+ "B. The United States.",
+ "C. Scotland.",
+ "D. Australia."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the accordion more rhythmic than the drum?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No.",
+ "C. Yes."
+ ]
+ }
+ },
+ "vJgnPKdu": {
+ "video_id": "vJgnPKdu",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person plays a xylophone with two mallets. The mallets have red tips. The xylophone is on a black folding stand. A music stand with sheet music is positioned behind the xylophone. Another music stand with sheet music is to the left of the frame. A pair of hands holds and fingers a flute, but the flute is not played. The person playing the xylophone wears a teal t-shirt and gray pants. The background consists of tiled walls. A double electrical outlet is visible on the wall.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many types of musical instruments are heard in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Two types.",
+ "B. Four types.",
+ "C. One type.",
+ "D. Three types."
+ ]
+ }
+ },
+ "pRYMIemy": {
+ "video_id": "pRYMIemy",
+ "video_duration": "40s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Physics",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a black ink signature on white paper. The text \"Sister Maria Celeste Galilei\" appears below the signature. This is followed by a black screen. Then, a pile of old books stacked on top of one another is shown next to a row of books. There is black background around the books. This is followed by a black and white etching of two people standing in front of a building with trees to their left and right. There is text below the etching. Next is a close-up view of several pages from what appears to be an old text book. The pages contain text and diagrams. This is followed by an old book opened to an image of three men wearing hats and carrying scientific devices. The next image is of a large open book with dense text and a few simple geometric diagrams on the page. Finally, the video shows an illustration of an elaborate structure made of wood and black material with red lines on it. Text is superimposed on top of each of these book pages and drawings.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order are the three events introduced in the video? (a) Describing the topics covered in Galileo's book. (b) Explaining the writing format of Galileo's book. (c) Discussing the role of nuns as editors.",
+ "answer": "C",
+ "candidates": [
+ "A. (a) (c) (b).",
+ "B. (c) (a) (b).",
+ "C. (c) (b) (a).",
+ "D. (b) (c) (a)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the man's appearance in the painting shown at the beginning of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Wearing 17th-century clothing with a hat.",
+ "B. Wearing 17th-century clothing with long braids.",
+ "C. Wearing 18th-century clothing with hair combed back.",
+ "D. Wearing 17th-century clothing with long hair and a beard."
+ ]
+ }
+ },
+ "fnJkEmjn": {
+ "video_id": "fnJkEmjn",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A young woman with blonde hair tied in a high ponytail sits on a wooden bench, playing a pipa (a Chinese lute). She wears a light pink coat over a gray hooded sweatshirt. She plucks the strings of the pipa with her right hand, while her left hand presses down on the strings along the neck of the instrument. Behind her, a metal railing borders a paved walkway. Behind the walkway, there's an area with some greenery, including bamboo.\u00a0 A portion of an orange building or structure is visible in the background, along with a section of a white building. A dark-colored sign stands near the walkway. Whitish text characters appear periodically overlaid on the right side of the video. A person walks by in the background.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotions does the music played in the video convey?",
+ "answer": "B",
+ "candidates": [
+ "A. Energetic and passionate.",
+ "B. Majestic and heroic.",
+ "C. Melodious and graceful.",
+ "D. Cheerful and lively."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What change occurs in the rhythm during the chorus section of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. From gentle to heavy.",
+ "B. From rapid to gentle.",
+ "C. From quick to heavy.",
+ "D. From heavy to gentle."
+ ]
+ }
+ },
+ "QrDEhFHx": {
+ "video_id": "QrDEhFHx",
+ "video_duration": "100s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video starts with a split-screen showing different angles of a courtroom. A man in a light blue, short-sleeved shirt sits at a table with papers on it. A bailiff in a green uniform stands near him. The split-screen disappears, focusing on the man and the bailiff. The man gestures with his hand toward the bailiff. Another bailiff enters the courtroom. The man picks up some papers and then throws them on the table. The bailiffs walk toward the man. The man puts some papers over his face. He removes the papers. He gestures and talks to the bailiff. Another bailiff enters. The first bailiff has his hand on his taser. The man picks up the papers, crumples them up, and then throws them at the bailiffs. He talks and gestures at them. He picks up some papers and throws them toward the bailiff. The bailiff takes out his taser and points it at the man. The man picks up some papers and throws them at the bailiffs. The bailiff tases the man, and he falls to the ground. They struggle, and the camera points down as they continue to struggle on the ground. One of the bailiffs puts handcuffs on the man, and they lead him away. The judge watches from the bench. One bailiff points his taser at the man as they take him from the courtroom. The camera is pointed at the floor, and the man is heard saying something. The man in the green uniform is seen restraining the man on the floor. The camera then focuses on the ceiling. The camera focuses on a bailiff in a dark uniform. The camera briefly points at the ceiling again. The bailiff re-enters the courtroom, going toward the judge. The other bailiffs follow. The man on the floor picks up something and throws it at the judge. The bailiffs struggle to restrain the man again. The camera is very shaky, and they fall to the ground again. The camera mainly shows the floor as they struggle. The man continues to struggle. The camera shows close-ups of one of the bailiffs. The video ends while they are still struggling.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In the video, when does the man in gray clothes stand up and point at someone?",
+ "answer": "B",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the end of the video.",
+ "C. At the beginning of the video.",
+ "D. Throughout the entire video."
+ ]
+ }
+ },
+ "JLXCjVWM": {
+ "video_id": "JLXCjVWM",
+ "video_duration": "239s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins in a room where two women are sitting on a bed with red curtains visible. One woman is wearing a pink top and black pants, and the other woman is wearing a black t-shirt and black pants. The second woman turns her head and says \"We need help,\" and the first woman cries. The first woman responds by saying \"And this one it's only God that can help us. You need to see my pastor.\" The second woman then says, \"Me, a relationship expert, can't sort this out.\" The first woman replies \"It's okay,\" and the second woman says \"Just come and see my pastor\" and the first woman agrees. The scene changes to a different room with red curtains, where the first woman is now sitting at a white table with a man wearing a gray suit. The man says \"So you always have heartbreaks?\", and the woman replies, \"Yes, sir. Hmm\". The man says \"Let's pray\" and they do, before the man says \"Hello dear, It seems that late marriages run in your lineage.\" The woman then says \"Yes, sir\", and the man continues with \"Your elder sister. She's not yet married. And she's over 42. Your mother got married around 42 as well. And it's been a pattern among your aunties and all\". The woman says \"Yes, yes, pastor it's true\", and the man says, \"Holyspirit, The Spirit of God is showing me something that this was the result of what your grandfather did. He illegally collected land from somebody out of anger. The person went to the land and made some incantations and raised an altar you know, and that seems to be what is affecting your lineage\". The woman says \"Pastor. I don't understand. What is my business with land? Why is it affecting me?\", and the man responds \"That is the point. The fathers have eaten sour grape and the children's teeth are set on edge\", then the woman cries out. The man continues, saying \"But you see, no more, no more will a causeless curse stand. As you have come to this mountain there shall be liberation in your life. Your story will change for good in the name of Jesus. Do you mind if I pray with you?\", and the woman responds \"I don't mind sir. Because I feel the presence of God in this place.\" They both put their hands on the table, and the man says \"Father in the name of the Lord\", and then they say \"Amen\". The man says \"I know your presence is here. You go ahead and make every crooked path straight. Now go into the lineage of this family. You masquerade demon covering the identity of this one.\" The woman's face changes to include darker tones and a wrinkly face with wide eyes. The transformed woman says \"You cannot have her. leave her alone\", and the man responds \"You have no right to speak here. By the power of the Holy Ghost, I silence you now, in the name of Jesus.\" The transformed woman says \"You masquerade spirit appearing on her face. Catch fire nows\", and the woman then puts her hands on her face, and the man commands her to \"Leave my face alone, in the name of the Lord Jesus\". The woman falls onto the desk. The man continues, saying \"You have no place. Every covenant from the pit of hell, every altar against your family, is destroyed now in the name of Jesus. The covenant that speaks better things than the blood of Abel begins to speak for you and your family. In Jesus' name\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "Why did the woman with reddish-brown braided hair suddenly display a demonic expression on her face at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. She has lost her sanity.",
+ "B. The family is cursed.",
+ "C. She is possessed by a demon.",
+ "D. She committed a bad deed."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What action does the man in the video, wearing a blue pinstriped suit and a white shirt, perform on the woman with reddish-brown braids?",
+ "answer": "B",
+ "candidates": [
+ "A. A kiss.",
+ "B. Holding hands.",
+ "C. A shoulder tap.",
+ "D. A hug."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who says the phrase \"We need help, we need help\" in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A woman in a blue and white shirt.",
+ "B. A woman in a black short-sleeved shirt.",
+ "C. A woman in a red and white shirt.",
+ "D. A woman in a red short-sleeved shirt."
+ ]
+ }
+ },
+ "DhphQeOP": {
+ "video_id": "DhphQeOP",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three individuals are seated and playing classical guitars. The person in the center is an adult woman with long, dark hair. She wears a black, short-sleeved top with a sheer, dark, floral overlay. She has red lipstick. The person to the left appears to be a young woman with long dark hair, wearing a black, short-sleeved top. The person to the right also appears to be a young woman with long dark hair, wearing a black, short-sleeved top and light blue jean shorts. The girl on the left wears a watch on her left wrist. The girl on the right wears a purple bracelet on her right wrist. All three people continuously play their guitars. They are sitting in front of a wall with two pieces of abstract art hanging on it. The artwork has shades of red, orange, and cream, with vertical and horizontal lines throughout. A corner of a dark piece of furniture is visible in the bottom left corner. The top of what appears to be a stereo receiver is visible behind and to the right of the adult woman. The light is relatively dim.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the instrument on the right in the video louder than the one in the middle?",
+ "answer": "C",
+ "candidates": [
+ "A. No.",
+ "B. I am not sure.",
+ "C. Yes.",
+ "D. None of the above."
+ ]
+ }
+ },
+ "ZyFkyJpc": {
+ "video_id": "ZyFkyJpc",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A woman's hands pluck the strings of a guzheng. The camera focuses on the details of the instrument and her fingers. The scene changes to a wider shot, revealing a woman in a white dress playing the guzheng on a wooden platform by a lake. Lush green mountains and tea fields form the backdrop. The camera then zooms in on her face as she continues playing. She has long black hair styled in a braid. The camera returns to her hands on the instrument. Later, the focus shifts back to her face and the scenic background. The video continues alternating between close-ups of her hands playing the guzheng and wider shots of her playing in the serene natural setting.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the guzheng consistently played throughout the video?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes.",
+ "C. No."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "How does the rhythm of the music change in the first half of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. From heavy to joyful.",
+ "B. From passionate to soothing.",
+ "C. From joyful to heavy.",
+ "D. From soothing to passionate."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotion is conveyed by the music played in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Sadness.",
+ "B. Regret.",
+ "C. Loneliness.",
+ "D. Confusion."
+ ]
+ }
+ },
+ "xArSBiiu": {
+ "video_id": "xArSBiiu",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Five people are sitting in chairs on a stage. They are holding instruments. The instruments are light brown, with a round piece on the top and a neck. They move their hands on the instruments as they are playing. In the background, a larger version of the instruments appear in a translucent overlay. A dark audience can be seen in the foreground at the bottom of the frame.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "How does the rhythm of the pipa music change at the beginning of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. From gentle to heavy.",
+ "B. From rapid to slow.",
+ "C. From heavy to gentle.",
+ "D. From slow to rapid."
+ ]
+ }
+ },
+ "yFGuZVBy": {
+ "video_id": "yFGuZVBy",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A woman with blonde hair is playing a violin while standing in front of a piano. A man with dark hair and glasses is seated at the piano. The woman is wearing a dark dress with a sparkly belt and a headband. The man is wearing a dark suit jacket. The background is a plain white wall with a floor lamp on the right side.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Does the violin have a stronger sense of rhythm than the piano in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes.",
+ "C. No."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What change occurs in the rhythm of the violin at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. It remains swift throughout.",
+ "B. It shifts from a swift pace to a gentle one.",
+ "C. The rhythm alternates between gentle and swift.",
+ "D. It shifts from a gentle pace to a swift one."
+ ]
+ }
+ },
+ "GhYpNHml": {
+ "video_id": "GhYpNHml",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three people are visible in the video. A person in the foreground is playing a cello while seated. A person in the background to the left is seated at a piano. A person in the background to the right is playing a stringed instrument. The three people appear to be playing music together. There are two music stands visible. Christmas stockings are hanging on the fireplace. There are several other decorations in the room.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "From which direction does the sound of the piano come in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. To the right of the screen.",
+ "B. To the left of the screen.",
+ "C. Behind the screen.",
+ "D. In front of the screen."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What changes occur in the rhythm of the cello music at the beginning of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. It transitions from joyful to calm.",
+ "B. It transitions from gentle to heavy.",
+ "C. It transitions from heavy to gentle.",
+ "D. It transitions from heavy to joyful."
+ ]
+ }
+ },
+ "vEouuPkx": {
+ "video_id": "vEouuPkx",
+ "video_duration": "170s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a garden area with various plants, flowers, and a few buildings and trees.Initially, there are three people in the garden. One woman wears a patterned black and white dress, another person wears a dark top and black shorts, and a third person is wearing a white shirt and dark shorts. The third person is pushing a green wheelbarrow with brown mulch.Later, the woman in the patterned dress empties the wheelbarrow. The other two people stand near her and are joined by a fourth person wearing a patterned blue dress with a head covering.Two other people are shown gathering crops, each with a brown woven basket.Later, three people are in the garden, one in a black t-shirt and white shorts, throwing a reddish-colored fruit into the air. One of the other two people is sitting on the ground in a black and white pattern dress.Four people are seen walking near some flowers in the garden area. One wears a black t-shirt with the word \"EQUALITY\" on it, and another woman wears a patterned black and white skirt and a dark top.The video shows an aerial view of the garden. There is a house nearby with cars parked in front of it.The woman in the green top is then shown sitting and talking. She is then shown holding a small purple flower and smelling a tall yellow flower.An aerial shot shows two more people working in the garden.The video also shows a plant with small light purple flowers being visited by bees.Finally, the video shows seven people standing and waving to the camera, then it shows more aerial views of the garden.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why are medicinal plants being grown in the community garden?",
+ "answer": "C",
+ "candidates": [
+ "A. To conduct research on drugs.",
+ "B. To earn money.",
+ "C. To have a garden.",
+ "D. To spend leisure time."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes have taken place in the community garden over the past three months?",
+ "answer": "B",
+ "candidates": [
+ "A. Weedy areas were turned into a vegetable garden.",
+ "B. Weedy areas were transformed into flower beds.",
+ "C. Weedy areas turned into wasteland.",
+ "D. Weedy areas became residential spaces."
+ ]
+ }
+ },
+ "VScKrlEw": {
+ "video_id": "VScKrlEw",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three women are seated and playing Chinese instruments.\u00a0 The woman on the left plays a two-stringed bowed instrument called an *erhu*. The woman in the middle plays a *dizi*, a transverse flute.\u00a0 The woman on the right also plays an *erhu*.\u00a0 Sheet music rests on stands in front of each musician.\u00a0 They are playing in a room with a light-colored wood floor and white walls. Several *erhu* are displayed on racks in the background. A dark red cabinet with glass doors is also visible in the background. The woman on the left wears a light-colored short-sleeved top with dark floral print. Her hair is dark and pulled back in a ponytail over her left shoulder. The woman in the middle wears a short-sleeved blue top and a blue skirt. Her dark hair is shoulder length. The woman on the right wears a lavender t-shirt with a dark design on the front, and a long, teal, paisley skirt. Her hair is dark and pulled back.\u00a0 The women play their instruments while looking at their sheet music.\u00a0 Their fingers and hands move over their instruments in a fluid manner.\u00a0 The woman playing the flute occasionally closes her eyes. The two women playing *erhu* use a bow and finger positions to change the notes. Their bows move rhythmically across the strings.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What kind of emotion does the music played in the video convey?",
+ "answer": "C",
+ "candidates": [
+ "A. Calmness.",
+ "B. Melancholy.",
+ "C. Sadness.",
+ "D. Relaxation."
+ ]
+ }
+ },
+ "zNQySNOV": {
+ "video_id": "zNQySNOV",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people are seated and holding saxophones and appear to be playing them. The person on the left is wearing a striped blue and white shirt, and the person on the right is wearing a dark gray shirt with a blue strap over the right shoulder. A music stand with sheet music is in the lower-left. A whiteboard with written text and musical symbols is visible on the wall in the background, and there are two instrument cases to the left of the seated people.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many instruments in the video are producing sound?",
+ "answer": "B",
+ "candidates": [
+ "A. One.",
+ "B. Two.",
+ "C. Three.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the sound of the instruments in front louder than those behind?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "jHlldnZG": {
+ "video_id": "jHlldnZG",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person with a beard sits holding a 12-string acoustic guitar. A microphone is positioned in front of the sound hole. The person plays the guitar. The camera zooms in on the headstock, showing the Yamaha logo and the tuning pegs. The camera then returns to the original position and the person continues to play. The person stops playing and places their right hand on the body of the guitar.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What instruments are featured in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A banjo.",
+ "B. A guitar.",
+ "C. A violin.",
+ "D. A cello."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At which point in the video does the guitar's sound disappear?",
+ "answer": "A",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the end of the video.",
+ "C. Throughout the entire video.",
+ "D. At the beginning of the video."
+ ]
+ }
+ },
+ "hzhmNvxu": {
+ "video_id": "hzhmNvxu",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people are standing outside in front of a light grey building with a single window with blue trim. The individual on the left is playing a gold saxophone. They are wearing a light-colored cap, glasses, a white long-sleeve shirt with a blue vest, and grey pants. The individual on the right is holding a red accordion. They are wearing a light-colored hat, a white long-sleeve shirt, a dark vest, and dark pants. There are some plants in front of the building and grass on either side.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the man wearing a gray-blue vest change his breath while playing in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Five times.",
+ "B. Six times.",
+ "C. Four times.",
+ "D. Three times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What changes occur in the rhythm of the saxophone music in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The rhythm shifts from gentle to rapid.",
+ "B. The rhythm shifts from rapid to gentle.",
+ "C. The rhythm remains rapid throughout.",
+ "D. Smooth and slow rhythms alternate with each other."
+ ]
+ }
+ },
+ "RxfsoZbv": {
+ "video_id": "RxfsoZbv",
+ "video_duration": "59s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Three white duck statues with yellow beaks and feet sit on a shelf. The camera moves to a green four-leaf clover logo made of duck heads. Several roasted ducks hang from metal hooks. A green sign reads \"HAND REARED IN IRELAND.\" A man with glasses and a striped shirt sits at a table in a restaurant, gesturing with his hands. A hand torches hanging ducks with a small flame. A bald man in a black apron looks at hanging ducks. Several roasted ducks hang in a row. A chef holds a roasted duck by a hook. A restaurant storefront shows a video screen playing footage of a farm. The restaurant's logo, \"Duckland,\" is illuminated. Two men pose together, seen through a blurry reflection. Framed photos hang on a wall, depicting ducks, food, and people. People are seated at tables in the restaurant, with chefs standing by a counter.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the sequence of events in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Have the chef roast the duck, taste Irish duck, introduce duck farms, and showcase the restaurant's exterior.",
+ "B. Taste Irish duck, introduce duck farms, have the chef roast the duck, and showcase the restaurant's exterior.",
+ "C. Introduce duck farms, taste Irish duck, have the chef roast the duck, and showcase the restaurant's exterior.",
+ "D. Taste Irish duck, have the chef roast the duck, introduce duck farms, and showcase the restaurant's exterior."
+ ]
+ }
+ },
+ "oJTgptaW": {
+ "video_id": "oJTgptaW",
+ "video_duration": "33s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A fisheye lens shows a large sign depicting the Most Autodrom track layout, with \"THE MOST DRIVING SINCE 1983\" written on it. The view changes to show several racing trucks parked in a line, with people standing around them. One truck, blue and black with orange details, has \"HOFMEISTER & MENCK\" written on its front. Then, a blue and red MAN truck with open doors is visible, parked in front of what seems to be a pit stop. The scene shifts to mechanics working on a large engine using a crane, which lowers the engine into a different compartment. Afterward, a reporter interviews a man wearing a blue and yellow racing suit. The camera then focuses on the interviewee as he speaks. The view returns to the mechanics installing the engine, this time inside the vehicle's chassis. The interview continues with the driver, who is looking at the reporter while answering her questions. The final shot shows a yellow and green racing truck with mechanics working around it.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the blue truck relative to the yellow truck in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. To the right, with two trucks in between.",
+ "B. To the left, separated by one truck.",
+ "C. To the right, with two trucks in between.",
+ "D. To the right, separated by one truck."
+ ]
+ }
+ },
+ "eTRFPxOV": {
+ "video_id": "eTRFPxOV",
+ "video_duration": "117s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "swimming",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A large indoor pool is visible. The pool is divided into multiple lanes by lane markers. There are swimmers in each of the lanes. Numerous individuals are standing along the right edge of the pool, most wearing green shirts. There are other individuals standing at the left edge of the pool. Banners are seen hanging along the top edge of the pool. At the start of the video, the swimmers are positioned at the end of the lanes. They enter the water and begin swimming. The video continues to display the swimmers swimming their length of the pool. Several of the individuals on the right edge of the pool stand watching the swimmers, some are holding clipboards. The swimmers continue to swim, and the camera angle and zoom shifts several times.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, where is Elizabeth Bryan positioned relative to Anaya Gautam at the moment the competition starts?",
+ "answer": "B",
+ "candidates": [
+ "A. Elizabeth Bryan is to the left of Anaya Gautam.",
+ "B. Elizabeth Bryan is to the right of Anaya Gautam.",
+ "C. Elizabeth Bryan is behind Anaya Gautam.",
+ "D. Elizabeth Bryan is in front of Anaya Gautam."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "When does the girls' diving take place in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. Throughout the video.",
+ "C. At the end of the video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "pAZpYwBE": {
+ "video_id": "pAZpYwBE",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen. Then it switches to show a man in a dark suit jacket and white button down shirt talking on a cell phone. Text appears at the bottom of the frame which reads \"is spelled...\". The scene cuts to a close up shot of a computer keyboard where text appears in the middle of the frame, first reading \"BU\" then \"BUSI\" then \"BUSINE\". The scene cuts back to the first man, still talking on the phone. Text appears at the bottom of the frame which reads \"and I'm glad\". Then the text changes to read \"I'm on the phone\". The man gestures with his hands and takes off his jacket while talking. The scene cuts to a second man with long brown hair, a white button down shirt, and a pink and white striped tie, also talking on a cell phone. The first man reappears and continues talking on the phone. Yellow text appears next to him which reads \"BUSINESS\". The text then changes to read \"whoever spelled BIZNA$$...\". The second man reappears, talking on the phone in a room with two computer monitors. The first man is then shown on the phone again, a small framed picture appears in the corner next to him, which shows a man holding a dollar bill. The second man is shown again, on the phone, adjusting his tie. The first man is then shown, followed by the second. Then the first man appears again, followed by the second man again. The first man then appears, taking a black tie and holding it up and talking on the phone. The second man then appears, talking on the phone, followed by the first man. The second man is shown again, talking on the phone. The first man appears with a tie, followed by the second man adjusting his tie. The first man holds the tie up again, talking on the phone. The second man takes both his blue tie and his red and white tie and holds them up while talking on the phone. The scene ends with the second man holding up the two ties.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man with long black hair in the video do with the tie?",
+ "answer": "C",
+ "candidates": [
+ "A. Adjusted the tie.",
+ "B. Loosened the tie.",
+ "C. Took off the tie.",
+ "D. Touched the tie."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who spells the word \"business\" in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A young man with light brown hair.",
+ "B. A man with long black hair.",
+ "C. A man with short hair.",
+ "D. A young man with a crew cut."
+ ]
+ }
+ },
+ "trVGczfP": {
+ "video_id": "trVGczfP",
+ "video_duration": "145s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Chemistry",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a striped long sleeve shirt, dark pants, and glasses is standing in a room. The person is manipulating a tall glass tube that is mounted on a metal stand. There are other glass containers visible on the counter. In another scene, the person uses the glass tube to introduce a liquid into a glass flask. The liquid is clear and turns purple once inside the flask. The person is talking and gesturing with their hands. They pick up the glass flask with the purple liquid and hold it up. They put it back down on the counter and move to another part of the counter to continue speaking. They return and continue gesturing while they speak. They touch the glass tube on the stand, then show something in their hand. They move their hands in different ways, then touch the glass tube again. They move something into another glass container. They move the glass flask with the purple liquid again. They pick up something blue. They move a glass graduated cylinder. They continue moving their hands and talking.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occur in the solution in the conical flask shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. It changes from colorless to pink, then becomes cloudy, and finally remains pink.",
+ "B. It changes from pink to colorless, then becomes cloudy, and finally remains pink.",
+ "C. It changes from colorless to pink, then the pink disappears, and finally remains colorless.",
+ "D. It changes from colorless to pink, then the pink disappears, and finally remains pink."
+ ]
+ }
+ },
+ "qznKFtan": {
+ "video_id": "qznKFtan",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man wearing a black suit and white shirt is playing a flute. The man is sitting down. The flute is silver with gold keys. He is wearing a silver ring. A microphone is in front of the man. There are music stands in front of him. Behind him, the background has vertical red and white panels. A second and then a third person is visible playing a violin and a cello. The second person has brown hair and is wearing a black dress. The third person is also wearing black and has dark hair. There are black music stands in front of each of the three musicians with the word \"Dolce\" on them. The three musicians are playing together.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "From which direction is the cello sound coming in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The sound of the cello is coming from the center of the frame.",
+ "B. The sound of the cello is coming from the right side of the frame.",
+ "C. The sound of the cello is coming from the left side of the frame.",
+ "D. The sound of the cello is coming from behind the frame."
+ ]
+ }
+ },
+ "EaPoverA": {
+ "video_id": "EaPoverA",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "On a stage with gold curtains, two boys perform a musical piece. A banner reading \u201cBIRCH\u201d hangs from the top of the curtain. One boy sits at a brown piano on the left side of the stage playing with both hands. The other boy stands on the right, playing the violin. A music stand holding sheet music stands in front of him and a microphone is positioned to his left. He moves the bow across the strings of the violin with his right hand, while his left hand fingers positions along the neck of the instrument. The boy at the piano wears a light pink hooded sweatshirt, light gray pants, and dark shoes. The boy playing violin wears a horizontally striped blue and white long-sleeved shirt, blue jeans and light colored sneakers. He has dark hair and glasses. The boy playing piano also has dark hair. Both boys concentrate on playing their instruments and look at their music from time to time. Someone in the audience in front of the stage is holding up a phone to record the performance.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At which point in the video does the violin stop playing temporarily?",
+ "answer": "A",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. In the middle of the video.",
+ "C. At the beginning of the video.",
+ "D. Throughout the entire video."
+ ]
+ }
+ },
+ "QJnSEryX": {
+ "video_id": "QJnSEryX",
+ "video_duration": "142s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "DIY & Crafts",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows the internal mechanisms of a safe. There is a silver metal frame with levers and springs. Within the frame, there are two circular plastic wheels. One wheel is black and the other is light yellow. A hand is seen in several shots of the video, pointing to or manipulating different parts of the safe's mechanics. The hand pushes the black wheel and then moves the metal frame and levers. Later the video shows a black handle being turned, which opens the safe. Another hand rotates a black dial with white markings. At the end, one part of the mechanism is pushed by a finger. The black dial then closes the safe.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What actions does the man perform in the video, in order?",
+ "answer": "D",
+ "candidates": [
+ "A. Press the spring to activate \u2192 Rotate the black knob \u2192 Spin the black wheel.",
+ "B. Rotate the black knob \u2192 Spin the black wheel \u2192 Press the spring to activate.",
+ "C. Rotate the black knob \u2192 Press the spring to activate \u2192 Spin the black wheel.",
+ "D. Spin the black wheel \u2192 Press the spring to activate \u2192 Rotate the black knob."
+ ]
+ }
+ },
+ "oFHrtYKJ": {
+ "video_id": "oFHrtYKJ",
+ "video_duration": "76s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person is in a kitchen and is speaking to the camera. They show two packages of cheese sticks: Cheddar Mozzarella and Colby Pepper Jack. They bite into one of the cheese sticks. Then they place a block of butter onto a digital scale on a kitchen countertop next to a large teal tumbler with a lid. Then the person removes the butter from the scale, and places the lid on the tumbler. Then the person shows a package of ground, decaffeinated coffee, then pours the coffee into a pour-over coffee maker positioned over the tumbler, then pours hot water into the pour-over. Then the person shows a carton of cashew milk and then shows a large container of vanilla collagen powder, before using a spoon to place the collagen powder into the tumbler.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many flavors of cheese does the woman taste in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. One kind.",
+ "B. None.",
+ "C. Three kinds.",
+ "D. Two kinds."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What are the people in the video discussing?",
+ "answer": "C",
+ "candidates": [
+ "A. The nutritional value of cheese sticks.",
+ "B. The producer of cheese sticks.",
+ "C. The flavors of cheese sticks.",
+ "D. The origin of cheese sticks."
+ ]
+ }
+ },
+ "druJhtnF": {
+ "video_id": "druJhtnF",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "The video begins with a close-up of three people, each holding a guitar and moving their fingers on the fretboard. The person in the center is wearing a light green shirt, a watch and has short black hair. The two people beside him are wearing light colored shirts. The camera angle changes to show 3 people, 2 men and 1 woman, all playing guitars while seated on a concrete bench in an outdoor area. The man on the left is wearing a light green shirt, and a watch. The woman in the center has long dark hair and a light shirt. The man on the right is wearing a dark blue shirt, glasses, and is playing his guitar. The camera angle changes to a person seated on a flight of stairs, playing a guitar. He is wearing a light colored polo shirt and gray pants. He is playing his guitar on a flight of stairs. The camera angle changes, focusing on his hands and the guitar as he plays, and then the camera zooms out. The camera angle changes again, to a close-up of someone's hand playing a guitar. The camera changes to show two people playing guitars. The camera changes to show a group of five people, all playing guitars, while seated in an outdoor area with greenery in the background. One of them is seated on a metal stool. The camera angle changes to a close-up of the same three people as seen at the beginning of the video, all playing their guitars. The camera angle changes again, showing the group of five people all playing guitars, some seated on the bench, some on chairs. The camera angle changes to a man, standing on a flight of stairs, and playing his guitar.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the instrument on the right louder than the one on the left in the middle of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes.",
+ "B. No.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "KFuroxuE": {
+ "video_id": "KFuroxuE",
+ "video_duration": "105s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "World News",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a split-screen showing two men in suits. The man on the left has a multi-colored medal around his neck. There are two small black microphones in front of the man on the left. A text banner at the bottom of the screen displays the words, \u201cThe West Asia Post\u201d and \"Jordan kicked Syrian ambassador out of the country in 2014\u201d. Another text banner is displayed and has the words, \u201cJordan & Syria rebuild diplomatic ties.\"The video then shows a series of different scenes. One shows a desert landscape with dust and debris flying around. Next, a brown camouflaged truck, equipped with a cannon, appears. The truck is being operated by men wearing camouflage. Then, a cannon is shown firing. Then the landscape is shown again with lots of smoke. Afterwards, several clips appear of men in suits shaking hands with each other. There are also scenes of men in uniform, in outdoor settings, operating large weapons and walking through streets.The video cuts to an interior shot of an immigration office. Several people are standing in line. There are also people working at computers behind windows. Next, a close up shot shows a man\u2019s hands opening up a passport and checking documents.The video ends with a close-up shot of a man speaking to the camera while standing inside the immigration office. A text banner at the bottom of the screen displays the words, \u201cThe West Asia Post\u201d and \"Jordan & Syria rebuild diplomatic ties\". Another text banner is displayed and has the words, \u201cNAJEH NOFAL, MEMBER, JORDANIAN CHAMBER OF COMMERCE\u201d.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a motorcycle present in the middle part of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes.",
+ "B. I am not sure.",
+ "C. No."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What abnormal situation is shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. People are causing disturbances.",
+ "B. Syria is attacked.",
+ "C. Jordan is attacked.",
+ "D. Artillery is fired and guns are shot."
+ ]
+ }
+ },
+ "PamJqiYi": {
+ "video_id": "PamJqiYi",
+ "video_duration": "155s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Bernie Sanders is shown in a close-up shot against a blue background with the CNN logo. He wears a dark suit, light-colored shirt, and patterned tie, along with glasses. He speaks while gesturing with his hands. A graphic at the bottom of the screen reads \"QUESTION: Aren't you proposing Scandinavian spending, but failing to pay for it?\" and another reads \"CNN DEBATE NIGHT: THE GOP TAX PLAN.\" The top right corner has the location \"Washington\" and the time \"10:06 PM ET.\" A \"LIVE\" indicator and the CNN logo are also present.The view changes to a split-screen, with an interviewer on the left and Bernie Sanders on the right. The interviewer wears a grey suit and dark tie. Sanders points at the interviewer.The interviewer holds a microphone and speaks, and Bernie Sanders watches him.The split-screen returns with Sanders speaking and gesturing.The screen shows Sanders at his podium next to another podium. He points and gestures towards the empty podium to his right. Ted Cruz then walks to the podium. He is wearing a dark suit and a striped tie with a small American flag pin on his lapel. The view changes back to the split-screen. Ted Cruz begins to speak, using his right hand for gestures. Bernie Sanders watches him, occasionally smiling. Cruz continues to speak, gesturing with both hands. Sanders smiles more broadly. Cruz continues speaking, and Sanders picks up a glass of water from his podium, takes a drink, and sets it down. Cruz gestures with his right hand and points with his left index finger while speaking, and then continues using hand gestures while speaking. Sanders begins speaking while gesturing with both hands. He then smiles and nods his head. Cruz waves his hand dismissively and laughs. Sanders laughs and picks up his papers from the podium. Cruz continues to gesture as he speaks. Sanders bends over to retrieve something he has dropped near the podium. Cruz continues gesturing. Sanders stands back up.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who in the video says, \"This requires spending money. Nothing is free\"?",
+ "answer": "B",
+ "candidates": [
+ "A. A man in a light gray suit with a dark tie.",
+ "B. An elderly man in a suit with a dark striped tie.",
+ "C. A man in a dark suit with a white shirt.",
+ "D. A man in a dark suit with a white striped tie."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the man wearing a light gray suit appear?",
+ "answer": "D",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. In the middle of the video.",
+ "C. Throughout the video.",
+ "D. At the beginning of the video."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the identity of the elderly man wearing a dark striped tie and suit in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Ordinary citizen.",
+ "B. Republican senator.",
+ "C. News anchor.",
+ "D. Democratic senator."
+ ]
+ }
+ },
+ "VLsQFrvO": {
+ "video_id": "VLsQFrvO",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A group of people are in a large room with tile floors and pillars. There are red ropes on stands delineating an area in front of a group of musicians. A man wearing glasses and a striped shirt is playing a red accordion, a woman wearing a red and white patterned shirt, pants, and a pink sunshade is moving her body, and a man wearing glasses and a light blue shirt is playing the violin. Another man with grey hair is seated and playing a wood instrument. Behind the musicians, a black piano is visible and a person is seated and playing. There are several rows of people sitting and observing the musicians. Other people are standing and observing in the background. Later in the video, some people in front of the audience begin dancing.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What is the first instrument to begin playing in the piece shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Piano.",
+ "B. Accordion.",
+ "C. Violin.",
+ "D. Guitar."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many types of instrument sounds are featured in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Two.",
+ "B. Three.",
+ "C. One.",
+ "D. Five."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "How many pianos are seen in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Two pianos.",
+ "B. Three pianos.",
+ "C. Four pianos.",
+ "D. One piano."
+ ]
+ }
+ },
+ "pDXLVdRH": {
+ "video_id": "pDXLVdRH",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two women sit on a stage and play the guzheng, a Chinese plucked zither. They are both dressed in formal black attire. The woman on the left has long dark hair, and her guzheng is placed on a stand. The woman on the right has shorter hair pulled back and wears glasses. Her guzheng is also on a stand. Both instruments have multiple strings and bridges. Behind them, a screen displays Chinese text and images of autumn leaves. A covered piano sits stage left behind the first woman. Microphones and stands are positioned near the instruments. Throughout the video, the women's hands move over the strings of their instruments, plucking and manipulating them. They maintain focus and coordination.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How many pieces are performed in this concert, as seen in the background images of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Three pieces.",
+ "B. Two pieces.",
+ "C. Four pieces.",
+ "D. Five pieces."
+ ]
+ }
+ },
+ "YZGrxsiE": {
+ "video_id": "YZGrxsiE",
+ "video_duration": "295s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Fashion",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video starts with a close up shot of an unconstructed Gundam toy inside of a cabinet. Then the video cuts to a close-up of a person's hands, which then move away to reveal a person with short white hair and a black t-shirt, sitting in a chair with a black and red backing. The person holds up a see through plastic bag with white boxes inside, showing the logo \"galactictoys.com\". Then the video cuts to close-ups of many white toy pieces still in their plastic frames, with toy box art in the background of each of the pieces. Then the video cuts back to the first person with the white hair and they hold up a single plastic toy frame, containing a miniature model. Then the video cuts to a person wearing a green hat and orange vest, with their eyes bulging, and then cuts back to the person with the white hair, who is again holding up the plastic toy frame. Then the first person holds up a toy box with a red and black circular component and then holds up the toy in the box. Then the first person shows multiple toy magazines in the same style as the toy boxes, and the video cuts to still image of a person standing in front of a large gundam with the logo \"the witch from mercury\". Then the video cuts back to the first person who speaks to the camera. The video then ends with the youtube outro graphic.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the environment like around the woman in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Dim and crowded.",
+ "B. Dim and spacious.",
+ "C. Spacious and bright.",
+ "D. Neat and bright."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What reasons does she have for liking the anime version of Gundam?",
+ "answer": "D",
+ "candidates": [
+ "A. It is lightweight.",
+ "B. It is more gentle.",
+ "C. It is inexpensive.",
+ "D. It is more rugged."
+ ]
+ }
+ },
+ "DMCVJPdj": {
+ "video_id": "DMCVJPdj",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person is sitting in front of a wall, behind a brown couch. There are two framed pieces of artwork on the wall. The person is wearing a purple shirt and is gesturing with one hand. Then, a black screen appears with the text \"First the drums\". This text then goes away, and there is another black screen with the text \"Meant to say 'what I've been practicing'!!!\". This text then disappears and is followed by the text \"And prewarning for the capture quality of the drums...\". After this text, a person is shown sitting behind a set of black drums with wooden drum sticks in their hands. They are wearing a purple shirt. The person hits the drum kit, repeatedly hitting various drums and cymbals.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Which part of the drum set did the woman strike first during her teaching demonstration in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Bass drum.",
+ "B. Kick drum.",
+ "C. Snare drum.",
+ "D. Cymbal."
+ ]
+ }
+ },
+ "WPnhtfrV": {
+ "video_id": "WPnhtfrV",
+ "video_duration": "71s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person wearing a white t-shirt and a navy blue hat is sitting at a wooden table. Behind the person is a dark blue wall and multiple hanging light bulbs, as well as various books and objects. The person gestures with their hands while looking directly at the camera. Then the camera shifts to a kitchen scene. Two round loaves of bread sit on a wire rack placed on a wooden cutting board. Someone wearing red oven mitts and a dark blue apron uses a black metal peel to move the loaves. The camera returns to the first person at the table, gesturing and looking at the camera. Then, the camera shifts back to the kitchen scene where three tall jars sit on the table on a scale. A person in a dark blue apron uses a scoop to add a white substance to a jar, then pours water from a glass bowl into that same jar. Then, the camera shows the first person at the table again, gesturing to the camera. Next, the camera shows the kitchen again, where the person in the apron uses a black utensil to mix the contents of a jar, before placing a lid on it, next to two other jars. A close-up shot shows the content of the jars. The camera returns to the first person at the table. Then, the camera returns to the kitchen scene and shows the person in the apron kneading dough in a metal bowl, then removing the dough. The person shapes the dough on a wooden surface, before placing the dough into containers, and moving the containers, which are covered with plastic. Then, the camera returns to the first person at the table.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man wearing an apron fold the dough in the latter part of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Five times.",
+ "B. Seven times.",
+ "C. Four times.",
+ "D. Six times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "How many grams of ingredients are required to make a fermented product?",
+ "answer": "C",
+ "candidates": [
+ "A. 20 grams of baking powder, 30 grams of flour, 20 grams of water.",
+ "B. 20 grams of flour, 30 grams of baking powder, 30 grams of water.",
+ "C. 20 grams of baking powder, 30 grams of flour, 30 grams of water.",
+ "D. 20 grams of water, 30 grams of flour, 30 grams of baking powder."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "Why were five loaves of bread made?",
+ "answer": "D",
+ "candidates": [
+ "A. Because he wanted to see which yeast-to-flour ratio produces better bread.",
+ "B. Because he wanted to see which flour ratio results in better bread.",
+ "C. Because he wanted to determine which type of flour makes better bread.",
+ "D. Because fermentation caused different changes in the bread."
+ ]
+ }
+ },
+ "NnAflpEq": {
+ "video_id": "NnAflpEq",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people are sitting on a red couch and playing ukuleles. The person on the left is wearing a white sweatshirt and blue pants. The person on the right is wearing a light blue t-shirt and gray pants. There are microphones in front of both of them. In the background is a door, a cabinet with shelves, wall art, and speakers.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many pieces were performed in total in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Three pieces.",
+ "B. Four pieces.",
+ "C. Two pieces.",
+ "D. One piece."
+ ]
+ }
+ },
+ "RHdUFnIx": {
+ "video_id": "RHdUFnIx",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A woman with dark hair is shown playing a cello. She is wearing a yellow top and blue pants. She moves a bow across the cello's strings. Later in the video, a second woman is shown playing an instrument with a long neck, and a bow. She wears a white top and blue pants and is seated on a stool. The woman with the cello is also shown, and she and the other woman both move their bows back and forth across the strings of their respective instruments. At times the view changes back to a solo shot of the cellist or the other woman. The backgrounds of the video include patterned wallpaper, a wall with wooden panels, and tall black speakers.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What instrument is the woman wearing yellow playing in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Cello.",
+ "B. Xylophone.",
+ "C. Pipa.",
+ "D. Violin."
+ ]
+ }
+ },
+ "pprFtmJX": {
+ "video_id": "pprFtmJX",
+ "video_duration": "329s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A stainless steel pot sits on a white stovetop. The pot contains chopped white and red vegetables in liquid. A metal spoon is used to stir the contents. A hand holds several sprigs of a green herb over the pot, and pieces of the herb are dropped into the pot. A metal spoon is used to stir the contents again. A wooden cutting board with chopped green pepper and white garlic is shown next to the pot, then the spoon is used to add the green pepper and garlic to the pot. A bottle of dark brown liquid is held over the pot, and the liquid is poured into the pot. A hand takes a metal can from a sink, then two more cans from a table, and uses a can opener to open each can. The contents of one can, brown beans, are poured into the pot, and followed by the contents of the second can, red paste. The contents of the pot are stirred, then a bag of light-colored grains is cut open and the contents added to the pot. Water is poured into the pot, then the pot's contents are stirred again. Then a white seasoning is sprinkled on the contents of the pot, which are stirred, the lid is placed on the pot, and the burner is turned off. The lid is taken off, and the food is stirred again. The pot with food in it remains on the stove and the video ends.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What ingredient was added immediately after the ketchup in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Garlic salt.",
+ "B. White sugar.",
+ "C. Pigeon peas.",
+ "D. Black pepper."
+ ]
+ }
+ },
+ "dcSdBPEm": {
+ "video_id": "dcSdBPEm",
+ "video_duration": "38s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Event"
+ ],
+ "video_caption": "A person flips in the air and lands on a mat. A woman in a white karate gi with a bonsai tree on the back stands between two men in karate gis. Two people spar in a karate studio. A person kicks another person in the face. Someone slides on their knees on a shiny floor. Three people practice karate kicks outdoors. A woman wearing boxing gloves practices with a male partner. A shirtless man kicks in the air. Several men perform karate kicks and one falls to the ground. Three people fight under a covered walkway. One person flips and lands on the ground. Two people fight near a swimming pool at an outdoor party. The title \"Cobra Kai V\" appears on the screen with fire and sparks surrounding it. Credit information appears on screen, listing video and music credits, and identifies the maker of the video as MACAMTV. A man in a red headband stands in front of Cobra Kai Karate. A close-up of a man\u2019s face. A close-up of a woman\u2019s face. A woman sits on a couch and talks. A woman stands with a coffee mug near a man seated at a table. A woman sits on a red couch. A close-up of a man\u2019s face.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What is happening at the beginning of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The venue is opening.",
+ "B. Preparations for a performance are underway.",
+ "C. A game is possibly about to start.",
+ "D. A gathering among friends."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "What is the person wearing a red band on their head doing in the latter part of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Complaining about the dojo.",
+ "B. Looking for a friend.",
+ "C. Boxing at the dojo.",
+ "D. Passing by the dojo."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "At the beginning of the video, the person wearing black clothes is fighting with several people wearing red clothes. How many people in red clothes are involved?",
+ "answer": "D",
+ "candidates": [
+ "A. 3.",
+ "B. 4.",
+ "C. 1.",
+ "D. 2."
+ ]
+ }
+ },
+ "OHnvoMwt": {
+ "video_id": "OHnvoMwt",
+ "video_duration": "50s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a white hoodie with black drawstrings and a red and white cylindrical object attached to the zipper is sitting down and facing the camera. The video then transitions to a close-up view of a playmat with an anime design, on which a person with a silver bracelet is placing various cards with colorful artwork and text on them. They place multiple cards onto the playmat, organizing them into stacks. The video concludes with more cards being placed on the playmat.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "Did the deck of cards change in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The deck is collectively displayed after being categorized by type.",
+ "B. The deck is shown in batches based on type after being categorized by quantity.",
+ "C. The cards in the deck are placed on the table after being shown.",
+ "D. The deck is reshuffled and then placed on the table."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many cards appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Forty-two.",
+ "B. Twenty-two.",
+ "C. Forty-one.",
+ "D. Forty."
+ ]
+ }
+ },
+ "scKLkPwA": {
+ "video_id": "scKLkPwA",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "On a stage, two young boys in white short-sleeved shirts and dark pants stand and play violins. They face the audience. Each uses a music stand.\u00a0A third person sits at a grand piano, partially obscured by the piano itself, playing as well. Behind the performers, light-colored curtains hang, and a red banner with some Asian script and a circular logo is displayed. Red curtains flank the banner and main curtains. A projector screen displaying white is positioned on the left side of the stage. A person walks across the front of the stage. Another person walks across from right to left. The musicians continue playing.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Which instrument produces the first sound in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A guitar.",
+ "B. A violin.",
+ "C. A piano.",
+ "D. A cello."
+ ]
+ }
+ },
+ "VODOJrbf": {
+ "video_id": "VODOJrbf",
+ "video_duration": "118s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens with an overview of an interior area, with a group of people visible inside a central cylinder-shaped structure with transparent walls.Two figures wearing orange and red suits approach the central structure from the left side.A close up shows a metal object, a gun, and a circular shield with a star in the middle.A figure in an orange suit walks past.The scene shifts to the inside of the cylinder-shaped structure. A man with a red skull mask, two masked white figures, and another figure with an orange mask walk through an archway. The man with the red skull mask and the two white masked figures stand facing a figure in a red, white, and blue suit in the middle of the room.The man in the red, white, and blue suit turns his head. A close-up shows his face and the green mask he is wearing.The man in the red, white, and blue suit's head is shown from a different angle. A tube is attached to the side of his neck and then an orange fluid flows from a dispenser into the tube.A person in an orange-lined black suit holds up a tube containing orange fluid.The man with the red skull mask holds up the tube of orange fluid.The man in the red, white, and blue suit faces a different direction and the person in the red skull mask stands in front of him.The man in a red, white, and blue suit faces the camera.The white masked figure moves.The camera focuses on the face of the white masked figure.The man in the red skull mask and the white masked figure are both visible.The scene returns to an overview of the interior area. The man in the red skull mask and two masked figures are in the cylinder structure.Two figures in orange-lined black suits stand side-by-side.The two figures in orange-lined black suits are looking at a red screen with an image of the man in the red, white, and blue suit.The man in the red skull mask, and the white masked figure appear. The man in the red skull mask is speaking.The man in the red, white, and blue suit is in the cylinder structure and faces the man in the red skull mask.The man with the red skull mask and the man in the red, white, and blue suit face each other directly.The face of the man in the red, white, and blue suit is shown up close again.A close-up shows the man with the red skull mask approaching and a red gloved hand reaching for the face of the man in the red, white, and blue suit.A close-up shows the man in the red, white, and blue suit with his mask removed.The man with the red skull mask holds up the green mask.The man with no mask is shown again, with black gloved hands attaching wires to his head.The man in the red, white, and blue suit is shown from another angle, in the same situation as before.A wide angle of the room shows the man with no mask in the cylinder structure.A figure in a black suit with red lines presses a button on a machine. The man with no mask begins to struggle.A screen shows an overview of the structure, then a closer view of the man with no mask who is visibly struggling.The man in the red skull mask watches and then a close-up of his face.The video ends with a screen that reads \"Thanks For Watching Feel Free to Subscribe, Share Like, and Comment\".",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What action did the Red Skull take against Captain America in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Red Skull knocked Captain America unconscious.",
+ "B. Red Skull discarded Captain America's weapon.",
+ "C. Red Skull removed Captain America's mask.",
+ "D. Red Skull took away Captain America's abilities."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the soldier next to Captain America do with the syringe in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Extract Captain America's serum.",
+ "B. Administer medication to Captain America.",
+ "C. Inject Captain America with Red Skull's serum.",
+ "D. Inject Captain America with Taskmaster's serum."
+ ]
+ }
+ },
+ "ZRcNsILH": {
+ "video_id": "ZRcNsILH",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man with curly brown hair and a beard sits on a low stool and plays a cello. He wears a light beige, short-sleeved, open-collared shirt and dark shorts. He holds the bow in his right hand and moves it across the strings, while his left hand fingers the strings. Beside him, lying on the ground, is a black dog wearing a collar. They are outside on a gravel area next to a white house with a yellow-trimmed window and potted plants below the window. The house sits within a grassy landscape dotted with some trees and shrubs. A body of water is visible in the middle ground and a mountain range in the background under a blue sky with white clouds. The man and dog remain in these positions for the duration of the video as he plays.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the dog located in relation to the man in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. On his rear left side.",
+ "B. In front of him.",
+ "C. Behind him.",
+ "D. On his rear right side."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotion does the entire piece of music convey?",
+ "answer": "D",
+ "candidates": [
+ "A. Mellow and graceful.",
+ "B. Majestic and heroic.",
+ "C. Joyful and positive.",
+ "D. Deeply emotional and sorrowful."
+ ]
+ }
+ },
+ "fufIXctz": {
+ "video_id": "fufIXctz",
+ "video_duration": "138s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A yellow train car is shown on train tracks in front of a building. The words \"Spencer's Demotion\" appear on the screen over the image. The view switches to two trains, one green and one blue, parked in open train bays. The words \"Narrated By: TheBuriedTruck\" appear on the screen over the image. The camera pans out to show three trains inside train bays: one red, one green, and one blue. The camera switches to a train moving along tracks outside. The train is red and has the number 5 on its side. The view blurs. Then the view is focused on the wheels of the red train moving on the tracks. The view goes back to the green train inside its bay, then shows the two trains again. The camera then focuses on the red train again, which is now in its bay. Then the blue train is shown alone in its bay. Then the three trains are shown together in their bays. A man in a black suit and top hat is seen from behind in front of the bays. The same man is then shown from the front standing on the train tracks. The view returns to the blue train in the bay, looking out at the man on the tracks, and then goes back to the man on the tracks. The camera cuts to the three trains in their bays with the man now in front of them with his back to the viewer. Then it cuts back to the man from the front on the tracks. Then the three trains are shown in their bays, with the camera panning from left to right across them. The view blurs. The camera focuses on a silver train in a train bay at sunset. Finally, the camera pans out from the train into the night, then the video fades to black.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between Spencer and James?",
+ "answer": "D",
+ "candidates": [
+ "A. Superior-subordinate relationship.",
+ "B. Stranger relationship.",
+ "C. Competitor relationship.",
+ "D. Colleague relationship."
+ ]
+ }
+ },
+ "ISjHoLoB": {
+ "video_id": "ISjHoLoB",
+ "video_duration": "337s",
+ "duration": "4-6min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with white text over a white background reading \"BOOK ONE: WATER\" above the text \"CHAPTER ONE: THE BOY IN THE ICEBERG,\" which appears below the previous text. Above \"BOOK ONE: WATER\" is an emblem made up of three black lines. This transitions to the text \"AVATAR: THE LAST AIRBENDER\" in orange boxes on the top of the screen over a white background. Directly below that are the words \"FIRST 5 MINUTES\" inside a similar orange box. Below the box is the original text, emblem, and background. The background then changes to a blue, icy-looking backdrop. As the background changes, the emblem disappears, and the two orange boxes become smaller and stay on the top of the screen. The background then zooms into a cloudy blue landscape as the screen is moving, as though underwater. Small, blurry images are then seen in a school of fish or other small creatures. Then a blue background with what appears to be icebergs moves into view. The screen then shows the icy landscape in detail. As the screen moves forward, a small kayak or boat is shown with two occupants. A closer shot of the boat and occupants follows. The first occupant has dark brown hair tied up in a topknot. They are wearing a blue coat with white trim. The second occupant also wears similar clothes, but their hair is not in a topknot. The occupant with the topknot holds a white staff. This is followed by a shot of only the occupant with the topknot, as they lift their staff. The second occupant looks down into the water with a concerned look. A figure is shown beneath the boat in the water, then the second occupant looks shocked and concerned toward the first occupant. A large flash of light appears around the iceberg and a light blue whale-like creature flies out of the light. The second occupant looks back in shock, and the first looks toward the light source with an open mouth, then closes their mouth. The first occupant has a different facial expression, as they show their hand out and look forward with squinted eyes. The second occupant raises their arms in alarm, then it returns to a shot of the boat moving. The first occupant has a hand on a staff while looking forward with wide eyes, and the second occupant has an arm raised in the air. The second occupant then looks toward the first with their eyes wide open. They show a close up of both their faces in the boat, as the first occupant looks confused. A light blue wave starts to come closer to their boat. The boat is then shown in the distance as it passes by an iceberg. A close up of the faces of both the occupants shows that they are screaming in panic, and then the boat tips and throws them into the water. They are seen lying on an iceberg and both get up and sit. The first occupant puts their hand to their forehead. They look to the side with wide eyes, and the second occupant crosses their arms and closes their eyes. The two occupants are then seen in the distance on the icebergs. They are then shown again in the distance on an iceberg, while one is standing and the other is sitting. The occupant that is standing raises both arms and waves them. A large flash and crack of light is seen above the iceberg, and both are shown in a close up. The first occupant has closed their eyes and the second is waving both arms in the air. The first occupant then puts their hands to their face. The second occupant then points their finger to the side while looking angrily at the first. The second occupant then looks down with their mouth open, as the second occupant is shown next to them with wide eyes and they look frightened and their eyes move from left to right. Then, they are shown in the distance, as they watch a white light forming an expanding circle on an iceberg. The light then explodes from the iceberg, as the two occupants are shown close up, as though the light is surrounding them, and they are now all light blue. The two occupants are shown side by side, looking concerned at a light in the distance. A light blue figure is then shown floating, then the first occupant looks at the second, as they raise their hand. The second occupant follows suit and they both grip each other's hands, as the floating figure is shown with both hands clasped in front of their body. The two figures are shown next to each other looking afraid and then holding the staff they had used earlier. The floating figure then is seen again surrounded by blue light. The two occupants then grip the white staff together and pull it away from the camera, as water rushes around them. The white figure is shown one last time before a black screen appears. Lastly, the video goes to an orange background with various colorful cartoon characters. The word \"SUBSCRIBE\" is on the left and \"WATCH MORE\" is on the right, with the words \"CHECK OUT MORE VIDEOS ON THE NICK REWIND CHANNEL\" in a white box. The word \"nickrewind\" is in purple in a circle next to the word \"SUBSCRIBE\".",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What might be the relationship between the boy and the girl in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Friends.",
+ "B. Brother and sister.",
+ "C. Strangers.",
+ "D. Father and daughter."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many distinct voices can be heard in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Two.",
+ "B. Five.",
+ "C. Three.",
+ "D. Four."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "In which scene did the boy and girl have their second argument?",
+ "answer": "A",
+ "candidates": [
+ "A. On an ice block.",
+ "B. On the shore.",
+ "C. In the water.",
+ "D. On the boat."
+ ]
+ }
+ },
+ "AiUECjgF": {
+ "video_id": "AiUECjgF",
+ "video_duration": "480s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A white dish holding two small bowls of sauces, one creamy yellow and the other a reddish-brown, sits on a brown wooden table. A black counter runs horizontally in the background. A person wearing a white chef's uniform and gloves wipes down a large flat top grill surface. The person sprays a liquid onto the grill, spreading it with a spatula. The person tosses the spatula into the air and catches it. A cylindrical container of oil is used to make a circle of oil on the grill. The person uses the spatula to spread out the oil, and then sprinkles some powder on the grill. Next, the person applies more of a powder which causes a large flame to erupt. The flame is followed by the application of other ingredients to the grill. A person puts sliced zucchini onto the grill, arranging them in a row. Sliced white onions are placed on the grill along with the zucchini, and small pieces of another food. The person uses the spatula to move and separate the food and then adds a liquid from a red bottle. Another liquid from a white bottle is put on the food. Another flame erupts, and is then extinguished, with smoke rising from the grill. The person then uses the spatula to mix the food on the grill.Cooked pieces of shrimp are put onto a white plate. Chopsticks are used to pick up a piece of shrimp and dip it into the creamy yellow sauce, which is then eaten. A plate of cooked zucchini, onions, and a white salad, sits on the table.A person in chef's garb scoops a pile of what appears to be small pieces of meat or chicken onto the grill surface. The person also adds mixed chopped vegetables to the grill along with several eggs. The eggs are scrambled in the bowl with the spatula, and then poured onto the grill next to the vegetables. A rectangular dish of cooked white rice is put onto the grill surface. The person uses two spatulas to cook the mixture of rice, vegetables and eggs. The person then mixes this mixture with the pieces of meat/chicken.A plate holding the mixture of rice, eggs, and vegetables with cooked shrimp is placed on the table. A person picks up pieces of the shrimp from a separate plate, and dips it in the white sauce. They take a bite and chew with their mouth open.A new plate of food, consisting of grilled mushrooms, a lobster tail, a small mass of what appears to be crab or fish, and a grouping of white scallops. A liquid from a red bottle is applied to the crab/fish, followed by a mixture of the various grilled items. The food is then transferred to several plates. One plate shows a close-up of the cooked meat, shrimp, and onions. A separate plate holds a mixture of grilled white scallops and onions, along with cooked zucchini. A white bowl of salad sits next to the plate, and a person eats the salad using a fork. Another plate holds a portion of fried rice. A person eats the fried rice with a fork. Additional footage shows the person in the grey shirt with the close ups of the food. The plates on the table show various mixtures of food. The plates include steak, shrimp, rice, vegetables, and white and brown sauces. The footage then shows the person in the grey shirt eating pieces of steak, dipping them in the sauce. There is then additional close up footage of the fried rice and also some of the shrimp. The video ends by showing the person eating the final bites of his meal.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the chef flip the eggs in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. 14.",
+ "B. 13.",
+ "C. 15.",
+ "D. 16."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of steps in the video for making fried rice? (a) Beat the eggs in a bowl and stir, (b) Pour the vegetable grains on the iron pan, (c) Pour the egg liquid on the iron pan, (d) Pour the rice on the iron pan.",
+ "answer": "C",
+ "candidates": [
+ "A. (a), (c), (b), (d).",
+ "B. (d), (a), (c), (b).",
+ "C. (b), (a), (c), (d).",
+ "D. (d), (b), (a), (c)."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the short-haired man holding in the video to display the fried rice?",
+ "answer": "C",
+ "candidates": [
+ "A. A spatula.",
+ "B. Chopsticks.",
+ "C. A fork.",
+ "D. A spoon."
+ ]
+ }
+ },
+ "NBhvskGP": {
+ "video_id": "NBhvskGP",
+ "video_duration": "335s",
+ "duration": "4-6min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins at night on a residential street. A person wearing a dark hoodie appears, then lowers the hood. Two cars are visible parked on the street. The person turns, revealing another person with a beard, also wearing a dark shirt. They appear to be having a conversation, though the audio is not included in the description. The person in the hoodie gestures and begins walking towards a car with a camouflage paint job in dark gray, white, and blue. The other man walks around the camouflaged car, briefly pointing to it, then walking towards another car. The person with the hoodie shows the back of the camouflaged car, then moves to the trunk. The trunk opens to reveal a pair of red Crocs, a duffle bag, a folded dark-colored blanket or jacket, and a white box. The person filming puts a phone with a lit screen into the trunk, then closes the trunk.The two people are again seen together, and another person in a blue hoodie with a white Nike logo and the word \u201cBEAST\u201d appears. The person filming walks back towards the camouflaged Dodge Charger, showing its paint job in detail. The video cuts to the two Dodge Chargers on another street. The camouflaged car has \u201cLaramy\u201d written on the front windshield. The other Dodge Charger is solid blue. Both appear to be preparing to race. The video then cuts to the inside of the camouflaged car for the first race. After a brief period, the video cuts back to the street scene, showing the back of the camouflaged car and the man with the beard, who appears to be on the phone. The video cuts to him walking down the street and showing his phone to the camera. It contains messages in a chat interface. He puts the phone in his pocket. The camouflaged car is visible next to the person filming. The person touches the roof of the car, which has a GoPro camera mounted on top. The video cuts to a man in a dark hoodie running up to the camera. The video then cuts to the inside of the camouflaged car for the second race. After a brief period, it cuts back to a view of two blue Dodge Chargers parked at a gas station. The camouflaged car\u2019s hood is open, revealing the engine. The person wearing the \u201cBEAST\u201d hoodie has a conversation with the bearded man near the cars. A third person with an afro-style haircut approaches and looks at the cars. The person filming shows the camouflage car\u2019s engine in more detail, moving closer to the air intake system. They close the hood and show the side and rear of the car once more. The young man with the afro films the solid blue Dodge Charger\u2019s rear lights, then moves to the driver\u2019s side. He films the interior briefly. The person with the \u201cBEAST\u201d hoodie walks around the blue Dodge Charger and shows the front of the car. They are again seen together near the cars.The video then cuts to the inside of the camouflaged car for a third race. It then returns to the gas station scene, with the camouflaged car\u2019s hood still up. The men examine the camouflage pattern on the hood of the car. The person with the \u201cBEAST\u201d hoodie films the side of the camouflage Dodge Charger, and the man with the beard counts money. They film the camouflaged car again.The video then cuts to the fourth race, which now involves three cars. The two Dodge Chargers are lined up next to a silver car. The man with the afro-style haircut counts down, and the cars race. The camera stays on the street as the cars speed away. It then cuts to the inside of the camouflaged car. The person filming looks around the car's interior and shows the digital display on the dashboard. The video then cuts to the fifth race.Afterward, it cuts to the man wearing the black hoodie sitting in a car.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the man with a beard holding money in the middle part of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. He got the money from a robbery.",
+ "B. He won the money in a competition.",
+ "C. He found the money on the ground.",
+ "D. He took out his own money and said he was going to eat."
+ ]
+ }
+ },
+ "hYXTtYsk": {
+ "video_id": "hYXTtYsk",
+ "video_duration": "193s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens with a view from a kayak on a lake. The person in the kayak is wearing camouflage water shoes and has a styrofoam cup next to them. They are holding a fishing rod. Another person paddles by in an orange and black kayak. The camera pans to show the lake and the treeline on the far shore. The person in the orange kayak casts their line. The camera person casts their line towards the shore near some downed trees and other debris. After a while, they reel in their line and cast again. The person in the orange kayak paddles over. The camera person reels in their line again, lifts the rod, and brings it in closer to their body. A fish splashes on the surface of the water near their line. The camera person reels in the fish and scoops it into a net. The orange kayak paddles by again. The video shows the person in the kayak reeling in their line and casting again. A fish splashes the water near the bobber on their line. They reel in the fish and lift it into the net. The video ends with a still image that reads, \u201cWilkins\u2019 Outdoor Adventures\u201d and below that, \u201cWork Hard - Play Harder\u201d.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the float located in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. In shallow water.",
+ "B. On the water's surface.",
+ "C. In deep water.",
+ "D. At the middle water level."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Where does the animal sound in the video come from?",
+ "answer": "A",
+ "candidates": [
+ "A. The sky.",
+ "B. Underwater.",
+ "C. The surface of the water.",
+ "D. The ground."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What emotion does the adult male display in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Frustration.",
+ "B. Anger.",
+ "C. Sadness.",
+ "D. Joy."
+ ]
+ }
+ },
+ "dMgNJpGm": {
+ "video_id": "dMgNJpGm",
+ "video_duration": "103s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person brushes melted butter onto the bottom of a cast iron skillet with a pastry brush. Text appears on screen that says \"STEP 3: Roll, cut and bake at 375 degrees for 35-40 minutes.\" A person sprinkles flour onto a marble countertop. Dough is poured from a glass bowl onto the floured surface. More flour is sprinkled on top of the dough, and the person pats the dough with floured hands. Using a wooden rolling pin, the person rolls out the dough into a thin rectangular shape. Melted butter is brushed onto the rolled dough. A brown sugar and cinnamon mixture is sprinkled over the buttered dough. The person gently pats the cinnamon sugar into the dough, then folds the edge of the dough closest to them over the cinnamon sugar mixture by a small amount and begins rolling the dough up into a log, using a bench scraper to assist. The person cuts the dough log into rolls using a large knife. The rolls are placed cut-side up inside the prepared cast-iron skillet. The skillet filled with cinnamon rolls is placed in the oven.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is flour sprinkled on the table at the beginning of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. To make dough.",
+ "B. Because it is considered fun.",
+ "C. To prevent the dough from sticking to the table.",
+ "D. To clean oil stains from the tabletop."
+ ]
+ }
+ },
+ "eLDjltll": {
+ "video_id": "eLDjltll",
+ "video_duration": "37s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person dressed in a red and blue suit with a hood walks on a tan box and then onto a counter. A man in a suit is standing nearby. Then the person in the suit jumps onto a table and falls onto the floor. The person lays on the ground, and then gets up. The person moves towards the center of the room where small cartoon-like creatures are standing. The person then walks back to the area with tables, and climbs back on to the table.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does Spider-Man fall in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Once.",
+ "B. Zero times.",
+ "C. Three times.",
+ "D. Twice."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What does Spider-Man do while standing on the circular table?",
+ "answer": "D",
+ "candidates": [
+ "A. Eats.",
+ "B. Dances.",
+ "C. Sleeps.",
+ "D. Performs a backflip."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of Spider-Man's action practice in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. First attempt a handstand, then perform a backflip, followed by a second backflip, and finally practice the worm dance.",
+ "B. First perform a backflip, then attempt a handstand, next practice the worm dance, and finally do a second backflip.",
+ "C. First practice the worm dance, then perform a backflip, followed by a second backflip, and finally attempt a handstand.",
+ "D. First perform a backflip, then do a second backflip, next practice the worm dance, and finally attempt a handstand."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "After performing two backflips, how does Spider-Man's emotional state change?",
+ "answer": "C",
+ "candidates": [
+ "A. Constant disappointment and anxiety.",
+ "B. From disappointment and anxiety to excitement.",
+ "C. From excitement to disappointment, and then to anxiety.",
+ "D. Constantly excited."
+ ]
+ }
+ },
+ "FLgWECZZ": {
+ "video_id": "FLgWECZZ",
+ "video_duration": "99s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man in a dark suit is standing on a red stage in a theater. There is an audience in the seats. He is facing towards the audience and then a woman walks onto the stage wearing a black shirt, black skirt, and a white and black striped jacket, they shake hands. Then a man wearing a dark jacket and pants with a white graphic T-shirt joins them and shakes hands with the first man. The second man presents a blue triangular object to the woman and they both hold it. The woman then speaks into a microphone. Then the man in the white graphic T-shirt and another man in a dark jacket with a white graphic T-shirt are on stage with the first man and the woman and shake hands with her. The audience applauds.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who said in the video that \"the Jury Prize will be awarded to a film that moved us all deeply, but may not be considered the best feature? Let\u2019s give a warm round of applause for Margot Dijan!\"?",
+ "answer": "C",
+ "candidates": [
+ "A. A man in a suit.",
+ "B. A man wearing a black T-shirt with white lettering.",
+ "C. A woman with red curly hair.",
+ "D. A bald man in a white short-sleeved shirt."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the bald man wearing a suit appear?",
+ "answer": "B",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. At the end of the video.",
+ "C. At the beginning of the video.",
+ "D. In the middle of the video."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the applause occur in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. Two.",
+ "D. One."
+ ]
+ }
+ },
+ "mMcrXfup": {
+ "video_id": "mMcrXfup",
+ "video_duration": "115s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video contains a series of shots of ponds and landscaping projects. In the first shot, multiple fish swim in a pond. The camera shows a close-up shot of many koi fish. Another shot shows a man wearing a gray shirt and a hat, standing outside in front of a building as a yellow excavator lifts a stone slab. The video then shows a man with a beard swimming in a pond with koi fish, followed by another man holding plant in front of a landscaped waterfall. The video then features another man with a red beard and a grey shirt standing in front of a pond being constructed by a group of men. A man with a beard wearing a grey jacket stands in front of a pond with lily pads in it. The logo \"TEAM Aquascape\" is briefly displayed, followed by an animated graphic of a frog with a dragonfly in its mouth. A man with a red beard and a gray shirt is then shown speaking directly to the camera. The camera then pans through a yard and patio area, highlighting the various trees and plants. In one shot a green plant is held in the hand. The last shot features a pink tree in the garden.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Which plant is shown last in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Saxifrage.",
+ "B. Louisa crabapple tree.",
+ "C. Syringa tree.",
+ "D. Beech tree."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "Which type of plant did the man in the video touch?",
+ "answer": "B",
+ "candidates": [
+ "A. European beech tree.",
+ "B. Milkfruit tree.",
+ "C. Saxifrage.",
+ "D. Louisa crabapple tree."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Will the mammee apple tree grow large enough to shade the pavilion?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. It will.",
+ "C. It will not."
+ ]
+ }
+ },
+ "QIouYHhc": {
+ "video_id": "QIouYHhc",
+ "video_duration": "187s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person with dark curly hair pulled into two buns on top of their head and a beard is sitting at a table wearing a white t-shirt with the words \u201cAPPLEHEAD ALEX\u201d in yellow letters. Their hands are clasped in front of them. They look to the side, then back forward. Their mouth moves open and shut, as if speaking, but no sounds are heard. They lift their hand and make a gesture, and move their hands in front of them as if explaining something. They raise their eyebrows and their mouth opens wider.The image changes. A person wearing a police uniform with a dark hat, a dark shirt with the word \u201cCOP\u201d on it, and a shiny police badge on the arm is holding a red flashlight, and then is shown speaking. The image cuts back to the first person wearing the white t-shirt with \u201cAPPLEHEAD ALEX\u201d and making hand gestures, then cuts to a black screen. The next scene shows the person in a police uniform talking, with a person with dark curly hair in buns with a unicorn horn headband on, wearing a black t-shirt with \u201cLAYLA FONSWAY\u201d on it at a table. The person in the police uniform holds a red flashlight. The person with the unicorn headband speaks with their hands and makes facial expressions, looks upward and their mouth opens and shuts. The person in the police uniform continues to hold the flashlight and speak. The person with the unicorn headband has their hands clasped and their mouth open. They make various hand gestures while talking. Their arms are crossed on their chest. The person in the police uniform speaks and turns their head and shines the red flashlight on the person with the unicorn headband. They make a variety of faces and hand gestures while speaking. The image changes, and a person wearing a black durag and a navy blue shirt with \u201cTOMMY HILFIGER\u201d on it talks to the camera, with their hands together. The person in a police uniform is shown again and is speaking to the side.The person with a black durag on is shown again talking to the camera and gesturing with their hands.The person in a police uniform is shown turning to look to the side, and then pointing and gesturing to the side. The person with pink and yellow buns in their hair and a grey shirt with a cartoon on it says something, and the person with the unicorn horn is shown again. They speak to the side with crossed arms, and then with gesturing hands. The person in a police uniform gestures and speaks. The person with the black durag on talks and points to themselves. The person in the police uniform is shown again, speaking to the side, and then they put their hand up. The person with pink and yellow hair buns speaks and looks to the side. The person with the black durag on talks and makes faces. They are shown making crying motions. The person in the police uniform is shown gesturing and moving to leave.The person with the black durag on is crying and rubbing their face. The person with the unicorn headband talks with their hands together. They look to the side. The person with the black durag on is crying and rubbing their face. The person in the police uniform is shown again. The person in a black durag is crying and rubbing their face, then gesturing. The person in the police uniform puts their hand up to the side. They are shown leaving the scene. The person in a black durag is shown crying and rubbing their face.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "In the latter part of the video, how does the emotional state of the man wearing a black hat change?",
+ "answer": "B",
+ "candidates": [
+ "A. He changes from sadness to surprise.",
+ "B. He changes from surprise to sadness.",
+ "C. He changes from indifference to happiness.",
+ "D. He changes from happiness to enthusiasm."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the water bottle with the red lid relative to the police in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The water bottle is on the right side of the police.",
+ "B. The water bottle is on the left side of the police.",
+ "C. The water bottle is in front of the police.",
+ "D. The water bottle is behind the police."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the girl with a hairband and the man in the black hat in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Mother and son.",
+ "B. Father and daughter.",
+ "C. Sister and brother.",
+ "D. Father and son."
+ ]
+ }
+ },
+ "QNdjTPvE": {
+ "video_id": "QNdjTPvE",
+ "video_duration": "114s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man wearing a motorcycle helmet high-fives another man. A man in a baseball cap holds a motorcycle part while two other men stand near him. Various motorcycle parts are laid out on a white cloth. The screen briefly displays static, followed by a title card that says \u201cMoonshine Harley-Davidson Shop Talk.\" A color bar test pattern appears, then a close-up of a man with a goatee is shown. The shot changes to a close-up of a different man speaking. The color bar test pattern reappears. A man holds what appears to be a cordless power tool. The color bar test pattern appears again. A man speaks to the camera in a motorcycle shop. A black Harley Davidson motorcycle is shown. A close-up of the motorcycle engine is shown, focusing on a round piece of metal with \"103\" engraved on it. The man continues speaking.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual behavior does the man wearing glasses display in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Holding a drill, he points at the man with a nose ring.",
+ "B. Holding a wrench, he points at the man with a nose ring.",
+ "C. Holding a phone, he points at the man with a nose ring.",
+ "D. Holding a motor, he points at the man with a nose ring."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why hadn't the man with glasses previously recorded or shown a video of a dual-cylinder engine modification?",
+ "answer": "A",
+ "candidates": [
+ "A. Because the modifications before were simple and rough, not detailed ones.",
+ "B. Because there was no additional time to record and edit the video content.",
+ "C. Because there were no suitable devices for recording and editing.",
+ "D. Because the customer did not request documentation or display of the modification process."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What components does the man with a nose ring in the video use to increase engine RPM?",
+ "answer": "B",
+ "candidates": [
+ "A. Factory fuel system pump and high-compression-ratio motor.",
+ "B. Factory fuel system pump and wooden three-five camshaft.",
+ "C. Full exhaust system headers and high-compression-ratio motor.",
+ "D. Wooden three-five camshaft and high-compression-ratio motor."
+ ]
+ }
+ },
+ "ZATdpNTJ": {
+ "video_id": "ZATdpNTJ",
+ "video_duration": "393s",
+ "duration": "6-8min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man with short hair and a goatee, wearing a short-sleeved black shirt with white designs and tan pants, stands on a stage and speaks, while different images are projected on a screen behind him.The first image shows autumn leaves on a tree. This image changes to an abstract watercolor painting with orange and blue colors, then to a painting of a person\u2019s head and shoulders against a dark background. The face is orange, and an orange streak extends down from the eye as if a tear. Red, white, and blue spots are scattered on the right side of the image.The screen then shows a black and white photograph of an artist\u2019s workspace in front of large windows. A long, curved horn, along with a stone or piece of wood, is on the windowsill. An assortment of paintbrushes is in a cup near the center of the table, along with small boxes and tubes of painting materials. A large mesa or plateau is visible outside the window.The screen next displays a photograph of a brick archway or tunnel, extending into the distance. This image is replaced by a photograph of a cluttered artist\u2019s workspace, captioned \u201cFrancis Bacon\u2019s Open Creative Mess,\u201d and then by a photograph of a writing desk with a quill and inkwell, captioned \u201cJane Austen\u2019s Writing Desk.\u201dA black and white street scene, including a bus parked on the side of the road, is then projected. Two young children in dresses stand in the foreground, near the edge of the screen. Several men wearing white, short-sleeved shirts and dark-colored pants crouch near the center of the image.The screen then shows a man in a gray robe sitting cross-legged on the floor. Behind the man is a wall with a pattern of square panes, similar to shoji screens. Finally, a ginkgo leaf appears on the screen. The leaf is yellow, with some brown coloring near the stem and on one edge of the leaf.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "With new insights into creative blocks, how will the education and training sectors evolve in the future?",
+ "answer": "B",
+ "candidates": [
+ "A. Greater emphasis is placed on fostering students' autonomy and cooperation.",
+ "B. Greater emphasis is placed on cultivating students' diversity and flexibility.",
+ "C. Greater emphasis is placed on fostering students' cooperation and perseverance.",
+ "D. Greater emphasis is placed on cultivating students' autonomy and perseverance."
+ ]
+ }
+ },
+ "qPyJMJre": {
+ "video_id": "qPyJMJre",
+ "video_duration": "410s",
+ "duration": "6-8min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A video game is being played. A person is visible in a small box in the upper right-hand corner of the screen. The perspective then switches to first-person gameplay, with a gun visible in the foreground. The player moves through several different indoor and outdoor environments. Throughout the video, other characters in the game are visible, and are often shot by the player. Occasionally, other objects or symbols are visible on the screen, including a like button and subscribe button. A scoreboard is visible at different times, showing the points and statistics of the different players. The screen shows a black and white test pattern with audio feedback. The video closes with a screen showing information about the player's \"Current Level,\" and then a screen that displays a list of user names. Finally, the video ends on a YouTube channel screen with a membership offer.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the man's highest consecutive win record while wearing a black baseball cap in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A thirty-nine-game winning streak.",
+ "B. A twenty-game winning streak.",
+ "C. An eighteen-game winning streak.",
+ "D. A sixteen-game winning streak."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Is there barking in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes, there is.",
+ "B. No, there isn't.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "gnSnBGYp": {
+ "video_id": "gnSnBGYp",
+ "video_duration": "497s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A young girl in pajamas holds a stuffed duck. She sits at a table with an empty plate and fork. A woman in a purple shirt speaks to the camera in a kitchen. Two young girls eat from bowls at a table. A man holds a young girl in a purple dress. He looks at cookies on a nearby counter. A man in a white Vans t-shirt hands a cookie to a young girl. A family sits at a table and eats cookies. A man in a black t-shirt eats a cookie. Partially eaten food sits in a red bowl with a wooden spoon. A person points at a t-shirt with a design of a man drinking coffee. A young girl holds an adult's hand and looks at something in the distance. A child's drawing of stick figures and hearts is shown. Floating lanterns are on the surface of water at sunset. A boat is in the background. Many floating lanterns are on the surface of water as people watch from the shore. A girl in a light-colored dress dances while another girl twirls a hula hoop. Children lie on a sofa. A woman walks across the room.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the little girl in the pink floral pattern use to finish her breakfast?",
+ "answer": "C",
+ "candidates": [
+ "A. A straw.",
+ "B. A spoon.",
+ "C. A fork.",
+ "D. Chopsticks."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why did the man wearing glasses leave with two children?",
+ "answer": "C",
+ "candidates": [
+ "A. Because they are going swimming.",
+ "B. Because they are going shopping.",
+ "C. Because they are going to the restroom.",
+ "D. Because they are going to drive."
+ ]
+ }
+ },
+ "vfneWNFi": {
+ "video_id": "vfneWNFi",
+ "video_duration": "171s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man in a blue t-shirt and black shorts stands in an art gallery, holding a golf club and dancing. He removes his t-shirt. A group of people dance in the gallery. Two people hug. A man with a white beanie and white shirt dances. He has red tape wrapped around his torso that reads \u201cDANGER.\u201d He has a small red bin in one hand. A man and a woman stand next to a white structure with \u201cTHE GLORY\u201d spray-painted on it in green. A woman dances and takes photos with a cell phone. A man wearing a black fedora and green t-shirt takes photos. Several people dance in the gallery. A group dances, and they lift a person in the air and carry them around the gallery. A person carries a tray of lit cupcakes. A man with a hat and a green t-shirt blows out the candles on the cupcakes. People cheer and clap. The person who blew out the candles and the person holding the cupcakes kiss.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "Why does the first man in the video take off his clothes?",
+ "answer": "D",
+ "candidates": [
+ "A. Because his clothes are wet.",
+ "B. Because he feels hot.",
+ "C. Because he is trying on clothes.",
+ "D. Because he is attending a party."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotion is conveyed by the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Joyful atmosphere.",
+ "B. Serious and contemplative.",
+ "C. Calm and relaxing.",
+ "D. Mysterious and captivating."
+ ]
+ }
+ },
+ "MeNYLOIW": {
+ "video_id": "MeNYLOIW",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Four people play flutes in front of a larger band. They wear black t-shirts with a gold and white design and black pants. They have music stands with sheet music and microphones in front of them.\u00a0 A person in a red t-shirt walks past. Two people in white dresses walk by. A woman carries a light-brown umbrella. A man in a blue t-shirt walks up to the flute players and appears to conduct them. He moves his arms, points, and gestures. Other members of the band behind them move instruments around the stage. People walk around in the background.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What are the identities of the three women in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Music teachers.",
+ "B. Company employees.",
+ "C. Volunteers.",
+ "D. Club members."
+ ]
+ }
+ },
+ "FltEpGtA": {
+ "video_id": "FltEpGtA",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A woman and a man sit side by side, playing ukuleles. The woman is on the left side of the frame, wearing glasses and a dark gray sweater or jacket. Her ukulele is a light, possibly natural wood color. She strums and fingers the chords throughout the video. The man is on the right, wearing glasses and a black, long-sleeved shirt. His ukulele is a darker, reddish-brown wood. He strums and fingers chords throughout the video, occasionally looking over at the woman's playing. Two microphones, one with a blue band around its base, are positioned in front of the instruments. They are near a wooden door and a wooden shelving unit with various framed pictures and objects on the shelves.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What changes occur in the sound of the ukulele in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. It gradually becomes faster.",
+ "B. It remains consistently slow.",
+ "C. It transitions from fast to gradual.",
+ "D. It remains consistently fast."
+ ]
+ }
+ },
+ "zetwUmDH": {
+ "video_id": "zetwUmDH",
+ "video_duration": "83s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows three adults outdoors during the day. One is wearing a white tank top and a patterned wrap skirt, and one is wearing a white top with an image on it and a blue wrap skirt. The third is wearing a multi-colored pattern dress. In the first shot, the individual in the white tank top is speaking, gesturing with their hands, and then touching their ears. A second individual is visible behind them, with a serious expression, and then a third individual appears behind the first individual in the white tank top. Next, a fourth individual appears wearing a white collared shirt with a red and black diagonal stripe. This individual is speaking and gesturing with one hand. The camera moves back to the first three people, and they appear to speak to each other, after which the individual in the white tank top ducks out of view. The individual in the white collared shirt is visible standing near a white car, before the camera again shows the first three individuals. The camera then shifts to the four individuals shown previously. Then, the first three individuals appear to approach the car, after which, the camera cuts to the individual in the white collared shirt, who is now sitting in the driver\u2019s seat, gesturing towards the other individuals, who are speaking to him. Finally, the camera shows the three individuals, two are standing and one is sitting on a bench. The two standing are speaking, and at the end of the video, the camera shows all three again.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video, in what order do the following events occur? (a) The woman with her hair tied up invites the man in a short-sleeved shirt to enjoy some food. (b) The man in a short-sleeved shirt wants to take the woman with loose hair away. (c) The man in a short-sleeved shirt drives away in a car. (d) The woman with her hair tied up and the woman with loose hair stand together to discuss what to do.",
+ "answer": "B",
+ "candidates": [
+ "A. (a)(b)(c)(d).",
+ "B. (b)(a)(c)(d).",
+ "C. (b)(c)(a)(d).",
+ "D. (c)(a)(b)(d)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the personality of the man wearing a vest in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Kind and approachable.",
+ "B. Discriminates between rich and poor.",
+ "C. Warm and loving.",
+ "D. Kind and considerate."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the woman with her hair down and the woman with her hair tied up in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Relatives.",
+ "B. Mother and daughter.",
+ "C. Friends.",
+ "D. Sisters."
+ ]
+ }
+ },
+ "GIfKJExJ": {
+ "video_id": "GIfKJExJ",
+ "video_duration": "136s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Mr. Bean, with two small cocktail umbrellas stuck in his hair, looks at a curly-haired man who sits next to him at a bar. The man talks, gesturing with his hands. Mr. Bean watches, tilting his head slightly and occasionally furrowing his brow. The other man continues talking, pointing his finger in the air and bringing it back down to his other hand. Mr. Bean pulls one of the umbrellas from his hair and starts picking at it, then watches the other man again. The curly-haired man brings his hands to his mouth, appearing distressed, and rests his forehead on the bar. He raises his hand and slams it down on the bar, scattering some peanuts from a nearby bowl. Mr. Bean stares straight ahead. The scene cuts to an aerial view of a street at night. Two men, now wearing black suits and yellow ties, walk down the street holding hands. One of them is Mr. Bean, still with the purple umbrella in his hair. They stop, look around, and continue walking. They swing their arms outward in exaggerated motions as they walk. They continue swinging their arms and skip together. The man next to Mr. Bean stumbles and almost falls, grabbing onto Mr. Bean for support. They look up at the sky.The scene cuts to a close-up of a full moon.The men resume their dance-walk down the street and pass a mailbox marked \"1150.\" The other man reaches out to touch some bushes. Mr. Bean grabs him and pulls him away. They both run, looking back, and enter a house with a short set of steps and a lit entryway. They continue dancing as they go inside. They go through a doorway flanked by lights and continue dancing.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional changes do the two people experience at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Surprised.",
+ "B. Suffering.",
+ "C. Resigned.",
+ "D. Happy."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "Why do two men have umbrellas sticking on their heads?",
+ "answer": "C",
+ "candidates": [
+ "A. There is no particular meaning.",
+ "B. Because they think it is entertaining.",
+ "C. Because they are drunk.",
+ "D. Because they are praying."
+ ]
+ }
+ },
+ "nxzOLWjJ": {
+ "video_id": "nxzOLWjJ",
+ "video_duration": "99s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a shot of a textured yellow object against a blurry blue background, and four people's heads visible in small boxes on the right side of the screen. The video then transitions to show a person's face close up, and then a hand holding colored, transparent sheets of plastic. Next, a man wearing a dark shirt is sitting at a table with various objects and turns a light on. He then picks up a small object and looks at it, followed by a shot of him in a darkened room, positioning various objects. The scene changes to a man in a dark room holding a small flashlight and a camera, looking down at colored sheets on the table. He points the flashlight towards the sheets and adjusts the camera, then there is a shot of a yellow object on a wall, with colored sheets on a table below. The scene changes to one where a blue light is used to highlight both a yellow object on a wall, and colored sheets on the table below, then another shot where a yellow object is floating in a clear box, with colored sheets on the table below. Next, a yellow object in a clear box is seen again, but with a blue light this time, followed by a shot where a yellow object in a clear box is seen again with a cloudy background, then a yellow object is seen in a clear box with a shadow behind it, and finally, the yellow object is seen again with a colorful light source. The camera shifts to a top down view where colored sheets and yellow sponges are seen on a table, followed by a man putting colored sheets on a table with yellow lighting, then a view of three layered, colorful objects. Next, the video shows another textured yellow object with a purple background, and then a clear box containing a yellow object against a blue background. The video then shifts to a view of a man holding a Rubik's cube, sitting on a couch. This changes to a man taping red, blue, and yellow tape to a white background, with a Rubik's cube visible nearby. The video goes back to the man moving a blue strip of tape, then it cuts to a man using a camera to take a photo. Next, a man is using a laptop computer, then a video is seen on the computer. After that, there is a view of a Rubik's cube, then the same man from before is using a laptop computer again. Finally, the video shows a man using chopsticks to add rice to a pot on a laptop screen, followed by a man photographing the laptop with chopsticks on the screen. The video ends with the man using chopsticks to add rice to a pot on a laptop screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What principle does the first male in the video use during filming?",
+ "answer": "C",
+ "candidates": [
+ "A. The polarization of light.",
+ "B. The dispersion of light.",
+ "C. The refraction of light.",
+ "D. The diffraction of light."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the relative position of the glass plate and the sponge in the final product shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The sponge is above the glass plate.",
+ "B. The sponge is in front of the glass plate.",
+ "C. The sponge is below the glass plate.",
+ "D. The sponge is behind the glass plate."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes did the Rubik's Cube undergo in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. It was restored.",
+ "B. Its order was scrambled.",
+ "C. It was displayed.",
+ "D. It was dismantled."
+ ]
+ }
+ },
+ "taCZdrFA": {
+ "video_id": "taCZdrFA",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people are playing instruments in a room. One person is seated and playing a guitar and the other person is standing and playing a stand-up bass. Both of them have music stands in front of them. The walls of the room are made of brick and another wall is painted white. There is a black speaker on the brick wall. A doorway is also visible on the brick wall.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What instrument is shown on the right side of the screen in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Guitar.",
+ "B. Ukulele.",
+ "C. Cello.",
+ "D. Double bass."
+ ]
+ }
+ },
+ "rXfMpPyy": {
+ "video_id": "rXfMpPyy",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three people are playing musical instruments. The person on the left is using a bow on a stringed instrument, while a tablet rests on a stand in front of them. The person in the middle is holding a large, lute-like instrument, plucking at its strings. The person on the right is playing an acoustic guitar. They are all standing in a room with a plain white wall. There are microphones on stands nearby. There is also a white table with a keyboard on it.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "In the video, from which direction is the erhu sound heard?",
+ "answer": "A",
+ "candidates": [
+ "A. From the left side of the image.",
+ "B. From the back of the image.",
+ "C. From the center of the image.",
+ "D. From the right side of the image."
+ ]
+ }
+ },
+ "lWfjIHMx": {
+ "video_id": "lWfjIHMx",
+ "video_duration": "76s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A Marine's leg is shown with a brace. A Marine launches a drone. Two Marines look at a computer screen, one of them touching the screen. A 3D printer is printing. A soldier looks through a scope with a green night vision lens. Marines crawl on the ground, pulling a wire. White smoke rises behind a barbed wire fence. A large explosion sends a cloud of dust and debris into the air. A soldier holds a rifle, silhouetted against the sunset. A Marine walks toward a mountain range capped with snow. A column of Marines marches on a dirt road. Two Marines look at and write on clipboards. A helicopter flies low over a building in dusty conditions. Two helicopters fly in formation close to the ground, creating dust clouds. Three fighter jets fly in formation over water. A pilot sits in the cockpit of a fighter jet. Marines operate small devices with screens. Two Marines look at their phones. A person in a hazmat suit works with equipment. A gloved hand scrapes a substance from a metal dish. A hand operates a control device with a screen and joystick. A small tan robot drives quickly across a sandy area. A pilot puts on her helmet. A crew member guides a helicopter landing on a tarmac. A Marine kicks a training pad held by another Marine. Waves crash against rocks. A person in camouflage looks through smoke. A soldier crawls under barbed wire in muddy water. Muddy boots jump from a vehicle. A diver swims underwater. Two soldiers in a helicopter look out over the ocean. A helicopter flies against a gray sky. A person leans out of a helicopter over a village with damaged buildings. A Marine hands a sack of supplies to a person in camouflage. Marines speed across water in inflatable boats. Several inflatable boats release smoke grenades over the water. Several personal watercraft travel quickly across the water. A soldier fires a machine gun from a helicopter. A soldier aims an artillery cannon. A Harrier jet lands on an aircraft carrier. The view from inside a cockpit as an aircraft approaches an aircraft carrier. Soldiers jump out of a helicopter. Soldiers fire rifles at a shooting range. The view from the cockpit as an aircraft flies over water at sunset. A Marine runs with a rocket launcher. Marines fire an artillery cannon. An explosion goes off. A helicopter flies above a beach. A tank moves forward with a soldier looking through its gun. Marines crouch behind rocks, looking off to the distance, as a helicopter flies past. Smoke rises from the desert. A helicopter fires flares. A Marine gathers a parachute on the deck of a ship. Marines walk through a barren landscape.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times do gunshots appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Zero.",
+ "B. Three or more.",
+ "C. One.",
+ "D. Two."
+ ]
+ }
+ },
+ "MWAuZEiu": {
+ "video_id": "MWAuZEiu",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A pipe band, dressed in kilts and vests, stands on a grassy field playing their instruments. Behind them are several vendor tents\u2014white, red, and green. Spectators sit in folding chairs in the background, watching the performance. The band members are arranged in rows, with bagpipers on the right, and drummers progressively filling in towards the left side. A single large bass drum is positioned in the center-back. One man in a black jacket and kilt stands to the left side, holding a piece of paper, possibly sheet music. As the video progresses, the camera very gradually pans slightly from right to left and zooms in a bit closer. The band continues playing throughout, and the arrangement of the musicians and the audience remains static.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Was the bagpipe continuously playing in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes.",
+ "C. No."
+ ]
+ }
+ },
+ "pEBaVsTr": {
+ "video_id": "pEBaVsTr",
+ "video_duration": "246s",
+ "duration": "4-6min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a view of a riverbank, with two fishing rods leaning against the ground. The camera pans to show the river, which is flowing quickly. A man in fishing gear then appears, holding a small fishing lure. He attaches a red bead to the lure. He then dips the lure in the river. The camera then shows a wider view of the river, with trees and some snow on the banks. The man reappears and attaches a float to the fishing line. Another man in fishing gear joins him, and they both look at the float. The first man casts his line into the river, and the float drifts downstream. The camera then shows a closer view of the river, with the float bobbing on the surface. The first man then reels in a fish, a steelhead trout. He holds the fish up for the camera, then releases it back into the river.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times do the fish open their mouths in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Twelve times.",
+ "B. Fifteen times.",
+ "C. Fourteen times.",
+ "D. Thirteen times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "Is there someone guiding another person on how to do something?",
+ "answer": "D",
+ "candidates": [
+ "A. I don't know.",
+ "B. Yes, someone is providing guidance.",
+ "C. None.",
+ "D. No, no one is providing any guidance."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct chronological order of the events shown in the video? (1) Casting the fishing rod (2) Teaching how to use the fishing hook (3) Releasing the fish (4) The fish biting.",
+ "answer": "C",
+ "candidates": [
+ "A. (1)(4)(2)(3)",
+ "B. (1)(2)(3)(4)",
+ "C. (2)(1)(4)(3)",
+ "D. (3)(1)(2)(4)"
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What emotion does the person in the video display?",
+ "answer": "B",
+ "candidates": [
+ "A. The person in the video appears anxious because the fish have not bitten for a long time.",
+ "B. The person in the video appears calm because their technique is highly skilled.",
+ "C. The person in the video appears joyful, feeling happy about their success.",
+ "D. The people in the video appear disappointed because they didn't catch any fish.."
+ ]
+ }
+ },
+ "rsIdEXEM": {
+ "video_id": "rsIdEXEM",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three people are visible on stage in front of a large wooden wall. One person stands on the left holding a flute. One person sits at a black piano in the center, and one person stands on the right holding a bassoon. All three individuals are wearing dark clothing. There are two black microphone stands in front of the piano. The people appear to be playing their instruments. Another person enters and sits behind the piano player.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the flute playing continuously in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes.",
+ "B. I am not sure.",
+ "C. No."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who is playing the piano in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A woman wearing a strapless evening gown.",
+ "B. A man with short hair wearing a black top.",
+ "C. A woman wearing a sleeveless black top.",
+ "D. A man with braided hair wearing a black top."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At which part of the video does the sound of the bassoon first briefly disappear?",
+ "answer": "B",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. At the beginning of the video.",
+ "C. In the middle of the video.",
+ "D. Throughout the entire video."
+ ]
+ }
+ },
+ "LeFssVGH": {
+ "video_id": "LeFssVGH",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Climate Change",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A woman in a patterned orange shirt and glasses raises her hand. A wide shot shows an island with a sandy beach and a large area of trees. Two men walk through a hallway and enter an office. The camera shows the side of a red boat with \"Griffith UNIVERSITY\" written on it. Three people are on the boat in the water. A man holds a device that reads 23.7 \u00b0C. The camera pans across a shore with several boats docked at the beach. A man in a blue shirt stands outside near the ocean. A market area with multiple people and produce is shown. A wide shot shows foliage and a building. A woman holds a leaf in a forest. A man cuts a branch of a tree. A man in a blue shirt looks at a large tree with hanging vegetation. Two people are in the water, with one on a raft. A woman with fins swims over coral. A hand with a glove reaches a coral. A clipboard underwater is shown. A man in a blue shirt speaks in front of a background of foliage.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What will be done next in the video after examining the structure of the coral reef?",
+ "answer": "B",
+ "candidates": [
+ "A. Organize community education activities on coral reef protection.",
+ "B. Compare coral reef data from different years to assess changes.",
+ "C. Immediately start another unrelated research project.",
+ "D. Analyze the data from coral reefs and publish research papers."
+ ]
+ }
+ },
+ "pLyiJQzf": {
+ "video_id": "pLyiJQzf",
+ "video_duration": "228s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a slot machine screen with various symbols and numbers displayed in a grid. The screen is a rectangle with rounded corners and a dark background. The grid is 5x5 with the symbols in each cell changing rapidly. Many cells contain golden, circular icons with numbers on them. Other cells contain other symbols, including musical instruments, mountains, and tents. The text at the bottom says \"LAST SPIN! BONUS REELS IN PLAY\", and then \"3 SPINS REMAINING\", \"2 SPINS REMAINING\" or \"1 SPIN REMAINING.\" At various times the grid is replaced with text and numbers that read \"MEGA\" $38,706.24\", \"MAJOR $601.99\", \"MINOR $54.88\", \"MINI $17.94,\" and \"WINNER!\" At the end of the video the screen reads \"Jackpot Call Attendant HAND PAY $39,051.56\" along with images of coins flying.Text on the bottom of the screen changes throughout the video between \"PRESS TO SPIN!\" and \"1\u00a2\" and displays different credits and BET amounts. Another slot machine is partially visible on the right side of the screen throughout the video.A hand enters the frame briefly to press a button below the game screen. The final part of the video is shaky and shows the machine and an adjacent area. A text overlay reading \"Please subscribe to support my channel!\" appears on screen for several seconds.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the player apologize?",
+ "answer": "C",
+ "candidates": [
+ "A. Apologize for being too excited.",
+ "B. End recording during the game.",
+ "C. Stop recording when the game is won.",
+ "D. Suddenly shift the camera to the surroundings."
+ ]
+ }
+ },
+ "AHcfQMNu": {
+ "video_id": "AHcfQMNu",
+ "video_duration": "263s",
+ "duration": "4-6min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video consists of multiple segments, which feature two men. In some segments, one man is wearing a dark suit, light shirt and patterned tie. He is in a room with dark curtains in the background and a lamp in the corner behind him. He makes various hand gestures while speaking.In other segments, one man is wearing a dark t-shirt and red button-down shirt. He sits in front of a background of green hexagonal patterns. He is speaking into a microphone and also makes various hand gestures.One segment shows two men, with one wearing glasses, a grey shirt, and the other with a black t-shirt and gray suit.There is also a segment that shows the man in a white shirt sitting on a chair on stage, in front of a blue background with an image of colorful shapes. He uses his hands to emphasize what he is saying.There is one shot of a woman with long red hair wearing a blue top on a black background with gold writing.Several segments contain the words 'SELF-PAY', 'OBSESSION WITH FIRMNESS', and \u2018THE BLOCKS PODCAST\u2019 overlaid on the screen.One segment shows an image of a map with weather patterns on it, while the man in the red shirt makes hand gestures in front of it.Many segments alternate quickly between the various people.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the mood of the man wearing a brown shirt when the man in the black suit jacket mentions his \"sandwich rule\"?",
+ "answer": "A",
+ "candidates": [
+ "A. Surprised.",
+ "B. Angry.",
+ "C. Happy.",
+ "D. Sad."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a vase behind the man wearing glasses in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I don't know."
+ ]
+ }
+ },
+ "geBYSdUC": {
+ "video_id": "geBYSdUC",
+ "video_duration": "112s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Drawing Tutorials",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with an image of an outlined figure in blue and red lines with accompanying text around it. The video then transitions to an animation of the red lines of the figure becoming more defined. The video ends with the fully outlined figure in red.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does the drawing of the fingers appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. Throughout the video.",
+ "C. At the beginning of the video.",
+ "D. In the middle of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the background color of the anime characters in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Red.",
+ "B. Orange.",
+ "C. Black.",
+ "D. Blue."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "Did the man in the video finish painting?",
+ "answer": "B",
+ "candidates": [
+ "A. It is uncertain.",
+ "B. It is completed.",
+ "C. The painting is not completed."
+ ]
+ }
+ },
+ "cgTOPiPf": {
+ "video_id": "cgTOPiPf",
+ "video_duration": "135s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a person's head and shoulder visible in a large image in the center of the screen and a second person in a small screen on the lower right side. The person in the large image is wearing a bright orange shirt and matching hat with large round glasses with a tortoise-shell pattern on the frames. The person's left hand is placed on their chin in a resting position. A white bar appears across the center of the screen containing a comment that reads \"stop dying your hair you forgot\". Then, that bar and a new white bar are displayed with the text \"I HATE tan skin on asians.\" Then, three white bars are displayed in different parts of the screen, each containing a comment: \"The earth is flat. And so is your nose. Show less 1\", \"double barrel shotgun nose ahh 4 months ago\", and \"Nose looks like Metroid\". The person in the small screen is wearing a light blue hoodie and using a computer. They make a hand gesture towards their face and look in the direction of the smaller screen. Then, a red overlay is applied to the larger image, and a portion of the person's face appears, where the red overlay is not. The person in the small screen continues to look at the computer and make hand gestures.The red overlay appears over more of the person's face, and then more and more until their entire face is covered with the red overlay. The person in the small screen continues to interact with the computer. The screen is then altered to display only the person in the large picture, without any text or overlays. Then, the video cuts to a closer shot of the person's face in the large photo. A very fine circle is displayed moving over the person's nose, and then the video ends.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the man's emotion in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Humorous.",
+ "B. Discouraged.",
+ "C. Excited.",
+ "D. Indifferent."
+ ]
+ }
+ },
+ "ECRaxgHh": {
+ "video_id": "ECRaxgHh",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Seven people are sitting on black chairs on a stage. Each person is holding a stringed instrument. There are black music stands in front of each person. All the people are positioned in a single line across the stage. The background is a solid red color. The stage itself is a wood color. The people are actively playing their instruments.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What position is the woman in a pink top relative to the woman in a pink skirt?",
+ "answer": "B",
+ "candidates": [
+ "A. The woman in a pink blouse is to the right of the woman in a pink skirt.",
+ "B. The woman in a pink blouse is behind the woman in a pink skirt.",
+ "C. The woman in a pink blouse is in front of the woman in a pink skirt.",
+ "D. The woman in a pink blouse is to the left of the woman in a pink skirt."
+ ]
+ }
+ },
+ "tBTQfDVX": {
+ "video_id": "tBTQfDVX",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man sits under a small, dark green pop-up canopy, playing a two-stringed bowed instrument. He is seated on a folding chair placed behind a wheeled, rectangular, black case. In front of the case is a partially open hard-shell instrument case with Chinese characters written on it. A microphone on a stand is positioned next to the instrument case. The man wears dark clothing. A few people stand nearby watching and listening, including two women in the foreground. One woman wears a red sweater and the other a light colored jacket. Other pedestrians are visible walking in the background on a wide paved walkway or plaza. Several bicycles are parked to the man's right and a red fire hydrant is also visible in the background. Some trees without leaves are seen along the sides of the path, and a long, light colored building with red banners is visible in the background. The women hug briefly. A child wearing a blue patterned top and maroon patterned skirt, and carrying a light blue backpack approaches the case and appears to place something inside. The child takes the man's hand briefly and then departs with another, shorter child. A man in dark clothing walks past in the background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is there any sound of the erhu in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. I'm not sure.",
+ "B. No, there isn't.",
+ "C. Yes, there is."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man wearing a black shirt in the video do to the speaker?",
+ "answer": "D",
+ "candidates": [
+ "A. Wipe the exterior of the speaker.",
+ "B. Place the speaker on his leg.",
+ "C. Adjust the position of the speaker.",
+ "D. Adjust the volume of the speaker."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotions are expressed by the music played in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Tense and fearful.",
+ "B. Calm and sorrowful.",
+ "C. Melancholy and oppressive.",
+ "D. Free-spirited and optimistic."
+ ]
+ }
+ },
+ "mkxLykfz": {
+ "video_id": "mkxLykfz",
+ "video_duration": "279s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A muscular man with reddish-blond hair, wearing a khaki stringer, stands in a gym, holding a pair of lifting straps. He walks over to a cable machine, points to it, attaches the straps to the cables, and performs a cable row exercise. He then detaches the straps. He moves to a back extension machine, lies face down on it, grasps the handles, and performs several repetitions.He then sits on an incline bench, picks up a dumbbell with the straps attached, and demonstrates a dumbbell pullover, raising and lowering the dumbbell from behind his head to over his chest.He lies face down on a flat bench, holding the dumbbell with straps, and demonstrates several repetitions of lying dumbbell pullovers.He walks over to a chalkboard wall with gym rules written on it. He then takes out his phone and uses it, scrolling with his fingers.He proceeds to put on a lifting belt and pick up an EZ curl bar. He demonstrates a series of bicep curls. Next, he picks up a dumbbell and continues with dumbbell bicep curls.Finally, he stands in front of a double cable machine, grasps the handles, and performs overhead tricep extensions.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man lift the barbell in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Twelve times.",
+ "B. Thirteen times.",
+ "C. Fifteen times.",
+ "D. Sixteen times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What fitness exercises does the man in the video perform in sequence?",
+ "answer": "B",
+ "candidates": [
+ "A. Bent-over barbell row \u2192 Lat pulldown \u2192 Dumbbell tricep extension \u2192 Barbell bicep curl.",
+ "B. Lat pulldown \u2192 Bent-over barbell row \u2192 Dumbbell tricep extension \u2192 Barbell bicep curl.",
+ "C. Bent-over barbell row \u2192 Barbell bicep curl \u2192 Lat pulldown \u2192 Dumbbell tricep extension.",
+ "D. Bent-over barbell row \u2192 Lat pulldown \u2192 Barbell bicep curl \u2192 Dumbbell tricep extension."
+ ]
+ }
+ },
+ "NBjkKJhm": {
+ "video_id": "NBjkKJhm",
+ "video_duration": "108s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Cultural Explainers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a series of panels from a comic book.The first panel shows a character in a red helmet and red and blue clothing standing with one arm raised in front of a yellow circle. Text in a blue box reads \"THE LAUGHING GAME\".The next panel shows a man with a mask pulling red objects from a table with a speech bubble that says \"ONE MONTH AGO. BEAUTIFUL. COME ON, BABY. EVERYTHING JUST FIT TOGETHER.\"The third panel depicts the red and blue character from the first panel standing in a crowd of people in front of a building. Speech bubbles appear above the character and some of the other figures.The fourth panel shows the red and blue character surrounded by three women. One woman is blond and wearing a pink top, another woman has long black hair, and a third, barely visible, woman has long brown hair. The blond woman has a speech bubble. The next two panels show an explosion, with the red and blue character holding a yellow rod. Text reads \"KA-BAAM\".The next panel shows a man and a woman running away from a doorway behind which the red and blue character can be seen.The following panel shows the red and blue character crouching with a red device in his hands.The next panel shows a red device being used, text in a white box reads \"EEEEHHMM.\"The following panel shows the red and blue character in front of a large circular door. Text in a speech bubble reads \"YES!\"The next panel shows the red and blue character in front of a large circular doorway surrounded by several police officers and their cars. Text in a speech bubble reads \"A LITTLE ULTRA-SOUND? SOUNDS ABOUT RIGHT TO ME.\"The next panel shows the red and blue character standing in front of several police officers, who have fallen over, along with loose money.The following panel shows the red and blue character with a large sack slung over his back, with several police officers lying on the ground. He's holding a black, bomb-like object. In the next panel, the same character is holding a red device. Text in both panels are in speech bubbles.The next panel shows the red and blue character standing in front of several figures wearing white bandages. One character wearing a \"PP\" hat is slightly behind a police officer who's running away. Text in speech bubbles is visible.The following panel depicts a pink substance shooting up towards the sky with text in a speech bubble that reads \"TRY SOME OF THIS, FREAKS\".The next panel shows the red and blue character covered in white bandages, holding a yellow rod and text in a speech bubble that reads \"GET THIS GARBAGE OFF ME!\"The next panel shows the red and blue character standing in a scene of explosions. Other figures can be seen. Text reads \"VARIUM\". Text in a speech bubble reads \"YES! NOTHING CAN STOP THE POWER OF CODPIECE!\"The following panel shows a police officer pointing a pistol. In the bottom section of the panel, the red and blue character with a thin beam of light emitting from his hands. Text in a speech bubble reads \"HEY! WHAT D'YOU KNOW? ONE OF THE LITTLE MEN HAS SOME GUTS.\"The following panel shows the red and blue character standing in front of a pile of white strands. Text in a speech bubble reads \"YOU REALLY THINK I'M GOING TO LET A PAIR OF MUMMIES BOTHER ME?\"The next panel shows the red and blue character, now wrapped in bandages, standing in different stances.The next two panels shows the red and blue character standing next to a woman with a green frog mask. Text in speech bubbles reads \"DARLING! I JUST COULDN'T KEEP AWAY FROM YOU! YOU'RE SO AMAZING. HAD TO SEE YOU, PLEASE DON'T BE ANGRY.\" and \"YOU'RE JUST THE MOST AMAZING MAN I'VE EVER MET. YOU'RE SO BIG, AND SO POWERFUL. I'VE NEVER SEEN ANYONE LIKE YOU.\"The following panel shows a woman with blond hair standing in front of a wall. Text on the wall reads \"ALL DELIVERIES GO TO BACK ENTRANCE\".The next panel shows a woman with blond hair creating several small balls of color. Text in a speech bubble reads \"YOU MEAN LIKE THAT?\"The following panel shows a woman with blond hair holding a green frog mask talking to another woman. Text in speech bubbles reads \"WHAT DO YOU THINK, JEAN?\" and \"IT'S GREAT. THIS IS GOING TO BE SUCH A TERRIFIC PARTY. I'M REALLY GLAD YOU'RE COMING BACK FOR IT. BACK FOR THE PARTY.\"The following panel shows the woman with the green frog mask creating small balls of color. In the next section of the panel, the red and blue character and the woman with the green frog mask can be seen. Text in a speech bubble reads \"I OVULATE!\" and \"DISSOLVE\". In another speech bubble: \"WHAH! WHAT ARE YOU DOING.\" and \"NO! PLEASE STOP IT, PLEASE.\"The next panel shows three people wearing bandages, one is a woman taking off the green frog mask. Text in speech bubbles reads \"THANKS. THAT WAS --SPECTACULAR.\" and \"THANK YOU. YOU TWO ARE REALLY QUITE SOMETHING YOURSELVES.\"The final two panels show the red and blue character with the woman in the frog mask, and the final panel is the same as the first one with the same text.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What actions does the man in the blue and red spandex suit perform in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Use a drill to penetrate the bank vault \u2192 Use ultrasound to knock out the police \u2192 Destroy the bank's pillars and entrance with a laser cannon.",
+ "B. Use ultrasound to knock out the police \u2192 Penetrate the bank vault with a drill \u2192 Destroy the bank's pillars and entrance with a laser cannon.",
+ "C. Use a laser cannon to destroy the bank's pillars and entrance \u2192 Use a drill to penetrate the bank vault \u2192 Use ultrasound to knock out the police.",
+ "D. Use ultrasound to knock out the police \u2192 Destroy the bank's pillars and entrance with a laser cannon \u2192 Penetrate the bank vault with a drill."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the woman wearing a frog mask do to the man in the blue and red spandex suit?",
+ "answer": "B",
+ "candidates": [
+ "A. The woman wearing a frog mask abused the man in the blue and red spandex suit.",
+ "B. The woman wearing a frog mask defeated the man in the blue and red spandex suit.",
+ "C. The woman wearing a frog mask gave the man in the blue and red spandex suit an ideological lesson.",
+ "D. The woman wearing a frog mask did nothing to the man in the blue and red spandex suit."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the most comical weapon used by the man wearing a blue and red bodysuit in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A laser cannon.",
+ "B. Extendable boxing gloves.",
+ "C. A drill.",
+ "D. Large scissors."
+ ]
+ }
+ },
+ "fZraANmy": {
+ "video_id": "fZraANmy",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man plays a saxophone while another man plays an electric guitar.\u00a0 They stand outdoors, with trees and a silver SUV in the background.\u00a0 A drummer sits behind them, playing a drum set. There are speakers and other music equipment around them. The saxophonist wears a lavender t-shirt and patchwork shorts.\u00a0 The guitarist wears a light green t-shirt, a cap, and blue patterned shorts. He also wears sunglasses. Both men appear focused on playing their instruments.\u00a0 The drummer wears a red tank top and plays the drums with drumsticks. The guitarist occasionally moves his head and looks over at the other band members. The saxophonist mostly keeps his gaze forward. They continue playing their instruments.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the man wearing a purple short-sleeve shirt first pause playing the saxophone?",
+ "answer": "A",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. Throughout the video.",
+ "C. At the end of the video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "MmRBiRGc": {
+ "video_id": "MmRBiRGc",
+ "video_duration": "225s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video starts with a close-up of a dark green decorative wall with a circular window revealing a dining area. The camera pans up to a large, textured, golden-yellow flower-shaped decoration on the wall. It then moves quickly across similar dark green swirling shapes, showing wooden gears on a dark blue background. The camera focuses on a gold circular sign with the words \"Wonderland Imaginative Cuisine\" written in cursive script and surrounded by a gear. The scene shifts to a silver, high-backed, ornate throne in a dining room lit with blue light. A young girl wearing a pink face mask and pink heart-patterned long-sleeved shirt sits in the throne. The view changes to a dining table with a man and a waitress, both wearing face masks. The man is seated, holding a paintbrush, and facing two blank menus framed in silver and gold, with the word \u201cWonderland\u201d at the top. The waitress stands next to him, gesturing. The camera zooms in on the silver-framed blank menu, then shows a close-up of the man dipping the paintbrush in water. Next, the camera focuses on the young girl, now holding a paintbrush and dipping it in water. She then paints the menu with water, revealing yellow text underneath. The camera shows two martini glasses on the table: one with yellow liquid and a red layer at the bottom, the other with yellow liquid and a blue layer. Then, a woman with red lipstick holds up a martini glass with yellow liquid and a blue layer at the bottom and takes a sip. The man and woman clink martini glasses together. The woman takes more sips of her martini. The camera focuses on the woman speaking, then returns to the table where a server in dark clothing and face mask is pouring a smoky substance onto a plate with a hard-boiled egg split open and nestled in a nest-like bed of shredded phyllo dough, with green garnishes. The server adds two more egg halves to the plate. The young girl takes a bite of the egg and savors it. The woman, wearing red lipstick, eats a bite and gives a thumbs-up. The camera pans across the dining room, showing several diners seated in ornate silver and white chairs at white tables with glass tops. A cut shows the waitress placing three small cones with avocado filling and sprigs on top in the middle of the table, next to a plate of mixed vegetables, asparagus, carrots, and small purple beets. The waitress pours a white sauce over the vegetables. The young girl picks up a deep-fried shrimp in a small metallic cup from the table. The man scoops up some dark crumbly food from the first course dish with a spoon. The girl begins eating the shrimp and remarks on it being tempura shrimp with cocktail sauce. The man picks up one of the avocado cones. The girl describes the cone as an avocado crema in a crispy shell. She takes a bite of the avocado cone, holding up two fingers, then finishes it. The woman eats one of the cones. The server comes to the table with a small white bowl, which she places in front of the girl. The bowl is made to resemble a cracked egg and has small pieces of food and some herbs in it. The server pours a yellow broth into the bowl. The girl takes a bite with a spoon and describes it as a dumpling in chicken broth with microgreens. She eats more of the dumpling and broth. The server brings two dishes: a plated chicken breast with mashed potatoes and boiled egg slices, and a salmon filet on rice with a small bowl of sauce. She pours brown gravy onto the chicken and a light orange sauce onto the salmon. The girl eats a bite of the chicken, mashed potatoes and vegetables, remarking on them. Two desserts are brought to the table: a large white frosted cake with a red candy apple on top and a smaller chocolate sphere with gold dusting. \u201cHappy Birthday Chelsea\u201d is written in brown on the large cake. The girl watches excitedly as the server slices the apple on the cake and then cuts the cake. The server places the chocolate ball on the same plate as the sliced apple cake, and cuts the chocolate ball open with two forks, revealing a cake interior. The server places the chocolate ball next to the apple cake slice. The scene shifts to a view of the empty dining room with white tables and silver and white ornate chairs. A chandelier hangs overhead. The video concludes with a shot of a woman sitting on a low wall in front of a cruise ship in tropical waters. Text on the screen reads, \u201cNEW VIDEOS WEDNESDAYS +MORE WHEN TRAVELING,\u201d and \u201cWATCH ANOTHER VIDEO!\u201d",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Which does the little girl prefer, beef or Bernardini fish?",
+ "answer": "D",
+ "candidates": [
+ "A. Beef.",
+ "B. Bernardini fish.",
+ "C. She likes neither.",
+ "D. She likes both."
+ ]
+ }
+ },
+ "bxWbXPdv": {
+ "video_id": "bxWbXPdv",
+ "video_duration": "54s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Biology",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with curly hair is shown outdoors wearing a white lab coat. They are holding a green folder with writing on it. They reach out and touch a small plant with red berries, and later point to the writing in their folder. The camera zooms in to show the text on the folder which includes headings such as \"SHRUBS\", \"SECTION II\", and \"ANGIOSPERMS\". The person flips through a few pages of the folder and points at more text. The camera then cuts to a close-up of the red berries on the plant, followed by the folder again, with the person pointing. Next, the camera zooms in on a pink flower, which is followed by a title card: \"ROSA WOODSII\".",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "After determining that the leaf structure shown in the video was a compound leaf, which question in the textbook did the woman turn to?",
+ "answer": "B",
+ "candidates": [
+ "A. Question 29.",
+ "B. Question 30.",
+ "C. Question 28.",
+ "D. Question 31."
+ ]
+ }
+ },
+ "eNnyPjlG": {
+ "video_id": "eNnyPjlG",
+ "video_duration": "142s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A young man stands next to a blue three-wheeled motorcycle in a parking lot, gesturing with his hands. He walks to the motorcycle and sits on it. He adjusts the mirrors and a small screen mounted on the handlebars. He gestures with his hand toward various parts of the vehicle. He drives the motorcycle in circles around the parking lot, occasionally looking back over his shoulder. He then drives it in a larger circle through the rows of parking spaces. The camera angle shifts to a close-up of the handlebars as the man continues driving. The camera returns to its original angle, showing the man driving. The man continues driving the motorcycle in circles and through the parking spaces. He drives towards the camera and stops. He gestures while continuing to sit on the vehicle.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What are the steps taken by the man in the black long-sleeve shirt to start the Can-Am Ryker in the video? (a) release the throttle (b) release the handbrake (c) then take off slowly",
+ "answer": "B",
+ "candidates": [
+ "A. Release the handbrake, release the throttle, then take off slowly.",
+ "B. Release the throttle, release the handbrake, then take off slowly.",
+ "C. Release the throttle, then take off slowly, release the handbrake.",
+ "D. Release the handbrake, then take off slowly, release the throttle."
+ ]
+ }
+ },
+ "eBdFvqdb": {
+ "video_id": "eBdFvqdb",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three men and a woman sit on chairs in a room and play banjos. Sheet music sits on stands in front of them. A grand piano and a pipe organ are visible in the background. The camera moves to show a glass display case with various stringed instruments. The man furthest to the left now plays a banjo with a dark fretboard, while the other two men continue playing their banjos. The camera pans back to the original view, where the woman has left. A few moments later, the man on the far right is replaced by a woman with short dark hair wearing an orange-red T-shirt. Another man and woman join them, playing banjos. All three are now seated, reading sheet music. The view shifts back to the two men with banjos near the instrument display case.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many musical instruments are being played in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Four.",
+ "B. Three.",
+ "C. Six.",
+ "D. Two."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What instrument is on the far left in the opening scene of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Xylophone.",
+ "B. Guitar.",
+ "C. Electric bass.",
+ "D. Banjo."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "From the audience's perspective, where is the white-haired elderly man positioned relative to the man wearing a gray long-sleeved shirt?",
+ "answer": "C",
+ "candidates": [
+ "A. Behind the man wearing a gray long-sleeved shirt.",
+ "B. In front of the man wearing a gray long-sleeved shirt.",
+ "C. To the left of the man wearing a gray long-sleeved shirt.",
+ "D. To the right of the man wearing a gray long-sleeved shirt."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Did the sound of piano playing appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes, it did.",
+ "C. No, it did not."
+ ]
+ }
+ },
+ "jYrthEYL": {
+ "video_id": "jYrthEYL",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "The video shows four people playing bassoons in a split-screen view. Each person occupies a quadrant of the screen. The top-left person wears glasses and a dark t-shirt. The top-right person wears a white tank top, a baseball cap, and dark sunglasses. The bottom-left person wears glasses and a dark suit jacket. The bottom-right person wears a light blue, button-down shirt. All four individuals play their bassoons continuously, their fingers moving over the keys. The top-right person occasionally bobs their head and sways slightly while playing. The bottom-left person briefly adjusts their glasses with their left hand. The lighting and background vary slightly between each quadrant, suggesting they are in different locations. The top-left and bottom-left have plain walls in the background, while the top-right and bottom-right have curtains visible behind them.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man in the light blue cardigan in the video briefly pause his playing?",
+ "answer": "D",
+ "candidates": [
+ "A. Once.",
+ "B. Four times.",
+ "C. Three times.",
+ "D. Twice."
+ ]
+ }
+ },
+ "UItmoRNs": {
+ "video_id": "UItmoRNs",
+ "video_duration": "85s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "DIY & Crafts",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing bright yellow gloves uses a green and silver circular saw to cut a wooden piece that has been secured in an orange vice. The person first inserts zip ties and removes and replaces the saw blade, then sharpens the saw blade using a grey sharpening stone, and uses the saw to cut the wooden piece in the vice.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did two blades appear in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No.",
+ "C. Yes."
+ ]
+ }
+ },
+ "ePomGDPj": {
+ "video_id": "ePomGDPj",
+ "video_duration": "50s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen with white radial lines emanating from the center. The words \"BATMAN\" over \"BEYOND\" appear in black text. The white radial lines and text fade in and out. The text \u201cRETURN OF THE JOKER\u201d appears on the screen. An animated character in a red and yellow costume is facing left, and an animated character in a gray coat and white mask is holding a brown pole above their head, near the red and yellow character. The gray coated character is then shown holding a large brown mallet. The camera is angled upward at the gray-coated character. This character is then shown holding the mallet up, facing forward in a different scene, in a hallway. The character in red and yellow is then shown strapped into a machine, in black and white. A character in a white lab coat is leaning over them, holding a tool. A character in a dark gray suit with a dark cape, wearing a dark mask and hat with pointed ears is then shown, facing the camera. This same character is shown, now with a blue line at the edges of his hat. A character in a black and red suit is holding a vase of flowers. A doorway is behind them. Through that doorway is the character in a dark gray suit with a dark cape. The black and red suited character is shown facing the viewer with the flowers, another character is shown sitting in a chair on an elevated platform, smoking a pipe. The seated character holds a newspaper in front of their face. The seated character, still in a chair, has the paper pulled down slightly. A standing character with green hair, a white face, and purple suit is shown with an identical character, face obscured, standing on a raised platform, bound by ropes. The standing character points with his finger. The standing character with the white face is shown speaking directly to the shadowed character. The shadowed character with the green hair is shown more closely. Their teeth are bared, and they have glowing red eyes. The white faced, green haired character is shown with wide eyes. The white faced, green haired character, is shown standing next to the character in black and red, and the obscured character in purple on the machine. Then, two animated characters in red tops, white shorts, white caps, and red boots are shown dancing near a grey object. The two characters are shown closer to the object, sitting on it. A brown object with red supports is then shown lifting from below the platform. The brown vehicle is shown with its red supports extended. The two characters in red and white are then climbing onto the vehicle. A larger white character in a red outfit is driving, the red and white characters are shown next to them. An older woman in a blue coat is shown, holding a cane. The older woman and the two red and white characters are then shown looking at each other. The older woman is shown pointing the cane, and looking aggravated. The older woman is shown, pointing the cane again, and looking annoyed. The two red and white characters are shown running away from the older woman. The older woman grabs one of the red and white characters by the collar. The older woman, and the two red and white characters are shown standing in a hallway, with a dark suited character standing near a door.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the Joker couple do to Tim Drake in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Flatter him.",
+ "B. Erase his memory.",
+ "C. Inject a wicked soul",
+ "D. Torture and brainwashing."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the elderly lady want the twin sisters to be punished at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Insulted their grandmother.",
+ "B. Hurt their grandmother.",
+ "C. Made their grandmother sad.",
+ "D. Deceived their grandmother."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What action did the twin sisters perform before climbing onto the stage?",
+ "answer": "A",
+ "candidates": [
+ "A. Performed a front flip.",
+ "B. Bent backward.",
+ "C. Did a full split.",
+ "D. Performed a backflip."
+ ]
+ }
+ },
+ "tQjULUpl": {
+ "video_id": "tQjULUpl",
+ "video_duration": "130s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a group of four people standing near a bar. There are two women and two men in the group. The woman on the left is wearing a beige jacket. The man to her right is wearing a white and grey patterned shirt. The man behind them is wearing a light grey jacket. The woman on the right is wearing a blue and white patterned top. One of the women places her arm around the man in the patterned shirt. Two hands enter the frame holding up two different identification cards. The first ID shown is primarily red, the second ID is primarily yellow. Later, a third, primarily blue ID card is displayed. The hands holding the ID cards are then seen pointing to particular features on the cards. White text boxes appear on screen with text over the video. One reads \"Secondary ID checks\" and then lists items \"Does their name match their ID?\", \"Check a credit card\", \"Check other ID such as a student card\", and \"Ask to see their Facebook page\". Another text box on the video reads \"Don't inspect IDs through a wallet window take a proper look\". A final text box reads \"Check the security feature\" and lists \"Check for signs of tampering\", \"Air bubbles\", \"Wrinkles\", and \"Bumps\". The group of four then dances, and is replaced by another group of two women. One woman is wearing a red top and denim skirt. The other woman is wearing a green top and a darker skirt. They are then joined by a man wearing a black shirt. The final shot is of the NSW Government logo. The lighting changes color throughout the video. The background appears to be that of a busy bar or club with other people.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How do the emotions of the four young individuals change in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Excitement, anxiety, relaxation, happiness.",
+ "B. Excitement, anxiety, tension, happiness.",
+ "C. Excitement, anxiety, tension, guilt.",
+ "D. Sadness, anxiety, tension, guilt."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why did the man with shoulder-length hair wearing a shirt take out two IDs for the bartender to inspect?",
+ "answer": "B",
+ "candidates": [
+ "A. Because he mistakenly presented the wrong ID first.",
+ "B. Because the bartender noticed that the photo on the first ID did not closely resemble him.",
+ "C. Because the first ID was damaged.",
+ "D. Because the bartender observed that the age on the first ID did not match his appearance."
+ ]
+ }
+ },
+ "pDxKVQaX": {
+ "video_id": "pDxKVQaX",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person wearing glasses and a long-sleeved black shirt plays a bassoon, standing slightly angled to the right, in front of a black music stand. Their left hand is positioned near the top of the bassoon, and their right hand is closer to the bottom. A second person sits at a black grand piano, mostly obscured by the piano itself, wearing a light blue shirt and dark vest. They play the piano while the first person plays the bassoon. Both individuals appear to be focused on their instruments throughout the video, which features a static shot. The wall behind them is light-colored on the left and dark wood paneling on the right. A black chair sits behind and slightly to the left of the bassoon player. The music stand holds sheet music.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the bassoon louder than the piano?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes.",
+ "C. No."
+ ]
+ }
+ },
+ "QJDFnwrm": {
+ "video_id": "QJDFnwrm",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "The video is split into two frames, with a woman visible on the left and a man on the right. The woman has long dark hair, is wearing dark over-ear headphones, a black tank top with red flower details, and jeans. She is holding and playing a gold-colored saxophone. The man is wearing a white t-shirt with a black and red cord around his neck, and is playing a brown ukulele. The background on the left is a plain white backdrop, while the background on the right appears to be trees. Black text in the bottom of the screen displays lyrics with chord notations above them. At one point, the woman makes a \u201ccall me\u201d hand gesture. At the end of the video, the split frame disappears and only the woman's face is visible, with the black headphones around her neck and the text \"SUBSCRIBE\" superimposed on the right side of the frame.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What changes occur in the pitch of the music in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. It remains gentle throughout.",
+ "B. It remains urgent throughout.",
+ "C. It gradually speeds up from a slow pace.",
+ "D. It slows down from a fast pace."
+ ]
+ }
+ },
+ "CbtwvFWc": {
+ "video_id": "CbtwvFWc",
+ "video_duration": "89s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A woman wearing a black headwrap, eyeglasses, a grey shirt with a pattern, a silver necklace, and gold bracelets is in the foreground. She is speaking and gesturing with her hands. Behind her are other people, and some of them hold pieces of paper or paintings with various colors. A thin white string is strung across the area and has pieces of paper attached to it using clothespins. One of the pieces of paper has red text written on it. The camera shows a table with a plastic bag of dried blood orange slices in it, a gold shoe ornament, feathers, and other items. Then the camera focuses on a white layered dress or slip that's on the ground, and reveals a broken open container with what appears to be dark seeds, other debris, and a large dried animal tail. The camera focuses on the speaking woman again as she holds up pieces of paper and continues speaking, then shifts back to show the strung paper pieces.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video is the wedding dress displayed?",
+ "answer": "D",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. Throughout the video.",
+ "C. At the beginning of the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many golden bangles is the woman wearing a black headscarf holding in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Four.",
+ "B. Five.",
+ "C. Two.",
+ "D. Three."
+ ]
+ }
+ },
+ "fkCrIHyu": {
+ "video_id": "fkCrIHyu",
+ "video_duration": "144s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Racing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a relay race on a running track. Multiple runners, some wearing red tops and some wearing black or dark tops, are running on the track and exchanging batons. There are multiple lines marking the running lanes. Orange cones are visible along the track, and there is a long jump pit near the track. Several people can be seen standing or sitting outside the track, and several rows of seating are visible in the background. There is a timer displayed on the screen with times ranging from 1:36.3 to 3:18.1. Some runners are wearing clothing with the letters \"THA\", \"MAS\" or \"AUS\". In some frames, a person in a yellow shirt is standing on the track and watching the runners. In several frames, a runner wearing a dark top runs alone on the track. In several frames, a runner wearing a red top runs alone on the track.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the positional relationship between the Malaysian and Thai competitors during the third lap?",
+ "answer": "C",
+ "candidates": [
+ "A. The Malaysian competitor leads at the start of the video but falls behind the Thai competitor by the end.",
+ "B. The Malaysian competitor consistently remains behind the Thai competitor.",
+ "C. The Malaysian competitor falls behind but later overtakes the Thai competitor.",
+ "D. The Malaysian competitor consistently remains ahead of the Thai competitor."
+ ]
+ }
+ },
+ "izyCUIrs": {
+ "video_id": "izyCUIrs",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Greyhounds are in starting boxes. A green light flashes and the dogs are released. They race along a dirt track. The dogs bunch together as they run around the first bend. The black dog pulls ahead, followed by the dog wearing red. The camera zooms out to follow the dogs as they run down the back straight. The black dog continues to lead, with the red dog and the light green dog in pursuit. The dogs approach the finish line, with the black dog winning comfortably, followed by the red dog in second and the light green dog in third. The camera switches to a woman in a black suit being congratulated by people around her. She smiles and walks away with her arm around another person in a black suit.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does the video show Dog No. 1 running at the front?",
+ "answer": "A",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. Throughout the video.",
+ "C. At the end of the video.",
+ "D. In the middle of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "In what setting does the motion in the video take place?",
+ "answer": "C",
+ "candidates": [
+ "A. A swimming pool.",
+ "B. A basketball court.",
+ "C. An outdoor sports field.",
+ "D. An indoor sports arena."
+ ]
+ }
+ },
+ "TIyjKrir": {
+ "video_id": "TIyjKrir",
+ "video_duration": "459s",
+ "duration": "6-8min",
+ "domain": "Tech & Science",
+ "sub_category": "Tech Reviews",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "Four mobile phones are placed on a wooden table. A fifth mobile phone is positioned above the others. All five mobile phones have black screens. A hand enters the frame holding the phone on the left side of the four phones. The hand connects a white charging cable to the left phone, and places the phone back down.The hand picks up a different phone, connects a white charging cable to the device, and places the phone back down. The hand picks up a third phone, connects a white charging cable, and places it back down. The hand picks up the fourth phone, connects a red and white charging cable, and places it back down.The hand picks up the phone on the far left and a graphic of a circle with a white outline appears on the screen. The hand selects the phone and swipes it to the left causing a graphic of a man in a blue and red suit to appear. A slider menu appears at the bottom and the hand adjusts the slider to the left. The phone then shows a home screen menu with multiple icons that the hand scrolls through. The hand places the phone back on the table.The hand then picks up the second phone from the left, and swipes it to the right causing a graphic of a red and white heat map to appear, which transitions to a screen showing a numerical pin pad. The hand then places the phone back down on the table. The hand picks up the third phone from the left, and swipes it causing a graphic of a man with red accents to appear, the hand then uses a finger to scroll through the phone, causing a menu to appear. The hand places the phone back down on the table. The hand picks up the fourth phone from the left, and uses a finger to scroll through the phone causing different menus to appear, the hand then places the phone back down on the table.The hand picks up a different phone and swipes to the left which causes a pin pad menu to appear with the word \"Biometrics\" written in white on the bottom of the screen. The hand places the phone back on the table. The hand then picks up the first phone again, and swipes to the left causing a menu of multiple icons to appear. The hand places the phone down on the table. The hand picks up another phone and scrolls to the left causing a red and white image to appear. The hand uses a finger to touch different areas of the screen. The hand then places that phone back down on the table. The hand picks up a different phone, and scrolls to the left causing a menu to appear with the words \"Biometrics\" at the bottom. The hand places the phone down on the table.The time on the mobile phone at the top of the screen reads 11:35. A new view of the phones on the table shows that the time on the phone at the top of the screen reads 12:50. Another view shows that the time on the mobile phone at the top of the screen reads 14:05. A different view shows the time reads 15:17. Then a different view with the time of 16:32. Another view with the time of 17:47. A view with the time at 19:03, then 20:04. Another view at 20:29 and then 20:54.A view of the devices with the time of 21:19. A different view with a time of 22:17. Another view with a time of 23:52. A view at time 25:27. A different view at 27:02. A view of time at 28:37. A view of time at 39:15. A view of time at 39:50, and then one at 40:25. Another view at 41:00 and then another at 41:35.A view of the phones on the table with the text \"41 mins\" and battery percentages showing near the top of the view. The time on the phone at the top reads 41:54. The hand picks up the first phone from the left, which shows the time and date on the home screen. The hand swipes left, which causes a graphic of a man with a blue hat and red accents to appear. The hand uses a finger to touch the screen. The hand then places the phone back down on the table.The hand then picks up the second phone from the left and a red and white heat map graphic appears which transitions to a pin pad entry screen. The hand places that phone down. The hand picks up the third phone from the left and a red and white heat map graphic appears which transitions to a pin pad entry screen. The hand places that phone down on the table. The hand then picks up the fourth phone from the left and a graphic of a man with red accents appears and a finger is used to move across the screen, causing a menu to appear. The hand places the phone back down on the table. The hand then picks up the phone on the left and swipes to the right, causing a menu to appear. The hand places the phone on the table.The hand picks up the second phone from the left. A red and white heat map graphic appears, which transitions to a pin pad entry screen. The hand places the phone down. The hand picks up the third phone from the left, which shows a graphic of a man with a white hat, and uses a finger to move across the screen, causing a menu to appear. The hand places the phone back down on the table.The hand picks up the fourth phone from the left, and a graphic of a man with a white hat appears on the screen. The hand uses a finger to navigate the screen. The hand places the phone back on the table. The time on the phone at the top of the screen now reads 42:40.Another view of the phones on the table, the time on the phone on top reads 42:46. The time reads 42:50, then 44:02, 45:22, 46:43, 48:03, 49:23, 49:41, 49:47, 49:53, 49:59. The time on top reads 50:06.The hand picks up the first phone, moves it slightly, and places it back down. The time on the phone at the top reads 50:13. The hand picks up the fourth phone on the right and a menu with white text on a black background appears. The hand uses a finger to scroll through the menu. The hand then places the phone down on the table. The time on the phone at the top now reads 51:00, 51:28, 52:22, 53:22, 54:22, and 55:22.A new view of the table shows the time and text \"60 mins\", and battery percentages for each of the four phones. The time reads 01:00:01. The hand picks up the first phone and a graphic of a man with a white hat and blue accents appears on the screen. The hand swipes the phone causing a home screen menu to appear. The hand then places that phone back down on the table. The hand picks up the second phone and a red heat map appears, then transitions to a pin pad. The hand then places the phone back on the table. The hand picks up the third phone and a red heat map graphic appears which transitions to a pin pad menu. The hand then places the phone back on the table. The hand picks up the fourth phone from the left and the graphic of the man with the white hat appears. The hand uses a finger to scroll through menus on the screen, and then places the phone back on the table.The time reads 01:00:10, then 01:00:16, and 01:00:17. The time reads 01:00:25. The hand then picks up the fourth phone from the left and a pin pad screen appears. The hand places the phone back down on the table. The hand picks up the fourth phone again and a graphic of a man with a white hat appears and then the hand uses a finger to touch different parts of the screen. The hand places the phone on the table. The time reads 01:00:30, 01:00:31, 01:00:32, 01:00:33, 01:00:34, 01:00:35, 01:00:39, 01:00:44, and 01:00:57.The time reads 01:01:09, 01:01:22, 01:01:34, 01:01:42, and 01:01:50. A new view of the table, with the time reading 01:02:12. Another view shows a time of 01:02:24.The time on the phone at the top reads 01:02:37. Another view with the time 01:02:52, then 01:03:10. The hand then touches the screen on the first phone from the right and scrolls through the home screen menu. The hand places the phone down on the table. The time now reads 01:03:38.A new view appears of the phones and text \"Huawei P30 Pro 100% 1hr 03 mins\". The time on the phone at the top now reads 01:03:40. The hand picks up the phone on the far right and a graphic of a white and red image appears, and the hand uses a finger to scroll up and down the screen. The hand places the phone back down. The hand then picks up the phone on the far right and then moves it to the far left and then moves the phone again to the second position. The hand places the phone back down on the table. The time on the phone on top reads 01:03:43, 01:03:44, 01:03:45, 01:04:01, and 01:04:21.The hand picks up a phone and scrolls using a finger. The hand then places it back down on the table. The time on the phone on top of the screen now reads 01:05:20, 01:05:40, and 01:06:20.The time on the phone on the top of the screen reads 01:06:40. The hand then picks up the phone in the second position, causing a pin pad to appear. The hand places the phone back down. The time on the top phone reads 01:07:00, 01:07:20, 01:07:40, 01:08:00, 01:08:20, 01:08:24, 01:08:25, 01:08:26, 01:08:27, 01:08:28, and 01:08:29.The time on the phone on top of the screen reads 01:08:30. The hand picks up the first phone from the left. The home screen of the phone appears. The hand swipes to the left, causing a graphic of a man in a blue hat and red accents to appear. The hand then scrolls through a menu. The hand places the phone down on the table. The time on the phone on top reads 01:08:40. The hand then picks up a different phone and causes a menu to appear when scrolling through the screen.The hand picks up another phone. A graphic of a man with a white hat and red accents appear, and the hand touches different parts of the screen. The hand puts the phone down, a new view of the table with the time reading 01:09:08. The time on the phone at the top reads 01:09:25, 01:09:43, 01:10:01, 01:10:18, and 01:10:36. The time reads 01:10:50. The hand picks up one of the phones and touches the pin pad that appears on the screen. The hand puts the phone back down. The time reads 01:11:28, then 01:11:46. The time reads 01:12:03, 01:12:05, 01:12:06, 01:12:07, 01:12:08, 01:12:09, and then 01:12:10. The time reads 01:12:11, and then 01:12:12.The view changes and the time reads 01:12:15. A new view appears of the table showing the text \"OnePlus 7 Pro 1hr 12 mins\". The hand picks up the phone at the far right, and a pin pad screen appears. The hand then swipes the phone, causing a graphic of a man with a white hat to appear. The hand then touches different parts of the screen. The hand puts the phone back down on the table.The hand then picks up the third phone from the right and a pin pad screen appears. The hand places the phone back down on the table. The hand then picks up the fourth phone and a pin pad appears. The hand puts the phone down on the table. The hand then picks up the phone again and uses a finger to navigate through the menus on the screen. The hand puts the phone down on the table.The time on the phone on the top of the screen reads 01:12:23. The hand picks up the phone in the second position on the right and causes it to display a red and white heat map graphic. The hand places the phone down on the table. The time on the phone reads 01:12:24.The hand picks up the phone on the left, and causes a circle to appear on the screen. The hand puts the phone down. The hand then moves a phone from the far right, to a position on the far left. The hand then picks up a different phone. The hand places it on the far right, and a light-colored version of the phone with a logo on the back appears on the far right. The hand then places that phone down. The view shows the four phones lined up next to each other, with charging cables connected to each device. The time on the phone at the top of the screen reads 01:12:25, 01:12:26, 01:12:27, 01:12:28, 01:12:29, 01:12:30, 01:12:42, 01:12:59, and 01:25:17.The time on the phone at the top of the screen reads 01:25:58. The hand then places the phone down. A view of the four phones lined up on the table.A view appears of a man with a beard. He wears a white t-shirt with a red graphic of a hand. He is gesturing with both hands as he is talking.The view changes back to the four phones lined up on the table with charging cables. The time on the phone at the top of the screen reads 01:28:57. The hand then picks up the phone on the left and swipes to the left causing the graphic of a man with the blue hat and red accents to appear. The hand places that phone down on the table.A view of the man with a beard. The man speaks to the camera, gesturing with both hands as he is talking. A new view shows the phones lined up on the table and charging cables and their respective wall chargers. The hand picks up one of the chargers showing the branding on the side.The view returns to the man with a beard. He speaks to the camera, gesturing with both hands as he is talking.A new view appears with four phones lying on the table with charging cables connected to each device. The mobile phone on the top of the screen shows the time 0:02:23, with black screens on each of the mobile phones. Then a new view showing a screen graphic on three of the phones indicating they are charging. The time on the top phone reads 0:03:21. Then a new view showing a logo on the screen of one of the mobile phones at the far right of the frame, the time reads 0:04:26. A new view with the time 0:05:24. A new view with the time 0:06:22. A new view showing a logo on the screen of one of the mobile phones in the second position from the right, and the time reading 0:07:21.A new view appears with the text \"HUAWEI\" on the screen of the third phone from the left with the time 0:08:25. A new view appears with a charging logo on the screen of the fourth phone from the left and the time reads 0:09:23.A view appears of the man with a beard. He gestures with a single finger. A new view with the man speaking to the camera, gesturing with both hands as he speaks. A new view of the phones from the top.The man gestures and points his finger. The view moves back to the phones on the table.A new view of the man with a beard, talking directly to the camera, gesturing with his hands.The man claps his hands and continues to talk, gesturing with a finger. A new view of the four phones lined up next to each other. He gestures to the side. A view of the man. A view of the phones. He continues to gesture.A view of the man with a beard, talking directly to the camera. The screen displays text \"Giveaway Link Below\", the man speaks to the camera. A new view with text \"Giveaway Link Below\".A view appears with the phones and charging cables.A view with the man with a beard, gesturing. The man smiles. A new view with the man with a beard, gesturing as he talks, the text \"Giveaway Link Below\" appears on screen. The man points with his finger. The man claps his hands. The man gestures. The man speaks and moves his hand. A view of the four phones lined up next to each other. The view returns to the man with a beard. The man speaks and gestures with his hands. The view changes to the table with four mobile phones and charging cables, and wall chargers. A hand picks up a charger showing the logo. The hand releases the charger.The view changes to the man, with a beard, who makes a small circle with his hand and talks. He then makes a fist and gestures with his hand. He continues to speak, then points his finger. The screen changes to a new view with the four mobile phones. Then a final view of the man talking and gesturing with his hands, and the text \"Giveaway Link Below\" appears on screen.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "When the phone has been charging for 41 minutes, what time does the screen of the first phone display?",
+ "answer": "C",
+ "candidates": [
+ "A. 8:15.",
+ "B. 8:14.",
+ "C. 8:13.",
+ "D. 8:12."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "At the 41-minute mark, which smartphone had a battery level of 79%?",
+ "answer": "B",
+ "candidates": [
+ "A. Oppo Reno 10x Zoom.",
+ "B. OnePlus 7 Pro.",
+ "C. Huawei P30 Pro.",
+ "D. Samsung Galaxy Note 10 Plus."
+ ]
+ }
+ },
+ "uCSbaAHT": {
+ "video_id": "uCSbaAHT",
+ "video_duration": "52s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen. It then transitions to a scene of a group of people with their hands in the center. The text \"Community\" appears in white. The video transitions again to a woman reading a book with text that says \"Lore\". It then cuts to two people playing a computer game with the text \"Gameplay\". A shot of a table with various items and the text \"Addiction\" is shown. The video moves to an image containing several characters in front of a backdrop, which then transitions to a game screen with a person playing in a bottom corner. The text \"The league of legends community\" is present in the center. The video transitions to a screen with text saying, \"Why is League of Legends Community so Toxic?\". The video transitions to a website showing the text \"How to Stop Being Toxic in League of Legends\". The next screen contains text saying, \"League's toxicity falls on Riot and the community\". The next website text reads, \"Why League of Legends Community is Toxic and Potential Fixes\". This is followed by another website with the headline \"Keeping Your Cool: A Guide to Tackling Toxicity in League of Legends\". A headline with the text \"Twitch streamer Bizzleberry calls out Riot for not punishing toxic streamers\" is shown with an image of a man. Then there is a shot of text saying, \"Riot is building a 'good behavior island' to deal with toxicity\". The video transitions to a black screen followed by an illustration of a medieval town. An image with a search bar that says \"Black Death\" along with text describing it is shown. The next image shows a man in a blue shirt and jeans. A black screen then appears, followed by a woman and a man sitting across from each other, with a window behind them. The next scene shows the man holding a \"League of Legends\" sign in front of his face. The video then shows two character figures in front of a background with a large structure. It then shows a side view of one of the characters with the other next to them. The scene changes to a landscape with a body of water and a structure. It then transitions to a car drifting in the snow. A car with flames coming out of the back of it is then shown also driving in the snow. The video cuts to a cartoon clown driving a green car. A picture of a man and woman facing opposite directions is then shown. A black screen appears, followed by a large image with the text \"League of Legends\". The Riot Games logo appears on a red background. A person carrying a box with the Riot Games logo is then shown. A picture of a long hallway is shown with a red arrow on the floor. The video concludes with a black screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many artillery shells are chasing the car as it makes a sharp turn on the snowy mountain in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. Two.",
+ "D. One."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "Which game do the male and female characters seen by the seaside in the video belong to?",
+ "answer": "C",
+ "candidates": [
+ "A. Chinese Paladin 3.",
+ "B. League of Legends.",
+ "C. Valorant.",
+ "D. Honor of Kings."
+ ]
+ }
+ },
+ "YUTPpLvp": {
+ "video_id": "YUTPpLvp",
+ "video_duration": "262s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "Two men stand in front of a camera on a tripod against a white background. The man on the left wears a light blue, short-sleeved collared shirt. The man on the right wears a dark blue, long-sleeved, collared shirt, a checkered bow tie, and a dark brown fedora. He also wears a name tag.The view changes to the back of the camera, displaying a small toy car. A hand places a black and white striped card and a gray card in front of the toy car. The hand moves the gray card out of frame, then adjusts the camera\u2019s focus.A black case with a black and white striped pattern on one side and a gray card on the other is shown, standing on a small built-in stand. The case is folded and then unfolded again.The man in the fedora holds the case up in front of the camera. He then opens the case, revealing a grid of color swatches. He closes and opens the case again, showing both sides.The view again changes to the camera\u2019s display, which shows the striped card and the toy car. The hand adjusts the camera\u2019s focus, zooming in and out on the striped card. The hand brings up the peaking level menu, and selects \u201cHigh.\u201d The hand returns to the previous screen and continues adjusting the focus of the camera, zooming in and out.The man in the fedora again gestures with his hands as the other man talks. The camera view returns, displaying the small car, and the hand continues adjusting the focus. The hand brings up the peaking level menu again, selects \u201cLow,\u201d and then returns to the main display. The focus is adjusted once again.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who says, \"This is not creative enough\" in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A man in an orange shirt wearing a hat.",
+ "B. A bald man in an orange shirt.",
+ "C. A man in a deep blue shirt wearing a hat.",
+ "D. A bald man in a light blue shirt."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the man wearing a deep blue shirt and a hat located in relation to the camera in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. On the left rear side.",
+ "B. On the right rear side.",
+ "C. Directly in front.",
+ "D. Directly behind."
+ ]
+ }
+ },
+ "dUGXVmDm": {
+ "video_id": "dUGXVmDm",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two individuals, one wearing an orange shirt and dark glasses and one wearing a dark blue shirt, stand on a stage, each with a violin and music stand in front of them. They play their violins. A woman wearing a light blue shirt is also on the stage, seated at a keyboard. She has a black microphone in front of her face, and appears to be playing the keyboard and singing. The two individuals finish playing and bow slightly forward with their violins. They then stand up straight with violins in hand and look forward.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Does the violin on the left have more rhythm than the one on the right?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "YGhXDutS": {
+ "video_id": "YGhXDutS",
+ "video_duration": "320s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A food processor with pinkish ground meat inside it sits on a kitchen counter. Next to it is an empty white bowl. The food processor is removed, and three bowls are visible: one with breadcrumbs, one with ground meat, and a third with an orange-brown paste-like substance and a metal fork in it. The person puts on clear plastic gloves. They lift the bowl with the breadcrumbs to show more of the breadcrumbs, then lift and show the bowl with the ground meat, and then stir the orange-brown paste in the third bowl. The person picks up a sealed plastic bag of croutons and holds it up to the camera, showing a picture of croutons and salad on the bag. The person puts the bag down and then slides the bowl of breadcrumbs closer to themselves. They then mix the orange-brown paste into the ground meat, scraping the paste out of the bowl and into the bowl containing the ground meat. They mix the ingredients in the bowl with their gloved hands. They then begin shaping the mixture into small patties and coating them with the breadcrumbs. They lightly coat a large frying pan with oil, then gently place the patties in the hot oil. The patties begin to sizzle and brown. They add two more patties to the pan. A slotted spatula is brought in and used to check the underside of the patties and move them around the pan as they cook. The camera angle shifts slightly, providing a better view of the pan and patties, which are becoming crispy brown on one side. They continue the cooking process, flipping the patties as needed.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the man in the video use fish paste to coat the bread crumbs?",
+ "answer": "A",
+ "candidates": [
+ "A. Make the fish paste easier to mold.",
+ "B. Extend the shelf life of the fish paste.",
+ "C. Enhance the flavor of the fish paste.",
+ "D. Reduce cooking time."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where are the breadcrumbs located in relation to the fish paste in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. To the left of the fish paste.",
+ "B. Below the fish paste.",
+ "C. To the right of the fish paste.",
+ "D. On top of the fish paste."
+ ]
+ }
+ },
+ "ONjOsTJt": {
+ "video_id": "ONjOsTJt",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three musicians sit on a stage with maroon curtains behind them.\u00a0 The stage floor is a polished wood. From left to right, the first musician plays a pipa, a pear-shaped lute.\u00a0 The musician wears a pale blue, long-sleeved shirt, dark pants, and white sneakers. The second musician plays the erhu, a two-stringed bowed instrument. This musician wears a peach-colored floor-length gown. The third musician plays the guzheng, a zither. This musician wears a white top with long, flowing sleeves, and a long red skirt. A microphone on a stand is positioned at center stage between the pipa and the erhu players, though no one is singing. The musicians play their instruments, their fingers and hands moving over the strings.\u00a0 Their bodies sway slightly as they perform.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who is playing the pipa in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The man on the left.",
+ "B. The woman on the right.",
+ "C. No one is playing the pipa.",
+ "D. The woman in the middle."
+ ]
+ }
+ },
+ "TxUufWUR": {
+ "video_id": "TxUufWUR",
+ "video_duration": "118s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A Minecraft character resembling Dr. Mario stands inside a wooden structure with checkered flooring. There is a pixelated table and chair, a chest attached to the wall, and some Super Mario themed objects on shelves. The character turns and walks towards an opening in the wall, revealing a grassy area surrounded by water. The character gets into a small boat and rows away from the island. The view shifts to third-person as the boat passes other small islands, including a GameStop, a Police Station, a McDonalds, and a building with an Austrian flag. The water changes to a darker turquoise and back again. The character approaches the McDonalds island, rows past it, and pulls the boat ashore. A character wearing a gold helmet and armor jumps out of the boat, runs and jumps among some small pigs and pixelated flowers, and then runs inside the restaurant building. The doctor character disembarks and approaches the other character, who runs and jumps inside. The doctor stands near the flowers while the other character runs and jumps inside the building, at one point colliding with one of the glass window panes. Another character resembling Dr. Mario walks out of the restaurant and stands near the doctor. The helmeted character is now seen behind glass inside the restaurant. The helmeted character briefly exits the restaurant, rejoins the doctor, and appears to start speaking.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What happens next after Bowser Jr. finishes his Happy Meal?",
+ "answer": "D",
+ "candidates": [
+ "A. He leaves McDonald's Island.",
+ "B. He starts playing a game.",
+ "C. He goes home.",
+ "D. He encounters McDonald's employees."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What silly gesture did Ronald McDonald make in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. He jumped onto the steps and shouted for beef.",
+ "B. He kept pacing back and forth.",
+ "C. He slipped.",
+ "D. He danced."
+ ]
+ }
+ },
+ "ymEposCa": {
+ "video_id": "ymEposCa",
+ "video_duration": "75s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A man with reddish-brown hair and a goatee sits in front of a white banner with two gold phoenixes and the words \"VALLEY FIRE PHOENIX RISING.\" He is wearing a dark suit jacket and a light-colored button-down shirt. He speaks directly to the camera.A calendar page shows the number 25 highlighted in a red box.A document titled \"Petition In Lieu of Filing Fee\" appears on screen. It names Nils I. Palsson as a candidate for United States Representative 5th District. It mentions the Presidential Primary Election to be held on June 7, 2016. The date issued is handwritten as \"1/27/2016.\" There are blank spaces for signatures.The man in the suit reappears.A logo for the Valley Fire Rebuild Expo is displayed. It shows a house and two trees on a green hill with a stylized wave beneath. The expo will be held February 20-21, 2016, at the Twin Pine Casino in Middletown.The man reappears.A group of people poses for a photo in front of a house. A projected image of Bernie Sanders with the word \"Bernie\" beneath it is visible on the house. Tree stumps and various items are scattered around the yard. The text \"Potluck Sunday 2/21 and Wednesday 2/24\" is displayed at the bottom of the screen.The man reappears. The URL www.peoplesvictory.us appears across the bottom of the screen.A group of people sit on a patterned rug and couches in a living room. A small dog walks among them. Some people write on notepads. A wooden bowl sits on the rug. The words \"SIGNATURES DUE THURSDAY 2/25 WILL YOU SIGN AND SHARE?\" are displayed on screen.The man reappears, and the video ends.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "At the beginning of the video, what activity is the man with reddish-brown hair participating in?",
+ "answer": "B",
+ "candidates": [
+ "A. Participating in activities at the welfare institute.",
+ "B. Participating in the Huofu Reconstruction Exhibition.",
+ "C. Participating in an assembly election.",
+ "D. Participating in a community potluck."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "In the video, where is the banquet taking place?",
+ "answer": "B",
+ "candidates": [
+ "A. Office.",
+ "B. At home.",
+ "C. Community center.",
+ "D. Exhibition site."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What are the people holding pens doing at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Signing a petition.",
+ "B. Practicing their signatures.",
+ "C. Marking important dates on a calendar.",
+ "D. Drawing in a notebook."
+ ]
+ }
+ },
+ "NIMhhcYe": {
+ "video_id": "NIMhhcYe",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A woman with glasses and long brown hair, wearing a light-colored short-sleeved shirt and a long skirt, sits at a black grand piano, playing with both hands.\u00a0A young man, wearing a dark short-sleeved t-shirt and shorts, sits to the right of the piano, playing a cello.\u00a0He holds the bow in his right hand and uses his left hand to position his fingers on the strings. A black music stand with sheet music on it is positioned next to him, behind the cello.\u00a0A microphone is positioned in front of the cello, pointing towards it, connected to a stand. A larger microphone is positioned behind the cello, angled downward.\u00a0A patterned area rug is underneath and around the piano and cello.\u00a0Beige curtains hang in the background, and a section of dark curtains or backdrop is partially visible on the right. Text appears in the lower left corner: \u201cTchaikovsky\u201d and \u201cChanson Triste, Op. 40, No.2.\u201d Later, text in a different font appears in the upper right corner. Then, different text appears in the lower left and lower right in a language other than English. The video then changes to a close-up view of the interior of the piano as the woman continues playing, revealing the hammers and other mechanical parts. Text appears in the center of the screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "How many instruments in the video remain silent from beginning to end?",
+ "answer": "A",
+ "candidates": [
+ "A. Zero.",
+ "B. Two.",
+ "C. Three.",
+ "D. One."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At which part of the video do the subtitles appear in the center of the screen?",
+ "answer": "B",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the end of the video.",
+ "C. At the beginning of the video.",
+ "D. Throughout the entire video."
+ ]
+ }
+ },
+ "PqdPigCf": {
+ "video_id": "PqdPigCf",
+ "video_duration": "134s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Art Exhibitions",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video begins with three vertical panels labeled \u201cGAMEPLAY,\u201d \u201cVISUAL,\u201d and \u201cNARRATIVE\u201d with green, blue, and orange backgrounds respectively, each containing two light colored rectangles. A mouse cursor is visible on the bottom right of the screen. The mouse cursor clicks the rectangle labeled \u201cSensor Eyes\u201d in the \u201cGAMEPLAY\u201d panel. The image transitions to a single, white security camera mounted on a tan, horizontal siding background. The scene then transitions back to the three panel arrangement. The mouse cursor then clicks the rectangle labeled \"Oppressive USA Regime\" on the NARRATIVE panel. This transitions to an image of two white security cameras and two black traffic lights mounted on a metal frame against a light blue sky. This transitions back to the three panel arrangement, and the mouse cursor then clicks the rectangle labeled \"8 legs?\" on the VISUAL panel. The scene transitions to a stylized drawing of a gray, mechanical spider-like figure beside the photo of the two security cameras on the metal frame. The scene transitions to an animation of a blue figure sitting on a stool at a desk while holding a pen-like object to a computer. Text appears that reads, \"A while later...\" The scene transitions to three drawings of a spider like creature. The scene transitions back to the security cameras on the metal frame, and finally to a single white security camera. The video then transitions to a silhouetted man in a dark room moving around blue laser beams. Finally, the video transitions back to the spider-like creatures, and a line drawing of a spider with a camera eye. This image is then shown next to an image of the previous drawing.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How high can the spider drone jump in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. 10 feet.",
+ "B. 40 feet.",
+ "C. 20 feet.",
+ "D. 30 feet."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occur to the camera on the spider drone in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. It becomes bigger.",
+ "B. It becomes wider.",
+ "C. It becomes narrower.",
+ "D. It becomes smaller."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the man's identity in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Video producer.",
+ "B. TV host.",
+ "C. Mechanical designer.",
+ "D. Game designer."
+ ]
+ }
+ },
+ "FdjbERgJ": {
+ "video_id": "FdjbERgJ",
+ "video_duration": "43s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a split screen, with a man in the top half and a man in the bottom half. The man in the top half is wearing glasses, a light colored shirt, and headphones. There is a microphone in front of him. The background behind him is a dark, patterned wall. The man in the bottom half is wearing a light blue collared shirt. The background behind him is a reddish patterned tapestry or rug and what appears to be a wooden ceiling. Throughout the video, text overlays the screen.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the perspective of the man with curly hair in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Completeness does not equate to truthfulness.",
+ "B. Consistency does not equate to authenticity.",
+ "C. Completeness is equivalent to truthfulness.",
+ "D. Consistency is equivalent to authenticity."
+ ]
+ }
+ },
+ "gyqldzOm": {
+ "video_id": "gyqldzOm",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two young girls sit on chairs on a stage and play cellos. They are positioned in front of a black grand piano. Behind them is a red curtain decorated with multicolored, flower-shaped balloons. Text in an Asian language appears on the wall above the red curtain. The girl on the left wears a sleeveless black dress with a sparkly pattern over a white top. She has a pink ribbon tied around her waist. Her hair is dark and tied back in a ponytail with a pink bow. The girl on the right wears a short-sleeved white dress. A music stand sits between them. The camera occasionally zooms in on one of the girls, then returns to the wider shot of both of them. The audience sits in chairs in the foreground. Balloons are visible in the lower corners of the frame.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What changes occur in the volume of the music in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. From low to high to low, and then to high.",
+ "B. From low to high and then back to low.",
+ "C. From low to high.",
+ "D. From high to low."
+ ]
+ }
+ },
+ "KJrqxwgu": {
+ "video_id": "KJrqxwgu",
+ "video_duration": "29s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a peach-colored sports bra and leggings is standing on a tan mat on a beach. The person has blonde hair pulled back in two braids and is wearing pink weights on their wrists. The person moves their arms and hands out to the sides and then back towards their body, and bends their torso at the waist. The background shows a beach with light sand, an ocean with waves, and a sky with clouds. At different points, two additional people can be seen walking in the background on the sand behind the first person. A timer is visible in the top right, counting down from 30 to 0, along with text that reads \u201cEXERCISE 6/10\u201d. A small image labeled \"UP NEXT\" is shown at the top left.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the blonde woman in the video stretch her arms back?",
+ "answer": "C",
+ "candidates": [
+ "A. Eighteen.",
+ "B. Sixteen.",
+ "C. Seventeen.",
+ "D. Nineteen."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What changes occur in the volume of the background music in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The volume decreases as the character's speaking voice decreases.",
+ "B. The volume remains unchanged.",
+ "C. The volume increases as the character's speaking voice increases.",
+ "D. The volume decreases as the character's speaking voice gets louder."
+ ]
+ }
+ },
+ "SKpdjxSA": {
+ "video_id": "SKpdjxSA",
+ "video_duration": "142s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a wide shot of a flat snowy landscape with tire tracks on the snow. Three people are standing next to two snowmobiles, one of which is yellow and one of which is white with black details, in the middle of the landscape. The people are wearing winter clothing, including hats, gloves, and coats. A large fish is superimposed on the scene, and then a close-up view of two of the people standing beside a circular hole in the snow. One of the people, wearing camouflage clothing, crouches and appears to put fishing line into the hole, and the other person, wearing a red-and-black plaid coat, is holding poles in the air. A red bobber is visible at the end of the line. The person in camouflage clothing takes the line from his hand and lowers it through the hole, and the other man is holding fishing line on a reel. Another man, wearing a red and black striped jacket is now visible and appears to be helping at the hole. The man in camouflage is shown pulling a fish out of the hole, which is then held up and displayed by a man. The man holding the fish is then shown from a distance holding the fish. He holds it up once more, close to the camera.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many snowmobiles are there in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Two vehicles.",
+ "B. One vehicle.",
+ "C. Four vehicles.",
+ "D. Three vehicles."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a table visible in the fleeting shot at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. There is.",
+ "B. I am not sure.",
+ "C. There isn't."
+ ]
+ }
+ },
+ "eWCFtynR": {
+ "video_id": "eWCFtynR",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people are playing flutes. The person on the left is wearing a light gray sweatshirt and a dark ponytail. They are using a wooden flute. The person on the right is wearing a light gray, long sleeve shirt, and glasses. They are using a silver colored flute. Both are looking down, appearing to read from the music stand in front of them. A screen in the background displays a movie titled \"La La Land\" and changes scenes periodically. There is a piano, a photo, and several items on the piano in the background. The background also has dark red curtains.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "What is the other instrument sounding in the video when the woman in the long-sleeved hoodie is playing the flute?",
+ "answer": "B",
+ "candidates": [
+ "A. Clarinet.",
+ "B. Flute.",
+ "C. Saxophone.",
+ "D. Bagpipe."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the instrument on the left in the video louder than the instrument on the right?",
+ "answer": "C",
+ "candidates": [
+ "A. No.",
+ "B. Unable to determine.",
+ "C. Yes."
+ ]
+ }
+ },
+ "IvzBEWnV": {
+ "video_id": "IvzBEWnV",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Four people stand behind conga drums. Two women wear chartreuse tops and dark pants, one woman wears a magenta top and dark pants, and one man wears a dark purple button-down shirt and dark pants. Behind them is a drum set, a xylophone, and other percussion instruments.\u00a0The four drummers play a rhythm on their congas. The woman in magenta and the man next to her tap the rims of their drums. The drummers continue playing a complex rhythm, occasionally tapping the rims or bodies of their drums. Their rhythm speeds up. The drummers look at each other as they continue playing.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the drum on the left louder than the drum on the right?",
+ "answer": "A",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. Not applicable.",
+ "D. I am not sure."
+ ]
+ }
+ },
+ "nWavCdFB": {
+ "video_id": "nWavCdFB",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two Asian men stand facing forward, playing violins. They are positioned slightly offset from each other, with the man on the viewer's left standing slightly in front. A black music stand sits between them, holding open sheets of music. Both men wear glasses and black suit jackets. They move their bows across their violins, their fingers moving on the fingerboards. The background is a light-colored wall with rectangular panels in a slightly darker shade.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video do the two instruments play exactly the same music together?",
+ "answer": "C",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. At the beginning of the video.",
+ "C. In the middle of the video.",
+ "D. At the end of the video."
+ ]
+ }
+ },
+ "OWGHZenn": {
+ "video_id": "OWGHZenn",
+ "video_duration": "500s",
+ "duration": ">8min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A slot machine game, Madame Destiny Megaways, is shown on the screen. The reels spin, displaying various symbols like a black cat, an owl, candles, card letters, and a crystal ball. The initial win is $180. Subsequent spins yield wins, and the total win amount increases. A bonus round is triggered, leading to a screen with a wheel divided into segments showing numbers of free spins and multipliers. The wheel spins, landing on 10 free spins and a 5x multiplier. The free spins commence, with more wins accumulating. A \"sensational\" win of $21,510 is displayed. The free spins continue, with the total win rising. The wheel spins again and lands on 20x. More free spins ensue. The total win increases steadily. Another wheel spin triggers, this time landing on 10x. The remaining free spins play out, showing a mix of smaller and larger wins. The final win amount displayed is $180,472.50. A man in the bottom right corner of the screen reacts to the large win.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What type of website is this?",
+ "answer": "A",
+ "candidates": [
+ "A. An educational website.",
+ "B. A gambling website.",
+ "C. A news website.",
+ "D. A lottery website."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the result of the man's first spin of the wheel?",
+ "answer": "D",
+ "candidates": [
+ "A. 10 spins resulting in 20X.",
+ "B. 5 spins resulting in 10X.",
+ "C. 5 spins resulting in 8X.",
+ "D. 10 spins resulting in 5X."
+ ]
+ }
+ },
+ "VLcfNbrH": {
+ "video_id": "VLcfNbrH",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A woman and a man are in a room. The woman is on the left, sitting on a chair, and playing a long silver flute. She is wearing a black top and a long teal and black skirt. The man is on the right, sitting on a chair, and playing a tall dark-brown instrument. He is wearing a white long sleeve shirt and black pants. Between them, two music stands are holding sheet music. In the background are two images on the wall and a black piano in the far left. A group of people can be seen in the bottom right corner.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Is the flute on the left more rhythmic than the bassoon on the right?",
+ "answer": "B",
+ "candidates": [
+ "A. Cannot be determined.",
+ "B. No.",
+ "C. Yes."
+ ]
+ }
+ },
+ "yulhbZOp": {
+ "video_id": "yulhbZOp",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A woman with red hair plays a black grand piano. Another woman with light brown hair pulled back in a bun plays at a keyboard partially obscured behind the piano. A third woman with pulled-back brown hair plays a flute. A man in a suit sits in a wooden chair with a maroon cushion. He looks down at his clasped hands in his lap. A thick book lies on a small table beside him. Another woman with long blonde hair sits behind the man. The musicians and the seated individuals are within a white-walled room with white wainscoting and maroon upholstered chairs visible behind the blonde woman. A potted plant is partially visible behind the flutist. The man looks up briefly, adjusts his glasses, then resumes looking down at his hands. The woman at the keyboard occasionally turns pages in the book of music before her. The blonde woman tucks a lock of hair behind her ear. The man and the women remain in these general positions for the duration of the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Is the piano playing consistently throughout the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Cannot be determined.",
+ "B. No.",
+ "C. Yes.",
+ "D. Not applicable."
+ ]
+ }
+ },
+ "XVwxNKuv": {
+ "video_id": "XVwxNKuv",
+ "video_duration": "78s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with an establishing shot of a news studio, which features a large screen displaying a map of the world. Text on the bottom of the screen indicates that this is a \"developing story\" concerning the APEC dinner. The large screen then shows a live feed of a person standing on a balcony, with ocean and beach in the background. The person wears a white t-shirt with black letters, a black strap over their shoulder and is moving their arms and speaking. The shot then switches to just the live feed and continues to show the person speaking, in front of the beach background. The t-shirt says 'OCCUPY WITH ALOHA'. The person moves their hands and arms throughout the feed. The time and location are listed in the bottom left hand corner as \"09:21 Honolulu, Hawaii\" and also displayed on the large screen next to the person's face. Text at the bottom of the screen indicates that the person being shown is \"Makana Hawaiian Guitarist.\" The live feed shows the person's face moving and mouth open as they are speaking. The video ends showing the person talking while the text at the bottom of the screen changes.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "On which day of the week does the broadcaster in the video connect with the young people?",
+ "answer": "B",
+ "candidates": [
+ "A. Sunday.",
+ "B. Saturday.",
+ "C. Friday.",
+ "D. Thursday."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does the man start playing the guitar?",
+ "answer": "C",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. At the end of the video.",
+ "C. In the middle of the video.",
+ "D. At the beginning of the video."
+ ]
+ }
+ },
+ "gBswQiaB": {
+ "video_id": "gBswQiaB",
+ "video_duration": "544s",
+ "duration": ">8min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man with a bald cap, a gray wig, glasses, a black suit, and a long black beard stands behind a table. On the table are various bottles of water and clear plastic cups filled with water.\u00a0A pH scale graphic appears on screen. The man picks up a small bottle and squeezes drops of a red liquid into the first cup.\u00a0 The water in the cup changes color, starting with orange and eventually turning a dark turquoise. The pH scale graphic disappears.\u00a0The man proceeds to pour water from each bottle into the cups in front of them. He picks up a small white electronic device and places it in the first cup. He appears to be reading the device and writing down his findings on a notepad.\u00a0 He repeats this process with each water bottle. A table appears on-screen and is populated with information from the device.\u00a0 The categories are TDS and Conductivity. The man removes the device from the cups.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the audio, in what order did the man introduce the water from these 11 brands?",
+ "answer": "C",
+ "candidates": [
+ "A. (d) (f) (b) (g) (a) (c) (h) (i) (j) (k) (e).",
+ "B. (c) (b) (d) (i) (h) (e) (k) (j) (f) (g) (a).",
+ "C. (h) (b) (j) (d) (e) (f) (g) (a) (i) (c) (k).",
+ "D. (b) (g) (f) (c) (h) (i) (d) (j) (k) (a) (e)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What property of water is the man testing at the beginning of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The total dissolved solids (TDS) value of the water.",
+ "B. The temperature of the water.",
+ "C. The electrical conductivity of the water.",
+ "D. The pH level of the water."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual accessory is the man wearing in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A black bald wig.",
+ "B. A blue bald wig.",
+ "C. A white bald wig.",
+ "D. A red bald wig."
+ ]
+ }
+ },
+ "StYFklhZ": {
+ "video_id": "StYFklhZ",
+ "video_duration": "57s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A completed mural on the side of a trailer depicts three pink pigs running from left to right, flames behind them, a hawk on the left, a rattlesnake on the right, desert mesas in the background, and blue sky above. The word \"PIPELINE\" is written across the top left.The video then transitions to show the process of creating the mural.A man climbs onto the top of the plain white trailer. Blue tape is applied to the bottom edge of the trailer. A man and a young girl work together to apply tape near the back of the trailer. A man on scaffolding paints the background blue and purple using spray cans. Yellow flames are added.Close-up, the artist sits on scaffolding, using brown spray paint to outline and fill in the desert mesas. Then yellow is added to highlight the flames. He switches to white and adds highlights to the pigs. Orange and yellow are used to further detail the flames.The scene briefly cuts to a young girl sitting by a chicken wire fence, observing several chickens. A rooster dust bathes near the fence.Back at the mural, the artist uses orange, red, and yellow spray paint to color in a hawk. A blonde woman in sunglasses watches from the ground. A man adds orange details to the rattlesnake.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the length of the car being painted in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. 52 feet.",
+ "B. 51 feet.",
+ "C. 53 feet.",
+ "D. 54 feet."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the man doing with the bottle in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Watering plants.",
+ "B. Drinking water.",
+ "C. Spraying paint.",
+ "D. Filling it with water."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many cans of spray paint are on the shelf in the second half of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Five cans.",
+ "B. Four cans.",
+ "C. Four cans.",
+ "D. Six cans."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video do two chickens appear to be fighting?",
+ "answer": "D",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. Throughout the video.",
+ "C. In the middle of the video.",
+ "D. At the end of the video."
+ ]
+ }
+ },
+ "UECHEryZ": {
+ "video_id": "UECHEryZ",
+ "video_duration": "42s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Astronomy",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video begins with a close-up of a man with glasses wearing a suit and a yellow tie, in front of a background with scattered white dots. He is talking. Subtitles appear below him in Arabic, and, at different moments, also in English. The video then cuts to a shot of several flags waving in the wind, followed by a shot of a sign with the \"ESO\" logo. The video ends with a shot of a building with two people walking toward it.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "Which countries are represented by the flags in the row from left to right in the video? (a) Germany. (b) Belgium. (c) Denmark. (d) Finland. (e) France. (f) Italy. (g) Luxembourg. (h) Portugal. (i) UK. (j) Sweden. (k) Switzerland.",
+ "answer": "A",
+ "candidates": [
+ "A. (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k).",
+ "B. (b) (a) (c) (d) (e) (f) (h) (g) (i) (j) (k).",
+ "C. (j) (c) (a) (e) (d) (f) (h) (g) (b) (i) (k).",
+ "D. (b) (c) (a) (d) (e) (f) (h) (g) (j) (i) (k)."
+ ]
+ }
+ },
+ "gZUJOumt": {
+ "video_id": "gZUJOumt",
+ "video_duration": "61s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person in a tiger costume opens a white refrigerator. The inside of the refrigerator has two lightbulbs, and contains two white bottles, a loaf of bread, some red and green vegetables, and a small fruit with a stem and two peppers. A person with red hair appears. The tiger then drinks from two brown bottles. There is a close-up of the tiger's face, then the tiger puts a paw to its mouth. A man wearing a black and white striped coat with a white captain's hat appears and holds a long item. He opens his coat to reveal several items attached to the inside. A man in a black suit and hat appears, holding a wrapped item. Two women are next shown, then a man in a tan suit and hat carries a brown briefcase into the room. The tiger walks around. The tiger is shown again, close up. Three people are shown, raising their arms in the air. The tiger puts a cake and plates onto a small table. It hands a plate to the person with red hair. The tiger gives a bouquet of flowers to a woman in a green dress. The tiger is ridden by the red-haired person. The red-haired person and the man in the tan suit are standing and facing each other. Three people wave their arms, then a person in a tiger costume is shown sitting at a small table. Three people are shown, waving, then the video shows a title screen.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What does the man in the red tie do after opening the bag in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Display the bag.",
+ "B. Empty the contents of the bag.",
+ "C. Receive a loaf of bread.",
+ "D. Search for items inside."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In which position was the woman in a green outfit relative to the girl with a bow when the tiger was eating?",
+ "answer": "C",
+ "candidates": [
+ "A. Above.",
+ "B. In front.",
+ "C. On the right.",
+ "D. On the left."
+ ]
+ }
+ },
+ "OnQfwKVW": {
+ "video_id": "OnQfwKVW",
+ "video_duration": "30s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a dark screen. The video then transitions to a night time scene where there is a large white rocket with a yellow top on a launchpad next to a tall tower structure with orange lights. The camera moves to focus closely on the bottom of the rocket. There is steam coming from the three rocket nozzles. Then sparks start shooting from under the rocket, followed by flames which turn into a plume of smoke. The camera then pulls back to a wider shot, as the rocket starts to rise, the rocket is engulfed by a plume of smoke and fire. The rocket is still engulfed in flame as it rises into the dark sky, with a bright light below the rocket. The rocket continues to rise until the camera switches to a very bright light with only the outline of the rocket. The camera goes back to the rocket flying above the smoke which is illuminated by the bright fire. The video ends with a shot of the rocket, now smaller, moving into the sky with the plume of smoke and fire behind it, and a bright light underneath. The ground is visible in the lower portion of the screen.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of events in the rocket launch shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Inspect the equipment, ignite the engine, ignite the booster, launch the rocket, and separate the stages.",
+ "B. Inspect the equipment, activate the sound suppression system, ignite the engine, ignite the booster, launch the rocket, and separate the stages.",
+ "C. Activate the sound suppression system, ignite the engine, ignite the boosters, and launch the rocket.",
+ "D. Activate the sound suppression system, ignite the engine, ignite the boosters, launch the rocket, and separate the stages."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Does the volume in the video change?",
+ "answer": "D",
+ "candidates": [
+ "A. Yes, the sounds of people and the rocket become louder.",
+ "B. Yes, the sounds of people and the rocket become quieter.",
+ "C. No, the sounds of people and the rocket remain the same.",
+ "D. Yes, the sound of people remains unchanged, but the sound of the rocket becomes louder."
+ ]
+ }
+ },
+ "poiuzmrP": {
+ "video_id": "poiuzmrP",
+ "video_duration": "121s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A bald person speaks directly to the camera. Behind them is a white wall and a shelf with various objects on it, such as a black cloth with the letters \"DOG\" in white, a cluster of silver spikes, a black boot with a white skeleton design on it, and a book with the words \"ScytHe tLepPo\" on its cover. The wall also has several black and white patches with graphics on them. The person is wearing a black t-shirt with \"CITY OF Compton\" printed on it in light blue text. The video switches to footage of a screen displaying numbers and data in white text and graphs, followed by footage of a large wall of ice. The video then shows a large structure with the words \"Shanghai 2044\" at the top. Next, the video shows a vast, grey, mountainous terrain. Then, it returns to the person speaking into the camera. The video then displays footage of armored characters from a video game. The video then returns to the person speaking into the camera. Then it shows a shot of the inside of a spaceship and the video returns again to the person talking into the camera.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order does the man in black evaluate the movie? (a) Poor camera work (b) Great special effects (c) Bad translation (d) Not enough details (e) Great casting (f) Fine machinery.",
+ "answer": "B",
+ "candidates": [
+ "A. (b) (a) (c) (e) (f) (d).",
+ "B. (a) (b) (c) (d) (e) (f).",
+ "C. (b) (a) (c) (d) (e) (f).",
+ "D. (a) (b) (d) (e) (c) (f)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "How does the blogger's overall mood change?",
+ "answer": "D",
+ "candidates": [
+ "A. Excited, calm, unhappy.",
+ "B. Unhappy, excited, calm.",
+ "C. Unhappy, calm, excited.",
+ "D. Calm, unhappy, excited."
+ ]
+ }
+ },
+ "YssVAWRV": {
+ "video_id": "YssVAWRV",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man with curly brown hair and a beard plays a cello outside a white cottage with yellow window frames. He wears a light brown short-sleeved shirt and knee-length shorts. The cello rests between his legs as he plays it with a bow. A small black dog lies in the grass a few feet away from him.\u00a0 The cottage sits on a grassy hill overlooking a lake and mountains. The sky is blue with scattered white clouds. The man continues to play the cello throughout the video.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the man in short sleeves doing with the cello in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Cello teaching.",
+ "B. Cello practice.",
+ "C. Cello maintenance.",
+ "D. Cello performance."
+ ]
+ }
+ },
+ "ZFQhabpn": {
+ "video_id": "ZFQhabpn",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A woman in a white robe plays a black grand piano. A man in a gray suit and blue and gold tie sits and plays a trumpet. Another man, wearing glasses, a white shirt, and gold tie, sits and watches. Behind them, eight people stand in purple robes facing the audience. The audience sits in the foreground, partially obscuring the view. The musicians and choir are on a raised platform, separated from the audience by a wooden railing. Two dark, rectangular objects resembling screens are placed on the platform behind the grand piano. A wall of square, light-brown panels forms the backdrop.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "What other instrument is playing alongside the piano in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Trumpet.",
+ "B. Flute.",
+ "C. Clarinet.",
+ "D. Saxophone."
+ ]
+ }
+ },
+ "fVeryVen": {
+ "video_id": "fVeryVen",
+ "video_duration": "352s",
+ "duration": "4-6min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person is visible in the bottom left corner of the screen throughout the video. A window with eight icons appears on screen, and a textbox with text in it is directly below the window, and a save button is in the bottom right. The text box reads \"A lot of Farfetch'd have been spotted on Route 221!\". The \"Pokemon\" icon is selected and another window opens that lists two pokemon, named Goku and Gible. A 3d image of Goku appears on the right. Then it shows Gible on the right instead. Then it shows Goku again on the right. The window closes. The first eight icon window reappears, then fades, and a map of the region takes its place. A yellow marker moves on the map, then the map is replaced by the icon window, followed by a view of a game environment with buildings. A person walks around the area, then enters a path with green plants. Then, the person begins to ascend a rocky cliff area. A message window appears, \"The wall is very rocky... Would you like to use Rock Climb?\", then a second message window saying, \"A wild Bibarel helped out by using Rock Climb!\", then the Bibarel appears on screen using an attack. After this, the game environment reappears, and the character ascends the cliff, and then another message window saying \"The wall is very rocky... Would you like to use Rock Climb?\" then it fades into a black screen.Then, a game scene in a sandy environment with multiple buildings reappears. The scene shifts, and then a region map is displayed with a textbox and a yellow marker, \"Route 228 This rough path is harshly raked by a horizontally blowing sandstorm driven by strong winds off the sea.\". Then the sandy environment reappears with a character standing in the middle. A message window appears, \"I feel invigorated from that battle, even though we lost. After all, we gave it everything we had.\". Then, the eight icons appear in a window, then a message box appears with the text: \"!!\". The person continues to walk in the sandy environment, then the eight icon window reappears, then the \"Bag\" icon is selected. The \"medicine pocket\" menu appears, listing several items, followed by the key items pocket menu. The item \"Vs. Seeker\" is selected, and \"Deregister\" and \"Cancel\" appear as options. Then, \"Explorer Kit\" is selected, and \"Use this item, View Contents, Register, and Cancel\" appears as options. The \"Register\" option is selected, and then a new window appears, asking where to register the item. The icon window reappears, then the sandy environment reappears. Then a character and another message box with \"!!\" reappears, then the character walks around. The eight icon menu reappears, followed by the text box \"A lot of Farfetch'd have been spotted on Route 221!\". The pokemon menu reappears, then the video fades to black. Then a battle screen appears with two characters on screen. One is a green-haired woman with an orange top, and the other is a character with black hair, and the text box \"You are challenged by Ace Trainer Ariana!\" appears. Then the woman is closer on screen, then she sends out \"Ambipom\", and the purple monkey creature appears on the screen. A dialogue box appears, \"Go on, Goku! I know you can do it!\". Then, Goku appears using an attack. Then, \"The sandstorm is raging\" and then \"Goku is expecting you to come up with a great strategy!\". Then the menu appears at the bottom right, \"battle, pokemon, bag, run\". The user selects \"Mach Punch\", and a text box appears saying \"It's super effective! The opposing Ambipom fainted!\". Then the screen shifts to the team selection window, showing experience points gained. The levels of \"Gible\" rise to level 17, then the screen shows a list of stats that were increased due to leveling up, followed by a text box \"Gible learned Sand Attack!\". The battle resumes, and Goku is buffeted by sandstorm. Then a text box appears \"Ace Trainer Ariana is about to send out Rapidash. Will you switch your Pokemon?\". Then \"Ace Trainer Ariana sent out Rapidash!\", then a fire horse creature appears. \"Goku used Close Combat\", and \"Goku's Defense fell!\". The fire horse used a large attack. The screen shows the team selection window again and the level of \"Gible\" rises to level 20. \"Gible wants to learn the move Bulldoze. Should an old move be forgotten and replaced with Bulldoze? \"Then the screen shows \"Goku is buffeted by the sandstorm!\". Then, a message appears that says \"Ace Trainer Ariana is about to send out Chansey. Will you switch your Pokemon?\". \"Ace Trainer Ariana sent out Chansey!\" A pink creature with a large egg appears. The \"Goku is awaiting\" message appears, then \"Chansey\" is defeated with \"Mach Punch\" and \"It's super effective!\". Then, \"That's it! I knew you could do it!\", and then \"Goku landed a critical hit, wishing to be praised!\". Then the team selection menu reappears. Then, a text box appears saying, \"You defeated Ace Trainer Ariana!\" The green-haired woman appears with a sad face, then a text box appears saying \"Hmm... You got 7,552 for winning!\". Then the video fades to black. Then the Gible creature appears in an animation, followed by a bright flash. Then the sandy game environment reappears. The character is still in the same location and a message box \"!!\" appears. The character walks, and then a text box appears: \"Hello again! Please don't hold anything back - just like before.\" Then the scene fades to black. Then, the character reappears, and then the message box, \"A lot of Farfetch'd have been spotted on Route 221!\" Then, a menu appears with a pokemon and a list, then shows \"Garchomp\". The Pokemon list then reappears with Garchomp holding a \"Lucky Egg\". The scene fades to a blue background, and then a stat page appears for Garchomp. Then a moves page appears for Garchomp, then another page of stats. Another stat page appears for Garchomp. Then the sandy environment reappears and the character walks around, then another message box appears, \"Hello again! Please don't hold anything back - just like before.\", then the screen fades to black. A final scene appears with the person from the bottom left on screen and he is talking with his hands. Then a final screen of black appears.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video, which of the following sequences of events is correct? (a) Activate the battle searcher, (b) go to Route 228, (c) set the battle searcher, (d) battle with the trainer.",
+ "answer": "D",
+ "candidates": [
+ "A. (a) (b) (c) (d).",
+ "B. (a) (c) (b) (d).",
+ "C. (c) (a) (b) (d).",
+ "D. (b) (c) (a) (d)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What will the man do after the match according to the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Continue searching for other trainers to battle.",
+ "B. Start the next match immediately.",
+ "C. Exit the game.",
+ "D. Return to the Pok\u00e9mon Center to treat his Pok\u00e9mon."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "In the video, why are some trainers unable to use the battle searcher?",
+ "answer": "A",
+ "candidates": [
+ "A. Because they are in caves or other inaccessible locations.",
+ "B. Because they have already been challenged and cannot be challenged again.",
+ "C. Because the Pok\u00e9mon have no energy left to battle again.",
+ "D. Because the player has not completed all the prerequisite tasks."
+ ]
+ }
+ },
+ "ECbuOxRs": {
+ "video_id": "ECbuOxRs",
+ "video_duration": "100s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Drawing Tutorials",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a purple, irregularly-shaped geometric background. Then, a diamond shape appears in the middle of the screen, along with the text \"DIAGRAPHICS\" and below that, \"CRYSTAL CLEAR COMMUNICATION\". Next, a program interface appears on screen. It has a grey background with several icons on the left and right side, and a white rectangle in the center. The mouse cursor is visible on the screen. From the menu options at the top of the program, 'View' is selected and then 'Show Grid' is selected in the dropdown menu. A grid pattern appears on the white rectangle and the grey background. Next, 'View' is again selected and then 'Snap to Grid'. After that, the cursor makes a series of movements within the white rectangle, before creating an elongated oval shape with a blue outline and filling. This is then copied and rotated 45 degrees, and is then duplicated multiple times, creating a flower-like figure. A selection box appears around the figure. Next, from the right menu a dropdown menu is selected and 'Open Swatch Library' is selected. Then from that menu, 'Other Library' is selected. A window opens, and 'Sunflower Swatches - By Diagraphic' is selected. A swatch palette opens on the right side, and yellow is selected. The petals of the flower figure are now filled with yellow.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the final color of the ellipse in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Red.",
+ "B. Orange.",
+ "C. White.",
+ "D. Yellow."
+ ]
+ }
+ },
+ "vHEUNCcw": {
+ "video_id": "vHEUNCcw",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A single person wearing a white top and a long, layered peach skirt is seen walking in a forest of tall, slender trees. She picks up a black instrument with a long neck and strings and begins to play it while seated on the ground. There are multiple shifts in the light source, some where the lighting is dark with a single blue source and some where the person is in full sun. There are also a variety of camera angles, including close up, from below, and from farther away.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many types of music are heard in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. One.",
+ "B. Three.",
+ "C. Four.",
+ "D. Two."
+ ]
+ }
+ },
+ "KpbwAljP": {
+ "video_id": "KpbwAljP",
+ "video_duration": "340s",
+ "duration": "4-6min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a white t-shirt, a red belt, and dark-colored pants is standing on stage. The background is a brick wall with a large red and white graphic. They are holding a microphone and gesturing with their arms. Their shadow is cast on the brick wall. A second microphone and a microphone stand is also visible. They move around the stage. At one point, the camera view is switched to an audience, where multiple people are sitting. The camera returns to the person on stage. A man with a microphone then walks onto the stage. The video changes to a display of the words \u201cChristina Canters\u201d followed by the image of an old, brown television set. The television shows the red and white graphic that was on the brick wall.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Is the woman seen wearing a coat in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes.",
+ "C. None.",
+ "D. No."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many times did the woman place her hands on her hips during the performance?",
+ "answer": "A",
+ "candidates": [
+ "A. Seven.",
+ "B. Three.",
+ "C. Five.",
+ "D. One."
+ ]
+ }
+ },
+ "BMXajgzh": {
+ "video_id": "BMXajgzh",
+ "video_duration": "122s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video begins with a close up of a hand stroking a white shirt. The image then fades to a black screen with white text that says, \u201cpls get ready for the WORST singing youve ever heard in ur life.\u201d The video cuts to a person with light brown and dark brown hair wearing blue headphones and a white t-shirt. The person has black winged eyeliner and pink lip color. They are standing in front of a large glass sliding door. A small box appears in the upper right hand of the screen showing a close up of a person. The box has white text below that says \"me wanting to die listening to myself\". The person in the box rubs their face, yawns, makes faces, and moves their hands around their face. The main person on the screen continues talking and moving around, holding a cellphone in their left hand. The person is moving their hands and gesturing while they are talking and singing. White text on the screen appears that says \u201cok that highlight yes ugly skinny kween!\u201d. The person then moves their hand up to their face and shows 5 fingers. Then the person dances a little. There is a cut to a test screen with colored stripes. The video cuts back to the person who is holding a cellphone and gesturing with both hands. Text on the screen says \u201cma ears are bleeding bye\u201d. The person continues singing while gesturing and making faces. The video cuts back to a test screen with colored stripes. Then the video cuts back to the person again singing and gesturing while holding the cellphone. The person says \u201cwait for the DROP\u201d and then begins dancing with quick hand motions while singing. The person gestures with both hands. The video ends with the person making facial expressions and moving their hands while singing. They then walk away and the video ends.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the woman suddenly burst into laughter in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Her voice became hoarse while singing.",
+ "B. Someone was peeking at her through the window.",
+ "C. She forgot the lyrics.",
+ "D. She felt happy."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the woman spin in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Four.",
+ "B. Three.",
+ "C. Two.",
+ "D. One."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a microphone in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes.",
+ "C. No."
+ ]
+ }
+ },
+ "fpuHTajr": {
+ "video_id": "fpuHTajr",
+ "video_duration": "54s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A boy wearing a red t-shirt with a \"Blue Jays\" logo stands in a dining room, holding a trumpet. He raises the trumpet to his lips and begins to play. A man wearing glasses and a red and white baseball jersey sits at an antique wooden sideboard or desk in the background, playing a keyboard. Sheet music rests on the keyboard's music stand. In the foreground is a wooden dining table with a tablet propped on a stand, along with loose sheets of paper, a pen, a book, and a keyboard wrist rest. A woven basket sits on the floor near the boy. The boy and the man continue to play their respective instruments. The room has light hardwood floors, and a window with white curtains is visible on the right. A painting of a sunflower is hanging on the wall above the sideboard. A decorative vase and a few other objects sit on the top of the sideboard. A wooden door is visible on the left side of the frame. The boy maintains eye contact with the camera as he plays. He occasionally lowers the trumpet very slightly from his lips between notes. He smiles slightly at the end of the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Which instrument produces the first note?",
+ "answer": "D",
+ "candidates": [
+ "A. An accordion.",
+ "B. A trumpet.",
+ "C. A tuba.",
+ "D. A piano."
+ ]
+ }
+ },
+ "dhZIVwfD": {
+ "video_id": "dhZIVwfD",
+ "video_duration": "67s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Motorsport",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Four people on motorcycles with helmets and protective clothing compete on an oval dirt track inside an arena with a large audience. The riders are wearing different colors: yellow, white, blue, and red. The video shows the first few laps of the race. The video has a scoreboard at the top of the screen that displays \"SGP HEAT 11\" and the current lap number. Throughout the race, the riders pass a person dressed in a black and white uniform standing on the inside of the track. As the race continues, the camera angle shifts and pans as the motorcycles continue the race around the track. At one point, the video displays a listing of names and numbers: \"P. Ljung\", \"M. Vaculik\", \"C. Holder\", and \"C. Harris\". The last lap of the race is the final clip shown, and then a single rider is shown adjusting their helmet at the end. Finally, the video shows a final shot of the entire arena.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What change occurred in the commentator's voice in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Gradually decreases.",
+ "B. Gradually increases.",
+ "C. I am not sure.",
+ "D. No change."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What position did the race car driver wearing a yellow helmet finish in during the last lap of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Third place.",
+ "B. First place.",
+ "C. Fourth place.",
+ "D. Second place."
+ ]
+ }
+ },
+ "SrLZGOSB": {
+ "video_id": "SrLZGOSB",
+ "video_duration": "102s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A Mobile Legends: Bang Bang match lobby screen is shown. The character Wanwan is selected. Gameplay begins, showing Wanwan running through the game map. Wanwan fights several enemy characters. The perspective briefly shifts to the mini-map. Wanwan fights more enemy characters and gets a \u201cMega Kill.\u201d More fighting ensues, with Wanwan getting a \u201cDouble Kill\u201d and a \u201cTriple Kill.\u201d A blue light emanates from the ground, and the camera zooms out to an overhead view, showing several characters lying on the ground. Wanwan attacks the enemy base. The fighting continues, and another blue light effect appears. The overhead zoomed-out view is again shown. More fighting occurs near a body of water. Wanwan destroys the enemy base.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "In the video, after being defeated several times, how many times did the hero Wanwan get defeated before starting her own kills?",
+ "answer": "C",
+ "candidates": [
+ "A. Five.",
+ "B. Six.",
+ "C. Four.",
+ "D. Three."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the sound of the hero Wanwan using her ultimate ability occur in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Four.",
+ "B. Five.",
+ "C. Three.",
+ "D. Two."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, what is the man's position in the friendly team during the hero selection phase?",
+ "answer": "C",
+ "candidates": [
+ "A. Fifth.",
+ "B. Second.",
+ "C. Fourth.",
+ "D. Third."
+ ]
+ }
+ },
+ "sDQwyyLK": {
+ "video_id": "sDQwyyLK",
+ "video_duration": "417s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man and a woman sit at a purple table against a purple backdrop. Behind them is a shelving unit with various jars and boxes. The man wears a red baseball cap and t-shirt, while the woman wears a paisley-patterned top. The man holds a vacuum-sealed package of chorizo. He uses a knife to cut open the package. He takes out one of the sausages and places it on a small paper plate. He then slices the sausage into small pieces with a knife. Both he and the woman each pick up a piece and eat it. An image of the chorizo packaging appears on screen. The man and woman continue eating the chorizo.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the man do to the lady after cutting the sausage in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Ate it himself.",
+ "B. Handed it to the lady to eat.",
+ "C. Smelled it himself.",
+ "D. Handed it to the lady to smell."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did any other types of sausages appear in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. I am not sure.",
+ "B. None of the above.",
+ "C. Yes, there are.",
+ "D. No, there aren't."
+ ]
+ }
+ },
+ "MbvGrNUS": {
+ "video_id": "MbvGrNUS",
+ "video_duration": "386s",
+ "duration": "6-8min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video displays a soccer game with a green field, white lines, and a goal with a net at each end. There are several people on the field wearing white or red uniforms with numbers. A large crowd is in the stands, and banners with the word KONAMI are also present. The video also shows flags on poles at each corner of the field. Various overlaid graphics appear in different parts of the video, including text that displays players\u2019 names, the names of countries, and the score. At one point, there is an animation of a player celebrating after scoring a goal, and several other players come over to celebrate with him. A circular logo with the word \u201cFootball\u201d appears several times in the video. After the first half ends, a screen appears displaying half-time statistics with the score 1-1. The video continues with more gameplay, and the score changes to 2-1. Several other graphic displays also appear throughout the video. At the end, a screen appears again showing that the score is 2-1.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why did the commentator suddenly become excited in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A player in a red uniform scored a goal.",
+ "B. A player in a white uniform scored a goal.",
+ "C. A player in a white uniform fell.",
+ "D. A player in a red uniform fell."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Have you heard the referee speaking?",
+ "answer": "C",
+ "candidates": [
+ "A. None of the above.",
+ "B. I am not sure.",
+ "C. No, I have not heard it.",
+ "D. Yes, I have heard it."
+ ]
+ }
+ },
+ "LSILdqrI": {
+ "video_id": "LSILdqrI",
+ "video_duration": "34s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A cartoon beaver in a space helmet rides a rocket through space, past a cratered moon. The rocket disappears, and the beaver stands on a tube of Ipana toothpaste, giving a thumbs-up. The beaver then reappears in the rocket, pointing at a distant star and brushing his large teeth. Inside the rocket, the beaver looks through a circular window, and the view zooms in on a cactus and then another rocket labeled \u201cD.K.-1.\u201d The D.K.-1 rocket, piloted by a dog-like character wearing goggles and a black suit, flies through space firing a corkscrew-shaped projectile. Through the D.K.-1\u2019s targeting scope, a small spaceship is seen, and the projectile hits it. The beaver\u2019s damaged rocket falls to a planet's surface, and the beaver encounters the pilot of D.K.-1. The beaver holds up the Ipana toothpaste as the villain falls backward in surprise. The villain floats in space, screaming, and then tumbles head over heels. The villain rides a reindeer-like creature and crashes into the ground next to the beaver, who is holding the toothpaste tube.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order do the events in the video occur?",
+ "answer": "D",
+ "candidates": [
+ "A. Black animals are discovered, black animals approach to attack, white squirrels are knocked down, white squirrels sing in space, white squirrels successfully counterattack black animals.",
+ "B. Black animals are discovered, black animals approach to attack, white squirrels are knocked down, white squirrels successfully counterattack black animals, white squirrels sing in space.",
+ "C. White squirrels are knocked down, white squirrels sing in space, white squirrels successfully counterattack black animals, black animals are discovered, black animals approach to attack.",
+ "D. White squirrels sing in space, discover black animals, black animals approach to attack, white squirrels are knocked down, white squirrels successfully counterattack black animals."
+ ]
+ }
+ },
+ "JZmHHeWC": {
+ "video_id": "JZmHHeWC",
+ "video_duration": "31s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen and white text that reads \"Longford Town 2-0 Cobh Ramblers SSE Airtricity League First Division\". The video then transitions to a view of a soccer field. There are 2 teams of players on the field. One team is wearing yellow jerseys, and the other is wearing red jerseys. A referee wearing black and green is also on the field. The soccer ball is moved around the field by players on both teams. The camera view moves from a wider view to more close up shots of players on the field. At one point a player in yellow attempts to kick the ball into the goal.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What was the final score of the match between Loughborough Town and Cobh Ramblers shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. One to zero.",
+ "B. Two to one.",
+ "C. Two to zero.",
+ "D. One to one."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does Verdon touch the ball in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Four.",
+ "B. Three.",
+ "C. Two.",
+ "D. One."
+ ]
+ }
+ },
+ "brrfsvYs": {
+ "video_id": "brrfsvYs",
+ "video_duration": "140s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a man with light skin wearing a black shirt speaking against a red background. Then there is a blank white screen. Next, a man with dark hair and a brown suit touches his upper lip with his index finger. Then, there is a man with short dark hair wearing a military uniform looking off to the side. After that, there is a man with a brown jacket and a dark hat bringing his interlocked hands up to his nose. The same man then appears, with his hands up to his face. A man with graying hair and dark shirt then looks ahead and speaks. The same man closes his eyes as he speaks. A clapboard appears with the text \"PROD. NO.\", \"SCENE SCREEN TEST\", \"DIRECTOR F. F. COPPOLA\", \"CAMERAMAN HIRO\", \"DATE 1/7/71\", \"EXT.\" and \"INT.\". The man with the dark suit is seen looking downward and smiling, before bringing his hands up to his face. The man with graying hair and a dark shirt is then seen speaking. Then there is a woman with a cream colored hat and a red and white patterned dress speaking to the side. The man with the military uniform is also seen with people in the background. After that, people are sitting at a table, and one is seen raising his arms and smiling. After that, a clapboard is shown again with the text \"GODFATHER\", \"TAKE 13\" and \"SLATE 6\". There is then a woman with brown hair wearing a gray blazer with a white shirt and black bow looking to her left and smiling, with the text \"DIANE KEATON\" below. The same woman is then seen with a white headband and touching her face while looking to the side. There is then a man with graying hair and dark shirt is shown speaking and closing his eyes. The same woman from before is then shown laughing and looking to the side. A woman wearing a red coat and hat spins around in a dark room. Then there is a man with a military uniform and the woman with a cream colored hat and a red and white patterned dress speaking to each other. The man in the black shirt is then seen speaking against a red background, while the video shows other shots, of tables, and people. There is then a large group of people standing together in a garden like area. Then there are a few people in a wooded area speaking to each other. Then the man with the military uniform is seen again looking to the side.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What expression did the man with black hair in a military uniform display while facing the camera in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Excited.",
+ "B. Anxious.",
+ "C. Guilty.",
+ "D. Nervous."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many times does the clapperboard appear in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Five.",
+ "B. One.",
+ "C. Zero.",
+ "D. Three."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the director praise the man with blonde hair?",
+ "answer": "A",
+ "candidates": [
+ "A. He acts well.",
+ "B. He is well-liked.",
+ "C. He is handsome.",
+ "D. He has a good attitude."
+ ]
+ }
+ },
+ "PFWtSgmy": {
+ "video_id": "PFWtSgmy",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A young man with short brown hair sits on a stage in front of rows of black chairs. He wears a blue t-shirt, dark shorts, and white over-the-ear headphones around his neck. He looks down at a smartphone in his hands.\u00a0He puts the phone down and lifts the headphones onto his ears. He picks the phone back up.\u00a0He puts the phone and headphones down and rubs his hands together. He lifts the headphones back onto his ears and picks up the phone again. He sets both down again, picks up two drumsticks, and begins tapping them together.Four other people carrying conga drums walk onto the stage and sit down next to him.\u00a0 Two sit to his left, one female and one male; and two to his right, one female and one male. The young man in the blue shirt puts down the drumsticks and gestures with open palms towards the other musicians. One of the women on his left is holding a smartphone in her hand.They all begin playing their conga drums, including the young man in the center, who now holds a drumstick and a small shaker in his hands. They continue playing.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order of events presented in the video? (a) A female voice is heard. (b) Music begins. (c) A male chorus is introduced. (d) The sound of conga drums and small sticks being played.",
+ "answer": "C",
+ "candidates": [
+ "A. (a) (d) (b) (c).",
+ "B. (a) (c) (b) (d).",
+ "C. (b) (a) (d) (c).",
+ "D. (b) (d) (a) (c)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "How many conga drums are in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Four.",
+ "B. Five.",
+ "C. Two.",
+ "D. Three."
+ ]
+ }
+ },
+ "YrfuwtDa": {
+ "video_id": "YrfuwtDa",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person stands in a living room and plays a bassoon. They are wearing a floor-length sleeveless patterned dress and have long dark hair. The bassoon is brown wood and they are looking at sheet music on a music stand placed beside a dark colored couch. There are windows behind them with sheer white curtains and large bay windows. A grey armchair sits to the left of the person playing the bassoon, partially covered with a blanket. A small round table is beside the armchair and a small potted plant is visible behind the music stand, near the couch. The bassoon case is laying on top of the couch.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the woman gasp?",
+ "answer": "C",
+ "candidates": [
+ "A. Six times.",
+ "B. Five times.",
+ "C. Seven times.",
+ "D. Eight times."
+ ]
+ }
+ },
+ "ruNfcJFd": {
+ "video_id": "ruNfcJFd",
+ "video_duration": "494s",
+ "duration": ">8min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a screen capture of a Roblox profile page, and a small box containing a person wearing headphones in the bottom right corner. In the profile page view, an avatar picture is circular and gray, a large image on the left is an advertisement for a \"grand opening\", and several game icons are in the \"Recommended\" section. The person in the smaller box is looking toward the screen and making various facial expressions and hand gestures. The video proceeds with a series of screen captures that show searches for various Roblox avatar customizations. These include items such as \"Cartoony Eyes\", \"[KITTY] Kitty Pants\", \"[KITTY] Kitty Shirt\" and \"8-Bit Kitty\". The video also shows screens of the Roblox avatar customization feature, including options for body scaling and skin tone. These are followed by additional avatar customization screens, and the final screen of the sequence shows an avatar wearing a red shirt and pants, with black eyes, and 8-Bit Kitty ears. The screen capture then switches to a Roblox game titled \"Kitty Chapter 1!\" which features a red cat character. The video proceeds to show gameplay in which the red cat moves through various rooms with blue, brown, and red walls. The person in the small box in the bottom right corner appears to be reacting to the gameplay. Several other red cat characters are introduced to the game. Finally, a scene of the game appears with the word \"Win\" and various numbers, followed by the person in the box making hand gestures and looking towards the camera.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many bottles of potion did the red cartoon mouse place on the shelf in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. One bottle.",
+ "B. Three bottles.",
+ "C. Two bottles.",
+ "D. Four bottles."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual scene takes place in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A duck is floating in space.",
+ "B. A cat is floating in space.",
+ "C. A duck is floating on Mars.",
+ "D. A mouse is floating on Mars."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the red cartoon mouse do with the green scissors?",
+ "answer": "D",
+ "candidates": [
+ "A. Cut the green wire of the machine.",
+ "B. Cut the red wire of the machine.",
+ "C. Cut the black wire of the machine.",
+ "D. Cut the yellow wire of the machine."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the red cartoon mouse enter a secret room filled with gasoline tanks?",
+ "answer": "B",
+ "candidates": [
+ "A. To rescue the trapped mice.",
+ "B. To rescue the trapped mice.",
+ "C. To rescue the trapped kitten.",
+ "D. To rescue the confined ducks."
+ ]
+ }
+ },
+ "EfuyFyfu": {
+ "video_id": "EfuyFyfu",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man in a red shirt and dark pants stands in a living room. He is moving his hands and arms. Two people are seated on a couch, and a woman stands in the kitchen. The man in red continues moving his hands and arms, then the words \"ANNOUNCMENT FAKE OUT\" appear on the screen in white text on a black background. The man in red begins to clap. The view of the people on the couch shows that there are three. The man in red gestures and moves his hands and arms. The three people on the couch start raising their hands. The woman in the kitchen also raises her hands. The view moves to show the man in the blue shirt on the couch reaching up as well. He gets his leg raised. The woman in the kitchen moves to touch the man on the couch. The view shifts to show all four people in the room.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the primary emotion conveyed in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Anger and disgust.",
+ "B. Happiness and sadness.",
+ "C. Excitement and joy.",
+ "D. Excitement and disappointment."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where is the woman in the white long robe located in the middle scene of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Sitting on the bench.",
+ "B. Squatting.",
+ "C. Sitting on the sofa.",
+ "D. Running in the room."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the emotion of the woman in the white long robe in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Frustrated.",
+ "B. Excited.",
+ "C. Disappointed.",
+ "D. Enthusiastic."
+ ]
+ }
+ },
+ "RibVkphS": {
+ "video_id": "RibVkphS",
+ "video_duration": "34s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A first-person perspective video game shows a player character aiming at a snowy, rocky landscape. Green projectiles rain down from the sky, impacting the ground and creating greenish-yellow explosions. The text \"Requesting orbital\" appears briefly. More projectiles strike the ground as a notification appears, reading \"Transport shuttle Pelican 1 approaching pickup. Mission Control: Extraction shuttle inbound\". As the bombardment continues, the rocky landscape becomes obscured by a cloud of dust and smoke. The player turns and runs across the rocky terrain as more projectiles fall. The notifications change to \"Shuttle Landing\" and then \"ELIMINATED\". Another notification appears, reading \"Transport shuttle: This is Pelican 1, preparing for touchdown\". The scene transitions, and the player now stands with arms outstretched as explosions continue in the background. The notification reads \"INBOUNDING 50:00\". More projectiles hit the landscape. The player remains stationary as the timer counts down, and the bombardment continues. The large explosions create a bright orange glow. Finally, the player starts running again as the message \"Calling in an Eagle\" appears.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What changes occurred in the man's voice in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. First decreases, then increases.",
+ "B. Continuously decreases.",
+ "C. First increases, then decreases.",
+ "D. Keeps increasing."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the game's objective in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Destroy the base.",
+ "B. Destroy the eggs.",
+ "C. Destroy the bugs.",
+ "D. Destroy the shuttle."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the game character wearing a red cloak and the game character wearing a white cloak in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Enemy.",
+ "B. Companion.",
+ "C. Partner.",
+ "D. Mentor."
+ ]
+ }
+ },
+ "jlUtrjun": {
+ "video_id": "jlUtrjun",
+ "video_duration": "51s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person wearing a baseball cap and a gray hoodie is in a car and talks to the camera. The scene switches to an indoor rock climbing gym. The time \"9:30 a.m.\" appears on the screen, then two children walk in front of a climbing wall. The person from the first scene then appears and holds up two fingers. Text appears on the screen that says \"Fun with kids\" and \"Work Out\", and the person holds up a pink ball. Text appears on the screen that says \"Time\". More text appears that reads \"Fun with kids while working out\". The person from the first scene then helps a child climb on a climbing wall. The person then climbs a wall with small holds. The person jumps to the floor, and a child falls to the floor as well. The person helps the child get up. The person climbs the wall again, and then jumps to the floor, and then looks at the camera and waves. The scene switches to a view of the wall, and then a child appears. Two children are sitting in cubbies, and then the person from the first scene walks toward the camera and then sits on the cubbies.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What challenge did the man undertake with the little girl in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Rock climbing.",
+ "B. Hide and seek.",
+ "C. Long jump.",
+ "D. Ball kicking."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Does the video show a man in a yellow shirt rock climbing with a little girl?",
+ "answer": "B",
+ "candidates": [
+ "A. None of the above.",
+ "B. No, there isn't.",
+ "C. Yes, there is.",
+ "D. I don't know."
+ ]
+ }
+ },
+ "ngKCvvQp": {
+ "video_id": "ngKCvvQp",
+ "video_duration": "164s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video is split into two frames. In the left frame, two people in space suits lay in seats. One of them has a blue object by their seat. In the right frame, there are three people in blue shirts who appear to be working on items with white cord coming out of them, in a room with various flags along the top. The next scene shows a zoom in of the three people in the right frame. Then the left frame changes to show two people still in their seats, one of them is in a white space suit and the other is in a black suit. The next scene shows a person with a blue shirt and headgear opening a round hatch in the left frame and reaching down into one of the seats. Another person in a blue shirt then comes to the seat and hands over a white block. The first person with a blue shirt then takes a rectangular object out of the seat and hands it to the second person. Finally, the screen shows a split frame again of the person with a blue shirt in the left side while on the right side there is a person at a table looking at a white object in a circular hatch.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What scene is on the right side of the video at the end?",
+ "answer": "A",
+ "candidates": [
+ "A. Dragon is inside the pressurized docking adapter.",
+ "B. The resting area for astronauts inside the space station.",
+ "C. The astronaut seating area of Dragon.",
+ "D. The space station inside the chamber of the pressurized docking adapter."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where was the white cargo placed at the beginning of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. In the corridor of the astronaut area.",
+ "B. Under the seats in the astronaut area.",
+ "C. Inside the storage cabinets in the astronaut area.",
+ "D. Above the seats in the astronaut area."
+ ]
+ }
+ },
+ "PXoUvfgm": {
+ "video_id": "PXoUvfgm",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A woman sits on a chair playing the flute, with a man standing to her right playing the bassoon and another man sitting to the bassoonist's right playing the clarinet. A grand piano sits behind the musicians, and an audience sits facing the musicians. A small handheld video camera, placed on a tripod, sits between the audience and the clarinet player, partially obscured by audience members. The musicians' heads and bodies move subtly as they play, and their fingers move over their instruments. The bassoonist gestures slightly with his hands in time with the music, and occasionally turns his head to glance at the other two musicians. Near the end of the video, the flutist drops her sheet music, picks it up, and resumes playing. The camera angle remains static throughout the video.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "In the video, how does the rhythm of the music change?",
+ "answer": "D",
+ "candidates": [
+ "A. It shifts from somber and oppressive to majestic and stirring.",
+ "B. It changes from a deep and oppressive tone to a light and soothing one.",
+ "C. It changes from lively and cheerful to majestic and stirring.",
+ "D. It changes from lively and cheerful to relaxed and soothing."
+ ]
+ }
+ },
+ "PwujWvIy": {
+ "video_id": "PwujWvIy",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people are standing and holding bassoons. The person on the left wears a dark blue T-shirt with a graphic on the front, beige shorts, and white sneakers. The person on the right wears a white T-shirt, dark blue jeans, and gray sneakers. The right person has an object strapped to their lower leg. They are standing in front of a music stand. The background wall has several framed pictures. A few people are seated in the foreground. There is a piano and a chair to the left of the two standing people.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Before taking a breath, how did the frequency of the men's contrabassoons change?",
+ "answer": "B",
+ "candidates": [
+ "A. From gradual to rapid.",
+ "B. From gradual to rapid and then back to gradual.",
+ "C. From rapid to gradual and then back to rapid.",
+ "D. From rapid to gradual."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man wearing white do with the bassoon?",
+ "answer": "B",
+ "candidates": [
+ "A. Play the bassoon in the music hall.",
+ "B. Play the bassoon in the classroom.",
+ "C. Help a classmate move a bassoon.",
+ "D. Help a classmate wipe the bassoon."
+ ]
+ }
+ },
+ "eQxVvndq": {
+ "video_id": "eQxVvndq",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Climate Change",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A black screen appears, then a white background with a black line running horizontally. A red, yellow, green, and blue cube appear in a line near the right edge of the screen. A cartoon figure, with a blue garment with gold stars on it, and black shorts, appears from the left holding a cartoonish gray figure with an orange cloth. The cartoon figure holds the gray figure up, then they both look slightly down. Then the cartoon figure\u2019s face changes. A large hand enters from the left and points its index finger at the small figure. The cartoon figure looks at the large hand then down at the gray figure. Then the cartoon figure looks up with wide eyes, then down. The cartoon figure holds the gray figure up, then looks at a book in front of him. The book shows diagrams and text. Then it shows a map of Italy, an upwards green arrow, a factory, a red arrow that curves clockwise, a picture of a few green trees with a blue sky. Then a red symbol of currency with a sun appears over the tree picture, the tree picture darkens to a gray color. The book then rotates out of view and the cartoon figure looks upwards. The cartoon figure looks downwards holding the gray figure, then a large hand enters holding a calculator, with the numbers 478920, 863737, and 7863737 appearing on the screen in succession.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occur in the forest at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The forest changes from withered to lush.",
+ "B. The trees change from dense to stumps after being cut down.",
+ "C. The forest becomes denser.",
+ "D. The forest changes from dense to withered."
+ ]
+ }
+ },
+ "qSxLNtoC": {
+ "video_id": "qSxLNtoC",
+ "video_duration": "367s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a view of a light-colored flat surface with lines and a dark square item on it, along with what appears to be a plant. The view shifts quickly, moving toward an exterior wall with two windows on the left and a door on the right. The video continues focusing on the bottom right-hand window, moving between close and far shots of someone's hand manipulating different colored strips of material around the frame of the window. This includes blue, red, and gray strips. The person's hand also smooths some gray material around the window. A diagram is displayed briefly during this process, showing how a window should open in order to be correctly installed. The video continues with the same person in view, then moving away to show them removing a vertical window panel. A series of quick shots are displayed of the person in a black shirt, using both their hands to position, align, and maneuver the window and other components. In the end of the video, the person steps back, showing that the work is completed and speaks while gesturing with his hand, in front of a \"Subscribe\" message on screen. The video ends on a black screen with this \"Subscribe\" message.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man demonstrate actions using tools in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Four.",
+ "B. Three.",
+ "C. Two.",
+ "D. One."
+ ]
+ }
+ },
+ "oDFEgayi": {
+ "video_id": "oDFEgayi",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people are present in the video. The person on the left wears glasses, a red plaid shirt, and plays a red ukulele. The person on the right wears a blue t-shirt, and plays a light colored guitar. They both appear to be moving their mouths and are looking in different directions while playing their instruments. Behind them is a painting of a man wearing a hat and glasses. The background is a light brown color.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What instrument is located to the left of the electric guitar?",
+ "answer": "A",
+ "candidates": [
+ "A. A Hawaiian guitar.",
+ "B. A violin.",
+ "C. An electric bass.",
+ "D. A ukulele."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the emotional tone of this song?",
+ "answer": "D",
+ "candidates": [
+ "A. Somber and oppressive.",
+ "B. Serene and melancholic.",
+ "C. Energetic and passionate.",
+ "D. Warm and joyful."
+ ]
+ }
+ },
+ "NfqKjvcz": {
+ "video_id": "NfqKjvcz",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person wearing a white t-shirt with black stripes and black shorts enters a room carrying a bagpipe. They then put the mouthpiece to their mouth and the bagpipe is filled with air. They play the instrument. After some time, they stop playing. There are two people in the room watching.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the bagpipe playing continuously in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes.",
+ "B. I am not sure.",
+ "C. No."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What change occurs in the rhythm of the bagpipes at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. From hurried to heavy.",
+ "B. From heavy to hurried.",
+ "C. From joyful to gentle.",
+ "D. From gentle to cheerful."
+ ]
+ }
+ },
+ "hMYAGFoQ": {
+ "video_id": "hMYAGFoQ",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Camping",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A tan and black patterned snake is lying in dry grass. The camera shifts to a different location, where the snake is on a grass verge next to a sandy area. A black shovel is brought into view. The camera shifts to a man with a light-blue t-shirt, sunglasses, and a light-colored hat, standing in front of a sandy beach. He gestures with his hands. The camera shifts to a dark scene where a light is illuminating a racoon that is partially submerged in water. The camera shifts back to the man in the light-blue t-shirt, who is gesturing with his hands. The camera shifts to a man with glasses and an orange collared shirt standing next to a tent on a sandy beach, and then back to the man in the light blue t-shirt. Then it is back to the man in the orange shirt, then the man in the light blue t-shirt one last time.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual creature is on the water's surface in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A wolf.",
+ "B. A sheep.",
+ "C. A dog.",
+ "D. A leopard."
+ ]
+ }
+ },
+ "xfyeYBXS": {
+ "video_id": "xfyeYBXS",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people are sitting and playing euphoniums. They both wear red shirts and dark caps. They are seated in chairs in what appears to be a band room, with music stands in front of them and microphones nearby. Behind them are empty chairs. The video then transitions to a black screen with the words \"Get this sheet music at TubaPeter.com.\" The video then cuts to a montage of short clips featuring other euphonium players in various settings, with the words \u201cNeed more Tuba?\" displayed on the left side of the screen. The clips include three euphonium players sitting in chairs, two euphonium players in black and white, and three euphonium players wearing hats. The titles \"Sonic Boom,\" \"Legend of Zelda,\" and \"Game of Thrones\" appear next to the respective clips.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "What is the source of the \"Zelda\" music featured in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Tuba performance video from the first row.",
+ "B. Tuba performance video from the second row.",
+ "C. Tuba performance video from the third row.",
+ "D. Three tuba performance videos."
+ ]
+ }
+ },
+ "OKyEAgAz": {
+ "video_id": "OKyEAgAz",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man and two young girls are playing drums. The man is in the center, behind a drum set. He is wearing a black t-shirt with a colorful graphic design. He holds a microphone in his right hand. The girls are on either side of him, also playing drum sets. They are wearing black t-shirts with orange designs. The girl on the left has her hair pulled back in a ponytail, while the girl on the right has short hair. The man occasionally sings or speaks into the microphone while tapping his drumsticks against a drum or cymbal. The girls are focused on playing the drums, following the rhythm dictated by the man. They hit their drums and cymbals in sync with each other and the man. The room has sound-absorbing foam panels on the walls, and there are boxes stacked in the background. A watermark for a video platform is visible in the top right corner.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the sound of hitting the cymbal on the drum occur in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Three times.",
+ "B. Four times.",
+ "C. Five times.",
+ "D. Twice."
+ ]
+ }
+ },
+ "hXgprlfg": {
+ "video_id": "hXgprlfg",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a black screen. Then a light grey screen with horizontal lines appears, followed by the word \"TUTORIAL\". Then the screen shows a black square, containing a white circle with a black dot in the center. Below it is the word \"iPHOTOGRAPHY\". The \"i\" and \"Y\" of \"iPHOTOGRAPHY\" move, creating the word \"PHOTOGRAPH\" instead. The screen is briefly replaced by the letter \"Y\", before going to a black screen again. The black screen shows three red vertical lines which change into a light blue rectangle outlined in pink with the words \"GRADUATION PHOTOSHOOT\". The screen switches to a man, wearing glasses, a black t-shirt with white lines on the sleeves, and a small microphone clipped to his chest. He is standing in front of a blue fabric background. His hands are open and are facing upwards. He is seen moving his hands while speaking. Underneath him is a light blue box with the text \"STEPHEN - HEAD TUTOR\". He is seen continuing to speak and move his hands. The camera zooms slightly on him. He is seen still moving his hands and speaking.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did the video teach specific techniques for taking graduation photos?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes.",
+ "B. I don't know.",
+ "C. No."
+ ]
+ }
+ },
+ "CrOuXLhe": {
+ "video_id": "CrOuXLhe",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man and a woman are seated at a grand piano, and a man is seated playing a tuba. The woman plays the piano, while the man next to her flips pages of sheet music. The tuba player and the pianist play their instruments continuously. The audience sits in chairs in the foreground. One audience member appears to take a photo with a phone. The walls of the room are gray cinder blocks, and vents are visible high on the walls. The floor is gray.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the piano on the left louder than the tuba on the right?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes.",
+ "B. None of the above.",
+ "C. No.",
+ "D. I am not sure."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Where does the piano sound in the video originate?",
+ "answer": "B",
+ "candidates": [
+ "A. A man in a black suit.",
+ "B. A woman with short hair.",
+ "C. A man in a blue suit.",
+ "D. A woman in a black skirt."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the identity of the man in the blue suit?",
+ "answer": "A",
+ "candidates": [
+ "A. Piano page turner.",
+ "B. Tuba player.",
+ "C. Piano player.",
+ "D. Audience member."
+ ]
+ }
+ },
+ "vwNurRWU": {
+ "video_id": "vwNurRWU",
+ "video_duration": "164s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Physics",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video starts with the words \"Cohesive Energy\" in black text on a white background. This is followed by a shot of a person standing in front of a white flip chart. The flip chart has \"Cohesive Energy\" written at the top in blue, followed by the statement \"depends on potential energy per atom u\". Below this, a formula is written: \"U=V(r0)\". Below that, it says \"Cohesive energy per atom\" and another formula \"e_coh=-U\". The person is wearing a gray sweater with a blue and white checkered pattern, and has their hair pulled back. The person gestures towards the chart as they speak, at times pointing to specific words or formulas on it. The flip chart is turned to a new page with a title \"V?\". The following is written below in blue: \"Ionic -> Coulomb interaction\", \"Covalent -> Morse potential\", \"Van der Waals -> Lennard-Jones potential\", and \"Metallic -> many-body interaction\". The person continues to gesture to the chart while speaking, at times putting a hand to their face. The flip chart is turned to a new page with the formula \"u=V(r)\" at the top, and in the top right, it says \"Nearest neighbours d=r0\". Below this, it says \"2nd nearest neighbours d=sqr(2)*r0\". On the bottom, it has the formula: \"1. E_coh=-V(r0)*6/2=-3V(r0)\" and \"2.E_coh=-V(sqr(2)*r0)*8/2+(-3V(r0))=-3V(r0)-6V(sqr(2)r0)\". The person gestures and points to the chart while speaking. The video then displays text on a white background, which reads \"What About Second Nearest Neighbours?\". The video shows the person standing in front of the flip chart with all of the same notes from the last shot. They continue to gesture towards the flip chart while speaking, and point to different formulas on the chart.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of bond types and their corresponding formulas mentioned by the woman in the video when discussing cohesive energy?",
+ "answer": "C",
+ "candidates": [
+ "A. (d) (b) (a) (c).",
+ "B. (c) (a) (d) (b).",
+ "C. (a) (b) (c) (d).",
+ "D. (a) (c) (b) (d)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What was the woman's emotional state when she discussed the potential functions of different key types used in calculating cohesive energy?",
+ "answer": "B",
+ "candidates": [
+ "A. Indifferent.",
+ "B. Excited.",
+ "C. Confused.",
+ "D. Tired."
+ ]
+ }
+ },
+ "fcJwkNBh": {
+ "video_id": "fcJwkNBh",
+ "video_duration": "53s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "The video begins with a solid red background, which is followed by a series of animated images and lines drawn over a pale red background. This series of animated images includes a small human figure, an old fashioned film camera, a clapperboard, and a popcorn box. These animated images are also accompanied by small dots and lines that circle the central image. Finally, a solid blue background displays text reading \u201cGo Tell Aunt Nancy, VCM Grade B, Traditional Arranged by C. R. Tribe, Guitar and Ukulele Duet\u201d. After this text screen, the video cuts to a shot of a room in which a boy and girl sit playing stringed instruments. The boy is holding an orange ukulele and the girl is playing a guitar. There is a music stand in front of them, holding what looks like a songbook. Both people play their instruments throughout the rest of the video.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What changes occur in the music during the video?",
+ "answer": "D",
+ "candidates": [
+ "A. It changes from lively and cheerful to relaxed and soothing.",
+ "B. It changes from somber and oppressive to majestic and inspiring.",
+ "C. It changes from somber and oppressive to light and relaxing.",
+ "D. It changes from lively and cheerful to somber and oppressive."
+ ]
+ }
+ },
+ "zHxmeJBW": {
+ "video_id": "zHxmeJBW",
+ "video_duration": "41s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Chemistry",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A grey and brown structure is on the left with a white wheel and other dark gray mechanisms. In the center of the screen is a gray structure with a light blue bucket next to it on the right side. White fluid emerges from the bottom and rises upwards. A red and orange flame appears at the bottom. The flame moves and grows, the white fluid remains. A rectangular box with black text appears in the upper right of the frame. The white fluid disappears. A white smokey substance moves up and towards the left, coming from a dark grey structure. A red and black downward arrow appears in the center of the screen. Red and blue molecules appear, falling towards the arrow.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many chemical formulas involving Fe are shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Five.",
+ "B. Four.",
+ "C. Two.",
+ "D. Three."
+ ]
+ }
+ },
+ "KGokYCdm": {
+ "video_id": "KGokYCdm",
+ "video_duration": "95s",
+ "duration": "1-2min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A child with blonde hair decorates a christmas tree with red ornaments. A woman stands next to the tree with the child. A child wearing a blue hat and blue sweater holds a red stocking in front of a fireplace. A man hangs up three christmas stockings in front of a fireplace. A child wearing light blue pajamas holds a star from a red box. A woman holds the child and a gold star is placed on top of a christmas tree. The four family members stand next to the fully decorated tree. A child walks through a door and waves. Then the child is in a different room and waves. A snowman is standing between two planters. The child walks out of the house. A man is carrying a plate with food items and a woman is standing next to him. Three children walk outside behind them, holding gingerbread men. A man and a woman place various things on a snowman, such as a red hat, green mittens, and a scarf. The children give the snowman small cookies and small twigs. A child in yellow headphones is shown looking sad, then is shown looking happy. Two hands in yellow mittens cover the face of the snowman. The family is standing together by the snowman. They wave. The family stands in a snowy yard next to a snowman and cookies. A 'Subscribe' button with a red background is shown with the children. The family all waves their hands behind a grey subscribe button with a bell.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the Christmas stocking hung on the fireplace in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. On the left.",
+ "B. On the right.",
+ "C. Above.",
+ "D. In front."
+ ]
+ }
+ },
+ "hYcIkwMV": {
+ "video_id": "hYcIkwMV",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A young person with short brown hair, wearing a long-sleeved checkered shirt and khaki pants, stands playing a bassoon. They are positioned on a red carpeted stage in front of a black music stand. The bassoon is attached to a neck strap. The musician\u2019s fingers move over the instrument\u2019s keys as they play. Behind them is a wooden podium and to the right, partially obscured by the podium, a person holding what appears to be a stringed instrument can be seen. The background consists of peach-colored walls and wooden cabinets. Three decorative shallow bowls sit atop the cabinets. Two microphones are attached to stands on the podium. The musician continues to play the bassoon, their fingers moving over the keys throughout the duration of the video. Their head is slightly tilted downwards towards the music stand. The angle of the shot remains static throughout.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the boy take a breath?",
+ "answer": "B",
+ "candidates": [
+ "A. Six times.",
+ "B. Seven times.",
+ "C. Eight times.",
+ "D. Five times."
+ ]
+ }
+ },
+ "OIMEHpmB": {
+ "video_id": "OIMEHpmB",
+ "video_duration": "45s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Unboxing Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A cartoon character with black hair, glasses, and a yellow scarf is moving a cardboard box with a black symbol on it across a room with a bookcase, a closet, and a closed door. The character pushes the box with their hand. Then, the character is shown pointing to the box. The same scene is shown again. The box has the word \"IRONSIDE\" and a black logo with an ax appear next to it. The next scene shows a stylized animated character standing in front of a custom gaming PC. Then there is a close-up of a figure inside the PC. Then the character from the first scene reappears holding a utility knife over the box. Next, the character is laying down on the box, and then a pink background appears with the character touching their chin. Next, a gray and black computer graphics card is shown floating against a pink background and the character is shown with their hands on their head. Then, the character is shown again looking at the box and opens it. Fluffy white packing material is spilling from the box as the character looks into it. Then, the character disappears into the box before only their head is visible surrounded by the white packing material. Then the box appears closed again with white packing material spilled on the floor around it and a second box is now also on the floor.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence in the unboxing video?",
+ "answer": "C",
+ "candidates": [
+ "A. Display the box, open the box, take out the small box, cut the box with a knife, and jump into the box.",
+ "B. Display the box, cut it open with a knife, open it, jump into it, and take out a small box.",
+ "C. Display the box, cut it open with a knife, open it, take out the small box, and jump into the box.",
+ "D. Display the box, cut it open with a knife, jump into it, open it, and take out the small box."
+ ]
+ }
+ },
+ "jbZgCwlJ": {
+ "video_id": "jbZgCwlJ",
+ "video_duration": "175s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Climate Change",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A man in a blue, long-sleeved button-down shirt stands in front of a dark blue background. He holds a clicker in his right hand and gestures with his left hand as he speaks. Behind him, the graphics change. First, there is a white stick figure with a lightbulb next to its head. The words \"TOKEN BEHAVIOUR\" are displayed in white text. A logo consisting of a dragon and the word \"UQx\" are in the upper right corner.The graphic then changes to a flow chart that shows perceived scientific agreement leading to: \u201cGlobal warming is happening,\u201d \u201cWe are causing it,\u201d and \u201cGlobal warming is dangerous.\u201d These three statements lead to \u201cPolicy Support.\u201d The words \u201cCONSENSUS GAP\u201d appear in white text above the flow chart.Next, two incomplete circles labeled \u201cTHE PUBLIC THINK\u2026\u201d and \u201cIN REALITY\u2026\u201d are displayed side-by-side. \u201c55% of climate scientists agree on global warming\u201d is displayed inside the left incomplete circle. \u201c97% of climate scientists agree on global warming\u201d is displayed inside the right incomplete circle.The graphic changes to a line graph showing \u201cPERCEPTION OF SCIENTIFIC AGREEMENT.\u201d The x-axis is labeled \u201cFree Market Support,\u201d from -1 to 1. The y-axis shows percentages from 0 to 100. A dotted horizontal line is labeled \u201c97% Scientific Consensus.\u201d The plotted line shows a downwards trend from approximately 75% to 35% moving left to right across the graph. The left end of the x-axis is labeled \u201cliberal\u201d and the right end is labeled \u201cconservative.\u201dAn orange rectangle is placed on the graph, covering the space between 0 and 40% on the y-axis. The label \u201cCultural Bias\u201d is placed inside the rectangle. A light blue rectangle is then added, covering the space between the \u201c97% Scientific Consensus\u201d dotted line and the orange rectangle. The text \u201cInformation Deficit / Misinformation Surplus\u201d is added to this blue rectangle.The graphic then disappears, leaving the man in the blue shirt against the dark blue background.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order do the different types of charts, used by the man to support his argument in the video, appear?",
+ "answer": "A",
+ "candidates": [
+ "A. Flowchart, pie chart, line graph.",
+ "B. Line graph, pie chart, flowchart.",
+ "C. Pie chart, flowchart, line graph.",
+ "D. Flowchart, line graph, pie chart."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "According to the chart about the debate among climate scientists regarding human-caused global warming, how much disagreement do the public believe exists among scientists?",
+ "answer": "A",
+ "candidates": [
+ "A. According to the chart, the public believes there is 55% disagreement among scientists.",
+ "B. According to the chart, the public believes there is 50% disagreement among scientists.",
+ "C. According to the chart, the public believes there is 97% disagreement among scientists.",
+ "D. According to the chart, the public believes there is 60% disagreement among scientists."
+ ]
+ }
+ },
+ "vSGVUrtn": {
+ "video_id": "vSGVUrtn",
+ "video_duration": "60s",
+ "duration": "1-2min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man and a woman stand in a gymnasium, holding hands. They let go and walk backward, away from each other, then jump and land on their knees. They crawl forward on their knees and elbows, then sit back and lean back on their hands. They stand up, run toward each other, and pass each other, running in opposite directions. The woman turns and runs back toward the center, crouches on her feet, and crawls forward. The man slides into a seated position on the floor, and the woman stands up and turns around. The man stands up and helps the woman to her feet.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Can you identify when the smile of the long-haired woman and the man with short black hair occurs in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. At the beginning of the video.",
+ "C. In the middle of the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of the dancer's movements in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Kneel, jump with arms spread, palms touching, approach, jump, kneel, lie down, kneel.",
+ "B. Approach, jump, kneel, lie down, kneel, jump.",
+ "C. Approach, jump, kneel, lie down, kneel, jump with arms spread, palms touching.",
+ "D. Jump, approach, kneel, lie down, kneel, jump with arms spread."
+ ]
+ }
+ },
+ "XkZlbDXD": {
+ "video_id": "XkZlbDXD",
+ "video_duration": "82s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Fashion",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person with long dark hair is in the center of the frame. The person has pale skin and is wearing a pink top and gold bracelet. The background is black with round yellow lights. The person claps their hands, then moves their hands near their face. The person holds a yellow stuffed minion, then the person holds a multi-colored keychain with keys attached to it. The person then displays three cell phone cases of different colors- red, purple and black. After that a shot of four phone cases on a table. The colors are purple, red, black, and dark blue. Then the person is shown in front of a white vanity with a mirror that has lightbulbs on each side of it. The person is holding a black cell phone, and puts the phone case on it, and takes off the phone case and puts it on a purple phone, then talks on the phone.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "In the video, how many phone cases did the woman buy in total?",
+ "answer": "A",
+ "candidates": [
+ "A. One.",
+ "B. Two.",
+ "C. Four.",
+ "D. Three."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "Which sequence of events correctly reflects the protagonist's activity flow in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Recounts a friend's gift of a bell keychain, displays a pendant, talks about the origin and loss of the keychain, and shows a phone case while sharing its purchase price.",
+ "B. Displays a pendant, talks about the origin and loss of the keychain, recounts a friend's gift of a bell keychain, and shows a phone case while sharing its purchase price.",
+ "C. Shows a phone case while sharing its purchase price, displays a pendant, talks about the origin and loss of the keychain, and recounts a friend's gift of a bell keychain.",
+ "D. Displays a pendant, talks about the origin and loss of the keychain, shows a phone case while sharing its purchase price, and recounts a friend's gift of a bell keychain."
+ ]
+ }
+ },
+ "EfVSxyDK": {
+ "video_id": "EfVSxyDK",
+ "video_duration": "56s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Fashion",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person wearing a white shirt and dark blue pants stands in front of a gray background. The person moves their hands while talking. The video then transitions to a multi-paneled image of hands working on what appears to be a leather shoe, using tools. The video transitions to an image of two pairs of shoes with labels underneath that read \"BROWN ANTIQUE\" and \"DIABLO ANTIQUE\" respectively, and then transitions again to an image of 3 pairs of shoes with corresponding labels underneath. The video transitions back to the person holding a brown shoe and moving it around. The person holds the shoe from different angles.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the man's most likely identity?",
+ "answer": "A",
+ "candidates": [
+ "A. A blogger.",
+ "B. A shoe salesperson.",
+ "C. A shoe repairer.",
+ "D. A shoemaker."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video showcasing the shoe-making process, which of the following outlines the correct production order?",
+ "answer": "D",
+ "candidates": [
+ "A. Make the shoe upper, finish the seams, sew the shoe upper, and attach the upper to the sole.",
+ "B. Make the shoe upper, attach the upper to the sole, sew the shoe upper, and finish the seams.",
+ "C. Make the shoe upper, sew the shoe upper, finish the seams, and attach the upper to the sole.",
+ "D. Make the shoe upper, sew the shoe upper, attach the upper to the sole, and finish the seams."
+ ]
+ }
+ },
+ "uvSjATRD": {
+ "video_id": "uvSjATRD",
+ "video_duration": "59s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A spotted eel is shown opening and closing its mouth several times. It is then seen with more of its body exposed, partially inside a cave structure. Next, a person in a pink and orange diving suit is shown placing their hand near the eel, which then retracts into the cave structure. In the next scene the spotted eel is fully exposed swimming in blue water. A gloved hand offers the eel something, which the eel accepts. Next, the eel is shown close to a person's face in the water, who then places their hands on the eel. The person holds and moves the eel, and the eel interacts with the person's face. The person then holds the eel again in a different position before feeding it a small fish. In the final sequence the person is holding the eel and they move the eel to their own face as the video ends.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What would the woman feel when she hasn't seen eels for three years?",
+ "answer": "C",
+ "candidates": [
+ "A. Happy.",
+ "B. Angry.",
+ "C. Nostalgic.",
+ "D. Furious."
+ ]
+ }
+ },
+ "iErPftPQ": {
+ "video_id": "iErPftPQ",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people are visible in the frame, seated and facing forward. The person on the left is playing a violin, and the person on the right is playing a guitar. Both are moving their hands to play their respective instruments. The person on the left is wearing a white and grey striped shirt and light colored pants. The person on the right is wearing a gray t-shirt and shorts. Both are indoors. Behind them are walls that are blue, a brown cabinet, a closed dark blue door, a flat screen tv, and a brown wooden shelf with various objects placed on it.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "At the end of the video, how many times does the man on the right pluck the string?",
+ "answer": "B",
+ "candidates": [
+ "A. Three times.",
+ "B. Twice.",
+ "C. Once.",
+ "D. Four times."
+ ]
+ }
+ },
+ "rMhuYqnC": {
+ "video_id": "rMhuYqnC",
+ "video_duration": "640s",
+ "duration": ">8min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A peach-colored screen shows three dark circles, likely film reel holes. The scene transitions to two women in a car. The driver wears a black bandana and black shirt, and the passenger wears a white tank top. They are talking and gesturing. The driver playfully pats the passenger's chin, then continues driving and talking, gesturing with her hands. They both laugh. The driver hands the passenger a drink, takes a sip of her own drink, then continues talking, gesturing, and occasionally looking at the road. The passenger intermittently looks at her phone. The driver offers her drink to the passenger, who takes a sip. They continue conversing, and the driver places her hand on the passenger's arm. The driver then kisses the passenger's cheek.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between these two women?",
+ "answer": "C",
+ "candidates": [
+ "A. Enemies.",
+ "B. Sisters.",
+ "C. Lovers.",
+ "D. Friends."
+ ]
+ }
+ },
+ "UFwlTsed": {
+ "video_id": "UFwlTsed",
+ "video_duration": "53s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with white text reading \"WHY ARE MODERN RELATIONSHIPS SO FRAGILE?\" on a pink background. The text is angled upward and appears to be hand-drawn. This is followed by white text reading \"7 REASONS WHY MOST RELATIONSHIPS FALL APART SO EASILY\" on a pink rectangle with a yellow outline. This is followed by white text reading \"ONE YOU'RE NOT READY YET\" on a pink background. A cartoon drawing appears of two cartoon figures, one with a ponytail, and one without. They are sitting at a table drinking a hot beverage, with smoke coming out of the cups. There are framed images on the wall. A split screen then appears with the cafe scene on the left, and on the right there is a scene of a party. The image zooms out and we see a cartoon figure in the foreground facing the split screen, then this cartoon figure walks into the party. The text \"STRONG, HEALTHY RELATIONSHIP\" is displayed, followed by images of a scale, a fist, an hourglass, a heart held in a hand and an extended hand that is open. A movie camera, film strip, book, musical notes, and logos for various media platforms appear. These images then float up through red arrows. Finally, a cartoon figure is shown holding a smart phone with a picture of two cartoon figures with the eiffel tower in the background. A text bubble appears reading, \"Boyfriend surprised me with a trip to Paris for our one year anniversary!\" the picture on the phone is then covered with a black oval.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "When the video narration refers to mainstream entertainment and social media settings, which element is not depicted in the video's imagery?",
+ "answer": "A",
+ "candidates": [
+ "A. A phone.",
+ "B. Musical notes.",
+ "C. Film.",
+ "D. WiFi."
+ ]
+ }
+ },
+ "dyBjnuIc": {
+ "video_id": "dyBjnuIc",
+ "video_duration": "519s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person wearing a blue shirt and a white cap is seen in a kitchen unpacking multiple white plastic grocery bags. The camera pans to the right and shows several items on the counter, including plates and glasses. It then pans to a wider shot of the apartment's living room. The video shows a person adjusting curtains on a window. There are also scenes of a person adjusting the light fixtures and various pieces of furniture. The video shows the person unboxing several picture frames and placing photographs into the frames. A small cat is also present during this scene. The person is seen placing and arranging several objects on shelves and a bar cart. A black sofa and a fireplace are also seen in the video. A person is seen taking a selfie in a mirror. Some of the text reads: \"Mademoiselle Chanel, A Novel\" and \"The Great Gatsby by F. Scott Fitzgerald\" and \"Wine Folly, The Master Guide\".",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the whiskey glass relative to the wine glass in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Above.",
+ "B. On the left.",
+ "C. Below.",
+ "D. On the right."
+ ]
+ }
+ },
+ "uyecmnpT": {
+ "video_id": "uyecmnpT",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Four men are playing musical instruments in a park. From left to right, they are playing a clarinet, a trombone, a double bass, a banjo, and a washboard. They are all wearing dark-colored suits and are standing on a paved area. People are walking in the background, and trees line the path. The camera angle changes, providing a full view of the band and its members. The band continues playing. A man and a woman start dancing in front of the band. A small child in an orange sweatshirt and multicolored pants joins them, mimicking their movements. The child dances with more enthusiasm as the song progresses. The band finishes playing, and the members smile.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is there a sound of a banjo?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. None.",
+ "D. I am not sure."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the five performers in the video and the man wearing glasses who is kneeling on the ground at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The band and the dancers.",
+ "B. The band and the audience.",
+ "C. The band and the conductor.",
+ "D. The band and the singer."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "In the video, where is the singing voice coming from?",
+ "answer": "C",
+ "candidates": [
+ "A. A man playing the trombone.",
+ "B. A man playing the suona.",
+ "C. A man playing the banjo.",
+ "D. A man playing the cello."
+ ]
+ }
+ },
+ "GkcryPKL": {
+ "video_id": "GkcryPKL",
+ "video_duration": "77s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man wearing a gray long-sleeved shirt and black shorts is sitting on a bench and lifting a barbell with weights. The man lifts the weight straight up from his chest, and then lowers it back to his chest. A caption that says \"MUST BE IN CALORIE DEFICIT\" appears in the lower portion of the screen. The man is now lying on his back and lifting two dumbbells. He brings the dumbbells straight up, and then lowers them back to the sides. A different man wearing a black shirt can be seen in the background. The man on the bench continues doing reps, bringing the dumbbells up and down. The man lowers the dumbbells down, and then does crunches holding both dumbbells over his chest.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "In the video, which method is not mentioned as the best way to lose fat?",
+ "answer": "B",
+ "candidates": [
+ "A. Perform low-intensity steady-state aerobic exercise.",
+ "B. Train with greater effort.",
+ "C. Use a tracking app to monitor food intake.",
+ "D. Reduce calorie intake through nutritional control."
+ ]
+ }
+ },
+ "lafhIogq": {
+ "video_id": "lafhIogq",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people stand on a stage, each holding a musical instrument. The person on the left is playing a violin and the person on the right is playing a cello. Both people wear long white dresses. There are music stands in front of both of them. A large screen behind them displays a series of changing images of people and nature. Other people walk in front of the stage.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a piano in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. It is present only at the beginning of the video.",
+ "B. It is present only at the end of the video.",
+ "C. It is present throughout the video.",
+ "D. It is not present."
+ ]
+ }
+ },
+ "jdxgftDh": {
+ "video_id": "jdxgftDh",
+ "video_duration": "145s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A Minecraft player, \"wemmbu\", with a purple skin and an orange crown, is seen in a practice duel. They are repeatedly hitting the air with a sword, causing black star-shaped particles to appear. The background changes several times. Initially, it's a simple green and pink plane with a starry night sky. It then transitions to a Japanese-themed landscape with pink cherry blossom trees, then back to the simple plane.The player \"wemmbu\" stops swinging their sword. The perspective shifts to a first-person view, showing another player, \"jooonah\", running towards the camera. The players are on a green surface with floating islands. \"Jooonah\" is now wearing a black and purple outfit, while the player controlling the camera appears to be wearing a green outfit. \"Jooonah\" jumps towards and attacks the camera player, initiating a fight. The camera player opens a menu with options like \"Back to Game,\" \"Achievements,\" \"Multiplayer,\" \"Options,\" \"Lunar Options,\" \"Disconnect.\" They then close the menu. A list of players appears on screen, then disappears.The scene changes to a red and blue platform in a space-like setting. Both \"wemmbu\" and \"jooonah\" are on the platform. \"Wemmbu\" is now wearing blue and purple armor and boots, holding a sword. \"Jooonah,\" in red and green armor, jumps and attacks \"wemmbu.\" The camera player, as \"jooonah\", walks towards \"wemmbu\" and strikes them repeatedly, causing \"wemmbu's\" health to decrease. The camera perspective changes to third person, following \"jooonah\" as they chase \"wemmbu\" across the blue platform. \"Jooonah\" knocks \"wemmbu\" off the platform.The scene shifts to a maze-like arena with brown platforms, some marked with different colors. The camera, as \"jooonah\" with a green sleeve and arm visible, falls and respawns at the starting point. They jump across several platforms, attempting to reach \"wemmbu\" who is moving quickly through the maze. The camera player falls again. They respawn and quickly move through the maze, occasionally taking damage. They fall again. This sequence of respawning, navigating the maze, and falling repeats several times.The scene transitions to a Christmas-themed area with snow, presents, candy canes, and a Christmas tree. The camera player opens a menu and closes it. A list of online players briefly appears. The camera player navigates the Christmas village, moving past snow-covered blocks, wooden stalls, and presents. The video ends.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "At the beginning of the video, why does the character with the purple body only run and not attack?",
+ "answer": "B",
+ "candidates": [
+ "A. Because the character with the purple body has low stamina.",
+ "B. Because the attacking character is continuously replenishing energy.",
+ "C. Because the attacking character has high stamina.",
+ "D. Because the character with the purple body is not at full health."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What items are in the backpack of the character with a green head and body in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A green block.",
+ "B. A red block.",
+ "C. A blue block.",
+ "D. A brown block."
+ ]
+ }
+ },
+ "hINShVid": {
+ "video_id": "hINShVid",
+ "video_duration": "59s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man with short brown hair is speaking into a microphone on a stage. He is wearing a dark blue plaid shirt over a dark t-shirt with a light-colored graphic and light gray pants. He has a yellow bracelet on his left wrist. There are other people in the background, some of whom are out of focus. He speaks and makes hand gestures while talking. At one point, he puts his hand over his mouth and nose while laughing and appears to wipe his eyes. Then he places his hand over one eye. At several points, he opens his mouth wide while talking. The background is dark and not clearly visible, though a few pieces of drum hardware are visible. There are subtitles along the bottom of the video, in white lettering.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the man's emotional reaction in the video after his daughter arrives at her destination?",
+ "answer": "D",
+ "candidates": [
+ "A. Shocked.",
+ "B. Happy.",
+ "C. Angry.",
+ "D. Sad."
+ ]
+ }
+ },
+ "DIpkcdYt": {
+ "video_id": "DIpkcdYt",
+ "video_duration": "369s",
+ "duration": "6-8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person holds a microphone. Initially, they are indoors, and a second person in the background wears a black and white patterned top. The camera angle moves outdoors to a street scene with many people. Some of them hold signs. The person with the microphone interviews several people. They hold out the microphone toward people they are interviewing and move the microphone to different people throughout the video. One person is holding a cell phone. At times the person with the microphone moves very close to the people they are interviewing or recording, and the image is obscured. A person who had been speaking then holds a newspaper in front of the camera lens to block it, and the camera focuses on the text of the newspaper. Eventually, the newspaper is moved aside and the person holding the microphone is shown close up, saying \"Can I tell you something?\" The video ends with blurred images of a city and some text is displayed.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many people were interviewed in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Three.",
+ "C. One.",
+ "D. Two."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the microphone positioned in relation to the man in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Left hand.",
+ "B. Right hand.",
+ "C. I don't know."
+ ]
+ }
+ },
+ "ZaTILlKa": {
+ "video_id": "ZaTILlKa",
+ "video_duration": "540s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A hand places an \"Ellie Hatchet\" box on the floor. The video cuts to a Spirit Halloween store, showing animatronics: a bride in a white dress and veil, a clown with glowing orange spiral eyes, and a girl with red eyes and pigtails. The view pans across various Halloween items, including a boxed \"Johnny Punk\" animatronic and a Sandy Cheeks costume. An orange shopping basket hangs from a display rack. A hand touches hanging skeleton props and a Frankenstein animatronic. A Beetlejuice prop stands against a wall. A hand holds up a pair of bone pliers. The camera pans past Bluey's Clues and You trick-or-treat bags and packaged Scooby-Doo costumes. Finally, a hand touches animatronic zombie babies covered in spiders.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the doll of the girl with braided pigtails and pink clothing appear open-mouthed in terror in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. There is a scarecrow doll behind her.",
+ "B. A terrifying clown doll is in front of her.",
+ "C. There is a long-haired ghost with glowing blue eyes behind her.",
+ "D. Her arm has been broken."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What part of the bride doll does the bald man touch in the first half of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The thumb of the left hand.",
+ "B. The middle finger of the left hand.",
+ "C. The middle finger of the right hand.",
+ "D. The thumb of the right hand."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order in which these four dolls appear in the video? (a) The clown doll stuck in the storm drain grate. (b) The doll holding a machete. (c) The inflatable doll sitting on a swing. (d) The tall clown doll holding a hammer.",
+ "answer": "B",
+ "candidates": [
+ "A. (a) (d) (b) (c).",
+ "B. (b) (d) (a) (c).",
+ "C. (c) (d) (a) (b).",
+ "D. (b) (c) (a) (d)."
+ ]
+ }
+ },
+ "KzOiyMSp": {
+ "video_id": "KzOiyMSp",
+ "video_duration": "98s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Chemistry",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A finger is seen touching a screen and drawing various shapes and numbers. A hand is then seen handling various small, colorful beads and toothpicks. These beads and toothpicks are being connected together to form shapes. This process is repeated with different beads and toothpicks. After some time, the hand stops working with the beads and returns to the screen, where it draws more shapes and numbers.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the woman use a tablet in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Four.",
+ "B. One.",
+ "C. Three.",
+ "D. Two."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the first chemical symbol the woman writes on the tablet in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Oxygen.",
+ "B. Potassium.",
+ "C. Hydrogen.",
+ "D. Carbon."
+ ]
+ }
+ },
+ "asPEhVEL": {
+ "video_id": "asPEhVEL",
+ "video_duration": "86s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black shirt and black headphones is centered in the middle of the frame, looking directly ahead. On either side of the person, there are separate displays showing football match information. The displays show the same football matches and similar graphics, but display information relevant to different teams. Below the person are two more displays displaying the team logos and the current score of a game in progress. There are a number of text boxes visible in the bottom left of the frame containing what appears to be comments. The clock on the lower middle display begins at 53:37 and then counts upwards to 90:00. The score on that display begins as 1-0, changes to 2-0, then 3-0, then 4-0, then 5-0. The graphics change throughout, including showing the league table and match details for multiple football matches.The person in the middle of the frame looks around slightly and their head moves slightly throughout the video. They mouth words, and speak aloud.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What emotional changes does the man experience in the video upon seeing Adoze enter?",
+ "answer": "B",
+ "candidates": [
+ "A. From calm to excitement.",
+ "B. From excitement to disappointment.",
+ "C. From disappointment to anger.",
+ "D. From calm to anger."
+ ]
+ }
+ },
+ "Dmrohaza": {
+ "video_id": "Dmrohaza",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A woman plays a piano, a man plays a saxophone, and a woman plays a violin inside what appears to be a church or chapel. The pianist is seated on a bench in front of an upright piano, with sheet music on the music stand. She is wearing a dark top with a floral-patterned skirt. The saxophonist is standing with his feet apart, facing the camera. He is wearing a dark jacket and trousers, and a dark patterned shirt. The violinist stands to the right of the saxophonist. She is facing the camera at a slight angle and is wearing a long black dress. There are microphones in front of each musician and what appears to be a small video camera on a tripod nearby. The background features religious artwork and architectural details. The musicians' fingers move over their instruments as they play. The camera angle shifts briefly to a close-up of the pianist\u2019s hands on the keyboard and then back to the wider shot. The musicians continue to play.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the saxophone louder than the violin?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No.",
+ "C. Yes."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Which instrument can be heard on the right side of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Cello.",
+ "B. Piano.",
+ "C. Violin.",
+ "D. Saxophone."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Did the sound in the video change?",
+ "answer": "D",
+ "candidates": [
+ "A. Consistently rapid.",
+ "B. From rapid to gentle.",
+ "C. Consistently gentle.",
+ "D. From gentle to rapid."
+ ]
+ }
+ },
+ "yNkBPAag": {
+ "video_id": "yNkBPAag",
+ "video_duration": "60s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen. A black and white graphic of a balance scale appears in a circle with the words \"DARK JUSTICE\" underneath, over a blurred background with indistinct shapes. A silhouette of a person walking appears on the left of the screen in front of the same graphic, then another. The view changes to show a person in a red jacket, then another in a black jacket as they walk on a sidewalk in front of a large building. A logo for \"DARK JUSTICE\" and a website address appear in the lower corners of the screen. The person in the black jacket, now closer to the camera, appears to be talking on a phone while a person in a red jacket walks past. The person in black is now holding a small white object in their hand while talking on the phone. They unfold the white object to reveal several small white cards.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What is likely to happen next after the two people meet?",
+ "answer": "C",
+ "candidates": [
+ "A. A 14-year-old girl will arrive at the scene.",
+ "B. The police will arrive at the scene.",
+ "C. A 13-year-old girl will arrive at the scene.",
+ "D. The police will not arrive at the scene."
+ ]
+ }
+ },
+ "vEYikJqy": {
+ "video_id": "vEYikJqy",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A four-person band plays on a stage. From left to right, a man plays a saxophone, a woman plays a five-string electric violin, a woman plays a keyboard, and a man plays a keyboard.\u00a0 The saxophone player wears a black suit jacket and a black cap.\u00a0 The violinist wears a long, light beige dress. The first keyboardist wears a long, red, sleeveless dress. The second keyboardist wears a dark suit.\u00a0 Behind the band, a large screen displays a static image of \u201cHikaru Entertainment\u201d logo. Two large black speakers flank each side of the stage.\u00a0 The screen's image changes to an advertisement for \"Durable\" and then \"Ultralife\" batteries.\u00a0 After a few seconds, the background changes to a picture of a street scene:\u00a0 a walkway bordered by buildings and lush greenery. The band continues to play. The woman in the red dress occasionally looks down at a laptop positioned on a stand in front of her keyboard.\u00a0 The lighting is dim, with purplish spotlights aimed at the backdrop.\u00a0 The camera angle remains static throughout the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "What is the first instrument sound you hear?",
+ "answer": "D",
+ "candidates": [
+ "A. Violin.",
+ "B. Cello.",
+ "C. Piano.",
+ "D. Saxophone."
+ ]
+ }
+ },
+ "vPVwGNej": {
+ "video_id": "vPVwGNej",
+ "video_duration": "35s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A hand uses pliers to cut two rows of metal pins from a larger piece. The two rows of pins are inserted into a white breadboard with red and blue lines along the side. A green rectangular microchip is then pressed onto the rows of pins. The hand uses a soldering iron and silver solder to add solder to the pins on the side of the microchip. The camera zooms in on a close up shot of the microchip. The hand adds solder to the remaining pins.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What steps does the man in the video follow, in sequence, to complete the object? (a) Place the footstrip. (b) Fix Pico. (c) Weld the footstrip. (d) Check and fix welding errors.",
+ "answer": "C",
+ "candidates": [
+ "A. (b) (a) (c) (d).",
+ "B. (c) (b) (a) (d).",
+ "C. (a) (b) (c) (d).",
+ "D. (a) (c) (b) (d)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "At the end of the video, what did the man do with the green object Pico?",
+ "answer": "B",
+ "candidates": [
+ "A. Welding was carried out.",
+ "B. An inspection was conducted.",
+ "C. An installation was performed.",
+ "D. A demonstration was made."
+ ]
+ }
+ },
+ "SiWVEKEY": {
+ "video_id": "SiWVEKEY",
+ "video_duration": "137s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Racing",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a track and field race taking place on a red track with multiple lanes. Initially, there are about 8 athletes visible, wearing various colors of athletic clothing, including red, yellow, black, and blue. They are spaced out in the lanes as they run a race. There are yellow cones marking the track and also white lines. There is a crowd of people visible in the stands. In the background, there are flags, tents, signs, and trees. As the video progresses, the athletes become more spread out, with one athlete in yellow and green ahead of the others and running along the inside of the track. The other athletes in different colors are visible behind him, on various parts of the track, and some lag behind as they run through different curves. The video includes a timer in the bottom right hand corner of the screen, displaying an elapsed time. There is also text that reads \"CR 1:48.95\" visible in this same area. As the yellow and green athlete crosses the finish line, some of the athletes behind him collapse on the track. Some individuals can be seen standing or walking around on the field, in white and other clothing, near the track. The video then shows people in the stands cheering, waving flags, and celebrating, before focusing again on the race. After that the video shows the athlete who finished first with a purple screen next to him, then again shows the athlete who finished first from different angles. The purple screen includes some of the results from the race. A red \"winner\" banner appears beneath the individual who won the race, with his name on the bottom of the banner.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What happened between Athlete 229 and Athlete 549 during the sprint phase of the competition in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Athlete 229 pushed Athlete 549 with their hand.",
+ "B. Athlete 549 overtook Athlete 229 and secured second place.",
+ "C. Athlete 229 overtook Athlete 549 and secured second place.",
+ "D. Athlete 229 tripped Athlete 549."
+ ]
+ }
+ },
+ "IpYvbmDt": {
+ "video_id": "IpYvbmDt",
+ "video_duration": "191s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Unboxing Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man wearing a black t-shirt and glasses is holding a white rectangular object, and a white box labeled \u201cHOBOT\u201d is to his right. A small yellow cloth is to his left, next to a bottle.The man flips the object around to display different sides. It appears to have a clear plastic container on one side.Next, the man is seen holding a flatter, white, square-ish object. He is shown holding this device in multiple close-ups, pointing at different features. He then holds up a gray remote. Close-ups of the remote's buttons are shown.The man is shown holding up a yellow H-shaped object. He attaches it to the bottom of the square-ish object. Close-ups of the bottom of the device, both with and without the yellow pad, are displayed.The video shows a series of square-ish devices, in red, blue, white, and black, against a glass wall.The video then shows the white square-ish device moving up a glass wall, and then a different square-ish yellow device moving up a glass wall. Both devices have a thin black cord plugged into them as they move up the wall. The man touches the yellow square-ish device on a glass wall, and then removes the white one off the glass wall. Finally, the yellow square-ish device is shown as it is actively cleaning a glass wall.The video ends with the man holding a phone in one hand, and the white rectangular object in the other. He attaches the white rectangular device to a glass wall.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Can the machine move only in a straight line while working in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No.",
+ "C. Yes."
+ ]
+ }
+ },
+ "zegFpygl": {
+ "video_id": "zegFpygl",
+ "video_duration": "31s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A cartoon chipmunk wearing an orange tiger hoodie with a purple bow and a purple dress stands in a forest clearing. She taps her foot and looks around. She walks towards some bushes and looks concerned. She peeks through the bushes into a darker part of the forest. Two shadowy figures are visible in the distance. The chipmunk hides behind the bushes and watches as the figures come closer. The scene changes to show the two figures walking through the misty, dark area of the forest. As they get closer, it becomes clear that the figures are a bear and a fox, both wearing costumes. The fox is entangled in green vines, and is holding a small box. The chipmunk gasps as she watches them. The bear approaches the fox and helps him untangle himself from the vines. The chipmunk runs towards them.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What might happen after the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The tiger and the rabbit might leave immediately.",
+ "B. The tiger and the rabbit might become entangled in vines.",
+ "C. The tiger and the rabbit might apologize to the fox.",
+ "D. The tiger and the rabbit might fight the fox."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What are the emotional changes of the tiger and the rabbit in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. First fearless, then excited, and finally confused.",
+ "B. First confident, then fearless, and finally confused.",
+ "C. First fearless, then confused, and finally excited.",
+ "D. First scared, then fearless, and finally confused."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What did the tiger and rabbit misunderstand in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. They misunderstood that there was a bear stealing things in the forest.",
+ "B. They misunderstood that there were two bears fighting in the forest.",
+ "C. They misunderstood the sounds to be howls of wolves in the forest.",
+ "D. They misunderstood that there was a bear in the forest harming children."
+ ]
+ }
+ },
+ "rkCfGzYn": {
+ "video_id": "rkCfGzYn",
+ "video_duration": "116s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen. It then cuts to a red screen with blurry shapes. The words \"PLAY\" appear in white text. Below that the words \"THE GAME\" appear in white text with a red border around them. The screen returns to black. The video cuts to a first person view of a video game. The player is holding a gun with a scope. Three bells are placed on a table in front of the player. The player shoots the three bells, and they explode. The player looks around the snowy environment. The player sees multiple snowmen and a large monster. The player shoots the monster, and a series of numbers and words appear on the screen. The player continues to shoot the monster through the scope. The player collects a green hat. A message on the screen reads \"MY HAT! MY BEAUTIFUL HAAAAAAAaaat...you're such a dick\". The player encounters other ice enemies, and shoots them from a distance. The player collects experience points as they defeat the enemies. The player runs into a town with several buildings. The video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does Tinder Snowflake attack with a cane in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Six times.",
+ "B. Five times.",
+ "C. Three times.",
+ "D. Four times."
+ ]
+ }
+ },
+ "rhiWCTTn": {
+ "video_id": "rhiWCTTn",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two guitarists play on a city sidewalk in front of a building with large glass doors and ornate stone pillars. The guitarist on the left sits and plays a classical acoustic guitar. He wears a gray sweater and dark pants. The guitarist on the right sits on what appears to be a guitar amp and plays an electric guitar. He wears a light denim jacket and dark pants. Behind them, the large glass entry doors of the building are mostly closed but have a narrow gap in the middle. A partially visible sign for HSBC is on the doors. A small cardboard box sits open between the two guitarists; a bottle and some other items are inside it.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the most likely emotional tone of this piece of music?",
+ "answer": "D",
+ "candidates": [
+ "A. Calm and steady.",
+ "B. Light and prolonged.",
+ "C. Sad and melancholic.",
+ "D. Passionate and intense."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What change occurred in the sound of the electric guitar in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The sound of the electric guitar gradually becomes softer.",
+ "B. The sound of the electric guitar gradually becomes louder.",
+ "C. The sound of the electric guitar changes from loud to soft.",
+ "D. There is no change."
+ ]
+ }
+ },
+ "HjjZYonP": {
+ "video_id": "HjjZYonP",
+ "video_duration": "44s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person in a red shirt stands behind a blue table against a black backdrop. A red bucket sits on the table. The person points at a small circular opening near the bottom of the bucket. The camera zooms in on the opening, and the person's finger inserts a metal washer into the hole. The person spins the bucket on the table.Next, the person picks up a clear plastic jug with a red and white label, and pours its contents into the bucket. Then the person picks up a metal stand and a black rubber square, placing the square on the table and affixing the stand to it. A glass bottle is affixed to the stand using a metal clamp. The person puts on a pair of red-tinted safety glasses and places something inside the bottle. Then, using pliers, the person inserts an orange rubber stopper into the mouth of the bottle. The camera zooms in on the hole in the bucket's side, and a green laser beam is emitted from the hole, traveling in a downwards arc. The person briefly blocks the laser beam with their hand.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the man's mood after the experiment is successful?",
+ "answer": "D",
+ "candidates": [
+ "A. Elated.",
+ "B. Arrogant.",
+ "C. Indifferent.",
+ "D. Thrilled."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What liquid is added to the experimental container?",
+ "answer": "B",
+ "candidates": [
+ "A. Propylene glycol.",
+ "B. A mixture of propylene glycol and cream.",
+ "C. Cream.",
+ "D. Ethanol."
+ ]
+ }
+ },
+ "Fvvcidyv": {
+ "video_id": "Fvvcidyv",
+ "video_duration": "373s",
+ "duration": "6-8min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a Pok\u00e9mon battle from *Pok\u00e9mon Alpha Sapphire* on a Nintendo 3DS. The battle is between a Sceptile and a Lugia. The Sceptile is a light teal color and uses the move False Swipe. The Lugia is white and pink, and uses the move Struggle. The background is a light blue cloudy sky over water.The screen is bordered in pink with extra information outside the gameplay window. In the top right is the Pok\u00e9mon Alpha Sapphire logo, in the bottom left is a counter at 7110, and additional labels reading \u201cOriginal Pok\u00e9mon,\u201d and \u201cShiny Target\u201d are at the bottom. Images of the regular and shiny versions of Lugia are shown in their respective sections. A group of Pok\u00e9mon are pictured on the bottom right, and a Lugia flies in the top left corner.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How many resets are required for Lugia with a pink belly to appear in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Seven thousand one hundred and eleven.",
+ "B. Seven thousand one hundred and ten.",
+ "C. Seven thousand one hundred and eight.",
+ "D. Seven thousand one hundred and nine."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does Lugia with a blue belly appear in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. At the start of the video.",
+ "C. At the end of the video.",
+ "D. In the middle of the video."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What are the emotional changes of the woman in the video during the capture of Lugia?",
+ "answer": "D",
+ "candidates": [
+ "A. From happiness to nervousness.",
+ "B. From nervousness to anger.",
+ "C. From happiness to disappointment.",
+ "D. From nervousness to disappointment."
+ ]
+ }
+ },
+ "MJAXSqRz": {
+ "video_id": "MJAXSqRz",
+ "video_duration": "72s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with dark hair and a gray beard, wearing a light blue long-sleeved shirt stands outside, with a shoreline and ocean in the background. The sky is blue. Then, the camera is at a low angle looking up at the same person with his arms crossed, holding a stick. He is wearing a light blue long-sleeved shirt, brown pants, and a dark belt. The sky is dark blue. A woman in a long-sleeved light pink shirt and dark green skirt is running on the beach toward another person who is lying on the sand. The man is wearing a white shirt and dark pants. The woman kneels next to the person on the ground and puts her hands on their chest, then touches their face. The person in light blue shirt and a gray beard and hair is holding a stick with the ocean in the background and then his face comes closer to the camera while holding the stick. Then, the woman is kneeling and looking upward. Then, the person in the blue shirt is walking towards the woman and person on the ground. The scene is looking down at a low angle of the man's legs in brown pants, with the woman on the ground in the background.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What is most likely to happen behind the white-bearded man in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Walks away.",
+ "B. Buries a man alive.",
+ "C. Hits a woman.",
+ "D. Buries a woman alive."
+ ]
+ }
+ },
+ "cpoJSETB": {
+ "video_id": "cpoJSETB",
+ "video_duration": "51s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Sports Game",
+ "audio_class": [
+ "Event"
+ ],
+ "video_caption": "The video depicts a soccer game on a green field with white lines. There are two teams; one team wears light blue jerseys with yellow accents and the other team wears jerseys with red and white vertical stripes. The soccer ball moves around the field as players from both teams run after it. The score is initially tied at 0-0. At one point, the blue team scores a goal. After the goal, one of the blue team players kneels on the field, and several teammates join him, patting his head and embracing him to celebrate the goal. The score then changes to 1-0.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the jersey number of the player from the blue team who scored the goal?",
+ "answer": "D",
+ "candidates": [
+ "A. Number 9.",
+ "B. Number 7.",
+ "C. Number 10.",
+ "D. Number 8."
+ ]
+ }
+ },
+ "onZhHJZa": {
+ "video_id": "onZhHJZa",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two young women stand side-by-side, facing the camera, holding ukuleles. The woman on the left wears a horizontally striped, rust and cream colored, long-sleeved, mock turtleneck top and blue jeans. She has shoulder-length dark hair with bangs. The woman on the right wears a black, long-sleeved mock turtleneck top and blue jeans. She has long, dark, wavy hair. Both women have small, dangling earrings. They both strum their ukuleles, looking at the instruments and occasionally glancing at each other. Their fingers move over the fretboards and they strum rhythmically. They appear to be playing the same song in unison. The video is a static shot, with the women remaining in the same position throughout. The background is a plain, light-colored wall.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "In the video, from which direction does the first sound of music come?",
+ "answer": "D",
+ "candidates": [
+ "A. The first sound comes from behind the screen.",
+ "B. The first sound comes from in front of the screen.",
+ "C. The first sound comes from the right side of the screen.",
+ "D. The first sound comes from the left side of the screen."
+ ]
+ }
+ },
+ "ESYTDrgJ": {
+ "video_id": "ESYTDrgJ",
+ "video_duration": "145s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A man with short dark hair, wearing a gray t-shirt, sits on a bed and speaks to the camera. A white bed with pillows is behind him. A cardboard box sits on the floor to his left. A small table with a lamp and a potted plant is to his right. An image of a glass cup of rooibos tea appears. The man continues speaking, gesturing with his hands. An image of rooibos tea with cinnamon sticks and a tea infuser appears. The man continues speaking. An image of a hand holding a cup of rooibos tea from above appears. The man continues speaking. The man holds up a white cat-shaped mug and dips a tea bag into it. He gestures with the tea bag and the mug as he speaks. The word \"CHAMOMILE TEA\" appears on screen. The man sets down the mug. An image of a clear glass mug containing chamomile tea appears, along with chamomile flowers. The man continues speaking, gesturing with his hands. He raises his arms and then brings them back down. He rubs his hands together and over his face as he continues speaking.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many chamomile flowers are shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Seven flowers.",
+ "B. Nine flowers.",
+ "C. Ten flowers.",
+ "D. Eight flowers."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What did the man do after taking out a cup shaped like a kitten?",
+ "answer": "A",
+ "candidates": [
+ "A. Took out a tea bag and displayed it.",
+ "B. Placed the tea bag on an acne spot.",
+ "C. Drank a sip of tea.",
+ "D. Did nothing and put it back."
+ ]
+ }
+ },
+ "HiDmUxKZ": {
+ "video_id": "HiDmUxKZ",
+ "video_duration": "36s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Chemistry",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a white lab coat stands in front of a white wall. On a table in front of them, there is a yellow box, a small black spoon, a shallow silver dish, and a clear jar. The person reaches for the yellow box and opens it with their hands. They then use the small black spoon to scoop a white powdery substance from the box. They move the spoon with the powdery substance over to the silver dish, and empty the contents of the spoon into the dish. Next they move to the clear jar and reach for the top of the jar. They then pour a liquid from the clear jar into the shallow silver dish. The person moves their hands in a circular motion inside the dish, and the powdery substance begins to froth up. They then move their hands away, and a white solid is visible in the dish. The person then uses the small spoon to stir the mixture, which appears to become bubbly and increase in size. Text on a black background then appears on the screen which reads \"But wait! Where did the fizz come from?\"",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the physical state of the substance produced when baking soda and apple cider vinegar react in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Only gas.",
+ "B. Gas and liquid.",
+ "C. Only solid.",
+ "D. Only liquid."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of actions in the woman's experiment shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. First, place the baking soda into a metal container, then put the measuring cup into a metal basin, next pour the apple vinegar into the measuring cup, and finally pour the baking soda into the measuring cup, triggering a reaction.",
+ "B. First, place the baking soda into a metal container, then pour the apple vinegar into a measuring cup, next put the measuring cup into a metal basin, and finally pour the baking soda into the measuring cup, triggering a reaction.",
+ "C. First, pour the apple vinegar into a measuring cup, then place the baking soda into a metal container, next put the measuring cup into a metal basin, and finally pour the baking soda into the measuring cup, triggering a reaction.",
+ "D. First, place the measuring cup into a metal basin, then put the baking soda into a metal container, next pour the apple vinegar into the measuring cup, and finally pour the baking soda into the measuring cup, triggering a reaction."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the woman's mood during a neutralization reaction?",
+ "answer": "B",
+ "candidates": [
+ "A. Depressed.",
+ "B. Frightened.",
+ "C. Sad.",
+ "D. Calm."
+ ]
+ }
+ },
+ "LoxMTauE": {
+ "video_id": "LoxMTauE",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Four individuals are sitting on chairs and playing saxophones. They are positioned in a row, facing the camera. Each individual has a music stand in front of them. The individuals are wearing dark jackets, jeans or dark trousers, and light shirts. The background appears to be a room with a large poster or display on the left, and a table and a group of people further in the background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "What types of saxophones are featured in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Baritone saxophone, tenor saxophone, saxophone.",
+ "B. Baritone saxophone, saxophone, alto saxophone.",
+ "C. Baritone saxophone, tenor saxophone, alto saxophone.",
+ "D. Tenor saxophone, alto saxophone, saxophone."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What is the instrument located farthest to the right in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Soprano saxophone.",
+ "B. Baritone saxophone.",
+ "C. Tenor saxophone.",
+ "D. Saxophone."
+ ]
+ }
+ },
+ "fWwiNnYW": {
+ "video_id": "fWwiNnYW",
+ "video_duration": "50s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A blonde toddler girl in a pink and purple swimsuit is in a kidney-shaped swimming pool. A woman in a yellow swimsuit is next to her in the water. A pink-haired baby sits in a yellow inflatable baby float shaped like a fish. Another toddler boy stands at the edge of the pool, watching the others. He wears blue and white striped swim trunks. The pool is surrounded by light gray paving stones and green grass. A white fence and a green bush are in the background.Close-up of the blonde girl in the pool.Close-up of the baby in the float and the woman.Close-up of the boy at the edge of the pool. He looks nervous, touching his face with his hands.A blond man puts on green swim shorts with a cactus pattern.The blond man approaches the nervous boy at the pool, puts a hand on the boy\u2019s head and speaks to him. The boy dips a toe into the pool. The boy touches the man\u2019s leg.Close-up of the boy\u2019s stomach. He touches it with both hands.The boy hugs the man\u2019s leg, burying his face against it.The man kneels and holds the boy\u2019s hands, speaking to him.Close-up of the boy. The man pats the boy\u2019s head.The man lifts the boy and carries him into the shallow end of the pool. The woman and other children are nearby in the pool.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where was the man before he walked over to the little boy?",
+ "answer": "A",
+ "candidates": [
+ "A. Sitting on the chair.",
+ "B. Sitting on the lawn.",
+ "C. In the swimming pool.",
+ "D. Inside the room."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the shape of the swimming pool in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Oblong.",
+ "B. Circle.",
+ "C. Rectangle.",
+ "D. Square."
+ ]
+ }
+ },
+ "wIyITRFB": {
+ "video_id": "wIyITRFB",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two individuals are visible in the video, each seated on chairs and each holding a stringed instrument. The individual on the left holds a violin, while the individual on the right holds a cello. Both individuals are positioned in front of a black music stand. The two individuals and music stands are set against the backdrop of a wooden deck with wooden railings, along with a body of water and trees in the background. The person on the left moves their arms and hands in a bowing motion as they play the violin. The person on the right also moves their arms in a bowing motion as they play their cello.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the violin on the left more rhythmic than the cello on the right?",
+ "answer": "C",
+ "candidates": [
+ "A. The left and right are the same.",
+ "B. Unclear.",
+ "C. No.",
+ "D. Yes."
+ ]
+ }
+ },
+ "oEKtGIOe": {
+ "video_id": "oEKtGIOe",
+ "video_duration": "126s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a white top and blue shorts initially sits in a blurry background. Then, a person wearing a long-sleeve, blue and white striped shirt grabs a long pole and rotates it. This is followed by a man in a dark suit entering the scene. Shortly after, two more men arrive: one wearing a blue shirt and light pants, and the other wearing a white t-shirt, a white cap, and dark pants. The man in the striped shirt places his hand on the shoulder of the man in the white t-shirt, while the man in blue places his hand on the man in the striped shirt. Then, the video shows a close-up of a man in a suit and sunglasses. The scene then changes to show the man in the blue shirt pointing towards the camera and the men in the white shirt and striped shirt talking. Finally the video shows the man in the blue shirt patting the hands of the men in the white shirt and striped shirt before returning to a view of the man in a suit. The video shows various views of all the men and some overlap in their actions. The background is always blurred. The video is dark in color.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Does the volume of the dialogue between the characters in the video change?",
+ "answer": "B",
+ "candidates": [
+ "A. No, there is no change in the dialogue volume.",
+ "B. Yes, the dialogue volume increases when the characters get closer and decreases when they move away.",
+ "C. The video has no sound.",
+ "D. I don\u2019t know."
+ ]
+ }
+ },
+ "wwIJcPKo": {
+ "video_id": "wwIJcPKo",
+ "video_duration": "348s",
+ "duration": "4-6min",
+ "domain": "Games",
+ "sub_category": "Sports Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a man wearing a blue shirt and headphones in a bottom left inset. The rest of the screen displays a simulated basketball court where multiple individuals wearing basketball attire are present. Throughout the video, some of the individuals pass a basketball, dribble the ball, and shoot it into a hoop. The video displays a scoreboard in the bottom right corner, which shows the score for two different players in the simulation. A text box sometimes appears at the top of the screen and displays the word \"release\" and \"coverage.\" Several additional text boxes are displayed throughout the video, containing various options and menu items for a gaming program. In the latter portion of the video, the man in the blue shirt gestures with his hands several times.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Which player number was the first to score a goal in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Number 14.",
+ "B. Number 41.",
+ "C. Number 18.",
+ "D. Number 51."
+ ]
+ }
+ },
+ "ZPygQlZO": {
+ "video_id": "ZPygQlZO",
+ "video_duration": "138s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "DIY & Crafts",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person's hand adjusts a gold colored bolt with a finger. The hand holds two wrenches and uses them to adjust the same bolt. The hand adjusts the same bolt with their fingers, then touches other bolts. The hand adjusts the kickstand of a motorcycle. The hand reaches into the engine area of the motorcycle, and fiddles with the chain, then it pulls a gear sprocket off of the chain. The hand places two gears on a dirty, wooden table. The hand moves the gears to display the hole in the center of the gears, as well as the teeth on the outside of the gear. The hand moves the two gears and places them side by side. The hand reaches for two bolts with nuts, and points at the bolts individually.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many wrenches did the man in the video use to remove the screws?",
+ "answer": "D",
+ "candidates": [
+ "A. One.",
+ "B. Three.",
+ "C. Four.",
+ "D. Two."
+ ]
+ }
+ },
+ "YEIUOeNp": {
+ "video_id": "YEIUOeNp",
+ "video_duration": "294s",
+ "duration": "4-6min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person stands on a stage and holds a microphone while gesturing with their hands. They wear a light gray hoodie with blue lettering, blue ripped jeans, and white sneakers with black accents. Behind them is a painted wall with geometric shapes in purples, blues, and whites. To their right is a black stool, and behind that is a microphone stand. In front of the stage and facing the speaker are three people seated in white chairs. The person closest to the camera has on a tan hat. At times, the camera view shifts, changing the perspective and showing the person on stage from different angles and at different distances. The person's gestures and movements also change. At one point the camera is positioned so that the background is mostly a wall of greenery. The video ends with a title card that says \u201cKeep Your Distance\u201d in white and green letters, with pink and red accents.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotion does this video aim to convey?",
+ "answer": "A",
+ "candidates": [
+ "A. Satirical commentary on racial discrimination.",
+ "B. Showcasing the charm of stand-up comedy performances.",
+ "C. Support for fans of the singer.",
+ "D. Displaying the benefits of a church's growth."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Do black men pray to God?",
+ "answer": "B",
+ "candidates": [
+ "A. Yes.",
+ "B. No.",
+ "C. I don't know."
+ ]
+ }
+ },
+ "XfhvArDf": {
+ "video_id": "XfhvArDf",
+ "video_duration": "49s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Historical Analysis",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens with a wide shot of the Sphinx and a large pyramid in a desert setting. Two people on horseback are visible in the distance. The scene changes to show multiple pyramids at sunset, with a bright sun low in the frame. Next, an indoor view of a museum is shown, with several statues and artifacts on display. A group of people are standing and walking around the room. The scene transitions to an extreme close up of a gold sarcophagus, with a dark headdress, and a decorated staff laid across it. The sarcophagus has a gold face and detailed features. Finally, a shot of a gold headdress encased in a glass display is shown. Several people are in the background. A woman in a light-colored dress is standing in the foreground. The last scene is a wide shot of a golden sarcophagus.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "Has the condition of the Tutankhamun mask in the video changed?",
+ "answer": "C",
+ "candidates": [
+ "A. No.",
+ "B. I am not sure.",
+ "C. Yes."
+ ]
+ }
+ },
+ "clgyOpgV": {
+ "video_id": "clgyOpgV",
+ "video_duration": "25s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Event"
+ ],
+ "video_caption": "The video shows two shirtless men boxing in a ring with red ropes. Both men are wearing black boxing shorts and red boxing gloves. The man on the left has light-colored skin and the name \u201cDIAZ\u201d on his shorts. The man on the right has dark-colored skin and the name \u201cSHANE\u201d on his shorts. A referee wearing a white shirt and black pants can be seen in the ring throughout the video. The video shows the men throwing punches at each other. The man on the right throws several punches at the head and body of the other man. The man on the left goes down to the mat. Later, the man on the right throws several punches in rapid succession at the head and body of the man on the left. There is a crowd watching the boxing match in the background. There is a scoreboard in the bottom of the screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the boxer wearing black shoes punch?",
+ "answer": "D",
+ "candidates": [
+ "A. Nine.",
+ "B. Three.",
+ "C. Five.",
+ "D. Twenty-eight."
+ ]
+ }
+ },
+ "CjtRgQnY": {
+ "video_id": "CjtRgQnY",
+ "video_duration": "366s",
+ "duration": "6-8min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A close-up shows Manny Pacquiao's face, then switches to Lehlo Ledwaba's. Both boxers are in a boxing ring. They stand facing each other as the referee speaks to them. The camera zooms out to show the whole boxing ring, which is blue and orange, and the crowd surrounding it. Pacquiao wears blue shorts and Ledwaba wears purple shorts with silver stripes. The referee signals the start of the match. Round 1 begins. The boxers circle each other. Pacquiao throws a punch and Ledwaba blocks it. They exchange several punches. Pacquiao lands a left hook. Ledwaba stumbles and falls to his knees. He puts his gloves on the mat to steady himself as the referee starts the count. Ledwaba gets up before the count of 10 and the match resumes. Pacquiao is aggressive, throwing several punches. Ledwaba tries to keep him at bay but is having difficulty blocking all of the incoming punches. Pacquiao lands a straight left hand that sends Ledwaba stumbling against the ropes. Pacquiao pursues him, landing more punches. Ledwaba falls to the mat, face first. The referee kneels beside Ledwaba and begins counting him out. The timer shows 20 seconds left in round 1. Pacquiao goes to his corner, Ledwaba tries to get up. It seems he can't. The referee stops the fight. Pacquiao celebrates, jumping up and down in his corner and hugging his team. Members of Ledwaba's team comfort him. Round 2 begins. The two boxers circle each other again and the video ends shortly thereafter.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why did the referee separate the two fighters?",
+ "answer": "D",
+ "candidates": [
+ "A. The boxer wearing black shoes knocked down the boxer wearing white shoes with a right hook.",
+ "B. The boxer wearing white shoes knocked down the boxer wearing black shoes with a left hook.",
+ "C. The boxer wearing black shoes knocked down the boxer wearing white shoes with a left hook.",
+ "D. The boxer wearing white shoes knocked down the boxer wearing black shoes with a right hook."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the boxer wearing black shoes fall?",
+ "answer": "B",
+ "candidates": [
+ "A. Four times.",
+ "B. Three times.",
+ "C. Twice.",
+ "D. Once."
+ ]
+ }
+ },
+ "nvpqGIyi": {
+ "video_id": "nvpqGIyi",
+ "video_duration": "56s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a wide shot of a large, light-colored building with multiple arches and two tall towers, surrounded by a large crowd of people sitting on a blue tarp or mat outside. Many of the people are bowing in unison. The camera angle changes, and a different view of the building is shown. The angle shifts again showing another angle of the crowd of people on the tarp and on the steps of the building. The camera zooms in on the building, slowly rotating from left to right, until the green text on top of it can be seen. The camera angle shifts again to a row of people standing side by side, bowing their heads. Then it shifts to a close up of a person wearing a blue jean jacket, white t-shirt, and black backpack, speaking into the camera. This person uses their hand to gesture while speaking. The camera then follows the person moving out of the alleyway into the street. The view becomes blurry, and a person walking away can be seen on a sidewalk in the distance, then cars parked on the side of the road, followed by shops on a busy street.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What emotions do the people in front of the mosque display in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Sad and downcast.",
+ "B. Dignified and solemn.",
+ "C. Joyful and celebratory.",
+ "D. Angry and excited."
+ ]
+ }
+ },
+ "ODRcVCUE": {
+ "video_id": "ODRcVCUE",
+ "video_duration": "173s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A brown character with a beard and an eye patch, wearing a black hat with a blue rim, is shown. A black line forms an eye. The background becomes green. A figure in a dark suit is in a setting with \"TNT\" labeled boxes. The view changes to a cage with a orange and a yellow character. The orange character has a white chef's hat. The two characters look at each other with changing expressions. A green potion is visible with white star shapes and a green ring and yellow liquid. The yellow liquid moves and two white shapes appear above it. A yellow pear shape with a gray stem is shown. The yellow character moves its arms, then a scene shows a green character with green hair and a brown outfit holding its arms out. The yellow and orange characters are now on the surface of a gray planet with black dots, the characters stand by a black object with a white window, then the green-haired character appears. A hand with a red filter comes up and red shapes appear. The green-haired character and the eye-patched character are shown together. The eye-patched character waves. The green-haired character holds its black-gloved hands up and moves them in a fighting manner. An eye-patched character walks with a blue outline. The green-haired character walks around. A dark window moves with two red orbs inside and a green object is thrown at the bottom of the window. The green-haired character is shown with a black object appearing above its head. The three characters stand together on a gray surface. The green-haired character stands near a rock. The yellow character puts an arm in front of the dark rock. The yellow character stands in front of a black rock. The green-haired character looks at the black rock and a knife appears in their hand. The yellow character appears with a white helmet with a transparent window, then the green-haired character appears in a black rock, the rock floats. A light-pink rectangle appears, two eyes with an \"X\" in the middle appear in the rectangle. The rectangle turns gray and then black. A single eye with two \"X\" eyes appears. The eye-patched character is shown with the orange and yellow character by the dark rock. The scene goes dark then to white \"X\" eyes and shows a dark screen and then the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "In the video, how many things were shattered by the cry of the person wearing a chef's hat?",
+ "answer": "A",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. Two.",
+ "D. Five."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What must the person dressed as a pirate do to alleviate the pain?",
+ "answer": "D",
+ "candidates": [
+ "A. Tell the woman with green hair how to escape the room.",
+ "B. Break free from restraints.",
+ "C. Use a multidimensional leap.",
+ "D. Tell the woman with green hair how to perform a dimensional leap."
+ ]
+ }
+ },
+ "CVMQQCTi": {
+ "video_id": "CVMQQCTi",
+ "video_duration": "149s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a dark blue t-shirt and a black cap is sitting on a stool and playing an acoustic guitar. To the right is a closed guitar case. Text overlays the video that reads \"Met you in the street,\" and then \"You walked like an angel\". The camera focuses on their face with the text still on the screen. A series of different people are shown on screen with their mouths open as if singing. They each have text overlays. The texts are \"Your voice was sweet\" and then \"Can't believe you were single\". Back on the guitarist the text reads \"Then I took you home\". A close-up of the guitar shows the person's hands strumming and picking at the strings. The text changes to \"I took your mask off and then..\". The person then holds a light blue disposable face mask and looks at it. The text changes to \"Is that how you look?\". Back to the guitarist with the text \"What the hell is that face?!\". Another series of different people are shown with their mouths open as if singing with text overlays. The texts are \"Yours is a big mistake\" then \"It doesn't match with your body\", then \"It's a 10 but your face is a 3\" then \"I'll just look at your back\", and finally \"The front gives me a heart attack\". Yet another series of people are shown with open mouths, text overlays read \"Do me a favor\". A new series with text overlays reads \"From morning till dawn\". Back to the guitarist with a close up on his mouth as the text \"Keep your mask on\" is on screen. A series of different people show up on screen each pointing to a face mask they are wearing and the text reads \"Keep your mask on\" or \"(Keep it on)\". The guitarist reappears on screen with the text \"Keep your mask on\". The guitarist is now singing and the text reads \"It's a disaster\". Another person is shown with the same text. The text on the guitarist's screen changes to \"How can this be true?\". Another person is shown with the same text. Then the text changes to \"They say looks don't matter\" the video again focuses on the guitarist. The text then changes to \"But in this case, they really do\". Two more people are shown on screen with the same text. The guitarist is shown with text reading \"You said I was your first guy\". The video changes to a close-up of the guitar. The text changes to \"Now I understand why\". A series of different people are shown with mouths open and the text \"Now I understand why\" appears on screen. The guitarist reappears with text saying \"What the hell is that face?!\". Two different people are on screen with the text \"What the hell is that face?!\". The guitarist reappears with text \"Yours is a big mistake\". Two different people are on screen with text \"Yours is a big mistake\" then another two are shown with text \"What can I do with you?\". Two more people show up with the text \"I could sell you to the zoo\". The text changes to \"You're ugly\" and shows the guitarist and then back to a series of people. The text now reads \"Oh my god\" and \"I'd rather get my mom\" as the people are on screen. The text returns to \"Do me a favor\" then \"From morning till dawn\" showing the guitarist. The video focuses on the guitar again then a close up of the face. The guitarist is on screen with text \"Keep your mask on\". A series of different people are on screen with the text \"Keep your mask on\". The guitarist is back on screen with the same text and then the video ends on the guitar strings.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "On which fret is the capo placed on the guitar in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. 4th fret.",
+ "B. 5th fret.",
+ "C. 3rd fret.",
+ "D. 2nd fret."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the man in the video appear surprised at the beginning?",
+ "answer": "A",
+ "candidates": [
+ "A. He saw someone as angelic as an angel.",
+ "B. He was surprised by the beauty of his own voice.",
+ "C. He was amazed by the smoothness of his own performance.",
+ "D. He saw someone who looked strange."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What guitar-playing techniques does the man demonstrate in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Strumming with a plectrum.",
+ "B. Plucking strings.",
+ "C. Strumming.",
+ "D. Finger-picking."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What genre of music does the male singer perform in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Rock.",
+ "B. Blues.",
+ "C. Folk.",
+ "D. Classical."
+ ]
+ }
+ },
+ "CmaDkTNS": {
+ "video_id": "CmaDkTNS",
+ "video_duration": "165s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Event"
+ ],
+ "video_caption": "A woman in a white dress sits in a chair, looking downwards at something in her lap. She looks up and speaks. The camera angle widens to reveal a talk show set with the woman and three other guests seated around a small circular table. The set has grey walls with framed posters. A close-up of the woman is shown. The image temporarily distorts with intense colors. The woman holds a small, metal object, examining it with her fingers. The object is a small, ornate key attached to a chain. The camera zooms in as she continues to examine it. A man with a white beard, glasses, and a patterned bowtie sits in a chair, speaking, facing the camera. The words \u201cJAMES RANDI \u2018The Amazing Randi\u2019\u201d appear at the bottom of the screen. The man\u2019s image distorts and stretches, briefly accompanied by colored bars. The image corrects itself, but static bars continue to flash across the screen. The image momentarily blacks out and distorts further, then returns to normal with continued static. The woman in white is shown again holding the small key. James Randi leans towards her and reaches for the key. He holds the key and a small metal rod, demonstrating how they fit together. The woman covers her mouth and laughs. Randi holds both keys up and speaks, gesturing with his hands. The woman takes one of the keys. The camera zooms in as she holds the key, turning it over in her hands. Randi leans in close, gesturing at the key, and speaks. He takes the key back. Close-up of the woman and Randi's hands as Randi points out details on the key. The camera zooms out to reveal two more guests seated in the background, watching and listening. The woman briefly covers her face, then continues to hold the key. The camera angle widens to show the entire table with four guests seated. They all applaud.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occurred to the second key in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. No change occurred.",
+ "B. From straight to broken.",
+ "C. From curved to straight.",
+ "D. From straight to curved."
+ ]
+ }
+ },
+ "XhioWNSE": {
+ "video_id": "XhioWNSE",
+ "video_duration": "77s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a split screen. The left side of the split screen has four people standing in a room. The woman standing in the back has on a pink shirt and a blue head covering. The other three people in the room have their backs to the camera. The right side of the split screen shows a man sitting on a red sofa. The man has on a blue plaid shirt and grey pants. The split screen disappears and the video shows two people sitting on two different sofas in a room. The man has on a blue plaid shirt, grey pants and has his bare feet on a blue ottoman in front of the red sofa. The woman has on a light colored floral shirt and blue pants.The video focuses on the man on the red sofa, who makes various hand gestures while talking.The video focuses on the woman on the blue sofa, who appears to be talking.The video focuses on a woman with a blue head covering and a pink shirt speaking into a microphone at a podium.The video shows the man on the red sofa making various hand gestures and talking.The video focuses on the woman on the blue sofa, who appears to be talking and making hand gestures.The video shows the man on the red sofa, who is speaking and making hand gestures.The video shows the woman on the blue sofa, who is speaking and making hand gestures.The video shows the man on the red sofa, who is talking and making hand gestures.The video shows the woman on the blue sofa, who is speaking and making hand gestures.The video shows the man on the red sofa, who is speaking and making hand gestures.The video shows the woman on the blue sofa, who is speaking and making hand gestures.The video shows the man on the red sofa.The video shows the man on the red sofa, and the woman on the blue sofa sitting on their respective sofas and then quickly getting up.The video briefly shows the empty room with two sofas, and then cuts out.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the woman wearing a floral shirt like in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Discouraged.",
+ "B. Happy.",
+ "C. Angry.",
+ "D. Joyful."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who says, \"I will kiss you now\" in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A man wearing a green-checked shirt.",
+ "B. A woman wearing a floral shirt with braided hair.",
+ "C. A man wearing an orange-checked shirt.",
+ "D. A man wearing a blue-checked shirt."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What is the woman in the floral shirt doing to the man in the blue plaid shirt in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The woman in the floral shirt is trying to lift the man in the blue plaid shirt.",
+ "B. The woman in the floral shirt is trying to avoid the man in the blue plaid shirt.",
+ "C. The woman in the floral shirt is trying to hug the man in the blue plaid shirt.",
+ "D. The woman in the floral shirt is trying to hit the man in the blue plaid shirt."
+ ]
+ }
+ },
+ "lUdmMpqR": {
+ "video_id": "lUdmMpqR",
+ "video_duration": "35s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A black pan filled with spaghetti and meatballs in a red sauce is shown. The dish is garnished with chopped green herbs and grated cheese. The pan is on a light wooden surface. The pan is then shown with white liquid in it. The camera cuts to a clear glass bowl where two eggs are being poured into. Red sauce is added to the bowl, followed by grated cheese and bread crumbs. Chopped green herbs and red pepper flakes are then added. A light brown liquid is poured in and all of the ingredients are mixed with a black spatula. Raw meat is added and then mixed by hand with the other ingredients in the bowl. Small balls are formed by hand from the mixture and placed onto a lined baking sheet.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the glass bowl in relation to the iron plate at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The glass bowl is in front of the iron plate.",
+ "B. The glass bowl is behind the iron plate.",
+ "C. The glass bowl is to the right of the iron plate.",
+ "D. The glass bowl is to the left of the iron plate."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What will be done to the prepared meatballs in the video next?",
+ "answer": "C",
+ "candidates": [
+ "A. Place them in the oven to bake.",
+ "B. Press the meatballs into a pancake shape.",
+ "C. Boil them in a pot of water.",
+ "D. Steam them in a pot."
+ ]
+ }
+ },
+ "YFoSwjHZ": {
+ "video_id": "YFoSwjHZ",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three men are seated on chairs, each holding a tuba.\u00a0 A fourth man stands behind them, observing. The men seated begin to play their tubas. The man standing claps his hands together twice, then drinks from a small bottle. The seated men continue to play their tubas. The man standing sets the bottle down on the floor. The seated men continue to play. The standing man claps twice more. The seated men continue to play. The man standing claps twice more.\u00a0 The seated men continue to play. One of the seated men adjusts his tuba. The standing man walks to a group of tubas standing against a back wall. He picks up a tuba and walks back to his prior position. He looks at the seated men while holding the tuba, and they continue to play. The standing man puts the tuba down next to the other tuba he set down, then resumes clapping. The seated men continue to play.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the tuba on the left in the video louder than the one on the right?",
+ "answer": "D",
+ "candidates": [
+ "A. Yes, the sound from the tuba on the left is louder.",
+ "B. No, both tubas produce the same sound level.",
+ "C. No, neither of the tubas is producing any sound.",
+ "D. No, the sound from the tuba on the right is louder."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the most likely emotional tone of the music played by the two men in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Serious.",
+ "B. Angry.",
+ "C. Sad.",
+ "D. Joyful."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "How many loud voices can be heard in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Three.",
+ "B. Two.",
+ "C. One.",
+ "D. None."
+ ]
+ }
+ },
+ "msTnUwBG": {
+ "video_id": "msTnUwBG",
+ "video_duration": "277s",
+ "duration": "4-6min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Event"
+ ],
+ "video_caption": "Two female wrestlers, one blonde and one brunette, face off in a wrestling ring. The blonde wrestler wears a grey and black top, black shorts, and purple knee and elbow pads. The brunette wrestler wears a black top, black pants with purple details, and purple wristbands.The blonde wrestler punches the brunette wrestler repeatedly. The brunette wrestler grabs the blonde wrestler and slams her to the mat. The blonde wrestler gets up and Irish whips the brunette wrestler into the ropes. The blonde wrestler runs towards the ropes herself and kicks the brunette wrestler in the face as she rebounds. The brunette wrestler falls to the ground.The blonde wrestler whips the brunette wrestler into the ropes again and clotheslines her as she rebounds. She then whips the brunette into the corner turnbuckle and punches her in the head. She Irish whips the brunette across the ring and dropkicks her in the chest as she hits the opposite turnbuckle. She follows up with a leg drop, then pins the brunette, but the brunette kicks out at two. The blonde wrestler picks the brunette up, but the brunette reverses the hold and punches the blonde wrestler several times. She Irish whips the blonde into the ropes, then performs a Russian leg sweep, knocking the blonde to the ground. She then throws the blonde onto the mat by her hair. The brunette whips the blonde into the ropes and attempts a clothesline, but the blonde ducks under and clotheslines the brunette. The blonde wrestler runs towards the brunette wrestler and kicks her in the face.The brunette wrestler stumbles to her feet and throws a series of punches, which the blonde wrestler blocks. The brunette performs a snap suplex, sending the blonde wrestler crashing to the mat. She attempts a pin, but the blonde kicks out at two.The brunette tries to pick the blonde up, but the blonde reverses the hold into a backslide pin, also getting a two count. The brunette picks the blonde wrestler up again, but the blonde gets free and hits the brunette with a spinning heel kick. The brunette falls to the ground.The blonde whips the brunette into the ropes and performs a Lou Thesz press, mounting the brunette and punching her repeatedly. The brunette crawls to the corner and pulls herself up using the ropes. The blonde runs toward her and performs a monkey flip, sending the brunette over the ropes and onto the mat outside the ring.The referee begins counting. The brunette gets up at eight and slowly returns to the ring, where the blonde wrestler whips her into the corner and kicks her in the chest. The brunette slumps to the mat. The blonde wrestler picks up the brunette wrestler and slams her onto the mat. She goes for a pin, but only gets a two-count.The brunette kicks the blonde and pulls herself to her feet by the ropes. The blonde whips the brunette into the ropes, ducks under the brunette's clothesline attempt and hits the brunette with a bridging German suplex and pins her, getting the three count.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What move does the woman with black hair use to knock down the woman with blonde hair at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A standing moon throw.",
+ "B. A leg grab takedown.",
+ "C. A body slam.",
+ "D. A flying kick takedown."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where does the competition in the video take place?",
+ "answer": "B",
+ "candidates": [
+ "A. Gymnastics hall.",
+ "B. Wrestling arena.",
+ "C. Martial arts gym.",
+ "D. Taekwondo dojo."
+ ]
+ }
+ },
+ "isMcSZkA": {
+ "video_id": "isMcSZkA",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man plays a clarinet while another man plays a classical guitar. A grand piano is visible behind them. The camera zooms in slightly, and the names \"Valerio Battaglia, guitar\" appear on screen. The view changes, showing four musicians on a small stage: a pianist, a percussionist, the clarinetist, and the guitarist. The names \"Francesco Scrofani Cancellieri, piano\" are displayed. The percussionist uses a single cymbal, and the pianist has a small drum attached to the piano. The camera angle shifts again, focusing on the percussionist and pianist. The camera then cuts back to the full view of the four musicians on stage.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What changes occurred in the volume of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The instrument's volume gradually decreased and then returned to normal.",
+ "B. The instrument's volume gradually increased.",
+ "C. The instrument's volume gradually increased and then returned to normal.",
+ "D. The instrument's volume gradually decreased."
+ ]
+ }
+ },
+ "uIcRxoGg": {
+ "video_id": "uIcRxoGg",
+ "video_duration": "109s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video starts with a shot of a dimly lit corridor, the walls and floor appear to be a dull yellow color. The camera moves to a cartoon image of a person with pale skin, large ears, and a third eye in the middle of their forehead, they are wearing a white collared shirt, a red tie, and a gray suit jacket. The cartoon person's mouth is open. The cartoon is shown several times in slightly different angles and with some filters. A black box appears with white text in a list format, \"2022 goals for @Broogli\", then two bullet points, followed by \"Broogli Merch\", and then \"2022 goals for @Twoogli\", and then three bullet points. Then there is a black box with the text, \"Broogli\" and \"200K subscribers\", then another black box with the text \"Stats\", \"Joined Jun 20, 2017\", and \"29,139,808 views\". The cartoon is shown again, in different angles and with different filters. Text appears over the cartoon, \"5 Backrooms entities that I would date...\", and the cartoon is shown again, in different angles. A black box appears with the text, \"Twoogli\" and \"11.6K subscribers\". A thumbnail appears with a person wearing a hat and headphones, with an arrow pointing to an image of an iceberg that says \"Broogli Iceberg\". The video ends.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What will happen to the subscription for the second channel?",
+ "answer": "B",
+ "candidates": [
+ "A. Remains stable.",
+ "B. Increases day by day.",
+ "C. I am not sure.",
+ "D. Gradually declines."
+ ]
+ }
+ },
+ "PgbRbjoT": {
+ "video_id": "PgbRbjoT",
+ "video_duration": "129s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two men are standing on a badminton court. The man on the left wears a red long-sleeved shirt and black shorts. The man on the right wears a black and red short-sleeved shirt and black shorts. The man in red holds a shuttlecock and gestures with his right hand. The man in black holds a badminton racket. The man in red tosses the shuttlecock toward the other man, who bends his knees, shifts his weight to his right foot, and swings his racket as if hitting the shuttlecock. The man in red then continues gesturing and appears to be speaking. He picks up the shuttlecock and repeats the toss, with the other man mimicking the same swing, shifting his weight and bending his knees. The man in red repeats these actions once more. He then places his right hand on the shuttlecock and gestures again while appearing to speak. He takes the racket from the other man and holds it in his right hand. He then demonstrates a similar swing, shifting his weight and bending his knees, while continuing to gesture. He repeats these swings several more times, sometimes holding the shuttlecock in his left hand, sometimes not. He holds up his index finger as if making a point. He stops demonstrating swings and holds the racket and shuttlecock while he continues to gesture. He demonstrates the swings again a few more times. He continues to gesture and appear to speak while holding the racket.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What does the student wearing a black and white shirt and black shorts do with the badminton racket in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Catches the ball thrown by the coach.",
+ "B. Raises the racket high over the head to intimidate the opponent.",
+ "C. Continuously hits the air with the racket to improve reaction speed.",
+ "D. Checks and repairs the racket."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the coach wearing a red shirt and black shorts relative to the student wearing a black-and-white shirt and black shorts in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Directly behind.",
+ "B. Directly in front.",
+ "C. Directly to the right.",
+ "D. Directly to the left."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did the video include a scene of two people playing a sport?",
+ "answer": "B",
+ "candidates": [
+ "A. Yes.",
+ "B. No.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "sWgpWzPZ": {
+ "video_id": "sWgpWzPZ",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "There are four people standing, each in front of a music stand. Each person is playing a saxophone. Each person is wearing different colored tops. The person on the left is wearing a striped top, the second from left is wearing a blue top, the third is wearing a brown top and the one on the right is wearing a red and black plaid top. They each have similar curly hair. All of the people are standing against a black wall, with four spotlights above them.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is there saxophone sound throughout the entire video?",
+ "answer": "B",
+ "candidates": [
+ "A. No, there is no saxophone sound.",
+ "B. Yes, there is saxophone sound.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "BNwDPOrI": {
+ "video_id": "BNwDPOrI",
+ "video_duration": "393s",
+ "duration": "6-8min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins in a dark environment, with a red, blurry, distorted image of a person in the center of the screen. The video then switches to a first-person view of two hands holding a green bladed weapon with a black handle. The character climbs over a wire fence. The view then shifts as the character climbs over rocky and brick-like surfaces, then proceeds up a wooden staircase. Once at the top of the stairs, the character continues to move forward, occasionally moving between buildings. At one point, an explosion can be seen at the edge of the screen. The character climbs onto the roof of a building, and two characters dressed in green appear. The character fires a weapon and the camera view becomes shaky. The character then enters an interior room with several chairs in it, then proceeds through multiple other rooms. In the interior of one room, the character appears to acquire a small orange object. The character climbs through the window of a room with a metal grate at the entrance, and proceeds through another room with a metal grate. The character then uses an orange object to climb up a wall. The character enters a room with white jail cells, and then proceeds through a door. In a new room, the character uses a bladed weapon and a gun to attack multiple characters, and to make their way through the different areas. At several points, there are blood splatters, and explosions. Some areas appear to have flames. At one point, the character is seen using two keys to lock a door, then proceeds to climb on different surfaces and use a bladed weapon. At 5:40, the character uses a zip line in an area with several enemies. The character then continues to move forward using a gun to fight multiple characters. After the character moves through a doorway they have successfully completed their task, and acquire items from various surfaces.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "In the video, what does the player's game character do after entering the first house but before climbing the first staircase?",
+ "answer": "D",
+ "candidates": [
+ "A. Wander around in circles.",
+ "B. Defeat an enemy.",
+ "C. Collect supplies.",
+ "D. Sprint to move ahead."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What is the weather like in the game shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Thunderstorm.",
+ "B. Overcast day.",
+ "C. Rainy day.",
+ "D. Sunny day."
+ ]
+ }
+ },
+ "KVrzfrNs": {
+ "video_id": "KVrzfrNs",
+ "video_duration": "47s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Family Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A child in a white shirt and dark shorts is seen pushing a toy car on a green surface. Then a two-tiered, decorated chocolate cake with lit candles on top is shown. Two adults hold the cake. Someone sprinkles decorations onto the cake and adjusts a yellow decoration. The two adults carry the cake. Several adults and children are visible in the background. The adults carrying the cake walk past them, and towards the children seated on a blue platform. The adults approach a child in a white shirt, who is standing and looking at the cake. The adult holding the cake and the other adult stand by as the child appears to make a wish. The child then looks at the cake.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the woman holding the cake in the video urge the little boy to blow out the candles?",
+ "answer": "B",
+ "candidates": [
+ "A. Three.",
+ "B. Five.",
+ "C. Eight.",
+ "D. Seven."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the mood of the boy celebrating his birthday in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Happy.",
+ "B. Sad.",
+ "C. Afraid.",
+ "D. Shy."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Is there a change in the volume of the people in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "whbDHTxD": {
+ "video_id": "whbDHTxD",
+ "video_duration": "61s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Chemistry",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a person wearing a dark blue striped shirt working in a cluttered room containing glassware, a fan, two cans, and other equipment. They are seen weighing a substance and handling it, placing it into what appears to be a pan. The video then cuts to a dark outdoor shot where a small, orange-red flame is briefly visible. Following this, the video shows the same flame again, and then shows the flame burning with more intensity, throwing off sparks and some purple smoke. The video then shows a person holding a can and bending or crushing it. The can is then held out and a flash of white light is visible. The person holds the can in two hands, bends and opens it more, and another flash of white light is visible. The video ends with the person holding the can still, with some of the white light still showing.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "In what setting do the man and the aluminum cans appear in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A bright mall.",
+ "B. Grass in the evening.",
+ "C. A pitch-black room.",
+ "D. A field of wheat."
+ ]
+ }
+ },
+ "WBtKutyY": {
+ "video_id": "WBtKutyY",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A woman and a man play a violin and a cello, respectively. The woman stands on the left side of the frame, wearing a sleeveless, knee-length maroon dress with small white floral patterns. Her dark brown hair is shoulder-length. She holds the violin under her chin, her left hand on the neck of the violin and her right hand moving the bow across the strings.\u00a0The man is seated on the right side of the frame, wearing a dark blue, long-sleeved, button-down shirt and brown pants. He holds the cello between his legs, his left hand on the neck of the cello, and his right hand moving the bow across the strings.A silver and gray condenser microphone stands on a black stand to the left of the woman. They both look at their instruments as they play. The background is a plain off-white wall. The lighting is bright and even.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "In the video, which instrument is the first to make a sound?",
+ "answer": "B",
+ "candidates": [
+ "A. A violin.",
+ "B. A cello.",
+ "C. A piano.",
+ "D. An accordion."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "In the video, what method does the man use to produce sound from the strings?",
+ "answer": "D",
+ "candidates": [
+ "A. Pluck the strings with fingers and strike them with the bow stick.",
+ "B. Draw the strings with a bow and strike them with the bow stick.",
+ "C. Pluck the strings with fingers, draw them with a bow, and strike them with the bow stick.",
+ "D. Draw the strings with a bow and pluck them with fingers."
+ ]
+ }
+ },
+ "tnfuJyYK": {
+ "video_id": "tnfuJyYK",
+ "video_duration": "78s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A woman with shoulder-length brown hair is facing the camera and talking. She is wearing a black t-shirt and is standing in front of a beige-colored paved area and a white gazebo-like structure. The camera angle changes, and then pans across a building, a wooden bench, and a lot of trees. Next, a logo for \"Knotts Berry Farm\" is overlaid on the video, along with text stating \"Thank You Visit Buena Park For sponsoring today's video.\"The camera view changes to the interior of a vehicle where four children are visible. The first child on the left has short brown hair and wears a dark blue shirt. The child next to him has blonde hair and is wearing a long sleeve orange and dark blue shirt. The next child is in the middle, wearing a dark blue and gold shirt and is looking straight ahead. The child on the right has blonde hair pulled up into a ponytail, and is wearing a dark blue shirt. The children talk and look around.The view changes to an exterior shot of a building with \"Knotts Berry Farm\" signage and decorations. The video shows children on several different amusement park rides. One amusement park ride has multiple swinging seats in a row, and another is a tall, flat, rectangular surface with children seated across it as it moves up and down. Next, the camera shows some people looking at a different amusement park ride. The camera shows a figure of a dog with a black nose and large black ears in a canoe on top of a small waterfall. Several children are playing near the waterfall. The video shows an exterior view of a building with the words \"Snoopy's Camp Store.\" A character figure of a dog is seen on the building with moving gears and clock mechanisms. Next, a person in a dog costume is walking near a paved area. Finally, a woman with brown hair is shown walking in a park. The camera switches to a child wearing a red shirt standing next to a wooden barrier. Finally, the view shifts to show the same child standing in front of a sign.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order in which the children play on the rides?",
+ "answer": "D",
+ "candidates": [
+ "A. First the drop tower, then the giant swing, and finally the small airplane.",
+ "B. First the small airplane, then the giant swing, and finally the small airplane again.",
+ "C. First the small airplane, then the drop tower, and finally the giant swing.",
+ "D. First the giant swing, then the drop tower, and finally the small airplane."
+ ]
+ }
+ },
+ "QAGirumX": {
+ "video_id": "QAGirumX",
+ "video_duration": "56s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video starts with a first-person view of a narrow passage, with dark walls on either side. At the end of the passage, there's an opening with a ladder and a glimpse of daylight. The view then shifts to a first-person perspective inside a building, with orange crates in the background. The view changes again to show a first-person view, with a weapon, pointing forward. Then it shifts to another view, where bullet flashes are visible. The view switches to a two people celebrating. The view changes again, now a dark circle with orange at the bottom, like looking through a scope. The view switches again, looking through a scope, at a wooden crate with a wall. The view shifts again, and a white sniper rifle is visible from a first-person perspective. The view shifts to show a person prone on the ground, visible in the distance, and then shifts again to looking through the sniper scope. The view shifts again to a first-person perspective showing a sniper rifle pointed forward. The view shifts again to show a person prone on the ground, visible in the distance, then shifts again to a first-person view of an assault rifle pointing forward. The view changes to a first-person perspective of someone holding a remote control device, with a prone body on the ground in the distance. Then the view shifts to a person kneeling and fiddling with a remote control. Then to a person walking forward with a gun, and then again a person crouched holding a rifle. Then a first-person view of an assault rifle, then a view of a first person holding a sniper rifle, then a view through a scope of a prone figure. Then, the view shifts to a first-person perspective of a sniper rifle, then to a view through the scope of a prone figure. Then the view changes to a first-person perspective of an assault rifle. Then there is a view of people celebrating. Then, a view of two people hugging. The video then ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Which part of the video shows a man wearing glasses?",
+ "answer": "C",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. He does not appear.",
+ "C. In the middle of the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What change occurs in the background music in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Remains loud.",
+ "B. From loud to soft.",
+ "C. Remains soft.",
+ "D. From soft to loud."
+ ]
+ }
+ },
+ "zGuBfeit": {
+ "video_id": "zGuBfeit",
+ "video_duration": "96s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A 3D animated character with gray hair, wearing a black and gray outfit is shown against a blue background. This character then falls and the screen shows a blur, before transitioning to an overhead view of trees, followed by a close up of two other characters. The character with red hair and a black hat faces another character with dark red hair, and they blink. The two characters' faces are shown in detail before transitioning to another two characters, one with yellow hair and one with black hair. In the next shot, four characters are shown together; one has yellow hair, two have dark hair, and one has pink and brown hair, and the character with pink hair is holding a sign that says \"YOU'LL BE SORRY!\" This is followed by a shot of four characters, one with white hair, one with yellow hair, one with light pink hair, and one with green hair, in front of a light brown background. Next, a character with white hair, and a black and white dog-like character are shown looking at a very large character with black and white, mask-like features. The next scene shows a character with orange hair holding a soccer ball, followed by a scene with seven characters in a forest. Then a soccer ball is shown and the scene transitions to another character with gray hair on a golden background. Then, a dark haired character on a yellow background. Next is an overhead view of trees with a sky and clouds, and a character with black hair falls from the sky, laughing. This character then hits the ground, followed by a soccer ball hitting the character's head. Three characters look at the fallen character and one other orange-haired character is shown smiling. Finally, the group of characters are shown looking at the fallen character, and the scene ends with a close up of that character and a sign that says \"NICE FOOTWORK DUM-DUM\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual events occur in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The soccer ball exploded after being kicked.",
+ "B. The soccer ball was kicked into the air.",
+ "C. Someone was injured.",
+ "D. There was no soccer ball."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Which team claimed victory?",
+ "answer": "C",
+ "candidates": [
+ "A. The team with the character who has pink-brown hair wins.",
+ "B. The team with the character who has white hair wins.",
+ "C. It is a draw.",
+ "D. The match has not yet ended."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the action of raising the sign occur in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Three.",
+ "B. Two.",
+ "C. One.",
+ "D. Four."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What is the final state of the soccer ball?",
+ "answer": "C",
+ "candidates": [
+ "A. The soccer ball was kicked to pieces.",
+ "B. The soccer ball remains unchanged.",
+ "C. The soccer ball was kicked flat.",
+ "D. The soccer ball was kicked into the air."
+ ]
+ }
+ },
+ "FBjXFpMz": {
+ "video_id": "FBjXFpMz",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man and a woman stand before a painting of a mountain landscape. They both hold microphones and sing. The man wears a pink shirt and a tie with a floral pattern. The woman wears a pink turtleneck and a dark jacket.The scene changes to an older man and woman. The man sits and plays a white bass guitar. He wears a white long-sleeved shirt and dark pants. He sits facing the camera, angled slightly to the right. A small amplifier sits on the floor beside him. The woman sits at a piano and plays. She wears a light pink and white floral patterned top and lavender pants. She sits facing right, her back slightly towards the camera. Sheet music rests on the piano's music stand. The same mountain landscape painting hangs on the wall behind them.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the piano more rhythmic than the electric bass in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. None of the above.",
+ "B. I am not sure.",
+ "C. No.",
+ "D. Yes."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the most likely emotional tone of the music performed by the elderly couple in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Sad and sorrowful.",
+ "B. Serene and peaceful.",
+ "C. Majestic and heroic.",
+ "D. Lively and cheerful."
+ ]
+ }
+ },
+ "JqpYjIoj": {
+ "video_id": "JqpYjIoj",
+ "video_duration": "61s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video starts with a first-person perspective view of a dilapidated brick rooftop at sunset. The camera pans to show a crumbling wall and overgrown foliage.The scene then transitions to a character customization screen in a video game. The screen displays various armor pieces and gear options. The background features a cloudy sunset sky. A pair of hands then enters the frame, holding a helmet in front of the camera. The hands place the helmet on the character's head, obscuring the view briefly.The hands then remove the helmet, returning to the view of the sky. The video repeats this action several times, showing different types of headwear being equipped and unequipped: a dark gas mask, a welding helmet, a biker helmet, another type of gas mask, and then back to the original helmet. Text appears on the left side of the screen, reading: \"Requirements: FDDA, DLTX, MCM Menu\". The video continues with the same cycle of equipping and unequipping headwear, now with the text visible on screen. The video ends with the view of the cloudy sunset sky.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "In the video, how does the man's mood change when the player puts on the helmet?",
+ "answer": "C",
+ "candidates": [
+ "A. Becomes more tense.",
+ "B. Shows no obvious change.",
+ "C. Becomes more excited.",
+ "D. Becomes more angry."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "At what time does the opening scene of the video take place?",
+ "answer": "B",
+ "candidates": [
+ "A. In the morning.",
+ "B. At dusk.",
+ "C. At dawn.",
+ "D. At noon."
+ ]
+ }
+ },
+ "rhudALCb": {
+ "video_id": "rhudALCb",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people are shown in the video, each sitting on a chair. The person on the left is holding and playing a cello, using a bow. The person on the right is holding and playing a trumpet. A music stand is situated between the two people. The background features a wall with a white, decorative pattern, and a shelf with candles and white flowers. \"Tryos Ensemble\" text appears over the video near the end.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is there a trumpet sound in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. No, only the sound of the cello.",
+ "B. Yes, there is a trumpet sound.",
+ "C. No, only the sound of the piano.",
+ "D. No, only the sound of the violin."
+ ]
+ }
+ },
+ "qfdgTeNb": {
+ "video_id": "qfdgTeNb",
+ "video_duration": "82s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "Two people are on a stage. One person, wearing a long pink dress, a head covering, and white shoes is holding a black microphone and is speaking. The other person, wearing a white head covering and a white dress, is sitting at a metal table with a black laptop on it. They too are holding a black microphone and speaking. Behind them, there are light blue curtains and a purple banner with text on it, as well as yellow and orange decorations with butterflies. A third person enters the stage wearing a dark jacket, black pants, a dark covering over their head and face, and a light blue square on their shirt with an \u2018f\u2019 on it. This person is holding a black microphone and is also speaking. The person wearing the white dress stands and then proceeds to kneel down on the stage. Confetti and white pieces of paper are on the ground. Three additional people dressed in dark clothing enter the stage. One of them wears a mask and has a light blue square with a 'g' on it, one has a number 3 sign attached to their shirt, one has a light blue square with a bird icon on it. They appear to interact with the person kneeling on the stage. All six individuals on the stage are holding microphones.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many actors in the video represent social media platforms?",
+ "answer": "A",
+ "candidates": [
+ "A. Four.",
+ "B. One.",
+ "C. Two.",
+ "D. Three."
+ ]
+ }
+ },
+ "wWqnMCJy": {
+ "video_id": "wWqnMCJy",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person plays a grand piano, their hands visible in the foreground. The camera angle is low, looking up along the keyboard. Reflected in the polished surface of the piano is a slightly distorted view of their hands, the keyboard, and a cellist sitting further back in the room. The cellist is a young man with short brown hair, wearing a teal t-shirt. He holds the cello between his knees and draws the bow across the strings. A large diaphragm condenser microphone on a stand is positioned near the cellist, presumably recording the music. The background is plain white. Both musicians' hands move continuously as they play, their fingers pressing keys and moving across the cello strings and bow.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Does the cello have a stronger sense of rhythm than the piano in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. None of the above.",
+ "B. Yes.",
+ "C. I am not sure.",
+ "D. No."
+ ]
+ }
+ },
+ "tuSvIMdQ": {
+ "video_id": "tuSvIMdQ",
+ "video_duration": "44s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A mountain biker rides down a trail in a wooded area. The biker goes over several small jumps and bumps, and navigates around trees. The biker rides down a small drop. Another biker rides a drop between two trees. A short clip plays of two bikers riding down a steep and dusty trail. The video cuts to a first-person view of a biker riding through a muddy, wooded trail, narrowly avoiding a tree. A biker rides off a jump as another biker waits at the bottom. The MTB Travel | Review logo is displayed against a black background. Two mountain bikers stand in a wooded area with their bikes. They are wearing helmets and gloves. The biker on the right wears a light gray t-shirt with a blue logo. The biker on the left wears a dark gray t-shirt and khaki cargo shorts. Both bikers appear to be talking to the camera. The overlay text \u201cEPISODE 4: DROPS IN 3 EASY STEPS!\u201d appears on screen.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Did the man in the dark gray T-shirt eventually learn the skills for downhill mountain biking?",
+ "answer": "B",
+ "candidates": [
+ "A. He has not learned.",
+ "B. He learned.",
+ "C. None.",
+ "D. It is unclear."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual behavior did one of the men at the back exhibit while riding at the beginning of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Collided with a tree.",
+ "B. The bike malfunctioned while riding.",
+ "C. Fell while going downhill.",
+ "D. Jumped over an obstacle."
+ ]
+ }
+ },
+ "FnkEbRve": {
+ "video_id": "FnkEbRve",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A girl is initially shown playing a violin while standing on a sandy beach with water in the background. She wears a yellow and orange dress. Then, another girl with braided pigtails appears, sitting on a striped case while playing a cello. She wears a white top and light-colored shorts. The two continue to play together.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the function of the cello case in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Used as a support for a cello.",
+ "B. Used as a music stand.",
+ "C. Used as a seat.",
+ "D. Used as a decoration."
+ ]
+ }
+ },
+ "yMUEXRfR": {
+ "video_id": "yMUEXRfR",
+ "video_duration": "55s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A black, a white and purple, and a green toy train are positioned on a light blue track, next to yellow, pink, and light blue foam letters D, E, and F, respectively. The camera focuses on each train and letter individually. Then, a hand places the three trains on three parallel tracks: a green track for the green train, an orange track for the black train, and a blue track for the white and purple train. The trains begin moving along their respective tracks. The green train pulls two green cargo cars filled with yellow cubes. The track splits, and the green train goes up a ramp, passing plastic palm trees, a \"Reptile Park\" sign, and a plastic tree with an orange snake wrapped around it. The three trains go up three separate ascending ramps, colored green, orange, and blue. The trains continue on their paths, going over hills and bridges, occasionally passing other toy trains and vehicles. The tracks converge as they pass under a beige and light blue overpass. The trains stop under the overpass. A circular inset with a picture of the white and purple train and the name \"Etienne\" appears on screen.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the black train relative to the green train at the end of the race?",
+ "answer": "D",
+ "candidates": [
+ "A. In front of the green train.",
+ "B. Behind the green train.",
+ "C. On the right side of the green train.",
+ "D. On the left side of the green train."
+ ]
+ }
+ },
+ "MTmzEYuz": {
+ "video_id": "MTmzEYuz",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person is sitting on a red couch. The person is wearing a light blue t-shirt and dark pants. The person holds a light brown ukulele. The person moves their fingers on the neck of the ukulele and strums the strings in a variety of different ways. At one point, a hand comes into the frame, touching the person's shoulder. The hand then guides the person's fingers on the ukulele. The video then blurs and then fades into greyscale, and finally, the video fades to black.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many variations of ukulele playing are shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Seventeen.",
+ "B. Eighteen.",
+ "C. Nineteen.",
+ "D. Twenty."
+ ]
+ }
+ },
+ "dpyuxcpa": {
+ "video_id": "dpyuxcpa",
+ "video_duration": "60s",
+ "duration": "1-2min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "The video begins with a white screen. A black logo with the letter \"M\" appears in the left center, followed by the word \"TOOLKIT\" to the right. Then, \"MUSICIAN'S\" appears above it and \".com\" appears below it. The image then fades to reveal a background featuring a man holding a gold tuba. Another tuba can be seen on a black tablecloth in the background along with other various objects on a table. The man\u2019s fingers are shown on the tuba, pressing the valves on the instrument. The image zooms in to focus on the man\u2019s fingers pressing the tuba valves. The image zooms back out to show the man holding the tuba. The image zooms in again to focus on the man\u2019s fingers pressing the tuba valves. The image fades to black, with the logo in white and the words \"MUSICIAN'S TOOLKIT\" and \".com\" again displayed to the right.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the most likely emotional tone of the music being played in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Plaintive.",
+ "B. Melancholic.",
+ "C. Joyful.",
+ "D. Lyrical."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "Based on the analysis of this video, what is the performer's profession?",
+ "answer": "A",
+ "candidates": [
+ "A. An instrument shop owner.",
+ "B. A musician.",
+ "C. A music enthusiast.",
+ "D. A music teacher."
+ ]
+ }
+ },
+ "SFwSbVCw": {
+ "video_id": "SFwSbVCw",
+ "video_duration": "183s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man runs from a large, multi-limbed creature in a dark, watery landscape. It rains. The text \"ON THE ROAD\" appears. A man in a captain's hat sits at the controls of a futuristic vehicle. Snow falls around a large, wrecked crane in a desolate, wintry landscape. A futuristic, three-wheeled vehicle drives through a desert. Another vehicle, larger and with robotic legs, follows, and a man climbs onto its roof. The text \"ON THE BED\" appears. A man stands with his back to the camera in a minimalist room. Walls flicker and change, showing colorful interference and a cityscape. A man lies asleep on a bed. The text \"ON THE DRAWBRIDGE\" appears. A group of figures in orange armor hold rifles. One fires at a flying mechanical creature, which falls to the ground. A larger, four-legged mechanical creature with glowing orange parts is visible. The orange-armored figures stand near it. A cloaked figure approaches containers in a red, dusty landscape. Figures with umbrellas float in the air. The cloaked figure shoots a floating figure with a rifle, and the figure disintegrates. The text \"ON THE BEACH\" appears. Robed figures in red stand around a red container. The lid opens, mechanical arms extend outward, and a figure with long, dark hair and red and gold armor rises. The robed figures raise their hands, and the figure in armor raises their arms. The armored figure has black face paint resembling a smile. An orange aiming reticle hovers near them. A man in dark futuristic clothing sits in a chair while a figure in armor places a hand on his shoulder. The armored figure speaks, gestures, and smiles. The man reaches for the small, puppet-like doll attached to his clothes, and the doll moves its arms. The armored figure holds a dark pod-like container with straps. The man stands with the armored figure, who places a hand on his arm and takes the guns of two of the surrounding armored figures. The man turns his head to look at the armored figure with the painted smile, and she smiles at him. She places a hand on his chest, turns, walks away and fires a large gun. The armored figure fights another armored figure with a sword, deflecting their attack. The figure with the painted smile wields a gun-shaped weapon, which sparks with white electricity. The armored figure and sword fighter continue their duel. The two fighters leap, their weapons clashing, and the figure with the painted smile drops to one knee.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What kind of vehicle is the man with the yellow box driving at the beginning of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A truck.",
+ "B. A sedan.",
+ "C. An off-road vehicle.",
+ "D. A motorcycle."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Does the volume of the background music change in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The volume first decreases and then increases.",
+ "B. The volume is gradually decreasing.",
+ "C. The volume first increases and then decreases.",
+ "D. The volume is gradually increasing."
+ ]
+ }
+ },
+ "IHjZUzwi": {
+ "video_id": "IHjZUzwi",
+ "video_duration": "61s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with brown hair is shown holding a small object constructed from a wire frame and red material, and a piece of paper. They then use their fingers to shape the object. The person then holds the same object and a larger white sheet of paper with a circular pattern printed on it and uses some wiring. A person wearing a green shirt is then shown holding a different, circular object. Then they are shown holding a rectangular object with blue and white colors. Lastly the person is shown holding the circular object with a piece of material on top and they place it on top of a silver ball on top of a table.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times is the metal ball lifted?",
+ "answer": "C",
+ "candidates": [
+ "A. Three times.",
+ "B. Five times.",
+ "C. Four times.",
+ "D. Six times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the metal ball float in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The strength of magnetic repulsion is adjusted with an electric current to suspend the metal ball.",
+ "B. The strength of magnetic attraction is adjusted using an electric current to suspend the metal ball.",
+ "C. The metal ball is placed in a vacuum and is suspended using the force of adsorption.",
+ "D. The airflow generated by a high-speed rotating turbine suspends the metal ball."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "From which angle was the final scene of the video filmed from the man's perspective?",
+ "answer": "C",
+ "candidates": [
+ "A. Back.",
+ "B. Left side.",
+ "C. Right side.",
+ "D. Front."
+ ]
+ }
+ },
+ "UjuGbwUi": {
+ "video_id": "UjuGbwUi",
+ "video_duration": "246s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A red bell pepper, two zucchinis, a white onion, and a pink knife with a floral design are arranged on a gray cutting board. A hand with holiday-themed nail polish appears and adjusts the vegetables' positions slightly. A metal bowl filled with water is brought into view, near the cutting board.One zucchini is picked up and sliced into round pieces, then diced into small cubes, which are placed in the metal bowl. The second zucchini undergoes the same process.The red bell pepper is cored and deseeded, then diced into small pieces, which are added to the bowl with the zucchini.The onion is sliced thinly, then diced, and the pieces are added to the bowl. The bowl is now filled with diced zucchini, red pepper, and onion.The inside of an Instant Pot is shown, clean and empty. A vacuum-sealed package of organic grass-fed ground beef is shown and then added to the Instant Pot. The package is opened inside the pot, and the ground beef is broken apart with a wooden spoon.Sliced mushrooms are added to the Instant Pot, followed by the diced vegetables. A small amount of garlic powder is added, then the vegetables are stirred with the ground beef. Beef broth is poured into a black measuring cup, then added to the pot.A colander containing kidney beans is shown and then poured into the pot. A can of diced tomatoes is opened and added to the pot, followed by another can of fire-roasted diced tomatoes. A can of diced green chilis is opened and its contents are poured into a measuring cup, then added to the pot. Three bay leaves are added to the pot. Chili powder, cumin, onion powder, garlic powder, salt, and pepper are also added, then stirred in.The lid is placed on the Instant Pot, and the steam release valve is turned to the \u201csealing\u201d position. After some time has passed, the pot is opened, and the chili is stirred with a wooden spoon.The cooked chili is served in a decorative bowl and topped with sour cream and shredded cheese.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many types of fresh vegetables are used in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Three.",
+ "B. Six.",
+ "C. Five.",
+ "D. Four."
+ ]
+ }
+ },
+ "rTQrCLZh": {
+ "video_id": "rTQrCLZh",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A close-up of the bottom portion of a two-stringed Chinese instrument called an erhu is shown. The erhu rests on a black stand on a wooden surface. The main body is a shiny, golden octagonal shape, and the front is covered with a decorative mesh. A blue cloth is visible inside. The two strings extend upwards.The image transitions, briefly showing a man in a red shirt playing the erhu. The image then focuses on the man, now shown seated on a stool against a backdrop of blue curtains. He holds the erhu in his left hand and the bow in his right. He appears to be demonstrating how to play the instrument, moving the bow back and forth across the strings.The words \"Erhu\" appear on screen. The man continues to play, and the name \"Dr. Ming Wang\" is displayed. The camera zooms in slightly as he plays.The image then shifts to a close-up of his hands playing the erhu. The words \"Danny Boy\" are displayed. The video returns to a medium shot of Dr. Wang playing, and then cuts back to the close-up of his hands. After a while, the camera again shows a medium shot of Dr. Wang. The words \"Music from 'The Godfather'\" appear as Dr. Wang continues to play the erhu. The video then transitions to another close-up of Dr. Wang's hands. Finally, the video returns to a medium shot of Dr. Wang playing, as the screen fades to black.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotion does the piece played in the video convey?",
+ "answer": "C",
+ "candidates": [
+ "A. Gentleness and calmness.",
+ "B. Violence and bloodshed.",
+ "C. Sadness and melancholy.",
+ "D. Happiness and joy."
+ ]
+ }
+ },
+ "ujUabiIh": {
+ "video_id": "ujUabiIh",
+ "video_duration": "120s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video is presented in a frame. In the top left corner of the frame, there is a video of one person sitting behind a desk. In the top right of the frame, there is a video of four people sitting at a table. Below that, there is a video of three of the same people also at a table. In the bottom left corner of the frame, there is a collage of the people from the video. A text box is displayed at the bottom of the frame that reads \"SUBSCRIBE FOR MORE CRITICAL ROLE HIGHLIGHTS!\". In the top left video, the person looks at the camera and then moves his hands and then looks down at the table. In the top right video, the person on the right side of the frame looks at the camera and then looks down at the table. The woman in the video reaches up and touches her hair. The other two people look at the table. In the middle video, the person in the middle of the frame puts her hands behind her head and looks up at the ceiling and moves her head, and then places her hands on her table. The man on the left looks at his phone, and then down at the table. The person on the right side of the frame looks down at the table. In the top left video, the person looks up and speaks, then smiles. In the top right video, the woman looks at the table and gestures with her hands. The man in the top right video on the left laughs, and then puts his hand to his face. The man in the top right video on the right side of the frame puts his hand to his mouth and looks at the table. In the bottom video the person in the middle of the frame looks up and smiles. The man in the bottom video on the right side of the frame looks down at the table. In the top left video, the person looks at the camera and then looks down at the table. In the top right video, the man on the left side of the frame gives a thumbs up sign. The woman puts her hand to her face. The person on the right side of the top right video laughs and smiles. In the middle video, the man in the center leans forward at the table and laughs, with his hand on his chin. The person in the middle of the video puts her hands on her face and laughs. The person on the right of the bottom video looks at the table. Then a picture appears in the bottom left corner of the frame. The picture is of a person riding a horse. The top right video shows the man on the left with his hand on his chin. The woman in the video has her hands on her face and the man on the right has his head in his hand and is looking at the table. The middle video shows the person on the left laughing, the person in the middle with her hands on her head, and the man on the right with his head in his hand. The picture on the bottom left changes to a person with a falcon. The top right video shows the man on the left with his hand to his face. The woman on the top right video laughs and looks around the table. The man on the right of the top right video has his hand to his head. The middle video shows the man on the left with his hand to his chin, the woman has her hands on her head and the man on the right has his hand on his chin. In the top right video the person on the right of the frame claps his hands. The woman in the video claps her hands as well and the man on the left claps his hands. In the middle video the man on the left of the frame drinks out of a mug, and the woman looks down at the table. The person on the right is clapping and looking up. Then, the man on the left drinks out of the mug, the woman in the top right drinks out of a glass, and the man on the right side of the video drinks out of a glass. The man in the bottom right is drinking out of a mug, and then the others drink out of their glasses, and the top left person drinks out of a glass and then looks at the camera. Another picture appears of a person with character information in the bottom left corner of the frame. The person in the top right video moves his arm and points. The person on the top left gestures with his arm. The top left person gestures with his hands. The person in the top right video puts his face in his hands. The middle video shows the person in the middle laughing, the man on the left looks down, and the man on the right is also laughing. The video in the bottom left corner changes to show the logo of the video. The man on the top left drinks from a glass, and the woman in the middle of the bottom video looks down at the table.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did the video feature the 8th speaker?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No.",
+ "C. Yes."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who says \"Please roll\" at the beginning of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A man with blue hair wearing a black short-sleeve shirt.",
+ "B. A woman in a plaid shirt.",
+ "C. A woman in a baseball jacket.",
+ "D. A man in a plaid shirt."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the woman wearing the baseball jacket relative to the woman wearing the plaid shirt in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. To the right of the woman in the plaid shirt.",
+ "B. To the left of the woman in the plaid shirt.",
+ "C. Above the woman in the plaid shirt.",
+ "D. Below the woman in the plaid shirt."
+ ]
+ }
+ },
+ "ZHiUnzfj": {
+ "video_id": "ZHiUnzfj",
+ "video_duration": "97s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video opens with the cover of a Nintendo Switch game, titled \"Race With Ryan,\" which features a cartoon drawing of a person riding a red toy car. In the lower right corner, a child is shown with a blue shirt. The background is a colorful, vibrant design. The camera then transitions to video game footage showing a racing kart on a blue track with colorful buildings in the background and other karts ahead. Next, the kart races on a sand-colored track near log cabins. A panel containing text, along with cartoonish splashes appears over the gaming footage. The text reads, \"Race against your friends in up to 4 players split screen.\" This panel is followed by a picture of a computer, then an Xbox console, a PlayStation console, and a Nintendo Switch console. Video game footage of the racing kart is again shown, along with split screen footage of the same race. Next, the video shows split screen footage of four game characters inside a car. The video returns to a bright blue background. The boy in the blue shirt is then seen cheering with the words, \"Ryan's World\" appearing next to him. A yellow and orange banner with various game characters at the top of the screen and YouTube video thumbnails appears next, followed by a written text warning. A boy and a box of toys are then shown against the blue background. To the right, a video game image of the boy appears. The image is followed by text describing a rate and review and then the same \"Race With Ryan\" video game covers are shown. A screen shot of a Google search describing the release date of the game is shown, then an \"Outright Games\" logo. Video game footage with different environments and characters then appears, with a blue dog on a wooden platform, a cat scaling a rock wall, a bird on a rocky terrain, a bird collecting purple gems, another bird leaping through a wintery landscape, a cartoon dog and other characters in a supermarket setting, and a character with a sombrero driving a tractor. More split screen racing footage is shown. Then, text appears, that reads, \"Take the wheel as me, Ryan, and all my friends.\" Then split screen racing footage with a cartoon panda on the side of the screen.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video, in what order does the narration introduce the game features?",
+ "answer": "C",
+ "candidates": [
+ "A. (A) Introduce the game developer, (C) Guide viewers to consider if the game is worth playing, (E) Discuss platforms supported by the game, (B) Show game visuals, (D) Introduce the game background.",
+ "B. (C) Guide viewers to consider if the game is worth playing, (B) Show game visuals, (D) Introduce the game background, (A) Introduce the game developer, (E) Discuss platforms supported by the game.",
+ "C. (B) Show game visuals, (E) Discuss platforms supported by the game, (D) Introduce the game background, (A) Introduce the game developer, (C) Guide viewers to consider if the game is worth playing.",
+ "D. (D) Introduce the game background, (A) Introduce the game developer, (E) Discuss platforms supported by the game, (C) Guide viewers to consider if the game is worth playing, (B) Show game visuals."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "When was this racing game released?",
+ "answer": "C",
+ "candidates": [
+ "A. Four three seven seven zero.",
+ "B. Four three seven one five.",
+ "C. Four three six five five.",
+ "D. Four three seven seven six."
+ ]
+ }
+ },
+ "SyuApLeo": {
+ "video_id": "SyuApLeo",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two drummers play a synchronized drum beat in a room with recording equipment. The camera first shows a wide shot of the two drummers, then focuses on one drummer playing, then switches back to the wide shot, then focuses on the second drummer, and continues switching back and forth between these three views. The two drummers are wearing black t-shirts. The first drummer's t-shirt has an illegible graphic on it, while the second drummer's shirt has the word \u201cHELLOWEEN\u201d on it. The drummers wear over-the-ear headphones. Multiple microphones are positioned near each drum set, along with a laptop, and video cameras on tripods. The drummers hit various drums and cymbals in a steady rhythm, their movements mirroring each other\u2019s.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the laptop positioned relative to the performer on the left in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Behind.",
+ "B. On the right side.",
+ "C. On the left side.",
+ "D. In front."
+ ]
+ }
+ },
+ "MPcnwbTM": {
+ "video_id": "MPcnwbTM",
+ "video_duration": "583s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a close-up of a fig growing on a branch. The camera then pans out to show a patio with several fig trees in pots. A hand picks a fig from a tree and cuts it in half with a knife, revealing the inside of the fruit. The hand places the fig halves on a decorative plate. The camera then moves around the patio, showing more fig trees and a man in a red shirt sitting at a table. The man eats a fig from the plate. The video ends with a close-up of the figs on the plate.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What will the man do next in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Purchase fig wine.",
+ "B. Care for the fig tree.",
+ "C. Make fig jam.",
+ "D. Make dried figs."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occur in the form of the Brown Turkey in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. First, it is complete, then only a little remains.",
+ "B. Initially, it is half, then only a little remains.",
+ "C. First, it is complete, then it becomes half, and finally, only a little remains.",
+ "D. First, it is complete, then it becomes half."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the Brown Turkey not popular?",
+ "answer": "D",
+ "candidates": [
+ "A. Because it matures early and is easily affected by rain.",
+ "B. Because it has many different names, which confuses people.",
+ "C. Because it tastes like cantaloupe but is not sweet enough.",
+ "D. Because its fruit has a large hollow center and is easily split."
+ ]
+ }
+ },
+ "erqXWPoB": {
+ "video_id": "erqXWPoB",
+ "video_duration": "64s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "World News",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A man with short brown hair and glasses wearing a blue and white striped shirt stands in a room with multiple monitors, several of which display weather maps. He speaks, and the video then shows a larger view of the room, which has a semicircular desk and a large circular display area. A digital globe appears in the display area with an outline of the Arctic Circle. Blue arrows indicating wind direction appear on the globe, along with wave height indicators of \"5-1m\", \"2+m\", and \"1-2m\". A graphic of green and red shipping lanes and a label reading \"ICEBERG HAZARDS\" and \"ICE\" is shown on the globe. A label saying \"CLOSED\" appears with an arrow indicating ice expansion. The video then cuts to a graphic of a weather forecast for Nuuk, Greenland. It shows a blue circle partially filled, a yellow sun, and a wind direction symbol. Text labels say \"41\u00b0\", then \"19\u00b0\", then \"18\u00b0\" with \u201cMostly Sunny\u201d and \"ESE 15-20 KM/H\u201d.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the central passage across the Arctic closed?",
+ "answer": "B",
+ "candidates": [
+ "A. A better route has been established, and the central passage is no longer used.",
+ "B. The weather along the flight route has become severe.",
+ "C. The airways are closed during winter as airlines do not operate.",
+ "D. Anti-aircraft weapons have been deployed in certain areas along the route."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What could be the profession of the storyteller?",
+ "answer": "C",
+ "candidates": [
+ "A. A journalist.",
+ "B. A photographer.",
+ "C. A host.",
+ "D. A meteorologist."
+ ]
+ }
+ },
+ "WrcvSYWZ": {
+ "video_id": "WrcvSYWZ",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two drummers are playing on separate drum kits in a large room with an audience seated in chairs in the background. The drummers are facing each other at an angle. The drummer on the left is wearing a horizontally striped shirt and grey trousers. His drum kit is blue. The drummer on the right is wearing a black t-shirt and grey shorts. His drum kit is dark blue/black. They play complex rhythms in unison and occasionally trade short fills and solos. A man and a woman walk in from the back of the room and stop briefly to watch, then continue walking.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What changes occur in the rhythm of the music in the latter half of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The rhythm starts slow and gradually speeds up.",
+ "B. The rhythm suddenly changes from slow to fast.",
+ "C. The rhythm becomes erratic, alternating between fast and slow.",
+ "D. The pace gradually slows down."
+ ]
+ }
+ },
+ "hxdqtlal": {
+ "video_id": "hxdqtlal",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two people stand on a rocky surface. They both wear headphones and hold acoustic guitars, and each is in front of a microphone on a stand. A silver laptop sits on a small black stand in the foreground. The background is a landscape with red rocks. The camera view changes to be an overhead, wide shot. The two people stand in the middle of a large, red, rocky area. The camera view changes back to a close, front-facing shot of the two people. The camera view changes to be a close-up shot of the person on the right. The camera view then changes to be an overhead, wide shot of the two people. The video displays a superimposed image of mountains and rocks.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who made the first sound heard at the beginning of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. It is unclear.",
+ "B. A man in a light-colored jacket on the right.",
+ "C. A man in a white long-sleeve shirt on the left.",
+ "D. Both men together."
+ ]
+ }
+ },
+ "UvexAryE": {
+ "video_id": "UvexAryE",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man with glasses plays the violin. He wears a dark blue, patterned long-sleeved shirt and has earbuds in his ears. A microphone is positioned to the left of him. Behind him, partially visible, is a bed with a gray headboard, gray pillows, and a white blanket. The wall behind him is a muted blue-gray, with an angled portion of the wall and ceiling meeting above him. A portion of a window is also visible in the top right corner of the frame.The video transitions to a split screen. On the left, a younger man plays an acoustic guitar. He wears a white t-shirt and jeans. The wall behind him is the same blue-gray, with an angled ceiling, and a window is visible. A red and white pennant is hanging on the wall, and the edge of a chair is visible next to him. On the right side of the split screen, the violinist continues playing, in the same setting.The video cuts back to just the violinist, in the same setting as before. He plays the violin, looking at the camera and occasionally smiling. The video transitions once again to the split screen, with the guitarist on the left and the violinist on the right. They both continue to play their instruments. The guitarist occasionally looks down at his guitar. The violinist continues to look at the camera.The video then cuts to the guitarist alone, playing a fingerstyle melody on the guitar. The setting is the same as before.Finally, the split screen returns, with the guitarist and the violinist concluding the piece together.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Which instrument is the first to produce sound when the video starts?",
+ "answer": "C",
+ "candidates": [
+ "A. A guitar.",
+ "B. A piano.",
+ "C. A violin and a guitar.",
+ "D. A violin."
+ ]
+ }
+ },
+ "sMJaPTZt": {
+ "video_id": "sMJaPTZt",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man in a red shirt sits at a grand piano and plays. Another man, wearing a light brown shirt, sits on a chair to the right of the piano and plays an erhu. The camera focuses on the erhu player\u2019s hands as he moves the bow across the strings and his fingers press and slide on the neck of the instrument. The camera angle widens to show both musicians playing. A bouquet of flowers sits on the floor beneath the piano. The camera focuses again on the erhu player. He is sitting with his legs slightly apart, and his body moves gently as he plays. He occasionally glances at the piano player. The piano player plays with focused concentration, occasionally looking at his hands, the piano keys, and the other musician. The camera cuts back to the erhu player. His body sways and moves gently, and he changes his fingering and bowing techniques throughout the performance. The camera pans to a wider shot of the stage and the two musicians. The camera then focuses on close ups of the erhu player again. The erhu player occasionally nods his head and looks at the other musician. He continues to move the bow across the two strings in varying strokes and speeds, sometimes holding the erhu closer to his face. The camera returns to the wider shot of the two performers playing together. The piano player plays continuous streams of notes, with his head slightly tilted down. The camera focuses on the erhu player again as he smiles briefly and then closes his eyes. Close up on the erhu\u2019s neck and tuning pegs, and the player\u2019s hands move rapidly on the instrument. The camera zooms out again to show both musicians from further away as they play. Close ups alternate between the two performers as they play. The piano player occasionally purses his lips, sings along silently, and nods his head in time with the music. The camera zooms out again to show the full stage and both musicians. The erhu player sways more dramatically, and the bow moves vigorously across the strings. The piano player matches the increasing tempo.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What changes occur in the sound of the erhu in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. From fast-paced to heavy.",
+ "B. From heavy to fast-paced.",
+ "C. From joyful to gentle.",
+ "D. From joyful to gentle."
+ ]
+ }
+ },
+ "NvmDeSqG": {
+ "video_id": "NvmDeSqG",
+ "video_duration": "36s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Basketball",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a basketball game in progress. The court is light tan wood with a red circle in the center. There are multiple people on the court wearing white jerseys and blue jerseys. The score is tied, 110-110, in overtime, with seconds remaining on the clock. A person wearing a blue jersey shoots a basketball. The ball goes into the net, and people in the crowd raise their hands. After the shot, some of the people wearing the white jerseys walk towards a person in a white shirt. The person in the white shirt then high-fives the people on the court. Multiple people stand on the sidelines, wearing white and black shirts. Then the video shows the crowd sitting in the stands.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional changes does the player in the blue jersey exhibit in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. From disappointment to excitement.",
+ "B. From anxiety to anger.",
+ "C. From excitement to disappointment.",
+ "D. From excitement to anxiety."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the role of the man wearing a white shirt, black pants, and a watch?",
+ "answer": "D",
+ "candidates": [
+ "A. The referee.",
+ "B. The coach of the blue team.",
+ "C. A spectator.",
+ "D. The coach of the white team."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual event occurred at the end of the game?",
+ "answer": "B",
+ "candidates": [
+ "A. Overtime started immediately after the game ended.",
+ "B. At the final moment of the game, a player from the blue team scored a goal.",
+ "C. At the final moment of the game, a player from the white team scored a goal.",
+ "D. At the end of the game, a conflict broke out between the two teams."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man wearing the number 1 blue soccer jersey do with the ball during the match?",
+ "answer": "C",
+ "candidates": [
+ "A. Successfully scored a goal.",
+ "B. Had the ball taken by someone in a blue uniform.",
+ "C. Attempted to score but failed.",
+ "D. Passed to other team members."
+ ]
+ }
+ },
+ "JElOueCd": {
+ "video_id": "JElOueCd",
+ "video_duration": "293s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Unboxing Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man with short dark hair and a blue shirt sits in the lower left corner of the video, which is presented as a split screen with a close-up of hands holding a brown box in the larger portion of the screen. The box has red printed labels on it. The hands use a silver tool to pry the box open, revealing bubble wrap inside. The hands pull the bubble wrap out of the box and then remove a clear plastic case holding a trading card from within the bubble wrap. The camera focuses on the back of the card in the case, with the hands holding the case up to the camera. The hands show the back and front of the card several times, and the camera zooms in on details. The camera then focuses on the entire card in the plastic case. The man in the lower left is seen with his eyes wide and his hands on his head, and then with his hands in the air. The man is seen with hands covering his face, and then holding his head again. The hands then show both the front and back of the card again. A white bar with text at the bottom of the screen shows details about the card being shown. The video ends with several close-up shots of the card.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "In the video, how many times did the man slash the delivery box with a knife?",
+ "answer": "C",
+ "candidates": [
+ "A. Two times.",
+ "B. Five times.",
+ "C. Three times.",
+ "D. Four times."
+ ]
+ }
+ },
+ "pcaszxvS": {
+ "video_id": "pcaszxvS",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person plays a flute while another person plays a xylophone. The flutist's hands are visible, holding the silver flute horizontally and pressing the keys. The flutist reads sheet music from a tablet attached to a stand. The xylophone player's hands are also visible, holding two mallets with pink heads and striking the wooden keys of the xylophone. The xylophone is on a black stand. There are blank sheet music pages on a stand next to the xylophone. The background is a wall of square, light brown tiles. A white electrical outlet is visible on the lower right of the frame.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "In the video, where is the loudest instrument located?",
+ "answer": "B",
+ "candidates": [
+ "A. In the front.",
+ "B. On the left.",
+ "C. In the back.",
+ "D. On the right."
+ ]
+ }
+ },
+ "IPGCgluH": {
+ "video_id": "IPGCgluH",
+ "video_duration": "511s",
+ "duration": ">8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Cultural Explainers",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "An aerial view of a coastline with rock formations and boats in the turquoise water is shown. Then, the video cuts to a city street with buildings, cars, and people walking. A tram travels up a graffiti-covered street. A winding road on a mountainside above a layer of clouds is displayed. The video shows an aerial shot of a domed stadium in a city. Next, there's an aerial view of a long sandy beach next to a town, with the ocean extending out. A castle sits atop a rocky hill, with a garden in the foreground. A bridge illuminated at night spans a river with a city on either side. Several wooden barrels are shown in front of a blurred background of buildings. An aerial view of a bridge crossing a river and a city built on the sides of hills is presented. The video cuts to black with an orange \"5\" and the words \"Fantastic English\" in white. A map shows the location of the Azores islands in the Atlantic Ocean. An aerial view of a steep, rocky, green coastline meeting the ocean is shown. A room with bunk beds and a coat rack is presented. Finally, an aerial shot of a long aqueduct spanning a valley with roads and trees below is displayed, followed by a close-up, ground-level view of cliffs along a shoreline.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many football stars were mentioned in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Three.",
+ "B. Five.",
+ "C. Two.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the sound of comments popping up appear in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Seven.",
+ "B. Five.",
+ "C. Eight.",
+ "D. Nine."
+ ]
+ }
+ },
+ "CFfWQggQ": {
+ "video_id": "CFfWQggQ",
+ "video_duration": "49s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A pink fox-like character with a crown sits on a purple cushion in a room and proposes playing hide-and-seek. The fox-like character then stands up and says \u201cok\u201d. The character then says that they will look for the others. There is a table in the center of the room with three other characters, a monkey-like character, a bunny-like character, and a green character with large glasses, each holding a book. The fox-like character covers their eyes and counts to 5. The bunny-like character then hides in the fireplace. The green character with glasses then appears confused and says \"hide\". The monkey character starts to walk around the room and says \"hide\". The pink fox covers their eyes and says \"hide\", then lifts their hands up and says \"hide\". The pink fox then goes to a spot next to the table, the bunny-like character peeks out from behind the table, and the green character hides next to the fireplace. The pink fox character then touches their chin, and walks away from the table, saying \u201cwhere are you\u201d. The pink fox turns around to face the green character, then asks \u201cwhere are you\u201d and the green character says \u201cyou found me\u201d. The green character asks \u201cwhere is Jen\u201d, then the bunny character closes their eyes and says \u201chide\u201d. The pink fox says \u201cYou found me. Where is my friend Poki\u201d and then the pink fox walks up to the yellow couch and says \u201cwhere is my friend Poki\u201d and then the pink monkey pops out from behind the side of the chair and asks \u201chuh?\u201d. The pink fox character raises their hands up and says \u201cboo\u201d and then there are four squares with the faces of the pink fox, the green character, the bunny character and the monkey character and they say \u201clet's play together, hide and seek\u201d.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What are the little animals doing in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Exercising.",
+ "B. Chatting.",
+ "C. Playing hide and seek.",
+ "D. Reading and studying."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "Where does the monkey in the video eventually hide?",
+ "answer": "A",
+ "candidates": [
+ "A. Behind the sofa.",
+ "B. Under the carpet.",
+ "C. Under the table.",
+ "D. Behind the table."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times is \"escondidas\" heard in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Eight times.",
+ "B. Seven times.",
+ "C. Six times.",
+ "D. Nine times."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the emotional transition of the baby monkey in the latter part of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. First proud, then surprised.",
+ "B. First tense, then proud.",
+ "C. First tense, then surprised.",
+ "D. First surprised, then excited."
+ ]
+ }
+ },
+ "RMofQasT": {
+ "video_id": "RMofQasT",
+ "video_duration": "57s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows two bottles, one with a red label and liquid, and the other with a white label and black floral pattern. In the foreground is a glass, and it begins to be filled with a red liquid. Next, a man is shown behind a kitchen counter, wearing a blue dress shirt with a gray bowtie and vest. He pours the red liquid from the bottle into a small glass, then pours this glass into the previously mentioned large glass. Then, the man appears to be shaking an ingredient from a silver container. He then pours from a clear bottle into a small clear glass, then pours this from the small glass into the larger glass. The man takes a white object and scrapes the rim of the larger glass, and then stirs the contents of the glass. The man opens a dark bottle and pours liquid from it into the large glass. The man uses a spoon-like utensil to stir the contents. Next, the man adds a slice of fruit to the glass and two slices of a red fruit. The man then adds a straw to the glass and then holds the glass. Finally, he drinks the contents of the glass. The video returns to showing the large glass with its contents and surrounding ingredients and bottles.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "How many milliliters of soda does the man pour into the measuring cup?",
+ "answer": "C",
+ "candidates": [
+ "A. 60 ml.",
+ "B. 40 ml.",
+ "C. 45 ml.",
+ "D. 50 ml."
+ ]
+ }
+ },
+ "ZDYBwNYz": {
+ "video_id": "ZDYBwNYz",
+ "video_duration": "100s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "Two people wearing hard hats stand in front of a window and look outward; one person holds a rolled-up document. The view switches to a cityscape with many buildings and a blue sky with clouds. The view changes to a waterfront scene with a large white building with a distinctive shape; a city skyline is in the background. A dam is shown with flowing water. A movie poster is superimposed over the view of the dam; this view switches to a view of the dam with two movie posters superimposed over it, which switches back to the view of the dam. The view changes to a cityscape at night. The view switches to a beach with many people, umbrellas, and tall buildings in the background. Three instances of a tall tower are superimposed over a view of the Grand Canyon. A large indoor structure is shown, containing an aircraft body, and moving around in the indoor structure. The view switches to a different structure and shows a large aircraft engine. The view switches back to the indoor structure, showing people working on an aircraft. The view switches to a view of an aircraft being lowered into the structure. The view switches to a view of the aircraft, where it is being constructed. The view switches to an airplane that is fully built, and is being moved inside the structure. The view changes to a view of a swimming pool, overlaid over the airplane construction; a text overlay is included. The view changes to a view of a stadium, overlaid over the airplane construction; a text overlay is included. The view switches to a view of the interior of the structure, where an image of a white car is superimposed over the area where an airplane is being constructed. The view switches to an aerial view of an airfield, with many small planes parked and one larger plane taxiing. The view changes to an aerial view of a large, roughly circular area of water and land, which switches to an illustration of the same area with some text overlay. The view switches to an image of earth from space; the globe is rotating. The view switches to an image of a mountain range, overlaid with text. The view changes to a view of a barren landscape with some craters, and switches to different views of the planet Mars, overlaid with text. The view switches to a yellow star, with a black disc obscuring part of it. The view switches to a diagram of the solar system. Finally, a view of a countdown appears, from the number five to the number two.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What model of aircraft is manufactured by the factory shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Boeing 747 aircraft.",
+ "B. Boeing 787 aircraft.",
+ "C. Boeing 767 aircraft.",
+ "D. Boeing 737 aircraft."
+ ]
+ }
+ },
+ "hBPOXjKA": {
+ "video_id": "hBPOXjKA",
+ "video_duration": "272s",
+ "duration": "4-6min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man with glasses and a blue t-shirt stands in front of a hydraulic press attached to a digital scale. He holds a thin white cord looped through a metal ring. He gestures towards the apparatus. He loops the cord around the end of the press and pulls it tight. The camera zooms in as he pumps the hydraulic lever. The digital scale displays increasing numbers: 0.01, 0.02, climbing steadily. He continues pumping until the scale reads 0.07, then pauses. He resumes pumping, and the numbers increase until the scale briefly reads 0.25. He releases the lever slightly. He starts pumping again, and the scale increments until it briefly displays 0.53. He manipulates the metal rods of the press, slides his fingers along them, then returns to pumping the lever. The scale reading climbs to 0.58. He continues pumping until the cord slips, and the scale returns to zero. He picks up the white cord and runs his fingers along it. The camera angle shifts to a close-up of his hands. He has two metal rings and the cord. He demonstrates how to tie a knot with the cord around a metal ring, making several wraps. He repeats the process, demonstrating the knot again, pulling the cord tight against the ring. The camera returns to the wider shot of the hydraulic press and scale. He places the newly tied knot on the end of the press and pulls the cord taut. He begins pumping the hydraulic lever, and the numbers on the scale increase. He pumps continuously, and the scale reads 0.1, 0.2, 0.3, and continues upwards. The numbers increment until the scale briefly reads 0.83. The pumping continues, and the scale fluctuates around 0.8, going slightly higher and then lower. The text \u201cDouble wrap slip knot, unraveled 79kg\u201d appears on screen. The knot breaks, and the scale returns to zero. The man holds another length of cord and loops it through a metal ring. He demonstrates a different knot. The camera zooms in on his hands. He brings the cord over and wraps it in a figure-eight fashion around the ring. He repeats the process to show the knot again. The camera returns to the wider view of the man, press, and scale. He loops the new knot over the press apparatus, pulls the cord tightly, and pumps the lever. The numbers on the scale increase. He pumps continuously until the scale reaches 1.09, and the cord breaks. The text, \u201cFigure 8 loop with a single wrap, Break Force 109kg,\u201d appears. The man demonstrates the knot again with the cord and ring. The camera angle changes to a close-up of his hands. He ties a knot around one of the metal rings, quickly pulling it tight. He displays the knot for the camera. The scene shifts back to the man, the press, and the scale. He threads the cord through the ring attached to the press, pulling it tight. He begins pumping the lever. The scale reading increases as he pumps. The scale goes above 0.1 and keeps climbing. It fluctuates between 0.08 and 0.09. It passes 0.1 again and continues upwards. It reaches 1.42 before the cord breaks.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the purpose of the machine in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. To conduct testing.",
+ "B. To serve as a decoration.",
+ "C. To display at an exhibition.",
+ "D. To perform experimental tests."
+ ]
+ }
+ },
+ "OaDXXEyV": {
+ "video_id": "OaDXXEyV",
+ "video_duration": "51s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Historical Analysis",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows an archway with the words \"SWAZILAND NATIONAL MUSEUM UMSAMO WESIVE\". The video then cuts to a room with several mannequins that have dark skin and dark hair. Some of the mannequins are wearing grass skirts and beaded necklaces. One is wearing an animal hide. One mannequin is holding a wooden spear and shield. There is also a display case with a large round shield with cow print. Then, a display with several long spears is shown. Next, a woman wearing a white and black polka dot shirt is shown standing in front of a display of photographs and informational signs. She is then shown shaking hands with a man who is out of focus.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where are the people in the video visiting art pieces?",
+ "answer": "A",
+ "candidates": [
+ "A. The Vatican Museums.",
+ "B. The Tate Gallery.",
+ "C. The Metropolitan Museum of Art, New York.",
+ "D. The National Gallery, London."
+ ]
+ }
+ },
+ "UHiTZveu": {
+ "video_id": "UHiTZveu",
+ "video_duration": "25s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A white and black SpaceX rocket with the Axiom logo and an American flag is shown on the launchpad near a body of water. The timer counts down from T- 23:58.The view switches to inside the rocket, showing two astronauts strapped into their seats. The view switches back to the rocket on the launchpad, now against a grey sky. Flames and smoke erupt from the base of the rocket as it begins to ascend.The perspective shifts to a wider view, showing the launchpad and surrounding landscape as the rocket gains altitude. The rocket continues to climb higher, the flame from its engines becoming smaller as it rises into the grey sky.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many astronauts are shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Four astronauts.",
+ "B. One astronaut.",
+ "C. Two astronauts.",
+ "D. Three astronauts."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of events occurring in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A brief view of the astronaut inside the Dragon spacecraft is shown, the rocket conducts a countdown at the launch pad, and the rocket engines start at full power and ignite, launching into the air.",
+ "B. The rocket engines start at full power and ignite, launching into the air, a brief view of the astronaut inside the Dragon spacecraft is shown, and the rocket conducts a countdown at the launch pad.",
+ "C. The rocket conducts a countdown at the launch pad, a brief view of the astronaut inside the Dragon spacecraft is shown, and the rocket engines start at full power and ignite, launching into the air.",
+ "D. The rocket conducts a countdown at the launch pad, the rocket engines start at full power and ignite, launching into the air, and a brief view of the astronaut inside the Dragon spacecraft is shown."
+ ]
+ }
+ },
+ "XPDGJZTA": {
+ "video_id": "XPDGJZTA",
+ "video_duration": "87s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video displays a screen from the mobile game Clash of Clans. Initially, a countdown timer reads 18 seconds. Below the timer, text indicates \"Tap or press and hold to deploy troops.\" At the bottom of the screen is a horizontal bar of available units, each with a quantity label, and icons representing a variety of animated humanoids, creatures, and spells. The top of the screen shows values for \"Max\", \"Available loot\", \"Defeat\", and other resources and quantities. The game shows an aerial view of a base consisting of various buildings, fences, and other structures. After the timer reaches 0, several animated characters appear and begin to move towards the base. Some of these characters are green-skinned creatures that audibly vocalize and others are larger humanoids. The animated characters move towards the base and begin attacking the buildings and other structures, causing them to be destroyed or damaged. A percentage titled \"Overall Damage\" is shown on the right of the screen, which increases as the animated characters attack. The video concludes with the timer at 1M 40S, showing an \"Overall Damage\" percentage of 79%, and many of the structures having been damaged or destroyed.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct order in which the goblin army captures the base?",
+ "answer": "C",
+ "candidates": [
+ "A. Goblins destroy the purple spherical building, steal all the money from the bank, breach the city wall, and are discovered and subjected to concentrated attacks.",
+ "B. Goblins are discovered and subjected to concentrated attacks, breach the city wall, destroy the purple spherical building, and steal all the money from the bank.",
+ "C. Goblins breach the city wall, destroy the purple spherical building, are discovered and subjected to concentrated attacks, and steal all the money from the bank.",
+ "D. Goblins breach the city wall, steal all the money from the bank, are discovered and subjected to concentrated attacks, and destroy the purple spherical building."
+ ]
+ }
+ },
+ "dVImZVpX": {
+ "video_id": "dVImZVpX",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man plays a trumpet in a gazebo-like structure with red pillars. He wears glasses and a colorful, patterned short-sleeved shirt. Several people stand and sit around him, some watching and listening, others seemingly passing by. Behind the gazebo, a wooden bridge and stairs lead up a hill, and several people walk up and down them.\u00a0 Trees and foliage surround the area. A man in a striped shirt stands next to a keyboard and other equipment, and he later picks up a saxophone and begins playing it along with the trumpet player. More people walk by on the stairs and bridge.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many different types of musical instruments can be heard in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Two types.",
+ "B. Four types.",
+ "C. Three types.",
+ "D. Five types."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "From which position does the sound of the tambourine come?",
+ "answer": "D",
+ "candidates": [
+ "A. From the front.",
+ "B. From the middle.",
+ "C. From the back.",
+ "D. From the right."
+ ]
+ }
+ },
+ "qTEUxUYL": {
+ "video_id": "qTEUxUYL",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A group of eight people are seated on chairs on a stage, playing cellos. They are arranged in two offset rows of four. Behind them are various percussion instruments, including timpani, snare drum, bass drum, and a xylophone. There are speakers on stands at the front left and right corners of the stage, and two microphones on stands near the center. The backdrop of the stage is a light tan, vertically-ribbed wall, curved at the edges. The stage floor is a light wood. The cellists are all wearing black skirts and white tops. They move their bows across the cellos in unison and individually, and occasionally move their heads and bodies slightly with the music. The camera angle is from the audience, slightly elevated and to the left of center. The video continues with the cellists playing. The overall lighting is dim, but the stage is lit enough to see the performers and instruments clearly.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the audio does the tapping sound occur?",
+ "answer": "B",
+ "candidates": [
+ "A. At the end of the audio.",
+ "B. In the second half of the audio.",
+ "C. In the first half of the audio.",
+ "D. At the beginning of the audio."
+ ]
+ }
+ },
+ "RhqiEayf": {
+ "video_id": "RhqiEayf",
+ "video_duration": "49s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A character with a light grey and dark grey headdress and orange skin pulls out a blue lightsaber. The character appears to fall from the ceiling and then a group of characters wearing white uniforms move forward. Two characters are standing near smoke and a flying ship travels over a gray landscape. Three characters walk in a field with tall structures. Rocks and dust fall through a shaft of light. Two characters in dark uniforms raise their hands upwards in a night sky with a glowing pyramid. Red lines and circles appear above the pyramid. A close up of a character with white markings and blue eyes is shown. Another character with a green face is shown and one with white markings and dark grey and black headress is shown. Then a green-skinned person and another person of color is shown in a cockpit. Then, a person with black hair and a orange and light brown top appears. Two figures wearing white uniforms appear in a dark space and another dark space reveals figures in white uniforms and a figure in a dark cloak. A character with white markings on their face is shown with glowing eyes, then two red lightsabers meet. A character in a dark cloak stands in a space with fire. A person with black hair and a orange and light brown top appears again, followed by a character with a light grey and dark grey headdress and orange skin, who is now holding a white lightsaber. Two red lightsabers are crossed. Then, a character with horns and red skin appears, holding a white lightsaber. Three figures engage in combat with lightsabers of various colors. A character with white markings on their face and a light grey and dark grey headdress appears and a character with horns and red skin is seen swinging a red lightsaber.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "What turning point occurred in the latter part of the battle shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A bearded man brought reinforcements.",
+ "B. Ahsoka brought reinforcements.",
+ "C. A blue-haired, blue-eyed man brought reinforcements.",
+ "D. A large number of reinforcements appeared out of nowhere."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What could be the relationship between Asoka and the person with red skin?",
+ "answer": "B",
+ "candidates": [
+ "A. Mother-son relationship.",
+ "B. Mother-daughter relationship.",
+ "C. Father-son relationship.",
+ "D. Father-daughter relationship."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "In the latter part of the video, how many people are holding blue lightsabers during the battle?",
+ "answer": "C",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. Two.",
+ "D. One."
+ ]
+ }
+ },
+ "HKRXuapf": {
+ "video_id": "HKRXuapf",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A woman with short dark hair, wearing a floral dress, plays a banjo ukulele. She strums and fingerpicks the instrument. The headstock of her ukulele is decorated with an orange and white crocheted pineapple and small fabric flower. A microphone on a stand is positioned in front of the instrument. Behind her, various ukuleles hang on a wall display.Later, a man with short dark hair wearing a teal and black short-sleeved shirt joins her. He plays a wooden ukulele with a leather strap. He stands to the woman's right, also facing forward, slightly angled towards her. They continue playing their respective instruments. The wall behind them features assorted ukuleles, flyers, and photographs.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Where does the sound of the ukulele in the video come from?",
+ "answer": "C",
+ "candidates": [
+ "A. From the front.",
+ "B. From the middle.",
+ "C. From the right.",
+ "D. From the back."
+ ]
+ }
+ },
+ "mmrwEcNl": {
+ "video_id": "mmrwEcNl",
+ "video_duration": "61s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video starts with an image of a Roblox character with orange hair and a light blue shirt with a pizza slice on it, and brown pants. There are two gray boxes to the left of it, one that says \"Customize\", and another one that says \"Shop\", and a speech bubble at the bottom that reads, \"I can't believe I got robux\". A new bubble then appears, reading \"Let's see how much I got\", before the video transitions to a screen that says \"Balance: 80 Robux\" and then \"Buy Robux\". Another speech bubble appears, saying \"Omg I got 80!!\" The screen then shifts back to the character. Another speech bubble reads, \"Now time to go buy stuff\". A menu appears on the left of the screen with a lot of hair options. A box appears at the bottom asking \"Hmm which hair should I get?\". One of the hair options is selected and then deselected before the screen shifts again to show another menu with hair options, and this time, one is selected. Then it shows a box that reads, \"Would you like to buy Blonde Flowy Pigtails?\" and the box is clicked. The screen then shifts to show face options, and one that reads, \"freckles\" is selected. A box appears that reads, \"Would you like to buy Freckles from Roblox?\" and the box is clicked, before showing options for clothing, and the box at the bottom reads, \"Now for the clothing\". A box labeled \"Cute Denim Shorts Jeans\" is clicked, and then the video shows \"Buy Item Would you like to buy Cute Denim Shorts Jeans Aesthetic Sporty Preppys from Pink Potato Clothing?\" and the box is clicked. Then the screen shows \"Classic shirts\" and a box that says, \"cottage core fairy vintage\" is clicked, with the accompanying pop up. The video then shows the avatar in the middle of a blank white screen with the words, \"Time to put them on! and a pink heart.\". The screen shifts to show menus of options, clicking on the blonde flowy pigtails hair and the freckles face, and the white crop top with light blue shorts, before shifting back to the avatar in the middle of the white screen, with the words, \"Omg I love my new avatar\".",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order of purchasing virtual items in the store as shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Virtual hairstyle, virtual expression, shoes, pants.",
+ "B. Virtual expression, shoes, pants, virtual hairstyle.",
+ "C. Virtual hairstyle, shoes, virtual expression, pants.",
+ "D. Shoes, pants, virtual hairstyle, virtual expression."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What change has occurred in the account balance?",
+ "answer": "A",
+ "candidates": [
+ "A. The balance decreased from an original balance of 80 to 10.",
+ "B. The balance decreased from an original balance of 90 to 0.",
+ "C. The balance decreased from an original balance of 100 to 10.",
+ "D. The balance decreased from an original balance of 80 to 0."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times was \"Buy Item\" used for purchases in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Three.",
+ "B. Two.",
+ "C. One.",
+ "D. Four."
+ ]
+ }
+ },
+ "unweleor": {
+ "video_id": "unweleor",
+ "video_duration": "62s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "DIY & Crafts",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with grey hair, glasses, and a beard, wearing a dark gray shirt and a tan apron is shown speaking to the camera. Behind them is a workbench with various tools. At one point, colorful shapes appear and the words \"HACK #5\" overlay the video. The person then holds up a rectangular wooden board and touches it. Text appears saying, \"NO RULER?\" and then \"NO PROBLEM!\". The person then points to the left, and then raises one finger up. The person is shown again speaking to the camera, and then the video cuts to a close up of the wooden board on the workbench. A hand can be seen drilling a hole into the board. Then a small block of wood, an eraser, and three pencils are laid onto the board. A black and yellow tape measure is put on the board, and then the tape measure is pulled out and placed against the small block of wood. A pencil is used to mark the block, and then several holes are drilled into it. The wooden block with the holes is then placed over the side of a larger piece of wood and the pencils are put through the holes.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video show any tools used to measure the precise center point of the wooden board's sides?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes.",
+ "B. I am not sure.",
+ "C. No."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct order of tools used in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Pencil, measuring tape, drill press.",
+ "B. Scrap wood, measuring tape, pencil, drill press.",
+ "C. Measuring tape, pencil, drill press.",
+ "D. Scrap wood, pencil, measuring tape, drill press."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many pencils did the man in the video use to find the exact center point of the side of the wooden board?",
+ "answer": "B",
+ "candidates": [
+ "A. Two.",
+ "B. Three.",
+ "C. One.",
+ "D. Four."
+ ]
+ }
+ },
+ "qsUjvoIq": {
+ "video_id": "qsUjvoIq",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man with glasses and a vertically striped, light-colored shirt plays the violin while seated next to a shiny black piano. Another man, wearing a dark shirt, sits to his right and plays a black accordion labeled \"ALLODI.\" Both men face slightly towards each other. Two microphones on stands are positioned between them, seemingly to capture the sound of their instruments. Another microphone is set up in front of a sound mixing board further to the right. Maroon curtains hang in the background. The men play their instruments throughout the duration of the video. The man playing the accordion occasionally shifts his gaze between his instrument and the other musician. At one point, the camera zooms in slightly on the violinist.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Which instrument plays simultaneously with the accordion?",
+ "answer": "D",
+ "candidates": [
+ "A. A double bass.",
+ "B. A piano.",
+ "C. A cello.",
+ "D. A violin."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the emotional tone conveyed by the music in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Cheerful.",
+ "B. Serious.",
+ "C. Angry.",
+ "D. Sad."
+ ]
+ }
+ },
+ "oQVaBsPw": {
+ "video_id": "oQVaBsPw",
+ "video_duration": "120s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A woman wearing a wide-brimmed hat, a corset, stockings, garter belt, pearl necklace and gloves stands in a doctor\u2019s office. A man in a doctor\u2019s coat stands by a metal cart with medical supplies on it. A nurse wearing a blue uniform and white apron stands by the door. The woman walks over to the examination table and sits down. She places her peach-colored dress and handbag on the examination table beside her. The doctor and nurse stand near the cart, watching her. The woman gets off the table and stands near it. The doctor walks over to the metal cart with supplies, picks up a towel, and throws it on top of the cart. The woman sits back down on the table. The doctor turns around. The woman picks up her dress. The doctor helps her put it on. She then puts on her pearl necklace. He makes a gesture as if to zip something up. She pulls her gloves on and adjusts her hat. She walks over to the doctor and kisses his hand. He smiles at her. The nurse picks up a file and shows it to the doctor. They both leave the room. The woman picks up her handbag and exits the room.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "Does the video show a doctor performing a physical examination on a woman wearing a hat?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No.",
+ "C. Yes.",
+ "D. None of the above."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "What happens in the second half of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The doctor expels the woman in a hat from the clinic.",
+ "B. The woman in a hat reveals her identity.",
+ "C. The doctor examines the woman in a hat.",
+ "D. The doctor verbally abuses the woman in a hat."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What changes occurred in the doctor's attitude towards the woman?",
+ "answer": "A",
+ "candidates": [
+ "A. From anger to helplessness.",
+ "B. From disdain to flattery.",
+ "C. From dislike to contempt.",
+ "D. From resignation to flattery."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the role of the woman wearing a large white hat in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A doctor.",
+ "B. A pharmacist.",
+ "C. Hospital Director.",
+ "D. A nurse."
+ ]
+ }
+ },
+ "lFzfysgv": {
+ "video_id": "lFzfysgv",
+ "video_duration": "80s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man kneels on the ground with his head down, amidst a crowd of people who are standing, some with their arms raised. A man with a microphone sings in front of a large white building with columns. Another man plays a red keyboard. The crowd has their arms raised, singing along. An aerial view shows a large crowd gathered on a grassy area, facing a stage. A man with sunglasses and a baseball cap raises one hand, holding a water bottle in the other. The crowd claps and sings along. A woman kneels with arms outstretched. The crowd sways and sings, their arms raised. A man in a white shirt raises his hand and sings into a microphone on a stage. A man throws a microphone stand. The crowd pumps their fists in the air, cheering. A man with his arms outstretched sings in front of the white building. The text \"#LETUSWORSHIP\" appears over the crowd. A large building is shown with an American flag on top. The crowd cheers and raises their hands. A young woman with green hair has her eyes closed and palms together. A Black woman consoles a man with a backpack. A group of people put their hands on the shoulders of a young person sitting on the ground. More crowd shots with raised arms.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where was this faith movement held?",
+ "answer": "B",
+ "candidates": [
+ "A. The World Trade Center Twin Towers in New York City.",
+ "B. The White House.",
+ "C. The Pentagon.",
+ "D. Times Square in New York City."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why do movements centered on religious themes occur?",
+ "answer": "A",
+ "candidates": [
+ "A. Excessive government intervention.",
+ "B. Protests against the government.",
+ "C. Movements organized by specific individuals.",
+ "D. Spontaneous organizational efforts."
+ ]
+ }
+ },
+ "iXSgRNiF": {
+ "video_id": "iXSgRNiF",
+ "video_duration": "100s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Cultural Explainers",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person with dark hair, wearing a white and black plaid shirt with yellow and red detailing on the chest, is facing the camera. The background is a light gray wall. There is a small picture-in-picture window in the top left corner displaying a slightly different view of the same person. To the left, a dark red object is visible on a shelf. The person moves her hands and head as she speaks and occasionally brings her hands together. A blue banner appears at one point with the white text \"Respecting self and others\".",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the main emotion of the woman in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Anger.",
+ "B. Anxiety.",
+ "C. Excitement.",
+ "D. Pride."
+ ]
+ }
+ },
+ "xpALqvCU": {
+ "video_id": "xpALqvCU",
+ "video_duration": "83s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A woman with blonde hair, wearing a purple shirt and a necklace, sits on a light-colored sofa and speaks to the camera. She gestures with her hands while text appears on the screen alongside her. Some of the words shown are \"acid reflux\", \"gas\", \"painful bloating\", \"constipation\", \"hemorrhoids\", \"GI doctor\", \"dermatologists\", \"naturopaths\", \"specialists\", \"carb cycling\", \"paleo\", \"whole 30\", \"plant based\", \"no dairy\", \"no gluten\", \"fast metabolism diet\", and \"Atkins\". There are some images of the woman appearing throughout, some have red circles with arrows pointing at different spots on her body. The woman in these images is wearing a blue bathing suit top, and later a leopard print swimsuit. There is a waterfall in the background of one image, and a wall and pebbles in the background of another. She is demonstrating a headstand in one of the images. She continues to gesture while speaking to the camera.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "How did she feel after visiting the dermatologist?",
+ "answer": "B",
+ "candidates": [
+ "A. Excited.",
+ "B. Disappointed.",
+ "C. Calm.",
+ "D. Agitated."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video, in what order does the woman list the diet plans?",
+ "answer": "B",
+ "candidates": [
+ "A. (a) (b) (c) (d) (e) (f) (g) (h) (k).",
+ "B. (e) (a) (k) (d) (f) (b) (c) (h) (g).",
+ "C. (c) (a) (k) (f) (b) (d) (e) (h) (g).",
+ "D. (e) (k) (f) (d) (c) (b) (a) (h) (g)."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How many diet plans does the woman in the video try?",
+ "answer": "B",
+ "candidates": [
+ "A. Four kinds.",
+ "B. Nine kinds.",
+ "C. Five kinds.",
+ "D. Seven kinds."
+ ]
+ }
+ },
+ "plOGuRmc": {
+ "video_id": "plOGuRmc",
+ "video_duration": "113s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Book Reviews",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a green hat and a red, white, and blue plaid shirt is seated in front of a red wall. A framed piece of art is visible in the upper left of the frame. The person holds a blue book with an image of a cloudy scene on the cover. The person turns the book around and shows the side, the other side, and the back of the book. Then, the person holds a brown book and the blue book and shows both covers at the same time, then side-by-side. The person shows the blue book again, moving it to show all sides of it. Then, they hold up a blue book with an image of a silhouette of a person on a horse in a small rectangle in the middle of the cover, showing the side, the other side, and the front of the book, and then the back of the book. Then, they again show the blue book, moving it around. They turn their head away and then back toward the camera. The person holds up a single white page, then a two-page spread showing a storm at sea. They show the two-page spread a second time, then the person turns their head away and then back toward the camera. The person shows the blue book again, and then turns their head away and then back toward the camera. They show a blue book with the cover showing a cloudy image a final time, and then turn away and back to face forward a final time.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did the video include content from the deluxe edition?",
+ "answer": "C",
+ "candidates": [
+ "A. None.",
+ "B. Yes.",
+ "C. No.",
+ "D. I don't know."
+ ]
+ }
+ },
+ "TQKHoMFL": {
+ "video_id": "TQKHoMFL",
+ "video_duration": "63s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with short brown hair and a gray t-shirt is in the foreground, in front of two all-terrain vehicles. The ground is muddy. The all-terrain vehicles have a gray, black, and white camouflage pattern. Some of the plastic body parts of the all-terrain vehicles are missing or broken. The person is facing the camera and gestures with his hands while talking.Later, the video shows the same person mounting one of the all-terrain vehicles in an open, grassy area with trees in the background. The person shifts position on top of the vehicle and then dismounts.The video then returns to the initial view with the two damaged all-terrain vehicles and the same person in the foreground talking. He gestures with his hands, pointing and holding up different numbers of fingers while talking.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Did the man in the video ultimately manage to lift the wheels?",
+ "answer": "C",
+ "candidates": [
+ "A. He lifted the front wheels.",
+ "B. He lifted the rear wheels.",
+ "C. It was not accomplished.",
+ "D. I am not sure."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did the man in the video fall off the car?",
+ "answer": "A",
+ "candidates": [
+ "A. Did not fall.",
+ "B. Fell while attempting to lift the rear wheels.",
+ "C. I am not sure.",
+ "D. Fell while attempting to lift the front wheels."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "How does the man in the video feel throughout the process?",
+ "answer": "A",
+ "candidates": [
+ "A. He feels mostly frustrated but humorously explains his inability to complete the specified action.",
+ "B. He feels mostly afraid and avoids riding the four-wheeler.",
+ "C. He feels generally happy and enthusiastically introduces his beloved four-wheeler.",
+ "D. He feels often irritated and expresses dissatisfaction with the four-wheeled vehicle."
+ ]
+ }
+ },
+ "mavnPLnL": {
+ "video_id": "mavnPLnL",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A young man with short dark hair, wearing a dark long-sleeved shirt, plays the violin in a wooded area. He is standing on a grassy area. The video cuts to a close-up of hands playing the keys of an electronic keyboard. The keyboard is on a stand. The video cuts back to the man playing the violin, then to the keyboard player\u2019s hands. The keyboard player uses his left hand to adjust a dial or slider on the left side of the keyboard. He then begins playing a small, white, rectangular electronic instrument with several rows of touch-sensitive keys. The camera focuses on the violin player for an extended period as he plays, then switches to the keyboard player. The video alternates between the violin player and the keyboard player, sometimes showing close-ups of their hands. At times, the images of the two musicians are superimposed. Near the end, the image of the keyboard player becomes blurry and out of focus. The video fades to black.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What feeling does this song convey?",
+ "answer": "D",
+ "candidates": [
+ "A. It conveys a sense of melancholy and pain.",
+ "B. It conveys a sense of melancholy and sadness.",
+ "C. It conveys a sense of joy and exhilaration.",
+ "D. It conveys a sense of happiness and relaxation."
+ ]
+ }
+ },
+ "IfaeYtPq": {
+ "video_id": "IfaeYtPq",
+ "video_duration": "52s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A man with brown, wavy hair and a beard is sitting in front of a microphone. He is wearing a dark colored t-shirt. In the background to his right is a bookshelf. To the left is a plant. In a small rectangle in the bottom left corner of the screen there are video clips of a stage, with a man in a white top and dark pants. In the first clip, there is another person to the left of the man in white, who is wearing dark pants and a dark top. In later clips, there is also a man with a guitar visible, and people in the background who may be dancers or singers. There are also overhead shots of a larger crowd and stage setting, with the word \"MULTIPLY\" shown on a screen.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional changes does the man in the black clothes exhibit in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. First calm, then disdainful, finally puzzled.",
+ "B. First disdainful, then puzzled, finally calm.",
+ "C. First puzzled, then calm, finally disdainful.",
+ "D. First calm, then puzzled, finally disdainful."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the guitar being played continuously throughout the video?",
+ "answer": "C",
+ "candidates": [
+ "A. No.",
+ "B. I don\u2019t know.",
+ "C. Yes."
+ ]
+ }
+ },
+ "SVPuLeYI": {
+ "video_id": "SVPuLeYI",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Four musicians sit on black stools arranged in a loose square, playing clarinets in an outdoor setting. The ground is reddish-brown and cracked, and the walls surrounding them are covered in graffiti. The musicians are wearing black suits and are focused on their instruments. Three of the musicians use typical Bb clarinets, while one in the center plays a larger, lower-pitched bass clarinet. They play in unison and harmony. The camera angle shifts, focusing closely on individual musicians, then broadening to encompass the whole group. At times, the camera zooms out to reveal more of the surrounding walls and a discarded tire next to some weeds. There are faded markings on the ground that appear to read \"ZILLER\". The lighting suggests late afternoon or early evening. The video shows a series of these shots, showcasing both the individual musicians and the quartet as a whole as they play their music.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Which instrument is louder in the video: the clarinet or the bass clarinet?",
+ "answer": "A",
+ "candidates": [
+ "A. Clarinet.",
+ "B. Bass clarinet.",
+ "C. They sound the same.",
+ "D. I am not sure."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Is the bass sound in the video produced by that person?",
+ "answer": "B",
+ "candidates": [
+ "A. A male with yellow hair.",
+ "B. A female with long hair.",
+ "C. A male with long hair.",
+ "D. A male with black hair."
+ ]
+ }
+ },
+ "iwgtgHVs": {
+ "video_id": "iwgtgHVs",
+ "video_duration": "131s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a large indoor area with a crowd of people. The majority of people are standing or seated along the edges of the room, while others are standing in the center. Some of the people are holding cameras. There are some signs on the walls. A man on a skateboard jumps over an orange ramp. There are close-up shots of a few people in the crowd. A man in a red shirt is shown sitting in a car, wearing ear buds. There is a short animated segment that consists of an 8-bit figure running. Then a logo appears on screen: \"THE BERRICS\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times was the man wearing a red hat pointed at with a laser pointer?",
+ "answer": "D",
+ "candidates": [
+ "A. Three times.",
+ "B. Four times.",
+ "C. Once.",
+ "D. Twice."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "How does the man wearing a white shirt and floral headscarf in the video feel throughout the process?",
+ "answer": "B",
+ "candidates": [
+ "A. Happy.",
+ "B. Confused.",
+ "C. Moved.",
+ "D. Sad."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the blonde man in green clothes located relative to the man wearing a hat with stars in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. In front.",
+ "B. Behind.",
+ "C. On the right.",
+ "D. On the left."
+ ]
+ }
+ },
+ "uEkEvyJM": {
+ "video_id": "uEkEvyJM",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Two individuals are present, each seated on a white stool. The person on the left is playing a brown and white ukulele. The person on the right is playing a black and silver clarinet. The person on the left is wearing a black long-sleeve shirt and black pants, while the person on the right is wearing a black long-sleeve shirt, black suspenders, a blue bow tie, blue pants, and a black hat. Behind them is a light-colored piano. On the right, a picture frame is leaning on a wooden crate. They continue to play their instruments through out the duration of the video. The camera zooms in on the individual playing the ukulele as well as the clarinet at different points throughout the video. The shot also returns to show both individuals during parts of the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What materials might the strings of the ukulele in the video be made from?",
+ "answer": "B",
+ "candidates": [
+ "A. Nylon.",
+ "B. Metal.",
+ "C. Sheep gut.",
+ "D. Carbon."
+ ]
+ }
+ },
+ "CCxLmdIN": {
+ "video_id": "CCxLmdIN",
+ "video_duration": "61s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman with long blonde hair, wearing a black turtleneck is speaking. There are several colored rectangles in front of her face. She has gold hoop earrings. Her name, \"Len\" is written to the left side of the screen and the text \"Being connected means accessing\" is written at the bottom of the screen. The name \"Lena Ostlund\" and \"France/Sweden\" is visible in a blue rectangle and the text \"Being connected means accessing\" continues at the bottom. Then the text changes to \"and sharing information any time\". And then \"and being able to connect with people\". And then \"from all around the world\". Then there's a new woman with dark braided hair, wearing a white top, speaking. \"Dunola Oladapo\" and \"Nigeria/UK\" is visible in a blue rectangle and the text \"Being able to keep up and catch up\" is visible at the bottom. Then \"with my friends and family\". A new woman with light brown, shoulder-length hair, wearing a grey shirt, is speaking. \"Birke Laubinger\" and \"Germany\" is visible in a blue rectangle and the text \"Feeling part of the global community\" is visible at the bottom of the screen. Then \"and knowing about my responsibilities within\". Then a man with short curly black hair and a beard, wearing a checked shirt is speaking. \"Geremachew Wagaye\" and \"Ethiopia\" is visible in a blue rectangle and the text \"It's a way of networking\" is visible at the bottom. Then \"to share opportunities and information globally\". Then a new woman with dark hair, wearing glasses and headphones is speaking. \"Rury Dempsey\" and \"Indonesia\" is visible in a blue rectangle and the text \"It's even more important than ever in this situation\" is at the bottom. Then \"when everyone in this world connected using technology\". Then a new woman with curly light brown hair, wearing glasses and a striped shirt, is speaking. \"Blenda Da Silva\" and \"Brazil\" is visible in a blue rectangle and the text \"It means leveling the playing field\" is at the bottom. Then \"so that all have the possibility and equal opportunity\". Then \"to take advantage of the benefits of digital technologies\". Then a new woman with dark hair, wearing a black shirt, is speaking. \"Xingyi Chen\" and \"China\" is visible in a blue rectangle and the text \"Technology also offers us a young generation\" is visible at the bottom. Then \"the opportunities to learn, connect, and communicate\". Then \"with the outside world\". Then a new man with dark hair and a mustache, wearing white headphones and a black shirt is speaking. \"Ali Altoblani\" and \"Bahrain\" is visible in a blue rectangle and the text \"It means to be able to be a part of\" is at the bottom. Then \"and contribute to something much greater\". Then \"than just my day to day life experience\". A logo with the words \"ITU\" and \"Generation Connect\" is visible in the top right of all the shots.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many headphones is the man with short black hair and dressed in black wearing in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Two.",
+ "B. One.",
+ "C. Zero.",
+ "D. Three."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "In which position does the black woman with braided hair appear?",
+ "answer": "B",
+ "candidates": [
+ "A. Third.",
+ "B. Second.",
+ "C. First.",
+ "D. Eighth."
+ ]
+ }
+ },
+ "eYlMVgMb": {
+ "video_id": "eYlMVgMb",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Three people are visible in the video. They are playing musical instruments, one a woodwind instrument, one a violin, and one a flute. All three people are standing and wearing dark clothing. There are three music stands with sheet music in front of them. There are other individuals visible in the bottom of the frame, they appear to be sitting. There is a lamp and a fireplace visible in the background. There are candles on the mantelpiece of the fireplace. There are curtains on the left and a mirror on the right.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What is the most rhythmic instrument in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Flute.",
+ "B. Violin.",
+ "C. Bassoon.",
+ "D. I am not sure."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the overall melody of the piece played in the video like?",
+ "answer": "D",
+ "candidates": [
+ "A. Gentle and relaxed.",
+ "B. Wild and dark.",
+ "C. Tense and thrilling.",
+ "D. Mysterious and perplexing."
+ ]
+ }
+ },
+ "vgJkumVZ": {
+ "video_id": "vgJkumVZ",
+ "video_duration": "68s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Motorsport",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a bicycle race. Initially, five cyclists are visible on a multi-lane paved road. One cyclist is wearing pink, one is wearing yellow, one is wearing blue, one is wearing light blue, and one is wearing pink. The cyclist in pink is positioned behind the other four. As the video progresses, the cyclists are joined by several people standing along the side of the road. Some people are also seen walking on the road in front of the cyclists. The group of cyclists rides down the road, with the cyclist wearing pink typically riding behind the other cyclists. A short time later, the cyclists become spaced out and the pink cyclist is riding several feet behind a yellow cyclist. After a few more seconds, the pink and yellow cyclists are much further ahead of the group, before the video shows the pink cyclist riding solo for a bit. The pink cyclist is then followed by a group of cyclists, and finally, there are only two cyclists followed by a group of vehicles. The pink cyclist is seen riding up to an area with stopped cars and trucks and the yellow cyclist is behind them. Finally, the video ends with only one cyclist, wearing pink, riding the road.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many contestants are participating in the competition?",
+ "answer": "D",
+ "candidates": [
+ "A. Seven.",
+ "B. Four.",
+ "C. Six.",
+ "D. Five."
+ ]
+ }
+ },
+ "lWXbuymr": {
+ "video_id": "lWXbuymr",
+ "video_duration": "55s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person in dark pants and black shoes with buckles is walking on sand or dirt. There is a splatter of red on the ground to the person's left. A person in a black hat and suit is speaking. A person with long dark hair, a mustache, and a dark jacket is shown using a small dark object in their hand to groom their mustache, and speaking. Then the person with a dark suit and hat is shown speaking. Then the person with long hair and mustache is shown using a small dark object to groom their mustache, and speaking again. Then a person in a suit and black hat is shown speaking. Then the person with long hair and mustache is shown holding their hair and speaking. Then the person with a dark hat is shown smiling and speaking. Then the person with the long dark hair is shown running in a town. The person in the black hat is shown standing in a street where there are a number of people gathered on either side of the street. The long haired person walks in from the right. A close up shot of a gun firing is shown with smoke coming from the barrel.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What was the mood of the man wearing a hat in the video when he realized the long-haired man was a scammer?",
+ "answer": "A",
+ "candidates": [
+ "A. Happy.",
+ "B. Distressed.",
+ "C. Sad.",
+ "D. Serious."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "Did the long-haired man in the video do anything unusual when he mentioned he is skilled at shooting with both hands?",
+ "answer": "B",
+ "candidates": [
+ "A. Take off his clothes.",
+ "B. Comb his beard while looking in the mirror.",
+ "C. Draw a gun.",
+ "D. Touch his hair."
+ ]
+ }
+ },
+ "gvPWsabJ": {
+ "video_id": "gvPWsabJ",
+ "video_duration": "91s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A multi-story warehouse is shown at night. Inside, three men in suits are gathered around a table. A crashing sound is heard, and a man in a suit falls through the skylight, landing amongst broken glass on the floor. Two of the men draw pistols while another kneels by the body. A woman wearing a green mask and holding a submachine gun stands with one hand on her hip. A man kneels before her, begging for his life. Two men point guns at the woman, while a third pleads with her. The woman stands over the body of the man who fell through the skylight. A man looks down the barrel of the submachine gun. The masked woman is shot in the chest and falls backward. Blood splatters the wall behind her. She lies dead on the floor, mask askew. A large man stands with a shotgun, facing three other men.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times are gunshots heard in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. One.",
+ "B. Three.",
+ "C. Four.",
+ "D. Two."
+ ]
+ }
+ },
+ "xGrbnZgN": {
+ "video_id": "xGrbnZgN",
+ "video_duration": "116s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a close-up of a person with their hair in a partial bun, wearing a white shirt and a black undershirt. The scene then cuts to a room with a desk and a white chair. There's a blanket in a wicker basket, a plant and a lamp. Next, there are clips with a pink text against a background of green leaves. The text fades, replaced by a red button with the word \"subscribe\", which then changes to a grey button with the same text. Next, the first person reappears in a store, speaking to the camera. A title card is shown with the words \"REDECORATE OUR ROOM WITH ME\". The person continues to speak, and a red \"instagram\" logo with the handle \"@klailea\" appears. The camera then pans around the store, showing shelves with home decor. The first person is seen holding a woven basket with a white cushion inside, then touching it. The view changes to show the person accompanied by another person with dark hair, and a child. They walk through the store. The first person shows a pillow, then a woven basket, then a small ceramic piece. The video cuts to a scene where a hand touches a white rug, and cuts back to the store where the person with dark hair holds up a small blue object and is followed by a shopping cart. The video cuts back to the first person, showing more decor. The person opens a white box and then holds it up. They make an \"okay\" hand gesture. Next, they hold up a woven basket and then a leafy plant in a white pot. The video continues showing the first person walking through the store and holding various items. The video ends with the two other people standing in the store with a box.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the second item that the girl in the black coat touches in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A potted plant.",
+ "B. A pillow.",
+ "C. A blanket.",
+ "D. A basket."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the positional relationship between the white potted plant and the basket with pillows in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The potted plant is on the left side of the basket.",
+ "B. The potted plant is on the right side of the basket.",
+ "C. The potted plant is behind the basket.",
+ "D. The potted plant is in front of the basket."
+ ]
+ }
+ },
+ "dgrNkrce": {
+ "video_id": "dgrNkrce",
+ "video_duration": "120s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a dark, short-sleeved jumpsuit with a white patch on the right breast area is on the screen. They move their arms around, hold a chair, and lift their legs while holding the chair, all while looking directly into the camera. There are two chairs on either side of the screen, as well as a bookshelf in the left background. The background is a curtain. At one point, the person has both hands near their chest, and then touches their chest. They also touch their ear at another point.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is there clapping sound in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. No.",
+ "B. I am not sure.",
+ "C. Yes."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How does the man's emotion change in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. From calm to disappointment.",
+ "B. From excitement to calm.",
+ "C. From calm to joy.",
+ "D. From calm to sadness."
+ ]
+ }
+ },
+ "itEqwxTw": {
+ "video_id": "itEqwxTw",
+ "video_duration": "234s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Soccer",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens with a wide shot of a group of eight men standing in a line on a grassy field, which is out of focus in the background. The men are individually outlined with colored lines and have a zooming effect applied to them. The scene transitions to show one of the men standing on the field holding a yellow oval-shaped ball. Two other men can be seen standing in the distance. This man throws the ball, and the video shows it moving through the air, before cutting to a different scene.The next scene has six men on the field, each holding a red, oval-shaped ball. Another man can be seen taking a photo of the group. This scene shifts to show another man on the field with a yellow ball, preparing to throw it towards two men who are standing further in the background. The video cuts to two men standing with their arms on their hips in a wide shot, in front of a line of white containers. The scene cuts back to show the man from the previous shot preparing to throw the yellow ball towards the two men in the background. The video zooms in on a dark oval-shaped object in the sky. The scene cuts to a view of the field along a fence line, with several white Nike logos visible on the side of a black fence. The word \"Eliminated\" is superimposed on the video, surrounded by two red hazard signs. The video shows 5 men standing on the field holding balls. The word \"Eliminated\" is superimposed on the video, surrounded by two red hazard signs. This scene is followed by another shot of the field, showing a yellow ball on the ground. The word \"Eliminated\" is superimposed on the video, surrounded by two red hazard signs. The man from a previous scene is preparing to throw a ball, with two men in the distance, with the word \"Eliminated\" superimposed on the video, surrounded by two red hazard signs. The video shows the man from a previous scene smiling and walking toward the camera.The video shows another man standing on the field, preparing to throw a red oval-shaped ball. Three men are seen in the distance, with the word \"Eliminated\" is superimposed on the video, surrounded by two red hazard signs. The scene switches to another man, standing with the five other men, holding a red ball in his hands. The video shows him looking and gesturing toward the left of the screen. The scene shows another man throwing the ball towards the goals, with the word \"Eliminated\" is superimposed on the video, surrounded by two red hazard signs.The next scene shows a red ball rolling on the ground and the man from a previous scene with a mustache walking on the field and the camera focusing on his face, with the word \"Eliminated\" superimposed on the video, surrounded by two red hazard signs. He looks down before the camera pans up, and shows the man in a white shirt being filmed on the ground, with the word \"Eliminated\" superimposed on the video, surrounded by two red hazard signs. The scene then shows another man on the field preparing to kick a red ball towards the goals.The next scene is a series of shots showing each of the remaining men throwing the ball towards the goals, with the camera often focusing on the ball in the sky, followed by shots showing their reactions on the field. The words \"Eliminated\" is superimposed on the video, surrounded by two red hazard signs, in several of the scenes, as the men are eliminated. The final shot shows the 2 remaining men and 6 other men running together and engaging in what appears to be a group hug. The video shows the 2 remaining men engaging in a celebration with the remaining men. Then, it ends by focusing on the man who is left, kicking the red ball towards the goals.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "How does the man in khaki shorts feel after winning?",
+ "answer": "D",
+ "candidates": [
+ "A. Disappointed and sad.",
+ "B. Inner peace.",
+ "C. Overjoyed.",
+ "D. Excited and thrilled."
+ ]
+ }
+ },
+ "kutBJtwn": {
+ "video_id": "kutBJtwn",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man and a woman are playing musical instruments on a cobblestone street in front of an ornate metal fence. The man, on the left, plays the violin, while the woman, on the right, plays the cello. Both have music stands in front of them. Behind the fence, two large statues of seated women are visible, along with a lion head relief. A building with many windows forms the backdrop. A partially visible white covering is behind the seated statues.\u00a0 A black cello case leans against the fence near the woman. An open, hard-shell instrument case lies on the cobblestones in front of the musicians. People stroll leisurely in the background beyond the fence. One person in the background sits on the edge of the fence. Other indistinct figures pass behind the fence and on the street to the right of the musicians. The musicians continue to play their instruments. A person with a pink and white checkered bag walks in front of the camera, partially obscuring the view of the musicians for a short time. The musicians continue to play.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the violin more rhythmic than the cello at the beginning of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes.",
+ "B. No.",
+ "C. None of the above.",
+ "D. I am not sure."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does a person with a backpack walk past the camera?",
+ "answer": "A",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. At the beginning of the video.",
+ "C. In the middle of the video.",
+ "D. Throughout the entire video."
+ ]
+ }
+ },
+ "aVSXHEWC": {
+ "video_id": "aVSXHEWC",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "Four men are seated on stools, playing cellos on a stage with blue flooring and a backdrop of purple and blue vertical lights.\u00a0 They are all dressed in dark suits or jackets and white shirts, and they are all looking down at their instruments as they play. Two of the men are clean-shaven and have short dark hair, while the other two have longer, lighter brown hair and beards. One wears glasses. The camera angle switches between focusing on the whole quartet, focusing in on a pair of players, and close-ups of individual players' hands on their instruments and bows, and their faces as they play. The music is fast-paced and rhythmic.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the man on the far right in the video doing to the cello?",
+ "answer": "D",
+ "candidates": [
+ "A. Adjusts the tuning pegs of the cello.",
+ "B. Wipes the body of the cello.",
+ "C. Plucks the strings with his fingers.",
+ "D. Strikes the strings with a bow."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is the cello on the far left in the video more rhythmic than the cello on the far right?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes.",
+ "B. I am not sure.",
+ "C. No."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At which point in the video does the second man from the left begin playing the cello?",
+ "answer": "D",
+ "candidates": [
+ "A. Throughout the entire video.",
+ "B. In the middle of the video.",
+ "C. At the beginning of the video.",
+ "D. At the end of the video."
+ ]
+ }
+ },
+ "gEAyCcwr": {
+ "video_id": "gEAyCcwr",
+ "video_duration": "75s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person is playing a video game. The game screen shows a pair of legs standing in a bathroom stall. After the level is completed, a screen appears that reads \"INFILTRATION SUCCESSFUL!\" along with some progress bars, and options such as \"BACK TO BASE\" and \"RETRY.\" The next screen shows the text \"LEVEL 4: THE MEDICAL.\" A man with blond hair and green eyes appears, along with text at the bottom of the screen that says, \"Gain system control of your eyes. Use them to look at things.\" The man's teeth fall out. The screen shows options for \"OPEN MOUTH\" and \"PULL OUT TEETH.\" A person in a chair appears next, in what looks like a waiting room. Then the level introduction screen appears again, followed by close-ups of the man's mouth. Some teeth are pulled out. An alert reads, \"TOO MUCH EYE CONTACT ALERT.\" More teeth are pulled. The man's eyes now move from side to side. Another screen shows a doctor talking to the player's character. The dialogue is presented at the bottom of the screen. While the conversation progresses, the player continues to manipulate the man's face, pulling his eyes and tongue out and back. The suspicion level rises, displayed by a bar at the top of the screen. The conversation with the doctor ends.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the sequence in which the blond man's facial features malfunction in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Teeth, nose, eyes, ears.",
+ "B. Eyes, nose, ears, teeth.",
+ "C. Teeth, nose, ears, eyes.",
+ "D. Nose, eyes, ears, teeth."
+ ]
+ }
+ },
+ "GotjEJnD": {
+ "video_id": "GotjEJnD",
+ "video_duration": "157s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black t-shirt and a colorful cap walks through a doorway into a room with a white door. The room has gold curtains, a white chair with writing on it, and other decorations including a lit Christmas tree. The person in the cap and black t-shirt then walks towards the back of the room. Another person with short, light-colored hair is in the kitchen. The two people speak and then both retrieve items from a white grocery bag and from the refrigerator and pantry. The woman with the light hair places a jar of sauce and a container of eggs into the bag. The woman with the cap and black t-shirt takes a bottle of water and another food item and puts them in the bag as well. The two women exchange physical contact, and then the woman in the cap walks through the house and out of a door.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Did any eggs appear in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Yes, a box of eggs appeared.",
+ "B. Yes, several eggs appeared.",
+ "C. Nothing appeared.",
+ "D. No, only empty egg boxes appeared."
+ ]
+ }
+ },
+ "KQIbbWyN": {
+ "video_id": "KQIbbWyN",
+ "video_duration": "204s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a person with light brown hair and a plaid shirt speaking directly to the camera. They are standing in front of a white wall where a framed picture with a white design on a blue background hangs. To their left is an hourglass. The person\u2019s eyes open and close and their mouth moves as they speak. Sometimes, their gaze shifts slightly.There are also close-up images of a red substance under a microscope, and a diagram showing an electrochemical process with orange and black panels, circles, arrows, and lettering. There are diagrams of the structure of a battery, with various text labels. Later in the video, there are clips of gloved hands handling lab equipment, including pouring liquid from one container to another, and a PH testing apparatus, along with readings on a digital screen, and a chart tracking the rise in PH over time.Towards the end of the video, diagrams of two different battery designs appear, followed by a slide listing various statistics and cost figures. These scenes are interspersed with the person again speaking directly to the camera as before.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the function of large amounts of carbon black in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Increase the amount of trivalent iron ions.",
+ "B. Reduce the distance between the current collector and trivalent iron ions.",
+ "C. Reduce the distance between the current collector and divalent iron ions.",
+ "D. Reduce the distance between the current collector and electrons."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "Which substance in the mixture of electrolyte salts is not necessary?",
+ "answer": "B",
+ "candidates": [
+ "A. Sulfate.",
+ "B. Sodium ion.",
+ "C. Potassium ion.",
+ "D. Chloride ion."
+ ]
+ }
+ },
+ "AErOVvGi": {
+ "video_id": "AErOVvGi",
+ "video_duration": "189s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a competition involving multiple robots, some of which are blue and some red, with a time display in the upper right of the screen. The robots are on a grey mat, bordered by transparent plexiglass and a metal structure. There are many people in the background, and several referees. The score is displayed at the bottom, which updates throughout the video. Throughout the video, the robots move around the field and perform actions. There are several blue cylinders on the mat. At some points in the video, the camera shows a wide shot of the court, the score, and the people in the stands. Later, a transition shows a car, and finally displays the results of the match, showing that the blue team won.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color are the items that the blue team successfully placed in autonomous mode in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Red cubes and red cones.",
+ "B. Blue cubes and yellow cones.",
+ "C. Yellow cubes and blue cones.",
+ "D. Red cubes and yellow cones."
+ ]
+ }
+ },
+ "exhtqZyb": {
+ "video_id": "exhtqZyb",
+ "video_duration": "413s",
+ "duration": "6-8min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two boxers, Manny Pacquiao in blue and white trunks and Juan Manuel M\u00e1rquez in black and green trunks, circle each other in a boxing ring. Pacquiao throws a left jab. M\u00e1rquez jabs back. Pacquiao lands a combination of punches. M\u00e1rquez clinches. They separate. Pacquiao throws a left hook. M\u00e1rquez dodges. Pacquiao lands a right hand. M\u00e1rquez throws a combination. Pacquiao lands a left uppercut. M\u00e1rquez clinches again. They trade jabs. Pacquiao lands a left hand. M\u00e1rquez counters with a right. They clinch. Pacquiao lands a body shot. M\u00e1rquez throws a right. The bell rings, ending round one.Round two begins. They trade jabs. Pacquiao lands a left hand. M\u00e1rquez counters with a right. Pacquiao lands a combination. M\u00e1rquez clinches. They separate. Pacquiao throws a left hook. M\u00e1rquez ducks under it. Pacquiao lands a straight right. M\u00e1rquez throws a jab. Pacquiao lands a left uppercut. M\u00e1rquez clinches. They separate. The boxers continue to exchange punches. The bell rings.Round three begins. Pacquiao lands a left. Marquez throws a counter right. They trade jabs. Pacquiao lands a combination. Marquez clinches. Pacquiao lands a right hand. The round ends.Round four begins. Pacquiao and Marquez trade punches. Marquez lands a right hook that knocks Pacquiao down. Pacquiao gets up. They exchange punches again. The round ends.Round five begins. Pacquiao lands a right hand. They trade jabs. Pacquiao lands a combination. M\u00e1rquez throws a right hand. Pacquiao lands a left uppercut. M\u00e1rquez clinches. They separate. M\u00e1rquez lands a hard right that sends Pacquiao stumbling. Pacquiao lands a combination. Marquez clinches. They continue exchanging punches until the bell rings.Round six begins. Marquez lands a right hand to Pacquiao\u2019s head. Pacquiao lands a left to Marquez\u2019s body. They trade jabs. Marquez ducks under a punch. Pacquiao misses with a left. Marquez lands a combination. Pacquiao lands a right hand. The round continues with a flurry of punches from both fighters.Round seven. The boxers circle and jab. Pacquiao lands a right hand. Pacquiao lands a left uppercut. Marquez is bleeding from his nose. Pacquiao lands a combination. Marquez clinches. They separate. Marquez lands a left jab.Round eight. Pacquiao and Marquez trade punches.Round nine. They continue to trade punches.Round ten. They continue to exchange punches.Round eleven. They continue to exchange punches. Pacquiao lands a combination of punches and Marquez falls to his knees. Marquez gets back up. They continue to trade punches until the bell rings.Round twelve. They continue to trade punches. With ten seconds left, M\u00e1rquez lands a devastating overhand right that knocks Pacquiao face-first to the canvas. The referee waves off the fight. M\u00e1rquez is victorious by knockout.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "At the end of the first round, how many punches did the man in black boxing gloves throw?",
+ "answer": "D",
+ "candidates": [
+ "A. Thirty.",
+ "B. Twenty-seven.",
+ "C. Twenty-nine.",
+ "D. Twenty-eight."
+ ]
+ }
+ },
+ "YCsjSmIR": {
+ "video_id": "YCsjSmIR",
+ "video_duration": "121s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black jacket and gray baseball cap with reflective sunglasses is seen in the foreground of a wooded area. The video then switches to a man sitting in a boat on a lake. He is wearing a dark jacket and pants and is holding a red can. The video then shows the shoreline from the boat, then it returns to a man sitting in the boat, drinking from a red can. Next, the man who was initially seen in the wooded area is now in the boat, talking to the camera. The video then shows a wide shot of the boat from the back, where two people are fishing. One person is sitting, wearing a dark jacket and the other person is standing, wearing a black jacket, fishing. The man in the black jacket reels in a fish. The man picks up the fish and holds it up, then drops it to the floor of the boat. The video shows this man holding the fish again, and then the video transitions to a close-up of the fish. Then the man holds the fish out to show it to the camera again. The video then shows the first man again, fishing in the boat. He catches another fish, holds it up, and then puts it down on the floor of the boat. Next, it shows the man fishing, from the perspective of the man sitting in the boat, followed by a different angle of the two men fishing. A fish can be seen flopping around in the water behind the boat. The video shows both of the men moving about the boat. The man sitting in the boat picks up a fish and is seen examining it, then he places the fish with other fish on the floor of the boat. Then the man who was filming is talking to the camera again.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What are the two men in the video planning to do?",
+ "answer": "A",
+ "candidates": [
+ "A. Cook all three fish.",
+ "B. Release one fish.",
+ "C. Release all three fish.",
+ "D. Cook all four fish."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "How does the mood of the man wearing a hat change in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. First excited, then surprised, and finally happy.",
+ "B. First happy, then sad, and finally joyful.",
+ "C. First excited, then sad, and finally happy.",
+ "D. First excited, then happy, and finally sad."
+ ]
+ }
+ },
+ "oARvEbHX": {
+ "video_id": "oARvEbHX",
+ "video_duration": "149s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A ring girl holds up a sign with the number four on it. Two female boxers, one in white and gold, the other in green and gold, face each other in the boxing ring, with the referee between them. The boxers touch gloves and begin fighting. The boxer in white throws several punches, while the boxer in green circles her. The boxer in green throws a few jabs and then corners the boxer in white against the ropes, landing several punches to her head and body. The referee separates them. The boxers continue exchanging punches in the center of the ring. The boxer in white gets the boxer in green against the ropes and lands a combination of punches. The boxers exchange more punches. The boxer in white lands a hard right hand to the head of the boxer in green. The boxer in white continues to press the attack, landing more punches to the head and body of the boxer in green. The referee steps in and stops the fight. The boxer in white has her hand raised in victory. A man wipes her face with a wet cloth and squirts water into her mouth from a plastic bottle.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What type of competition is shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A boxing match.",
+ "B. A freestyle martial arts competition.",
+ "C. A wrestling match.",
+ "D. A sanda competition."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the man wearing a white long-sleeved shirt and the two women wearing shorts in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The coach and the athlete.",
+ "B. The coach and the referee.",
+ "C. The referee and the athlete.",
+ "D. The doctor and the athletes."
+ ]
+ }
+ },
+ "WvMbJyer": {
+ "video_id": "WvMbJyer",
+ "video_duration": "126s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Software",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a screen recording of a 3D modeling software. The first section depicts a tree-like figure with a brown body and green foliage, followed by the creation of a white cube. An array modifier is added, twisting the cubes into a diagonal line, then a cartoon-like insect is created from the cubes. The view then changes to a path of white objects following a curved line and the corresponding graph view below it. Next a similar curved line is created with a white pattern in the upper window and the corresponding graph view below it. Then there is a close up shot of swirls of orange, yellow, green, and blue paper-like strips. The view switches back to the software showing a yellow sphere and corresponding graph and data windows, which then changes shape into a crumpled mass and returns back to its original shape. The text \"GEOMETRY NODES\" is then displayed in green text against a dark gray background. A white heart-like shape is drawn over this text. A white cube is displayed and then a menu appears to select modifiers. Then an image of three boxes shows a geometry input with an arrow pointing to a wrench tool labeled Mod, and then another box with two wrench tools and then another box with three wrench tools. The video then switches back to the software and more modifiers are used to create circular array and transforms of the white cube. Finally different object types like meshes and curves are selected in the software.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many modifications did the speaker make to the tree shown at the beginning of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Three.",
+ "C. Two.",
+ "D. Five."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes were made to the cube first shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. It transformed into a monkey head model and was stretched and rotated.",
+ "B. It transformed into a flower model and was stretched and rotated.",
+ "C. It transformed into a vine model and was stretched and rotated.",
+ "D. It transformed into a cloud model and was stretched and rotated."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the purpose of geometric nodes in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Geometric nodes can be used to create modifiers.",
+ "B. Geometric nodes can be used to set modifier parameters.",
+ "C. Geometric nodes can be used to create geometry.",
+ "D. Geometric nodes can directly apply user-integrated modifiers to modify geometry."
+ ]
+ }
+ },
+ "qnwPYkyW": {
+ "video_id": "qnwPYkyW",
+ "video_duration": "69s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person with white hair and a white shirt is walking on a sidewalk. The person gestures with their hands. Cars are visible on the street next to the sidewalk. The video switches to another shot with the person standing in front of a sign for \"Ottawa Township High School\". The sign has \"GO PIRATES\" in red letters on a digital display.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What is the man in the video discussing?",
+ "answer": "D",
+ "candidates": [
+ "A. An introduction to large enterprises in his hometown.",
+ "B. An introduction to street traffic.",
+ "C. An introduction to a restaurant in his hometown.",
+ "D. An optimistic outlook for small businesses in his hometown."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "Which small businesses did the man in the video introduce in order? (a) small restaurant, (b) clothing store, (c) print shop, (d) store, (e) photography studio.",
+ "answer": "D",
+ "candidates": [
+ "A. (b)(a)(c)(d)(e).",
+ "B. (b)(a)(d)(c)(e).",
+ "C. (e)(a)(d)(c)(b).",
+ "D. (c)(e)(a)(d)(b)."
+ ]
+ }
+ },
+ "GJOqONSZ": {
+ "video_id": "GJOqONSZ",
+ "video_duration": "192s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Motorsport",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video opens with the text \"MEGALIZED How the race was MEGALIZED Powered by MEGALIZED (Liege-Bastogne-Liege 2012)\" in white letters on a black background, below a red stylized \"M\" logo. The video transitions to a shot of five cyclists riding at night. The cyclists wear helmets and colored jerseys, including green, blue and white, and orange and black. The camera angle shifts, maintaining focus on the cyclists as they continue riding along the road.The camera view switches again, and the cyclists' jerseys now appear to be white, green, orange, and a blue-and-white.There are now six cyclists visible, including a new cyclist in a black jersey.The view then switches to a shot of a cyclist wearing a blue and white jersey, riding on a road during the day near trees and buildings. A second view of a similar road shows a cyclist in the same blue and white jersey. The view switches to another cyclist wearing a blue and white jersey followed by a cyclist in a blue, yellow, and red jersey.The view switches to a group of cyclists near some fences and crowds on a road, before the view is changed to a single cyclist in a white and green jersey sitting on the ground against a curb, with an eye injury. The video then pans out to show two people assisting the cyclist, as well as text on the screen reading \"Simon Geschke Team Argos Shimano\".The scene cuts to a cyclist in a white and green jersey riding again, a group of cyclists riding along a road. The view shifts to a man in an orange jersey walking through a crowd with a bicycle.The camera then shows a group of cyclists, including those wearing a green jersey, blue and white jersey, and white and green jersey. The view then shifts again and shows several cyclists riding along the road being filmed from a car following behind. The video cuts to another scene showing cyclists riding along a road, being filmed from behind and above the road.A group of cyclists are now riding along a road together. The view changes, showing the cyclists near some parked cars and people, with the camera position moving from being behind, to ahead of the cyclists.The video shifts to cyclists being viewed from above the road, with the camera focusing on a large group of cyclists moving along the road. The view then moves closer to a group of cyclists, and then to two fallen cyclists.The view changes to a cyclist in a green jersey riding along the road near orange barriers and several people.The view changes to show two cyclists on a road with several people watching, then another view of cyclists riding on a dirt road.The view then shifts to a group of cyclists riding on a paved road near a crowd, being filmed from behind them.The camera view switches to cyclists riding on a road near a crowd of onlookers; the camera then pans to a large group of cyclists and a group of motorbikes in the middle of the road.The camera position moves to show a group of cyclists from an overhead perspective near trees.The view then shifts to a group of cyclists riding along a dirt road.The camera view switches to show two cyclists riding on a road near a crowd; the camera then switches to a view of several cyclists riding together on a narrow road in the daytime.The view changes to show cyclists riding up a small hill as the camera follows closely behind them.The camera then transitions again, to show one cyclist alone, with two other cyclists behind him, moving along a paved road, followed by one cyclist moving alone.There is then a view of a group of cyclists being filmed from the side as they pass.The view switches to cyclists being filmed from behind as they are riding along the road.The view switches again, and shows a group of cyclists on a hill from above; then there is a view of a group of cyclists being filmed from a car following behind, the view shows a single cyclist leading the way.There is a view of a cyclist riding alone along the road with the camera filming from behind, and then to a view of the cyclists riding down a small hill.The view shifts to show two cyclists being filmed as they are riding down a road.The view shows a group of cyclists, with two of them leading the way, moving along the road, followed by the cyclists being filmed from behind moving down the road.The view changes to a shot of the finishing line with one cyclist raising their arms in victory, followed by another view of the same finish line with more cyclists approaching the end.The video ends with a view of a cyclist with another person on a scooter, followed by a final view of a group of cyclists crossing the finishing line.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "Where did the champion lay the foundation for their victory?",
+ "answer": "C",
+ "candidates": [
+ "A. Uphill.",
+ "B. At the finish line.",
+ "C. At the mountain peak.",
+ "D. Downhill."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the person who fell do to the other riders?",
+ "answer": "A",
+ "candidates": [
+ "A. Provide a boost.",
+ "B. Call for help.",
+ "C. Trip them.",
+ "D. Cheer loudly."
+ ]
+ }
+ },
+ "cEFXXYhy": {
+ "video_id": "cEFXXYhy",
+ "video_duration": "536s",
+ "duration": ">8min",
+ "domain": "Tech & Science",
+ "sub_category": "Biology",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A table displays the stages of seed germination: pregermination, germination, and post germination. Each stage lists corresponding events. Pregermination includes rehydration, RNA and protein synthesis stimulation, increased metabolism, hydrolysis of food reserves, changes in cell ultrastructure, and induction of cell division and growth. Germination involves the rupture of the seed coat and emergence of the seedling, usually the radicle first. Post germination includes controlled growth of root and shoot axis, controlled transport of materials from food stores to growing axis, and senescence of food storage tissues.The video then presents the phases of seed germination: imbibition phase, latent phase, and exponential growth phase. A diagram illustrates the imbibition phase in barley and lettuce seed germination. It shows water imbibition, starch hydrolysis into sugar by amylase, RNA transcription and translation from DNA, and GA3 growth. The latent phase is depicted with a graph showing fresh weight increase over imbibition time, divided into three phases. Phase I shows rapid water uptake, onset of mRNA synthesis and onset of starch degradation. Phase II shows reactivation of metabolism. Phase III shows radicle protrusion. The exponential growth phase is shown with diagrams of a seed underground. It shows water entering the seed, followed by GA, then \u03b1-amylase, then sugar, and finally the radicle emerging and growing roots.\u00a0Hypogeal germination is illustrated in a series of diagrams showing a seed underground with the epicotyl emerging, followed by the young shoot, and finally lateral roots developing.Requirements for germination are listed, beginning with light. The information details the role of light, photoblastic seeds (positive, negative, and non-photoblastic), and phytochrome, a plant pigment sensitive to red light. A light spectrum is shown, along with a diagram illustrating the functioning of the phytochrome system involving Pr (inactive form) and Pfr (active form), along with red light (RL) and far-red light (FRL). A table summarizes germination results based on exposure to red and far-red light.\u00a0The next requirement for germination is water, described as necessary for imbibition, which leads to swelling and breaking of the seed coat. Oxygen is also listed as a requirement for germination, needed for metabolic activity, breakdown of stored food, and radicle emergence. Finally, temperature is listed as a requirement, noting that many seeds require a cold period before germination. A thermometer is shown. Text on the screen explains that the temperature must be optimum and too-high temperatures speed up seed emergence and withering. The final screen shows information about Kharif and Rabi crops and their sowing seasons.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Do seeds break their seed coat during the imbibition phase?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes, they do.",
+ "B. I am not sure.",
+ "C. No, they do not."
+ ]
+ }
+ },
+ "btkelxqe": {
+ "video_id": "btkelxqe",
+ "video_duration": "96s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Art Exhibitions",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video begins with a wide shot of a large, wooden, stick-figure sculpture standing in a desert area, with tents and other structures visible in the background. The scene shifts to a large, horizontal mural panel in a wooden frame, surrounded by a group of people standing on a dusty surface. A man is seen holding a long wooden pole against the side of the mural, and then the camera shows a different angle of the scene, with a tall stick-figure sculpture visible in the background. The scene transitions to a vintage black-and-white drawing of a room filled with artifacts. Next, a man wearing a brown hat is shown speaking to the camera. The video then returns to showing the mural, showing someone adjusting it. Another shot of the mural is shown, with a person crouched nearby. The scene cuts to a stylized painting of a room, with people and animals. The next scene shows a different mural, this time with people in a painted scene that features a man running and large floating jellyfish. Next, the scene shows the mural, lit up at night, under a tent-like structure. Then, there\u2019s a wide shot of the stick-figure sculpture and mural together in a desert area, and the scene cuts to the man in the brown hat, speaking to the camera. The video then shows a large wooden structure being moved from a trailer by a number of people. The scene shifts to a red semi truck, with covered cargo. The scene then shows a large flat crate on the ground, with a truck and other objects in the background, in the same desert area as shown before. The video then shows a group of people standing in front of a similar framed structure on the desert at night. The scene then shifts back to the man in the brown hat, speaking to the camera again. The video shows people looking at the murals again at night. The video shifts to a man in a dark suit gesturing at the mural, with a group of people standing and sitting around him. The video returns to the man in a brown hat again before the video ends.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order are the events introduced in the video? (a) Describing the process of creating the artwork. (b) Introducing the inspiration and design of the artwork. (c) Discussing the positive reactions to the work.",
+ "answer": "A",
+ "candidates": [
+ "A. (b) (a) (c).",
+ "B. (a) (c) (b).",
+ "C. (b) (c) (a).",
+ "D. (a) (b) (c)."
+ ]
+ }
+ },
+ "PKvGUmnx": {
+ "video_id": "PKvGUmnx",
+ "video_duration": "89s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with an aerial view of a green landscape with a white text overlay saying \"Preserving Cambodia's environment and supporting rural communities\" and a logo in the top left corner. The scene then shifts to a shot of a red metal door within a wooden gate, and a sign that says \"Private Property No Entry.\" Next the scene quickly transitions to a website with photos of people and the text \"Our Story\". A woman is seen carrying a child in one photo, and other photos show people working outside. The video moves to a shot of a boy wearing a red shirt walking towards a small wooden structure and a river. A water basin is seen on a cart near a pond and waterfalls. Then a woman and a man are shown talking through a window with trees in the background, followed by a different scene of the same people talking in front of vegetation. The scene transitions to a table with several people and a woman walking up to it, in a lit room. A shot shows a woman walking out of an open doorway. An aerial view of a large building with a red roof is shown. Then a white building with a large lawn is shown. The video then cuts to an image of a wooden structure with a thatched roof. Next, a black and white image is seen of a woman on an airplane seat. The video continues with shots of a woman on a purple background. The video shows a home interior including dining rooms, and a library with a red couch. The woman in front of the purple background is shown again, and the video ends with a still image of the sky with colorful clouds.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Is there a perfume advertisement in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. None of the above.",
+ "B. No.",
+ "C. Yes.",
+ "D. I don't know."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the actress mentioned in the video expand her real estate holdings?",
+ "answer": "C",
+ "candidates": [
+ "A. To film a commercial.",
+ "B. To showcase her wealth.",
+ "C. To protect local communities and the environment.",
+ "D. Because property prices have decreased."
+ ]
+ }
+ },
+ "EgYEFOjD": {
+ "video_id": "EgYEFOjD",
+ "video_duration": "93s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a stage with two couches and an armchair, and a man with a microphone. A second man is standing to the left of the armchair, and another man is sitting on steps at the front of the stage.The man on the right walks around and uses his microphone, then the man sitting on the steps uses his microphone. At one point the man on the steps reaches out and touches the left arm of the man sitting to his left.At one point the camera angle changes to show the stage and an audience.The man sitting on the steps is speaking into a microphone and gesturing with his hands.The man with a microphone walks around the stage and leans down at one point. He uses a cane at one point.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who said, \"Couldn't you see Grandma dragging you around the whole block\" in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The black person wearing a red hat.",
+ "B. The white person themselves.",
+ "C. The grandfather of the white person.",
+ "D. The black person wearing a black hat."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why do people laugh out loud?",
+ "answer": "A",
+ "candidates": [
+ "A. Because of the voices of white men.",
+ "B. Because of the hairstyles of white men.",
+ "C. Because of the appearance of white men.",
+ "D. Because of the responses of white men."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Which variable in the video caused the performer's voice to change?",
+ "answer": "D",
+ "candidates": [
+ "A. The voices of the live audience drowned it out.",
+ "B. The distance between the microphone and the mouth.",
+ "C. The camera moved further away.",
+ "D. The climax of the story plot."
+ ]
+ }
+ },
+ "kCZqGlHr": {
+ "video_id": "kCZqGlHr",
+ "video_duration": "422s",
+ "duration": "6-8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Book Reviews",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with brown hair and glasses, wearing a red shirt and blue jeans, is sitting on the floor. He is holding and showing different books. There are several stacks of books next to him on the floor. He looks directly at the camera. He uses hand gestures while holding or gesturing towards the books. He puts one stack of books into a cardboard box. He makes several gestures to his head. He picks up a phone. He makes many gestures with his hands. He puts another stack of books down, puts his hand over the stack, moves his hands around, and ends the video sitting there.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man hit an object at the beginning of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. One.",
+ "B. Two.",
+ "C. Four.",
+ "D. Three."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the red bottle positioned in relation to the man in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Behind him.",
+ "B. In front.",
+ "C. On the right.",
+ "D. On the left."
+ ]
+ }
+ },
+ "bmZIHDNM": {
+ "video_id": "bmZIHDNM",
+ "video_duration": "192s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person with short brown hair is standing in front of a kitchen. They are wearing a black short-sleeved shirt with white accents. They speak and gesture with their hands. They pick up a piece of ginger and show it to the camera. They place the ginger piece into a blender. They use a cutting board to cut an apple, and place it in the blender. They add broccoli florets to the blender. They pick up a container of bean sprouts. A recipe is shown overlaid on the video. A piece of kiwi is cut and placed into the blender. A lemon is cut and juice squeezed into the blender. A package of walnuts is shown and some of the walnuts are poured into the blender. They pick up a bundle of parsley and add it to the blender. They turn on the blender and a green mixture is produced. The person turns to another person and motions with their hand, and they both smile. The person pours the green mixture from the blender into a glass. The other person in the kitchen takes a drink from the glass.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many ingredients are needed to make the green drink in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Ten kinds.",
+ "B. Nine kinds.",
+ "C. Eight kinds.",
+ "D. Seven kinds."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What ingredients does the man add to the blender in the video, and in what order? (a) ginger. (b) sprouts. (c) apples. (d) broccoli. (e) spinach. (f) walnuts. (g) parsley. (h) lemon juice. (i) kiwi.",
+ "answer": "D",
+ "candidates": [
+ "A. (b) (e) (d) (f) (a) (c) (h) (i) (g)",
+ "B. (e) (c) (d) (h) (b) (g) (f) (a) (i)",
+ "C. (b) (e) (f) (c) (a) (d) (g) (h) (i)",
+ "D. (a) (b) (c) (d) (e) (f) (g) (h) (i)"
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the man's emotion after drinking the green beverage in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Excited.",
+ "B. Happy.",
+ "C. Calm.",
+ "D. Regretful."
+ ]
+ }
+ },
+ "ZpyQFAFz": {
+ "video_id": "ZpyQFAFz",
+ "video_duration": "44s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a tan military uniform and hat stands in the foreground of a grassy area. Behind them, there are other people, most of whom appear to be wearing similar military uniforms. Some people wear non-military clothes. Some people are shaking hands. Text appears on the screen, identifying the person in the foreground as \"Lance Cpl. Lauren Kurkimilis.\" A close up of hands holding a medal in a box on top of a red folder appears. The camera pans to a group of people standing on bleachers wearing military uniforms. Many of them are saluting. A wider shot shows a very large group of people sitting on bleachers wearing military uniforms. A person in a military uniform is seen walking across the grass. The same person is seen walking towards a podium on a concrete area. The camera shows an additional view of people on bleachers applauding. Another view of the people standing at attention. The camera zooms in on a man who is wearing a military uniform; Text appears on the screen, identifying him as \"Sgt. Clifford Wooldridge.\"",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where was the red notebook located when the man in uniform stepped onto the stage?",
+ "answer": "B",
+ "candidates": [
+ "A. In the breast pocket.",
+ "B. In his left hand.",
+ "C. In his right hand.",
+ "D. In the pants pocket."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many times does the Navy Cross appear in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. One.",
+ "B. Four.",
+ "C. Three.",
+ "D. Two."
+ ]
+ }
+ },
+ "QkkvovCT": {
+ "video_id": "QkkvovCT",
+ "video_duration": "86s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A computer screen displaying software appears, with a blue and red circular design in the center. A mouse cursor clicks on the blue part, changing it to a deeper blue. The cursor selects a different part on the left sidebar and clicks on the deep blue circle, making it change to red, and the design changes in complexity to a two-dimensional shape with an outline. The cursor then clicks on a small square in a menu and moves the red circle and a second circle in the design. The program preview is selected and in a second window a blue, dotted path is shown with a green spiral on one end. A second green spiral appears when the button is pressed and the view moves to a side projection, revealing depth information. The view transitions to a close-up of a silver drill bit cutting through a piece of silver material. A human hand places a small, light-colored fastener into the cut in the silver material and then adjusts it before the view switches back to the software. The cursor then goes back to the design and selects the outer blue shape. Another red circle is added and a white line connects the two circles. A new window appears and values of X and Y are selected. The view changes to a silver tool head cutting into a black piece of material. The tool head initially cuts a vertical line and then a circular hole. The tool head then lifts out of the material and reveals a finished hole.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man in the video click the mouse's up button on the computer to change the direction of the blueprint?",
+ "answer": "C",
+ "candidates": [
+ "A. Twice.",
+ "B. Five times.",
+ "C. Four times.",
+ "D. Three times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What does the man in the video do with the nuts?",
+ "answer": "D",
+ "candidates": [
+ "A. Secure parts using nuts.",
+ "B. Install the nuts onto the machine.",
+ "C. Replace old parts.",
+ "D. Place the nuts into the T-slot workbench."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the man in the video use a mouse to draw a line between two circles?",
+ "answer": "B",
+ "candidates": [
+ "A. To measure the size of the circle.",
+ "B. To easily find the coordinates.",
+ "C. To align the circles diagonally.",
+ "D. To measure the size of the workpiece."
+ ]
+ }
+ },
+ "ZSRBwQry": {
+ "video_id": "ZSRBwQry",
+ "video_duration": "118s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two people are seated at a round, wooden table. A small electronic device is in the center of the table. The person on the left wears a black t-shirt, jeans, and a leather watch. The person on the right wears a white t-shirt and khaki shorts. Both people are wearing glasses. At various times, they pick up small devices from the table. The two individuals face the camera at times and face each other at other times. At times the view switches to a close up shot of the device, and a small green and white item moves inside it. The video ends with a logo that states \"sparkfun ELECTRONICS\" in white font on a black background.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the two men in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Familial relationship.",
+ "B. Mentorship relationship.",
+ "C. Classmate relationship.",
+ "D. Partnership."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "In the middle section of the video, what do the two men do?",
+ "answer": "A",
+ "candidates": [
+ "A. The two men are playing mechanical table tennis together.",
+ "B. The two men are introducing mechanical table tennis together.",
+ "C. The two men simultaneously operate the mechanical table tennis, setting it to \"demonstration mode.\"",
+ "D. The two men have a disagreement over the mechanical table tennis."
+ ]
+ }
+ },
+ "jLYUvXrB": {
+ "video_id": "jLYUvXrB",
+ "video_duration": "135s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A woman with dark hair and red lipstick gestures with her hands. The background includes a bookshelf and a couch. Space-themed images hang on the wall behind her.The video cuts to an image of a space shuttle launching. Text appears on the image reading \"ENGINE = LIQUID FUEL, THROTTLEABLE\". An arrow points to the space shuttle's main engine. Another arrow appears, pointing to the top of the solid rocket boosters, along with text reading \"MOTOR = SOLID FUEL\".The video cuts back to the woman. She picks up and displays a model of a small, single-stage rocket. The text \"ENGINE/MOTOR COUNT: 0\" appears below her. She then picks up a model of the Saturn V rocket. The text below changes to \"ENGINE/MOTOR COUNT: 5\".A painted image of a Saturn V rocket in space appears, with Earth partially visible below. Flames trail from the rocket's engines. Text reading \"ENGINE/MOTOR COUNT: 5\" is overlaid on the image. The text changes to \"ENGINE/MOTOR COUNT: 13\".The woman is shown again holding the two rocket models. The text \"ENGINE/MOTOR COUNT: 13\" is displayed below. She detaches the first stage of the larger rocket model and holds it separately. The text changes to \"ENGINE/MOTOR COUNT: 21\".",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What are the advantages of liquid fuel compared to solid fuel?",
+ "answer": "B",
+ "candidates": [
+ "A. Liquid fuel is convenient to store.",
+ "B. Liquid fuel can be throttled.",
+ "C. Liquid fuel is easy to transport.",
+ "D. Liquid fuel is less expensive."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What occurs after about two and a half minutes of the first stage burning?",
+ "answer": "B",
+ "candidates": [
+ "A. It continues to burn alongside the rocket.",
+ "B. It separates and moves away from the rocket body.",
+ "C. Nothing happens.",
+ "D. It separates but stays close to the rocket's main body."
+ ]
+ }
+ },
+ "NQqTIkER": {
+ "video_id": "NQqTIkER",
+ "video_duration": "160s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a shot of a white tiled walkway with closed, slatted doors. There are a few people visible down the hall. The camera moves forward, turning right, and a young man with blond hair is visible in the foreground. He turns to the right and a woman with curly brown hair wearing clear glasses appears next to him, and the caption \"WE GET FREE FOOD\" - KAYLEE is visible in the center of the screen. The man and woman laugh. The camera focuses on the man's face and then back to both of them walking along a walkway lined with a black curtain. A man in a grey shirt walks in front of the camera through a door in the curtain. The camera follows the man into the stadium and a ladder and some scaffolding are visible. The man walks into the stadium and a basketball game is in progress. The camera turns back and shows the woman, who is smiling. Then, the man is visible again and he is smiling. He turns to the right and the woman is next to him. They are now walking along the side of the stadium seats. The camera focuses on a man's hand placing a red ball onto a seat. Then the hand picks up the red ball. The camera is now pointing into the stadium at the basketball court. A hand with a bright yellow band on its wrist is holding a red ball and tossing it into a small basketball net, then the ball is picked up off the floor. The camera moves again, and the woman's face is in the foreground, and the stadium is behind her. The woman then gestures, and then her face is in the foreground again. The man appears again, and the camera follows him through a series of shots. At one point, the man is holding a basketball and posing with two other men, one of whom is identified as Frank Jackson. The camera then focuses on the man's face again. He then turns around, and the woman's face is visible next to his, they both laugh. The camera pans to basketball players in blue walking onto the court. The camera goes back to showing the man and woman laughing. A shot of basketball players stretching and then the man holding a basketball is on screen again. The camera follows him and shows a third man placing a hand on his shoulder. There is a brief shot of a hand picking up the red ball from a black leather seat. The camera is on the man and the woman again, they both are laughing and smiling.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many basketball hoops are there in the arena?",
+ "answer": "B",
+ "candidates": [
+ "A. Two.",
+ "B. Three.",
+ "C. Four.",
+ "D. One."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual event occurred in the middle of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Two little girls in white clothing appeared.",
+ "B. A man made a basketball shot.",
+ "C. A basketball player suddenly appeared in the scene.",
+ "D. A toy basketball fell into the trash can."
+ ]
+ }
+ },
+ "nZOkcNql": {
+ "video_id": "nZOkcNql",
+ "video_duration": "54s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Soccer",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a football game at night. The first shot is of a scoreboard that shows the home and guest teams tied at 8 points, with 1:48 remaining in the third quarter. The next scene shows a football play, with players in red and black jerseys and players in white and blue jerseys lined up at the line of scrimmage. The players in red and black advance, and then a player in white and blue moves the ball down the field. The scoreboard in the following shot shows that the home team leads 22 to 8. Another scene shows football plays in which players in red and black and players in white and blue run across the field, and one falls to the ground. In another scene, the scoreboard reads 22-22. After several more plays, the final score is shown as 25 to 22. The last few shots of the video show a player in red and black lying on the ground as two coaches tend to him. The final scene shows cheerleaders in blue and white at the edge of the field with players in white and blue on the field.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional change did the athlete wearing a white jersey undergo in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. From nervous to sad.",
+ "B. From sad to nervous.",
+ "C. From excited to sad.",
+ "D. From nervous to excited."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "To which of his teammates, wearing a white jersey, did Charlie Pfister pass the ball, and what was the jersey number?",
+ "answer": "C",
+ "candidates": [
+ "A. Number 2.",
+ "B. Number 5.",
+ "C. Number 14.",
+ "D. Number 24."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the player wearing the red No. 3 jersey do to the player wearing the white No. 3 jersey?",
+ "answer": "A",
+ "candidates": [
+ "A. Chase after the player wearing the white No. 3 jersey.",
+ "B. Clearly pull the player wearing the white No. 3 jersey.",
+ "C. Punch the player wearing the white No. 3 jersey.",
+ "D. Use the body to knock down the player wearing the white No. 3 jersey."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the role of the man wearing black-and-white stripes and a hat near the stadium?",
+ "answer": "D",
+ "candidates": [
+ "A. A player.",
+ "B. A spectator.",
+ "C. A coach.",
+ "D. A referee."
+ ]
+ }
+ },
+ "OtDiByCB": {
+ "video_id": "OtDiByCB",
+ "video_duration": "67s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A man in a white lab coat stands and gestures, a bookshelf, window, desk with pills, and a book visible in the background. Animated grey spheres connected by lines rotate, followed by a sphere with blue and green labels, then a blue and green connected line drawing and the words \"SugarMD\". The screen goes black. An animated man with glasses, wearing a black suit and yellow tie, appears against a blue background. The words \"45 YEAR OLD\" appear to his left. The same animated man sits at a desk with a laptop in a room with a bulletin board and a lamp. The same man is pictured in a gym with other animated people working out on treadmills and lifting weights. The man is then pictured at the desk again, followed by walking away. The man, wearing the same black suit, sits at a restaurant table along with other animated people. The same man stands with another animated man on the sidewalk. A grey graphic of a city skyline is displayed. The man sits on a chair with another animated person sitting on a couch. The man is then depicted slumping to the floor and holding his head, with the woman sitting on the couch looking at him. A calendar graphic is displayed with the words \"Yearly Medical check-ups\". A man is seen in a laboratory working with a test tube, and two clipboards with \"positive\" and \"negative\" labels. The man is then depicted sitting on an examination table in a doctor's office with an animated doctor. The man is sitting in a white chair in an office speaking with an animated doctor sitting at a desk. The same animated man is seen sitting at a restaurant table eating. The same man is then pictured in a different restaurant, then eating a meal at a table with a family. Finally, the man stands with his hand on his head in a living room.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where does the first scene in the animation take place?",
+ "answer": "C",
+ "candidates": [
+ "A. A restaurant.",
+ "B. A gym.",
+ "C. An office.",
+ "D. A hospital."
+ ]
+ }
+ },
+ "ZLPwsBNd": {
+ "video_id": "ZLPwsBNd",
+ "video_duration": "37s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person is using a hoe to till the soil in a garden. Several people are harvesting yellow flowers in a large greenhouse. A wide shot shows several rows of tilled soil in a greenhouse. An external shot shows two greenhouses and a grassy field. A person wearing a hat and gloves kneels on the ground while working in a garden. Two white buses drive on a road, followed by a person riding a bike. A blue car drives on a road past a sign that says \u201cBienvenidos a El Carmen de V\u201d. Several pieces of farm equipment are used to till soil. A person with a hoe works around the base of a tree. A person is shown pouring fertilizer into a piece of green farm equipment connected to a tractor. The fertilizer is then spread through the farm equipment on the ground. Several people with spraying tanks on their backs walk down rows of a field, spraying their crops.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What tool does the person in the video use to stir the fertilizer?",
+ "answer": "B",
+ "candidates": [
+ "A. Use a shovel to stir the fertilizer.",
+ "B. Use a large mixer to stir the fertilizer.",
+ "C. Use a hoe to stir the fertilizer.",
+ "D. Use a stick to stir the fertilizer."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times did the commentator mention the word \"greenhouse\"?",
+ "answer": "B",
+ "candidates": [
+ "A. Three.",
+ "B. One.",
+ "C. Four.",
+ "D. Two."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the purpose of the machinery shown in the middle of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Loosen the soil.",
+ "B. Enhance transportation efficiency.",
+ "C. Water the crops.",
+ "D. Improve the climate environment."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the man in a white shirt and hat till the ground in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Two.",
+ "B. Three.",
+ "C. One.",
+ "D. Four."
+ ]
+ }
+ },
+ "bREzcKHa": {
+ "video_id": "bREzcKHa",
+ "video_duration": "152s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Racing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a track and field race. Eight people wearing athletic clothing are at the starting line, crouched down in starting blocks. Each runner is in a lane with a yellow number marker in front of them. The racers are wearing clothes in various colors, including green, red, blue, yellow, black, and white. On the far side of the track are flags and a building. After a brief pause, the racers begin to run around the track. Some people can be seen sitting in bleachers in the background. Some people are standing near the track wearing various colored clothing, including white shirts and dark pants. The runners complete the race, with one runner in green and red clothes in the lead. The runners eventually begin to slow down and stop walking. One of the people walks toward a person in the green and red clothing and hands them a slip of paper. The camera zooms into the person in the green and red clothing. The camera continues to move to show other people in the background. A large crowd of people can be seen in the background in a stadium. The camera zooms out to show the runners lined up at the starting line with a clear view of all of them. Then they prepare to run again. Finally they begin running. The camera zooms out to show all the runners completing their race.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the sound of a starting gun occur in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Two times.",
+ "B. Four times.",
+ "C. One time.",
+ "D. Three times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What are the roles of the six girls in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Runners in the 100-meter sprint.",
+ "B. Runners in the 400-meter race.",
+ "C. Runners in the 200-meter sprint.",
+ "D. Runners in the 50-meter sprint."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the positional relationship between Adams and Campbell?",
+ "answer": "D",
+ "candidates": [
+ "A. The two are separated by one lane.",
+ "B. The two are separated by three lanes.",
+ "C. The two are separated by two lanes.",
+ "D. The two are adjacent."
+ ]
+ }
+ },
+ "dyTNOoaC": {
+ "video_id": "dyTNOoaC",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Historical Analysis",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person wearing a white hat and a white and brown outfit stands outdoors on a cracked dirt surface. The person holds a bow and arrow and aims to the left. The person has a quiver full of arrows on their back. Behind them are various potted plants on a table, and other objects including a scale, a refrigerator, buckets, and a brick wall.The video then switches to show a person inside, with glass paneled doors behind them. They are wearing dark clothing and a fur hat. The person holds a bow with two arrows pointed forward. They draw the bow back and aim it to their right.Next, the person is seen from a slightly different angle, holding the bow with a single arrow aimed in front of them to the right. The person in the previous two clips then draws the bow again with more than one arrow aimed in front of them to the right. Then the same person draws the bow again with multiple arrows, wearing red gloves and aimed to the right.The video then switches to show a person wearing a tan outfit with a red belt, standing indoors with a dark stage and purple drapes behind them. They are squatting on a bench and drawing back their bow. The person draws the bow a few more times in the same position. They do this several more times, and then switch to stand on the bench, drawing the bow. In the background, there is another person. Then, it switches to show the person from a slightly different angle, drawing the bow.Finally, the video switches to show a person wearing a red shirt and dark pants sitting on a small chair outdoors. They are drawing back a bow with an arrow, while multiple other objects surround them. The person continues to draw and aim the bow several times. They move their body while drawing the bow, and then bring the arrow up over their head. The person continues to aim the bow and arrow.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How does the mood of the man wearing glasses change after shooting five arrows in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. From calm to angry.",
+ "B. From frustrated to excited.",
+ "C. From calm to excited.",
+ "D. From angry to excited."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What happened to the wooden board after the man in red clothes shot several arrows at it in succession in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The wooden board was knocked over by the arrows.",
+ "B. The wooden board was shot through.",
+ "C. The wooden board was taken away by the man.",
+ "D. The wooden board remained undamaged."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the woman in the white top holding in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. She is repairing a bow and arrow.",
+ "B. She is sitting on a chair.",
+ "C. She is watching others shoot arrows.",
+ "D. She is competing in archery with others."
+ ]
+ }
+ },
+ "wCZkPkru": {
+ "video_id": "wCZkPkru",
+ "video_duration": "87s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens with the text \u201cTWO ROCK\u201d in white letters on a dark background, followed by an aerial shot of a long straight road stretching through a desert landscape with mountains in the background. The text \"BEM-VINDO A TWO ROCK\" appears in white on the road. Next, a close up of a sign that says \"Desert Dream MOTEL\" is displayed. The shot then changes to an aerial view of the motel, showing three people walking towards the motel building, and a vehicle parked in the car park. The scene cuts to a room with two men facing each other, one man holding a cigarette. The scene changes to a dark room with text overlay. A man and a boy are shown in the dark moving a body wrapped in a sheet. Then a boy is shown climbing through a skylight and down onto furniture. The scene changes to a close up of a man\u2019s face, and then a shot of two men looking into an open safe filled with stacks of money. A man is then shown putting money into a bag. A close up of a man\u2019s face is shown in a dimly lit room. A man with long hair and a mustache is shown in front of shelves of bottles. A woman is shown in front of a slot machine. A man, a boy and two other young men are standing in a wooden structure. A woman is shown lying down in a dark room with a man standing by the door. The scene switches to a man with a beard. The man is shown again with a woman. There is a shot of a family in a car, followed by another shot of the same family at the motel with a dog in front of them. A series of text overlays appear. The next scene shows two men in bandanas, one pointing a gun. More text overlays are shown. A man in a red top is shown talking to another man. Another man is shown talking to two younger people, followed by the text \u201cA SUA HIST\u00d3RIA\u201d on a black background. A close up of a woman in dark lighting is shown with the text \u201cPapai, n\u00e3o v\u00e1!\u201d. A woman\u2019s face is shown with her hands over her face. Then another close up of her face is displayed. A man and a woman are shown. A man is shown holding another man. Text overlays appear, including a quote from Distractify and screen rant. The scene changes to a house burning in flames. A road is shown. A car is shown on the road and then police cars. A helicopter is shown with the word \"POLICE\". A car is surrounded by police cars. A man is shown looking shocked. A man fights another man. An older man is shown hugging a young woman. A scene shows a young man with a young woman, then a scene of a man breaking a bottle. The scene changes to a close up of a young man\u2019s face with a woman standing beside him, and a text overlay \u201cSUAS DECIS\u00d5ES MOLDAM VIDAS\u201d Another young man and a woman are shown, and then a close up of their faces, and a man in a car. Text overlays include \u201cE REVELAM QUEM VOC\u00ca \u00c9\u201d. A young man wearing a bandana is shown. A young man is shown in front of a bookcase. Two men are shown falling in a swimming pool. A scene shows a car speeding along a road, then a man lying still, and then a close up of a man\u2019s face. A man is shown kissing a woman. A girl is shown in a car, then a hand is shown reaching out of the car. The motel is shown again and multiple faces are displayed. A graphic with three faces and the motel is shown. The same image is shown again with the title, \u201cAS DUSK FALLS EXPERIENCE SOLO OR MULTI PLAYER\u201d.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where does the scene of a police car chasing another car occur in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A grassland.",
+ "B. A highway.",
+ "C. A desert.",
+ "D. A forest."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the blonde girl doing with the bottle?",
+ "answer": "D",
+ "candidates": [
+ "A. Throwing it at a crowd.",
+ "B. Using it for self-defense.",
+ "C. Dropping it from a great height.",
+ "D. Making a homemade bomb to set a house on fire."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What did the man wearing a headscarf do?",
+ "answer": "A",
+ "candidates": [
+ "A. Stole money.",
+ "B. Killed a person.",
+ "C. Injured someone.",
+ "D. Stole a gun."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who said, \"I'm not just gonna abandon you\"?",
+ "answer": "D",
+ "candidates": [
+ "A. A boy in a yellow shirt.",
+ "B. A woman with curly hair and dark skin.",
+ "C. A man in a suit with a red tie.",
+ "D. A girl with golden curly hair wearing red pants."
+ ]
+ }
+ },
+ "vesHuoFw": {
+ "video_id": "vesHuoFw",
+ "video_duration": "98s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens with an aerial view of a lake surrounded by trees and some buildings. The camera transitions to a closer view, focusing on a person standing on a boat in the middle of the lake. The person, wearing a dark jacket, jeans, and sunglasses, is holding a fishing rod. They cast the line into the water, reel it in, and then catch a fish in a net, before lifting the fish out of the net and holding it up. The person then places the fish back down and the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "How much does the fish that the man in the video catch weigh?",
+ "answer": "B",
+ "candidates": [
+ "A. 6 pounds.",
+ "B. 8 pounds.",
+ "C. 9 pounds.",
+ "D. 7 pounds."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many fish did the man catch in the video in total?",
+ "answer": "B",
+ "candidates": [
+ "A. Two.",
+ "B. One.",
+ "C. Three.",
+ "D. Four."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence in which the man in the video catches the fish?",
+ "answer": "A",
+ "candidates": [
+ "A. Pull back and yank the rod, quickly retrieve the line, engage in side-to-side motion, use a net to lift the fish.",
+ "B. Engage in side-to-side motion, quickly retrieve the line, pull back and yank the rod, use a net to lift the fish.",
+ "C. Use a net to lift the fish, quickly retrieve the line, engage in side-to-side motion, use a net to lift the fish.",
+ "D. Quickly retrieve the line, engage in side-to-side motion, pull back and yank the rod, use a net to lift the fish."
+ ]
+ }
+ },
+ "wyZdicSK": {
+ "video_id": "wyZdicSK",
+ "video_duration": "97s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Family Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows an indoor space with multiple bird cages, some of which contain birds. There is also a wooden structure with perches. A person wearing a white shirt, shorts, a baseball cap, and glasses walks through the space, looking up. A blue and green bird is on the floor. A grey and red bird flies from a hand. A pink and grey bird is seen on a railing. The grey and red bird is on the railing. The grey and red bird flies to a chandelier. The pink and grey bird is seen flying from the railing to the floor. The pink and grey bird is seen walking around on the floor.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "From which direction did the first parrot fly towards the camera?",
+ "answer": "C",
+ "candidates": [
+ "A. From the front left.",
+ "B. From the rear right.",
+ "C. From the rear left.",
+ "D. From above."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "At the end of the video, did the gray parrot find the man?",
+ "answer": "C",
+ "candidates": [
+ "A. None.",
+ "B. Not sure.",
+ "C. Did not find him.",
+ "D. Found him."
+ ]
+ }
+ },
+ "GeKVDdqj": {
+ "video_id": "GeKVDdqj",
+ "video_duration": "31s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person in a pink apron and green sweater stands up in a hearing room and raises a pink sign. The sign has various slogans on it, including \u201cStop Israeli War Crimes,\u201d \u201cNo to the Siege of Gaza,\u201d and \u201cStop Israeli State-Sponsored Terrorism.\u201d The person chants slogans including \u201cCeasefire Now,\u201d \u201cWhat you\u2019ve been doing is illegal,\u201d \u201cPalestinians aren\u2019t animals!\u201d and \u201cStop funding the genocide of Palestinians.\u201dA man in a suit, presumably part of the hearing panel, is shown looking down, seemingly unaffected. A woman in a dark suit sits behind him, observing the protest. The protester continues holding signs and chanting as security approaches and escorts them out. The bottom of the frame switches to a news headline that reads, \"CORRUPT TESTIMONY OF BLINKEN.\" As the protester is escorted away, they continue shouting and holding signs. The protest continues off-screen while the man in the suit remains at his seat, a different man with a patterned tie now visible behind him. The protester's chants continue with slogans such as \u201cStop funding the genocide!\u201d and \u201cDemilitarize the border!\u201d",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What is the woman in red and green clothing preparing to do after standing on the table in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. She wants to attract attention.",
+ "B. She is a journalist preparing to conduct an interview.",
+ "C. She is preparing to hold up a banner to protest.",
+ "D. She is a sanitation worker preparing to wipe the table."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What slogan is on the banner in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Oppose the rule of Gaza.",
+ "B. Oppose the release of Gaza.",
+ "C. Oppose the Gaza blockade.",
+ "D. Oppose Gaza's independence."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the woman in red and green clothing say 'stop firing' in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Four.",
+ "B. Three.",
+ "C. Six.",
+ "D. Five."
+ ]
+ }
+ },
+ "UTDkoMuj": {
+ "video_id": "UTDkoMuj",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person is playing a long brown flute while standing in front of a piano. The person is wearing a grey patterned shirt. In a later scene, many people are sitting on stage playing instruments while the same person is playing the flute. Later, the flute player stands and moves their arms. In the final scene, the flute player stands and plays again.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What is the first instrument to appear in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The pipa.",
+ "B. The flute.",
+ "C. The erhu.",
+ "D. The guzheng."
+ ]
+ }
+ },
+ "gAZSVTpc": {
+ "video_id": "gAZSVTpc",
+ "video_duration": "222s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man with short, blonde hair and wearing a blue, long-sleeved shirt is in a kitchen, talking. The camera angle moves to show a black frying pan on a gas stovetop. A clear yellow liquid is poured into the frying pan from a clear plastic squeeze bottle. The man picks up the frying pan. The camera angle returns to the man, who talks, then the camera moves to his hand putting two red pieces of meat in the frying pan. The video cuts to another man, with medium-length, dark curly hair, wearing a white and blue checked shirt and a black and white striped apron. He gestures with his hands. He takes a piece of meat and places it in a frying pan on the stovetop, then adds a second piece of meat. The camera angle returns to the man gesturing with his hands, and the camera focuses on the meat in the frying pan. He talks again, and then the camera returns to show the first man, who is talking. The camera focuses on the two pieces of meat cooking in the frying pan. The video cuts back to the man with dark curly hair, who is talking. The camera angle returns to the blonde man, who is talking. The camera moves to the frying pan, where the meat is now cooked on one side, and is being flipped over with a pair of silver tongs. The first man is shown talking. The camera returns to the frying pan, where the pieces of meat are now cooked on both sides. He adds something into the pan and shakes it. He speaks again, and uses tongs to add something to the pan and turn the meat. He turns the pan in a circle while it is still on the burner. The camera cuts back to the first man, who talks. The camera angle changes to a different man in a kitchen, with medium-length dark curly hair. He takes pieces of meat out of the pan. The camera angle moves to the first man, who is shown adding cloves of garlic to a wooden cutting board. He separates the garlic cloves, and then adds something to the frying pan. He talks, and then the video shows the meat in the frying pan. He moves to a different stove top and continues cooking. He puts the frying pan down and adds some sprigs of green herbs to the frying pan. He picks up the frying pan, the pieces of meat now well-browned, and adds a clear liquid to the pan. The camera angle moves back to the first man talking. The camera moves to a shot of the meat frying. He picks up a spoon, and uses it to cook the meat and clear liquid. He uses tongs to take the meat out of the pan. He takes another piece of meat out of the pan and places it on a wooden cutting board. The first man is shown, talking and using a grinder on the meat on the cutting board. The camera angle returns to the first man, who is talking. The first man uses a cloth to wipe his hand. He talks, then the camera cuts to the pieces of meat in the sauce, in the frying pan. The second man is shown, talking and gesturing with his hands. The camera cuts to the first man, who is talking. He takes some butter and puts it in the pan. He uses the spoon to move the butter around in the pan. He talks again. The man picks up the frying pan and moves it while still over the heat. He takes the meat out of the pan and places it on a plate. The camera angle shows the pieces of meat as they are being added to a plate, and then two pieces of meat are shown on a white plate. The first man is shown cooking more meat. He uses tongs to move the meat. He takes the pan off the burner, he speaks again. The camera cuts to the meat cooking with sauce, and the second man adds more of a creamy sauce to it, which he stirs with a spoon. He adds pieces of butter, and then stirs it some more. The video ends with various \u2018like\u2019, \u2018subscribe\u2019, \u2018share\u2019, and \u2018comment\u2019 graphics.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occur to the steak prepared by the man wearing an apron in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The raw steak turns into a cooked steak coated in lemon juice.",
+ "B. The raw steak turns into a cooked steak coated in cream.",
+ "C. The raw steak turns into a cooked steak coated in mushroom sauce.",
+ "D. The raw steak turns into a cooked steak coated in red wine sauce."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Are there any spices in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes, there are.",
+ "B. No, there are not.",
+ "C. I am not sure."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man in short sleeves do with the garlic cloves in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Chopped them directly and added them to the pot.",
+ "B. Peeled, chopped, and added them to the pot.",
+ "C. Crushed them directly and added them to the pot.",
+ "D. Peeled, crushed, and added them to the pot."
+ ]
+ }
+ },
+ "DThfPzCz": {
+ "video_id": "DThfPzCz",
+ "video_duration": "100s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a gray shirt and a white and red baseball cap is sitting in a green John Deere utility vehicle inside a garage. The person puts both of their hands up in the air with both thumbs up. They wave their hands and then lean down to manipulate the vehicle's steering wheel. They then puts their hands up in the air with their palms facing outward. They lean towards the camera with open hands and dirty fingers, and then returns to sitting in the driver's seat of the vehicle. The person puts both hands on the steering wheel and then moves their left hand away. Then they once again puts their hands up in the air with open palms facing the camera. They point with one finger towards the camera, then leans down to manipulate the steering wheel again. They point with one finger towards the camera. They walks out of the garage. The vehicle then moves to the driveway, and stops next to a dark car. The person leans towards the front tire, and then manipulates the vehicle's tire again. The person sits down in the driver's seat, and then moves to the back of the vehicle, manipulating something on the back. Then, the vehicle's dump bed is tilted upward. The person manipulates something near the tire again, and walks toward the camera, pointing at the camera with one finger. The person puts their hands up in the air with their palms open, then holds up three fingers. They close their eyes and makes a face. They then puts their hands up in the air with their palms open again, then holds up one finger, and finally points one finger at the camera. Then the video shows a black screen with \"3DMachines\" at the top left, \"Subscribe\" at the left, a black screen saying \"yesterday's video\" and a rectangle that says \"music by: Highlanderz.\"",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the man in the gray shirt reverse the car in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Once.",
+ "B. Four times.",
+ "C. Three times.",
+ "D. Twice."
+ ]
+ }
+ },
+ "iFZNzQMl": {
+ "video_id": "iFZNzQMl",
+ "video_duration": "133s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Historical Analysis",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a globe and the text \"Human Rights\". Below that are three dolls, a book, a scale, and a red and white striped barrier. These images are replaced by three dolls each holding a sign with a black and white image on it. Each doll has white tape over their mouths. The scene shifts to a view of a computer screen showing the three dolls and then a red screen with the word \"Blocked\". The red screen then appears on a newspaper with the text \"Free Press\", \"Independent/ Non-Partisan\" and the text \"Fear of freedom of opinion\" below the red box. The top corner of the paper tears away to reveal a map of the world that is colored with three different colors. Then the text \"Press\", \"Free\", \"Restricted\", and \"Not Free\" appears below the map. This map is then replaced by a second world map with different colors. Below this map the text \"Death Penalty\" and \"Abolished for all crimes\" and \"Abolished for minor offenses\" and \"Abolished in practice\" and \"Not abolished\" are displayed. This map is then replaced by text that is layered over a simplified map of the world with text reading, \"Western Sahara\", \"South Ossetia\", \"Uzbekistan\", \"Sudan\", \"Somalia\", \"Syria\", \"Turkmenistan\", \"Saudi Arabia\", \"Equatorial Guinea\", \"North Korea\", \"Libya\", \"Eritrea\", \"Cuba\", \"Chad\", \"China\", \"Laos\", \"Cote D'Ivorie\", \"Belarus\", and \"Myanmar.\" These layered texts are followed by the appearance of an EU flag, then a segment of barbed wire, and three sheets of paper each with the word \"Asylum\" written on them. The three sheets tear to reveal a second image in which these images are superimposed over an image of a boat slowly sinking in the ocean. Then the video shows the EU flag and barbed wire again, followed by a US flag and a bar graph of executions showing a series of clear syringe images. Each syringe is filled with a brown liquid with a skull and crossbones on the syringe. The syringes decrease in quantity of brown liquid over time. Then a metal grid appears over the graph and three orange clad figures with masks stand behind the grid. The video then splits the screen. On the left a bald man with sunglasses wearing a suit holds a gun, and on the right a man in a military style uniform holds a gun. Then the background shifts to red on the left and right side of the split screen, and on the left side there is a doll in a red bikini and a cartoon hand with a gun pointing at her head, with the text \"live show\" above her, and on the right side there is a doll in a military style uniform in front of a burning building, with a red background. Then the screen shows a doll with dark hair wearing a suit and tie sitting at a desk next to a laptop and a globe. Then the scene splits the screen again, with the left side red and showing four dolls at sewing machines and the right side is also red and shows a doll with a paint sprayer and a red barrel with a skull and crossbones on it. Finally, the screen shows a series of different images on the screen at once including images from earlier in the video with the words \"Human Rights\" layering over all the images.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How many executions were carried out in the United States in the year 2000 according to the video?",
+ "answer": "C",
+ "candidates": [
+ "A. 98.",
+ "B. 56.",
+ "C. 85.",
+ "D. 60."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the forced labor images at the end of the video relative to the images showing the number of executions in the United States?",
+ "answer": "B",
+ "candidates": [
+ "A. The images of forced labor are above the images showing the number of executions in the United States.",
+ "B. The images of forced labor are to the right of the images showing the number of executions in the United States.",
+ "C. The images of forced labor are below the images showing the number of executions in the United States.",
+ "D. The images of forced labor are to the left of the images showing the number of executions in the United States."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occurred in the montage concerning human rights issues in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. It disappeared in the corner.",
+ "B. It was divided into two parts.",
+ "C. It broke in the middle.",
+ "D. It was painted red."
+ ]
+ }
+ },
+ "heOrfYdb": {
+ "video_id": "heOrfYdb",
+ "video_duration": "80s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Biology",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A blue, roughly cone-shaped object appears at the top of the screen within a brown, textured environment. Red and dark, v-shaped objects appear to enter the bottom opening of the blue object. The blue object is composed of a grid-like pattern. The red objects accumulate inside the blue object, while some of the dark v-shaped objects enter through the sides of the blue object. A yellow circle appears on the side of the blue object. As more red objects accumulate inside the blue object, a layer of orange objects appears, and the blue object appears to be pushed down further in the environment. The blue object turns white in sections, with orange appearing on the surrounding area. The orange section with the red objects increases in size. The perspective changes, and we see a close up of the boundary between the blue and orange sections. The grid-like pattern of the blue object remains visible, and orange appears to be held inside the blue object. The dark v-shaped objects remain, as well as the small red objects. The blue object opens up at the bottom, allowing some of the orange objects to exit. We see the surrounding brown environment again with more dark v-shaped objects flowing around it. Finally, the blue object is almost entirely white, and the large orange layer remains.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of tooth development? (a) The epithelial barrier sends inductive signals to the outer embryonic mesenchyme towards the root papilla, (b) secretes dentin matrix, (c) allowing direct contact between the dentin and odontoblasts with the dental sac, (d) cells differentiate into ameloblasts, (e) secreting proteins related to enamel, ultimately detaching from the root surface, (f) inducing primary cementum.",
+ "answer": "D",
+ "candidates": [
+ "A. (e), (a), (d), (b), (c), (f).",
+ "B. (d), (e), (a), (b), (c), (f).",
+ "C. (a), (b), (c), (d), (e), (f).",
+ "D. (a), (d), (b), (e), (c), (f)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship among primary cementum, the dental sac, dentin, and enamel-related proteins?",
+ "answer": "B",
+ "candidates": [
+ "A. Primary cementum interacts with the dental sac, inducing the production of enamel-related proteins and dentin.",
+ "B. Dentin and enamel-related proteins interact with the dental sac, inducing the formation of primary cementum.",
+ "C. Dentin and primary cementum stimulate the dental sac to secrete enamel-related proteins.",
+ "D. Dentin and primary cementum make contact with the dental sac, inducing the production of enamel-related proteins."
+ ]
+ }
+ },
+ "xWebFGfa": {
+ "video_id": "xWebFGfa",
+ "video_duration": "133s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Unboxing Videos",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video displays a man wearing a blue suit and a striped shirt. He appears to be talking. Behind him, there is a photo of several individuals standing next to luggage, a bottle, a passport, and a hat. There is a row of books stacked on the table next to him. The next scene shows multiple gold bars displayed with an overhead view. The man in the suit then reappears. A calendar is shown with a red frame around the letters \"SAT\". The next scene is of a gym with weight equipment, followed by a website of watches. The man in the suit then reappears again, followed by an outdoor scene of two lounge chairs near a swimming pool. The video concludes with the man in the suit speaking again.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why did the man sell his watch?",
+ "answer": "B",
+ "candidates": [
+ "A. Because a family member is sick.",
+ "B. To pay the car loan.",
+ "C. To pay the mortgage.",
+ "D. To lend money to a friend."
+ ]
+ }
+ },
+ "NKwcqozV": {
+ "video_id": "NKwcqozV",
+ "video_duration": "57s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A film is projected onto a screen. The film shows a figure in a dark uniform facing a woman behind a counter. The SCP logo is visible on the wall behind the woman. The figure in uniform turns to face the camera, revealing complex mechanical inner workings beneath a transparent shell. The image then changes to show a close-up of mechanical legs and feet. The feet have what appears to be thrusters attached. The next scene shows an explosion and a figure in a metallic suit propelled upwards by the thrusters. The suit continues ascending against the backdrop of the explosion.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What will they do next?",
+ "answer": "D",
+ "candidates": [
+ "A. Release the SCP.",
+ "B. Destroy all SCP-A5.",
+ "C. Modify the SCP.",
+ "D. Capture the escaped SCP."
+ ]
+ }
+ },
+ "NBWJocjJ": {
+ "video_id": "NBWJocjJ",
+ "video_duration": "177s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Unboxing Videos",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A hand moves a lever and pulls a black battery pack out of a frame attached to a bike. A man is then seen standing next to a black electric bike, reaching down to make an adjustment. The man then poses with both arms outstretched next to the bike. The bike is then shown laying on its side in a garage. A dog is seen lying down in the background. A hand points towards the dog. The video shows a close-up of the dog. The dog opens its mouth, and turns its head. A black battery pack charging is shown. A view of dog's legs moving is shown. The dog moves closer to the camera and laps out of a red bucket. The dog moves its head back and forth. A second dog appears and both drink from the bucket. The two dogs walk by the bike. The rear rack of the bike is shown, followed by the bike's display which reads 0.0 mph, then 4 mph, then 8.0 miles. The bike's handgrip and brake is shown, followed by the front tire. A full view of the black bike is then shown. Close up views are shown of various parts of the bike, including a pedal, the battery, a hub, the tires, and the handlebars. A hand adjusts a black wire on the bike\u2019s front wheel. A dog is shown in the background, laying on a white piece of styrofoam. The battery charger is shown with a green light. A dog\u2019s legs are shown quickly moving, followed by two dogs lapping from a bucket. The bike is shown again, followed by a closeup of the rear rack. The bike\u2019s display is shown, displaying numbers and a battery graphic. A hand grips the right handlebar, then the left handlebar. Then the handlebars are shown in motion on a road. A person is shown riding the black electric bike, followed by two dogs running alongside. The camera changes perspective, and the person riding the bike is shown on a paved road and grass, then the camera returns to the perspective of the person riding the bike. The camera then shifts to a ground view of the road being biked. A full view of a dirt road appears. The bike rider appears and rides by, followed by two dogs running alongside. The dogs then follow the bike rider past the camera. A closeup of a man points at the camera, then opens his hand, then smiles. A dog walks toward the camera and away, then two dogs drink from a water source.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order in which the man introduces the parts of the bicycle?",
+ "answer": "D",
+ "candidates": [
+ "A. Cyclometer, seat, handlebar, wheel.",
+ "B. Seat, handlebar, cyclometer, wheel.",
+ "C. Seat, handlebar, wheel, cyclometer.",
+ "D. Seat, cyclometer, handlebar, wheel."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Which part of the bicycle does the man install first in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The tire.",
+ "B. The seat.",
+ "C. The handlebar.",
+ "D. The chain."
+ ]
+ }
+ },
+ "nJUtXswZ": {
+ "video_id": "nJUtXswZ",
+ "video_duration": "46s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A character wielding a sword and other smaller blades is battling several enemy characters. The main character engages the enemies with both melee attacks, and some that create an explosion. The scene is set in a dark, stone-walled area, and the main character moves to a higher, wooden platform after defeating a few enemies. The character then moves to a caged elevator and avoids getting on. At the end, the character moves from the elevator and proceeds through a stone passage.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What was the final game result at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Lose.",
+ "B. Draw.",
+ "C. Win.",
+ "D. I don't know."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "How does the player character enter the elevator?",
+ "answer": "B",
+ "candidates": [
+ "A. Walks in.",
+ "B. Rolls in.",
+ "C. Jumps in.",
+ "D. Climbs in."
+ ]
+ }
+ },
+ "zvqokOfj": {
+ "video_id": "zvqokOfj",
+ "video_duration": "82s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two individuals stand facing each other. One person wears a two-tone green and white outfit and white hand coverings. The other wears a dark outfit with black hand coverings. The individual in the dark outfit moves their arms and legs in various motions, which the person in the green and white outfit observes. The person in the dark outfit occasionally extends an arm or leg in the direction of the other person. The person in the green and white outfit holds their hands in front of them at chest height for the duration of the video. The video switches to a different angle, at which point the person in the dark outfit rotates in the middle of the room while the other person stands still, then switches back to the previous angle.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What is the emotional change experienced by the man in black after the man in white and green stripes blocks the move in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. From surprise to helplessness.",
+ "B. From surprise to extreme anger.",
+ "C. From surprise to excitement.",
+ "D. From helplessness to surprise."
+ ]
+ }
+ },
+ "EMovDXyK": {
+ "video_id": "EMovDXyK",
+ "video_duration": "145s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with curly hair, wearing a grey t-shirt is holding a piece of meat in a clear plastic wrapper, standing in front of a black grill. The person removes the meat from the packaging and holds it in his hand and then uses a knife to cut it into smaller pieces. The person then places the meat pieces on the grill. He then puts salt on the meat and uses tongs to flip the meat over and then places the meat on a white plate. The person uses the knife to cut the meat on the plate and then uses the tongs to pick up a piece of meat. The person cuts into the piece of meat with the knife. He repeats this action for all the pieces of meat on the plate.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct order of cooking meat shown in the video among the following options?",
+ "answer": "B",
+ "candidates": [
+ "A. Cut, sprinkle with salt, cook both sides, place on the grill, and cook both sides again.",
+ "B. Cut, place on the grill, cook by flipping, sprinkle with salt, and cook by flipping again.",
+ "C. Cut, sprinkle with salt, place on the grill, flip and cook, and sprinkle with salt again.",
+ "D. Cut, place on the grill, sprinkle with salt, flip and cook, and sprinkle with salt again."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, where is the salt located in the oven?",
+ "answer": "B",
+ "candidates": [
+ "A. In the front.",
+ "B. On the left.",
+ "C. On the right.",
+ "D. In the back."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "How does the man in the video catch the spilled salt?",
+ "answer": "D",
+ "candidates": [
+ "A. With a plate.",
+ "B. With a bowl.",
+ "C. Directly onto the grilled meat.",
+ "D. With his hands."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "Why does grilled meat produce a lot of white smoke when turned over?",
+ "answer": "C",
+ "candidates": [
+ "A. Because a large amount of spices falls off the surface, producing smoke.",
+ "B. Because one side of the grilled meat becomes charred from overcooking, resulting in smoke production.",
+ "C. Because oil drips off, generating a significant amount of smoke.",
+ "D. Because the moisture on the surface of the meat evaporates, forming water vapor."
+ ]
+ }
+ },
+ "SIjnNIPr": {
+ "video_id": "SIjnNIPr",
+ "video_duration": "110s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens with a black screen with a small, white, horizontal icon in the upper right corner, comprised of 3 short parallel lines. The image changes to a black screen with two thin, short, white lines in the top center. Next, the screen shows white text on a black background. The numbers 0745 are displayed on top, followed by the numbers 26/05/2020. The screen changes to a video of an angled bridge on top of a large barge floating on water, with trees visible in the background. A white text box is displayed in the bottom corner reading \"Credit: NTSB\". The next scene shows a damaged bridge with five bridge pillars, with text labels \"Pier 1\", \"Pier 4\", \"Pier 5\", and \"Navigation channel\", with arrows pointing to the relevant structures in the image. This is followed by an animated drawing of the a barge and a bridge from the side view. A yellow car travels on top of the bridge, followed by a green truck, and an ambulance, all animated. The next screen shows an animated drawing of two cartoon figures, one in a gray outfit sitting, and the other in a green outfit standing. The sitting figure has a thought bubble saying \"Thank you for saving me\", and the standing figure replies in a thought bubble \"Anytime\". This is followed by an animated drawing of the a barge and a bridge, with a yellow car and a green truck on top, and a person in a blue outfit standing on the edge of the bridge, followed by an ambulance. The next screen shows a vintage television on a white table with a black background. There are three consecutive views of different angles of the damaged bridge as seen through the TV. Following that is a series of images showing white boxes drawn onto the black background with a white outline. In one scene, a hand gives a thumbs up. The white boxes then become filled with a solid blue color, before becoming white again. Lastly, a video is displayed of a damaged pier, with a barge visible in the background. A white text box is displayed in the bottom corner reading \"Credit: NTSB\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where is the person wearing a blue uniform in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. On a ship.",
+ "B. On the shore.",
+ "C. On the deck of Bridge No.4.",
+ "D. On the deck of Bridge No.2."
+ ]
+ }
+ },
+ "ejEIzXua": {
+ "video_id": "ejEIzXua",
+ "video_duration": "116s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video starts with a close up of many round, star, and heart-shaped balloons in varying colors including red, yellow, blue, green, and pink, with a pale blue background. It then transitions to a cartoon scene with a green background, showing various cartoon sea creatures including several baby sharks in different colors, a dolphin, an octopus, a crab, and several fish. There are two carts with purple roofs, one with a blue baby shark inside and the other with an ice cream stand. A striped blue cart also appears, filled with other sea creature characters and colorful balloons. A red-haired, child-like figure in a blue top is shown multiple times. Some additional animated hearts and other items appear throughout the video. The Baby Shark logo also appears at times.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the lanternfish do to the orange shark?",
+ "answer": "B",
+ "candidates": [
+ "A. Take a photo with it.",
+ "B. Take it away.",
+ "C. Hug it.",
+ "D. Shake hands with it."
+ ]
+ }
+ },
+ "SgOTJiLx": {
+ "video_id": "SgOTJiLx",
+ "video_duration": "448s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person writes the word \"Wholesome\" and then the word \"Roots\" on a chalkboard. Then, a person wearing a dark t-shirt, brown overalls, and a green patterned headscarf speaks to the camera. There are two bags of soil and a bucket containing dark soil behind them. Then a hand reaches into the frame and touches the soil in the bucket. The hand then points to a bag labeled \"Organic Perlite\". The hand then points to the bag labeled \"Organic Earthworm Castings\". The hand then points to a dark green watering can. Then the scene shifts to a close up of the bucket of soil, which is then watered from the can. Next, a hand uses a small gardening trowel to mix the wet soil. The person then tips the bucket into a blue tote, and then points to a cracked hinge on a table. Then the camera moves and the person pours perlite into a bucket. Then the person pours the earthworm castings into the bucket with the perlite, and then the camera focuses in on the person, holding a handful of the mixed soil and speaking to the camera. Then, they speak again, and then smile. Then the video ends.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the earthworm castings relative to the perlite in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Earthworm castings are to the left of the perlite.",
+ "B. Earthworm castings are to the right of the perlite.",
+ "C. Earthworm castings are in front of the perlite.",
+ "D. Earthworm castings are behind the perlite."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order in which these five actions occur in the video? (a) Add water and mix well, (b) Pour in the materials, (c) Add worm castings, (d) Add perlite, (e) Pour into a container with holes.",
+ "answer": "D",
+ "candidates": [
+ "A. (c) (b) (d) (a) (e).",
+ "B. (d) (e) (a) (c) (b).",
+ "C. (d) (b) (c) (e) (a).",
+ "D. (b) (a) (e) (d) (c)."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the action of adding water to the white bucket take place?",
+ "answer": "A",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. At the end of the video.",
+ "C. Throughout the entire video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "ROXVCobL": {
+ "video_id": "ROXVCobL",
+ "video_duration": "31s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A woman wearing a white blouse and black skirt walks into a room followed by a man in a white shirt and black jacket, and a boy in a white shirt and blue jeans. They pass in front of a white wall with a framed graphic and a light fixture. A boy walks away from the group and sits on a grey couch. A man holding a camera takes pictures. The image shows four people laying down on a white floor. The image shows three people laying down on a white floor. The image shows two people laying down on a white floor. A woman sits on a grey couch. A woman sits on a grey couch. A woman sits on a grey couch. A man, a woman, and a boy stand in front of a white wall. A man and a woman stand together. A man and a woman are in silhouette. Four people are in silhouette standing back-to-back. Four people are in silhouette standing back-to-back. A hand is shown sorting through articles of clothing in a suitcase. A young man is shown closing a suitcase while a woman is standing next to him. A young man wearing a white, navy and red jacket shakes hands with a boy. The young man is embraced by the woman. A young man and a man pull a suitcase along the street. A woman and a boy wave from a balcony. A woman and a boy wave from a balcony.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What happened after arriving home from filming?",
+ "answer": "C",
+ "candidates": [
+ "A. Mom is cooking.",
+ "B. The younger son is being mischievous.",
+ "C. The eldest son is packing his luggage to go out.",
+ "D. Dad is going on a business trip."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many sets of photos did this family take in the photography studio?",
+ "answer": "C",
+ "candidates": [
+ "A. 5 sets.",
+ "B. 10 sets.",
+ "C. 7 sets.",
+ "D. 3 sets."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Were props used to assist with posing during the shooting process?",
+ "answer": "B",
+ "candidates": [
+ "A. I don\u2019t know.",
+ "B. No.",
+ "C. Yes."
+ ]
+ }
+ },
+ "huniSiGf": {
+ "video_id": "huniSiGf",
+ "video_duration": "226s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Chemistry",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a blue lab coat and purple gloves is working in a laboratory. The person is using various laboratory equipment including a large flask with a filtration device on top, and a round-bottom flask connected to tubing, along with a beaker with a yellow liquid. At the beginning of the video, the person uses their gloved hand to move a white piece of paper towards them on the table. The person then uses their gloved hand to put a small, white container into the filtration device, and then pours the yellow liquid from the beaker into the white container. Then, the person puts a white solid in a small, transparent beaker and adds a white liquid to it, and swirls the beaker, before again working with the filtration device. The person uses a metal spatula to stir the white solid inside the filtration device, then removes the white container from the filtration device. They then put the white solid into a white tray before scooping it into a small glass bottle, using a metal spatula to move more of the substance into the bottle. The person is then seen holding a small, white container and stirring its contents. The person is then seen pouring its contents into a small tray, then scooping the solid into the small, glass bottle, using a metal spatula.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What does the filtration experiment shown in the video demonstrate?",
+ "answer": "D",
+ "candidates": [
+ "A. The experiment demonstrates that filtration can remove impurities from water and improve water quality.",
+ "B. The experiment demonstrates that a vacuum pump can speed up the filtration process.",
+ "C. The experiment demonstrates that the white powder is insoluble in water.",
+ "D. The experiment demonstrates that acetone can be used to clean other substances."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of the six steps shown in the video? (a) Dry the product, (b) Filter the reaction mixture, (c) Stir the mixture, (d) Transfer to a weighing dish, (e) Set up the filtration device, (f) Wash the product with acetone.",
+ "answer": "A",
+ "candidates": [
+ "A. (c) (e) (b) (f) (a) (d).",
+ "B. (b) (e) (f) (c) (a) (d).",
+ "C. (e) (c) (d) (a) (b) (f).",
+ "D. (b) (e) (d) (f) (a) (c)."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the man pour the transparent liquid directly into the funnel?",
+ "answer": "B",
+ "candidates": [
+ "A. Five.",
+ "B. Three.",
+ "C. Two.",
+ "D. Four."
+ ]
+ }
+ },
+ "houTEfoV": {
+ "video_id": "houTEfoV",
+ "video_duration": "94s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Astronomy",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a dark background filled with stars. A long, narrow galaxy is visible, with a slight tilt from upper left to lower right. More stars are scattered across the background, and a couple of other more distant galaxies are seen in the upper left corner. Next, a man with curly hair and a beard, wearing a light colored shirt, is shown against a plain background. The words \"JUSTIN MYLES\" and \"Scientist, Dark Energy Survey\" appear in the bottom left corner. The video cuts to a dense cluster of stars. The video then shows a grid pattern overlaying an image of the night sky, with stars and galaxies, followed by an image of a dark silhouette of a mountain top with multiple observatories, and then an image that displays a white text graphic with a red line that reads \"THE DARK ENERGY SURVEY\" superimposed over the same image of a dark mountain with observatories. The video cuts to a woman wearing a red sweater and earbuds, with plants in the background. \"GIULIA GIANNINI\" and \"Scientist, Dark Energy Survey\" appear in the bottom left corner of the frame. The video shifts to a time-lapse of the night sky with multiple observatories in the foreground. The video shifts back to the night sky and shows a spiral galaxy. Next is another close up of the man from before and then back to the observatory. The video shows the same observatory under a bright blue sky. The camera then enters an observatory and shows a large, complex piece of machinery. The video then shows two observatories under a night sky and the words \"DARK ENERGY SURVEY\" \"YEAR 3\" superimposed over the image. The video then shows a series of graphics with a drawing of three yellow observatories on a gray hill with the words \"345 nights observed\". Then the same graphic appears with an image of a yellow telescope with a crescent moon and the words \"5000 deg\u00b2 of night sky\". The next graphic shows an image of three yellow swirling galaxies with the words \"226 million galaxies\". The video cuts to a woman wearing a blue gown and hair covering, looking into a large ring with a complex interior that contains a light. The video cuts to a woman with curly hair and a white top against a background with plants. The final scenes show two men standing inside an observatory, observing a black screen, then a yellow metal object rotates, followed by a view from the outside of a dome opening. The video ends by showing the observatories under a starry night sky.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the male voice in the video pronounce words?",
+ "answer": "A",
+ "candidates": [
+ "A. Twice.",
+ "B. Once.",
+ "C. Three times.",
+ "D. Four times."
+ ]
+ }
+ },
+ "rDkGognn": {
+ "video_id": "rDkGognn",
+ "video_duration": "53s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a woman with long brown hair wearing a white t-shirt and black pants. She stands in front of different background images. Some backgrounds show a dark blurry city, a vehicle on a rocky surface with a night sky, a vehicle on a surface with glowing orange tracks, a vehicle with a yellow sky, a fiery image, a red rocky terrain with a meteor, a red planet with beams of light, a chart showing driving distances on Mars and the Moon, and an image with circular lines. The woman gestures with her hands as she speaks. There are also words displayed on the screen above the woman at various points in the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video feature a Mars rover?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No.",
+ "C. Yes."
+ ]
+ }
+ },
+ "fqobruRt": {
+ "video_id": "fqobruRt",
+ "video_duration": "29s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Motorsport",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Multiple race cars are shown driving on a racetrack. A text overlay reads, \"February 15, 1998 DAYTONA.\" A race car with the number 3 drives on the racetrack. A man is shown working on the racecar. The same race car is shown driving slowly on the track with several people in a row reaching out to touch it. Several men are shown embracing. A man is shown wearing a red cap and is interviewed with a microphone. Two men are shown opening champagne bottles.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What type of location is shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. City streets.",
+ "B. A highway.",
+ "C. A car race track.",
+ "D. A rally race track."
+ ]
+ }
+ },
+ "FSrCIICc": {
+ "video_id": "FSrCIICc",
+ "video_duration": "168s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Hiking",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A single person wearing a gray flat cap, a black puffer jacket, and a gray backpack is visible in a wooded area. The person's face is framed in the foreground of the shots, while the forest floor, trees, and a narrow dirt path are visible behind them. At one point, a close up shot focuses on the tip of a hiking stick covered in dirt. Later the person is seen picking up a small black clip that appears to be a part of the backpack. The person moves through the trees and along the dirt path. The camera remains focused on the person in each shot.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the man in the video share his personal experiences while on the road?",
+ "answer": "C",
+ "candidates": [
+ "A. Because he wants to reach Cannon Hall Farm quickly.",
+ "B. Because he enjoys the feeling of hiking.",
+ "C. Because he cannot find a suitable place to sit and share.",
+ "D. Because he needs to rely on objects along the way to help share his experiences."
+ ]
+ }
+ },
+ "tZtjazpa": {
+ "video_id": "tZtjazpa",
+ "video_duration": "65s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Physics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with long brown hair, wearing a blue shirt with white and red floral patterns, places a white ball at the top of a blue inclined track. The ball rolls down the track. The person catches the ball and places it back at the top of the track. The ball rolls down the track and impacts a piece of white paper hanging from a wooden frame. The person replaces the white paper with another sheet of white paper and repeats the process. The person then places a black covering on top of the paper. They use a pencil to mark the locations where the ball impacts the paper. The person then uses a pencil to draw lines on the paper.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What is the man with long curly hair doing in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The man with long curly hair in the video is conducting an experiment to measure the impact force of concrete balls sliding along a track.",
+ "B. The man with long curly hair in the video is conducting a distance measurement experiment.",
+ "C. The man with long curly hair in the video is conducting a projection motion experiment.",
+ "D. The man with long curly hair in the video is conducting an experiment to measure the speed of concrete balls sliding down a track."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why are circles drawn on the paper?",
+ "answer": "B",
+ "candidates": [
+ "A. To record the trajectory of the concrete ball.",
+ "B. To enable faster experimentation.",
+ "C. To indicate the position of the paper.",
+ "D. To make the data points more noticeable."
+ ]
+ }
+ },
+ "sCbtRgHy": {
+ "video_id": "sCbtRgHy",
+ "video_duration": "114s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Unboxing Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person puts a black device on their arm. The person puts a small black object in their pocket. The person attaches two black, circular devices with red dots to their belt. A hand picks up a small black device off the ground. A girl sits in front of a screen and talks into a walkie talkie. A boy walks down stairs with a walkie talkie in his hand. The boy opens a door and walks into the doorway and looks at the girl. The girl is shown again in front of the screen. The boy in red walks into the doorway. The boy walks down the hallway, past a mirror and photos on the wall. The girl is again looking at the screen. The boy comes back into the room and looks in a closet. He grabs a box out of the closet. Two of the black circular devices are shown again on a pair of blue jeans, and a hand touches one. The person again has both devices in their hands, and they walk out of the room, and the camera shows a black round object rolling on the carpet. The black circular object is shown again on the carpet. The girl is in a room with lights that are out of focus. A woman\u2019s face is shown in profile, then again at an angle. A boy is seen talking underneath a bed. A shot of the two circular devices is shown again. The boy is seen running from behind the bed into the hallway. The girl is seen again, and then again in profile in a door way. She is seen speaking. The boy is shown speaking to the girl in a doorway.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the action of handing over the walkie-talkie occur?",
+ "answer": "D",
+ "candidates": [
+ "A. Throughout the entire video.",
+ "B. At the end of the video.",
+ "C. In the middle of the video.",
+ "D. At the beginning of the video."
+ ]
+ }
+ },
+ "YUMRCeAK": {
+ "video_id": "YUMRCeAK",
+ "video_duration": "16s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person's hand places a green drink on a counter. The camera pans across the counter to the right showing multiple items on the counter, including a white paper with text. The camera then rotates and the view changes to a metallic shelving unit containing clear plastic bags filled with items. The camera rotates again, showing a bicycle rack holding a bicycle above the person and the top of a metallic object. The view then changes to a person holding a green drink standing outside on a walkway and speaking. They take a drink from the beverage. The view shifts to the ground on the walkway, where a person is placing a foot onto a skateboard. The person is then seen riding the skateboard. The view returns to the person on the walkway who is still holding the green drink, and is speaking.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What technique does the man in the video use to navigate the obstacle on his skateboard?",
+ "answer": "B",
+ "candidates": [
+ "A. Jump over it while skateboarding.",
+ "B. Tilt the skateboard to an angle.",
+ "C. Dismount from the skateboard.",
+ "D. Allow the skateboard to slide with the wheels facing straight."
+ ]
+ }
+ },
+ "uVuESdIK": {
+ "video_id": "uVuESdIK",
+ "video_duration": "260s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a white hoodie, black pants, a dark colored baseball cap, and a backpack, enters a room. A person is sitting on a couch in the room wearing a pink shirt and light colored pants. The person in the white hoodie approaches the couch, leans over the person sitting, and then moves away from the couch. The person in the white hoodie can be seen carrying a gray-colored rectangular object. The person wearing the white hoodie then is seen holding a dark-colored suitcase. This person is seen talking to the camera. A person is seen standing in an open window wearing an orange top and dark pants, then is seen driving in a car wearing a black hat. The person in the white hoodie can be seen talking to the camera. The person in the white hoodie then goes to sit on the couch with the other person. They are both seen using controllers. The person in the white hoodie is seen again talking to the camera. They are both then seen at a gym, which is full of workout equipment. Both people then are seen in a dark room, where workout machines are being used with a screen visible in the background. The person in the white hoodie is then seen entering through a door into the gym with a sign on the door. This person is then seen doing leg exercises with workout machines. The person in the white hoodie and the other person are seen leaving the gym. The person in the white hoodie is seen holding a water bottle, and is seen entering a car. Back inside the first room, the person in the white hoodie is seen wearing red plaid pajamas and a baseball cap, and walks up to the other person sitting on the couch. The two sit on the couch and are seen on a large TV selecting a character in Mario Kart game. Both are seen playing the game. Then, a chart is shown ranking the players. The two people are seen playing the game and laughing. The person in red plaid pajamas is then seen talking to the camera, and says the other person started the game without them. The person in the red plaid pajamas can be seen dancing on the floor. The two people are then seen talking on the couch. The time 2:35 is shown on a timer of some kind. A person with a beard wearing a blue hoodie can be seen standing next to a counter where there is a sink. Then, the three people can be seen on the couch, the person in the middle has a long red piece of fabric across his legs. The person in red plaid pajamas is seen dancing in place again, and the person with the beard gives him a hug. Then all three are seen sitting on the couch, laughing.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order did these four events occur in the video? (a) Playing a Mario game, (b) Going to the home of the lady in a pink shirt, (c) Going to the gym, (d) Meeting the lady in a gray shirt.",
+ "answer": "D",
+ "candidates": [
+ "A. (d) (b) (a) (c).",
+ "B. (b) (c) (d) (a).",
+ "C. (d) (b) (c) (a).",
+ "D. (b) (d) (c) (a)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "In the video, where is the scene set when the man asks \"Why do cardio when you can vape instead?\"?",
+ "answer": "A",
+ "candidates": [
+ "A. Inside the car.",
+ "B. Living room.",
+ "C. Gym.",
+ "D. Entrance to the gym."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "On which side of Mario wearing a green hat is the small icon of Mario wearing a red hat located in the video game?",
+ "answer": "C",
+ "candidates": [
+ "A. Mario wearing a red hat is to the right of Mario wearing a green hat.",
+ "B. Mario wearing a red hat is below Mario wearing a green hat.",
+ "C. Mario wearing a red hat is to the left of Mario wearing a green hat.",
+ "D. Mario wearing a red hat is above Mario wearing a green hat."
+ ]
+ }
+ },
+ "sQXTYOqF": {
+ "video_id": "sQXTYOqF",
+ "video_duration": "126s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Family Vlogs",
+ "audio_class": [
+ "Event"
+ ],
+ "video_caption": "A group of people is initially seen posing for a photo using a smartphone on a tripod. A woman in a red coat then moves from her seated position, followed by a man in a dark jacket who walks in front of the tripod. Next, the group is gathered behind some chairs, and they suddenly drop one of the men to the floor. He is then assisted back up and another person falls on the floor. This repeats several times, with individuals on the floor being assisted back up, while others fall. The video then transitions to another room with a Christmas tree and many wrapped presents. A woman in a gray sweater is holding a phone and is speaking to a woman in a red coat. The woman in the gray sweater opens a gift bag. A woman in a black sweater with red pants laughs. A woman in a red coat removes an item from the gift bag. Another woman in red speaks. A woman is seen holding a shopping list, while another woman holds a clear glass. A woman wearing a red sweater and a gray scarf is holding and examining a small wrapped gift. A man is seen in a brown sweater talking on his phone. Another woman in a red coat opens a present and is holding a pair of green stretchy undergarments, and she puts them around her waist, showing them off to the other people. There is an edited version where text appears on the screen that says \"Sherri Sagum\" while the woman is showing off the undergarments. The video transitions to a person in an orange jacket in the kitchen, using a white spray bottle. The first woman in red is back and speaking to the camera. The man with a brown sweater is opening a present. A box reveals an inflated balloon that looks like a sandwich, then a floral object, and then a small gift, followed by a small toy of a man in a box that has a warning symbol on it. The box is then opened to reveal a small toy that the man then gives to the teenager wearing a black shirt that reads \"NOEL DAYS OFF.\" The toy man is held upside down by the teenager.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What colors are the shirts of the people in the kitchen wearing in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Orange, gray.",
+ "B. Gray, red.",
+ "C. Orange, white.",
+ "D. White, red."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What does the man wearing black shorts do while taking a photo at the beginning of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. He hides.",
+ "B. He stands behind.",
+ "C. He leans against the wall.",
+ "D. He sits on the bench."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the emotion of the girl in the red coat when she opens the gift?",
+ "answer": "D",
+ "candidates": [
+ "A. Indifferent.",
+ "B. Excited.",
+ "C. Happy.",
+ "D. Sad."
+ ]
+ }
+ },
+ "OtfiBlnB": {
+ "video_id": "OtfiBlnB",
+ "video_duration": "137s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A space shuttle stands on the launchpad, surrounded by a metal framework. White smoke or vapor emerges from the base of the shuttle. Flames ignite beneath the shuttle, growing more intense. The shuttle begins to ascend, leaving the launchpad and framework behind. The scene changes to a wider view of the launch, showing spectators gathered near a body of water, watching the shuttle rise. A large digital timer in the foreground displays the elapsed time since liftoff. The shuttle climbs higher, leaving a trail of white smoke against the clear sky.The camera angle shifts to a close-up of the shuttle's engines, viewed from below. The engine nozzles glow brightly as they propel the shuttle upwards. The background darkens as the shuttle gains altitude.The camera angle changes again, showing the shuttle from the side, still ascending and leaving a long, fiery trail. The shuttle gets smaller as it continues to climb. The sky gradually turns dark. The flame from the engines appears more defined against the darkened sky. The shuttle shrinks to a small point of light with a bright flame trail.The camera perspective changes once more, now showing the shuttle from behind, still firing its engines. The shuttle is tilted slightly and is moving against a dark, almost black, sky.As the shuttle continues upwards, a white plume of exhaust expands behind it. The engine glow diminishes, leaving only the white plume and a small orange glow visible. The shuttle appears smaller within the expanding plume of smoke. Two bright, separate flares appear behind the main plume. The solid rocket boosters separate and fall away from the shuttle, appearing as two small, bright flames falling downwards.The scene changes to a view of three distinct lights against the dark sky. The central light, presumably the shuttle, continues to rise, while the other two, presumably the solid rocket boosters, arc downward. The two downward-arcing lights gradually fade and disappear.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What emotions did the audience express about the rocket launch?",
+ "answer": "C",
+ "candidates": [
+ "A. Disappointment.",
+ "B. Aversion.",
+ "C. Joy.",
+ "D. Worry."
+ ]
+ }
+ },
+ "WFALCVgv": {
+ "video_id": "WFALCVgv",
+ "video_duration": "240s",
+ "duration": "4-6min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a plaid shirt uses a small, black tool to crimp a metal fitting onto a length of metal cable. The person rotates the tool around the fitting and squeezes the handles repeatedly. The camera then zooms out to show the person standing at a metal workbench in a garage. There are various tools and parts on the workbench. Behind the person is a classic red and white car. The person picks up a short metal rod and examines one end, which appears to be threaded. They then use a pocketknife to scrape or shave a thin layer off the rod\u2019s outer surface near the threaded end. They insert the rod into a larger, silver tool on the workbench. Next, they pick up the cable again and insert it through a hole in the side of a small, silver tool, and continue to rotate the tool around the cable near the end. The camera zooms in as they repeatedly squeeze the tool\u2019s handles. They then insert the metal rod into a slot on another tool clamped to the workbench. One hand holds the rod while the other hand turns a lever-like handle attached to the tool. This appears to tighten the tool around the rod. They continue to rotate the lever several times, tightening it further. The person then turns the tool around to show the opposite end, and rotates another lever on this end several rotations. They then remove the rod from the tool. The person now picks up the metal cable again, and inserts it through a hole in the tool attached to the bench clamp. They insert the rod into another slot on the tool and screw down a knurled knob on top of the tool with their fingers. The knob presses down against the rod. Next, they insert a metal rod into the tool attached to the bench clamp. One hand holds the tool while the other hand inserts a metal lever through a hole on the side and rotates it. This appears to be tightening the tool around the rod. They continue to rotate the lever until the handle is roughly parallel with the rod, and then remove the rod from the tool.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many types of tools were used to create the items in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Six types.",
+ "B. Four types.",
+ "C. Five types.",
+ "D. Three types."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What tool does the man use in the video to remove the cutting burrs?",
+ "answer": "B",
+ "candidates": [
+ "A. A saw.",
+ "B. A dagger.",
+ "C. A rasp.",
+ "D. A plier."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence for making flexible tubes as shown in the video? (a) Cutting the tube, (b) deburring, (c) installing the fitting nut, (d) adjusting the length of the tube extending out of the mold, (e) flaring, (f) pushing in the complete flared fitting.",
+ "answer": "A",
+ "candidates": [
+ "A. (a) (d) (e) (b) (c) (f).",
+ "B. (a) (c) (d) (e) (b) (f).",
+ "C. (a) (c) (d) (b) (e) (f).",
+ "D. (a) (b) (c) (d) (e) (f)."
+ ]
+ }
+ },
+ "wbOpeowX": {
+ "video_id": "wbOpeowX",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man holds a frying pan over a gas burner. He places the pan on the burner and pours oil from a bottle into it. He picks up a fish fillet and places it skin-side down in the pan. He sprinkles salt and grinds pepper onto the fish. He flips the fish over with a knife and adds butter and thyme sprigs to the pan. He spoons hot oil over the fish and flips it again, skin-side down. He then plates the fish with a serving of cooked vegetables alongside. He squeezes lemon juice over the dish, cuts into the fish with a knife and fork, takes a bite, and savors the flavor.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What is the purpose of squeezing lemon juice?",
+ "answer": "C",
+ "candidates": [
+ "A. To make the fish firmer.",
+ "B. To make the fish more tender.",
+ "C. To make the fish skin crisper.",
+ "D. To alter the flavor."
+ ]
+ }
+ },
+ "VIJpIFKQ": {
+ "video_id": "VIJpIFKQ",
+ "video_duration": "91s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with an outdoor shot of a body of water, surrounded by trees. The scene changes to a person wearing an orange long-sleeved shirt standing on a wooden dock. The person is holding a small white and blue object. The video then shows a wider view of the body of water, which is surrounded by docks with boats. The person in orange casts a fishing line into the water. The video shifts perspective and a fish jumps out of the water. This repeats several times, with fish jumping out of the water as the person in orange is fishing. At one point, the person in orange can be seen standing on the dock and reeling in a fish. After a few more shots of fish jumping out of the water, the video shows a shot of boats docked in the water. Then the video shows a view from a car driving past an open body of water, before switching back to an outdoor shot of the person in orange fishing again in another outdoor setting of a body of water surrounded by trees. The person in orange reels in a fish. Then the person is in the water, holding a fish. The person removes the fish from the hook, and the video ends.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What type of fish took the bait in the video when the man wearing orange was fishing?",
+ "answer": "C",
+ "candidates": [
+ "A. Pompano.",
+ "B. Bass.",
+ "C. Redfin snapper.",
+ "D. Marlin."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional change does the man in orange experience after the fish takes the hook?",
+ "answer": "D",
+ "candidates": [
+ "A. Gradually becomes tense.",
+ "B. Gradually becomes afraid.",
+ "C. Gradually becomes more excited.",
+ "D. Gradually becomes happier."
+ ]
+ }
+ },
+ "NtMdPseK": {
+ "video_id": "NtMdPseK",
+ "video_duration": "320s",
+ "duration": "4-6min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video is a first-person view of a Minecraft game. Initially, the player is standing on a flat surface made of blocky white and gray textures, with water all around. A message reading \"ALERT! The BRIDGES ARE SPAWNING!\" appears on screen. Blocks of blue and gray then generate in the water, creating pathways to other land. The player walks across these newly-generated pathways. There are other players visible, some wearing red and white blocky armor and others wearing more rudimentary blocky clothes. Other blocky figures, which might be trees, can be seen around the edges of the various landmasses. The player appears to fight several other players in close quarters, sometimes with swords and sometimes with pickaxes. A red liquid effect is visible as they hit each other. The player also uses a special attack labeled \"Berserker Leap.\" The player walks across a larger island, then goes down into a dark stone area. The player continues to fight other players here, using weapons and a special leap ability. Several chests are visible in the dark underground area, and the player opens and loots at least one of them. The player then ascends back to the surface, fighting along the way. The player continues to fight a few more players before ascending to a platform, with an announcement reading \"Green won the game!\".",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is Nez's profession in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Student.",
+ "B. Broadcaster.",
+ "C. Teacher.",
+ "D. Host."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What occurred when Nez attempted to eliminate the last member of the blue team?",
+ "answer": "B",
+ "candidates": [
+ "A. Fell into the water.",
+ "B. Fell from a height.",
+ "C. Fell into a cave.",
+ "D. Hit by a block."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What is the fate of the yellow team character in the underground mine depicted in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Killed by the yellow team character.",
+ "B. Killed by the red team character.",
+ "C. Killed by the green team character.",
+ "D. Killed by the blue team character."
+ ]
+ }
+ },
+ "fVwwlLFH": {
+ "video_id": "fVwwlLFH",
+ "video_duration": "136s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two fighters, one Black and one white with a pink mohawk, face each other in a cage. The referee signals the start of the match. The white fighter throws a kick. The Black fighter takes the white fighter down. They grapple on the ground. The Black fighter takes the white fighter's back and gets one hook in. The white fighter rolls, but the Black fighter maintains back control and sinks both hooks in. The Black fighter appears to attempt a rear-naked choke. The referee stops the fight. The fighters stand up. The referee brings the fighters to the center of the cage and raises the Black fighter's arm. The Black fighter circles the cage, and the white fighter walks towards his corner.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "When the referee separated the two men, how many punches did the bald man land on the red-haired man?",
+ "answer": "D",
+ "candidates": [
+ "A. Three.",
+ "B. Two.",
+ "C. One.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Will the competition continue after the video ends?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes.",
+ "C. No."
+ ]
+ }
+ },
+ "KJuuMzVZ": {
+ "video_id": "KJuuMzVZ",
+ "video_duration": "57s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A woman with long, curly hair stands center stage, wearing a short black dress and a sash. Two other women in similar black dresses flank her, positioned slightly behind. The central woman places her hands on her hips, then extends her arms outwards. The scene transitions to the same woman standing onstage beside a man in a suit. He holds a microphone to her and a small, flat object, possibly a plaque or award, in his other hand. She speaks into the microphone. The background features red and white draped fabric and palm fronds. The woman continues speaking into the microphone as the man listens, occasionally shifting his weight. Text overlays are present in the bottom corners.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Did a microphone appear in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. No.",
+ "B. None.",
+ "C. I don't know.",
+ "D. Yes."
+ ]
+ }
+ },
+ "qzXPKVgW": {
+ "video_id": "qzXPKVgW",
+ "video_duration": "32s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "Brock Lesnar and Goldberg face off in a wrestling ring. Goldberg lifts Lesnar and slams him to the mat. Brock Lesnar walks down a red-lit ramp, accompanied by another man in a suit. Lesnar lifts Goldberg and carries him, then slams him down. Lesnar pins Goldberg. Lesnar raises a championship belt above his head in victory, while Goldberg lies on the mat. Lesnar crawls on the mat, facing away from Goldberg, who stands over him with his arm raised. Lesnar lifts Goldberg onto his shoulders. Lesnar pins Goldberg.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What changes will occur for the white guys with bowl cuts after their match in WrestleMania 33?",
+ "answer": "D",
+ "candidates": [
+ "A. Achieve a higher ranking.",
+ "B. Become wealthier.",
+ "C. Inspire others.",
+ "D. Popularity increases."
+ ]
+ }
+ },
+ "fBtekRGw": {
+ "video_id": "fBtekRGw",
+ "video_duration": "98s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A white line appears on a black background that shows an oscillating pattern, as of an electrocardiogram reading, and then the line descends. A white line appears in the black background at an angle descending from the top left to the bottom right. A series of drawings appear on the white line: First, a person wearing a baseball cap, carrying a backpack, and holding a camera. Next, a person wearing a hat and white outfit with a staff, follows the first person. Then, a third person appears wearing a white outfit and hat holding a stick. Then a final person appears holding a child. The next scene is of a person in a chair using a phone while eating a hamburger, and a graphic heart is displayed on the person's upper chest as well as on a monitor on the wall. There is a drawing of shelves with multiple jars or boxes all decorated with a red heart symbol. The shelf contains a drawing of a pig. The scene moves back to a drawing of a person eating a hamburger and using a phone, while an image of a heart is displayed on the wall. The scene moves to a depiction of a refrigerator with the door partially open, and a pink pig is seen inside along with several food packages that are each decorated with a red heart symbol. The next scene is of a refrigerator with a display on the outside door indicating \"100%\", \"182%\" and \"92%\" along with what appears to be a stylized person, DNA helixes, and rows of small brown bags. The next scene focuses on the bags, and shows text which says, \"GM soy\". The screen is briefly blank white, before a single green ear of corn appears. Next, some brown bags appear. The text, \"GM corn\" and a DNA helix appear. The next scene shows a group of people walking away, and a plane on the horizon, with a drawing of a pig on the tail of the plane. The next scene is of a large machine driving over a landscape with written text. The next scene shows an assembly line where pigs are being put into packaging. A final scene is of the refrigerator with a digital display, before switching to a graphic that shows a picture of two globes, with red humanoid shapes overlaid on each globe. The scene zooms in and changes to red humanoid figures and red pig figures. The next scene is a graphic that depicts black cars in formation, some with visible drivers.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "How effective is genetic engineering in boosting food production?",
+ "answer": "D",
+ "candidates": [
+ "A. Soybean production decreases, and corn production decreases.",
+ "B. Soybean production increases, while corn production decreases.",
+ "C. Soybean production increases, and corn production increases.",
+ "D. Soybean production decreases, while corn production increases."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "What is missing from the fridge in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Corn.",
+ "B. Potato.",
+ "C. Hamburger.",
+ "D. Banana."
+ ]
+ }
+ },
+ "sbBtBpsT": {
+ "video_id": "sbBtBpsT",
+ "video_duration": "42s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man with no shirt on is crouching on a counter in what appears to be a derelict bathroom. There is blood visible on his body and the sink. A separate scene shows a police officer talking on a phone in a dark room with other people in the background. The next scene shows a computer screen displaying an abstract diagram with a control panel next to it with green and red lights. The next scenes show 3 men in a dark room, where a man with a white sweatshirt is visible at a machine. Another man in a suit stands near the entrance to the room. Then, the man in the suit is standing looking directly at the camera. The last scenes show the 3 men, then just the man in the suit again.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order do the following four events occur in the video? (a) The screen appears, (b) the lights turn on, (c) the high-tech doors open, (d) the glowing box appears.",
+ "answer": "D",
+ "candidates": [
+ "A. (a), (b), (c), (d).",
+ "B. (b), (c), (a), (d).",
+ "C. (b), (a), (d), (c).",
+ "D. (b), (a), (c), (d)."
+ ]
+ }
+ },
+ "VtYRqYwy": {
+ "video_id": "VtYRqYwy",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows two microscopes side-by-side, one labeled \"Transmitted Light Microscopy\" and the other labeled \"Fluorescence Microscopy\". A white light beam shines from the base of the transmitted light microscope, passes through the sample, then travels upwards to the eyepiece. A teal/green light beam shines from the side of the fluorescence microscope down towards the sample. This light passes through a cube-shaped filter and then continues downwards towards the sample. Light from the sample then travels upwards towards the eyepiece, and a dichroic mirror is shown angled in the light path, reflecting the upward travelling light beam towards the eyepiece.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What happens to the laser light after it passes through a filter in fluorescence microscopy?",
+ "answer": "C",
+ "candidates": [
+ "A. It passes directly through the filter.",
+ "B. The filter completely absorbs the laser light.",
+ "C. It produces different wavelengths for the sample.",
+ "D. It produces the same wavelength for the sample."
+ ]
+ }
+ },
+ "BxDKxiTA": {
+ "video_id": "BxDKxiTA",
+ "video_duration": "37s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "World News",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a person wearing a dark vest, a dark hat with a bear on it, and glasses, sitting at a round wooden table. There is an open bottle of wine and a wine glass on the table. The person picks up the wine bottle, and pours wine into the glass. The camera focuses on a book with \"TOTAL ECLIPSE OF THE SUN 2024\" printed on it. The person pours wine into another wine glass. The video then cuts to various outdoor street scenes with many people and cars visible. A road sign is visible, with text reading \"Kerrville\" and the names of various roads. A person wearing a pink patterned dress is shown outdoors. The video shows more people walking on a sidewalk.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Before the man pours wine, what is the zebra painting's position relative to him?",
+ "answer": "D",
+ "candidates": [
+ "A. The painting is to the man's front right.",
+ "B. The painting is to the man's front left.",
+ "C. The painting is to the man's rear right.",
+ "D. The painting is to the man's rear left."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What might be the occupation of the woman wearing a floral skirt in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A bartender.",
+ "B. A journalist.",
+ "C. A winery owner.",
+ "D. A photographer."
+ ]
+ }
+ },
+ "OEFiEVUv": {
+ "video_id": "OEFiEVUv",
+ "video_duration": "90s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A woman wearing a green top holds a red plug with a red cord, a man with a black top holds a red circle with the number 700 and a blonde woman behind him. A man wearing a grey top holds a red socket with a red cord. Below these individuals there is a laptop screen with a man wearing an orange top speaking. Two people sit on each side of the laptop and they are wrapped in white blankets with mugs next to them. A factory with a chimney is to the right and the background is a blue sky with a row of blue mountains. A figure of a man wearing a yellow helmet and blue clothes is by an electrical box. Green money symbols fly around a red circle with the number 588 and euro symbol. The video fades to an image of only the laptop screen and the two people wrapped in blankets. The man is shown within the laptop screen speaking, gesturing and moving his head and hands.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What action did the man take in the video while discussing government fraud?",
+ "answer": "A",
+ "candidates": [
+ "A. He pointed at the screen with his finger and then pointed to himself.",
+ "B. He picked up the cup next to him and slammed it on the table.",
+ "C. He did not make any discernible action.",
+ "D. He jumped up in a fit of rage."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the main emotion expressed by the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Frustration and sadness.",
+ "B. Anger and disappointment.",
+ "C. Excitement and negativity.",
+ "D. Happiness and excitement."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man in the video clap?",
+ "answer": "D",
+ "candidates": [
+ "A. Two.",
+ "B. One.",
+ "C. Three.",
+ "D. Four."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does the animation in the video disappear?",
+ "answer": "B",
+ "candidates": [
+ "A. Throughout the entire video.",
+ "B. In the middle of the video.",
+ "C. At the end of the video.",
+ "D. At the beginning of the video."
+ ]
+ }
+ },
+ "nKEnpTcZ": {
+ "video_id": "nKEnpTcZ",
+ "video_duration": "305s",
+ "duration": "4-6min",
+ "domain": "Tech & Science",
+ "sub_category": "Academic Lectures",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows several scenes, often presented in a picture-in-picture format.In some scenes, a classroom with wooden desks is shown with people walking through an open door. Later, a similar classroom with students at the desks is shown on a monitor screen.In other scenes, a man in a light-colored shirt with glasses is in front of a whiteboard and a green chalkboard. Two people are sitting at desks facing him. The chalkboard reads \"CLASSROOM MANAGEMENT\" in white letters. A monitor on the wall shows video clips of other classrooms with students in them. In many of these scenes, the man speaks and gestures with his hands. In a couple of the scenes, a black wire with a microphone is clipped to his shirt.In some of the scenes, three people are sitting at desks, facing the camera. They each have papers in front of them. These people include the same woman seen previously in the green top and another woman wearing a black top.In other scenes, a woman is in a classroom with student desks. She is holding a book. She gestures with the book. Many young students are sitting at desks, and most of the students are wearing red and blue plaid shirts.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who in the video suggested considering students' vision issues and properly arranging the lights and the placement of the blackboard in the classroom?",
+ "answer": "B",
+ "candidates": [
+ "A. A man in a white shirt.",
+ "B. A woman in green.",
+ "C. A woman in black.",
+ "D. A man in a blue shirt."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order discussed by the teachers in the video regarding the classroom setup? (a) The correct placement of the blackboard. (b) The order of seating arrangement in the classroom. (c) The installation location of basic facilities.",
+ "answer": "C",
+ "candidates": [
+ "A. (a) (c) (b).",
+ "B. (a) (b) (c).",
+ "C. (b) (c) (a).",
+ "D. (c) (a) (b)."
+ ]
+ }
+ },
+ "smCyEVbM": {
+ "video_id": "smCyEVbM",
+ "video_duration": "44s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a woman with blonde hair and a light blue shirt in the center of the frame. Behind her is a bulletin board with various images and writings. A pink graphic in the top left says \"SUBSCRIBE\". The word \"BLOOPERS\" appears with the handle \"@BROOKESBEAUTYBAZAAR\" underneath, both overlayed in light grey on the lower part of the frame. The woman looks down, then looks up, and is then seen talking. There are occasional shifts in her facial expression while she talks. She picks up a small cylindrical silver can. The video then cuts to a screen with the woman to the right and various graphics and text. She now has a green floral patterned dress. The background is yellow, pink and red. The words \"Watch More\" and \"Subscribe\" appear. There are two hand icons pointing to the left and two gold stars, and a lightning bolt. The letters \"bb\" are overlayed, also in white.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the woman wearing the blue shirt laugh at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. She does not know the content of the script.",
+ "B. She sees a funny video.",
+ "C. She hears a funny story.",
+ "D. She wants to release stress by laughing."
+ ]
+ }
+ },
+ "JxXbveQd": {
+ "video_id": "JxXbveQd",
+ "video_duration": "32s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Fitness",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a black screen, followed by footage of multiple cheerleading teams performing stunts in an indoor arena with red floors and rows of red seating in the background. Some team members wear black and white uniforms, while others wear red and black, navy and white, green and white, or orange and white. The video shows various cheerleading formations, including people being lifted into the air. Next, a light-skinned woman wearing a blue polo shirt with a logo speaks in front of a light blue background. Then, more cheerleading footage is shown, including stunts with flyers in the air. This is followed by footage of the same woman speaking again, in front of the same light blue background. The video then cuts to outdoor shots of campus buildings and a flag pole. Then, a man wearing a white coat shows another man multiple images on two computer screens. Finally, the video shows the woman speaking again, followed by another campus shot.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What type of location is depicted in the video where the girls perform stunts?",
+ "answer": "A",
+ "candidates": [
+ "A. Stadium.",
+ "B. Gymnasium.",
+ "C. Basketball court.",
+ "D. Tennis court."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What might be the relationship between the girl in the blue short-sleeved shirt in the video and William Jessup University?",
+ "answer": "D",
+ "candidates": [
+ "A. The girl might be a student at the university.",
+ "B. The girl might be the president of the university.",
+ "C. The girl might be the parent of a university student.",
+ "D. The girl might be a coach at the university."
+ ]
+ }
+ },
+ "VpgqgMjl": {
+ "video_id": "VpgqgMjl",
+ "video_duration": "93s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a distorted American flag with a blue line stripe. The flag becomes clear with the blue line stripe. The words \u201cBLUE LINE PREDATORS\u201d appear above the flag, and the text \"The PEOPLE are Watching YOU!\" appears below. The next scene shows a man wearing a suit and tie in a news studio with multiple monitors in the background. The banner at the bottom of the screen reads \u201cTARGET 12 INVESTIGATORS\u201d and \u201cTim White\u201d and \u201cCOVERAGE YOU CAN COUNT ON\u201d. The man is speaking. The video shifts to a courtroom scene with two men wearing light-colored suits on the left and right, and a man with a shaved head in the middle. The video then cuts to a man wearing a black robe sitting at a bench in a courtroom. The scene shifts back to the three men in the courtroom. A man wearing a red shirt and backward black baseball cap is shown standing in front of a brick wall speaking. The video then shows an empty street, followed by a red brick building. There is a close-up shot of a window with bars on it. The video shifts back to the man in the red shirt, who is being interviewed by a woman holding a microphone. The next shot is a close-up of a man's eye and cheek being held in someone's hand. The same shot shows the man closing his eye. The video then shows the man in the red shirt being interviewed. Next is a courtroom scene with the man with a shaved head. A statue of a man wearing a police uniform is shown. A building that says \"RHODE ISLAND STATE POLICE\" on it is shown. Two men are at a press conference, with one speaking. The next scene shows a woman standing in front of a building with a plaque that says \"RHODE ISLAND DEPARTMENT OF THE ATTORNEY GENERAL\". The video then shows the same courtroom scene as before with the man with the shaved head. The video cuts to a building saying \"RHODE ISLAND STATE POLICE PUBLIC SAFETY COMPLEX\". A shot is shown through the inside of a car towards a sign that says \"RHODE ISLAND STATE POLICE\". A quote appears on the screen that reads, \"This has caused grave concern to all Troopers and their families, since it obviously opens all of us to serious personal liability while facing the ever-increasing hazards required by our jobs.\" Another quote appears on screen that reads, \"Committing the crime of assault upon a prisoner - or anyone - is outside of the course and scope of the duties of a state trooper, and the taxpayers should not have to defend or pay for the criminal actions of Donnelley-Taylor.\" The video then shifts to the man in the blue suit speaking in front of a blue background. The video cuts to another man speaking. Finally, the video shows the man in the suit speaking from the news studio at the beginning of the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "In which country does the event shown in the video take place?",
+ "answer": "C",
+ "candidates": [
+ "A. Germany.",
+ "B. France.",
+ "C. The United States.",
+ "D. The United Kingdom."
+ ]
+ }
+ },
+ "ZQtYufWS": {
+ "video_id": "ZQtYufWS",
+ "video_duration": "70s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Historical Analysis",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a timeline of events displayed horizontally on screen. The timeline contains photographs, names of months, and text descriptions of events. The first image shown is labeled February and the text below reads \u201cYalta Conference.\u201d The next image, also labeled February, is below the text \u201cBombing Dresden\u201d.The next image is labeled March and is below the text \u201cCrossing the Rhine.\u201d This image shows three people.The following image is labeled April and the text reads \u201cSoviets Take Berlin\u201d. This image shows tanks and wood rubble. The next image is labeled \u201cMay 8\u201d and the text reads \u201cV.E. Day\u201d. The photograph shows a large crowd of people.The next image is of a soldier with a lit cigarette in his mouth. Another soldier in the background points forward.The next scene shows a large tank and smoke in the background. The next shot shows multiple people in military uniforms moving in a city environment with smoke in the background. The following shot shows a sign with text and an arrow pointing to the left. There is also fire visible. The next shot shows one person in military uniform standing on the left of the screen, pointing a gun. Multiple civilians are standing at the entrance of a building.The following shot shows a group of people moving down the street. A dead body is lying on the ground.The next image shows buildings in a city. The following image shows two tanks moving down a street. The following image shows a tank moving across a large bridge.The next image shows two people seated at a table. One person wears a dark hat with a German symbol. The other person wears a dark beret with a military emblem. Both people have microphones in front of them.The next image shows several people in street clothes moving along a city street. The timeline of events from the beginning of the video is shown again. The May 8 \u201cV.E. Day\u201d image is shown again, more closely, to show the large crowd of people. A crowd of people waving their arms is shown.A logo for DoD news is displayed on a white screen. The website address defense.gov is also shown.The DoD news logo is shown again on a purple background. The next image has a logo of the US Department of Defense. Social media handles are shown below the logo, along with the URL defense.gov. Text reads \u201cSubscribe\u201d and \u201cWatch More\u201d.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Are there any Chinese people in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I don't know."
+ ]
+ }
+ },
+ "bbZxYynw": {
+ "video_id": "bbZxYynw",
+ "video_duration": "138s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A group of three people are sitting in front of a large window. A person on the right has a dark jacket on and is holding a black object. A person on the left has a red and black checked shirt on. The person in the middle has a red and black checked shirt on. The person on the right moves his arm, and then the person on the left passes the black object from his left hand to his right hand. The person in the middle is also holding an object. Then, the person on the right puts their hand on the arm of the person in the middle. The person on the left holds their hands out.The video then shows a person in green scrubs sitting at a table with a drink and a tray of food in front of them. Another person with dark, curly hair is facing away from the camera with their hands raised in the air. The person in green scrubs is moving their hands. The video then shows a different group of three people at a table. A man in a black shirt is pouring a white liquid onto a tray of food in front of the table. Three people are seated at the table and are looking at the man. Another man in a black jacket is visible in the background at another table with food. A close up of a person's eye is shown. Then, the video returns to the three people at the table. The person in black is pouring more liquid. A different person at the table is now eating with the aid of a utensil. The video then shows a man with dark hair and a beard who has a red and black plaid shirt on. Then, this person is seen in a full shot sitting at a table with others. They are eating from a tray of food.The video then returns to the first scene. The three people are sitting by the window again. The person in the middle is using a phone. A fourth person walks into the scene and looks at the person in the right. Another person is visible behind the person on the right. The video then shows a person in a grey shirt with their hands in handcuffs, being held against a car by a person in a black shirt with a gun in a holster.A person in a blue shirt is then shown, looking down. The video then shows a man from behind walking down a paved road. He has a red and white checked shirt on.The video ends with a close up shot of a person\u2019s eye.The video shows a group of four people at a table. The person on the left has a grey hooded sweatshirt on, and another person on the right has a black and white checked shirt on. Another person in the middle has a green shirt.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "In the video, did the boy wearing the black and white checkered shirt physically assault someone?",
+ "answer": "D",
+ "candidates": [
+ "A. None of the above.",
+ "B. Yes, he did.",
+ "C. It is unclear.",
+ "D. No, he did not."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotions does the curly-haired boy in the green shirt go through in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Anger - irritation - panic.",
+ "B. Anger - happiness - panic.",
+ "C. Anger and fear - rising hope - gratitude.",
+ "D. Irritation - fear - calm."
+ ]
+ }
+ },
+ "KDDrzWZn": {
+ "video_id": "KDDrzWZn",
+ "video_duration": "456s",
+ "duration": "6-8min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a large outdoor gathering of many people. The first few seconds show the crowd from a distance. Some people are raising their arms, and a stage with speakers is visible. In the next clip, a person holds a large framed picture over their head while surrounded by a crowd of people. The third clip shows a man in a black outfit talking and then hugging a man in a white tank top; they are in a crowded tent. Several similar clips follow, depicting the two men hugging and other people standing around tables in the tent. The video then shows a man in a long white outfit performing on a red-carpeted stage in front of a large crowd, accompanied by a security officer in black. He walks around and interacts with people around the stage. Several similar scenes follow, all from different perspectives. A new scene shows a person in a suit talking and then hugging people within a large crowd; the scene is very crowded with people taking videos of the events. Similar scenes of individuals greeting and hugging with a large crowd in the background follow. Next, the video shows several different clips of a man in a white outfit speaking into a microphone on a red-carpeted stage; the stage also has two laptops. A large crowd faces the stage. A man in a blue shirt is taking a video of the stage. Several similar scenes of people on the stage or in the crowd follow. The video shows more clips of a crowd before transitioning to many clips of the outdoor stage from various perspectives. Towards the end of the video, it shows people at what appears to be a stadium, where people in gold and red outfits are sitting in the stands. It also shows a band on the stage with large flags and tents in the background. The last shot of the video is an overall shot of the crowd in the stadium with the stage on the right.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What religious group is the crowd in the video affiliated with?",
+ "answer": "D",
+ "candidates": [
+ "A. Buddhism.",
+ "B. Christianity.",
+ "C. Islam.",
+ "D. Islam."
+ ]
+ }
+ },
+ "YOmEyvxf": {
+ "video_id": "YOmEyvxf",
+ "video_duration": "79s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Fitness",
+ "audio_class": [
+ "Event"
+ ],
+ "video_caption": "A man is lying on a weight bench lifting a barbell with red weight plates on either side. Two other men stand by to assist. The text overlay reads 227.5KG/500LBS X3. The scene changes to a woman standing in a gym in front of a barbell with large weight plates. The barbell has at least two large red weight plates and two smaller green ones. She bends at the waist, then straightens. The scene changes to a woman doing deadlifts with a barbell weighted with large red and green weight plates. Text overlay reads 210KG/462LBS X3. The scene changes to a man loading an additional red weight plate onto a barbell. The text overlay reads 320KG/705LBS. The man steps away from the barbell, and the camera cuts to a full shot of the same man preparing to deadlift. He is wearing black pants and a black shirt. He hoists the barbell and then puts it back on the ground. A man is seen adjusting a belt around his waist. The scene changes to the man deadlifting a barbell with orange weight plates. The text overlay reads 270KG/595LBS. He lifts, and puts it back down. The scene changes to the man with the beanie again deadlifting a barbell. The text overlay reads 365KG/804LBS. He lifts the barbell, then puts it back down. He holds a fist up to the side. The scene then changes to a close up of some plates on a pallet jack. The camera shows a barbell with many large orange weight plates. A person is seen placing a green weight plate onto the barbell. The scene changes to a man with a grey beanie. The camera cuts to the man. He is talking in a medium close-up. A man loads more weight onto a bar. The text overlay reads 320KG/705LBS. The man is shown in a close up again. The scene shows a woman doing deadlifts. The text overlay reads 210KG/462LBS X3. The camera then shows the woman lifting a barbell. The scene changes to a man putting on a belt. The camera shifts to another person preparing to deadlift. The text overlay reads 300KG/661LBS. The man deadlifts the weight and places it back down. The scene shows a closeup of the barbell weights again. The camera shifts to a man putting on a belt. Several people are seen in the background. The man with the beanie is shown preparing to deadlift. The text overlay reads 410KG/903LBS. The man lifts and puts down the barbell. Another person gives him a high-five, as several people clap. The video ends with a white logo on a black background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the deadlift movement appear in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. 13.",
+ "B. 12.",
+ "C. 10.",
+ "D. 11."
+ ]
+ }
+ },
+ "sYVBxvNj": {
+ "video_id": "sYVBxvNj",
+ "video_duration": "283s",
+ "duration": "4-6min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a blue t-shirt with a red and yellow logo and a blue, white and red jacket speaks to the camera. They are initially facing directly forward. They look to their right, then look back toward the camera. They begin to speak again. The person holds a black guitar and begins to strum it while speaking. The person is shown from the shoulders up and wears a blue, red, and white jacket over a blue t-shirt with a red and yellow logo. A black strap is visible on the guitar. A red, orange and yellow cloth is behind them. Then a white diagram appears next to the video, showing notation for guitar music, and a person playing the guitar in a corner of the video. The screen switches to a view of the music editor, showing blue audio patterns. The screen then switches back to the person speaking while wearing a black headset. The person puts their hand to their face, then gestures using both hands. The screen shows the music editor screen, then a screen showing a search bar in a browser. A search for \"boom\" is performed. An audio pattern titled \"booms.wav\" is displayed. The person is shown again, using hand gestures. Their hands are clasped, and then they continue to gesture with their hands. The screen switches to the music editor, with audio patterns visible. The person is seen again, looking at the camera, then covering their face with their hand and laughing. The person then speaks to the camera and continues making hand gestures. The person is shown again, singing. The person continues to sing while looking down toward a microphone. The person ends the video by singing and looking toward the camera.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the middle section of the video, while the man is creating music, where does his portrait appear?",
+ "answer": "C",
+ "candidates": [
+ "A. The lower-left corner.",
+ "B. The lower-right corner.",
+ "C. The upper-left corner.",
+ "D. The upper-right corner."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What does the man do with his guitar at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Teach various guitar finger techniques.",
+ "B. Teach how to tune a guitar.",
+ "C. Sing and play, accompanying himself.",
+ "D. Record and produce music."
+ ]
+ }
+ },
+ "KOHWnhue": {
+ "video_id": "KOHWnhue",
+ "video_duration": "30s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Event"
+ ],
+ "video_caption": "A man and a child, a girl, are shown walking on a paved road with trees and houses in the background. The man is wearing a black puffy jacket and pants and a baseball cap and has a beard. The child is wearing a dark jacket, blue hat, pink shirt, and tan boots. The video shows various scenes of the man and child. In some they are walking and in others they are crouched near a brown cardboard sign that reads \"SYRIAN REFUGEE HELP\". The man wears a black beanie in the scenes near the sign. The lighting changes between some scenes, appearing to be filmed at various times during the day, or at night.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why do Syrian refugees seek help?",
+ "answer": "D",
+ "candidates": [
+ "A. The Syrian people are impoverished.",
+ "B. Uprisings are occurring in Syria.",
+ "C. War is occurring in Syria.",
+ "D. An earthquake has occurred in Syria."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is the boy's hat in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Dark blue.",
+ "B. Blue.",
+ "C. Black.",
+ "D. White."
+ ]
+ }
+ },
+ "IZAGVUgC": {
+ "video_id": "IZAGVUgC",
+ "video_duration": "89s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a blurred background with blue and red sections, with white diamond and hexagon shapes. Text appears on top of the background. The text includes the words: \"SOCIOPATHS,\" \"SOCIOPATHS DON'T CARE ABOUT OTH,\" \"SOCIOPATHS DON'T CARE ABOUT OTHERS WHILE PSYCHOPAT,\" \"SOCIOPATHS DON'T CARE ABOUT OTHERS WHILE PSYCHOPATHS PRETEND TO\". The video then transitions to a cartoon panel showing a blue figure with two green leaves on its head sitting on a pink surface, holding a black rectangle. The figure raises its right arm. In the next panel, the figure appears to be hitting a white figure with a single green leaf on its head. The video then transitions to another panel with a red figure with a single green leaf on its head waving. Next, it transitions to an animation showing a red figure with three white figures behind it. Following this, the video returns to the blurred background with text, which includes the following words: \"SOCIOPATHS ARE IMPULSI,\" \"SOCIOPATHS ARE IMPULSIVE WHILE PSYCHOPATHS,\" \"SOCIOPATHS ARE IMPULSIVE WHILE PSYCHOPATHS TEND TO BE METHODIC,\" \"SOCIOPATHS ARE IMPULSIVE WHILE PSYCHOPATHS TEND TO BE METHODICAL\". The video transitions to a split panel, one side shows a blue figure with a single green leaf punching a gray figure. The other panel shows a blue figure, and a red figure. Following this, the video returns to the blurred background with text, which includes the following words: \"SOCIOPATHS ARE,\" \"SOCIOPATHS ARE GENERALLY,\" \"SOCIOPATHS ARE GENERALLY A PRODUCT OF THEIR ENVIRONMENT,\" \"SOCIOPATHS ARE GENERALLY A PRODUCT OF THEIR ENVIRONMENT WHILE PSYCHOPATHS AR,\" \"SOCIOPATHS ARE GENERALLY A PRODUCT OF THEIR ENVIRONMENT WHILE PSYCHOPATHS ARE BORN\". Next, the video shows a light blue background with horizontal white lines and the text \"VS.\". This transitions to an animation of a double helix shape with dots to the left of \"VS.\" and a blue heart shape to the right. Lastly, the video transitions to a split panel: on the left, a blue figure standing in front of a yellow background, on the right a red figure stands in front of a light blue background. The red figure raises its hand, and next to it a portion of a gray body is visible with red marks on it. The word \"LEARNED\" appears above the blue figure.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What action does the blue cartoon character with a green sprout on its head take toward the gray cartoon character in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. First shows concern, then pushes away.",
+ "B. Discusses studying.",
+ "C. First comforts, then hugs.",
+ "D. Aggressively attacks."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual actions does the pink cartoon character in the video exhibit towards the white cartoon character wearing a leaf?",
+ "answer": "D",
+ "candidates": [
+ "A. Shows fear and avoids.",
+ "B. No unusual behavior.",
+ "C. Friendly but maintains a distance.",
+ "D. Always initiates greetings."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What does the DNA double helix structure signify at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Both personalities are genetically inherited.",
+ "B. It signifies the genetic traits of borderline personality disorder.",
+ "C. It signifies the genetic predisposition to antisocial personality disorder.",
+ "D. Two lines intertwine."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the emotional state of the blue cartoon character with a green sprout on its head in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Excited.",
+ "B. Angry.",
+ "C. Agitated.",
+ "D. Sad."
+ ]
+ }
+ },
+ "XjKlLynb": {
+ "video_id": "XjKlLynb",
+ "video_duration": "68s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two shirtless men, one wearing yellow shorts, are shaking hands in what appears to be a boxing ring with various flags in the background. The scene then cuts to black text on a black background that reads \"IT'S ON DESHAE...\". Next, two men are in a room, standing in front of a bed. The man on the left wears a black tank top and green shorts. The man on the right wears a black t-shirt and light gray pants. The man in the gray pants stands on the bed and begins to put items into a bag that sits on the bed. The man in green shorts watches him. The two men then stand and face each other, briefly making contact before separating. The man in gray pants moves around behind the bed. The man in green shorts then jumps onto the bed and tackles the other man.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct chronological order of the events depicted in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Two people duel in a ring, then they argue, and a conflict arises.",
+ "B. Two people argue, then a conflict arises, and then they duel in a ring.",
+ "C. Two people duel in a ring, then a conflict arises, and then they argue.",
+ "D. Two people argue, then they duel in a ring, and a conflict arises."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the bell ring in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. One.",
+ "B. Two.",
+ "C. Four.",
+ "D. Three."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the spatial relationship between the two people at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The person wearing gray pants is to the right of the person wearing shorts.",
+ "B. The person wearing shorts is lying on top of the person wearing gray pants.",
+ "C. The person wearing gray pants is lying on top of the person wearing shorts.",
+ "D. The person wearing gray pants is to the left of the person wearing shorts."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What are the two people doing in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. They argue first, followed by a conflict.",
+ "B. They are arguing but not engaging in a conflict.",
+ "C. They neither argue nor engage in a conflict.",
+ "D. They are in conflict but not arguing."
+ ]
+ },
+ "task4": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What are the emotional changes of the man in gray pants in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. From happiness to fear.",
+ "B. From happiness to anger.",
+ "C. From fear to anger.",
+ "D. From anger to fear."
+ ]
+ }
+ },
+ "JIyZyUJF": {
+ "video_id": "JIyZyUJF",
+ "video_duration": "47s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a close-up view of a computer system with multiple circuit boards and cooling components. There are seven black objects with copper colored heat pipes on top, arranged in a row. To the right of them is another circuit board with many colored wires connected to it. Further right, there is a block containing multiple black fans in a row. The camera moves around the system, showing the different parts from various angles, including circuit boards with silver connector ports and multiple rows of green chips. The camera pans across the system, showing more cooling fans and the black components with copper heat pipes. The video ends by focusing on the black components with copper heat pipes.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the fan positioned on the device shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. On the right.",
+ "B. Below.",
+ "C. Above.",
+ "D. On the left."
+ ]
+ }
+ },
+ "svbTwZoO": {
+ "video_id": "svbTwZoO",
+ "video_duration": "52s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a solid black screen. Two translucent, pale hands appear on the screen, palms facing out. They move around the screen slightly.The view then cuts to a wide shot of a humanoid silhouette with large arms outstretched. The silhouette has a small face with oval eyes. The silhouette is surrounded by small, dark objects, and there are short flickering lines of fire below the silhouette.The view cuts to the silhouette of a bird with outstretched wings.The view cuts to a long, pale, worm-like creature with a dark outline and simple facial features, standing against a black background with small white lines in the distance. The view zooms in on the creature's face, then cuts to another view of the same creature upside down, hanging from a dark, tree-like shape. Small dark drops are falling from the tree.The view cuts to a close up of a grey scribble on a black background.The view cuts to a black and white image of concentric circles with the illustration of a human figure inside a circle. The circles move closer and closer to the center.The view cuts to the illustration of a human like figure with its arms outstretched, shown against a background that is half-black, half-white. The figure is pale with black outlines.The view cuts to the same illustration shown three times in a row in an alternating pattern of black and white.The view cuts to the black silhouette of a humanoid figure, arms bent with palms together, against a white background.The view cuts to the same figure with arms outstretched, palms upward.The view cuts to the same black silhouette of a humanoid figure sitting with legs bent on a staircase. The figure has small arms outstretched, and is shown under a sign.The view cuts to a close up of the same face shown earlier.The view cuts to several figures standing in a line, with one in front leaning over with hand near his face. The view changes to another figure leaning towards another figure with a hand near his face, as if they are listening. Another figure raises his hands in the air.The video cuts to the black ground with several pale, humanoid faces, with small black hands reaching up from the ground as a pale, humanoid figure raises its hands in the air.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What happens after the explosion in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Darkness descends.",
+ "B. People chant for freedom.",
+ "C. The god of darkness emerges.",
+ "D. A figure sits on the throne."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What does the god of darkness do to the little beings in the darkness in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The god of darkness becomes the teacher of the little beings in darkness.",
+ "B. The god of darkness grants freedom to the little beings in darkness.",
+ "C. The god of darkness bestows pain upon the little beings in darkness.",
+ "D. The god of darkness bestows happiness upon the little beings in darkness."
+ ]
+ }
+ },
+ "DHEgydBG": {
+ "video_id": "DHEgydBG",
+ "video_duration": "518s",
+ "duration": ">8min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person is kayaking on the ocean at dusk. The shoreline, which includes trees and buildings, is visible in the background. The camera is positioned on the kayak, providing a first-person perspective. The person's bare feet and part of their legs are visible, as is a fish finder mounted on the kayak. The sun appears to be setting behind the camera, and the moon is visible ahead and to the right. The camera pans across the kayak, showing two fishing rods. The person picks up one of the rods and begins reeling. They continue reeling for some time, pausing occasionally to check the fish finder or reposition their hands.The sky grows lighter. Several birds appear and circle above the water. The camera shows the fish finder again. The person checks one of their other lines, pulls it up, and inspects the hook. It's bare. The person recasts their line. The sun is higher in the sky now, reflecting off the water, and the sea is a deep blue. A bird gets caught on the fishing line. The person reels it in, and carefully removes the hook and the line from the bird's feet and wing. The bird is released back onto the water.The person switches to a different fishing rod and casts the line. They continue reeling. They switch to another rod. They inspect a white fishing lure with a yellow back, a hook, and green feathers. They cast the line. They check the line. They examine the fishing lure, which appears to have become unrigged. They start rerigging the fishing lure. The person looks toward the camera, wearing a black beanie hat and black wraparound sunglasses. The camera pans across the kayak again, showing the fishing rods, a net, and various other pieces of equipment. The camera changes scenes. Two people are fishing from a small motorboat on a bright, sunny day. The coastline is visible in the background. One person is wearing a black jacket and hat, the other a brownish-orange jacket. They each take turns reeling in a large fish.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What does the man wearing a woven hat discover at the beginning of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Many birds are flying.",
+ "B. A friend catches a tuna.",
+ "C. The sun rises above the horizon.",
+ "D. A large school of fish passes under the boat."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "In the video, what does the man wearing a woven hat mean by \"full-size 6x\"?",
+ "answer": "C",
+ "candidates": [
+ "A. It refers to a fishing rod.",
+ "B. It refers to a turbine fin.",
+ "C. It refers to a fishing line.",
+ "D. It refers to a fish finder."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "At what time of day does the man wearing a woven hat begin fishing in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Afternoon.",
+ "B. Dawn.",
+ "C. Morning.",
+ "D. Midnight."
+ ]
+ }
+ },
+ "UyWRQoKg": {
+ "video_id": "UyWRQoKg",
+ "video_duration": "407s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Event"
+ ],
+ "video_caption": "A man with a beard, wearing a short-sleeved shirt with a tropical print, stands at a wooden table outdoors. He chops cilantro and lime wedges on a cutting board. Three ears of cooked corn are placed on the cutting board. He inserts wooden skewers into the ends of each corn cob. Using tongs, he spreads mayonnaise on each ear of corn. He then sprinkles cotija cheese from a plastic bag onto the corn, followed by Tajin seasoning from a bottle. He squeezes lime juice onto the corn, adding more Tajin and cotija cheese. He takes a large bite of one of the corn cobs. He picks up a bottle of Valentina hot sauce.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, is the oven located to the man's right front or left rear?",
+ "answer": "A",
+ "candidates": [
+ "A. The left rear.",
+ "B. The right front.",
+ "C. Directly in front.",
+ "D. Directly behind."
+ ]
+ }
+ },
+ "TnhFyVsw": {
+ "video_id": "TnhFyVsw",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person with long brown hair sits on an orange couch covered with a pink and yellow blanket. They wear a black long-sleeved shirt and bring their right hand to their forehead and eyes. A person with long dark hair and a beard sits on a green couch. They wear a black hoodie and red sweatpants. They have their hands clasped together in their lap. The person with long brown hair points with their right index finger. A woman with long dark hair and bangs sits on an orange couch. She wears a pink sweater and a plaid skirt. She holds a microphone in her right hand. The lower portion of the screen shows containers of Wendy's food, including french fries, nuggets, two drinks, and condiment packets. The person with long brown hair gestures with both hands. The person with the beard sits next to a person with a reddish-brown beard and a yellow jacket. The woman with bangs looks downwards.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, what is the position of the man wearing red pants relative to the man wearing orange clothes?",
+ "answer": "C",
+ "candidates": [
+ "A. On the left.",
+ "B. In front.",
+ "C. On the right.",
+ "D. Behind."
+ ]
+ }
+ },
+ "YvDzQAnH": {
+ "video_id": "YvDzQAnH",
+ "video_duration": "108s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Four people are standing behind a white sign with the words \"Junior Presidents Cup\" and a trophy icon on it. Two of the people are wearing navy blue baseball caps and striped yellow and navy blue shirts. The other two people are wearing white baseball caps and red shirts. The sign also has \"2017 Junior Presidents Cup\" and \"Plainfield Country Club - Edison, NJ. Day 1 - Four-ball\" written on it. The scene transitions to a close-up of two men wearing baseball caps and another person who is not in view. One of the men is wearing a blue baseball cap, and the other is wearing a white baseball cap. The scene transitions to show a crowd of people, most wearing red shirts and white baseball caps. One man is taking a swing at a golf ball. He is wearing a red shirt and navy blue shorts. The camera angle changes to a wide shot of a man taking a swing at a golf ball. He is wearing a red shirt and navy blue shorts. A man in the background is carrying a golf bag. The scene transitions to a wide shot of a golf course. A yellow flag is in the distance. The scene transitions to a close-up of a man wearing a striped yellow and navy blue shirt taking a swing at a golf ball. There are several other people in the background, some of whom are standing around golf bags. The scene transitions to a wide shot of the golf course with several people and two golf bags on the periphery. A man with a yellow flag approaches a golf ball on the green. The scene transitions to a man in a red shirt preparing to take a golf swing with two other people in the background. The scene transitions to a man in a red shirt taking a swing at a golf ball. The scene transitions to a wide shot of the golf course. The camera is tracking a golf ball in flight. The scene transitions to a wide shot of a man in a yellow and blue striped shirt getting ready to putt. The scene transitions to a wide shot of the golf course showing the flag in the distance and the golf ball on the green. The scene transitions to a wide shot of the golf course with a golfer in the distance.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Which team won the match shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The U.S. team.",
+ "B. The international team.",
+ "C. It was a draw.",
+ "D. I am not sure."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "How did Garrick Higgo perform on the eighth hole in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. He scored two strokes over par.",
+ "B. He scored one stroke over par.",
+ "C. He scored one stroke under par.",
+ "D. He scored two strokes under par."
+ ]
+ }
+ },
+ "zLfhsrxJ": {
+ "video_id": "zLfhsrxJ",
+ "video_duration": "61s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a soccer field with three people in blue uniforms practicing soccer. The camera is behind a goal. A woman with brown hair is talking and waving, with colorful orbs in the background. The camera then shows the woman in a blue jumpsuit, sitting in a spinning apparatus. She spins around repeatedly. Next the woman is in a white spacesuit putting on a helmet in front of a structure. The woman is then seen using a microscope. A woman with blonde hair is shown, along with the text \"Ms. Lauren Wernstrom, Math Teacher\". A rocket is shown launching with lots of smoke. A man with graying hair is shown. There is a wall with numerous photos, articles, and a framed photo of the woman in a blue space suit, holding a helmet. Then a framed picture of the same woman in blue suit appears, along with a magazine cover and article about her. Then a group photo of the woman and about 12 other people in green t-shirts in front of a rocket. Another shot shows a rocket launching from underneath, with a view of its engines. The woman with brown hair is shown talking. She is then shown in a pool diving with scuba gear. There are several people, some with scuba gear and some with space suits, performing an underwater activity. A man with a bald head is shown, along with the text \"Mr. Ryan Cook, Principal\". The woman is shown in a bedroom setting writing in a notebook with a computer next to her. There is a close up of her hands while she writes. A woman with blonde hair is shown, speaking. A man with gray hair is shown, along with the text \"Paul Foerman, Public Affairs Officer, Stennis Space Center\". Finally, the woman is shown outside on the soccer field, looking around and smiling.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "Did the rocket shown in the first half of the video complete booster separation?",
+ "answer": "A",
+ "candidates": [
+ "A. No, it did not.",
+ "B. None of the above.",
+ "C. Yes, it did.",
+ "D. It cannot be determined."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many rocket launches are shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Two.",
+ "B. Four.",
+ "C. Three.",
+ "D. One."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What type of environment are the people wearing white spacesuits in during the latter part of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Inside the spacecraft.",
+ "B. Outside the spacecraft.",
+ "C. In the water.",
+ "D. Inside the space station."
+ ]
+ }
+ },
+ "nglFPgXE": {
+ "video_id": "nglFPgXE",
+ "video_duration": "63s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens with a shot of a wooden archway, leading into a cobbled street lined with buildings. People walk along the street. The scene shifts to a hippogriff flying over a rocky coastline near the sea. Several trolls emerge from a dark cave. In a stone corridor lit by torches, a figure is thrown backward through the air. A person in a dark robe runs through a grassy field, pursued by a large, horned, dragon-like creature. The scene changes to a stone wall with a circular, spiral design. A figure emerges from a hole broken in a stone wall as debris flies outward. The scene transitions to a dark cave opening to the sky. A dark figure appears to be lying on a raised platform within the cave, surrounded by small lights, under a waterfall. A young man with dark hair walks toward the platform. In a darker cavern, a stone bridge leads to a large symbol adorned with fire and flanked by torches. The flames on the torches intensify. A dark-skinned young man touches the face of the dragon-like creature. He then fights a troll with a club in a stone chamber. He stands and raises his wand, which emits a white light. A winged creature attacks him, and he is flung into the air. In a candlelit, circular chamber with columns, a figure stands before a raised platform. A masked figure stands before an open book on a table lit by candles and a lantern. Another figure is knocked backward onto a stone floor. A person in a dark robe, holding a wand, casts a purple spell toward an unseen opponent. A shadowy figure plummets down through debris and smoke. A woman in a red-trimmed dark robe aims her wand. A large spider-like creature is flung back and explodes in a shower of red sparks. A cloaked figure casts a reddish spell toward the camera. Another cloaked figure is knocked backward. A dragon breathes a stream of fire towards a running figure, who uses a shield spell to deflect it. The scene changes to a view of the sky through dispersing clouds, as a dragon flies upward. The camera pans up to reveal Hogwarts castle in the distance. The title \"Hogwarts Legacy\" appears in front of the castle, as a figure with a glowing wand walks toward it.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does a player appear riding a dragon over a highland landscape?",
+ "answer": "C",
+ "candidates": [
+ "A. Throughout the entire video.",
+ "B. In the middle of the video.",
+ "C. At the beginning of the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many dragons are shown flying in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Two.",
+ "B. One.",
+ "C. Three.",
+ "D. Four."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What occurs in the video when the wizard confronts the wall within the ruins?",
+ "answer": "B",
+ "candidates": [
+ "A. The wall opens automatically, propelling the wizard behind it, and ultimately the wizard transforms into a droplet of water.",
+ "B. The wall opens automatically, then propels the wizard behind it, and finally the wall closes automatically.",
+ "C. The wizard casts a spell on the wall, is propelled behind it, and finally the wall closes automatically.",
+ "D. The wall vanishes, propelling the wizard behind it, and ultimately the wizard transforms into a droplet of water."
+ ]
+ }
+ },
+ "ASFUHqmN": {
+ "video_id": "ASFUHqmN",
+ "video_duration": "134s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a person with dark hair and facial hair wearing a maroon t-shirt against a gray background. The video starts with a quick motion of the person's head and hair. The person sticks out their tongue. Then they put one hand up to touch their hair, and briefly close their eyes. The person opens their mouth and speaks before speaking more, and then they move their hand up to their head. The person then holds a bright orange hair dryer in front of them, and shows the hair dryer up close, before they comb the hair while using the hair dryer. They use the comb and hair dryer to style their hair while looking directly at the camera. The video then shows a side by side comparison of the person's hair after air drying and after blow drying, with the labels 'AIR DRY' and 'BLOW DRY' written below. After this, the person is seen holding up their hands in front of the camera, then showing the hair after using some product from a small can, while talking and making gestures with their hands. The video ends by showing the hair from different angles.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many tools does the man use to style his hair in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Five.",
+ "B. Three.",
+ "C. Two.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What content deficiencies are apparent at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The video audio is unclear.",
+ "B. The video image is blurry.",
+ "C. The video abruptly cuts off.",
+ "D. The video lacks post-production special effects."
+ ]
+ }
+ },
+ "BWlTtVQn": {
+ "video_id": "BWlTtVQn",
+ "video_duration": "49s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video starts with the Crown Jewel event title card. Then Brock Lesnar makes his entrance, shirtless and wearing black MMA-style shorts and gloves.\u00a0 The camera cuts to a wide shot of the wrestling ring. After Lesnar enters the ring, Bobby Lashley, wearing red and yellow wrestling attire, enters from the opposite side. The match begins with Lashley tackling Lesnar. They wrestle outside of the ring, where Lashley slams Lesnar into the barricade. Back in the ring, Lashley gets Lesnar in a Hurt Lock submission hold. Lesnar struggles to escape, eventually lifting and slamming Lashley to break the hold. Lesnar then picks up Lashley and delivers an F5 for the win. He celebrates his victory as young fans cheer in the foreground.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What was the final outcome shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The man in black shorts ties with the man in red and yellow pants.",
+ "B. The man in black shorts wins.",
+ "C. The man in red and yellow pants wins.",
+ "D. The referee raises the hand of the man wearing black shorts."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual action does the man wearing red and yellow pants perform at the start of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The man in red and yellow pants sits on a man wearing black shorts.",
+ "B. The man in red and yellow pants lifts and throws down a man in black shorts.",
+ "C. The man in red and yellow pants puts a man in black shorts in a chokehold.",
+ "D. The man in red and yellow pants pushes a man in black shorts onto a sofa."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "In the video, how many times does the man wearing red and yellow pants attack the man wearing black shorts?",
+ "answer": "C",
+ "candidates": [
+ "A. One time.",
+ "B. Zero times.",
+ "C. Three times.",
+ "D. Five times."
+ ]
+ }
+ },
+ "hnbnwqLz": {
+ "video_id": "hnbnwqLz",
+ "video_duration": "63s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Astronomy",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video starts with a close-up view of the ground. There is a partially visible individual wearing a blue jacket with yellow accents and black footwear. A white spiral-bound notebook is open on the ground. The first page of the notebook has a sketch of a small house. The following page has handwritten text and an oval with arrows pointing to and from it. The notebook is then picked up. The view changes to show the legs and lower torso of a person in dark pants and shoes standing on a dirt trail. The camera pans up to reveal the person in dark clothing squatting down, reaching into a black bag, and then holding something that looks like a laptop. The laptop screen shows the blurry reflection of the person filming, standing among trees under a blue sky. A black rectangle appears and moves across the reflection on the screen. The camera angle moves to the laptop screen, then it is moved so that a person's hand and face appear in the frame. The video shifts to a front-facing shot of a man with tousled light brown hair, a beard and dark eyebrows. He's wearing a black leather jacket over a blue shirt. He touches his forehead a few times. The background shows bare trees and a white dome-shaped building. The man speaks to the camera, moves his head and smiles. The video shows a close-up of tree shadows on the ground, and then a spiral notebook lying on the ground. The video ends with text asking what was on the laptop screen and notepad. Then, a clip is repeated showing the reflection of the person in the laptop screen and then the page with writing and sketches is shown.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional changes occur in the man wearing a jacket in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. There is no change.",
+ "B. From sadness to happiness.",
+ "C. From anxiety to excitement.",
+ "D. From sorrow to weeping."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What kind of interaction occurs between the two people in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. They pass a computer.",
+ "B. They pass a notebook.",
+ "C. They pass a camera.",
+ "D. They have no interaction."
+ ]
+ }
+ },
+ "xLIZphBh": {
+ "video_id": "xLIZphBh",
+ "video_duration": "23s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A red train with a number 5 is on a train track. It is next to a red truck. The train has a black face with white eyes and a surprised expression. Then there are two light blue figures wearing hats standing in front of a beige building with two green windows. The train is shown on the track pulling several brown freight cars. The red train pulls the brown freight cars along a track with a large green hill next to it. Then the red train pulls the freight cars next to a black tar car. Then the train moves in front of a building and there are two light blue figures standing behind green benches. The red train's black face shows an angry expression and steam is coming out from the train.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "In the video, from which direction does the second braking sound originate?",
+ "answer": "C",
+ "candidates": [
+ "A. The sound comes from the left.",
+ "B. The sound comes from below.",
+ "C. The sound comes from the right.",
+ "D. The sound comes from above."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the small red train angry?",
+ "answer": "B",
+ "candidates": [
+ "A. The coal supply is insufficient.",
+ "B. The puppet stewardess is singing.",
+ "C. It is overtaken by other trains.",
+ "D. The railway track surface is uneven."
+ ]
+ }
+ },
+ "kGHcZoyy": {
+ "video_id": "kGHcZoyy",
+ "video_duration": "160s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Software",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A pair of hands holds a silver metal object with a hole in the middle. A red rag is under it. The hands then pick up a separate silver metal piece with various parts extending from it. The hands put a finger inside the hole, then add another finger. The hands insert a black rubber disc onto the hole. After that, a silver spring is inserted into the hole. The hands then add a black plastic piece and adjust it. The hands then begin to press a piece of black rubber along the edge of the silver metal piece. The hands adjust the spring again, then show the black plastic piece. The hands then place the black plastic piece onto the silver metal piece and adjust it with the fingers. The hands then pick up a red screwdriver and use it to insert screws into the black plastic piece. The hands then show the silver metal piece and black plastic piece on top of it, then set it down.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct order for assembling the carburetor, as demonstrated in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Place the O-ring, inspect the diaphragm, align the needle and jet, cover with the black cap, insert the screw, tighten the screw.",
+ "B. Inspect the diaphragm, place the O-ring, align the needle and jet, cover with the black cap, insert the screw, tighten the screw.",
+ "C. Align the needle and jet, place the O-ring, cover with the black cap, inspect the diaphragm, insert the screw, tighten the screw.",
+ "D. Place the O-ring, align the needle and jet, cover with the black cap, inspect the diaphragm, insert the screw, tighten the screw."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes happen to the spring in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. It is straightened.",
+ "B. It is cut short.",
+ "C. It is compressed.",
+ "D. It is stretched."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many screws did the man place on the cover in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. One.",
+ "B. Three.",
+ "C. Two.",
+ "D. Four."
+ ]
+ }
+ },
+ "rBVlRilL": {
+ "video_id": "rBVlRilL",
+ "video_duration": "311s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video starts with a close-up of a monstera plant in a terracotta pot. The camera moves up the plant's stem, which is attached to a moss pole with green ties. A finger points at the stem and green tie and moves down the stem, showing the moss pole. The video then transitions to a tropical leaf design graphic. Next, a woman with long brown hair wearing a mustard yellow top sits at a wooden table. In front of her are several plants, a clear glass jar with a plant and water in it, a dark green pot with a plant in it, a small black pot, and a gray rock. She picks up the dark green pot and rotates it to show the leaves of the plant inside. She then picks up the clear glass jar, rotates it to show the roots of the plant, and then pulls the plant out of the jar, showing a longer stem and larger roots. She holds the plant in her palm and moves it closer to the camera. She then puts the plant down and picks up a second plant in a net pot filled with what appears to be clay pebbles or leca. She removes the net pot, showing a tangle of white roots attached to multiple leca balls. She shows the roots close to the camera and manipulates the leca balls around the roots. She then places the net pot with the plant back into a small glass of water.The woman then picks up a plant from a standard pot. The plant has long, white roots. She plants the plant's roots into the dark green pot. She picks up a mason jar with soil in it and begins to plant a different plant cutting into the soil. She then picks up another plant cutting and holds it close to the camera, showing a small node towards the top. She plants the cutting into the mason jar of soil. She repeats this with another cutting, pointing out the node on that cutting as well.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many different samples of cultivated plants did the woman wearing a yellow T-shirt display in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Three.",
+ "B. One.",
+ "C. Six.",
+ "D. Four."
+ ]
+ }
+ },
+ "nLFRStsr": {
+ "video_id": "nLFRStsr",
+ "video_duration": "311s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a dark background with orange and white particles floating across the screen. These particles move together into a horizontal line, which then forms the letters \"EALSH.\" The letters then change to \"THEREALSHOWBBQ\" in gold. The words stay still before transitioning to black text on a dark background reading \"We Be.\" The scene changes to show a person wearing a gray long sleeve shirt and black baseball cap standing in a kitchen. In front of them are ribs on a counter, a red jar, a brown bottle, and a green bottle. The person speaks, pointing towards the camera with a black-gloved hand. The person points at the spices, picks up the bottle on the far right, then grabs the red jar. The person reaches for the white paper on the counter then wipes down the ribs. The person then picks up a white bottle and puts a white liquid on the ribs. They use a paper towel again to wipe the liquid, then cuts the ribs using a knife. The person uses a black glove to pick up one side of the ribs, then they pick up the whole piece and put it back down before repeating this action several times. They then open the jar, shaking it over the ribs. They rub the red spice on the ribs with a black-gloved hand, and the scene cuts to a shot of the person opening a black box grill. They check inside and then grabs a red cutting board with the ribs to put inside the grill, then sprays the ribs. They close the grill and the scene changes to show the person arranging items on a sheet of silver foil. They pour white, brown, and yellow liquids onto sugar and spices that they put on the foil.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man do with a white tissue in the first half of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Absorb the fat from the ribs.",
+ "B. Remove the membrane from the ribs.",
+ "C. Wipe the table.",
+ "D. Wipe the blood off the ribs."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Did a knife appear in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Uncertain.",
+ "B. Yes.",
+ "C. No."
+ ]
+ }
+ },
+ "mkLbMJPA": {
+ "video_id": "mkLbMJPA",
+ "video_duration": "69s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person uses a long, thin metal utensil to flip flat, yellow pieces of dough in a large black pan of hot, bubbling oil. Steam is rising from the oil. There is a price of $0.2 and \"halwa puri\" text on the screen. Several people are standing in the background. One person is seated, preparing dough pieces. A person on the left throws a golden-brown dough into the air before placing it into the pan. Several people are seen standing near the frying pan. There is a silver metal container full of small white dough balls. The camera switches to show two round pieces of fried dough on a metal counter with a silver container of white balls and a few fried doughs. The camera angle changes to a city street scene, where a few people are cooking and tossing fried doughs into the air. The camera switches back to show an overhead view of a black counter where a person scoops a yellowish, stew-like dish out of a metal pot and places it into a small silver dish. The camera switches to an overhead view of a metal dish of tan colored, lumpy food. Two large, round fried doughs are on plates and held up. The camera angle changes to show those two doughs placed on a black table. A person places two small metal dishes with a stew-like dish onto the table. The camera angle changes to show two people at a table in a restaurant. Both people are facing the camera. The person on the left is wearing a blue shirt and gold-framed sunglasses. The person on the right is wearing a white outfit with a white hat. The same two people are then shown in different angles speaking to the camera. Text appears on the screen: \"$0.4 Butter Milk\". The camera switches back to an overhead view of the black table with the fried doughs and small dishes. One person points at the metal dishes. A waiter arrives and serves a tall silver glass filled with a white beverage to the people at the table. There is a text over one of the doughs, \"Hot.\"",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What will the two people do after their conversation in the video concludes?",
+ "answer": "B",
+ "candidates": [
+ "A. They will order another portion of food.",
+ "B. They will eat the food in front of them.",
+ "C. They will discard the food.",
+ "D. They will go home."
+ ]
+ }
+ },
+ "oVXhRJQm": {
+ "video_id": "oVXhRJQm",
+ "video_duration": "130s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a first-person perspective of a player using various weapons in what appears to be a video game. The player uses a pistol-like weapon and a shotgun. The player moves through several indoor and outdoor environments with various structures, including wooden buildings, metal structures, and rocky areas. The player shoots at other characters and receives damage. At one point a screen is shown that has a series of illustrated movement instructions. The player's health and ammo is also visible on screen. Electrical effects are also shown emanating from the shotgun. The player's character also has a view of their arms and hands.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Has the player character ever defeated an enemy using a thrown weapon?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes, they have.",
+ "B. It is unclear.",
+ "C. No, they have not."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many enemies does the player kill with a shotgun in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Two.",
+ "B. Eight.",
+ "C. Four.",
+ "D. Seven."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the purpose of the boxes left behind after defeating enemies in the game?",
+ "answer": "C",
+ "candidates": [
+ "A. Summoning teammates.",
+ "B. Restoring health.",
+ "C. Replenishing ammunition.",
+ "D. Obtaining new weapons."
+ ]
+ }
+ },
+ "zbJtRhqA": {
+ "video_id": "zbJtRhqA",
+ "video_duration": "151s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman with shoulder length wavy blonde hair is standing and handling the hair of another woman, while facing the camera and talking. The second woman is seated in a salon chair with her back partially to the camera and her head in profile. The first woman is wearing a black jacket and a necklace. The second woman is wearing a black tank top. The first woman uses her fingers, a comb, a hair spray bottle, and a curling iron to style the second woman's hair. The background shows a salon with salon chairs and black drawers. There is a logo in the bottom right corner.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the woman in black section a fringe from the model\u2019s forehead?",
+ "answer": "D",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. Throughout the video.",
+ "C. At the beginning of the video.",
+ "D. At the end of the video."
+ ]
+ }
+ },
+ "ixoMLCRM": {
+ "video_id": "ixoMLCRM",
+ "video_duration": "132s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a competition in a large room with a black floor. There are white lines on the floor forming a path. There are several structures constructed out of white panels placed on the floor creating a small maze. Several people wearing maroon and gold colored clothing are standing on a raised platform watching the floor. There are also people standing on a lower level near tables. On the floor, three toy vehicles, one blue, one red and one multi colored are moving along the white lines and interacting with gray balls that are also on the floor. The vehicles are navigating through the maze. In the foreground, there are several white panels held up by wooden support structures. A person carrying a laptop enters and moves away from the camera.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual events occur when the 3007th wheel gets stuck?",
+ "answer": "A",
+ "candidates": [
+ "A. A robot without a number attempts to push No. 3007.",
+ "B. No. 2181 goes out of bounds.",
+ "C. Car No. 3007 rolls over.",
+ "D. Wheel No. 3007 falls off."
+ ]
+ }
+ },
+ "iCIHneHo": {
+ "video_id": "iCIHneHo",
+ "video_duration": "43s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two people are visible in the video, both standing on a paved area. A person wearing a blue jacket holds a yellow measuring tape and marks something on a flat, grey surface while another person wearing a black coat holds a notebook and pen. The person wearing a blue jacket then places a small, dark object on the surface. A red graphic of \"d = 5m\" is superimposed over the scene. The two people then stand on opposite ends of the grey surface, while a cell phone is shown which has a stop watch on screen. The timer on the screen starts, and it reads \"1.015\". The person in the blue jacket throws a small, dark object, and a semi-transparent green circle graphic is superimposed over the scene. A red graphic of \"\u0394t = d/v\" is superimposed on the scene, then the person in the blue jacket is shown throwing the object. A semi-transparent green circle graphic is superimposed over the scene again, and then a red graphic of \"\u0394t = d/v\" is superimposed on the scene. The video is taken from a slightly high vantage point. There is a landscape with trees and fields visible in the background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Who clapped second?",
+ "answer": "C",
+ "candidates": [
+ "A. The man in the black jacket.",
+ "B. The man in the yellow jacket.",
+ "C. The man in the blue jacket.",
+ "D. The man in the green jacket."
+ ]
+ }
+ },
+ "SeJPVYAW": {
+ "video_id": "SeJPVYAW",
+ "video_duration": "35s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video displays a first person perspective of a video game. The player is using a gun. A circular inset in the upper center shows a person wearing a beige hat, glasses, a beige vest, and black headphones. The first person view shows what the player is seeing in the video game. The player aims at a person laying on the ground in the distance. The player shoots and the person appears to go down. The player begins moving forward, and they come across a person on the ground that appears to be injured. There are two other video game characters that appear in the view and then leave the screen as the player moves. The injured person is shown from multiple different angles. The player moves to a location with a sandy background. There is a structure in the background that appears to be blown up.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Are there any moving cars in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes, there are.",
+ "B. No, there are not.",
+ "C. It is unclear."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Do the female characters in the video also have teammates?",
+ "answer": "B",
+ "candidates": [
+ "A. Yes.",
+ "B. Unclear.",
+ "C. No."
+ ]
+ }
+ },
+ "QeDQYPoA": {
+ "video_id": "QeDQYPoA",
+ "video_duration": "120s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A single animated figure with short blonde hair stands in a room with a sloping ceiling, holding a hockey stick, and wearing a striped shirt and blue jeans. A blue bag lies on the floor. The figure turns their head to the side, and appears to have closed their eyes. The animated figure\u2019s head drops down, and then it turns to look to the left. Another animated figure with dark hair, glasses, and a yellow striped shirt, places their hand on the shoulder of the first animated figure. The first animated figure has a blue backpack and a yellow shirt on, as the second animated figure speaks to them at a door. The first figure steps through the door and waves. The first animated figure is walking out of the door.A blue orb containing three people seated on a tree branch is shown. The people are then surrounded by children in matching red clothing. The first animated figure is then shown hoisted in the air, carried by the children. The image then fades to black. The animated figure with a yellow shirt appears in a classroom, and appears to look down. Another animated figure is talking in the first animated figures direction, and then it looks at someone else. The first animated figure is then shown again in a close up, and appears to look sad. Another figure appears to be in a clear tube. Then, a blue animated figure drops onto a surface, and slides down into a blue tube. The scene then changes to the blue animated figure on a purple background. Three animated figures are then shown interacting in an animated scene with a dark background. The scene fades to black. A small, glowing blue animated figure is shown in a dark space. The animated blue figure and then a small, glowing red figure are then shown. The animated figure with the yellow shirt is then shown at a table. Then, a shot of the first animated figure walking into a dark room with a bed. The door is then closed. The scene fades to black. Then, an animated blue figure with glasses and a sweater, is shown looking frightened. The animated blue figure is then shown falling through a dark space with a blurred background, and is shown again, and then again. The blue animated figure is shown at the edge of a cliff. Then, there is a close up of a table, and a blurry view of the first animated figure in the background, in a room with a kitchen. The first animated figure is then shown looking downward and frowning. Then, the blue animated figure is then shown, working with buttons and lights on a platform. Then, there is a close up of the blue animated figure working with the platform. The blue animated figure is seen pushing down a lever on the platform. The first animated figure is then shown in dark light, leaning against a window with a street view. The first animated figure is shown in a close up. Then a blue light is shown on a black background. The shot changes to a dark space. A small, glowing figure then appears, and then begins to move towards a larger group. The animated blue figure is then shown, looking worried. The blue animated figure is then shown, and appears to be holding two spheres. One is a pale blue, and the other is dark. The blue animated figure hugs the two spheres. The blue animated figure\u2019s head is turned down, with its eyes closed. A golden orb, with many figures is shown, as the animated blue figure is shown in a downward position. The animated blue figure\u2019s head is turned down, with its eyes closed. The blue animated figure\u2019s head is turned down, with its eyes closed.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Will the little girl experience sadness again after accepting it?",
+ "answer": "B",
+ "candidates": [
+ "A. No, she won't.",
+ "B. Yes, she will.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "skAGZRUx": {
+ "video_id": "skAGZRUx",
+ "video_duration": "130s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a close-up of a shirtless person with glasses smiling. Text on the screen reads \"during the 'STREAMING UNTIL I HIT 300 SUBS!!!' live stream...\" The camera then moves to show the person from the chest up. It then shifts to a full view of a pair of closed white doors. The scene then changes to show a digital screen with the words \"PICK MAP READY UP ITEMS START\" in front of a dark background. The scene changes again to display the words \"MAIN MENU LOBBY\" along with a list of names and the word \"KICK.\" The screen changes again, now showing a player avatar in a white and red outfit standing in front of the previous screen. The scene changes again to show a \"Cups: 311\" text in a white box in the top left corner of the screen, along with the previous \"MAIN MENU LOBBY\" screen and two player avatars in holiday-themed outfits. The view changes again to a dimly lit scene of the \u201cMAIN MENU LOBBY\u201d screen with text that is unreadable. This screen changes again to a different view of the \"MAIN MENU LOBBY\" screen along with a player avatar dressed in white and red, and another dressed in red and green. The scene then changes to a different view of the \"MAIN MENU LOBBY\" screen with multiple player avatars dressed in holiday-themed outfits. The scene then changes to show two player avatars standing together in a dimly lit room. The view then changes again to show one player avatar walking in a room with red couches. This avatar is then shown holding two green paw-print cards. The screen changes to black, and then shows the same image but in a third-person perspective. The words \"Part 1: The Beginning (Paranormica)\" appear on the screen and then transition to a different view of the \u201cMAIN MENU LOBBY\u201d screen and the player avatars. Then it shows the same room and avatars but with a third-person perspective, and then it changes to show a first-person perspective where the camera quickly spins around. The screen then changes to show a dark room with a rainbow-colored television and 5 player avatars standing together. The view then changes to a single player avatar, in red and green, walking through a room with red couches. This avatar holds two paw-print cards and rotates them in front of the screen. The screen changes to black, then back to gameplay. The camera moves through a dark area, passing multiple players, then switches to a first-person view in a hallway where a player avatar is standing next to a wall. The view then shows the avatars standing outside with a dark building in the background. The scene changes again, showing a white screen with text on the left side and a player avatar in red in the center. The camera then moves again, showing a first-person perspective in the dark. The scene then shows a first-person perspective with a white tripod in the dark, then showing a box that says \"MrNeroYT\" in the corner. The video then shows a view of a player in silhouette, in front of a white wall with the player's username in the corner, and the camera switches between different angles of the silhouette and wall. Then the screen switches again to show the first-person perspective with a different set of player avatars, then showing one of the player avatars with a red and yellow outfit. The camera perspective changes to a player avatar moving through a room while being followed by the camera. The scene then shows an avatar in red and green standing next to a broken wall. The perspective shifts again, showing a dark view with two avatars on either side of a broken wall. The scene changes again, with the camera showing another avatar in red, walking through a dark area. Then, an avatar is shown holding a metallic cage, and then moves the cage around. The screen goes black for a second, then back to the game. The perspective shows the camera, on a tripod, then the screen shows the word \"MrNeroYT\" again. Then the scene changes to show more avatars in a brightly lit room, with a first-person perspective. The view changes again to a player avatar in red standing next to an area with graffiti. The camera then moves forward, showing more of the room and graffiti. The camera then moves forward again, showing two of the player avatars in front of a broken wall with a hole in it. The perspective changes once more, showing three player avatars and a broken wall. A third-person view shows player avatars standing together, some holding swords. Then, the scene changes to a first-person perspective, showing the word \"Nero doing math\" on the screen, and shows multiple player avatars standing together. The first-person perspective shows two player avatars in front of a wooden wall. Another first person perspective shows player avatars interacting with each other and their weapons. The camera then changes again, showing the word \u201cJulian\u201d in the corner, and then switches to a dark hallway, and shows a first person perspective of the user walking through the hallway. The video then goes completely dark, showing some player names in the corners, and text chat in the left corner of the screen, along with a percentage of 6% in the corner. The scene goes completely dark again, the percentage goes down to 0%, and then the screen shows some player avatars walking in the dark. The scene goes completely dark again and the words \u201cMrNeroYT\u201d flash on the screen. This pattern repeats a few times, with some brief moments of gameplay shown in the dark.The video ends on a completely black screen with the text \u201cMrNeroYT\u201d and 0% visible.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What might the weather conditions be like when the game environment suddenly turns dark?",
+ "answer": "A",
+ "candidates": [
+ "A. Clear skies.",
+ "B. Thunderstorm.",
+ "C. Rainy weather.",
+ "D. Overcast skies."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What are the main emotions experienced by players in the latter part of the game?",
+ "answer": "A",
+ "candidates": [
+ "A. Joyful and tense.",
+ "B. Peaceful and relaxed.",
+ "C. Joyful and relaxed.",
+ "D. Fearful and tense."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did the player in the video survive until the end of the game?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes.",
+ "B. No.",
+ "C. Unclear."
+ ]
+ }
+ },
+ "mDhhXBXn": {
+ "video_id": "mDhhXBXn",
+ "video_duration": "398s",
+ "duration": "6-8min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a title card showing the name \u201cTHEKENNYBROWN1\u201d and the social media handle \u201c@IAMKENNYBROWN\u201d on a dark background with a lens flare effect.The video then cuts to a room with taupe walls, a brown couch to the left, a desk with computer equipment in the center against the window, and a Michael Jackson poster framed on the wall. A large white reflector screen stands near the poster. A man sits on the couch with a laptop. Another man sits at the desk, wearing a grey beanie and black shirt, and adjusts a microphone. A third man, wearing a red backwards baseball cap, light-wash sleeveless denim jacket, white t-shirt, and red pants, faces away from the camera. A teddy bear sits on the desk. Cutouts of musical notes are placed on the upper wall near the window.The man in the red cap faces and walks toward the camera. He switches the red cap for a blue one and begins to rap, gesturing with his hands and occasionally touching his chest or adjusting the strap of his bag. He wears a ring and a watch on his left hand. The other two men remain in the background. One continues to adjust the microphone while the other looks at his phone.The man rapping occasionally turns his head to the side, looks at the other men, or points toward them, then faces the camera again, continuing to rap and gesture along with the music. He switches between the blue cap and the red cap throughout his performance. At one point, he lifts his shirt to reveal his abs. He also occasionally claps along with the beat.At the end of the rapping, he walks back to the desk and puts the red baseball cap back on. The video ends with a screen that promotes the creator\u2019s other social media links and a previous video. The previous video shows a person with a long dark wig looking to the side and apparently whistling, though the audio is cut off.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the man in the video wear his hat backwards while singing?",
+ "answer": "C",
+ "candidates": [
+ "A. Four times.",
+ "B. Three times.",
+ "C. Twice.",
+ "D. Once."
+ ]
+ }
+ },
+ "JVOffdac": {
+ "video_id": "JVOffdac",
+ "video_duration": "49s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Software",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a logo consisting of the number \"1\" colored orange and the words \"PAGE PROFITS\" stacked below in white and orange, all against a black background. A yellow curved element emerges on the right of the screen, revealing a cartoon man wearing a red shirt and dark pants, holding a laptop. The man and laptop remain, as the background fades to orange, and a white, radial graphic appears. This then transforms into a series of web page mockups, as dollar bills float around the man and text reads \"MAKING $300+ PER DAY.\" The scene then fades to an orange background, and text appears stating, \"BUILT IN TRAFFIC SOURCE,\" followed by the silhouette of a group of people. This transitions back to a black background, as text and web page mockups appear with \"MONEY PAGE\" text below. A group of white and yellow silhouettes of people appear on the right, and \"BUILT IN FREE TRAFFIC\" appears above them, along with the text, \"PLUG IN YOUR AFFILIATE ID.\" This scene is followed by \"STEP 1\", followed by \"PLUG IN YOUR AFFILIATE ID,\" \"STEP 2\" and \"ACTIVATE THE BUILT IN TRAFFIC,\" and \"STEP 3\" and \"THERE IS NO STEP 3.\" The cartoon man reappears, gesturing with open hands as the text \"WE CREATED '1 Page Profits'\" appears. The video then transitions to show the cartoon man at a desk, with computer, chair, and other objects, such as a cup and small potted plant. The man is sitting at the desk, as two circles stating \"AGE\" and \"TECHNICAL EXPERIENCE\" appear, as the man gestures with open hands. Finally, the cartoon man and \"1 Page Profits\" reappear as the background fades away.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the man's position in the video relative to the coffee?",
+ "answer": "C",
+ "candidates": [
+ "A. In front of the coffee.",
+ "B. To the right of the coffee.",
+ "C. To the left of the coffee.",
+ "D. Behind the coffee."
+ ]
+ }
+ },
+ "pkapMNjo": {
+ "video_id": "pkapMNjo",
+ "video_duration": "60s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Historical Analysis",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A sepia-toned drawing of a woman wearing glasses and a hat is shown. In the background, a horse-drawn carriage passes by. The image transitions to a drawing of a woman holding a Votes for Women sign, addressing a crowd of women holding signs with various slogans including \u201cDeeds Not Words\u201d. A police officer stands to the side, observing the gathering. Then, the image changes to another scene of a woman holding a sign and addressing a crowd with her back to the viewer. The image transitions to a close-up view of the crowd, focusing on several women holding placards and police officers standing nearby. Next, a black-and-white drawing depicting a crowd of police officers and women appears. A final black-and-white drawing shows a chaotic scene with people, mostly women, in hats and some police officers, jostling for position.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Did Miss Davidson ultimately become a prisoner?",
+ "answer": "D",
+ "candidates": [
+ "A. No, she did not.",
+ "B. None of the above.",
+ "C. It is not known.",
+ "D. Yes, she did."
+ ]
+ }
+ },
+ "QMaVZkmC": {
+ "video_id": "QMaVZkmC",
+ "video_duration": "263s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Unboxing Videos",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing an orange shirt and a military-style vest is in front of an orange wall, removing various items from boxes. The person unpacks, examines, and assembles the items. Some items include multiple, small objects in clear plastic bags. Some of the items are guns with black frames and orange tips. There is also a long black object. A piece of white paper with a handwritten drawing is mounted to the wall. A poster with text is also mounted to the wall. A tan bag is visible next to the person.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional transitions does the man in the video go through?",
+ "answer": "C",
+ "candidates": [
+ "A. From curiosity to excitement and then to shock.",
+ "B. From curiosity to shock and then to excitement.",
+ "C. From excitement to curiosity and then to shock.",
+ "D. From shock to curiosity and then to excitement."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What is the man in the video likely to do next?",
+ "answer": "C",
+ "candidates": [
+ "A. Test the accuracy.",
+ "B. Assemble all the parts.",
+ "C. Sell a gun.",
+ "D. Open another box."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What material is the quick reload device in the video made from?",
+ "answer": "C",
+ "candidates": [
+ "A. Plastic.",
+ "B. Polymer.",
+ "C. Metal.",
+ "D. High molecular material."
+ ]
+ }
+ },
+ "ObdqVMLc": {
+ "video_id": "ObdqVMLc",
+ "video_duration": "163s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man with short brown hair and a beard is speaking into a microphone. He is wearing a black blazer with white trim over a patterned shirt, and has a red and black patterned pocket square. He gestures with his hands as he talks. The background includes red curtains, a small black stool, and two large cutout shapes of handguns. The video switches between color and black and white footage. An audience is seated in the foreground, facing him. At one point, the man raises his hand. He points with one finger. At other points, he has both hands in front of him gesturing.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What will the man's age in the video be next year?",
+ "answer": "D",
+ "candidates": [
+ "A. 35 years old.",
+ "B. 40 years old.",
+ "C. 50 years old.",
+ "D. 42 years old."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What skill does Neo possess in the movie *The Matrix* as mentioned in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Instant teleportation.",
+ "B. Precognition.",
+ "C. Photographic memory.",
+ "D. Superhuman abilities."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where on the stage does the stool appear in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Exactly in the center.",
+ "B. Slightly right of center.",
+ "C. At the edge of the stage.",
+ "D. Slightly left of center."
+ ]
+ }
+ },
+ "MJeOZpCt": {
+ "video_id": "MJeOZpCt",
+ "video_duration": "196s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Basketball",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows two people in a basketball court. The court has a yellow floor and a basketball hoop. Both people are wearing black shirts and shorts. One person has blue and red shorts and white shoes and the other person has black shorts and black shoes. The video begins with a close up of someone's hand and then focuses on the person with black shorts holding a basketball, who then smiles at the camera. The video then switches to a view of the whole court with the two people on it. They face each other and begin to practice basketball, passing the ball back and forth and shooting the ball at the hoop. Throughout this, the number 1, 2, 3, 4, 5, 6, 7, and 8 appear in the bottom left corner. The video shows several passes and shots from each person. They then begin talking and smiling. Then, the person in black shorts and black shoes runs around the court.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the emotion of the bald man when he shoots the basketball in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Angry.",
+ "B. Surprised.",
+ "C. Nervous.",
+ "D. Excited."
+ ]
+ }
+ },
+ "uEfdyFzK": {
+ "video_id": "uEfdyFzK",
+ "video_duration": "68s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a title card that reads \"ROLLING WITH THE BIG DOGS\". The background is a blurry, brown wood-grain pattern. Following the title card, there are various scenes. A woman with sunglasses adjusts the hair of a man with a mustache. Two men stand in front of bowling lanes, one wearing a blue hat and a leopard print jacket, and the other a red baseball cap. A woman with short gray hair and a man with red hair stand next to each other. There are four people in front of bowling lanes, including the two men from the previous scene and one woman in a light-blue shirt and jeans. There is a classroom scene where various young children and two adult men are visible. There are three men in black and white shirts and the middle man wearing a gray hat. Two men are looking at a computer monitor. Four men, all wearing white shirts, are huddling in a circle with their arms on each other. A hand places a bowling ball into a ball return machine. A man wearing a gray shirt throws a red bowling ball at bowling pins, while another man watches. The bowling ball hits the pins. A man is using a brush to clean clothes hanging on a rail. A man in a hat uses a machine on an article of clothing. There are five people in a bowling alley. One man is wearing a black shirt with a red bowling ball. A man is standing in the dark. A man with dreadlocks peeks out from behind a curtain. Two men stand in front of bowling lanes. One is wearing an orange shirt and one is wearing a yellow plaid jacket. There is another title card that reads \"THEY'RE GONNA NEED SOME BALLS\". A woman in a light blue shirt is bowling a ball with people in the background. A man wearing a white shirt and black jacket gestures with his arm in front of a crowd. A man in a blue shirt and a red kilt is stretching next to a bowling alley. A man with dark hair and a mustache is holding his mustache with his fingers. A man is holding a bowling ball with red, yellow and black colors on it. Two men, both wearing white shirts, are touching elbows. Three men wearing black shirts stand in front of a bowling alley while a man with a mustache is talking. Various people at a bowling alley are cheering, including a man with a mustache. Three men are sitting in front of a bowling alley and laughing. Two men are looking at merchandise in a shop. One has a plaid cap and the other has a black cap. One man is looking over merchandise while wearing a wig and mustache. There is a scene where one man is in the foreground and doing pushups, with other men in the background. A man is holding a jump rope in a yard. Three men are standing in a room and one man is gesturing towards them. Two men in blue shirts are fist-bumping in a bowling alley. There are three people in a kitchen, with one wearing a red apron. One man with dreadlocks is in a dark car. A man in a red shirt is driving. A man with a mustache and a hat is running down a street. There is a man with a black and white bandana is looking at something. A man in a blue shirt is talking with a woman. A man in a light green and white hat and a light green and blue jacket with rhinestones is talking to a woman. A man in a black and white shirt is pumping his fist with a crowd behind him. The word MURS appears over a man in a black shirt and he is smiling. The word DEE-1 appears over a man wearing a gray head wrap, with dreadlocks and a red apron. The word TABI BONNEY appears over a man holding a blue bowling ball and a hat. The words ACADEMY AWARD\u00ae NOMINEE ERIC ROBERTS appear over a man with gray hair wearing a blue polo. Bowling pins are knocked down. A man in a white shirt is staring intently while holding a bowling ball with a crowd in the background. A man in a hat is talking to the camera. Four men are facing forward wearing white polo shirts with one man with dreadlocks holding a bowling ball. There is a scene where a man in a blue shirt is rolling while people cheer in the background. A man wearing a white shirt is looking to the right. The video ends with the words \"MAUL DOGS OF BEWARE\" on a golden circular logo. The words \"COMING SOON\" appear underneath this logo. And it ends with a list of the names of the actors.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the sound of bowling pins being knocked down occur in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Five.",
+ "B. Six.",
+ "C. Seven.",
+ "D. Four."
+ ]
+ }
+ },
+ "ipBErquQ": {
+ "video_id": "ipBErquQ",
+ "video_duration": "98s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Soccer",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a baseball game. A batter wearing a white jersey with \"ASTROS\" printed across the chest, and the number 11 on the right side is shown holding a wooden bat. He swings at a pitch. The pitcher wears a grey jersey with \u201cChicago\u201d written across the chest and the number 43. He throws a pitch towards the batter. At one point in the video, the batter and the pitcher are standing facing each other at the baseball field. The batter stands at the plate. There is a catcher and an umpire behind the batter. The catcher wears a dark uniform. In the background are other people sitting in the bleachers. There is a graphic on the screen in the video showing the pitcher's name, and other baseball statistics such as \u201c3 H, 1-8\u201d. Another graphic is shown showing the batter\u2019s name and the words \"CAREER PINCH-HITTING .310 AVG, 4 HR\". There are banners in the background that read \"ASTROS.COM\" and \"THIS BUD'S FOR YOU\". The pitcher walks over and hugs the catcher. The pitcher and catcher are shown shaking their gloves. There is a zoomed in view of the catcher\u2019s mask. Then the game continues with the pitcher throwing another pitch and the batter swings.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the video feature a hitting sound?",
+ "answer": "A",
+ "candidates": [
+ "A. Five times.",
+ "B. Three times.",
+ "C. Six times.",
+ "D. Four times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video do the catcher and pitcher change their strategies?",
+ "answer": "D",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. In the middle of the video.",
+ "C. No strategy change occurs.",
+ "D. At the end of the video."
+ ]
+ }
+ },
+ "zbTQBYzD": {
+ "video_id": "zbTQBYzD",
+ "video_duration": "37s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A black and white marine animal swims in dark water, a boat railing is visible at the bottom. Bubbles rise from the animal. The video transitions to a view from the deck of a boat. A white railing is visible in the frame. The video cuts back to an animal in the water, then back to the deck of the boat, and the video darkens to night. There is a shot from the boat deck showing an opening in the deck, in which water can be seen inside, there are some objects visible. A hand reaches into the deck opening. The view shifts to inside the deck opening, objects are being moved around. There are items in orange, white, and yellow visible inside the opening. The view is back on deck, and more objects are moved. The video shows an exterior view of the boat, it is dark and lights are visible on the deck. There is another boat nearby and the two boats are connected by a line. A man wearing a helmet is seen pulling on the line. A boat is shown, it has a sail and lights are visible. The man wearing a helmet continues to pull on the line. There is a final scene where a man is sitting on the deck of a boat. There is a stack of life preservers next to him.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "How were the crew members in the Strait of Gibraltar rescued as shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Smart emergency equipment was used, and rescue teams provided assistance.",
+ "B. A distress signal was sent, and rescue teams provided assistance.",
+ "C. Nearby vessels provided assistance, and rescue teams offered help.",
+ "D. A distress signal was sent, and nearby vessels provided assistance."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What type of noise is heard in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The sound of orcas colliding with the boat.",
+ "B. The sound of the crew using the intercom.",
+ "C. The sound of the crew tapping on metal.",
+ "D. The sound of the boat starting its engine."
+ ]
+ }
+ },
+ "rhYqHLMS": {
+ "video_id": "rhYqHLMS",
+ "video_duration": "73s",
+ "duration": "1-2min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen displaying the white text \"MASTERMIND\" and \"HD\". It then transitions to a screen showing a computer terminal with the words \"CENTRAL INTELLIGENCE AGENCY\" at the top, with multiple fields displayed. The computer screen shows someone inputting code. Next, the video shows a tape reel with some writing on it, a floppy disc with writing and a red stain, and a piece of paper with a code on it. After this, a newspaper with the title \"Nationwide Observer\" is shown. Then, the tape reel appears again. Following this, the words \"EVIDENCE 01: 'A CODED MESSAGE'\" appear, and the scene shifts to a man on a rooftop. The man's speech bubbles appear with the phrases \"Tell us where Arash is and you'll live\", \"You messed with the wrong people\", and \"Where's Arash?\" These speech bubbles are then followed by the phrases \"I... I just handle the money... I have no idea where Arash is.\", and \"I don't think you understand the situation...\" The man is then approached by another individual, and then he says \"You Americans have rules!! You have rules!\" and \"You took hostages. The rules changed\", before the other person makes a threatening statement, and then the man reveals \"W-wait! He's in Turkey! He's meeting someone at Trabzon Airfield tomorrow night!\" and \"I swear I swear I don't know!\", and \"They only communicate with coded messages.\" The scene ends with an \"Operation Roof Climb\" message.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional shift does the man with the bleeding nose display after being held captive in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. He shifts from calm to anger.",
+ "B. He shifts from unease to fear.",
+ "C. He shifts from unease to depression.",
+ "D. He shifts from calm to fear."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many pieces of evidence are required to decrypt the disk in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Four pieces.",
+ "B. One piece.",
+ "C. Three pieces.",
+ "D. Two pieces."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "Into which stages is the process of decrypting floppy disks in the video divided?",
+ "answer": "B",
+ "candidates": [
+ "A. A coded message, the front page of an observer newspaper, a radio station broadcast, collecting evidence, decrypting disk codes, decrypting passwords.",
+ "B. Collecting evidence, decrypting disk codes, decrypting passwords, a coded message, the front page of the Observer newspaper, a radio station broadcast.",
+ "C. Decrypting passwords, a coded message, collecting evidence, decrypting disk codes, the front page of an observer newspaper, a radio station broadcast.",
+ "D. A radio station broadcast, decrypting passwords, a coded message, collecting evidence, decrypting disk codes, the front page of the Observer newspaper."
+ ]
+ }
+ },
+ "OYHnbxhW": {
+ "video_id": "OYHnbxhW",
+ "video_duration": "151s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A reddish slab of meat is cooking over a bed of coals inside what appears to be a large smoker. The meat is brushed with a dark liquid, and then a dark-handled spatula is used to lift and transfer the meat to a metal tray. The tray is placed on a metal grate, and the meat is then taken off of the tray and placed on a white surface. The meat is sliced and then separated into pieces. A similar process occurs with another piece of meat. A man wearing a blue shirt, an apron and a light-blue bracelet is shown placing the first batch of sliced meat onto a serving tray. A different man is shown lifting and then placing a piece of meat onto a metal grate in a smoker. The door of the smoker is closed. Later, a man with a camouflage hat is shown holding and unwrapping a piece of foil to reveal a piece of meat. He puts the meat on a metal tray. The man then returns to the first piece of meat on the white surface and brushes the pieces with a dark liquid. Another man in a black shirt and a baseball cap gives a thumbs-up sign. Another man wearing a navy-blue shirt is shown eating a piece of meat, while holding a foil package. A group of people can be seen waving at the camera. Another group of people are shown posing and raising their hands. A group of people are seen sitting underneath a white tent. Another group of people are shown waving their hands while under a red tent. A man in a blue shirt gives a thumbs-up sign. Another man waves and smiles at the camera. A group of people are seen under a red tent, waving their arms and hands. Two people walk by. A group of people is shown standing near a doorway. Another man is shown talking to another person. A man in a blue shirt is shown giving a high-five to another person. Next, a man is shown slicing a thick cut of meat with a knife. Another man is shown slicing a similar cut of meat. The man is shown placing pieces of meat in a container, by hand. Two men are seen working together, one placing meat in a container while the other pours a liquid on top. The two men are seen handing off a container of meat. A man with a red cap walks away while carrying a container. Several people are seen walking in a line. A group of people are shown entering a building. A group of three people can be seen talking together. A man in a blue shirt walks up to a man in a red shirt and high-fives him.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who said, \"What we need to do now is make a phone call\"?",
+ "answer": "B",
+ "candidates": [
+ "A. A man with a white beard, wearing a white baseball cap and a black apron.",
+ "B. A man with a white beard, wearing a brown baseball cap and a black apron.",
+ "C. A man with a white beard, wearing a brown baseball cap and a brown apron.",
+ "D. A man with a black beard, wearing a brown baseball cap and a black apron."
+ ]
+ }
+ },
+ "seSggjOP": {
+ "video_id": "seSggjOP",
+ "video_duration": "57s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A soccer match is being shown. At first the view is a wide shot of the field, and players wearing black and white, and others wearing green and white are shown on the field. Some players are near the goal. The camera zooms in and a goal is made, the camera follows a man wearing black with the number 7, as he celebrates. The camera shows a group of men on the sidelines celebrating. The camera changes to a shot of a crowd cheering and jumping. The camera returns to the field and focuses on a man wearing black with a red design, who is receiving a hug from his teammates. The camera shifts back to a view of the game and then to shots of players at the corners of the field. The camera then pans to a wide shot of the field and players. The camera shows a player in black kicking the ball and his teammates surrounding the goal. The camera then follows a player in black and a goal is scored. The camera returns to the wide shot.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "In the video, how many times did the team wearing black short sleeves pass the ball?",
+ "answer": "A",
+ "candidates": [
+ "A. One time.",
+ "B. Three times.",
+ "C. Two times.",
+ "D. Zero times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "Which team scored in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Shelbourne Football Club.",
+ "B. Dundalk Football Club.",
+ "C. Drogheda United Football Club.",
+ "D. Dundalk Football Club (duplicate, replace if intended)."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "Which type of set piece is shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A free kick.",
+ "B. A corner kick.",
+ "C. A penalty kick.",
+ "D. A throw-in."
+ ]
+ }
+ },
+ "umpKoxkH": {
+ "video_id": "umpKoxkH",
+ "video_duration": "17s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video is split into two sections, one above the other. In the top section, a man with a beard and glasses is seen standing at a podium with two microphones. He is wearing a blue suit jacket over a light blue shirt. In the bottom section of the video, a person with glasses, curly hair, and a beaded headband is holding a microphone. They are wearing a black top, and have a bracelet. There are people standing around them. The text \"ASK YOU RIGHT NOW. OKAY,\" \"BUT YOU KNOW THAT YOU'RE A,\" \"WOMAN. YES. WHAT IS THAT?\" \"SO GENDER IS A SOCIAL THING,\" and \"WHAT IS A WOMAN?\" appear on screen above the lower section. After this, the video changes to a wide shot of a large room filled with people seated in chairs. The text \"WHAT IS A WOMAN? THE WORD,\" \"WOMAN. DEFINE IN THE DICTIONARY\" appears on the screen. The entire video also has a circular watermark with a symbol and the text \"PROVEN MENTOR\".",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What emotion is the man in the blue suit showing in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Calmness.",
+ "B. Excitement.",
+ "C. Dullness.",
+ "D. Pain."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the emotion conveyed in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Indifferent.",
+ "B. Passionate.",
+ "C. Composed.",
+ "D. Dull."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What are the two men discussing in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The distinction between women and men.",
+ "B. The definition of femininity.",
+ "C. The origins of women.",
+ "D. The way to write about women."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times is \"what is a woman\" mentioned in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Five.",
+ "B. Three.",
+ "C. Two.",
+ "D. Four."
+ ]
+ }
+ },
+ "QbAFkYAu": {
+ "video_id": "QbAFkYAu",
+ "video_duration": "133s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a blurry, fast moving, time-lapse-style shot, with what appears to be car headlights on a road at night. The video then transitions to a shot of a man in a tan police uniform, standing outside at night, in front of a building and parked police vehicle. This shot is then repeated multiple times, in a mosaic pattern on the screen. The video then goes back to a single shot of the first man, as before. The man is then briefly replaced by a shot of a second man in a similar tan uniform, also standing outside at night, near vehicles and a building. This second man is also then shown in a repeating mosaic pattern. The video then cycles back and forth between shots of each man, while also including shots of the police vehicle, a white and black SUV, with the word \"POLICE\" in white lettering on the side. The vehicle is shown parked in front of a building, and again at various points with a flashing light on top. The video ends with the first man shown once again, before a final shot of the police vehicle. There are words overlayed over the video throughout, in various colors. There is also a sparkle effect at certain points.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who said \"It's reasonable suspicion that there's a crime\" in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A white-haired police officer with one hand on his hip and one finger pointing at the car.",
+ "B. A Black police officer with one hand on his hip and one finger pointing at the car.",
+ "C. A black-haired police officer with one hand on his hip and one finger pointing at the car.",
+ "D. A black-haired police officer standing with both hands on his hips."
+ ]
+ }
+ },
+ "jUNtvqPm": {
+ "video_id": "jUNtvqPm",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a grey shirt and dark shorts is standing next to weight machine. A second person is lying on the bench of the weight machine with their legs straight in the air. The second person has orange and grey shoes on their feet. The weight machine has a bar across it. The second person's feet are positioned with the soles of their shoes on the bar. The first person uses both of their hands to help stabilize the bar, and then moves their hands closer together. The first person's hands move toward the middle of the bar, and the second person lowers their legs to where they are in a more horizontal position. The first person is wearing a grey shirt and black shorts. The second person is wearing a black tank top and grey shorts.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many barbell leg raises did the man wearing a black vest perform in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Thirteen.",
+ "B. Fifteen.",
+ "C. Twelve.",
+ "D. Fourteen."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the purpose of the man wearing a gray short-sleeved shirt in the video pressing both hands on the barbell?",
+ "answer": "B",
+ "candidates": [
+ "A. To increase the weight on the left arm for the man wearing a black vest.",
+ "B. To increase the weight on the right leg for the man wearing a black vest.",
+ "C. To increase the weight on the left leg for the man wearing a black vest.",
+ "D. To increase the weight on the right arm for the man wearing a black vest."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color short-sleeved shirt is the cameraman wearing in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. White.",
+ "B. Blue.",
+ "C. Gray.",
+ "D. Black."
+ ]
+ }
+ },
+ "xGSanuth": {
+ "video_id": "xGSanuth",
+ "video_duration": "119s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with an aerial view of a large mansion at twilight. A bright light shines down on the mansion's back patio and pool area. The camera then pans to show the mansion at night, surrounded by trees and gardens illuminated by small lights. Next, the video shows the front of the mansion during the day, featuring a large staircase leading to the entrance, flanked by manicured lawns and trees.The video then cuts to the interior, showcasing a grand foyer with a dual staircase and a large round table in the center. A crystal chandelier hangs above. The next scene is a light green bedroom with a four-poster bed and matching furniture. This is followed by a hallway with high arched ceilings supported by columns.The video then cuts to a blonde woman in a pink sweater and black pants against a purple background. She talks, gesturing with her hands, but no audio is available.The video cuts again to a large, round living room with high windows draped in orange and sheer curtains. A crystal chandelier hangs from the center of the ceiling. The next scene is a two-story living room with exposed wooden beams and a large chandelier. Floor-to-ceiling windows offer views of the surrounding greenery. A staircase to the second floor is visible in the background. This is followed by a library with red walls and a black and white zebra-print rug. Black bookshelves line the walls, and a spiral staircase is in the corner. The video then cuts to a large kitchen with wooden cabinets and marble countertops. Two crystal chandeliers illuminate the space. Next is a study with dark wood furniture, built-in bookshelves, and a table with a world map design.The video then shows a game room with a pool table, a bar, and a foosball table. This is followed by a dimly lit, medieval-style sitting room with a vaulted ceiling supported by stone columns. Next, a formal dining room with a high ceiling is shown, featuring a long wooden table with chairs and a large chandelier. A mural adorns one wall. Then, a home theater with brown reclining chairs and a large screen is displayed. This is followed by a small, indoor swimming pool in a room with stone walls and arches. Another dining room is shown, featuring dark furniture and red leather chairs. The blonde woman appears again, talking against the purple background.The video then cuts back to the exterior, showing a large green lawn with a gazebo and two lion statues. This is followed by an aerial view of the mansion during the day, with a fountain in the circular driveway. The video then cuts to a large swimming pool with an infinity edge. Next, a wooden sauna with towels neatly folded on the benches is shown. This is followed by a tennis court on the property. A gym with various exercise equipment is shown briefly, then cut to another view of the swimming pool and patio area, with lounge chairs and umbrellas set up. Finally, an aerial view shows the entire mansion property, including the pool, tennis court, gardens, and surrounding trees.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "Where is the vase located in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. It is next to the sofa.",
+ "B. It is next to the television.",
+ "C. It is next to the curtains.",
+ "D. It is next to the dining table."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the primary construction material of the kitchen shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Stone.",
+ "B. Plastic.",
+ "C. Stainless steel.",
+ "D. Wood."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many outdoor scenes are shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Three.",
+ "B. Five.",
+ "C. Four.",
+ "D. Two."
+ ]
+ }
+ },
+ "bywZFvPE": {
+ "video_id": "bywZFvPE",
+ "video_duration": "71s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Climate Change",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A woman with long brown hair and a black headband is centered in front of a white background with a gold frame. The camera then shifts to an image of a globe focused on the region of South Asia. The globe zooms into an area including Vietnam and Cambodia. Words appear, \"FPT University, Green Space.\" The camera zooms into an architectural rendering of buildings that are covered in green trees. There are people pictured in various places around the buildings and a small pond is also shown. A woman with brown hair and a black headband is then shown again, this time with the text overlaying \"Fresh, Moves, Mobile, Markets\". A colorful bus is shown with the words \"Growing Power\", \"Fresh Moves\" written on the side. The camera then cuts to a picture of a man inside the bus looking at bins of fruit. Another picture shows the inside of the bus with two people at a cash register and shelves of fruits and vegetables. An architectural rendering shows a schematic of the interior of the bus with a check out station, shelves for produce and refrigerators.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual behavior does the individual exhibit on the bus in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Singing on the bus.",
+ "B. Shopping on the bus.",
+ "C. Buying vegetables on the bus.",
+ "D. Dancing on the bus."
+ ]
+ }
+ },
+ "AOOFZaPS": {
+ "video_id": "AOOFZaPS",
+ "video_duration": "264s",
+ "duration": "4-6min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman with long blonde hair, wearing a sleeveless black top and long black pants, stands on a stage with a man in a purple suit. They are both holding microphones. The stage has a circular design with white concentric circles and blue lights. There are musical instruments and other performers on the stage behind them. A large screen displays their performance to the audience. The background behind the performers features a large circular structure with lights and patterns, and the words \"Peter Cetera\" are displayed.The camera pans to the audience, showing a large crowd in tiered seating. Blue lights illuminate the audience and stage. The camera cuts back to the performers, and then back to the audience, showing different angles and perspectives. The camera focuses on some audience members singing along and enjoying the performance. The camera cuts back to the performers. The woman is singing while the man stands beside her, holding a microphone and occasionally joining in the song.The camera cuts back to the audience, focusing on individual audience members singing along and reacting emotionally. The camera then focuses on the two performers as they sing together. The background changes to a starry night effect with small white particles falling. The man in the purple suit sings a solo.The camera pans across the audience again. Then, it's back to the stage where the duet continues. Confetti falls from the circular structure above. The camera returns to audience members as they watch and sing along. The duet continues, and the performers hold hands at one point. The man sings solo again. The duet resumes. The woman puts her hand on the man's shoulder. The man extends his hand out towards the audience. The woman laughs while singing. The camera shows several close-ups of audience members. Confetti falls heavily now. The duet ends with a hug, and the woman pats the man's hair. She blows kisses to the audience, bows, and the video ends with a logo.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times did applause occur?",
+ "answer": "C",
+ "candidates": [
+ "A. Three times.",
+ "B. Five times.",
+ "C. Four times.",
+ "D. Two times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What action did the singer perform as interaction at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A high-five.",
+ "B. A handshake.",
+ "C. A kiss.",
+ "D. A hug."
+ ]
+ }
+ },
+ "fNLgOPsN": {
+ "video_id": "fNLgOPsN",
+ "video_duration": "63s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a rotating view of the southern hemisphere, centered on Antarctica. A color scale indicates ozone levels, ranging from 100 to 500 Dobson units, with blue representing higher ozone and red/orange representing lower ozone. The years 1980 through 2000 are displayed sequentially, and the ozone hole over Antarctica is shown growing larger and deeper (redder) over time.A brief animated sequence shows molecules interacting: two joined blue spheres (oxygen molecules) and a single green sphere (chlorine atom). The chlorine atom separates one oxygen atom from the molecule, bonding with it and leaving a single oxygen atom.After the molecular animation, the video displays twelve snapshots of the Antarctic ozone hole from 1988 to 2000, arranged in a grid. The ozone hole's size and depth fluctuate from year to year, but the overall trend is toward a larger and deeper hole by 2000. The video then shows the ozone hole in the year 2000, followed by a comparison of the ozone hole in 2000 and 2017 side-by-side. The 2017 ozone hole is visibly smaller than the 2000 hole.Finally, a grid of twelve snapshots of the Antarctic ozone hole appears, depicting the years 2005 through 2016. These snapshots show year-to-year variations, but the hole generally appears smaller than in the previous sequences. The video concludes with an extended view of the ozone hole in 2017, followed by a view of the ozone hole in 2018, which is shown to be smaller than in previous years.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the emotional progression depicted in this video?",
+ "answer": "A",
+ "candidates": [
+ "A. The video shifts from worry to optimism.",
+ "B. The video shifts from optimism to concern.",
+ "C. The video shifts from pain to concern.",
+ "D. The video shifts from pain to optimism."
+ ]
+ }
+ },
+ "VWeDZbED": {
+ "video_id": "VWeDZbED",
+ "video_duration": "485s",
+ "duration": ">8min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man wearing a wide-brimmed hat smiles at the camera. The camera focuses on a blue-green body of water, where a red and white fishing rod is visible. The man holds up a package of Rapala Husky Jerk fishing lures. An aerial view of the pond shows the surrounding trees and grassy areas. Pink flowers grow by the edge of the water near a small spillway. The man casts the fishing line into the water. He pulls a fish into a net. He removes a small silver and green lure from the fish's mouth. He walks through a wooded area, towards a large, multi-story building. He holds up a large largemouth bass that he has caught. He lowers the fish back into the water. The man now wears sunglasses and stands near a bridge over a small waterway, facing the camera. A golf cart drives past on a paved path. The words \u201cNO FISHING\u201d appear on the screen, superimposed over the image of the fishing net on the ground.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How long did the man wearing sunglasses in the video spend catching a fish?",
+ "answer": "D",
+ "candidates": [
+ "A. 2 hours.",
+ "B. 3 hours.",
+ "C. 5 hours.",
+ "D. 4 hours."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the weight of the fish caught by the man wearing sunglasses in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. 4.5 pounds.",
+ "B. 5 pounds.",
+ "C. 4.34 pounds.",
+ "D. 4 pounds."
+ ]
+ }
+ },
+ "xcEQlFDc": {
+ "video_id": "xcEQlFDc",
+ "video_duration": "556s",
+ "duration": ">8min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A woman with long brown hair, wearing a white tank top and jeans, sits on a bed with a tan comforter. She gestures with her hands while speaking directly to the camera. The bed has a metal headboard and several pillows in brown, white, and light green. Behind her, a window reveals trees, and a bookshelf stands beside it. String lights are visible on the wall behind the bed. The woman occasionally touches her hair or brings her hands together. She points off to the side at one point, and continues gesturing throughout the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many pillows are shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Two.",
+ "B. Three.",
+ "C. Five.",
+ "D. Four."
+ ]
+ }
+ },
+ "UJBlPdiS": {
+ "video_id": "UJBlPdiS",
+ "video_duration": "42s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a man wearing a baseball cap in several different shots. In the first set of shots the man is wearing a gray shirt. In a later set of shots, he is wearing a white and black jacket. There is a large array of weapons on a counter and on the wall behind the man. The man picks up various objects and presents them to the camera. In one shot, the man holds a black firearm, then puts it down. He also holds a shorter black firearm and moves it around. The room in the video is illuminated by a bright overhead light.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the apparent emotion of the man wearing a black hat in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. He appears sad.",
+ "B. He appears excited.",
+ "C. He appears discouraged.",
+ "D. He appears angry."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video show an AR-15 firearm?",
+ "answer": "A",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I am not sure."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man wearing white do with the machine gun in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Pack the gun.",
+ "B. Sell the gun.",
+ "C. Disassemble the gun.",
+ "D. Put it back in its original place."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the AK-47 relative to the man with the white hat in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. In front of him.",
+ "B. To his right side.",
+ "C. To his left side.",
+ "D. Behind him."
+ ]
+ }
+ },
+ "IecTfBGr": {
+ "video_id": "IecTfBGr",
+ "video_duration": "50s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A rotating logo appears on a navy background, which then becomes a static logo, and then a golden emblem. The logos and emblems appear on the navy background. Then there is an image of two hands shaking over a navy background, with a logo on either side of the hands. A group of people wearing blue shirts and one person in a green flightsuit are inside of a room, some of them are looking at a computer screen, one is holding a white medical instrument to his mouth, and another person is holding onto the computer. A man wearing a white coat and gloves applies a red band to another man's arm, who is wearing a green flight suit. Then, an image of a person is lying in a blue covered container, with multiple medical instruments on and around him. Then, a screen appears displaying two men, in blue medical garb, behind another man who is wearing a light blue shirt and working with multiple medical instruments. A man in a white T-shirt, with a belt is being tended to by multiple people wearing light blue shirts. Then, there is an image of multiple men wearing similar tan colored clothing. Then, there is an image of four men wearing white flight suits in front of a large piece of equipment. Then, there is another image of two men wearing the white flight suits putting on and adjusting their gear. Inside of a yellow interior, a man is in a pink suit and wearing headphones is being attended by two people. Inside of a large metallic cabin, a man in a tan flight suit is shown attending to another man on a metallic chair. In a room, three men are tending to another man on a chair, they are securing him in the seat. In a larger room, several people are floating in zero gravity, two of them appear to be wearing spacesuits. In a forest with snow on the ground, several men are shown adjusting the outer layer of their snow suits, another man is chopping on a log. Several men wearing white clothing are shown adjusting and packing a parachute. A watercraft appears in a body of water with two men wearing white shirts, and another watercraft appears next to it with a man in blue. Finally, an image of an apparatus appears in an interior room, it is large, grey, and cylindrical.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What is the person wearing white doing to help the soldier sitting in the chair?",
+ "answer": "B",
+ "candidates": [
+ "A. Performing surgery.",
+ "B. Drawing blood.",
+ "C. Measuring blood pressure.",
+ "D. Measuring lung capacity."
+ ]
+ }
+ },
+ "liYaHkhW": {
+ "video_id": "liYaHkhW",
+ "video_duration": "41s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person is visible, sitting at a table. There are objects on the table, including two small rectangular trays. One tray contains several small silver-colored fish, and the other tray contains two small pink objects, one black object, and one gold object. Also on the table are a small orange cylinder, a roll of clear line, and some packages of fishing equipment. The person is holding one of the small silver fish, and also a small yellow and black tool. The person is using the tool to cut a small white object on a white tray. Behind the table are multiple racks of clothing.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times is the sound of foam being cut audible in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Two.",
+ "B. Five.",
+ "C. Three.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What method did the man in the video use to cut the white foam?",
+ "answer": "B",
+ "candidates": [
+ "A. Cut along the diagonal of the white rectangular foam.",
+ "B. Cut along the long side of the white rectangular foam.",
+ "C. Cut along the center of the white rectangular foam.",
+ "D. Cut along the short side of the white rectangular foam."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the man in the video repeatedly compare a fishing hook with white foam?",
+ "answer": "C",
+ "candidates": [
+ "A. To make the foam sink into the water.",
+ "B. To cut the foam to the appropriate length.",
+ "C. To trim the foam to the appropriate width.",
+ "D. To insert the fishing hook into the foam."
+ ]
+ }
+ },
+ "ABXtrFIK": {
+ "video_id": "ABXtrFIK",
+ "video_duration": "242s",
+ "duration": "4-6min",
+ "domain": "Tech & Science",
+ "sub_category": "Artificial Intelligence",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows several parts of a mechanical device being disassembled and inspected. It begins with two silver-colored gears meshing together. Then, a drawing of the disassembled parts is shown. The video cuts to a gloved hand placing a gear into a red housing. A blue bin of miscellaneous parts is also visible. Multiple components are shown and inspected, including a bronze-colored ring, a series of toothed black plates, and various cylindrical pieces of metal. A damaged, broken brown ring is also shown. The video switches to showing a blue tow truck with a black articulated arm on the back. Text appears on the screen at various points, along with the phrase \"PTO quickly disconnects now\", and a phone number. The video then cuts to a black screen with orange text.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Which of the following phone numbers can be used to contact the mechanic shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. 755716153.",
+ "B. 755716152.",
+ "C. 755716155.",
+ "D. 755716154."
+ ]
+ }
+ },
+ "nJLAtIjS": {
+ "video_id": "nJLAtIjS",
+ "video_duration": "76s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows three people standing in a corner of a room with wood panel walls. One person is wearing an orange jumpsuit and a mask shaped like a skull with antlers. Another person is wearing a long grey robe and a mask shaped like a bird's head. The third person has short dark hair and is wearing a white lab coat. A door with a sign on it is shown next, before the video cuts back to the three people in the corner. The video cuts to show a close-up of the person wearing the white lab coat, and then cuts to show an empty hallway. A dark speaker attached to the wall is shown next, before cutting back to a close up of the person wearing the lab coat. The video then cuts to show the orange-clad and grey-robed figures standing in front of the door with the sign. They are then shown entering a hallway with a checkerboard floor. A red blob-like creature with eyes appears on the floor between the two figures. The creature moves around, and the video cuts to a full shot of the lab-coated individual standing over the creature. The video then cuts to black, before showing a line of people wearing black tactical gear and helmets, pointing guns forward. The video then closes by showing a door with a sign on it.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many characters that appear to be non-human are shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Four.",
+ "B. Two.",
+ "C. Three.",
+ "D. Five."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What interaction occurs between the blob-like non-human entity and the man in the white coat after they enter the elevator?",
+ "answer": "D",
+ "candidates": [
+ "A. They narrow their eyes and stare at each other.",
+ "B. They narrow their eyes and engage in physical combat.",
+ "C. They widen their eyes and engage in physical combat.",
+ "D. They widen their eyes and stare at each other."
+ ]
+ }
+ },
+ "AYAPrJaU": {
+ "video_id": "AYAPrJaU",
+ "video_duration": "90s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two fighters, Whittaker and Costa, circle each other in an octagon. Whittaker wears green trunks, and Costa wears black trunks. Whittaker throws a kick, then they exchange punches. Whittaker lands a head kick, and Costa stumbles. They continue exchanging punches. Whittaker lands another head kick, knocking Costa to the ground. Whittaker follows up with punches on the ground, and the referee stops the fight. Whittaker celebrates by raising his arms, running around the octagon, jumping on the cage, and flexing. The perspective changes to a slow-motion replay of the head kick knockout, showing it from a top-down angle. The replay ends, and Whittaker continues celebrating in the octagon.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual action did the tattooed man in the video perform after his victory?",
+ "answer": "A",
+ "candidates": [
+ "A. Sat on the ropes.",
+ "B. Raised his hands.",
+ "C. Ran towards the audience.",
+ "D. Lifted his opponent."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Who is the victor in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The outcome is uncertain.",
+ "B. Paulo Costa.",
+ "C. Robert Whittaker."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Where in the video is slow-motion replay used?",
+ "answer": "D",
+ "candidates": [
+ "A. Slow-motion replay is used at the beginning of the video.",
+ "B. Slow-motion replay is used throughout the video.",
+ "C. Slow-motion replay is used in the middle of the video.",
+ "D. Slow-motion replay is used at the end of the video."
+ ]
+ }
+ },
+ "RTjweEbZ": {
+ "video_id": "RTjweEbZ",
+ "video_duration": "348s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a view of the driver's area of a recreational vehicle (RV). The camera pans to the left, showing the driver's seat, steering wheel, and a side window that opens outward. It then moves to the passenger seat area, showcasing the entry door and passenger seat. Next, a small screen displays the radio interface, followed by a view of the digital dashboard. The camera pans back to the front view of the driver's area and then focuses on the center console, where two screens display an iPhone interface and navigation respectively. The camera shifts to a JBL speaker on the wall before returning to the driver\u2019s area. It then focuses on a beige pillow with white tassels on a couch, followed by another beige pillow with a white bottom panel and a fuzzy beige blanket.The video then transitions to a full view of the RV's interior, showing couches, a kitchen area, and a dining area. The camera zooms in on a three-seater couch under a window, followed by another couch closer to the driver's area with a TV mounted on the wall above. Next, the camera focuses on the kitchen cabinets with glass doors, then a small wine fridge. The video returns briefly to the interior view before focusing on the kitchen sink, then a pull-out cabinet with vertical dividers. The camera pans back to show the kitchen area with cabinets, a sink, a stovetop, a microwave, and a refrigerator. It then lingers on the upper kitchen cabinets and a wall-mounted touchscreen before returning to the couch under the large window. Next, the video focuses on a Bose soundbar, a ceiling-mounted TV, and the ornate ceiling.The camera moves to the flooring, then to the Bosch logo on the refrigerator, the glass stovetop, and the Bosch microwave. It shows the refrigerator again, followed by the kitchen sink, the spice rack cabinet, and the kitchen area. The video then returns to the upper cabinets, followed by another wall-mounted touchscreen. Next, the camera shows the dining area, followed by the half-bath area, and the bedroom with a king-sized bed, pillows, drawers, and a TV. The video concludes by showing a cedar-lined closet with shelves, a safe inside a cabinet, a dresser under the window, the bathroom vanity with two sinks, the shower with a wooden seat, and finally, a stacked Whirlpool washer and dryer.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What Fahrenheit temperature is shown on the dashboard at the beginning of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. 65 degrees.",
+ "B. 60 degrees.",
+ "C. 70 degrees.",
+ "D. 75 degrees."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In relation to the floral pillow, what is the position of the white pillow on the master bedroom bed?",
+ "answer": "D",
+ "candidates": [
+ "A. To the right of it.",
+ "B. To the left of it.",
+ "C. Behind it.",
+ "D. In front of it."
+ ]
+ }
+ },
+ "FPdOSpde": {
+ "video_id": "FPdOSpde",
+ "video_duration": "61s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Artificial Intelligence",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A white and black robot with a screen on its chest is shown in a hallway. The screen has lines on it and the eyes on the robot's face display glowing blue circles. The eyes change into a happy face with blue lines for the mouth. The image switches to a structure made of light-colored metal with a blue barrel on top, placed on red dirt surrounded by trees. A rainbow-colored spiral pattern in the shape of a human head is shown over a backdrop of stars. The image changes to a young girl with blonde hair being held by two women. The girl turns her head and looks up at one of the women. The girl and the women are shown again, then another scene starts, depicting a woman with dark hair sitting on a red bean bag. She is holding a small baby, while a toddler with blonde hair sits next to her, wearing headphones. A hairless cat walks by. The scene cuts to a close-up of a book being opened by two people, then switches to a slightly different angle. The camera shows a dog with a person who has blond hair and wears a white shirt taking a photo of themselves with a cell phone. The image cuts to a person with long curly hair in a yellow and white checkered dress. They are hugging a tan-colored dog. The camera cuts to an older man and woman looking at a book of pictures. The woman is holding a small white photograph.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many cats were shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Two.",
+ "B. One.",
+ "C. Three.",
+ "D. None."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is the hairclip worn by the little girl in a pink long-sleeve top?",
+ "answer": "C",
+ "candidates": [
+ "A. Gold.",
+ "B. Brown.",
+ "C. Pink.",
+ "D. Green."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What action does the girl in the yellow checkered skirt perform while holding a puppy?",
+ "answer": "D",
+ "candidates": [
+ "A. Raise her hand to brush her hair.",
+ "B. Hold the puppy in both arms.",
+ "C. Make a V sign with her fingers.",
+ "D. Kiss the puppy."
+ ]
+ }
+ },
+ "NJHrwiES": {
+ "video_id": "NJHrwiES",
+ "video_duration": "51s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Family Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins by showing a bedroom with a bed, a lamp on a bedside table, and a red shirt and a blue flag hanging on the wall. The video then pans to a white board and then turns and shows a doorway. The video shows a bathroom with a toilet and a bathtub. Then it pans over to show toiletries on the bathroom counter, and then pans back to show a garbage can on the floor next to the toilet. The video pans out of the bathroom through another doorway and into a room where there appears to be a washing machine and dryer. The video then continues to pan out into a room with a wooden floor and shows a living room with an L-shaped couch, a coffee table, and an area rug. The video then pans over to a kitchen with a sink full of dirty dishes, along with a variety of other kitchen appliances. It pans back to show the dining area and then to the living room. It then moves around to the kitchen area and then shows a whiteboard on the side of a refrigerator. Finally, it shows a staircase with a wooden railing.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the color of the clothing displayed on the bedroom wall in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. White.",
+ "B. Blue.",
+ "C. Gray.",
+ "D. Orange."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where does the scene take place where the lady appears in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A kitchen.",
+ "B. A bedroom.",
+ "C. Inside a car.",
+ "D. A living room."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "On which side of the sink is the air fryer located in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. To the left of the sink.",
+ "B. Behind the sink.",
+ "C. Inside the sink.",
+ "D. To the right of the sink."
+ ]
+ }
+ },
+ "RGENQMAN": {
+ "video_id": "RGENQMAN",
+ "video_duration": "170s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Two people are fishing on a grassy bank alongside a body of water. They are holding fishing rods, and one is wearing a blue patterned cloth on his wrist. One person is wearing a green long-sleeved shirt, a hat, and sunglasses, and the other is wearing a black t-shirt and black shorts. In a separate scene, a person wearing a black t-shirt with the word \"NINES\" in green lettering and a face mask with a clown design is holding a fish. Another scene shows a person holding a shark in their hands while on a boat. The video transitions back to the fishing scene, where one of the individuals is actively fishing, and then another individual is shown holding an iguana by a string. They eventually put the iguana in a small enclosure. Then they can be seen putting out various vegetables into the enclosure with the iguana. Finally, the last two scenes show one of the individuals using a camera on a tripod, and the iguana inside the cage.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the man wearing black wrapping a towel around his left hand?",
+ "answer": "D",
+ "candidates": [
+ "A. To wipe away sweat.",
+ "B. To prevent sunburn.",
+ "C. For aesthetic reasons.",
+ "D. To prevent cuts from a fishing line."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What stage is the captured iguana in?",
+ "answer": "D",
+ "candidates": [
+ "A. A young adult.",
+ "B. Elderly.",
+ "C. An adult.",
+ "D. A juvenile."
+ ]
+ }
+ },
+ "owaUYnud": {
+ "video_id": "owaUYnud",
+ "video_duration": "32s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A split-screen video shows a young man with glasses and headphones playing a video game. The top portion of the screen shows his face, and the bottom portion shows the gameplay. Initially, a voting screen appears in the game, showing an option to forfeit. The text \"TM8 VOTES TO FORFEIT\" is superimposed on the video. The gameplay begins, showing a 3 vs. 0 score in favor of the opponent. The player, using a blue car, scores a goal, making the score 1-3. The player scores another goal, bringing the score to 2-3. The player scores again, tying the game 3-3. The game goes into overtime. The player maneuvers their car, attempting to score, with the ball moving around the field. The video ends with the player still in overtime.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What occurred before the player scored their third point?",
+ "answer": "A",
+ "candidates": [
+ "A. A collision occurred with the player 'ZEEPOOL'.",
+ "B. A collision occurred with the player 'Bru'.",
+ "C. The player collided with the railings.",
+ "D. A collision occurred with the player 'ILYARE1X'."
+ ]
+ }
+ },
+ "jkKlFsZp": {
+ "video_id": "jkKlFsZp",
+ "video_duration": "107s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a red loading screen with the word \u201cLOADING\u201d in orange text in the center. The scene then shifts to a view of a green and white-colored cloudy structure moving from left to right, then it transitions to a white background with a blurred figure moving on the right side of the screen while purple smoky shapes appear on the left side of the screen. The next scene is in warm colors with a blurry image of debris and an amorphous shape moving in the center. This is followed by a close-up, blurry, transparent image of what appears to be a man holding a rifle in his right hand. The image is overlaid onto another blurry image of structures. Next, there\u2019s a close-up shot of what appears to be a soldier holding an object in his hands and a man in the background. It then shows a blurry image of a man\u2019s body being moved and another blurry image of a structure behind it. This image is then replaced by a shot of several explosions with red debris. Another blurry image shows structures and a person walking in the distance. The next image is a dead bird on a stone-like surface. After, there\u2019s a blurry image of a street with buildings and people in the background. The video transitions again to a blurry shot of a man\u2019s legs being moved. This is followed by an aerial view of what appears to be an area with text overlaying the image. The next scene is a close-up of a face with a gas mask and a walkie-talkie. The video continues to show another close-up of faces wearing gas masks. The scene changes to show blurry rocks with a body in the center and then a blurry image of structures with green lighting. Then the video changes to a blurry image of a man with a large weapon. This is followed by another blurry image of structures and the road in a town. Next, there is a blurry shot of a destroyed car and a street with bodies lying around. Following that is a blurry image of a soldier with a gun, and then another image of a soldier with a gun and a body on the ground next to him. The next scene shows a man in a mask and another man in the background. After that is a blurry image of a child moving her arms. The next scene shows four men in camouflage holding weapons. This image transitions to a blurry image of a person lying in bed and someone standing nearby. The scene then shifts to a close-up shot of a man lifting a green-colored object. This image is followed by another shot of a blurry leg being dragged, then an image of a soldier with visible injuries. The next scene is a dead soldier with visible lacerations. Then, it shows a close-up of medical instruments with blood on them. The next image shows a man holding a light bar, then it transitions to a shot of the same man looking at the screen. This is followed by a black screen and then a shot of a white structure outside. The video then transitions to a shot of an explosion. The video then goes back to the red loading screen with the word \u201cLOADING\u201d in the center. It then transitions to show a man with a hood on his head with the text \"U.S.S. Reagan - CITA[text in another language]. This is then replaced with the same man but with the added text \u201cU.S.S. Reagan - CITADEL Command, Offshore South\u201d. The next scene contains the same man with a hood with the added text \u201cU.S.S. Reagan - CITADEL Command, Offshore Southwest Manhattan\u201d. The man remains in focus in the next scene. The scene then zooms into the man\u2019s face. The next shot is still of the same man, then there is another man in the background. It then shows both men in the scene. The video focuses on the second man and then shows the first man with the hood and text in front of him. The video shifts to show the second man\u2019s boots. The next shot shows a man being lifted by another man. The video then shows the man with the hood being lifted by another man. The video then shifts to show a man's face. After that, it shifts to show the two men again and then a black screen. The video shifts to show the man with a hood sitting on the floor. The next scene shows a figure illuminated in a red light walking on a striped road. The video then zooms into a man\u2019s eyes. Then it shows a man on the floor with another man in the background. It then shifts to show a figure being lit with red light, walking on a surface with two bare feet in the foreground. The camera follows this figure, then switches to a shot of another figure on the floor with the camera focusing on the bare feet figure. The video then shows a scene of the two figures fighting. The next shot is from above and shows two men on the floor. The next scene shows a man lifting another man above his head. This shot is followed by a man being moved. The video shifts to show the two figures fighting in front of a wall. After that, the video shows the man with a hood crouching on a surface. The next shot is from the perspective of a large creature's feet while a man in a hooded jacket kneels before them. The next scene is a view of the large creature from behind as a man in a hooded jacket watches. The camera continues to follow the large creature, which is now at the edge of a tall building. The video then transitions to show the large creature on what appears to be a ship deck surrounded by soldiers firing weapons. The video shows the creature exploding and being covered with red and black tendrils. More soldiers fire their weapons at the creature. The camera moves around the creature as it is attacked by soldiers and shows a man in a jacket standing in front of the creature. The large creature then falls to the ground on its side, with a timer that reads \u201c2 MINUTES REMAINING\u201d, \u201cOBJECTIVE COMPLETE\u201d, and \u201c500000 EP\u201d. The creature continues to explode, with the image zooming in. It is then followed by an image of the creature exploding and then a final shot of the creature being destroyed.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual event occurs at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The army kills the monster at the end of the video.",
+ "B. A monster decimates the army at the end of the video.",
+ "C. A monster kills a man wearing a black hoodie at the end of the video.",
+ "D. The monster transforms into a human at the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What is the state of the hands of the man wearing a black hoodie at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Both hands are normal.",
+ "B. The left hand is normal, and the right hand is a knife.",
+ "C. Both hands are knives.",
+ "D. The left hand is a knife, and the right hand is normal."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the soldiers in the video use a chopping knife for?",
+ "answer": "A",
+ "candidates": [
+ "A. Killed a child.",
+ "B. Committed suicide.",
+ "C. Attacked a man wearing a black hoodie.",
+ "D. Attacked a monster."
+ ]
+ }
+ },
+ "UGFoQrpI": {
+ "video_id": "UGFoQrpI",
+ "video_duration": "73s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with long braids and a floral head wrap leans against a door and dances. The scene changes to a bathroom where a different person with two buns in their hair and wearing a black t-shirt talks to the camera. Cut back to the person with braids, now in a bedroom, dancing, while the person from the bathroom sits on the bed. Cut back to the person in the bathroom talking and playing with their hair. Cut again to the person at the door, now wearing a Louis Vuitton head wrap, leaning against the door frame. Back in the bathroom, the person continues talking. The person at the door is shown again, still leaning and talking. Back in the bathroom, the person continues talking. A hand reaches in and puts a blue towel over the doorknob. The person in the bathroom reacts to the towel on the knob. Another person comes to stand next to the person in the bathroom and they both look at the door. The person in the bathroom reacts with a worried expression.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the most likely relationship between the girl with dreadlocks and the woman in a black short-sleeved top in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Sisters.",
+ "B. Friends.",
+ "C. Mother and daughter.",
+ "D. Neighbors."
+ ]
+ }
+ },
+ "YZAUjAGl": {
+ "video_id": "YZAUjAGl",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a red chevron graphic on a white background. Then, a man wearing a red t-shirt is shown. He stands behind a countertop on which there is a blender, various clear glasses containing liquid, a glass bowl of cut pineapple, and a small metal bowl. Behind the man, there is a wall with a bar display and a wooden sign. The man picks up the bowl of pineapple and begins to pour the contents into the blender, using a white spatula. The scene cuts to an overhead view of the blender where a hand pours the pineapple into the blender using the spatula, while text on the screen says, \"1 1/2 cups Pineapple\". The man is back, and he is gesturing and speaking. The scene changes to an overhead view of the blender, and ice is added, while text on the screen says, \"1/4 cup ice.\" Next, white liquid is poured in while text on the screen says, \"1 cup coconut milk.\" Then the same hand pours in yellow liquid while text on the screen says, \"1 cup pineapple juice\". The man is shown pouring more of the yellow liquid into the blender. The scene is once again an overhead view, where a spoonful of brown substance is added while text on the screen says, \"1 tbs brown sugar\". The black base of the blender is twisted onto the glass top. The blended yellow liquid is shown inside the blender, from a front view. Then, an overhead view shows the blender whirring and blending. The scene changes to an overhead view of a hand pouring the blended yellow liquid from the blender into a clear glass. Then the scene changes to a view of the man's red shirt in the background as he pours the liquid into the glass. Then, an overhead view shows a straw placed in the blended drink. Finally, a red sphere on a skewer is added as a garnish.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the video mention increasing the amount of ice cubes?",
+ "answer": "D",
+ "candidates": [
+ "A. Pineapple juice without ice.",
+ "B. Coconut milk without ice.",
+ "C. Cherries without ice.",
+ "D. Pineapple chunks without ice."
+ ]
+ }
+ },
+ "igEjGGbB": {
+ "video_id": "igEjGGbB",
+ "video_duration": "79s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A hand holds a small, green and black fishing lure. The lure has a segmented body and a paddle tail. The hand moves the lure slightly. The scene changes to a man in a boat on a lake. He holds a fishing rod with a different lure attached. The lure is white and chartreuse. The scene changes again, showing the view from the boat, with a fish finder display inset in the upper left corner. The man casts the line into the water. The rod bends as a fish takes the lure. The man reels in the fish, a smallmouth bass. He removes the hook and releases the fish. The scene changes back to the man sitting in the boat. The view changes to a close-up of the fish finder display. The scene changes back to the man in the boat. He continues fishing, casting and reeling, and catches a second smallmouth bass. He removes the hook and releases the fish.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "According to the video content, which of the following is the correct sequence of events?",
+ "answer": "A",
+ "candidates": [
+ "A. (a) Introduction of Euro Tackle B Vibe, (b) Introduction of Classic Curly Tail, (c) Discussion of how water depth affects fishing techniques, (d) Presentation of fishing skill advice.",
+ "B. (b) Introduction of Classic Curly Tail, (a) Introduction of Euro Tackle B Vibe, (c) Discussion of how water depth affects fishing techniques, (d) Presentation of fishing skill advice.",
+ "C. (a) Introduction of Euro Tackle B Vibe, (d) Presentation of fishing skill advice, (b) Introduction of Classic Curly Tail, (c) Discussion of how water depth affects fishing techniques.",
+ "D. (c) Discussion of how water depth affects fishing techniques, (a) Introduction of Euro Tackle B Vibe, (b) Introduction of Classic Curly Tail, (d) Presentation of fishing skill advice."
+ ]
+ }
+ },
+ "oVnTPKjh": {
+ "video_id": "oVnTPKjh",
+ "video_duration": "59s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video starts with text on a black background that reads, \u201cTHEY MUST FIGHT FOR THEIR FREEDOM\u201d. Next, a shot of a damaged city with multiple broken buildings, and trees. Then, a man in a black tank top is shown with his hands in restraints and pulling them. A man\u2019s leg is shown moving quickly over some rocks. Inside a building, one person with a red bandana on their head punches another person. A bald man with a beard is shown speaking in an interior space with light-colored walls. Then, the camera shows a bald man with a tattoo near his eye. A person is shown running on the top of a wall, then jumping down and hitting two people on the ground. A person in a tan outfit is shown crawling on the ground. Then, an explosion is shown. The camera shows one person punching another in an interior space, then a person jumps down from a high position into an interior space. A man in a black jacket is shown pulling on a chain, and then a man in a red jacket falls. A man in a black tank top is shown kicking another man. Then, a man with a red bandana is shown on the ground. The bearded man is then shown holding a machine gun. Next, two people are shown shooting guns in an outdoor area. Then, an explosion is shown. A man is shown aiming a bazooka. One man is shown jumping on another\u2019s back, then two men are shown fighting in an interior space. A man is shown aiming a gun in an interior space, and then a man is shown shooting a bazooka outside. Two men are shown with guns and red bandanas on their heads. The bald man from earlier is shown restraining another man with chains, followed by a scene of two men fighting in an interior space. A group of people are shown in an outdoor area, and then a man with the name \"Mike Tyson\" is shown, followed by a scene of that man fighting another. He is then shown aiming a gun. A man is then shown at night surrounded by three other individuals, two of which are wearing face coverings, and several flames. The bearded man is shown shooting a gun in an interior space. A dark silhouette is shown walking next to a burning object at night, and then an explosion is shown in the dark. Multiple cars are shown driving at night, followed by a large explosion at night. A man is then shown with a gun in an interior space, and then two men are shown crashing through a doorway. The video ends with the text \u201cDESERT STRIKE\u201d in golden text on a black background. Additionally, a man with text under it stating his name is shown.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who says, \"I'm not in the plan anymore,\" in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The man wearing a red headscarf.",
+ "B. The man in a gray coat.",
+ "C. The bald man.",
+ "D. The bearded man wearing black clothes."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What does the bearded man in black want the bald man to do?",
+ "answer": "C",
+ "candidates": [
+ "A. Kill someone for him.",
+ "B. Develop a plan with him.",
+ "C. Wake up his friend.",
+ "D. Look for money and gold with him."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the man wearing a black outfit and sporting a beard most likely to be?",
+ "answer": "A",
+ "candidates": [
+ "A. A robber.",
+ "B. A scammer.",
+ "C. A bodyguard.",
+ "D. A police officer."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What is the most likely thing the bearded man wearing a black shirt in the video will do next?",
+ "answer": "A",
+ "candidates": [
+ "A. Search for money and gold.",
+ "B. Reformulate the plan.",
+ "C. Escape.",
+ "D. Kill a companion."
+ ]
+ }
+ },
+ "jTXcQoqj": {
+ "video_id": "jTXcQoqj",
+ "video_duration": "410s",
+ "duration": "6-8min",
+ "domain": "Games",
+ "sub_category": "Sports Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a basketball game taking place on a virtual court with surrounding buildings and spectator figures. Initially, many figures stand on the court. Eventually the figures split up into 2 teams of 2, and gameplay begins with a basketball being moved back and forth between figures. A figure wearing a red and blue superhero suit with a red mask and one figure wearing a white shirt with a green mask play basketball, with other figures on the sidelines. The game ends and a score board appears, indicating that one of the teams lost with a score of 9-23. The video then returns to the area where the game occurred, and all the figures are back on the sidelines with gameplay finished. There are also several intermittent images of random people and food.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many people are playing ball in the scene after the game starts?",
+ "answer": "D",
+ "candidates": [
+ "A. Seven.",
+ "B. Four.",
+ "C. Six.",
+ "D. Eight."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "With which hand did the man wearing the yellow headband throw the second ball?",
+ "answer": "B",
+ "candidates": [
+ "A. The right hand.",
+ "B. The left hand.",
+ "C. Both hands.",
+ "D. I don't know."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional changes does the man wearing red experience in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Confident to crying to excited.",
+ "B. Confident to angry to excited.",
+ "C. Angry to confident to crying.",
+ "D. Crying to happy to angry."
+ ]
+ }
+ },
+ "VxuFjupi": {
+ "video_id": "VxuFjupi",
+ "video_duration": "134s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two people are visible in the video. One person wears a white collared shirt and light tan pants. The person also has a white headband and is wearing jewelry. The second person is wearing a black tank top or dress and is also wearing jewelry. The person in the light-colored clothing is sitting on the floor. The person in the black clothing is sitting in a director's chair. The person in the white shirt reaches for and handles a small object. The person in the white shirt and the person in the black clothing then move toward each other, and their faces touch. The person in the white shirt can be seen touching their face with their fingers. The video is shot in a room that has at least two doors and a light on a stand. The camera moves throughout the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "How many distinct voices are present in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Two.",
+ "B. Three.",
+ "C. Four.",
+ "D. One."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the girl in white doing with the brush in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. She planned to use the brush to help another girl with makeup.",
+ "B. She uses the brush to do her own makeup.",
+ "C. She uses the brush to help another girl with makeup.",
+ "D. She planned to use the brush to do her own makeup."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What anomaly is observed in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A girl wearing a white shirt suddenly kisses a girl in a black skirt in the video.",
+ "B. A girl wearing a white shirt suddenly pulls the hair of a girl in a black skirt in the video.",
+ "C. A girl in a black skirt suddenly pulls the hair of a girl wearing a white shirt in the video.",
+ "D. A girl wearing black suddenly kisses a girl in a white shirt in the video."
+ ]
+ }
+ },
+ "xksrdjJE": {
+ "video_id": "xksrdjJE",
+ "video_duration": "99s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Book Reviews",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with long brown hair is in the foreground, speaking while looking directly at the camera. They are wearing a black shirt and appear to have makeup on. In the background, there are bookshelves filled with books and some decorations. In the upper right corner of the video frame, there is an image of a book cover, with the text \"Ovid Metamorphoses\". The person then shows a white book with an anime drawing on the cover. They then rotate the book to show the front and back, and then open it to show the pages inside, and then close it again and shows the cover, rotating it several times while speaking.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the woman in the video like Sailor Moon?",
+ "answer": "D",
+ "candidates": [
+ "A. Because Sailor Moon provided her with a lot of companionship during her childhood.",
+ "B. Because the art style of Sailor Moon is visually appealing.",
+ "C. Because Sailor Moon conveys deep symbolic meaning.",
+ "D. Because each Sailor Guardian has unique characteristics."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What is the final state of the Pretty Soldier manga shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Open, on the right side of the video.",
+ "B. Closed, on the right side of the video.",
+ "C. Open, on the left side of the video.",
+ "D. Closed, on the left side of the video."
+ ]
+ }
+ },
+ "CbQTQoPZ": {
+ "video_id": "CbQTQoPZ",
+ "video_duration": "80s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A fishing pole is seen pointing at a body of water. The pole moves up, down, left, and right. The camera view shifts to the sky which is blue with some clouds. Then, the pole points at a canal that has grass growing on its banks. The pole points at the water again. Houses and trees are visible in the background. The fishing pole goes back and forth between pointing up at the sky, and pointing towards the water and shore. The water's surface is disturbed by something moving below. The pole then points at the sky, then back at the water again. A fish breaks the surface of the water and splashes. This is repeated several times. Later in the video, a person in a dark shirt and light blue shorts holds a fishing pole. Then, the view goes to the sky, and then back to the person by the water. The pole then points at the water where a fish is moving on the surface. A fish breaks the surface of the water. The camera shows a wide shot of the canal, and then the view returns to the pole pointing at the water. The camera view moves to show the surface of the water. A fish is seen near the rocky bottom of the water. A hand grabs a fish from the water.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does the man catch the fish in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the end of the video.",
+ "C. Throughout the entire video.",
+ "D. At the beginning of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What happens after the man brings the fish up?",
+ "answer": "A",
+ "candidates": [
+ "A. The man removes the fish from the fishing rod.",
+ "B. The man immediately kills the fish.",
+ "C. The man falls into the water.",
+ "D. The man releases the fish."
+ ]
+ }
+ },
+ "FHXGTrhC": {
+ "video_id": "FHXGTrhC",
+ "video_duration": "61s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A graph appears, with \"carbon emissions\" in red and \"global GDP\" in blue. The graph shows data from 1996 to 2012. A person with brown hair, wearing a maroon shirt speaks to the camera. A new graph appears, this time with a yellow line labeled \"GDP,\" a purple line labeled \"MEDIAN INCOME - 2015 DOLLARS,\" and a light blue line labeled \"LOWEST FIFTH\". A red line labeled \"TOP 5%\" begins to appear on the graph. The person speaks to the camera again. An animation shows a crowd of people in blue and purple walking along a street, some of whom are carrying boxes, and a building with the word \"TV\" at the top. A man in a black suit and red tie points to a sign with the letter \"P\" on it as a woman in pink stands next to him. The woman gives money to the man. The person speaks to the camera again. Another graphic appears with three trees, all covered with green leaves and orange fruit. The text \"COMMUNAL ORANGE GROVES\" is under the trees. A white rectangular building appears, and then it gains a banner that says \"JUICE ME UP! INC.\" and an orange flag with a white circle. Next to the building three white jugs of juice appear, each bearing an orange circle. A red circle with an upward arrow appears above the building and jugs. The person in the maroon shirt speaks to the camera again. A black-and-white video clip shows a person using a water pump with a bucket next to it. Water pours into the bucket from the pump. A red circle appears with a black eye and mouth; then a tear appears coming from the eye. An animated red hand appears with a stack of cash, which sparkles.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "In the video, was a hospital visible during the exchange of currency for medication?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes, a hospital was visible.",
+ "B. I am not sure.",
+ "C. No, a hospital was not visible."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Over the past 40 years, by how many times has GDP increased?",
+ "answer": "B",
+ "candidates": [
+ "A. One time.",
+ "B. Two.",
+ "C. Four.",
+ "D. Three."
+ ]
+ }
+ },
+ "pFssNbmp": {
+ "video_id": "pFssNbmp",
+ "video_duration": "59s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A group of people, mostly women wearing pink shirts and some with red paint on their hands, are sitting in an auditorium-like setting with rows of seats. A blonde woman in a pink shirt is standing and holding up a sign reading \"UNRWA SAVES LIVES\". She's facing forward towards a window or screen and raising her voice. Several other people in pink are clapping and also raising their hands with red palms visible.As the video progresses, a security officer in a dark uniform approaches the woman with the sign, puts her hand on the woman\u2019s back, and starts to escort her away from the seating area. The woman continues to hold the sign and speak loudly while being escorted. The other protesters continue clapping and raising their hands. The security officer continues to escort the protesting woman out of the frame.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the predominant emotion conveyed in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Calm.",
+ "B. Joyful.",
+ "C. Surprised.",
+ "D. Angry."
+ ]
+ }
+ },
+ "VjGXajbg": {
+ "video_id": "VjGXajbg",
+ "video_duration": "532s",
+ "duration": ">8min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A pixelated close-up of Donald Trump's face transitions to a clearer image. The image zooms in on his nose. Text appears: \"LIBRARY LIBRARY OF CONGRESS.\" A title page of a book is displayed: \u201cBARON TRUMP\u2019S MARVELLOUS UNDERGROUND JOURNEY by INGERSOLL LOCKWOOD.\u201d The author\u2019s other works are listed below the title. A page of text from the book is shown, featuring the phrase, \"The people will tell thee.\" The cover of another book, \u201cThe Travel and Adventures of Little Baron Trump,\" appears. A cartoon-like image shows a boy and a small dragon. A drawing shows a child in what appears to be Chinese traditional clothing with a dog standing next to a block. The title page of the same book is shown again. A book titled \"The Last President,\" by Ingersoll Lockwood, is presented. The drawing of the child and the dog appears again. An image of a person wading in water at sunset is shown.The image of the child and the dog is shown again, larger this time. A picture of Donald Trump\u2019s son, Barron Trump, at a podium next to Donald Trump, is presented. The image of the traditionally dressed child is shown next to a picture of Barron Trump in a suit. The image stays on screen as more text appears from \u201cBaron Trump\u2019s Marvellous Underground Journey,\u201d chapter 2. The chapter title is: \"DON FUM\u2019S MYSTERIOUS DIRECTIONS.\u2014BULGER AND I SET OUT FOR PETERSBURG, AND THENCE PROCEED TO ARCHANGEL.\u2014THE STORY OF OUR JOURNEY AS FAR AS ILITCH ON THE ILITCH.\u2014IVAN THE TEAMSTER.\u2014HOW WE MADE OUR WAY NORTHWARD IN SEARCH OF THE PORTALS TO THE WORLD WITHIN A WORLD.\u2014IVAN\u2019S THREAT.\u2014BULGER\u2019S DISTRUST OF THE MAN AND OTHER THINGS.\u201d Text below reads, \u201cAccording to the learned Don Fum's manuscript, the portals to the World within a World were situated somewhere in Northern Russia\u2026\u201dImages of Donald Trump with Vladimir Putin and Melania Trump are shown. A Wikipedia entry for Melania Trump appears, stating she is the First Lady of the United States. The title \u201cTHE LAST PRESIDENT\u201d appears again. Then, another image of the words \u201cLAST PRESIDENT\u201d appears. The video cuts to black. A view of the Brooklyn Bridge and the New York skyline is shown, with the sun shining through the bridge. Trump Tower is then presented in a series of shots, including a street-level view with people walking around.Donald Trump speaks at a podium, pointing his finger. A page of text from \u201cThe Last President\u201d describes a young president who stood firm. The text lists cabinet members: \u201cSecretary of State\u2014William M. Stewart, of Nevada. Secretary of Treasury\u2014Richard P. Bland, of Missouri. Secretary of War\u2014John P. Altgeld, of Illinois\u2026\u201d The video shows \u201cSecretary Agriculture - Lafe Pence, of Colorado,\" and reads, \"The first thing that flashed across the minds of many upon glancing over this list of names...\u201d A nighttime scene depicts a street filled with smoke or tear gas. A car drives through the haze with its headlights on. People with cameras film the events. The smoke grows denser. People in what appears to be riot gear are visible in the smoke. Several vehicles drive through the haze, their headlights on. More flashes of light are seen. Then, more abstract gray pixels appear.A picture of Nikola Tesla is shown. A black and white photo shows Tesla sitting in a chair, reading a book, surrounded by what appears to be electrical sparks and equipment. Black material spreads on a white surface around two central points, resembling magnetic field lines. The Earth rotates in a grid-like structure in space, with a satellite orbiting it. A black fluid forms peaks as it is seemingly pulled downwards. The fluid flattens out.A picture of Tesla is shown next to a drawing of a flying saucer-like craft. A sketch depicts Tesla sitting, surrounded by images of inventions and electricity. A closer image of the sketch of Tesla's inventions is shown. A typed document shows an image of a busy office, with several people sitting at desks or typewriters. The document reads, \u201cGERHARD & HEY INC. NEW YORK TAKEN OVER BY THE ALIEN PROPERTY CUSTODIAN. The Office of the Concern. Oct. 3, 1918.\u201d A black and white photo shows a group of men in suits, holding hats. Another photo depicts a group of men sitting around a large table in a room with dark curtains. A close-up of a drumstick hitting a cymbal is shown. A picture of a man in a suit is shown beside a large cylindrical coil. Donald Trump's face is shown. The picture of Tesla, surrounded by sparks from earlier, is shown. The camera zooms in on the metal cage. Patent drawings for inventions attributed to Nikola Tesla are shown. Then, a video shows the Trump Tower in Chicago at night.The words \u201cONE MORE THING\u2026\u201d appear. A Wikipedia page on Mary Anne MacLeod Trump lists her as Donald Trump's mother. A Wikipedia page on Fred Trump lists him as the father of Donald Trump. A Google search for \"June 14, 1946\" shows that it was the date of a total lunar eclipse and the birthdate of Donald Trump. The earlier image of the child and the dog appears, with overlaid text that reads, \u201cPrecocious, restless, and prone to get in trouble, often mentions his massive brain, and has a personalized insult for most of the people he meets.\u201d Donald Trump\u2019s face, looking downwards, is shown. The title card \"The Baron Trump Adventures\" is displayed against a background of a waterfall and a cave. The video cuts to black.A cartoonish figure with a square head stands at a podium and makes several hand gestures. A scene from the movie *Back to the Future Part II* shows Marty McFly walking down a street, looking up at Biff Tannen's Pleasure Paradise Casino. Biff Tannen, shirtless and surrounded by two women, looks at the camera. The cartoon figure gestures again. Marty is shown again, looking at the casino. A close-up of Biff is shown holding a gun. The cartoon figure continues to gesture.The video shows a digitally rendered head, comprised of lines and dots, spinning against a black background. It cuts to a blue and pink nebula in space. The cartoon figure shrugs. A clip from *Back to the Future Part II* shows Marty looking disgusted with the dystopian alternate 1985 he finds himself in. A clip shows Biff Tannen holding a gun in a darkened room. The cartoon figure is back at the podium, making gestures.A glowing blue sphere floats among skyscrapers at night. The New York skyline is shown with a full moon in the sky. The cartoon figure again makes gestures from behind his podium. A cartoon figure with dark hair is shown wearing an apron. Suddenly, the face transforms into a large smile. A blue outline of a box encompasses the figure. A first-person video game perspective is shown, with figures flying through the air being shot at. A group of children watch television on a couch, including two children fighting over a bowl on the floor. An underwater scene shows bubbles rising. The cartoon figure crosses its arms. A screen with options to subscribe, learn more, play, or watch the next video is shown. Social media links for \"Spirit Science Central\" are also displayed. The words \"THE CONSPIRACY THEORY OF EVERYTHING\" are displayed in the night sky. A cartoonish figure is shown with its face splitting off and reforming with a smile. The video shows a surveillance camera with a red recording light. Four surveillance camera screens display various desolate industrial scenes. A UFO hovers above a road at night, with a vehicle parked on the roadside. A crosshair is superimposed on some equipment on what appears to be the lunar surface. A photo of an astronaut on the moon shows an unidentified object reflected in his visor. The reflection is magnified, next to a photo of a similar-looking object. A blue planet spins in space, surrounded by an aura of energy. The words \"THE CONSPIRACY THEORY OF EVERYTHING\" appear in white against the cosmic background. A figure with its head replaced by a smiling mask is shown. Then the camera focuses on a security camera lens.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the character Baron in the video \"Subterranean Trip\" and the real-life Baron?",
+ "answer": "A",
+ "candidates": [
+ "A. They share similar family backgrounds.",
+ "B. They possess analogous life experiences.",
+ "C. They exhibit alike personality traits.",
+ "D. They have comparable social positions."
+ ]
+ }
+ },
+ "tLDFojmL": {
+ "video_id": "tLDFojmL",
+ "video_duration": "178s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "DIY & Crafts",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A hand holds several small, clear plastic spheres. Some of the spheres contain a small collection of red cylindrical items. Other spheres are filled with a cluster of white objects. One sphere is a solid gray color. A yellow box with the words \"Snap Dragons\" is visible in the background. The hand then holds one of the spheres in the foreground and displays it to the camera. The hand begins removing small objects from the box, including a string of red items and several small paper cylinders with attached wicks. The hand is then seen reassembling the clear plastic spheres and holding them. The scene shifts to an outdoor environment with snow on the ground. A person is seen holding a black and green painted object and pointing it towards a tree. They are seen loading a clear sphere into the object and then firing it at the tree. This process is repeated with a different sphere each time. The spheres break open when they make contact with the tree.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many red paintballs are in the shooter's hand at the beginning of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Four.",
+ "B. One.",
+ "C. Two.",
+ "D. Three."
+ ]
+ }
+ },
+ "NImXOCeW": {
+ "video_id": "NImXOCeW",
+ "video_duration": "97s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Physics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a close up shot of a woman with long dark hair. The background includes red curtains and blurred images of objects, like bookshelves. The scene transitions to a computer generated image of a spherical planet with rings around it, and a smaller sphere in the background. The camera pans to follow the movement of a satellite, and then zooms closer to show details of its construction. The scene cuts back to the woman, who is now in an indoor setting with two armchairs in the background. On the table in front of her are a large white sphere, a red apple, a green lime, and several small red and white objects. Her finger points to the small white objects. The hand is shown touching each object, including the red apple, the green lime, and the two small red objects. The scene transitions back to the close up of the woman talking. She is then shown moving the green lime and holding it in her hand. The scene cuts back to the interior room. The woman picks up the small metal object, and is then shown standing by a set of doors. She then walks through the doors, outside and onto a stone path, and walks towards the yard. The woman continues walking, and the view cuts to the yard, which is shown through a wooden gate. The woman then walks across a field, stops, and holds up the green lime. The video ends with a close up shot of her hand holding the green lime.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "In the video, which planet does the woman use an apple to represent?",
+ "answer": "B",
+ "candidates": [
+ "A. Saturn.",
+ "B. Jupiter.",
+ "C. Earth.",
+ "D. Venus."
+ ]
+ }
+ },
+ "nCtBHgbW": {
+ "video_id": "nCtBHgbW",
+ "video_duration": "33s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The first scene shows a person lying in a hammock with a black laptop on their lap. They are in a green, wooded area and the sun is shining through the trees. The next scene shows a person lying in a hammock near a beach. There are palm trees, a beach and an ocean in the background. The third scene shows a person sitting on rocks. There is a large city and ocean in the background. This person is holding a black mobile phone on a tripod. The fourth scene shows a person standing in a field. They are holding a black mobile phone on a tripod. There is a camera also attached to the tripod.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video is the job introduction given?",
+ "answer": "A",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. At the beginning of the video.",
+ "C. In the middle of the video.",
+ "D. At the end of the video."
+ ]
+ }
+ },
+ "nKRwyQOk": {
+ "video_id": "nKRwyQOk",
+ "video_duration": "439s",
+ "duration": "6-8min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Three people are present in the video. Two men are in a wrestling ring, with a third man standing outside the ring. The first man is wearing a white and black jacket over a black t-shirt, red wristbands, and black patterned pants. The second man is shirtless, with red wristbands, and black pants. The third man is wearing a dark suit. The video shows wrestling moves performed between the two men, including kicking, throwing, and pinning. The video ends with one of the men on the ground while the other stands and raises his arms in the air.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What change occurs in the background music of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The volume of the music increases after the competition starts.",
+ "B. The music begins, stops at the start of the competition, and restarts after the competition ends.",
+ "C. The music starts at the beginning of the competition and stops at the end.",
+ "D. The volume of the music decreases after the competition starts."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Who wins the competition in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The man wearing patterned pants.",
+ "B. The two competitors tie.",
+ "C. The man wearing black pants.",
+ "D. The man wearing black trousers."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "When the subtitles display the competition results, what is the positional relationship between the two contestants?",
+ "answer": "B",
+ "candidates": [
+ "A. The contestant wearing black pants is to the right of the contestant wearing patterned pants.",
+ "B. The contestant wearing black pants is to the left of the contestant wearing patterned pants.",
+ "C. The contestant wearing black pants is lying on top of the contestant wearing patterned pants.",
+ "D. The contestant wearing black pants is behind the contestant wearing patterned pants."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "In the video, what technique does the man wearing patterned pants use to defeat the man wearing black pants?",
+ "answer": "B",
+ "candidates": [
+ "A. Leg sweep.",
+ "B. Grounded lion lock.",
+ "C. Double elbow throw.",
+ "D. Backflip."
+ ]
+ }
+ },
+ "GXVWYASR": {
+ "video_id": "GXVWYASR",
+ "video_duration": "59s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Soccer",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The first clip shows 3 people on a sideline, all wearing masks. One person with a football jersey and the number 5 on it has their arms being held by the first masked person. A second football player in a jersey with the number 2 is drinking from a bottle. In the next clip, 2 teams in football uniforms are lined up on the field. The team on the right has maroon uniforms and the team on the left has white and blue uniforms. The two teams have a huddle in the center of the field. The next clip shows the two teams lined up on the field with several officials in black and white striped shirts. The teams walk and mingle and eventually appear to be shaking hands. The last clip shows one team walking away from the scene, and their coach in a gray jacket with a mask is approached by several people. One person in a yellow raincoat can also be seen. The two teams walk off in opposite directions, again shaking hands and engaging each other.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional change did player No. 40 of the Red Team experience after the match ended in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Calmness turned into excitement.",
+ "B. Calmness turned into irritability.",
+ "C. Calmness turned into oppression.",
+ "D. Calmness turned into depression."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What does the player from Team White, wearing number 2, do with the water cup in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Pours water into a teammate's neck and feeds water to the teammate.",
+ "B. Pours water onto a teammate's hand and drinks water.",
+ "C. Pours water into a teammate's neck and drinks water.",
+ "D. Pours water onto the ground and drinks water."
+ ]
+ }
+ },
+ "ClYMxHPZ": {
+ "video_id": "ClYMxHPZ",
+ "video_duration": "224s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "In one scene, several people stand around a table with clothes on it. A woman holds a blue sign with white lettering. In another, a woman wearing a red shirt stands in front of a group of people and gestures with her hands. In another scene, a paper target showing the image of a man with a red circle in the center hangs on a wooden structure. In a subsequent scene, a woman with a gray baseball cap speaks to the camera while sitting at a wooden table. After this, a person is shown drawing a handgun from a holster on their waist. Several people are then shown holding handguns and aiming them. One woman can be seen with a bullet target with a cluster of bullet holes in the center of it. Another scene shows a woman aiming a handgun at a paper target. A woman in a purple shirt can be seen at an outdoor range, gesturing with her hands. Next, the video shows a woman in a black shirt firing a handgun while another woman with hearing protection on stands behind her and places her hand on the shoulder of the person holding the gun. After that, the woman with the gray baseball cap can be seen speaking to the camera again. The video shows a white sign with the words \"Thank You\" and \"Ammo Sponsor.\" The next scene shows several people standing in a field, near a person who is kneeling and preparing a rifle. The scene then cuts to a view of rifle targets placed in a field. Then the woman with the gray baseball cap can be seen speaking to the camera again. A woman is then shown laying down and firing a rifle at a target. Three women can be seen standing around a wooden table with a rifle resting on it. A woman wearing a black t-shirt can be seen talking to three women while sitting on the ground. She then shows another woman how to aim a rifle with a scope on top of it. The video shows two women at a table with a rifle, one of them is aiming and firing. The woman with the gray baseball cap can be seen speaking to the camera again. The video shows several people lying on the ground firing rifles. The woman with a black baseball cap is shown kneeling and helping another person adjust her rifle. Then the video cuts to a shot of rifles with scopes on top of them. The video cuts to the woman with the gray baseball cap speaking to the camera again. After this, the camera zooms in on the firing mechanisms of a rifle. A scene shows the woman with a black t-shirt speaking into the camera while gesturing. The camera then shows several people lying down and aiming rifles, followed by a woman helping someone aim a rifle. The video then shows several people walking in a field.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times were gunshots heard?",
+ "answer": "A",
+ "candidates": [
+ "A. Six.",
+ "B. Eight.",
+ "C. Seven.",
+ "D. Five."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What do participants learn after joining the women's armed training program?",
+ "answer": "D",
+ "candidates": [
+ "A. Learn how to use an assault rifle.",
+ "B. Learn how to use a machine gun.",
+ "C. Learn how to use a sniper rifle.",
+ "D. Learn how to use a pistol."
+ ]
+ }
+ },
+ "lfNqlQVa": {
+ "video_id": "lfNqlQVa",
+ "video_duration": "151s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Biology",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a white background and text that reads \"hhmi BioInteractive\" in a light blue and dark green color with a dark green vertical line separating the text. A green and blue drawing of a lizard is on the top right corner. The text and the lizard slowly fade out. A grey, shaded rectangle appears on the left side of the screen with the white text \"SARS-CoV-2\" on the rectangle. Next to this text is a 3D drawing of a virus in green, blue, and white colors. A white line appears under the text. Then, the text \"Infection\" appears on the same line. The word \"Infection\" fades out after a few seconds and the same phrase fades in and out. A series of white text appears next to the grey, shaded rectangle that reads \"Severe Acute\", then \"Respiratory Syndrome\", and then \"Coronavirus 2\". The virus drawing remains on the right side of the screen. The text, shaded rectangle, and line fade out. The virus drawing remains in the center of the screen. The text \"Coronavirus\" appears on the top left of the screen in a light blue, shaded rectangle. The same virus drawing appears in the center of the screen with \"SARS-CoV-2\" below it. A new text box is added in the top right corner in a light blue, shaded rectangle. The text reads \"COVID-19\". Then, the text box fades and several small, drawn, virus particles appear to the left of the original virus drawing in light blue, grey, and light orange colors. The text \"Family of viruses\" appears below the smaller viruses. A dotted line appears from one of the smaller viruses, and connects to the larger virus. A drawing of a human body outline appears next to the original virus drawing. A dotted line connects the outline to the virus. The text \"Disease\" appears under the human body drawing. The drawings of the virus particles and the human body outline fade out. The drawing of a single human outline reappears with a drawing of the lungs inside the body. The text \"Lungs\" appears with a dotted line connecting it to the lungs. Below this drawing is the text \"Coronaviruses infect humans and other animals\". The outline of the drawing appears to be coughing, and the text \"cough\" is repeated twice. The text \"Lungs\" appears again with the dotted line connected to the lungs. The lungs in the outline are shaded green. The drawing fades out with a light grey background, and only the larger virus particle remains in the center of the screen. The larger virus particle fades to show a cut-away view of the virus, revealing a blue, twisted strand and a light blue outer layer. Text with an arrow appears to the right of the twisted strand that reads \"Viral genome (RNA)\". The same text is repeated with the outer layer that reads \"Envelope.\" The twisted blue strand remains as the outer layer fades out. The screen zooms in on the outer parts of the virus particle, with blue shaded parts around the circumference. Text appears that reads \"Crown = Corona\" with the text slowly fading. The video transitions to an animated drawing of a boy with brown hair and an orange shirt. He coughs and particles come out of his mouth. A cut-away view of his nose, mouth, and lungs appears with the lungs in a pink color. A virus particle appears in a circle in the center of the lungs. The text \"Outside human cell\" appears with a drawing of a single virus particle on a pink background, with yellow circles on the surface of the background. The same text is repeated with an arrow pointing to the surface and then pointing to the virus particle which now has a dotted line to the text, \"Human cell\". The virus particle fades and the drawing turns into a blue strand, and the text reads \"Inside human cell\". The text, \"Viral genome (RNA)\" appears next to the blue strand. The text is repeated next to the drawing of a straight strand of blue. The text changes to include a pink drawing connected to the blue strand that reads, \"Cell's ribosome.\" The drawing changes to include another blue strand below it, and text next to a green drawing appears that reads, \"Viral RNA polymerase.\" Then green sticks appear. Text next to the green sticks reads \"Viral RNAs\". The green sticks fade and the green virus protein drawings appear with text that reads \"Viral proteins\". A blue line connects to the protein. Then the proteins become attached to the strand and form a twisted line, turning into the virus particle. The text \"Inside human cell\" remains at the top. The virus particle fades out and shows the text \"Outside human cell\". The video displays three virus particles, and then goes back to the boy with the same view of his lungs. The virus particle appears in the middle of his lungs. The boy coughs and the virus particle comes out of his mouth. The boy's lungs are shaded green and the text \"Fever Cough\" appears on the top right. The green color on the lungs is removed. The boy looks from side to side, and the video transitions to credits. The credits list the producers, reviewers, writers, art direction, animations, narration, and post-production.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What happens when the viral protein spike binds to a specific protein on the surface of human cells?",
+ "answer": "B",
+ "candidates": [
+ "A. The viral RNA is released, the entire genome is replicated, and RNA polymerase and proteins are translated.",
+ "B. The viral RNA is released, RNA polymerase and proteins are translated, and RNA copies are transcribed while proteins are translated.",
+ "C. RNA copies are transcribed by RNA polymerase while proteins are translated, and the viral RNA is released.",
+ "D. The viral RNA is released, RNA polymerase transcribes RNA copies while proteins are translated, and RNA polymerase and proteins are translated."
+ ]
+ }
+ },
+ "LARxRUcg": {
+ "video_id": "LARxRUcg",
+ "video_duration": "121s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two wrestlers, one in a dark singlet and one in a white singlet, face each other in a wrestling match on a purple mat with gold lettering. The wrestler in the dark singlet wears bright green wrestling shoes. There are spectators in the bleachers, and a referee in a black and white striped shirt stands nearby. The wrestlers circle each other, then the wrestler in the dark singlet shoots in for a takedown. The wrestler in the white singlet sprawls to defend. The wrestler in the dark singlet then switches to a single leg takedown and brings the wrestler in the white singlet to the mat. He then transitions to a pinning combination, attempting to turn the wrestler in the white singlet onto his back. The wrestler in the white singlet bridges and fights to avoid the pin. The wrestler in the dark singlet adjusts his grip and continues to work for the pin. He lifts the leg of the wrestler in white and appears to almost secure the pin, but then the wrestler in white breaks free. The referee watches the action closely. The wrestler in the dark singlet maintains control, holding the wrestler in white down on the mat. He continues working towards a pin, adjusting his position several times. The wrestler in the white singlet continues to struggle and bridge, attempting to avoid the pin. The wrestler in the dark singlet eventually gets the wrestler in white onto his back. The referee drops to the mat and slaps the mat, signaling a pin. The wrestler in the dark singlet gets up and walks away. The referee helps the wrestler in white up off the mat.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the role of the man wearing a gray shirt at the venue?",
+ "answer": "B",
+ "candidates": [
+ "A. An observer.",
+ "B. A coach.",
+ "C. A security officer.",
+ "D. A referee."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the man wearing black shoes fail to defeat the man wearing green shoes in a wrestling attempt?",
+ "answer": "C",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. At the beginning of the video.",
+ "C. In the middle of the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What move did the man wearing green shoes use to defeat the man wearing black shoes at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. An arm lock immobilizing both arms.",
+ "B. A leg grab and locking takedown.",
+ "C. An arm-through over-the-shoulder throw.",
+ "D. A neck hold and over-the-shoulder throw."
+ ]
+ }
+ },
+ "eipFDMvQ": {
+ "video_id": "eipFDMvQ",
+ "video_duration": "85s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A bearded man with a top knot sits at a table with a blue tablecloth. He lifts his head from his arms. He opens a white packet. He lifts the lid of a dark-colored teacup. He uses the packet to wipe the outside of the teacup. He looks at the tea leaves inside the teacup. He picks up a copper kettle and pours hot water over the tea leaves. The camera offers a close-up, overhead shot of the hot water pouring onto the tea leaves. Back to the man, he pours tea from the teacup into a small white cup. He drinks the tea from the white cup. He pours more tea from the teacup into the white cup. He drinks the tea. He refills the white cup from the teacup and takes several sips, while seeming to express his thoughts to someone offscreen. He pours more tea and drinks it. He refills the white cup again and takes another sip. He repeats this process several times.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "In the video, how many times does the man pour water into the dark brown teapot?",
+ "answer": "D",
+ "candidates": [
+ "A. Four times.",
+ "B. Three times.",
+ "C. Two times.",
+ "D. One time."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the man unpack the shrubs?",
+ "answer": "A",
+ "candidates": [
+ "A. The video begins with it.",
+ "B. It is present throughout the entire video.",
+ "C. It occurs in the middle of the video.",
+ "D. It is shown at the end of the video."
+ ]
+ }
+ },
+ "JYBseFUa": {
+ "video_id": "JYBseFUa",
+ "video_duration": "55s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a black screen. Then, a woman with blonde wavy hair is seen talking to a man wearing a light blue collared shirt. The woman wears a pink off-the-shoulder top. The man is wearing a tie with a purple and pink pattern. The camera cuts to a wider view showing that they are in a shop with a variety of items on shelves. Another woman, wearing a red robe over a tie-dye shirt, is seen in the background holding a white bag with a red logo. The camera cuts back to the woman in pink, who opens her mouth in a surprised expression. The camera then cuts to the woman in the robe who appears to be smiling. The camera cuts to the man in the blue shirt who is clapping his hands and appears happy. The screen goes to a dark background with hearts in the background, and text that reads \"1 SALON\" in gold lettering. The camera goes to a woman in a red off-the-shoulder dress who appears to be looking down. The camera goes to a scene with three people, a woman in a red dress, a woman in a white shirt and a man in a red apron, all of whom are interacting with another woman in the foreground. The camera cuts back to the woman in the red dress, then to the woman in the tie dye shirt and red robe who is getting her hair styled by the woman in pink. The camera cuts again to a woman in the red off-the-shoulder dress. The scene cuts to black, then to an exterior view where a woman in the red dress and a man are getting into a black car with red wheel trim. The screen goes to black, and then to another shot in the salon with a woman in a red off-the shoulder dress looking down. Then there is a shot with a woman in a white shirt making a face and another shot with two more people in the same room. The camera goes to a black screen and then it changes to an angle with the same room and the woman with red off the shoulder dress looking sad, then to a scene with a woman in pink, and the woman in the red robe and tie dye shirt. The camera cuts again to the woman in pink. Then a shot with a man in a red apron, and a woman in a white shirt with others in the background. The camera goes to a black background and gold text that reads \"4 LOVE STORIES\". Then a shot with the woman in pink and the woman in red. There is a shot of two men sitting in chairs, then the woman in pink. Then the camera goes back to shots with the woman in pink, the woman in red, and the man in the red robe. Then the woman in pink is seen talking to the woman in the tie dye shirt. Then there is a shot of the man with the red apron and another shot with a man and woman sitting in chairs. Finally, the last shot of the video is of a group of people hugging.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the woman wearing a pink top and the man wearing a light blue shirt in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. They are romantically involved.",
+ "B. They are colleagues.",
+ "C. They are friends.",
+ "D. They are neighbors."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who just said, \"What does that mean?\"",
+ "answer": "C",
+ "candidates": [
+ "A. A short-haired woman in a colorful dress.",
+ "B. A woman with an afro wearing a white jumpsuit.",
+ "C. A short-haired woman in a red dress.",
+ "D. A blonde woman wearing a pink blouse."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional change occurs in the man wearing a light blue shirt in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. From calmness to happiness.",
+ "B. From pain to happiness.",
+ "C. From anger to happiness.",
+ "D. From sadness to happiness."
+ ]
+ }
+ },
+ "cEtyreWh": {
+ "video_id": "cEtyreWh",
+ "video_duration": "225s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Two boxers, one Black and one white, are in a boxing ring. The Black boxer wears black shorts and red gloves, while the white boxer wears blue shorts and silver gloves. They circle each other, exchanging jabs. The Black boxer lands a combination of punches, causing the white boxer to stumble backward. The white boxer is backed into a corner and takes several more hits. The referee intervenes and gives the white boxer a standing eight count. The white boxer continues but soon receives more hard blows to the head and body, again staggering backward. He leans against the ropes. The Black boxer continues punching the white boxer who is pinned against the ropes. The referee steps in and stops the fight. The Black boxer climbs onto the ropes facing the audience, raising his gloves. The white boxer sits on the floor of the ring being attended to by his corner. The Black boxer then comes down from the ropes and goes to his corner while the white boxer gets up. The referee calls the two boxers to the center of the ring, and the announcer states the name of the winning boxer and the method and time of victory. The Black boxer\u2019s arm is raised by the referee. The Black boxer celebrates while his cornermen enter the ring to congratulate him.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "Who was knocked down at the beginning of the match, and how was it done?",
+ "answer": "C",
+ "candidates": [
+ "A. The black boxer knocks down the white boxer with a right hook.",
+ "B. The black boxer knocks down the referee with a left straight punch.",
+ "C. The black boxer knocks down the white boxer with a left hook.",
+ "D. The white boxer knocks down the black boxer with a left hook."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When was the first cheer heard?",
+ "answer": "C",
+ "candidates": [
+ "A. There was no cheer.",
+ "B. There was no cheer.",
+ "C. There was no cheer.",
+ "D. There was no cheer."
+ ]
+ }
+ },
+ "TIrfzzMu": {
+ "video_id": "TIrfzzMu",
+ "video_duration": "36s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Bowling Ball",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a white long-sleeved shirt, white pants, and a dark-colored helmet with a face guard is standing on a green field. The person raises their right arm, then lowers it and takes several steps. Later, the person hits a ball with a bat. The ball goes towards a fence with logos including \"Ford,\" \"Sahara,\" and \"Lufthansa.\" The person raises both arms, and then swings a bat. There are other people wearing white clothing on the field as well. In the stands, many people are cheering. There is also a flag with a white crescent moon and star. A dark-colored box in the upper right corner of the screen shows numbers.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How do the emotions of the people in the video change?",
+ "answer": "B",
+ "candidates": [
+ "A. Emotions shift from excitement to disbelief.",
+ "B. Emotions shift from anticipation to excitement.",
+ "C. Emotions shift from anticipation to sadness.",
+ "D. There is no emotional change."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Who is the pitcher in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The person standing next to the man in the white hat.",
+ "B. The man wearing a white hat.",
+ "C. The person hit by the cricket ball.",
+ "D. The person holding a cricket bat."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "After the second strike, did the video show any abnormalities?",
+ "answer": "B",
+ "candidates": [
+ "A. The video shows snow along with a black screen.",
+ "B. The video shows flickering lines accompanied by color distortion.",
+ "C. The video display shows black vertical stripes.",
+ "D. There are no abnormalities; the video feed remains normal."
+ ]
+ }
+ },
+ "VpgZcUdk": {
+ "video_id": "VpgZcUdk",
+ "video_duration": "109s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a white t-shirt with the words \u201cEVERYTHING IS POSSIBLE\u201d is visible in the video. The person unwraps a parcel with red tape on it. Inside the parcel there is a food container. There is also a small packet containing soap. The person opens the packet of soap and spills the contents into their palm. The person then rubs their hands together until they get soapy. The person then returns to unwrapping the parcel. Once the parcel is unwrapped, a white cloth is removed, then a piece of paper is removed, and finally the food container is revealed. The person takes the red tape off of the food container. Inside the container, there is yellow rice, three sauces, and four small brown balls.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of events shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Display the hand sanitizer, apply it to both hands, wash hands, examine the pancake, tear open the outer packaging, display the price, remove the seal, open the packaging box, display the meal.",
+ "B. Display the hand sanitizer, apply it to both hands, wash hands, display the price, tear open the outer packaging, examine the pancake, remove the seal, open the packaging box, display the meal.",
+ "C. Display the hand sanitizer, apply it to both hands, wash hands, tear open the outer packaging, display the price, examine the pancake, remove the seal, open the packaging box, display the meal.",
+ "D. Display the hand sanitizer, apply it to both hands, wash hands, tear open the outer packaging, display the price, remove the seal, open the packaging box, display the meal, examine the pancake."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "Did the man in the video tear off the receipt from the outer packaging?",
+ "answer": "B",
+ "candidates": [
+ "A. Yes.",
+ "B. No.",
+ "C. Not sure."
+ ]
+ }
+ },
+ "BhjkyvRO": {
+ "video_id": "BhjkyvRO",
+ "video_duration": "147s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Hiking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing light gray shorts, black leggings, and a blue shirt with a green hat and red bandana is walking on a path in a wooded area. They are wearing a gray backpack and carrying a black bag on one side. The ground is covered with pine needles and fallen branches. The person continues to walk along a path with more tree roots. They look around. Then the person is shown from the back walking on a path in a wooded area, and then standing in a clearing. A white piece of garbage is visible on the ground. Then the person is shown with a mountain range in the background. The camera moves to the person's feet, which are stepping over a rock with a white and black item under the rock. The camera moves to show a view from a high place with a blue lake visible in the distance. Then the person is shown again talking to the camera with the mountain range and lake in the background.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "How do the two men in the video feel when they climb to two thousand feet?",
+ "answer": "D",
+ "candidates": [
+ "A. Discouraged.",
+ "B. Overwhelmed with passion.",
+ "C. Energetic and exhilarated.",
+ "D. Excited."
+ ]
+ }
+ },
+ "EXLowmQV": {
+ "video_id": "EXLowmQV",
+ "video_duration": "38s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person in a grey suit enters a room through a doorway with a stained-glass window. A person in a gold-patterned dress and dark hat is already in the room. They both then walk further into the room. In a different scene, a person in a salmon-colored top and red turban is sitting and talking in front of a floral arrangement. A person in a grey suit is shown lighting a cigarette. A person in a gold-patterned dress and dark hat is standing in a dark room holding a black object. A person in a gray suit is standing nearby holding a thin, white object. In a different scene, a person in a tan dress approaches and touches the arm of a person in a white shirt and dark hat. The person in a white shirt is holding a gray suit jacket. In another scene, a person in a salmon-colored top and red turban is sitting and talking in front of a floral arrangement.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Where is the woman in the gray dress in the video preparing to go?",
+ "answer": "C",
+ "candidates": [
+ "A. Church.",
+ "B. Hospital.",
+ "C. Bar.",
+ "D. School."
+ ]
+ }
+ },
+ "drCNbTDR": {
+ "video_id": "drCNbTDR",
+ "video_duration": "119s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Drawing Tutorials",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A white dry-erase board with a wooden frame is visible, secured to the background with white tape. The background is blurry and features black and white shapes. A person wearing a red and black plaid shirt and a black beanie enters the frame from the left and stands facing the camera. The text \"STEP ONE\" appears in purple letters. The person picks up a black marker and draws a circle on the whiteboard. The text \"PRO TIP\" appears in pink letters over a background with animated fog and two red horns. The person stands next to the circle. The person turns away from the camera and the text \"STEP TWO\" appears in purple letters. The person draws a curved line inside the circle on the whiteboard, then turns to face the camera. The person points at the camera. \"STEP THREE\" appears in purple letters. The person holds up two fingers and then begins speaking again, and he is then seen drawing two circles on the first curved line. The text \"STEP IV\" appears in purple letters. The person draws a line below the circle and two dots above it. The person continues talking. The person speaks and draws a line below the face shape and adds another line and says \"Oh, that looks terrible\". The text \"STEP FIVE\" appears in purple letters. The person draws a curved line on the bottom of the face. The text \"STEP SIX\" appears in purple letters. The person continues drawing by adding a body to the drawing. The person draws two pointy lines on top of the face and states, \"These Look Like Fox Ears\". The text \"STEP 7\" appears in purple letters, and the person is shown standing next to the drawing. The text \"NICE LEGS\" appears over the person\u2019s blurred image, who is adding legs to the drawing. The text \"STEPATE\" appears. The person adds an udder to the drawing and some spots on the cow, then stops drawing. The text \"STEP N\" appears. The person draws some lines inside the cow's face. The person adds spots to the body of the cow. The person looks at the camera. The person adds hooves to the drawing. The person steps away from the board and looks at the camera. The person begins talking while facing the camera. The person is visible from the waist up.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many steps are required for the man in the video to draw a cow?",
+ "answer": "B",
+ "candidates": [
+ "A. Eight.",
+ "B. Ten.",
+ "C. Eleven.",
+ "D. Nine."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "In which step does the man draw the cow's eyes in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Step five.",
+ "B. Step three.",
+ "C. Step four.",
+ "D. Step two."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What does the man do in the sixth step of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. He draws a pair of ears.",
+ "B. He draws a tail.",
+ "C. He draws four legs.",
+ "D. He draws two nostrils."
+ ]
+ }
+ },
+ "OmZnoEHV": {
+ "video_id": "OmZnoEHV",
+ "video_duration": "72s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Unboxing Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Several packaged toys are on a white table. A person's hands remove several of the dog figures from their packaging, then place two of them onto the table, in front of the packaged toys. The person's hands place two additional figures onto the table and then manipulate them briefly. The person picks up the figures and places the red and green figures back on the table and then picks up the remaining blue figure. The person then picks up the other figures and sets them up again, placing them near the red and blue spaceship toy, and then places some of them into the spaceship. Next, the red and blue figures are set on the table. Finally, two small grey platforms are placed in front of all the figures.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video, what is the order in which the toy dogs of different colors appear after being unboxed?",
+ "answer": "C",
+ "candidates": [
+ "A. Blue, pink, orange, red, yellow, green, white.",
+ "B. Blue, pink, green, white, red, orange, yellow.",
+ "C. Pink, red, green, blue, yellow, orange, white.",
+ "D. Pink, blue, red, yellow, green, orange, white."
+ ]
+ }
+ },
+ "NnDlOyNu": {
+ "video_id": "NnDlOyNu",
+ "video_duration": "241s",
+ "duration": "4-6min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person with short dark hair wearing a dark blue long-sleeved shirt and a seatbelt is seen in a car. They move their lips, point their finger, cover their face with their hand, and touch their nose. They seem to be in a moving vehicle, and buildings with a store awning can be seen through the window. Next, two men are shown standing behind a wooden counter with a blackboard behind them. One man is wearing a white shirt and the other a white shirt and tie. They are both counting stacks of cash.The scene then switches to a large room with two gray doors, a table and white mini fridge, and four pink folding chairs. Two people enter, one wearing a dark long-sleeve shirt and jeans, and another wearing a gray hooded sweatshirt and dark shorts holding a toy gun. They both sit, the person in jeans touches the other\u2019s head, and then walks away as the other person sits fidgeting with the toy gun.A woman is then shown on a street at night. She moves her head and is talking to someone out of view.The scene shifts to one of the men at the wooden counter with a blackboard, interacting with the woman from the street. She is now in the space with him, and is talking to the man. He hands her cash, and they both talk.The scene switches again, now back to the large room, with the child in the gray hooded sweatshirt holding two toy figures while seated in a pink chair.Back at the counter, the woman is talking to the man, and he is pointing in her direction.The scene switches back to the large room. The woman in jeans is comforting the child, then they walk out the room.The scene switches back to the man at the counter who is now counting the cash by himself.He then stands at the counter and looks around.The scene transitions back to the large room. A single toy figurine is shown in a pink chair.Then, a man in a cowboy costume enters the room, points a toy gun, and gives a high five to another man.A man wearing a space ranger costume stands in front of shelves and moves his arms.A woman in a cowboy costume moves in front of the wooden counter. Then all three people in costumes appear to do a dance.The man at the counter holds up his hands in a scared manner. Then the space ranger poses.The three people in costumes are then shown holding items. The man in the white shirt is tied up with a rope.The room is shown once again, empty.The tied-up man is shown again, and he falls to the floor and writhes as if in pain.A single toy figurine is shown on the floor and is touched by a foot. The costumed man with the toy gun approaches the man tied up.The tied-up man is seen writhing on the floor one more time.The child in the hooded sweatshirt approaches the camera with the toy figures, and puts the toys into a grey backpack. The man at the counter looks around, then seems frightened.The three costumed figures do a final pose, and the video ends with the child in the hooded sweatshirt walking towards the tied up man on the floor.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "How does the tall man wearing a white shirt in the video treat the woman seeking a loan consultation?",
+ "answer": "C",
+ "candidates": [
+ "A. He is forceful, directly refusing and driving away the customer.",
+ "B. He is warm and friendly, actively offering assistance.",
+ "C. He initially refuses but later compromises, engaging in limited communication.",
+ "D. He is very cold and refuses to communicate, citing regulations."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual event occurs at the video's conclusion?",
+ "answer": "A",
+ "candidates": [
+ "A. Toys transform into people, then threaten and restrain a tall man wearing white clothing.",
+ "B. A red-haired boy threatens and restrains a tall man wearing white clothing.",
+ "C. A man wearing a tie threatens and restrains a tall man wearing white clothing.",
+ "D. A woman wearing black clothing threatens and restrains a tall man wearing white clothing."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What is the reason the tall man in the white shirt single-handedly shut down the entire bank in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The banking system experienced a temporary fault and requires staff to investigate and resolve it.",
+ "B. New business training needs to be conducted after work hours.",
+ "C. Other colleagues have appointments and are leaving early.",
+ "D. The bank has important tasks that are unfinished."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Can you determine when the man wearing a tie and a white shirt leaves in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Throughout the entire video.",
+ "B. At the end of the video.",
+ "C. In the middle of the video.",
+ "D. At the beginning of the video."
+ ]
+ }
+ },
+ "JHHBkPVt": {
+ "video_id": "JHHBkPVt",
+ "video_duration": "26s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A close-up of a photograph's corner reveals a signature, \"Martin 11\". The scene transitions to a man in a suit examining a black and white photographic print of a person's face, wearing white gloves. Multiple inverted black and white images of a couple embracing appear in a grid pattern. Next, a cylindrical device emits light from one end. This scene transitions into one where a man with glasses, wearing a gray shirt, gestures with his hands. Another person in a suit is also present. Two men are then shown laughing in a photograph displayed on a computer screen. Boxes labeled \"Glossy FineArt\" and \"FineArt Pearl 285 gsm\" are stacked on a shelf. Finally, black and white photographs of a rock formation, an older man in a plaid shirt, and an older woman holding a framed picture and a cane are displayed, culminating in the \"Hahnem\u00fchle FINEART\" logo in red text on a white background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "In which setting is the man looking at the photo?",
+ "answer": "D",
+ "candidates": [
+ "A. A study.",
+ "B. An art exhibition.",
+ "C. An office.",
+ "D. A photo printing studio."
+ ]
+ }
+ },
+ "IlWrdqwa": {
+ "video_id": "IlWrdqwa",
+ "video_duration": "112s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A silver Ford Mustang drives slowly on a paved surface at night. The scene changes to a drag strip. Two cars, a red Camaro and the silver Mustang, are staged at the starting line. White smoke comes from under the Mustang, then from both cars. The starting lights sequence activates, and both cars launch quickly down the track. The Mustang appears to win. The scene cuts to empty bleachers alongside the drag strip. The scene changes again to the Mustang driving slowly on the paved surface towards the camera. A man gets out of the car and leans back in to wipe something off. The video cuts to a time slip that shows car #332 winning against car #501. The elapsed times were 10.055 seconds and 10.159 seconds, and the speeds were 136.43 mph and 133.32 mph. Finally, a logo for Destroyer 1320 appears. Below the logo is a URL and a YouTube channel name. A final shot shows a Dodge Challenger.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the likely temperature at the competition venue depicted in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The temperature is rather muggy.",
+ "B. The temperature is extremely cold.",
+ "C. The temperature is very hot.",
+ "D. The temperature is relatively cold."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unexpected event occurred during the competition?",
+ "answer": "A",
+ "candidates": [
+ "A. Brian completed a five-second tunnel run but did not win the race.",
+ "B. Brian completed a five-second tunnel run and won the race.",
+ "C. Brian completed a five-second tunnel run and tied with the opponent in the race.",
+ "D. Brian committed a foul during his five-second tunnel run."
+ ]
+ }
+ },
+ "bVXKQLJB": {
+ "video_id": "bVXKQLJB",
+ "video_duration": "42s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The first scene shows a view from inside a moving vehicle. The vehicle travels down a road with a median and a stone wall running parallel to the road on the left. There are several other vehicles on the road traveling in the same direction and in the opposite direction. In the second scene, the view switches to a street view of an intersection. A truck carrying cars is traveling through the intersection along with other passenger vehicles. In the next scene, the view switches again to an overhead view of a flipped trailer truck on a highway. There is a fire truck at the scene as well as one other person walking along the side of the highway. There is a white vehicle as well at the scene of the incident. Next the scene shows a close up of a man in a green shirt, who has a name placard below him reading 'Chris Howie' and 'Acting Operations Manager SA Ambulance Service'. In the next sequence, an overhead view of a highway scene with a police vehicle parked in the center and a few people around the vehicle. Debris is strewn across the highway. After that the scene shows again a close up of a man in a green shirt. Next there is a view of a highway intersection at night showing a few cars traveling down the road. Then another view of a highway intersection at night, showing a few cars traveling. After that a view of the highway again, showing a trailer truck turning over. There is a white car in the foreground. In the last scene, there is an overhead view of a flipped trailer truck and an ambulance. There is one person walking along the side of the highway. After that, the video shows a close up of a man in a green shirt.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Does the video contain photographs of casualties?",
+ "answer": "B",
+ "candidates": [
+ "A. None of the above.",
+ "B. No, it does not.",
+ "C. I do not know.",
+ "D. Yes, it does."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "In which month did an ambulance appear in the video surveillance footage?",
+ "answer": "A",
+ "candidates": [
+ "A. October.",
+ "B. August.",
+ "C. November.",
+ "D. December."
+ ]
+ }
+ },
+ "FlDOjEOZ": {
+ "video_id": "FlDOjEOZ",
+ "video_duration": "157s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man wearing a black shirt and black headphones is sitting in a black chair. Behind him are two screens. The left screen shows a soccer team logo with the name \"SOUTHAMPTON FC\" below it, and the words \"4TH\" and \"PTS 78\" and a table showing 40 games played, 23 games won, 9 games drawn and 8 games lost. The right screen shows another soccer team logo with the name \"WATFORD\" below it, and the words \"14TH\" and \"PTS 52\" and a table showing 42 games played, 12 games won, 16 games drawn and 14 games lost. Below the man and the screens, there is a rectangle that displays two soccer team logos and the numbers \"2\" and \"1\". In the center of the rectangle, the time reads \"84:52\". The timer counts up. The score changes from 2 - 1 to 2 - 2. Both the left and right screens change to show soccer team information. The timer increases to \"90:00\". The score remains 2 - 2. Both the left and right screens change to show soccer team logos in the upper corners, and a soccer match score in the middle of each screen, along with other soccer team match stats. The score changes from 2 - 2 to 3 - 2. The left and right screens change to show pictures of soccer players. The left and right screens change back to the soccer match information. The video transitions to a screen that says \"THANKS FOR WATCHING\" with several people in black clothing dancing around a disco ball.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man in the video hit the table?",
+ "answer": "A",
+ "candidates": [
+ "A. Five times.",
+ "B. Six times.",
+ "C. Four times.",
+ "D. Three times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Does this video show the Bournemouth team's ranking?",
+ "answer": "B",
+ "candidates": [
+ "A. Yes, it does.",
+ "B. No, it does not.",
+ "C. None of the above.",
+ "D. I do not know."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does the competition end?",
+ "answer": "A",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the beginning of the video.",
+ "C. At the end of the video.",
+ "D. Throughout the video."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional changes does the man exhibit throughout the video?",
+ "answer": "B",
+ "candidates": [
+ "A. His mood moves from disappointed to upset.",
+ "B. His mood shifts from indifferent to excited.",
+ "C. His mood transitions from serious to relaxed.",
+ "D. His mood changes from happy to indifferent."
+ ]
+ }
+ },
+ "qzPkIxWe": {
+ "video_id": "qzPkIxWe",
+ "video_duration": "536s",
+ "duration": ">8min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A man with short brown hair, wearing an olive-green t-shirt, sits in front of a wall with a map and a model train. He speaks directly to the camera.A train approaches and stops at a station platform. A few people are waiting on the platform.The man in the olive shirt continues speaking.A light rail train passes over a highway on an elevated track. Cars drive on the highway below.The man continues speaking.A light rail station is shown. The platform and station building are modern and made largely of glass.An animated map shows different lines in red and green. The green line extends further.A train arrives at an underground station platform. People are seen on the platform and inside the train through the windows.A white train with red stripes pulls into a station. A person with a backpack stands on the platform looking at their phone.The man in the olive shirt continues speaking.A train passes by a platform in an underground station. People stand on the platform as the train approaches and departs.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How many BRT lines are in Guadalajara according to the video?",
+ "answer": "B",
+ "candidates": [
+ "A. 4 lines.",
+ "B. 2 lines.",
+ "C. 3 lines.",
+ "D. 1 line."
+ ]
+ }
+ },
+ "NOdmttWy": {
+ "video_id": "NOdmttWy",
+ "video_duration": "176s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A bald man wearing a black t-shirt and a striped apron stands in front of a cob oven. He gestures with his hands while looking at the camera. He picks up a metal pizza peel, turns, and approaches the oven. He opens the oven door and uses the peel and a wooden-handled tool to place a pizza inside. After a moment, he retrieves the pizza. He carries it over to a wooden table, places it on a wooden cutting board, and removes it from the peel. He picks up a mezzaluna and cuts the pizza into slices. He then picks up a slice, eats it, and reacts enthusiastically. He continues to slice and eat the pizza.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man do with a shovel inside the oven midway through the video?",
+ "answer": "D",
+ "candidates": [
+ "A. He uses the shovel to remove the pizza.",
+ "B. He uses the shovel to remove the ashes.",
+ "C. He uses the shovel to remove the yellow soil.",
+ "D. He uses the shovel to remove the charcoal."
+ ]
+ }
+ },
+ "LOOORjqu": {
+ "video_id": "LOOORjqu",
+ "video_duration": "31s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen. Then, a title card appears with the words \"BOXING LEGENDS present\" in white text on a black background. To the right, there is an image of two gold boxing gloves hanging on a dark wall. The screen then goes black again, and the \"BOXING LEGENDS\" logo appears in the top right corner. Next, a black and white image of two people appears, with one facing the other and speaking into a microphone. A car is visible in the background, and also some trees. The screen then goes black again, and the \"BOXING LEGENDS\" logo appears in the top right corner. Finally, a black and white image of two people appears, with one in focus and moving their hands, while the other is out of focus. The words \"MUHAMMAD ALI\" and \"THAT'S MY NAME\" are overlaid in white text.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the role of the man in the white suit shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A journalist.",
+ "B. A coach.",
+ "C. A writer.",
+ "D. A host."
+ ]
+ }
+ },
+ "ZpVwdjZe": {
+ "video_id": "ZpVwdjZe",
+ "video_duration": "37s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "DIY & Crafts",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a light blue shirt, a brown apron, and a hat is holding a small, flat piece of metal and showing it to the camera. They then place the piece of metal onto a workbench next to another piece of metal. They pick up an electric sander with a black disk on the end and use it to sand the metal on the workbench. Then, they pick up a rectangular metal piece that has been cut in the middle, and uses the electric sander to smooth the rough edges of the cut. The sander is then moved to the end of the metal, to round the edge. The person picks up a small, angled piece of metal, and is shown placing it on to the corner of the rectangular metal, where the two edges meet.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many types of sounds are primarily featured in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. One type.",
+ "B. Two types.",
+ "C. Four types.",
+ "D. Three types."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Is the technique demonstrated by the man in the video helpful for other related professionals?",
+ "answer": "A",
+ "candidates": [
+ "A. I don't know.",
+ "B. No, it isn't.",
+ "C. Yes, it is."
+ ]
+ }
+ },
+ "jlcZDHya": {
+ "video_id": "jlcZDHya",
+ "video_duration": "47s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A young man with short dreadlocks, wearing a blue football jersey, sits in front of an indoor football field and speaks to someone off-camera. The words \"BEFORE 2018 SEASON\" appear on the screen.The scene changes to a football stadium. A football player in a crimson jersey with the name \"MURRAY\" and the number \"1\" on the back walks onto the field and greets fans.Subsequent clips show Murray, wearing the same jersey, making several plays, including a touchdown run against a team in gold uniforms, a run and spin move in the open field against a team in black uniforms, a run against a team in gray uniforms, a touchdown celebration with teammates in which he jumps up and down, and various other runs and plays.The scene transitions to a night game against a team in yellow and black uniforms. Murray runs for a touchdown. Then, against a team in orange uniforms, Murray runs for a touchdown in which he celebrates just before crossing the goal line by dropping to the ground and lying flat on his back.In a night game against a team in blue uniforms, Murray runs for a touchdown down the sideline.Finally, the same young man who appeared at the beginning of the video is shown standing at a podium in a suit, giving a speech.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did Kyler Murray jump to celebrate with his teammates?",
+ "answer": "C",
+ "candidates": [
+ "A. It did not happen.",
+ "B. It happened twice.",
+ "C. It happened once.",
+ "D. It happened three times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between Kyler Murray and the person wearing a vertically striped shirt with the number F?",
+ "answer": "C",
+ "candidates": [
+ "A. One is a player, and the other is a member of the medical personnel.",
+ "B. They are teammates.",
+ "C. One is a player, and the other is a referee.",
+ "D. They are opponents."
+ ]
+ }
+ },
+ "jzdbThgZ": {
+ "video_id": "jzdbThgZ",
+ "video_duration": "46s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video is split into two frames. In the top frame, a man wearing a black t-shirt, a silver chain, and a watch is seated in a wooden chair. He laughs with his head tilted back and one hand over his mouth. A microphone is visible in front of him and behind him are shelves containing decorative objects, and a wall with artwork. In the bottom frame, a man wearing a pink hoodie and pink pants is seated in a wooden chair and laughs with his head back and one hand on his forehead. A microphone is visible in front of him and behind him are shelves containing decorative objects, a wall with artwork, and a game box.Both men are then shown individually at different moments. They talk with hand gestures, then both men laugh and put their hands over their eyes. The man in the black shirt then talks about a story. The man in the pink shirt also talks about the story while gesturing with his hands. The video then shows the man in the black shirt repeating the story and talking about the sign interpreter. He then says \u201cWhat?\u201d and gestures, while the man in the pink hoodie laughs. The man in the black shirt says the phrase \"take the award so he can chat\". Finally, both men are shown laughing again, the man in the pink hoodie says \"Awkward,\" and the video ends. Throughout the video, words appear on the screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who said, \"It\u2019s hard to express in sign language that the man is holding a trophy\"?",
+ "answer": "A",
+ "candidates": [
+ "A. The man wearing a black short-sleeve shirt.",
+ "B. The award-winning deaf person.",
+ "C. The host of the award ceremony.",
+ "D. The man wearing a pink hoodie."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the man in a black short-sleeved shirt cover his face?",
+ "answer": "B",
+ "candidates": [
+ "A. Two.",
+ "B. Three.",
+ "C. Four.",
+ "D. Five."
+ ]
+ }
+ },
+ "WuXuoxOW": {
+ "video_id": "WuXuoxOW",
+ "video_duration": "172s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Celebrity Interviews",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two people are in the video. A person with long brown hair and a black shirt gestures with a hand. A second person with long red hair and a blue necklace holds a purple and white container. Both people drink from different containers, a red cup and the white and purple container.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times do the two people drink water in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Six.",
+ "B. Four.",
+ "C. Five.",
+ "D. Three."
+ ]
+ }
+ },
+ "pmatNJEU": {
+ "video_id": "pmatNJEU",
+ "video_duration": "75s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Drawing Tutorials",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person wearing a blue patterned long-sleeved shirt is standing in front of a canvas on an easel, holding up various sheets of paper. The first sheet shows a hand-drawn illustration of a series of bolts and nuts, labeled in handwriting. The person points to various parts of this illustration. The person then displays another sheet of paper that shows a photo of what appears to be a metal bike frame with various nuts and bolts. The third sheet the person displays contains a photograph of what seems to be a bike helmet with wires and padding. The person points to details in each of these photos. The background is a plain white wall.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What is the man in the blue shirt doing in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Displaying students' paintings.",
+ "B. Displaying photographs he has taken.",
+ "C. Displaying his own paintings.",
+ "D. Displaying diagrams of part design dimensions."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the sequence in which the man in the blue shirt showcases his paintings in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Nuts-Bolts-Bearings.",
+ "B. Bolts-Nuts-Bearings.",
+ "C. Nuts-Bearings-Bolts.",
+ "D. Bolts-Bearings-Nuts."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many voices are heard in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Two.",
+ "C. Three.",
+ "D. One."
+ ]
+ }
+ },
+ "VBOtTObR": {
+ "video_id": "VBOtTObR",
+ "video_duration": "49s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A figure in a red Santa suit is shown walking left with their hands on the sides of their head. Another figure in a blue outfit is shown standing to the right. The background shows a large window with an airplane outside of it. Four more figures appear, standing between the figure in the Santa suit and the figure in the blue outfit. Then, the background changes and shows a number of monitors. Finally, a red filter is applied to the video.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the little girl in blue disappear from the screen at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Because the woman in red criticized the little girl in blue.",
+ "B. Because the man in green criticized the little girl in blue.",
+ "C. Because the boy in yellow took the little girl in blue home.",
+ "D. Because the girl in green took the little girl in blue home."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What is the emotional progression of the family members of the little girl in blue clothing as shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. From initially expressing deep apologies to Santa Claus to defending the girl in blue clothing, and finally becoming angry with her.",
+ "B. The father-daughter relationship transitions from initial confusion to sincere apologies to Santa Claus, eventually leading to anger with the little girl in blue clothing.",
+ "C. From initial anger towards Santa Claus to ultimately defending the girl in blue clothing.",
+ "D. From initial confusion to becoming angry with the little girl in blue clothing."
+ ]
+ }
+ },
+ "LgmWgDQo": {
+ "video_id": "LgmWgDQo",
+ "video_duration": "603s",
+ "duration": ">8min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man with a beard and a white t-shirt looks at the camera. A bald man with glasses and a red beard sits at a desk, picking up a small blue tool. A man in a suit holds a flaming ball in his outstretched right hand. A bearded man in a black jacket and beanie stands in a field, tossing a small red object towards a line of smoke or fog emitting from a long black pipe. A close-up of the back of a small toy airplane is shown; blue flames emerge from the rear of the plane. A man in a dark shirt crouches while holding the toy airplane as flames shoot out the back. A bearded man wearing protective glasses and face shield holds a hotdog over a small flame. A small drone flies through trees, the camera following close behind. A small drone hovers briefly before lifting off and flying through a dark room. A close-up of the drone's electronics shows blinking lights. A bald man with glasses sits at a desk with a microphone and speaks. Close up of someone assembling drone parts. Small drones with lights fly in a dark room. A small drone takes off from the ground and flies away, leaving traces of light behind. A drone moves quickly through the air, near the top of a screen. More footage of someone assembling a small drone. A bearded man with headphones and a microphone speaks from what looks like a workshop. Drones fly in the dark, leaving streaks of colored light behind. More drone assembling. Two men sit at a desk working on a drone together. \"Building Drew's Set of Props\" appears at the bottom of the screen. A man places a small, black circular piece inside a slightly larger silver cylinder, over a clear, plastic, 3-pronged base. A different person opens a box revealing bags of colorful LED lights inside. A drone flies in a parking lot. Small drones with LED lights are shown in the dark, on sand. A drone flies over a bridge at night, passing near some trees. Another drone flight in the dark. More drone flying, this time with the camera seeming to be mounted on the drone. A man wearing FPV goggles and a headlamp walks at night. More drone assembly at night. A group of men gathers at night, each one holding a small drone. A dark background with a silver logo featuring a skull in a crosshair spins and zooms towards the camera. The text \"NINE MONTHS LATER\" appears in white text on a black background. A drone flies in the dark. A group of people at a picnic table are working on assembling more drones. A bearded man wearing FPV goggles works at a table with other drone hobbyists. A first-person view from a drone shows it flying fast through trees.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes are observed in the drone during the first half of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The drone remains unchanged.",
+ "B. The main body of the drone explodes.",
+ "C. Two of the drone's LED lights are broken.",
+ "D. The drone's propellers explode."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many drones are flying in the air at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Five drones are present.",
+ "B. Four drones are present.",
+ "C. Two drones are present.",
+ "D. Three drones are present."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What happens when a man makes a drone perform a flip?",
+ "answer": "A",
+ "candidates": [
+ "A. The drone crashes to the ground.",
+ "B. The drone crashes onto the roof.",
+ "C. The drone successfully performs the flip and lands safely on the ground.",
+ "D. The drone crashes onto the balcony."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why do LED lights malfunction in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The contact points on the battery are not installed in the correct position.",
+ "B. The centrifugal force is so great that the contact points on the battery cannot make contact.",
+ "C. The contact points on the battery are rusted.",
+ "D. The centrifugal force is so great that the contact points on the battery are thrown off."
+ ]
+ }
+ },
+ "XRAXieIb": {
+ "video_id": "XRAXieIb",
+ "video_duration": "61s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Cultural Explainers",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The first video shows a street with buildings on either side. The street is filled with pedestrians and parked motorcycles. Many of the buildings have colorful signs.The second video shows several tall apartment buildings with many windows. The buildings are surrounded by parking lots and a few trees.The third video shows a wet street with buildings on either side. There are several bicycles parked along the side of the street. Pedestrians are walking in both directions. One person is wearing a yellow jacket.The fourth video shows a sidewalk with buildings on one side. Pedestrians are walking in both directions. There are several signs on the sidewalk. Parked cars are visible on the other side of the street.The fifth video shows a webpage with the heading \"Walk Score\". Under the heading is the text \"London, Ontario\" and a score of 25 which is described as \"Car-Dependent\". Underneath is a map of the area.The sixth video shows a webpage with the heading \"Walk Score\". Under the heading is the text \"London, England\" and a score of 99 which is described as \"Walker's Paradise\". Underneath is a map of the area.The seventh video shows a street with buildings on either side. Pedestrians are walking on the sidewalks in both directions. Bicycles are also being ridden on the street and parked along the side.The eighth video shows police officers riding horses. The officers are wearing yellow jackets. The horses are dark brown. There are trees on either side of the path.The ninth video shows a street with buildings on either side. Pedestrians are walking in both directions. Bicycles are parked along the side of the street.The tenth video shows people walking through a covered walkway with glass walls. The walkway is long with a white ceiling and lights.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What are the differences between the old town and the new town mentioned in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The old town has fewer pedestrians than the new town.",
+ "B. The old town has more pedestrians than the new town.",
+ "C. The old town has more cars than the new town.",
+ "D. The old town is more prone to traffic accidents than the new town."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Did a car with its right turn signal on appear in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes.",
+ "B. Unclear.",
+ "C. No."
+ ]
+ }
+ },
+ "fmucmNhj": {
+ "video_id": "fmucmNhj",
+ "video_duration": "151s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Fitness",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black sweatshirt stands in a kitchen and holds a small object. The video transitions to a living room. The same person appears in the living room and a box appears at the bottom of the screen with the words \"MONDAY\", \"Turf Games Prep\". The person looks around. The person holds a small object in their hand. The person is in the kitchen and holds a gold-colored object in his hand. The video cuts to a gym, where several people are shown lifting weights. The video focuses on a man lifting a barbell and squatting. The video then transitions to the same person on a rowing machine. He is also seen lifting weights and doing pull-ups. The video shows a wrist wearing a smart watch that displays the time, total exercise time, calories burned and average heart rate. The video cuts to the person in a car. The video shows the person driving with another person in the passenger seat. The person driving is wearing a black top and has a towel around their neck. The video ends with the watch displaying the heart rate information.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct order of the man's training routine in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Light back squats, 30 pull-ups, 2000-meter rowing, 40 cleans.",
+ "B. Light back squats, 2000-meter rowing, 30 pull-ups, 40 cleans.",
+ "C. 2000-meter rowing, light back squats, 30 pull-ups, 40 cleans.",
+ "D. 30 pull-ups, light back squats, 2000-meter rowing, 40 cleans."
+ ]
+ }
+ },
+ "HIWyvwld": {
+ "video_id": "HIWyvwld",
+ "video_duration": "25s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video begins with a black screen. Then a grey screen appears with a logo. This logo consists of two stylized letters 'em', then a vertical line, then the text 'Echoe Malone', with the text 'Beauty & Brains' underneath. The screen then turns white, and the logo is still present. The screen turns black with the text 'at BeautyCon LA 2015'. The scene changes to an outdoor scene where there is a large crowd, white tents, two large balloons floating, and building in the background. Then a black SUV drives past a long line of people waiting. Another black SUV is shown driving, with tents, a tall wooden structure, a green slide, and a ferris wheel visible in the background. Next, a red pickup truck with cardboard piled high on its bed drives past the line of people. Then several people are shown walking across a street intersection, then many people walking on a sidewalk. The view cuts to a close up of two people. One person has blonde hair and is wearing sunglasses, while the other has dark hair and is wearing bright red lipstick. The view cuts to a group of people in a crowd. One person has blonde, spiky hair and is wearing a black sleeveless top with white stars, and is taking photos with their phone. The video shows a group of people in front of a black wall with logos printed on it. Then the scene cuts to an indoor space with large pink foil balloons that spell out 'BEAUTYCON'. There are white mannequins displaying shirts and multiple people are milling about. The view switches to a close up of several people, some of whom are waving. The next view shows many people waving. The view changes to show a person with dark hair wearing a pink crop top, surrounded by people waving. There are more views of crowds of people waving.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Does the video contain any footage of cosmetics?",
+ "answer": "D",
+ "candidates": [
+ "A. None of the above.",
+ "B. Yes, it does.",
+ "C. I do not know.",
+ "D. No, it does not."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What might the girl wearing a pink sleeveless top in the video be doing?",
+ "answer": "D",
+ "candidates": [
+ "A. Atmosphere enhancer on site.",
+ "B. Event organizer.",
+ "C. Advertising promoter.",
+ "D. Interview journalist."
+ ]
+ }
+ },
+ "EJyEnDLu": {
+ "video_id": "EJyEnDLu",
+ "video_duration": "166s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Two boxers, Tenshin Nasukawa and Floyd Mayweather, stand in a boxing ring with a referee between them. Nasukawa has blond hair styled upwards and wears white boxing shorts and gloves. Mayweather is bald, has a goatee, and wears black and silver shorts and blue and white gloves. A man dabs Nasukawa\u2019s face with a cotton swab. Mayweather has his face wiped with a yellow towel and takes a sip of water from a bottle. Both boxers look around the ring. The referee separates from the boxers. The bell rings and the match begins. Nasukawa throws some punches but Mayweather dodges them. Mayweather lands a punch to Nasukawa\u2019s face. The boxers circle each other. Mayweather lands another punch and Nasukawa stumbles. Mayweather lands a left hook and Nasukawa falls to the mat. He gets up quickly. Mayweather corners Nasukawa and delivers a combination of punches, forcing Nasukawa back towards the ropes. Nasukawa falls to the mat again. He rises unsteadily. Mayweather lands a left hook that sends Nasukawa to the canvas. The referee steps in and stops the match. Nasukawa looks dazed and upset while members of his corner enter the ring to tend to him.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What was the black athlete's emotional state before the start of the race?",
+ "answer": "D",
+ "candidates": [
+ "A. Fearful and anxious.",
+ "B. Sad and sorrowful.",
+ "C. Excited and enthusiastic.",
+ "D. Happy and relaxed."
+ ]
+ }
+ },
+ "DyHVNvpw": {
+ "video_id": "DyHVNvpw",
+ "video_duration": "115s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A white pickup truck and a gray sedan are driving on a dry, grassy field. A dark tire lies on the field in the foreground. Trees and a road are visible in the background. The cars begin to drift, kicking up dust. The white truck drives further away, while the sedan continues to drift closer to the foreground. The sedan spins in a circle multiple times, creating a large cloud of dust. The white truck continues driving around the field, making wide turns and kicking up dust. Two figures appear in the bottom left corner, watching the cars. One of them is wearing a yellow shirt. The sedan and the truck continue driving in circles, occasionally drifting. The video ends with the sedan drifting in the mid-ground, and the truck driving away into the background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "In the video, from which direction did the car sounds originate before the off-road race began?",
+ "answer": "B",
+ "candidates": [
+ "A. The car sounds come from the road on the right side of the camera view.",
+ "B. The car sounds come from the road on the left side of the camera view.",
+ "C. The car sounds come from the road in front of the camera view.",
+ "D. The car sounds come from the road in front of the camera view."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the blue car drift in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. One time.",
+ "B. Four times.",
+ "C. Three times.",
+ "D. Two times."
+ ]
+ }
+ },
+ "sgMZzmvL": {
+ "video_id": "sgMZzmvL",
+ "video_duration": "100s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A pink background with horizontal lines moves from right to left. A girl with red hair tied into two buns is visible in the right portion of the screen. The background changes to show a purple background with horizontal lines. The words \"The Powerpuff Girls\" are displayed above three girls with different colored hair each with a colored streak behind them, they move from right to left. The screen turns to black with white text that says \"princess snorebucks\". Then, a large round head with a large toothy grin with orange wavy lines in the background is shown with the words from before overlaid on it. The same head is now wearing a yellow crown with a purple jewel, its eyes are narrower, and its cheeks are colored with pink spots. Three girls are now in bed, each under a different colored blanket. A man in a white lab coat stands beside the bed holding a golden box. A blue dog figure is in the foreground. The man smiles. The three girls are now shown closer, the three are all looking forward under their covers. The man is now kneeling on the floor with a yellow box in his hands. A musical note appears from the box and heads towards the bed of girls, the girls are now sleeping. Three dream bubbles show an ice cream cone, a butterfly, and a rabbit. The golden box is shown again. The golden box plays musical notes. The golden box now opens to reveal a pink-dressed figurine spinning inside. The figurine is replaced by a dark-haired figure in a bee costume and crown. A character with red, curly hair in a yellow top and purple skirt is shown with an angry expression. The screen shows a man in a suit signing a document, the red-haired figure is seen standing below him. The screen focuses on her angry face. The girl with red hair appears, with an angry expression with eyebrows furrowed and a bared mouth, then her eyes and mouth are open wide with a surprised expression. The girl with red hair is seen standing behind red curtains, she has a yellow top and a black bottom. Then in the same outfit she is standing in a room with a checked floor and a gold crown on a table in the foreground. The girl with red hair is seen again with a wide-open angry mouth, then her teeth are closed in an angry expression, and finally her mouth is closed with an angry eyebrow line. The three girls are seen sleeping in bed. The red-haired character is seen again with a smile and her hands up and open. The three girls are shown in bed sleeping again with their dreams appearing above them.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the orange-haired girl in the video want the other three girls to remain asleep?",
+ "answer": "B",
+ "candidates": [
+ "A. She was previously bullied by the other three girls.",
+ "B. She wants to be a hero and banish the Powerpuff Girls from the town.",
+ "C. She wants the three girls to ask her for help.",
+ "D. She desires to be the leader of the three girls."
+ ]
+ }
+ },
+ "IOoGwVsu": {
+ "video_id": "IOoGwVsu",
+ "video_duration": "105s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Academic Lectures",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A hand holding a marker draws a semicircle, then adds scalloped lines to make a helmet, then adds more lines to complete a helmeted warrior\u2019s head. The hand then draws the body of the warrior, and then starts on a second one, facing the first. The hand draws the body of the second warrior, completing the drawing with black lines, and then writes the word \u201cAthens\u201d above the first warrior and \u201cSparta\u201d above the second. The hand then fills in the warriors and draws additional figures underneath them, and then draws flames coming from the warriors' helmets. The hand continues to fill in and highlight more details, then a large explosion appears on the left side. The hand proceeds to fill the left warrior with black, then outlines the right warrior.The hand then starts to draw a new image, starting with the head of a person, then the ear, the neck, and finally adding a beard. The hand then adds a body and a face to complete the drawing. Next, the hand adds a horizontal rectangle with a triangle on top to resemble a building, and then adds detail to that building. The hand then adds additional structures around the building, completing the image by drawing a landscape.The hand then draws a new person, starting with the head and then adding detail to complete the body. Next, the hand draws another human figure, also starting with the head, then the body, and then adds a red line vertically down this figure. The hand adds a horizontal red line crossing the first red line. Next, the hand draws a line vertically down a new person, and the adds 6 and a half heads vertically stacked on top of the new person. The hand then draws a slightly taller figure, also with a stacked representation of 8 heads beside him. The hand then draws a figure in a dynamic pose, and writes the word \u201cSMASH\u201d.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Are four torches burning in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. None of the above.",
+ "B. No.",
+ "C. I don't know.",
+ "D. Yes."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many different explanations were heard in total in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. Two.",
+ "D. One."
+ ]
+ }
+ },
+ "qtFlislJ": {
+ "video_id": "qtFlislJ",
+ "video_duration": "286s",
+ "duration": "4-6min",
+ "domain": "Sports",
+ "sub_category": "Hiking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens with a shot of a light-colored antler lying on a patch of dirt and sparse grass. The camera zooms in and around the antler. A hand picks it up. Then the camera follows a different light-colored antler, also lying on the ground. The camera follows the hand as it sets the antler next to a skull. Then, the hand lifts the first antler and sets it on a tan-colored backpack. More antlers are placed on top of the backpack and secured with the straps. The words \"End of Day 1\" appear on a black screen. The video then shows a person with a grey hat and dark jacket. They are walking along a path through the trees. Then, a shot of their hiking boots resting in the grass. The camera switches views to an open field with sparse trees. The camera then switches to a close view, showing antlers lying on the ground. Then, a hand picks up the antler to show its shape. The hand also picks up another white antler. The camera pans to show a tan-colored bush. A white antler appears wedged within the bush. The hand reaches in and tries to pull out the antler. It is pulled out. The hand moves another antler around and then places two antlers on the ground with a skull. The video then ends with a logo for \"Fire Mountain.\"",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why did the hunter sit down to rest halfway through his journey?",
+ "answer": "B",
+ "candidates": [
+ "A. Because there was no road ahead.",
+ "B. Because his feet were extremely fatigued.",
+ "C. Because he was attracted by the animals nearby.",
+ "D. Because he had been injured."
+ ]
+ }
+ },
+ "mqCxshCf": {
+ "video_id": "mqCxshCf",
+ "video_duration": "115s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a shot of a person with blonde hair wearing a black jacket in an urban setting. The camera then shifts to a view of two women. One woman has blonde hair in a ponytail wearing a black jacket, and the second woman is wearing a gray hooded sweatshirt with a red \"R\" on the front, orange patterned leggings, a bright green belt, and white tennis shoes with neon yellow soles and accents. The second woman is holding a water bottle and a phone with wired headphones. The first woman is speaking into the camera and the second woman waves. The two women continue to stand on a paved area near some steps. The first woman continues speaking into the camera. The camera then shifts to show the second woman's outfit. The camera then switches between the two women who are continuing to speak. The first woman points to a set of steps, and the camera moves to show an American flag and a blue and yellow flag on flagpoles in the distance. The camera then shows a wide shot of a plaza with a fountain in the center. The camera returns to a close shot of the first woman who then points to a concrete ledge. The second woman is seen stepping onto and off of the ledge repeatedly, followed by the first woman joining her in stepping onto and off of the ledge repeatedly. They are then seen doing abdominal exercises while seated on the ledge. The camera then switches to a view of the silhouettes of the two women walking on the ledge as the video comes to a close. The video shows the first woman walking and speaking with a view of a building, a statue, and a plaza in the background. The video continues to shift between showing the first woman speaking to the camera and the buildings in the distance, before ending with a view of the first woman speaking with the second woman in the background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the action of running up the path occur in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. One.",
+ "B. Four.",
+ "C. Two.",
+ "D. Three."
+ ]
+ }
+ },
+ "UwQqrXaw": {
+ "video_id": "UwQqrXaw",
+ "video_duration": "40s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Bowling Ball",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A cricket match is being played. The score is shown as \"AUS 116/3\" then \"AUS 116/4\", and the \"Overs\" progress from \"23.3\" to \"23.4\". One person, wearing a dark-colored uniform, throws a ball towards a person wearing a yellow and green uniform. The person in yellow and green hits the ball with a bat. Another person in a dark-colored uniform catches the ball. Then, a group of people wearing dark uniforms gather and begin to celebrate. Then the camera shows a crowd of people, many of them wearing wide-brimmed hats. There are more shots of the people in dark uniforms celebrating. Then it zooms into a logo which includes a fern. Then the camera shows the person in dark uniform who threw the ball in another sequence. The person in yellow and green swings a bat to hit the ball. Then there is another wider shot of the field. Then it focuses on the person in dark uniform who caught the ball, who is holding the ball. Then it zooms into his face. He raises his arms in celebration. Then there is a flash. Finally, there is a shot of a man in yellow and green walking off the field. The score is displayed again, and a name \"MICHAEL CLARKE\" is displayed in text.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does Clarke leave the field?",
+ "answer": "D",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. Throughout the entire video.",
+ "C. In the middle of the video.",
+ "D. At the end of the video."
+ ]
+ }
+ },
+ "ROcytYWJ": {
+ "video_id": "ROcytYWJ",
+ "video_duration": "485s",
+ "duration": ">8min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows several clips of what appears to be a soccer match. In the first clip, two players wearing yellow jerseys and black shorts are passing a soccer ball to each other, while two other players, one in a yellow jersey and red shorts, and one in a blue jersey and dark shorts, are standing apart.The next series of clips show a soccer match in progress. There are many players in different positions on the field. Some players are wearing red jerseys and red shorts, and some are wearing yellow jerseys and black shorts. The ball is passed back and forth between players in the different jersey colors. A player in a blue jersey and black shorts is running on the field, as is a player in a green shirt, black shorts, and a green cap.The scene then shifts to show a player wearing a yellow jersey and red shorts setting up for what looks like a penalty kick. The player with the yellow jersey and red shorts then backs up as another player with a yellow jersey and red shorts moves into the kick position. Players in various colored uniforms run around the field. A player in a blue jersey runs on the field. Players wearing yellow jerseys and black shorts run, jump, and slide on the field. The scene then shows a player with a yellow jersey and red shorts attempting a goal, but it\u2019s stopped by a player in a red jersey. Other players run to the scene. Players start to celebrate, jumping and hugging each other. The scene is chaotic, with players running and jumping in multiple directions.The final series of clips show a match in progress. Players wearing yellow jerseys and red shorts run around a player wearing a yellow jersey and black shorts and a green sleeve. The player in the yellow jersey and black shorts is on the ground near the soccer goal, appearing to have been hurt. Other players run to his aid as a player wearing a blue shirt runs onto the field and pulls out a card. The scene shifts to show another player in a yellow jersey with black shorts and a green armband running towards a soccer ball and kicking it. A person in a red jersey tries to stop it, but the ball goes in, so the other team celebrates by gathering together and hugging. A player is briefly shown in a yellow jersey and red shorts walking away from the action with his head down.The video continues by showing a variety of game situations and angles. The players in red and yellow jerseys and shorts run around the soccer field as the game continues. The video ends with players huddling together while a coach instructs them. Some players are shaking hands, and then a coach is shown giving instruction to his team. After more scenes from the game, players in yellow jerseys celebrate after a goal is made. The video continues with scenes from game play.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What was the game clock time displayed when the team wearing fluorescent jerseys scored their first goal in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. 89 minutes and 10 seconds.",
+ "B. 76 minutes and 46 seconds.",
+ "C. 62 minutes and 20 seconds.",
+ "D. 47 minutes and 30 seconds."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual incident occurred in the middle of the video that caused the referee to show a yellow card?",
+ "answer": "D",
+ "candidates": [
+ "A. A player in the fluorescent team uniform suffered a leg injury.",
+ "B. A player in the red and yellow uniform maliciously tripped a player in the fluorescent team uniform.",
+ "C. A player in the red and yellow team uniform was hit on the head by the ball.",
+ "D. A player in the fluorescent team uniform maliciously tripped a player in the red and yellow team uniform."
+ ]
+ }
+ },
+ "vTcpByqO": {
+ "video_id": "vTcpByqO",
+ "video_duration": "50s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Motorsport",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person in yellow and blue clothing is touching a black motorcycle tire. There are other people standing around the motorcycle. There is a small graphic of a manual transmission in the upper right corner of the screen. Three motorcycles ride along a wet track. There are some blurry figures in the background. The motorcycles continue to race along the wet track. There is now a group of 3 motorcycles, and one of them is number 3, with a mostly orange coloring. The 3 motorcycles continue to ride, and the #3 motorcycle is leading. The #3 motorcycle begins to veer left, and a person on a black motorcycle can be seen in the background, who is #19. The #3 motorcycle falls off to the left of the track. The motorcycle and person are sliding on the grass. A person on a #3 motorcycle slides on the ground in the dirt and grass area of a track, while other motorcycles continue around the track behind them. There are now multiple motorcycles traveling around the track and the fallen motorcycle is visible in the grass. One motorcycle falls on the track. A person in yellow clothing can be seen sliding on the grass alongside their motorcycle. Another motorcycle speeds past. Three motorcycles appear on the track, with numbers 1, 19, and 12. The 3 motorcycles are moving down the track together. Now only 2 motorcycles are visible, and then only 1 motorcycle, #501, is in the shot, leaning far to the left. The motorcycle speeds down the track, and a graphic is shown with a race results: Gobert is first, Fogarty is second, and Guaresti is third.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "Which drivers were involved in an accident during the race?",
+ "answer": "B",
+ "candidates": [
+ "A. Frankie and Fogarty.",
+ "B. Corsa and Frankie.",
+ "C. Lat Haga and Govan.",
+ "D. Gobert and Corsa."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the positional relationship between Corsa and Frankie in the middle of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Corsa is behind Frankie.",
+ "B. They are back to back.",
+ "C. Corsa is in front of Frankie.",
+ "D. They are shoulder to shoulder."
+ ]
+ }
+ },
+ "ibchzmVW": {
+ "video_id": "ibchzmVW",
+ "video_duration": "68s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Two men sit at a table with open laptops in front of them. A bookshelf filled with books and magazines is visible behind them. The man on the left wears a light blue, long-sleeved button-down shirt, while the man on the right wears a maroon button-down shirt and thin-framed glasses. The man on the left scratches his ear. The man on the right gestures with both hands as he speaks. The man on the right looks down at his laptop and types, while the man on the left watches him. The man on the right gestures with his hands as he continues to speak. He turns to the other man, who smiles. The man on the right turns back to his laptop and continues typing. The man on the left watches and gestures with his hands as he speaks. The man on the right looks up and turns to the left while speaking. The man on the left nods as the man on the right continues speaking. He raises a hand to his mouth as he talks to the man on the left. The man on the right raises his hand and gestures as he speaks. The man on the left picks up his phone and holds it to his ear. The man on the right continues typing, then looks up at the man on the left as he talks. The man on the left puts down his phone. The man on the right continues talking. He gestures with one hand while speaking and makes a typing motion with the other hand. The man on the left listens intently. The man on the right smiles and nods as he continues talking. He types on his laptop. The man on the right keeps typing. The man on the left leans his head to the side as he watches. The man on the right glances up at him, looks down, and keeps typing. The man on the right turns to the left, shrugs, and looks expectantly at the other man. The man on the left looks down and frowns. He raises his hands as he speaks. The man on the right watches and listens. He types on his laptop. The man on the right speaks, gesturing with his right hand while typing with his left. The man on the left watches and listens. The man on the right continues typing and gesturing with his hand. The man on the left looks at the man on the right and smiles.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the emotional state of the person wearing a blue shirt in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Compromising.",
+ "B. Angry.",
+ "C. Happy.",
+ "D. Sad."
+ ]
+ }
+ },
+ "aWYLcGSY": {
+ "video_id": "aWYLcGSY",
+ "video_duration": "64s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Bowling Ball",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two people are visible on a green field, one wearing a red shirt and black pants with a black hat, and the other person wearing a blue and yellow shirt and blue pants. The person in the blue and yellow outfit runs a few steps, lifts their arm, and throws a small white ball. Then a third person is visible at the base of the screen wearing an orange and black outfit. The third person hits the ball with a bat. The camera pans upwards, showing people in a stadium. The camera moves back to the field, where the ball goes to a person in a blue outfit. The camera then moves to the stands, showing crowds of people, some of whom are waving flags. Then it shows the man in blue walking slowly back to his position. Then it shows a man in an orange outfit who looks concerned. It then shows a man with sunglasses and a goatee in an orange shirt. The man in orange looks very stressed and has his head down. Then, there is another shot of the field, showing the man in blue with a ball. It shows a man with sunglasses again, who claps his hands. There is a shot of a man with a helmet, looking down. Then there is another shot of the man in the blue outfit. The camera moves back to the field, where the man in orange hits the ball with a bat. The camera shows the stadium again, then a bright light. Then it shows the field again. There is a man with sunglasses and a goatee looking excited. A group of people in orange shirts celebrate. And the video ends with another man in blue looking stressed.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the role of the man wearing purple sunglasses in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. He is the coach of the blue team.",
+ "B. He is a fan of the orange team.",
+ "C. He is a fan of the blue team.",
+ "D. He is the coach of the orange team."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "In the video, did the referee stop the competitor from continuing the match?",
+ "answer": "C",
+ "candidates": [
+ "A. None of the above.",
+ "B. Yes, the referee did.",
+ "C. No, the referee did not.",
+ "D. I don't know."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional changes did the batter wearing the orange jersey go through in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. From excitement to disappointment.",
+ "B. From disappointment to excitement and then to enthusiasm.",
+ "C. From excitement to disappointment and then to excitement.",
+ "D. From excitement to excitement and then to disappointment."
+ ]
+ }
+ },
+ "QrpQnBfp": {
+ "video_id": "QrpQnBfp",
+ "video_duration": "45s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen displaying the word \"WORLD\" in white. The scene then changes to a third-person perspective of a character with light green hair and a red and brown outfit, interacting with a menu. The character then runs across a light-colored terrain with some rock formations. A large bird with multicolored wings is seen flying overhead. A white text overlay displays \"NEED\", \"ICE\", \"PAL\", \"INSTANCE\", and \"TO\". The video shows a large brown and white bird flying near the character. The character faces a large, light-blue creature with large pink wings, firing a large, dark-colored gun at it, which produces an explosion of ice. The text overlay says \"SO\", \"IN\", \"JETRAGON\", \"SEE\", \"COUPLE\", \"ICE\". Then a second ice explosion occurs, after which the text overlay says \"DOWN\", \"HAVE\", \"GRENADE\", \"ROCKET\", \"SHOWN\", \"YOU\", \"WHILE\", \"FROZEN\". A fireball explosion and a white smoke explosion occur. Then the text overlay says \"GO\", \"THE\", \"SOMETIMES\", \"SHOW\", \"ANY\", \"ALL\", \"DON'T\", \"AGAIN\", \"SHOOT\", \"AIR\", \"THIS\", \"ONCE\", \"DOWN\", \"41,000*\", \"CASE\", \"IS\", \"A\", \"DO\", \"VERY\", and \"FARM\". A large, dead, multicolored creature is then shown on the ground.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What mood is the man in after the player defeats Jetragon?",
+ "answer": "D",
+ "candidates": [
+ "A. Excited.",
+ "B. Depressed.",
+ "C. Mocking.",
+ "D. Proud."
+ ]
+ }
+ },
+ "SbImIxzv": {
+ "video_id": "SbImIxzv",
+ "video_duration": "610s",
+ "duration": ">8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A video plays on a screen showing two dark-skinned boys. Two men with beards sit in front of the screen. One wears a gray sweater and the other a white shirt. The scene changes to a woman with long black hair, wearing a black turtleneck dress, standing on a stage with a dark background. She gestures with her hands as she speaks. The scene cuts to two men on a brown leather couch. One has partially gray, short hair, and wears a black t-shirt. The other man has short black hair, a black t-shirt with a green design, and yellow and white sneakers. A bearded man appears in a smaller frame in the bottom right corner. The man with gray hair talks and gestures while the other man listens. The perspective switches to the bearded man who is wearing headphones and a black t-shirt against a swirling purple, teal, and red background. \u201cCHECK THE DESCRIPTION\u201d and \u201cBOBBY\u2019S PERSPECTIVE\u201d appear in orange and purple text. The scene returns to the two men on the couch; both are now gesturing as they talk. The video cuts back to Bobby, who is talking and gesturing. Two men sit at a table with a laptop between them. One wears a camouflage tank top and a baseball cap, and the other a black t-shirt. They appear to be having a conversation. The scene cuts back to the men on the brown couch. Bobby reappears, looking down. An image of a shirtless man with tattoos appears with text that reads \"LAST TIME TATE DRUNK\". The scene changes back to Bobby, who raises his eyebrows and speaks, looking directly at the camera. Bobby closes his eyes and lowers his head slightly.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What does the man wearing a black short-sleeved shirt do at the end of the short video shared in the clip?",
+ "answer": "B",
+ "candidates": [
+ "A. Displays a Bugatti car.",
+ "B. Lifts the largest nut in the world.",
+ "C. Shows a standard-sized nut.",
+ "D. Performs on stage."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotion does the man wearing headphones in the video display towards the content posted by Nas Daily?",
+ "answer": "D",
+ "candidates": [
+ "A. Approval.",
+ "B. Disgust.",
+ "C. Disappointment.",
+ "D. Sarcasm."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the account rating displayed on the Nas Daily statistics interface in the second half of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. B+.",
+ "B. A-.",
+ "C. B-.",
+ "D. A+."
+ ]
+ }
+ },
+ "dJnHlLlR": {
+ "video_id": "dJnHlLlR",
+ "video_duration": "68s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a series of comic book panels and character illustrations. The first panel shows a purple character with a golden glove with different colored gems. The second panel features the same character but now blue with golden shoulder pads, holding a glowing orb. The next panel shows the character again, in blue and gold, wearing a gem-studded gauntlet. Then the same character is depicted with a shiny gold appearance, with glowing white eyes. Another panel is shown with many characters in the night sky around a single glowing character. Next, a panel shows a blue character with golden armor firing a burst of bright orange energy at the character in a seated position wearing black and gold. Then there are a few panels with a partially-obscured view of the blue and golden character, and a final set of panels shows two versions of the same character in the night sky with outstretched arms, with a single panel showing just the night sky.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Which of the following scenes does not appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Thanos reflects on the differences between the Infinity Gauntlet and the Cosmic Heart.",
+ "B. Thanos uses the Cosmic Heart to destroy the universe.",
+ "C. Thanos engages in an intense battle with other characters.",
+ "D. Thanos restores the universe after its destruction."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Which outcome is most likely to occur after Thanos restores the universe?",
+ "answer": "D",
+ "candidates": [
+ "A. Thanos is defeated by other characters.",
+ "B. Thanos acquires a new cosmic heart.",
+ "C. The Heart of the Universe disappears from reality.",
+ "D. Thanos continues to use the Infinity Gauntlet."
+ ]
+ }
+ },
+ "BAYNeYkE": {
+ "video_id": "BAYNeYkE",
+ "video_duration": "192s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two individuals wearing wrestling uniforms are on a red mat with a blue area in the middle. One person's uniform is black with gold trim and a letter 'E', while the other person's is navy blue and white with the letter 'M'. There is a referee wearing a striped shirt and black pants, as well as another referee. There is a large crowd of people seated in the bleachers behind them. During the video, the two wrestlers engage with each other, attempt to take each other down, and move across the mat. The referee makes hand gestures and walks around the two wrestlers. The camera focuses on the two wrestlers.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Who won the match?",
+ "answer": "B",
+ "candidates": [
+ "A. A draw.",
+ "B. Donathan.",
+ "C. Hart.",
+ "D. Unknown."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What was the final score between Donathan and Hart?",
+ "answer": "C",
+ "candidates": [
+ "A. 5 to 2.",
+ "B. 5 to 3.",
+ "C. 7 to 4.",
+ "D. 7 to 3."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many types of sounds are there in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Three.",
+ "B. One.",
+ "C. Two.",
+ "D. Four."
+ ]
+ }
+ },
+ "orphuEQS": {
+ "video_id": "orphuEQS",
+ "video_duration": "159s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a large green rectangle on the left side of the screen and a black background. Then the view shifts to show a person with red hair, yellow clothing, and a silver sword walking through a stone corridor with columns. Next, the video transitions to the person standing in a round, dark room. The next scene shows the green rectangle and black background again, but with red numbers descending on the screen, followed by descriptive text next to each number. Afterwards, a person is displayed in an inventory screen and equipped with different outfits. Then, a text screen with the green rectangle and black background lists more numbers and descriptive text. After this, a person is shown at a merchant's stall, where several items are offered. Then the video displays another list with numbers and descriptive text before transitioning to a person flying over a green field with multiple enemies, followed by yet another list of numbers and text. The video then shows a person in a blue environment, opening a chest and obtaining a lightscale trident. The person enters the inventory screen to equip it before another list appears. After this, a person flies over a town before landing in the middle of it and running towards a building. Next, a list appears, followed by two puppet figures standing near a computer. The scene then switches to one puppet looking away from the computer, and then to a person near a dog and a horse in a wooded area near a shop. Then the view shows another list of numbers and text before the scene transitions to a person rolling a snowball. Next, the snowball rolls into some water, revealing a group of wooden figures standing in the water. The video then moves to another list of numbers and text. Following that is an aerial view of a person flying over a green environment. The view then shifts to a person standing before a large monster with a flaming sword. The person hits the monster with purple magic, and then takes items off of it after it collapses. This is followed by another list of numbers and text. Then the view transitions to a horse and a person in an open area and text boxes appear that ask the person if they want to change the horse's mane, the saddle, or if they are done. After the person selects the \"change the bridle\" option, a list of possible bridles appear. Then a new list of numbers and text appear, followed by a view of multiple photographs in an album, and then a list of numbers and text. The next scene is of a person standing before a monk like figure, with the phrase \"A gift from the monks\" appearing on the screen. After that, a view of a person obtaining \"cap of the wild\" appears, followed by another list of numbers and text before the video transitions to a person fighting a large monster in a dark and fiery environment.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "How many game tasks does the man in the video introduce in total?",
+ "answer": "D",
+ "candidates": [
+ "A. Thirteen.",
+ "B. Seventeen.",
+ "C. Fifteen.",
+ "D. Twelve."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the game character perform the \"snowball push\" action?",
+ "answer": "B",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. In the latter half of the video.",
+ "C. At the beginning of the video.",
+ "D. Throughout the entire video."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What is the purpose of the player completing the tasks at the beginning of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Obtain the monster mask.",
+ "B. Collect clothing.",
+ "C. Capture memories.",
+ "D. Defeat Ganon."
+ ]
+ }
+ },
+ "pePKevuL": {
+ "video_id": "pePKevuL",
+ "video_duration": "35s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "Greyhounds wearing different colored racing jackets are in starting boxes. The gates open, and the dogs burst onto the track. They race down the straight, with dog number eight in a pink jacket taking an early lead. The dogs bunch together as they approach the first turn. Number eight maintains its lead, followed closely by the dog in the teal jacket. The dogs continue around the bend, with number eight still ahead. As they come out of the turn and onto the back straight, dog number eight extends its lead. The camera pans to follow the dogs down the back straight. Dog number eight is well in front. The other dogs follow, vying for the minor placings. The dogs approach the final turn, and number eight rounds it smoothly, maintaining its lead. They enter the home straight with number eight a clear winner. The camera zooms out to a wider shot of the track as the dogs cross the finish line. The dogs slow to a walk after finishing the race. A man approaches them. The camera focuses on the track scoreboard.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the dog wearing red stripes first achieve first place?",
+ "answer": "B",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the beginning of the video.",
+ "C. At the end of the video.",
+ "D. Throughout the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "At the end of the video, what did the person wearing dark clothes do to the dog on the field?",
+ "answer": "D",
+ "candidates": [
+ "A. The person uses food to entice the dog away from the field.",
+ "B. The person uses ropes to bind and lead the dog away.",
+ "C. The person uses hand gestures to direct the dog into a specific area.",
+ "D. The person uses fences to enclose the dog on the field."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "At the beginning of the video, what is the position of the dog wearing red stripes relative to the dog wearing black clothing?",
+ "answer": "B",
+ "candidates": [
+ "A. Behind the dog wearing black clothing.",
+ "B. In front of the dog wearing black clothing.",
+ "C. To the left of the dog wearing black clothing.",
+ "D. To the right of the dog wearing black clothing."
+ ]
+ }
+ },
+ "eFmgqjpo": {
+ "video_id": "eFmgqjpo",
+ "video_duration": "70s",
+ "duration": "1-2min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A cursor moves through a Google search bar and types \"robl\" before several suggestions appear below. The cursor then clicks the second suggestion, \"Roblox - web.roblox.com\". A Google page appears, followed by a Roblox page, and finally a Roblox profile page. A profile with the username \"SubToXLGamerBros\" is displayed, and then the cursor moves to the text \"SubToXLGamerBros\", and then to the URL address bar, highlighting a portion of the text. The cursor returns to the search bar at the top, clicks \"Games,\" then types \"jailb\" in the search bar. The cursor clicks \"Search 'jailb' in games\", and a page of game results appears. The cursor then moves over the first result, called \"Jailbreak,\" and clicks the game. The cursor then moves to and clicks \"Servers,\" and then the other servers menu appears. The cursor highlights and then types the numbers \"1583533203\", and moves to and clicks the \"Userid\" box, which turns into \"Username\". The \"Username\" box is clicked, then cursor is moved to and clicks \"Subscribe\". The Subscribe button turns gray and says \"Subscribed\" and a notification bell icon is then clicked.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many components make up the 3D graphic shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Six.",
+ "B. Eight.",
+ "C. Ten.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of Twitter relative to Discord in the middle of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Twitter is below Discord.",
+ "B. Twitter is to the right of Discord.",
+ "C. Twitter is to the left of Discord.",
+ "D. Twitter is above Discord."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How many digits does the user ID searched for in the video have?",
+ "answer": "A",
+ "candidates": [
+ "A. Ten digits.",
+ "B. Eleven digits.",
+ "C. Eight digits.",
+ "D. Nine digits."
+ ]
+ }
+ },
+ "DgRjZadN": {
+ "video_id": "DgRjZadN",
+ "video_duration": "241s",
+ "duration": "4-6min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a white screen. Then there is a scene showing a town viewed from above. The town contains various buildings, walls and other structures, and a timer on screen counting down. Below the town is an area with several buttons showing images of characters, a shield, and vials of liquid. The screen alternates between the town scene and a white screen. Then it shows another town with a slightly different layout. The timer is counting down from 29 seconds. A message says, \u201cTap or press and hold to deploy troops\u201d. A series of buildings are visible in the town. After that, large creatures are deployed and walk through the town destroying many of the structures. The battle continues. Structures are broken, there are small explosions of fire and dust, and the timer counts down. Eventually, the battle is won. There are two star shapes above the town and the words \"Victory\". Some text appears showing loot gained, trophies and troops expended. The screen transitions to white and then shows another town, similar in structure. Then a screen appears showing a cartoon of a green troll like character wearing a pirate hat, an eye patch, and a necklace with a skull, with other text that says \"Subscribe\", \"Free Gems\" and \"Free my Apps\". It also includes an invite code of \"Godson.\" This screen is visible for several seconds and the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many cannons are shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Three.",
+ "B. Six.",
+ "C. Five.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is encased in the sphere of light in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A bomb.",
+ "B. A laser tower.",
+ "C. A cannon.",
+ "D. A bow tower."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the person wearing gray and the person wearing red in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Superior and subordinate.",
+ "B. Adversaries.",
+ "C. Partners.",
+ "D. Teammates."
+ ]
+ }
+ },
+ "NurouABK": {
+ "video_id": "NurouABK",
+ "video_duration": "31s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a man sitting in a chair and gesturing with his hands while speaking. He is wearing a light-colored t-shirt with an American flag design on it. He has grey hair and light-colored skin. Behind him is a computer monitor displaying an image of a green, hilly landscape. A desk is visible on his right with a computer speaker, some papers and a few other objects. A tall, brown dresser is on his left and some toys are visible on a shelf beneath the desk. There is a window to the right of the monitor and a piece of art hangs on the wall behind it. There are intermittent shots of a person standing on a rock formation with a mountain in the background; of a person standing near a blue tarp and tent on a cliff near a beach; of a person biking on a path through a forest; of a person biking in a desert-like area; of a person standing on a rocky outcropping with a city in the distance; and of a person riding a bicycle on a paved surface. There are also two shots of a body of water with canyon walls, one with red text superimposed and one with white text superimposed. There are a couple of shots that feature the man talking in a chair with an Instagram logo and the text \"@DUZER\" superimposed.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the wooden cabinet relative to the man in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. To the left rear.",
+ "B. To the right front.",
+ "C. To the left front.",
+ "D. To the right rear."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What profession might the person wearing orange in the video have?",
+ "answer": "A",
+ "candidates": [
+ "A. A sanitation worker.",
+ "B. A construction worker.",
+ "C. A traffic police officer.",
+ "D. A repairman."
+ ]
+ }
+ },
+ "gYRMdxDT": {
+ "video_id": "gYRMdxDT",
+ "video_duration": "136s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A white bowl contains thin, julienned, light yellow pieces, likely onions, coated in a slightly darker yellow mixture of spices and flour. A hand briefly mixes the ingredients within the bowl with a spoon.The scene changes to a black, cast-iron skillet containing hot oil. A small clump of the onion mixture from the bowl is placed into the oil, where it sizzles and begins to fry. A hand appears and begins to form small balls of the onion mixture from the larger bowl. Several more clumps of onions are placed into the skillet, which gradually becomes more full. The onion balls gradually become more golden-brown as they fry.The scene again shifts to the white bowl where, this time, a person in white chef\u2019s attire is mixing and forming the onion mixture into balls using both hands.The camera returns to the skillet, where several more onion balls are being fried, becoming increasingly brown. More onion balls are dropped into the skillet by the chef. Finally, some of the cooked onion balls are removed from the skillet using a slotted, metal spatula and placed into a metal colander to drain. Excess oil drips back into the skillet. The frying process continues. More onion balls are formed and dropped into the hot oil, while others continue to cook to a golden-brown color.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is used to coat the surface of the onion in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Bean flour.",
+ "B. Cornstarch.",
+ "C. Sweet potato starch.",
+ "D. Flour."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many onion pancakes were fried in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Fourteen.",
+ "B. Fifteen.",
+ "C. Thirteen.",
+ "D. Twelve."
+ ]
+ }
+ },
+ "UMQwinaI": {
+ "video_id": "UMQwinaI",
+ "video_duration": "59s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a black square on a white background that says \"TCR\", \"THROWING CHAIN REACTION\", and \"THROWS MECHANICS & TRAINING SYSTEM\" in white text. A man wearing a black shirt, grey pants, and a black hat stands inside of a concrete circle surrounded by a grey track and green grass. Next to the man is a black square with the white text \"FREE EXCLUSIVE\", \"TCR MINI COURSE\", \"DISCUS THROW\", \"& SHOT PUT\", and \"GRAB YOURS TODAY!\" on it. This black box is partially covered by a moving bright yellow box. The man begins moving his arms and a white ring appears under his foot and is then moved off-screen by his foot. The black box is replaced by a black square with a white silhouette of a person throwing a disc with \"ARETE\", \"THROWS\", \"NATION\", and \"TM\" below it in white text. This is then replaced by a black screen with the same silhouette and white text. The camera shows a close-up of a man with a beard, dark sunglasses, and a dark colored hat who speaks to the camera. The scene changes to a man throwing a discus inside of a concrete ring on a track. There is a person laying down in the background and another person further away. The camera returns to the close-up of a man speaking to the camera. A four-way split-screen appears showing a collage of people throwing a discus in various settings with the text \"ARETE THROWS NATION PRESENTS\" in green text. A video plays of a person throwing a discus with text that reads \"T.C.R.\", \"THROWING\", \"CHAIN\", \"REACTION\", \"SYSTEM\" and \"THROWS\", \"ANALYSIS\" over the video in yellow and green. This is followed by another man throwing a discus with text that reads \"THROWS TECHNIQUE\" and \"THROWS ANALYSIS\" in yellow and green. A split-screen of different individuals throwing a discus, text that reads \"LOVETHEGRIND\", and then two clips of men throwing a discus. The text \"TCR\" is shown with an orange background. Two white logos are placed on a red background, \"ARETE THROWS NATION\" on the left, and \"TCR THROWING CHAIN REACTION SYSTEM\", \"THROWS MECHANICS & TRAINING SYSTEM\" on the right. A man throws a discus in a concrete ring with the words \"TODAY'S EPI\", \"HOW TO USE HEAVY BAR FOR DISCUS THROW TRAINING\" in yellow and white text in a black box over the video. Two men stand in a concrete circle. One man is wearing a long-sleeved blue shirt, black shorts, and a blue baseball cap. The other man is wearing a short-sleeved pink shirt, black shorts and a black baseball cap. A yellow animation of a figure throwing a discus is placed over the second man. The two men continue to stand in a concrete circle, the animation disappears, and the man in the pink shirt begins speaking. The logo \"ARETE THROWS NATION\" appears in the bottom-left corner with a small rectangle with a black background that displays \"HEAD COACH | ERIK JOHNSON\" and \"ARETETHROWSNATION.COM\" in white text. The text \"HEAVY BAR TRAINING\" is placed in white text between the two men, along with the text described previously. A man holding a long, red bar appears next to a second man speaking. This continues with different shots of people throwing a discus, the two men standing in a circle, and text that says \"OFF SEASON\". A man wearing a white shirt with a dark text design and a straw hat is shown taking video of a man throwing a discus. The video ends with a white hand clicking the subscribe button.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many rotations did the man with blue sleeves and a black shirt make before throwing in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Three and a half rotations.",
+ "B. Two and a half rotations.",
+ "C. Four and a half rotations.",
+ "D. One and a half rotations."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the man in the red shirt and black pants relative to the man in the black clothes with blue sleeves in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. In the back-right direction.",
+ "B. In the front-left direction.",
+ "C. In the front-right direction.",
+ "D. In the back-left direction."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What are the man in the dark blue shirt and the man in the pink shirt going to do in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Start doing rotation exercises.",
+ "B. Start practicing projectile throws.",
+ "C. Start waist training.",
+ "D. Start strength training."
+ ]
+ }
+ },
+ "FPBRVwEr": {
+ "video_id": "FPBRVwEr",
+ "video_duration": "86s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Chemistry",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Three people stand outdoors in front of trees. The person in the middle is wearing a yellow shirt and dark pants, and is moving their arms. The person on the left is wearing a blue button-down shirt and light pants, and the person on the right is wearing a dark shirt and light blue pants. Then, text appears on a black background with a white glow around the edge, which says \"Kinetic Molecular Theory\". The scene switches to an indoor location where five people are standing. One person is standing to the side, holding a book, wearing a yellow shirt and dark pants. The other four people begin a series of actions, such as bowing, spinning, and walking around, while the person holding the book remains stationary. Finally, the people walk out of the room into a hallway. The person holding the book walks into the hallway, closes a door, and the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Where is the woman wearing a blue and white striped shirt sitting in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A cabinet.",
+ "B. A table.",
+ "C. A red sofa.",
+ "D. A white sofa."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many cabinets are shown in the room in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. 9.",
+ "B. 8.",
+ "C. 7.",
+ "D. 6."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In which direction was the man in the blue shirt when the man in the brown shirt bumped into the red sofa?",
+ "answer": "B",
+ "candidates": [
+ "A. In front.",
+ "B. To the right.",
+ "C. To the left.",
+ "D. Behind."
+ ]
+ }
+ },
+ "IiJODLGV": {
+ "video_id": "IiJODLGV",
+ "video_duration": "111s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black baseball cap, olive green jacket with a fur-lined collar, a black shirt, and orange plaid pants gestures towards the camera in a parking garage. They turn and walk toward a gray BMW i8 with butterfly doors open, revealing tan leather seats. The person continues to walk past rows of parked cars, including a gold BMW X2. A second person with glasses and a black baseball cap, wearing a black t-shirt with white letters, approaches the first person. The first person returns to the BMW i8 and gestures excitedly. The second person walks over to the car, and a third person wearing a tie-dye Rugrats sweatshirt and neon green shorts approaches the car and sits in the driver's seat. The second person stands beside the open door, while the first person looks on. The third person sits in the driver's seat and fiddles with something in her hand. The second person leaves the frame. The third person gets out of the BMW and walks around it. The second person reappears and watches the first person get into the driver's seat. The first person sits in the driver's seat. The first person sits in the driver's seat and plays with the car's controls. The third person attempts to get back into the car but struggles due to the butterfly doors. The first person gets out of the car and the two hug. They stand talking briefly before the first person begins to walk away. The first person turns and dances. They hug again, and the first person gets back into the driver's seat of the BMW.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "How many girls are shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Five.",
+ "B. One.",
+ "C. Two.",
+ "D. Three."
+ ]
+ }
+ },
+ "tBMdPGhO": {
+ "video_id": "tBMdPGhO",
+ "video_duration": "230s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a red text overlay of \"16+\" and the word \"Xdik\" below it on a dark background with a blurred static effect, along with firework graphics. There are images of 2 boxes, with white material in both and hand written numbers 350 and 100 on the boxes. There is also a clear plastic cup to the side with the number 2 hand written on it. The next scene shows a cartoon dog holding a white piece of paper and a golden Christmas tree to the right. Then the scene changes to both the dog and cat each holding a white paper, with the Christmas tree in between them. The cat raises its arms with the paper in its hand. Another scene shows a silver ring in a black box, followed by a fireworks graphic with the same text as the beginning. The scene changes to the dog and cat holding papers with the Christmas tree, and then to the cat raising its arms with the paper in its hand. This is followed by another fireworks graphic with the text. There are various scenes including snowmen, some made of snow and others made of cabbages, a person in an outfit resembling a Christmas tree in a store, a graphic of 2019 with a cartoon dog and cat at a table in the foreground, a raw piece of meat shaped like a fish on a wooden background, and two cartoon aliens staring. The video ends with a series of fireworks with text overlays of \"16+\", the year 2020 and a mouse wearing a small red Santa hat, and again, more fireworks with the \"16+\" text.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many meme images were inserted into the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Six.",
+ "B. Eight.",
+ "C. Thirteen.",
+ "D. Ten."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How does Tomcat's mood change after reading each manuscript?",
+ "answer": "B",
+ "candidates": [
+ "A. Sorrowful and frowning.",
+ "B. Excited and thrilled.",
+ "C. Scratching his head in confusion.",
+ "D. Wide-eyed and astonished."
+ ]
+ }
+ },
+ "paCxaaXR": {
+ "video_id": "paCxaaXR",
+ "video_duration": "39s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Fashion",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A man wearing a dark blue denim jacket over a white and black horizontally striped sweater and a light blue collared shirt stands against a solid gray background. He holds up a gray baseball cap and then a yellow baseball cap, sometimes one at a time, sometimes both simultaneously. He crumples the yellow cap in his hand. He folds and unfolds the bill of the gray cap. He holds both caps, crumpled, in his hands. He unfolds the gray cap and then puts it on his head. He adjusts the cap, pulling it down slightly. He puts his hands in his pockets. He takes his right hand out of his pocket and adjusts his sleeve. He crosses his arms. He holds his hands out with palms up. He runs his hand through his hair. He takes the gray cap off and crumples it. He places both crumpled caps on his left palm and then shakes them as if weighing them. He shrugs his shoulders and smiles. He points with his right index finger towards the right side of the screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the man wearing a dark blue denim jacket put on a hat?",
+ "answer": "D",
+ "candidates": [
+ "A. The first half of the video.",
+ "B. The middle of the video.",
+ "C. The entire video.",
+ "D. The second half of the video."
+ ]
+ }
+ },
+ "PRgxbMeQ": {
+ "video_id": "PRgxbMeQ",
+ "video_duration": "35s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A dark screen shows white text in Spanish which translates to \"It is February 13, 1986\". The scene changes to show a man in a grey jacket and tie sitting at a table. The same Spanish text is displayed. The man says \"I am agent Bill Hagmaier\". The scene changes to show another man wearing a grey shirt with handcuffs on, who has his hands clasped in front of him. The man in the grey jacket says \"and I am with Theodore...\". The handcuffed man then says \"No, Ted\". The grey jacket man says \"I am with Ted Bundy\". The scene changes back to the grey jacket man. The scene changes again to show 3 men holding a sign that says \"BURN BUNDY\". The text on the screen reads \"Ted Bundy has 13 hours left to live\". The scene changes to a television screen showing an empty execution chair. The text from the previous scene continues. The scene changes to show 2 men and a woman in what appears to be an enclosed room. The woman is wearing a pink top, one of the men is wearing a grey top, and the other is wearing a tan top. The text on the screen continues. The scene changes to a closer shot of the man in the tan top. The same text continues. The scene changes again to show a room with several people, one person is holding a camera. The man in the yellow shirt says \"He's ready to talk\". The text continues \"He is ready to confess everything but only to one person\". A man wearing a suit walks into view and says the same line as before. The scene changes to a man in a suit, sitting at a desk with a picture of the white house behind him. The scene changes to a person with an orange shirt behind a window with the text \"There are families waiting for answers\". The man in the orange shirt has his arms behind his head. A man in a suit walks down a hallway with the text \"The world needs to know why he killed those girls\". The same text is displayed as a woman in a purple dress walks into a house. The scene changes to the man in a suit sitting at a desk again with the text \"You're a rising star and I'm your next big case\". The scene changes to the man in the yellow shirt with the text repeating from the previous scene. The scene changes to the man in the grey jacket with the text \"You will be the one who made me talk\". The screen goes black. The man in the grey shirt appears and is rubbing his mouth with the text \"There are many myths and misunderstandings about me\".",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the man in the video being handcuffed?",
+ "answer": "D",
+ "candidates": [
+ "A. He stole an item.",
+ "B. He engaged in a fight.",
+ "C. He defrauded others of money.",
+ "D. He murdered a girl."
+ ]
+ }
+ },
+ "pUNdafJc": {
+ "video_id": "pUNdafJc",
+ "video_duration": "143s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video starts with a still image of a polaroid, showing a group of six cartoon people with varied skin tones. Two red hearts, a yellow infinity symbol, and a white frame are also visible. The background is green with a repeating leaf pattern. This is followed by a yellow oval shape with the infinity symbol and then the number 4. The words \"We're Not Trying To Make You Uneasy\" then appear over the oval. A purple question mark on a pale green background comes next. This is followed by an image of a cartoon person in a blue top holding the purple question mark and is then grabbed by another cartoon person with an infinity symbol over their head. Then two cartoon people are shown sitting on a bench, one with a green leaf and one with an infinity symbol on their heads. The second character is shown having thoughts in a cloud above their head. Then a text bubble appears with the words \"THEY'RE TALKING!\" A line is shown and then the words \"Acceptable Distance\" is displayed above. Then text appears saying \"EYE CONTACT!\", followed by \"BREAK EYE CONTACT!\". Next a cartoon character with a green leaf on their head appears looking surprised. This is followed by a musical symbol and sound waves next to the second character. Text appears saying \"THEY'RE SURPRISED!\" and then \"Oh, now THEY LOOK DIFFERENT\". The character with the infinity symbol smiles and text appears saying \"TIME TO REACT!\" and then \"AM I SMILING?\". The second cartoon character appears with text over them that says \"ARE THEY STILL TALKING?\", \"SHOULD I BLINK?\", \"DO I LOOK LIKE I'M LISTENING?\", \"TIME TO NOD\", \"THEY FROWNED\". This text then changes to \"SHOCKED\", \"WIDE YOUR EYES\", \"LAUGH AT THE JOKE\", \"THEY FROWNED AGAIN\" and then \"BREAK EYE CONTACT!\". All these words are written in dark red and overlapping each other. Then the text \"TOO FAR AWAY?\", \"TOO CLOSE? AGAIN!\" appears in dark purple and is also overlapping. The text then shifts to \"BREEZY\", \"EYES WIDENED...\", \"MOUTH IS OPENED...\", \"THEY ARE SHOCKED!\" all in the dark purple font. Then we return to the character sitting on the bench again as the text \"HOW DO I RESPOND?\" appears. The second cartoon person is shown again and text appears \"DO YOU HAVE A SCRIPT?\" The final text reads \"TOO CLOSE\" and \"ADJUST?\" The video ends with the two characters sitting on the bench.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the white cartoon character raise its hand in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. One time.",
+ "B. Two times.",
+ "C. Four times.",
+ "D. Three times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is the man in the upper-left corner of the photo wearing at the beginning of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Green.",
+ "B. White.",
+ "C. Red.",
+ "D. Blue."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the emotional change of the yellow cartoon character after pulling the white cartoon character's hand in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. From joy to embarrassment.",
+ "B. From happiness to sadness.",
+ "C. From sadness to happiness.",
+ "D. From embarrassment to joy."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "In the latter part of the video, what does the yellow cartoon character do to the white cartoon character?",
+ "answer": "A",
+ "candidates": [
+ "A. Pulls the hand of the white cartoon character.",
+ "B. Hugs the white cartoon character.",
+ "C. Touches the head of the white cartoon character.",
+ "D. Lifts the foot of the white cartoon character."
+ ]
+ }
+ },
+ "aSVLVenX": {
+ "video_id": "aSVLVenX",
+ "video_duration": "197s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a grey zip-up hoodie and black shorts with white stripes is carrying a folded blue tarp on a green field with white lines. They hold the tarp up in the air then let it fall to the ground. They take a roll of white tape, and while kneeling, put the tape on the edges of the blue tarp. They then stand and pour water from two clear bottles onto the tarp and walk around it. A different person throws a white container into the air, then juggles an orange, then pours milk from a white container over the tarp. They then pour a blue liquid from a clear bottle on to the tarp. A different person throws a soccer ball, then juggles a soccer ball and carries a soccer ball. They move the soccer ball back and forth, then the person is seen again in the video. Two people are seen passing a soccer ball. One person is dressed in a grey zip-up hoodie with black shorts, with a pair of soccer gloves. The other person is wearing a white long-sleeved shirt with blue shorts. They pass the soccer ball back and forth. Then one person attempts to kick the soccer ball into the soccer net while the other person is the goalie, jumping to attempt to block the ball.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the man in gray clothing doing on the cloth by pouring white liquid onto it?",
+ "answer": "A",
+ "candidates": [
+ "A. Securing the cloth in place.",
+ "B. Cleaning the cloth.",
+ "C. Lubricating the surface.",
+ "D. Creating a non-slip surface."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What changes occur in the background music of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. It becomes quieter and quieter.",
+ "B. It fluctuates between loud and quiet.",
+ "C. It becomes quieter when someone speaks.",
+ "D. It becomes louder and louder."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is the football that the man wearing a gray shirt is holding in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Black, white, blue, and gray.",
+ "B. Black and white.",
+ "C. Black, red, and white.",
+ "D. Black, blue, brown, and white."
+ ]
+ }
+ },
+ "HUOgTYtW": {
+ "video_id": "HUOgTYtW",
+ "video_duration": "215s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Book Reviews",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a person with a beard, wearing glasses, a tweed jacket, a grey hat, and a bowtie. The person is sitting in front of a wall decorated with posters, drawings, newspaper clippings and a shelf of toys. The camera focuses on this person as they speak, gesturing with their hands. At one point a graphic with the word \u201csavvy\u201d appears on screen. This is followed by more of the person speaking as well as additional graphics appearing on screen including \u201ccrazy adventure,\u201d \u201cbus,\u201d \u201cinsecure bus driver,\u201d and \u201cmom issues.\u201d The person holds up a small red object, and then places their hands on their temples and winces. The camera focuses on a bright pink snow globe containing a skull, followed by the person again with a distressed look on their face. Then the camera focuses on a slice of white bread, which becomes part of the scene in front of the person. The person again looks distressed and makes an \u201co\u201d shape with their mouth, and another graphic appears that says \u201cHOBBLEDEHOY.\u201d The scene transitions to a different person sitting on a white sofa in a library. This person holds a book, looks around, and speaks as well. The scene goes back to the first person, who is seen speaking again, followed by a quick scene of the second person speaking again. This is followed by a view of a library shelf and sofa, then back to the first person speaking. Then there is a view of a library shelf and sofa, and then back to the first person again. The scene then focuses on a pink globe and then back to the first person rubbing their temples. After this, there\u2019s a close up of the pink globe, and then another close up of the person\u2019s face as he winces. Next, there\u2019s a close up of a slice of white bread. Then the camera zooms in on the face of the first person as he is talking and makes a mouth \u201co.\u201d The scene transitions to a black and white image of three men, followed by the words \"@monsieurwaskeau production.\" The camera then focuses on the first person again, who is now sitting on a white sofa, and is holding a slice of white bread. The first person is then again, speaking with more graphics flashing on screen. Next, there is a close up of some library shelves and the scene switches to a person in front of a screen, talking. Next, the camera focuses on a slice of bread falling on a sofa, then another slice of bread falling onto the floor, and then transitions back to the first person dancing, rubbing their temples, and wincing. Then, there's a close up of a line of book spines, then a close up of multiple slices of bread. Following this, there's a close up of a single slice of bread on a white surface. Then the camera switches to the first person again, who is looking distressed. This is followed by another image of a slice of bread, and then back to the first person who grabs the bread and puts it in his mouth. The first person is then seen dancing with the bread in their hand and putting it to their mouth, and then looking around. The video then cuts to a purple animated pony, and then a slice of bread next to an eye ball toy, followed by the first person wincing and dancing. Next there\u2019s a shot of a hanging plush toy, followed by bread on a wall, followed by the first person again, dancing, then a view of a female doll next to some slices of bread, followed by a close up of slices of bread on a table, followed by the first person dancing with the bread on his head. After this, there's a view of a black CGI character, followed by an empty room, then a slice of bread on a wooden floor, followed by the first person dancing and then a close up of a library bookshelf, and then multiple slices of bread on a splattered background, followed by a single slice of bread on a white surface, followed by the first person speaking in front of a wall of drawings, followed by another single slice of bread on a white surface, followed by the first person speaking, and then finally the person is seen holding bread, and then the video goes to black.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the man wearing glasses and the man in blue in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Friends.",
+ "B. Father and son.",
+ "C. Colleagues.",
+ "D. Strangers."
+ ]
+ }
+ },
+ "euqCCnjV": {
+ "video_id": "euqCCnjV",
+ "video_duration": "210s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a pale pink background. The title \u201cJenn\u2019s Journey\u201d in a cursive, light brown font appears in the center, along with a small heart to the right of the title. The title fades and blurred pastel-colored circles appear. The words \u201c5-ingredient Pulled Pork\u201d in a purple font appear in the center of the screen. A piece of raw pork in a plastic wrapper sits on a white plate. Two cans of Swanson Chicken Broth, small jars of cumin and garlic powder, and a container of sea salt are displayed next to it. The camera zooms in to show the label on the pork package, which reads \u201cSmithfield Bone-In Pork Shoulder Picnic Roast.\u201d The camera then zooms in on each of the other ingredients, focusing on the labels. The pork is then placed in a slow cooker lined with a plastic slow cooker liner. Garlic powder, cumin, chicken broth, and sea salt are added to the slow cooker. The slow cooker lid is placed on top. The slow cooker is a stainless steel Crock-Pot, and is programmed for 4 hours on high.Later, the lid is removed to reveal the cooked pork. Steam rises from the slow cooker. The pork roast is removed from the slow cooker and placed on a pink cutting board. A silver aluminum pan sits beside the cutting board. A small pairing knife with a pink and white floral design is used to cut the cooked pork. Some of the outside portion is cut off and placed into the aluminum pan. The rest of the pork is sliced and pulled apart. The pulled pork is placed in the aluminum pan.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "How long was the meat cooked at each heat setting in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Three hours on high heat, then three hours on low heat.",
+ "B. Four hours on high heat, then three hours on low heat.",
+ "C. Five hours on high heat, then three hours on low heat.",
+ "D. Six hours on high heat, then three hours on low heat."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, as the woman displays the food ingredients and seasonings, from which direction does a person pass by her?",
+ "answer": "B",
+ "candidates": [
+ "A. To her right.",
+ "B. On her left.",
+ "C. On her right.",
+ "D. In front of her."
+ ]
+ }
+ },
+ "QCUTogGm": {
+ "video_id": "QCUTogGm",
+ "video_duration": "43s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a person standing on the edge of a mountain, followed by an aerial view of a large bridge over water, then a man holding a plate with food sitting next to a drainage ditch, followed by an aerial view of a beach with palm trees. A train on a bridge is shown, then a waterfall, followed by a boat in a wide river, then two men in a small boat in the water using poles. An aerial view of a city is shown. A woman and a man wade into water at a waterfall. A train is shown pulling into a station. An uncompleted building is shown. Traffic on city streets is shown. A man stands on a bus, followed by an aerial view of a city. Another beach is shown. A man points on a city street. He is shown getting into a van. A man does pushups on a rocky surface, then three men are shown together. The video concludes with a graphic of a handprint with a black Africa continent shape in the center.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video is the man seen sitting by the roadside eating?",
+ "answer": "C",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. In the middle of the video.",
+ "C. At the beginning of the video.",
+ "D. Throughout the video."
+ ]
+ }
+ },
+ "uYczFocm": {
+ "video_id": "uYczFocm",
+ "video_duration": "70s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A view of the London skyline showcases The Shard and Tower Bridge. A figure in a green suit falls from a cliff engulfed in a cloud of dust and smoke. A figure, seemingly Doctor Strange, flies through the dusty air using green magical energy. The same green-suited figure soars past a building, creating a dust cloud upon impact. Doctor Strange, amidst green energy, battles the figure. Inside a dimly lit pub, two men sit at the bar. On a bridge at night, a young man and woman converse. The London Bridge is ablaze, collapsing amidst explosions and billowing smoke. A man in a metallic suit looks concerned. The same man, now with a clear helmet and emitting green energy, floats in the air. People watch in horror as a fiery elemental creature ravages London Bridge. A double-decker bus explodes violently. People flee from the scene. A man in a golden suit fights against a green force. A green vortex of energy swirls. A young man in a car removes his sunglasses in shock. A devastated town is seen from above. Spider-Man swings and lands on the spire of a damaged building. Spider-Man crashes into a tiled roof amidst sparks. A group of people run away from explosions and debris. A young man looks at his hand, which glows with blue and pink energy. A man in a suit walks through an opulent airplane cabin. The helmeted man unleashes green energy near a Ferris wheel. The screen goes black. A man with an eye patch addresses a figure in a black suit. A mural of Iron Man, adorned with candles and flowers, is seen on a wall. A fiery elemental emerges, causing explosions. A figure leaps between rooftops as a fire rages below. The figure falls from a rooftop as a large fire engulfs it. Red letters appear on the screen, followed by the title card: \"Spider-Man: Far From Home.\" A futuristic aircraft flies over London. The same aircraft soars over Tower Bridge. A group of teenagers and an adult stand on a bridge, observing an explosion in the background. The teenagers and the adult yell in reaction to another, larger explosion. The final shot is a promotional image for \"Spider-Man: Far From Home,\" featuring Spider-Man in a new suit.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "In the last scene of the video, what is the relationship between the man in the black suit and Peter?",
+ "answer": "C",
+ "candidates": [
+ "A. They are teacher and student.",
+ "B. They are father and son.",
+ "C. They are colleagues.",
+ "D. They are friends."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "In the last scene of the video, why does the man in the black suit instruct the four people in front of him to board the plane?",
+ "answer": "B",
+ "candidates": [
+ "A. Their luggage is missing.",
+ "B. They are in danger.",
+ "C. They are unable to walk.",
+ "D. They are late for their class."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "At the beginning of the video, what type of elemental monster is the person wearing a red cloak fighting?",
+ "answer": "B",
+ "candidates": [
+ "A. Wood.",
+ "B. Water.",
+ "C. Earth.",
+ "D. Fire."
+ ]
+ }
+ },
+ "TfRFcrax": {
+ "video_id": "TfRFcrax",
+ "video_duration": "232s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two men sit in front of microphones and laptops, watching and reacting to an animated video. The video depicts an orange cartoon monster with horns and blue eyes, holding a phone to its ear. The monster is standing on a city street, in front of a building and a dumpster. The scene cuts to two animated men, also on their phones. One wears a white tank top, and the other a black shirt with a Superman logo, and a baseball cap. The man in the tank top holds a knife, then puts it away. They continue their phone conversation. A puff of smoke transitions the scene back to the first animated monster. It is now talking to the man in the tank top on the phone. The man pulls out two pistols. The video cuts back and forth between the two characters as their phone conversation continues, with their words displayed as subtitles. The man in the tank top holsters the pistols. As the cartoon continues, the two men watching the video react to the dialogue with laughter and facial expressions. The two cartoon characters hug as they finish their conversation. The cartoon monster reappears, this time talking to the man in the tank top.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional changes does the man wearing a hat exhibit in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The man's emotions shift from calmness to confusion, then to anger.",
+ "B. The man's emotions shift from relief to anger, then to confusion.",
+ "C. The man's emotions shift from confusion to anger, then to relief.",
+ "D. The man's emotions shift from anger to confusion, then to relief."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What emotion does the man in the tank top display when the man in the hat first mentions wanting to call Shante?",
+ "answer": "A",
+ "candidates": [
+ "A. A confused emotion.",
+ "B. An anxious emotion.",
+ "C. A fearful emotion.",
+ "D. A joyful emotion."
+ ]
+ }
+ },
+ "SGNYLwwj": {
+ "video_id": "SGNYLwwj",
+ "video_duration": "280s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man with a short beard, wearing a white t-shirt with a rose emblem, eats a burger. A digital orange outline of a laptop appears on a black background. Then, a digital orange play button with a magnifying glass appears on a black background. Text appears on a black background saying \"DaddyBhriss\", followed by \"subscribe pls\" in a smaller font. The background switches to a blue color with the same text. The background switches to a black background. Two men stand inside of an elevator, one wearing a white shirt and a black face mask, and the other wearing a white shirt and a black backpack. Text appears saying \"aint opening\" followed by \"its only tonight ;)\". The video cuts to a hallway, where a group of people are present. One person in a white top makes two gestures with their middle fingers. A person wearing glasses and a backpack stands in the hallway. Several people are visible in the hallway. The video cuts to a \"A Few Moments Later\" caption on a red background with pineapples. A woman wearing a white sweater and pajamas appears, along with several other people in the hallway, who are gesturing. The person wearing glasses and a backpack is in the foreground, and the other people are behind them in a hallway. The video switches to a shot of the hallway from the person with the glasses and the backpack, who is speaking. One person says \"follow me on instagram\" and a handle @BhrissYT, followed by logos of Instagram, Twitter, TikTok, and Twitch. The video switches to a white wall with a drawing of a raised fist on it. Three people are standing near the wall. The video changes to a black and white view. The person wearing glasses and a backpack faces the camera. He has his hand on the side of his head. A title appears that says \"were besties ;)\" with several people standing in the hallway behind the person wearing glasses and a backpack. Another title appears that says \"hmmmm\" followed by \"idk about that\". The video switches to a black and white scene with the man with glasses and backpack, and a man in a white shirt and black face mask. The first man puts his hands on his head, and then a person in a green shirt is visible in a doorway with lights decorating the doorway, along with a sign reading \"Maria\". The person in the green shirt says something, and the person with the glasses says something in response. The view changes to a person with dark hair wearing a white and blue shirt and the same man with glasses, and then the man with the glasses and backpack knocks on a door. A person in a black shirt appears behind the door and then says \"oh hi\". A title that says \"good\" appears. The man with glasses and a backpack high-fives the person in black. A title appears that says \"lets go\". The person with the black mask enters the room with the man with glasses and backpack, and another person in black. Text appears that says \"i dont know doing saw him room chance was over but this is what happened\". The person with glasses and backpack then stands in front of the door, and then turns to the camera to say \"wait your the wait me about a\". A text box appears that says \"Delivered \u2022 3w\". A blue cap emoji appears, followed by the text \"gonna answer\". The person in glasses and backpack then turns to the camera and says something. The man with glasses and backpack then says \"so a few later he wanted to and i was like ayoooo tryina do but then some weird noises what I heard...\"The man with glasses and backpack looks surprised, and then a purple eggplant emoji appears, and the man laughs. A clock appears on a black background that says \"5 minutes later...\" A man with glasses and backpack stands in a hallway. Another man appears, and the video shows him and a woman with blonde hair in the doorway. The man with glasses and backpack turns towards the camera and walks toward a door and the man with the face mask. He moves his hands to say goodbye. He points at something in the distance and says something. He puts his arms out sideways, then points again, before knocking on a door. He looks into the doorway, where two women are standing. A title says \"a little bit\". A winking emoji is visible. The three people inside a room shake hands and say \"i'm Aisha\" \"Bhriss\" \"no i'm Aisha\" \"i'm Nikki\". Text appears that says \"*awkward moment*\". The man with glasses and backpack makes a gesture that is followed by a checkmark. The person in glasses and a backpack makes a motion with his hand. Text appears that says \"thanks daddy ;)\". The man is in the room now. He is seen near an air diffuser. The view changes to the man in the white shirt and a black face mask again, and the man with glasses in the hallway and they point at each other, and the man with glasses is smiling and wiggling his fingers, and then puts his arms out to the side. The video goes black.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "In the latter part of the video, does the air diffuser switch from an on to an off state?",
+ "answer": "A",
+ "candidates": [
+ "A. No, it does not.",
+ "B. Yes, it does.",
+ "C. It is uncertain.",
+ "D. None of the above."
+ ]
+ }
+ },
+ "tqzhWKKZ": {
+ "video_id": "tqzhWKKZ",
+ "video_duration": "109s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A bald man in camouflage pants stands facing another man in khaki shorts and a gray t-shirt. Several other people stand around them, observing. The bald man touches the other man's forehead with his index finger. They continue to face one another. The man in camouflage pants then brings his right hand up to the other man\u2019s face and brings him to the ground.The scene changes. Now the bald man faces a Black man wearing a light green t-shirt and camouflage pants. They stand on a blue mat. Again, other people stand around them. The bald man touches the Black man\u2019s face. He then puts his arm around the Black man\u2019s neck and brings him to the ground.The scene changes. Now the bald man stands facing a man in a black t-shirt. Other people stand around them in a loose circle. They\u2019re on the blue mat, but now part of the background is green. The bald man puts his hand on the other man's forehead and appears to push his head slightly. He keeps his hand in place. Then he brings his other hand up and places it on the side of the man\u2019s head. He moves his hands down the man\u2019s face, onto his shoulders, then around his neck, and brings the man to the ground. He rises and gestures with his hands as the other man lies on the ground.He then moves in front of the man again and places a hand on his head, moving his hand down his face. He reaches around the man\u2019s neck and brings him to the ground once more.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "After the bald man wearing black clothing and camouflage pants throws the man in a white short-sleeved shirt and gray pants to the ground, what does he do?",
+ "answer": "C",
+ "candidates": [
+ "A. He circles around him.",
+ "B. He attacks his nose.",
+ "C. He sits on him.",
+ "D. He simply knocks him down."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, when the bald man wearing a black short-sleeved shirt strikes the nose of the man wearing a gray short-sleeved shirt, what is the bald man's position relative to the man in gray?",
+ "answer": "A",
+ "candidates": [
+ "A. Directly in front.",
+ "B. To the left side.",
+ "C. To the right side.",
+ "D. Directly behind."
+ ]
+ }
+ },
+ "otyTqkci": {
+ "video_id": "otyTqkci",
+ "video_duration": "65s",
+ "duration": "1-2min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A dark screen is shown, followed by an animated scene. A person wearing a blue outfit is asleep in a sleeping bag. A notebook and a green crayon are next to them. The person wakes up with wide eyes. They sit up and look around, then pick up a flashlight and shine it on the ground. The person appears to be sitting on a sleeping bag in a forest setting. Next, a fire is shown in the background. The person looks around again, seemingly frightened. The person ducks into a bush. Then a large character is shown lying on the ground, with glowing red mushrooms surrounding them. The large character turns their head towards the camera, and they have a flag printed on their face. The person is then shown again hiding in the bushes. Then, a character with a mushroom-shaped hat and a stick appears in the frame, with more mushroom characters in the background. The mushroom character moves their mouth, then looks worried, then looks angry. The person hides behind a tree, then falls to the ground. The person sits up, appears to be startled, then runs. A mushroom character appears in the water, and a mushroom character is on the ground with their stick broken. The person is shown looking worried again. A backpack is shown with various food items in it, then the person reaches out for a can. Two mushroom characters appear with a mace. Several other mushroom characters appear in the background. The two mushroom characters drop a mace, with several snacks scattered on the ground. The person pulls a meat stick out of their bag, then puts on their backpack with a mushroom character clinging on. The mushroom character then bites a belt on the person's clothes. The person is lying on the ground, and a mushroom character is next to them. The person rolls over with a container next to them. The container and a mushroom character is shown with a white liquid spilled around them. The person and a mushroom character are shown again in the next scene, with the mushroom character jumping out of a tree with a knife. The character stabs itself with its own knife, and then falls to the ground with white liquid coming out. The person appears to be startled, lying on the ground.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why do the sprites attack the girl in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The girl and her friends are camping on the sprites' territory.",
+ "B. The sprites want to steal the girl's food.",
+ "C. The girl harmed a sprite.",
+ "D. The girl stole something from the sprites."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "Was the stick snack in the backpack opened in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. No.",
+ "B. I am not sure.",
+ "C. Yes."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the white substance on the face of the knife elf in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Milk.",
+ "B. Flour.",
+ "C. Clay.",
+ "D. Bird droppings."
+ ]
+ }
+ },
+ "ZdqVyOwy": {
+ "video_id": "ZdqVyOwy",
+ "video_duration": "74s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a boxing match between Richard Commey and Teofimo Lopez. The two boxers circle each other in the ring. Lopez is wearing gold and white shorts and gloves, and Commey is wearing gold shorts and gloves. Lopez throws a few jabs at Commey. Commey dodges some of the punches, but Lopez lands a few punches to Commey\u2019s head. The boxers continue to circle each other, and Lopez throws a right hook that knocks Commey to the ground. The referee starts the count, and Commey struggles to get back up. The referee reaches ten, and the fight is over. Lopez is declared the winner.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What are the roles of a darker-skinned man and a lighter-skinned man, respectively?",
+ "answer": "B",
+ "candidates": [
+ "A. A player and a commentator.",
+ "B. A player and a player.",
+ "C. A referee and a referee.",
+ "D. A player and a referee."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What action did the lighter-skinned competitor take against the darker-skinned competitor in the video, prompting the referee to pause the match and separate them?",
+ "answer": "C",
+ "candidates": [
+ "A. A straight punch.",
+ "B. A right hook.",
+ "C. A left hook.",
+ "D. A combination punch."
+ ]
+ }
+ },
+ "gifWfvVv": {
+ "video_id": "gifWfvVv",
+ "video_duration": "93s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man wearing a blue button-up shirt stands with his hands clasped in front of him, with a sign with the words \"Rock 92\" on a stand to his left. Another man, with long blonde hair, a beard and a blue tank top, walks into the scene and stands facing the first man. The blonde-haired man moves his arms and hands while the man in the blue shirt moves his hands in response. The blond-haired man points his finger, and gestures with his hands. Two additional men enter the scene and the blonde-haired man interacts with a man with a red, white and blue bandanna on his head while the first man stands behind him. A fourth man, wearing a light blue button-up shirt with glasses walks around the other three. The man in the blue tank top points his finger at the man in glasses and gestures. The video fades to black and white text appears that says \"To be continued... Sunday, July 24th at Wet N Wild Emerald\". The logo for Rock 92 appears throughout the video on the bottom right of the screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who just said, \"Roses are red, and there are impurities in my car's oil\"?",
+ "answer": "A",
+ "candidates": [
+ "A. A man with white hair.",
+ "B. A man wearing a gray shirt.",
+ "C. A man wearing a headscarf.",
+ "D. A bald man."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the emotional state of the white-haired man in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Excited.",
+ "B. Sad.",
+ "C. Happy.",
+ "D. Angry."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man with white hair do to the hat in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Placed it on the head of the man wearing a gray shirt.",
+ "B. Took it off and threw it beside him.",
+ "C. Put it into the pocket.",
+ "D. Stepped on it."
+ ]
+ }
+ },
+ "sqqskMRA": {
+ "video_id": "sqqskMRA",
+ "video_duration": "498s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video begins with a blurry view of a living room. Text reading \"AFTER A FEW LONG MINUTES!\" appears on the screen. The view then cuts to a person in a black t-shirt and black pants in a room with a gold-framed mirror and white door. They are speaking and gesturing, and move their hair. The view then cuts to a wide shot of a living room that includes a white sofa, a dog bed with a dog inside, and a partially visible kitchen. The view goes back to the person in the black shirt and pants in the first room, where they gesture to their face, then the view returns to the wide shot of the living room. Then, the view cuts to the person in the black shirt picking up a blue bag. Another person enters the frame. They are wearing a black shirt with a small gold design, black shorts and black shoes. They carry a black bag as a brown dog walks into the view. The person in the black t-shirt exits the frame. The view goes back to the wide shot of the living room. Then the view cuts to the person in the first room and text appears reading \"WHAT IS THAT?\", as they walk to the door. The view then shifts to the wide shot of the living room where the person wearing the black shirt and shorts and the person wearing the black t-shirt and pants walk into the room. The two people begin to put things in a gray suitcase. The camera view shifts to the room with the gold-framed mirror, where the person in black t-shirt and pants looks in the mirror and gestures. They then exit frame. The person with the black shirt and shorts enters the frame with the brown dog. The view goes to the living room and the woman in the black shirt and pants packs clothes into the suitcase. She grabs more clothes and the man in the black shirt and shorts looks annoyed. The woman begins to unpack more items from the gray suitcase while the man continues to gesture with his hands. The camera then moves to the room with the gold-framed mirror where the man gestures as the brown dog moves around the room. The view goes back to the wide shot of the living room and the two people continue to pack the suitcase with clothes. The man in the black shirt and shorts holds a white t-shirt over his head. The man then does a lifting motion with his arms. The woman then walks back into the shot from the door way and she gestures to the man. The two then unpack more items from the suitcase as a grey cat enters the frame, walking across the white couch. They continue to pack the suitcase and move items around on the couch. The view goes back to the room with the gold framed mirror and the woman is gesturing and a text box with \"IDK\" appears. The woman continues to gesture in that room. Then the view goes to the wide shot of the living room and the man enters. The view then shifts to the room with the gold-framed mirror where the woman in the black t-shirt and pants is looking around. The man with the black shirt and shorts is also in that room and begins to move and stretch his arms, and then looks at a phone. The camera returns to the living room where the woman is putting things in the suitcase and looking at her phone. The man does stretches with his legs before picking up a small white box. The view changes to the first room where the woman in the black t-shirt and pants does a hand gesture, and the view cuts back to the living room, where the man is touching his chest and moving his hands around. The woman is looking into the suitcase. There is text on the video reading \"TRIED PUTTING SOME BASS IN HIS VOICE\" and it follows the man as he does gestures. Then the camera pans back to the woman packing the suitcase. The camera view then changes back to the room with the gold framed mirror. The woman does some movements before the man does some large stretches with his arms. The view changes to the living room and the woman is taking items out of the suitcase. The man is also in the shot and is throwing clothes around. The view changes back to the room with the gold-framed mirror and the woman is holding some pants with a green and black pattern. The man and woman are in that room and the man is gesturing with his hands while the woman is inspecting some pants. The view changes back to the living room, where the woman is unpacking a gray bag. The man has his arms out to the side and then the camera shifts to the woman and the man in the room with the gold framed mirror, where she does a dance move. The woman then walks out the door frame and the view changes back to the living room. The woman is putting some shoes in the suitcase, and text reading \"WHERE'S MY BAGS?\" appears. The man does an arm stretch. The camera view shifts to the woman and the man in the living room again. The woman opens and closes the suitcase as she looks at the man. The view shifts to the man gesturing and then to the woman packing more clothes into the suitcase. The camera view changes to the wide shot of the living room. The two people walk into and out of the door way as the camera view changes to the room with the gold framed mirror where the man does a leg stretch. The woman walks into the living room carrying the blue luggage and the man is gesturing. The woman puts shoes on the couch and the man stretches his arms again. The camera shifts to the woman packing the luggage with the shoes. The view changes to a shot of the man looking at a phone in the room with the gold framed mirror. The view changes to the woman putting the suitcase on the couch. The man stretches again. The view changes to the woman and the man in the living room as the woman picks up a pair of orange and black shoes. The view then changes to the man in the room with the gold framed mirror looking at his phone. The view returns to the living room where the woman is adding the orange and black shoes to the suitcase. The man enters the living room again. The woman then sits on the couch and the man is holding on to the suitcase's handle. The view shifts to the woman sitting on the couch and the man is carrying a suitcase and then the two walk out the door, then the video ends with a final wide shot of the living room, with both people outside the view.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What color is the suitcase used by the woman in black to carry items in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. White.",
+ "B. Blue.",
+ "C. Black.",
+ "D. Gray."
+ ]
+ }
+ },
+ "ohpXmCnf": {
+ "video_id": "ohpXmCnf",
+ "video_duration": "24s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A cartoon of a young person with brown hair, glasses and a red headband is shown against a dark purple background speckled with white and brown spots. The young person is on top of a large dark object. The young person's expression changes from concern, to happiness, to anger, in quick succession. Then a yellow triangle with a black top hat and bow tie, and one large white eye appears with a small brown object in its hand. The triangle moves around with the object. The young person then punches a large white object, with the text \"POW\" written over the image. The yellow triangle then reappears holding the brown object. The triangle moves around the screen, and it and the brown object slowly shrink.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "In the animated video, does the yellow triangular creature have a mouth depicted?",
+ "answer": "A",
+ "candidates": [
+ "A. No, it does not.",
+ "B. Yes, it does.",
+ "C. It is not clear."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "In the video, why does the yellow triangular-shaped creature release the little girl?",
+ "answer": "A",
+ "candidates": [
+ "A. The little girl hits it.",
+ "B. The little girl pries open its hand.",
+ "C. The little girl apologizes to it.",
+ "D. The little girl endears herself to it."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Does the volume of dialogue between characters change in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The sound gradually increases as the video progresses.",
+ "B. There is no change.",
+ "C. The sound gradually decreases as the video progresses.",
+ "D. The closer the characters are, the louder the sound."
+ ]
+ }
+ },
+ "qKwAdiXH": {
+ "video_id": "qKwAdiXH",
+ "video_duration": "46s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Artificial Intelligence",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with the text \"Radon Machine\" at the top of a white screen. Below that are a question mark, the text \"But how to obtain a strong model?\", and a diagram depicting labeled green and blue boxes, each of which are connected by lines. The diagram is partially obscured by a dark blur. The screen goes white with a dark blur. The screen goes white again, displaying \"Radon Machine\". An arm holding a marker comes into view from the left side of the screen, and text begins to be written on the screen, first \"Let us picture\". The text continues as \"weak model as a point\" while a cartoon stick figure is shown in the bottom left corner of the screen. A diagram is shown on the right side of the screen consisting of blue points scattered within a light blue gradient oval. The text continues as \"and the model quality as a convex function\". Additional text appears as \"Each blue point represents a model\". The text disappears and the words \"The Radon Machine partitions the set of points into subsets...\" appears. On the right the blue points change to gray, yellow, and orange. The labels S0, S1, S2, and S3 are shown on a few of the points. The text continues as \"...and then it computes the Radon point of each subset.\" On the diagram, lines in yellow and red connect the point clusters. A yellow square box is shown in the center of the diagram. The yellow and red lines disappear. The yellow square is shown connected by a blue line to an orange square, then to a red square, and then to a violet square. The text on the left of the screen states, \"After having calculated the Radon point of each subset, it begins the process iteratively again...\". The text continues as \"...with each iteration yielding stronger models.\" The violet square is shown at the center of the diagram, labeled with r^h. The text on the left of the screen continues as \"The last point remaining is the output of the Radon\".",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is only one dot visible in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The video does not demonstrate it.",
+ "B. The other dots were erased.",
+ "C. The model has become weakened.",
+ "D. It is the result of the completed iterations."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "In the video, when the man says that the blue dots represent a model, how many blue dots are there in total?",
+ "answer": "D",
+ "candidates": [
+ "A. Twelve.",
+ "B. Four.",
+ "C. Eight.",
+ "D. Sixteen."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the cartoon portrait of the short-haired man appear?",
+ "answer": "A",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. Throughout the video.",
+ "C. At the end of the video.",
+ "D. At the beginning of the video."
+ ]
+ }
+ },
+ "IWWlNWAT": {
+ "video_id": "IWWlNWAT",
+ "video_duration": "28s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person in a black graduation gown and white shirt is standing in front of a window with a black frame. There's a brown table to the left, and another person\u2019s arm comes into frame from the left with a hand holding a black graduation hat, placing it on the table. The person in the graduation gown then picks up the hat and puts it on their head. Another arm and hand enters the frame again, and the person in the gown hands them a black book. The person in the gown makes exaggerated gestures, then holds the book and speaks. The person then holds the book to their ear as if it's a phone. Finally, the person in the graduation gown presents their fist to another person\u2019s fist.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order do the five actions occur in the video? a. Throwing something at a hat. b. Handing a book to the other hand. c. Putting on a hat. d. Writing on the book. e. Bumping fists.",
+ "answer": "A",
+ "candidates": [
+ "A. acbde.",
+ "B. aebdc.",
+ "C. acdbe.",
+ "D. ecbda."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the hand throw something into the hat?",
+ "answer": "C",
+ "candidates": [
+ "A. Four.",
+ "B. Once.",
+ "C. Two.",
+ "D. Five."
+ ]
+ }
+ },
+ "ExOAnMyv": {
+ "video_id": "ExOAnMyv",
+ "video_duration": "34s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The screen displays the text, \"I'm gonna get you home safe.\" A figure aims a gun in a dark room. Another figure, Ada Wong, aims her gun back at him and says \"Ada\". A man grabs Leon's arm. Leon asks \"who are you? And what are you doing here?\" The man holding Leon raises his hand as debris falls behind him. Leon fights a large creature in a cave. The creature roars, and Ada shouts \"Ashley, run!\" Leon and Ashley Graham run through the rain. Leon is knocked off a ledge into water. He falls into a chamber, where a man stands at an altar with candles. Two robed figures stand on either side of the altar. Ashley coughs up blood and looks at her bloody hand. A robed figure stands behind her. Leon lays on the floor, injured. Ashley and Leon stare at an explosion behind a ruined doorway.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "How many different types of monsters appeared in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Four types.",
+ "B. One type.",
+ "C. Three types.",
+ "D. Six types."
+ ]
+ }
+ },
+ "uYNtVWOB": {
+ "video_id": "uYNtVWOB",
+ "video_duration": "279s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with text on a white background. \"Step 2: Shape Bagel Dough and cover with a kitchen towel and let rest for 30-60 minutes or until puffy.\" Below that it reads \"Step 3: Bring a pot of water to boil. Add 1 tablespoon of sugar to the water.\"Next, a woman's hands are shown in a kitchen, using a metal dough scraper to divide a large piece of dough into smaller, roughly triangular portions. Several glass jars containing various ingredients are visible behind the dough. A baking sheet lined with parchment paper is placed on the stovetop behind the dough. The woman portions the dough into eight pieces.She picks up one piece of dough and begins shaping it by rolling it in her hands, using a circular motion against the countertop. The dough forms into a smooth, round ball. Then, she pokes a hole through the center of the ball with her thumb, and widens the hole by working the dough with her fingers to create a bagel shape. She places the shaped bagel on the baking sheet.She repeats this shaping process with each of the remaining pieces of dough, placing the shaped bagels on the baking sheet, evenly spaced apart.Next, the woman folds a clean white kitchen towel and places it over the bagels on the baking sheet. She lifts a red pot filled with clear liquid off the stovetop and places it on a different burner. She picks up a large jar filled with sugar and dips a spoon into the sugar. The next text screen reads \"Step 4: Boil bagels for 2 minutes on each side. Dip one side in seeds and set on parchment paper lined baking sheet. Step 5: Bake bagels at 425 degrees F for 25 minutes or until golden brown.\"Next, she uses a slotted spoon to lift boiled bagels, one by one, from the red pot, placing them on the baking sheet lined with parchment paper. A jar filled with what appears to be brown sugar is visible.A shallow bowl containing a mixture of seeds is then shown. The woman picks up one of the boiled bagels and presses one side of it into the seed mixture. She lifts the bagel, now coated with seeds on one side, and places it back on the baking sheet. She repeats this process with a second bagel, using a slightly different seed mixture. She repeats this until all bagels are seeded.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What is the woman doing in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. She is sharing a doughnut.",
+ "B. She is tasting a doughnut.",
+ "C. She is buying doughnuts.",
+ "D. She is making doughnuts."
+ ]
+ }
+ },
+ "TYvrtyRg": {
+ "video_id": "TYvrtyRg",
+ "video_duration": "117s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Motorsport",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video starts with a wide shot of a race track, with cars lining up at the starting line and several people standing on the track dressed in orange. There are several vans and other vehicles parked in the background, and green trees in the distance. The cars begin to move forward after the traffic light turns green. The camera angle changes to follow the cars as they go around the race track, which has sections of both paved and dirt track. There are multiple cars, which are various colors: yellow, pink, blue, and black/white. The camera angle changes multiple times to follow the cars from different viewpoints. The video ends with a wide shot showing multiple cars continuing around the race track.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Who is driving the yellow vehicle in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Max Weatherly.",
+ "B. Anto Pitcannon.",
+ "C. James Chismann.",
+ "D. James Constantine."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What place will the white and blue vehicle finish in the race shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Sixth place.",
+ "B. Third place.",
+ "C. Fourth place.",
+ "D. Fifth place."
+ ]
+ }
+ },
+ "kTJatlWN": {
+ "video_id": "kTJatlWN",
+ "video_duration": "49s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a man with dark hair and sunglasses, wearing a white and blue striped long-sleeved shirt, speaking while standing by a wooden fence. In the background, there is a white barn with a sign. Next, two people, a boy wearing a black short-sleeved shirt and a man wearing a blue long-sleeved shirt and a black hat, are seen looking at a wooden structure. Then, a band of four men are on a stage beneath a white tent, playing guitars and a banjo. A man with a beard wearing a black jacket is seen, with a white barn in the background, and many people sitting on hay bales. A woman with her hair in a bun wearing a grey sweatshirt and a man wearing glasses and a plaid shirt with a backpack are standing outside near a white tent with blue stripes. The video shows a long view of a building with \"Bob Evans Farms\" on the side and an American flag flying. The camera also captures \"Bob Evans\" on the side of a building and inside there are baked goods and bottles on shelves. Packages of frozen food with \"Bob Evans\" printed on them are shown. Finally, there is a graphic that reads \"Bob Evans Farm Festival\" with the dates of \"October 14, 15 & 16\". The video includes a view of many white tents, some with striped tops, and a man in a red shirt driving a truck, a large open area with people and cars, people putting clothes on a clothes line, and people decorating a pumpkin. The video shows musicians playing instruments and singing. A white barn with a large sign reading \u201cOhio Bicentennial 1803-2003\u201d on its side is shown. The video also contains shots of signs that read \"Taste of the Farm,\" \"Bakery,\" and \"Old Route 35\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What musical instrument is featured in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A guitar.",
+ "B. A drum set.",
+ "C. A piano.",
+ "D. A violin."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who just said, \"This is our first time participating in this program\"?",
+ "answer": "D",
+ "candidates": [
+ "A. The woman wearing gray clothing.",
+ "B. The man wearing sunglasses.",
+ "C. The man wearing a black hat.",
+ "D. The man wearing a red plaid shirt and a green hat."
+ ]
+ }
+ },
+ "qvadvwyD": {
+ "video_id": "qvadvwyD",
+ "video_duration": "29s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Celebrity Interviews",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person with light brown hair wearing a grey camouflage shirt is shown walking outdoors, then indoors. They appear next to a projected image of a tweet that has two images. Another tweet appears over the first one. The video transitions to a group of people, including one wearing a black dress, walking on a reflective floor. The view changes to a close-up of a person with dark hair wearing a black top, bathed in red light, then another tweet appears on the screen. The scene changes to a person with light brown hair wearing a yellow dress twirling. A tweet appears on the screen with a person in the foreground. There is a performance with multiple people and a large crowd in the background. The person in the foreground is holding a blue microphone while on a stage. The scene changes to an overhead view of a stage with moving props above a crowd. The scene changes to a group of people walking on a brightly lit stage. The video transitions to a person with light brown hair on a stage holding a pale pink guitar. The last scene shows a person on a platform with other people below it, with red light and a dark sky with stars above.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color clothing is the singer wearing while seated on the sofa in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. White.",
+ "B. Cyan.",
+ "C. Gray.",
+ "D. Black."
+ ]
+ }
+ },
+ "UvHQjxLh": {
+ "video_id": "UvHQjxLh",
+ "video_duration": "130s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows two people interacting. One person is wearing a dark-colored pilot's hat with goggles, a dark-colored coat, a white shirt, a dark tie and an orange scarf. The other person is wearing a dark colored cap and a dark-colored double breasted coat with a dark tie and has a mustache. The video alternates between shots of the two people. They are both talking and gesturing with their hands. There is a large fan in the background of some of the shots. In some shots, there is also a typewriter and a small silver pitcher visible. In some shots, a white rectangular object is visible on the table in front of one of the people.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the profession of the person speaking throughout the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A commercial airline pilot.",
+ "B. A logistics officer.",
+ "C. A fighter jet pilot.",
+ "D. An ordinary soldier."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Where does the table-tapping sound occur in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. In the middle of the video.",
+ "C. Throughout the entire video.",
+ "D. At the beginning of the video."
+ ]
+ }
+ },
+ "qSVRkFvR": {
+ "video_id": "qSVRkFvR",
+ "video_duration": "29s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Race cars line up on a track. A man in a white and red uniform waves a flag to start the race. A red race car with the number 38 drives on the track. A jumbotron shows a race car driving on a track. The camera is placed inside a car driving on a track. The driver's hands are visible on the steering wheel. A red race car with the number 38 drives toward the pit crew. A man in a white shirt with red lettering walks toward the car. A jumbotron shows race cars driving on the track. The red car with number 38 is parked, and a man in a uniform leans into the driver's side window. The car drives away. The camera is inside the car as it drives on the track. Two men are in a pit booth. One is wearing a red shirt with \"Kia\" written on it and is speaking into a headset. He adjusts the microphone. A third man joins them.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the primary emotion conveyed by the person speaking through the walkie-talkie in the latter part of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Sadness.",
+ "B. Excitement.",
+ "C. Anger.",
+ "D. Eagerness."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "In the middle of the video, what position is the black race car on the large screen likely to be in?",
+ "answer": "D",
+ "candidates": [
+ "A. Fourth.",
+ "B. Third.",
+ "C. Fifth.",
+ "D. Sixth."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Are there any blue race cars in the competition?",
+ "answer": "B",
+ "candidates": [
+ "A. None of the above.",
+ "B. No, there are not.",
+ "C. It is not known.",
+ "D. Yes, there are."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Did Mark finish the race?",
+ "answer": "C",
+ "candidates": [
+ "A. None of the above.",
+ "B. Yes, he completed it.",
+ "C. No, he did not.",
+ "D. It is not known."
+ ]
+ }
+ },
+ "kulBDGMt": {
+ "video_id": "kulBDGMt",
+ "video_duration": "71s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with blonde hair and sunglasses perched on top of their head is wearing white headphones and a blue denim button down. They are sitting in a black gaming chair in front of two computer monitors, a keyboard, and a mouse. The background includes white walls with a geometric light fixture that changes color, and a water bottle. The person gestures with their hands while looking at the camera.A close up of a Nintendo Switch with two different color joysticks is displayed, showing multiple game icons. This is followed by a closeup of a screen being held by two hands with long, pale painted fingernails displaying a screen with a PlayStation game library interface, with a Nintendo Switch visible in the background.The person with blonde hair is again shown in their chair gesturing with their hand while speaking, with the color of the geometric lights on the wall changing. The camera focuses on the person again speaking while the background lights change colors. The video cuts to one of the two monitors, where a game map is being shown. The person then turns to face the camera and smiles, still in their chair in front of the computer monitors with the background lights changing.The camera focuses on the blonde person again who is speaking and pointing while the lights change color. The video shows a small white figurine on a shelf, lit in red, and then focuses on the person again, who appears surprised, with the lights changing color again.The camera displays a monitor screen showing gameplay footage from a fantasy game, and then shows the person again who appears to be displeased. They then speak with the background lights changing color. A close up is shown of a multi-colored RGB speaker and a light fixture is shown.The person appears on camera again, speaking while the background lights change color. A monitor screen is shown with gameplay footage, and then shows the person with a box containing a light fixture they are holding in their hands. The person looks at all sides of the box while sitting on the floor.The person returns to their chair and is speaking while the background lights change colors. They are shown unboxing the lighting kit and holding up a small white connecting piece. They then put together some of the light pieces on the floor. The video shows the person in their chair again speaking. The person is shown assembling more of the light fixture on the floor. They are shown back in the chair looking confused and touching their face, and then smiling and speaking while the background light is pink. The video cuts to a close up of the person speaking, and then to the back of a person with long brown hair in a ponytail who appears to be installing the light fixture. The person with blonde hair is now also helping with the installation.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What will the woman in the video do in the future?",
+ "answer": "C",
+ "candidates": [
+ "A. Share recommendations for gaming products.",
+ "B. Watch gaming videos.",
+ "C. Create a gaming hub.",
+ "D. Commentate on games."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why did the woman install Nanoleaf light strips?",
+ "answer": "C",
+ "candidates": [
+ "A. To illuminate the room.",
+ "B. To decorate the room.",
+ "C. To create a gaming station.",
+ "D. To protect her eyes."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the PS5 located in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. On the floor.",
+ "B. On the table.",
+ "C. On the shelf.",
+ "D. Inside the box."
+ ]
+ }
+ },
+ "mADEFANv": {
+ "video_id": "mADEFANv",
+ "video_duration": "399s",
+ "duration": "6-8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Cultural Explainers",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with curly reddish brown hair and glasses is shown speaking directly to the camera. They are wearing a grey shirt. The person shows a small glass vial, a pentagram, a trefoil knot on a string, and a glass and metal pendant. At one point, their back is turned to the camera, revealing a tattoo of a trefoil knot on their upper back. The person continues to speak for the duration of the video. There are various objects on a dark wooden surface behind them, such as a lit candle. A window and dark paned windows are also behind them.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the woman present items to the camera?",
+ "answer": "D",
+ "candidates": [
+ "A. Three times.",
+ "B. Four times.",
+ "C. Six times.",
+ "D. Five times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the tattoo relative to the woman's body in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. In front of the woman's body.",
+ "B. On the right side of the woman's body.",
+ "C. Behind the woman's body.",
+ "D. On the left side of the woman's body."
+ ]
+ }
+ },
+ "nqndGLQu": {
+ "video_id": "nqndGLQu",
+ "video_duration": "57s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a soccer game taking place on a grass field, with the camera positioned at an elevated angle along the sideline. The players are wearing red or green and white striped uniforms. Throughout the video, they run across the field, pass a black and white ball, and engage in soccer plays. A goal is scored by a player wearing a red uniform and celebrated by the red team. Later, a player from the team wearing green and white uniforms speaks with a referee and his assistants. The video ends showing an extended view of the soccer game in progress. There are trees and buildings visible in the background. A fence runs along the near side of the field. There are also spectators watching from the sidelines.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, where is the referee wearing black positioned relative to the man in green jersey number 9 as he takes a penalty kick?",
+ "answer": "D",
+ "candidates": [
+ "A. On the right.",
+ "B. In front.",
+ "C. Behind.",
+ "D. On the left."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the two men do after running to the baseline following the red team\u2019s goal in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Shook hands.",
+ "B. Hugged.",
+ "C. Danced.",
+ "D. High-fived."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "At the end of the video, what emotion is the man from the Red Team experiencing?",
+ "answer": "D",
+ "candidates": [
+ "A. Calmness.",
+ "B. Anger.",
+ "C. Sadness.",
+ "D. Happiness."
+ ]
+ }
+ },
+ "XEhRmCgu": {
+ "video_id": "XEhRmCgu",
+ "video_duration": "128s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a white screen. An orange circular logo appears in the center left of the frame with the word \"GEEKCOSMOS\" to the right of it, both in black. The word \"GEEKCOSMOS\" is replaced with \"GEEKS\". The screen then displays a character walking in a dark, forested area from a third person perspective, heading through a wooden archway. The word \"SEKIRO\" appears on the lower portion of the frame in white text. The screen then changes to a menu interface with the word \"Equipment\" highlighted. The menu displays a variety of available items. The highlighted section then switches to \"Inventory\" showing available items there. The screen reverts back to the character standing in the same location beneath the archway. The screen zooms in on the character, and the character then becomes transparent with floating blue lights around them. The screen changes to an over-head perspective as the transparent character moves through a dark area and up stone steps. A health bar appears in the upper left hand corner along with the text \"Corrupted Monk.\" The character faces a ghost-like figure wearing robes and holding a long sword. A second ghost-like robed figure appears. The character uses an item and the robed figure explodes into flames, leaving behind a trail of smoke. The transparent character then battles the robed figure. The robed figure\u2019s sword glows blue as the character is hit. The robed figure is then surrounded by a white light. The character then uses another item which creates a fiery explosion on the robed figure and the character. The battle continues, and the robed figure is hit with a red attack. The screen then changes to show the robed figure sitting in the center of the frame. The character attacks the robed figure from behind, and the robed figure collapses. The text \"SHINOBI EXECUTION\" appears in white text on a dark screen. This continues as the camera pans around the area with the character on the screen. The text disappears, and the screen returns to the white background with an orange circular logo on the left with the words \"GEEKCOSMOS\" to the right in black text.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the player use Snapseed in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Six.",
+ "C. Five.",
+ "D. Three."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the purpose of the Snapseed skills shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Attack monks.",
+ "B. Make oneself invisible.",
+ "C. Cause monks to recoil and step back in fear.",
+ "D. Make monks freeze in place."
+ ]
+ }
+ },
+ "uUMBAtTr": {
+ "video_id": "uUMBAtTr",
+ "video_duration": "107s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A hand places a small, clear bowl of mayonnaise onto a dark countertop. Beside the mayonnaise are a package of bread, a block of cheese on a plate, and a metal cheese grater. The hand puts on a clear glove and picks up the cheese and the grater. The cheese is grated onto the plate. The grater is set aside. A knife spreads butter onto the cheese. Two slices of bread are placed into a skillet with melted butter. Once browned, they are transferred to a clean plate. Shredded cheese is placed on one slice of bread, and the other slice is placed on top, forming a sandwich. More butter is added to the skillet. The sandwich is placed in the skillet and flipped with tongs. Once both sides are browned, the sandwich is placed on the clean plate and the pan is wiped clean with a white cloth.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the person in the video place a plate on top of the food being cooked?",
+ "answer": "A",
+ "candidates": [
+ "A. To make the crispy crust of the bread thicker.",
+ "B. To flatten the bread.",
+ "C. To heat the plate.",
+ "D. To prevent dust from falling into the pan."
+ ]
+ }
+ },
+ "cilnnxGf": {
+ "video_id": "cilnnxGf",
+ "video_duration": "72s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A blurry figure performs jump squats in a dance studio. The scene transitions to a title card that says \"Level 1 - HIIT - Coming up... Star Jumps\". After a pause with the words \"READY\" and \"SET\", a man in a blue t-shirt and black shorts begins performing star jumps. He stands with feet together and arms by his side, then jumps while spreading his legs shoulder-width apart and raising his arms above his head. He then jumps back to the starting position. A timer counts down from 30 seconds in the top right corner. A small video in the bottom left corner shows a preview of the next exercise: lunges. The man continues performing star jumps as a countdown from 10 appears on screen. He keeps jumping after the countdown ends, and the video ends with a blurry image of him doing a star jump, overlaid with the words \"GREAT WORK - let's go\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many star jumps does the man perform in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Sixty-six.",
+ "B. Sixty-five.",
+ "C. Sixty-eight.",
+ "D. Sixty-seven."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does the man begin to accelerate his jumps?",
+ "answer": "B",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the end of the video.",
+ "C. Throughout the video.",
+ "D. At the beginning of the video."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "Based on the video, what is the man's likely occupation?",
+ "answer": "C",
+ "candidates": [
+ "A. A badminton coach.",
+ "B. A weightlifting coach.",
+ "C. A fitness trainer.",
+ "D. A dance instructor."
+ ]
+ }
+ },
+ "cqdTfeaq": {
+ "video_id": "cqdTfeaq",
+ "video_duration": "51s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Fitness",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man in a white shirt and red shorts is shown performing a weightlifting maneuver with a barbell. The barbell has large red and yellow weights. The man squats while holding the barbell. He then stands while holding the barbell above his head. Afterwards, he is sitting on a chair and being helped by other men, wearing red shirts and white shirts. He is using a white towel. Finally, a man is shown filling out paperwork at a desk. He is wearing a grey t-shirt, a dark baseball cap and a black face mask.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "How many commentators are in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Three.",
+ "B. Zero.",
+ "C. Two.",
+ "D. One."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the atmosphere like in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Calm.",
+ "B. Passionate.",
+ "C. Angry.",
+ "D. Painful."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the result of Menas?",
+ "answer": "C",
+ "candidates": [
+ "A. 207.",
+ "B. 203.",
+ "C. 205.",
+ "D. 202."
+ ]
+ }
+ },
+ "dvgrOXBj": {
+ "video_id": "dvgrOXBj",
+ "video_duration": "97s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Artificial Intelligence",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A white background is shown, followed by text that reads \"Air suspension\". A blue rectangle appears and the text \"HOW DOES IT WORK\" is added within the rectangle. A car tire is shown with metallic support elements attached to it. The metallic support elements are labeled as \"Air suspension system\". The image is displayed with a semi-transparent overlay to reveal an internal view of the suspension system. The elements of the suspension system are labeled \"Suspension support\" and \"Air suspension bellows\". The wheel, and the suspension elements attached to it, then rotate in three dimensions as they are being displayed. A semi-transparent image of the axle, suspension, and wheels is shown. The text \"Axles act as one system\" is added to the image. The semi-transparent image of the axles, suspension, and wheels is then replaced by solid images of the same components. A six wheeled truck chassis is shown without the truck cab attached. A truck trailer is then shown being attached to the chassis. A side view of the trailer and chassis is shown. The internal view of the axle and suspension is shown again, along with the loading of the trailer with oval shaped brown cargo. The pressure gauge readings of the suspension system increase as the cargo is loaded. After the cargo loading the pressure gauge reading stabilizes. Another view of the trailer is shown with two portions of the cargo separated by color. The cargo on the right is red and the cargo on the left is blue. The camera then focuses on the kingpin. Two different views of the trailer on different types of terrain are then shown. The STAS logo is shown. A link to the STAS website is provided.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the loading point at the end of the video located relative to the original loading point?",
+ "answer": "C",
+ "candidates": [
+ "A. On the right.",
+ "B. On the left.",
+ "C. At the back.",
+ "D. At the front."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What change did the blue potatoes in the car undergo at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. There is no change.",
+ "B. They have disappeared.",
+ "C. There are fewer now.",
+ "D. There are more now."
+ ]
+ }
+ },
+ "NeJzKrOI": {
+ "video_id": "NeJzKrOI",
+ "video_duration": "111s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Two girls are sitting on a bed. One girl is wearing a black jacket and yellow pants. The other is wearing a black t-shirt and purple pants. The girl in the black t-shirt is holding a Rubik's cube. Both girls are talking, and the girl in the black jacket is gesturing with her hands. A man wearing a red t-shirt enters through a door. The man is holding paper money. The man talks while the two girls on the bed look at him. The man leaves. The girl in the black t-shirt takes the money. She appears to count the bills. She smiles. A woman enters the door and talks with the man, while the two girls on the bed continue to talk. The girl in the black t-shirt is now holding a toy and talking. The two girls on the bed are looking in different directions. The man is shown looking in the direction of the two girls. The video then shows a cut where the two girls are sitting next to the man on a balcony, playing a board game. They appear to high five with one another. The man enters the door again and begins to speak. The two girls are again sitting on the bed, one is holding money and the other a notebook. The girl with the money laughs, throws a pillow, and the video cuts off.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How do the emotions of the two women change in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. From happiness to disappointment.",
+ "B. From sadness to happiness.",
+ "C. From disappointment to anger.",
+ "D. From happiness to anger."
+ ]
+ }
+ },
+ "anLknBCI": {
+ "video_id": "anLknBCI",
+ "video_duration": "98s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A young woman with long, dark hair pulled half-back sits in front of a plain off-white wall. She yawns briefly at the very beginning, then speaks to the camera while gesturing with her hands. After a while, she picks up a pink plastic bottle labeled \"Sugar Crush Body Wash\" and holds it up, pointing to the label and pump, turning it over to show the back, and opening the pump top. She then sets down the pink bottle and picks up a smaller, transparent red plastic bottle with a flip-top. She holds this up as well, turning it in her hands, opening the flip-top, and sniffing the bottle's contents. After setting that one down, she brings a brown plastic bottle with a black flip-top into view and repeats similar actions: turning the bottle over, pointing at it, opening the cap and smelling the contents. She rubs the fingers of one hand across her cheek. She then sets the brown bottle down.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "On which finger is the woman in the video wearing her ring?",
+ "answer": "B",
+ "candidates": [
+ "A. The ring finger.",
+ "B. The middle finger.",
+ "C. The thumb.",
+ "D. The index finger."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the woman in the video open the pink shower gel bottle?",
+ "answer": "D",
+ "candidates": [
+ "A. She wants to examine the texture of the shower gel.",
+ "B. She intends to use the shower gel.",
+ "C. She wants to demonstrate how easily the bottle opens.",
+ "D. She wants to smell the fragrance of the shower gel."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times is the sound of a bottle being opened heard in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Twice.",
+ "B. Four times.",
+ "C. Once.",
+ "D. Three times."
+ ]
+ }
+ },
+ "qANJrnKL": {
+ "video_id": "qANJrnKL",
+ "video_duration": "31s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A soccer player wearing a yellow jersey with the number 9 runs with a soccer ball, while a second player in a dark blue jersey tries to intercept.A player in a yellow jersey runs with the ball, while two other players in dark blue jerseys pursue.A soccer player in a dark blue and red jersey runs with a soccer ball while three other players in white and green jerseys run towards him.A soccer player in a red jersey kicks a soccer ball. A referee in a yellow jersey stands in the background.A player in a red jersey runs and kicks the ball while a player wearing blue and a goalkeeper in black sit on the ground.A player in a red jersey runs and kicks the ball.A player in a blue and red jersey holds a soccer ball and then kicks it.A player in a blue and red jersey runs with the soccer ball.A player in a dark red jersey runs with the ball while three other players in yellow try to steal the ball.A player in a dark red jersey runs with the ball while three other players in yellow pursue.A soccer player in a blue and red jersey runs with a soccer ball while a player in a light blue and white jersey attempts to intercept.A soccer player in a blue and red jersey runs with a soccer ball while a player in a light blue and white jersey attempts to intercept.A soccer player in a blue and red jersey runs with a soccer ball while a player in a white jersey attempts to intercept.A soccer player in a blue and red jersey runs with a soccer ball while a player in a white jersey attempts to intercept.A player in a dark blue and red jersey runs with the ball, while two other players in a light blue and white jersey try to intercept.A soccer player in a dark blue and red jersey runs with a soccer ball, while a player in a light blue and white jersey pursues.A player wearing a red jersey runs with the ball while two players in blue and white jerseys are on either side of him, and a referee is in the background.A player wearing a blue and red jersey dribbles a soccer ball, while a player in a yellow jersey runs to intercept.A player wearing a dark blue and red jersey kicks the soccer ball towards the goal while a player in a yellow jersey stands near the goal and the goalkeeper in black jumps to intercept the ball.A player in a dark blue and red jersey kicks the soccer ball into the goal while a player in a yellow jersey stands near the goal and the goalkeeper in black tries to block it.A player in a dark blue and red jersey kicks the soccer ball. A goalkeeper in black and a player in a white jersey try to stop it. A player with the number 10 on the back of his dark blue and red jersey approaches.A player wearing a dark blue and red jersey kicks the ball forward while a goalkeeper in black lies on the ground.A soccer player in a red jersey runs with a soccer ball, while a player in a light blue and white jersey runs towards him.A soccer player in a red jersey runs with a soccer ball, while a player in a light blue and white jersey runs to intercept him.A soccer player in a light blue jersey runs with the ball, while a player in a red and white jersey pursues.A soccer player in a dark blue and red jersey runs with the ball, while three other players in red and white jerseys pursue.A soccer player in a dark blue and red jersey runs with a soccer ball while two players in red and white jerseys attempt to intercept.A soccer player in a dark blue and red jersey runs with a soccer ball, while a player in a light blue and white jersey attempts to intercept. A referee wearing a yellow jersey is in the background.A player in a blue and red jersey runs with a soccer ball, while a player in a white jersey slides on the ground.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "In the video, what is the relationship between the player wearing the green jersey, number 9, and the short-haired player wearing the blue jersey?",
+ "answer": "D",
+ "candidates": [
+ "A. They are friends.",
+ "B. They are brothers.",
+ "C. They are teammates.",
+ "D. They are opponents."
+ ]
+ }
+ },
+ "YABdtFZf": {
+ "video_id": "YABdtFZf",
+ "video_duration": "293s",
+ "duration": "4-6min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman in a red dress walks with a man in a suit. The scene cuts to the woman confronting another man in a suit while a third man wearing a patterned shirt and cap stands to the side with his arms akimbo. The woman tugs at the second man\u2019s suit pocket. The man in the patterned shirt gestures as if speaking. The woman points at the man she confronted. She pushes the man she confronted and points to the third man while she interacts with both men, raising her hands between them. The man in the patterned shirt briefly high fives the woman. He points his finger and gestures as if speaking. The woman holds the second man\u2019s suit and shakes it. She resumes speaking and gesturing with her hands and pointing. The man in the patterned shirt makes hand gestures as if speaking. The woman continues speaking and gesturing at the other two men. The man in the patterned shirt gestures as if responding. The second man covers his face. The woman gestures downwards. The man in the patterned shirt continues making hand gestures as if speaking while the other man looks down. The woman resumes holding the man\u2019s coat while he gestures with his hands. The third man points his index finger up. The woman continues confronting the second man, gesturing with her free hand. The second man raises one arm. He pulls something from his pocket and gives it to the woman. The woman examines what she was given as she steps back. The scene cuts to an aerial view of an island with buildings on it, with a small boat traveling alongside it. The scene cuts again to a building exterior with gated windows. A door in the building opens, and the woman and third man exit the building. The man holds paper money and uses a cardboard rectangle to fan the woman as they walk away. The man continues to fan the woman. They walk along a dirt area between some buildings, with clothing hanging on a line. The second man in the suit joins them, and the third man hands him the cardboard rectangle. The second man fans himself and speaks. He hands the cardboard rectangle back to the third man and the woman speaks to the third man. The third man gives the cardboard rectangle to the woman, and the second man speaks. The woman begins fanning herself with the cardboard rectangle. The second man speaks and gestures with his hands. The woman hands him the cardboard rectangle, which he begins fanning himself with. He continues to fan himself while speaking. He runs away, still holding the cardboard rectangle, as the third man runs after him. The video cuts to a green screen with a red button that says subscribe, which is clicked by a cursor hand. The button then turns white and reads subscribed, and a cursor hand clicks a bell next to it.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What is the progression of the emotional tone throughout this video?",
+ "answer": "A",
+ "candidates": [
+ "A. The tone transitions from tense and confrontational to relaxed and soothing, then back to tense.",
+ "B. The tone remains consistently relaxed and soothing.",
+ "C. The tone remains consistently tense and confrontational.",
+ "D. The tone shifts from relaxed and soothing to tense and confrontational, then returns to relaxed and soothing."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the man in the floral shirt do to the man in the suit at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Took off his hat and chased him.",
+ "B. Took off his shoes and bowed to him.",
+ "C. Took off his shoes and chased him.",
+ "D. Took off his hat and bowed to him."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Why does the woman in red thank the man in the flowered shirt in the middle of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Because he previously helped the woman in red and asked her to pay him.",
+ "B. Because he previously helped the woman and asked the man in the suit to pay her.",
+ "C. Because he previously helped the woman and asked her to pay the man in the suit.",
+ "D. Because he previously helped the man in the suit and asked the man to pay him."
+ ]
+ }
+ },
+ "OqSMEoZS": {
+ "video_id": "OqSMEoZS",
+ "video_duration": "33s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A hand dips a tortilla chip into a bowl of orange dip. The dip is thick and creamy. On the table are also a bowl of tortilla chips, and a plate with broccoli florets, baby carrots, and celery sticks.A woman with long brown hair, wearing an olive green sweater, sits in a kitchen.The inside of a blender is shown, containing walnuts, a head of garlic, lemon wedges, and roasted red peppers. Chopped chives and a reddish-brown powder, possibly paprika, are added to the blender.A plate with round crackers and an orange-red dip is shown on a table with a bamboo placemat. A white rectangular dish with a dessert and a red berry sauce is also visible.A close-up of a dip, similar in color to the previous one but with a coarser texture, is shown. Pomegranate seeds and a drizzle of oil are on top of the dip. A piece of pita bread is dipping into the dip.The woman in the kitchen gestures with her hands.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the woman's emotional response when discussing Muhammara?",
+ "answer": "D",
+ "candidates": [
+ "A. She feels excitement.",
+ "B. She feels disappointment.",
+ "C. She feels anger.",
+ "D. She feels satisfaction."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What change occurs to the dipping sauce in the bowl at the beginning of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. It becomes smooth from being uneven.",
+ "B. It becomes dirty from being clean.",
+ "C. It becomes clean from being dirty.",
+ "D. It becomes uneven from being smooth."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What genre of music is featured in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The music is joyful.",
+ "B. The music is angry.",
+ "C. The music is sad.",
+ "D. The music is exciting."
+ ]
+ }
+ },
+ "LFQHCYRD": {
+ "video_id": "LFQHCYRD",
+ "video_duration": "32s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man wearing a black t-shirt, black pants, a baseball cap, and a bandana over his lower face stands in front of a cinderblock wall. Two large, gold dollar signs flank him on either side. He squats slightly, hands on his thighs, then stands up straight with his arms outstretched. Next, he raises both middle fingers. Animated bursts of flame appear on either side of him. These flames are followed by dark plumes of smoke that obscure the edges of the frame. The man returns to his arms-outstretched pose before squatting again, hands on thighs. He repeats the cycle of middle fingers up, flames, smoke, arms outstretched, and squatting multiple times.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What does the man wearing a baseball cap in the video consider uninteresting?",
+ "answer": "C",
+ "candidates": [
+ "A. Spending a lot of money on compartmentalized shelters.",
+ "B. Working 40 hours a week just to pay bills.",
+ "C. Working hard all one's life just to make others richer.",
+ "D. Wasting precious time as a hired laborer."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct order of the man's criticisms of capitalism in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Wasting time as an employed slave, cubicle shelters being too expensive, working 40 hours a week to pay bills, essentials should be free, and working hard to make others rich is not enjoyable.",
+ "B. Wasting time as an employed slave, working 40 hours a week to pay bills, essentials should be free, cubicle shelters being too expensive, and working hard to make others rich is not enjoyable.",
+ "C. Wasting time as an employed slave, working 40 hours a week to pay bills, cubicle shelters being too expensive, essentials should be free, and working hard to make others rich is not enjoyable.",
+ "D. Wasting time as an employed slave, essentials should be free, working 40 hours a week to pay bills, cubicle shelters being too expensive, and working hard to make others rich is not enjoyable."
+ ]
+ }
+ },
+ "RPnItRsE": {
+ "video_id": "RPnItRsE",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video is split into three sections: the top section shows a man wearing a white t-shirt and black headphones, sitting in a chair and looking at a monitor; the middle left section shows a man with a beard looking at a monitor; and the bottom section shows a first-person view of a video game.In the bottom section, a character wearing blue and red gloves holds a red and white gun. The player character moves through a variety of corridors and rooms. Different characters, represented by solid colors, are visible at different points in the video and are shot by the player character. A large orb is visible in some segments. The player character's perspective is often partially obscured by a blue-purple haze. At some points a text overlay is visible in the bottom section of the video, such as \"SPIKE PLANTED\". In the middle of the video, text overlays appear, including \"How\", \"plays Yoru\", \"again look\", \"we talked about\", \"he's covered\", \"Hooka's\", \"from backsite\", \"its such a good\", \"isolated position\", \"prefriring the\", \"was insane\", \"he has 2 flashes\", \"spike is down\", \"me personally\", \"i keep a tp here\", \"i take the tp\", \"and then take it back\", \"he sends his tp\", \"actual teleporter\", \"and rotate as well\", \"yup\", \"a good play\", \"either way\", \"that's a good\", \"somebody took\", \"if you heard that\", \"one backsite\", \"great flash\", \"again\", \"Eggster flashes\", \"are pixel perfect\", \"every single time\", \"i think this is why\", \"he gets so many kills\", \"he's aiming\", \"no dodging that\", \"and then instantly\", \"he knows\", \"always have a TP\", \"there's just no way\", \"fake TP her\", \"he could flash her\", \"flash TP her\", \"infinite options\", \"flash TP\", \"it's over\".Red circles are drawn on top of the bottom section in multiple places.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many enemies did the Yoru player defeat in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. One.",
+ "B. Two.",
+ "C. Three.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, where does the last enemy appear relative to the Yoru player?",
+ "answer": "D",
+ "candidates": [
+ "A. On the left.",
+ "B. Behind.",
+ "C. On the right.",
+ "D. In front."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the skill order used by the Yoru player to eliminate the second enemy in the video? (a) Teleport, (b) Blink, (c) Dash",
+ "answer": "C",
+ "candidates": [
+ "A. (a)(c)(a)(b).",
+ "B. (c)(a)(b)(a).",
+ "C. (a)(b)(a)(c).",
+ "D. (b)(a)(c)(c)."
+ ]
+ }
+ },
+ "JtCczCsx": {
+ "video_id": "JtCczCsx",
+ "video_duration": "40s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Academic Lectures",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A baby in orange overalls lies on their back on a white blanket and looks concerned. The baby then sits up and starts crying while being held by their mother. The mother looks worried as she holds the baby. Next, the baby sits alone on the blanket with a sad expression. Another panel shows the baby sitting on the bed playing with a toy and looking angry, waving their arms around. Their mother sits on a small stool in the background, looking thoughtful with her hand on her chin. The mother is then shown sitting on the stool while her toddler plays with a toy truck on a rug in the foreground. Other toys, a ball and a smaller truck, also lie on the rug. A dresser with a lamp on it and a framed picture of a teddy bear are in the background. A new scene shows the mother standing next to a medical exam table. A doctor sits on the other side of the table, facing the toddler. The doctor points their finger upwards while talking to the mother.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "Which of the following sequences correctly orders the events related to Timo as presented in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Timo avoids eye contact, can only say two or three words at four years old, becomes uneasy during hugs, does not smile, displays tantrums while playing with friends, and is diagnosed with ASD.",
+ "B. Timo avoids eye contact, becomes uneasy during hugs, does not smile, displays tantrums while playing with friends, can only say two or three words at four years old, and is diagnosed with ASD.",
+ "C. Timo becomes uneasy during hugs, does not smile, displays tantrums while playing with friends, can only say two or three words at four years old, avoids eye contact, and is diagnosed with ASD.",
+ "D. Timo displays tantrums while playing with friends, avoids eye contact, does not smile, becomes uneasy during hugs, can only say two or three words at four years old, and is diagnosed with ASD."
+ ]
+ }
+ },
+ "rmVAnNMn": {
+ "video_id": "rmVAnNMn",
+ "video_duration": "146s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black t-shirt, light-colored patterned jeans, a black baseball cap with a pink brim, and a black backpack walks through the aisles of a convenience store, browsing the shelves of snacks. The person takes a brown drink bottle off a lower shelf and places it in the backpack, then walks down a different aisle, and adds several bags of chips and a pack of candy to the backpack. Another person wearing a pink shirt and a red covering over their head, waits by the door outside the store, adjusting the head covering. This person briefly appears within the store as the first person continues shopping, and says, \u201cWhat is taking him so long there? He\u2019s just supposed to be getting some snacks.\u201d The person in pink then turns their head to face outside again. The first person continues to browse, and adds more snacks to the backpack. When the first person approaches the counter, the second person in pink is shown briefly outside, facing towards the shop. The person working behind the counter walks onto the counter when the first person reaches it and asks \u201cHey, what are you doing man?\u201d. The first person walks away from the counter, and the counter worker approaches him grabbing at the backpack. There is a brief struggle where the contents of the backpack are partially revealed. The first person then points at a blue backpack on the counter and says \u201cThat\u2019s not my bag!\u201d. The counter worker then turns away and retrieves the backpack of the first person. The counter worker checks the contents of the backpack and asks, \u201cWhere is your dad? Where is your momma?\u201d to which the first person replies \u201cMy mom is at home, my dad is at work\u201d. The counter worker continues checking the contents of the backpack and says \u201cYou're lucky I know your daddy bro.\u201d The counter worker then says, \u201cDon\u2019t come back here stealing, ok?\u201d and gives the backpack to the first person. The first person puts on their backpack and leaves the store, and then stands by the store entrance.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many bags of red-packaged snacks did the boy steal?",
+ "answer": "A",
+ "candidates": [
+ "A. One.",
+ "B. Three.",
+ "C. Four.",
+ "D. Two."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What good deed did the boss perform in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Reported the boy but gave him snacks.",
+ "B. Did not report the boy and gave him snacks.",
+ "C. Reported the boy but pleaded on his behalf.",
+ "D. Did not report the boy."
+ ]
+ }
+ },
+ "tSOipTEP": {
+ "video_id": "tSOipTEP",
+ "video_duration": "182s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man wearing a gray LA baseball cap, a black t-shirt, a denim apron, and black gloves stands in a kitchen. He places chopped, pickled jalape\u00f1os in a light-colored baking dish, and adds sliced white onions. Next, he adds several pieces of cooked chicken to the baking dish. He picks up a cast iron skillet containing an orange-colored sauce and pours it over the chicken and vegetables. He uses a spoon to ensure the sauce covers all the chicken. He then places the baking dish into an oven. Later, he takes the baking dish out of the oven. The chicken now appears to be fully cooked, covered in the sauce. He plates white rice beside a small bowl of chopped green onions. He uses tongs to place two pieces of chicken on the rice, and uses a spoon to add sauce over the chicken. Next, he uses a fork and knife to cut into one of the plated chicken pieces. He picks up a small amount of chicken with the fork, brings the fork to his mouth, and appears to eat the chicken. He then points at the nearly-empty plate with the knife and fork resting on it.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many pieces of chicken are in the baking dish shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Eight.",
+ "B. Six.",
+ "C. Four.",
+ "D. Ten."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times do chewing sounds occur in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. One.",
+ "B. Four.",
+ "C. Two.",
+ "D. Three."
+ ]
+ }
+ },
+ "wGYlOVgl": {
+ "video_id": "wGYlOVgl",
+ "video_duration": "44s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Chemistry",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A woman is talking in front of a blue background with white lines. She is wearing glasses, a red shawl, and a white top. The scene switches to a pair of hands placing a piece of brown paper on a small metal rod on a black background. A glass beaker with clear liquid, and another one containing a blue liquid are visible. The same hands remove the paper from the rod and place the rod onto the background next to two other rods. Then the hands use a pair of metal tweezers to place two metal rods into a glass of blue liquid, and then remove the two rods to place back on the background. Finally the hands place one metal rod into the blue liquid. The scene switches to show two beakers, one holding the blue liquid and a smaller empty one. The same hands put one rod into the glass of blue liquid and then remove the rod and drop it into the empty beaker. Next a pair of silver and red nails are shown next to each other on a black background. Then a hand holding a test tube with blue liquid is shown. Finally a chart with observations is shown.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many types of equipment were used in total during the experiment?",
+ "answer": "B",
+ "candidates": [
+ "A. Eight.",
+ "B. Six.",
+ "C. Five.",
+ "D. Three."
+ ]
+ }
+ },
+ "OSwWaELc": {
+ "video_id": "OSwWaELc",
+ "video_duration": "117s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Motorsport",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a series of race cars, mostly pickups, on a track. The cars begin by driving in a curve, then continue down a straightaway. Several cars crash on different parts of the track. One black race car crashes against the wall, producing smoke, sparks and flames. Another white race car crashes and flips over and continues producing sparks and flames as it slides. The race continues around the crashed vehicles.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where did the faulty vehicle flip over in relation to the photographer's position?",
+ "answer": "A",
+ "candidates": [
+ "A. On the right side.",
+ "B. Directly in front.",
+ "C. On the left side.",
+ "D. Ahead and to the right."
+ ]
+ }
+ },
+ "EuaNjosy": {
+ "video_id": "EuaNjosy",
+ "video_duration": "40s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A Minecraft player stands inside a glass box. The player flies above a landscape of several small islands, using a fill command to replace the area with dirt blocks. Then the player creates a large, square, flat area of grass blocks at the world height limit. Dirt is added underneath the grass platform. The player proceeds to dig holes into the grass platform and adds dirt and grass blocks to create raised islands, along with a dirt path connecting some of them. Using spruce logs and azalea leaves, the player creates small trees on some of the islands. Then, some brain coral is placed on an island. A small pond is dug and filled with water.Using bone meal, the player grows grass, flowers, and saplings on the grass platform. A white block structure, an Aether Temple, is constructed on the platform. Birch wood blocks are used for doorway details. A chest and a birch door are added inside the structure. More white blocks form a decorative top to the structure. Cows are added to the grass platform, along with several rainbow-colored particles. The particles fade to reveal a yellow and white cow-like creature, a Moobloom. The Minecraft player stands by a portal, holding buckets of milk. The player goes through the portal and is transformed into a cow-like Moobloom.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Towards the end of the video, which structure, when completed, activated the colored block effect?",
+ "answer": "C",
+ "candidates": [
+ "A. A temple.",
+ "B. A pond.",
+ "C. A cow.",
+ "D. A tree."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How many items with a stock quantity of 64 are listed in the inventory during the latter part of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Five kinds.",
+ "B. Three kinds.",
+ "C. Four kinds.",
+ "D. Two kinds."
+ ]
+ }
+ },
+ "kgaIdFRN": {
+ "video_id": "kgaIdFRN",
+ "video_duration": "146s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with white text on a purple background that contains a disclaimer, along with a logo at the bottom left that says \"FAMOUS ENTERTAINMENT.\" At the top left corner, an image of a woman and a tall building is shown. After a few seconds, the view changes to black and white clips of a house. The view zooms in on the house, then shifts to a night view of the same house. Inside, a person is seen in bed. The view shifts to a full color house during the day with trees surrounding it, then shifts to interior views showing family photos on a wall. The view switches to a man with a beard, then to a view of several newspaper articles. The view changes to an outdoor scene with three men, one of whom is wearing a blue robe, and one of whom is putting his hands on the man in the robe. The next view is a black and white scene of two people carrying a person covered in a sheet on a stretcher. Then, a woman with blonde hair wearing a brown costume appears. She makes hand motions while looking directly at the viewer. The scene cuts to a view of a cracked wall, then a woman with brown hair, and finally a man with a beard, while the scene appears to be indoors, and the view goes back and forth between these two people. The scene cuts to a night view of a dilapidated shed, and then an interior of a man with a beard holding a flashlight. The view shifts to a woman looking through a glass door, and then back to the woman with blonde hair in the brown costume. The scene changes to a house where a woman is talking to a young girl. In the next scene, a woman is putting her hands on a small child sleeping in bed. The scene then goes back to the blonde woman in the costume. She makes hand motions, and sometimes claps, while the scene changes to say, \"KARA\", and then @KARATHEVAMPIRESLAYER,\" and then finally, @KAR. At the very end, the view shifts to the words \"HOUSE TOURS,\" and then \"FASHION,\" and \"DRAMA\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the woman describe the DeFeo murders?",
+ "answer": "B",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. At the beginning of the video.",
+ "C. Throughout the entire video.",
+ "D. In the middle of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there an axe in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. No.",
+ "B. I am not sure.",
+ "C. Yes."
+ ]
+ }
+ },
+ "YmZvQOvV": {
+ "video_id": "YmZvQOvV",
+ "video_duration": "72s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with long, light brown hair is facing the camera, sitting in front of a couch. The person is wearing a colorful floral patterned dress and a black bracelet. They are extending their hands out to their sides. Behind the couch is a shelf with several figures, and a pillow shaped like an animal\u2019s head is also visible. The next shot is an overhead shot of text on a black background that describes \"Wondrous Journeys,\" a show celebrating 100 years of Walt Disney Animation Studios, and it names the movies that will be referenced. A series of movie poster images is then shown for the movies \"Encanto,\" \"The Princess and the Frog,\" \"Frozen,\" \"Treasure Planet,\" \"Big Hero 6\" and \"Moana.\" The video then shows several images of a fireworks show over a castle. There is text on a black background that provides more information about \"Wondrous Journeys.\" Next, a black screen shows three white, stylized building designs with a scene of characters flying on a magical object above the castle in the center design. Below the designs is the text \"Animatic Conception\". The next image has the three buildings with moving lights and flowers displayed over them, with the \"Animatic Conception\" text still at the bottom. Two cartoon character images are displayed: one of Mickey Mouse and one of Bambi. This is followed by images of Ariel from \"The Little Mermaid\" and Mirabel from \"Encanto\". Finally, a poster for the movie \"Baymax!\" is shown.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "In what setting does the Frog Princess appear in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Inside a forest.",
+ "B. On the roof of a house.",
+ "C. Outside a house.",
+ "D. On the snow."
+ ]
+ }
+ },
+ "jWISDmxm": {
+ "video_id": "jWISDmxm",
+ "video_duration": "43s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person wearing a black and gold football jersey with the number 43 is sitting in a wicker chair. Behind them is a white background with a circular logo that reads \"THE PUPPET SHOW\" and the head and shoulders of a person with fair skin. To the left is a cigar. The video transitions to a split screen, with one side showing the individual in the chair and the other showing a large group of people. The person in the chair continues to move their head and hands while a different background is visible on the other side of the split screen. The split screen is then removed, and the background changes again. The person is still in the same chair with the same clothes and moves his hands. The background switches again, and then the original background appears again with the circular logo.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What will the man do after the interview ends?",
+ "answer": "A",
+ "candidates": [
+ "A. Create a puppet show.",
+ "B. Produce a movie.",
+ "C. Act in a movie.",
+ "D. Perform in the puppet show."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many times does the letter 'P' appear in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Two.",
+ "B. One.",
+ "C. Four.",
+ "D. Three."
+ ]
+ }
+ },
+ "BUdBDkNt": {
+ "video_id": "BUdBDkNt",
+ "video_duration": "118s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A hand turns a black oven knob from the \"Broil\" setting to 350 degrees. The video then cuts to a light brown substance being mixed in a large orange bowl with a metal whisk. Next, chunks of yellow butter are dropped from a turquoise spatula into a silver mixing bowl with a white paddle attachment. The mixer is turned on, and the butter is mixed until smooth. White granulated sugar is poured into the bowl as the mixer runs. After a moment, the video abruptly cuts to more of the light brown substance, which is now revealed to be some kind of dry ingredient mix, and continues to be mixed with the whisk. The video cuts back to the stand mixer, which now contains a thick, light tan batter, and two eggs are added from an orange spatula. A hand mixes the eggs into the batter briefly with the spatula and then turns the mixer back on. Next, a small amount of dark liquid is poured from a white cup into a green bowl containing a white liquid. The dark liquid is mixed into the white liquid with a spoon. The video cuts back to the stand mixer as a powdery white substance is added to the batter while the mixer runs. The mixer continues to run, and the batter lightens in color and becomes very smooth. The video briefly cuts to the now empty green bowl as the spoon stirs the thin white liquid it contains. The video then returns to the stand mixer as the batter is mixed and eventually comes together into a soft, uniform consistency. An orange spatula scrapes down the sides of the bowl. Next, paper liners are placed in a dark gray muffin tin. Finally, batter is scooped from a bowl with a metal ice cream scoop and placed in the lined muffin tin.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of baking steps demonstrated in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Adjust the temperature, then mix the flour and butter, add and mix in the eggs, and pour into the mold.",
+ "B. Adjust the temperature, then mix the flour and butter, pour into the mold, and add and mix in the eggs.",
+ "C. Mix the flour and butter, then add and mix in the eggs, pour into the mold, and adjust the temperature.",
+ "D. Mix the flour and butter, adjust the temperature, then add and mix in the eggs, and pour into the mold."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the woman use the blender in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Once.",
+ "B. Twice.",
+ "C. Four times.",
+ "D. Three times."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the woman using an ice cream scoop for in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. She is eating dessert.",
+ "B. She is portioning butter.",
+ "C. She is mixing butter.",
+ "D. She is scooping egg mixture."
+ ]
+ }
+ },
+ "RmzhNhPW": {
+ "video_id": "RmzhNhPW",
+ "video_duration": "120s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video starts with a blurry view of a person wearing a dark colored t-shirt. The view switches to show a downward view of a stairway in a room with unfinished white walls. A few lightbulbs hang from the ceiling, and metal grate is on the right. The view changes to show the bottom of the stairway with construction materials. The video shows the view moving upwards, following the stairway. The view changes again, showing a view of the area surrounding the bottom of the stairway. A person's hand points to the left, and the view moves with the person's hand. The view changes to show the stairway in the lower left and the metal grating to the right. The view switches to show more of the room, with the camera moving down the stairway with a hand visible. The view switches to a hand moving across the top of a set of horizontal boards in the room. The video moves to show a hand pulling on an electrical cord. The view changes again showing the stairway with a dark opening. The video view switches back to the person walking on the level above the stairwell. The person is wearing a white hardhat with the word \"PCCR\" on it, a black t-shirt and blue jeans. The person moves on to some boards, using their hands to move around. The camera shifts to the side, and then up at the person's face. The person is still wearing a white hardhat. The view shifts down at the person moving through an opening in the floor. The video then switches to show the person standing on a raised platform. The view looks down at the room with unfinished walls. The person then adjusts the platform controls. The view changes again to show the person working on some walls. The view looks up through a glass ceiling. The camera moves down to show more of the construction on the level beneath the glass ceiling. The view then shows a person wearing a hardhat and a black shirt standing on a platform, looking at the camera. The view shifts again to show the person talking, with the sun in the background through a glass ceiling. The person then flexes their bicep. The video ends with a final shot of the person talking with a white hardhat.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What is the scene after the man climbs the scaffolding?",
+ "answer": "B",
+ "candidates": [
+ "A. By the balcony.",
+ "B. In the basket.",
+ "C. In the basement.",
+ "D. On the roof."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of events in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Climbing the scaffolding, riding in the basket, going down the stairs.",
+ "B. Climbing the scaffolding, going down the stairs, riding in the basket.",
+ "C. Going down the stairs, riding in the basket, climbing the scaffolding.",
+ "D. Going down the stairs, climbing the scaffolding, riding in the basket."
+ ]
+ }
+ },
+ "ARzNUZVm": {
+ "video_id": "ARzNUZVm",
+ "video_duration": "162s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man and a woman are sitting on a gray couch in a sparsely decorated room with white curtains. The woman is wearing a blue tank top and has shoulder-length braids adorned with colorful beads. The man, wearing a black and white striped polo shirt, leans towards the woman. She gets up and walks away. The man leans over the back of the couch and speaks to the camera.A brief aerial shot of a city by the water is shown.A hand knocks on a white door. The same man from the first scene opens the door. He is now wearing a cream-colored long-sleeved polo shirt. The woman from the first scene is at the door. She is now wearing a low-cut, short black dress with sheer, striped sleeves, and is holding two cellphones. The two of them talk. The man leans in and kisses her. He touches her neck and hair, and she closes her eyes. He kisses her neck and she throws her head back. She holds her arms around herself as if cold. The man is now shirtless and sits in bed covered by a white sheet. The woman is holding her phones again. She fidgets with the sleeve of her dress. The man and woman continue their conversation. The woman picks up pink slippers from beside the wardrobe and brings them to the bed. She places a hairbrush on the nightstand. The man sits up in bed, smiling.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What occurs after the woman knocks on the man's door in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. They share a kiss and then smile at each other.",
+ "B. A kiss is followed by intimate physical contact.",
+ "C. A kiss is followed by an embrace.",
+ "D. Nothing happens."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "At what point in the video does the man part the woman's hair?",
+ "answer": "B",
+ "candidates": [
+ "A. After the woman adjusts her clothes.",
+ "B. After the man opens the door and they have a conversation.",
+ "C. Immediately after knocking on the door.",
+ "D. After the man lies down on the bed."
+ ]
+ }
+ },
+ "kiHIJLnd": {
+ "video_id": "kiHIJLnd",
+ "video_duration": "164s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A wrestler in gold trunks and knee pads grabs another wrestler in a grey shirt and pink tights and lifts him into the air, holding him horizontally. The wrestler in gold then flips his opponent forward and throws him down on the mat. Both wrestlers are now on the mat, with the pink-clad wrestler lying face down and the gold-clad wrestler quickly moving around him to pin his arm behind his back. A man in a black suit and hat approaches the edge of the ring. The gold-clad wrestler has the other wrestler in a submission hold. The referee leans down to ask the pink-clad wrestler if he wants to give up. The gold-clad wrestler lets go of the pink-clad wrestler, gets up, and flexes his muscles for the crowd. The referee checks on the wrestler still on the mat, and then begins counting him out. The pink-clad wrestler eventually rises to his feet. He and the gold-clad wrestler lock arms, and the gold-clad wrestler lifts him up again. He pushes the pink-clad wrestler against the ropes and then pushes him backward onto the mat. The pink-clad wrestler grabs the gold-clad wrestler's legs and twists his body so that the gold-clad wrestler is now on the mat. He then locks his legs around the gold-clad wrestler and bends one arm behind his back. The referee asks the gold-clad wrestler if he gives up. The gold-clad wrestler gets to his feet and leans over the ropes, holding his arm. The pink-clad wrestler gets to his feet. He motions to the referee, the audience, and the other wrestler and then lifts his arms into the air. The gold-clad wrestler approaches the ropes and runs off them toward the other wrestler, only to be grabbed and hoisted into the air again, then flipped and dropped on the mat. The pink-clad wrestler lifts the gold-clad wrestler's legs up from the mat and then drops his knee onto the gold-clad wrestler's back. A wrestler in a black singlet and red tights and mask climbs into the ring. He kicks the pink-clad wrestler in the back, and then holds him while the wrestler in the gold trunks gets up. The black-clad wrestler holds the pink-clad wrestler up in the air horizontally and then flips him forward and throws him on the mat.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual event occurred when the referee began speaking with the audience?",
+ "answer": "B",
+ "candidates": [
+ "A. The referee is attacked.",
+ "B. Audience members storm the ring.",
+ "C. Spectators are fighting.",
+ "D. A contestant rushes off the stage."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In what part of the video does the referee's third ground strike occur?",
+ "answer": "A",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. At the beginning of the video.",
+ "C. Throughout the video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "BdRGUaFO": {
+ "video_id": "BdRGUaFO",
+ "video_duration": "144s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person with short dark hair wearing glasses and headphones stands in front of a microphone with a pop filter. The person holds a phone in one hand and gestures with the other hand, first by pointing a finger up and later by raising the entire hand. The person's shirt has white block letters that seem to spell out the word \"VLL\" with three stacked white lines in the middle, and a tattoo on the arm is visible. The person's head remains within a dark gray area, and in the background are black soundproofing panels and a black speaker. Behind and to the right of the person is a black desk, a laptop and other electronic equipment. The video ends with a black screen displaying the words \"CRYPTIC WISDOM\" in a stylized font.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "Is the songwriter of the song in the video the man himself?",
+ "answer": "A",
+ "candidates": [
+ "A. No.",
+ "B. I don't know.",
+ "C. Yes."
+ ]
+ }
+ },
+ "KGOBsdJm": {
+ "video_id": "KGOBsdJm",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Family Vlogs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with dark curly hair is seen looking at the camera in front of a white slatted window. This is followed by a person with dark curly hair, a mustache, and a short beard, with their face close to the camera and waving one hand. Next, a stainless steel kitchen sink is filled with baby bottles and lids. The camera cuts back to the person with dark curly hair, a mustache, and a short beard, who then walks to a corner of the counter where there is a white and gray bottle sterilizer and a gallon of water. They open the top of the sterilizer, then the lid to the sterilizing rack inside, before walking away. The camera cuts to two adults then to the person with dark curly hair, a mustache, and a short beard looking at the camera. The camera cuts to a stainless steel sink where a person is rinsing glass bottles with a stream of water. Then the camera cuts to a baby wearing a grey and white onesie with text on it lying in a bed, with an adult trying to put a pacifier in the baby's mouth.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What is the volume comparison between the man's speaking voice and the infant's cry in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The man's speaking voice is louder than the infant's cry.",
+ "B. The man's speaking voice is higher-pitched than the infant's cry.",
+ "C. The man's speaking voice is clearer than the infant's cry.",
+ "D. The man's speaking voice is quieter than the infant's cry."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What is the woman teaching the man to do in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. How to do house cleaning.",
+ "B. How to take care of a baby.",
+ "C. How to clean and sterilize.",
+ "D. How to clean baby bottles."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is the dish sponge shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Green-white.",
+ "B. Yellow-green.",
+ "C. Yellow-pink.",
+ "D. Yellow-white."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the sound of running water appear?",
+ "answer": "C",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. Throughout the video.",
+ "C. At the end of the video.",
+ "D. At the beginning of the video."
+ ]
+ }
+ },
+ "KPipMdmR": {
+ "video_id": "KPipMdmR",
+ "video_duration": "156s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with an image of the Seattle skyline with the Space Needle visible. Then the video transitions to an image of a building with a sign reading \"PACHYDERM STUDIOS\" in front of it. Next, an album cover is displayed that reads \"NIRVANA\" and \"IN UTERO\". Below the album cover, the text \"FRANCES FARMER WILL HAVE HER REVENGE ON SEATTLE\" appears. The image changes to a shot of a place setting on a table; the album cover is placed on a plate with silverware next to it, and a glass in the background. The video transitions back to the album cover with the text \"RADIO FRIENDLY UNIT SHIFTER\" below it. Then the video shows a person with white arms and a black and white patterned shirt holding an album that reads \"AUTOMATIC FOR THE PEOPLE\" and \"R.E.M.\". The video transitions to an image of a person with no head but a single eye in place of it wearing a chef's hat. The person holds an album that reads \"AUTOMATIC FOR THE PEOPLE\" and \"R.E.M.\". Below the person, the text \"NOW PLAYING MAN ON THE MOON\" appears. The video transitions to the album cover with the text \"HEART SHAPED BOX\" below it. A hand with a white glove on it holds a yellow and black banana in front of a circular brown and red patterned background. Below the banana appears the text, \"EVERY GOOD ALT ROCK ALBUM HAS SOME VELVET UNDERGROUND IN IT\". This changes to \"EVERY GOOD ROCK ALBUM HAS SOME VELVET UNDERGROUND IN SOME\". Then the text \"NOW PLAYING IF SHE EVER COMES NOW\" appears under the banana. The image transitions to the album cover, and below that the text \"ALL APOLOGIES\" is shown. The album cover reappears, with hands holding it from either side. The album cover then appears cracked, and the image shifts to a collage within that crack. The video transitions back to the whole album cover. The video shows a black and white shot of a person with long hair singing into a microphone. This transitions to a montage of the single eyed chef figure from earlier, the singer from the previous frame, and images of the album cover.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the central message the video author aims to convey?",
+ "answer": "B",
+ "candidates": [
+ "A. The relevance of arrangement in the creative process.",
+ "B. The role of isolation in the creative process.",
+ "C. The importance of knowledge in the creative process.",
+ "D. The significance of teamwork in the creative process."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What song is playing in the latter part of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. \"Radio Friendly Unit Shifter.\"",
+ "B. \"Man on the Moon.\"",
+ "C. \"If She Comes Now.\"",
+ "D. \"All Apologies.\""
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual scene is shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The animated chef character has only an ear on their face, with no other features.",
+ "B. The animated chef character has only an eye on their face, with no other features.",
+ "C. The animated chef character has only a nose on their face, with no other features.",
+ "D. The animated chef character has only a mouth on their face, with no other features."
+ ]
+ }
+ },
+ "fusoYZoM": {
+ "video_id": "fusoYZoM",
+ "video_duration": "111s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Academic Lectures",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a blue screen with several horizontal lines. Three cartoon figures appear, all with circular white faces and dark eyes. The leftmost figure is wearing an orange shirt and blue pants, and is pushing a red lawnmower. The middle figure is wearing an orange shirt, and is reading a red book. The rightmost figure is wearing an orange shirt, a white headband, and blue pants, and is lifting a barbell.A fourth figure appears, wearing an orange shirt and blue pants, and is reclining on a green couch. The scene changes to a park with green grass and several flowers, and a gray sidewalk. In the park are three figures. One has dark hair with blue and red accents, and is wearing a purple shirt. One is white with puffy white hair, and is wearing an orange coat. The third is a blue penguin. The dark-haired figure gestures with one hand and looks startled. The white figure looks around with wide eyes. The dark-haired figure stares ahead with a square in her eye. The white figure takes a brown bird with red spikes on its head by the neck. The dark-haired figure puts her hands together in front of her. The white figure gestures with the bird while the penguin stands next to him.The white figure raises the bird, and looks shocked, and other cartoon figures appear behind the characters in the park. The dark-haired figure speaks, and other figures appear behind them in the park. The white figure looks startled again, and the penguin clenches its fists.A cartoon figure appears sitting behind a white box with the text \"DR. is IN\". He is wearing a blue shirt, a headband and a stethoscope. The figure behind the box is smiling. An indistinct gray figure appears and moves in front of the white box. A view of the gray moon with lines and circles appears.The scene cuts back to the park, with the figures in their original positions. The scene shifts back to show the figure in the white box again, and it reads \"WELCOME TO PORTLAND\" with two skulls on either side of the box.The figure sitting in the white box now looks sad. The scene changes again and the white figure is again holding the bird. The dark-haired figure gestures with a hand.A close-up of the dark-haired figure shows her gasping and with a square in her eye. The scene moves to show a close-up of the white figure holding the bird, who looks around with spiraling eyes.The white figure is holding the bird and the penguin now strikes him, causing a yellow starburst. The white figure looks bewildered, the dark-haired figure stares forward, and the penguin clenches its fists. The dark-haired figure talks while gesturing with one hand.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What are the colors of the cartoon character's hair in the video, if the character is wearing purple clothing?",
+ "answer": "C",
+ "candidates": [
+ "A. Black, green, and blue.",
+ "B. Black, yellow, and green.",
+ "C. Black, red, and blue.",
+ "D. Black, purple, and red."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the penguin relative to the cartoon character wearing yellow in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Behind.",
+ "B. On the right.",
+ "C. On the left.",
+ "D. In front."
+ ]
+ }
+ },
+ "ShWVlzGB": {
+ "video_id": "ShWVlzGB",
+ "video_duration": "88s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Camping",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man with a beard and bald head wearing a navy blue hoodie with the word \"fearless\" and an image of several tanks on the front stands facing the camera. Behind him is a shelf with black containers, a large purple folded blanket, a dark bag with a \"Goal Zero\" logo, and a white and gray portable power station. To the right of the man are several bicycles hanging on the wall. The man gestures with his hands, and then holds up a black cylindrical bag with the word \"Trekology\" written on it. The man then holds up a black rectangular folding object. The video then transitions to a close up view of a black table frame being unfolded and assembled, followed by a table top being attached to the frame.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional changes does the man in the hoodie exhibit throughout the video?",
+ "answer": "C",
+ "candidates": [
+ "A. From excited to calm.",
+ "B. From calm to excited.",
+ "C. Consistently calm.",
+ "D. Consistently excited."
+ ]
+ }
+ },
+ "dAYIRizP": {
+ "video_id": "dAYIRizP",
+ "video_duration": "50s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A flock of birds flies against a pale sky, partially obscured by silhouetted leaves in the foreground. The scene transitions to a woman with short dark hair, seated at a desk, gesturing with her hands as she speaks. She wears a white shirt and a patterned vest. Framed photos and books are visible on the shelves behind her.The video then cuts to close-ups of various faces, each with their eyes closed. First, a man with a mustache, then a darker-skinned man with a goatee. Next, a light-skinned man with light brown hair and a goatee speaks, wearing a light-colored shirt and a dark beaded necklace, standing in front of a tree with reddish bark. The sequence returns to more close-ups of faces with closed eyes, now with colorful makeup applied around the eyelids. This transitions to a blurry image of people with hands clasped, seemingly in prayer or meditation.The woman at the desk reappears, still speaking. The screen cuts to black and displays the text, \"An ancient teaching offered a way out.\" The scene changes to a large open-air space covered by a green and red canopy. Rows of white bedding are arranged neatly on the ground. A man in light clothing walks across the space. The camera pans across the area, revealing more rows of bedding. As the lighting shifts to night, the space becomes filled with people in white clothing sitting on the bedding, facing a small stage where a few figures stand. The video then cuts back to the man with a mustache from earlier, who is speaking, followed by a scene where people in white are serving food from large pots. The man with the mustache reappears, looking downward.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What action is the black-haired, bearded man performing with a spoon at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Adding soup.",
+ "B. Stirring ingredients.",
+ "C. Tasting food.",
+ "D. Adding rice."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "In what area does a black-haired, bearded man provide assistance to prisoners in jail?",
+ "answer": "B",
+ "candidates": [
+ "A. Healthcare services are provided.",
+ "B. Assistance with daily living is given.",
+ "C. Emotional support is provided.",
+ "D. Social assistance is offered."
+ ]
+ }
+ },
+ "sWznoNyJ": {
+ "video_id": "sWznoNyJ",
+ "video_duration": "118s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Celebrity Interviews",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen for approximately one second. Then, text reading \"ANYTHING GOES CLIPS\" appears in a stylized font over a swirling background of brown hues and small, dark particles. This image lasts for about 12 seconds. Then the image switches to a man with short dark hair and a white t-shirt sitting in front of a microphone with his arms folded. There are objects on a shelf behind him including books, a globe, a golden cat statue, and a lit lamp. A logo with the text \u201cJames English\u201d appears in the lower left corner of the screen. This lasts for about 9 seconds until the image switches to a second man. He has short gray hair and is wearing a gray collared shirt. The same logo appears in the lower left corner of the screen. He is also sitting in front of a microphone. This lasts about 18 seconds. Then the first man reappears in the foreground, talking. He is still in the same position with the same backdrop. He gestures with his hand at one point while talking. Then the image switches back to the second man, talking, with the same backdrop. Then the first man reappears, laughing. Then the second man reappears, talking. Then, the video shows the first man followed by the second man several times, both talking. At one point the second man runs his fingers over his forehead, then gestures with his hands. In the final scene of the video, the camera switches to a shot of both men sitting across from one another at a table with a screen divider behind them. Then the camera goes back to the second man. Then finally the text \"ANYTHING GOES CLIPS\" appears once more in the same stylized font as before over a swirling brown background with dark particles. The video then ends.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the two characters in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Older brother and younger brother.",
+ "B. Superior and subordinate.",
+ "C. Interviewer and interviewee.",
+ "D. Interviewee and interviewer."
+ ]
+ }
+ },
+ "yBPIlkMO": {
+ "video_id": "yBPIlkMO",
+ "video_duration": "80s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a view of a white truck with a large white tank on its flatbed, parked in a lot. Next, a young child in a light blue shirt is shown walking. A person reaches out and takes the child's hand. The video cuts to a close-up of a chalkboard sign with white lettering that reads \"Babbs & Eric June 17 2017\". Another chalkboard is shown with a table arrangement written in white marker, with a hand pointing at some of the names. Next, the interior of a wooden structure is shown, decorated with lights and chandeliers, with tables and chairs set for a meal. A woman in a floral dress and the child from earlier are seen near the tables. The video then shows several more close ups of the tables, the decor, and the cakes. There are several white cakes decorated with flowers and a teacup with a tag that reads \"Danielle Carolan\" is shown. A woman is then shown drinking from a blue teacup with the tag visible. The video then shows two women speaking to the camera. The video cuts to a close up of place settings on the tables. Next, a piece of bread and a small open faced sandwich with meat and mashed potatoes on top are shown. The video then shows four men playing various instruments on a stage, including a guitar, banjo, mandolin, and double bass. People are then seen milling around. Several people are shown dancing. A woman stands at a microphone and speaks, while another woman plays a double bass next to her. Finally, people are shown dancing again.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What could be the relationship between the girl filming the video and the bride?",
+ "answer": "B",
+ "candidates": [
+ "A. The bride is the girl's younger sister.",
+ "B. The bride is the girl's aunt.",
+ "C. The bride is the girl's aunt on her mother's side.",
+ "D. The bride is the girl's sister."
+ ]
+ }
+ },
+ "NmnhhAtl": {
+ "video_id": "NmnhhAtl",
+ "video_duration": "90s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman with short red hair and a maroon blazer is seated at a desk covered in stacks of paper and sticky notes. A framed photo is on her left. On the wall behind her are various posters. She raises her hand and makes an exaggerated expression, then picks up a phone. She then raises one hand up into the air, still holding the phone. Three other women walk in through a door; one has blonde hair and is wearing a red shirt with suspenders, the other two have dark hair and are wearing a tan jacket and a dark grey shirt. The woman with red hair repeats the same motion as earlier, raising her hand up, then looks down at the desk. Then a woman with blonde hair is seen briefly, followed by the other two women. The red-haired woman raises both hands up as she speaks, then picks up a few of the colored sticky notes from her desk. The three women are shown again; the one in the brown jacket is speaking, followed by the one in the grey shirt. Then, the woman in the red shirt. The red-haired woman is shown again, speaking. The three other women are shown again, looking somewhat bewildered. The woman in the red blazer is shown again, speaking, and then she clasps her hands together on the desk, looking at the camera. The three other women are shown again, looking directly at the camera. The red-haired woman is shown again, then, the three women are shown again. The blonde woman takes off her red shirt, then walks off camera. The blonde woman with a black and white striped shirt, suspenders and a blue skirt puts on a police captain hat. She dances in front of a graffiti wall. She raises her arms, holding a clear plastic box filled with various colorful objects. Then, she holds a black dog wearing a red and black jacket. She removes a purple piece of clothing. She holds up two posters of movies, pointing at them. Then, she puts her hands in front of her mouth. She grabs a small figurine and places it on top of a stack of discs. She then opens her hands. She holds up a series of picture of men. She opens her hands again.Then, she turns to the camera.Then her body moves to the side.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is a possible occupation for a woman with short hair?",
+ "answer": "C",
+ "candidates": [
+ "A. A purchasing agent.",
+ "B. A business owner.",
+ "C. A headhunter.",
+ "D. An agent."
+ ]
+ }
+ },
+ "ftGqgdIo": {
+ "video_id": "ftGqgdIo",
+ "video_duration": "131s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A Monster High doll stands on a doll stand, wearing a purple jacket over a pink and black dress and pink high heels.\u00a0 A male doll sits on the floor with his legs stretched out, wearing blue jeans and a blue and teal shirt.\u00a0 There are tally marks on the wall.\u00a0 The camera angle changes, focusing on the male doll.\u00a0 He sits up and gestures with his hands. A rolled-up rug leans against a blue moon chair.\u00a0 The Monster High doll places her hand on her hip. The male doll stands up.\u00a0 He walks over to the Monster High doll and stands next to her.\u00a0 They leave through a black door.The scene changes to a room with striped wallpaper.\u00a0 A doll in a white hazmat suit and orange boots stands on a doll stand, wearing pink gloves and a colorful face mask and goggles underneath the hood.\u00a0 A male doll in blue jeans and a blue and teal shirt stands facing the hazmat doll. The Monster High doll stands next to him.\u00a0 The male doll speaks to the hazmat doll and gestures with his hands. He fixes his hair with his hands.\u00a0 A female doll with brown hair and a blue dress sits on a purple chair.\u00a0 The hazmat doll speaks and gestures with one gloved hand.\u00a0 The camera angle changes to focus on the hazmat doll.\u00a0 The doll rubs its gloved hands together. It gestures with one hand, then holds it up in a stop motion. The scene focuses on the room, showing a purple chair, a boarded window with purple sheer curtains, a white table with items on top, a grey staircase, a partially boarded doorway, and a picture hanging on the wall.\u00a0 The male doll touches the hazmat doll's shoulder.\u00a0 The Monster High doll places her hand on his arm. The Monster High doll walks away from the male doll. The hazmat doll turns to follow the Monster High doll.\u00a0 The screen goes black.Text appears on the screen reading, \u201cLater that morning\u2026\u201d The scene changes back to the same room with the striped wallpaper. A blue Monster High doll lies on the purple couch, wearing purple heels, black pants, and a tank top.\u00a0 She sits up and gestures with her hand. The hazmat doll enters, carrying a yellow sponge, and cleans the wall with it.\u00a0 The male doll waves his hands in the air, and the Monster High doll in the purple outfit approaches him, gesturing with her hand.\u00a0 The Monster High doll adjusts the male doll\u2019s hair. He speaks to the hazmat doll.\u00a0 The blue Monster High doll on the couch speaks and gestures.\u00a0 The male doll speaks and gestures towards the Monster High doll in purple, who also speaks and gestures.\u00a0 She gestures towards the hazmat doll. The hazmat doll speaks to the Monster High doll, gesturing with its hand.\u00a0 The hazmat doll raises its gloved hands in a shrug.\u00a0 It gives a face mask to the green Monster High doll, who puts it on.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the short-haired woman confined to the basement?",
+ "answer": "C",
+ "candidates": [
+ "A. The short-haired woman committed an offense and is punished by confinement in the basement.",
+ "B. The short-haired woman had a conflict with a girl wearing white, and as punishment was confined to the basement.",
+ "C. The short-haired woman may carry a virus and requires isolation in the basement.",
+ "D. The short-haired woman is not confined to the basement; it is her living space."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the short-haired woman's emotional state in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. She appears angry.",
+ "B. She appears calm.",
+ "C. She appears happy.",
+ "D. She appears anxious."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What is the final state of the door in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. It is ajar.",
+ "B. The answer is not known.",
+ "C. It is fully open.",
+ "D. It is closed."
+ ]
+ }
+ },
+ "tTHbptSD": {
+ "video_id": "tTHbptSD",
+ "video_duration": "107s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Book Reviews",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with red hair and a beard is standing in front of two computer monitors and a textured wall. The person is wearing a gray t-shirt with the phrase \"KIND PEOPLE ARE MY KIND OF PEOPLE\" on it. The person is holding a red book titled \"LOGO MODERNISM.\" They put the book down on a table, then they pick it up and show the front cover again. They show the spine of the book and then turn the book to show the back cover. They then open the book and flip through the pages, showing various graphics and text. The person points at certain graphics in the book while flipping through it. They remove a sheet of paper from the book and display it. They then put the sheet back into the book and hold it closed.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct order in which the man in the video presents the books?",
+ "answer": "B",
+ "candidates": [
+ "A. Front cover, author, spine, back cover, front page, front design, content.",
+ "B. Front cover, spine, back cover, front page, author, front design, content.",
+ "C. Front cover, spine, back cover, front page, front design, author, content.",
+ "D. Front cover, front design, back cover, front page, spine, author, content."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "In the latter part of the video, does the man close the book?",
+ "answer": "B",
+ "candidates": [
+ "A. None of the above.",
+ "B. No, he does not.",
+ "C. It is not possible to determine.",
+ "D. Yes, he does."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man turn the pages in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Twenty-nine.",
+ "B. Thirty.",
+ "C. Thirty-one.",
+ "D. Twenty-eight."
+ ]
+ }
+ },
+ "yxvYMCRR": {
+ "video_id": "yxvYMCRR",
+ "video_duration": "242s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Family Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with black screen text reading \"and mommy is scared!!!!\" The text is then followed by a shot of a man in a white shirt with black text. He is replaced by a shot of a woman in a red shirt and black apron. They then appear together with the man on the left in a white shirt with a red heart and the woman on the right in a red shirt with a black apron. The man is holding a silver clippers and the woman is holding a silver trimmer. The woman then holds up a black clippers with a yellow section, and then shows the clippers with the blades detached. She then states \"mommy LOVES to talk!\" She then holds up a black trimmer, and a pair of silver scissors, then uses a comb on the man's hair. The video then shifts to her using the black clippers on the man's hair, and then a white comb on his hair. She uses the silver clippers on his hair as well, then the silver scissors. She alternates between the comb, clippers, and scissors. She then uses a black comb and the black clippers on his hair, and the silver scissors on his ear area. She alternates between these implements and then finally holds up the comb and smiles. The woman then grabs the comb and goes over his hair one more time.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of events in the video? (a) The woman accidentally cut the man, (b) The woman explains the reason for cutting hair, (c) The man gives feedback on the haircut result, (d) The woman shows the haircut result to the man, (e) The woman cuts the man's hair.",
+ "answer": "B",
+ "candidates": [
+ "A. (b)(e)(d)(a)(c).",
+ "B. (b)(e)(a)(d)(c).",
+ "C. (e)(b)(a)(d)(c).",
+ "D. (e)(b)(d)(a)(c)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unexpected event occurred during the haircut?",
+ "answer": "D",
+ "candidates": [
+ "A. The hairdressing tool in the woman's hand broke.",
+ "B. The woman accidentally damaged the man's bangs.",
+ "C. The man suddenly felt uncomfortable.",
+ "D. The woman accidentally cut the man."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What changes occur to the volume of the background music in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. When characters speak, the background music decreases.",
+ "B. The background music gradually increases in volume over time.",
+ "C. When characters speak, the background music increases.",
+ "D. The background music gradually decreases in volume over time."
+ ]
+ }
+ },
+ "pmorJqxl": {
+ "video_id": "pmorJqxl",
+ "video_duration": "100s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two people, a man and a woman, stand on yoga mats, in front of a banner. The man\u2019s mat is blue, and the woman\u2019s mat is orange. They have their hands together in front of their chests, then they raise their arms up over their heads, then arch back. They bring their arms back down, then bend forward, touching their hands to the ground in front of them. They then lower their knees to the ground into a lunge position. They then lower their chests towards the ground. They then raise their bodies back up. They raise their hips towards the sky. They then lower their knees to the ground. They stretch forward with their arms reaching forward. They come onto their stomachs, then raise their upper bodies upwards. They return to a plank position and push up into a downward dog pose. Then they push up from the downward dog pose and into the plank position, then they bring their knees to the ground. Then, they stretch forward with their arms reaching forward. The people return to a lunge position, then stand. They bend forward, touching their hands to the ground. They then raise their arms up over their heads, then arch back. The man then sits cross legged on his mat. He is wearing a white shirt, and white pants, and a beaded necklace. He holds his hands together. The screen then shows blue and brown dots on a white background.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "When does the music start playing in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. In the middle of the video.",
+ "C. At the end of the video.",
+ "D. There is no music throughout the video."
+ ]
+ }
+ },
+ "oRzhNgpf": {
+ "video_id": "oRzhNgpf",
+ "video_duration": "88s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Astronomy",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A drawing of a house with a dog lying in front of it is shown, with the text \"Chicago, Illinois. USA, 1868\" in the upper-right hand corner of the video. A brown mailbox labeled \"THE HALES\" appears in the foreground. The perspective shifts, and a man with glasses and short brown hair is shown standing in front of a group of people and a red ribbon. A drawing of a large telescope appears above the scene. The scene shifts, and a drawing of a building is shown with people, a red ribbon, and a man in a dark suit. The text \"KENWOOD ASTROPHYSICAL OBSERVATORY\" is displayed above the drawing. A similar scene is displayed, with the text \"MOUNT WILSON OBSERVATORY\" and a different building. The scene shifts again to the same man, a red ribbon, a group of people and the text \"PALOMAR OBSERVATORY\". Another scene is shown with the same man, a red ribbon, a group of people, and a tall tower with the text \"MT. WILSON 150-FOOT SOLAR TOWER\". A newspaper clipping is shown. The same man is shown working with machinery. Confetti falls. A dark blue night sky is shown, with a scattering of stars. A white line drawing of a pyramid is displayed. A white line representing a ray of light travels through the prism, creating a rainbow on a screen. A line of light is shown passing through a flat prism, creating a rainbow on a screen. A yellow sun is shown. A gray cloud appears with the letters \"H\", \"He\", \"Fe\" and \"Ca\" displayed within it. A white rectangle appears with the text \"v=s/t\" inside it. A magnet is shown. A telescope is shown. A ray of light is shown passing through the telescope and reflecting a rainbow on a screen. A magnet is shown waving by the telescope. A sphere is shown being illuminated by the telescope and reflecting a rainbow on a screen. A sphere with two segmented concentric circles surrounding it is displayed against a red background. A view through the interior of the tower is shown.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the sequence of the ribbon-cutting events featured in the video? (a) Telescope. (b) Kenwood Astrophysical Observatory. (c) Mount Wilson Observatory. (d) Palomar Mountain Observatory. (e) Mount Wilson 150-foot Solar Tower.",
+ "answer": "A",
+ "candidates": [
+ "A. (a) (b) (c) (d) (e).",
+ "B. (c) (a) (b) (d) (e).",
+ "C. (c) (b) (a) (d) (e).",
+ "D. (a) (c) (b) (d) (e)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many thumbs-up gestures does Haier display in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Zero.",
+ "B. One.",
+ "C. Two.",
+ "D. Three."
+ ]
+ }
+ },
+ "wsvmfXZd": {
+ "video_id": "wsvmfXZd",
+ "video_duration": "70s",
+ "duration": "1-2min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man and woman embrace in what looks like a hospital hallway. The woman is then shown soaking in a bathtub. Next, her head and shoulders are above water, possibly in the ocean. The man is shown crouched down, looking forlorn, in a messy room. The woman looks at him as if arguing, then they are shown kissing in the hallway, both wearing scrubs. Cut to the woman in the bathtub again, followed by the woman in the water. The man is again seen crouched in the messy room. The couple are shown kissing again, this time more passionately, and the man has one hand on the woman\u2019s head. The couple kisses again, the man wearing a suit and tie, the woman a black sleeveless top. The man in the suit smiles at the woman. They kiss again. The woman rests her head on the man\u2019s shoulder as they embrace in scrubs. The man is shown on a beach, wearing a pale blue, long-sleeved, linen shirt, open at the collar, raising his hands to his forehead. The woman, wearing a white scoop-neck blouse and blue jeans, stands on the beach and places her hand on her chest.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Does the song's volume fluctuate?",
+ "answer": "C",
+ "candidates": [
+ "A. I do not know.",
+ "B. None of the above.",
+ "C. No, the volume remains constant.",
+ "D. Yes, the volume changes."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times is the phrase \"I will go down with this ship\" sung in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Two times.",
+ "B. One time.",
+ "C. Four times.",
+ "D. Three times."
+ ]
+ }
+ },
+ "uxMXudbr": {
+ "video_id": "uxMXudbr",
+ "video_duration": "35s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "President John F. Kennedy delivers a speech at a podium in front of a large crowd. The scene transitions to a room filled with people working at computer consoles. A man with glasses and a white shirt looks at the camera. A rocket launches from a platform, surrounded by smoke and fire. People watch and film the rocket launch from a field. Mission control personnel clap and cheer. A family watches the launch on a small television. A crowd in a city street looks up at the sky. A news reporter with glasses talks excitedly, adjusting his glasses and smiling. Grainy footage shows two figures moving on the surface of the moon. Newspaper headlines announce \"Man Walks on Moon\" in English, French, another language that looks like Urdu and one that seems to be traditional Chinese characters. An astronaut plants an American flag on the lunar surface. A lunar rover drives across the moon's surface. The Earth rises above the lunar horizon. A lunar module sits on the moon.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video is the sound of applause heard?",
+ "answer": "B",
+ "candidates": [
+ "A. The beginning section.",
+ "B. The middle section.",
+ "C. Throughout the entire video.",
+ "D. The end section."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Were Japanese newspapers shown in the middle section of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. I don't know.",
+ "B. None of the above.",
+ "C. Yes, they were.",
+ "D. No, they were not."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "In the video, how many distinct groups of people are shown paying attention to the rocket launch?",
+ "answer": "D",
+ "candidates": [
+ "A. Six.",
+ "B. Four.",
+ "C. Three.",
+ "D. Five."
+ ]
+ }
+ },
+ "FHlbeENJ": {
+ "video_id": "FHlbeENJ",
+ "video_duration": "122s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The first portion of the video depicts two people indoors. One person is wearing a red t-shirt, and the other is wearing a black t-shirt. The person in the black shirt is holding a long, thin structure. This structure appears to have several lights and wires attached to it. The two people appear to be speaking to each other. Later in the video, a person wearing a red hoodie can be seen outdoors, holding a controller. There are several large pipes and metal structures visible. There is a flying structure visible in the air, which is also lit with lights. The perspective changes to that of the flying structure, revealing that it is a multi-rotor configuration with a perimeter frame, moving around various outdoor industrial locations. Later, it returns to the view of the person holding the controller, and then focuses on various aerial views of the flying structure as well.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occurred in the state of the ring-shaped drones in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The drones transitioned from a state of motion to rest and then back to motion.",
+ "B. The drones changed from a state of rest to motion and then remained in motion.",
+ "C. The drones transitioned from a state of motion to rest and then remained at rest.",
+ "D. The drones changed from a state of rest to motion and then returned to rest."
+ ]
+ }
+ },
+ "UtkTrJmB": {
+ "video_id": "UtkTrJmB",
+ "video_duration": "338s",
+ "duration": "4-6min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a green tank top and white shorts is rowing a boat. The camera appears to be attached to the boat, facing towards the person rowing. The person is holding onto a rowing oar with yellow grips. There are other boats with people rowing in the water further away. The sky appears overcast.There are buildings visible on the shoreline in the background. The rowing oar is moving back and forth as the person rows. Water splashes at the ends of each oar stroke. The sun becomes partially obscured by the clouds at times. As the video goes on, the other boats appear to be further and further away. At one point a yellow ball is seen floating in the water. After a while the rower turns their head and looks over their left shoulder and then their right shoulder. At the very end, the rower raises their left hand.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, what is the direction of the other ships relative to the ship in the middle?",
+ "answer": "C",
+ "candidates": [
+ "A. To the right.",
+ "B. Directly behind.",
+ "C. To the left.",
+ "D. Directly in front."
+ ]
+ }
+ },
+ "lNQdrmMn": {
+ "video_id": "lNQdrmMn",
+ "video_duration": "44s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a soccer match. Players wearing white and red/black striped jerseys are visible, as is a goalie in a green jersey. The soccer players run across the grass field, and the video shows multiple instances of players kicking the soccer ball toward the net. In multiple instances, the goalie can be seen diving to block the soccer ball, and in two instances, the ball goes into the net. After a goal, the soccer players celebrate. At one point, the video shows the soccer player who kicked the goal running and yelling toward the crowd in the stands, and then turning and smiling at another player who runs up to him. The camera pans around the soccer field during the video. There are signs that say \"betway\" around the perimeter of the field. There is a red banner that says \"British Heart Foundation\" and \"revivr\" in a portion of the video. The \"afcb tv\" logo appears in the upper-right corner of the screen throughout the video.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Which team, the white team or the black team, is more likely to win the entire game?",
+ "answer": "D",
+ "candidates": [
+ "A. A draw is more likely.",
+ "B. The white team is more likely to win.",
+ "C. It is not possible to determine.",
+ "D. The black team is more likely to win."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the cheering occur?",
+ "answer": "B",
+ "candidates": [
+ "A. In the earlier part.",
+ "B. Throughout the entire video.",
+ "C. In the later part.",
+ "D. In the middle part."
+ ]
+ }
+ },
+ "uPhymZtz": {
+ "video_id": "uPhymZtz",
+ "video_duration": "167s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A woman with short white hair, wearing a black dress with small white polka dots, speaks directly to the camera in what appears to be a grand hall. The words \"Nadine Dorries MP\" and \"Secretary of State for Digital, Culture, Media and Sport\" appear in a purple banner at the bottom of the screen. The video cuts to a brief clip from the film *Naked Gun 33 1/3: The Final Insult*. Two men in suits and hats wave to a clapping crowd.The video returns to the woman. She gestures with her hands as she continues speaking. The words \"PARTIES\" and \"OOPS!!\" appear overlaid on the video in large red letters. A thought bubble appears above the woman\u2019s head containing the words \"OH SHIT.\"The video cuts again, this time showing a man in a suit and glasses speaking to the woman. Another man in a suit stands in the background. A statue of a man holding a scroll is visible behind them. The video returns to the interview with the woman, now with a thought bubble above her head with the words \"I DON\u2019T CARE.\" This text also appears below the Channel 4 News logo in the upper left-hand corner.The video shows another shot of the two figures speaking in the hall, then returns to the interview with the woman. The text \u201cGATHERINGS NOT PARTIES\u201d appears in large red letters overlaid on the video. The video then cuts back to the woman, speaking directly to the camera. Another thought bubble above her head shows \"LA-LA-LA I AM NOT LISTENING.\" Several images of old newspaper clippings reporting on Boris Johnson\u2019s past controversies overlay the video. The video cuts between the woman speaking alone and the two figures talking, before finally returning to the woman alone, this time with a thought bubble that reads \"ERROR 404 FILE NOT FOUND\". The video ends with a black screen displaying a white outline of a face with glasses, followed by a pair of eyes and a set of scales.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many instances of comical sound effects were inserted into the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Two.",
+ "B. One.",
+ "C. Three.",
+ "D. Four."
+ ]
+ }
+ },
+ "vBoFDzVN": {
+ "video_id": "vBoFDzVN",
+ "video_duration": "47s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with long dark hair and glasses is shown with a light-blue tin can. The person takes the lid off the can, revealing several cookies inside, each in a white paper liner. The can is held up, showing the viewer the cookies. The person takes a cookie, eats it, and then takes a drink from a light-blue mug. The person talks while holding up the can. The camera then shows a close-up of the cookies. Finally, the person speaks again while holding the can.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional change does the woman exhibit after eating the cookies in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. From curious to sad.",
+ "B. From curious to angry.",
+ "C. From curious to confused.",
+ "D. From curious to excited."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the cookie box located in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. On the table.",
+ "B. On the ground.",
+ "C. On the woman's lap.",
+ "D. On the woman's hand."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the woman eat cookies in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Three.",
+ "B. Two.",
+ "C. Four.",
+ "D. One."
+ ]
+ }
+ },
+ "IgIUTxHY": {
+ "video_id": "IgIUTxHY",
+ "video_duration": "184s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video opens with text on a black background that reads \"5 LAT PULLDOWN VARIATIONS.\" The video then transitions to a person wearing a black tank top who is facing away from the camera and is performing various lat pulldown exercises on a machine in a gym. The camera angle is primarily from behind the person, but at times the camera will switch to a side view as the person lifts. Several other people wearing various colored shirts and pants can be seen in the background throughout the video, as well as additional gym equipment. Different bars and attachments are also shown at different points in the video, and white text at the bottom of the screen labels the type of exercise being demonstrated.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the emotional state of the man in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Calm and relaxed.",
+ "B. Very excited.",
+ "C. Becoming more impatient.",
+ "D. Increasingly excited."
+ ]
+ }
+ },
+ "UiRMmEIp": {
+ "video_id": "UiRMmEIp",
+ "video_duration": "161s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a competition area with two teams on opposite sides. The teams each have 3 robots. There are multiple purple cubes scattered on the playing surface. There are also angled ramps on each side of the arena. Several individuals are standing at the edges of the playing field. There is a person wearing a yellow shirt standing in the middle of the playing field who is operating a computer of some sort. There are multiple cameras on the scene. The camera angle is that of a birds-eye view of the whole playing field. The two teams are labeled Alliance 6 (blue) and Alliance 2 (red). The teams are scoring points throughout the video by navigating around the playing field and moving the purple cubes. The scoring numbers are displayed on the screen. There are people sitting in stands behind the competition area, watching the action. There are a few people standing on the edges of the court, and a couple people standing next to the person wearing the yellow shirt. There is a split-screen view of the playing area in the last section of the video.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "Who is the man in a black short-sleeve shirt walking back and forth on the track in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The cameraman.",
+ "B. The security personnel.",
+ "C. The commentator.",
+ "D. The referee."
+ ]
+ }
+ },
+ "aPCVbwDg": {
+ "video_id": "aPCVbwDg",
+ "video_duration": "554s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Unboxing Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A figure with long brown hair and a blue cap stands behind a table. On the table are many objects, including a box of cereal, two bottles, a mug, several cans, several hot dog shaped objects, a donut, two pizza slices, a smartphone, and a drumstick. There is a pink chair to the left of the table. To the right of the table is a large rectangular box. A hand touches one of the hot dogs and the drumstick, and moves them slightly. The same hand then pushes two bullets forward off the table. The hand holds the box and turns it so the labeling on the front and side is seen. The hand proceeds to open the box and removes white paper packaging. The hand then removes the packaging to reveal many colorful objects. The hand assembles a toy chainsaw, a toy riot shield, a toy nail gun, a toy throwing star, a toy baseball bat, a toy ax, a toy brain, a toy eyeball, an arrow, boxing gloves, and a toy crossbow. The hand then picks up a second figure and lays it in the box before closing it again. The hand then disassembles the chainsaw and fixes a different blade to the tool. The hand then places the boxing gloves on another figure, along with the newly assembled chainsaw. The video ends with a screen showing the text \"My Damn Toys\". There is also a short scene showing two animated characters from Spongebob Squarepants. The scene shows an animated yellow fish in a green shirt, holding up a burger, next to the animated character Plankton. Then a picture shows a moldy and dirty burger.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where were the chicken legs positioned on the pizza at the beginning of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. At the back.",
+ "B. On the right.",
+ "C. At the front.",
+ "D. On the left."
+ ]
+ }
+ },
+ "EBXdoHXv": {
+ "video_id": "EBXdoHXv",
+ "video_duration": "65s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a green shirt hands a clear plastic wrapped package with red and blue ribbons to a person wearing a black pilot uniform. Another view of the package being handed over. A close up of the pilot looking towards the camera, a person with a camera behind him, and another view of the package. A close up of the pilot speaking. A wide view of the pilot. A woman in a turquoise jacket is handing out items to other people. The woman hands out items to more people. A close up of hands exchanging small rectangular items. A close up of the woman in a turquoise jacket speaking. A wide view of the woman in turquoise handing small rectangular items to people, as people walk by. A low view of people's legs walking down a hallway. An airplane with a Malaysia logo on the side sits on the runway. Another view of the plane on the runway. A plane is shown landing at dusk. A close up of a man in a black suit and tie speaking. Another close up view of the same man speaking. Another close up of the man speaking. A plane is shown landing in the daytime. Another shot of the same plane landing. An airplane with a Malaysia logo taking off. Another view of the plane taking off. Another view of the same plane taking off. The man in the suit speaks again. The same man speaks again. The man speaks again. An airplane with a Malaysia logo is shown from the cockpit. An airport authority car is shown driving on the runway. More airport cars and a bus drive behind the authority car. Multiple people wearing lime green reflective vests are shown watching an airplane land in the distance. More views of the people in lime green watching the planes land. A Malaysia airplane is shown taking off. Another view of the plane taking off. People in lime green vests watch the planes take off. A stylized white silhouette of an airport is shown against a blue background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the airplane taxi in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Five times.",
+ "B. Four times.",
+ "C. Three times.",
+ "D. Two times."
+ ]
+ }
+ },
+ "LdNwMILI": {
+ "video_id": "LdNwMILI",
+ "video_duration": "92s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Software",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a white modem with several indicator lights on the front panel. The \"internet\" light is not lit, and a red circle is drawn around the space where the light is expected. A green text overlay reads, \"Modem has No internet light.\" Then, the video cuts to a man in a gray t-shirt, and the man's face is shown for an extended time as he talks. Next, the video shows the back of the same modem, with multiple black and grey cables and one blue cable plugged into it. A hand reaches into the frame and presses the modem's power button. Then, the hand removes the power cord from the back of the modem and then re-inserts it. Again, the hand presses the modem's power button. The video cuts back to the same man again and shows his face for an extended time as he talks. Finally, the video cuts back to the front of the modem with multiple green lights on and the \"internet\" light still off.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video, which of the following sequences does the man use to troubleshoot the modem issue?",
+ "answer": "B",
+ "candidates": [
+ "A. Unplug the power adapter, turn on the power switch, reinsert the power adapter, and wait for the modem to start.",
+ "B. Turn off the power switch, unplug the power adapter, turn on the power switch, reinsert the power adapter, and wait for the modem to start.",
+ "C. Turn off the power switch, unplug the power adapter, reinsert the power adapter, and wait for the modem to start.",
+ "D. Turn off the power switch, unplug the power adapter, wait for the modem to start, and reinsert the power adapter."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the television in relation to the man in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The TV is in front of the man.",
+ "B. The TV is to the man's right.",
+ "C. The TV is between the man and the wall.",
+ "D. The TV is to the man's left."
+ ]
+ }
+ },
+ "Sweoflyp": {
+ "video_id": "Sweoflyp",
+ "video_duration": "44s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Salukis are released from starting boxes on a dirt racetrack. People in white robes stand by the boxes as the dogs begin running. Spectators in traditional attire watch from behind a fence on the left. The dogs run toward the right side of the frame, with a white SUV and other cars driving alongside the track. A timer in the lower right corner of the screen counts up from zero. As the race progresses, the pack thins out, and two dogs emerge as the clear frontrunners, a dark-colored dog and a lighter-colored dog with a patterned cloth on its back. The two leading dogs continue running with the cars pacing them, with the lighter dog gradually pulling ahead. More dogs trail behind at a greater distance. The road alongside the racetrack has a fence next to it, and parked vehicles and spectators can be seen beyond the fence. The lighter-colored dog maintains its lead and approaches what appears to be the finish line.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the first half of the video, are the three people positioned on the left or right side of the track?",
+ "answer": "B",
+ "candidates": [
+ "A. The three people in the first half of the video are in the middle of the track.",
+ "B. The three people in the first half of the video are on the right side of the track.",
+ "C. The three people in the first half of the video are on the left side of the track.",
+ "D. The three people in the first half of the video are at the left rear of the track."
+ ]
+ }
+ },
+ "JlhoxYuh": {
+ "video_id": "JlhoxYuh",
+ "video_duration": "92s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A weathered wooden cabin sits on a grassy slope, with a wooden well in the foreground and trees and hills in the background. A person in jeans and a white shirt stands at the bottom of the cabin's steps.The camera moves closer to the cabin, showcasing its aged wooden planks and a small covered porch. Two windows are visible, one on the second floor and one on the first floor. The camera moves along the porch, where an old rocking chair is seen at the far end.Inside the cabin, clothes hang on a wall rack. An ironing board with an iron rests nearby. Dresses hang on hooks in a sparsely furnished room with a window and door. More clothes are seen hanging from a line near a window. In the kitchen area, a white teapot sits on a small table, a double-burner stove holds a few eggs in a dish, a small creamer, and a can. The walls are covered in newspaper clippings. Various cooking utensils hang on the walls and shelves. Pots and pans are visible.A black and white photograph depicts a family posed together. The family includes a man, a woman, and several children of varying ages.More kitchen details are shown, including hanging pots, pans, strainers, bowls, and jars on shelves.A bedroom is visible, featuring newspaper-covered walls, a single bed, a rocking chair, a sewing machine, a window, and a wooden dresser. A man in a dark shirt and jeans is partially visible in the foreground.A small, gray outbuilding bears a sign reading \u201cLoretta Lynn\u2019s Coal Mine No. 5\u201d. A sign by the open door warns of danger and prohibits smoking. Inside, mining equipment is dimly visible.The entrance to the Loretta Lynn Coal Miner\u2019s Daughter Museum is shown. A stylized image of Loretta Lynn is displayed above the entryway. A long, glittering pink gown hangs near the entrance. Inside the museum, various signs including \"Danger No Smoking\" and \"Close Clearance\" are visible. A plastic curtain leads into a dimly lit replica of a mine shaft where a figure of a miner kneels, holding a mining tool. A silver tour bus with a painted design is shown. \"Coal Miners\" is written across the front.A large white house with columns stands on a grassy lawn. People are visible on the porch. Snow falls on the house and surrounding lawn in a different shot.A woman wearing a yellow top and a jeweled collar is seen speaking. The footage appears to be older.A walkway leads to a large white house with a covered porch and columns. Shrubs line the walkway. A closer view of the house shows white wicker chairs on the porch and various plants in the garden.The interior of a house features patterned wallpaper, a chandelier, and a dining table set for a meal. A doorway leads to a bathroom with green tile. The camera moves closer to the dining area. The video ends with snow falling over the white house with columns.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is Loretta's father's profession?",
+ "answer": "B",
+ "candidates": [
+ "A. A driver.",
+ "B. A miner.",
+ "C. A construction worker.",
+ "D. An engineer."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the girl with two braids and a skirt in the video and Loretta?",
+ "answer": "C",
+ "candidates": [
+ "A. They are colleagues.",
+ "B. They are classmates.",
+ "C. They are sisters.",
+ "D. They are friends."
+ ]
+ }
+ },
+ "GmnobDZn": {
+ "video_id": "GmnobDZn",
+ "video_duration": "69s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Academic Lectures",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a close-up of a yellow piece of paper with four letters written on it in black marker. Two 'T's are stacked vertically above one 'C'. There is a horizontal line below the top 'T'. There is a blur at the bottom left and bottom right of the paper.The scene then changes to an overhead shot of multiple red petri dishes on a metal tray. The words \"Harness the Data\" are superimposed over the image. The scene changes to a machine that appears to have many small vertical pins that descends and touches a tray. The words \"Harness the Data\" are again superimposed over the image.The video then transitions to a closeup view of several circular petri dishes. The scene shifts to a robotic arm moving above a tray of small white tubes, dispensing a liquid. The scene changes again to an older man with white hair talking in front of a background that shows a grid with blue and purple boxes. The same man speaks again with the same background.A machine is shown performing an action with a tray of small circular test tubes. Then a tray with small test tubes moves across the frame. A blurred view of a shelf of equipment is shown, followed by a view of a computer screen. Then a screen with text is shown, followed by another screen with text.The scene changes to a black and white shot of a person in a white shirt looking at a document. The same shot is repeated multiple times.The video transitions to a man with balding hair wearing a jacket in front of the blue and purple grid background. His name and title are displayed as Maynard Olson, Professor Emeritus, Medicine and Genome Sciences, University of Washington.A vintage computer is then displayed with a dark screen, light grey keys, and a black disk drive. The video shows a black and white shot of two men at a lit table looking at a document. The same man with balding hair appears again in front of the blue and purple background. The same man appears again and speaks in front of the same background. Then a man with grey hair and glasses speaks in front of the blue and purple background.The video moves to a computer with a white frame, a small white box, and blue on the screen. A panning shot of a room full of computer workstations is shown, with one person in a white lab coat walking through the room. A woman in a lab coat works on a computer. The video zooms in on a computer screen with data displayed on it. An overhead shot of a lab with several workstations is shown.The scene changes to rows of letters in a dark background. The text appears to move forward. A 'g' is displayed on the screen with light beams emanating out from the center. The man with balding hair wearing a jacket is shown again in front of the blue and purple grid background. The same man is then shown speaking in front of the blue and purple background. The same man appears again and speaks in front of the same background.The video ends with a fade to white.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who states in the video, \"Without computers, we would be at a loss\"?",
+ "answer": "B",
+ "candidates": [
+ "A. A man with white hair, a white beard, and glasses.",
+ "B. A man with gray hair wearing a black coat.",
+ "C. A bald man with a white beard.",
+ "D. A bald man wearing glasses."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many people are speaking in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Two.",
+ "B. Three.",
+ "C. Five.",
+ "D. Four."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why should biology be integrated with computer science in research?",
+ "answer": "A",
+ "candidates": [
+ "A. Because the data becomes more accurate.",
+ "B. Because conducting experiments is challenging.",
+ "C. Because it allows access to more specialized knowledge.",
+ "D. Because it is easier to share experimental data."
+ ]
+ }
+ },
+ "roHillMF": {
+ "video_id": "roHillMF",
+ "video_duration": "85s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A white dog wearing a black harness with colorful text is on a purple and green floor. A woman wearing a purple jacket and black pants walks the dog on a leash from a purple section to the green section of the floor. The woman removes the leash from the dog, and places a toy on the dog\u2019s neck. The woman stands in front of the dog and the dog sits, looks up, and licks its face. The dog sits in the middle of the green floor, next to a black and orange duck and an orange toy. The woman runs across the green floor, and the dog runs in different directions through various obstacles. The obstacles include jumps, tunnels and an A frame. The dog finishes the course and runs back to the woman. The woman puts the leash back on the dog, then puts the harness back on the dog. The video concludes with a shot of the dog jumping through a purple and yellow ring.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does Rudy perform the action of digging the tunnel for the second time?",
+ "answer": "B",
+ "candidates": [
+ "A. The end of the video.",
+ "B. Throughout the entire video.",
+ "C. The middle of the video.",
+ "D. The beginning of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual event occurred during the puppy competition?",
+ "answer": "A",
+ "candidates": [
+ "A. Experienced an error.",
+ "B. Began to bark wildly.",
+ "C. Remained completely still.",
+ "D. Completed an extremely difficult maneuver."
+ ]
+ }
+ },
+ "hYVzvKzs": {
+ "video_id": "hYVzvKzs",
+ "video_duration": "69s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with a beard and a black baseball cap is visible in the first few seconds, looking at the camera. They are in a room with a shelf of storage bins and some aquariums. The camera then pans to show multiple aquariums stacked on shelves. Each aquarium is filled with water and various plants. Some contain small fish and snails. In one tank, a black sponge-like object is visible. The camera zooms in on this object, showing small white worms attached to it, as well as snails nearby. A person's finger points at the sponge, and then at other areas of the tank. A container of brown and tan fish food is held up to the camera, and a small amount is sprinkled into the tank, as well as into a tank with plants on the surface. Another container of brown and tan food is also held up, and a small amount is sprinkled into a tank with plants on the surface. The camera then shows a tank with a black sponge-like object, with a finger pointing at it, showing a snail nearby, and a collection of snails next to it. The camera then pans to other tank views.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many different kinds of food did the man place in the bathtub?",
+ "answer": "C",
+ "candidates": [
+ "A. Three types.",
+ "B. Four types.",
+ "C. Two types.",
+ "D. One type."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the purpose of the black foam ring shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Absorbing impurities from the water.",
+ "B. Filtering the water.",
+ "C. Decorating the fish tank.",
+ "D. Transferring worms."
+ ]
+ }
+ },
+ "bMUvHHkR": {
+ "video_id": "bMUvHHkR",
+ "video_duration": "372s",
+ "duration": "6-8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Book Reviews",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person with light blonde hair wearing a pink top is in front of a wall with many items. The person talks while making various hand gestures, and at one point she holds up a book titled \"I Hunt Killers.\" She shows the book from different angles, then puts it down. She finishes talking, then waves.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many characters are mentioned in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. Five.",
+ "D. Six."
+ ]
+ }
+ },
+ "oScelWEH": {
+ "video_id": "oScelWEH",
+ "video_duration": "153s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a black and white silhouette of a person in a suit and hat with their arms raised against a white background. Below this image is a gray banner containing the words \"MJ VideoArchive\". The video transitions to a shot of a large crowd of people with their arms raised. The camera pans across the crowd, showing the people from different angles. The view changes to show people sitting on bleachers, with some standing. One shot shows a boy and a woman sitting with the woman's arm around the boy. Some people have headbands and are wearing casual clothing. The video pans across the crowd again and then to the stage. The stage has a dark metal background with a light colored floor. Several people are on stage. The view zooms in on one person in the middle of the stage, with two dancers beside him. He is wearing dark pants and a dark jacket with white designs. He is holding a microphone and is singing. The camera zooms in on the face of the person with the microphone. He is then shown with a group of other people who also have microphones. A large crowd of people is shown again with the stage in the background. A banner reading \"Michael Jackson\" is shown. A white building is seen next to the banner. People are shown in a close-up shot and then the person with the microphone is shown singing again. The camera then shows the stage from different angles with the people on stage. People in the audience are shown from a distance again. Finally, the camera closes in on the person with the microphone, who is singing and moving, with the other people in the background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many backup dancers were on stage during Michael Jackson's appearance in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Two dancers.",
+ "B. Three dancers.",
+ "C. Four dancers.",
+ "D. Five dancers."
+ ]
+ }
+ },
+ "UNPwVpfz": {
+ "video_id": "UNPwVpfz",
+ "video_duration": "118s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two boxers are in a boxing ring, facing each other. One boxer wears white and gold trunks and black gloves. The other boxer wears red trunks and red gloves. A referee in a white shirt stands between them. The boxer in white and gold throws a left jab, then a right hook that lands on the red boxer's head. The red boxer staggers back. The white and gold boxer pursues, landing another right hook. The red boxer falls to the canvas. The referee checks on him, then waves his hands, signaling the end of the match. The red boxer remains on the canvas. The referee raises the winning boxer\u2019s right hand. The winning boxer raises both hands in victory. He embraces a man in a black shirt.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How did the audience react when the black boxer was knocked down in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. They displayed disappointment and excitement.",
+ "B. They were consistently excited.",
+ "C. They were consistently tense.",
+ "D. They were anxious and excited."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "In what manner did the boxer wearing red gloves knock out the boxer wearing black gloves?",
+ "answer": "B",
+ "candidates": [
+ "A. With a right hook.",
+ "B. With a left hook.",
+ "C. With a straight punch.",
+ "D. With a downward elbow strike."
+ ]
+ }
+ },
+ "nGZPRohb": {
+ "video_id": "nGZPRohb",
+ "video_duration": "93s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "World News",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows two firefighters in dark blue clothing and red helmets using a hose to spray water onto a fire. It then shows a firefighter silhouetted in front of a burning fire. The video continues by showing five people walking along a hillside. Next, multiple trucks are parked on a tarmac while a plane flies overhead and sprays a cloud of water towards the ground. A firefighter wearing a yellow helmet holds a hose, while a yellow and red fire plane passes in front of a red firetruck. Then a man in a blue shirt is shown speaking in front of a yellow and red plane. A dried up lake bed is shown, first as an aerial view and then close-up. A truck with a large tank drives down a street, before the video shows people filling large water bottles from the tank. A man in a blue shirt fills a water bottle from a tank. Next, an aerial view of a dry lakebed and water dam is shown, with multiple people standing on top. Another truck is shown with its tank, then people are again shown filling plastic bottles from a tank. A street is then shown with people walking and sitting at outdoor cafes, with the temperature displayed as +34\u00b0. A couple is shown sitting by a fountain while talking. The video ends with additional shots of people walking in the street and a large fire burning through trees.The video features text overlays at the bottom with the words \u201cHVM NEWS\u201d, \u201cUNUSUAL SPRING HEATWAVE IN SPAIN\u201d, \u201cGROWING CONCERN OVER UNSEASONAL WARM SPELL IN SPAIN\u201d, \u201cSPAIN HEATWAVES BRING RECORD TEMP\u201d, \u201cHEATWAVE IN SPAIN TO DRIVE TEMPERATURE ABOVE 40\u00b0C\u201d, \u201cSPAIN PLEADS FOR EU CRISIS FUNDS\u201d, \u201cHEAT WAVE GRIPS SPAIN\u201d, \u201cFUNDS AS DROUGHT HITS FARMERS\u201d, \u201cSPANISH TOWN GETS WATER TRUCKED IN AS TEMPERATURES PEAK\u201d, \u201cDROUGHT FEARS WORSEN IN SPAIN\u201d, \u201cSPAIN BRACES FOR A WEEK OF SCORCHING HEAT\u201d, \u201cSPAIN FACES UNUSUALLY HOT TEMPERATURES FOR APRIL\u201d, \u201cAT LEAST 27% OF COUNTRY IN DROUGHT, WATER AT 50% CAPACITY\u201d. The video also includes a date and location in the top corner of the screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many types of firefighting tools are shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Four types.",
+ "B. Two types.",
+ "C. Six types.",
+ "D. Eight types."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a white water truck parked at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Uncertain.",
+ "B. None of the above.",
+ "C. No.",
+ "D. Yes."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What is the reason residents are collecting water in buckets?",
+ "answer": "D",
+ "candidates": [
+ "A. A grassland fire causes a lack of groundwater.",
+ "B. Factory pollution causes a lack of groundwater.",
+ "C. A pipe burst causes the water supply to be suspended.",
+ "D. A forest fire causes a lack of groundwater."
+ ]
+ }
+ },
+ "ELRYYncZ": {
+ "video_id": "ELRYYncZ",
+ "video_duration": "223s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A young woman with her hair in a bun stands in a bathroom, gesturing with her hand. The scene changes to a young man lying on a gray couch, looking at his phone. He throws his head back and laughs.\u00a0The scene changes again to a younger girl standing in a doorway, holding a microphone and singing. She kicks her leg back and forth as she sings. The young man on the couch laughs hysterically. The girl continues to sing, peeking around the doorway.The young man on the couch makes faces and gestures, appearing annoyed. The girl jumps on the couch next to him, throwing her legs over the armrest.\u00a0She grabs the remote and holds it. The young man makes exaggerated faces at her. The girl picks up the dropped phone and uses it.The young woman enters the room from the bathroom and approaches the couch. The young man frantically types on his phone. The girl on the couch starts singing again, and the young woman puts her hand to her head in frustration. The young man throws his head back and laughs. The girl cries on the couch. The young man continues to make faces at the girl as she cries, then picks up the phone and dials. He talks on the phone.\u00a0The young woman raises her hands in exasperation. The girl sings into the microphone, then picks up the hairbrush and brushes her hair while singing. The young man reacts in surprise and laughs. The young woman yells at the girl, who is now crying and brushing her hair while she sings. The young man is still talking on the phone.\u00a0 The two young women have an argument about singing.\u00a0The younger girl sings into the microphone. The young woman reacts in annoyance and then leaves. The man hangs up his phone as the younger girl continues singing.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Whose phone rang in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Alex.",
+ "B. Their mother.",
+ "C. Their father.",
+ "D. Casey."
+ ]
+ }
+ },
+ "cznYACpP": {
+ "video_id": "cznYACpP",
+ "video_duration": "91s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Artificial Intelligence",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two hands are shown manipulating a gray and black mechanical device with number displays. The hands are pressing on the number display and pushing the metal lever on the side of the device. The device has two numbers displayed, the right one showing the numbers \"341044\". The hands press the lever and the right set of numbers changes and becomes \"341045\". They press on the number display again, and the number on the left changes to \"87\". The hands press the metal lever on the side again, and the right number display changes again. The hands manipulate the device, opening a panel to reveal the internal workings. The hands then remove a series of parts from within the device including a metal cylinder and a plastic cartridge containing a small roll. The device is then closed.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order does the man remove the ink cartridge?",
+ "answer": "D",
+ "candidates": [
+ "A. Rotate the button on the left - slide the middle clip - enter the password - remove the ink cartridge.",
+ "B. Slide the middle clip - rotate the button on the left - enter the password - remove the ink cartridge.",
+ "C. Rotate the button on the left - enter the password - slide the middle clip - remove the ink cartridge.",
+ "D. Enter the password - rotate the button on the left - slide the middle clip - remove the ink cartridge."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many times did the man rotate the left button in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Zero.",
+ "B. Two.",
+ "C. Four.",
+ "D. One."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the function of the ink cartridge in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. It records voice.",
+ "B. It records data.",
+ "C. It decorates the interior.",
+ "D. It is completely useless."
+ ]
+ }
+ },
+ "mrJDxSwO": {
+ "video_id": "mrJDxSwO",
+ "video_duration": "64s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person with blonde hair is wearing a black sweatshirt with a white logo on the front left side is in a kitchen with white cabinets. They are looking at the camera and moving their head and talking. They then show a piece of paper with a drawing of five figures and text on it. Then the person with blonde hair is in the kitchen again talking, and is then shown holding a thermostat which says 56 degrees. Then a person with light blonde hair is wearing a grey hoodie and is in front of a sign and shelves with bottles. They are talking and give a thumbs up. They are back in the kitchen again talking, then the person is wearing a grey hoodie with writing on the front and is at a stationary bike, looking down. They lift their head to look at the camera and move out of the frame onto the bike.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "When thinking about coffee, what is the woman\u2019s mood?",
+ "answer": "C",
+ "candidates": [
+ "A. Joyful yet melancholy.",
+ "B. Depressed and sorrowful.",
+ "C. Happy and excited.",
+ "D. Sad and disappointed."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many people appear in the upper part of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. One.",
+ "B. Seven.",
+ "C. Three.",
+ "D. Five."
+ ]
+ }
+ },
+ "PCJzPGGw": {
+ "video_id": "PCJzPGGw",
+ "video_duration": "179s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Family Vlogs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "There are three people in the video. One person is wearing a black tank top and red shorts. Another person is wearing a green t-shirt with a smiley face and a maroon baseball cap. The third person is wearing a navy blue t-shirt with a white star and light blue shorts. The person wearing the black tank top grabs a green cup and holds it up to the person in the navy blue shirt's face. The person in the navy blue t-shirt mixes something in a white bowl using a white straw. The person in the black tank top picks up a bottle and pours it into the bowl. The person in the green t-shirt is mixing something in the bowl using a white spoon. The person in the black tank top uses both of their hands to rub something onto the bowl. The person in the green t-shirt starts to run through the house and the person in the navy blue shirt begins walking with their arms outstretched. The person in the navy blue t-shirt continues to walk, picking up a white object and then they drop to their hands and knees. They start rubbing the white object on the floor and they try to crawl under a black line that is on the floor. The person wearing the green shirt also gets on their hands and knees, and rubs something onto the floor next to the person wearing the navy blue shirt, and then the person in the navy shirt stands up and runs away, followed by the person in the green t-shirt who is crawling on the floor. Both people can then be seen running through the house, and the person in the green shirt can be seen rubbing something onto the floor while walking through the house. The person in the black tank top appears to be recording and turns the camera up towards the ceiling, panning across it, and then the person wearing red shorts walks out of frame.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "In the video, how many times does the boy wearing a Captain America T-shirt go into the bathroom?",
+ "answer": "D",
+ "candidates": [
+ "A. Three times.",
+ "B. Once.",
+ "C. Four times.",
+ "D. Twice."
+ ]
+ }
+ },
+ "ZuBxXVvg": {
+ "video_id": "ZuBxXVvg",
+ "video_duration": "312s",
+ "duration": "4-6min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with dark curly hair and a black t-shirt walks into a room with a bed. They straighten the covers, fix the pillows, and move a blanket. They then pick up a light-colored, book-shaped object, and look inside of it. Afterwards, they go to another room with a computer and other objects and sit in a large chair. They continue to look at the book-shaped object. Another person with dark hair in two pigtails, a light blue shirt and a red backpack walks into the room with the bed. They pick up a pillow and then use their phone while standing near the bed. They continue to look at the phone. They then take their phone from the backpack and continue to look at it. They start to speak while looking at their phone. The person in the chair continues to look at the book-shaped object and talk. They are then shown holding the book-shaped object while in the chair and talking. The person by the bed also talks while looking around the bed. The person by the bed picks up a phone and begins to talk while looking at the phone. This person starts talking while laying down on the pillow. They start moving while still laying down and talking. The person in the chair continues talking while holding the book-shaped object. This person starts using a cell phone. The person by the bed starts thrashing. They then grab their pillow and start thrashing. The person in the chair continues to talk while holding the book-shaped object. The person by the bed continues to thrash. The person in the chair puts down the book-shaped object and continues to talk while holding a cell phone. The person by the bed continues to move. A text message pops up on the phone of the person sitting in the chair. The person by the bed makes a shocked expression, and continues to move on the bed. The person in the chair sits up, and then moves to another room, carrying the book-shaped object. They sit on the bed with their feet up, holding the object, and using their phone. The person on the bed is shown next. The person in the chair uses their phone and then tosses the phone aside and goes back to the room with the bed, carrying the object. The person on the bed has their face pressed into the pillow and is shown with their eyes wide and looking forward. The person in the doorway is shown with the object. The person on the bed continues to have their face pressed into the pillow while looking forward. The person in the doorway is again shown. The person is shown again laying face-down on the pillow. The person is shown again in the doorway with the book shaped object. The person on the bed continues to lay with their face against the pillow.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the man's emotional state at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Shocked and angry.",
+ "B. Helpless and guilty.",
+ "C. Excited and tense.",
+ "D. Surprised and guilty."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "Did the little girl forgive the man in the end?",
+ "answer": "B",
+ "candidates": [
+ "A. It is unknown.",
+ "B. No, she did not forgive him.",
+ "C. Yes, she forgave him.",
+ "D. They are still in a cold war."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "In the end, can the girl retrieve her phone?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I don't know."
+ ]
+ }
+ },
+ "PmbSvrYy": {
+ "video_id": "PmbSvrYy",
+ "video_duration": "72s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Biology",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A long, thin object with colored segments moves across a light blue background. It then rotates and appears vertically against a purplish surface with many small, concave openings. The object then appears to extend further, and a dark vertical line appears to be its other side. It then rotates vertically, appears against the light blue background, and bends. A tan, textured object with a small opening appears, and the long object moves through the opening. Another texturized object and the long object are labelled as Ribosome and Messenger RNA respectively. Small objects, each with a single, colored stalk and a small sphere at the top then move into view. The single-stalk objects are labelled as Amino acid, and a cluster of four linked single-stalk objects is labelled as Transfer RNA. More single-stalk objects and clusters of them then appear, followed by a cluster of spheres linked in a line, and the tan textured object appears again. A similar tan textured object then appears, and is labelled as Protein.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "Which substances are required for protein production?",
+ "answer": "C",
+ "candidates": [
+ "A. Transfer RNA molecules, bases, amino acids.",
+ "B. Messenger RNA, ribosomes, transfer RNA molecules, amino acids.",
+ "C. Messenger RNA, ribosomes, transfer RNA molecules, bases, amino acids.",
+ "D. Messenger RNA, ribosomes, transfer RNA molecules."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the function of messenger RNA?",
+ "answer": "B",
+ "candidates": [
+ "A. Carries the genetic code for protein synthesis.",
+ "B. Is read in groups of three bases.",
+ "C. Binds to ribosomes.",
+ "D. Encodes the sequence of amino acids."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "Please briefly describe the process of protein synthesis.",
+ "answer": "D",
+ "candidates": [
+ "A. Messenger RNA is read in three-nucleotide codons, ribosomes read the messenger RNA code, and transfer RNA delivers amino acids. The amino acid chain folds to form proteins.",
+ "B. Transfer RNA transports amino acids, messenger RNA is read in triplets, and ribosomes read the messenger RNA code. The amino acid chain folds to form proteins.",
+ "C. Messenger RNA is read in three-base codons, the amino acid chain folds to form proteins, transfer RNA delivers amino acids, and ribosomes read the messenger RNA code.",
+ "D. Ribosomes read the messenger RNA code, and transfer RNA delivers amino acids by reading the three-nucleotide codons. The amino acid chain folds to form proteins."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What are the changes in state from amino acids to proteins?",
+ "answer": "A",
+ "candidates": [
+ "A. Amino acid chains are converted into proteins.",
+ "B. Amino acids form amino acid chains and ultimately become proteins.",
+ "C. Amino acids are converted into ribosomes.",
+ "D. Amino acids form amino acid chains."
+ ]
+ }
+ },
+ "TXnLoLuU": {
+ "video_id": "TXnLoLuU",
+ "video_duration": "71s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person in a white spacesuit stands on a platform while multiple people dressed in black, some with masks, adjust equipment around them. There is a large pool of water visible behind them. In the next scene, a group of seven people are seated around a table inside a room, many wearing masks, with a laptop, water bottle, papers, and other items on the table. Following that, four people in blue jumpsuits stand in front of a white airplane with the registration number N95NA; there is a black sign to the left with four names on it. A rocket launches with large amounts of white smoke near a coastal area. The rocket is next seen reaching a high altitude and separating into pieces, which continues to ascend past a planet into space. A large launch tower and rocket with a white tip is next shown. There is a view from the inside of the spaceship with two people in white space suits and blue screens visible, with hands moving within the vehicle. A robotic arm is shown with earth in the background and a small, round object is visible in the distance. The camera moves closer to the object. There is then a view of the International Space Station. A space capsule docks with a port on the space station, and the docking process is shown up close, along with views of the inside of the space station with people floating in zero-g, and exiting through a portal. The people are then seen embracing.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What structural changes occur to a rocket after it reaches orbit?",
+ "answer": "C",
+ "candidates": [
+ "A. The tail flame changes.",
+ "B. Debris falls away.",
+ "C. The fairing separates.",
+ "D. No change occurs."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "How many crew members were originally stationed on the space station?",
+ "answer": "C",
+ "candidates": [
+ "A. Zero.",
+ "B. One.",
+ "C. Two.",
+ "D. Three."
+ ]
+ }
+ },
+ "SYYvkZZA": {
+ "video_id": "SYYvkZZA",
+ "video_duration": "39s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "On a stage, there are six men and one woman wearing blue outfits. The men raise their right arms. One of the men holds a microphone and then makes a kicking motion with his leg. The camera focuses on a man playing guitar and another man speaks to the camera. The man speaking is wearing a blue polo shirt with a logo on it. The camera returns to a man in a blue suit who is singing into two microphones, and moves his right arm out. There are now six men on stage in blue suits, each holding a microphone and with arms raised above their head. A second set of singers, dressed in red, are shown on a tiered stage. One of the women in red is shown singing into a microphone. A band is shown, the members are wearing white suits and playing musical instruments. An audience is shown clapping. The camera returns to the stage where there are now three women in white and gold outfits and matching hats and boots. They are sitting on stools, singing into microphones.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the scene with applause appear?",
+ "answer": "D",
+ "candidates": [
+ "A. The applause occurs in the middle of the video.",
+ "B. The applause occurs throughout the entire video.",
+ "C. The applause occurs at the beginning of the video.",
+ "D. The applause occurs near the end of the video."
+ ]
+ }
+ },
+ "TvSQaalJ": {
+ "video_id": "TvSQaalJ",
+ "video_duration": "84s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person in a dark suit with a light-colored tie is sitting at a desk. The person picks up a black object with a gold microphone attached, and then moves it to their ear. The person then puts the object back down. The person then picks up a white piece of paper and looks at it. The person makes a gesture with their hand, then points their index finger, and opens their mouth. There is a skeleton with glowing blue eyes, wearing a grey suit and a yellow tag that says \"GEOFF\". The person then sits back down, looking at a piece of paper. The skeleton reappears. A brown figure with a red bandana appears in a doorway. The person laughs and waves their hand. The person talks and gestures with their hands. The person picks up the black object again and moves it to their ear. The person then puts the object down on the desk.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What background music plays during the middle section of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Rock music.",
+ "B. Horror music.",
+ "C. Light music.",
+ "D. Classical music."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the phone positioned for the host?",
+ "answer": "A",
+ "candidates": [
+ "A. On the left side.",
+ "B. Directly in front.",
+ "C. On the right side.",
+ "D. Behind."
+ ]
+ }
+ },
+ "FkSVwjdv": {
+ "video_id": "FkSVwjdv",
+ "video_duration": "89s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A young woman with long brown hair and a black beret speaks directly to the camera. She wears a plaid shirt. The background is slightly blurred, showing an orange couch, a large window, and some plants. The background then changes to plain white. Images of various people wearing face masks appear around the woman. The video then transitions to a montage of pictures of people wearing face masks in different settings. The pictures are shown in a dynamic collage style against a peach-colored background. Next, the video shows two women sitting at a white desk, using a sewing machine and working with colorful fabrics. A TV screen displays a video in the background. The scene changes to a split-screen showing three different shots of people making face masks. In the first shot, a person cuts fabric on the floor. The second shows someone sewing at a machine. The third features two people organizing piles of completed masks. Another series of split-screen shots follows, displaying different people, young and old, using sewing machines to create masks. Cardboard boxes filled with fabric scraps and masks are also visible. A young woman is shown sitting outside, wearing sunglasses and using a sewing machine. Close-ups of handmade face masks are shown, including a pink unicorn-themed mask and piles of finished masks with ties. A sewing machine is actively stitching a green patterned mask.The background becomes peach-colored. Three animated circles appear one by one. The first contains a drawing of a face mask, labeled \"01 Face mask\". The second circle shows \"10 million\" and is labeled \"02 Three Days\". The third circle depicts a crossed-out virus and is labeled \"03 Repress Infection\".An animated graphic shows a profile view of a person's head wearing a mask. The text \"95 - 100% droplets caught\" appears to the right. Small purple dots representing droplets appear near the mask but don't pass through. A zoomed-in cross-section of the mask fabric with the droplets caught is shown.The hashtag \"#Masks4All\" is displayed in dark red against the peach background. The website \"masks4all.org\" appears above \"#Masks4All,\" followed by the text \"Please share this video and help us save more lives.\" The video concludes with the young woman in the beret, back on the orange couch, putting on a black face mask. \"#Masks4All\" appears one last time.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the most frequently used tool for making masks in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A measuring tape.",
+ "B. A sewing machine.",
+ "C. Scissors.",
+ "D. A fabric cutter."
+ ]
+ }
+ },
+ "zpshTYYZ": {
+ "video_id": "zpshTYYZ",
+ "video_duration": "56s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Artificial Intelligence",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Three men are visible in the video. They are standing next to a large gray object that has a cross-section cut away revealing red and blue colored areas. The man on the right, who is wearing a black shirt, puts his hand near the upper right side of the large gray object. He then picks up a small black object. The man points at an area of the gray object with his finger. The man with the black shirt then takes a small black object and inserts it into the gray object. A close up shows the valves and a piston in the gray object, with the piston moving up and down. The video moves to the original wide shot. The man with the black shirt puts the black object back down. He moves his hand to the side of the gray object.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Which of the following locations is the person in the video most likely to be in?",
+ "answer": "D",
+ "candidates": [
+ "A. Car dealership.",
+ "B. Inside a car.",
+ "C. Supermarket.",
+ "D. Vehicle repair shop."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Will the man's explanation help improve the work of others?",
+ "answer": "A",
+ "candidates": [
+ "A. I don't know.",
+ "B. No.",
+ "C. Yes."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "After the close-up of the valve, how many times did the man in a black short-sleeve shirt and black wristband turn the valve?",
+ "answer": "D",
+ "candidates": [
+ "A. Three.",
+ "B. One.",
+ "C. Five.",
+ "D. Seven."
+ ]
+ }
+ },
+ "eKUImCBZ": {
+ "video_id": "eKUImCBZ",
+ "video_duration": "525s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a pink background with 12 tubes of lipstick scattered on top. A text graphic is overlaid that reads \"Bourjois Rouge Velvet The Lipstick Lip Swatches and Date Night Wear Test\". The text fades to black and white, then a light pink, and then fades away. The lipstick tubes are once again visible and displayed on a pink background. The video shifts and shows a woman with long brown hair and an off-the-shoulder red top. She is shown talking, smiling, and touching her hair. She is seen holding up tubes of lipstick. The woman is then shown applying lipstick with a hand mirror. She is shown on a pink background with lipstick tubes and is shown applying various lipsticks. A pink graphic appears over her stating \"Matte finish No transferring!\". She kisses her hand. She uses her hand to show the lipstick colors on her lips. She holds and wiggles each tube to display the name of the color. She is shown smiling, touching her face, and blowing kisses. The video shows a sign of a restaurant at night with the word \"ANANAS\" in pink neon lights. The camera shifts to a table setting at the restaurant. The camera shifts and displays the woman in a red off-the-shoulder top on her phone. The camera displays a framed quote on the wall. The woman in the red top is then shown holding and drinking a mixed drink, and touching her hair. A man in a suit is shown making a drink at the bar. The woman reapplies lipstick. A shot shows the woman eating. She then holds a glass and drinks. She is then shown working at a laptop. The camera shifts and shows the woman in a red top walking away at night in a blurred nightscape. She is then at a dessert stand. She is shown buying an ice cream and walking away. She applies the same lipstick again while walking. She then licks the ice cream and blows a kiss. At the end, a pink text graphic appears with the words \"Michelle Crossan BEAUTY . LIFESTYLE . TRAVEL . MUSINGS SUBSCRIBE HAVE YOU SEEN THIS? INSTAGRAM: @michellecrossan_ SNAPCHAT: michellecrossan TWITTER: @beautylifem\".",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order does the woman in the video introduce the lipstick colors? (a) Red. (b) Light nude. (c) Deep red.",
+ "answer": "A",
+ "candidates": [
+ "A. (b) (a) (c).",
+ "B. (a) (b) (c).",
+ "C. (b) (c) (a).",
+ "D. (c) (a) (b)."
+ ]
+ }
+ },
+ "NoxKqPuo": {
+ "video_id": "NoxKqPuo",
+ "video_duration": "208s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman with a gray wig and a green cardigan gestures with her hands while speaking. She changes to a brunette wig with a black lace top and holds a red velvet cake with white frosting and a single candle. The camera cuts to a person with short black hair, glasses, a red shirt, and a denim vest sitting on a couch. Back to the first woman, now wearing a purple top with turquoise jewelry, she looks surprised, leans forward, then leans back and puts her feet up on the back of the couch, revealing yellow flip-flops. She switches to a blond wig and a navy jacket, speaking and gesturing. The scene changes to a man with a gray wig and mustache eating cake from a blue plate. The woman reappears in the lace top with both hands up. The person on the couch makes a disapproving face. The woman in purple gestures while talking, and the man continues to eat cake.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the woman wearing a blue necklace do to the little girl?",
+ "answer": "D",
+ "candidates": [
+ "A. Stopped the little girl from eating potato chips.",
+ "B. Gave the little girl rum.",
+ "C. Told the little girl to turn off the TV.",
+ "D. Handed her a piece of cake."
+ ]
+ }
+ },
+ "oMbOCwOD": {
+ "video_id": "oMbOCwOD",
+ "video_duration": "176s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black Brooklyn Nets hoodie and a black Nike baseball cap holds a black spatula. The person places a blue and white patterned bowl on a beige plate on a woven placemat. A person in a tan and green t-shirt sits at the counter. A hand holding the spatula puts pulled pork, mac and cheese, and baked beans into the bowl. The person in the Nets hoodie smiles. The person in the t-shirt drinks from a black mug. A person wearing a pink tank top appears and sits at the counter next to a plate and bowl on a woven placemat. The person in the t-shirt changes to a light green and white baseball tee with a dinosaur graphic. The person wearing the pink tank top taps the person in the dinosaur shirt on the arm. The person in the Nets hoodie points their index finger in the air. The person in the pink tank top yells and points at the person in the Nets hoodie. The person in the dinosaur shirt smiles broadly. The person in the pink tank top makes disgusting sounds while looking at their food. The person in the Nets hoodie makes a disgusted face while holding their hat. The person in the dinosaur shirt dips a piece of cornbread into their food and takes a bite. The person in the Nets hoodie throws their hands up in disgust. The person in the dinosaur shirt picks up their bowl and gestures with it. The person in the Nets hoodie holds their hands up and makes a kissy face. The person in the dinosaur shirt throws their hands up, sticks their tongue out, and falls on the floor. The person in the pink tank top picks up several individually wrapped Zebra Cakes. The person in the pink tank top opens a Zebra Cake and feeds it to the person lying on the floor. The person in the pink tank top eats the remaining Zebra Cakes under a bed, while kneeling and facing the basket which serves as a bed post. The person in the Nets hoodie throws their hands up again.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the man in the video say he is full in the middle of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Because he is actually full.",
+ "B. Because he wants to save the food for when he is hungry later.",
+ "C. Because he has other matters to attend to and needs to leave.",
+ "D. Because the food smells unpleasant."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the man wearing a green short-sleeved shirt and the woman wearing a pink short-sleeved shirt in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. They are siblings.",
+ "B. They are classmates.",
+ "C. They are friends.",
+ "D. They are a couple."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the color of the food held by the woman wearing a pink short-sleeved shirt at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. White and brown.",
+ "B. Brown.",
+ "C. White.",
+ "D. Black and white."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What action by the woman in the pink short-sleeved shirt restored consciousness to the man in the green short-sleeved shirt at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. She gently slaps him a few times and then feeds him a treasured snack.",
+ "B. The woman in the pink short-sleeved shirt attempts to revive the man by hitting him with bread.",
+ "C. The woman in the pink outfit performs cardiopulmonary resuscitation on the man.",
+ "D. The woman in pink directly feeds the man a snack."
+ ]
+ }
+ },
+ "RBdrFMQM": {
+ "video_id": "RBdrFMQM",
+ "video_duration": "41s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Four figures in dark, armored suits with glowing orange flames atop staves spin around a circular platform. The platform is maroon with black concentric circles. They stop spinning and raise their staves. The scene cuts to a crowd watching two figures on a similar platform. These two figures wear dark, segmented armor with red accents. They perform a synchronized dance. The view focuses on a man in the crowd with slicked-back black hair and a soul patch. He has a red and grey outfit and two piercings below his left ear. The two dancers continue their performance, ending in a bow. The man from the crowd walks onto the platform. Red lasers shoot across the stage from a central point. The man dodges the lasers. A hooded figure leaps across the stage and kicks a small fiery explosion. The man in the crowd is knocked to the ground. A blonde figure in a dark suit rushes in, wielding a staff or scythe, and attacks the two armored dancers. The blonde figure throws the man from the crowd across the stage. One of the dancers transforms into a large, many-limbed creature. The man gets up and throws a circular, glowing, orange energy projectile. He throws another projectile, and then holds a third in front of him.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "In the video, after the bearded man raises the staff toward the monster, is a flame emitted?",
+ "answer": "A",
+ "candidates": [
+ "A. No, there isn't.",
+ "B. Yes, there is.",
+ "C. I don't know."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional changes does the bearded man experience in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. From nervousness to anger.",
+ "B. From curiosity to panic.",
+ "C. From calmness to fear.",
+ "D. From confusion to sadness."
+ ]
+ }
+ },
+ "qVHRuokn": {
+ "video_id": "qVHRuokn",
+ "video_duration": "213s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens with a black screen, then displays the \u201cGreen Regimen\u201d logo. The logo features a muscular, green torso holding a small plant in its hands, surrounded by the words \u201cGREEN REGIMEN\u201d. The background fades to white. On a table sits a bowl of ice cubes, a glass of water, a peeled banana on a small plate, two papaya halves on a plate, chopped papaya pieces in a clear bowl, and some kale. A muscular, dark-skinned man wearing a short-sleeved blue athletic shirt enters the scene. He pours the water into a blender, then adds the ice cubes. He picks up the banana and puts it in the blender. He holds up the papaya, displaying a sticker that reads \u201cBrazil Caliman Papaya 3111\u201d. He peels the papaya, cuts it in half lengthwise, showing the dark seeds inside, scoops out the seeds, and then adds the papaya to the blender. He scoops some papaya seeds from a bowl and adds them to the blender as well. He adds diced papaya chunks to the blender. Finally, he adds the kale to the blender and blends the ingredients. The contents of the blender begin as a light green color, which turns darker green as the ingredients mix. The blended mixture is poured into a glass. A straw is inserted into the glass and the man takes a sip.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "How does adding different materials affect the flavor of a passionfruit drink?",
+ "answer": "B",
+ "candidates": [
+ "A. Adding papaya seeds makes it spicy.",
+ "B. Adding papaya seeds makes it sweeter.",
+ "C. Adding papaya makes it spicy.",
+ "D. Adding papaya makes it sweeter."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order of making a papaya passion drink?",
+ "answer": "B",
+ "candidates": [
+ "A. First add the papaya and papaya seeds, then add the banana, and finally add the watermelon and vegetables.",
+ "B. First add the banana, then add the papaya and papaya seeds, and finally add the watermelon and vegetables.",
+ "C. First add the watermelon and vegetables, then add the banana, and finally add the papaya and papaya seeds.",
+ "D. First add the watermelon and vegetables, then add the papaya and papaya seeds, and finally add the banana."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Do papaya seeds and grapes appear in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Only papaya seeds appear.",
+ "B. Both appear.",
+ "C. Only grapes appear.",
+ "D. Neither appears."
+ ]
+ }
+ },
+ "ALfOUzDH": {
+ "video_id": "ALfOUzDH",
+ "video_duration": "43s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Physics",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A beaker and a heater are on a surface. A blue screen with an orange bar at the bottom appears. A procedure is listed on the orange bar. An empty, clear double-walled container is on the surface. A disembodied arm picks up a beaker. The arm pours water from the beaker into the double-walled container. A new beaker, partially full of liquid, is placed on the surface. The liquid in the second beaker appears to be steaming. A second beaker with steaming liquid is added to the surface. A disembodied arm places the second beaker inside the larger container. A silver stirring rod is added to the inner beaker. The top of the larger container is covered with a wooden lid that has a hole in the middle. Two disembodied arms enter the frame. Each arm is holding a thermometer. Speech bubbles with the labels T1 and T2 appear above the thermometers.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the spatial relationship between the beaker filled with hot water and the sealed apparatus in the final part of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The beaker and the apparatus are placed side by side.",
+ "B. The beaker is beneath the sealed apparatus.",
+ "C. The beaker is inside the sealed apparatus.",
+ "D. The beaker is on top of the sealed apparatus."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "In the video, what temperature of water in degrees Celsius should be added to the beaker suspended in the closed apparatus?",
+ "answer": "B",
+ "candidates": [
+ "A. 35\u00b0C to 45\u00b0C.",
+ "B. 75\u00b0C to 85\u00b0C.",
+ "C. 15\u00b0C to 25\u00b0C.",
+ "D. -5\u00b0C to 5\u00b0C."
+ ]
+ }
+ },
+ "HkPkGkIh": {
+ "video_id": "HkPkGkIh",
+ "video_duration": "53s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man with brown hair, a mustache, and wearing a light gray suit and striped tie is looking off to his right. An orange caption says \"Wow.\" He continues to look off to his right, as another orange caption says \"Never ceases to amaze me.\" He continues to look off to his right, and an orange caption says \"What cologne are you gonna go with?\". He continues to look off to his right, as an orange caption says \"London Gentlemen or,\". A second man, with brown hair and a mustache, wearing a light blue pinstripe suit and a patterned tie, is visible in profile, and he says \"London Gentlemen or,\". The first man looks off to his right again as an orange caption says \"No, no, no. Hold on.\" He continues to look to his right, as an orange caption says \"Blackbeard's Delight.\". The second man says \"No.\". The second man opens a cabinet door. A caption says \"She gets a special cologne.\" The second man reaches into the cabinet. Another caption says \"It's called Sex Panther by Odeon.\" The second man opens a black box, with a dark, circular hole in it. Another caption says \"It's called Sex Panther by Odeon.\" The camera zooms in on the circular hole. The camera continues to zoom in on the circular hole, which has a black panther head with teeth visible. A caption says \"It's illegal in nine countries.\" The second man is visible again, saying \"It's illegal in nine countries.\" The second man reaches into the box and picks up the black panther bottle. A caption says, \"Yep, it's made with bits of real panther.\" He is holding the bottle. The first man says, while looking off to his right, \"So you know it's good.\" The first man says, while looking off to his right, \"It's quite pungent.\" The second man holds up the bottle, as another caption says \"Ooh it's a formidable scent.\" He continues to hold up the bottle, with the caption \"Ooh it's a formidable scent.\" The first man looks off to his right again, and says, \"It stings the nostrils.\" The second man puts the bottle to his neck. The first man says, \"Bryan, I'm going to be honest with you that smells like pure gasoline.\" The second man is holding the bottle and says \"that smells like pure gasoline. They've done studies, you know.\". The second man is holding the bottle and says \"60% of the time, it works every time.\" The first man looks off to his right and says, \"That doesn't make sense.\" The second man smiles and says, \"Well,\". The second man is then shown standing and moves his hands up and down, as a caption says \"Lets' go see if we can make this little kitty purr.\" The second man is then shown standing and moves his hands up and down as the final caption says \"Lets' go see if we can make this little kitty purr.\"",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What is the facial expression of the brown-haired man in the video when he smells the perfume?",
+ "answer": "C",
+ "candidates": [
+ "A. Raises an eyebrow disdainfully.",
+ "B. Surprised and cheering.",
+ "C. Corner of his mouth twitches.",
+ "D. Smiles and nods."
+ ]
+ }
+ },
+ "wYKRfRNr": {
+ "video_id": "wYKRfRNr",
+ "video_duration": "108s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Black Widow and Captain America are sitting in a dark place, seemingly looking at something in front of them. The Hulk, wearing a grey cardigan, stands before a wooded area, looking down. Captain America, with blood on his face and a grimace, looks upward. The Hulk, Thor, Captain America, Rocket, and another person stand on a wooden dock by a lake. Captain America, in his suit, stands in a destroyed city, debris falling from the sky. The destroyed city is shown in a wider shot, the sun rising behind it. Captain America, in civilian clothes, stands before a wooded area. Captain America, with his shield, faces a large army. Iron Man and Captain America, both in civilian clothing, stand face to face in a city. Iron Man, inside his suit, looks downwards. Iron Man flies through the air. Many characters charge forward in battle. A large army appears through glowing orange portals. Captain America and Black Widow, in uniform, stand inside what looks like a hangar. Captain America, his suit torn, runs amid a large group in battle. Black Panther, Okoye, and Captain America, walk with the Dora Milaje and Wakandan army. Thor swings his hammer, while Black Widow readies weapons, with Hulk and Captain America behind them. The Hulk, Black Widow, Iron Man, and Captain America fight in New York. Several people on motorbikes jump over a barricade, the Hulk visible on the left. Iron Man fires a beam at unseen enemies. Iron Man lies on the ground, injured and with blood on his face. Doctor Strange gestures, sending orange energy. Thanos' golden gauntlet grabs Star-Lord's throat. Wanda uses red magic against Thanos. Captain America strains against Thanos' gauntlet. Thor and Captain America fight amidst a fiery explosion in New York. Iron Man flies through the air and shoots beams of energy. Iron Man, inside his suit, looks at some technical readouts. Captain America, wearing a safety harness, sits and breathes heavily. A compass shows a photo of Peggy Carter. An injured man lies on a cot. Doctor Strange extends his hand, projecting orange energy. Iron Man, injured, sits down. Doctor Strange looks on. Iron Man raises his hand, using all six Infinity Stones, creating bright lights. Thanos punches Iron Man. Spider-Man hugs Iron Man. Thanos slams Iron Man into debris. Iron Man, now in a suit, places his hand on Peter Parker's shoulder and guides him. Black Widow, wearing a white blouse, smiles slightly. Black Widow, wearing a black vest, looks at Hawkeye, who has a mohawk hairstyle. A woman's hand rests on another person's hand. Iron Man raises his hand, wearing the gauntlet. Energy flows from the gauntlet. Pepper Potts walks down the steps of a cabin, a young girl at her side, with other people in black suits. Pepper Potts cries, hugging a man. Iron Man raises the gauntlet, energy flows from the stones. Pepper Potts sits with her daughter at a lakeside. A floral wreath floats on the water, the center of which bears an arc reactor with the inscription, \"Proof that Tony Stark has a heart.\" Iron Man's hand stops moving, the light fading from the gauntlet. Spider-Man, crying and dirty, looks at Iron Man. Iron Man looks at the gauntlet on his hand. He snaps his fingers. Captain America walks with another man on a grassy area. Iron Man and Captain America walk side by side. Captain America and a man in a suit walk toward an orange Audi R8 sports car. A small, yellow house surrounded by trees is shown. Captain America hugs Peggy Carter, who closes her eyes.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What is the weather like during the fight scene in the ruins in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. It is sunny.",
+ "B. It is rainy.",
+ "C. It is foggy.",
+ "D. It is overcast."
+ ]
+ }
+ },
+ "prCzmGtq": {
+ "video_id": "prCzmGtq",
+ "video_duration": "77s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A single person wearing a black t-shirt, black pants, a black head covering, and white shoes is in a room with light-colored walls and a dark brown door. The person can be seen holding a cell phone to their ear in several frames. A white towel hangs on a hook in the background, and white light switches are visible. In other frames, the person walks into an adjacent room that has a blue chair, and a mirror on the wall. In later frames, the person holds a small circular object and applies it to their face.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the woman's emotion in the video who has long black hair?",
+ "answer": "D",
+ "candidates": [
+ "A. Excited.",
+ "B. Sad.",
+ "C. Calm.",
+ "D. Angry."
+ ]
+ }
+ },
+ "nvmqSUXN": {
+ "video_id": "nvmqSUXN",
+ "video_duration": "343s",
+ "duration": "4-6min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A wrestling match takes place in a cage. The cage walls are orange-brown and decorated with stylized flames. The ring mat is predominantly white with colorful designs. The ropes are blue, and the turnbuckle pads are hot pink.\u00a0 Two wrestlers, one in a colorful one-piece outfit and one in a green and white one-piece outfit, face off.\u00a0The wrestler in green grabs the other wrestler and performs a takedown. The referee checks on the downed wrestler. The wrestler in the colorful outfit gets up and circles the wrestler in green. They lock up again, and the wrestler in green performs another takedown. The referee checks on the downed wrestler. The wrestler in the colorful outfit gets up and circles the wrestler in green.The wrestler in the colorful outfit performs a takedown. The wrestler in green gets up. The wrestler in the colorful outfit kicks the wrestler in green, who then climbs the turnbuckle and jumps off, landing on the other wrestler. Both get up.The wrestler in colorful clothing grabs the wrestler in green, and the wrestler in green flips the other wrestler and pins them near the ropes. The referee counts the pin, but the wrestler in the colorful clothes escapes at the count of two. The wrestlers grapple near the ropes. The wrestler in green Irish whips the wrestler in colorful clothing into the ropes and performs a spinning move, taking the other wrestler down and pinning them. The referee counts the pin, one...two...three.The words \"Winner: Nightmare Noah Tyndal\" appear on the screen.The scene changes to a replay of the final moments of the match, from a different angle.The wrestler in green celebrates in the ring.The scene changes to a match rating screen, which says \u201cGood.\u201d",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the action of a competitor wearing green using a stick occur?",
+ "answer": "D",
+ "candidates": [
+ "A. Eleven times.",
+ "B. Eight times.",
+ "C. Nine times.",
+ "D. Ten times."
+ ]
+ }
+ },
+ "fAxVqEIX": {
+ "video_id": "fAxVqEIX",
+ "video_duration": "108s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Four people are present in the video. There are two dark-haired women standing on the left side, behind a table covered in purple cloth. The woman on the left wears a dark sleeveless top, and the woman next to her is wearing a blue top. They are both cooking with individual cooking burners and skillets. There is a man in a white polo standing between them and two dark-haired women on the right. The two women on the right are cooking with individual burners and skillets behind a table covered in black cloth. The woman on the left is wearing a black shirt, and the woman next to her is wearing a striped blazer. There are various bowls, plates, and cooking tools on each table. At the end, the people put their hands in the air.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video contain curry?",
+ "answer": "D",
+ "candidates": [
+ "A. None of the above.",
+ "B. Yes, it does.",
+ "C. I am not sure.",
+ "D. No, it does not."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "In the video, who said, \"Pass me the curry that was stolen from her\"?",
+ "answer": "B",
+ "candidates": [
+ "A. A woman in a sleeveless top.",
+ "B. A woman in black.",
+ "C. A woman in blue.",
+ "D. A man in a white shirt."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "What ingredients did the man in the video use to prepare the dish?",
+ "answer": "A",
+ "candidates": [
+ "A. The man did not cook.",
+ "B. Shrimp, curry spices, coconut milk, and whale.",
+ "C. Corn flakes, ramen, Thai chili, and shrimp.",
+ "D. Shrimp, curry spices, coconut milk, and mixed bell peppers."
+ ]
+ }
+ },
+ "ZJqGIAcP": {
+ "video_id": "ZJqGIAcP",
+ "video_duration": "29s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man with a beard and glasses, wearing a black cap and a dark gray shirt, speaks in front of a microphone.\u00a0 Behind him, a cartoon graphic reads \u201cSmile 2 Jannah.\u201d\u00a0 He gestures with his hands as he talks.\u00a0 The video cuts to a split screen, showing trending topics on Twitter in India, Worldwide, the United Arab Emirates, and Pakistan.\u00a0 The man continues speaking.The video cuts to a man in light clothing, bending over a small television placed on top of a cabinet.\u00a0 The man speaking reappears.\u00a0 An image is shown of an older, light-skinned man with white hair, lying on a blue yoga mat.\u00a0 He wears white clothing with an orange, white and green sash across his chest. The video cuts to a man in a blue suit speaking at a podium, facing a seated audience. Two large screens show him speaking.\u00a0 He raises his fist.\u00a0 A wider shot of the room is shown, as the audience applauds.\u00a0 The bearded man with glasses reappears, speaking and gesturing.\u00a0An animated image of planet Earth is shown.\u00a0 The animation then shows a cartoon man in a suit, sinking in blue water, his leg chained to a large ball labeled \u201cDEBT.\u201d\u00a0 The final animation shows a yellow, egg-shaped character with a disapproving expression, working on a laptop placed on a desk or table.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What do the animations presented in the latter part of the video represent?",
+ "answer": "B",
+ "candidates": [
+ "A. Taxes, Islamophobia, and climate change.",
+ "B. Climate change, debt, and taxes.",
+ "C. Debt, taxes, and Islamophobia.",
+ "D. Climate change, taxation, and debt."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Which of the following events did not occur after the speech in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Someone suddenly fainted.",
+ "B. Someone immediately paid back taxes.",
+ "C. Imran Khan became a trending topic.",
+ "D. Someone kissed the television."
+ ]
+ }
+ },
+ "sXlCyYCG": {
+ "video_id": "sXlCyYCG",
+ "video_duration": "77s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A close up black and white drawing of a face is shown with the right eye looking to the right. The left eye is slightly angled, looking away. The video then moves to a wider shot of the black and white drawing, framed within a black border. The drawing gets dark, and then moves to a scene with a person standing on top of a large canvas. The person is painting onto the canvas with a black liquid, and is not wearing any shoes. Next, a close up of hands holding white pieces of paper to their chest is shown. The video then shows a man wearing a black shirt holding many dismembered doll limbs. Then, a man wearing a red shirt is shown holding a bucket. He splashes black paint onto a white canvas. Next, a man wearing a black shirt is shown crouching on a large white canvas, using a small brush and red paint. Then, a man wearing a blue t-shirt is arranging playing cards on a table. Next, a man wearing a grey shirt is shown kneeling on a white surface, painting onto it with a brown liquid and drinking from a brown bottle. A man is shown drawing a large black and white picture with small dots. Next, a man in a jacket and grey pants is standing on a red circle on a stage, giving a presentation. He is shown standing on another stage with a background of purple geometric panels. The man is seen rolling up his sleeve in a close up shot. He is shown painting on his chest with white paint and covering it with a piece of paper. Then he adds black paint to make the shape of a face. He is then shown painting his chest with blue paint, then with white and green to make the earth. The man is shown pouring a liquid into a black tray with a white paper. The man is shown throwing a black object against a white wall, then another object is thrown against the wall, creating a splattered art piece. The drawing is shown complete. Then the man is shown cutting a piece of paper with a pair of scissors. He then tears a photo in two and shows it. He then feeds the pieces into a paper shredder. Then black and white writing appears saying, \"Goodbye Art\". The man is then shown arranging multiple layered playing card portraits. He then lights the edge of the layered cards on fire, creating a burning display.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What is the setting the man is in during the middle of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Outdoors.",
+ "B. An art gallery.",
+ "C. An office.",
+ "D. A studio."
+ ]
+ }
+ },
+ "yLaVkfro": {
+ "video_id": "yLaVkfro",
+ "video_duration": "61s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "World News",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The first part of the video shows the exterior of a severely damaged two-story building. Several people wearing black and yellow uniforms are walking amid the debris. There is a large orange truck and a yellow construction vehicle near the ruins. Several fire trucks are visible in the distance. A severely damaged car sits in the foreground.The remaining portion of the video is shot from ground level showing a series of large trucks with large tires. Several people in dark uniforms are loading rectangular objects into the rear of the trucks. A large \"Z\" is visible on some of the trucks. Some of the trucks appear to have damage including large holes in the sides and bent metal. Debris is scattered across the ground.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, what is the position of the ammunition boxes on the truck relative to the soldiers?",
+ "answer": "D",
+ "candidates": [
+ "A. The ammunition boxes are behind the soldiers.",
+ "B. The ammunition boxes are to the front right of the soldiers.",
+ "C. The ammunition boxes are to the right of the soldiers.",
+ "D. The ammunition boxes are to the left of the soldiers."
+ ]
+ }
+ },
+ "AzJvyuVm": {
+ "video_id": "AzJvyuVm",
+ "video_duration": "82s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a woman with short dark hair and glasses wearing a pink collared shirt with a pearl necklace. She is gesturing with her hands as she speaks. In the background, behind her, a man with a shaved head and glasses can be seen, he is wearing a suit with a white shirt and a dark tie. There is also a logo that says \"PROJOURN MEDIA TV\" with a \"SUBSCRIBE\" button on it in the top right corner. The left corner of the video contains the text \"Give Your Life To Jesus Christ (Mat 3:2)\" and \"PROJOURN MEDIA LIVE UPDATES\". In the bottom right corner there is the text \"HAJJI MUKISA OMUSOMI W'EDDUWA\" along with a phone number and a WhatsApp icon. At times, there are also colorful speech bubbles containing the words \"SUBSCRIBE\", \"COMMENT\", and \"LIKE\". At one point the text \"TOP GOVERNMENT OFFICIALS ARE NOT VICTIMS OF CORRUPTION, - IGG\" is displayed, then it changes to \"BETI KAMYA, INSPECTOR GENERAL OF GOVERNMENT\".",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What emotion is the woman in the video expressing?",
+ "answer": "C",
+ "candidates": [
+ "A. Serious and pained.",
+ "B. Excited and thrilled.",
+ "C. Angry and frustrated.",
+ "D. Calm and composed."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the woman snap her fingers in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Three.",
+ "C. Five.",
+ "D. Two."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Where is the clearly audible laughter in the video coming from?",
+ "answer": "B",
+ "candidates": [
+ "A. The people around.",
+ "B. A bald man.",
+ "C. A woman.",
+ "D. The radio broadcast."
+ ]
+ }
+ },
+ "dwjcoxJl": {
+ "video_id": "dwjcoxJl",
+ "video_duration": "45s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman with blond hair, wearing a dark lace shirt, is looking out a window with white lace curtains. A dark-colored truck is visible outside, along with a man with a shaved head and light-colored shirt. The woman then looks down as feet stomp on the floor near the door, with sounds of glass breaking. She crouches under a table, and looks towards the sound. The video switches to a shot of a yellow kettle on a stove, with flames underneath. Then two men, one with a beard and a red plaid shirt, the other with a hoody, stand in a room. The bearded man picks up an item. A man with a shaved head walks towards a window. The blond woman with the lace shirt is in the corner of a room, appearing distressed. The bearded man is holding her, then carries her away. The man with the shaved head is shown, along with the blond woman. A man speaks, then the blond woman is shoved towards a sink. A man with a beard and a red plaid shirt is shown. The blond woman, then a man with a hoody are shown. The woman is held over a shoulder. The man with a red plaid shirt extends his arm. The man in the hoody walks up wooden stairs. The man with the beard looks surprised as the woman struggles against him. The video ends.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What led the three men in the video to discover the woman in her home?",
+ "answer": "A",
+ "candidates": [
+ "A. A kettle was boiling water in the house.",
+ "B. The woman's breathing was too loud.",
+ "C. The door to the house was open.",
+ "D. The lights in the house were not turned off."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where did the woman hide in the video when she realized the bad guy was approaching?",
+ "answer": "D",
+ "candidates": [
+ "A. Inside the cabinet.",
+ "B. Behind the door.",
+ "C. In the basement.",
+ "D. Under the table."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Will the woman in the video ultimately escape from the villains?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes, she can escape.",
+ "B. No, she cannot escape.",
+ "C. I am not sure."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What sound was heard in the background as three men entered the house in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The sound of a key turning in a lock.",
+ "B. The alarm of a kettle.",
+ "C. The sound of breaking glass.",
+ "D. The sound of a door being kicked in."
+ ]
+ }
+ },
+ "TYRjyrsC": {
+ "video_id": "TYRjyrsC",
+ "video_duration": "33s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video opens with a black background with white, sparkling diamond-shaped objects moving across the screen. A transparent logo appears in the center. This transitions to a shot of a person sitting in a chair. The person has dark curly hair pulled up, is wearing a grey sweatshirt, and is sitting in front of a partially visible kitchen. The person gestures and moves their arms and speaks. On the right of the screen is a logo with the words \"SEXUALITY | SHENANIGANS \\/*\\ SHANBOODY SUNDAYS | WEDNESDAYS\". The person gestures with their finger and hands while talking. At the end of the video, the person leans out of view. The camera is not stationary during the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a cup in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A cup is shown.",
+ "B. No cup is shown.",
+ "C. A cup is shown in the kitchen.",
+ "D. A cup is shown and someone drinks water from it."
+ ]
+ }
+ },
+ "NLvWrilq": {
+ "video_id": "NLvWrilq",
+ "video_duration": "212s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a split screen, with different content on either side. The left side shows a man speaking into a microphone, wearing a dark suit with a red tie. The right side of the split screen shows a man with dark hair wearing a black shirt. The man on the left side is often looking directly into the camera while the man on the right is often looking directly at the man on the left side, but also sometimes at the camera. At various points, the man on the right side can be seen gesturing with his hands, and also eating food items. This split screen is also displayed in front of an audience, on a stage with ornate framing. The split screen is shown multiple times, with the men's actions changing slightly. Interspersed with the split screen are shots of different audience members, some of whom are smiling and clapping their hands. These audience members are sitting in rows of red seats. A man on the stage is shown holding a microphone, and also a cell phone to his ear. At one point, another person enters the stage wearing a long black dress. There are bright lights and gold decorations on the stage. The video also displays text and logos for 'asianetHD' and 'hotstar' in the bottom left and top right corners. At the end of the video there is a black screen that features a logo for 'hotstar' with the text 'go solo'.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotion does the man on the right side of the big screen in the video appear to express?",
+ "answer": "D",
+ "candidates": [
+ "A. Happiness.",
+ "B. Anger.",
+ "C. Calmness.",
+ "D. Humor."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What event in the video signifies that Dalmajian has won the debate?",
+ "answer": "C",
+ "candidates": [
+ "A. Dalmajian disappears.",
+ "B. Dalmajian receives an award.",
+ "C. Dalmajian steps onto the stage.",
+ "D. Dalmajian speaks."
+ ]
+ }
+ },
+ "sqosqbEK": {
+ "video_id": "sqosqbEK",
+ "video_duration": "60s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Book Reviews",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person is holding a large hardcover book and then a stack of four books. The video then shows a close up of the stack of books, which appear to be \"Curse So Dark and Lonely\" by Brigid Kemmerer, \"Sadie\" by Courtney Summers, \"Crooked Kingdom\" by Leigh Bardugo, and \"Harry Potter and the Prisoner of Azkaban\" by J.K. Rowling. A person reappears, who is wearing a grey sweater. They clap, give two thumbs up, hold out both palms, put their hands to their cheeks, put their hands to their mouth, hold their hands out, put their hands together, and wave. They point downward, hold up two fingers and then three fingers, and then wave. The video ends with a screen that says \"Sabine's Book Nook\" and has a subscribe button along with two rectangles that say \"previous video\" and \"more sabine\".",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the first gold-colored letter visible on the spine of the bottommost book in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. R.",
+ "B. H.",
+ "C. P.",
+ "D. Y."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the woman in the video so excited?",
+ "answer": "B",
+ "candidates": [
+ "A. Because she discovered four interesting books.",
+ "B. Because she participated in a reading marathon.",
+ "C. Because she got a new camera.",
+ "D. Because she gained a lot of attention."
+ ]
+ }
+ },
+ "ChIOcnWi": {
+ "video_id": "ChIOcnWi",
+ "video_duration": "163s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Software",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a desktop environment with a mountain background, a \"Home\" folder, a \"Trash\" icon, and a taskbar at the bottom. A cursor moves around the screen, opening various menus and windows. Initially, the start menu is opened at the bottom left corner of the screen, displaying several submenus like \"Accessories\", \"Games\", \"Graphics\", and others. Then the \"About this computer\" setting is shown, followed by an appearance setting. The desktop background is changed to an image of the moon, then a galaxy, then a blue abstract pattern, and finally to a city background, before returning to the original background with mountains. Subsequently, the start menu is opened again, showing programs such as Klondike, Mines, and Gimp. \"Virtual Box\" and various other programs are opened and used, and finally, the system is shut down. The appearance of the windows, icons, and background changes throughout the video.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the purpose of the Zorin web browser manager?",
+ "answer": "A",
+ "candidates": [
+ "A. To download and switch between web browsers.",
+ "B. To download image editors.",
+ "C. To download video editors.",
+ "D. To modify browser plugins."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the memory size of the computer operating system shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. 4 GB.",
+ "B. 8 GB.",
+ "C. 2 GB.",
+ "D. 1 GB."
+ ]
+ }
+ },
+ "tkZqcJYF": {
+ "video_id": "tkZqcJYF",
+ "video_duration": "43s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen. Then, green text appears that reads \"HIT THAT LIKE BUTTON\". Then, four individuals appear, standing side-by-side in a hallway. One individual holds a black rectangular object. That individual then extends the object toward the person standing next to him, while talking. The individual holding the object is on the left side of the screen. The third individual from the left starts talking. The person on the far right of the screen holds part of his jacket in front of his mouth. The first and fourth individuals then perform a hand shake, with both touching hands and then the individual on the right touching his hand to his mouth. The third person then starts to dance while the other three watch. The two individuals on the left continue walking while making contact with the hallway walls. The third individual continues walking while making contact with the wall.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the emotional state of the man holding the phone in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Humorous.",
+ "B. Fearful.",
+ "C. Calm.",
+ "D. Painful."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What special action is performed after smelling something in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Rubs the face.",
+ "B. Waves a hand in front.",
+ "C. Adjusts the clothes.",
+ "D. Pinches the nose."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What is the man holding a phone doing at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Running.",
+ "B. Chasing.",
+ "C. Walking.",
+ "D. Playing basketball."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Has a microphone appeared in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Yes.",
+ "B. No.",
+ "C. Unclear."
+ ]
+ }
+ },
+ "RwRbtKkh": {
+ "video_id": "RwRbtKkh",
+ "video_duration": "401s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video starts with two people sitting next to each other, both wearing red clothing with lettering. The video then shows a sequence of people, often in groups, wearing shirts with university names, and many people wearing red. These people are located in classrooms, hallways, and cafeterias. A brown bear mascot is shown briefly. A poster with a map of the United States with colored dots on it is also shown. The video also features two people in a car.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who said \"I am going to Rutgers University\" in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A person wearing a gray hooded sweatshirt.",
+ "B. A person wearing a red hoodie.",
+ "C. A person wearing a red striped undershirt.",
+ "D. A person wearing a red short-sleeved shirt."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a white map in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Yes.",
+ "B. No.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "TdrhnSYo": {
+ "video_id": "TdrhnSYo",
+ "video_duration": "185s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Drawing Tutorials",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person's hands are visible manipulating an open book. The book has different pages with drawings and different color patterns on them. The first two pages shown feature a drawing of two cats with green eyes, one grey, one white, along with other drawings of birds, butterflies, and flowers. On top of the two cats' heads, there are two small blue rectangles with the words \"Purr-fect\" and \"Together.\" The book is then flipped to the next two pages, which are decorated with light blue, yellow, and white shapes, including circles and dots, as well as green and white flowers. These pages are then flipped to different pages of the book, which are patterned with dark, and light pink, green and yellow, and blue shapes and patterns. The person's hands are seen opening a plastic container with black lids. A small bottle of dark grey liquid and a small piece of paper with text on it are removed. The container is put back down and then a small bottle of gold liquid is taken out of the container. The book is then flipped to pages with a watercolor painting with a butterfly on them. Finally, the book is then flipped through various other pages. Throughout the video the person's hands are rubbing or touching different parts of the book. The book rests on a blue cutting mat.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many paintings are displayed on the laptop in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Seven.",
+ "B. Six.",
+ "C. Four.",
+ "D. Five."
+ ]
+ }
+ },
+ "uJzMtZOf": {
+ "video_id": "uJzMtZOf",
+ "video_duration": "37s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A piece of raw ribeye steak is placed on a wooden cutting board. Olive oil is drizzled onto the steak. Salt and pepper are added using a grinder. The steak is flipped, and more salt and pepper are added.A frying pan is placed on a gas stove. The pan has a honeycomb pattern. The steak is placed in the hot pan. The steak is flipped and pressed down. Garlic cloves and butter are added to the pan. The garlic and butter are spooned over the steak. More olive oil is added to the pan. The steak is cooked until browned on both sides.The cooked steak is placed back on the wooden cutting board. The steak is probed with a meat thermometer. Butter from the pan is spread onto the steak. Two eggs are cracked into the same frying pan. The eggs are cooked until the whites are set and the yolks are still runny. The cooked eggs are transferred to a plate next to the steak, using a spatula. A fork is placed on the plate. A screenshot is shown of a calorie-counting app displaying the nutritional information of the steak and eggs.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many types of seasonings were added while cooking the steak in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Six.",
+ "C. Seven.",
+ "D. Five."
+ ]
+ }
+ },
+ "opfzTCSo": {
+ "video_id": "opfzTCSo",
+ "video_duration": "30s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Wladimir Klitschko, in red boxing trunks, and Samuel Peter, in light blue boxing trunks, are boxing. Klitschko lands a right hook to Peter\u2019s head. Peter throws a left jab, followed by a right-handed punch to Klitschko\u2019s body. The two boxers grapple briefly. Klitschko lands a combination of punches, including an uppercut, to Peter\u2019s head. Peter lands a left hook to Klitschko\u2019s head. Klitschko throws a right-handed punch to Peter\u2019s body. The boxers exchange punches. Peter throws a series of punches. Klitschko backs away, then throws a punch to Peter\u2019s body. Peter throws a punch to Klitschko's head. They briefly grapple. Klitschko throws a right-handed punch to Peter\u2019s head as Peter simultaneously throws a punch. Klitschko and Peter trade punches again. The referee steps in to separate the boxers as they grapple. Klitschko throws a right hook. Peter jabs at Klitschko, and the two boxers grapple again. Klitschko lands several jabs as he backs away from Peter. Peter throws a right hook to Klitschko's head, and they grapple once more. Klitschko lands a punch to Peter's head. Klitschko and Peter exchange punches and grapple, with Klitschko throwing a combination of punches.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Does the video show two referees?",
+ "answer": "B",
+ "candidates": [
+ "A. None of the above.",
+ "B. No.",
+ "C. I am not sure.",
+ "D. Yes."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Who ultimately wins in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The man in blue shorts wins.",
+ "B. The man in red shorts wins.",
+ "C. The match results in a draw.",
+ "D. The outcome is unclear."
+ ]
+ }
+ },
+ "geGkHJBA": {
+ "video_id": "geGkHJBA",
+ "video_duration": "51s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A skier stands at the top of a snowy peak, overlooking a mountain range and a large lake or sea in the distance. The camera zooms out, revealing the vastness of the snowy landscape.The scene changes to a higher vantage point, showing a skier standing at the edge of a steep slope. Two other skiers are visible further down the slope.Next, a skier descends a steep, snowy chute, kicking up a cloud of snow. The skier navigates through the chute, maneuvering around rocks and trees.The perspective shifts to a first-person view from a skier's skis. The skier begins their descent down a very steep, untouched slope. The camera follows the skier's rapid descent. The final shot returns to the aerial perspective. The skier continues down the mountain, leaving a trail in the snow. The skier eventually disappears into the snowy terrain dotted with trees.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the woman's mood as she stands on the mountaintop in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. She is calm and contemplative.",
+ "B. She is excited and thrilled.",
+ "C. She is disappointed and discouraged.",
+ "D. She is furious and enraged."
+ ]
+ }
+ },
+ "sEsyAQxf": {
+ "video_id": "sEsyAQxf",
+ "video_duration": "95s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Historical Analysis",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man with a mustache, wearing a suit with a white collared shirt and a striped tie is shown speaking into several microphones. The man is shown in a close-up, facing either the camera or off to the left of the camera. Later, the video shows an audience clapping, and then goes back to the man at the microphones, where the man continues speaking. Then the video cuts to a wider shot showing multiple microphones on a podium, with the man now walking to the left side of the frame.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "Who is the man in the video giving a speech while wearing a suit?",
+ "answer": "D",
+ "candidates": [
+ "A. A diplomat.",
+ "B. A teacher.",
+ "C. A protester.",
+ "D. The king."
+ ]
+ }
+ },
+ "mkGswQLR": {
+ "video_id": "mkGswQLR",
+ "video_duration": "205s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Three men are on a dimly lit stage in front of a crowd. A man in a blue t-shirt and baseball cap raps into a microphone, gesturing with his hands. Another man in a red hoodie stands next to him, also rapping and gesturing. A third man in a green t-shirt stands near a table with equipment, also rapping. The camera zooms in on the man in blue. The camera zooms in on the man in red as a repeated visual effect momentarily tiles the screen. The screen briefly goes black. The camera focuses on the man in red, then the man in green. The camera focuses on the man in red, then a repeated visual effect of the man rapping fills the screen. The camera focuses on the man in blue, then red, then blue, then red. The camera focuses on the man in green, who is rapping, then on the man in red, then green, then red. A repeated effect of the man in red fills the screen. A repeated effect of the man in green fills the screen. The camera pans to the crowd, who are raising their arms and clapping. A repeated effect of the crowd fills the screen. The camera returns to the three men on stage. The camera zooms in on the man in blue. The man in red raps. The camera focuses on a man in a pink hoodie who is rapping, then briefly on the man in red. The camera focuses on the man in red, then the man in green. A repeated visual effect of the man in red fills the screen. A repeated visual effect of the man in green fills the screen. The camera returns to the three men. The camera focuses on the man in green, the man in red, then the man in green. A repeated effect of the man in red fills the screen. A repeated visual effect of the man in green fills the screen. The camera returns to the three men. The camera focuses on the man in blue, and then the three men interact and gesture. The camera focuses on the man in red rapping. A repeated visual effect of the man in red fills the screen. The camera focuses on the man in green. The camera returns to the three men. The camera focuses on the man in blue. A repeated visual effect of the audience fills the screen, along with flashes of light. The camera focuses on the man in blue. The man in a red hoodie raps. The camera focuses on the man in green, then the man in red. The man in blue raps and points at the audience. The screen goes black, then the man in blue reappears. A repeated effect of the man in red rapping fills the screen, then a repeated effect of the man in blue, then a repeated effect of the man in red. The screen goes black. The camera focuses on the man in red, then green. A repeating visual effect of the man in green fills the screen, then the image shifts to a repeating effect of the man in red. The camera returns to the man in green rapping. The camera returns to the men on stage. The camera focuses on the man in blue. The screen briefly goes black. The camera returns to the man in red. A repeating effect of the man in red fills the screen, then the man in green rapping. The camera returns to the stage. Three men on stage gesture and move. A repeated effect of the man in green fills the screen, and then a repeated effect of the man in red. The camera focuses on the three men on stage. A repeated effect of the man in blue fills the screen. The camera returns to the three men, ending the video with a screen that says \u201cHot Sauce Committee: Part One,\u201d \u201cBeastie Boys,\u201d and \u201cHere\u2019s A Little Something For Ya,\u201d with other names below.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What interaction occurs between the man wearing a blue shirt and the audience in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The man in blue gives the middle finger to the audience.",
+ "B. The audience gives the middle finger to the man in blue.",
+ "C. The man in blue makes a face at the audience.",
+ "D. The audience makes rap gestures towards the man in blue."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many different instruments produce sounds in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. One.",
+ "B. Four.",
+ "C. Two.",
+ "D. Three."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man in red move his arms in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Eighteen.",
+ "B. Fifteen.",
+ "C. Sixteen.",
+ "D. Seventeen."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "In which part of the video is the instrument player positioned at the center of the frame?",
+ "answer": "B",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. At the end of the video.",
+ "C. Throughout the entire video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "kSuLzBHn": {
+ "video_id": "kSuLzBHn",
+ "video_duration": "34s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "An animated figure with cat ears, glasses, and an orange scarf holds up one finger. The figure smiles.A small, light-green-haired elf with pointed ears stands on a stage in front of pink curtains. Three other figures join her: a pink-skinned demon with horns, a blond elf, and a green goblin-like creature with long dark hair. The scene changes to show these four around a table, looking at papers. The green-haired elf leans forward and gestures excitedly. Later, the green-haired elf and the goblin look at a purple computer screen with the title \"The Venture Place.\" The green-haired elf sits on a couch with the blond elf, looking at a phone and laughing. The green-haired elf sits next to the demon in a movie theater, watching a monster on the screen. The elf circles a date on a calendar. She then looks at several pieces of paper, with a determined expression on her face. As she reads the papers, images briefly appear: a muscular, blond warrior with a sword; a blue fuzzy monster; a small wizard in a pointed hat and purple robes. She talks to the demon and blond elf, gesturing broadly with her hands.Two hands frame the three figures as though taking a photo. A red curtain opens to reveal the green-haired elf on a brightly lit stage. The blond elf, now in armor, kneels with a sword, then raises it in a fighting stance. The goblin runs through a dark forest. The demon dips a human prince in a ballroom.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the girl with green hair appear alone in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Eight times.",
+ "B. Three times.",
+ "C. Four times.",
+ "D. Six times."
+ ]
+ }
+ },
+ "DLvKfgfe": {
+ "video_id": "DLvKfgfe",
+ "video_duration": "92s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Unboxing Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A multicolored object with animal features stands on a table. A hand picks up the object, revealing it to be a segmented bracelet. The object twists and curls. A series of these segmented bracelets and other objects are arranged in rows. Two young people wearing hats and trench coats appear and hold magnifying glasses. The bracelets are displayed on a hand. A series of rapid edits of the bracelet objects, bracelets on hands, and the two young people are shown. Two young people are dancing. Two young people are seated in front of a table covered with segmented objects. A young person speaks to the camera while a circular inset of a segmented object is displayed in the corner. The two young people continue to talk. The bracelets and segmented objects are displayed on the table. The two young people are talking to the camera. A question mark graphic is displayed above their heads. One of the young people looks at the camera with surprise. The segmented objects are shown again. The text \"Cutie Frutti Unicorn\" appears above a segmented object. A hand picks up the object. Two hands pull apart a segmented bracelet. The text \"How2 Twist! Pull Apart\" is displayed. The hands separate a bracelet into a straight line of connected segments. The young people are sitting in front of the table with the segmented objects. A hand connects two ends of a segmented bracelet. The text \"How2 Twist! Connect Both Ends\" is displayed. The young people are shown with the bracelets on their wrists. The text \"Color Plunge\" is displayed. The young people are shown sitting behind a table. A hand unlatches a bracelet. The text \"How2 Twist! Unlatch\" is displayed. A hand twists the legs of the segmented object. The text \"How2 Twist! Twist Back Legs\" is displayed. The two young people are shown sitting in front of the table with the segmented objects. A hand twists the front legs of a segmented object. The text \"How2 Twist! Twist Front Legs\" is displayed. The two young people are shown sitting in front of the table with the segmented objects. A graphic of a unicorn face with the text \"Mane Friend\" is displayed. A young person puts their hand to their head in confusion. The two young people are talking to the camera. A \"Style Alert\" graphic appears. One of the young people is holding a segmented object. The two young people are sitting in front of the table with segmented objects. Photos of a young person are displayed in black and white frames. The young person in one of the photos has a unicorn horn and blue hair. A young person talks to the camera. The two young people are taking selfies in front of the table. The text \"Unicam\" appears. A young person is holding a segmented object. The two young people are shown together. The young people are smiling with segmented objects in their hands.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many steps are required to turn a pet into a twist bracelet?",
+ "answer": "B",
+ "candidates": [
+ "A. Four steps.",
+ "B. Two steps.",
+ "C. Five steps.",
+ "D. Three steps."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the two girls do together at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Played around.",
+ "B. Had an argument.",
+ "C. Pretended to be unicorns.",
+ "D. Took a group photo."
+ ]
+ }
+ },
+ "hoWvXKgK": {
+ "video_id": "hoWvXKgK",
+ "video_duration": "125s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Three people are visible in the video. Two are wrestling in a ring, and a third person, wearing a black and white striped shirt and black pants, is present as an observer.The wrestler on the left initially wears a silver tank top and silver pants. The wrestler on the right wears a blue and black top and a blue and black short skirt and black boots. Throughout the video, the two wrestlers grapple, perform various holds and takedowns, and at one point they struggle near the ropes. Eventually, the wrestler in the silver clothing is pinned on the floor of the ring, and the referee raises the other wrestler's arm. After that, both wrestlers can be seen recovering, and then wrestling some more, using a variety of moves on each other before the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "Who won at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The woman in a silver outfit.",
+ "B. The woman in a blue outfit.",
+ "C. It was a draw."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the occupation of the man wearing the black striped shirt in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Cameraman.",
+ "B. Spectator.",
+ "C. Commentator.",
+ "D. Wrestler."
+ ]
+ }
+ },
+ "SZHDrkgA": {
+ "video_id": "SZHDrkgA",
+ "video_duration": "232s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Drawing Tutorials",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A piece of graph paper is shown with the title \"Two Point Perspective\" and the subtitle \"Cube.\" A horizontal line is drawn across the page below the title. Two small vertical marks are drawn on this horizontal line, one near the left edge and one near the right edge of the paper. Below this line, another parallel horizontal line is drawn. A small vertical mark is made on the graph paper, centered between the two previous vertical marks and below the lower horizontal line. Two diagonal lines are drawn from this vertical mark up to the vertical marks on the upper horizontal line. A vertical line is then drawn downward from the centered vertical mark to the bottom horizontal line. A ruler is placed along one of the diagonal lines and a new line is drawn along the ruler from the upper vertical mark to a point on the lower diagonal line. The ruler is repositioned along the other diagonal line and another line is drawn from the other upper vertical mark to a point on the lower diagonal line. The ruler is now placed along the bottom horizontal line and two vertical lines are drawn, one extending upward from each of the two intersection points created in the previous step. These vertical lines terminate where they intersect the two upper diagonal lines. The ruler is now placed horizontally at the highest intersection point on the left of the drawing and a line is drawn to the highest intersection point on the right. The same is done with the second highest pair of intersection points, completing the cube.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence for drawing a two-point perspective view, as presented in the video? (a) Establish the true height lines, (b) determine the origin and vanishing points, (c) set the ground line and horizon, (d) connect the height lines to the vanishing points, (e) determine the back edges of objects, (f) connect the back edges to the vanishing points.",
+ "answer": "A",
+ "candidates": [
+ "A. (c)(b)(a)(d)(e)(f).",
+ "B. (b)(a)(a)(d)(e)(f).",
+ "C. (a)(b)(c)(d)(e)(f).",
+ "D. (c)(a)(b)(d)(e)(f)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How does the lecturer determine the cube's position in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The cube's position depends solely on the horizon setting.",
+ "B. The cube's position depends on both the vanishing point and origin settings.",
+ "C. The cube's position depends solely on the vanishing point setting.",
+ "D. The cube's position depends solely on the origin setting."
+ ]
+ }
+ },
+ "NUMkSFya": {
+ "video_id": "NUMkSFya",
+ "video_duration": "111s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Astronomy",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a dark screen. Then a circular black area is shown in the center of the screen with a grey-white luminous background. The black circle remains in the center of the screen surrounded by grey-white light which changes its shape. The camera then shifts to an animation of the earth, moon, and sun with a purple line indicating an eclipse. The camera then shifts to a map of the Earth, focused on North America, with purple shaded areas, a timeline marker at the bottom, and red dots marking cities. The camera then shifts to an illustration of people dancing in an outdoor area with a can over the sun. Then the camera moves to a cartoon of a male character with messy red hair looking frustrated, with text to his left. The next scene is an image of the moon and sun animated with faces. The camera then moves to a map focused on Texas and surrounding states with a purple line showing the path of an eclipse with times and locations written on it. The map then moves to cover more states of the Midwest and east coast, with the timeline changing with the locations. The camera then shifts to an image of the cartoon sun in bed with a moon pillow. Next, the scene shifts to a yellow hardhat and red earmuffs, with the word \u201cSAFETY\u201d displayed. A close-up of a pair of human eyes is then shown followed by a collage of many different sets of human eyes. The scene then cuts to a black-and-white photo of multiple people wearing eclipse viewing glasses looking up. This is followed by a series of photos of people wearing eclipse viewing glasses in a black-and-white collage. The word \u201cINSTEAD\u201d is imposed across these images in large red letters. Next there is a tutorial showing someone cutting a hole in paper with scissors, taping a foil square over the hole, then making a tiny hole in the foil with a pin. The hands are shown holding this device over another sheet of paper where a circle of light appears, then the word \u201cvoila\u201d appears, followed by instructions on how to use it. The video then shows a mannequin head wearing sunglasses and the text \u201cNOT suitable for eclipse viewing\u201d. The video then has an arrow pointing at the head then text reading \"DON\u2019T DO\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "In the video, many people are wearing glasses and looking at the sky. What appears in the sky at this time?",
+ "answer": "D",
+ "candidates": [
+ "A. Dark clouds appear.",
+ "B. Lightning flashes.",
+ "C. Rain falls.",
+ "D. A solar eclipse occurs."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual scene appears in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. People watch a solar eclipse.",
+ "B. Luxury sunglasses are displayed.",
+ "C. A pinhole camera is being made.",
+ "D. The sun and moon are sleeping at the head of a bed."
+ ]
+ }
+ },
+ "AWSgwejL": {
+ "video_id": "AWSgwejL",
+ "video_duration": "243s",
+ "duration": "4-6min",
+ "domain": "Culture & Politics",
+ "sub_category": "Historical Analysis",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video starts with a black screen. A map with a white, blue, and red flag overlaying Russia and a flag with red crosses over a red and white flag overlaying Georgia appears. The map is of the region between the black and Caspian seas. A small flag of what appears to be Ajaria appears on the Georgian flag. The Russian flag fades away and then the Georgian flag. The map pulls back, showing more of Europe, and the Georgian flag is now near the bottom right. A map of Europe with a variety of colored flags appears. The Georgian flag appears on the map of Georgia. A white rectangle with a picture of what appears to be a group of people waving red and white flags appears over a map of Ukraine. In the picture of the people, there is a sign with a stylized M. The image of people disappears. A map of Ukraine with blue and yellow appears. The map pulls back and shows a large white star in a circle over a map of Europe, adjacent to the Ukraine and Georgia flags. The flags then transition to the flags of NATO and Romania. Then the map shows NATO, Ukraine, and Russia. The map shows the borders of Belarus, and Latvia appear along the borders between Russia, the NATO flag, Ukraine and Belarus. The map continues to show Lithuania, Estonia, and Belarus. The map then shows Ukraine, Georgia, and Russia. The map then transitions to Russia, NATO, and Ukraine. The map then shows just Georgia, and then the map pulls back further, showing all of Europe and Georgia. A blue circle appears and then transitions to a NATO flag over the map of Europe and Georgia. The NATO flag is again replaced by a map of Russia and Ukraine. The map shows Russia, Ukraine, Georgia, NATO, and then Moldova and another country next to it. The map then shows the regions in green. The map then shows a man with short brown hair wearing a black suit and blue tie. He is in front of a red and blue flag. The map transitions back to a map with Russia, Ukraine, Georgia, and the NATO symbol. The colors are slightly different. The map then slowly zooms in on each area, and different flags appear on top of the countries. The video ends with a white text box with red text that reads \"THANKS FOR WATCHING\" as well as the word \"SUBSCRIBE!\" and the name @mad_mac_official along with a drawing of 2 globes with flags on them in the colors of Georgia and a second country.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How long did the Russo-Georgian War last?",
+ "answer": "B",
+ "candidates": [
+ "A. 8 days.",
+ "B. 6 days.",
+ "C. 7 days.",
+ "D. 5 days."
+ ]
+ }
+ },
+ "XUeAUTXY": {
+ "video_id": "XUeAUTXY",
+ "video_duration": "37s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a drawing of a man next to a car under a sun, with three arrows pointing to a picture of the same man. Then, a hand enters the frame and draws a circle around the arrows. The drawing adds a ribbon and seal next to the picture of the man. Then, the drawing adds three curved arrows pointing down to the circle with the three straight arrows. Next, a hand enters the frame and adds a drawing of an aperture above the arrows, then a stopwatch above the arrows and the word \"ISO\" next to the picture of the man. The hand moves the stopwatch and \"ISO\" next to the circle with arrows, and exits. The video shows an aperture with a background changing from a white background to a black background. Two hands frame the aperture with their index fingers pointing up. Then, a drawing of a window with curtains appears above the aperture, then disappears. Three arrows appear pointing towards the aperture, then a single finger points to the window with curtains, then to the aperture. The two hands return, framing the window. The window changes, showing a night sky. Five arrows point toward the aperture, then the aperture is replaced with a black circle. The window shows a bright sun. The two hands enter the frame again to frame the window with their index fingers pointing up, then the two hands move to the aperture. A single arrow points to the aperture, then the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What abnormality is present at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The aperture and curtain states at the end of the video are inconsistent; one is open while the other is closed.",
+ "B. The explanation of the concept of exposure time at the end of the video is incorrect.",
+ "C. An incomplete photo appears at the end of the video.",
+ "D. An overexposed photo is displayed at the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What happens to the curtains when there is insufficient light in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. They change from open to closed.",
+ "B. The curtains are taken down.",
+ "C. They remain unchanged.",
+ "D. They change from closed to open."
+ ]
+ }
+ },
+ "grhchCFB": {
+ "video_id": "grhchCFB",
+ "video_duration": "37s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Fitness",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A man wearing a blue t-shirt, blue shorts and grey sneakers is shown performing an incline chest press with a barbell in a Smith Machine. He lowers the barbell to his chest and pushes it back up. Then, the man is shown performing push-ups on a green mat, using his palms on the mat. After that, the man performs dumbbell chest flyes on an incline bench. Then he is shown again performing push-ups and a forward/reverse crawl on the mat using his palms on the mat. The video concludes with a screen displaying the text, \"TRAIN LIKE A SUPERHERO WITH THE RULK\", followed by the text \"DAY ONE\", followed by the text \"Functional Compound Movement 1\", followed by the text \"Incline Smith Chest Press: 15 reps\", \"RFT Sit Thru Push Ups: 20 reps\", \"Workout Structure:\", \"2 circuits per round\", \"Rounds: 5\", \"Rest: 1 minute between rounds\", \"Functional Compound Movement 2:\", \"Flat Dumbbell Chest Flyes: 15 reps\", \"Push Ups: 10 reps\", \"Forward/ Reverse Crawl (5m each way)\", \"Push Ups: 10 reps\", \"Workout Structure:\", \"2 circuits per round\", \"Rounds: 5\", \"Rest: 1 minute between rounds\".",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How many sets of exercises should be performed according to the method shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Thirty-five.",
+ "B. Twenty-five.",
+ "C. Thirty.",
+ "D. Twenty."
+ ]
+ }
+ },
+ "iXmtxHzc": {
+ "video_id": "iXmtxHzc",
+ "video_duration": "142s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen with the words \u201cRomans 8:8\u201d and \"Those who are in the flesh cannot please God.\" The text then changes to \u201cScenario #1\u201d and \u201cAngry fleshy friend & friend not in the overflow.\u201d A woman in light blue jeans and a white shirt with the words \"Believe in Miracles\" is standing by a tree. A person wearing a silver skull mask and a black hooded cloak walks toward her, holding a dark object with a cord and what appears to be a green digital display. The first woman takes the object from them, revealing it to be a bundle of tubes with the digital device attached. The person then walks behind a tree out of sight. The woman takes the object, looks up, and appears to be yelling. She walks to the right, then looks to the left. The person with the mask gives the woman a dark substance which she smears on her face. She recoils, yelling again, then walks away from the tree. The person with the mask touches a chain necklace and grabs the woman by her arms. Another woman, with blond hair and a blue and yellow plaid coat, walks into view, and the first woman is now behind her. She puts the chain necklace around the neck of the first woman. The woman in the plaid coat looks at the other woman and appears to speak with her, while the person in the mask walks around her. They give her a bundle with a digital display, which she throws on the ground. They then attempt to smear a dark substance on her face. They proceed to chain the woman's neck. The woman in the plaid coat looks sad, while the first woman turns her face to her. The person in the mask pulls her to a parking lot, then lets her go. The masked person goes to the driver's seat of a blue car with a silver emblem and gets in. The woman in white gets in on the passenger side. They drive away. The woman in the plaid coat sits on a rock, bends over, and appears to cry. The video ends.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What changes in emotion does the woman wearing the colorful plaid jacket experience after speaking with the two women when she approaches the woman in white in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. From anger to sadness.",
+ "B. From anger to happiness.",
+ "C. From happiness to sadness.",
+ "D. From sadness to happiness."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the woman wearing a colorful checkered jacket relative to the woman in white clothing when she first appears in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. On the right side.",
+ "B. Ahead.",
+ "C. Behind.",
+ "D. On the left side."
+ ]
+ }
+ },
+ "inmzdEoM": {
+ "video_id": "inmzdEoM",
+ "video_duration": "59s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A young person with blonde hair, glasses, and a tie-dye shirt speaks into a microphone held by someone off-screen. They are surrounded by other people, including a woman with long brown hair and a tan blazer on the right. The video cuts to a man with glasses and a dark beard, standing at a podium in front of a dark blue background. He gestures with his right hand.The video cuts back to the young person with the microphone. Behind them are the woman in tan and a person wearing a white shirt.The video then returns to the man at the podium. He continues gesturing with his right hand. The background is now purple.The split screen repeats this sequence several times, alternating between the young person speaking and the man at the podium gesturing and responding. The background behind the man shifts between blue, purple, and red.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "During which part of the video does the action of \"supporting oneself on the lectern with both hands\" occur?",
+ "answer": "D",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. In the middle of the video.",
+ "C. Throughout the video.",
+ "D. At the end of the video."
+ ]
+ }
+ },
+ "JiIMoWSj": {
+ "video_id": "JiIMoWSj",
+ "video_duration": "111s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens with a dark, cloudy night sky. This is followed by an animated shot of a residential street at night, with the lights on in some windows. Next, there's an animation of the outline of a woman's head in white, with red, blue, and green outlines layered around the edge. The outline transitions to an animation of a silhouette of a woman with two buns. The video shifts to a colorful, animated scene of two women; one with pink hair and a pink dress eating a cupcake, while another, upside down, with brown hair and a dark dress smiles. The same two women then appear with a background that looks like rainfall. The video goes black.The scene changes to two women in witch costumes standing on a residential street; the woman on the left has purple hair and the woman on the right has red hair. Another woman with pink hair and a white shirt with a skeleton pattern is added to the center. The video cuts to a new woman with pink hair, a black jacket, and a red scarf in the center. The original three women stand on either side of her. The same woman now stands in the center with the original three women again on either side, until she takes a bow. The focus returns to the first three women, then the woman in red hair points toward the right of the screen. There is a shot of a brown door with four panels. A woman with yellow hair, white and red clothes, and a large yellow tail is shown with her arms outstretched. The first four women are shown again in a group. A woman with pink hair and a skull on her head is shown, holding a bag of candy. She looks at the woman with red hair, who points at her. The camera zooms in on the woman with red hair, then cuts to a door. A woman with pink hair and white wrappings then appears. A woman with blonde hair appears, with a pink hat with horns. The woman in wrappings appears again, with the woman with pink hair and the skull to her right. The woman in wrappings appears again, with the woman with the pink hair. The woman with the purple hair is shown. She looks to the side. A man with red and yellow clothes steps out of a door. The first four women in witch costumes are shown again, now with the man in yellow and red. A woman with long white hair is seen, clasping and then opening her hands. Another woman with a large, pointed hat and yellow trim on her red clothes is seen. She is shown with a white background. She then looks to the side, as an explosion occurs in the background. The woman with red hair and a witch hat appears. She looks to the left, then she is seen with the same background as the explosion. The woman with purple hair is shown, and the background is on fire. A black and white static screen appears. A man's jacket is shown. The camera then cuts to a man in a red bandana and no shirt. A woman with purple hair and many purple tendrils is seen, and then is zoomed out to show her in the street. The woman's body is only shown to the waist. A shirtless man in a red bandana is seen, then a hand reaches out from behind a door. The four women from the beginning of the video are seen with the woman with purple tendrils and then with just the woman with purple tendrils. The door is shown.The door shifts to show a woman with red hair and a red bow in a dark street holding an electric guitar. She remains still, until her face is shown and then the background becomes more distorted. She returns to her original pose. A man is shown in a yellow and black outfit. He is shown gesturing with his hands. The same woman with the guitar appears and asks for candy. She says this twice, and then says it a third time while making motions with the guitar. The shot cuts to a side shot of her silhouette with the man in the yellow and black outfit behind her. The man is shown again, asking if she knows who he is. He is shown saying she's wasting space. He is shown saying she'll get no candy. The woman with the guitar asks if he did not hear her riff. She asks again. She asks a third time. A blurred shot of the woman is shown and then she is shown again with the man with red hair. Her silhouette is shown with the man in the red and yellow outfit behind her. She is shown lying horizontally and then she is shown holding a guitar and saying \"Now you owe me double!\"",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video, what is the correct order in which the four women visit the five households?",
+ "answer": "C",
+ "candidates": [
+ "A. The visit begins at the house of the woman with yellow hair, then proceeds to the house of the woman with horns, followed by the house of the man with white hair, then the house of the man wearing a golden outfit, and concludes at the house of the man wearing a red headscarf.",
+ "B. The visit begins at the house of the woman with horns, then proceeds to the house of the woman with yellow hair, followed by the house of the man with white hair, then the house of the man wearing a red headscarf, and concludes at the house of the man wearing a golden outfit.",
+ "C. The visit begins at the house of the woman with yellow hair, then proceeds to the house of the woman with horns, followed by the house of the man with white hair, then the house of the man wearing a red headscarf, and concludes at the house of the man wearing a golden outfit.",
+ "D. The visit begins at the house of the woman with yellow hair, then proceeds to the house of the man with white hair, followed by the house of the woman with horns, then the house of the man wearing a red headscarf, and concludes at the house of the man wearing a golden outfit."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional change does the man wearing gold clothing display in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. He transitions from excitement to confidence.",
+ "B. He remains confident throughout.",
+ "C. He transitions from panic to confidence.",
+ "D. He transitions from confidence to panic."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many households do the four women visit in total in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. One.",
+ "B. Five.",
+ "C. Three.",
+ "D. Four."
+ ]
+ }
+ },
+ "NTkgrdue": {
+ "video_id": "NTkgrdue",
+ "video_duration": "323s",
+ "duration": "4-6min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with the text \"few minutes later\" displayed on a black screen. Below this text is another line of text that reads, \"Uncle Wowo has done it again\". The scene cuts to a medium shot of a man wearing a brown and white plaid shirt. He has a confused expression as he looks around. The background reveals a clothing rack with an orange dress. Another man wearing a red shirt and green baseball cap enters the frame. He looks to his right, and then a man wearing a black shirt appears to his right. The scene shifts to show a woman wearing a white blouse and pearl earrings. She turns her head and looks down. A boy with a blue shirt and red backpack is briefly shown, with a slightly concerned expression. The camera then switches to a man wearing a white tank top and a red and black patterned cap. He is looking down and rubbing his face. The man wearing the plaid shirt is then seen again, looking bewildered, and asks a question. The woman with the white blouse responds to him and is then shown gesturing with her hands, as she speaks. She looks down and then is seen looking at the camera. The scene then returns to the men from earlier. The man in the red shirt and baseball cap stands next to the man wearing the black shirt. The woman speaks again, this time with more emphasis and a raised tone. She is then seen putting her hands together. The man wearing the red and black patterned cap is then shown again, as is the woman. The man wearing the plaid shirt asks \"what did you say is the problem?\". The man wearing the green hat says \"Principal, I am not going to take this\". The woman replies to this, looking upset. The man in the plaid shirt says \"I'm not understanding this thing\". The camera then switches back to the man with the red cap, he asks a question of the woman. She is then shown again as is the man in the plaid shirt. The man wearing the red shirt says \"Your teacher got my nephew spanked\". The plaid shirt wearing man replies to this, and the red cap wearing man responds by asking \"show me that stupid teacher, so i can eat him raw\" then he says \"calm down sir we are not wrestlers here\". The woman is then shown again, and she states that she will get to the bottom of this. The man in the plaid shirt is looking around and then says \"where is this teacher, uncle Wowo\". The man in the red and black cap asks \"what is it?\". The man in the plaid shirt says \"Can you explain to me what is really happening here?\". The man with the red and black cap responds. The woman is then shown, and says \u201cSir, little little\". The man in the red and black cap replies. The man in the plaid shirt turns his head, the man with the red and black cap then says \"the brain is just somehow\", then the man asks \u201cI dont know if he got the brain from the father\u201d. The group of people are then shown together. The man in the red and black cap says, \u201cHe\u2019s too dull\u201d. The man wearing a black shirt is then shown briefly, before the young boy from the earlier scene. The young boy is shown briefly as is the woman. The man in the red and black cap is then shown again, and he says \"the word very well and he flogged me\". The man in the plaid shirt says, \"my friend come and write this word\". The man in the red and black cap says \u201cLet me see what you asked this boy to pronounce\u201d. The man in the plaid shirt gestures with his finger. The scene then shows the man wearing the red and black cap standing in front of a white board. The camera then shifts to show the other people present, including the man in the red shirt and the man in the black shirt. The red and black cap wearing man is shown again, asking \"Sir, should I write it?\", and then he proceeds to write on the white board. The man wearing a red shirt looks on. The red cap wearing man writes \"PHOTOSYNTHESIS\" on the white board. He points to the spelling, and says \"this is it\", then \"pronounce it now\". The red cap man looks away, briefly touching his face. The man then touches the first four letters and says \"PHOTO\". He points again, at the whole word and says \u201cPHOTOSYNTHESIS\u201d. He turns to the other people in the room. The man in the plaid shirt comes up to him and loudly asks \"who employed this man to this school\". The man in the red and black cap responds. The woman then says \"Me? I am mad?\". A man wearing a black shirt says \u201cYou are MAD principal\u201d. Then, \"like teacher like principal\" while raising his voice and gesturing his hand upwards. The red and black capped man is shown again in front of the white board. The man in the black shirt is shown and he states, \"by the time I'm back to collect all my school fees, you will know me, lets go\". The plaid shirt wearing man responds to this. The red and black capped man touches his face again, and states that, \u201cSir, there is room for improvement\u201d. The man in the plaid shirt gestures and responds angrily. The red and black capped man then says \"Sorry Sir\". The screen goes black. Text appears reading \u201cShotz by Alex Hanifa Lex for MARKFILMS\u201d. The scene fades into a blurry shot of a woman in a black shirt and blue jeans, standing in a room near mannequins. The camera continues to move slowly and the room is shown again. The scene is now outside, showing the same men from before, one in black, one in red, and the young boy from the earlier scene. The man in black is gesturing and appears to be laughing, while the other two men are shown, then it ends. The final scene shows a black screen with white text that says \"Subscribe\" and a cursor points to a white bell icon on the right, and a red youtube icon on the left. Then it fades.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many Black men appear in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Five.",
+ "B. Two.",
+ "C. Three.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the role of the man wearing white clothes in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Teacher.",
+ "B. Parent of a student.",
+ "C. Student.",
+ "D. Principal."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the mood of the man in red at the beginning of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Sad.",
+ "B. Happy.",
+ "C. Angry.",
+ "D. Discouraged."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What will the man wearing black in the video do?",
+ "answer": "C",
+ "candidates": [
+ "A. He will complain to the school teacher.",
+ "B. He will reprimand the teacher.",
+ "C. All of the boy's tuition fees will be refunded.",
+ "D. He will ensure the child continues to study at the school."
+ ]
+ }
+ },
+ "kqTvbhku": {
+ "video_id": "kqTvbhku",
+ "video_duration": "92s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a white shirt with a black design and blue jeans is visible in the first few seconds. Two pots sit on a black stovetop with the front burner glowing with a purple light. Then, a hand wearing a red glove is visible near a black pot on the stovetop. Then, a person holds up a can, and opens it using a can opener. Then, another person wearing an orange shirt and a black beanie hat is in the frame. The view goes back to a pot of cooked food on the stove. Then, a person wearing a white shirt and black design is holding a yellow package of sauce. Another person appears in the frame. A person with an orange shirt is visible again, along with a person wearing a white shirt and red sleeves. Another person wearing a black shirt and blue jeans is also in the frame. A person wearing a gray tank top with a dark pattern and blue jeans and a tattoo is in the frame. This person holds up a can and opens it. This person then turns the can over. They open the can, using a can opener. This person hands the open can off to another person, then the person wearing black opens a yellow package. Then the person wearing a black shirt and blue jeans grabs a bag of sugar and a container of cobbler mix off the counter. They open the cobbler mix. The person then uses a knife to light a cigarette. The frame shows a stackable washer and dryer, and then pans towards two white doors. Then, the person wearing the black shirt is no longer in the frame and the person with an orange shirt is. The person wearing the orange shirt laughs and looks upward. The person wearing the orange shirt is visible alone.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the woman in the vest doing with a knife in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Opening the can.",
+ "B. Cutting the pork.",
+ "C. Cutting open the bag.",
+ "D. Washing the utensils."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "Who made the woman in the vest laugh?",
+ "answer": "A",
+ "candidates": [
+ "A. The man wearing a hat.",
+ "B. The man wearing glasses.",
+ "C. The man in a short-sleeved shirt.",
+ "D. The man in a vest."
+ ]
+ }
+ },
+ "hdqBPiST": {
+ "video_id": "hdqBPiST",
+ "video_duration": "462s",
+ "duration": "6-8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Art Exhibitions",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video begins with white text on a white background, reading \"My flesh and my heart may fail, but God is the strength of my heart and my portion forever. Psalm 73:26\". The video then transitions to a shot of a bible opened to two pages with writing on them. A hand enters the shot holding a black pen and uses it to trace the lines of the text on the right page. The hand then traces the lines of a sketched image. The hand uses a paintbrush to put green and yellow paint over the entire sketched image area. Next, pink paint is applied in the shape of a heart and red paint is used to fill in a small figure at the bottom right of the page. The sketched lines of a kite and the figure at the bottom right are then outlined using the pen. A hand then adds the words \"You are the Strength of my\" in white to the drawn heart area. The word \"heart\" is then written on the bottom portion of the painted heart. The hand then removes a piece of white paper, and a clear piece of material from over the page. A hand then uses an iron on the page, then reapplies the watercolor. Finally, the camera shifts back to show the whole bible with the painted pages, and the words \"Bible Journaling Made Simple\" appear in the lower left corner, as well as two books with the same title.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the woman drawing in the Bible?",
+ "answer": "C",
+ "candidates": [
+ "A. Because she wants to express her joy to the Lord.",
+ "B. Because she wants to express dissatisfaction with the Lord.",
+ "C. Because she wants to express gratitude to the Lord.",
+ "D. Because she wants the Bible to be more aesthetically pleasing."
+ ]
+ }
+ },
+ "hKvMovoh": {
+ "video_id": "hKvMovoh",
+ "video_duration": "19s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A white Jeep Gladiator drives along a dirt path through dry brush.\u00a0 The camera follows closely behind as the Jeep pulls a small trailer. The \"Muddy Ruttz Overland\" logo is superimposed on the lower portion of the frame. The video then cuts to a different, steeper section of path. The Jeep, no longer towing the trailer, climbs the sandy incline between two low ridges of dirt and vegetation. The tires kick up small amounts of dust. The video briefly displays superimposed \u201clike and subscribe\u201d and \u201ccomment\u201d icons in the lower left corner.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What is the initial scene depicted in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A white off-road trailer drives by.",
+ "B. A silver SUV drives by.",
+ "C. A white truck drives by.",
+ "D. A silver trailer drives by."
+ ]
+ }
+ },
+ "eWfghuzr": {
+ "video_id": "eWfghuzr",
+ "video_duration": "103s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Astronomy",
+ "audio_class": [
+ "Event"
+ ],
+ "video_caption": "A white box with two black cables connected is mounted to a wooden fence. In the background there is a house and some trees. Text appears that says \"Lunatico Cloudwatcher\". A view of a computer interface is shown which has several graphical data plots displayed. The video then cuts to a dark shot with light shining on some curved parts. A door is then opened and a man wearing a yellow shirt and dark pants enters a white room. Inside this white room there is a white telescope with an orange bar mounted to it. The man adjusts wires and the telescope. The video cuts to a close up of the telescope and then back to the man. The man then appears to adjust a black object on the end of the telescope. The video then shows an electronic white box with black wires mounted on the roof inside the white room. A hand points at it. This is followed by a close up of that box, which is covered in small black insects. Another hand then points at a black electronic device. The video then returns to the man standing next to the telescope, he has changed into a black shirt. He claps his hands together and then opens them out towards the camera, and then repeats the action. He appears to be talking. The man then picks up a black device with some red coloring, and adjusts parts on it. He has changed back to a yellow shirt and then back to a black shirt again. The man is standing beside the telescope, holding his hands together in front of him. He continues to talk while moving his hands.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What is the emotional progression of this man?",
+ "answer": "B",
+ "candidates": [
+ "A. From disappointment to pride and then to disgust.",
+ "B. From pride to disgust and then to disappointment.",
+ "C. From pride to disappointment and then to disgust.",
+ "D. From disappointment to disgust and then to pride."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why are ants appearing there?",
+ "answer": "A",
+ "candidates": [
+ "A. Because ants are attracted to heat sources, and that area happens to be warm.",
+ "B. Because the man keeps ants.",
+ "C. Because the door is not closed tightly.",
+ "D. Because there is honey."
+ ]
+ }
+ },
+ "xqCwOIyA": {
+ "video_id": "xqCwOIyA",
+ "video_duration": "59s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen. This is followed by static and a scene of a tall rock formation against a blue sky with clouds. The screen goes to static again. Then a dark room with two figures; one is sleeping in a bed, the other is standing near the foot of the bed. Then there is a close up of a figure with yellow and purple coloring against a background of sky and clouds. This is followed by another shot of the same figure. Then there is static again. Then a pink figure is seen in a dark room. Then the scene changes to a pink figure with rain in the background and a rocky cliff. Then there is a light interior scene with the yellow and purple figure standing in front of open windows. Then there is static. Then the rain scene is seen again with the pink figure. Then a close up of the pink figure, followed by a close up of the yellow and purple figure. There is static. Then there is a wide shot of the yellow and purple figure. This is followed by static and a shot of the pink figure in a dark stormy background with a blurry image of the yellow and purple figure overlaid. Then there is a close-up of the yellow and purple figure. Then static again and a shot of the yellow and purple figure looking at a dark figure. The next shot has the yellow and purple figure with the pink figure. The next close up is of the pink figure. The scene changes to a shot of a dark tree line against a purple sky, then static. Then there is another shot of a forest area with a purple sky. Then a shot of a purple moon over a purple landscape with static. Then the pink figure stands on a rocky cliff and a wide shot of the yellow and purple figure running with lines indicating motion against a purple sky. Then there is static. Then a pink figure looks at a white sheet of paper. Then there is a fast scene of a blue and black figure. The last few shots are close ups of the blue and black figure, a blue figure with wings, a black figure with fire coming out of its head, a blue and purple figure in a wooded area, and the yellow and purple figure running again against a purple background.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "After talking to the yellow pony, how did the black skeleton-costumed pony\u2019s mood change?",
+ "answer": "B",
+ "candidates": [
+ "A. Sadness turns to happiness.",
+ "B. Sadness turns to anger.",
+ "C. Anger turns to sadness.",
+ "D. Happiness turns to sadness."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the yellow pony and the black skull-costumed pony in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Father-son relationship.",
+ "B. Friendship.",
+ "C. Mother-daughter relationship.",
+ "D. Marriage."
+ ]
+ }
+ },
+ "MzGgXQTh": {
+ "video_id": "MzGgXQTh",
+ "video_duration": "70s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a man wearing a black t-shirt with a white triangle logo and blue jeans sitting in a doorway. The scene then changes to an outdoor setting, where the same man stands, holding a camera, in front of a backdrop. A person sits on a wooden chair in front of the backdrop, holding a small bouquet of flowers. The man takes pictures of the person in different poses. In some frames, the person is sitting on the chair, and in others, they are lying down on the floor next to the chair. Another man in an orange top holds up a white reflector panel during the photoshoot. The video displays images that have been taken and includes white text on a crumpled paper background that reads \u201c1. Understanding the direction of the light\u201d and \u201c2. Using Reflectors/ Bounce boards\u201d and then includes short clips of the man taking photographs.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What are the last three letters when the subtitles first appear?",
+ "answer": "D",
+ "candidates": [
+ "A. ing.",
+ "B. the.",
+ "C. rec.",
+ "D. ght."
+ ]
+ }
+ },
+ "vzBihdvg": {
+ "video_id": "vzBihdvg",
+ "video_duration": "347s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man in a black polo shirt and dark shorts stands in a gym, facing the camera. He holds a small grey foam roller and places it on his left tricep, near his elbow. He moves the roller up and down his tricep. He then leans against a squat rack and demonstrates another tricep exercise using the foam roller, this time placing his right leg on a blue balance pad for support. He switches legs and repeats the exercise. He then uses the foam roller on the inside of his right bicep and tricep, near the elbow, rolling up and down his inner arm.The man then picks up a green resistance band and demonstrates how to use it for tricep exercises. He loops the band around one of the squat rack posts and holds both ends in his right hand. He places his left hand on his chest for support. He performs tricep extensions with the resistance band, pulling it straight.Next, he demonstrates an exercise using the same green band attached to the post, but this time holding it with his left hand. His right hand is open, facing the camera, for demonstration purposes. He performs outward arm rotations at a 45-degree angle. He then switches to performing upward arm rotations, keeping the same 45-degree angle.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man in the video pull up the resistance band?",
+ "answer": "C",
+ "candidates": [
+ "A. Ten times.",
+ "B. Eleven times.",
+ "C. Nine times.",
+ "D. Six times."
+ ]
+ }
+ },
+ "HRdQoXAH": {
+ "video_id": "HRdQoXAH",
+ "video_duration": "90s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man in a yellow shirt and grey pants stands in a kitchen, stirring a pot on the stove with a long-handled spoon. A woman in a peach-colored outfit stands near him. The pot is silver with a lid slightly ajar, and contains a yellowish liquid. Two other containers, one smaller and one larger than the pot, contain shredded orange food. A bag of white powder is visible on the counter. The text \"BHARAT JODO YATRA\" is displayed in the top right corner.The scene transitions to a close-up of the pot on the stove. The lid is still slightly ajar, revealing the yellowish liquid inside. The scene cuts back to the kitchen, and the man leans against the counter with his arms crossed, and the woman is visible next to the stove top.The scene changes to a light-yellow background with the superimposed text \"As a mother, what's the one thing that annoys you about Rahul?\". Then, the scene returns to the kitchen, showing the woman and man in the same positions.The scene then transitions to a light-yellow background again with the superimposed text \"What is the one thing you love most about him?\". Once again, the scene returns to the kitchen, showing the woman and man in the same positions.The scene changes again to a light-yellow background with the text \"Who's the best cook in the house?\". The scene shifts back to the kitchen, the woman and man in the same positions.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What genre of music plays at the beginning of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Light music.",
+ "B. Folk music.",
+ "C. Rock music.",
+ "D. Classical music."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What is the change in the woman's emotion when the topic of \"mint\" is introduced?",
+ "answer": "D",
+ "candidates": [
+ "A. Her emotion shifts from calm to angry.",
+ "B. Her emotion shifts from happy to calm.",
+ "C. Her emotion shifts from calm to disappointed.",
+ "D. Her emotion shifts from calm to happy."
+ ]
+ }
+ },
+ "QFlhWAxH": {
+ "video_id": "QFlhWAxH",
+ "video_duration": "89s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Climate Change",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens with a close-up of a woman with light brown hair speaking. She is wearing a dark jacket, a blue and orange scarf, and a black knitted cowl. In the background, there are many people, some holding signs, and the backdrop is a city street with trees and light colored buildings. The camera angle changes, showing a wide view of the crowd. Many people are holding signs, and a few are holding orange and white banners. A man near the center of the frame is raising his left arm. The scene transitions again to a group of people carrying a large banner with text. There are people holding up signs and a person holding a multi-colored flag. The camera zooms in on a few individuals, each holding signs, including a man holding a sign with text that reads, \"WRETCH OF THE EARTH\". The next scene focuses on two women with dark hair, both speaking. One is wearing a striped scarf. A graphic on screen reads, \"Collective of London Latins activists.\" The camera angle changes again to a group of people with painted faces holding signs. A graphic on screen reads, \"Max, Fabio, and Alana are fighting the PEC 215 amendment to the Brazilian constitution.\" The camera focuses on an individual with black paint covering their nose and cheeks. The video then cuts to a wider view of a crowd carrying a banner with the text, \"STILL FIGHTING COLONIALISM\". There are many people behind the banner holding up various signs. The video then cuts to a closer shot of some of the individuals behind the banner, including a woman in a black hat and a man holding a sign that says, \"CAUSES CLIMATE INJUSTICE.\" The camera then moves across the crowd showing various individuals. The video zooms in on a man holding a sign and then again transitions back to the man with black paint across his face. The man speaks to the camera as the video comes to an end.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotion is conveyed by this video?",
+ "answer": "B",
+ "candidates": [
+ "A. Love and loyalty toward the country.",
+ "B. Anger and opposition to the environmental destruction caused by imperial capitalism.",
+ "C. Concern about humans causing climate change.",
+ "D. Discontent and resistance against wars initiated by powerful nations."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "This video most likely represents the indigenous people of which region?",
+ "answer": "D",
+ "candidates": [
+ "A. Indigenous people of South America.",
+ "B. Indigenous people of North America.",
+ "C. Indigenous people of Africa.",
+ "D. Indigenous people of Europe."
+ ]
+ }
+ },
+ "DPgIUYyt": {
+ "video_id": "DPgIUYyt",
+ "video_duration": "119s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man with short blond hair holds a knife. He wears a dark short-sleeved shirt, dark pants, and a harness. He looks to his left. The scene changes to show the blond man standing across from a larger man wearing camouflage pants, a dark short-sleeved shirt, and a maroon beret. Both men hold knives. They begin to fight. The larger man grabs the blond man's knife and tosses it away. The larger man holds his knife to the blond man's throat. He throws the blond man to the ground. The blond man gets up and they continue to fight. They struggle over the larger man\u2019s knife. They exchange blows. The blond man cuts the larger man across the chest. The blond man spins and kicks the larger man in the head, knocking him to the ground. The larger man gets back up. The blond man is on the ground, and the larger man is standing over him holding a knife. They struggle again. The blond man is now on the ground, and he appears to be injured. A woman wearing a red dress appears behind the larger man. She is holding a gun. She points the gun at the larger man. She lowers the gun. The larger man falls to the ground. The blond man is lying on the ground, injured. The woman disappears. The blond man gets up and retrieves his knife.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What action will the man wearing the red hat take next?",
+ "answer": "B",
+ "candidates": [
+ "A. Fight alongside the woman in the red dress against the blond man.",
+ "B. Leave this place.",
+ "C. Fight with the woman in the red dress.",
+ "D. Beg for mercy from the woman in the red dress."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "During the fight, what did the man wearing the red hat kick at the blond-haired man?",
+ "answer": "D",
+ "candidates": [
+ "A. A flashbang.",
+ "B. A dagger.",
+ "C. A pistol.",
+ "D. An iron bucket."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "In this fight, what did the blond man do to the man wearing a red hat?",
+ "answer": "B",
+ "candidates": [
+ "A. The blond man scratched the face of the man wearing a red hat.",
+ "B. The blond man scratched the chest of the man wearing a red hat.",
+ "C. The blond man shot the dagger out of the man wearing a red hat's hand.",
+ "D. The blond man kicked an iron bucket at the man wearing a red hat."
+ ]
+ }
+ },
+ "SQRyZAtU": {
+ "video_id": "SQRyZAtU",
+ "video_duration": "79s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Red text on a black background reads, \"ANTIFA animals stalk and menace Conservative reporters in Portland.\"A man in a blue hoodie, baseball cap, and white-framed sunglasses stands on the sidewalk near a street. Cars drive by on the street. There are other people standing on the sidewalk and street. The man adjusts the hood of his sweatshirt. He looks around.The camera moves, and two men walk toward the man in the blue hoodie. One is wearing a dark-colored, possibly black, jacket and jeans and a newsboy-style cap. The other wears a black hooded sweatshirt and pants and has most of their face covered, possibly with a mask or bandana. Another person is partially visible, wearing a black coat and yellow scarf or bandana. The man in the blue hoodie talks to the three people. The camera moves quickly, showing a blurry image of a building and a tree. It then refocuses on the interaction between the man in the blue hoodie and the other three people, before again moving away, and showing a parking lot, street, and sidewalk. The three individuals are now walking in front of the man in the blue hoodie, as other pedestrians walk by. The man in the blue hoodie appears to be following the three individuals. Two of the people wear face coverings; the third does not.The camera refocuses, showing the three walking away from the man in the blue hoodie, as cars drive by on the street behind them. Then, the camera focuses again on a close-up of the man in the blue hoodie. He vapes and exhales.\u00a0The camera focuses again on the three walking in front of the man in the blue hoodie. Cars drive by in the background. The person in all black crosses their arms. The man in the blue hoodie rejoins the three other people. Now, the person who was previously wearing a newsboy-style hat is wearing a baseball cap and has something covering his face, leaving only his eyes visible. The other two continue to stand with crossed arms.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What action does the man in the green coat take towards the man in the blue coat in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Offers him water.",
+ "B. Mediates a dispute.",
+ "C. Takes a photo.",
+ "D. Engages in verbal abuse."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What are the last three letters of the initial subtitle displayed in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Con.",
+ "B. Men.",
+ "C. Ant.",
+ "D. And."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who says \"no no no\" in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The man in a black coat.",
+ "B. The man in a blue coat.",
+ "C. The man in a green coat.",
+ "D. The man in a brown coat."
+ ]
+ }
+ },
+ "zCXtCYUJ": {
+ "video_id": "zCXtCYUJ",
+ "video_duration": "222s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A pair of hands holds a bottle labeled \"Vitamin E Oil.\" Three carrots are placed on a white cutting board, one at a time. The hands then hold one of the carrots, followed by the hands using a knife to cut slices of the carrot. The sliced carrot pieces are put into the container of a small blender. Water is added and the blender is then used to puree the carrots. The pureed carrots are poured through a strainer into a clear bowl. The hands then add a dark powder from a tea bag into a bowl. Then, a scoop of white cream from a blue tub is added to the bowl, followed by another scoop of white cream from a glass jar. The hands then add a light colored oil and then a yellow-tinted oil from a bottle, and finally carrot juice, to the bowl. The hands stir the mixture in the bowl. Finally, the mixture is poured into a small glass and the glass is held up. Text appears which reads: \"Store in Refrigerator.\"",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "In the video, what is the difference between a high-power mixer and a mixer used for adding water?",
+ "answer": "B",
+ "candidates": [
+ "A. Mixers used for adding water have a faster stirring speed.",
+ "B. High-power mixers have a faster stirring speed.",
+ "C. Mixers used for adding water consume more electricity.",
+ "D. High-power mixers consume more electricity."
+ ]
+ }
+ },
+ "LZnrjRkE": {
+ "video_id": "LZnrjRkE",
+ "video_duration": "150s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a light brown, medium-sized dog wearing a plastic cone collar and a red harness with a black leash attached. The dog has a bandage on one of its front legs. The dog walks through a doorway held open by a person wearing navy pants. The dog then walks into a room with a blue examination table, light tan cabinets, and a white floor. A person wearing tan pants, black boots, and a red and black plaid shirt kneels and pets the dog under the table. Another person wearing blue jeans and a red and black plaid shirt also pets the dog. The dog rotates a bit while both people pet it, and then shakes its head. The cone collar moves in a rapid blur. At the end of the video, the dog is still being petted by two people.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct chronological order of events at the hospital shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. First, the dog excitedly greets the grandmother, then the grandmother strokes and comforts the dog, and finally, the dog explores the new environment.",
+ "B. First, the dog excitedly greets the grandmother, then it explores the new environment, and finally, the grandmother strokes and comforts the dog.",
+ "C. First, the grandmother strokes and comforts the dog, then the dog excitedly follows her, and finally, the dog explores the new environment.",
+ "D. First, the grandmother strokes and comforts the dog, then the dog explores the new environment, and finally, the dog excitedly follows the grandmother."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "How many years have the dogs been blind?",
+ "answer": "C",
+ "candidates": [
+ "A. 4 years.",
+ "B. 1 year.",
+ "C. 2 years.",
+ "D. 3 years."
+ ]
+ }
+ },
+ "AQWsXmzI": {
+ "video_id": "AQWsXmzI",
+ "video_duration": "53s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows an airplane moving down a runway. The plane then takes off and flies into the sky. Then the video cuts to a group of people with cameras and another group of people, one of whom is being interviewed, while a plane sits behind them. Then the video switches back to two planes taking off into the sky. Then one plane flies alone in the sky.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the airplane, which appears in the middle of the video, relative to the white-haired man wearing a suit?",
+ "answer": "C",
+ "candidates": [
+ "A. The airplane is to the left of the white-haired man in a suit.",
+ "B. The airplane is to the right of the white-haired man in a suit.",
+ "C. The airplane is behind the white-haired man in a suit.",
+ "D. The airplane is in front of the white-haired man in a suit."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Does the background sound in the video change as the airplane takes off?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes, it does.",
+ "B. It is not possible to determine.",
+ "C. None of the above.",
+ "D. No, it does not."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the white-haired man in a suit believe the Paramount project was successful?",
+ "answer": "D",
+ "candidates": [
+ "A. Because the project was straightforward.",
+ "B. Because the research and development plan was complex.",
+ "C. Because the team was led by experienced engineers.",
+ "D. Because it was a team of young engineers and technicians."
+ ]
+ }
+ },
+ "QeCZwxHr": {
+ "video_id": "QeCZwxHr",
+ "video_duration": "45s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person with a beard and a black baseball cap, wearing a white shirt with a black logo, is shown at close range. A thin object protrudes from their mouth. They hold a microphone with \u201cStreet Deejay\u201d written on it. The camera pans up to show the person\u2019s surroundings, which appear to be a store. The person with the microphone approaches another person wearing a blue and white shirt. They are joined by another person wearing a black and white striped shirt. The person with the microphone holds the microphone in front of the person wearing the blue and white shirt. The two men embrace. The man with the microphone hands the microphone back to the camera person. The camera then zooms in on a different person with long hair in the background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does a person wearing an orange vest appear?",
+ "answer": "D",
+ "candidates": [
+ "A. The middle part of the video.",
+ "B. The entire video.",
+ "C. The ending part of the video.",
+ "D. The beginning part of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "Who is the short, black curly-haired man in the video most likely to have gone to the supermarket with?",
+ "answer": "D",
+ "candidates": [
+ "A. He is alone.",
+ "B. His friend.",
+ "C. His parents.",
+ "D. His father."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What occurs in the middle of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A bearded man interviews an ordinary man.",
+ "B. During the interview, a short, black-haired man suddenly bursts into laughter.",
+ "C. A short, black-haired man tearfully hugs the bearded man.",
+ "D. While shopping at the supermarket, a short, black-haired man unexpectedly meets his idol."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How does the interviewer's emotional state change in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. From friendly to curiously thankful.",
+ "B. From friendly to surprisingly happy.",
+ "C. From enthusiastic to surprisingly grateful.",
+ "D. From tense to surprisingly friendly."
+ ]
+ }
+ },
+ "UEYOIzEo": {
+ "video_id": "UEYOIzEo",
+ "video_duration": "152s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A woman with long blonde hair wearing a red top is centered on a screen. Then there is a split screen with this same woman on the left. The right side of the split screen shows a wooden door opening. Inside the door, there is a living room with white furniture, a Christmas tree, and white framed windows. The camera moves through this house, past a set of stairs with a red garland, through the living room with a set of beige couches and armchairs with a black throw blanket. Then the camera pans around, through the hallway and towards a dining room. The dining room includes a set of 8 white chairs with red and green decorations, and a dark wooden table with black and white patterned plates, a centerpiece, and more red and green decorations. The final split screen shows a dining table with different decorations from the previous frames on the left, and the dining room seen before on the right.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is a woman in a red shirt placing a dog statue at the restaurant entrance in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. To decorate the restaurant.",
+ "B. In memory of a deceased pet.",
+ "C. Because the antique has investment value.",
+ "D. To showcase her artistic taste."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, where is the statue of Santa Claus located relative to the cabinet?",
+ "answer": "B",
+ "candidates": [
+ "A. Directly to the right.",
+ "B. Directly above.",
+ "C. Directly in front.",
+ "D. Directly to the left."
+ ]
+ }
+ },
+ "aWWzQIuh": {
+ "video_id": "aWWzQIuh",
+ "video_duration": "491s",
+ "duration": ">8min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A screen recording of a YouTube video shows a young woman singing on a stage with a band. In the upper left corner is a smaller embedded video of a man wearing a headset and glasses reacting to the performance. The main video shows the girl, barefoot and wearing a white dress and a floral headpiece, standing center stage singing into a microphone placed on a stand. Behind her, musicians sit and play their instruments. A man in a dark suit with red details conducts the band. Other elements on the screen include social media handles in the upper left, a video title and date in the upper middle, video controls along the bottom, and a spherical object next to the conductor, partially obscuring another band member playing a brass instrument. As the video progresses, the man in the corner reacts to the performance, and images of a different woman singing appear overlaid in the bottom left and right corners. The video title in the upper middle reads \"Angelina Jordan (13) - I Have Nothing Whitney Houston Tribute - remix\".",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "In the video, when did the woman wearing a white dress record \"It's A Man's World\"?",
+ "answer": "C",
+ "candidates": [
+ "A. 44,363.",
+ "B. 43,998.",
+ "C. 43,267.",
+ "D. 43,632."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "In the video, how many clarinets appear in the singing video watched by the man sitting in the room?",
+ "answer": "D",
+ "candidates": [
+ "A. Four.",
+ "B. One.",
+ "C. Two.",
+ "D. Three."
+ ]
+ }
+ },
+ "lCnCPmBf": {
+ "video_id": "lCnCPmBf",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A dark moon rises amidst gloomy clouds. The camera focuses on a circular stone platform with fire pits and palm trees in a misty jungle setting. It then transitions to a small pond surrounded by rocks and bushes, with carved pumpkins placed around it. Next, the scene shows a large, illuminated jack-o'-lantern on a pedestal, surrounded by floating candy and lollipops. Three characters appear: a blue-skinned, muscular figure, a small witch with a pumpkin shield, and a man in a top hat and coat, holding a glowing blue sword. The video then cuts to gameplay footage, showing a character moving through a Halloween-themed map, collecting candy. The next scene features a character in a lab coat with a large weapon on their back, standing on the stone platform. They shoot fire from the weapon. After that, a furry white creature appears, attacking a small, blue, hovering character. A cuckoo clock on a branch emerges, with floating gears around it. A small, winged creature flies around the clock. The man in the lab coat returns, performing an attack. He transforms into a werewolf-like creature. The final shot displays promotional artwork for the Mobile Legends Halloween event, featuring a witch, a small creature with goggles, a werewolf, and carved pumpkins.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times is candy mentioned in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Twice.",
+ "B. Four times.",
+ "C. Three times.",
+ "D. Once."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Was a cake present in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. No, there was not.",
+ "B. Yes, there was.",
+ "C. I do not know.",
+ "D. None of the above."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is the creature in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Green.",
+ "B. Blue.",
+ "C. Yellow.",
+ "D. White."
+ ]
+ }
+ },
+ "MUuVsNZR": {
+ "video_id": "MUuVsNZR",
+ "video_duration": "368s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Unboxing Videos",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing glasses and a brown t-shirt is holding a flat, closed cardboard box with white tape on it. He places it on a light wooden table next to another, larger, closed cardboard box, also with white tape. He then opens the larger box, revealing pink bubble wrap inside. He pulls out a cardboard insert, then reaches further into the box and pulls out another flat cardboard box with white tape, which he opens with a knife. He then opens the first box again and pulls out a black box which contains a transparent cube, a red paper envelope, an instruction booklet, a small red bracelet, and a small CPU. He opens the booklet and looks at the pages. Then he opens a small black cardboard box which contains the computer graphics card he then holds to display its front and back, and finally, its edges. The video shows the graphics card resting on the table alongside its box. The person reassembles all the parts within the larger black box and closes it. He then places the assembled black box next to the original shipping boxes. The video shows the unpacked graphics card, as well as its opened box and a few smaller boxes from within, sitting on a light wooden table. The video also shows two computer monitors, a black chair and computer hardware behind the person.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What object does the man in the video use to open the box?",
+ "answer": "C",
+ "candidates": [
+ "A. A paring knife.",
+ "B. A cleaver.",
+ "C. A small knife.",
+ "D. Scissors."
+ ]
+ }
+ },
+ "TxIZAGCJ": {
+ "video_id": "TxIZAGCJ",
+ "video_duration": "76s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A group of soldiers in dark uniforms with tall white plumes on their hats march in formation, holding rifles. They are wearing face masks. The scene changes to a group of people standing in front of a red curtain. A woman in a pink dress and a man in a military uniform are in the center. The scene changes back to the soldiers marching. Then, a young woman with long light brown hair, wearing a white blazer, is shown. She smiles slightly.The scene shifts to a group of people sitting in ornate chairs. The young woman in the white blazer is seated next to a woman in a light blue dress. An empty chair with a red cushion sits between two men. A group of dancers in dark red dresses perform in a grand ballroom with chandeliers. They move gracefully across the wooden floor, their arms raised. The scene cuts back to the seated audience. The young woman in white is now looking serious.The scene changes back to the dancers, now closer to the audience, performing with expressive movements. The audience watches attentively. The scene returns to the young woman in white, and then to the woman in blue next to her. The woman in blue smiles. The young woman in white then smiles brightly.Two women, one in a pink outfit and the other in a white dress, walk out onto a balcony and descend a set of stairs in front of an elaborate red building with intricate details and statues. They walk down the stairs towards the camera. Photographers and other people are gathered at the bottom of the stairs. The two women pose for photos. The woman in pink holds a light-colored clutch purse and wears pink and purple earrings. The woman in white wears a watch. They walk down the steps, and then enter the building. They stop inside near a wall with text. The woman in pink looks around the room.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many dancers are performing at the classical music event?",
+ "answer": "B",
+ "candidates": [
+ "A. Seventeen.",
+ "B. Nineteen.",
+ "C. Fifteen.",
+ "D. Twenty-one."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Which outfits does Princess Elizabeth wear in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A white suit and a pink jumpsuit.",
+ "B. A pink dress, a blue suit, and a white suit.",
+ "C. A pink dress, a blue suit, and a white dress.",
+ "D. A white suit and a pink two-piece outfit."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How do Princess Elizabeth's emotions change during the concert in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. From sadness to happiness.",
+ "B. From calmness to happiness.",
+ "C. From happiness to sadness.",
+ "D. From happiness to calmness."
+ ]
+ }
+ },
+ "MPBLgQiU": {
+ "video_id": "MPBLgQiU",
+ "video_duration": "325s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two people are in a kitchen. Various ingredients and kitchen utensils are on the counter. One person is cutting chicken pieces on a white cutting board. The other person pours oil into an instant pot. The person cutting chicken pieces then adds them to the instant pot. The person in blue then stirs the chicken. The person cutting chicken pieces adds chopped onion to the instant pot, and continues stirring. The person in blue adds rice, closes the instant pot lid, sets a timer on the digital display, and presses a button. They then reopen the instant pot lid and the person cutting chicken pieces adds a bag of cheese to the pot. They then open a can of cream of chicken soup and add it to the instant pot. They then add a package of frozen vegetables, stir the pot, and then add water from a measuring cup, They reclose the lid and press a button. They then open the lid and the video ends.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "How does the brown-haired woman in the video put the chicken into the pot?",
+ "answer": "D",
+ "candidates": [
+ "A. She grabs the chicken with both hands and throws it into the pot.",
+ "B. She holds the cutting board with both hands and then grabs the chicken with her right hand, pouring it into the pot.",
+ "C. She holds the cutting board with both hands and pours the chicken into the pot.",
+ "D. She holds the cutting board in her left hand and grabs the chicken with her right hand, placing it in the pot."
+ ]
+ }
+ },
+ "uxhWKYxG": {
+ "video_id": "uxhWKYxG",
+ "video_duration": "78s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man with light skin and short brown hair wears glasses and a green and purple plaid shirt. He appears to be indoors with a blue wall behind him. There are various items displayed on the shelf behind him, including books, pictures, and a pennant. At one point a video of a large group of birds in flight is displayed, followed by more footage of the same man talking. The man makes hand gestures with his fingers at one point. Text appears on screen at one point with the text: \"(you are listening to the long-drawn cheery whistle in question.)\". The video then returns to the man speaking before ending.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times do images with densely packed black dots appear in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Three.",
+ "B. Two.",
+ "C. One.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What did the man do when the sound of a bird was heard in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Looked straight ahead.",
+ "B. Looked to his left.",
+ "C. Looked to his right.",
+ "D. Looked behind."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional change does the man experience at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. From frustration to calm.",
+ "B. From calm to helplessness.",
+ "C. From helplessness to anger.",
+ "D. From calm to excitement."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What are the dense black spots in the sky in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Locusts.",
+ "B. Nuthatches.",
+ "C. Bees.",
+ "D. Flies."
+ ]
+ }
+ },
+ "SwvPQLck": {
+ "video_id": "SwvPQLck",
+ "video_duration": "121s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video opens with the Uproxx Reports logo. It then cuts to a man with glasses holding a microphone with the Uproxx logo, interviewing people at a protest. Protest signs are visible in the background, including one that reads \"In the name of Humanity, We REFUSE To Accept a Fascist America.\" Other signs say \"No Racism, Sexism, Bigotry, Xenophobia, No Trump\", \"Love Trumps Hate,\" \"Shove It!\", \"Organize, Support, Resist,\" and \"Love is up.\" A person wearing a sign on their back that reads \"Queers against\" walks by. The interviewer speaks to a blonde woman in a white top, and then to a bald man wearing sunglasses and a black shirt. A group of people holding signs march down a street; some of the signs say \"Love,\" \"Organized Resistance,\" and \"Brown and Proud.\" A screenshot of an article titled \"How President Trump Could Reshape the Supreme Court\u2014and the Country\" is shown. The video returns to the protest, where the interviewer is speaking to a young Black man wearing a white t-shirt with a red stain. The background is a wall covered with various protest signs. Some of them read \"DUMP TRUMP!\", \"No Racism, Sexism, Bigotry, Xenophobia, No Trump,\" \"LOVE,\" \"Latina Pride,\" \"Organize, Support, Resist,\" \"Refugees are welcome here,\" \"Diversity,\" \"Make America Safe Again,\" and \"Black Lives Matter\". The young man raises his arms against the wall of signs. The camera pans to the reporter, who is standing at a crosswalk as protesters march in the background. The protesters carry signs reading \"We Will Not Be Silenced,\" \"My Body My Choice,\" and \"Unity = Power.\" Three people with signs walk in the street as several police cars follow behind them. The video ends with the reporter holding the microphone.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual event occurs at the beginning of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A protest march is taking place.",
+ "B. A street interview is being conducted.",
+ "C. A tour group is departing.",
+ "D. A star concert is being held."
+ ]
+ }
+ },
+ "frrdpuhE": {
+ "video_id": "frrdpuhE",
+ "video_duration": "85s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Fireworks explode above a wrestling ring. A wrestler in a red, white, and blue singlet and knee-high boots removes a wristband. A bald wrestler in a black shirt faces him. They circle each other. The wrestler in black grabs the other wrestler and slams him to the mat. He places him in a leg lock and pulls it up high. The wrestler in the singlet kicks his way out and briefly puts the black-shirted wrestler in a hold against the ropes. He kicks at his opponent\u2019s midsection. The wrestler in black reverses the momentum and slams the other wrestler to the mat. They struggle on the mat, switching positions and holds. The wrestler in the red, white, and blue singlet wraps his legs around the other wrestler\u2019s neck and then kicks him in the face several times. The wrestler in black picks up the other wrestler and slams him into the mat, then places him in another leg lock. The wrestler in the red, white, and blue singlet kicks his way out again, throws a punch, and then places the wrestler in black in an ankle lock. He stomps on the wrestler\u2019s leg repeatedly, then releases the hold and raises a hand in the air.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What technique does the man wearing a black shirt use to win at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A German leg lock.",
+ "B. A hip toss.",
+ "C. A back suplex.",
+ "D. An angle takedown."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the bell sound occur in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Twice.",
+ "B. Once.",
+ "C. Zero times.",
+ "D. Three times."
+ ]
+ }
+ },
+ "hgWdgCpO": {
+ "video_id": "hgWdgCpO",
+ "video_duration": "155s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a view inside a vehicle, then the camera is angled towards the driver-side door, where a hand opens it. The view is then of a parking lot, with cars in the background. A dark car with the license plate \"Y77874\" is parked in a parking spot. Then a person with a white, red and blue jacket, black pants, and white shoes is seen walking towards the car. They get into the car. The camera then shows a close up of the person inside the car, who is wearing sunglasses, a mask, and a black hooded sweatshirt with a logo on it. The same person can be seen walking around the front of the car from the outside. Then, the person is again shown inside the car, now adjusting the hood. Then, they are shown from the outside again. Then, the camera is angled to show the inside of the car from the front, and the person inside the car again. Then, a second person wearing a white hoodie walks into frame and speaks. They are shown again, talking and gesturing with their hands. The camera angle shifts to show the person in the driver seat again. The second person is once again shown gesturing. The person in the driver seat has a dark covering over their hood. The person in the driver's seat takes their hand off the wheel, and then adjusts their mask. The second person moves to the other side of the vehicle, as the camera angle shifts to follow them. The second person then walks around the car, as another person in a blue jacket comes into view. All three individuals are shown together, and then it shifts back to the person in the drivers seat. Then, the second individual is shown walking towards the car, and holding what appears to be a phone. Then, it shows the person in the driver's seat again. The second individual is once again seen walking towards the car with the phone. Then, the person in the blue jacket is shown standing near the car. The person in the driver's seat is shown adjusting a dial. The second person is seen with their arm crossed. The person in the driver's seat is seen in a close-up view. The second person can be seen walking around the car. Then, the person in the driver's seat can be seen in a close-up view. The second person is standing next to the car. The third person is standing nearby, then the second person is shown once more. The person in the driver's seat adjusts their mask, then the second individual is shown from the outside, and they talk with their hands by their face. The camera shows the second individual speaking again. Then the person in the driver's seat is shown adjusting the gear. The second individual gestures, then the person in the driver's seat adjusts the car controls again. Then the second person is shown, gesturing, as the individual in the driver's seat makes another adjustment. Then the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does the man wearing sunglasses in the video admit to stealing the gaming console?",
+ "answer": "C",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. At the end of the video.",
+ "C. In the middle of the video.",
+ "D. At the beginning of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the man in the blue coat located around the car?",
+ "answer": "D",
+ "candidates": [
+ "A. On the left side of the car.",
+ "B. On the right side of the car.",
+ "C. Behind the car.",
+ "D. In front of the car."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional changes does the man wearing sunglasses undergo in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. From anger to tension.",
+ "B. From tension to fear.",
+ "C. From irritability to anger.",
+ "D. From fear to tension."
+ ]
+ }
+ },
+ "GtNxfwCV": {
+ "video_id": "GtNxfwCV",
+ "video_duration": "93s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A police car drives down a residential street. The camera is mounted inside the vehicle, facing forward. The officer stops the car, exits, and approaches a man on a bicycle. The man is wearing a helmet and carrying a backpack. The officer and the man speak to each other. The officer appears to be asking for identification. The cyclist produces a notepad, which he appears to read from, and then accepts a piece of paper offered by the officer. The officer returns to his vehicle. The cyclist remains standing on the sidewalk.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the cyclist's position relative to the police car in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The cyclist is behind the police car.",
+ "B. The cyclist is to the right of the police car.",
+ "C. The cyclist is to the left of the police car.",
+ "D. The cyclist is in front of the police car."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times did the police car's siren sound in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. One time.",
+ "B. Three times.",
+ "C. Zero times.",
+ "D. Two times."
+ ]
+ }
+ },
+ "rnMUzHiF": {
+ "video_id": "rnMUzHiF",
+ "video_duration": "50s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows three people in the aisle of a grocery store. One person is wearing a plaid shirt and grey pants, another is wearing a white hoodie and tan pants, and a third is wearing a blue jacket and jeans. The person in the hoodie reaches for an item on the shelf, and then another person in the blue jacket walks over to him, and gestures towards the shelf. The man in the plaid shirt walks a shopping cart past them, and they continue looking at the shelves. At the end, the man in the blue jacket picks up a blue basket from a shopping cart.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Does this video show an infant?",
+ "answer": "D",
+ "candidates": [
+ "A. It is not clear.",
+ "B. Yes, it does.",
+ "C. None of the above.",
+ "D. No, it does not."
+ ]
+ }
+ },
+ "QqbUijcL": {
+ "video_id": "QqbUijcL",
+ "video_duration": "62s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person with short brown hair and glasses is shown in the foreground, wearing a green shirt. They hold out their index finger towards the camera. Then, the person's head and shoulders are shown centered in the shot. They appear to be speaking. They then hold a gold-colored fishing lure in their open hand. They are then shown examining and manipulating another fishing lure. They then speak to the camera again. Then, their hands are shown manipulating a final lure with a white body and a gold spinner. They speak again, and then hold the lure in front of their open palm. The person is then shown speaking to the camera. Finally, an aerial shot of a lake with a small tree-covered island in the center of it appears. White text appears on screen that reads \"FOLLOW US LIKE & SUBSCRIBE\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the man hold a silver blade at any point in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. It is not clear.",
+ "B. None of the above.",
+ "C. No, he does not.",
+ "D. Yes, he does."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man in the video demonstrate the bait?",
+ "answer": "A",
+ "candidates": [
+ "A. Four.",
+ "B. Two.",
+ "C. One.",
+ "D. Three."
+ ]
+ }
+ },
+ "YcixhURC": {
+ "video_id": "YcixhURC",
+ "video_duration": "320s",
+ "duration": "4-6min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man with tattoos and a black baseball cap is shown in a small picture-in-picture window in the bottom-left corner. The main part of the screen displays a TV studio, bathed in blue and pink light. The word \u201cOPINI\u201d appears briefly on the screen, along with Indonesian text. A four-person band appears in the studio: a bassist, two guitarists, and a drummer. All four band members are wearing hijabs and dark clothing. The camera focuses briefly on each musician. The tattooed man in the picture-in-picture window watches the band with interest, nodding his head slightly. A small dancing Super Mario figure is superimposed on the screen in front of the drummer. The tattooed man raises his fist in the air, mirroring a pose struck by the guitarist. The main guitarist and singer approaches the microphone and begins singing \u201cGuerilla Radio\u201d by Rage Against the Machine. The band plays the song in its entirety. Throughout the performance, the tattooed man watches intently, sometimes making gestures and facial expressions in reaction to the music. At one point, the face of Shaquille O\u2019Neal eating is superimposed over the tattooed man\u2019s head. Images of the band performing elsewhere flash on the screen before the video cuts back to the studio and the band\u2019s performance.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "According to the video, which band member does the man consider to be the most talented?",
+ "answer": "D",
+ "candidates": [
+ "A. The drummer.",
+ "B. The guitarist.",
+ "C. The lead singer.",
+ "D. The bassist."
+ ]
+ }
+ },
+ "OfRKevJN": {
+ "video_id": "OfRKevJN",
+ "video_duration": "130s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Unboxing Videos",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a stylized cartoon shot of an outdoor street scene, showing a red building and an off-white building. A white, puffy cloud is in the light blue sky. The words \"THE DOMAIN\" appear in blue block letters with a logo of a grey circle with some black lines within it in between the words. The scene then changes to a shot from inside a moving car of the same outdoor street scene, but the image has slightly different coloring. The video continues as a person wearing a blue mask, a red and blue beanie, glasses, a dark blue shirt, and patterned grey and black gloves, walks down an aisle in a store. The aisle is lined with shelves containing toys. The person in the video examines many boxes and packages, picking them up, putting them down, and looking at them. The person also looks at their cellphone, examining the screen. The person holds up a package of toys and reaches into an open box of similar packages. The video then changes to a shot of a person with long blonde hair staring forward. Their face is slightly blurred. The shot changes back to the person in the store who continues to examine packages. Some of the packages have \"Halo\" written on them. The person holds a package up, pulls it out of a larger open box, and examines the back of it. The camera continues to show a variety of toy packages.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many individual mystery boxes are contained within the larger box shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Thirty-two.",
+ "B. Twenty-four.",
+ "C. Thirty.",
+ "D. Eight."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the man take out his phone to select a blind box?",
+ "answer": "B",
+ "candidates": [
+ "A. To reply to a message.",
+ "B. To collect 32 blind boxes.",
+ "C. To check the price of the blind box.",
+ "D. To check what is inside the blind box."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where does the man in the video buy the blind box?",
+ "answer": "B",
+ "candidates": [
+ "A. Shopping mall.",
+ "B. Retail company.",
+ "C. Toy store.",
+ "D. Blind box specialty store."
+ ]
+ }
+ },
+ "vrUnLHUv": {
+ "video_id": "vrUnLHUv",
+ "video_duration": "260s",
+ "duration": "4-6min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A man in an orange shirt walks toward a desk where a man in a red shirt is working on a laptop. The man in the orange shirt speaks, and the man in the red shirt responds while looking at documents. After some back and forth between the two, the man in the orange shirt walks away. The man in the red shirt continues looking at papers on the desk. The man in the orange shirt then reappears. They speak again. The man in the orange shirt then disappears again, and a woman in a multicolored shirt walks into the room and joins the men. All three people continue to talk. The man in the orange shirt is then told to submit some paperwork. He then begins to laugh. The woman in the multicolored shirt speaks, and is then congratulated by the man in the red shirt, who offers her his hand to shake. The woman in the multicolored shirt is then told to say 'thank you', and says \"thank you\" to both men. She then is told that she is the new branch manager. She is then shocked by the news and the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual event happened in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A woman with earrings was fired.",
+ "B. A man wearing orange was fired.",
+ "C. A man in red clothing was promoted.",
+ "D. A man in orange clothing was promoted."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the man in orange and the woman wearing earrings?",
+ "answer": "B",
+ "candidates": [
+ "A. The man in orange is the subordinate of the woman wearing earrings.",
+ "B. The man in orange is the boss of the woman wearing earrings.",
+ "C. The man in orange is the romantic partner of the woman wearing earrings.",
+ "D. The man in orange is the friend of the woman wearing earrings."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Does the audio change in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Yes, it becomes heavier as the mood of the woman with earrings changes.",
+ "B. Yes, it becomes heavier as the mood of the man in orange changes.",
+ "C. Yes, it becomes cheerful as the mood of the man in orange changes.",
+ "D. No, there is no change."
+ ]
+ }
+ },
+ "efYYhSuH": {
+ "video_id": "efYYhSuH",
+ "video_duration": "40s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing blue gloves is holding a round, semi-transparent plastic container covered in dark specks. A white cloth is seen in the background. The container is labeled \"Parmesan Cheese.\" The video cuts to the gloved hands holding a black container with blue and gold labeling, also with dark specks. The label is \"Crumbled Feta Cheese.\" A third round container is shown, also with dark specks, with a clear base. The gloved hands are holding the container. A fourth round container is shown being held up, also with a clear base, showing a brown substance. Then the video shows the gloved hands using a cloth to wipe down the inside of a box or cupboard. The video shifts to a view of the inside of the box, showing several dark, covered containers and bags with dark specs on them. A gloved hand holds up another round container with a clear base. The date on the bottom of the container is \"2015.\" The video cuts to a view of someone wearing a white mask and a yellow strap. Then the video shows a gloved finger pointing at dark debris in the box, as well as several plastic bags. The video shows a mesh bag containing several red and green orbs, as well as a white, round container of brown liquid. The gloved hands pull up a several plastic bags containing a soupy, brown liquid. The gloved finger points at brown pieces inside. The video cuts to a view of a white surface with many small brown pieces scattered across it. Then a gloved hand picks up a square, black container, with a visible label. The label reads \"Pasteurized Cream.\"",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What is the condition of the maggots in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The maggots are feeding.",
+ "B. The maggots are floating in the water.",
+ "C. The maggots are being taken away.",
+ "D. The maggots are dead."
+ ]
+ }
+ },
+ "FlKUHLoY": {
+ "video_id": "FlKUHLoY",
+ "video_duration": "78s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "The video displays a series of rapidly changing images on the screen. In the background, there is a recurring pattern of the word \"Lolo\" on the left and right of the screen. Two pixelated characters with brown hair and red jackets are displayed at the bottom left and bottom right of the screen throughout the video. The images displayed on screen range from blocky pixelated images of a rainbow, text, a wave pattern, food items, a duck, a weasel, a clock, a computer window and cars. Some of the images have a color filter applied.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many distinct individuals appear in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. Five.",
+ "D. Two."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Did the background music in the video change?",
+ "answer": "A",
+ "candidates": [
+ "A. No, it did not.",
+ "B. It is not clear.",
+ "C. There is no music.",
+ "D. Yes, it did."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "How many scene changes occur in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Eighteen.",
+ "B. Thirteen.",
+ "C. Seventeen.",
+ "D. Fifteen."
+ ]
+ }
+ },
+ "QFKCydSV": {
+ "video_id": "QFKCydSV",
+ "video_duration": "41s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with long, dark hair holds a microphone with a \"RODE\" logo towards a second person with a beard and a green jacket. They are standing in front of green doors with a gold design and the words \"MONETARY\". A police officer in a yellow vest is visible through the doors. The man with the beard is then standing with a group of nine other people on a set of steps in front of a brick building. Some of them are wearing winter attire. Next, multiple people with cameras are shown. Many are holding large black cameras. Then, the two people from the first clip are shown speaking to each other in front of the doors again. A police officer in a yellow vest is still visible in the background through the doors. Text reading \"Russell Miller, Former Civil Rights Lawyer\" is overlaid at the bottom right. The two individuals from the first shot are shown one last time speaking to each other in front of the same doors. Then a smartphone is held in front of a group of people who are standing in front of the green doors, and are then shown exiting the doors. People are hugging each other. Next, a wide shot of a crowd is shown, and many of them are holding signs. Some of the signs have words or symbols on them. Finally, the man with a beard exits the green doors. He is holding a brown bag and a cup. He raises his right fist in the air. There is a police officer visible inside the doors.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times is the brown-skinned woman's voice heard in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Twice.",
+ "B. Once.",
+ "C. Four times.",
+ "D. Three times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the identity of the man wearing green clothes in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A journalist.",
+ "B. A teacher.",
+ "C. A lawyer.",
+ "D. A police officer."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Was the man who passed by the woman with brown skin at the beginning of the video wearing glasses?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes.",
+ "B. I am not sure.",
+ "C. No."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How did the emotions of the crowd change in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The crowd's emotion shifted from calm to happiness.",
+ "B. The crowd's emotion shifted from anger to happiness.",
+ "C. The crowd's emotion shifted from happiness to calm.",
+ "D. The crowd's emotion shifted from calm to anger."
+ ]
+ }
+ },
+ "bOjFxYrX": {
+ "video_id": "bOjFxYrX",
+ "video_duration": "50s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen. Then a view of two planets in space appears. The smaller planet, a light brown, orbits the larger planet, which has stripes in light browns and yellows. Text appears in white on the left of the screen reading, \"EU,\" then a new screen shows text reading, \"EUROPA: MOON OF J.\" The text changes again to \"EUROPA: MOON OF JUPITER.\" Then, the smaller planet is shown alone against the black background, with a network of fine lines all over its surface.Next, a man with long brown hair and a light blue shirt appears. Text appears beneath him: \"ROBERT PAPPALARDO EUROPA MISSION PROJECT SCIENTIST.\"Then, the planet earth is shown against a starry background. After this, the video shows a school of fish swimming in dark blue water. Next, a cross-section view of the interior of the large planet with the smaller planet shows up, illustrating a molten core, a rocky layer, and an ocean beneath the icy surface.Following this, a woman with shoulder length brown hair and eyeglasses appears. She is wearing a navy blue sweater with a white and yellow collared shirt underneath. Text appears beneath her: \"CLAUDIA ALEXANDER GALILEO MISSION PROJECT MANAGER (1999-2003).\" Then, the video shows a close up of the surface of the small planet, with the intricate patterns on its surface.Next, the woman in the blue sweater appears again, and a translucent image of the surface of the small planet is superimposed over her face. Then, it shows a sequence of views on the surface of the small planet. Finally, the words \"ALIEN OCEAN NASA'S MISSION TO EUROPA\" appear on the screen in white over a starry background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is there any sound of a drum being played in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. No.",
+ "B. I am not sure.",
+ "C. Yes."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is Europa's position relative to Jupiter in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Above Jupiter.",
+ "B. Below Jupiter.",
+ "C. In front of Jupiter.",
+ "D. Behind Jupiter."
+ ]
+ }
+ },
+ "BitGCGWd": {
+ "video_id": "BitGCGWd",
+ "video_duration": "594s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a red baseball cap and a camouflage jacket is shown. Then, the same person is shown wearing a black hoodie with gold letters. A person wearing a white top and jeans is shown. Then, the same person is shown with their hands covering their face. The person with the black hoodie is shown speaking to the camera, while another person moves around in the background. Then, the person in the black hoodie is shown gesturing with their hands while looking at the camera and a laptop. In the bottom right corner of the screen, a video is shown of two people hugging. Then, a different video is shown with a group of people. Then, a video is shown of one person and then two, and then three and then four. The video shows many different clips of different videos. Then, the person wearing the black hoodie is shown looking into a black screen where text is flashing. Then, the person wearing the black hoodie is shown watching the video with the black screen, they are shown again looking into the black screen. At the end, the person is shown waving.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the prank video, when the man in white opened the door, where was the woman positioned in relation to the man without a shirt?",
+ "answer": "D",
+ "candidates": [
+ "A. The woman was lying to the right of the man without a shirt.",
+ "B. The woman was lying in front of the man without a shirt.",
+ "C. The woman was lying to the left of the man without a shirt.",
+ "D. The woman was lying on top of the man without a shirt."
+ ]
+ }
+ },
+ "TGIlcJSQ": {
+ "video_id": "TGIlcJSQ",
+ "video_duration": "89s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows several skincare products on a white surface. These include a brown glass bottle with a black dropper top, a small white jar, a frosted white bottle with a silver top, and a tall cylindrical container made of bamboo. The video then shows a person wearing a pink satin robe. The person applies a clear substance to their face, rubbing their fingers over their skin. The person then appears to adjust their robe. The person displays their hands, and then wipes their left cheek with a small, flat object. The video then shows a water faucet turning on and water flowing into a white sink. A hand reaches under the running water. The person is then shown wiping a white substance onto their face. The person is again shown from the neck up, applying the substance. The video then shows the white bottle being held, and liquid being released into a hand. The person then rubs their hands together to mix the liquid and white powder. The person reapplies the paste to their face and neck. The video then ends.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the purpose of the facial mask shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. It cleanses the skin.",
+ "B. It minimizes pores.",
+ "C. It moisturizes and softens the skin.",
+ "D. It whitens and brightens the skin."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct skincare routine followed by the black woman with long curly hair in the video? (a) Wash face (b) Apply mask (c) Use charcoal powder (d) Apply facial oil (e) Apply eye massage cream",
+ "answer": "B",
+ "candidates": [
+ "A. (a) (d) (c) (b) (e).",
+ "B. (a) (b) (c) (d) (e).",
+ "C. (a) (c) (b) (d) (e).",
+ "D. (e) (a) (d) (c) (b)."
+ ]
+ }
+ },
+ "wbMOiZGx": {
+ "video_id": "wbMOiZGx",
+ "video_duration": "232s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man sits at a desk with a bong, a bottle of Sweet Baby Ray's barbecue sauce, a small glass container, a lighter, and other miscellaneous items. He uses the lighter to heat the bowl of the bong, which is partially filled with barbecue sauce. The inset video shows a closeup of the barbecue sauce being poured into another glass container. He inhales from the bong, pulling the barbecue sauce vapor into the chamber. The inset video switches to slow motion as he inhales and white vapor fills the bong. The man removes the lighter and inhales the remaining vapor. He exhales, then coughs and holds his chest. He grabs a tissue and wipes his nose and mouth. He coughs and blows his nose into the tissue, then gives two thumbs down. A red circle with a line through it is superimposed over an image of the barbecue sauce bottle. The man coughs and blows his nose several more times, while occasionally waving his hand dismissively. He wipes his face with the tissue.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the man use an alcohol lamp to light his tobacco pipe?",
+ "answer": "A",
+ "candidates": [
+ "A. Four.",
+ "B. Five.",
+ "C. Three.",
+ "D. Two."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What are the main instruments featured in the background music of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Violin.",
+ "B. Guitar.",
+ "C. Drums.",
+ "D. Piano."
+ ]
+ }
+ },
+ "ZXrOYGZx": {
+ "video_id": "ZXrOYGZx",
+ "video_duration": "174s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a shot of a two-story white house with black shutters. There's a basketball hoop attached to the side of the house near the front door. Trees and shrubs surround the house. The video then cuts to a blurry image and back to the house, seemingly rewinding slightly. The video stays on the house for a few more moments before cutting to a close-up shot of a woman with shoulder-length brown hair. She appears to be speaking, although there is no audio. The background is a stone wall. Briefly, the video distorts, creating a wavy visual effect on the woman's face. The shot returns to normal, and the woman continues to appear as if she is speaking. After a few seconds, the video cuts to blurry horizontal lines and then to a hole in the ground. The hole is roughly rectangular, with loose dirt and rocks around the edges. The hole appears to be several feet deep. The camera angle changes to show the inside of a dimly lit basement. A metal ladder is leaning against a round concrete structure. An orange rope is dangling into a hole at the base of the structure. The lower portion of a person wearing khaki shorts and brown work boots is visible. The video lingers on this shot before cutting to blurry, angled lines. Then, a wooden ladder is shown descending into a hole in a concrete floor. The hole is surrounded by crumbling concrete and dirt. The bottom of the hole is not clearly visible. The video cuts to a view through a hole broken in a concrete ceiling. The torso, arms, and hands of a person wearing a blue and white patterned shirt are visible as they climb through the hole. Lastly, the video shows a dimly lit section of a stone wall within a confined space. The wall is rough and uneven, and there's a darker area at the bottom. The camera pans slowly to the right, then up, revealing more of the stone wall and a small, protruding, rectangular stone or brick in a corner. The video ends with a black screen showing a blue thumbs-up icon, a blue arrow, and the words \"LIKE & SHARE\".",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What action will Alexandra Poulos take next?",
+ "answer": "C",
+ "candidates": [
+ "A. She will not provide guided tours of the secret passage.",
+ "B. She will falsify the story of the secret passage.",
+ "C. She will not conceal the entrance to the secret passage.",
+ "D. She will conceal the entrance to the secret passage."
+ ]
+ }
+ },
+ "OQBWWJjv": {
+ "video_id": "OQBWWJjv",
+ "video_duration": "72s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a view through two large windows into a partially constructed room. A ceiling fan hangs from the ceiling and some construction materials are lying on the floor. The view pans slowly to the right, revealing a chandelier in the room. A woman then appears in the video, wearing a white blouse and jeans, standing in a small, undecorated room with a door and wire shelving. She gestures with her hands and points. The woman opens a closet door. She points again, then touches her cheek, then points with one finger. The woman shows the room with her arms outstretched, turns, and walks toward a different doorway. She enters a room with wooden shelves and walks to a kitchen area, where there are wooden countertops, white cabinets, a large sink, a black faucet, and a stainless steel oven. She walks around the kitchen touching different elements.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the small pendant lamp relative to the black fan?",
+ "answer": "A",
+ "candidates": [
+ "A. It is in front of the black fan.",
+ "B. It is to the left of the black fan.",
+ "C. It is behind the black fan.",
+ "D. It is to the right of the black fan."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Does the volume of the woman wearing white change at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. No, there is no change in her volume.",
+ "B. I am unsure.",
+ "C. None of the above.",
+ "D. Yes, her speaking volume increases as she moves closer to the camera and decreases as she moves farther away."
+ ]
+ }
+ },
+ "MPlKWYIR": {
+ "video_id": "MPlKWYIR",
+ "video_duration": "472s",
+ "duration": "6-8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Imran Khan sits in front of two flags, the Pakistani flag on the right and a flag with a white crescent moon and star on a green background to the left. He wears a light blue collared shirt, mostly covered by a dark blue vest. He speaks directly to the camera, using hand gestures.\u00a0The video occasionally cuts to a split-screen view, showing Imran Khan on one side and a woman with short, light brown hair and a red blazer, presumably a CNN anchor, on the other.\u00a0 The anchor appears to be speaking but the audio is focused on Imran Khan. The lower third of the screen displays various text captions related to the interview's topics. These include \u201cPAKISTAN\u2019S POLITICAL TURMOIL,\u201d \u201cIMRAN KHAN CALLS FOR U.S. DIPLOMAT TO BE DISMISSED,\" \u201cKHAN CLAIMS HIS OUSTING WAS A RESULT OF U.S. INTERVENTION,\" \u201cKHAN: I HAD PERFECTLY GOOD RELATIONS WITH TRUMP ADMIN,\u201d \u201cKHAN ON UKRAINE: I DON\u2019T BELIEVE IN MILITARY SOLUTIONS,\u201d and \u201cKHAN WILL RUN AGAIN IN PAKISTAN\u2019S NEXT ELECTION.\" The CNN logo and the word \"LIVE\" are also present on the screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "In the video, who asks, \"Will you run for prime minister again in the next election?\"",
+ "answer": "D",
+ "candidates": [
+ "A. A man wearing a blue shirt and a brown vest.",
+ "B. A man wearing a white shirt and a brown vest.",
+ "C. A woman with long blonde hair.",
+ "D. A woman with short blonde hair."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the man wearing a white shirt and brown vest discuss the United States' charges against him?",
+ "answer": "B",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. At the beginning of the video.",
+ "C. In the middle of the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "Based on the video, will the man wearing a white shirt and brown vest participate in the next prime minister election?",
+ "answer": "C",
+ "candidates": [
+ "A. It is not known.",
+ "B. This option is not provided.",
+ "C. Yes, he will.",
+ "D. No, he will not."
+ ]
+ }
+ },
+ "IoZVRUJA": {
+ "video_id": "IoZVRUJA",
+ "video_duration": "180s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens with white text on a black background reading \"THE BOYFRIEND INTERVIEW\". A man wearing a white and blue shirt appears in front of a black background with two vertical lights. He speaks and waves. Three women appear sitting at a table in front of a green wall. The woman in the middle speaks, and the man then reappears. He states his name, age, and occupation. The three women are shown again, and the woman in the middle asks a question. The man reappears, and answers. He is then shown again as he answers another question from the three women. One of the women speaks and a question is posed to the man. He answers, then the women speak and one reacts negatively, saying \"Wrong\". The man answers the question again. One of the women is displeased and says, \"Irrelevant. Hot lady like me doesn\u2019t eat.\" Another woman asks him a question about a selfie, and the man answers, then the woman in the middle says \"Blah blah blah...\". A question is asked by the same woman. Another woman asks, \"Am I fat?\", and then she says, \"Monkey also can answer that question. Can you take this seriously?\". A woman then asks what to get her as a gift for her birthday. The man answers, and the woman asks, \"A present? A? Singular?\". A woman says it is difficult, and she gives a clue. The man says he still doesn't get it, and another question is posed. The man answers. One woman says \"So two present! Two!\". He then holds his face in his hands and one of the women says he is \"obviously not boyfriend material\". One of the women asks what kind of watch he has. He answers and wipes his face with a tissue. She says, \"That could make up for the last undesirable trait of his.\" He is asked a hypothetical question. He answers \"No,\". One woman states that it is a good answer. Another calls the man \"young\". The man responds that that is a lie and calls one of the women hot, which leads one of the women to pull on another, looking under her skirt. The man says, \"Upskirt! Upskirt!\". They ask another question about a girlfriend in an accident, and the man says \"I don't know\". He says, \"I will politely reject,\". The women react and ask, \"Why the f*** would you even pick up a phone call from her?\". The man starts to cry, saying, \"Because I deleted her number, that\u2019s why I didn\u2019t know who was calling.\" The women talk among themselves, then one of them says, \"Ladies, I think Mr. Eden Ang is eligible to proceed to the next round of the interview\".",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Is the man's watch in the video worn on his left wrist or his right wrist?",
+ "answer": "C",
+ "candidates": [
+ "A. Worn on the right wrist.",
+ "B. Not sure.",
+ "C. Worn on the left wrist."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "In the video, what is the attitude of the woman in the colorful dress towards the man?",
+ "answer": "D",
+ "candidates": [
+ "A. Questioning.",
+ "B. Indifferent.",
+ "C. Hostile.",
+ "D. Friendly."
+ ]
+ }
+ },
+ "ZJkmqQHN": {
+ "video_id": "ZJkmqQHN",
+ "video_duration": "163s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Biology",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A white paper with black text provides instructions for an endospore stain, listing the following steps: 1. Apply malachite directly to heat-fixed slide for 3-5 minutes. 2. Rinse with water. 3. Apply Safranin counter stain for 1 minute. 4. Blot dry and observe under oil immersion. A video shows a person wearing blue gloves and a white lab coat, performing the staining process. The person holds a glass microscope slide with a white label that reads \"E\" in black marker. The person dips the slide in a metal beaker of liquid, then takes it out, holding it above a metal grate. Another person wearing gloves is then seen holding a clear bottle with a white label that says \"Malachite Green\". The person pours the malachite green into the metal beaker above the grate. A slide is then placed on the grate. A hand holds a clear glass bottle with a white label that says \"Malachite Green\". Next, a person in blue gloves adds more of the green liquid on top of the slide, placed above the metal grate in the beaker, then sets the bottle down. The hand then removes the slide from the grate and places it onto a white paper towel, pressing down. Then, a white piece of paper with black text appears, which reads: \"Heat over water bath for 3 - 5 minutes. Do not allow stain on slide to dry out. Apply more malachite green if necessary.\" A gloved hand applies malachite green to a slide sitting on a metal grate. The slide is blotted dry with a piece of white paper towel, then, a person in blue gloves rinses the slide under a stream of running water into a metal basin, then moves to another station. A person's hand is seen holding a brown bottle with a white label that reads \"Safranin\". Then, the person uses a dropper and places the pink liquid from the bottle onto the slide resting in a metal tray. A text card appears that says, \"Wait one minute\". The slide is then rinsed with water, and blotted dry.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video is the action of applying malachite green dye to the slide performed?",
+ "answer": "A",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. Throughout the entire video.",
+ "C. At the end of the video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "zNdYoLzH": {
+ "video_id": "zNdYoLzH",
+ "video_duration": "85s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "World News",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a split screen with two individuals. On the left is a man with a light colored turban, wearing a dark outfit with vertical gold stripes. On the right is a monochrome image of another man in a white shirt. A CNN logo appears in the bottom right corner with the words \"TAIPEI A. 131.62\" below it. The bottom of the screen also shows a red newsbar with white text and a logo on the far left.The scene changes to show a man in a black long sleeve shirt in front of a large, abstract tan and brown painting, with another man in black behind him. The man in black touches the painting.The video then shows the man in black speaking while two paintings can be seen behind him.The scene changes again, now showing the man in black walking to the side of the frame in what appears to be an art gallery. The frame is briefly obscured by his shoulder, before showing a different man wearing a suit walking in a space with boxes.A still image is then shown with the monochrome image of the man in the white shirt on the left and the logo seen before in the video on the right, along with a line of text that says \u201cDR RASHID TAFIK Gaddafi Foundation\u201dThe video then shows three men walking in a line followed by others. The middle man shakes the hand of a man in dark attire. They walk side by side in front of a crowd.The video cuts to a group of people ascending airplane stairs, then it cuts to a group of people boarding a helicopter. Then, another group of people are seen in front of camera holding cameras up, with one camera pointing at the frame.Finally, a large crowd of people holding signs that say \u201cMerseyside Council of Faiths\u201d can be seen standing outside. Then the screen zooms in on the sign. Then a small cluster of people with various religious attire are seen. Then a man in a tan jacket is seen talking in front of what appears to be a cathedral. Text appears above his name, which reads \u201cIRAQ CONFLICT Shiv Pande COUNCIL OF FAITHS\u201d",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many hostages does the Kadhafi Foundation rescue in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Four people.",
+ "B. Six people.",
+ "C. Ten people.",
+ "D. Fourteen people."
+ ]
+ }
+ },
+ "jOVDSXlG": {
+ "video_id": "jOVDSXlG",
+ "video_duration": "64s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Chemistry",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man and a woman are standing at a lab table with lab equipment. The man is wearing a light blue button down shirt and the woman is wearing a white and grey striped top. They are performing a chemistry experiment. They pour liquid from one clear bottle into another. They then start shaking the bottle which causes a white foam to form. They then pass the bottle back and forth. After a few passes the woman runs away while still holding the bottle. The man then walks over to the white board and begins pointing at the writing on it.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "How does the girl react to the bag filled with gas?",
+ "answer": "C",
+ "candidates": [
+ "A. She appears sad.",
+ "B. She appears angry.",
+ "C. She appears excited.",
+ "D. She appears doubtful."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the action of mixing citric acid and baking soda in a plastic bag occur?",
+ "answer": "C",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. Throughout the video.",
+ "C. At the beginning of the video.",
+ "D. At the end of the video."
+ ]
+ }
+ },
+ "xrmeuMrq": {
+ "video_id": "xrmeuMrq",
+ "video_duration": "44s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black suit and a blue and green striped tie is shown adjusting their tie. The person is then shown looking into a mirror, adjusting their jacket and hair, and speaking. The person then exits through a doorway. The person is later shown in an elevator, smiling and waving. The person is then shown walking through a doorway. They are shown talking to a person and holding a brown folder. The person is then shown with a group of other people in an office, and at a whiteboard where there are several doodles and writings. They are seen gesturing with their hands, and later is shown speaking directly to the camera. The person is then seen standing in front of an exit door, and then seen picking up a cardboard box and walking away.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the action of closing a door occur in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Three.",
+ "B. Six.",
+ "C. Four.",
+ "D. Five."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Does the video show a man in a black suit slowly squatting down?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. I am not sure.",
+ "C. Yes."
+ ]
+ }
+ },
+ "sBoStFTj": {
+ "video_id": "sBoStFTj",
+ "video_duration": "497s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person with a red jacket and white stripes holds their hand in front of the camera. The video cuts to an out-of-focus scene, and then to a color test pattern. Three people are seen sitting on and around a bed in a room with light gray walls. There are two paintings on the wall. The person sitting in the middle, wearing a white top and patterned pants, laughs and points a phone towards the other people. The video cuts back to the color test pattern. The three people are seen again, and the person in the middle touches the other two. The person on the left moves up next to the person in the middle, and the person on the right moves off of the stool. The video cuts to another color test pattern, this time with static. The person in the middle stands up from the bed and grabs a black jacket, and a subscription box appears. The cursor clicks the thumbs-up icon, the bell, then the subscribe button. The person stands by the bed while using a phone, then the other two people move near her, and all three people begin to interact with each other. The video concludes with one of the people being held up by the other two. Finally, \u201cthanks for watching\u201d appears on the screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man sitting on the bench on the right side of the video lift his buttocks off the bench?",
+ "answer": "C",
+ "candidates": [
+ "A. Four.",
+ "B. Two.",
+ "C. Three.",
+ "D. One."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What food does the boy wearing a red hat in the video prefer to eat when he is hungry?",
+ "answer": "B",
+ "candidates": [
+ "A. Chicken wings.",
+ "B. Noodles.",
+ "C. A hamburger.",
+ "D. Rice."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color are the girl's nails in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Red.",
+ "B. Black.",
+ "C. Green.",
+ "D. White."
+ ]
+ }
+ },
+ "ThwjrIGs": {
+ "video_id": "ThwjrIGs",
+ "video_duration": "52s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows black, finger-shaped grapes hanging from vines. A man in a light-colored shirt and dark pants walks between rows of vines with green grapes, touches the grapes, and appears to taste one. Then the video shows a view of rows of grape vines from a moving vehicle.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What are the black grapes shown in the video called?",
+ "answer": "D",
+ "candidates": [
+ "A. Jelly bean grapes.",
+ "B. Cotton candy grapes.",
+ "C. Teardrop grapes.",
+ "D. Moon drop grapes."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What is the purpose of organizing the consumer competition shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. To implement a promotional campaign.",
+ "B. To promote a new variety.",
+ "C. To increase brand awareness.",
+ "D. To name a new variety."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "Which type of grape from the vineyard is the most delicious?",
+ "answer": "B",
+ "candidates": [
+ "A. Cotton Candy Grapes.",
+ "B. Moonlight Grapes.",
+ "C. Flavorful Grapes.",
+ "D. Spicy Grapes."
+ ]
+ }
+ },
+ "EOqEbrpQ": {
+ "video_id": "EOqEbrpQ",
+ "video_duration": "87s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video starts with an aerial view of a town with brick buildings, steeples, and trees. A man in a grey shirt with red sleeves, wearing glasses, taps a rocky surface with a hammer. A small brown piece of rock falls onto the man's hand. A man's hands hold a larger rock with multiple colors. Then, two people wearing hardhats, one white and one red, walk down a path and into a tunnel. Inside the tunnel, the man in the red hardhat points at a rock in the wall and touches it with a tool. Then, a close-up shows a dark grey rock in the wall being touched by a finger. Next is an aerial shot showing a river next to a forest. A black and white image shows five people sitting in a cave. In another part of the cave a person is kneeling with a light next to some mining tools and a wheelbarrow. A black and white image shows two buildings with a horse, people, and chimneys. A similar black and white image shows a building with a tall chimney, a wooden bridge, and a dirt road. Then the video shows the interior of a cave with rails, metal buckets, and a stairway. Two people wearing hardhats walk through a cave holding flashlights. They are still walking and then a woman is shown in a red dress. Next, the man in the red hardhat is looking toward a dark tunnel. He then faces the camera. A black and white scene shows a man working at a workbench in a large factory, with many guns. Another black and white scene shows three men working at machines in a dark factory. Then, a black and white shot shows three men next to a doorway. Next a small metal wheelbarrow filled with rocks is shown. Then there is a close up of many bullets on a grey surface. The video shows another aerial view of a town, and a water tower labeled \"Galena.\" The man in the red hardhat is shown looking into the shadows of a dark space. The man appears to be talking. A woman holds a grey rock and then, a man holds a rock in his hands with shiny metallic pieces.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the man in the video go to Galena, Illinois?",
+ "answer": "A",
+ "candidates": [
+ "A. To collect lead ore from Galena.",
+ "B. To visit a zinc mine.",
+ "C. To explore a silver mine.",
+ "D. To travel for leisure."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a bullet in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "LVSoqxLS": {
+ "video_id": "LVSoqxLS",
+ "video_duration": "130s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A black and white video shows a group of people swing dancing. The background changes color periodically. A man in a suit dances in the center of a circle of other dancers. He performs a variety of swing dance moves, including kicks, spins, and jumps. He is later joined by a female partner and they perform several steps before the woman takes center stage and dances solo with various energetic swing moves. A second man wearing suspenders then enters the center of the circle and begins to dance solo, performing various energetic steps and arm movements. Several other individuals also take turns dancing solo in the center of the circle before returning to the larger circle of dancers who are still clapping and keeping the rhythm.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What instrument was featured in the audio?",
+ "answer": "D",
+ "candidates": [
+ "A. Piano.",
+ "B. Bianzhong (chime bells).",
+ "C. Suona.",
+ "D. Drum set."
+ ]
+ }
+ },
+ "dTTMOntX": {
+ "video_id": "dTTMOntX",
+ "video_duration": "384s",
+ "duration": "6-8min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Two shirtless men wearing boxing gloves and boxing shorts are in a boxing ring with blue canvas and surrounded by ropes. The man on the left side of the screen is wearing red and black boxing shorts with the word \"RICOR\" written on it and red gloves. The man on the right side of the screen is wearing white shorts with a red trim and red gloves, he has tattoos visible on his arms. A referee in a blue shirt and black pants is walking around the edges of the ring. There is a large crowd visible around the ring. A timer is visible on the bottom of the screen that displays the current round and the time remaining. The two men exchange punches throughout the video until one man falls to his knees. The referee then signals the end of the match and the two men touch gloves. Some of the people on the side of the ring begin to enter and join the men. A top down view of the ring is shown with the crowd visible. The timer on the screen changes to display 5.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In the video, at what point does the referee first use their hand to separate the bald boxer wearing white shorts from the boxer with short dark hair wearing red shorts?",
+ "answer": "C",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. Throughout the video.",
+ "C. At the beginning of the video.",
+ "D. In the middle of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What move did the bald boxer wearing white shorts use to hit the boxer with short black hair and red shorts on the head?",
+ "answer": "C",
+ "candidates": [
+ "A. A straight punch.",
+ "B. A right hook.",
+ "C. A left hook.",
+ "D. An uppercut."
+ ]
+ }
+ },
+ "kCMTuNVH": {
+ "video_id": "kCMTuNVH",
+ "video_duration": "57s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a red and white striped long sleeve shirt is handing an object to another person wearing a dark colored jacket. The person in the red and white shirt is also holding a cup and bag. The scene shifts and shows the person in the red and white shirt looking around, with a red ribbon in their hair. Then there is a shot of a cloudy sky, before the scene shifts to a view of a house. There is text that reads \"Addie's Childhood Home\". Inside, a group of people are sitting at a long table, some of whom are wearing red and white striped clothing. There are candles lit on the table. The same scene is shown from a different angle and there are emoji faces floating near some of the people. Then the scene changes and many of the people are shown standing in a circle, looking down at their feet. Their feet are wearing different colored socks. Then the scene goes back to the dining table and several plates with decorated cookies are visible. The video then shows a woman with curly hair holding a small penguin figurine, and the scene cuts back to people sitting at the table, decorating cookies. Finally, the scene shifts to a room with a Christmas tree, and multiple people are opening presents.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a person wearing blue socks visible in the middle section of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes, there is.",
+ "B. It cannot be determined.",
+ "C. No, there is not.",
+ "D. None of the above."
+ ]
+ }
+ },
+ "GKJNrORU": {
+ "video_id": "GKJNrORU",
+ "video_duration": "94s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person is sleeping in a sleeping bag next to a book, a pencil, and a pair of binoculars. The person wakes up and is startled. They sit up and use a flashlight. They shine the flashlight around their campsite, showing a sleeping person laying on the ground and many small glowing mushroom creatures. The person goes into the bushes and hides. The camera shows another sleeping person with a pinecone on their forehead. The person is hiding in the bushes. A mushroom person is on a tree stump with many mushroom creatures around him. The mushroom person opens his mouth and grins. The camera shows the person again. The person falls onto the ground. The person looks around startled and sits up. Another mushroom creature falls into a bowl. The person seems startled. The person appears to be reaching into their bag, and pulls out a salty meat stick. The person runs away with a mushroom creature on their back. The camera closes up on the mushroom creature on the person's back, before showing the person falling again. The mushroom creature appears to be jumping on the person's back. The person is shown again on the ground. A small yellow bird is shown on a branch. The person is looking around the forest and seems startled. The person is standing by a tree stump, and then moves behind the tree. The person is walking in the woods. The camera is focusing on a large mushroom creature, and the person is behind it. The mushroom creature opens it's mouth and then looks angered, before looking delighted.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct chronological order of events in the video's story? (a) The girl discovers her friend is missing. (b) The bird sings to help. (c) The fairy launches an attack. (d) The friend is trapped.",
+ "answer": "D",
+ "candidates": [
+ "A. (a), (b), (c), (d).",
+ "B. (d), (a), (b), (c).",
+ "C. (d), (a), (c), (b).",
+ "D. (a), (d), (c), (b)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What change occurs in the yellow bird's state in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Its motion changes to stillness.",
+ "B. It remains in motion.",
+ "C. It remains still.",
+ "D. Its stillness changes to motion."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the girl use a flashlight in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Zero.",
+ "B. Three.",
+ "C. One.",
+ "D. Two."
+ ]
+ }
+ },
+ "COgenLTn": {
+ "video_id": "COgenLTn",
+ "video_duration": "73s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Soccer",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a football game. In the beginning, a man wearing a headset and a blue vest is talking to players sitting on a bench. Then the view switches to players on the field. A player wearing a white jersey with the number 31 catches a football and runs with it. He is then tackled by a player wearing a purple jersey. The view switches back to the man with the headset who is on the sideline, and then to more players on the sideline. There are a lot of people in the background. Then the view switches back to players on the field. A player wearing a white jersey with the number 81 catches a football, but is then tackled by a player wearing a purple jersey. The view switches back to the players on the sideline. The final shots are of the players on the sidelines with a large crowd in the background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Which team has the advantage in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. It is a draw.",
+ "B. The White Team.",
+ "C. The Blue Team.",
+ "D. It cannot be determined."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the most common action performed by teammates on player number 31 after he scores a goal?",
+ "answer": "C",
+ "candidates": [
+ "A. High-five him.",
+ "B. Hug him.",
+ "C. Pat his cap.",
+ "D. Lift him up."
+ ]
+ }
+ },
+ "dDquedye": {
+ "video_id": "dDquedye",
+ "video_duration": "123s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Cultural Explainers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows several people gathering dried grass and creating a large pile of it in the middle of a dirt road. Some of the people are wearing white shirts, some are wearing plaid shirts, and one is wearing a striped shirt. A man with a metal pot and another man holding a rounded dark object wrapped with green leaves walk through the scene. Several men carry a small tray with lit candles and hold out their hands, then begin moving and swaying with a group of people in a circle. The large pile of dry grass is set on fire. A group of white cows decorated with garlands and colorful items, followed by a group of black cows, are run through the burning pile of grass. A small white goat is paraded through the embers of the burning pile. Some people add more fuel to keep the fire going and someone pours water on parts of it. Several people can be seen gathered around watching this ritual. Fireworks are then seen going off in the distance. Finally, a decorated grey cow is pulled through a crowd of people at night. A woman in a printed orange sari is seen pouring liquid out of a silver cup over the cow's head. Many people are standing around the cow and pat it. The video ends with credits.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the woman wearing an orange shawl positioned in relation to the cow adorned with a flower garland at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. In front of the cow adorned with a flower garland.",
+ "B. Behind the cow adorned with a flower garland.",
+ "C. To the left of the cow adorned with a flower garland.",
+ "D. To the right of the cow adorned with a flower garland."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is the elderly man wearing a white shirt in the video holding a tool for drying clothes?",
+ "answer": "C",
+ "candidates": [
+ "A. No.",
+ "B. Uncertain.",
+ "C. Yes."
+ ]
+ }
+ },
+ "gUiRfHwy": {
+ "video_id": "gUiRfHwy",
+ "video_duration": "41s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A digital figure with red hair and a blue shirt with a pizza design raises its arms. A caption that says \"THIS IS THE BEST DAY EVER\" appears below the figure. The figure moves its hand to the back of its head, then moves its arms to its side. The same caption still appears below the figure. Another caption appears that says \"Time to buy my dream items\" and a smiley face. The figure's face is zoomed into. Heart shapes and stars appear around the figure's face, along with a caption that says \"Woo\" and another smiley face. A screen displays various digital items, some of which are labeled with names and prices. A caption reads \"All the items looks so cool\" along with a face with teary eyes. A crown item is selected, then a black ruffled skirt. Then a screen says \"A FEW MINUTES LATER...\". The digital figure is shown on a different background. A caption reads \"I got so many stuff!\" and a smiley face. Another caption reads \"Time to make an avatar\". A screen displays the figure along with various digital items organized by categories labeled \"Head\", \"Face\", \"Neck\", \"Shoulder\", \"Front\", \"Back\", \"Waist\", and \"Gear\". The figure is standing to the right of the item options. A caption appears that says \"I love my stuff wow\" with some hearts. The figure is shown while the various menus are selected and it's appearance is changed. One item selected is black and white sunglasses, and another is a blue flower on the shoulder. Another is black and white characters at the waist. Another is orange plaid shirt. Another is white pants with black designs, and black and white shoes. The menu changes to head and the text \"Almost done!\" appears. A new face is chosen. The menu changes to right legs. The figure is shown with its completed new avatar. A caption says \"Omg wow\" and a shocked face. The figure sits down, and another figure appears to its left. The final caption says \"I love my avatar\" with some hearts.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "In the video, what is the price of the crown, in gold coins, listed in the shop's inventory?",
+ "answer": "C",
+ "candidates": [
+ "A. 40.",
+ "B. 50.",
+ "C. 70.",
+ "D. 90."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "In the video, how many times does the virtual character's facial expression change during the dressing-up process?",
+ "answer": "C",
+ "candidates": [
+ "A. One time.",
+ "B. Four times.",
+ "C. Two times.",
+ "D. Three times."
+ ]
+ }
+ },
+ "QqgwOfPW": {
+ "video_id": "QqgwOfPW",
+ "video_duration": "45s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A man with reddish, wavy hair and a beard speaks into a microphone while standing in front of two banners, one red with the word \u201cETZNER\u201d and the other light blue with \u201cWooThemes\u201d. He is wearing a gray t-shirt.The video cuts to four people sitting around a round white table, each with a laptop. Three are men and one is a woman. The men are actively discussing something while looking at the laptops.\u00a0The video cuts back to the first man with the microphone.The video cuts to a different man with a beard, wearing a gray sweater, holding a microphone.The video cuts to several people working on laptops at a large table. There is a large Apple monitor at the center of the table.\u00a0The video cuts back to the man in the gray sweater holding the microphone.The video cuts to three men working at a table with laptops.\u00a0The video cuts back to the man in the gray sweater.The video cuts to two men working on laptops in a more open area with a low wooden table between them. One man has longer hair and sits on a white mesh chair. The other is sitting on a brown couch.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the main message conveyed by the content on the poster behind the first speaker?",
+ "answer": "C",
+ "candidates": [
+ "A. It is used to enhance the atmosphere of the event.",
+ "B. It specifies the theme direction of the development.",
+ "C. It shows the name of the company developing the theme.",
+ "D. It serves a decorative purpose."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What are the roles of the two people speaking with microphones in the video during this event?",
+ "answer": "D",
+ "candidates": [
+ "A. Web developers.",
+ "B. The initiators of the event.",
+ "C. The hosts of the event.",
+ "D. Providers of technical support and advice."
+ ]
+ }
+ },
+ "mwwaXugn": {
+ "video_id": "mwwaXugn",
+ "video_duration": "27s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A video game is being played. The player is in a first-person perspective using weapons in a building, moving from room to room. A 1 VS 5 appears at the top of the screen, then changes to 1 VS 4. The player appears to be eliminated and a message appears that says \"BLUE TEAM ROUND 2 LOST.\" The perspective changes to a replay and then a view of a map with two teams labelled blue and orange, with a message that says \"ROUND 3.\" There is a screen of game options before a screen goes black. The same person is in the lower right corner of the screen throughout the entire video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the man kill his enemies using fire?",
+ "answer": "B",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. In the first half of the video.",
+ "C. In the middle of the video.",
+ "D. Throughout the entire video."
+ ]
+ }
+ },
+ "ewOrZBhz": {
+ "video_id": "ewOrZBhz",
+ "video_duration": "51s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Family Vlogs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A golden retriever is lying on a gray sofa. A small, white puppy with an orange collar is next to the larger dog. The golden retriever gently nuzzles the puppy. A person wearing a black t-shirt, grey shorts, and a black headband picks up the puppy and the golden retriever moves closer. The person holds the puppy as the golden retriever sniffs it. The golden retriever nudges the puppy, sniffing its head and body. The golden retriever walks around the person and puppy, pausing to sniff it again. The person brings the puppy closer to the dog, allowing the dog to sniff the puppy's face. The golden retriever walks away and then sniffs the puppy. It then briefly lowers its head and is hugged by the person, before going back to sniff the puppy. The golden retriever is next seen standing near the person who is holding the puppy. The dog is sniffing and nudging the puppy. The person then holds the puppy closer to the dog, who is seen nudging the puppy again. The golden retriever moves back and forth before the person and puppy. The golden retriever licks the puppy and walks away and then returns and sniffs the puppy. The golden retriever lies down and is next seen with its head lowered toward the puppy, as the puppy is being held by the person. The golden retriever is shown sniffing the puppy again, this time the puppy has a blue collar. The golden retriever moves towards the puppy and then sits. The dog is then shown with its head lowered toward the puppy again. The golden retriever stands and turns away. The dog is next shown facing the puppy held by the person.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the large golden retriever and the small golden retriever?",
+ "answer": "C",
+ "candidates": [
+ "A. They are father and son.",
+ "B. They are sisters.",
+ "C. They are brothers.",
+ "D. They are mother and son."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the golden retriever jump off the sofa?",
+ "answer": "C",
+ "candidates": [
+ "A. The beginning of the video.",
+ "B. It never jumps off the sofa.",
+ "C. The middle part of the video.",
+ "D. The latter part of the video."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many women's voices can be heard speaking in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. 3.",
+ "B. 1.",
+ "C. 2.",
+ "D. 0."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the name of Xiao Jinmao?",
+ "answer": "B",
+ "candidates": [
+ "A. I don't know.",
+ "B. Sky.",
+ "C. Oreo.",
+ "D. Blue."
+ ]
+ }
+ },
+ "MNNsuGnJ": {
+ "video_id": "MNNsuGnJ",
+ "video_duration": "29s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A first-person view shows a player holding a black airsoft gun with a teal sight, walking through a grassy field. Another player in camouflage gear walks beside them. The person in camouflage takes something from the other player\u2019s outstretched hand. It appears to be a blue and silver canister. The player holding the gun fires repeatedly, producing a cloud of smoke. The view of the field becomes obscured by smoke. The player in camouflage drops to the ground. The smoke dissipates, revealing the field. The player holding the gun fires multiple shots again. The player in camouflage remains on the ground. The view pans to show another player in black clothing, holding a similar airsoft gun. Another canister is tossed to the first player. The player fires the gun again, producing smoke. This action repeats several more times.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the depicted purpose of the hand grenade in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. To locate enemy positions.",
+ "B. To demonstrate grenade throwing techniques.",
+ "C. To instruct viewers on identifying hand grenades.",
+ "D. To illustrate the harsh realities of war."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many smoke grenades were launched from the cannon in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Two.",
+ "B. Three.",
+ "C. One.",
+ "D. Four."
+ ]
+ }
+ },
+ "NfWbakun": {
+ "video_id": "NfWbakun",
+ "video_duration": "44s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video begins with a news segment featuring Joe Biden. He is wearing a suit and tie and speaking directly to the camera. A banner at the bottom of the screen reads \"DOMESTIC SPYING CONTROVERSY\" and identifies him as \"Sen. Joe Biden, D-Delaware.\" A cityscape is visible in the background.The video then cuts to Barack Obama standing at a podium with a Presidential seal. He's wearing a suit and light blue tie. A woman in a dark jacket with light detailing stands to his left and a man in a suit and striped tie stands to his right. An American flag is visible behind them to the left, along with dark blue curtains. Obama speaks, gesturing with his hands. He picks up the podium microphones and continues speaking.The video then cuts back to the news segment with Joe Biden, still speaking to the camera against the backdrop of the cityscape. The banner with his name and title, and the words \"Domestic Spying Controversy,\" remains on screen. He blinks, moves his head slightly, and continues speaking.The video cuts back again to Barack Obama, still speaking at the podium and gesturing. The woman and man standing beside him remain motionless.Finally, the video cuts back to Joe Biden in the news segment one last time. The visual elements remain the same: the suit and tie, the background cityscape, and the bottom-of-screen banner. He continues speaking.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What activity is depicted in the video scene?",
+ "answer": "B",
+ "candidates": [
+ "A. People are gossiping.",
+ "B. A debate is taking place.",
+ "C. An argument is occurring.",
+ "D. A speech is being delivered."
+ ]
+ }
+ },
+ "CdDlQhao": {
+ "video_id": "CdDlQhao",
+ "video_duration": "43s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A countdown appears with the number \"5\" superimposed on multiple different shots in a music video.In the first shot, a person with long hair and sunglasses is standing on a blue boat which is on a trailer near a beach. The person is wearing a yellow and black patterned jumpsuit and black gloves. There are several people behind them on the boat. Palm trees are visible on the beach. In the next shot the person in the jumpsuit is sitting on a white part of the boat holding a black object in each hand, one which looks like a small gun. There is a patterned bag and pillow nearby. The person is then laying on their side on the boat. The next shot shows a person wearing white shorts and a white shirt leaning over the side of a car with an open roof, and then the shot shifts to a woman in a car wearing a black jacket, sunglasses, and a patterned bandana.The person in the jumpsuit is then shown standing on the boat again with their arms extended, then shown holding a black object near a beach. The next shot shows two hands with one hand holding a brown paper bag. A person's legs are partially visible. Next the person with long hair is seen wearing sunglasses, then a man is shown talking to the camera with a black hat and glasses. The person in the patterned jumpsuit is then shown again, this time riding the blue and white boat on its trailer past palm trees and a beach. Next a close up of a woman with red lipstick and a red lollipop is seen sitting in a car with another person in the background, followed by a shot of the person in the jumpsuit sitting on a white boat, then hands washing a car with soapy mitts. The person in the jumpsuit is shown again on the boat, sitting with a patterned pillow and bag nearby. The car and red lollipop are shown again. Two men are shown washing a car with a hose and sponges. The person in the jumpsuit is shown from a low angle pulling at their long hair, then is seen from a lower angle waving their arms in the air on the boat.Next the person with long hair is seen in a close up with sunglasses, and then sitting on the boat again. A man is shown in front of a gray wall talking with black hat and glasses, then a similar shot is shown again. Next the blue and white boat is again being driven on its trailer with beach scenery and palm trees visible. The person in the jumpsuit is shown holding a blue bottle, standing between two women, one in a pink dress and one in a blue shirt with palm trees on it, the woman on the left is holding a glass. The same shot is shown again and then again. The next scene is a woman on a bicycle and another with a skateboard in front of a building with white and red accents. The person in the jumpsuit is shown holding the blue bottle between a woman in a pink dress and a man in a Santa suit. A woman with a white hat saying \"Fresh\" and a patterned shirt is shown next to a man with no shirt in front of a backdrop of trees. The two women with unique makeup shown next to a red fence.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "Which line of the lyrics does the man wearing a hat in the video find the most interesting?",
+ "answer": "C",
+ "candidates": [
+ "A. I\u2019m so high, like\u2026 an attic. I\u2019m so thrilled, addicted.",
+ "B. I\u2019m so excited\u2026 addicted. I\u2019m high up, like\u2026 an attic.",
+ "C. I\u2019m so high, the attic. I\u2019m too excited, like\u2026 a drug addict.",
+ "D. I\u2019m so thrilled, like\u2026 an addict. I\u2019m up high, in the attic."
+ ]
+ }
+ },
+ "toXKvvpd": {
+ "video_id": "toXKvvpd",
+ "video_duration": "56s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a white lab coat and a tie appears briefly in the lower right corner of the video and makes various gestures. Multiple circuit boards with metal pins are displayed on a black surface, along with a yellow device with a small screen. The video displays various close ups of these components. A finger appears, interacting with one of the circuit boards. Several animated text boxes appear with phrases such as \u201cyet eBoard MK II,\u201d \u201c2.8\u201d TFT Touch Screen,\u201d \u201cDual Wemos ESP32 Minis,\u201d \u201cOne for Deauth, One for stealing passwords,\u201d and \u201cMicro SD card for storing handshakes.\u201d At one point, the circuit board and touchscreen are connected to the yellow device and a menu is displayed with the options \u201cMarauder, WIFI, Bluetooth, General Apps, Device, and Reboot.\u201d The finger uses a stylus to interact with the various menu options. Further menus are displayed, and animated text boxes say, \u201cRidiculous WIFI Capability,\u201d \u201cYou can even draw,\u201d and \u201cBluetooth Sniffer and Card Skimmer Detection.\u201d The final text box displayed says, \u201cHack the Planet.\u201d A cartoon image of a bigfoot with a smile is displayed in the upper right corner throughout the video.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct order of actions performed in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The touch screen device is picked up, the front of the screen is shown, the back of the screen is shown, the suction cup is attached, the side with the microSD card is displayed, the program is started, the wireless connection is checked, and the Bluetooth sniffer is checked.",
+ "B. The touch screen device is picked up, the front of the screen is shown, the back of the screen is shown, the side with the microSD card is displayed, the program is started, the suction cup is attached, the wireless connection is checked, and the Bluetooth sniffer is checked.",
+ "C. The touch screen device is picked up, the front of the screen is shown, the back of the screen is shown, the side with the microSD card is displayed, the suction cup is attached, the program is started, the wireless connection is checked, and the Bluetooth sniffer is checked.",
+ "D. The touch screen device is picked up, the back of the screen is shown, the front of the screen is shown, the side with the microSD card is displayed, the suction cup is attached, the program is started, the wireless connection is checked, and the Bluetooth sniffer is checked."
+ ]
+ }
+ },
+ "xDWEzlOU": {
+ "video_id": "xDWEzlOU",
+ "video_duration": "226s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Soccer",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a sports game being played on a field. Multiple players, some wearing dark red uniforms and some wearing green, are shown running around the field and holding a ball. There are also people standing along the edges of the field, appearing to be spectators. The video includes several close-up views of individual players as well as wide shots of the whole field. There is also a scoreboard at the top of the frame which says that the score is 3 to 2 with \"South\" in the lead. In the later scenes there is text overlaid on the video that says \u201cFull Time\u201d and shows the final score as \"South Sydney 3\" and \u201cWest Tigers 2\u201d. At the end of the video, players shake hands with each other.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the jersey number of the player kicking the ball in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. 14.",
+ "B. 2.",
+ "C. 1.",
+ "D. 9."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "How does the Tiger team prevent the South team from attacking in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Intercept the South team's flank breakthrough.",
+ "B. Increase physical confrontation to stop the South team's attack.",
+ "C. Cover key areas of the field to halt the South team's offense.",
+ "D. Execute a quick counterattack to relieve defensive pressure."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At which point in the video does the southern team make a mistake near the sideline?",
+ "answer": "A",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. At the beginning of the video.",
+ "C. Throughout the video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "YPJUPTKJ": {
+ "video_id": "YPJUPTKJ",
+ "video_duration": "309s",
+ "duration": "4-6min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman with glasses and dark hair stands at a podium with a blue front that displays the web address \u201cBERNIESANDERS.COM\u201d. She wears a grey t-shirt with \u201cBERNIE\u201d and \u201cSANDERS\u201d on it, as well as a black and white checkered scarf with red and white stripes. A light blue button is pinned to her shirt. She speaks and gestures while reading from some papers on the podium. A woman in a blue dress, wearing a round sticker, stands to the speaker's left and smiles. Behind the speaker, slightly to the right, stands a dark-skinned man in a suit. After the speaker finishes, the woman in blue claps, and then two men, a woman in a red shirt, and the suited man all walk over and embrace the speaker. The woman in blue shakes the speaker's hand. The suited man walks to the podium after the speaker steps away and also addresses the audience.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What happens when the white-haired man and blonde woman come onto the stage at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A woman wearing glasses continues the presentation.",
+ "B. The white-haired man begins to speak.",
+ "C. The blonde woman begins to speak.",
+ "D. The presentation officially concludes."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What is the primary instrument featured in the video's background music?",
+ "answer": "B",
+ "candidates": [
+ "A. Flute.",
+ "B. Drum.",
+ "C. Piano.",
+ "D. Violin."
+ ]
+ }
+ },
+ "DgttEGqd": {
+ "video_id": "DgttEGqd",
+ "video_duration": "310s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person with shoulder-length curly hair that has dark roots and blue-tinted ends, wears a gray sports bra and khaki colored leggings. They apply product from a purple bottle to their hair. They then begin using a grey and pink hair dryer on their hair. Then, they wrap a dark blue and gold patterned scarf around their head. After, the person takes off the scarf and uses their fingers to style the blue-tinted curly hair. They then hold up a round white container and another purple bottle, and then continue to style their hair. The video then shows a black screen with three circles, a white rectangle, a red subscribe button, white text that says, \"THANKS FOR WATCHING\" and \"New Videos Every Monday, Wednesday, & Friday @ 12 pm EST\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many different hair dye products are shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Two.",
+ "C. One.",
+ "D. Three."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the purpose of the purple hair dye shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. It increases hair elasticity and shine.",
+ "B. It improves hair health.",
+ "C. It maintains hair hydration.",
+ "D. It enhances hair smoothness."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Will the woman with curly hair in the video use this purple hair dye in the future?",
+ "answer": "A",
+ "candidates": [
+ "A. She will.",
+ "B. She might.",
+ "C. It is unclear.",
+ "D. She will not."
+ ]
+ }
+ },
+ "KyZqpMOd": {
+ "video_id": "KyZqpMOd",
+ "video_duration": "76s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A blonde woman leans out of a window. She climbs out and looks around. Another woman with dark hair and two pigtails leans out of the window looking in the same direction. The blonde woman is now outside and meets another woman with shoulder-length black hair, a black AC/DC hoodie, and ripped black pants. The second woman offers the blonde woman a green bottle. The woman in the AC/DC hoodie takes a swig of the bottle. She hands the bottle to the blonde woman, and their hands touch. The blonde woman takes a long drink from the bottle and smiles. She hands the bottle back to the other woman. The two women are now sitting on a couch, snuggling close. The woman with the AC/DC shirt rests her head on the blonde woman\u2019s shoulder. She lays down across the blonde woman's lap. The blonde woman is shown drinking more from the green bottle as the woman in the AC/DC hoodie stands next to her making woozy hand motions. They are now sitting on the floor. Silhouetted figures of other people are visible. The blonde woman raises the green bottle for another drink. The woman in black holds her belly as she watches. The two women are shown looking at each other and smiling. Hearts appear as they look lovingly at one another. The woman in black reaches up and adjusts her hair. The blonde woman\u2019s hand is shown reaching toward the spinning bottle on the floor. The woman in the AC/DC hoodie leans close to the blonde woman and puts one hand on her shoulder, while the other hand reaches around and touches the blonde woman\u2019s throat. They kiss, and the two silhouetted figures look on. The blonde woman places her hands over her heart and looks happy. Then, her expression changes to confusion and then to sadness.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "In a spin-the-bottle game, what did the girl with yellow hair do to the bottle?",
+ "answer": "A",
+ "candidates": [
+ "A. She spun the bottle while it was lying flat.",
+ "B. She poured alcohol into it.",
+ "C. She broke the bottle.",
+ "D. She stood the bottle upright."
+ ]
+ }
+ },
+ "cCchEFcQ": {
+ "video_id": "cCchEFcQ",
+ "video_duration": "120s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Two women stand smiling in a television studio, in front of a backdrop depicting the Atlanta skyline. The woman on the left has long, reddish-brown hair and is wearing a patterned blue and maroon blouse with a pussy bow, and dark pants. The woman on the right has short, dark hair and wears a patterned blue and green dress.\u00a0The scene transitions to a photograph of four people posing against a wood-paneled wall. The woman on the far left sits on a wooden stool and wears a dark denim jacket over a dark top, with Coco Chanel written on the back of the jacket in gold script. She has dark curly hair. The man next to her wears a black and white splatter-patterned shirt and olive-green pants. The woman next to him wears a cream-colored dress and a sleeveless burgundy jacket or cape. She has reddish-orange hair. The man on the far right wears a dark long-sleeved shirt and white pants. He has a beard. A wooden ladder stands against the wall behind the group.The scene transitions to a video of four people dancing in a white room. All four wear all-white clothing. The two men wear short-sleeved shirts and pants, while the two women wear knee-length, sleeveless dresses. One woman's dress has an off-the-shoulder ruffled top. The people perform synchronized, rhythmic hip movements to music.\u00a0The scene returns to the two women in the TV studio. They smile and nod as they speak to the camera.The scene transitions to the TV studio, where the four dancers are now seated on white armchairs arranged in a semi-circle. One woman wears brown and black striped flared pants and a black button-down shirt; the other woman wears a striped brown, maroon, and orange top and shiny pink pants; one man wears a brown corduroy jacket and dark pants; the other man wears a studded black leather jacket and gold-striped pants. They begin to sing.The four singers continue to sing, some with closed eyes.The scene changes to show just the two women on the right, both with eyes closed, singing, before returning to the full group of four. They occasionally laugh and sway as they sing.As the song ends, the two women hosts from the beginning walk over and dance with the singers. The woman with the reddish-brown hair dances with the first man as the short-haired woman dances with the remaining three singers in a line. The four singers bend at the knees and move their hips as they dance in the line.They continue dancing until the video ends.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Does the video include instrumental music during the vocal performance?",
+ "answer": "B",
+ "candidates": [
+ "A. None of the above.",
+ "B. No, there is no instrumental music.",
+ "C. Yes, there is instrumental music.",
+ "D. It is not clear."
+ ]
+ }
+ },
+ "JqnfUUDR": {
+ "video_id": "JqnfUUDR",
+ "video_duration": "189s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a soccer game in a stadium with a large crowd. The game starts with a view of the full stadium, then transitions to a close up of the field with a goal post and 2 people. A graphic appears with the team logos: Real Madrid CF on the left and FC Barcelona on the right, and the words \u201cKONAMI Cup\u201d above. The video transitions to a top down view of the soccer field and 22 players, 11 in white, 11 in dark colors. A graphic shows the score as 0-0 for both teams. The video then shows the soccer players moving the ball around the field and several shots of the field from a closer view. A rainbow colored frame and the word \u201cFootball\u201d appears. The video shifts to replay of the soccer action with close ups of the players running, kicking, passing, and falling to the ground. At one point a player raises his hands up to his head. The video returns to the top down view of the soccer field and the score is still 0-0 for both teams. The game continues with players passing the ball, and a text appears with the words \u201cGoal Celebration\u201d and options for players to pick. The screen shifts again to show a goal being scored, and a rainbow frame and the word \"Football\" appears. The video returns to the top down view, and then another goal is scored. The text shows the score as 2-0 with Real Madrid CF in the lead. The game continues with shots of the players moving the ball down the field, and then the words \"Half Time\" and the Konami Cup logo appears over a blue screen. Lastly, the words \u201cHalf Time\u201d and the Konami Cup logo is over the stadium, with information including \u201cPossession\u201d \"Shots (On Target)\", \"Fouls\", and \u201cPasses (Successful)\u201d.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Does the video show Barcelona scoring a goal?",
+ "answer": "D",
+ "candidates": [
+ "A. It is not clear.",
+ "B. None of the above.",
+ "C. Yes, it does.",
+ "D. No, it does not."
+ ]
+ }
+ },
+ "hmROrkZi": {
+ "video_id": "hmROrkZi",
+ "video_duration": "41s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video opens with a courtroom scene. Two men in suits are walking toward the judge's bench. Several other people in suits are seated in the pews.The scene changes to another courtroom, this one brighter and with larger windows. Two men are standing before the judge, seemingly presenting a case. There are other people seated in the gallery behind them.Cut to close-ups of the two men arguing the case. One appears to be explaining or accusing, while the other reacts with confusion and then a look of realization.A split-screen shows the first man continuing to address the court in the top half of the screen, while in the bottom half, a woman in the gallery listens intently.The older man in the suit reappears in a close-up. The scene cuts back to the two men from the previous courtroom before the judge, who sits reviewing documents presented to him. A bailiff is visible, standing near the men.A split-screen shows the first man again addressing the court in the top half of the screen. The bottom half shows another man in a suit listening. The older man is shown again in a close-up.Finally, the judge sits at his bench, looking thoughtful, then addresses the courtroom. The two lawyers stand before him once again.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the atmosphere of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Depressed and anxious.",
+ "B. Tense and anxious.",
+ "C. Calm and surprised.",
+ "D. Confident and happy."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the low, somber music begin?",
+ "answer": "B",
+ "candidates": [
+ "A. The music plays throughout the entire video.",
+ "B. The music starts in the latter half of the video.",
+ "C. The music starts at the end of the video.",
+ "D. The music starts at the beginning of the video."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What event is depicted in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Two lawyers are in opposition within the courtroom.",
+ "B. The lawyer wearing blue clothing engages in a heated exchange with the judge.",
+ "C. The judge delivers a verdict in court.",
+ "D. The lawyer wearing blue clothing amicably presents their viewpoint to the opposing counsel."
+ ]
+ }
+ },
+ "gQEImgyK": {
+ "video_id": "gQEImgyK",
+ "video_duration": "21s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A black car drives forward on a road, then continues into a parking lot, where it stops in the middle of an empty parking space. A man wearing a dark green shirt and tan pants walks towards the back of the car. He leans over and looks at one of the car's tail lights. Then, the scene cuts to four people in black vests with \"Police\" printed on the back walking together on a street at night. Finally, the scene cuts to two cats walking in a wooden room. One cat exits a doorway, the other walks towards the foreground.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What are the colors of the bulletproof vests and hats worn by the four police officers in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The bulletproof vests are black, and the hats are yellow.",
+ "B. The bulletproof vests are green, and the hats are black.",
+ "C. The bulletproof vests are blue, and the hats are black.",
+ "D. The bulletproof vests are black, and the hats are black."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What is the condition of the car that appears at the beginning of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Stationary state.",
+ "B. Reversing state.",
+ "C. Moving state.",
+ "D. Bumpy condition."
+ ]
+ }
+ },
+ "rBpYoeaG": {
+ "video_id": "rBpYoeaG",
+ "video_duration": "304s",
+ "duration": "4-6min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A wrestler with long dark hair, a denim jacket, and checkered knee-high socks stands in a wrestling ring. The camera pans to the audience, then to a female wrestler entering the ring. She wears a two-piece outfit with purple and white checkered patterns. She leans against the ropes as another female wrestler with darker skin enters the ring. She wears an outfit with red, blue, gold and purple patterns. They face each other briefly, circle, and lock up. The wrestler in purple takes the other wrestler down and pins her for a one-count. After the wrestlers get back on their feet, there's a series of strikes, dodges, and grapples. The wrestler in purple executes a move, knocking the other wrestler to the mat. She pins her, but the other wrestler kicks out at two. The wrestler in purple climbs the ropes and jumps onto her opponent, then goes for another pin. The other wrestler kicks out at two again. The wrestler in purple locks in a submission hold, and the referee checks on the wrestler being held. She taps out, and the wrestler in purple releases the hold. She raises her arms in victory, then walks around the ring acknowledging the audience. The video then cuts to a sequence of short clips of different wrestling matches, featuring both male and female wrestlers.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual situation involves the woman wearing black boots?",
+ "answer": "A",
+ "candidates": [
+ "A. She is overpowered by someone wearing colorful stockings.",
+ "B. She overpowers a woman wearing colorful stockings.",
+ "C. They are evenly matched.",
+ "D. The woman wearing black boots concedes."
+ ]
+ }
+ },
+ "ayCJscCe": {
+ "video_id": "ayCJscCe",
+ "video_duration": "299s",
+ "duration": "4-6min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two video screens display highlights of an upcoming wrestling match. On the left screen, a wrestler wearing a zebra-striped shirt and dark pants performs various moves. On the right screen, white streaks flash across a black background as the name \"Konrad\" appears in red letters. The screens are set against a backdrop of shimmering blue lights.The screen then switches to a split-screen view of the two wrestlers. On the left is Shawn Rife, wearing a referee shirt. On the right is Konrad, shirtless and wearing black gloves. Clips of each wrestler in action play in the background.The scene transitions to a wrestling ring. Shawn Rife, wearing a black tank top and teal tights with stars, climbs into the ring. Konrad, wearing beige and black tights and black armbands, follows. They face off as a referee in a zebra-striped shirt watches.Konrad and Rife circle each other. Konrad attempts to grab Rife but Rife dodges. Konrad throws a punch, which Rife blocks. Konrad grabs Rife and lifts him for a fireman\u2019s carry, but Rife escapes. Rife then whips Konrad into the ropes. Rife then bounces off the ropes himself, but Konrad leaps over him and lands facing the audience.Rife attempts a clothesline, but Konrad ducks. Rife throws more punches, which Konrad mostly dodges or blocks. Konrad kicks Rife in the gut. Konrad whips Rife into the ropes. Konrad leaps at Rife with a knee to the stomach. Konrad lifts Rife again, this time executing a fireman's carry slam.Konrad climbs on the ropes in the corner near a member of the crowd in a blue jacket, and taunts Rife and the audience. He jumps off, and Rife catches him in a bear hug from behind and takes him down. Konrad grabs Rife\u2019s legs and escapes. Rife jumps up from the floor and kicks Konrad. Konrad staggers toward the ropes and another official in a blue jacket.Konrad kicks Rife in the stomach and whips him into the ropes. Konrad again leaps at Rife and gets him into a headlock. Rife escapes, throws punches, but Konrad puts him in another headlock. Rife pushes Konrad to the ropes to escape and he goes down. Konrad whips Rife to the ropes and runs towards him. Rife picks Konrad up and slams him to the mat.Rife then places Konrad in a leg lock, keeping his shoulders to the mat. The referee in stripes counts 1, 2, and Konrad is pinned. Konrad and the other referee exit the ring while Rife celebrates and taunts the crowd.The video concludes with the UWC Online logo displayed against a riveted metal background, with flames and smoke rising from it. The text \"Facebook.com/UWCGeorgetown\" appears below.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "In the video, how many attacks did Conrad make to win the match?",
+ "answer": "A",
+ "candidates": [
+ "A. Eleven attacks.",
+ "B. Eight attacks.",
+ "C. Five attacks.",
+ "D. Thirteen attacks."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What maneuver does Conrad use to defeat Sean Rieve at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A leg takedown.",
+ "B. A flying splash.",
+ "C. A body slam.",
+ "D. A shoulder throw."
+ ]
+ }
+ },
+ "bXiUtrSf": {
+ "video_id": "bXiUtrSf",
+ "video_duration": "118s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video takes place in a large arena with blue seating. The floor has a gray mat with a red geometric shape in the center. The video begins with one person standing in the center of the mat while another kicks a pad. Then two people in white outfits with red and blue helmets stand on a marked circle and a person in a white shirt and black pants stands near them. The two people in white outfits step forward, and one of them holds a blue helmet. The person in the white shirt with black pants claps their hands together. The person in a white outfit with the blue helmet takes a stance, and then the two people in white outfits spar with kicks. A person in a white shirt and black pants walks around the ring. One of the people in a white outfit falls to the ground. The person in the white shirt raises their hand up then walks to the two people in white outfits. One of the people in a white outfit walks to the side of the ring and then onto a blue mat. The person in the white shirt claps their hands again. Then the two people in white outfits are sparring with kicks again. The person in a white shirt continues to walk around the ring. They spar a few more times before the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What action did the red team player use to earn the crucial 3 points?",
+ "answer": "B",
+ "candidates": [
+ "A. Hook kick.",
+ "B. Front kick.",
+ "C. Downward strike.",
+ "D. Back kick."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "During the match, how many times did the player's hand from the blue team touch the ground?",
+ "answer": "B",
+ "candidates": [
+ "A. Three.",
+ "B. Two.",
+ "C. One.",
+ "D. Zero."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why did both teams fall during the match?",
+ "answer": "C",
+ "candidates": [
+ "A. Because the blue team consistently launched attacks.",
+ "B. Because the red team member lost balance while kicking backward.",
+ "C. Because the red team consistently launched attacks.",
+ "D. Because the blue team member lost balance while kicking backward."
+ ]
+ }
+ },
+ "YVAxgwZw": {
+ "video_id": "YVAxgwZw",
+ "video_duration": "64s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "swimming",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Four individuals are in an outdoor pool at night. One individual wearing a white swim cap and dark shorts is sitting near the edge of the pool, partially submerged, and is facing the viewer. A second individual wearing a white swim cap is swimming a modified breaststroke. A third individual wearing a white swim cap and dark swimsuit is swimming in the far right of the pool. Then the individual closest to the camera dives into the water. The individual swimming a modified breaststroke swims laps. The camera is now underwater and pointing upward as the two individuals swim laps. The individual swimming laps breaststrokes while the individual in the foreground kicks their feet as they swim. The individual in the foreground comes to a stop near the pool edge, while the other individual continues swimming. The camera is now angled above the pool again, facing downward. The first individual comes back into view, and the camera moves closer to them. The camera is now to the right side of the individual in the foreground, pointed directly at them as they swim. They continue to swim for a time, facing towards the right side of the pool, while the individual in the background also continues to swim. The video then shifts to a daytime video of a swim meet. Many swimmers are lined up and on diving blocks in a swimming pool. A sign above them reads \"MEN'S 200 BREASTSTROKE\". The video shifts again to an individual inside a car wearing a bag over their head. A person with a red shirt is at the window of a drive through, and the individual in the car is handing the person cash. The video then shifts again to an individual wearing a face mask and surgical cap in what appears to be a store, with many shelves of items around them. The video shifts again to an individual standing in a kitchen with several food items on the table, as well as a blender. The video shifts again to a swim meet, with several swimmers diving into the water at the same time. The video shifts again to two individuals sitting in chairs, in what appears to be a school hallway. The two individuals are talking to each other. The video shifts again to two individuals on a baseball field, in which it is snowing. The two individuals are hitting and catching baseballs. The video shifts again to the video of the pool, from a side view, with the two individuals swimming laps. The camera is pointed straight down at the pool, with the two individuals swimming laps. The camera is pointed more at the pool edge as the individual closest to the camera swims breaststroke and comes to a stop on the edge. The camera is at a medium close up of the pool with the two individuals swimming. The video shifts to a collage of three different individuals, with the text KMAASTV in the center.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What changes occur to the audio of the man speaking in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. It gradually becomes quieter.",
+ "B. It alternates between loud and quiet.",
+ "C. It remains unchanged.",
+ "D. It gradually becomes louder."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the man holding the camera in the video relative to the man giving the commands?",
+ "answer": "A",
+ "candidates": [
+ "A. On the left.",
+ "B. In front.",
+ "C. On the right.",
+ "D. Behind."
+ ]
+ }
+ },
+ "KrnSrKPU": {
+ "video_id": "KrnSrKPU",
+ "video_duration": "102s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A window labeled \"Duel Options\" is visible, with a list of options, including \"No Ranged,\" \"No Melee,\" \"No Magic,\" \"No Sp. Attk,\" \"Fun Weapons,\" \"No Forfeit,\" \"No Prayer,\" \"No Drinks,\" \"No Food,\" \"No Movement,\" \"Obstacles,\" and \"No Weapon Switch,\" with \"Show Inventories\" selected at the bottom. The player selects \"Wait.\" The window changes to \"Duel Stake,\" where the user adds 1,000,000 gold pieces, to their side and the opponent\u2019s. The player selects \"Accept.\" The game viewpoint switches to a fight arena where two figures with red health bars are fighting a figure with a blue health bar. In the player's inventory, they select the option to use a ranged potion. The player character switches to using a spear weapon, then to an Abyssal tentacle. The player character switches again to a Dragon crossbow. A second figure with a red health bar appears and a figure with a blue health bar appears. The player character equips a blue shield. The player selects the \"Fight Skiddler\" option and the player character attacks a figure in red with a spear. The player character uses another item and uses a purple potion. The player continues fighting the red player, and selects a \"Fight Skiddler\" option. The player uses another purple potion and equips an Armadyl godsword. A menu appears with the \"Fight Skiddler\" option and the player character attacks a figure with a red health bar. A health potion is then used. The player character attacks the figure with a red health bar once again, and uses a purple potion. The character then attacks a figure with a blue health bar. The player character attacks the red health bar character again, the menu with \u201cFight Skiddler\u201d reappears, and the player selects this option again. The figure in red moves slowly away. The figure with a blue health bar attacks and the player character is defeated. A \u201cYou Have Won\u201d window appears with information about the duel. The player character then moves and interacts with other game characters, then opens a bank chest. The window switches to a black background that contains a single line of text that reads \u201cpay me 2m for 1 last fight\u201d. The player character then returns to the arena, drinks another potion, and moves again to a trading area. They trade 2,000,000 gold pieces to another player. The player character then moves back to the duel stake window and the player enters 5,000,000 gold pieces to the stake and presses \"Accept\". The player character then moves back to the dueling arena to fight. The player character drinks a potion and moves to attack the other player. The player character equips the Armadyl godsword, and then another player appears with a yellow health bar. The player character attacks a figure with a red health bar and a player character with a blue health bar, then attacks a different figure with a red health bar and a yellow health bar. The player character moves to another location and is defeated. The window switches to \"You Lost\", with information about the duel. The player wins a duel and a \u201cYou Have Won\u201d window appears with information about the duel. The player clicks \"Claim\" and moves away, interacting with other player characters. The player drops some coins, and then the video ends. A small window in the upper-right portion of the screen is visible, showing a person in a chair.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is C Engineer's mood as he continues to compete after the second match?",
+ "answer": "C",
+ "candidates": [
+ "A. Guilty.",
+ "B. Excited.",
+ "C. Bored.",
+ "D. Anxious."
+ ]
+ }
+ },
+ "lsetflhk": {
+ "video_id": "lsetflhk",
+ "video_duration": "317s",
+ "duration": "4-6min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A white figure with a black head appears in a blurry image, which quickly fades to black. A black silhouetted bird like figure appears. The silhouetted bird changes to show that it has white, red and black colouring. The creature's name \"Yvel\" appears, and after a moment the words \"Yveltal appeared!\" show up. The bird moves about, and then a white figure with pink and orange clothes appears. The words \"Go on, Smeargle! I know you can do it!\" appears at the bottom of the image. The picture moves to show Smeargle, a creature with a white body and head, brown ears and tail and a green brush for a tail. The picture focuses to show the battle between Smeargle and Yveltal. The words \"Yveltal's Dark Aura\" appear. The words, \"The wild Yveltal is radiating a dark aura\" show up. The picture continues to show a battle scene between Smeargle and Yveltal and then the words, \"You and Smeargle are breathing in perfect sync with one another\" appear. The scene shows Yveltal attacking Smeargle with the words, \"Smeargle used False Swipe!\" appearing at the bottom of the scene. The words \"The wild Yveltal used Dark Pulse!\" appear at the bottom of the scene. The camera focuses back on Yveltal who starts to do a move. After this, the picture shows the health of Smeargle has gone from 226 to 143 and the words \"Smeargle is awaiting your directions\" appears. The same move occurs again and the health of Smeargle is reduced. The words \"Smeargle used False Swipe!\" appear once again. The words, \"The wild Yveltal used Psychic!\" appear at the bottom. After this, the scene focuses on Smeargle with a purple light around it and then shows the health bar has reduced from 143 to 75. Once again, the words \"Smeargle is awaiting your directions\" appear. The scene then shows a move being used by Yveltal. The words \"The wild Yveltal used Oblivi\" appear and then the camera cuts to Yveltal doing a move. Then the camera cuts to show the health bar for Smeargle has reduced dramatically and the words \"Smeargle is in a bit of a pinch. It looks like it might cry\" appears. The scene cuts to show Yveltal again, and the words \"Smeargle is awaiting your directions\" appear once again. More actions take place, and the words \"Smeargle used False Swipe!\" are present and \"The wild Yveltal used Psychic!\" show up at the bottom of the scene. Smeargle is shown with a purple light around it, and then its health is shown again. The health of Smeargle is at zero, and a blue ball of light appears. The scene cuts to show the blue ball again and then cuts to show Yveltal and the words \"Go for it, Decidueye!\" appears at the bottom. A brown and green owl like creature appears, Decidueye, along with a battle scene that shows Yveltal again. A battle scene takes place again, with \"Decidueye is awaiting your directions\" appearing. The word \"Angel used a Rot\" appears, followed by a red and white creature with the words \"Rotom Power unlocked! Zzz-zzt!\" A text box at the bottom of the image reads: \"Your chance of catching Pok\u00e9mon increased!\" A scene with Yveltal is shown and then one with Decidueye being hit. A shadow appears behind Decidueye and the words \"The wild Yve\" are displayed. Decidueye glows purple and the scene cuts to the health of Decidueye being reduced to 196. Decidueye\u2019s health continues to be reduced. The text box reads \"Decidueye is awaiting your directions\" and then the battle scene with Yveltal reappears. The words, \"Angel used a Nest Ball\" appears at the bottom of the screen. Then a move takes place. A brown ball with yellow stripes is present, followed by the words \"Angel used a Premier Ball\". A white ball appears with a red ring in the centre, and the ball glows yellow with star shapes around it. The scene changes to show \"Gotcha! Yveltal was caught!\" followed by a white ball with a red ring in the centre. Decidueye is shown again and the words \"Decidueye boosted 2,322 Exp. Points!\" appears, followed by \"The rest of your team gained Exp. Points thanks to the Exp. Share!\" then the words \"Mewtwo grew to Lv. 89!\" appear. The screen then cuts to a black screen, then some blue patterns appear with the words \"Would you like to give Yveltal a nickname?\" Then another pattern appears, with the text box reading \"Where do you want to send Yveltal to?\" Then the text, \"Please select a Pokemon to swap from your pa\" appears. The camera cuts to black again, and then to a green pattern. This is followed by a black screen and then an image of Yveltal along with a text box that reads \"Smeargle has an Amulet Coin. What would you like to do with it?\". This is followed by a black screen. The camera then cuts to a white figure in pink and orange clothing standing on a brown platform, and a text box reads, \"Would you like to go back to where you came from?\" This is followed by a plain white background. The screen goes black and then it cuts to a white figure in pink and orange clothing standing in a purple landscape, then it goes black and shows a brown pointy figure. This is followed by a colourful map and the words \"Alola Region\". This cuts to various locations on the map including \"Ancient Poni Path\", \"Poni Beach\" \"Aether Paradise\" \"Heahea City\", \"Hau'oli Outskirts\" \"Hano Grand Resort\" \"Paniola Town\" and \"Route 5\". After showing these various locations, the scene returns to the Alola map again. The screen then shows an individual with a white helmet and pink and black clothing standing on a blue background, and then the camera zooms in on the individuals face. A brown and orange creature flies around. The scene moves to focus on the individuals' legs. The individual appears again on a blue background. The screen then cuts to black, and then to the interior of a building in the dark. The individual in white and pink clothing is standing near a light on a path. The scene then cuts to the individual in white and pink clothing walking in a forest. The screen fades to black and then to show the individual walking through a forest again, with the words \"Lush Jungl\" at the bottom of the screen. The screen then cuts to show the individual walking through the forest, and then it goes black. The screen then goes back to show the individual walking through the forest again. The scene then cuts to blue, with the sun shown. The screen fades to white. The screen fades to black, and then shows a grey monkey creature with white and purple fur on a dark background and the words \"A wild Oranguru appeared!\". The screen cuts to show the monkey again with a green leaf like object in its hand. The words \"A wild Oranguru appeared!\" are present again. The individual in white and pink clothing is shown near the monkey, and then Yveltal appears and a battle scene begins. The text \"Yveltal's Dark Aura\" appears. The words \"Yveltal is radiating a dark aura!\" appears, then the words \"What will Yveltal do?\" then \"You got away safely!\" are shown. The screen fades to black and then shows the individual walking through a forest again. This is followed by the individual walking through a forest once more. The individual is shown walking and a light is visible at the end of the path. Then the camera cuts to a blue background, with a sun visible. The scene then fades to white, then to a dark grey background, and then to a dark grey monkey with a purple outline. The words \"A wi\" are present. A small green, white and purple creature appears on a black background and the words \"A wild Steenee appeared!\". This is followed by the individual in white and pink clothing and the words \"Go! Yveltal!\". This is followed by Yveltal appearing. Yveltal appears in the scene again and then is battling with Steenee. The text \"Yveltal's Dark Aura\" appears, along with \"Yveltal is radiating a dark aura!\". The camera focuses on the battle, and \"What will Yveltal do?\" is shown. The screen then fades to black and shows the individual walking through a forest, with the camera zooming closer to them. The scene then cuts to a sunny background. The scene fades to white, then to a black screen, and then to a monkey like creature who is grey, purple and white. A screen shows details about the creature Yveltal. A black screen is shown again. A green pattern is shown again. A white figure in pink and orange clothes is standing on a brown platform once more. This is followed by the figure running towards the camera on a brown platform and then the camera cuts to show the figure running through narrow passages on the brown platform. The figure is shown running towards a black light at the end of a path, and the text \"Would you like to go back to where you came from?\" is shown.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video, what was the order in which Angel used the props?",
+ "answer": "C",
+ "candidates": [
+ "A. Beast Ball, Decidueye, Smeargle.",
+ "B. Decidueye, Beast Ball.",
+ "C. Smeargle, Decidueye, Beast Ball.",
+ "D. Smeargle, Beast Ball, Decidueye."
+ ]
+ }
+ },
+ "LveTOgyb": {
+ "video_id": "LveTOgyb",
+ "video_duration": "32s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a dark outfit does push-ups on a grass field. There are trees and a cloudy sky in the background.A person wearing a black top and blue shorts places a heavy round weight onto a barbell.A person with curly dark hair, wearing a black top, is facing the camera.A person wearing blue shorts and white shoes lifts a heavy round weight onto a barbell.A person in a dark outfit walks down the metal stairs of an outdoor stadium.A person in dark pants and a bright pink top runs on a field.A person in blue shorts and a blue and white jacket runs on a field.A person wearing a dark top and blue shorts stands in a gym and picks up a barbell with heavy round weights on it.A person wearing a light green top does pull ups on a black metal bar. Another person is visible in the background.A person with short red hair is shown in a dark city environment.A person wearing an orange shirt throws a dark medicine ball in a gym.Multiple people on a grass field are passing a ball back and forth.A close-up of the face of a person with dark curly hair is shown.A close-up of the face of a person with a beard is shown.A close-up of the face of a person with short dark hair is shown.A close-up of a person adjusting a prosthetic leg.A person wearing a red top and black boxing gloves is sparring with a person wearing black clothing and black gloves.A person wearing a black outfit lies on the ground while another person reaches out their hand to them.Two groups of people gather their hands in the middle and lift them up.A person wearing a dark outfit swings a sledgehammer over their head outdoors.A person wearing a dark outfit is walking while wearing a baseball cap.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where does the man in the orange short-sleeved shirt throw the ball in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A park.",
+ "B. A boxing gym.",
+ "C. A school.",
+ "D. A gymnasium."
+ ]
+ }
+ },
+ "qOkXGpvb": {
+ "video_id": "qOkXGpvb",
+ "video_duration": "63s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Climate Change",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a distorted image in purple, white, and blue. This is followed by a man in a dark suit, a white shirt and a patterned tie. He is speaking and in the bottom left corner a banner reads \u201cANGUS TAYLOR | ENERGY AND EMISSIONS REDUCTION MINISTER\u201d and in the bottom center another banner reads \u201cPM, ENERGY MINISTER PITCH NET ZERO PLAN\u201d. This is followed by a slightly closer shot of the same man speaking. Then, there is a shot of a different man, wearing a dark suit, a white shirt and a dark tie, speaking at a podium. In the background are two flags, one is the United Nations flag, and the other is the Australian flag. Then the video cuts to a yellow and red distorted image followed by another shot of the second man at the podium, with the flags behind him. This is followed by an aerial shot of wind turbines on a grassy hill. The video then cuts to a man with glasses and a grey suit, speaking in front of a bookshelf. Then, a back shot of the same man looking at a stone building. The man is then seen walking through a paved plaza. There are trees in large pots and in the background, there are other people. The next scene shows a group of people surrounding a table, on which there is a cylindrical object with a red light on it. Then the video shows a pile of trash under a blue sky. Next, there is an aerial view of a factory with smoke stacks. This is followed by an aerial view of a deep pit. Next there is a shot of a large field of solar panels with the sun setting behind them. There are multiple shots of the solar panels with the sunset behind them. There is a shot of two people wearing lab coats and safety glasses, they are holding up small square objects. The camera goes back to the man speaking in front of the bookshelf, with the banner \u201cProfessor Thomas Maschmeyer FAA FTSE University of Sydney\u201d in the bottom. The video shows the same man, along with another man, working in a lab. This is followed by a shot of three people, wearing lab coats, looking at a white board with writing on it. Then there is a shot of two men, wearing lab coats, walking through a lab. The video cuts to a woman sitting at a picnic table, using a laptop. She is outside and there is a light orange structure behind her. Next, a graphic appears depicting a battery with the words \u201cZinc Bromine Gel\u201d on the top. This battery is separated into two halves of different colors. There is a graphic of the same battery, showing an electrical connection to a light bulb. The video cuts to a shot of a large field at sunset. There is a road in the distance and some buildings. This is followed by an animated graphic showing different components of a battery. Then a shot of a completed battery, and then the same battery from a slightly different angle. Then there is an overhead shot of many cylindrical tanks.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the drop in solar energy prices considered a major transformative event?",
+ "answer": "B",
+ "candidates": [
+ "A. Solar energy can be harnessed in harsh environments.",
+ "B. It enables numerous technologies, including energy storage.",
+ "C. Nothing else has experienced a 95% price decrease within a decade.",
+ "D. It resulted in Thomas Maschmeyer receiving the Prime Minister's Prize for Science in 2020."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video is the Gelion battery introduced?",
+ "answer": "B",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. At the end of the video.",
+ "C. Throughout the entire video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "JvSccFip": {
+ "video_id": "JvSccFip",
+ "video_duration": "503s",
+ "duration": ">8min",
+ "domain": "Tech & Science",
+ "sub_category": "Physics",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens with a person, who is wearing a light blue, button-up shirt and a dark tie, sitting at a desk with various scientific equipment. To his right is a blue stand with a glass vacuum tube and a second larger grey piece of scientific equipment. Behind him is a laptop. There is also a mug on the desk. Text in a stylized font appears overlaid on the video, saying \"The Model of the Atom\". The scene transitions to a white background with the title \"Alpha Scattering The Nuclear Model of the Atom\" written in a stylized font, and the text \"GorillaPhysics\" on the bottom of the screen. A Lego figurine that appears to be a gorilla appears on screen. Then the video returns to the person at the desk. The text \"Kit Betts Masters\" appears on the screen. The text \"Atom\" appears on the screen. A statuette of Democritus with the text \"Democritus\" appears on the screen. The video shows an image of a block of white cheese. The video shows an overlay of what appears to be a cracked glass texture, followed by an overlay of what appears to be a bright cloudy sky, then an overlay of animated flames. The video returns to the person at the desk. A portrait of a man, John Dalton, with the text \"John Dalton\" appears next to the person at the desk. A diagram of molecular structures appears overlaid on the video. Then an image of a page with writing in columns appears. The video returns to the person at the desk. Then an image of a different cathode ray tube is shown. The text \"Models are accepted if they explain the evidence available\" appears overlaid on the video. The person at the desk picks up the glass vacuum tube, then the video transitions to a shot of the tube. A hand touches the left side of the tube and text appears saying \u201ccathode\u201d, and then the hand touches the right side of the tube and text appears saying \u201cpositive plate.\u201d A green line appears from inside the tube and text appears saying \u201cCathode Ray a beam of electrons\u201d. The scene is shown from different perspectives. The person at the desk is shown holding a cookie, which he then breaks apart. The video transitions to an image of a green light inside the tube and the text appears saying \u201cAtoms aren\u2019t fundamental.\u201d The person at the desk is shown again and then an image of a cathode ray tube with text saying \u201ccathode.\u201d The person touches the different sides of the tube again. Then a diagram with text appears showing the cathode and anode, and two charged plates within a cathode ray tube. Then the video shows a photograph of a man from the neck up. Another version of the cathode ray tube is shown. Then there is a photograph of a man sitting at a table with a series of scientific equipment. The person is shown at the desk and then a portrait of a man, Ernest Rutherford, with the text \u201cErnest Rutherford\u201d appears next to the person at the desk. The person at the desk is shown touching his hair and then the video transitions to a drawing showing a circle with \u201calpha particles source\u201d, \u201cgold foil\u201d, and a red line showing a straight line path. A second drawing next to that showing \u201cobserved result\u201d with the lines having different and diverted paths is shown. Then a diagram with the text \u201cThomson\u201d and an image of a circle with small blue spheres inside it. Then a diagram with the text \u201cRutherford\u201d and an image of a circle with small blue spheres inside it, and a solid red sphere in the center, and black arrows going through it. Then the text \u201cThe Nuclear Model of the Atom\u201d is shown. The video shows a hand pushing a small silver ball across a white surface. The hand pushes some of them at a line of fruit scones. The video shows a hand using markers to highlight words. Then the video shows a green line moving from left to right through a sequence of four silver balls. Then, a few lines and text about an experiment and what was observed appears on the screen. Then the video shows a series of hand-drawn, diagrams showing a line with alpha particles going through an atom, with a measurement diagram. Then the video transitions back to the person at the desk. A portrait of a man, Niels Bohr, with the text \u201cNiels Bohr\u201d appears next to the person at the desk. A diagram of an atom with a nucleus and electrons orbiting the nucleus appears on the screen, with the text \u201cThe Bohr Model\u201d. Then a diagram of the atom with a series of concentric circles. The text \u201cEnergy Levels\u201d appears. The video returns to the person at the desk. Text saying \"1. Most of the atom is empty space\" appears on the screen, then \"2. The mass and positive charge is concentrated in a nucleus.\" Then text reading \"3. The electrons orbit with fixed energy levels\" appears. Then the person at the desk is shown again. The diagram of the atom with a series of concentric circles appears. Then a text saying \"Thanks to Jabberwock for the diagram\" appears. The video shows the person at the desk and a text on the screen saying \"Thanks for watching.\" The video concludes with the text \"Like it, share it, and subscribe for more!\".",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the cherry pie in the video used for?",
+ "answer": "C",
+ "candidates": [
+ "A. As a negative magnetic field.",
+ "B. As food.",
+ "C. As a generator of gaps.",
+ "D. As an anode magnetic field."
+ ]
+ }
+ },
+ "dJCrkXVh": {
+ "video_id": "dJCrkXVh",
+ "video_duration": "30s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A woman with long brown hair and a red headband that says \"KISS\" on it is shown speaking and making hand gestures. She wears a purple and grey striped long-sleeved top. She is standing near a staircase with wood steps. Then the video cuts to the same woman, now with short brown hair, a leopard print hair band, and large star-shaped earrings, wearing a purple, pink and blue jacket. She is holding a black book that says \"RECORD\" on it. Behind her is a television screen showing a dance program. Then the video cuts back to the woman with the headband who makes more hand gestures. Finally, the video cuts back to the woman with the short hair holding the book.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the long-haired woman waving her arms at the beginning of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. She is warming up for a fight with a short-haired woman.",
+ "B. She is practicing boxing movements.",
+ "C. She is rehearsing a dance.",
+ "D. She is doing aerobics to lose weight."
+ ]
+ }
+ },
+ "MyeHDJxB": {
+ "video_id": "MyeHDJxB",
+ "video_duration": "85s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A figure with black hair, cat ears, a yellow bow, glasses, and a yellow scarf, is shown writing a letter with a white quill. The figure is wearing a black sweater. To their right is a text box on a cream-colored background with black writing. In front of the figure is a small, green figurine with a red dress, standing on a grey stand. To the figure's left is a plate of cookies and a grey bottle. In the upper-left, a black lantern with a yellow glow hangs from a chain. Behind the figure is a wooden window with a moon and stars visible. Intermittently, images of two orcs in armor are superimposed over the scene, one of the orcs has red hair. Additionally, two images of a green-skinned orc wearing different armors are also superimposed over the scene, and a pink-skinned demon is superimposed over the scene along with the main character. These images fade in and out, sometimes with the main character visible behind them, and other times completely covering the scene. The scene then returns to the main character.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the man wearing a yellow scarf and the winged mutant do in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The man wearing a yellow scarf held the winged mutant.",
+ "B. The winged mutant carried the man wearing a yellow scarf.",
+ "C. The man wearing a yellow scarf pinned the winged mutant against the wall.",
+ "D. The winged mutant pinned the man wearing a yellow scarf against the wall."
+ ]
+ }
+ },
+ "xKZjTmtc": {
+ "video_id": "xKZjTmtc",
+ "video_duration": "200s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two shirtless men wearing boxing gloves face each other in a boxing ring. A man wearing a white shirt, black pants, and a black bow tie stands between them and speaks into a black microphone. The camera angle changes to an overhead view of the ring, and shows the previous three men as well as a number of additional individuals in the ring. The camera zooms in again, focusing on one of the shirtless men wearing blue gloves who walks towards the opposite side of the ring. The overhead view returns, and the two shirtless men wearing boxing gloves are seen moving towards each other and begin throwing punches. The man in the white shirt steps into the fray several times to separate the two men. The camera zooms in on one of the men as he is approached by a man in a black shirt who wraps him in an embrace and puts a white towel over his head. The other shirtless man has a towel applied to his face by a man in a black t-shirt. The overhead view returns, showing the end of a boxing match. Both men are embraced by handlers as a result of the match.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What is the man wearing blue boxing gloves most likely to do next after finishing the match in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Begin preparing for the next match immediately.",
+ "B. Perform boxing techniques in the arena with the audience.",
+ "C. Get ready to go home and rest.",
+ "D. Undergo examination and treatment by the medical team."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What are the last three letters in the final row of the competition location information bar mentioned in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. THE.",
+ "B. MID.",
+ "C. EST.",
+ "D. DOM."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times do the audience's cheers occur in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Two.",
+ "B. One.",
+ "C. Four.",
+ "D. Three."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In which direction is the man wearing red boxing gloves on the stage positioned before the match starts?",
+ "answer": "C",
+ "candidates": [
+ "A. Upper right corner.",
+ "B. Upper left.",
+ "C. Lower right corner.",
+ "D. Lower left corner."
+ ]
+ }
+ },
+ "YnwBJbVs": {
+ "video_id": "YnwBJbVs",
+ "video_duration": "98s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A character with pale skin, white hair, and a horned helmet walks into a scene at night. The view changes to a map with various icons, a red circle is drawn around a cluster of icons. Then a window showing possible upgrades appears, followed by a window showing an option for more upgrades. A scene with dead bodies laying on the ground, and a building engulfed in flames is shown. A winged creature with horns is then shown with a burning building in the background. Next a creature with black and red armor is shown with a weapon, followed by a character with a purple hooded cloak. The view changes to a character with a horned helmet standing in front of a line of armored warriors. The view changes to an overhead shot of the battle, with multiple warriors. The horned warrior character is shown in the middle of the battle, next to a large creature that is being hit by fire. The winged creature with horns is shown again, then a warrior with white skin and dark red armor. More scenes of the battle are shown, with multiple different creatures participating. The winged creature is shown flying above the battle, before the scene fades to black.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Through what can the feelings of the winged harpy towards the Saranis Legion be understood?",
+ "answer": "B",
+ "candidates": [
+ "A. Her sword.",
+ "B. Her shield.",
+ "C. Her warhorse.",
+ "D. Her singing voice."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a tank in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes.",
+ "B. No.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "bVsFUogL": {
+ "video_id": "bVsFUogL",
+ "video_duration": "25s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a series of comic book panels.The first panel shows a large group of characters, some wearing superhero costumes. There is a banner at the top that says \"Record World Records Holder.\" In the foreground there is a character in a white dress with a veil, and a character in a red and black suit and a top hat. The following panels show different scenes with character dialog in speech bubbles. In one panel there is a man in a suit talking to a character in a red and black suit with a black sword on their back. Later the red and black suited character is pointing a pistol. A figure with purple skin and horns then appears. There are also panels with a character in a blue and white suit and a character in a red, white and blue suit. Another panel shows a character in an orange plaid shirt. Finally there are panels of the red and black suited character talking to a purple skinned character. There are two men in black suits and ties as well. In one panel there is a large pink \"BAMF!\" written.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does Deadpool invite his companion to his wedding in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. Throughout the video.",
+ "C. At the beginning of the video.",
+ "D. In the middle of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the emotional state of his companions upon learning that Deadpool is getting married?",
+ "answer": "D",
+ "candidates": [
+ "A. Sadness.",
+ "B. Anger.",
+ "C. Happiness.",
+ "D. Doubt."
+ ]
+ }
+ },
+ "fVpzGknZ": {
+ "video_id": "fVpzGknZ",
+ "video_duration": "35s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A man in a red shirt gestures with his hands as he speaks.\u00a0 A map of the United States appears on screen, with a dotted green line connecting New York and Los Angeles. The map changes to one of Europe, with a dotted green line connecting London and Istanbul.\u00a0 A small, dark-colored model, resembling a miniature missile or rocket, sits on a small, round, blue table to the man\u2019s right.\u00a0 A diagram appears, depicting a cross-section of a cylindrical device. The diagram is labeled \u201cHigh Pressure Feed Stream\u201d on the left, \u201cLow Pressure\u201d on the top and bottom, \u201cBarrier\u201d in the center, \u201cEnriched Stream\u201d on the top right, and \u201cDepleted Stream\u201d on the bottom right. Small green and magenta spheres enter from the left, labeled \u201cHigh Pressure Feed Stream.\u201d The spheres disperse within the device. More green spheres appear to move towards the top right (\u201cEnriched Stream\u201d) and more magenta spheres move towards the bottom right (\u201cDepleted Stream\u201d). A solid yellow rectangle briefly covers the central portion of the screen, obscuring part of the diagram, then rotates to become a vertical yellow grid.\u00a0 The grid expands, taking on a triangular shape, suggestive of a cascade. Lines of magenta and green spheres continue to move towards the top and bottom points of the triangle, respectively.\u00a0 The words \u201cFeed\u201d appear to the left, \u201cEnriched Uranium\u201d near the top, and \u201cDepleted Uranium\u201d near the bottom of the triangular shape.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "In a video depicting two uranium isotopes traveling from New York to Los Angeles, how much sooner would the lighter isotope arrive compared to the heavier one?",
+ "answer": "B",
+ "candidates": [
+ "A. 26 feet.",
+ "B. 25 feet.",
+ "C. 27 feet.",
+ "D. 24 feet."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What change is demonstrated with U-235 in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Its mass increases from light to heavy.",
+ "B. Its concentration increases from low to high.",
+ "C. Its mass decreases from heavy to light.",
+ "D. Its concentration decreases from high to low."
+ ]
+ }
+ },
+ "PXcjWzaT": {
+ "video_id": "PXcjWzaT",
+ "video_duration": "176s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A crowd cheers and lifts a person in a hot dog costume. Another person in a pizza costume dances. A person in an ice cream sundae costume walks through the crowd. A highway sign for Los Angeles 101 appears. The entrance to Hollywood Arts High School is shown. Students walk and gather around a red pickup truck. A band performs on a balcony above the crowd. The band consists of a guitarist, a drummer, a bass guitarist, and a female singer. The crowd cheers and claps along. Two men stand next to a red SUV. One man talks on a cell phone. People in food costumes exit a blue van. The person in the pickle costume falls down and the person in the hamburger costume helps them up. The band continues to perform, and the crowd continues to cheer and clap along.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What action are the woman wearing a shirt and the man wearing a coat performing in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. They are dancing.",
+ "B. They are kissing.",
+ "C. They are singing.",
+ "D. They are applauding."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who in the video says, \"This is called 'Song To You'\"?",
+ "answer": "D",
+ "candidates": [
+ "A. A girl with wavy hair, wearing a purple dress.",
+ "B. A girl with short blonde hair, wearing an orange vest.",
+ "C. A boy wearing a hat and blue clothing.",
+ "D. A boy with dreadlocks, wearing a pink shirt."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Does the drum set play continuously throughout the video?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No, it does not.",
+ "C. Yes, it does."
+ ]
+ }
+ },
+ "kMTpkPFb": {
+ "video_id": "kMTpkPFb",
+ "video_duration": "179s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person in a red shirt and a beige apron is standing in a kitchen. The person is standing next to a silver and black pressure cooker on a counter. There is a sink next to the pressure cooker. The pressure cooker has a digital display showing OFF. The person\u2019s left hand is resting on the top edge of the pressure cooker, and their right hand is gesturing. The background shows shelves filled with various jars and items. The person places both hands on the pressure cooker lid and rotates it. The lid is opened, revealing chicken bones, carrots, an onion, and water inside the pressure cooker. The person, wearing striped oven mitts, pours the contents of the pressure cooker through a strainer into a glass jar in the sink. After pouring, the strainer contains three carrots and some bone fragments. The strainer is removed from the jar. The glass jar has a yellow liquid in it. The person is pointing to a level in the jar. The person rotates back to face the camera, and a logo appears with the words \"Living Life on the Rural Side, The Homesteading Hippy, Living a Rural Life, in an Urban Environment\" printed in it. The logo disappears and the screen goes white then black.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the Instant Pot release steam in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Four times.",
+ "B. One time.",
+ "C. Two times.",
+ "D. Three times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What will the woman in the video do next?",
+ "answer": "B",
+ "candidates": [
+ "A. Remove the vegetables and chicken frame from the pot.",
+ "B. Add water to boil the second batch of meat broth.",
+ "C. Add water to boil the third batch of meat broth.",
+ "D. Discard the food from the pot and clean the cookware."
+ ]
+ }
+ },
+ "bObABELW": {
+ "video_id": "bObABELW",
+ "video_duration": "106s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A man and a woman are kneeling next to a building with a sign that reads \"Presidential Goylon\". The woman's mouth opens and closes. The man's mouth opens. The woman looks down, then looks up and closes her eyes and opens her mouth again. The two of them are now standing among trees and talking with their mouths open. The woman puts her head on the man's shoulder. The screen says \"The Next Day.\" Two other identical people are standing next to the man and woman. The four people now look to their right. The man looks to his left. Another man with a beard appears. Words \"Ready\" and \"Go!\" briefly appear on screen. Two black boxes say \"50/5\" and \"0\" above both men. A black box appears near the first man with words \"Fudgen Gun Crystal Sword Oil W/ Fire\". The first man appears to shoot a beam at the other man. The number above the second man changes. A new black box appears by the second man with words \"Punch, Throw, Bat\". He throws a bottle at the first man. A black box appears again by the first man with words \"Fudgen Gun Crystal Sword Oil W/ Fire\". He appears to throw oil and fire at the other man. The background changes to fire and the number above each man changes. A crystal sword appears. The first man points his finger at the other man and then the sword appears to strike him. The number above the second man changes. A new black box appears by the second man with words \"Punch, Throw, Bat\". The first man appears to slam the other man on the ground, changing the numbers. The man throws oil and fire at the other man. Numbers above each man change. A new black box appears by the second man with words \"Punch, Throw, Bat\" and the man throws a bat at the first man. Background changes to mountains. The numbers change again. The first man shoots a beam at the other man and then the numbers change again. A black box says \"Missed\". A black box appears near the first man with words \"Fudgen Gun Crystal Sword Oil W/ Fire\".",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How does the emotional state of the young man wearing glasses change in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Confusion changes to fear.",
+ "B. Excitement changes to fear.",
+ "C. Fear changes to excitement.",
+ "D. Tension changes to excitement."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Does the background audio in the video change?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I am not sure."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the third tool used by the young man wearing glasses in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A wooden stick.",
+ "B. A bottle.",
+ "C. Fire.",
+ "D. A sword."
+ ]
+ }
+ },
+ "OvlukcZq": {
+ "video_id": "OvlukcZq",
+ "video_duration": "49s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A video begins with a blurred fast motion shot of a person. Then a woman wearing a red, white, and blue striped blouse and dark pants appears on camera holding a microphone that has \"BIS TV\" printed on it. She's standing in a large, brightly lit convention hall, where several people in business casual attire can be seen walking around and standing in groups. There are several screens and booth displays visible in the background. The video then cuts to a wide aerial shot of the convention hall, which has a large red shipping container at the left edge of the frame, and many booths are visible in the middle, with more people walking around. The shot returns to the woman with the microphone, then cuts to a shot of her interviewing a man with a light beard, wearing a red and white plaid shirt and blue lanyard, who is sitting at a red counter. The bottom of the screen has text that says \"Omar Ganiev Defence Group\". During the interview, a couple of screens in the background display different images. The video returns to a close up on the man while he speaks to the interviewer.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where was the man interviewed in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Sitting on a chair.",
+ "B. Standing behind the table.",
+ "C. Standing in the center of the stage.",
+ "D. Standing in front of the table."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Which items are present at the booth of the man wearing a plaid shirt in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. There is a table at the booth with several books and some documents.",
+ "B. There are a large number of brochures and small gifts at the booth.",
+ "C. There are laptops, a projection screen, and a microphone from BIS TV at the booth.",
+ "D. There is a giant display and a robot at the booth."
+ ]
+ }
+ },
+ "poaiRUGB": {
+ "video_id": "poaiRUGB",
+ "video_duration": "240s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person is seen assembling a picture frame. They start by attaching a clear plastic sheet onto a white picture frame using thin, clear plastic connectors. They use their fingers and a yellow-handled snipper to secure the plastic pieces into the frame. The person repeats this process, cutting more thin pieces and attaching them to the remaining corners of the frame until all corners have been secured. A painting of three pink flowers is positioned next to the frame on a wooden surface.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What is the final state of the gasket in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Attached to the picture frame after being cut.",
+ "B. Attached to the glass panel after being cut.",
+ "C. Cut and set aside.",
+ "D. Attached to the painting after being cut."
+ ]
+ }
+ },
+ "SBSdVtTm": {
+ "video_id": "SBSdVtTm",
+ "video_duration": "112s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows two American flags and a metal railing. There is also a sign that says \"HUMANITY FIRST\". A man in a black jacket with \"Yang 2020\" on it is also visible, as well as a large crowd of people. The video then shows a woman in a black hat, brown jacket and orange vest, who is talking with a man in a dark coat with a red and grey striped scarf. They are seen talking and gesturing with their hands. They are inside a white tent. Other people are visible in the background. The video cuts to a view of a sign that says \"HUMANITY FIRST\", then shows a person's arm and fist raised. The video then continues showing the man and woman inside the tent talking. The woman touches her chest then her face, and speaks. The man gestures with his hands, then touches his own chest, and nods. They each speak some more, then the man claps twice. The woman then speaks, and they both nod. They are in a tent.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What Arabic numerals are displayed on the back of the black-hoodied, black-baseball-capped male's sweatshirt at the beginning of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. 2020.",
+ "B. 2030.",
+ "C. 2022.",
+ "D. 2021."
+ ]
+ }
+ },
+ "vrmNbcjP": {
+ "video_id": "vrmNbcjP",
+ "video_duration": "199s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "DIY & Crafts",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a blue t-shirt and a dark-colored baseball cap is in a workshop and holds a wooden plank in their hand, pointing to the nails in it. They place this and other planks on a table, and then uses a hammer to break the planks apart. The individual then stands near a pile of wooden planks, and demonstrates a black nail gun, which they then connect to a yellow air hose. They use the nail gun to secure a single plank in a clear plastic bin, placing additional nails into it. Next they place another piece of wood into the bin, and connect these two pieces of wood by placing nails into them with the nail gun. Finally, they use the nail gun to connect a third board to the previous two.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What is the man wearing a hat doing at the beginning of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Dismantling a tray.",
+ "B. Using a nail remover.",
+ "C. Using a claw hammer to pry out nails.",
+ "D. Making a tray."
+ ]
+ }
+ },
+ "AJNgNHxE": {
+ "video_id": "AJNgNHxE",
+ "video_duration": "44s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Soccer",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a rugby match. There are two teams, one with yellow and blue jerseys and the other with light blue and white jerseys. The players are running around and tackling each other on a green field. There are several scrums and one player carries the ball to the goal line. He falls to the ground and appears to score. The teammates of this player congratulate him with hugs and high fives. A referee holding a flag signals the scoring of the point. He then walks off screen, and the video ends by focusing on two players on the yellow and blue team. One is touching the other player's head. The second player is touching his own head. The time in the top left corner of the screen changes throughout the video, starting at 12:48 and progressing to 13:25. The team scores are also displayed on the top left, showing that at the beginning, one team has 8 points and the other has 0.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "In the latter part of the video, why does the referee walk to the sideline?",
+ "answer": "A",
+ "candidates": [
+ "A. A basket is scored.",
+ "B. A ball-handling violation occurs.",
+ "C. The players are involved in a conflict.",
+ "D. The quarter ends."
+ ]
+ }
+ },
+ "RlPqMTFR": {
+ "video_id": "RlPqMTFR",
+ "video_duration": "95s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a dark screen. It then transitions to an outdoor scene showing a bald person riding a horse in a grassy field. There are train tracks to the left and trees in the background. Another person is visible standing near the train tracks. The bald person gets off the horse and pulls out a knife. A circular menu appears in the bottom-left corner and a circular menu of weapon options appears in the middle. The camera shifts to a close-up of the bald person walking towards the other person. The image turns black and white and the circular menus appear again. The bald person begins to grapple with the other person. Both people are visible as they go to the ground in a grappling struggle, then they get up and are grappling again. The bald person then holds the other person from behind with the knife held to their neck and they struggle and move around. The other person runs away, and the bald person gives chase, while using the knife and a rifle as a weapon. The bald person then tackles and grapples with the other person to the ground again, and the image switches back to black and white. The bald person is now seen chasing the other person on the train tracks and they fight again. They are then seen running through the field, with the bald person tackling the other person to the ground once again, where they proceed to fight. The camera angle then switches and the bald person is running, then pulling up a weapon selection menu. The last scene is the other person running away, the bald person chasing.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the bald man holding a knife at the beginning of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Because the bald man wants to peel an apple to eat.",
+ "B. Because the bald man is holding the knife for self-defense.",
+ "C. Because the bald man is using the knife to cut grass.",
+ "D. Because the bald man wants to shave the man wearing white."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "Did the person wearing a white shirt in the video drop their hat on the ground at the start of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. No.",
+ "B. I don\u2019t know.",
+ "C. Yes."
+ ]
+ }
+ },
+ "SXFXLiZc": {
+ "video_id": "SXFXLiZc",
+ "video_duration": "226s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a shot of the bridge of a spaceship. Crew members are seated at consoles. One crew member, wearing a blue and white uniform, stands with their back to the camera, looking out a large window. The names \"BRIAN 'PROFESSOR' LEWIS\" and \"DIGBY TATHAM-WARTER\" appear on screen. The standing crew member turns and looks toward the camera with a concerned expression. The video then cuts to a close-up of the crew member, now seated at a console, appearing distressed. Another crew member, a woman with dark hair, sits next to him and raises a finger as if making a point. She then speaks directly to the camera. The first crew member interacts with a holographic display, pointing and gesturing at a tactical map.The video cuts back to the first crew member, now standing and looking concerned. The view changes to a screen displaying an incoming transmission. A pixelated image of a pale, alien face appears, labeled \"EXECUTOR KUL-LERN.\" The video cuts back to the first crew member, their arms crossed, with a worried look on their face.The incoming transmission screen changes, displaying multiple alien images. There's Kul-Lern again, a robotic figure with a large, flat head covered in vegetation, a creature resembling a bipedal mushroom with a red cap and multiple eyes, a white, feathery creature with multiple eyes, and a three-eyed yellow, multi-limbed creature with golden markings. The first crew member watches the transmission with growing concern. As more alien portraits appear, the crewmember puts a hand to their face in disbelief. The robotic figure's portrait is highlighted. The first crew member continues to react to the images on the screen, looking increasingly distressed and gesturing emphatically. The yellow, multi-limbed alien is highlighted. Next, a picture of a person in a gray blanket wearing horns appears on screen, labelled \"STONE GOLEM OF MAJESTIC STEWARDSHIP\". The person is shown talking. He continues to talk, becoming increasingly animated. As the person talks, the first crew member shakes their head. More portraits are displayed, including one of a person with ram horns wrapped in a blanket. The video ends with a shot of the Door Monster logo.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "Which type of alien is described as humanity's first friend after the advent of spaceflight?",
+ "answer": "A",
+ "candidates": [
+ "A. An alien with large eyes and a mushroom-shaped head.",
+ "B. An alien with a round head and a moss-covered surface.",
+ "C. A yellow, insect-like alien with multiple limbs and eyes.",
+ "D. An alien with white and purple feathers and multiple eyes."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "In the video, how many alien beings did humans have screen conversations with?",
+ "answer": "C",
+ "candidates": [
+ "A. Five.",
+ "B. Eight.",
+ "C. Seven.",
+ "D. Six."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What do white and purple-feathered, multi-eyed aliens accuse humans of doing?",
+ "answer": "C",
+ "candidates": [
+ "A. Breaking their defensive pact.",
+ "B. Making extremely rude comments.",
+ "C. Hiring mercenaries to attack their undefended colonies.",
+ "D. Blocking their only hyperspace route to the rest of the galaxy."
+ ]
+ }
+ },
+ "QiEecPFq": {
+ "video_id": "QiEecPFq",
+ "video_duration": "542s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Family Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman and a young boy sit on a low table in front of a couch. The woman wears glasses and a black t-shirt with the word \"Blessed\" printed on it. The boy wears a black t-shirt with a white graphic. There are two small glass jars containing an orange liquid and a clear plastic bowl between them on the table.\u00a0 Behind them, on the couch, are orange cushions and a black blanket or throw with \u201cRaiders\u201d written in white. The woman opens a small red bag of chips. She and the boy each take a chip from the bag and eat it. They look at a phone on the table. They eat more chips from the bag, putting the empty wrappers into the bowl. They each pick up a jar and drink from it. They look and speak directly to the camera. The woman points to a \u201cSubscribe\u201d button overlaid on the lower left of the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did they rest during the competition?",
+ "answer": "D",
+ "candidates": [
+ "A. Two.",
+ "B. One.",
+ "C. Four.",
+ "D. Three."
+ ]
+ }
+ },
+ "slVWVVQU": {
+ "video_id": "slVWVVQU",
+ "video_duration": "463s",
+ "duration": "6-8min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A pair of hands holds a piece of fishing wire with a small metal loop at the end. On the table are a dead fish, a wooden skewer, a multi-pronged toothpick-like tool with a plastic piece at the top, a knife with a blue handle, and a pair of scissors.The scene changes to a man sitting at a table. He is wearing a blue, white, and black collared shirt. Behind him are shelves with various items. He holds the fishing wire with the metal loop in one hand and makes gestures with his other hand. On the table are a glass of water, a fish on a cutting board, a knife, a pair of small scissors, and a small pink sticky note.The scene changes back to the hands and tools on the table. One hand picks up the fish and gently presses on its underbelly. The other hand picks up a tool made of a skewer and a toothpick-like object woven together with fishing wire.The scene cuts back to the man. He holds the same woven tool.The camera returns to the hands and tools. The skewer is placed on the cutting board. The toothpick tool is also placed down. The hands are now picking up the fish and removing the woven tool from inside its mouth.The camera cuts back to the man. He holds the fish.The scene changes to the hands. One picks up the skewer. The other hand holds the fish by the tail and the skewer is inserted through the fish\u2019s mouth.Back to the man again. He holds the skewer, which is now poking out of both ends of the fish.The camera returns to the table with the tools. The pliers cut off one end of the skewer that is sticking out of the fish\u2019s tail.The scene cuts back to the man holding the fish impaled on the skewer.Cut to the hands. Black tape is being wrapped around the skewer and the fish.Cut to the man holding the fish and the black tape.Back to the hands, which now hold pliers and a hook attached to some fishing wire. The point of the hook is placed against the tail end of the skewer. The pliers crimp the hook in place.Cut back to the man with the fish and tools.Cut to the hands. One hand uses the knife to cut slits along the fish\u2019s belly.Cut to the man and the fish, which now has several parallel slits cut into it.Cut to the hands. The knife is now used to carefully separate the meat from the skin, beginning from one of the incisions in the belly.Cut to the man. He is continuing to separate the fish meat from the skin.Cut to the hands holding pliers and the fish. The pliers are removing the skewer from the fish.Cut to the man, who continues to peel the skin away from the fish, which is now skinless on one side and only attached at the head and tail.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order of events depicted in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The sardine is cut, then the stick is inserted into the sardine, then fishing line wraps around the stick, then cotton thread wraps around the sardine.",
+ "B. Fishing line wraps around the stick, then the sardine is cut, then the stick is inserted into the sardine, then cotton thread wraps around the sardine.",
+ "C. The sardine is cut, then fishing line wraps around the stick, then the stick is inserted into the sardine, then cotton thread wraps around the sardine.",
+ "D. Fishing line wraps around the stick, then the sardine is cut, then cotton thread wraps around the sardine, then the stick is inserted into the sardine."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many tools were used to make fish bait in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Two.",
+ "B. One.",
+ "C. Four.",
+ "D. Three."
+ ]
+ }
+ },
+ "FHxtBJyb": {
+ "video_id": "FHxtBJyb",
+ "video_duration": "65s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A boxer in black trunks and orange gloves lands a left hook to the head of his opponent, who is wearing white trunks with red gloves. The opponent stumbles back towards the ropes. The boxer in black continues his attack with a combination of punches, pushing the opponent against the ropes. The opponent attempts to defend but absorbs more blows. He leans forward onto the ropes, covering up. The referee steps in to check on the opponent, then steps back and lets the match continue. The boxer in black throws more punches, connecting with both left and right hooks. The opponent continues to defend, but gets hit by several more punches to the head. The referee stops the fight and checks on the opponent in white trunks again. The referee steps back, lets the fight resume, and the opponent in black lands a few more punches. The referee steps in again, stops the fight, and talks to the opponent in white. The fight resumes again briefly with the boxer in black landing more punches, before the referee steps in and stops the fight to check the other boxer one last time. The fight resumes, and the boxer in black throws several more punches at the opponent, who primarily covers up. The boxer in black lands several body shots. The bell rings, ending the round, and the boxer in black trunks walks towards his corner.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What technique did the boxer wearing green gloves use to defeat their opponent?",
+ "answer": "B",
+ "candidates": [
+ "A. An uppercut.",
+ "B. A left hook.",
+ "C. A combination of punches.",
+ "D. A sweeping kick."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What is the man wearing white gloves likely to do next?",
+ "answer": "D",
+ "candidates": [
+ "A. Have a meal at a restaurant.",
+ "B. Participate in the next match.",
+ "C. Give an interview.",
+ "D. Tend to an injury."
+ ]
+ }
+ },
+ "gjqkjtuG": {
+ "video_id": "gjqkjtuG",
+ "video_duration": "148s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A white bird with a crest on its head is seen inside a car, then outside in a grassy area with multiple grave markers. The bird moves from the hand it's on, up the arm, and back down. The bird looks in various directions. It\u2019s then seen back on the car, and eventually back inside the car. There are some instances where colorful static is shown.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "Based on the video, what is the woman's likely occupation?",
+ "answer": "A",
+ "candidates": [
+ "A. A parrot trainer.",
+ "B. A paranormal blogger.",
+ "C. A race car driver.",
+ "D. A cemetery caretaker."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the parrot screaming in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. It does not want to stay in the car.",
+ "B. It does not want to be taken home by its owner.",
+ "C. It has found a new toy.",
+ "D. It is reacting to the strange atmosphere in the cemetery."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "At the end of the video, where did the parrot move from and to?",
+ "answer": "A",
+ "candidates": [
+ "A. It moved from the man's hand to the car window.",
+ "B. It flew from the tombstone to the woman's hand.",
+ "C. It flew from the car window to the big tree.",
+ "D. It moved from the car window to the man's hand."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What is the location of the scene at the beginning of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A cemetery.",
+ "B. The park.",
+ "C. Inside a car.",
+ "D. A campground."
+ ]
+ }
+ },
+ "WUIUugQN": {
+ "video_id": "WUIUugQN",
+ "video_duration": "176s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man in a blue t-shirt and black shorts walks on a sidewalk past a multi-family house. The scene changes to a street corner where two men are sitting on the sidewalk next to a dark-colored SUV. Several men are standing around them, including a police officer and men in red and blue shirts. The man in the blue shirt films the scene and shows his badge. He walks closer to the men on the sidewalk. One of the men standing near the SUV holds a can. The man in the red shirt talks on a cell phone. A shirtless man runs by on the sidewalk. The man filming turns the camera to show traffic backed up on the street. He turns the camera back to the men on the sidewalk and the men standing nearby. The camera pans to follow the shirtless man running down the sidewalk. The camera focuses on a navy blue Mazda sedan parked on the street. The man in the blue shirt points to something off-screen. A Dodge Durango SUV drives down the street. The camera follows a gray Honda Pilot down the street as other vehicles follow behind. The camera pans across the street to show other vehicles. A black Chrysler sedan is visible. The man in the blue shirt focuses the camera on the scene with the police officer and the men on the sidewalk. He walks along the sidewalk past the stopped traffic.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the police do to the photographer at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Allow them to leave.",
+ "B. Place them in the police car.",
+ "C. Put them in handcuffs.",
+ "D. Point a pistol at them."
+ ]
+ }
+ },
+ "sOXjkfcp": {
+ "video_id": "sOXjkfcp",
+ "video_duration": "274s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man with gray hair and a blue tank top is sitting in a chair and holds a lottery ticket. He gives a thumbs up. Next, a man with a black cap is speaking to the camera. The screen fades to a dark purple color. Then two red angled rectangles are presented with a moving circular image inside them, followed by a title card reading \u201cScratch Life.\u201d Next, the same man with a black cap is speaking to the camera. He is followed by the man with the gray hair, who is speaking to the camera. The man with gray hair is holding a lottery ticket, and a coin is shown scratching off the covered parts of the ticket. The numbers being scratched off are being revealed. There are now two men in the shot. One man, with the black cap, appears next to the man with the gray hair. One man is shown looking at the camera after the lottery ticket is completely scratched off. The man with gray hair shakes his head and laughs. The man with the black cap is looking at the ticket.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the highest possible prize that can be won with this scratch-off ticket?",
+ "answer": "B",
+ "candidates": [
+ "A. $100,000.",
+ "B. $1,000,000.",
+ "C. $500,000.",
+ "D. $2,000,000."
+ ]
+ }
+ },
+ "ydeDAksp": {
+ "video_id": "ydeDAksp",
+ "video_duration": "83s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "Two people are kneeling on a white carpet and packing a suitcase. Clothing in various colors are scattered around the suitcase on the carpet. The man picks up a gray shirt with a plaid pattern, and the woman reaches into the suitcase, placing items inside. The man then places a pile of dark shirts on the suitcase, and the woman places a yellow top into the suitcase. Next, the camera shows a close up of a power outlet and a set of USB ports in an airport. A hand is seen inserting a charging cable into a phone, then the other end of the cable is plugged into one of the ports. The camera cuts to a person holding a tablet and a stylus, using the stylus to interact with the tablet. The background is blurry, and contains other people and what appear to be various signs. Next, the video shows a camera bag open on the floor. A hand is seen placing items inside the bag, including a lens, the camera body, and another electronic device, before closing the bag. The next shot shows a young boy sitting in a plane seat, looking at a handheld device. The camera cuts to three men sitting in airplane seats. The man closest to the camera puts on a set of headphones. The next shot shows a woman using her phone. The phone has a green screen, and sits on top of a suitcase with various travel items in the suitcase. The camera cuts to a person sitting on a stone structure, using a cell phone, with a red suitcase beside her. The next shot shows a man in a forest, looking at his cell phone. Next, there are a number of SIM cards on a white background. The next shot shows a close up of a cell phone that is displaying a VPN app that is loading. The final scene is of a woman taking a photo of a landscape with a cell phone.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many phones appear in this video?",
+ "answer": "D",
+ "candidates": [
+ "A. Five.",
+ "B. Seven.",
+ "C. Eight.",
+ "D. Six."
+ ]
+ }
+ },
+ "XYAYaULv": {
+ "video_id": "XYAYaULv",
+ "video_duration": "91s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a series of comic book panels. In the first panel, a man in a dark coat and a hat holds a glowing orb in his hand. He has a mustache and is also holding an umbrella. Two speech bubbles are visible.In the next panel, three people are visible. Two of them, a man and a woman, are suspended in the air, with a symbol behind each of them. A third person, wearing a dark hood and ornate armbands, is below, facing them.The following panel depicts a man with wings, dressed in a gold and olive costume, striking another man in the face. A speech bubble with an onomatopoeia is visible.Next is a panel with two versions of a red-haired woman; one is solid and the other is partially transparent, and has wings. A speech bubble is also present.In the next image, a man with bone-like wings and a mask of some sort is holding a winged helmet in his hands. There are speech bubbles.The following panel shows this same man fighting a group of winged warriors. There is an onomatopoeia in a speech bubble.Next is a panel that shows the masked man from before, but in a stony form with bone-like structures extending from his back.The stony figure from the previous panel is now surrounded by a cloudy white material.Then is a panel that shows many winged figures in gold and reddish-brown costumes, surrounded by a yellow glow, with the two figures from the second panel below them.The next panel depicts the male and female figures from the second panel lying next to each other on the ground, face-to-face.In the following panel, the same two figures from the second panel are speaking to one another. The woman is wrapped in bandages. The man is bare-chested. There are three speech bubbles.Then is a panel that shows the same two figures walking away in the distance, and there are speech bubbles. The setting is a ground littered with white skulls.Next, two panels show the male and female figures amongst white clouds and white skulls. The man is kneeling in the second panel.In the next panel, a man with long hair is speaking with another speech bubble, next to a large light burst. Then, there's an image of the same man, crying. A speech bubble is visible.In the following panel, the female figure, now with wings, extends her hand. There's a speech bubble and another panel with the two figures holding hands. The man is kneeling.Next is a close-up of the red-haired woman and a speech bubble is present.In the following panel, the two figures from the previous images are facing one another and there are two speech bubbles.Then there are two panels showing the figures, now with wings, holding each other, and the male figure is seen in profile. Speech bubbles are present.In the next image, the two figures are surrounded by a yellow light. There is a speech bubble.In the final panel, there are two images. In the left panel the same two figures are surrounded by a yellow-white light, and in the right panel they are hugging and wearing their winged costumes from previous panels. There is a speech bubble.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many golden avatars appear on the bearded man and the red-haired woman in the scene where they fall to the ground?",
+ "answer": "C",
+ "candidates": [
+ "A. Fifteen.",
+ "B. Thirteen.",
+ "C. Fourteen.",
+ "D. Twelve."
+ ]
+ }
+ },
+ "CCshqfiZ": {
+ "video_id": "CCshqfiZ",
+ "video_duration": "102s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A chef stands with his back to the camera in a restaurant kitchen. Another chef is visible in the background working with plates.\u00a0A montage of quick cuts shows garlic and green onions being chopped, a woman with dark hair holding her fingers to her ears, a woman covering her face with a wooden spoon and holding a whisk, and a pair of hands shredding something. A woman with a parrot on her shoulder crosses her arms in front of a black background. A montage shows a different woman with long light brown hair flexing her arm, looking serious, and holding a spoon with a parrot on her shoulder. The two women stand side-by-side with crossed arms. Two women in matching black hoodies stand at a kitchen counter with various foods in glass bowls in front of them. Parrots can be seen on perches in the background and on the women's shoulders. One woman begins preparing food on a white cutting board.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What genre of music is featured at the beginning of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Classical music.",
+ "B. Folk music.",
+ "C. Pop music.",
+ "D. Rock music."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "Where does the parrot eventually end up on the shoulders of the women?",
+ "answer": "D",
+ "candidates": [
+ "A. On the blonde woman's left shoulder.",
+ "B. On the black-haired woman's left shoulder.",
+ "C. On the black-haired woman's right shoulder.",
+ "D. On the blonde woman's right shoulder."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What happens to the music when two women are present in the kitchen?",
+ "answer": "D",
+ "candidates": [
+ "A. The music's volume increases.",
+ "B. The music stops.",
+ "C. It is unclear.",
+ "D. The music's volume decreases."
+ ]
+ }
+ },
+ "fWlmhNhg": {
+ "video_id": "fWlmhNhg",
+ "video_duration": "150s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a game being played. There is a grey interface on the left, and the game is on the right. At the beginning, a cartoon image of a white can with sunglasses and a red horn appears on the screen, along with falling cartoon images of dollar bills, and a message appears saying \"New Subscriber! theoriginalsev\". This image transitions to a black and white image of a man with a hat and glasses with falling dollar bills and the words \"Dat Lyte\" and \"Latta Bitch\". This image is then repeated, but this time with 3 images of the man and falling dollar bills. These images then change to show an in-game view of two cars. One of them starts to move in the game and is followed by a figure with a blue hat and backpack. The player then enters an in-game inventory menu with a number of options before a cartoon image of an animal appears. The player then moves into a field where several other characters and vehicles can be seen. The player gets on the top of one of the vehicles, and then a second figure can be seen on a car. The player shoots their gun and a black and white image appears with the words \"Dat Lyte\", and \"Rekt\". This is followed by images of a man with red glasses and the words \"Rekt\", and then the game play continues. The player is in a field with other characters, and a message on the screen says \"Get\", before gameplay resumes. The camera follows the player, and several other figures are seen. The player shoots at a vehicle and then a black and white image appears with the word \"Rektboth\" across it. This image is followed by black and white images with the words \"Dat Lyte\" and \"Rekt\", and an image of a man with red glasses with the word \"Rekt\" across it. Another image of a man with red glasses and the word \"Rekt\" is shown, then gameplay continues. The player is shown moving in the field and then the word \"WASSUP\" is displayed. The player is shown in the field and then shoots at another character. The player shoots again, and then a black and white image appears with the words \"1 Rekt\". The black and white images repeat and then the player is shown in game view, shooting another player. A text box with the words \"Thanks for the donation\" appears, with text about who donated. The player is near a vehicle, and then shoots at another figure. The camera is angled from above, and follows a vehicle, and another player. A black and white image of a man with glasses and a hat appears, followed by another man with red glasses, and the word \"REKT\". The game shows the player shooting the other player again and again. The player is moving towards a camper and shoots at it and a small house next to it. The player is now behind a camper and shoots at an enemy. A black and white image of the game appears with a car, then another player is seen on top of the car. The player shoots and kills the figure on top of the car and falls on the ground. The image turns to black and white, and the game displays \"H1Z1\" in red letters, the player is shown on the ground, along with another player on the top of the vehicle. The game displays the number of kills and vehicles destroyed and says \"new subscriber! Xalamance\", before the game screen goes back to the action and a player on the vehicle is shown.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What was the number on the road sign the jeep collided with?",
+ "answer": "D",
+ "candidates": [
+ "A. Number 26.",
+ "B. Number 23.",
+ "C. Number 24.",
+ "D. Number 25."
+ ]
+ }
+ },
+ "cmjFadrC": {
+ "video_id": "cmjFadrC",
+ "video_duration": "149s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A hand is shown painting a partially assembled model house with a dark brown paint. The hand applies a thin coat of paint to all visible sides of the house. A small glass jar and a can of paint are visible on the table next to the house. Next, a hand is shown using a ruler to cut several strips of cardboard. The hand then uses a green cloth to wipe a dirty surface, followed by a hand shown folding strips of cardboard along the scored lines, before cutting them into small squares. The hand is then shown applying glue to the underside of small squares before sticking them to a larger piece of cardboard, and then doing the same to the outside of one of the houses, using a black marker to make vertical lines over these cardboard squares. Then, a hand is shown examining a container of paint, which is a grey color, before using a brush to apply the grey paint to one of the model houses, also adding dark grey to the rooftops.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many floors are there in the house model made by the man in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Two.",
+ "C. Three.",
+ "D. One."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What is the man's next step after cutting the corrugated cardboard strips?",
+ "answer": "B",
+ "candidates": [
+ "A. Color the cardboard strips.",
+ "B. Apply glue to the roof.",
+ "C. Paint the roof.",
+ "D. Attach the roof to the model."
+ ]
+ }
+ },
+ "BRvMYKmO": {
+ "video_id": "BRvMYKmO",
+ "video_duration": "465s",
+ "duration": "6-8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A man in a suit and tie stands amidst a throng of people, many of whom are holding microphones, cameras, and smartphones towards him. He speaks, gesturing with his hands. The crowd around him is densely packed, with individuals jostling for position. A woman in a red blazer stands near him, also holding a microphone. People in the crowd hold phones up, some appearing to record video, others taking photographs. The man continues to speak, looking at different people in the crowd and occasionally adjusting his tie. The background includes trees and what appears to be a street or road. The sun is shining, creating strong contrasts of light and shadow across the scene.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times was the short-haired woman in a red dress asked a question and then answered?",
+ "answer": "C",
+ "candidates": [
+ "A. Once.",
+ "B. Twice.",
+ "C. Three times.",
+ "D. Four times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the color of the tie worn by the journalist in the gray suit in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Blue.",
+ "B. Silver-gray.",
+ "C. Black.",
+ "D. Red."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What device is the bald man using to ask a question?",
+ "answer": "D",
+ "candidates": [
+ "A. A boom microphone.",
+ "B. A walkie-talkie.",
+ "C. A phone.",
+ "D. A microphone."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the likely profession of the man answering the question in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A member of parliament.",
+ "B. A prominent lawyer.",
+ "C. A company owner.",
+ "D. A football club manager."
+ ]
+ }
+ },
+ "GLeoqUaW": {
+ "video_id": "GLeoqUaW",
+ "video_duration": "63s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A large, two-legged robot walks forward. It then appears to explode. A green robot is shown standing on a brown, sandy ground, then its weapons are shown on the sides and top of its chassis. A red number 31 is displayed on the front of it. A text box that reads \"FLIGHT MODE\" appears. The green robot is shown walking in a city area with grey and red buildings. The robot then launches up in the air and explodes. A white robot with purple-tipped guns is shown in a snowy city area. A green robot is shown flying in the air in a city area with a blue beam of light going to the ground. A red robot is shown on fire in the city. A green robot is shown firing its guns at another robot on a city street. The screen shows a pause symbol, then a rewind symbol. A green robot fires its guns and a blue beam comes out of it. A green robot is shown flying in a city. A red robot is seen exploding while walking on a road. A green robot is shown flying in the city area with multiple other robots. A brown robot with a laser is shown firing it on a desert-like surface. A green robot is shown flying in a city. A red shape is visible on a building in the background. A green robot is shown firing its guns while flying and walking.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order in which Robot Muramitz defeated the other robots? (a) The shield-raising robot, (b) the spider-like robot, (c) the gray robot with a cannon on its head, (d) the yellow robot walking on the ground, (e) the small robot moving on yellowish soil.",
+ "answer": "B",
+ "candidates": [
+ "A. (c)(d)(a)(b)(e).",
+ "B. (d)(a)(c)(b)(e).",
+ "C. (b)(d)(a)(c)(e).",
+ "D. (e)(d)(a)(c)(b)."
+ ]
+ }
+ },
+ "XZBOvwNp": {
+ "video_id": "XZBOvwNp",
+ "video_duration": "136s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a large outdoor festival with a stage resembling a dragon's head. Many people are shown dancing, some wearing masks and costumes. There are also aerial views of the festival grounds, showing tents and a large crowd gathered. There are some shots of a person holding a microphone and interviewing several people. Fireworks and light shows are also visible at night. Multiple national flags are shown. Some of the people have red markings on their faces. One person holds a sign.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Is the interviewee in the video male or female?",
+ "answer": "A",
+ "candidates": [
+ "A. All are boys.",
+ "B. One girl and two boys.",
+ "C. All are girls.",
+ "D. One boy and two girls."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many times does the female host appear in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. One.",
+ "B. Two.",
+ "C. Three.",
+ "D. Four."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At which point in the video does the man make a heart gesture?",
+ "answer": "A",
+ "candidates": [
+ "A. The beginning of the video.",
+ "B. The middle part of the video.",
+ "C. It does not appear.",
+ "D. The latter part of the video."
+ ]
+ }
+ },
+ "JSOKEFZj": {
+ "video_id": "JSOKEFZj",
+ "video_duration": "84s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "Two young men stand on a stage under blue spotlights. Behind them, a large screen displays \"We Are Two Different People Tour\". Both wear matching gold and light blue, shiny bomber jackets, white t-shirts, dark pants, and white sneakers. They each hold a microphone. One man speaks into his microphone while the other listens, occasionally nodding or gesturing. The man speaking walks back and forth across the stage as he talks. The camera zooms in on each speaker as they talk. The background screen changes to light blue, and text appears on it. The man speaking points at the screen, and an image appears next to the text. The other man walks to join him, looking at and gesturing to the screen. The first man resumes pacing and talking. He claps his hands together. The second man joins him at center stage, and they continue talking into their microphones, gesturing. One touches his head. They move apart again. The first speaker points, and the background screen returns with an image.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who in the video says, \"Do you want us to mock a fictional character?\"",
+ "answer": "B",
+ "candidates": [
+ "A. The audience.",
+ "B. The comedian on the right.",
+ "C. The host.",
+ "D. The comedian on the left."
+ ]
+ }
+ },
+ "UxOeBDOa": {
+ "video_id": "UxOeBDOa",
+ "video_duration": "128s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a car driving on a highway during the day. Other vehicles are also visible on the road. The sky is overcast. There is green vegetation along the sides of the road and on the embankment. Road signs are present as well. In a small inset at the bottom right of the video, two people are shown inside a car. The person on the left is wearing a blue sweatshirt and a black mask. The person on the right is wearing a dark jacket, and a blue mask. The person on the left gestures with their hands several times in the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "In the video, who stated, \"He slowed down because you turned on your turn signal in time, as if you were indicating that you were changing lanes?\"",
+ "answer": "D",
+ "candidates": [
+ "A. The man wearing a black mask in the driver's seat.",
+ "B. The man wearing a blue mask in the passenger seat.",
+ "C. The man wearing a blue mask in the driver's seat.",
+ "D. The man wearing a black mask in the passenger seat."
+ ]
+ }
+ },
+ "nJtNQlLj": {
+ "video_id": "nJtNQlLj",
+ "video_duration": "247s",
+ "duration": "4-6min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video displays a series of screen captures from what appears to be a video game. The first few frames show various game menu settings, including video, controller, and audio options. Later frames depict a first-person perspective of a character within the game's environment. The character is seen building structures, shooting at other characters, and moving around the map, which includes green grass, hills and some trees. Finally the video returns to showing the game menu options again, mostly focusing on controller settings, before ending.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What value did the man in the video ultimately set the ADS LOOK HORIZONTAL SPEED to?",
+ "answer": "C",
+ "candidates": [
+ "A. 0.13.",
+ "B. 0.10.",
+ "C. 0.14.",
+ "D. 0.12."
+ ]
+ }
+ },
+ "EhikGjbZ": {
+ "video_id": "EhikGjbZ",
+ "video_duration": "87s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Camping",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a dark blue scene and the words \u201cWINTER CAMPING\u201d shown in a blurred white. The scene cuts to black, then transitions to a man with a beard and glasses, shot from the chest up. He is wearing a blue coat and is outdoors, and the sky is visible. A caption below the man reads \u201cDrew\u201d. The scene changes to a gray tent with two white ropes on either side, in a field of snow, surrounded by trees. A hashtag on the screen reads \u201c#SnowCamp2019\u201d. The camera pans to the right, showing a red shovel with a black handle, and various poles in a white bucket. The camera pans to a light gray Jeep in a field of snow, where a person is digging snow around it. A caption below reads \u201cKara\u201d and \u201c#MIAdventureDog\u201d. The person digs around the front of the Jeep. A brown and tan dog appears and sniffs the snow. The person then moves to the side of the Jeep, as the dog walks around, and a caption reads \u201cTom\u201d. The person waves, still shown in front of the Jeep. The camera pans back to the gray tent, then a man wearing a black coat, red hat, and yellow gloves exits the tent. A caption below reads \u201cDick\u201d. The man is shown smoking. The scene changes back to the person digging snow around the Jeep and the dog. The person is shown digging snow around the Jeep and the dog. The camera is on the Jeep and the person digging snow, and a dog is present. The camera moves away from the Jeep to the right, where a black mesh trailer is shown. The camera pans to the right to show a silver truck stuck in the snow, pulling a trailer with a blue tarp over it. Then the camera is back on the person and Jeep in the snow. The camera then shows the person digging snow in front of the Jeep, as well as the dog. The man wearing a black coat, red hat, and yellow gloves is shown using a yellow tool in front of the gray tent. The camera is then back on the person and Jeep in the snow, as well as the dog. The scene cuts to the man with a beard from the beginning, shot from the chest up and outside. The scene then transitions to a Jeep with headlights on driving on a snow covered road at night. A caption below reads \u201cTrail ridin\u2019 with Tom\u201d. The Jeep is shown multiple times driving down the snowy road. The scene then shows 3 individuals digging snow around the Jeep with shovels, with headlights on them at night.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "Which skill are the characters in the video most likely to have in common?",
+ "answer": "A",
+ "candidates": [
+ "A. Driving vehicles.",
+ "B. Weather observation.",
+ "C. Starting a fire and cooking.",
+ "D. Vehicle repair."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the man wearing a blue and white knitted hat shoveling snow in the first half of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. He wants to dig out his car.",
+ "B. He wants to build a snowman.",
+ "C. He wants to clear space to start a fire.",
+ "D. He wants to clear space to set up a tent."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What is the most likely first thing this group of people will do next?",
+ "answer": "C",
+ "candidates": [
+ "A. Begin a jungle adventure.",
+ "B. Wait in place for other friends.",
+ "C. Drive the car to a location with signal reception.",
+ "D. Warm up by making a fire around the vehicles."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many types of cars appear in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. One type.",
+ "B. Two types.",
+ "C. Four types.",
+ "D. Three types."
+ ]
+ }
+ },
+ "ECbnUEnr": {
+ "video_id": "ECbnUEnr",
+ "video_duration": "133s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Motorsport",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a view from inside a race car, showing the track ahead. The car has a red, orange, yellow, and blue paint scheme with a \"Toyota\" logo and several other logos. The view changes to an overhead shot of a dark-colored race car traveling on the track. The car passes another red car. Another dark-colored race car appears, and the two dark cars appear to race each other. Another camera shot shows the dark car passing a green one on a curve. Then it shows the dark car making its way down the track alone. The scene cuts to several cars making their way down the track, with a crowd present on the side of the track. Another view from the side shows several cars coming to a pit stop for service, and the teams tending to the cars. One car is serviced by team members in white, and then is pushed out of the pit stop area. Another overhead view shows several cars going around a track, and the camera focuses on a green car as it moves. The view changes to a shot of two green cars next to each other, and then the camera follows one of the green cars as it races down the track. The camera shifts to an overhead view of the track as several cars go around, and then it zooms out to an aerial view of the entire track with a graphic stating the location of the race as West Allis, Wisconsin.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How long did Michael Andretti remain in the pit area?",
+ "answer": "D",
+ "candidates": [
+ "A. 12.3 seconds.",
+ "B. 12 seconds.",
+ "C. 13 seconds.",
+ "D. 13.3 seconds."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did you see any injured staff members not wearing helmets?",
+ "answer": "C",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "XYiHQWXx": {
+ "video_id": "XYiHQWXx",
+ "video_duration": "110s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man stands in front of a milling machine, holding a small, dark, irregularly shaped object in his left hand. He brings the object closer to the drill bit of the machine. A close-up shows the drill bit making contact with the object. The man then holds a similar object, slightly larger and with two circular openings, and repeats the process of bringing it close to the drill bit, making contact. The drill bit spins against the object. The man then picks up a rectangular, metallic piece and again brings it close to the rotating drill bit, this time from a different angle. The process is repeated, with the camera focusing on the drill bit touching the metallic piece. He switches to a cylindrical metallic piece, placing it horizontally on the machine's platform, and touches it with the rotating drill bit.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the purpose of the machine tools shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Cut metal.",
+ "B. Smelt metal.",
+ "C. Heat metal.",
+ "D. Polish metal."
+ ]
+ }
+ },
+ "cJegsNat": {
+ "video_id": "cJegsNat",
+ "video_duration": "180s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Bowling Ball",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video begins with a close-up of a person wearing a red baseball cap and red jersey with white lettering. He has a beard. He walks past a catcher wearing a grey helmet. They both walk along the edge of the field. The man gestures with his open hand. The man stops in front of an umpire wearing a black baseball cap and black shirt with white lettering. The two men briefly shake hands. The camera angle changes to a man wearing a white jersey with black lettering and a black baseball cap. He is on the opposite team and is giving his glove to an umpire. The camera angle changes to the first man as he is speaking to the umpire. The umpire is looking at the man. There is an insert shot of their hands shaking, with the middle finger of the man in red jersey extended during the handshake. The camera angle changes to a different view, focusing on the umpire. The man in the red jersey can be seen behind the umpire as he speaks to him and walks to the team bench. The man in the red jersey comes closer to the umpire and yells at him as the umpire faces him. The video shows a closer angle of the umpire, and the man in red is further away. A group of men are near the team bench as one man in red comes toward the man in the red jersey. They hold him back. One of the men is wearing a tan uniform shirt. The man in red is yelling and moving his arms. Then the video shows an umpire in black standing on the baseball field. He moves his hands as if he is cleaning something off. The video shows a close up of the man in red jersey standing by the team bench. The man in the red jersey is again speaking with the umpire. The video shows close-ups of both the umpire and the man in the red jersey, each taking up most of the frame. The video ends showing the man in the red jersey standing next to his teammates, speaking to the umpire again.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why did player number 40 have a conflict with the referee?",
+ "answer": "C",
+ "candidates": [
+ "A. The referee verbally provoked the player.",
+ "B. The player's words angered the referee.",
+ "C. The player believed the referee was unfair.",
+ "D. There was an old grudge between the player and the referee."
+ ]
+ }
+ },
+ "NJeocnZO": {
+ "video_id": "NJeocnZO",
+ "video_duration": "37s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a figure standing in a teal glow against a black background. The figure has spiky white hair and is wearing dark clothing with a teal vest. The next shot shows this figure standing in the teal glow, and a second figure, also with spiky white hair, sitting by a yellow fire. The second figure is wearing a cream-colored outfit with a teal vest. There is a zoom in on the face of the second figure. The next scene shows a teal-blue sky. Then the scene transitions to the figure in the cream outfit sitting by a fire with teal arcs swirling above him. The caption \"Kakashi spirit is released\" is at the bottom of the screen. The scene transitions again to a figure with spiky white hair, a black mask, and a cut over one eye. This shot is followed by a scene of the figure on the ground with two figures standing above him. The three figures are on a rocky terrain. Next, there is a close-up of a figure with spiky red hair and the white-haired, masked figure. A caption of \"Kakashi rebirth\" appears at the bottom of the screen. The final shot of the video shows the three figures in a field with a large white, slug-like creature. The caption \"Kakashi rebirth\" remains on the bottom of the screen.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional changes does the long-haired, red-brown-haired male character experience in the latter half of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. From excitement to anger.",
+ "B. From excitement to sadness.",
+ "C. From anger to sadness.",
+ "D. From sadness to happiness."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the masked white-haired male character and the white-haired male character wearing a hairband beside him at the beginning of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Friends.",
+ "B. Grandparent and grandson.",
+ "C. Brothers.",
+ "D. Father and son."
+ ]
+ }
+ },
+ "YnuJrEOl": {
+ "video_id": "YnuJrEOl",
+ "video_duration": "70s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A soccer ball rests on a field near the touchline. A player's foot, clad in a bright yellow cleat and blue sock with \"Hajduk\" written on it, is placed next to the ball.The view switches to a wider shot of the field. Players in white and blue are positioned around the center circle. The players in the dark uniforms pass the ball amongst themselves, moving forward.A scoreboard shows \"Hajduk Split 1 - 0 Everton\". The score is from a goal by Rado\u0161evi\u0107 in the 43rd minute. The time is 45+ minutes, near the end of the first half.A player in white prepares for a free kick. He kicks the ball toward the goal. The goalkeeper in yellow dives to his right, but the ball sails wide of the goal.The goalkeeper retrieves the ball from near the corner flag and takes a goal kick.The players in dark uniforms celebrate a goal by hugging each other.The Everton fans in the stands cheer and raise their arms.The view changes to a high-angle, wide shot of the entire stadium, brightly lit at night. The players are small figures on the field below.The players in white and blue move the ball upfield. They are close to the opponent's goal.The view changes again, showing a closer shot of the goal. The player in white shoots. The goalkeeper dives but the ball hits the post and goes in.The goalie lies on the ground near the post.The goalkeeper walks towards the net. He jumps and catches the ball. He falls down.A player in dark clothes kicks the ball in the air, then kicks it with his left foot. He kicks the ball again. It travels high above the field.The goalie dives to his left, and fails to catch the ball, which enters the goal.A player in blue kicks the ball. Another player in a dark uniform falls to the ground trying to block it. The players in blue chase after the ball.The goalkeeper catches the ball near his goal line.The Everton players hug each other, celebrating the goal.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the audience pull open the blue banner?",
+ "answer": "B",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the beginning of the video.",
+ "C. Throughout the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, when the man wearing the number 18 blue jersey kicks a critical goal, is the man wearing the number 23 white jersey positioned to his left or right?",
+ "answer": "C",
+ "candidates": [
+ "A. On the right rear.",
+ "B. On the left rear.",
+ "C. On the left.",
+ "D. On the right."
+ ]
+ }
+ },
+ "yhHaTuLg": {
+ "video_id": "yhHaTuLg",
+ "video_duration": "37s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows several different scenes.In one scene, three individuals are seated on a brown leather couch. There are two decorative pillows with a brown, tan, and black pattern behind them. The person on the left is wearing a red jacket with a white undershirt, black pants, and black shoes. The person in the center is wearing a long sleeved red shirt with a dark floral pattern, black pants, and is also wearing a watch. The person on the right is wearing a blue jacket, dark pants, and white sneakers. All three individuals are talking.In another scene, a single woman with blonde hair is seated on a chair, she is wearing a dress with a brown and black cheetah print pattern. She is holding a black rectangle with white lettering on it. There is a purple background behind her with \"BOHEMIAN RHAPSODY\" in golden lettering along with a faint golden pattern.In another scene, a woman with blonde hair is seated on a chair. She is wearing a long sleeve white dress with a black heart pattern on it. There is a purple background behind her with \"BOHEMIAN RHAPSODY\" in golden lettering along with a faint golden pattern. A purple box appears with white letters, that say \"Lucy Boynton\".In another scene, the same woman with blonde hair in a cheetah print dress is seated on a chair, holding the black object. To the right side of the screen is a partial view of a second individual. In the background is a movie poster for \"BOHEMIAN RHAPSODY\".Lastly, there is a close-up of an older woman. She has white hair and is wearing a red hat with a red flower on it. She is also wearing a red blazer with a pearl necklace and pearl earrings. She has a smile on her face.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, what is the position of the woman wearing leopard print relative to the three men?",
+ "answer": "A",
+ "candidates": [
+ "A. The woman in leopard print is directly in front of the three men.",
+ "B. The woman in leopard print is to the right of the three men.",
+ "C. The woman in leopard print is to the left of the three men.",
+ "D. The woman in leopard print is behind the three men."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many times does the woman wearing a red hat appear in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Four.",
+ "B. Three.",
+ "C. Two.",
+ "D. Five."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the woman in the heart-print dress speak with an incredulous tone in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Because she answered correctly.",
+ "B. Because she was puzzled by the man's response in the video.",
+ "C. Because she didn't believe the host's answer.",
+ "D. Because she guessed the answer incorrectly."
+ ]
+ }
+ },
+ "PAqziTZB": {
+ "video_id": "PAqziTZB",
+ "video_duration": "539s",
+ "duration": ">8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Celebrity Interviews",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A woman with long, dark hair is speaking in what appears to be a convention center or large public space. She wears a black top and a teal lanyard. She gestures with her hands while speaking. People walk in the background. There are signs and banners visible in the background, some featuring images of a woman. Text in a non-English language is superimposed over the lower part of the video.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the woman crying in bed?",
+ "answer": "C",
+ "candidates": [
+ "A. Because she is not allowed to practice her religion.",
+ "B. Because she is a victim of domestic violence.",
+ "C. Because she has no personal rights.",
+ "D. Because she has no money to spend."
+ ]
+ }
+ },
+ "CBquZJCC": {
+ "video_id": "CBquZJCC",
+ "video_duration": "67s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Artificial Intelligence",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A light-skinned man wearing a black hooded sweatshirt with white and blue designs is in the center of the frame. He is also wearing eyeglasses and a watch. His mouth is open while he speaks and his hands are open, gesturing outward. In the background is a wooden bookshelf containing several books, a model car, and a model of a large vehicle. In the foreground is a microphone stand.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the car model on the bookshelf relative to the award certificate?",
+ "answer": "B",
+ "candidates": [
+ "A. The car model is in front of the award certificate.",
+ "B. The car model is to the right of the award certificate.",
+ "C. The car model is behind the award certificate.",
+ "D. The car model is to the left of the award certificate."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does the man in the video make the OK gesture with his hands?",
+ "answer": "D",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. Throughout the video.",
+ "C. In the middle of the video.",
+ "D. Towards the end of the video."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What colors are featured on the characters in the yellow book cover located at the upper left part of the bookshelf?",
+ "answer": "B",
+ "candidates": [
+ "A. Yellow, green, blue, and red.",
+ "B. Orange, green, blue, and red.",
+ "C. Orange, green, yellow, and red.",
+ "D. Orange, black, blue, and red."
+ ]
+ }
+ },
+ "oovwWWMl": {
+ "video_id": "oovwWWMl",
+ "video_duration": "82s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Historical Analysis",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen. The scene changes to show three 6x6 rubiks cubes on a table in front of a person wearing a light colored shirt with a small image. The person moves their hands and picks up the middle cube, moving the different colored faces. The person continues to manipulate the cube. The person finishes the cube and places it back on the table, the video transitions to the next scene which is of a man in a black suit against a white background, his hands on his face, making an animated shocked expression. The scene transitions to a view of a red, white, and blue 6x6 rubik's cube with a black background. The scene changes again to a view of a 6x6 rubik's cube with a green face and a gold or tan side. The cube is on a black background. The scene changes to a stacked view of a 6x6 rubik's cube with mixed colors on a black background. The scene transitions to a view of a 6x6 rubik's cube with a varied color design, on a black background. The scene shows a 6x6 rubik's cube in the foreground with a landscape of city lights and the Acropolis in the background. The video shifts to a shot of the 2005 competition results, with a red circle around one of the award winning cubes. The scene changes to a shot of two 6x6 rubiks cubes next to each other, with a statement on the image. The next scene is the logo for eBay. The scene transitions to a white and black 6x6 rubik's cube on a white background. The scene transitions to a 6x6 rubik's cube with red, yellow and blue colors, it is displayed on a stand with a grey background. The video transitions to a shot of a man sitting in a chair holding a rubiks cube, with several other cubes on the table in front of him. The video changes to a single image of a man in a white t-shirt. The scene transitions to two men with two cubes circled in red. The video shows a closeup of two hands that are touching a speed timer and a solved cube. The video shows hands manipulating a rubik's cube next to a speed timer that is counting the seconds. The person continues to manipulate the cube and the timer continues to count the seconds.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What happens to the timer displayed in front of the man during his Rubik's cube tutorial just before the video ends?",
+ "answer": "B",
+ "candidates": [
+ "A. The timer screen malfunctions and goes dark.",
+ "B. The timer starts running.",
+ "C. The timer screen begins to flicker.",
+ "D. The timer remains stationary."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Which size Rubik's Cube tutorial is shown at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. 5\u00d75.",
+ "B. 3\u00d73.",
+ "C. 6\u00d76.",
+ "D. 7\u00d77."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Will the video on solving a 7x7 Rubik's Cube eventually receive an honorary nomination?",
+ "answer": "C",
+ "candidates": [
+ "A. It will not win an award.",
+ "B. I am not sure.",
+ "C. It will win an award."
+ ]
+ }
+ },
+ "sLEmCdKl": {
+ "video_id": "sLEmCdKl",
+ "video_duration": "121s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A video game is being played with a pink bird-like figure flying over an area with green grass and trees. A white worm-like creature is in the grass. Text appears at the bottom of the main viewing area, saying \"What will Talonflame do?\". The pink bird is shown flying around, then a text box appears saying \"It's sup\". Then the pink bird disappears and a white worm-like creature spins around in the grass, then lays on its side. The pink bird reappears and a text box says \"Talonflame is damaged by the recoil\". Then the main viewing area is shown to be black. Then a video game character is walking in a forest-like area. Then a text box appears saying \"Repel's effect wore off! Use another?\". Another text box says \"Angel used a Super Repel\". The character is shown walking in the forest area again. Then the main viewing area is black again. Then the main viewing area shows many white squiggles falling down in front of a black background. Then tall green grass appears. Then a white worm-like creature appears with blue sparkling stars around it. Then text appears saying \"A wild Caterpie appeared!\". The character is shown running through the forest. Then the main viewing area is all white. Then a pink bird is shown flying in the air over the green grass and trees and sparkling. Then text appears saying \"What will Talonflame do?\" The pink bird is shown flying again. Then the main viewing area switches to a close up of the white worm-like creature. Then a text box appears saying \"Angel used a Luxury Ball\". The main viewing area switches to a close up of the white worm-like creature being enveloped by a sphere of light. Then a ball with blue, white and yellow appears floating on the grass. Text appears saying \"Gotcha! Caterpie was caught!\". The pink bird appears in the main viewing area again and a text box says \"Talonflame gained a boosted 16 Exp. Points!\". Then the main viewing area turns black. Then the main viewing area is white with a white worm-like creature in the center. Text appears saying \"Give a nickname to the caught Caterpie?\". Then the main viewing area turns black. Then the video game character is shown walking in the forest-like area again. Then the main viewing area turns black. Then the main viewing area shows a screen with multiple stats and characteristics for the white worm-like creature. Then the main viewing area turns black. Then it turns white. Then it is black. Then the main viewing area is a forest-like area with the character walking through it again. The main viewing area turns black. Then it shows many white squiggles falling in front of a black background again. Then a white worm-like creature is shown again in front of green grass. The video ends with the video game character walking in the forest again.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many insect-like Pok\u00e9mon appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Three.",
+ "B. Zero.",
+ "C. Two.",
+ "D. One."
+ ]
+ }
+ },
+ "UVTWcmEM": {
+ "video_id": "UVTWcmEM",
+ "video_duration": "104s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A young boy types on a laptop keyboard. He swivels around on a rolling cart with the laptop and a video camera. He picks up headphones and places them on his head. He then picks up the camera and aims it toward two girls. The perspective switches to the camera's viewfinder, showing the two girls. One girl points at the camera. They both lean into the camera with excited expressions. The perspective switches back to the boy, who operates the video camera. The view changes to a digital timer, which reads 24 hours, 6 minutes, and 55 seconds. The girls walk in front of a vehicle prop. The timer changes to 21 hours, 24 minutes, and 57 seconds. A woman enters the scene with a laptop. The camera's viewfinder shows the girls again. A TV screen above them displays a man. The man waves at the screen. The perspective changes back to the boy. The digital timer is shown again and advances a few seconds. Three men in yellow hardhats enter the room. One of the girls pulls a small object from her ear. One man gestures and the men leave. The digital timer advances. The perspective changes to the viewfinder as the original two girls appear. The boy films them as they stand by the vehicle prop and the TV screen. A man appears on the small TV screen and makes an angry face. The timer is visible again as it advances a few seconds. The boy adjusts the camera, and a woman approaches him. The man on the TV gets closer to the screen. The perspective changes again to the camera viewfinder. Three men in yellow hardhats walk into the scene. They walk over to the equipment cart. One of the men gestures with his arm, and the two girls react with surprise. The men and the women leave. The two girls stand next to the man in the hardhat. The camera viewfinder displays the two girls and the man in the hardhat. One of the girls picks up a bundle of wires. The boy continues to film as the two girls and the woman with the laptop unplug equipment. The group walks towards a staircase. The camera follows them. The group walks up the staircase. The boy strains to carry the cart.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Where will the girls in the video likely stream their next live broadcast?",
+ "answer": "A",
+ "candidates": [
+ "A. The hall.",
+ "B. A bedroom.",
+ "C. The stairs.",
+ "D. Outdoors."
+ ]
+ }
+ },
+ "soDBpPyd": {
+ "video_id": "soDBpPyd",
+ "video_duration": "117s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Tech Reviews",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person with light blonde shoulder-length hair sits behind a white table, facing the camera. They are wearing a dark blue shirt with a white logo. In the background, there is a black pegboard wall with tools and books on shelves. The person's arms are resting on the table. At times, a close-up of the inside of a cell phone is shown, during which a gray, flat tool is used to pry open a part of the phone, and then a screwdriver is used to remove a part. Another close-up of a long circuit board is shown and then part of the circuit board with a large silver component.The person's hands remain in a resting position on the table when the phone parts are shown.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What is the woman in the video likely to do next?",
+ "answer": "A",
+ "candidates": [
+ "A. Discuss the phone's back cover.",
+ "B. Replace the screen of the iPhone 4S.",
+ "C. Introduce the A6 chip.",
+ "D. Analyze the performance of the battery."
+ ]
+ }
+ },
+ "QzQmWjnJ": {
+ "video_id": "QzQmWjnJ",
+ "video_duration": "429s",
+ "duration": "6-8min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a close-up of a person wearing a pink hooded sweatshirt and a black backpack facing the camera. A second person wearing a white cardigan over a pink top with blue jeans is seen to their right, smiling and waving. The second person is also wearing a necklace and has dark hair. Both individuals are facing forward with the camera recording their view. The background is a mix of tan mulch, concrete benches and structures, and buildings with many windows. The camera moves slightly. The second person touches their hair and smiles. The first person is shown from the side, and they are talking while smiling. The camera view shifts back and forth between the first person and the second person. At one point, the second person is presented with a blue overlay text of the name, \"Leslie\". A pause button and mouse pointer appear briefly on the screen. In a following sequence, the second person is shown on camera alone. Some white, hazy lines overlay the scene. Text saying, \"she doesn't know who i am\", is shown. The video then changes to show both individuals, with the first person saying, \"so my main goal is to tell her that we used to date hopefully she reacts in a,\" with the second person touching her hair and smiling. A winking emoji is overlaid over the two individuals. Text saying \"i need to be friends with her i do it\" is displayed over the video. The first person says, \"you trynna sit down\", and the second person laughs. The first person says some additional dialogue. The video then changes to show only the second person. Text saying \"*wtf r u saying*\" is displayed. The camera shifts from the first person to the second person who is seen making hand gestures. The video then changes back and forth between the two people talking and gesturing. Text saying \"i am\" is shown with an image of the second person. The video then changes to show the second person holding up a book. Text saying \"reading\" is shown. Both people continue to speak and make gestures, the camera moving between them. At one point, the first person is shown with a filter of small red hearts over their face. The filter also appears on the second person's face. They both continue talking. The second person touches her hair several times. The first person points at the second person. The video shows more hand gestures and talking from both individuals, with the camera changing between the two people. The camera view briefly switches to show other people and then back to the two people. The video then shows text saying \"just over here.\" The second person is shown holding a book. The first person again speaks some dialogue and the second person adjusts her hair. The video changes between each person as they speak. The first person uses hand gestures while talking and the second person adjusts her hair and speaks back. The video then displays the text, \"nice\", with an image of the second person. A second text is displayed after, \"Cute.\" The video view transitions to show both individuals. The video then displays, text saying, \"i need to be friends with her i do it\". The first person then says, \"You trynna sit down :)\" with the second person laughing and touching her hair again. The video then shows both individuals walking and talking while the camera moves between them. They both gesture with their hands while talking. Text reading \"Beautiful\" and \"Scholar\", along with \"(Perfect Smile :)\" is shown. Red hearts are overlaid over their faces as they talk. The second person is shown touching her hair once again. More talking occurs. A ring image is briefly overlaid in the right-hand corner of the screen. The video shows the two people talking, gesturing with their hands, and making faces. A few more filters and emojis are shown during the video. They continue walking and talking. The first person is shown briefly, then the video moves to the second person who is holding a microphone and speaking. The first person then appears again and says \"isnt it?\". A book is held up by the second person. The two individuals then walk together with an arm around each other and the second person uses a book to cover her face a few times. Then the first person begins to walk without the second person. The camera focuses on their faces. More talking occurs between the two people. The video then shows the first person\u2019s face and it goes black and white. The second person is still next to them, with only their face showing in color. More conversation continues. The video then cuts to show the first person holding up the microphone to the second person's mouth. Then the camera view is switched to show the second person speaking while holding the microphone. The first person takes a microphone from the second person and then hands over a book. The second person then connects the wire to their pants and makes other hand gestures. The first person holds up the book as the second person connects the wire. Text is displayed saying, \u201ci gotchu :)\u201d and \u201cill hold the book.\u201d More talking between the two is shown with camera movement. The video then shows the second person holding the microphone with the first person holding the book. They both continue speaking while walking. The camera transitions between them while they speak and the text, \"im a gentleman, i know how to treat, what can i say :)\", is displayed. The second person laughs and speaks. The first person then speaks, walking forward. They continue walking as the camera shifts between them while they talk. A black object is held in the second person\u2019s hand while she is talking. The video changes the view several times. Both people are shown with a filter of red hearts. The camera continues to transition between the two people. Text saying \u201cyou look cute in all angles :)\u201d is displayed on the screen. The two continue walking. The video shows a variety of views. The video displays text saying \u201cNo.\u201d Then text displays saying, \u201cShe doesn\u2019t know its her\u201d and another winking emoji. The second person is shown holding a book while walking. More dialogue is shown between the two individuals as they walk. The video then shows the second person holding a microphone and the first person taking the microphone and holding up the book. More dialogue between them is displayed. Text is displayed saying, \u201ccan you help me?\u201d The camera shows the second person attaching the wire to their shirt as they walk and talk. The video ends with the second person walking, holding the microphone.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the woman and the man in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Former girlfriend from junior high school.",
+ "B. Friends from before college.",
+ "C. Female classmate from junior high school.",
+ "D. Current girlfriend."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the boy do to the girl at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Carried the girl.",
+ "B. Held the girl\u2019s hand.",
+ "C. Kissed the girl.",
+ "D. Hugged the girl."
+ ]
+ }
+ },
+ "zYjYCvZS": {
+ "video_id": "zYjYCvZS",
+ "video_duration": "107s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Hiking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a close-up of a hand holding a gold-colored coin with an image of a building and a figure on it. The camera then shows a full tree with green leaves on a lawn, then switches to an aerial view of the tree. Next, a patch of grass is shown, followed by a hole in the ground. A hand places something into the hole. Then the hand retrieves something from the hole. The hand is shown holding a silver coin with a raised image of a face on one side. The other side of the coin has \"ONE DIME\" written on it. A close-up of a man's face is then shown as he speaks. Finally, the video ends with close-ups of the man's hand holding the same silver coin.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "From which direction does the sound of the car driving come at the beginning of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. From the road in the foreground.",
+ "B. From the road in the background.",
+ "C. From the road on the left side of the screen.",
+ "D. From the road on the right side of the screen."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many coins were dug up in total in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Three.",
+ "B. One.",
+ "C. Two.",
+ "D. Four."
+ ]
+ }
+ },
+ "PvGklMfp": {
+ "video_id": "PvGklMfp",
+ "video_duration": "15s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Unboxing Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person holds and shows several Pokemon cards. The person has dark hair and is wearing headphones. The person's image is in a blue square overlaid in the top right of the video. As each card is displayed, the person in the blue square is seen eating a small chocolate bar. The first card has a purple character on a green background and the words \"Poison Gas\" and \"Sticky Hands\" on it. The second card has a blue character on a blue background and the words \"Hydrocannon\" and \"Rocket Tackle\" on it. The person appears happy with each card displayed.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many attempts did the man make before obtaining a rare card?",
+ "answer": "B",
+ "candidates": [
+ "A. Four attempts.",
+ "B. Two attempts.",
+ "C. One attempt.",
+ "D. Three attempts."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Which character did the man first summon?",
+ "answer": "D",
+ "candidates": [
+ "A. Summons Squirtle.",
+ "B. Summons Dark Blast Turtle.",
+ "C. Summons Pikachu.",
+ "D. Summons Shadow Suicune."
+ ]
+ }
+ },
+ "sitcCmod": {
+ "video_id": "sitcCmod",
+ "video_duration": "53s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Climate Change",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Two children open a brown gate and walk through. One waves. A man in a dark coat carries a large board, while another man climbs a ladder. The first man hands a solar panel up to the man on the ladder. The video says \"Across Uganda, engineers like Simon are installing solar panels\". Text then appears that says \"$12.5M invested\", \"This project means more than a million people\", \"240,000 solar panels will soon have access to reliable, affordable electricity for the first time\". Children in green uniforms run past. A man stands in the center as kids run towards him. The text \"Welcome to Citizens Boarding Primary School!\" is shown. Children run through a gate into a school yard. A woman in a flowered dress is standing between several rows of children sitting at desks. The text says \"The solar panel powers the new lights. It enables children to do their homework after dark\". The woman then says \"No more dirty kerosene lamps, no more dangerous candles\". \"Cleaner, cheaper energy is lighting up our school\". \"No more worries!\". Children play on a slide and swings. A child in a green uniform looks at the camera as the text \"Investing for a greener world\" is displayed.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the possible identity of the girl wearing a qipao in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A principal.",
+ "B. A teacher.",
+ "C. An engineer.",
+ "D. A student."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What are the two men doing at the beginning of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Repairing the roof.",
+ "B. Teaching a class to students.",
+ "C. Fixing the ladder.",
+ "D. Installing solar panels."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times can footsteps be heard in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Four.",
+ "B. One.",
+ "C. Three.",
+ "D. Two."
+ ]
+ }
+ },
+ "pONGpJvw": {
+ "video_id": "pONGpJvw",
+ "video_duration": "35s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Basketball",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a basketball game. There are people wearing white and yellow uniforms, and others wearing green and light blue uniforms. The players dribble a basketball and move on the basketball court. The video includes a digital timer that counts down. There are people visible in the background and their hands are raised. There is a logo visible in the corner of the screen. There is also a website address shown at the bottom of the video. A person in a white and yellow jersey with the number 45 on the back raises their hands above their head.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What jersey number is worn by the player who made the three-point shot?",
+ "answer": "D",
+ "candidates": [
+ "A. 12.",
+ "B. 1.",
+ "C. 25.",
+ "D. 45."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What impact did James's goal have on the outcome of the game?",
+ "answer": "D",
+ "candidates": [
+ "A. The game ended in a draw.",
+ "B. The team lost the game.",
+ "C. The team won the game.",
+ "D. The score was tied."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times are the passes made in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. One.",
+ "B. Three.",
+ "C. Zero.",
+ "D. Two."
+ ]
+ }
+ },
+ "pPwZpcEL": {
+ "video_id": "pPwZpcEL",
+ "video_duration": "131s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with white text that reads \"bloopers\" against a black background. The next frame has white text reading \"brady and I tried to kidnap the host...\" and the frame after that says \"it didnt work out lol\". The next scene is a video game view in which multiple characters are shown descending from the sky under a dark blue disc that is presumably a parachute. Two characters land on the snow-covered ground in front of some buildings. One character picks up a chest, and then begins running. The view shifts to various angles as the character battles other players, builds walls and stairs, and is eliminated and then resurrected. A couple of minutes later, the view changes to a higher perspective from the sky, where a character is dropping down toward the ground using an umbrella. The character lands, and the view shifts to show battles from multiple perspectives, then showing a different player being knocked down and then resurrected. At about 5 minutes into the video the view shifts to a first-person perspective of a different player as they approach a building, and another player is shown jumping out of a window. In this first person view, the player is eliminated, and then resuscitated. Finally the video ends with a color-bar image, then an image with two squares, the words \"SUBSCRIBE AND ENJOY\", and a logo.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the game character fly without using any equipment?",
+ "answer": "C",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. At the beginning of the video.",
+ "C. In the middle of the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "How do the game characters interact with enemies in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Tie the fallen enemy to a chair.",
+ "B. Carry the fallen enemy on their back.",
+ "C. Threaten the enemy with a hammer.",
+ "D. Threaten the enemy using a helicopter."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What weapon does the game character obtain from the chest in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Rifle.",
+ "B. Shotgun.",
+ "C. Submachine gun.",
+ "D. Pistol."
+ ]
+ }
+ },
+ "SFZTNRvC": {
+ "video_id": "SFZTNRvC",
+ "video_duration": "259s",
+ "duration": "4-6min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video opens with a menu for \"Goddess Statues,\" which gives details about what the statues are used for. A bearded man is visible in the upper-right corner. The menu fades into black, and then a large archway appears, glowing with blue light. The figure of a person in blue clothing can be seen in front of it. After the figure leaves the archway, it moves toward the edge of a cliff, where it encounters two skeletal figures. The figure moves past them. The figure glides through a canyon with a large, brown paraglider. It lands and begins attacking a large rock formation, which yields rock salt, flint, amber, and topaz. It continues traveling over the landscape. It runs across a path and encounters a pack of creatures. Then a black wolf-like creature appears. The figure then encounters a group of larger, armored creatures. There are several clashes and the figure attacks a black, skeletal creature. It then moves through the landscape, slashing at a tree and revealing some apples and wood. It passes a large red, organic mass. The figure then proceeds to engage a large white-armored creature, and then encounters a black horse-like creature that transforms into a skeletal figure. Finally, it faces a centaur-like creature, with a shield and a flaming sword. The figure attacks the centaur using bombs and a sword. After the centaur is defeated, it drops loot, and the figure collects it. At the very end, the man's face returns to the screen with a slightly different facial expression. The text \"26 hits\" appears throughout the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video display the enemy's remaining health?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I don't know."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "At the end of the video, where do the yellow flames originate?",
+ "answer": "B",
+ "candidates": [
+ "A. The sword of the game character changes from blue flames to yellow.",
+ "B. The enemy's sword emits yellow flames when it swings.",
+ "C. Yellow flames appear when the enemy dies.",
+ "D. I don't know."
+ ]
+ }
+ },
+ "XVFUvjXa": {
+ "video_id": "XVFUvjXa",
+ "video_duration": "77s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A series of animated scenes, framed like photographs with white borders, play against a blurred golden background.The first scene features a grinning man with green hair and round glasses perched on his head. He wears a red and pink outfit with star-shaped decorations and a blue collar. A large, glowing gold hand appears to be reaching forward from his perspective.The second scene depicts the same green-haired man in a bright yellow suit adorned with stars and a dark red rose pinned to the chest. He wears large, gold, star-shaped earrings and has his arms outstretched, a wide grin on his face. He stands in front of a crowd within a large stadium or arena with purple lighting and gold lion statues.The third scene showcases a group of diverse characters on a red surface. They appear to be in swimwear; some wear hats, and one is a skeleton. They're all grinning and appear excited.The fourth scene shows a group of characters in white outfits, some with various accessories and exaggerated facial expressions. They appear to be inside a dimly lit room.The fifth scene features the same green-haired man from the first scene, now in a pink suit, reaching towards a large, melting ball of gold.The sixth scene displays a fight sequence between the green-haired man in his pink suit and a smaller character with dark hair, goggles, and a black outfit. The man in pink throws a punch with a glowing gold fist, while the other character seems to be dodging.The seventh scene depicts a tall, thin character in a black suit with purple and red accents, laughing with his arms spread wide against a bright yellow, fiery background. Two other figures are partially visible behind him.The eighth scene showcases a large, muscular character with a bronze or green-tinged body, accented by glowing pink lines. The background is dark blue or black.The ninth scene continues to feature the large, muscular figure. Now, swirling lines of orange and red surround him as a smaller, dark-haired figure appears to be attacking him.The final scene shows a close-up of a face with bright yellow eyes, a bruised and scratched appearance, and gritted teeth. The background is dark.Then it shows, from the character\u2019s view, two large, dark arms reaching towards them, swirling with gold and white energy.Lastly, another character with green hair and a black suit with light-colored accents is yelling. He\u2019s falling backwards, possibly being knocked down, against a dark, speckled background.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What weapon did Luffy use to defeat Tesoro?",
+ "answer": "C",
+ "candidates": [
+ "A. A pistol.",
+ "B. A bow and arrow.",
+ "C. A rocket launcher.",
+ "D. A shotgun."
+ ]
+ }
+ },
+ "RTJtlbjq": {
+ "video_id": "RTJtlbjq",
+ "video_duration": "83s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Cultural Explainers",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A white, polygonal bear runs over a snow-covered hill toward a snowman with a black top hat. Snow is falling in the blue background. The view changes to a desert landscape with a yellow ground and two green figures resembling the number 4. The view changes again to a green landscape with two trees and a small pond, as well as white text spelling out \"PHYSICAL GEOGRAPHY\". Below the text, a gray hippopotamus is in the pond. Then, the video shifts to a shot of a man with light hair and a red, yellow-striped collared shirt, talking. The wall behind him has photos, papers, and numbers hanging from it. The man is then shown speaking several more times. After the man speaks a few times, a map of Europe is shown with a blue river, and orange and red circles. The man is then shown speaking again. The view shifts to a map with the names \"Mures\" and \"Olt\", along with thin blue lines. The man speaks again and then a map of Romania is shown next to the man speaking. Then the man is shown again. After the man's image is a picture of a fox, then a bear walking in shallow water, then a wolf walking in snow, then a lynx, then a large group of sheep, and then the man is shown speaking again. Then, text that reads \"from *\" is shown with the man talking and is then removed. Several images of the man talking are shown, and then an image of a person holding a tricolor flag is shown, as well as an image of a mountain top with a tricolor flag and signpost. After these images, the man speaks again, and then a map of Romania is shown. The man is shown talking again, and then a fox, a bear, a wolf, a lynx, and sheep are shown. Then the man speaks and then an image with a white border of a toddler with a white bonnet in a white gown being held by two adults, is shown beside him. Then the man is shown talking again, and then a yellow mound of polenta is shown, followed by a bowl of wrapped cabbage. The man is shown talking, and then an image of ground meat, called \"mici (littles)\", with yellow mustard on a white plate. The man talks several more times to conclude the video.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is Prince Charles's connection to Romania as mentioned in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Prince Charles owns a house in Romania.",
+ "B. Prince Charles owns cars in Romania.",
+ "C. Prince Charles has friends in Romania.",
+ "D. Prince Charles has relatives in Romania."
+ ]
+ }
+ },
+ "VVDgqVcV": {
+ "video_id": "VVDgqVcV",
+ "video_duration": "109s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Cultural Explainers",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A man with a beard, wearing a blue t-shirt with the words \"Miyagi-Do\" and an image on it, is in front of a camera. He makes various hand motions while talking. Then, there are several shots of an image of a blue-green, red and gold dragon. Next, a shot of the same dragon on a pink background appears. The man reappears in front of the camera holding a book entitled \"The Odyssey.\" He continues talking and making hand gestures. The video then displays an image of a blue ocean horizon, which then turns red. Next, there is a shot of a group of purple sheep. Finally, the man is once again in front of the camera talking and making hand gestures. The video ends with shots of paint cans and brushes of varying colors. First, six grey cans are shown. Then, the third and second cans turn red and purple respectively, while the rest remain grey. Next, the first and second cans turn yellow and green respectively, while the rest remain the same. Then, a can turns blue.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occur in the ocean in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. From maroon to blue.",
+ "B. From green to blue.",
+ "C. From blue to green.",
+ "D. From blue to maroon."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Does the volume of the man's voice in the video change?",
+ "answer": "A",
+ "candidates": [
+ "A. No.",
+ "B. I don't know.",
+ "C. Yes."
+ ]
+ }
+ },
+ "szwjDXVg": {
+ "video_id": "szwjDXVg",
+ "video_duration": "50s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man wearing a pink t-shirt and a camouflage baseball cap stands in front of a flat top grill. He holds a wooden bowl with chopped celery, onions, and a pat of butter, and a clear bowl with chopped white fish. He inverts the wooden bowl onto the grill, depositing the vegetables and butter. He adds the fish to the grill. Next, he adds ground chicken to the grill, followed by minced garlic. He displays a bag containing a brown spice blend and sprinkles some onto his open palm. He sprinkles the spice blend onto the vegetables and chicken. He mixes the ingredients on the grill with metal spatulas. He then adds cooked rice from a zip-top bag to the grill. Soy sauce is drizzled over the rice. He presents a bottle of Dat'l Do-It hot sauce and adds it to the rice. He follows this with a bottle of Death By Jerk hot sauce, which he adds to the rice. He cracks two eggs onto the grill. After they cook briefly, he scrambles them and mixes them into the rice. Next, he pushes the rice and egg mixture aside and adds more white fish to the grill, topping it with the spice blend. He adds soy sauce to the fish. Finally, he uses a spatula to move some of the fish onto the rice mixture.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What are the first three white letters visible on the spice bottle in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. SPI.",
+ "B. KUK.",
+ "C. DEA.",
+ "D. DAT."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where does the butter first appear in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. On top of the table.",
+ "B. On top of the shrimp.",
+ "C. On top of the chicken.",
+ "D. On top of the side dish."
+ ]
+ }
+ },
+ "VkUODAaD": {
+ "video_id": "VkUODAaD",
+ "video_duration": "104s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Family Vlogs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man wearing a grey hooded sweatshirt and glasses is speaking to the camera. Behind him a woman is in a hospital bed and lying down. Then the camera moves to show the lower portion of a person wearing blue gloves and a blue smock holding a newborn baby with a dark head. The words \u201cKensley Lynn Green Born: May 14, 2019\u201d appear on screen, followed by \u201c3lbs 12 oz\u201d. The person wearing the blue gloves is then seen handing the baby, who is wrapped in a white and blue blanket, to the woman who was in the bed. The woman, also wearing a grey top and an identification bracelet, then holds the baby. Another person wearing blue gloves assists her. The baby is then seen being examined with a red light device. The woman is seen holding the baby in her arms.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What does the woman feel after giving birth to the baby in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Feels happy.",
+ "B. Feels relaxed.",
+ "C. Feels pain.",
+ "D. Feels unwell."
+ ]
+ }
+ },
+ "NtLEWSry": {
+ "video_id": "NtLEWSry",
+ "video_duration": "52s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A woman wearing a sparkly, form-fitting gown and a crown is being crowned by a man in a tuxedo. They are on a stage with several other women in gowns, also wearing sashes and holding bouquets of roses. A man with a sash that says \"Miss Arab USA\" stands to the side.The camera angle changes to show other contestants, each with a sash indicating the country or region they represent (Morocco, Palestine, Algeria). The scene shifts again to the full stage. A group of women in formal gowns stand together as a man in a tuxedo walks across the stage. \"Miss Arab USA\" is projected on the backdrop. The man walks off stage.The scene returns to the close-up shot of the two finalists and the man with the \"Miss Arab USA\" sash. The woman being crowned touches her chest with her hand.The woman crowned Miss Arab USA 2022, Marwa Lahlou (as shown on the overlay), now holds a trophy and bouquet, along with another woman wearing a sash reading \"Morocco.\u201d Confetti and small pyrotechnics go off around them as the other contestants clap.The previous Miss Arab USA crowns Ms. Lahlou. More confetti falls, obscuring the image briefly. Sparklers shoot upwards behind the contestants. Ms. Lahlou is embraced and congratulated by people on stage.The scene changes. Now, a woman wearing a \"Syria\" sash and a light blue gown is being crowned as Miss Congeniality while sparklers continue to erupt. People around her clap and take photos.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many beauty contestants are shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. 13 beauty contestants.",
+ "B. 12 beauty contestants.",
+ "C. 14 beauty contestants.",
+ "D. 11 beauty contestants."
+ ]
+ }
+ },
+ "imToxGQR": {
+ "video_id": "imToxGQR",
+ "video_duration": "78s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A soccer game is in progress. Players in red and yellow uniforms move across the field. A player in red passes the ball to a teammate near the goal. The player kicks, and the ball goes into the net. Players in red celebrate. The camera cuts to players in yellow, who appear disappointed. The game continues, and a player in yellow dribbles the ball past a player in red. The player in yellow is tackled and falls to the ground. The ball goes out of bounds. The camera shows the goal, which has fallen over. The goalie in yellow scrambles to get up. The game restarts with a free kick for the yellow team. The ball is kicked into the penalty area but is headed away. Play continues with both teams moving the ball across the field. A player in red receives the ball in the penalty area and shoots. The goalie dives and the ball hits the net. The players in red celebrate again. The game continues, and both teams pass the ball amongst teammates. A player in red kicks the ball towards the goal, the goalie dives, but fails to catch it, and it goes into the net. A player in red celebrates. Play continues, a player in red makes another shot towards the net, but the goalie stops it. A player in red receives the ball near the corner of the field and passes it towards the penalty area. A player in red kicks the ball. It hits the post. Play continues, and a player in red sends the ball back into the field, where a teammate kicks it towards the net, and it hits the net again. A player in red celebrates.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What are the emotional shifts experienced by Manchester United players?",
+ "answer": "B",
+ "candidates": [
+ "A. From excitement to nervousness and then to disappointment.",
+ "B. From nervousness to disappointment and then to excitement.",
+ "C. From disappointment to nervousness and then to excitement.",
+ "D. From excitement to disappointment and then to nervousness."
+ ]
+ }
+ },
+ "reAitcrB": {
+ "video_id": "reAitcrB",
+ "video_duration": "59s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man with a beard and glasses, wearing a blue blazer over a light blue shirt, stands behind a red podium with two microphones. The text \u201cLET\u2019S TRY TO STAY FOCUSED ON ONE FALSE CLAIM AT A TIME,\u201d appears below him, then \u201cGET BOGGED DOWN. OKAY SO YOU SAID YOU ACTUALLY STARTED BY SAYING THAT BLOCKING PUBERTY, WHICH IS CHEMICAL CASTRATION OF A CHILD, JUST ONE EXAMPLE. YEAH. YEAH. IS LIFE SAVING? IT CAN BE. HOW DO YOU KNOW? BECAUSE THERE ARE CHILDREN WHO HAVE BEEN SUICIDAL THAT HAVE HAD THESE INTERVENTIONS MADE. HOW DO YOU KNOW HOW MANY CHILDREN HAVE HAD THAT DONE? PERSONAL CHILDREN? I HAVEN'T MET ENOUGH TRANS CHILDREN TO ANSWER THAT BUT I KNOW YOU ARE MAKING RESEARCH EXISTS, YOU'RE MAKING THIS STATEMENT, YOU'RE STATING IT AS A FACT IN THERE MUST BE A REASON. THERE MUST BE SOMETHING THAT YOU ENCOUNTERED THAT LED YOU TO THIS RATHER RADICAL CONCLUSION THAT TO SAVE A CHILD'S LIFE, YOU HAVE TO CHEMICALLY CASTRATE THEM. THAT IS WHAT YOU JUST SAID. THE DRUGS WE GIVE TO KIDS FOR PUBERTY BLOCKERS, SO CALLED PUBERTY BLOCKERS, ORIGINALLY ARE CANCER DRUGS USED TO TREAT PROSTATE CANCER. THEY'RE ALSO USED TO CHEMICALLY CASTRATE SEX OFFENDERS AND CHILD RAPISTS. NOW, WE GIVE THEM TO PHYSICALLY HEALTHY KIDS. YOU\u2019RE SAYING THAT WE NEED TO.\u201d In front of him, below the frame, is a woman with blonde hair wearing a tie-dye shirt and holding a microphone, speaking. Behind them are 5 or more other people. A watermark reading \u201cPROVENMENTOR\u201d is visible in a circle on the woman\u2019s shoulder.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What emotion is displayed by the woman wearing a blue shirt in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Sad and upset.",
+ "B. Anxious and uneasy.",
+ "C. Happy and joyful.",
+ "D. Angry and furious."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a golden object at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. I'm not sure.",
+ "B. No.",
+ "C. There is a golden earring and a golden stud."
+ ]
+ }
+ },
+ "eapYeyoW": {
+ "video_id": "eapYeyoW",
+ "video_duration": "140s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video opens with five people standing outside of a building. The person in the center of the group of five is standing behind a clear podium speaking into a microphone. A yellow horizontal bar is superimposed on the lower portion of the screen which states her name as \"Nayo Harvin\" and her title as \"Owner | Sweet GG's Bakery.\" The person to her left has a black mask on their face and holds a rectangular object. The second person to the left of the speaker wears a purple shirt and sunglasses. The two people to the right of the speaker wear suits. The woman at the podium gestures with her right hand while speaking. The man in the purple shirt then steps up to the podium and speaks, while the woman at the podium steps to the right. The yellow bar at the bottom of the screen changes to state \"Devon Harvin | Owner | Sweet GG's Bakery.\" The man in the purple shirt speaks while gesturing with his right hand. The camera then cuts to an interior shot of the bakery. A few people are inside the store at the counter. The yellow bar at the bottom of the screen then states \"Sweet GG's Bakery\" followed by \"Sweet GG's provides cake slices, cookies, cupcakes, and treats.\" The scene remains the same, but the bottom of the screen then changes to \"Sweet GG's provides services for weddings and catering events\". The yellow bar remains and states \"Ribbon Cutting for Sweet GG's Bakery located at 631-B Harden Street.\" The video shows a ribbon cutting with multiple people holding scissors to a long red ribbon. The video then cuts to a black screen followed by a screen showing an aerial view of a city. A light blue bar appears at the bottom with white text that states, \"Public Relations, Media & Marketing www.columbiaSC.gov | PIO@ColumbiaSC.gov | (803) 545-3020 | #WeAreColumbia\". Social media icons are below this.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the chronological order of events at the cake shop shown in the video? (a) The cake shop owner gave a speech, (b) customers bought cakes, (c) the cake shop owner cut the ribbon.",
+ "answer": "A",
+ "candidates": [
+ "A. (a), (b), (c).",
+ "B. (b), (c), (a).",
+ "C. (c), (a), (b).",
+ "D. (a), (c), (b)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who said \"we'll so thankful for welcoming us\" in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A woman wearing a black short-sleeve shirt.",
+ "B. A woman wearing a black long-sleeve shirt.",
+ "C. A man wearing a light purple shirt.",
+ "D. A man wearing a blue suit."
+ ]
+ }
+ },
+ "nlfaBeRz": {
+ "video_id": "nlfaBeRz",
+ "video_duration": "51s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Tech Reviews",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person with light skin and a light beard and hair, wearing a gray patterned sweater and glasses, is shown holding a black camera lens in their hand. They manipulate the lens with their hands. The person speaks while looking at the lens, then puts the lens down, continuing to look straight ahead. The video then shows a black camera with a lens attached to it on a dark surface, with a black lens cap next to it. Text appears that reads \"M42 to Canon FD\". The video returns to showing the person with light skin and a light beard and hair, wearing a gray patterned sweater and glasses, speaking again while holding a black camera lens. Next, a person with light skin, wearing a red puffer jacket, holds a dark-colored cell phone in their hand while in front of a mirror. Text appears that reads \"Paso Robles\". The video returns to showing the person with light skin and a light beard and hair, wearing a gray patterned sweater and glasses, speaking again. Finally, a dog with black and white fur walks through fallen orange leaves while a person runs off screen.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man use to film the Thanksgiving video?",
+ "answer": "B",
+ "candidates": [
+ "A. Canon CCD.",
+ "B. M42-threaded lens.",
+ "C. Canon camera.",
+ "D. Phone."
+ ]
+ }
+ },
+ "QNkWMgsF": {
+ "video_id": "QNkWMgsF",
+ "video_duration": "77s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Book Reviews",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a person with light skin, red lips and their hair in a bun. They are wearing a dark green shirt and a gold necklace with a bar pendant. There is a bookshelf behind them.At first the person is looking at the camera and their mouth is moving. Then, they lift their right arm and pull a paper bag into view, shaking the bag to display the contents, which are books, then they lower their arm and the bag. The person looks to their right. They move their mouth again, and briefly their left eye is closed.After that, they show each of the books in the paper bag individually. The cover of each book is visible. Finally, they raise their right index finger and move it before looking up and talking again.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many books did the woman display?",
+ "answer": "D",
+ "candidates": [
+ "A. Nine books.",
+ "B. Eleven books.",
+ "C. Eight books.",
+ "D. Ten books."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times did the woman say, \"I got the Sookie Stackhouse series\"?",
+ "answer": "D",
+ "candidates": [
+ "A. Six.",
+ "B. Four.",
+ "C. Three.",
+ "D. Five."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Is there a change in the volume of the woman's voice in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. I don't know.",
+ "B. No.",
+ "C. Yes."
+ ]
+ }
+ },
+ "eRbZdGnA": {
+ "video_id": "eRbZdGnA",
+ "video_duration": "60s",
+ "duration": "1-2min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen containing the text \"VISUALS BY KEINO\" and a graphic of a camera with the text \"@KEINO.AK\" below. The screen then transitions to a dimly lit room where several people are gathered. Most of them have their faces covered with masks or hoods. Some are wearing tracksuits. A close up of one individual\u2019s face mask reveals the outline of a stylized skull. Throughout the video, various individuals are seen making hand gestures. The people are shown walking down a hall, then standing together. One of the people is holding a box with a cake inside. The words \"Happy Birthday\" are visible on the cake. The camera angle changes as they continue moving, and the video has occasional distorted effects. A man in a blue, white, and orange FILA tracksuit is seen moving in various settings, and the video ends with the group of people standing in a room, moving around.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the overall emotional feeling conveyed by the song in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Somber and steady.",
+ "B. Noble and refined.",
+ "C. Violent and dark.",
+ "D. Gentle and soft."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Is there someone wearing a Spider-Man mask in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. I don't know.",
+ "B. No.",
+ "C. Yes."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Through what does the video make the song and reality more realistic and coherent?",
+ "answer": "A",
+ "candidates": [
+ "A. A cake.",
+ "B. Dance.",
+ "C. A mask.",
+ "D. Music."
+ ]
+ }
+ },
+ "UgglZVsK": {
+ "video_id": "UgglZVsK",
+ "video_duration": "38s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing red and white sports clothing runs onto a green field with white lines, with white smoke and flames rising up around them. Then, a person wearing a grey suit shakes hands and hugs a person wearing a grey suit and red hat, holding a red sports jersey. A person wearing a red and white sports jersey, red shoulder pads, and white pants holds a sports helmet outside of a building. Then, a close up of the face of a person wearing a red headband and black markings on the face. Next, a person wearing a red and white sports uniform is running on a green field, with many other people also in sports uniforms around them. The person continues running on a green field and into an endzone, then among a group of people in similar uniforms. After this, a person wearing a red headband speaks with other people wearing sports clothing in a building, holding hands and raising them in a circle.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Who scored during the game shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Geno Smith.",
+ "B. Jalen Hurts.",
+ "C. Kirk Cousins.",
+ "D. Julio Jones."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Do the Indiana Pacers score in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes, they do.",
+ "C. No, they do not."
+ ]
+ }
+ },
+ "DnCdfWNW": {
+ "video_id": "DnCdfWNW",
+ "video_duration": "166s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Cultural Explainers",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person with long dark hair is sitting on an orange couch. The person is wearing a long-sleeved black shirt. The person moves their hands and arms. The person points at different areas of their face. A picture of another person appears next to the person on the couch. There is white text appearing above the person on the couch. A split image of an eye with and without make-up, and close-up of adhesive is shown. The person makes different facial expressions throughout the video.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the woman wearing black in the video express a preference for Asian noses?",
+ "answer": "B",
+ "candidates": [
+ "A. They are neat and refined.",
+ "B. They are petite and cute.",
+ "C. They are straight and well-defined.",
+ "D. They have distinct lines."
+ ]
+ }
+ },
+ "pZUGAWYx": {
+ "video_id": "pZUGAWYx",
+ "video_duration": "45s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A dark-haired anime-style man with a teal accent on his collar faces a green-haired woman. White bars obscure parts of the image. The scene transitions to a room engulfed in flames, where the woman points a gun at the man. The flames intensify, filling the screen with orange and red hues. The scene cuts to black, followed by white text on a gray background. The scene returns to the burning room, focusing on the woman from the waist up. She makes a circular gesture with her thumb and index finger. The scene cuts to red text on a black background. The scene returns to the woman in the burning room, who now looks distressed and wears a white mask over her eyes. The scene shifts back to the man in the burning room, partially obscured by flames and text. The scene fades to black and white text with the numbers zero through seven appears on the screen. The scene returns to the burning room with the man, now grinning, his face partially covered with blood.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the black-haired woman's final decision in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Sacrifice her life and give away her wealth.",
+ "B. Give away her wealth and keep her life.",
+ "C. Sacrifice her life and keep her wealth.",
+ "D. Keep her wealth and sacrifice her life."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What change occurs in the background music throughout the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The background music remains unchanged throughout the video.",
+ "B. A male voice appears at the end.",
+ "C. The background music disappears in the middle of the video.",
+ "D. The video has no music at the beginning."
+ ]
+ }
+ },
+ "EDHOCYPN": {
+ "video_id": "EDHOCYPN",
+ "video_duration": "310s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A young man sits in the driver's seat of a Volkswagen car and adjusts his face mask. The video cuts to a Costco Wholesale store exterior, where a masked man pushes a shopping cart in the parking lot. Inside the store, the man picks up a bag of \"Silver Hills Sprouted Power 20 Grain\" bread, points at it, and nods. He then picks up a bag of zucchini, examines it, and adds two bags to his cart. Next, he holds up a bag of French green beans and puts one in his cart. He picks up a bag of \"Mann\" brand \"Caulilini\" (baby cauliflower), points at it, and then puts one in his cart. He picks up a bag of cauliflower, inspects it, and adds one to his cart. The video then shows him reaching for a container of organic spring mix lettuce and adding two to his cart. He examines a container of arugula and puts it in the cart. He picks up a container of baby spinach, holds it up, then returns it to the shelf. He picks up a container of \"Earthbound Farm Organic Mighty Mix\" salad and adds it to the cart. He reaches for \"Drieger Farms\" blueberries and adds them to the cart. He then looks at a display case containing various cuts of meat, picks up a package of salmon fillets, and adds it to the cart. He examines and then adds a container of lean ground beef to the cart, followed by ground turkey. Next, he picks up and inspects a package of chicken breasts from the refrigerated display case and adds it to the cart. He picks up a carton of \"Kirkland\" egg whites, shows them to the camera, and places them in the cart. He then picks up a large container of kimchi, points at the label, and places it in the cart. The man walks down a snack aisle and picks up a bag of \"Cheetos\" and places it in his cart. He continues down the snack aisle to a shelf containing \"Miss Vickie's\" kettle chips. He selects a bag of the \"Sea Salt & Malt Vinegar\" flavored chips. He continues pushing the shopping cart along refrigerated aisles and takes three tubs of cottage cheese and places them in the cart. Finally, he picks up a box of frozen Greek yogurt bars from the freezer aisle and places it in the cart.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did the man in white buy a soft-shelled turtle?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I don't know.",
+ "D. Yes."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What might the hobbies of the man in white in the video be?",
+ "answer": "C",
+ "candidates": [
+ "A. Cooking.",
+ "B. Shopping.",
+ "C. Exercising.",
+ "D. Driving."
+ ]
+ }
+ },
+ "BRBWNDmc": {
+ "video_id": "BRBWNDmc",
+ "video_duration": "87s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A paper cutout of a person wearing a brown hat and a brown vest over a green shirt is holding a yellow piece of paper shaped like a crescent with points. The paper person is attached to a chair leg. In the following shots the paper person's clothing includes a green backpack. Next the paper person is holding an orange book which reads, \"Survival Manual\". Then the paper person again holds the yellow paper. In the next shot, a piece of paper is opened. It has a drawing of a TV on the inside, and the text reads, \"If you see two guys on top of your cliff, about to watch TV\". In the next shot, a page is turned, and the new text reads, \"Disguise yourself as a remote control and let them notice you\", and a drawing of a remote control is next to this text. The paper person is then holding a paper remote control, and in the following shots, the paper person moves while holding the remote control. Next, two Lego figurines are standing next to a blue Lego object with two levers on top of it. Then a shot of a sand dune, and then more shots of the paper person holding the remote control, with the yellow crescent. In the following shots, the person is holding both the remote control and the yellow crescent. Next the two Lego figurines appear again, and then a hand is shown placing the Lego figures next to the remote control and the paper person. In the following shots, the hand is manipulating the figures. Then the two Lego figures are shown without a hand in the frame. Finally, the two Lego figures are shown separately. In the last shots, the two Lego figures are standing together again.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "In the video, who suggested disguising oneself as a remote control to attract attention?",
+ "answer": "C",
+ "candidates": [
+ "A. A boy wearing a red hat, a jacket, and green pants.",
+ "B. A boy with brown hair wearing a yellow shirt.",
+ "C. A boy with black hair wearing a brown cowboy hat.",
+ "D. A boy with brown hair wearing an orange jacket and blue pants."
+ ]
+ }
+ },
+ "kstmjYAV": {
+ "video_id": "kstmjYAV",
+ "video_duration": "101s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Racing",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a track and field race. Initially, one female athlete wearing a red jersey with the letters MAS and a black head covering walks towards a starting block on a running track. Then other athletes appear, each standing at a different starting block and holding a baton. All the athletes crouch in a starting position. A person raises their arm, and all athletes begin running. The camera follows the runners as they move along the track. Throughout the video, the runners hold a baton. As the race continues, time stamps appear in the bottom right hand corner of the screen, going up sequentially by tenths of a second. Eventually, the first runners cross a finish line, marked by a series of vertical numbers on the ground. Finally, the first runner to cross the finish line slows and continues walking, with the other runners just behind.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Which part of the video features three athletes standing on the track preparing for a race?",
+ "answer": "D",
+ "candidates": [
+ "A. Every part of the video.",
+ "B. The middle part of the video.",
+ "C. The final part of the video.",
+ "D. The beginning part of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the athlete in lane 6 relative to the athlete in lane 5 in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Right rear.",
+ "B. Right front.",
+ "C. Front left.",
+ "D. Left rear."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Which country does the athlete who won the championship in the video represent?",
+ "answer": "D",
+ "candidates": [
+ "A. Singapore.",
+ "B. Malaysia.",
+ "C. Thailand.",
+ "D. Malaysia."
+ ]
+ }
+ },
+ "hlfiUgUZ": {
+ "video_id": "hlfiUgUZ",
+ "video_duration": "197s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Cultural Explainers",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person with long, dark, braided hair is speaking while making gestures with their hands. They are wearing a dark shirt underneath a light purple cardigan. Behind them is a white bookshelf with various items and a white table with red boxes below it. The screen also displays the text: \u201cTHE PHOEBE WAY\u201d and \u201cPRINCIPLE OF DESCENT\u201d or \u201cPRINCIPLE OF BIRTHPLACE\u201d and \u201cat least 8 years\u201d. There are also social media icons on the bottom of the screen.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Does the woman in the video have a watch on her right hand?",
+ "answer": "A",
+ "candidates": [
+ "A. No, she has a watch on her left hand.",
+ "B. It cannot be determined.",
+ "C. Yes.",
+ "D. Not applicable."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is it not possible to hold dual citizenship for both Ghana and Germany?",
+ "answer": "C",
+ "candidates": [
+ "A. Because Ghana does not allow dual citizenship.",
+ "B. Because Germany does not allow dual citizenship.",
+ "C. Because it does not meet the requirements for two EU countries.",
+ "D. Because the relationship between the two countries is poor."
+ ]
+ }
+ },
+ "VAmZSHYh": {
+ "video_id": "VAmZSHYh",
+ "video_duration": "187s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a person wearing a blue plaid shirt and a straw hat standing outdoors among various green plants. Then the scene transitions to a person with a beard, wearing a white t-shirt standing in front of white doors, speaking directly to the camera. The video then shows a person's hand touching two small green fruits on a green stem that extends from a brick wall. The hand then touches and examines several pale yellow flowers on the same plant. The camera then pans up, showing the plant\u2019s leaves and stems. It then shows a few leaves of the plant. It then moves to show the full view of the plant with a few green fruits and flowers. The camera then shows a few green fruits and then two green fruits and a black rubber protective layer that covers part of the plant's trunk. The camera then zooms into the dark protective layer to show small brown ants and a small brown lizard. It shows the plants with leaves, flowers, and small green fruits. It then shows some of the surrounding area, including other plants, and a lake with a line of trees. The video then returns to the person\u2019s hand examining one of the plants and leaves. The camera then focuses on the plants growing against a brick wall. The camera then shows the leaves, green fruit and a few flowers on the plant. Finally it zooms into the dark protective layer to show the brown ants and the lizard again.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence for manually pollinating papayas, as demonstrated in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Touch the stamen of a male flower with a finger, collect pollen, and quickly transfer it to the pistil of a female flower.",
+ "B. Remove a female flower, detach its petals, and brush its pistil against other open flowers.",
+ "C. Remove a male flower, detach its petals, and brush its stamen against other open flowers.",
+ "D. Remove a male flower and brush its stamen against other open flowers."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the papaya tree in relation to the lake in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. On the right.",
+ "B. In front.",
+ "C. Behind.",
+ "D. On the left."
+ ]
+ }
+ },
+ "uqvShHCQ": {
+ "video_id": "uqvShHCQ",
+ "video_duration": "461s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a digital painting program. The canvas is white, initially with a light pink tulip image in the center. The video begins with selections in the tool bar and menus. Brushes are selected from the side menus, each brush changes the appearance of the tulip. The brushes used include \"Loaded Thick and Wet\", \"Loaded Oily Knife\", \"Dry Trail Off\", \"Soft Thin Bristle\", \"Light Texturizer\" and \"Scribe\". With each brush the colors and appearance of the tulip are altered. Finally, the \"toggle tracing layer\" option is turned off, revealing the painting without the underlying image, followed by a zoom out and the video ends.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "Has the color of the flower in the video changed?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "pTDIYhFf": {
+ "video_id": "pTDIYhFf",
+ "video_duration": "144s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A hand points at a circuit diagram on a smartphone screen. A USB cable is plugged into the bottom of the phone. The camera then moves to show a circuit board with various components, including wires, transistors, and a heat sink. A container of small electronic components is shown. The view returns to the circuit board, and a hand touches a component on it.The camera then focuses on a circular arrangement of copper wire coils on a piece of cardboard. The coils are connected with blue connectors and radiate outwards. A hand holds wire strippers, stripping the insulation off a piece of white wire.The view goes back to the circuit board. The hand touches the heat sink again. The video cuts to a lead-acid battery on the table. The view returns to the copper coil arrangement on the cardboard. A small plastic cup with a metal ball inside is placed in the center of the coils. The stripped white wire from earlier is brought to the base of the cup, and one of the stripped ends of the wire is touched to the metal coil.The hand then places the lid on the cup. The hand moves the cup, with its lid, around the coil. The hand lifts the cup, with its lid, revealing the metal ball inside. The hand continues to move the cup around the coil, sometimes lifting it to show the ball.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "In the video, what type of container does the experimenter place the round iron ball into?",
+ "answer": "C",
+ "candidates": [
+ "A. A ceramic cup.",
+ "B. A glass cup.",
+ "C. A plastic cup.",
+ "D. A metal cup."
+ ]
+ }
+ },
+ "tcpSBZVd": {
+ "video_id": "tcpSBZVd",
+ "video_duration": "103s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a close-up of a person's face. The background is blurry. Then it cuts to a white police car and a man with a red shirt talking on a cell phone. A few seconds later the video cuts to a magazine about police vehicles being flipped through, showing various photos and text. After the magazine pages the video cuts back to the police car, then to a police motorcycle, then it shows a table with many weapons laid out on it including handguns and rifles. After the weapons on the table, the video cuts to a website, then a map, then some text, then finally, different forms of body armor, ending on a dog mannequin wearing a vest. The last scene is a few pages of text and photos from a magazine.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order do the following five topics appear in the video? (a) Rent and price changes of motorcycles and their quantities, (b) Promoting cars to the police department, (c) Rental and repair of firearms, (d) Exhibitions of bulletproof vests companies and prices of bulletproof vests for police and police dogs, (e) RISS.NET, a national law enforcement information sharing system, and its funding sources.",
+ "answer": "C",
+ "candidates": [
+ "A. (d) (b) (c) (a) (e).",
+ "B. (a) (c) (e) (d) (b).",
+ "C. (b) (a) (c) (e) (d).",
+ "D. (a) (c) (b) (e) (d)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What are the first three letters of the main title of the RISS.NET website shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Log.",
+ "B. Reg.",
+ "C. Ris.",
+ "D. Res."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is the model dog shown at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. White.",
+ "B. Brown.",
+ "C. Yellow.",
+ "D. Black."
+ ]
+ }
+ },
+ "mNrLABFh": {
+ "video_id": "mNrLABFh",
+ "video_duration": "32s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A group of seven individuals is standing together against a dark background. A single individual with a red and white head and a gray and white upper body suit is sitting against a light blue wall. The group of seven individuals is standing together against a dark background. A single individual with a red and white head and a gray and white upper body suit is sitting against a light blue wall. The group of seven individuals is standing together against a dark background. The individual with the red and white head and a gray and white upper body suit is turned, facing towards the group of seven individuals. A hand in a yellow glove touches the shoulder of the individual with the red and white head. An individual with white hair and a green and yellow jacket approaches the individual with the red and white head. The individual with the white hair leans in close to the individual with the red and white head. The individual with the white hair has her eyes closed. The individual with the red and white head has his head turned to the side. An individual with a yellow and blue mask extends three sharp claws toward the face of the individual with the red and white head. The individual with the red and white head looks back as the individual with a yellow and blue mask approaches him. The individual with the red and white head throws a punch, hitting the individual with the yellow and blue mask. The individual with the red and white head is upside down. The individual with the yellow and blue mask extends his claws forward towards the individual with the white hair. The individual with the red and white head is no longer wearing a gray suit. A group of the seven individuals is tangled in white nets. The individual with the red and white head stands in front of a green wall. The individual with the red and white head kicks the green wall. The individual with the red and white head is running away through a doorway.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who called out Spider-Man's name in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A man wearing a blue shirt.",
+ "B. A woman wearing a red top.",
+ "C. A man wearing blue gloves.",
+ "D. A woman wearing yellow gloves."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is the door shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Yellow.",
+ "B. Blue.",
+ "C. Green.",
+ "D. White."
+ ]
+ }
+ },
+ "OqgKmMdb": {
+ "video_id": "OqgKmMdb",
+ "video_duration": "192s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person's face, with red lines drawn on it, indicating the outlines of a large, open mouth and furrowed brows, is shown in close-up. A hand holding a makeup sponge applies bright green paint within the drawn outlines, covering the forehead, cheeks, nose, chin, and parts of the mouth area. The person\u2019s eyes are closed. The hand then applies green paint to the person's right cheek and jawline, blending it outwards. The hand moves back to the face, applying green paint to the areas around the eyes and forehead, blending it with the paint already applied. White paint is then applied to the area inside the red outline of the mouth, creating the appearance of teeth. The hand then uses a thin brush to apply dark gray or black paint to the edges of the white teeth, adding definition. Dark gray or black paint is also used on the person's eyelids, around their eyes and eyebrows, creating a dramatic, shadowed effect. A dark gray or black brow is painted over the red outline above the eyes. Green paint is used to blend the dark eye shadow into the green base color. Black lines are added to the sides of the nose and mouth, enhancing the cartoonish effect. The bottom lip and chin are further darkened with the dark gray or black paint. Small black lines are added between the white teeth to complete the look. Black is also added to the hairline. The video ends with the final result of a completed comic book-style Hulk face paint design.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many different tools were used in the makeup application?",
+ "answer": "D",
+ "candidates": [
+ "A. Six.",
+ "B. Four.",
+ "C. Three.",
+ "D. Five."
+ ]
+ }
+ },
+ "wrGZFvPJ": {
+ "video_id": "wrGZFvPJ",
+ "video_duration": "383s",
+ "duration": "6-8min",
+ "domain": "Tech & Science",
+ "sub_category": "Artificial Intelligence",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens with a black screen, followed by a green rectangle with rounded corners appearing in the center of the screen. Inside the rectangle are the words \"Montr\u00e9al\" above the word \"Python\" in grey text, and a small white logo of a snake on the upper right side of the rectangle. The rectangle then moves to the upper left corner, then, in the upper right corner of the screen appears the text \"53 : Sanctified\" above the word \"Terabit\" in a smaller font. These words are in green on the black background. Then under these words, in white text on the black background, appears the text \"Datacratic's Machine Learning Database (MLDB) A LIGHTNING TALK BY Nicolas Kruchten\". Under that text, in the lower left corner of the screen, appears a green rectangle containing a movie clapperboard with the words \"with a little help from\" in small white text. Next to the clapperboard is a circular penguin logo with the words \"Savoir-faire LINUX\" under it, and next to that, a small calendar icon with the number \"13\" on it along with the text \"AVR 2015\". The clapperboard, penguin, and calendar logos are in white, grey, and green, and this set of logos appears on multiple lines. The background is black and the same format is repeated in the video. The scene then transitions to show a video frame with 3 distinct sections. The large center section has a view of a person speaking before a crowd of people sitting in chairs. The right and left sections show the sides of the room where the people are seated. The video transitions to showing the large central video frame, which is now a view of a single man speaking before a crowd of people sitting in chairs. Behind the man is a white projection screen on a tripod. Another woman is briefly seen approaching the man at the front to adjust the computer on the table. The scene transitions again, now showing a small floating window on the right side of the screen displaying a list of file directories. A man can be seen on the left of this floating window. The video focuses on the man speaking, the floating file directory window, and the white projection screen behind the man. The floating file directory window changes to show a colorful scatter plot of points, then changes to display several lines of text containing computer code. The video returns to a full screen view of the man speaking in front of the white projection screen and the crowd. Then, the video returns to the floating window, which is again showing computer code, before again showing a full screen view of the presenter. The view then returns to the floating window on the right showing an illustration of a boat, then a page of data, and then ends on a full screen of the sponsors and logos.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "From the video's perspective, where are the viewers seated relative to the man wearing glasses?",
+ "answer": "B",
+ "candidates": [
+ "A. Behind him.",
+ "B. In front of him.",
+ "C. To his right.",
+ "D. To his left."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the second way the man in the video describes for using Python with MLDB?",
+ "answer": "B",
+ "candidates": [
+ "A. Provide an SDK to build plugins for MLDB using Python.",
+ "B. Create data visualizations and plot sub-panel maps.",
+ "C. Store data in a structured format.",
+ "D. REST APIs can encounter network latency issues but can still execute."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "At what point in the video is PyTanic used to demonstrate the calculation of survival odds in the Titanic disaster?",
+ "answer": "A",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. In the middle of the video.",
+ "C. Throughout the video.",
+ "D. At the beginning of the video."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does collective clapping occur in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. Two.",
+ "D. One."
+ ]
+ }
+ },
+ "IIsRSTmT": {
+ "video_id": "IIsRSTmT",
+ "video_duration": "39s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Editorials",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with two people sitting behind a desk in a news studio. A man in a gray suit and blue tie speaks directly to the camera, gesturing with his hands, while a woman with blonde hair in a black jacket looks to the side. A sign with the letters \"KOVR 13\" is visible in the background. The video cuts to a woman with blonde hair in a light gray blazer and the caption \"WAL-MART to Welfare?\". She is looking at the camera, speaking directly to the viewer. The video then transitions to a woman with dark hair in a black jacket and blue top, with a building in the background. She is also speaking directly to the viewer. Next, there is an exterior shot of a Wal-Mart store with cars in the parking lot in front. The camera then cuts to an interior shot of a person standing at a cash register. Another person stands behind the cash register, interacting with the first individual. The camera then shows a document titled \"Hidden Cost of Wal-Mart Jobs\" with the subtitles \u201cWal-Mart Public Costs\u201d, \u201c$86 Million a Year to California Taxpayers\u201d, \u201cUp to $25 Million more to County Taxpayers\u201d, and \"Health Care, Income Tax Credits, Housing Subsidies, Food Stamps\u201d.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many news anchors appear in the video reporting on the Walmart incident?",
+ "answer": "D",
+ "candidates": [
+ "A. Two.",
+ "B. Three.",
+ "C. One.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In Walmart, what is the position of black male staff members in relation to black female staff members?",
+ "answer": "C",
+ "candidates": [
+ "A. Directly in front.",
+ "B. To the left and front.",
+ "C. To the right and front.",
+ "D. Directly behind."
+ ]
+ }
+ },
+ "ZKFLtpFZ": {
+ "video_id": "ZKFLtpFZ",
+ "video_duration": "30s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A cartoon with a stage and audience is displayed. A person with blonde hair and a red shirt walks onto the stage, then another person with short brown hair and a green shirt walks onto the stage. The person in the red shirt exits the stage. The person in the green shirt walks to the center and begins to sing into a microphone. The camera zooms into the person singing. The camera shows the person singing with a range of expressions and the person continues to sing, then the camera zooms out to show the audience which consists of several cartoon characters and faceless purple figures. The audience is shown, then a title card is displayed that reads \"Thieves Haven\" and beneath that reads \"Written Animated and Directed by SuperstarYT\". That card then shows the cast credits which read \"Thief 1 (Jerry)...........GrantsVoice, Thief 2 (Phillip).....Andrew Whelbig, Thief 3 (Tom)......Luke And Friends\".",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What are the last three letters of the first line of yellow text that appears at the end of the video after the man's performance?",
+ "answer": "C",
+ "candidates": [
+ "A. Ryt.",
+ "B. Tde.",
+ "C. Ted.",
+ "D. Tby."
+ ]
+ }
+ },
+ "cWJmSamq": {
+ "video_id": "cWJmSamq",
+ "video_duration": "32s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A black screen with white text on it that says, \"The automatically controlled operation of an apparatus, process, or system by mechanical or electronic devices that take the place of human labor,\" and below this text, \"SOURCE: www.merriam-webster.com\".Then, a large orange robot arm moves in the center of the image. There are blue tubes attached to the arm and a metallic structure on its base.The robot arm picks up flat panels.A wide shot of a large industrial space shows multiple machines and one person walking in the distance.A different robot arm with a welding tool on the end is shown welding together pieces of metal. Sparks fly around the welding tool.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does robotic welding appear?",
+ "answer": "D",
+ "candidates": [
+ "A. Throughout the entire video.",
+ "B. At the beginning of the video.",
+ "C. In the middle of the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many robots appear in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Eight.",
+ "B. Seven.",
+ "C. Six.",
+ "D. Nine."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What type of hat is the man in the video wearing?",
+ "answer": "D",
+ "candidates": [
+ "A. Dustproof cap.",
+ "B. Sun hat.",
+ "C. Hard hat.",
+ "D. Baseball cap."
+ ]
+ }
+ },
+ "JQOitYHu": {
+ "video_id": "JQOitYHu",
+ "video_duration": "82s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins focused on a leaflet with the headline \"GREEN ACTION #noTTIP SAVE THE NHS\". The leaflet features a photo of a woman with shoulder-length light brown hair, and text about the Transatlantic Trade and Investment Partnership. The camera then zooms out and transitions to a close-up of another, seemingly identical, leaflet. After a few seconds, the leaflet vanishes, leaving a white background with the words \"GREEN ACTION #noTTIP SAVE THE NHS SCOTTISH GREENS\" in green and purple lettering.Next, the image of a woman with short blond hair and purple-framed glasses appears superimposed on the leaflet image. She speaks, and the hashtag #noTTIP is displayed in purple over her left shoulder. The background is blurred. The scene changes to show the woman sitting at a round table with a checkered tablecloth. Behind her is a window with white curtains, through which a sunroom or patio is visible. An elephant statue stands on a table to her right, and a dark wooden chair is visible behind her.The scene then cuts to the exterior of Dungavel House Immigration Removal Centre. Two signs identify the building. A chain-link fence topped with barbed wire surrounds the property.The video returns to the woman seated at the table.The final scene shows a group of people sitting at a table on a stage in a large room. A person is in the foreground, facing the stage, and holding some filming equipment. The image on a tablet attached to the equipment shows a similar view of the panel.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point does the Dunnfield Detention Center appear in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. It appears in the middle of the video.",
+ "B. It appears at the beginning of the video.",
+ "C. It appears at the end of the video.",
+ "D. It appears throughout the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "From what direction does the applause in the video originate?",
+ "answer": "C",
+ "candidates": [
+ "A. The applause comes from the stage.",
+ "B. The applause comes from the right side of the frame.",
+ "C. The applause comes from the left side of the frame.",
+ "D. The applause comes from directly in front of the frame."
+ ]
+ }
+ },
+ "axWWyKlS": {
+ "video_id": "axWWyKlS",
+ "video_duration": "206s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A cartoon lion sits on a rock by a body of water, his reflection visible in the water. A cartoon sheep with a red bell around its neck walks by. The scene shifts, with a number of white sheep drinking from a river. A cartoon lion and sheep are by the water\u2019s edge, then the lion lies down. The scene shifts to a wolf silhouetted against a full moon. The lion, looking scared, is sitting with his paws in front of his face. A group of sheep run through a forest at night. A wolf is shown attacking a sheep, followed by the sheep running away from the wolf. The lion lays down again. A group of sheep is standing in a circle with the lion in the center of the circle. Silhouetted wolves are in the background, along with the title: The End. A Walt Disney Production appears.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What happened when the big lion and the lambs were drinking water by the riverside?",
+ "answer": "D",
+ "candidates": [
+ "A. The lambs invited the big lion to take a bath in the river.",
+ "B. The big lion and the lambs laughed and played together by the riverside.",
+ "C. The big lion threw the lambs into the river.",
+ "D. The lambs ambushed the big lion from behind and knocked him into the river."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "Why did the lion's eyes suddenly turn red?",
+ "answer": "C",
+ "candidates": [
+ "A. Because he injured his eyes while fighting with a wolf.",
+ "B. Because he saw a beam of red light, which reflected in his eyes.",
+ "C. Because he heard the mother sheep calling for help.",
+ "D. Because he cried after seeing the sheep being dragged away by the wolf."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does a lion being carried by a group of lambs appear?",
+ "answer": "B",
+ "candidates": [
+ "A. It does not appear in the video.",
+ "B. At the end of the video.",
+ "C. At the beginning of the video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "LQaZJHfT": {
+ "video_id": "LQaZJHfT",
+ "video_duration": "128s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Academic Lectures",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a person with a white beard and glasses, wearing a gray suit jacket, blue shirt, and patterned tie standing and gesturing in a room with blue curtains, stacked boxes, and a whiteboard with writing on it. At another point, the video cuts to a different person wearing glasses and a blue shirt, with a badge on their chest. They are standing in a room with other people. The video then cuts back to the first person, who continues gesturing, standing in front of the whiteboard. They then put a hand in their suit jacket, and then out of their suit jacket, followed by more gesturing. The video then cuts back to the second person from earlier, who then smiles and the video cuts back to the first person who is then shown reaching down in front of them and touching his clothing. Then, the first person moves their arms while talking and gesturing.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Is the man on the podium keeping his left hand in his pocket?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes, he is.",
+ "B. None of the above.",
+ "C. No, he is not.",
+ "D. It is unclear."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many people wearing glasses appear in the video footage?",
+ "answer": "B",
+ "candidates": [
+ "A. Five.",
+ "B. Four.",
+ "C. Three.",
+ "D. Two."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the man in blue clothing relative to the man in white clothing in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. To the left of the man in white clothing.",
+ "B. To the right of the man in white clothing.",
+ "C. In front of the man in white clothing.",
+ "D. Behind the man in white clothing."
+ ]
+ }
+ },
+ "VCOQFaXM": {
+ "video_id": "VCOQFaXM",
+ "video_duration": "59s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A tray of graham crackers is shown. A silver pot is on a black stovetop with a yellow liquid and two rectangular pieces of butter. The liquid begins to bubble around the butter. The mixture begins to turn brown and become more viscous. Finely chopped pecans are added to the pot, and stirred with a light gray spoon. The mixture is then poured over the graham crackers, which are arranged in a single layer in a silver baking pan. The pecans are distributed evenly over the graham crackers with a light gray spoon.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Is sugar present among the ingredients shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes, it is.",
+ "B. I am not sure.",
+ "C. No, it is not."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is the bowl holding the walnuts in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. White.",
+ "B. Brown.",
+ "C. Blue.",
+ "D. Gray."
+ ]
+ }
+ },
+ "vAhsFDMV": {
+ "video_id": "vAhsFDMV",
+ "video_duration": "27s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Academic Lectures",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a red background, and two text labels appear at the top, 'Phagocytes' and 'Lymphocytes', each above a different white object. The 'Phagocytes' object is an irregular shape with a blue, multi-lobed interior. The 'Lymphocytes' object is a solid white shape with a single, large blue circle inside. Next, a larger white circle is placed over the two shapes, briefly obscuring them. Then, the background changes to light beige with a single small red dot at the center. The background then changes to a lighter beige color, and a red circle appears, then a white irregular shape with a blue multi-lobed interior forms within the red circle. Then, the background changes to red, and a white shape appears with the label 'Multi-lobed Nucleus' and a line pointing to the blue multi-lobed part. A second line and the text label 'Cytoplasm' appears pointing to the white part of the shape. Then, a white bone shaped object appears with a white cell with a blue multi-lobed interior. The bone object moves horizontally and the background changes to red and white stripes. The cell object moves along the red and white striped background. The cell then shifts shape and a small green object appears next to it, then the cell encloses the small green object. The cell is then against a red background with a green object inside. A new image of a cut-away section of skin appears, with a nail being driven in. Green objects start to appear and then a white cell with a blue multi-lobed interior appears. More green objects and white cells appear. The white cells move toward the green objects. Eventually the white cells enclose the green objects.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many lobes are typically present in the nucleus of a phagocyte?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. One.",
+ "C. Three.",
+ "D. Two."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What state will phagocytes be present in at the wound site after they engulf and kill a large number of bacteria?",
+ "answer": "B",
+ "candidates": [
+ "A. Green pus.",
+ "B. Yellow pus.",
+ "C. Black pus.",
+ "D. White pus."
+ ]
+ }
+ },
+ "ouMTVzbe": {
+ "video_id": "ouMTVzbe",
+ "video_duration": "100s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Software",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a blue screen with small white square shapes and a dark blue rectangle with white text that reads \":> Qt\". This changes to a blue and orange gradient background with a dark bar at the bottom of the screen. A blue and white icon appears in the bottom left corner and, when clicked, a dark gray menu is displayed. The menu has multiple white text options, with a white gear icon at the top left. \"Dolphin\" is highlighted in the list, and then, after being selected, a file management window pops up. The window has a gray frame, and the white text 'Home--Dolphin' is in the top bar. It displays eight folder icons and a gray bar at the bottom with text that reads \"8 folders 12.9 GiB free\". The dark gray menu appears again beside this file management window. The menu is selected, and the dolphin window disappears. The screen goes to the blue background with squares and a dark rectangle with white text that reads \"Kubuntu\" with a white gear logo to the left. This blue background appears with an arrow cursor with a white circle when clicked, a magnifying glass icon appears, then a triangle logo, and then a mobile phone logo with a white \"K\" and gear logo in the middle. Then a blue and white window appears with white text labels on the left and a white keyboard graphic in the center. \"Continue\" is selected, and the screen changes to a white window with black text options and a vertical checklist on the left side of the window. The \"Minimal installation\" text box is selected and then \"continue\" is selected. Another window appears, with blue and white icons on the left and white text options with gray text below them on a white background, the top title reads \"Featured\". After a few moments of staying on this screen, a white window appears that has black text category options listed down the left with several gray text options listed in the center. Some of the options in the list are highlighted and then a blue and white video player appears, with a white \"K\" and gear logo in the middle, followed by a black rectangle, white text that reads \"Plasma 5.12 LTS\" and lastly a playlist window with a gray frame, a blue music note icon on the top left, and multiple album cover options on the left and a blank right window with multiple small icons at the bottom.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the likely profession of the man speaking in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A teacher.",
+ "B. A lawyer.",
+ "C. A product manager.",
+ "D. A worker."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What is explained in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Software update content explanation.",
+ "B. System update content explanation.",
+ "C. Game update content explanation.",
+ "D. Wallpaper update content explanation."
+ ]
+ }
+ },
+ "XoIjTcyc": {
+ "video_id": "XoIjTcyc",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A woman with purple hair stands indoors. Behind her are two dark sofas, two matching dark cabinets, and a small stack of books. Two lit candles are placed atop the cabinets. There is a window to the left, with a nighttime sky visible.The scene changes to a night time view of multiple houses and falling snow. A white van with a yellow bumper is parked. A red delivery truck with the logo \"Conk-Cola\" enters from the right.The scene zooms in on the front windshield of the red truck, revealing two women inside. The woman on the left has pink hair, and the woman on the right has brown hair. The woman on the right points towards the woman on the left.The scene returns to the view of multiple houses and falling snow. The red delivery truck is in the foreground, with the white van parked in the background. The red truck moves from the right to the left, while a stream of white light appears underneath it. The background houses are now blurred.The scene shifts to a shot of the red truck now flying in the sky, leaving behind a fiery trail. The background changes to a grey cloudy sky with falling snow.The red truck with a black outline flies in the night sky with a full moon in the background.The scene zooms in on the front windshield of the red truck, now showing four women inside. There is a pink haired woman on the left, a woman with short blue hair next to her, a woman with long purple hair, and a woman with long brown hair on the right.There is an image of a landmass, with a blue lightning effect on the left, a purple lightning effect on the right, and a woman with short red hair on the left and a woman with long purple hair on the right. There are several different images of women standing in front of various backgrounds. The backgrounds are: The Capitol building with snow, a desert area with a pool, The Christ the Redeemer statue in Rio de Janeiro, The Tower Bridge in London with snow, and Krakow City Square with a black and white cow.There are more images of women in front of cityscapes: the Sydney Opera House, and multiple tall skyscrapers.There is an image of four women next to a christmas tree on a gray background with pink polka dots, followed by a holiday greeting that says \"Merry Christmas\".",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Is there any change in the volume of the dialogue between characters in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The closer the character is to the camera, the louder the volume.",
+ "B. No change.",
+ "C. The volume of the character increases with the background music.",
+ "D. The farther the character is from the camera, the louder the volume."
+ ]
+ }
+ },
+ "lCjyNkPF": {
+ "video_id": "lCjyNkPF",
+ "video_duration": "135s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "Diagonal blue and white stripes transition to a screen that reads \"ECLIPSE IN ARKANSAS\" with a partial solar eclipse in the background. The screen transitions again to a woman in a purple shirt standing in front of a river with a bridge in the background. She holds a smartphone in her hand. The video cuts to people looking up at the sky wearing special glasses decorated with American flags. A small child adjusts their glasses with help from a man behind them. A complete solar eclipse is shown against a dark sky. A group of people gathers in a grassy field in front of a building. Some wear eclipse glasses, others look through makeshift viewers. A map of Arkansas is shown with a highlighted area, pinpointed near Little Rock. The video cuts to a man in a beige, button-down shirt standing in front of a wooded area. The video returns to a grassy field with rows of trees. Then another grassy field is shown, surrounded by spring-green deciduous trees.The video cuts to a smartphone screen with a photo of a smiling man wearing an orange beanie. The text \"ONX APP USED FOR PUBLIC LAND,\" \"TOTAL SOLAR ECLIPSE,\" \"ON THE PHONE,\" \"TYLER COLLINS,\" and \"ARKANSAS, APP USER\" is displayed.A close-up map of Bell Slough Wildlife Management Area is shown, indicating a road and various points of interest within the area. Then, a screen recording of the ONX app is shown, displaying trails and geographical information, followed by trail details for a route between Lake Sylvia and Winona Lake and Highway 9.The video returns to the man in beige, in the wooded area. It cuts back to a grassy area between rows of trees. A white pickup truck pulling a trailer drives slowly past. The video cuts to another grassy park, with a \"BIKE ROUTE\" sign and a person walking. The video returns to the woman in purple by the river, holding a smartphone in her hand.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order in which the video introduces the software's functions?",
+ "answer": "A",
+ "candidates": [
+ "A. Locate the land entrance, provide a map, navigate through the camp, and find the parking lot.",
+ "B. Provide a map, navigate through the camp, find the parking lot, and locate the land entrance.",
+ "C. Navigate through the camp, find the parking lot, locate the land entrance, and provide a map.",
+ "D. Provide a map, navigate through the camp, locate the land entrance, and find the parking lot."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who said, \"I hope they stay clean\"?",
+ "answer": "A",
+ "candidates": [
+ "A. A fair-haired man.",
+ "B. No one said this.",
+ "C. A woman in purple.",
+ "D. A man on the phone."
+ ]
+ }
+ },
+ "spSVfQOC": {
+ "video_id": "spSVfQOC",
+ "video_duration": "186s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with red hair, wearing a black shirt, stands next to a table. One person is sitting at the table and wearing a black hooded cape with a red collar. Two other people in similar black capes stand nearby. The person at the table grabs a piece of paper and points at it while talking to the person standing. The person standing shakes their head. The person sitting at the table takes out a piece of white cloth and spits into it. The person standing looks unhappy. Then, another person wearing a similar black hooded cape points a finger at the other hooded cape wearer and spits food out of his mouth. The first person at the table places a finger on the face of the person standing. The second person wearing a hooded cape points a finger. The person standing walks away to a booth and puts their hand on the shoulder of a person in a white shirt and black tie, while two hooded cape wearers face him on each side. The scene switches to a car with two people inside, with one wearing a hat and the other wearing sunglasses. They look at binoculars and then discuss what they've seen. The scene switches back to the booth, where the two people in hooded capes are leaning towards the person in the white shirt and tie while yelling at him. The camera switches to an outside view of the restaurant, where the person with red hair can be seen in the window. The car and it's occupants are also visible in the street. The scene then switches back to the inside of the car. Finally the scene is back in the restaurant at the booth, where the red-haired person walks to the table and the two hooded cape wearers face each other and laugh, putting their hands on the shoulders of the person in the white shirt and tie.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What emotion does the vampire express towards the female server in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Shock.",
+ "B. Embarrassment.",
+ "C. Sadness.",
+ "D. Calmness."
+ ]
+ }
+ },
+ "jhAfmpdM": {
+ "video_id": "jhAfmpdM",
+ "video_duration": "247s",
+ "duration": "4-6min",
+ "domain": "Film & TV",
+ "sub_category": "World News",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person is standing in a television studio in front of a large monitor and two smaller monitors on either side. The person is wearing a dark colored dress and light colored shoes. The large monitor displays several weather graphics, changing throughout the video. The smaller monitors also display graphics which change throughout the video. The person gestures with their hands, occasionally pointing at the monitor. There are also numbers appearing at the bottom of the video screen that change periodically.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the weather forecast for Tokoroa in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. There will be heavy snow.",
+ "B. There will be strong winds.",
+ "C. There will be sunshine.",
+ "D. There will be a thunderstorm."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the connection between Tauranga and Mt Maunganui in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The minimum temperature is the same.",
+ "B. Both the minimum and maximum temperatures are different.",
+ "C. Both the minimum and maximum temperatures are the same.",
+ "D. The maximum temperature is the same."
+ ]
+ }
+ },
+ "OlYnyYRv": {
+ "video_id": "OlYnyYRv",
+ "video_duration": "39s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A frying pan is placed on a lit gas stove. Olive oil is poured into the pan. The camera pans to a bowl of chopped white onions, and the cook turns off the burner under the frying pan. The onions are added to the pan. The burner is turned back on under the pan, and the onions are stirred with a spatula. A clove of garlic is placed on a cutting board and the root end is sliced off. The peel is removed. The clove is smashed with the flat of a knife and the remaining peel is removed. The peeled garlic cloves are placed in a garlic press and crushed directly into the frying pan with the onions. The onion and garlic mixture is stirred. The contents of the pan are then transferred to an Instant Pot or pressure cooker.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the sequence in which the woman adds the ingredients to the pot in the video? (a) Add oil to the pot, (b) add onion granules to the pot, (c) add garlic paste to the pot.",
+ "answer": "B",
+ "candidates": [
+ "A. (a) then (c) then (b).",
+ "B. (a) then (b) then (c).",
+ "C. (b) then (a) then (c).",
+ "D. (c) then (b) then (a)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Does the video show a garlic press?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes.",
+ "B. None of the above.",
+ "C. No.",
+ "D. I am not sure."
+ ]
+ }
+ },
+ "hSwBQgNN": {
+ "video_id": "hSwBQgNN",
+ "video_duration": "84s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with orange background and white text that reads \"ON THE NEXT ALL NEW\" followed by the text \"Kidding\" that also appears in white. Below these words appears the word \"SHOWTIME\". The scene then changes to a woman in a red t-shirt and blue jeans standing inside what appears to be a dining room. She walks towards a table and sits down, putting her face in her hands. A man in a plaid shirt and dark jacket appears in front of her and speaks. The woman puts her hands on the table. The man then places his hand on her arm and she then takes his hand. The woman then begins to speak and gestures with her hands as she is speaking. The man gets up and moves to the left of the frame. The woman is then shown again sitting at the table, speaking, and gesturing. The man then moves to the right side of the frame and then appears on the left side of the frame again. The woman is shown again speaking and gesturing. The man appears in front of her again, and the woman continues to speak and gesture with her hands.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What change occurs in the woman's voice at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Gradually increases.",
+ "B. Increases first, then decreases.",
+ "C. Decreases first, then increases.",
+ "D. Gradually decreases."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many table lamps are in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Two.",
+ "B. Four.",
+ "C. One.",
+ "D. Three."
+ ]
+ }
+ },
+ "FyEqOPGL": {
+ "video_id": "FyEqOPGL",
+ "video_duration": "45s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Software",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person with dark hair is visible against a light-colored background, with the word \"myBURGER\" written over the person's torso in white text. Then, blurry scenes of a kitchen area with multiple people behind a counter are shown. The video cuts to a close up of a person with dark hair and a light-colored shirt with a black jacket talking. The scene then transitions to a shot of several tables in a restaurant. The video then cuts back to the person in the black jacket talking. A laptop with a calendar on the screen is shown. The video again shows the person in the black jacket talking, followed by a very blurry shot of a table. The scene cuts to a slightly more focused shot of a table with condiments. The video then shows red chairs stacked up. The person in the black jacket is then shown holding white cups behind a counter. The person behind the counter is partially visible, wearing a white shirt and dark cap. The scene cuts again to the person in the black jacket talking. The video shows several yellow jars of mustard. The video then shows two halves of a burger on a white surface with various toppings. A shot of text that reads \"Flippin' good Burgers\" is shown over a wall.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What type of shop does the clean-shaven man run in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A hamburger restaurant.",
+ "B. A coffee shop.",
+ "C. A sushi restaurant.",
+ "D. A milk tea shop."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Did a smartphone appear in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I am not sure."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "How much time does the man in the video spend each week scheduling for the store?",
+ "answer": "C",
+ "candidates": [
+ "A. 4 hours.",
+ "B. 1 hour.",
+ "C. 2 hours.",
+ "D. 3 hours."
+ ]
+ }
+ },
+ "WdfFGPrJ": {
+ "video_id": "WdfFGPrJ",
+ "video_duration": "117s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A wrestler in black and white tights lifts another wrestler in black tights and slams him to the mat. The wrestler in black and white tights kneels and taunts the wrestler in black tights. The wrestler in black tights gets up and is Irish whipped into the ropes, bouncing off and hitting the wrestler in black and white tights. The two wrestlers exchange blows. The wrestler in black and white tights whips the other wrestler into the ropes, and then throws him towards the turnbuckle, and kicks him in the head. The wrestler in black and white tights bounces the wrestler in black tights\u2019 head off the top turnbuckle repeatedly, and whips him into the ropes again. The wrestler in black and white tights lifts and slams the other wrestler to the mat again, and pins him, but the referee counts to two before the wrestler in black tights kicks out. The wrestler in black and white tights pulls the other wrestler up and throws him outside the ring. The wrestler in black and white tights climbs the turnbuckle and jumps, knocking the other wrestler to the ground outside the ring. The wrestler in black and white tights throws the wrestler in black tights back inside the ring and pins him again, with the referee getting a two count. The wrestler in black and white tights sets up another slam, picks up the other wrestler again, and slams him to the mat once more. He pins him again and the referee counts to three.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "In the video, what final technique did the man with long black hair use to force his opponent into a defensive position?",
+ "answer": "B",
+ "candidates": [
+ "A. A downward axe kick.",
+ "B. A Sicilian storm.",
+ "C. An Irish whip.",
+ "D. A knee strike."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "At the start of the video, what unusual event happens to the short-haired man who had been in a dominant position?",
+ "answer": "B",
+ "candidates": [
+ "A. The man with long, black hair knocks him to the ground.",
+ "B. The man with long, black hair counterattacks and throws him out of bounds.",
+ "C. The referee issues a warning.",
+ "D. The man with long, black hair chokes him."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "During the referee's count in the video, what was the positional relationship between the two contestants?",
+ "answer": "C",
+ "candidates": [
+ "A. The man with long black hair is positioned to the right of the short-haired man.",
+ "B. The man with long black hair is positioned to the left of the short-haired man.",
+ "C. The short-haired man is positioned above the man with long black hair.",
+ "D. The man with long black hair is positioned above the short-haired man."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "In the video, how many times does the referee clap to start or stop the timer?",
+ "answer": "B",
+ "candidates": [
+ "A. Two times.",
+ "B. One time.",
+ "C. Three times.",
+ "D. Zero times."
+ ]
+ }
+ },
+ "BmoicFLe": {
+ "video_id": "BmoicFLe",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Editorials",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video starts with a white screen, then transitions to a blue background with text that says \"A MESSAGE FROM\" above an oval graphic that has the image of the White House and text below saying \"THE WHITE HOUSE WASHINGTON\" and below that, \"student debt forgiveness program\". The screen then transitions to a man with a beard wearing a grey suit, a blue shirt, and a patterned tie, talking against a blue or green background. He is gesturing with his hands. Text appears at the bottom of the screen, synchronized with his speech. Throughout the video, the background alternates between blue and green, as do his hand gestures and facial expressions. Near the end of the video, a blue rectangle with a document and an image of Joe Biden appears above him. Text continues to appear across the screen along with the man's speech. Finally, the screen becomes entirely black with the word \"reason\" in white and orange text.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the celebratory action occur?",
+ "answer": "A",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. Throughout the entire video.",
+ "C. In the middle of the video.",
+ "D. At the end of the video."
+ ]
+ }
+ },
+ "xUvewnRm": {
+ "video_id": "xUvewnRm",
+ "video_duration": "146s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man plays the drums. Another man plays a dark-colored electric guitar. A third man sings into a red microphone. The video cuts between these three men as they play and sing. The drummer wears a black t-shirt with an orange design and dark trousers. He wears headphones and plays a drum kit with Zildjian cymbals. The guitarist wears a gray t-shirt and dark trousers. He is seated and plays the guitar while sitting in front of a Marshall amplifier. A brown acoustic guitar leans against the wall behind him. The singer wears a dark green t-shirt with Roman numerals and the number 19 in red. He stands in front of the microphone, occasionally gesturing with his right hand. At one point, the video switches to black and white while the singer is shown, but the other musicians remain in color. Later, the screen splits, showing the singer on the left and the guitarist on the right. The singer is now in black and white, while the guitarist remains in color. The split screen returns to the individual shots of each musician. The video ends with the three musicians playing and singing.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "How many different types of instruments produce sound in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. One.",
+ "B. Three.",
+ "C. Two.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the man wearing a necklace start playing the guitar?",
+ "answer": "A",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. At the end of the video.",
+ "C. Throughout the entire video.",
+ "D. In the middle of the video."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotions are conveyed by the song in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Dependence and fear.",
+ "B. Tension and anger.",
+ "C. Happiness and excitement.",
+ "D. Pain and sadness."
+ ]
+ }
+ },
+ "SCtoLDSp": {
+ "video_id": "SCtoLDSp",
+ "video_duration": "78s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Historical Analysis",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person with a patterned sleeve uses their hands to wind a thin cord onto a wooden device. A liquid from a gold container is poured into a black mug. Hands use a utensil to scoop a green mixture from a bowl onto a wooden plate with flat bread. A tan cup is sitting on a tan plate. Three people wearing crowns and royal-like garb are visible, with an additional person kneeling in front of them. A person in a red robe and hood holds up a red banner, among a crowd of people seated in pews. Two people engage in a mock battle using long sticks. Two people engage in a mock battle with sword-like objects, one person using a shield. A person pulls a glove onto their hand, while another adjusts the helmet on their head. A person puts on a padded head piece. Two people engage in a mock battle, with one person running away. Several people are engaged in a mock battle. A large group of people with shields and weapons walks towards the camera. Two people engage in a mock battle. Two people engage in a mock battle, with one person swinging a stick at the other\u2019s shield.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "Why would a soldier in full armor kneel?",
+ "answer": "C",
+ "candidates": [
+ "A. Kneeling provides a faster movement speed.",
+ "B. A leg injury prevents standing.",
+ "C. Kneeling allows for an attack on an opponent's abdomen.",
+ "D. Lack of leg strength prevents a standing attack."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the Queen's mood in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Joyful and melancholic.",
+ "B. Happy and pleased.",
+ "C. Depressed and sad.",
+ "D. Sad and discouraged."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What does the soldier with a red scarf do to the soldier with a green shield in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Swings a sword toward the leg.",
+ "B. Swings a sword toward the abdomen.",
+ "C. Blocks with a shield.",
+ "D. Swings a sword toward the arm."
+ ]
+ }
+ },
+ "tOqzQbIM": {
+ "video_id": "tOqzQbIM",
+ "video_duration": "37s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a 3-panel comic strip. The first panel shows two adults standing behind a child seated at a table, one adult touching the child's arm. Inset is a circular frame containing the same two adults, close together, with a red heart in the middle. The second panel shows the same two adults with the child not in the scene. The adults' postures appear different from the first panel. The third panel shows the two adults from the first panel with the child in the foreground, with both hands to its face. The video then transitions to black and white drawings in a horizontal layout. The first panel contains no drawings. The second panel shows an adult. The next panel depicts the two adults standing next to each other. In the following panel, the adult couple is standing next to a child. The next panel shows the same adult couple and child standing next to a vertical pole. The pole is then changed to one holding a sign showing a movie clapper board. Two more poles holding signs appear: one with a heart, and one with nothing. The blank one is then changed to show a brain. The child in the panel is seen with its arms outstretched in the next panel. The child is seen in the next panel with both adults watching. In the next panel the adult couple is watching the child, and the scene zooms out a little bit. A drawing of an adult face appears below the two-panel layout, followed by the whole adult body in the next panel. In the following panel the adult is seen looking at a wristwatch with a symbol showing an unequal sign next to it. Another figure is drawn next to the first. The next panel is the same, except the figure who looked at the watch is now gesturing in a manner of speaking, with the figure next to it remaining unchanged. A different scene shows one of the original adults pointing while a thought bubble comes from its head. Another figure is being pointed at. The next panel shows the same two adults, with the one from the previous frame now seated with a newspaper while the child is seen standing in the corner. The next panel is blank. The adult figure is then seen with its hands outstretched. The adult is then seen holding a blank sign, and then a series of signs, one on top of the other. A brain is shown on the top of the signs, and is then placed in the middle of a set of signs that were previously blank. A heart is then placed on top of this. The adult is then seen breaking two batons while a heart and clapper board are shown on the ground next to the brain. The final panel shows an adult looking pleased. The second to the last panel shows the adult with a jacket, and this adult is walking away.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many distinct behavioral markers are shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Three.",
+ "B. Two.",
+ "C. Four.",
+ "D. One."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual action did the man ultimately take?",
+ "answer": "C",
+ "candidates": [
+ "A. Break the heart symbol.",
+ "B. Break all three behavior symbols.",
+ "C. Break the skull symbol.",
+ "D. Break the symbol with a movie clapper."
+ ]
+ }
+ },
+ "ZPCkqxeU": {
+ "video_id": "ZPCkqxeU",
+ "video_duration": "111s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Close-up shots of German players' faces. The German and Brazilian coaches shake hands. German players give each other high fives. A Brazilian player (#11) and a German player (#16) shake hands with the referees before the kickoff. The game starts. A Brazilian player (#9) passes the ball to a teammate, who gets tackled by a German player (#10). Marcelo (Brazil, #6) kicks the ball towards the goal, but it goes out. A German player (#8) places the ball for a corner kick. The corner is taken and the ball is headed in by M\u00fcller (Germany, #13). German players celebrate the goal while Brazilian fans react with shock and disappointment. Marcelo is shown with a worried expression. German players take another corner kick. Close-up shot of a Brazilian fan's face with the Brazilian flag painted on. The German team takes another corner kick. The ball bounces off the Brazilian goalkeeper's hands, but Klose (Germany, #11) kicks it into the goal. German players celebrate the goal. The Brazilian players appear dejected. A close-up is shown of a Brazilian fan holding a replica World Cup trophy. The German coach, Joachim L\u00f6w, looks pleased. A Brazilian fan with face paint is shown looking sad. The German team scores a third goal. A Brazilian fan puts her hand to her mouth in shock. German players celebrate, while the Brazilian goalkeeper sits on the field, looking dejected. A Brazilian fan cries. German players celebrate after their team scores another goal. A Brazilian player walks on the field, looking defeated. Brazilian fans in the stands appear sad and distressed. One fan holds up a Neymar mask. The scoreboard shows the score as 0-5. German players hug as the video shows the goals again in quick succession.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What was the final score in the match between Germany and Brazil?",
+ "answer": "C",
+ "candidates": [
+ "A. 4-1.",
+ "B. 0-5.",
+ "C. 5-0.",
+ "D. 1-4."
+ ]
+ }
+ },
+ "ObyfFIZa": {
+ "video_id": "ObyfFIZa",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video opens with a view of a multi-story building with many balconies. Several people are standing in front of it. The camera zooms in on a vase of pink flowers. A man in a black and white jacket walks by. Next, three people are standing by a white fence. A blonde woman in a red dress gestures with her hand. A man and a woman in black stand to her left, and another woman stands to her right. There is a brief shot of a video camera that is recording this scene. Next, two women in black dresses are holding VR headsets. A third man is standing behind them in a green jacket. Then a woman with a colorful headdress hugs a man in a suit. Then, four people in formalwear are posed on a blue background. They pose again, then a single woman in a black dress walks up to pose. A woman in a red dress talks to someone offscreen. Then, a woman in a purple dress addresses the camera. Then, a logo of SABC 3 is displayed. A montage of many images is then displayed on a grid. The next shots are of a beach and then of people in silhouette. Then there is a shot of food being made, followed by text on a green background that reads \u201cFollow the Leader,\u201d followed by a shot of three men on an ATV. Next, a woman bounces on a bed. She then stands up and walks to a desk. Two men are seen taking photos. A close up of a phone is displayed with an image on it of a woman in a purple dress. Several people in formalwear walk through the background. Then there is a shot of a camera, then a woman in a purple dress is interviewed. Another woman in a black dress is interviewed. She has a blue necklace on. The first woman in the purple dress is interviewed again. Another woman in the black dress with the blue necklace is interviewed again.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does SABC3 programming begin to play?",
+ "answer": "C",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. At the beginning of the video.",
+ "C. In the middle of the video.",
+ "D. Throughout the video."
+ ]
+ }
+ },
+ "LcWvKWDr": {
+ "video_id": "LcWvKWDr",
+ "video_duration": "98s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A Fortnite Brightness Calibration screen is shown, with the brightness slider set at 60. The default value is 0.50. The screen blurs and then focuses on the main lobby screen. The player's chosen character is an Elite Agent skin. The text overlay reads, \"first time on after getting hacked\". A message at the bottom of the screen states, \"ITEM DISABLED. Due to an issue, we've temporarily disabled the following item: Zapper Trap.\" The player opens the menu and scrolls through the friends list and joinable parties. The player then opens the settings menu and then returns to the lobby. The \"Play\" button is selected, followed by the \"Battle Pass\" tab. Then the \"Career\" tab and \"Profile\" sub-tab are selected. The player's statistics are briefly displayed: Account level 717. Underneath, Season X progress shows 100% Match XP Boost and 80% Match XP Boost for friends. Initial stats show 0 wins, 0 eliminations, and 0 matches played, but refresh after a few seconds to 14 wins, 651 eliminations, and 369 matches played. In Duo mode, there are 14,713 score points and 3,009 players outlasted. The \"Change Flag\" button is clicked. The Flag Select screen appears. The player scrolls through the flags, landing on Argentina (currently selected), Australia, Egypt, Mexico, then back to Argentina. More flags are scrolled through, including Japan, Wales, the United States, and England. The player returns to the profile screen with the updated stats showing wins as 14, eliminations as 651, and matches played as 369. Switching between \"Solo,\" \"Duo,\" \"Squad,\" and \"LTM\" changes the displayed stats. In Solo, the stats are: 240 wins, 7,070 eliminations, and 4,189 matches played. In Duo, there are 621 Top 5s and 1,096 Top 12s. In Squad, there are 190 wins, 667 Top 10s, 1,391 Top 6s, 9,424 eliminations, and 5,871 matches played. The player navigates to the \"Item Shop\" tab. The featured item is \"Sledge,\" an outfit for 1,200 V-bucks. The player clicks the \"Support a Creator\" button at the bottom right. The text \"Kiwiking76_YT\" is entered into the field and accepted. The Item Shop screen reappears. The player goes back to the profile screen and clicks LTM. The wins counter cycles through 309, 190, 240, then rests at 309, while other stats also change. Duo is selected, showing 240 wins. Squad is selected, showing 309 wins, 634 Top 3s, 1,032 Top 6s, 6,961 eliminations, and 4,476 matches played. The player goes to the \"Locker\" tab and selects a different outfit, the \"Merry Marauder.\" The Black Shield back bling and Scorcher harvesting tool are also selected. The player then selects the AC/DC harvesting tool. The player navigates back to the main lobby and to the \"Battle Pass\" tab, then back to the \"Locker\" tab. The Merry Marauder outfit is chosen again. The second style option for this outfit, lacking the mask, is chosen and saved. The player returns to the main lobby screen and then to the \"Career\" tab.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many national flags are shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Thirty-six.",
+ "B. Thirty-five.",
+ "C. Thirty-seven.",
+ "D. Thirty-eight."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What are the first three letters below the team symbol in the top left corner when the video starts?",
+ "answer": "B",
+ "candidates": [
+ "A. PUR.",
+ "B. PAR.",
+ "C. MAN.",
+ "D. MOR."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Which game mode is shown in the video with 240 victories?",
+ "answer": "D",
+ "candidates": [
+ "A. Three-player mode.",
+ "B. Team mode.",
+ "C. Single-player mode.",
+ "D. Two-player mode."
+ ]
+ }
+ },
+ "hpSLslts": {
+ "video_id": "hpSLslts",
+ "video_duration": "596s",
+ "duration": ">8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A man with short brown hair and a beard speaks directly to the camera. He wears a dark blue t-shirt. Behind him, a framed map hangs on a light gray wall, and a white mug sits on a white surface to the right. Indonesian subtitles appear at the bottom of the frame throughout the video. He speaks with varied expressions, nodding and gesturing with his head and eyebrows. The lighting is consistent and bright, focused on his face.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What religious sect does the man in the video follow?",
+ "answer": "C",
+ "candidates": [
+ "A. Islam.",
+ "B. Buddhism.",
+ "C. Atheism.",
+ "D. Christianity."
+ ]
+ }
+ },
+ "vRZKqHCC": {
+ "video_id": "vRZKqHCC",
+ "video_duration": "23s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person is shown wearing a light blue jacket over a light blue collared shirt and a red tie, and red pants. The letters \"W\" and \"L\" are shown in green and red. Then a swirling blue circle and the word \"swablu\" appears, followed by 5 pokeballs on the left. Then the word \"Japan\" appears, followed by an animated blue bird-like figure with white clouds around it. A box appears below it, partially obscured by the words \"cover \u30c1\u30eb\u30c3\u30c8\". The word \"caught\" is then shown, and a blue box with information including a date of 12/31/2060 and the location Casseroya Lake appears, as well as the phrase \"it's mischievous!\". The word \"in\" is shown over the box. Then a yellow smiley face is shown over the person's face. Next, the person is shown holding a black and yellow pokeball in their hands, and looking down at it. Then the word \"swablu?\" appears in blue text. A sparkling blue streak appears going from the pokeball. Next, the word \"Hopip\" in pink appears, and a pink figure with yellow antennae appears. A hexagonal chart is then shown, as well as the word \"Chlorophyll\", followed by the words \"ending Hop to\", and then \"RADISH\". Then the word \"giving\" appears, followed by a light brown, four-legged animal-like figure. The word \"definitely\" appears over it, and the words \"upgrade\" and \"over Rockruff\" are shown. The word \"something\" is shown over an animated white burst of light, followed by the image of a black and yellow pokeball. The letter \"I\" appears, and then a gray figure with a black marking on its back appears, followed by the word \"Clodsire\", and the phrase \"ent over Clodsire\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Which Pok\u00e9mon was traded first in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Rockruff.",
+ "B. Krookodile.",
+ "C. Swablu.",
+ "D. Skiploom."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Can the Tribal Chief win against Jianzi Cao in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Can.",
+ "C. Cannot."
+ ]
+ }
+ },
+ "GFajeGMh": {
+ "video_id": "GFajeGMh",
+ "video_duration": "101s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a dark shot of a person facing away from the camera, watching a city with many fires. The scene changes to a medium shot of a person facing the camera, who speaks. The scene then changes to a medium shot of a person with long dark hair, who is looking down, seemingly sad. The scene then changes to another medium shot of the person who was facing the camera, who also speaks again. His eyes then glow white, briefly. Then, his shoulder is touched by someone.The scene changes to the person in the previous scenes with long hair, and he says something with the subtitle \"I'm tired of this emo boy\". The same character looks at another character in a \u201cstare down\u201d, then he seems to look away. Emojis of black hearts and skulls appear, and a devil emoji. The camera cuts back to the character staring forward. The subtitles then say \"*is emo\". The scene cuts to an aerial shot of a dark silhouetted figure climbing the side of a structure. Then the camera cuts back to a person facing forward, looking over a crowd, with fires in the background.The scene shifts to the same character again, now standing at the bottom of a triangular structure, with 2 other silhouetted figures standing on either side of the structure behind him. He has a different hat on. Then the scene changes to the same character talking to a woman. Then the camera cuts to a scene of the same character walking into a room with many people, some of whom are sitting on benches. The camera shows the character walk into the middle of the room, and turn to another character. The words \"*sick of Druig\" are on the screen. The character's eyes then glow white. Then an older man's eyes glow white and the camera shows he is holding a black object. The camera moves to two people who are looking and talking to each other. One of the characters has devil emojis by his ear.The camera shows the previous character turn away from another character, and then walks away, with the subtitles \"*starts a fight for no reason\". Then the camera cuts to him speaking again. Then the camera shows one of the characters close up looking to the side. The other character looks to the side also. The scene cuts to the first character again with his eyes glowing. The camera cuts to other characters. The scene shifts to an aerial shot of one of the characters on the ground, crawling on the ground and then jumping into a fight, with the words \u201cis extra for no reason\u201d. The scene cuts to the long-haired character again speaking to the previous character, who says \u201cbut Sersi, I\u2019m emo\u201d. They continue to look at each other, and she briefly takes his hand. The screen briefly cuts to text \u201cdon\u2019t toucha me\u201d. Then the camera shows both characters facing each other with a crowd in the background. The camera then cuts to an eye close up shot with the person\u2019s eyes glowing white. Then the camera shows that person raising their arm and in the background it looks like a crowd of people moving. Finally, the video shows the long-haired character and others walking through a forest at night, with the subtitle \"What's happening?\"",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the man in the black T-shirt do to the man in the black leather jacket?",
+ "answer": "B",
+ "candidates": [
+ "A. The two men embraced.",
+ "B. He pulled the other man's shoulder.",
+ "C. He gave the other man a push.",
+ "D. The two men held hands."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual action did the man in the gray suit take in the prison?",
+ "answer": "B",
+ "candidates": [
+ "A. Bit his own body.",
+ "B. Threw an object.",
+ "C. Hit his own leg.",
+ "D. Slapped himself."
+ ]
+ }
+ },
+ "yRoeAmwb": {
+ "video_id": "yRoeAmwb",
+ "video_duration": "129s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Biology",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a solid dark green screen. White text appears on the green screen that reads: \"Dissecting grasshoppers and tissue preparation\". The screen then switches to a shot of a table with various tools, materials, and three beakers. Two of the beakers are labeled with \"70% EtOH\". There is a yellow rectangular container, a small wooden block with pins in it, a dissection mat, a clear plastic bag and a printed page. The printed page has a picture of a grasshopper. The screen then zooms in on a kit of tools. The screen returns to a wide shot of the table, and then zooms in on a plastic bag with a grasshopper inside. Then zooms in on the printed page with the diagram. Then zooms out to a wide shot of the table. The screen zooms in as a hand with a blue glove uses forceps to take a brown object out of the bag and place it on the dissection mat. The brown object looks like a grasshopper. The blue gloved hand uses forceps to manipulate the object. The screen then pans to the three beakers. Then the blue gloved hands pick up the beaker labeled with \"70% EtOH\" and pour it into the larger middle beaker. The gloved hand then picks up the other \"70% EtOH\" beaker and pours it into the same middle beaker. The gloved hand then appears in frame with the brown object on the dissection mat, and uses scissors and forceps to manipulate the object. The gloved hand then places two pins through the object onto the dissection mat and uses scissors and forceps to manipulate the object. The gloved hands continue to use scissors and forceps to dissect the object pinned on the dissection mat. The video ends with a shot of the dissected object with pins and the dissection tools.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence for dissecting a grasshopper as demonstrated in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Prepare the tools, rinse with ethanol, fix with an insect pin, cut open the exoskeleton, remove the hind legs and wings, and remove the digestive system.",
+ "B. Prepare the tools, remove the hind legs and wings, rinse with ethanol, fix with an insect pin, cut open the exoskeleton, and remove the digestive system.",
+ "C. Prepare the tools, rinse with ethanol, remove the hind legs and wings, fix with an insect pin, cut open the exoskeleton, and remove the digestive system.",
+ "D. Prepare the tools, rinse with ethanol, fix with an insect pin, remove the hind legs and wings, cut open the exoskeleton, and remove the digestive system."
+ ]
+ }
+ },
+ "sAXFSued": {
+ "video_id": "sAXFSued",
+ "video_duration": "63s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a shot of the interior of a multi-story building with a rounded atrium and escalators. The camera pans to show the various levels of the building. A person is then shown from the back, walking through the building. Two people are then seen walking with one person's arm around the other. The camera returns to show the interior architecture. An aerial view of a multi-story building is shown with people walking around each level.A man is shown speaking while inside a car. Next, a shot of industrial robotic arms working inside a factory is shown. A dark shot with motion blur is followed by a shot of a building, viewed through vertically oriented blinds. A close up shot of a wall with repeating hexagonal shapes is shown. Next, a mechanical arm and small parts are shown in close up. Another shot of the interior of a multi-story building with escalators is shown.A series of shots of buildings under construction with cranes overhead are shown. Next, a person wearing a dark suit, tie, and glasses speaks. A vehicle is shown driving on the street at night, followed by an aerial view of city highways and skyscrapers at night. A man in a suit is shown speaking.Two people are shown from the back, looking out of a window with a view of a city. A wide shot of a city is shown from above. A wide shot of a city street with many people walking and driving is shown. A woman wearing a necklace is shown speaking. She is shown from the shoulders up, then the camera zooms out to show more of her surroundings as she continues to speak. A close up shot of a glass building is followed by a low angle shot of various skyscrapers. A shot of a gold temple tower with birds flying by is shown, then a shot of many gold and white towers. A shot of people gathered around a tour guide at a gold pagoda is shown. A shot of two people in dark red robes is shown, followed by a shot of people walking in the rain with umbrellas. A group of people in dark red robes walk in a line with metal bowls. A shot of a person standing in a boat on a body of water, next to a mountain range is shown. A woman is then shown speaking.Next, a dark shot of a person, partially silhouetted is followed by a view of the water. A person in a boat is shown from the back. A shot of a train platform with a train and people boarding is shown, then a conductor waves. A shot of a field of rice is shown. Then a woman with curly hair and a scarf is shown speaking. Rice terraces are shown.A satellite orbiting the Earth is shown. A blurry close up shot of a piece of grass is shown. A person wearing a green backpack is shown walking along a dirt path. A closeup of a yellow bag with a logo. Two people stand in a warehouse. A person in work clothes stands next to a machine, and then a machine is shown assembling packages on a conveyor belt. A tall building with curved architecture is shown.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What are the final positions of the two tower cranes in the middle of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. One is on the right, and the other is slightly to the right of center.",
+ "B. One is on the left, and the other is slightly to the left of center.",
+ "C. One is on the left, and the other is slightly to the right of center.",
+ "D. One is on the right, and the other is slightly to the left of center."
+ ]
+ }
+ },
+ "SUUJhalo": {
+ "video_id": "SUUJhalo",
+ "video_duration": "143s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person with a beard and tattoos is working on a mannequin head with long light brown hair. The person uses clips to section the hair. They then use a thin comb to section off pieces of hair and place foil underneath them. Using a brush, the person applies a substance to the hair sitting on the foil. They fold the foil up and place it into the hair. This process is repeated around the mannequin's head. The person uses the comb to separate out more pieces of hair, and places foil underneath before applying the substance. Once all the hair has been processed, the person holds the hair up to demonstrate the result. Finally, they use a comb to pull down the strands of the processed hair.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man in the video perform the \"chemical treatment with foil wrap\" action on hair?",
+ "answer": "D",
+ "candidates": [
+ "A. 4.",
+ "B. 6.",
+ "C. 3.",
+ "D. 5."
+ ]
+ }
+ },
+ "WjLBpBzr": {
+ "video_id": "WjLBpBzr",
+ "video_duration": "269s",
+ "duration": "4-6min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person attaches a lure to a fishing rod. The lure is initially white, then changes to orange. The person opens the door of a white car and casts the fishing line with the orange lure. The lure is changed again, this time to one that is translucent orange with a lead weight. The line is cast towards a small pond in a residential area, near a playground and a food truck. The lure is changed a few more times, cycling through different sizes and colors (orange, double-tailed grub, grub with lead weight). The person casts the line into the pond several times from the bank, reeling it in each time. The lure is changed again. The fishing rod bends as the person reels in a fish. The person pulls the fish, a largemouth bass, onto the grass bank and removes the hook from its mouth. The fish is released back into the water. Another lure is put on the hook, and the person casts again. The person reels in another fish and releases it. The process of changing lures and casting continues. Another largemouth bass is caught, unhooked, and released. The process of casting and reeling continues, with another fish being caught and released. The person continues casting and reeling.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "After the man catches the second fish in the video, does bait remain on the fishing rod?",
+ "answer": "C",
+ "candidates": [
+ "A. Bait does not remain.",
+ "B. It is not possible to determine.",
+ "C. Bait remains."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the white transparent object placed on the car in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. On the driver\u2019s seat.",
+ "B. On the back seat.",
+ "C. On the passenger\u2019s seat.",
+ "D. Inside the trunk."
+ ]
+ }
+ },
+ "TSTXDJpG": {
+ "video_id": "TSTXDJpG",
+ "video_duration": "243s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A silver tea kettle is poured into a clear blender pitcher. Then, a tan bag is opened and its contents are poured into the blender. A small amount of liquid from an orange bowl is also poured into the blender. Next, a black container is opened, and a small, light-brown wooden stick is used to extract a dark, shiny substance. This substance is then added to the blender. The blender is then placed on the base, and the blend speed is adjusted and then it is turned on. The resulting liquid is poured into a clear jar. A white liquid is added, and a lid is placed on the jar. Then, the jar is held up for the camera. At another point in the video, a different black container is opened, and it is held up for the camera, and text appears next to it. A portion of the video also shows a graphic of black background with text in yellow and white. A white liquid is then poured into the jar, followed by the jar being moved out of view, and a man appears to be speaking into the camera, and he gestures towards the jar.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct order of making the drink as shown in the video? (a) Pour in hot water (b) Add Shiraji (c) Add coffee powder (d) Add milk and ice cubes (e) Mix well (f) Pour in maple syrup",
+ "answer": "C",
+ "candidates": [
+ "A. (c) (b) (a) (e) (f) (d).",
+ "B. (a) (b) (e) (d) (f) (c).",
+ "C. (a) (c) (b) (f) (e) (d).",
+ "D. (a) (b) (c) (f) (e) (d)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What does the man in the video add to the blender?",
+ "answer": "A",
+ "candidates": [
+ "A. Water, coffee powder, Lipton, and maple syrup.",
+ "B. Water, honey, milk, and ice cubes.",
+ "C. Water, coffee powder, Lipton, and milk.",
+ "D. Water, coffee powder, milk, and ice cubes."
+ ]
+ }
+ },
+ "ndMGQClm": {
+ "video_id": "ndMGQClm",
+ "video_duration": "44s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Artificial Intelligence",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video starts with a diagram of the \"Cross vs Self Attention\" in a neural network. This is followed by a chart with the letters of the alphabet, along with their pronunciation in the International Phonetic Alphabet. This is followed by another chart showing Kannada script, with the Roman alphabet transcription and pronunciation in the International Phonetic Alphabet. Next, a diagram appears illustrating how queries, keys and values interact in an attention mechanism. This is followed by a diagram showing how keys and a query interact to generate an attention score. Next, a diagram shows how attention scores are combined with values to generate a weighted value. Finally, the last diagram shows the sequence of operations of a single attention head. Some elements of the diagrams are circled in pink.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What numbers are used to label the value vectors displayed at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. 1, 0, 2.",
+ "B. 0, 1, 1.",
+ "C. 4, 4, 1.",
+ "D. 1, 2, 3."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is used to represent the query vectors in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Orange.",
+ "B. Blue.",
+ "C. Purple.",
+ "D. Green."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many times does the query vector appear in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Three.",
+ "C. One.",
+ "D. Two."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order do the following four events occur in the video? (a) Proposal of the translation task hypothesis, (b) Introduction to the fundamental principles of attention mechanisms, (c) Explanation of self-attention mechanisms, (d) Explanation of cross-attention mechanisms.",
+ "answer": "B",
+ "candidates": [
+ "A. (b) (a) (c) (d).",
+ "B. (c) (a) (d) (b).",
+ "C. (a) (c) (b) (d).",
+ "D. (d) (c) (a) (b)."
+ ]
+ }
+ },
+ "rZqxSijA": {
+ "video_id": "rZqxSijA",
+ "video_duration": "115s",
+ "duration": "1-2min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A black screen displays a white logo reading \"VISUALS BY TP FILMS.\" A young Black woman wearing a gray Nike hoodie and gray sweatpants stands before a microphone on a stand, rapping. She is outdoors, in what looks like an alleyway between rows of blue garage doors. Buildings are visible in the background. Various images appear superimposed on the video, timed to the lyrics. These include a floating hoodie and sweatpants set, the Central Cee logo, two police officers, praying hands, a cell phone, Usain Bolt running, a maroon tracksuit, a white sleeveless dress, a red comb, three stick figures, a cartoon image of Nick and Andrew from the show *Big Mouth,* a lighter, metal bars, flames, the Snapchat logo, a pistol, a tall building, handcuffs, and a container of what looks like chewing gum. The lighting changes throughout the video, shifting from a dark, overcast feel to brighter, more saturated lighting.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What emotion does the singer convey at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Joy.",
+ "B. Confidence.",
+ "C. Sadness.",
+ "D. Anger."
+ ]
+ }
+ },
+ "IUnlAgRl": {
+ "video_id": "IUnlAgRl",
+ "video_duration": "56s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Climate Change",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A light-skinned person wearing glasses and a light blue shirt is in the first few frames, with a dark background with a pattern. Then a hand holding a thermometer is shown in front of a background of cracked ground and blue sky. There are multiple views of the thermometer, before a map of the Pacific off Central America appears with a temperature scale shown, and the map is shown multiple times with different dates. The map changes to a graphic with blue, green, yellow and red colors, and the dates \"Jan 97\" and \"Jun 97\" are shown on two separate versions of this graphic. The light skinned person wearing glasses and a light blue shirt is shown again with a dark background with a pattern, before a chart labeled \"SOI Index\" with blue and red lines is shown. Then the person wearing a light blue shirt is shown once again with the same dark background with a pattern before graphs of the same map of the Pacific Ocean with different dates and a second chart with anomalies is shown multiple times. The person wearing a light blue shirt is shown once more with the same dark background with a pattern and then it fades to black, and then to white text with credits and then a white UCAR logo and black background, which then fades to black.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What were the likely consequences of the El Ni\u00f1o phenomenon in 2015?",
+ "answer": "D",
+ "candidates": [
+ "A. Pacific Ocean water flows into the Atlantic Ocean.",
+ "B. Global temperatures decrease.",
+ "C. There is no significant change.",
+ "D. Global temperatures increase again."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "Based on the sea surface temperature map for one week in September 2014, what can be inferred about the pattern of temperature changes?",
+ "answer": "B",
+ "candidates": [
+ "A. High-temperature waters gradually move toward the center.",
+ "B. High-temperature waters gradually move toward the western coast.",
+ "C. High-temperature waters gradually move toward the eastern coast.",
+ "D. High-temperature waters gradually dissipate."
+ ]
+ }
+ },
+ "VwTQDgiO": {
+ "video_id": "VwTQDgiO",
+ "video_duration": "121s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A woman with long dark hair, wearing a light blue button up shirt stands in a kitchen. In front of her are five small patterned bowls. The woman moves her hands while talking. The camera zooms in on a bowl containing sliced bananas and white yogurt. The woman is holding the bowl and shows the contents to the camera. The camera shows the woman and the kitchen again. She picks up a blender and opens its lid. The woman uses a spoon to add banana and yogurt mixture to the blender. She then adds more to the blender. She moves the lid of the blender to a closed position, and then holds it and turns it on. The contents of the blender are liquified. The blender is turned off. The woman uses the blender to pour the liquified mixture in the center bowl. The camera shows the woman scooping the mixture from the center bowl into a small bowl. She pours a powdered spice from a glass container over the mixture and the bowls on the sides. She holds a small bowl containing what appears to be the mixture. The woman smiles, holds the bowl up and shows it to the camera.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "In the video, after the blender initially failed to mash the bananas, how many times did the woman stir the mixture with a spoon?",
+ "answer": "B",
+ "candidates": [
+ "A. Nine times.",
+ "B. Ten times.",
+ "C. Twelve times.",
+ "D. Eleven times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did the woman in the video thaw the banana?",
+ "answer": "A",
+ "candidates": [
+ "A. I don't know.",
+ "B. No.",
+ "C. Yes."
+ ]
+ }
+ },
+ "HHGCZvII": {
+ "video_id": "HHGCZvII",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Celebrity Interviews",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Two individuals are seated at a desk with two computer monitors, a keyboard, and other equipment. One of the individuals is wearing a dark beanie, white headphones, and a light-colored long-sleeved shirt. The other person is wearing a baseball cap, and a red and blue jacket. The person wearing the baseball cap is gesturing with their hands. A third individual enters the frame, wearing a dark leather jacket and black pants. They pick up a phone that is sitting on the desk, and they appear to engage with one of the people sitting. The person wearing a long-sleeved shirt stands up and appears to be talking with hand gestures. Another person, wearing a light-colored denim jacket and a baseball cap, walks behind the first person. The first person walks and motions towards the second. After that, both the person in the light-colored denim jacket and the person in the long-sleeved shirt exit the frame.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional changes does the black man wearing a brown hat in the video display?",
+ "answer": "B",
+ "candidates": [
+ "A. From anger to confidence and then to impatience.",
+ "B. From confidence to impatience and then to anger.",
+ "C. From impatience to confidence and then to anger.",
+ "D. From anger to impatience and then to confidence."
+ ]
+ }
+ },
+ "cfACAMJg": {
+ "video_id": "cfACAMJg",
+ "video_duration": "55s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man with glasses wearing a blue suit and light blue striped tie is talking. The scene changes and now shows a man wearing glasses and a black suit sitting and holding papers while talking. To his right a man wearing a blue suit and light blue striped tie is sitting. A man wearing a dark vest and a white shirt is sitting to the right of that man, and to the right of that man a fourth man wearing glasses and a blue suit sits. All four men are sitting on metal stools. In front of each person is a clear plastic bottle with a white cap. The scene changes back to the first man in the blue suit and tie, now facing a man with white hair. The scene changes back to the four men sitting, then again to the man in the blue suit and tie. Then the scene changes to the fourth man who is wearing the glasses and blue suit. He is moving his hands while talking. The scene shifts to a stylized image of a large courtyard area with trees and many people milling around. Another scene is shown of the courtyard and shows some people sitting at tables. Another stylized image is shown with some people walking, sitting, and standing around plant areas. The scene changes back to the four men on stools, then to the white-haired man with papers, then to the same man in the blue suit and tie. The scene changes to the man with the vest holding paper and gesturing, while talking. The scene moves to show all four men on the metal stools in front of an audience. The scene shows a large stylized building, then an exterior view of a building with staircases going up and down, then an interior view of the same staircase, and then another view.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why was this park created?",
+ "answer": "D",
+ "candidates": [
+ "A. Because this was his first stop in the United States.",
+ "B. Because there is an artificial lake here.",
+ "C. Because the air pollution in this area is severe.",
+ "D. Because this area has a large number of young people."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What could be the occupation of the short-haired man wearing a blue suit in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Designer.",
+ "B. Journalist.",
+ "C. Actor.",
+ "D. Host."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many squares featured in the video have advantages?",
+ "answer": "B",
+ "candidates": [
+ "A. One.",
+ "B. Two.",
+ "C. Four.",
+ "D. Three."
+ ]
+ }
+ },
+ "aQKLDzDr": {
+ "video_id": "aQKLDzDr",
+ "video_duration": "135s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Artificial Intelligence",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A black tablet displays a blue and yellow globe surrounded by light blue lines extending outward, resembling rays of light. The image changes to a white screen inside the tablet that has the logo \"IBM WATSON\" in the top left corner, and several grey boxes with lines of text below it. A loading bar is displayed below the list of boxes. The screen displays three grey bars with a taller green bar behind them and the text \"Answer #1 77% confid\", and the text \"88%\" above the green bar. The text changes from \"77% confid\" to \"88% confidence\". The screen shows a black stick figure with glasses sitting at a desk and a black coffee mug with steam rising, a white rectangle to their left with the text \"Professionals\" in black letters and a projector hanging down from its corner. The words \"White-collar workers\" appear under \"Professionals\" then \"Low-skill workers\" appears under \"White-collar workers\". The scene changes to a person with short blonde hair wearing a grey and black shirt, with a beaded necklace, in front of a bookcase and a light brown door. The person is moving their hands while speaking. The scene changes to a computer screen with 3 tabs, one displaying the front cover of a book, one a blog post from federicopistono.com and the other a webpage, the blog and webpage contain lists of data. The scene returns to the person from before, who is still gesturing with their hands while speaking. The screen displays a chart with the heading \"Occupation\" on the left and \"Number of Workers\" on the right, listing several job types. The chart shows more job types and numbers in rows. The chart displays more job types and numbers in rows. The scene returns to the person, still gesturing with their hands while speaking. The screen shows the same job title chart as before. The scene changes to a webpage with the heading \"MIT Technology Review\" with the text \"Report Suggests Nearly Half of U.S. Jobs Are Vulnerable to Computerization\", the text is highlighted in light blue, and then a line of text in the article is also highlighted light blue. The webpage changes to one with the heading \"Computerizable Jobs\", with a graph showing occupations on the left and a dark blue bar corresponding to each job title on the right, a second table is to the right with the heading \"Bottlenecks to Computers\". The screen changes to the heading \"Bloomberg\" with the text \"Half of Europe's Jobs Threatened by Machines in Echo of U.S. Risk\" in blue. A picture of a person in a grey shirt is in the middle of the page and there is a robotic arm working on a car, several lines of text are highlighted in light blue on the webpage. The scene returns to the person, still moving their hands while speaking. The scene shows the person talking while gesturing with their hands.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "According to the video, which jobs are most likely to be replaced by automation in the future?",
+ "answer": "B",
+ "candidates": [
+ "A. Legal assistant.",
+ "B. Loan officer.",
+ "C. Taxi driver.",
+ "D. Information clerk."
+ ]
+ }
+ },
+ "WydusXAx": {
+ "video_id": "WydusXAx",
+ "video_duration": "117s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a split-screen of two women competing side-by-side on an obstacle course. Initially, a scoreboard is displayed on the left side of the screen, showing \"Roxanne Edrington\" in first place with a time of 10.49 seconds. On the right side, \"Kelly Loufbourrow\" is shown with a time of 13.00 seconds. Both women are standing on starting platforms.The scoreboard disappears, a timer begins counting down from 1:30, and both women start simultaneously. They run across a series of short padded platforms, then jump and grab a rope swing. Both swing towards a long, suspended, padded beam. Kelly lands on the beam first and begins to move across. Roxanne lands slightly later. They both traverse the beam hand over hand, Kelly maintaining her lead. After crossing the beam, both women drop to the ground, landing on a large padded mat, and the timer stops. The final times are displayed: Roxanne finishes at 28.40 seconds, while Kelly finishes at 28.91 seconds. The video then immediately replays the portion showing them on the rope swing, traversing the beam and dropping to the padded mat, without the scoreboard. During this replay, the timer counts down, showing both women completing this section in around 10 seconds. The replay ends when they both land on the mat.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What is the primary reason for the audience's applause at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. To mock the athletes.",
+ "B. To celebrate the victory of the athlete on the right.",
+ "C. To celebrate the victory of the athlete on the left.",
+ "D. To encourage and motivate the athletes."
+ ]
+ }
+ },
+ "UHMICBNU": {
+ "video_id": "UHMICBNU",
+ "video_duration": "74s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen. Then the letters \"EMALS\" appear in white, with various images displayed within each letter. The text \"electromagnetic aircraft launch system\" appears in a smaller font beneath the \"EMALS\" letters. Below this, the text \"voice of Capt. John F. Meier, commanding officer, aircraft carrier Gerald R. Ford (CVN 78)\" appears. A series of blurred, overlapping images are shown. The video then shows a close-up of a metallic track, with two metal plates in the center, and two rows of rectangular black squares on either side of the plates, followed by another close-up of a section of the metallic track. Then, a long shot of a metallic track with red rectangular objects on either side, followed by a person wearing a blue camouflage uniform, a white hard hat, and dark shoes, who is standing on a grey surface and waving with one hand. The text \"INSTALLED\" appears over this. Then, the text changes to \u201cINSTALLED & TESTED\u201d. Then a red rectangular object moves down a metallic track, followed by a long shot of a metallic track with a red object moving along it, while a group of people stand in the background. Next, a gray jet is seen on a runway, with six people in green or yellow vests in front of it. The jet then travels along the runway, kicking up dust. A graph comparing \u201cSteam\u201d and \u201cEMALS\u201d in terms of \u201c29% MORE ENERGY\u201d appears, followed by shots of gray jets on a runway, with people nearby. Text stating \"REDUCE the WEAR & TEAR\" is shown, followed by a jet launching and a view of the ocean, where an object is seen flying through the air above the water. Then, the text \"REVOLUTIONIZE THE WAY THAT WE LAUNCH THE AIRCRAFT TODAY AND INTO THE FUTURE\" appears, followed by more close-ups of the metallic track, followed by footage of various types of aircraft taking off, and the perspective of something moving along a metallic track.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What type of propulsion might the airplane shown in the video use?",
+ "answer": "C",
+ "candidates": [
+ "A. Jet propulsion.",
+ "B. Spiral propulsion.",
+ "C. Fuel-powered engine.",
+ "D. Steam power."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What does the person wearing a white hard hat do in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Prohibits the driver from starting the vehicle.",
+ "B. Instructs the driver to land.",
+ "C. Directs the driver on which direction to proceed.",
+ "D. Commands the driver to start the vehicle."
+ ]
+ }
+ },
+ "ncHhQojO": {
+ "video_id": "ncHhQojO",
+ "video_duration": "261s",
+ "duration": "4-6min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A man in a suit is at a lectern speaking. Text saying \u201cPresidente del Gobierno Rajoy\u201d is at the top left of the image. The man raises his hand and counts on his fingers. He then returns to looking at the documents in front of him. On the right, a yellow square appears with an image of a building and lines above it. The text \u201cClick To Visit \u2018Visual Politik\u2019\u201d is below it. The man in the suit is seen pointing his finger. The yellow square with the building icon is present on the right of the image. The next scene shows a man wearing glasses and a blue patterned shirt speaking. He moves his hands as he speaks. The man in the patterned shirt is seen speaking again. His expression changes from a smile to a more serious one. The scene shifts again to show the man in the patterned shirt speaking, with a more animated expression.Two people appear in a different scene. A woman with short blonde hair and a gray jacket is on the left. A man with glasses is on the right. The man is touching his chin with his hand. The scene returns to the man in the patterned shirt. He is speaking and gesturing with his hands. Another scene appears. A woman is shown behind a Twitter box. The text on the Twitter box is in Spanish. The scene returns to the man in the patterned shirt, speaking and gesturing with his hands. A man with a tie, light-colored shirt and suspenders appears sitting at a desk covered in papers and other objects. The scene transitions to the man in the patterned shirt. He is speaking and gesturing with his hands. A map of Spain is superimposed on his body. An arrow is pointing to the region of Catalonia. A Catalan flag is in the bottom right. The man in the patterned shirt is speaking and gesturing with his hands. A scene with many people holding red and yellow striped flags appears. The scene shifts to the man in the patterned shirt, who is speaking.The scene returns to the man in the patterned shirt and he speaks. Then, a blue square with white text reading \u201cPP\u201d is shown on the left of him. Next, a cartoon of a man with glasses is shown beside the man in the patterned shirt, with the blue \u201cPP\u201d logo. The man in the patterned shirt speaks and gestures with his hands. The man in the patterned shirt is again speaking while the video shifts to a different scene. In this scene, a man with glasses is speaking at a podium, with many red and yellow striped flags behind him.The man in the patterned shirt speaks. The man in the patterned shirt speaks and gestures with his hands. The man in the patterned shirt speaks again with slightly different hand gestures.A man wearing glasses looks directly at the camera. The man rests his chin on his hand. The scene switches to the man in a patterned shirt who is speaking.The man in the patterned shirt speaks again, gesturing with his hands.A scene of many people in the streets appears. The text \u201c2015 Spanish General Election\u201d appears in the top left. The text disappears to reveal a dark blue background. A light blue square with white \u201cPP\u201d text and a red square with white \u201cPSOE\u201d text appear.Two cartoon images of men appear below the logos, with the text \u201cPartido Popular\u201d and \u201cSocialistas\u201d above them. Black \u201cX\u201d marks appear over the logos and the cartoon images. A cartoon of a man with a raised fist is shown on the left with text \"(We can)\" above. A purple square with text \u201cPODEMOS\u201d is shown behind him. In the top right, an image of a man in a suit at a lectern with text \u201cC\u2019s\u201d below, \u201cCiudadanos\u201d below that, and then \u201cCitizens\u201d below that is also present. Red arrows point from the logos to the images.A logo with \u201cC\u2019s\u201d and \u201cCiudadanos\u201d appears next to the man in the patterned shirt. He is speaking and gesturing.The man in the patterned shirt speaks and gestures with his hands. The logo with \u201cC\u2019s\u201d and \u201cCiudadanos\u201d appears next to the man in the patterned shirt.The logo with \u201cC\u2019s\u201d and \u201cCiudadanos\u201d appears next to the man in the patterned shirt again. He is speaking and gesturing with his hands.A purple circle logo with \u201cPODEMOS\u201d under it appears next to the man in the patterned shirt. He is speaking and gesturing. A cartoon image of a man with his fist raised is next to him.A different scene shows the man with a beard, with his fist raised.The purple circle logo with \"PODEMOS\" appears, while a cartoon image of a man with his fist raised is next to the man in a patterned shirt. He speaks and makes hand gestures.The man in the patterned shirt is speaking. He claps his hands together, then makes a wide gesture with his arms.The man in the patterned shirt looks directly into the camera and continues to speak. The logo with \u201cPP\u201d is shown next to the man, and a red square with the text \u201cPSOE\u201d appears next to him on the right.The man in the patterned shirt is shown gesturing while a red square with \u201cPSOE\u201d text and blue square with \u201cPP\u201d text is next to him. The text is replaced with other party logos.An image of a ballot box with a gold star is shown. A cartoon of a hand is shown placing a piece of paper in the ballot box. The image transitions to a screen with text showing vote percentages.A blue \u201cPP\u201d logo with the text \u201c28.7%\u201d appears on the left. The percentage has additional text below saying \"down 16.3% from 2011\". A red \"PSOE\" logo with the text \"22%\" is on the right with the text \"down 6.8% from 2011\". Below them is a \u201cC\u2019s\u201d logo with \u201c13.9%\u201d text, and a purple logo with text \u201c20.7%\u201d. Underneath those, a photo of a room with many people is shown. The scene returns to a dark blue screen.The man in the patterned shirt is speaking. He makes gestures with his hands as he talks.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What was the narrator's mood during the first half of the video, when elephant pictures were displayed?",
+ "answer": "B",
+ "candidates": [
+ "A. Excited and thrilled.",
+ "B. Angry and agitated.",
+ "C. Sad and regretful.",
+ "D. Calm and neutral."
+ ]
+ }
+ },
+ "PartcFAw": {
+ "video_id": "PartcFAw",
+ "video_duration": "142s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The title card displays the words \"The Countryballs Show\" in black, red, and yellow text. Below the title a black monitor and a circular character with the white and red flag of Poland with large white eyes and a wide red smile is shown. A grassy landscape with trees in the background is then shown. A circular character with the blue and yellow flag of Ukraine, white eyes and a black moustache, appears. Its mouth is slightly open, as if in shock. A black silhouette of a person riding a bike crosses the top of the screen in the distance. The same scene is repeated several times with different circular or rectangular characters appearing in the foreground at different times, and the black silhoutte of the person on the bike crossing the top of the screen each time.A rectangular character with the light blue flag of Kazakhstan with a yellow sun appears. Its mouth opens and closes. A circular character with the red flag of Kyrgyzstan with white eyes appears next to a black smartphone. Two circular characters with the green flags of Turkmenistan and Uzbekistan respectively, white eyes, and open, screaming mouths appear next to each other. The circular character with the flag of Ukraine reappears. The video ends with the circular character of Ukraine blinking with his mouth slightly agape.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many times does the phone appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Three.",
+ "B. One.",
+ "C. Two.",
+ "D. Zero."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How does the mood of the Ukrainian player change in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Neutral, shocked, puzzled.",
+ "B. Neutral, puzzled, shocked.",
+ "C. Confused, neutral, shocked.",
+ "D. Shocked, neutral, confused."
+ ]
+ }
+ },
+ "ZDntqINX": {
+ "video_id": "ZDntqINX",
+ "video_duration": "117s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video has a red background with silhouettes of guns, martini glasses, women, cars, hats, and a circular symbol. There are various white rectangular panels which appear. The first panel includes text in a small font. The second panel has the large number 007 printed on it. The third panel shows a portrait of a man. The fourth panel is a graphic with images of various products. The fifth panel is a black and white image of a man in what appears to be a jail cell. The sixth panel is an image of two globes with the text \"The Mandela Effect\" over top. The seventh panel is an interior image of a machine with a man in a hard hat standing in front of it. The eighth panel is an abstract graphic with a light blue vertical column and grids on each side. The ninth panel has a Volkswagen logo. The tenth panel has text that reads \"Shared Memory\" and \"Reality\" as well as two versions of the Volkswagen logo. The eleventh panel shows a short video clip of a man in a red jacket. The twelfth panel is a red background with white lines. The thirteenth panel displays a woman with blonde braided pigtails. The fourteenth panel shows a search results page with text and a few images. The fifteenth panel displays multiple comments on a forum. The sixteenth panel shows a different set of comments on a forum. The seventeenth panel displays comments on a forum. The eighteenth panel displays more comments on a forum. The nineteenth panel displays a character description for \"Dolly\" with a picture and the word \"braces\" circled in red.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the narrator in the video feel both uneasy and somewhat excited?",
+ "answer": "A",
+ "candidates": [
+ "A. Because there is disagreement about whether something will happen.",
+ "B. Because there is no disagreement about something.",
+ "C. Because there is interest in whether something will happen.",
+ "D. Because there is doubt about whether something will happen."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "At the beginning of the video, why do some people believe the girl wore braces?",
+ "answer": "B",
+ "candidates": [
+ "A. Because everyone around them was talking about the girl wearing braces.",
+ "B. Because they saw the girl wearing braces in their teens.",
+ "C. Because a newspaper mentioned the girl's braces.",
+ "D. Because the girl in the movie has braces."
+ ]
+ }
+ },
+ "RkwSAmse": {
+ "video_id": "RkwSAmse",
+ "video_duration": "91s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Fitness",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black and gold patterned martial arts outfit stands on a blue mat, holding a sword. They move their arms, then begin a series of movements with the sword that involves lunging and rotating with one leg bent and the other outstretched. They also jump and spin, moving the sword around in a circular motion and extending their arms outwards. There is also a white and red fabric or feather attached to the sword. The person continues this sequence of movements and poses with the sword, before walking off camera. A scorecard appears at the end with numbers and the person's name. The background is a light yellow wall with logos on it.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Does the woman perform a running motion at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Yes, she does.",
+ "B. I do not know.",
+ "C. None of the above.",
+ "D. No, she does not."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "In the video replay, how many actions did the woman perform?",
+ "answer": "C",
+ "candidates": [
+ "A. Six.",
+ "B. Eight.",
+ "C. Seven.",
+ "D. Five."
+ ]
+ }
+ },
+ "xdjBoZmB": {
+ "video_id": "xdjBoZmB",
+ "video_duration": "334s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with dark hair wearing a dark long-sleeved shirt appears from behind the driver seat of a car, looking towards the camera and smiling. The view of the car's interior is then shown without the person, and a man wearing a blue jacket and with dark hair walks up to the driver side of the car, gets in, and closes the door. The man remains seated for a moment, and then lifts the car's sun visor up and adjusts it. He moves his head and looks towards the passenger's side of the vehicle, turns to look forward, and then looks at the driver's side mirror. He talks, then touches his head and puts his head in his hands. Then he lifts the sun visor and adjusts it again. The man then looks and speaks towards the passenger side of the vehicle, and gestures with his hand. He reaches over the passenger seat, and then takes his jacket off and puts it on the seat behind him. He is now wearing a dark t-shirt. He then puts his hand on the steering wheel and looks toward the passenger's side. He then continues to speak and uses a variety of hand gestures while doing so, while looking to the passenger seat. The man shifts positions in the driver seat. Someone places a hand up from the passenger side. They gesture, then move to put a hand on his face. Another hand also comes to touch his face. They wave at him, and then grab his arm. The man laughs and then shifts, and begins to speak and gesture again. The man takes his jacket off the back seat and shifts to get his arm through it, and shifts to put the jacket on. He appears to adjust his shirt, then looks and talks towards the passenger seat again. The scene then ends.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How did the man's mood change after hearing that the woman was not pregnant?",
+ "answer": "D",
+ "candidates": [
+ "A. From happiness to anger.",
+ "B. From anticipation to disappointment.",
+ "C. From happiness to disappointment.",
+ "D. From tension to relaxation."
+ ]
+ }
+ },
+ "oSmjZCEK": {
+ "video_id": "oSmjZCEK",
+ "video_duration": "188s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen. A partially visible gray object with black dots on its surface enters the frame from the left. Then a circular dark object surrounded by a sandy colored surface comes into view. The circular object appears to descend towards the lower-left, while the sandy surface becomes more visible, eventually showing cloud cover moving from right to left. The surface becomes clearer, and a sandy landscape appears, gradually becoming the primary focus. The landscape has various indentations and is eventually covered by a dark, cloudlike covering. At the end of the video, the screen is dark.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "Based on the video, is the spacecraft's heat shield still functional?",
+ "answer": "A",
+ "candidates": [
+ "A. The heat shield is not functional.",
+ "B. The framework is functional.",
+ "C. The components are functional.",
+ "D. The heat shield is still functional."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What is the spacecraft's motion in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. It is descending counterclockwise.",
+ "B. It is descending in a straight line.",
+ "C. It is ascending counterclockwise.",
+ "D. It is descending clockwise."
+ ]
+ }
+ },
+ "HneKjDJf": {
+ "video_id": "HneKjDJf",
+ "video_duration": "98s",
+ "duration": "1-2min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person with long dark hair and glasses sits holding an orange acoustic guitar. They are wearing a light blue t-shirt. The individual strums the guitar and sings with their head slightly tilted down. Occasionally, they look up and smile while continuing to play and sing. At one point, they briefly stick their tongue out playfully. The background of the room is slightly blurred, featuring a dark cabinet, shelves, wall decorations, and a light fixture. The video ends with a black screen that displays the social media handle \"@JOeGRAFIA\" in gold lettering, followed by a smiley face \":)\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What method does the man use to play the guitar in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Strumming.",
+ "B. Plucking the strings.",
+ "C. Fingerpicking.",
+ "D. Strumming with a pick."
+ ]
+ }
+ },
+ "frjYQjOw": {
+ "video_id": "frjYQjOw",
+ "video_duration": "228s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows several scenes featuring multiple women, most frequently in red dresses with a black hat and sunglasses, sometimes wearing white pearl necklaces. They are seen walking in lines, dancing on a stage, striking poses and singing into a microphone. One woman in a red dress is also shown in a series of interview settings speaking directly to the camera and gesticulating. There are also group photos shown of women in various clothing. There are several shots of a large balloon with a red elephant image. The end of the video includes a shot of many women on a stage, with the majority in red dresses and holding a pose with their hands in a triangle shape. There is a small logo that reads \"Sister Circle Live\" in the bottom left of most of the scenes. In several scenes, people are shown holding phones, presumably taking photos or videos.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What action did the women in red dresses perform when entering the scene?",
+ "answer": "A",
+ "candidates": [
+ "A. Placed hands on shoulders.",
+ "B. Held hands.",
+ "C. Clapped their hands.",
+ "D. Raised both hands."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What does the narrator do with the hat in the latter part of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Gives it to the person next to them.",
+ "B. Adjusts the hat's position.",
+ "C. Takes it off and throws it to the audience.",
+ "D. Takes it off and puts it back on."
+ ]
+ }
+ },
+ "ZtpGLXyM": {
+ "video_id": "ZtpGLXyM",
+ "video_duration": "125s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Fitness",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black and gold outfit stands on a blue mat. They begin to perform a series of movements, holding a long metal object. The person moves across the mat, jumping and kneeling, continuing their routine with the metal object. They kick the object, dropping it to the floor. The person continues the routine, spinning and moving the object, then ending with a series of movements while squatting and lunging. The person stands up and moves towards a different area in the arena, then stops to look forward with their arms crossed. A screen appears showing a score of 9.110.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man with the knife in the video perform a jumping kick and a squatting motion?",
+ "answer": "A",
+ "candidates": [
+ "A. Four times.",
+ "B. Three times.",
+ "C. One time.",
+ "D. Two times."
+ ]
+ }
+ },
+ "YUwaAuCO": {
+ "video_id": "YUwaAuCO",
+ "video_duration": "540s",
+ "duration": ">8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a shot of a pink and yellow sunset sky. The camera pans down to reveal the Toronto skyline. A billboard appears in front of the city, initially blank, then displaying the words \"Dream\" and \"Abroad,\" separated by a small airplane icon. A location pin is displayed next to the word \"Abroad.\"The scene transitions to a man with a beard wearing a dark tie-dye hoodie with \"Hollister\" written across the chest. He sits in front of a wall with several framed prints and a tall floor lamp. A potted plant is visible to the right. A small shelf with plants in pots sits above and behind his left shoulder.An Instagram logo appears to the left of the speaker. It then transforms into an Instagram profile for \"DreamersAbroad,\" featuring a profile picture of a couple. Various news headlines appear below the speaker. The headlines relate to changes to the H-1B visa program and immigration policies. A still image of former president Donald Trump holding up a signed document appears briefly. Later, a still image of Joe Biden at a podium with \"Office of the President Elect\" in the background is displayed. There is a period where a section of a document is displayed, highlighting information about a final rule regarding H-1B visas.Periodically, subscription prompts appear on screen. First, an Instagram \u201cFollow\u201d button shows up underneath the \u201cDreamersAbroad\u201d profile. Towards the end of the video, a YouTube subscription prompt with a thumbs-up icon, a red \"SUBSCRIBE\" button, and a notification bell appear below the speaker. A mouse pointer clicks the \"SUBSCRIBE\" button and the bell icon, turning both blue and changing the button text to \"SUBSCRIBED.\" Another subscription prompt with the YouTube logo, a \"SUBSCRIBE\" button, and a notification bell then appears. A mouse pointer clicks the \"SUBSCRIBE\" button and the notification bell. The words \"Like,\" \"Comment,\" and \"Share,\" along with their corresponding icons, then appear below the prompt. The video ends with a black screen with a blue shape and the words \"SUBSCRIBE FOR MORE VIDEOS LIKE THIS\" and \"Watch related videos-\".",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Does the video mention the H1B visa?",
+ "answer": "B",
+ "candidates": [
+ "A. Yes.",
+ "B. No.",
+ "C. I am not sure."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "When was the video posted that shows \"Trump Extends Pandemic Immigration Bans Through March\"?",
+ "answer": "B",
+ "candidates": [
+ "A. 2021/11/1",
+ "B. 2020/10/1",
+ "C. 2020/11/1",
+ "D. 2021/10/1"
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video is \"Biden pledged to change immigration\" mentioned?",
+ "answer": "A",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. At the end of the video.",
+ "C. In the middle of the video.",
+ "D. Throughout the entire video."
+ ]
+ }
+ },
+ "xPBlkeSA": {
+ "video_id": "xPBlkeSA",
+ "video_duration": "19s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video starts with a black screen. Then the words \"Welcome to Las Vegas\" appear in gold letters above a city skyline at night. A beam of blue light is visible above a pyramid-shaped structure in the foreground. The view zooms out slightly and a banner that says \"Westgate Las Vegas Resort & Casino\" appears. Then two gold-colored buildings come into view at the left and right of the image, with the words \"February 29th\" above the building on the right. The words \"One Night Only\" appear in gold. Then, \"Angelina Jordan\" appears, followed by \"The Barefoot Princess,\" and finally \"The Queen of Jazz,\" in a similar style.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is the entire triangular building shown in the video illuminated?",
+ "answer": "B",
+ "candidates": [
+ "A. It cannot be determined.",
+ "B. Only the top is illuminated.",
+ "C. The entire building is illuminated.",
+ "D. Half of it is illuminated."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What type of background music is playing in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Jazz music.",
+ "B. Rock music.",
+ "C. Classical music.",
+ "D. Pop music."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What is the name of the hotel that appears at the beginning of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Aria Resort Hotel.",
+ "B. Sigert Hotel.",
+ "C. Bellagio Hotel.",
+ "D. Wynn Hotel."
+ ]
+ }
+ },
+ "WONAEPSw": {
+ "video_id": "WONAEPSw",
+ "video_duration": "34s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Biology",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a shot of several plants with long, pointed, green and yellow leaves in a garden setting. Text overlays the bottom of the screen. A kaleidoscope effect is then applied to the video, with images of a bee-like insect on the left side and a white and brown flower on the right, mirrored on a vertical axis. The kaleidoscope effect disappears. Then, a single bee-like insect is shown flying towards a white and brown flower. Text overlays the bottom of the screen. The insect is shown close up hovering near the flower. The shot cuts to a cluster of yellow flowers. A small insect is on one of them. Text overlays the bottom of the screen. The shot cuts to an insect on a bright yellow flower, then to an insect on a white flower with green leaves. The camera focuses on the insect as it moves on the flower. The shot cuts and applies a kaleidoscopic filter to the image. The filter disappears, and the camera focuses on a white and pink flower with 2 bee-like insects on it. The camera focuses on the insects on the flower. Text overlays the bottom of the screen. A bee-like insect is shown on a yellow flower. The insect moves on the flower. Text overlays the bottom of the screen.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "According to the video, what factor resulted in the plant population exhibiting greater strength and resilience?",
+ "answer": "A",
+ "candidates": [
+ "A. Genetic mutations resulting from pollination.",
+ "B. Human intervention.",
+ "C. Genetic recombination.",
+ "D. Chromosomal variation."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Where in the video do rapeseed flowers appear?",
+ "answer": "C",
+ "candidates": [
+ "A. Rapeseed flowers appear at the beginning of the video.",
+ "B. Rapeseed flowers appear at the end of the video.",
+ "C. Rapeseed flowers appear in the middle of the video.",
+ "D. Rapeseed flowers appear throughout the video."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many different colors of flowers are shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Two.",
+ "B. Three.",
+ "C. One.",
+ "D. Four."
+ ]
+ }
+ },
+ "CJhFZWte": {
+ "video_id": "CJhFZWte",
+ "video_duration": "482s",
+ "duration": ">8min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man sits in front of a display case of Yu-Gi-Oh! cards and eight booster packs of *Gladiator's Assault*. He holds up various cards to the camera, including a \"Cloudian - Ghost Fog\", \"Chamberlain of the Six Samurai\", a face-down card, \"Detonator Circle A\", \"Golden Warrior\", another face-down card, and \"Infinity Dark\". He picks up a booster pack, opens it, and begins to remove cards, continuing this process for the remaining packs. He reveals various cards from the packs.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the value of the ultra-rare card revealed at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. $0.38.",
+ "B. $0.64.",
+ "C. $1.44.",
+ "D. $2.29."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is the Patreon platform present in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Not present.",
+ "B. I am not sure.",
+ "C. Present."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times is the card package opened in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Thirty times.",
+ "B. Fifty times.",
+ "C. Forty times.",
+ "D. Twenty times."
+ ]
+ }
+ },
+ "IqEFjWJh": {
+ "video_id": "IqEFjWJh",
+ "video_duration": "84s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman with short auburn hair and a blue cardigan over a tan top is speaking to someone whose back is to the camera. The camera cuts to show a building with white columns and large windows, with an American flag waving on the front lawn and cars in front of it. Text that reads \"Tuesday\" and \"Martes\" appears in the middle of the shot. The camera cuts to show a classroom, where the woman in the blue cardigan walks forward toward a man and woman holding hands and standing together in the classroom. The woman in the cardigan is holding a book. The man waves. The man and woman walk toward the woman in the cardigan. The camera focuses on the couple. The camera moves back to the woman in the cardigan. The camera cuts to show the couple, then the woman in the cardigan. The camera shows the couple again, who are now sitting down in two chairs. The camera cuts to the woman in the cardigan, then back to the couple. The woman in the cardigan speaks, then the camera shows the man and woman. The man is laughing and then the woman is speaking. The woman in the cardigan speaks again. She mentions a test and that one of the questions was \"2 + 2\". She says \"Danny\" answered 22. The camera focuses on her again. The camera cuts back to the couple. The man and woman are smiling, then the camera cuts to the woman in the cardigan again. She says \"That's not right.\" The woman in the cardigan continues to speak. The woman and man look at each other and then back to the woman in the cardigan. She continues to speak. The woman with the cardigan says \"Who says so? Math says so\" The woman from the couple says \"Are you calling my son stupid?\" The woman with the cardigan replies: \"No, of course not!\" The man from the couple says \"Who are you to say his answer is wrong?\" The woman from the couple says \"She's right.\" The camera returns to the woman in the cardigan. She says \"Thank you.\" The woman from the couple adds \"..If we were in Nazi Germany.\" The woman in the cardigan says, \"You're not being serious\" and adds \"You know very well what 2+2 is.\" The man from the couple says, \"So you've got it all figured out?\" then says \"You know it all\" The woman in the cardigan exclaims \"Ahhhhh\". The couple yells \"Ahhhhh\" and the man asks \"What's wrong with her, is she retarded?\". The woman in the cardigan says \"I'm sorry but I can't continue with this conversation.\" The man states \"We're going to complain about you to the principal.\" The woman in the cardigan replies, \"Danny is a free thinker\". The woman from the couple approaches the woman with the cardigan and says \"I'll take your job\" The woman in the cardigan says \"No, not if you can't even add 2+2.\"",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, what is the positional relationship between the woman with long hair and the woman with short hair?",
+ "answer": "D",
+ "candidates": [
+ "A. The woman with long hair is seated behind the woman with short hair.",
+ "B. The woman with long hair is seated to the right of the woman with short hair.",
+ "C. The woman with long hair is seated to the left of the woman with short hair.",
+ "D. The woman with long hair is seated across from the woman with short hair."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the boy who incorrectly answered the math problem in the video and the short-haired woman?",
+ "answer": "A",
+ "candidates": [
+ "A. They are teacher and student.",
+ "B. They are supervisor and subordinate.",
+ "C. They are friends.",
+ "D. They are colleagues."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What would cause a long-haired woman to hit a short-haired woman?",
+ "answer": "D",
+ "candidates": [
+ "A. A short-haired woman hits a long-haired woman first.",
+ "B. A short-haired woman insults a long-haired woman.",
+ "C. A short-haired woman spits on a long-haired woman.",
+ "D. A short-haired woman mocks a long-haired woman."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "How does the man's emotional state change throughout the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Disappointed, excited, calm.",
+ "B. Calm, disappointed, happy.",
+ "C. Happy, calm, excited.",
+ "D. Calm, excited, angry."
+ ]
+ },
+ "task4": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What does the long-haired woman do next after hitting the short-haired woman in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Go to the principal's office to drop out of school.",
+ "B. Go to the principal's office to apologize.",
+ "C. Go to the principal's office to deliver a letter of commendation.",
+ "D. Go to the principal's office to complain."
+ ]
+ }
+ },
+ "sAjQHCif": {
+ "video_id": "sAjQHCif",
+ "video_duration": "117s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man wearing a short-sleeved tan shirt, gray shorts, and dark sunglasses stands on a boat. He gestures with his hands and then picks up a wooden stick. He reaches into a compartment and begins to use the stick to apply white paint onto the floor of the boat. He continues to apply paint with the stick and paint roller. Then the camera shows him holding a can of paint, speaking, and gesturing with his other hand. A blue text overlay appears with the words \u201cWWW.TUFFCOATMARINE.COM\u201d.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is a drum brush shown in the middle of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I am not sure."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "After applying the coating in the video, was the blue can opened?",
+ "answer": "A",
+ "candidates": [
+ "A. No.",
+ "B. I am not sure.",
+ "C. Yes."
+ ]
+ }
+ },
+ "UApNafiJ": {
+ "video_id": "UApNafiJ",
+ "video_duration": "77s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Two robots, one resembling a shopping cart and the other with a lifting mechanism, compete on a split field, one side red and the other blue, separated by a ramp in the center. Blue bins are stacked on a raised platform in the center, and smaller bins are scattered on both sides. The scoreboard displays \u201cAutonomous Mode\u201d.\u00a0 The robots move forward; the lifter-bot reaches the center platform first and knocks off a stack of bins. The cart-bot bumps into the fallen bins. \u201cAutonomous Mode\u201d switches to \u201cRemote Control Mode.\u201d The cart-bot is entangled with bins. The lifter-bot continues collecting and depositing bins on its side of the field. The cart-bot manages to move slightly, dragging some bins, but mostly stays stationary, wedged in place. The lifter-bot continues its task, its score increasing. The lifter-bot uses its arm to deposit bins, and eventually drives onto the central ramp. It remains there as the match ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unexpected event is depicted in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Robot number 45 loses communication.",
+ "B. Robot number 65 loses communication.",
+ "C. Robot number 292 loses communication.",
+ "D. Robot number 16 loses communication."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "What occurred when the horn sounded?",
+ "answer": "D",
+ "candidates": [
+ "A. The blue team's robot pushed most of the boxes into the red team's area.",
+ "B. The red team's robot pushed most of the boxes into the blue team's area.",
+ "C. The red team's robot pushed most of the boxes into the red team's area.",
+ "D. The blue team's robot pushed most of the boxes into the blue team's area."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What characteristics can be inferred about Robot 292 from the video?",
+ "answer": "C",
+ "candidates": [
+ "A. It has very heavy mass.",
+ "B. It has very tall guardrails.",
+ "C. It has high power.",
+ "D. It has very fast speed."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How long after the competition began did all the robots stop moving?",
+ "answer": "D",
+ "candidates": [
+ "A. At 17 seconds after the competition began.",
+ "B. At 5 seconds after the competition began.",
+ "C. At 3 seconds after the competition began.",
+ "D. At 10 seconds after the competition began."
+ ]
+ }
+ },
+ "FumqOnyk": {
+ "video_id": "FumqOnyk",
+ "video_duration": "39s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a soccer game in progress. There are two teams, one in red uniforms and one in yellow and green uniforms. The players are running on a green grass field. At different points in the video, different players from both teams have the ball. There is a white soccer goal at either end of the field with a net on each of them, and a player in blue is near the net acting as a goalie. There is a running track encircling the field. In the background, there are rows of seats which are predominantly yellow and blue. At the start, there are fewer people visible in the stands. Later on, they are more populated. A few times, one of the players has a shot towards the goal. In one case the goalie blocks the shot, in another the player on the yellow and green team scores. In one section, the camera zooms in closer on the player dribbling the ball towards the goal. In several shots, there are additional people visible near the field who aren't players, and are wearing different colored clothing. In one instance, a person is sitting on the ground and appears to be filming something. There are also orange signs along the track with white lettering on them.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "Which team does the goalkeeper wearing a blue uniform belong to in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The Freedom Team.",
+ "B. I am not sure.",
+ "C. The Dwarf Team."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual scenes were shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A person in a red jersey knocked over a person in a yellow jersey.",
+ "B. A person in a blue jersey knocked over a person in a red jersey.",
+ "C. A person in a yellow jersey knocked over a person in a red jersey.",
+ "D. A person in a yellow jersey knocked over a person in a blue jersey."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the goalkeeper in pink do with the ball in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Caught the ball.",
+ "B. Kicked the ball back onto the field.",
+ "C. Kicked the ball into the goal.",
+ "D. Kicked the ball outside the goal."
+ ]
+ }
+ },
+ "fKbkMuLU": {
+ "video_id": "fKbkMuLU",
+ "video_duration": "36s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Soccer",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a football game taking place on a green field. Players in teal and white uniforms are on one team, and players in white and red uniforms are on the opposing team. A referee in a black and white striped shirt is shown with his arms extended sideways. The score is displayed on a banner at the bottom of the screen. The camera view shifts to different angles throughout the video. Players run around, and at one point, a player on the white and red team attempts to catch the ball, but appears to drop it. The video ends with a player in teal walking off the field while someone places a medical bag over his shoulder. A person in grey clothing carries blue rectangular objects as they walk away.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does player number 17 lose control of the ball in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. 3.",
+ "B. 2.",
+ "C. 0.",
+ "D. 1."
+ ]
+ }
+ },
+ "rroySvPr": {
+ "video_id": "rroySvPr",
+ "video_duration": "33s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Artificial Intelligence",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a light gray background and three gauges. Each gauge is a semicircle with a black needle pointing to a value, and has color coded areas for \"POOR\", \"LOW\", \"FAIR\", \"GOOD\", and \"EXCELLENT\". In the center of the frame, the text \"Credit Score\" appears below the middle gauge. Below that is text saying \"Here's how a bank could use the Toolkit:\".Next, there is a wireframe cube with blue dots on the corners and rotating gears in the center of each face. To the right of the cube is a blue speech bubble containing the text \"Why was my loan application rejected?\" with \"Customer\" below it. Then another speech bubble appears below the first one, this one saying \"How can I have confidence in the AI's recommendations?\" with \"Loan Officer\" below it. Then a third speech bubble appears, saying \"How can I debug and improve my machine learning models?\" with \"Data scientist\" below it. Below these speech bubbles is the text, \"Different audiences require different kinds of explanation.\" Below that the text \"When a customer's loan application is rejected\" appears.The scene changes to show a cartoon person with red hair and a yellow shirt with their hands crossed in front of them. To their right a hand appears holding a document with a red stamp that says \"REJECTED\". Text appears below that says \"a bank's data scientist can use an algorithm from the Toolkit\".The scene changes again to show a hand holding a tablet. The tablet shows several graphs with a table of data appearing next to the tablet on the right. The text \"a bank's data scientist can use an algorithm from the Toolkit\" remains, with additional text below that saying \"to check whether the AI is properly weighing risk factors.\"Then another scene appears showing a cartoon person sitting at a desk with a laptop. To their left is a table of data. Some of the rows in the table are highlighted in blue. The text \"Meanwhile, another algorithm can help a loan officer\" appears below the table, followed by the text \"analyze whether customers with similar scores repaid or defaulted on their loans.\"A final scene shows a cartoon person with red hair and a yellow shirt sitting at a desk with a laptop. To their right is a menu with three cartoon people icons at the top and various text. Below the screen is the text \"A customer service tool built using a different explainability algorithm\" followed by the text \"could explain risk factors and prioritization\" and \"and suggest how to improve the customer's credit score.\"",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How many people with similar credit scores to the loan users did the loan officer analyze?",
+ "answer": "D",
+ "candidates": [
+ "A. Five.",
+ "B. Two.",
+ "C. Four.",
+ "D. Three."
+ ]
+ }
+ },
+ "yUoZdCvl": {
+ "video_id": "yUoZdCvl",
+ "video_duration": "135s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A man with glasses and a black shirt is facing the camera. He picks up a light blue, plush bunny-ear headband and puts it on.\u00a0The scene cuts to two red gummy candies shaped like bears.\u00a0The man reappears on screen, still wearing the headband.The scene cuts to an image of a green box against a green background. The box features images of small characters, and identical versions of these characters surround the box as enamel pins. The title on the box says \"Five Nights at Freddy's POPGOES Collector's Edition\" and \"YouTooz\" is visible on the bottom left corner.\u00a0The man reappears, still with the headband.\u00a0A short clip of a boy in a blue shirt running and then facing the camera follows, with the words \"Wait a minute, who are you?\" overlaid on the clip.Next, a clip shows a white, plush, humanoid toy with a party hat, a red nose, and mismatched eyes - one closed in a wink, the other covered by a bandage.The scene transitions back and forth between close-ups and a full view of the plush toy.A split screen follows, one side showing a partial view of the man\u2019s arm and, in the background, several boxes seemingly containing a green character. On the other side is a box with the word \"DREADBEAR\" on top and a cartoon bear figure below it.The split screen then switches to two different plush toys on a shelf. The first toy is a tan rabbit with purple details, and the second one is a small white blob with floppy ears, blush, and glasses.\u00a0The man reappears on screen, still wearing the headband.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where is the man wearing black clothing located in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. He is presenting toys in a shopping mall.",
+ "B. He is playing with toys at an amusement park.",
+ "C. He is presenting toys at home.",
+ "D. He is providing a product introduction in a factory."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What is the most likely next purchase for the man wearing black clothing?",
+ "answer": "B",
+ "candidates": [
+ "A. A Help Wanted plushie.",
+ "B. A Glitchtrap doll.",
+ "C. A Dreadbear plush toy.",
+ "D. A badge."
+ ]
+ }
+ },
+ "gnXmJAIJ": {
+ "video_id": "gnXmJAIJ",
+ "video_duration": "504s",
+ "duration": ">8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A man with short brown hair, wearing a dark gray suit jacket and a light blue and white vertically striped collared shirt, sits in front of a blurred image of a clock tower. He speaks to an unseen person off-camera. In the bottom right corner is a watermark that reads \u201cUNIVERSITY of OTAGO.\u201dThe video then cuts to a wider shot, which shows two men sitting in chairs on a small stage, facing an audience of approximately 20 people, seated in rows of chairs facing the stage. Between the two men on stage are two cups and what appears to be a small book. Behind them is a large backdrop image of a university campus, with the same clock tower from the previous shot more prominently visible. A red-haired woman stands off to the side, to the right of the stage, facing the seated men.\u00a0The video then cuts back to the first man, again shown from the shoulders up and speaking to an unseen person off-camera. The clock tower background remains the same.The man with glasses in the second shot speaks.The video then cuts back to the first man, who continues to speak.The second man then speaks again.The video returns to the first man.The second man speaks.\u00a0The video cuts back to the first man, who touches his right hand to his chest as he speaks.The second man, holding a coffee cup in his right hand, speaks.The video returns to the first man.The second man speaks.The video shows the first man rubbing his nose with his right hand before resuming speaking.The second man speaks.The video cuts back to the first man.The audience is shown watching two men sitting in chairs on a stage. One of them has a disposable coffee cup in hand. To the left of the audience is another man with a video camera. He pans the camera leftward and then back to the right, to face the men on stage.\u00a0A large wall screen can be seen to the left, with writing displayed on it.A young woman with red hair and glasses, wearing a black cardigan over a white blouse with a floral-patterned scarf, speaks.\u00a0The video returns to the first man. He raises his right hand from his lap to gesture with it as he speaks.\u00a0The video cuts to the second man, holding the disposable coffee cup in his right hand as he speaks.The video returns to the first man, who continues to speak and use hand gestures.The video then cuts to the second man speaking and gesturing.\u00a0The video cuts back to the first man, who occasionally gestures with his hands and looks downward briefly.The video cuts to the second man speaking and gesturing with his right hand.The video returns to the first man.The video cuts back to the second man, speaking and gesturing.\u00a0The video cuts back to the first man.The audience, camera operator, and screen are again shown from the side.\u00a0The video cuts to the woman with red hair, who speaks, using small hand gestures.The first man speaks, gesturing.The video then returns to the red-haired woman, who speaks and gestures.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why are many voters undecided?",
+ "answer": "C",
+ "candidates": [
+ "A. They are taking a wait-and-see approach regarding the new government's capabilities.",
+ "B. They have doubts about the government's credibility.",
+ "C. They have doubts about the government's ability.",
+ "D. They believe all governments are equally capable."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who recently stated, \"I cannot underestimate the severity of what is happening in Tauranga?\"",
+ "answer": "D",
+ "candidates": [
+ "A. A man wearing a black suit.",
+ "B. A Black person wearing a black shirt.",
+ "C. A Black person wearing a gray hoodie.",
+ "D. A man wearing a gray suit."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What are the potential consequences following an oil spill?",
+ "answer": "D",
+ "candidates": [
+ "A. The chamber of commerce is dissolved.",
+ "B. The chamber of commerce increases prices.",
+ "C. The chamber of commerce suspends operations.",
+ "D. The chamber of commerce continues normal operations."
+ ]
+ }
+ },
+ "qRnAbFZc": {
+ "video_id": "qRnAbFZc",
+ "video_duration": "126s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Book Reviews",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person is in the center of the frame, wearing a maroon, long-sleeved shirt with white stripes across the chest. A jar with a white lid is held in one of their hands, then a book titled \"Ready Player One\" is held in one of their hands. Their hands are frequently raised as they talk. The person is positioned in front of a bookshelf filled with books, figurines and various jars. There is also a bed, visible behind them, as well as string lights.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the occupation of the woman in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Merchant.",
+ "B. Beauty blogger.",
+ "C. Writer.",
+ "D. Film and television actress."
+ ]
+ }
+ },
+ "LjisVSMY": {
+ "video_id": "LjisVSMY",
+ "video_duration": "403s",
+ "duration": "6-8min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two men are in a boxing ring, both wearing red boxing gloves. One man is wearing red shorts, the other is wearing animal print shorts. A man in a white shirt and black pants is also in the ring, acting as the referee. The men throw punches at each other throughout the video. The referee intervenes and stops the fight several times. The video shows shots of the men\u2019s corners and of the crowd. One man\u2019s trainer has a green shirt. One man raises his hands in the air at the end of the fight.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "In the video, how many times does the man wearing red shorts fall?",
+ "answer": "C",
+ "candidates": [
+ "A. Four times.",
+ "B. Three times.",
+ "C. Twice.",
+ "D. Once."
+ ]
+ }
+ },
+ "jyfioJBq": {
+ "video_id": "jyfioJBq",
+ "video_duration": "116s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "DIY & Crafts",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A machine with a moving arm and drawing implement is drawing on a piece of white paper. The machine draws a dashed line first. The machine then draws the shape of a safety pin next, followed by the outline of a pair of underpants with a polka dot pattern on it. Then the machine draws a picture of a cartoon bird holding a candle and a tool. The text 'HackadayPrize2016' is displayed below the drawing.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the printer appear to print in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. 10000.",
+ "B. 11000.",
+ "C. 13000.",
+ "D. 12000."
+ ]
+ }
+ },
+ "uicnrvxy": {
+ "video_id": "uicnrvxy",
+ "video_duration": "367s",
+ "duration": "6-8min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a first person perspective of a video game with a person in the top left corner. The video game features a player in a street like setting. The player appears to shoot other players in the game with different weapons. Many yellow markers show up in the game. A player character with a green suit and backpack appears. The game includes text such as, \u201cCAPTURED,\u201d \u201cRELOADING,\u201d \u201cNORMAL KILL,\u201d and \u201cUAV Assist.\u201d The game also shows various locations on a small map in the top right corner of the screen. The video ends with the word \u201cVictory\u201d in large orange letters and a list of names.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What type of firearm is the game character holding in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A machine gun.",
+ "B. A pistol.",
+ "C. A sniper rifle.",
+ "D. A rifle."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What can be inferred about the broadcaster\u2019s rank in the game?",
+ "answer": "A",
+ "candidates": [
+ "A. God of War Rank.",
+ "B. Platinum Division.",
+ "C. Gold Division.",
+ "D. Diamond Rank."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the game character throw the grenade?",
+ "answer": "A",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. At the start of the video.",
+ "C. In the middle of the video.",
+ "D. Throughout the entire video."
+ ]
+ }
+ },
+ "iAKQvJrs": {
+ "video_id": "iAKQvJrs",
+ "video_duration": "128s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person stands at a table with a green cutting mat, paint and art supplies on it. The person touches a large plastic jug, and then touches something at their wrist. They move the plastic jug from the table to the ground. The person touches the table with a paint brush. Then, the person places a framed wooden panel on the table, and picks up a piece of paper and places it on the wooden panel. The person then places the artwork on top of the paper. The camera shifts to show the person measuring the paper on a grid mat, and cutting away excess paper with a utility knife. After the paper is cut, the artwork is placed on the paper. The person places white strips of material on the edges of the canvas, using their hands to push down the strips, and then removes the excess strips with a paint scraper. The video transitions to a black screen with text. A logo appears on a black background.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the scraps cut by the woman relative to the artwork?",
+ "answer": "A",
+ "candidates": [
+ "A. On the right side of the painting.",
+ "B. On the left side of the painting.",
+ "C. Above the painting.",
+ "D. Below the painting."
+ ]
+ }
+ },
+ "xgIzdIRy": {
+ "video_id": "xgIzdIRy",
+ "video_duration": "34s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A young man with shoulder-length blond hair sits in the driver's seat of a car. He wears a white baseball cap and a gray hoodie featuring a Polo Bear graphic. He adjusts his hat and rubs his eye. He picks up a small Pizza Hut box and presents it to the camera, opening it slightly to reveal a single slice of cheese pizza inside. He brings the box closer to his face and inhales the aroma. He then opens the box fully, revealing the whole slice. He closes the box and sets it down, before picking up a larger Pizza Hut box. He opens it to reveal a whole cheese personal pan pizza. He closes the box, rotates it, and opens it again from the side. He tips the box towards the camera, then sets it down. He looks towards the passenger seat and reaches for something, revealing a bag of frozen strawberries, a bag of red grapes, a banana, and an apple on the seat. He brings the larger pizza box back up, inhales the aroma again, then puts it aside. He then adjusts his ear and leans down towards the footwell area, out of view. He comes back into view, picks up something from the footwell, and places it on the dashboard on the passenger side. He leans down towards the footwell again.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the food's position relative to the man in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The food is in front of the man.",
+ "B. The food is to the man's left.",
+ "C. The food is to the man's right.",
+ "D. The food is behind the man."
+ ]
+ }
+ },
+ "ZikbEtrf": {
+ "video_id": "ZikbEtrf",
+ "video_duration": "44s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person with dark curly hair is speaking and holding a microphone while standing in what appears to be a space station. There are various pieces of equipment visible, including some panels lit with pink light. Text in Italian is displayed at the bottom of the screen. The video then cuts to a person with long blonde hair, wearing a blue jumpsuit and seated against a black background. They are speaking and gesturing with their hands. The video then cuts back to the person with dark curly hair, speaking and holding a microphone in the space station. It cuts to another person in a white space suit, floating inside the space station. The video then cuts back to the person with blonde hair speaking against a black background. It then cuts to a view of Earth from space with a satellite component visible on the left side of the screen. The video continues to show this view of Earth from space with the satellite component. It then cuts to the person with blonde hair against a black background, who is speaking and gesturing with their hands. Then, a flat, disc-shaped animation of Earth with a sky above it is shown. The video cuts back to the person with dark curly hair in the space station. This sequence is followed by a view of Earth from space with a satellite component. Then, there is a closer view of Earth, without the satellite component, followed by the same Earth view again. Then the view zooms in on Earth, and then it appears to slowly rotate. The Italian captions are present throughout the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video contain a map?",
+ "answer": "B",
+ "candidates": [
+ "A. It is unclear whether a map is present.",
+ "B. No, a map is not present.",
+ "C. Yes, a map is present."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "According to the man in the video, what object is impossible to observe from a near-Earth orbit perspective?",
+ "answer": "D",
+ "candidates": [
+ "A. The Moon.",
+ "B. The universe.",
+ "C. The Equator.",
+ "D. The Earth."
+ ]
+ }
+ },
+ "qfMumTLA": {
+ "video_id": "qfMumTLA",
+ "video_duration": "42s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A woman and a girl are in a kitchen, preparing a soup. They begin by adding chopped onions and leeks to a pot on a portable burner. Next, chopped potatoes are added to the pot using a wooden cutting board and spatula. The woman then adds what appears to be salt to the pot, followed by a large amount of broth. The girl stirs the pot's contents with a wooden spoon.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the little girl in the video add ingredients to the pot?",
+ "answer": "D",
+ "candidates": [
+ "A. Four times.",
+ "B. Five times.",
+ "C. Two times.",
+ "D. Three times."
+ ]
+ }
+ },
+ "OXokWgFQ": {
+ "video_id": "OXokWgFQ",
+ "video_duration": "136s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "The video begins with a black screen, and then the text \"Mat4yo - Good Looks Cover\" appears. After this, a person with short brown hair wearing a grey and black striped jacket over a black t-shirt with a white graphic appears in the center of the frame. They make various hand gestures while appearing to speak, often raising their hands to their head. A cartoon graphic of a paperclip with googly eyes and a piece of lined paper is displayed briefly in front of the person, as is a cartoon graphic of a red and yellow stylized \"S\" logo over the person's chest. Throughout the video, the background is a cluttered area with a brick wall and many bicycles hanging on it, along with other random objects such as a red ladder, a black bag, and a green baby carriage. The person continues to move their hands around, sometimes touching their face or chest, and moves their body as well. Occasionally, a white or slightly orange object can be seen in the person's hands. At some points they clench their fists. They also gesture towards the background. At the end, the person is waving their arms, and they bring their hands to their hair.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does the man make the gesture for 11?",
+ "answer": "A",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. Throughout the video.",
+ "C. At the end of the video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "eCxaufgy": {
+ "video_id": "eCxaufgy",
+ "video_duration": "87s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a close up shot of a bear character with brown fur on the left half of its face and grey fur on the right half of its face. There's a red lens flare across the face. The face reveals its teeth. A large \"2\" appears in the lower right corner of the screen. The image fades to black. Then there are a series of images of text that appear to be from a Reddit post, showing a discussion about the topic shown in the first shot of the bear. Then there are clips from what appears to be the same game as the original bear character shot, showing what appears to be a security guard's view of a room with a black computer screen and some paper on a desk, as well as a small light on the ceiling. The screen displays times in the AM, going from 5 AM, to 6 AM. Then a paycheck is shown, then the video fades to black. A yellow chicken character's face appears, with a large, wide open mouth. The image changes to an abstract image of wavy blue, green, red, and yellow. The Reddit post is shown again. There are several more game clips, including a security guard view, dark and shadowy with some character's eyes in the image, the security guard view again, the prize corner with the security guard view, the dark shadowy view, the prize corner view, the dark and shadowy view, the security view of the desk in the room, the dark shadowy view, and then the final view of a yellow chicken character in the room and the shadowy view once more. There is a scene with an internet browser of search results related to the topic at hand shown between the security view clips.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times is the term \"Stingray virus\" mentioned in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Three.",
+ "C. Five.",
+ "D. Two."
+ ]
+ }
+ },
+ "zDdFrCpN": {
+ "video_id": "zDdFrCpN",
+ "video_duration": "151s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A lion lies on the ground looking to its right. A lioness and a lion lie next to each other, looking ahead, partially hidden by foliage and a tree trunk. A tiger lies on the ground next to a rocky wall. It opens its mouth, yawns and sticks its tongue out. A gray and white cat sits facing the camera. A humpback whale breaches, almost fully out of the water. A chameleon sits on a branch. Two bearded dragons sit on a rock; another smaller one peers from behind the rock; another one appears in the bottom-left corner of the frame. A spectacled caiman sits on a riverbank with its mouth slightly open. A transparent, ghostly image of the same caiman, colored green and red, walks past the real caiman. A snake is coiled on a piece of wood. A crocodile rests on a riverbank, looking upward. A red-eyed tree frog sits on a plant stem. A frog sits on a log. A fire salamander sits on a log with moss and mushrooms. A blue-and-yellow macaw sits on a branch; several red and green parrots appear blurry in the background. Four African penguins stand or lie on a sandy beach near the water. A grey crowned crane stands in the grass. A flock of birds flies low over the water near a beach. A vulture flies over dry grassland. An ostrich walks through dry grassland. An emu stands looking at the camera. Two emperor penguins stand on a snowfield. A betta fish swims in dark water; tiny bubbles appear in the background. Various colorful fish swim around a coral reef. A goldfish swims against a black background. A cooked lobster lies on a blue wooden surface. Two raw fish lie on a dark wooden surface. Many green leaves form a continuous surface. The Earth rotates, viewed from space. Green, leafy trees surround a path in a park; sunlight streams through the trees. A field of white daisies sways gently in the wind. A catnip plant with green leaves and some small buds appears. An orange and black butterfly sits on an orange flower. Two ants walk on a surface. A bee collects pollen from a pink flower. A beetle sits on a flower petal. A chain of weaver ants forms a bridge between two leaves. A katydid sits on a log. A bear sits at the edge of a forest. Yellow weaver birds sit in and on their nests in a tree. A large herd of elephants walks in a line. A wildlife photographer lies on the ground, taking photos; a cheetah cub sits on his back. A deer and a doe stand in a field of yellow flowers. Several seedlings sprout in rows from dark soil, gradually growing taller. Green fern leaves grow on a rocky surface near a stream. Purple flowers grow densely on a hillside. Orange flowers grow on a plant; mountains appear in the background. A thorny desert plant is shown in its natural environment; next to it is its dry seed pod on a white background. A woman wearing a straw hat takes pictures of purple flowers with a camera. A man stands on the roof rack of a jeep in a snowy forest, looking through binoculars and then taking pictures with a camera. A circular diagram with three sections rotates slowly.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a penguin in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. No.",
+ "B. I am not sure.",
+ "C. Yes."
+ ]
+ }
+ },
+ "OoPvjAwP": {
+ "video_id": "OoPvjAwP",
+ "video_duration": "164s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a light pink long sleeve shirt is holding a white and orange toy assault rifle. The person's hands are visible and they are wearing a bracelet. Another person wearing a blue shirt and black pants walks into the frame, holding a similar toy assault rifle. The first person aims their toy rifle at the other person, then the other person puts down their toy rifle. The second person proceeds to dissemble and reassemble their toy rifle. They then load and unload different ammo magazines. Then they appear to demonstrate how the rifle shoots by aiming and firing at the floor and walls. The video cuts to the second person looking into the camera, still holding their rifle. The camera is then positioned from the other person's POV. The camera focuses on the second person who again disassembles and reassembles their toy rifle. The camera changes to a wider view, where we can see that both individuals are in a room with many toy guns on the walls. The second person, wearing a blue shirt and a black hat walks towards the camera, and the first person aims a toy rifle at them. The second person then reaches out with his fist for a fist-bump, which the first person returns. The video continues, showing the second person reassembling their toy rifle and then holding it in front of them while looking at the camera. The video cuts to the same scene but zoomed out, and the second person starts to disassemble and reassemble their toy rifle again, and points it towards the camera. The video then stops.The video continues with the second person holding the toy rifle in front of their body. The view cuts back to the person wearing the light pink long sleeve shirt, but is slightly farther away. The second person is holding their toy rifle and smiling at the camera. The camera then moves to the second person, holding a different toy rifle. The view widens and the second person is holding the toy rifle with the packaging in front of them. They show off the toy rifle and then smile, as they load and unload the magazine. The video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual incident occurred when the woman fired the gun for the first time in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The gun jammed with a bullet.",
+ "B. The woman fell.",
+ "C. The magazine was inserted backward.",
+ "D. The gun was missing parts."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many different colors do bullets come in?",
+ "answer": "D",
+ "candidates": [
+ "A. One.",
+ "B. Four.",
+ "C. Two.",
+ "D. Three."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "On which platform did the man in a blue shirt and the woman speaking eventually post the video?",
+ "answer": "A",
+ "candidates": [
+ "A. TikTok and YouTube.",
+ "B. Bilibili.",
+ "C. TikTok.",
+ "D. YouTube."
+ ]
+ }
+ },
+ "kHnDJhcy": {
+ "video_id": "kHnDJhcy",
+ "video_duration": "107s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Astronomy",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video starts with a view of a night sky with visible stars through a windscreen, with trees visible at the bottom of the frame. The view of the night sky expands to show more stars and some clouds. Then, the video shows a close-up of the sun as a red sphere. The text \"The Sun has a lifetime of about 10 billion years\" appears on the screen. Then, a diagram appears with a line going from left to right, with numbered markers and several white dots and arrows, along with a stylized image of a nebula. A marker on the line is labeled \"Sun\". The label \"Age (billion years)\" appears at the bottom right of the diagram. The text \"HIP 102152 is a star located 250 light-years from Earth\" appears on the screen. A diagram of stars of different sizes appears with a horizontal line labeled \"Age (billion years)\" and a marker labeled \"HIP 102152\". The video shows a woman with dark hair wearing a green cardigan and white top, sitting in a dark room in front of a display of the solar system with a red sun depicted. The same woman is shown again in close-up, with the red depiction of the sun visible in the background. The video shows a night sky over a rocky terrain with a car in the frame and text that reads \"Stars much smaller than the Sun die as white dwarfs\". The video shows a different rocky terrain at night with a lone silhouette in the foreground and the Milky Way in the sky. The video then shows a view of a spinning galaxy with text that reads \"Stars much bigger than the Sun become supernovae\". A view of a yellow and blue nebula appears that is rotating with the colors alternating between yellow and blue and then becomes a sphere of yellow. The view widens to show a grey galaxy with a bright center. Then, a view of a star field with a cluster of blue stars is shown. Next, the video shows rocks floating in space. Finally, the Milky Way appears as a panoramic view, followed by a close up of the woman from earlier in the video again speaking. The video shows a white screen with the text \"CO-PRODUCED BY THE CONVERSATION SBS\" and then the text \"PRODUCER/CAMERA/EDITOR RYAN BLAIR\", and the text \"ADDITIONAL VISION NASA EUROPEAN SOUTHERN OBSERVATORY MARK KOCHTE IAN NORMAN\", and then the video ends with multiple cartoon speech bubbles and the words \"THE CONVERSATION\".",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is it said that we are made of stardust?",
+ "answer": "B",
+ "candidates": [
+ "A. Because stardust contains unique substances crucial for the origin of life, directly facilitating its formation.",
+ "B. Because elements created by supernova explosions form stardust, which may eventually contribute to the formation of life.",
+ "C. Because stardust consists of fundamental particles in the universe, directly composing our bodies.",
+ "D. Because stardust is a form of dark matter in the universe, indirectly influencing the evolution of life."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "Please arrange the following events in the sequence described in the video: 1. The sun becomes a red giant. 2. The sun's core hydrogen is depleted. 3. Stars much smaller than the sun become white dwarfs and die. 4. Stars larger than the sun become supernovae and explode.",
+ "answer": "D",
+ "candidates": [
+ "A. 3, 4, 1, 2.",
+ "B. 2, 1, 3, 4.",
+ "C. 4, 3, 2, 1.",
+ "D. 1, 2, 3, 4."
+ ]
+ }
+ },
+ "CsysKoiH": {
+ "video_id": "CsysKoiH",
+ "video_duration": "57s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A group of young men are gathered in a hotel room. Some are sitting on a bed, others on a couch. Two men are having a conversation on the bed. The scene shifts to a close-up of a person with long dark hair and sunglasses perched on their head. They are speaking and gesturing with their hand. The scene cuts to a man with glasses talking. Then back to the person with sunglasses. The scene changes again to the man with glasses, this time he is smiling. Cut to another young man in a dark blue shirt, also speaking. The man with glasses appears again, now gesturing excitedly with his hands. The scene cuts back to the person with dark hair and sunglasses. They are smiling and have a fist raised to their face. The man with glasses speaks again. The person with sunglasses smiles widely. The man with the glasses continues to speak. The scene returns to the wider shot of the group in the hotel room. One person is covering their mouth with their hands in surprise. Another person with blond hair has their hand over their mouth. A person wearing a baseball cap looks surprised. The scene quickly cuts between several of the men, showing their reactions. Many are covering their mouths or have wide-eyed expressions. Some of the men get up and start to leave the room. One person puts their hand on the doorframe as they exit. The remaining men also get up and leave the room.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who just said, \"Three minutes?\"",
+ "answer": "C",
+ "candidates": [
+ "A. A man with brown hair, wearing a black short-sleeved shirt and black pants.",
+ "B. A man with blond hair, wearing a white short-sleeved shirt and light-colored jeans.",
+ "C. A man with black hair, wearing a white short-sleeved shirt and gray shorts, is sitting on a square stool.",
+ "D. A man with black hair, wearing a white short-sleeved shirt and reading glasses."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many men wearing glasses appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. One.",
+ "B. Two.",
+ "C. Three.",
+ "D. Four."
+ ]
+ }
+ },
+ "KNfJPiuY": {
+ "video_id": "KNfJPiuY",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a screen recording of gameplay from the online game Old School Runescape.At the start, two player characters are standing near a stone wall. The character on the right has an orange cape and is equipped with a staff. The character on the left has red and silver armor. They appear to be in combat.The orange-caped character drinks from a potion and then teleports. The player then runs to a purple pool and then to a room with other characters in it. Then, the orange-caped character returns to where he previously was and engages in combat with 3 other characters. These characters include one in full white and blue robes, another in red and silver armor, and one in grey and yellow with a small blue sword. There is a lot of movement between the characters in this scene, and they each appear to hit the other with weapons. The grey and yellow character is defeated and drops an item. The video then focuses on the orange-caped character fighting the others. Eventually, there are a couple more characters there.Finally, there is a screen with the Grand Exchange guide prices up.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the purpose of the blue elliptical item in the lower right corner of the backpack icon in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Instantly restores the character's mana.",
+ "B. Teleports to the main city.",
+ "C. Teleports to the home.",
+ "D. Instantly restores the character's health."
+ ]
+ }
+ },
+ "yActfXiW": {
+ "video_id": "yActfXiW",
+ "video_duration": "54s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A hand holds a yellow and black object and a person presses a button on it. A green light appears at the tip of the object, it is then shown near the wires of a white lamp. The lamp is turned on. The yellow object is touched to the white wires, then the yellow object shows a red light. A metal file is used to file down the prongs of a white power plug. A black power splitter is held, and then plugged in to a black power cord, then the black power cord is plugged in to a yellow power box. The yellow object is again used near the wires of the white lamp, a red light appears, and the words \"Not good!!!!\" are shown.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What change in state does the table lamp undergo in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The lamp remains on throughout.",
+ "B. The lamp changes from on to off, then back to on.",
+ "C. The lamp remains off throughout.",
+ "D. The lamp changes from off to on, then back to off."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "After using a converter plug, what action did the man take with the plug?",
+ "answer": "B",
+ "candidates": [
+ "A. He polished the plug terminals.",
+ "B. He inserted the plug in reverse.",
+ "C. He inserted the plug normally.",
+ "D. He checked the plug's current."
+ ]
+ }
+ },
+ "LGVtxjFQ": {
+ "video_id": "LGVtxjFQ",
+ "video_duration": "181s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Climate Change",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "Two people, one wearing a green t-shirt and black pants and the other wearing a white t-shirt and beige pants, are standing outdoors. The person wearing the white t-shirt jumps onto a wooden bench while holding a skateboard. The person wearing the green t-shirt appears to be holding a phone and looking at the other person. Next, there's a blurry image followed by a shot of a person wearing a gray shirt and blue pants placing a bike in a bike rack. Then there's a shot of a highway with cars, followed by a nighttime shot of a city with traffic. A red double-decker bus appears in the street and then a woman with brown hair walks past a display of handbags. A woman with dark hair wearing glasses walks across a crosswalk in a crowd. A conveyor belt of e-waste moves, and a large orange crane is seen lifting e-waste into a truck. In the next scene, yellow and orange bell peppers are poured from a wok onto the stove and then a wide shot of produce for sale at a market. A blue garbage bin is emptied into a garbage truck and then there is a shot of discarded eggshells and a banana peel on the ground. The scene then cuts to a large room with many windows followed by a shot of a large house with a curved staircase. Next, there is a single wind turbine shown at sunset. This is followed by a digital thermostat and someone using a remote control on an air conditioner. Three items of clothing hang on a clothesline outside, and a blue airplane is shown taking off. The view then shows an airplane wing above the clouds followed by a single engine airplane. A blue train moves along the coast and then an overhead view of a cruise ship appears. Then, a house with a man riding a scooter in front of it is shown. The scene changes to a room with a large rug and then an interior shot of a window with Christmas decorations. A solar panel is in a field. A red car is shown driving on a road and through a puddle. Two cars drive on a curved road and the White House is shown. A gas station pump is displayed, and a hand presses a button labeled 87. A person mixes a bowl of fruit and then a herd of cattle walks through a snowy field. A close-up of various vegetables, including tomatoes, broccoli, and avocado is shown and then a shot of a farmer's market with packed plastic containers of produce. Four people sit around a table, and then a crowd of people are sitting at tables and watching a presentation. Three people are talking with each other. Finally, two people are riding horses on a dirt road.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What scene does the man film after another man finishes skateboarding?",
+ "answer": "C",
+ "candidates": [
+ "A. A woman is buying vegetables.",
+ "B. A woman is learning to ride a horse.",
+ "C. A man is riding a bicycle.",
+ "D. A child is riding a bicycle."
+ ]
+ }
+ },
+ "xjqCeIxz": {
+ "video_id": "xjqCeIxz",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with short brown hair is wearing pink sunglasses and a dark button-down shirt. They are facing the camera and speaking while standing in a hallway with a door and a red box on the left. Text appears on the screen, including \"The Only Comedian\" and \"SNL = Sorry Not Laughing.\" The text \"youtube.com/KvonComedy\" and \"K-vonComedy.com\" also appear. The video then cuts to a second person in a suit in front of three flags speaking from a podium. Text at the bottom of the screen reads \"PRES. BIDEN TALKS COVID-19 & VACCINES\". Question marks appear over the person's head. The text \"Book K-von LIVE or ZOOM\" appears. Finally, the video shows the first person again, this time wearing a blue suit and a red tie, on a background of blue with the text \"PODCAST\" and \"THE RIGHT SHOW WITH K-VON.\" The text \"Book K-von LIVE or ZOOM\" is overlaid.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the emotional tone of the man wearing red sunglasses at the beginning of the video when he mentions Biden?",
+ "answer": "B",
+ "candidates": [
+ "A. Sarcastic.",
+ "B. Humorous.",
+ "C. Disappointed.",
+ "D. Excited."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual event occurred during Biden's speech in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. He incorrectly read the vaccination address.",
+ "B. He incorrectly read the consultation phone number.",
+ "C. He incorrectly read the postal code.",
+ "D. He incorrectly read the access URL."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is the shirt K-VON is wearing at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. White.",
+ "B. Black.",
+ "C. Red.",
+ "D. Blue."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video is it mentioned that getting vaccinated does not require an appointment?",
+ "answer": "A",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. In the middle of the video.",
+ "C. Throughout the video.",
+ "D. At the beginning of the video."
+ ]
+ }
+ },
+ "SVRlNNIO": {
+ "video_id": "SVRlNNIO",
+ "video_duration": "22s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Four people are sitting on a sofa and a chair around a coffee table. There is a Christmas tree in the corner of the room. The people are wearing Santa-themed headbands and a Santa hat. There are three wrapped gifts on the coffee table. The person to the left begins to open the largest wrapped gift and pulls out a bag of chips. A green rectangle with the word \"GREEDY\" appears above the four people's heads. A green rectangle with the word \"UNLUCKY\" appears above the woman with the bag of chips.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What is the emotional shift experienced by the woman wearing a Christmas hat before and after seeing the gifts in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. From happiness to anger.",
+ "B. From happiness to disappointment.",
+ "C. From anger to happiness.",
+ "D. From anger to disappointment."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the largest gift received by the woman wearing a Christmas hat in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A metal lunch box.",
+ "B. A watch.",
+ "C. Chips.",
+ "D. A spoon."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the emotion of the woman wearing a gray-green short-sleeved shirt in the video after she sees that her gift is a watch?",
+ "answer": "A",
+ "candidates": [
+ "A. Surprised.",
+ "B. Embarrassed.",
+ "C. Disappointed.",
+ "D. Angry."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the man wearing a red hat sitting in relation to the woman wearing a gray-green short-sleeve top in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. In front.",
+ "B. Behind.",
+ "C. On the left.",
+ "D. On the right."
+ ]
+ },
+ "task4": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man wearing a red hat slap his leg in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. One.",
+ "D. Two."
+ ]
+ }
+ },
+ "dJmzIiLX": {
+ "video_id": "dJmzIiLX",
+ "video_duration": "139s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Fashion",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with long dark hair is centered in a shot. They wear a black blazer, and gold-colored loop earrings. They have red lipstick on. The person is gesturing with their hands while looking at the camera. A door, a light switch, and part of a white dresser with boxes and other items are visible in the background. In subsequent shots, various gold colored rings and a black jacket on a hanger appear. Finally, the person is holding a black shopping bag with white text and a white flower on it. They are looking at and gesturing with the shopping bag.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Which Herm\u00e8s product is most popular among women?",
+ "answer": "C",
+ "candidates": [
+ "A. Shoes.",
+ "B. Bags.",
+ "C. Jewelry.",
+ "D. Clothing."
+ ]
+ }
+ },
+ "sIkhyPUW": {
+ "video_id": "sIkhyPUW",
+ "video_duration": "77s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows aerial views of a reddish, desert-like terrain, interspersed with closeups of machinery and what looks to be a rover. The camera zooms in and pans across the landscape, which features textured surfaces and craters. Text appears with changing measurements in kilometers and meters. Later, the video shows views from multiple cameras of what appears to be the lowering of a rover to the surface of the planet as seen from its landing mechanism and from other angles. The screen is sometimes split into multiple views. There is text that appears in an unknown language, as well as the word \"Perseverance\". The final section of the video shows people indoors raising their arms in the air and cheering. Some of them are wearing face masks. The logos for NASA and JPL are also shown.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the sequence of operations for the Perseverance landing?",
+ "answer": "C",
+ "candidates": [
+ "A. Perform a yaw maneuver, separate the aft shell, receive signals from MRO, execute vertical takeoff and landing.",
+ "B. Receive signals from MRO, execute vertical takeoff and landing, separate the aft shell, perform a yaw maneuver.",
+ "C. Separate the aft shell, perform a yaw maneuver, execute vertical takeoff and landing, receive signals from MRO.",
+ "D. Receive signals from MRO, separate the aft shell, perform a yaw maneuver, execute vertical takeoff and landing."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where does the scene of people cheering in the video take place?",
+ "answer": "B",
+ "candidates": [
+ "A. Control room.",
+ "B. Office.",
+ "C. Equipment room.",
+ "D. Meeting room."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the man do after the vertical takeoff and landing procedure began?",
+ "answer": "B",
+ "candidates": [
+ "A. He performed a vertical takeoff and landing.",
+ "B. He received signals from the MRO.",
+ "C. He observed from the control room.",
+ "D. He conducted landing operations."
+ ]
+ }
+ },
+ "mYzgBAEo": {
+ "video_id": "mYzgBAEo",
+ "video_duration": "162s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A white and yellow rocket sits on a launch platform at night. Flames and sparks appear under the rocket engines, then the rocket launches, rising into the night sky. The scene shifts to a close-up view of the engines igniting and flames shooting out the bottom, and then again to a view of the whole rocket rising. The rocket rises higher against the night sky and passes in front of the sun. Finally, the view alternates between multiple different close up views of the rocket launching, with the view sometimes appearing inside the rocket nozzle, until the rocket is too distant to be seen. A white countdown timer appears superimposed on the lower right of the view, indicating the seconds elapsed during the launch.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes can be observed in the state of the ignition device in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Alternates between turning on and off intermittently.",
+ "B. Turns on during ignition.",
+ "C. Remains off continuously.",
+ "D. Remains on continuously."
+ ]
+ }
+ },
+ "QFCKFCFC": {
+ "video_id": "QFCKFCFC",
+ "video_duration": "64s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman with long blond hair wearing a white beanie speaks to the camera. Text overlays read \"LIVE UKRAINE UPDATE\" and a Ukrainian flag emoji. The next scene shows a car driving down a multi-lane highway. A blue circle is drawn on the video in the sky, where small, distant objects are visible. The text overlay says \"SEEING MISSILES WITH MY OWN EYES!!\". The next scene shows a different woman wearing a grey hoodie and baseball cap, speaking to the camera. Text overlay reads \"In KYIV\" and a Ukrainian flag emoji. She is walking outside, cars are parked nearby, and trees and buildings are in the background. The scene changes to the inside of a supermarket. Two people in camouflage military uniforms are standing near the checkout counters. Other people are shopping. The shelves of a bread aisle appear mostly empty. A masked man walks down the aisle. Another aisle, with loose items sold from wooden bins, is also empty. A woman walks past a refrigerated deli counter.A blond woman holds a small child wearing a bright yellow t-shirt. The text overlay reads, \"Don't worry guys\" and then, \"We Ukrainians are a different breed\". The text then says \"No matter how small...\" A digitally-added red heart appears near the woman and child. The blond woman is shown again, now indoors in what appears to be a bedroom, speaking to the camera while wearing a white t-shirt. Finally, the same woman with the grey hoodie and baseball cap is outdoors again. She brings her hand to her forehead and cries.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video do soldiers appear shopping at a supermarket?",
+ "answer": "C",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. Throughout the video.",
+ "C. At the beginning of the video.",
+ "D. In the middle of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What action did the woman wearing a yellow top and the little girl wearing a white top perform in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The woman in the yellow top teaches the girl in the white top to protect Ukraine.",
+ "B. The woman in the yellow top teaches the girl in the white top to imitate a lion's roar.",
+ "C. The woman in the yellow top teaches the girl in the white top to imitate a tiger's roar.",
+ "D. The woman in the yellow top teaches the girl in the white top how to greet someone."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who says, \"The past 24 hours have been definitely very overwhelming\" in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A woman wearing a white hat.",
+ "B. A woman wearing a black hat.",
+ "C. A woman wearing a white shirt.",
+ "D. A woman wearing a yellow shirt."
+ ]
+ }
+ },
+ "VDZoojXo": {
+ "video_id": "VDZoojXo",
+ "video_duration": "90s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a medium shot of a woman with long brown hair speaking into a microphone. Behind her are several other people, some with their faces obscured by masks and scarves. One person is holding a green and white flag. The scene then cuts to a shot of a woman with brown hair speaking to the camera while walking along a city street. The text \"Natalia Marques, Peoples Dispatch\" appears in the lower left corner of the screen. She is followed by numerous other people walking in the same direction. Many of them are holding signs or flags. One sign reads, \"Stop the genocide!\". The video shows more of the crowd, with some people wearing masks and head coverings. Many people are holding signs with various messages and flags. One sign reads, \u201cOne is not free until we are all free.\" The woman continues walking and speaking to the camera. The scene then cuts to several people holding signs and flags while walking at night. One woman is speaking into a white and blue megaphone. There are several other people around her. More of the crowd of people are shown, including those holding flags and signs that read, \"Free Gaza\" and \"Palestine will be free.\" Some text on the screen reads \"Gaza we will liberate.\" The video shows more people walking at night while holding signs and flags.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the woman holding the megaphone shout the slogan \"When you're shopping, a bomb is exploding\"?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Two.",
+ "C. One.",
+ "D. Three."
+ ]
+ }
+ },
+ "JwjmAXNq": {
+ "video_id": "JwjmAXNq",
+ "video_duration": "233s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a visual representation of cellular division, including both meiosis and mitosis. The meiosis section shows an initial cell with blue and red strands inside, which duplicate and then divide. The next division results in four cells, each with a subset of the original colored strands. The mitosis section shows an initial orange cell with blue and red strands inside. This cell duplicates its contents and divides into two identical orange cells each with identical blue and red strands. Further on in the video, a circular arrow and the words \"cell cycle\" appears, and is divided into multiple colored sections, with the names \"meiosis I\", \"meiosis II\" and \"cytokinesis\" shown. The rest of the video depicts more detailed visuals of meiosis stages. A round cell with inner blue and red strands replicates. Orange bar-shaped objects appear within the outer edge of the cell. Strands of yellow lines begin to radiate from the orange bars, while the internal blue and red strands condense to form paired, X-shaped chromosomes. The nuclear envelope disintegrates. The chromosomes attach to the yellow lines that radiate from the orange bars. A diagram showing the visual difference between chromosomes in meiosis and mitosis appears. The chromosomes arrange themselves in the center of the cell. The chromosomes separate. The single cell splits into two cells. A light white line forms and closes between the two cells. The final result is two cells each with an interior red and blue strand. These two cells repeat the same process to result in four cells. White lines separate the cells.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Which color represents the cells that successfully complete the G1 phase of interphase in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Light blue.",
+ "B. Light green.",
+ "C. Deep green.",
+ "D. Deep blue."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a single blue chromosome visible in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. No, it does not exist.",
+ "B. I am not sure.",
+ "C. Yes, it exists."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "In the video, two yellow centrosomes are moving. What is this process called?",
+ "answer": "B",
+ "candidates": [
+ "A. Chromosome disassembly process.",
+ "B. Nuclear membrane disassembly process.",
+ "C. Chromosome binding process.",
+ "D. Nuclear membrane docking process."
+ ]
+ }
+ },
+ "rRLLFacJ": {
+ "video_id": "rRLLFacJ",
+ "video_duration": "61s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video starts with an image of multiple cartoon characters on a red background. There is a large face at the top with an open mouth full of teeth, and many smaller characters surrounding it.Next, there is a close up of one character, looking down at the screen. This character has short black hair, a yellow hat, and a blue and yellow shirt. The character makes a fist.Then, there is a back view of a bare chested figure with many red beads around their neck. There is a tattoo on the figure's left arm. The figure is tied up with rope and looks to be attached to a stone wall.The next image shows the first character running on an orange ground, with a straw hat on, wearing a light blue shirt and red pants.Then there is a close up of the first character looking to the right of the screen with wide eyes. They have a wet face with water droplets, and are wearing a light blue shirt.Next, there is an image of the first character with a large open mouth and fists raised. They are wearing a yellow tank top and red pants. Behind this character are many other characters, most in black and white striped clothing.The next image is a close up of a character with a red nose, clown makeup, and a blue head of hair, with yellow star-like designs and water droplets around their face.Next is a close up of a man with dark hair, a thin moustache, glasses, and a striped shirt, waving their right hand.Next is another close up of the same man with glasses, looking very serious or upset in front of metal bars.Then, there is a shot of the first character interacting with the man with glasses. The man with glasses looks angry with their mouth open, and the first character looks neutral, with fists together.Another close up shows the first character looking serious, with a wet face. They are wearing a blue tank top and have a straw hat on.The last image shows the first character with their mouth open and a fist raised again. They are wearing the same yellow tank top and red pants. There are many other characters behind this one.The second to last scene is a close up of the man with the glasses and the man with the red clown nose shaking hands.The last scene shows the man with the glasses and the man with the red clown nose jumping with their hands in the air. Many other characters are cheering behind them in black and white striped clothing.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the man wearing blue glasses in the group photo compared to the man wearing a yellow shirt in the middle of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. On the right.",
+ "B. On the upper left.",
+ "C. On the left.",
+ "D. On the lower right."
+ ]
+ }
+ },
+ "GsoanCMe": {
+ "video_id": "GsoanCMe",
+ "video_duration": "62s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man in a suit gestures with his hands while standing in front of a teal background. The word \"cabotage\" and its definition appear beside him. A map of the United States appears, and two red dots are placed on it, connected by a red dashed line. A paper boat with a Canadian flag appears on the line. A large red \"X\" appears over the boat and line. The Canadian flag is replaced with an American flag, and a green checkmark appears. The man continues gesturing. The word \u201cshipping\u201d appears, followed by \u201cstuff\u201d. A map of North America appears. The words \u201cHawaii,\u201d \u201cGuam,\u201d \u201cPuerto Rico,\u201d and \u201cMassatucky\u201d sequentially appear on the map. The man gestures as the map changes to show part of the USA. A paper boat with a Union Jack appears. A top hat and a bottle labeled with a skull and crossbones appear on the boat. The map changes to show Europe. Two more paper boats appear, one with a French flag, the other with an American flag. A pair of nunchucks appears next to the boats. A black and white photo of a man with a large mustache appears. The hashtag #HISTORY appears. The map is replaced with a globe. Three paper boats, with British, American, and French flags respectively, appear on the globe. The man gestures as the video ends.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What change occurs to the Earth ship icon in the video's final background?",
+ "answer": "A",
+ "candidates": [
+ "A. The icon changes from depicting American, British, and French ships to depicting only American ships.",
+ "B. The icon changes from depicting American, British, and French ships to depicting only British ships.",
+ "C. The icon remains unchanged.",
+ "D. The icon changes from depicting American, British, and French ships to depicting only French ships."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there an American flag in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I don\u2019t know."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What were the man's feelings when he described the murder of Archduke Franz Ferdinand?",
+ "answer": "D",
+ "candidates": [
+ "A. Disappointed.",
+ "B. Calm.",
+ "C. Embarrassed.",
+ "D. Excited."
+ ]
+ }
+ },
+ "AJHiXSlt": {
+ "video_id": "AJHiXSlt",
+ "video_duration": "189s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man in a suit walks down a hallway. He walks past a red couch, towards a group of people holding microphones and cameras. A police officer follows behind him. He continues walking down a red-carpeted hallway and enters a doorway. He walks outside, where there are more people with cameras. The video then cuts to another man in a suit, standing inside a large room with wood paneling and paintings on the wall, holding a binder. The next shot is of a check for $13,560 from Cassels Brock & Blackwell LLP to Nelligan O'Brien Payne LLP. The video cuts to a man in a suit with a striped tie and a yellow ribbon pinned to his jacket, sitting in a chair at a table. A man in a suit walks through a room with wood paneling and shakes hands with other men. The video cuts to a man at a podium in a large conference center, speaking to a seated audience. The backdrop is blue, and a Canadian flag hangs in the background. The man in the suit from the beginning of the video speaks to a woman holding a microphone, while a cameraman films him. The Conservative Party logo appears onscreen. The video then cuts back to the man from the table earlier, then back to the man who spoke to the crowd. A man in a suit walks between rows of chairs. Several people seated at tables clap. A bald man with glasses speaks into several microphones held by reporters surrounding him. He is outside, in front of green trees. He then walks in front of a Hyundai Santa Fe. Reporters approach the bald man and ask him questions. A man in a suit speaks into microphones at a podium. A blonde woman in a blazer and pants walks into a room, carrying a red notebook. A bald man in a suit stands in a room with wooden seating. Several people stand and chat in the room. A woman in black shakes hands with a man in the audience. She moves through the crowd and shakes hands with more people. A wide shot shows a vaulted room with red carpeting and a chandelier. A photograph of an email with a picture of a man appears. A man in a suit speaks at a podium in front of microphones. The bald man from the beginning gets into the driver\u2019s seat of a blue car, puts on his seatbelt, and adjusts the rearview mirror. The car drives away, through a parking lot. A man with blond hair and glasses wearing a suit speaks to the camera.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional shift does Mike Duffy display in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. His emotions shift from anger to sadness.",
+ "B. His emotions shift from sadness to stability.",
+ "C. His emotions shift from stable to indifferent.",
+ "D. His emotions shift from anger to stability."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did Mike Duffy talk about with someone at Royal Bank in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Borrow money.",
+ "B. Withdraw money.",
+ "C. Deposit money.",
+ "D. Repay money."
+ ]
+ }
+ },
+ "zatkNppT": {
+ "video_id": "zatkNppT",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Bowling Ball",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A cricket match is in progress. One player wearing blue throws the ball. A second player wearing black hits the ball. A third player wearing purple is also visible. The player in purple attempts to catch the ball and fails to do so. The player wearing blue throws their arms in the air and is later joined by others also wearing blue. The player in purple walks off the field. The words \"WICKET\" appear on screen. Several fans with flags can be seen in the audience. Finally, the words \"THANKS FOR WATCHING\" and \"SUBSCRIBE\" appear on a red screen.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What celebratory action did the pitcher perform with the first man who approached him after receiving the ball in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A handshake.",
+ "B. A hug.",
+ "C. A high five and a hug.",
+ "D. A high five."
+ ]
+ }
+ },
+ "cIghevlO": {
+ "video_id": "cIghevlO",
+ "video_duration": "59s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows two different scenes. In the first scene, there are two people: a woman and a man, sitting in front of a black background with the words \"NIGHT SCHOOL\" in white and yellow letters at the top left. The woman has dark braided hair in a high bun and is wearing a long floral print dress and light pink pants. She is wearing gold hoop earrings and several rings. The man is wearing dark sunglasses, a blue hoodie and matching pants and is wearing a gold wristwatch. In the first scene, the man and woman are talking and gesturing. In some of the frames, there are yellow and green boxes with text that read \"Tiffany Haddish\" and \"Kevin Hart\". In the second scene, there is one woman in front of the same background, who has light blonde hair and is wearing an orange and pink top. She is holding a black rectangle with \"MTV\" in white on it and is gesturing with her hands while speaking. In some of the frames, there is a pink box with \"Bex\" and a twitter handle.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man clap in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Three.",
+ "C. One.",
+ "D. Two."
+ ]
+ }
+ },
+ "jVGpxuXi": {
+ "video_id": "jVGpxuXi",
+ "video_duration": "338s",
+ "duration": "4-6min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video starts with a Minecraft Beta 1.2 game. The player is inside a wooden house with glass pane walls. Other players are visible outside the house. The player places stone and smooth stone blocks on the floor.The player exits the house and the view pans to a small body of water, several torches, and a nearby player named Joko. Three dark-colored creatures with numerous appendages and a central eye move toward the player. They attack and the player dies, resulting in a \"Game Over\" screen.The player respawns in the house and opens their crafting table, accessing sulfur. The player exits and walks to a nearby hill, building a ladder upwards with cobblestone blocks. They jump and take fall damage, leaving them with half a heart of health.The player resumes climbing the cobblestone ladder, reaching a wooden platform at the top. They walk around the area, placing torches. They find sugarcane and bamboo plants next to a body of water. The player then builds a bridge of cobblestone across the water. The view changes as they look at the wooden structure they are standing on, then focuses on the player again as they approach the opposite shore.The player returns to the house, opens a double chest and places raw porkchops inside. They open the crafting menu and check various items: gray wool, leather, planks, and cobblestone.The player places wood planks on the floor, then constructs a fence near the crafting table. They cook a porkchop in the furnace. In the house's upper level, the player places fences, creating a small enclosure.They exit and venture out again, finding a cow and a sheep near the water and bamboo. They kill a spider using a sword. The player builds a bridge over the water towards a sand bank and another structure. They reenter the original house and place cobblestone blocks next to the entrance. They place more blocks and torches at the entrance. Two other players approach the house and join the player on the pathway. The player and Joko climb to the wooden platform, then jump down.The player climbs the cobblestone ladder again. After building with another player for a moment, the player jumps down, surviving the fall this time. They kill another spider. The player starts building a dirt tower. While building, a creeper approaches the dirt tower and explodes, killing the player again.The player respawns, cooks more pork, and starts building the dirt tower again. Another creeper appears and explodes, resulting in another death. A brief sequence of a chimpanzee's face appears, followed by the \"Game Over\" screen. The video ends with a distorted image and a logo of a green slime on a stick.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What game is the man playing in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Minecraft.",
+ "B. Honor of Kings.",
+ "C. Peacekeeper Elite.",
+ "D. Mini World."
+ ]
+ }
+ },
+ "azKgWqlb": {
+ "video_id": "azKgWqlb",
+ "video_duration": "237s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Art Exhibitions",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a view of a man in a white shirt and dark pants on a ladder painting a large mural. The mural has a pink and purple background and contains a stylized image of a cup with green sunglasses, in front of a pair of tires. The next shot is a close up of metallic panel with the text \"TCO\" in large letters, and the words \"THE CHOSEN ONE\" fading in below. The next shot shows a woman holding a microphone and wearing an orange dress and a black hat standing in front of the mural. The scene cuts to a woman wearing a pink hooded vest, black hat and holding a microphone, standing in front of a chain link fence with graffiti behind her. The camera goes back to the women in the orange dress with the mural behind her. The women in the orange dress is shown with the hood of her vest up and then without the hood. The video then shows the man in the black cap painting a mural, then cuts to the woman in the orange dress with the mural behind her. The camera cuts to a man wearing a black beanie with a flag on it being interviewed. The scene cuts to an overview of the mural being painted with different people working. The camera cuts to a man in a gray shirt using a spray can on the mural. The camera cuts to a shot of a palm tree painted on the pink background. The camera cuts to a shot of a man painting the top of the wall while the sun shines brightly. The camera cuts back to the man with the black beanie. The scene cuts to a man painting, the man in the black beanie. Then the scene cuts to the mural with the man in the grey shirt painting in front. The video cuts back to the man with the black beanie being interviewed and the scene changes again to a shot of a man painting a pink section of the mural. The man in the black beanie is shown being interviewed again. The man is then seen walking in front of the finished mural that has a cup wearing sunglasses, in front of two tires and clouds. The scene cuts back to a man with the black beanie being interviewed. The camera shifts to a shot of a man walking towards the camera with a paint roller in his hand, then to the woman in the orange dress speaking into her microphone and the mural is behind her. The woman in the orange dress is shown walking. Next the woman is shown reaching up and holding a ladder for a man with a blue hat and white shirt. The scene goes to a close up of the woman in the orange dress with a pink microphone, and then to a man with the blue hat is shown with sunglasses being interviewed. The scene cuts to the woman in the orange dress with a black hat giving an interview, and the scene changes to the man with the blue hat painting and then back to the man being interviewed. The camera cuts to different areas where the women with an orange dress is being interviewed with a man next to her. The man with the blue hat is then shown spray painting the mural. The video then cuts to shots of a man walking while painting with a spray can. The scene shifts to the man in the blue hat as he moves the ladder. Numbers appear, counting from 6 to 10, interspersed in with the video, including on shots of a DJ. The words \u201cthe chosen one\u201d appears throughout the video. The camera shifts to different people being interviewed in front of the murals being painted. A shot of spray paint cans is shown. The scene changes to a shot of two men hugging in front of the murals. The woman in the orange dress is seen interviewed again. The scene shifts to shots of barbed wire against the sky. The woman is shown with the man in the grey shirt being interviewed. The scene shifts to an interview with the woman in the orange dress and black hat. The scene shifts to a mural with green and bone letter design, and a man is then shown next to the woman with an orange dress being interviewed. The scene changes to a shot of a mural saying \"Monster.\" The scene changes to the man in the black beanie with the women in the black hat and the next scene. The scene changes to a man with a fro being interviewed, followed by two men and a woman. A mural with purple, red and green shapes is shown. The video cuts to show a man holding a tee shirt, another mural, followed by a woman in the orange dress interviewing and the scene cuts to a mural. The camera moves to the woman in the orange dress and a man in the blue hat being interviewed. The camera shifts to a mural saying Monster. The video cuts to a group of men and women standing. The video shows spray paint cans. The scene cuts to the man painting the Monster mural and then to two men hugging again. The woman in the orange dress is interviewed, then shots of a metallic panel with Montana Cans written on it and the scene changes to a DJ, then to a man standing with a mural being interviewed. The woman in orange dress interviewing two men on a white sofa, followed by close ups of hands putting caps on spray paint cans. A woman holding a spray paint can is shown. The video goes back to the man in the blue hat being interviewed and then cuts to the woman in the orange dress with a black hat and a woman in a black dress holding a spray paint can. The scene shows three women holding up a spray paint can. The scene changes to the woman in orange interviewing the man with the black beanie standing next to the mural. The scene cuts to the woman in the black hat with a man with a spray paint can, and then back to three people holding spray paint cans in front of a mural. A man is shown throwing a can while spray paint cans. The camera cuts to different people holding up and spray paint cans. Then the video shows various people being interviewed. A man in a grey shirt is shown holding a spray paint can with the flame coming out. The video shows two people next to a mural. Different views of the mural saying Monster are shown. The camera cuts to different shots of graffiti and then two men next to it with the woman in the orange dress and black hat. The scene shifts back to the mural with a man standing in front. The scene shifts to a group of women standing by a bar, with a man being interviewed and also a man and women holding cans. The final shot is of a mural with a cup wearing sunglasses and tires. The video ends with a mural saying monster.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What is the purpose of asking the woman to collect paint cans in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. To serve as a memento of the competition.",
+ "B. To count the number of paint cans used by the contestants.",
+ "C. To tally the judges' votes.",
+ "D. To clean up the competition venue."
+ ]
+ }
+ },
+ "agztifcG": {
+ "video_id": "agztifcG",
+ "video_duration": "185s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a yellow, short-sleeved, floral patterned dress is talking and gesturing with their hands. They are wearing a small microphone clipped to their dress. The person's hair is long and curly, and they have noticeable eyeliner. In the background, there is a wall with framed pictures and a table, chairs, and plants. Animated images including a phone with a heart button and speech bubbles appear next to the person. The screen shifts to a pink background with a photo of the person and the text \"Michelle Manuel\" and descriptions below, including the words hair, makeup, lifestyle and cruelty free. The next scene shows the person wearing a pink robe in a room with furniture and picture frames, running their hands through their wet hair, and using a spray bottle on their hair. They then hold up a white bottle with black text that reads, \"best life, define & shine hydrating style cream\", as well as squeezing a small amount of the cream onto their hand and rubbing their hands together. Next they pour more of the cream from the bottle into their hands. The person then applies the hair cream to their hair, also using a blue comb. Next they hold up a white bottle with a blue cap which has the word \"ENLIVEN MOUSSE\" in black lettering. They apply the mousse to their hand and then to their hair, using the comb again. The screen then shifts to show the person wearing the yellow floral dress again, holding their hair.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many distinct items did the woman use while styling her hair in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Five.",
+ "B. Four.",
+ "C. Two.",
+ "D. Three."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What change is observed in the hanging fan behind the woman in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The fan remains still.",
+ "B. The fan changes from being still to rotating.",
+ "C. The fan changes from rotating to being still.",
+ "D. The fan rotates continuously."
+ ]
+ }
+ },
+ "EfWFFYJx": {
+ "video_id": "EfWFFYJx",
+ "video_duration": "307s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing black gloves uses a spatula and a scraper to move chopped vegetables on a flat top grill. The person then slices a smoked sausage into thin rounds on a wooden cutting board. The vegetables on the grill are pushed to one side and the sausage rounds are placed on the grill. The person places a stick of butter on the grill. As the butter melts, they spread it around the grill surface. More vegetables and sausage are added to the grill. The person then adds shrimp to the grill, pushing everything to different areas of the grill. A spice mix is sprinkled over the vegetables and shrimp. The shrimp is flipped and moved around the grill. The person then adds rice to the grill and spreads it out. A dark sauce is poured onto the rice. The rice is then mixed and chopped with spatulas. Finally, the shrimp, sausage and vegetables are combined with the rice.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "In the video, how many times does the man stir-fry the shrimp after adding it to the pan?",
+ "answer": "A",
+ "candidates": [
+ "A. Nineteen.",
+ "B. Seventeen.",
+ "C. Sixteen.",
+ "D. Eighteen."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Before the rice is placed in the center of the video, where are the shrimp located relative to the sausage?",
+ "answer": "D",
+ "candidates": [
+ "A. Behind.",
+ "B. In front.",
+ "C. On the left.",
+ "D. On the right."
+ ]
+ }
+ },
+ "dfUnXXsc": {
+ "video_id": "dfUnXXsc",
+ "video_duration": "119s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A white, winged creature flies in a blue sky. This transitions to a bright, white scene. A fair-skinned woman with long blonde hair and sunglasses on her head is visible, with a red banner labeled \"Music\". A scene from a video game is then shown: three characters walk through an open, grassy area near a body of water. The game display has the words \"Music\" in red on the bottom left of the screen. A group of three characters walk along a path through a grassy area, with tall waterfalls in the background. This is followed by another scene from the video game, showing a group of three characters walking through a grassy area. Another video game scene then appears showing a character with red clothing and another character with white clothing walking on a path by a rocky cliff. Birds fly in the sky. Two characters walk along a grass patch, near a cliff, where a winged creature flies overhead. The camera changes to show three characters wading through water toward a shoreline. The camera changes to show the three characters wading through the water at a different angle. Three characters walk on a grassy patch towards a structure with glowing blue elements. A group of six characters standing on a beach at night talk. This cuts back to the woman, who makes an OK sign with her hand. A group of three characters walks across a long bridge. The three characters continue walking across a long bridge. Three characters walk through a snowy area with large, crystal-like formations. The woman appears and speaks, with no text on the screen. A group of three characters runs through a grassy area with a treasure chest in view. A map is shown with a group of markers, and moves around to show more of the map. The map view disappears. Three characters walk through a field. Three characters walk through a field and attack smaller creatures flying over a body of water. The characters attack one of the smaller flying creatures. Another attack is used on the flying creature. The scene cuts back to the three characters walking on a bridge from a different angle. The characters continue walking along the bridge. The characters walk on the bridge again, now further along the path. They continue to walk along the bridge. The screen cuts to white, then shows the woman, with the red banner \"Verdict\" in the lower left corner, speaking. Another battle scene from the video game shows three characters fighting a glowing blue creature in a grassy field with purple plants. The characters use attacks against the creature. Another attack is used. The woman speaks again, with no text. Another video game battle scene shows three characters fighting in a dark, enclosed location. One of the characters attacks with a light-blue beam. The same character attacks, then another character attacks with another beam. The characters continue the battle. A scene shows three characters fighting a larger, red creature. The characters attack the creature. Another attack is used. The characters continue to fight against a larger, beast-like creature. Another attack is used. The woman makes a hand motion and smiles. The woman speaks again, with no text on screen. She speaks again and places her hand on her chest. A scene shows three video game characters at a scenic location with a large, flat flying object overhead. Two characters stand at the same location and look at a larger, flying object. The video game characters run though a blue lit area with several green, glowing spheres. The characters run through the same area again. A screen showing the party members and their levels and statistics is displayed. Another screen is displayed showing the party members with different character art. A video game scene shows multiple characters, with one saying, \"We can easily watch for anything approaching.\" A character then says, \"OK, I'll carry her.\" The characters appear to be speaking to each other. A video game scene shows three characters in a well-lit area, with a text box saying \"Alcamoth Citizen\" \"I'm really\". The text box disappears. The video game character looks out over a large cliff with a red sky. The red sky is seen again. The woman speaks and smiles, with no text appearing. She then speaks again. She smiles once more. A video game character is seen running through a lush, green area with tall waterfalls in the distance. An item collection notification appears. The video game character's equipment and stats is shown on a different screen. Another video game character's equipment and stats are shown on screen. A character walks on a white, tiled area with multiple levels. A single character walks on the tiled area from a different angle. A video game scene appears showing the three characters fighting against a mechanical creature with multiple blue light beams. The characters continue their attack against the mechanical creature. The three characters are shown fighting a larger beast-like creature with chain attacks. The beast is attacked again. The woman speaks and smiles. A number appears and a character portrait is displayed, with the number also present. A group of characters are seen, with a rating of 9.5 out of 10 displayed over them. The woman is shown again, speaking, then the camera zooms to a close up of her eye. A battle scene in a video game shows characters fighting a large, winged creature. The woman appears again, with \"Ircha Gaming\" shown as a text logo. She is shown again speaking. She speaks again with \"Ircha Gaming\" as a logo on the bottom of the screen. The camera moves to her right side with her holding a game. The camera moves to her right side again, with the woman holding a game with \"Animal Crossing\" on the cover. The camera again moves to her right side, with the woman holding a game with \"Luigi's Mansion 3\" on the cover. She speaks again with a smiling expression, with the \"Ircha Gaming\" logo on the lower left. The woman appears, speaks, winks, and smiles, with \"Ircha Gaming\" as a logo on the lower left. A video game character with a pink and blue headdress and outfit is shown walking in an open area, with \"Incoming Walking Style\" and \"Ircha Gaming\" as a text logo on the screen. The character is shown again, walking towards other characters, with the logo on screen. The character continues her walk towards the other characters, with \"Ircha Gaming\" on the screen. A video game scene shows a character collecting an item at \"Vilia Lake\". The video game character runs through the same area, and a Landmark notification appears on the screen. The video game characters are seen running through a town square area with shops, with \"Ircha Gaming\" as a text logo on screen. The characters move to different vendors in the area. The characters continue to run through the area with \"Ircha Gaming\" as a logo.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the scene of the game character climbing the ladder appear?",
+ "answer": "A",
+ "candidates": [
+ "A. It appears in the early part.",
+ "B. It appears in the middle part.",
+ "C. It does not appear at all.",
+ "D. It appears in the later part."
+ ]
+ }
+ },
+ "DNBdAJkP": {
+ "video_id": "DNBdAJkP",
+ "video_duration": "111s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A blue rectangle appears, with the text \u201cRESPONSIBILITY\u201d in purple and \"Proving you can be trusted with what's expected of you\" in black. Two people, one with a pink shirt and another with a green shirt, stand behind the rectangle. The rectangle disappears and the two people stand in front of a white background. They each put their hands together in front of them, then bring their hands up. The man starts to gesture with his hands while talking. The woman moves her hands in front of her and holds them together. The man\u2019s hands come together in front of him. The woman looks at the man, and they both smile. The man continues to gesture while talking. They both look at each other and the man puts his hand on his chest. The man moves his hands together. The woman puts her hands together. They look at each other. The man gestures with his hands. They both smile again and look at each other. The man points at the camera with his right index finger. The man walks off the screen. The woman speaks with her arms out in front of her. She gestures with her hands again. The man appears again, carrying a black music stand. The man places the music stand in front of him, with the woman to his right. They both talk. The man picks up a black boombox, and they both talk again. The man hands the boombox to the woman. The man picks up a gold cymbal, while the woman holds the boombox. The man strikes the cymbal. The man grabs the boombox from the woman and pushes a button. They both start moving quickly and appear surprised. The man covers his ear while holding the cymbal and the woman covers her ear while holding the boombox. The man puts his hand on his chest. The woman speaks with her arms out in front of her. She gestures with her hands again. The woman touches her ear and makes a gesture with her hand.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who makes the hissing sound in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Both people make the sound.",
+ "B. The woman wearing green makes the sound.",
+ "C. No one makes the sound.",
+ "D. The man wearing pink makes the sound."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Where is the man injured in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. An arm.",
+ "B. The head.",
+ "C. A foot.",
+ "D. A hand."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What instrument does the man use in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A pair of crash cymbals.",
+ "B. A gong.",
+ "C. A cymbal.",
+ "D. A drum."
+ ]
+ }
+ },
+ "gvOrDmAW": {
+ "video_id": "gvOrDmAW",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A graphic appears, displaying \"zambidis\" vs \"mamu\". Two fighters are in a boxing ring; one wearing red shorts, the other black. They exchange punches and kicks. Zambidis kicks Mamu, sending him into the ropes. Mamu falls against the ropes as Zambidis approaches, then Zambidis kicks him again while he is down. The scene changes, showing a person in a hooded robe and mask walking between torches in a smoky area above the ring. The person removes their robe and mask, revealing a bald man with boxing gloves. He raises his arms. The fighter in black shorts re-enters the ring. He bows, raises a clenched fist, and bounces on his feet. The fighter in red shorts now appears above the ring, wearing gloves and his shorts. He descends, entering the ring. He throws punches and kicks, and bows. Zambidis and Mamu are back in the ring, this time both wearing boxing gloves. They circle each other, trading punches and kicks. Zambidis kicks Mamu in the side, then punches him in the head several times. Mamu delivers a few kicks to Zambidis. Zambidis kicks Mamu. Mamu kicks Zambidis. Zambidis lands a hard kick to Mamu's head.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who in the video just said, \"Let's continue?\"",
+ "answer": "A",
+ "candidates": [
+ "A. The commentator.",
+ "B. The referee.",
+ "C. The Black athlete.",
+ "D. The bald contestant."
+ ]
+ }
+ },
+ "ihGNvGiA": {
+ "video_id": "ihGNvGiA",
+ "video_duration": "44s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "Several stock cars race closely together on an oval track. Car number 95, predominantly white with orange and black markings, makes contact with the back of another car. Smoke appears. Car number 95 spins, hitting other cars and causing a chain reaction collision. Multiple cars crash into each other and the track wall. Flames and more smoke erupt from the wreckage. At least one car flips over. The lap counter in the video shows lap 41 of 110. The pile-up involves a number of cars, including numbers 95, 21, 12, 38, 10, 36, 22, and 4. Flames continue to burn intensely as cars collide and slide across the track. Sparks fly from the impact. Smoke billows across the track, partially obscuring the view. The wreckage comes to rest against the wall, with flames still present.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the color of the No. 21 car that was involved in the collision shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Yellow.",
+ "B. Black.",
+ "C. Red.",
+ "D. Blue."
+ ]
+ }
+ },
+ "qgyfIDLL": {
+ "video_id": "qgyfIDLL",
+ "video_duration": "298s",
+ "duration": "4-6min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man in a pink and white shirt holds a microphone. Around him are several people in athletic shirts. The man in the pink and white shirt holds the microphone up to the face of the person in the gray shirt. The person in the gray shirt bows their head. A woman in a floral dress touches the man in the gray shirt's back. The man in the gray shirt begins to bend forward and people in blue staff shirts help stabilize his arms. The man in the pink and white shirt continues speaking into the microphone. The man in the gray shirt falls backwards and is caught by the people around him in blue staff shirts. The man in the gray shirt is then carried by the people in the blue shirts as he is bent backwards. Eventually the man in the gray shirt falls to the floor, and he is surrounded by many people. A woman in a blue shirt kneels and puts a hand on his head and begins to raise her hand in the air. The man in the pink and white shirt is speaking and pointing. Eventually the man in gray shirt is helped to stand, and after speaking with him, the man in pink and white shirt puts his hand on the side of the head of the man in gray shirt and the man again bows his head. The man in gray shirt then stumbles and falls to the floor where he is again surrounded by the people in the blue staff shirts. He is again helped up. The man in the pink and white shirt is speaking with the man in gray and touches his head again, and the man in gray shirt falls to his knees. A woman in a blue shirt then kneels beside him. The man in the pink and white shirt is speaking, a woman in a floral shirt holds a microphone. The man in the gray shirt raises his hands in the air.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How does the man's emotional state change after he falls to the ground in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. He shifts from feeling relaxed to feeling sad.",
+ "B. He shifts from feeling excited to feeling sad.",
+ "C. He shifts from feeling excited to feeling relaxed.",
+ "D. He shifts from feeling relaxed to feeling excited."
+ ]
+ }
+ },
+ "HwyYDudd": {
+ "video_id": "HwyYDudd",
+ "video_duration": "180s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A woman wearing a head wrap and patterned top looks at the camera with a serious expression. The screen cuts to black with the words \"A FEW MONTHS LATER.\" The same woman sits on a low stool in a small blue room, picking through a tray of beans. Another woman with braids, wearing a black top and jeans, enters. The woman on the stool stands and hands the tray to the other woman, who exits with it. The first woman then sits on a couch in a living room, where she is joined by a visibly pregnant woman wearing a red t-shirt and head wrap. A man in a light blue sweatshirt sits next to the pregnant woman.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the emotional state of the woman wearing a black headscarf in the middle of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. She is happy.",
+ "B. She is excited.",
+ "C. She is frustrated.",
+ "D. She is angry."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who said in the video, \"Even my mother doesn't know everything about me.\"?",
+ "answer": "B",
+ "candidates": [
+ "A. A man in a white hoodie.",
+ "B. A woman in a red shirt.",
+ "C. A woman in a black headscarf.",
+ "D. A woman in a black vest."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the woman wearing a black vest and the woman wearing a red shirt in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Biological sisters.",
+ "B. Aunt and niece.",
+ "C. Cousins.",
+ "D. Mother and daughter."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the woman wearing a headscarf sitting in relation to the woman wearing red?",
+ "answer": "B",
+ "candidates": [
+ "A. Behind.",
+ "B. On the left.",
+ "C. On the right.",
+ "D. In front."
+ ]
+ }
+ },
+ "sGrGFmhX": {
+ "video_id": "sGrGFmhX",
+ "video_duration": "125s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A dark, slightly shaky video shows a person standing near a body of water at twilight. The water reflects a lighter band where the sky meets the horizon. The silhouette of trees and a few distant lights are visible on the opposite shore. The camera pivots, showing more of the water and less of the person. The camera moves again, focusing on the dark water with a faint predawn glow appearing on the horizon. Some lights are reflected on the water's surface in the distance. The camera pivots back to the figure, still silhouetted, now holding a fishing rod. The camera angle changes, providing a clearer view of the figure fishing at the water's edge. Then the scene shifts to two individuals on a concrete pier or embankment, also fishing. One holds a net, and the other works a fishing rod. The one with the net scoops something out of the water. The camera angle changes again, looking down at the net which holds a large silvery fish. The fish is removed from the net and laid on the concrete surface of the pier. The figure in the blue jacket then kneels, holding the fish up against the brightening dawn sky and smiles for the camera. The sky is a deep, clear blue, with a thin layer of mist over the water.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many people are fishing in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Four.",
+ "B. One.",
+ "C. Three.",
+ "D. Two."
+ ]
+ }
+ },
+ "TfQivQLH": {
+ "video_id": "TfQivQLH",
+ "video_duration": "485s",
+ "duration": ">8min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a large audience seated in an auditorium listening to a speaker. The speaker, a woman with short gray hair, wearing glasses, a gray blazer, and a lime green shirt, stands next to a large white pad of paper.\u00a0She begins drawing on the pad. At the top, she writes \u201cDrawing your dreams into reality.\" She draws two large circles, labeling one \"Current State\" and the other \"Desired New Reality.\" Under \"Current State,\" she draws a person sitting at a desk with a stack of papers, another person standing nearby, and thought bubbles above them. Below this, she draws two overlapping circles representing a brain. Under \"Desired New Reality\" she draws a group of people with their hands raised. She colors in a sun above them with yellow marker. She draws a series of arrows connecting the \"Current State\" to the \"Desired New Reality\" labeling them \"See it,\" \"Believe it,\" and \"Action it.\" She colors in the brain drawing with pink, and draws a globe under the group of people. She continues to talk while drawing and gesturing towards the drawings. The audience is seen listening and watching.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the woman in the video color the objects on the canvas?",
+ "answer": "D",
+ "candidates": [
+ "A. To accentuate the details within the painting.",
+ "B. To make the painting's elements more easily discernible.",
+ "C. To enhance the aesthetic appeal of the painting.",
+ "D. To imbue the objects in the painting with emotion."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a table on the canvas in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No, there is not a table.",
+ "C. Yes, there is a table."
+ ]
+ }
+ },
+ "mDwMBvoz": {
+ "video_id": "mDwMBvoz",
+ "video_duration": "119s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video depicts a soccer match. Players in red and white uniforms move the soccer ball around the field, trying to get it past players in light blue and white uniforms. There are also players wearing yellow and black. At the bottom of the video, there is a soccer ball on the field. The video changes angles frequently. A white line divides the field horizontally into two halves, and a white circle is at the center. There is a soccer goal at either end of the field, with a player positioned in front of each goal. At one point, the action slows down, and the words \"Replay\" appear in the top-left corner. A close up of 3 players is shown at the end, in a different position on the field. The players are wearing, white and light blue, and red and white.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Which player (by jersey number) missed the shot in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Number 10.",
+ "B. Number 11.",
+ "C. Number 26.",
+ "D. Number 25."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "Who did the goalkeeper in the green jersey throw the ball to in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A player wearing a red jersey.",
+ "B. A player wearing a blue jersey.",
+ "C. A player wearing a black jersey.",
+ "D. A player wearing a white jersey."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why did the player wearing number 26 fail to score in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The contact point was incorrect.",
+ "B. The goalkeeper saved the ball.",
+ "C. Leaning back caused the shot to fail.",
+ "D. The opposing player intercepted the ball."
+ ]
+ }
+ },
+ "bFKPTAxR": {
+ "video_id": "bFKPTAxR",
+ "video_duration": "109s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Soccer",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a white and green uniform with the number 16 on it, including a white helmet with green and orange markings, looks to their left. Then the camera angle switches to a wide view of a football field, showing 11 players in white and green uniforms lined up on one side of the field facing 11 players in red uniforms on the other side of the field. The teams engage, a football is thrown, and a player in a white and green uniform catches the ball and is tackled near the endzone. The person in the white uniform gets up and runs into the end zone. Some people in white and green uniforms jump in the air and give each other high-fives and chest bumps. The video shows a view from behind a person in a white uniform with green accents as they are celebrating with their teammates. The video then shifts to a wide view of a football field with people in white and green uniforms at the 10-yard line facing people in red uniforms. The person in the red uniform picks up a football, runs, and is tackled in the endzone. Players in red uniforms begin jumping in the air and high fiving. Then the camera focuses on a player in the red uniform, including a red helmet and a face mask. The video returns to a wide shot of the field as players wearing red and white uniforms are lined up for another play. A person in a red uniform kicks a football and it goes over the goal post. The video shows a person in a white uniform with green accents running down the field, as other players surround him. The video shifts to a player in a red uniform picking up the ball, running and is tackled by players in white and green uniforms. Players in red uniforms run to join in the celebration. The camera then pans across a packed stadium of people. Players in red uniforms are then shown high-fiving and patting each other on the backs. The video then switches to show the XFL logo and the ESPN logo. A football play is shown with players in red and white uniforms facing off again. A ball is kicked and a player in red catches the ball. Another player in red is tackled by a person in the white uniform.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How did the Red Team players' emotions change during the match?",
+ "answer": "B",
+ "candidates": [
+ "A. Their emotions shifted from tense to disappointed.",
+ "B. Their emotions shifted from tense to excited.",
+ "C. Their emotions shifted from tense to subdued.",
+ "D. Their emotions shifted from tense to calm."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Which player number threw the ball to the ground after being intercepted by a White Team player in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Number 20.",
+ "B. Number 19.",
+ "C. Number 16.",
+ "D. Number 17."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What was the score between the two teams at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. 06:04:00.",
+ "B. 06:06:00.",
+ "C. 06:05:00.",
+ "D. 06:09:00."
+ ]
+ }
+ },
+ "SxXuQoJA": {
+ "video_id": "SxXuQoJA",
+ "video_duration": "44s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A cartoon character with brown hair and a purple hoodie with a green circle on it is shown in front of a window with yellow curtains and a wall divided horizontally into tan and maroon sections. The character gestures with her right hand. The video cuts to the character's face only with a straight mouth, then it shows a cartoon cookie walking up to the character. The video cuts back to the full character again, and she is gesturing with her right hand. The video then shows a simplified version of the character with a purple hood and a large grey hairclip over their head. Then the video cuts back to the full version of the character, again gesturing. The video cuts to the character with her eyes closed and her right arm raised. The video shows a white screen with a cartoon dragonfly image in the center of the screen. Then it shows a simplified version of the character sitting and a purple hood with a green circle next to her. A cartoon dragonfly flies onto the screen. The video then shows the simplified character with her arms raised and moving on the screen. The video cuts back to the full version of the character, and then it shows the white screen again with the cartoon dragonfly image. Then it shows the full version of the character. The character has her arms raised in the air, and then closes her eyes. The video cuts to a white screen with the text \"wait...\" in pink. The video cuts to another white screen with the text \"wait... can dragonflies even bite people?\" in pink. The video cuts to the full version of the character. Then the video cuts to a picture of a dragonfly holding a wasp. The video cuts to the full version of the character with arms folded across her chest. The video then cuts to a cell phone screen showing a google search for \"can dragonflies bite\". The screen includes a text section that reads: \"YES, if you catch one and hold it in your hand and carelessly allow its mandibles to reach your skin, it will bite as hard as it can in self defense. Very few dragonflies can even break the skin, but some of the big ones can do so and may induce an \"ouch\".\" The screen then includes a link to \"Do dragonflies bite or sting? - Worldwide Dragonfly Association\". The cell phone screen fades to white, and then shows the full version of the cartoon character again.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What happens to the audio in the video when the girl chases the dragonfly?",
+ "answer": "C",
+ "candidates": [
+ "A. The background music stops.",
+ "B. The background music becomes louder.",
+ "C. The background music starts playing.",
+ "D. The background music becomes quieter."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What happens to the girl's hat when the dragonfly flies by in the middle of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. It gets dirty.",
+ "B. It falls to the ground.",
+ "C. It is carried away by the dragonfly.",
+ "D. No change occurs."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Are there any bees shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Uncertain.",
+ "B. No, there aren't.",
+ "C. Yes, there are."
+ ]
+ }
+ },
+ "RYmbMlyz": {
+ "video_id": "RYmbMlyz",
+ "video_duration": "95s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A group of young soccer players are practicing on an indoor field. They wear two different uniforms: one light blue, the other dark blue/black. Four players in green and black goalie jerseys stand outside a square marked by white lines. Each corner of the square has two orange markers. Inside the square, players pass the ball amongst themselves. One dark-uniformed player holds the ball, while another player in the same uniform gestures. The player with the ball passes to the gesturing player. The coach, in red and black, gestures and talks to a young player wearing a light blue practice jersey. The practice resumes inside the square. Players pass the ball around within the square. One player passes to a teammate at the edge of the square. The receiving player kicks it back into the square, then receives a pass in return. The coach continues to interact and gesture with the player in the blue practice jersey off to the side. The practice continues in the square, with players passing and maneuvering. The coach looks on attentively. The ball goes out of bounds, picked up by one of the goalies. The goalie throws the ball to a teammate inside the square, and practice resumes. The coach interacts with another goalie outside the square. Practice continues with passing within the square. The coach gestures with a player in the blue practice jersey. The coach begins talking to the camera.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What did the blue team do after the green team served the ball?",
+ "answer": "A",
+ "candidates": [
+ "A. The blue team initiates a combination attack and then passes to the green team.",
+ "B. The blue team continuously attacks the green team.",
+ "C. The blue team first defends and then attacks.",
+ "D. The blue team continuously implements a defensive strategy."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the likely occupation of a person wearing red clothing?",
+ "answer": "C",
+ "candidates": [
+ "A. A referee.",
+ "B. An audience member.",
+ "C. A soccer coach.",
+ "D. A journalist."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "How does a person wearing red clothing likely feel when the Blue Team completes a coordinated attack?",
+ "answer": "D",
+ "candidates": [
+ "A. Angry.",
+ "B. Sad.",
+ "C. Shocked.",
+ "D. Proud."
+ ]
+ }
+ },
+ "kXEeTvhd": {
+ "video_id": "kXEeTvhd",
+ "video_duration": "39s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a series of still images, in sequence, with the following content:The first image shows the text \"Mariner 11\" in yellow text, against a background of stars and galaxies.The second image shows the text \"Mariner 11\" with a large red X drawn through it, as well as the text \"Voyager 1\", both against a background of stars and galaxies.The third image shows a rocket taking off, in front of a deep blue sky, with the NASA logo visible in the upper left corner.The fourth image shows the same rocket image, with the addition of a white speech bubble saying \"U mad, logic?\"The fifth image shows an image of the planet Jupiter on the left and a spacecraft to the right, both against a starfield background.The sixth image shows an image of Saturn against a black background.The seventh image shows a close-up image of a planet with a blue emission on one edge.The eighth image shows the same edge of the planet at a distance, surrounded by black space.The ninth image shows a white spacecraft on a black background with a yellow sign reading \"NEPTUNE or BUST!\".The tenth image shows a planet in front of a star field.The eleventh image shows the same planet, followed by a \"greater than\" symbol, followed by another image of a planet in front of a starfield.The twelfth image shows the NASA logo against a starfield.The thirteenth image shows a tombstone shaped gray object that says \"Planetary Mission\" next to the NASA logo and a starfield background.The fourteenth image shows the text \"Interstellar Mission\" in yellow letters against a starfield background.The final image shows a depiction of the solar system on the left and the spacecraft on the right with a speech bubble saying \"See ya!\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video show Saturn?",
+ "answer": "D",
+ "candidates": [
+ "A. None of the above.",
+ "B. It is not possible to determine.",
+ "C. No, it does not.",
+ "D. Yes, it does."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of Voyager 1 relative to the solar system in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. To the left of the solar system.",
+ "B. To the right of the solar system.",
+ "C. Below the solar system.",
+ "D. Above the solar system."
+ ]
+ }
+ },
+ "HFebsoAQ": {
+ "video_id": "HFebsoAQ",
+ "video_duration": "132s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a first-person perspective of a player moving through a dark, industrial-looking environment. The player character carries a weapon and points it in different directions, while also interacting with menus to manage their inventory and check weapon statistics. Other players appear in the distance, sometimes partially concealed by objects, or in a prone position on the ground. At one point the player fires their weapon at a prone enemy. The player moves through various indoor and outdoor areas, including a tunnel. The player character's gloved hands holding a rifle are prominently visible. At the end of the video, the scene transitions to a menu with the text \"ESCAPE FROM TARKOV\". The bottom left corner shows a webcam view of a person wearing headphones. There is also a chat window visible on the right side of the screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "What is the bullet capacity of the magazine used by the man in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Sixty.",
+ "B. Eighty.",
+ "C. One hundred.",
+ "D. Thirty."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Did any footsteps sound appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. No.",
+ "B. I don\u2019t know.",
+ "C. Yes."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many consecutive shots were fired in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Eleven times.",
+ "B. Five times.",
+ "C. Seven times.",
+ "D. Nine times."
+ ]
+ }
+ },
+ "vYkSCnAM": {
+ "video_id": "vYkSCnAM",
+ "video_duration": "616s",
+ "duration": ">8min",
+ "domain": "Games",
+ "sub_category": "Sports Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A screen recording of the video game *Madden NFL 23* shows a player drafting their team. The player selects Delanie Walker (TE), T.J. Hockenson (TE), and Travis Kelce (TE) in Round 3. In Round 6, the player selects Bruce Matthews (LG), Alex Mack (C), and Larry Warford (RG). The player cycles through various screens, and the final team lineup is shown with the selected players filling their respective positions. The gameplay begins, and the Lions score a touchdown against the Eagles. After a successful extra point, the score is 7-6 in favor of the Lions. The Eagles then score a touchdown and extra point to take the lead 10-7, and then score a field goal to extend their lead to 12-10. The Eagles score another touchdown, bringing the score to 26-10.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the running speed of the man wearing a red shirt on the field in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Eighty-eight.",
+ "B. Seventy-eight.",
+ "C. Eighty.",
+ "D. Ninety."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is Gardner Minshew's position relative to Josh Allen in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Gardner Minshew is above Josh Allen.",
+ "B. Gardner Minshew is below Josh Allen.",
+ "C. Gardner Minshew is to the right of Josh Allen.",
+ "D. Gardner Minshew is to the left of Josh Allen."
+ ]
+ }
+ },
+ "qnozzLjg": {
+ "video_id": "qnozzLjg",
+ "video_duration": "63s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a person wearing a dark top and light pants, standing in front of a police car. The police car is dark with yellow lettering. Another person, wearing a white shirt, is visible in the foreground and appears to be the person filming. The person in the white shirt shifts position, and a second police vehicle, white in color, appears on the street. The person in the white shirt continues to move and adjusts their position, causing the camera to shift between a view of their body and the police officers. The camera angle moves from a street view, to an upward angle, briefly showing trees, a house, and sky. The person in the white shirt again adjusts the camera angle and briefly shows tattoos on their arm. Then the camera alternates views between the street, the police officers, the person filming, a few houses, and the sky. A police officer wearing a dark vest and light shirt is also visible. A portion of the video is filmed with the camera angled downwards. Later in the video, a person wearing a blue top is briefly visible. The video ends with a dark screen.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why would a white person wearing a black vest arrest a black person wearing a white short-sleeved shirt?",
+ "answer": "D",
+ "candidates": [
+ "A. The black person in a white short-sleeved shirt was fighting with others.",
+ "B. The black person in a white short-sleeved shirt committed robbery.",
+ "C. The black person in a white short-sleeved shirt violated mandatory housing regulations.",
+ "D. The black person in a white short-sleeved shirt violated traffic regulations."
+ ]
+ }
+ },
+ "AdSMqfHb": {
+ "video_id": "AdSMqfHb",
+ "video_duration": "99s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Chemistry",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A glass jar with a green lid is shown with a liquid inside that appears opaque and cloudy, and contains small particles. The jar is first shown next to a rock, and then moved to sit on a concrete ledge. A person's hands appear and place a small metal object inside the jar. The person\u2019s hands also appear to take off the green lid and place it back on the jar, and move the jar. The video shows the contents of the jar change, becoming more clear. Later, the jar's contents become more orange.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does a reddish-brown liquid appear in the bottle?",
+ "answer": "A",
+ "candidates": [
+ "A. The liquid appears at the end of the video.",
+ "B. The liquid appears at the beginning of the video.",
+ "C. The liquid appears throughout the entire video.",
+ "D. The liquid appears in the middle of the video."
+ ]
+ }
+ },
+ "jEYMBGsq": {
+ "video_id": "jEYMBGsq",
+ "video_duration": "44s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The letters \"JBER\" appear on a dark background, fragmenting and then reforming against a textured backdrop. A silhouetted soldier kneels, a jet flies overhead, and the full name \"Joint Base Elmendorf-Richardson\" is displayed. The image transitions to an aerial view of a metal structure. Then, the scene cuts to a hallway where an airman in a flight suit walks past a door labeled \"525th Squadron Commander, Room 217.\" Three men in flight gear walk and talk in the hallway. One man adjusts his helmet and oxygen mask. A fighter jet executes a sharp turn in a clear blue sky, leaving a vapor trail. A group of airmen in flight suits walks across a tarmac. Two airmen, wearing parachute packs, walk through the snow. In a hangar, Lt. Col. Clayton Percle, 525th Fighter Squadron Commander, speaks. Black and white images appear: a pilot in a cockpit, and a boy sleeping under a blanket with a Star Wars pillow. Lt. Col. Percle continues speaking. Silhouetted fighter jets fly against a sunset. Grainy black and white footage shows biplanes in flight, a pilot in an open cockpit, and Air Force personnel in formation walking past a modern building. A color image of a fighter jet parked on grass in front of a building is shown. Lt. Col. Percle is shown again, continuing his speech.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the man's father's occupation?",
+ "answer": "A",
+ "candidates": [
+ "A. An army pilot.",
+ "B. Army medical personnel.",
+ "C. Army logistics personnel.",
+ "D. A member of the navy."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the man in the fighter jet and the man being interviewed?",
+ "answer": "B",
+ "candidates": [
+ "A. They are the same person.",
+ "B. They are father and son.",
+ "C. They are teacher and student.",
+ "D. One is an idol, and the other is a fan."
+ ]
+ }
+ },
+ "kydZEXLL": {
+ "video_id": "kydZEXLL",
+ "video_duration": "15s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "World News",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a map with red shapes moving across it. There are four clusters of shapes, one cluster is located near the land, and the other three are in the water. All of the shapes are moving towards the coast. The video then changes and shows two satellite-like objects with a large red beam coming from them pointed towards land, with three blue ships and a red shape also on the map below them. A smaller video appears on the screen, showing a military-style aircraft on the land. There is a second small video showing structures with doors and roofs. A dark banner with white text then appears across the bottom of the map.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Which part of the video shows a drone?",
+ "answer": "B",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. At the end of the video.",
+ "C. At the beginning of the video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "oVXVZPVk": {
+ "video_id": "oVXVZPVk",
+ "video_duration": "34s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person with long red hair, wearing a black jacket, a white top, and a silver necklace is centered in the frame and talking directly to the camera. The background includes a white wall with two framed pictures, one of a horse's head and one of three cartoon figures. In front of the person, on the right of the screen, is a counter top with a small pattern. Throughout the video, a series of text overlays appear at the bottom of the frame and a picture appears next to the person. The image shows two legs from the knee down, wearing white socks and black high heels. The image also includes text at the bottom. Later in the video, a second image appears that shows only one leg, sock, and shoe. During the video, the person gestures with their hands while speaking. Near the end, the person uses both hands to raise to a position at the sides of their head. There is a black lamp to the right side of the frame.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the woman make an \"OK\" hand gesture in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. One.",
+ "B. Zero.",
+ "C. Two.",
+ "D. Three."
+ ]
+ }
+ },
+ "aXKlhzgv": {
+ "video_id": "aXKlhzgv",
+ "video_duration": "73s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Four figures in colorful armor sit on a couch. A fifth figure in silver armor approaches the door. The silver figure is followed by a figure in orange armor through the doorway. Then the view splits, showing two figures: a yellow figure and the orange one and then shows the pink one. Then the view splits again showing the yellow, pink, and blue figures. The red armored figure is seen on the couch as the yellow figure falls, then the red figure walks, and the blue one is running. There is then a view of the blue figure running with a filter effect. The pink armored figure is then seen running with a filter effect. Then a yellow figure is seen running with the filter effect and stops facing the camera. The yellow armored figure is seen in a wider shot moving their arms, then the red figure appears with their arms up, and the blue figure is running. Then a wide shot of the red figure is seen with arms up. Then the blue figure is seen running, and is seen spinning in the air. Then the viewpoint changes to look up at the gold ceiling and the blue figure falls and a small figure with wings is shown. The blue figure catches the winged figure. Then the blue figure is seen flying through a door. The blue figure stops and is seen lifting its arms. The video then cuts to a wide shot and the blue figure flies through it. The blue figure is seen floating and hits the yellow figure against a wall, breaking it. The yellow figure stumbles around and then falls on the ground. The pink figure is seen, then the video switches to a gun pov, with the pink figure facing forward. Two figures in similar pink and black armor shoot at each other with purple lighting. The two figures are seen talking, then running away. The red figure is seen walking into frame, then a close up of the mask of the red armored figure is shown, then a wider shot of them holding two weapons, and firing them with flaming projectiles. The flaming projectiles are seen in a closer view. The red armored figure is seen firing the flaming projectiles while moving through the room. The screen displays a top-down perspective of the red figure firing two weapons, then a close up view of the head of the red figure, and a last shot of the red figure shooting with the weapons towards the ceiling.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occur on the wall midway through the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The wall shows no changes.",
+ "B. The wall turns black.",
+ "C. The entire wall collapses.",
+ "D. Part of the wall cracks."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video is the sound of breathing heard?",
+ "answer": "D",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the beginning of the video.",
+ "C. Throughout the video.",
+ "D. At the end of the video."
+ ]
+ }
+ },
+ "UJfJDqjT": {
+ "video_id": "UJfJDqjT",
+ "video_duration": "45s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Neymar, wearing a yellow Brazilian jersey, is shown in a close-up shot. The video cuts to a penalty kick. A Croatian player kicks the ball into the goal, while the Brazilian goalkeeper dives in the opposite direction. The video then shows a close-up of the Croatian player celebrating.Next, Marquinhos, a Brazilian player, takes a penalty kick. He kicks the ball, and it hits the left goalpost. Marquinhos bends over with his hands on his knees.The video shows the Croatian players running towards the center of the field to celebrate their victory while the Brazilian goalkeeper sits dejectedly on the field. The Croatian players hug and jump in celebration. One of them falls to his knees and raises his arms in the air.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, which direction was the Croatian goal scored relative to the goalkeeper?",
+ "answer": "D",
+ "candidates": [
+ "A. Directly above.",
+ "B. Directly in front.",
+ "C. Directly to the right.",
+ "D. Directly to the left."
+ ]
+ }
+ },
+ "beLFYVbE": {
+ "video_id": "beLFYVbE",
+ "video_duration": "120s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Editorials",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man with light brown hair and glasses is wearing a dark suit, a white shirt, and a dark blue tie. He is looking at the camera, and his mouth is moving as if he is talking. His hands move during some of the video. The background is a light colored wooden wall to the left and a glass door with dark framing to the right. A window behind the door is visible, as is a stone building with an arched entryway. There is also a light blue scooter visible through the window. During some of the video, text appears at the bottom, overlaying the man. The text first says, \"Christopher Lamb of The Tablet on the Extraordinary Synod of Bishops in Rome\". The text later says, \"Marriage and family life. A priority concern in the pews\". At the end, the background is black and text appears in white, stating \"Pope Francis the Synod and the Church, Austen Ivereigh and Christopher Lamb were talking to Ruth Gledhill, @austenli, @ctrlamb, @ruthiegledhill.\"",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How does the narrator's emotional state change in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. From curious to calm.",
+ "B. From curious to angry.",
+ "C. From calm to curious.",
+ "D. From angry to curious."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Where does the main noise in the video come from?",
+ "answer": "D",
+ "candidates": [
+ "A. A child coughing.",
+ "B. The sound of a crowd arguing.",
+ "C. The narrator's echo.",
+ "D. The sound of a crowd discussing."
+ ]
+ }
+ },
+ "PdpWrLsy": {
+ "video_id": "PdpWrLsy",
+ "video_duration": "88s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a soccer game. The soccer field is green with white lines. Players wearing blue and white uniforms are seen competing against players in grey and maroon uniforms. There is also a person in a yellow uniform. The players are running around the field. At times they kick a white and black soccer ball. The goal posts are white nets. In some shots, the video shows a large crowd of people sitting in the stands. They are clapping their hands. In one shot, a large screen is visible which displays a photograph of a fair-skinned person with light hair. In other shots, people in blue and white uniforms are seen embracing one another. In another shot, a person wearing a dark blue suit and tie is seen shouting.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "How many times did the goalkeeper in the green jersey block the blue team's goals at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. One.",
+ "B. Two.",
+ "C. Three.",
+ "D. I don't know."
+ ]
+ }
+ },
+ "xpyDxKTB": {
+ "video_id": "xpyDxKTB",
+ "video_duration": "100s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video begins with a woman speaking directly to the camera, while sitting in front of a solid brown background. The video is a recording of a Youtube live stream, and there are many Youtube controls visible, along with the title of the video and other channel information. She has long brown hair, a white top, and a ring visible on her left hand. She moves her hands around sometimes while talking. After some time the video switches to what appears to be a news article on the Reuters webpage. The title of the article is \"Netanyahu out, Bennett in as Israel marks end of an era\". After showing that article for a little while, it switches to a second article from the New York Times. The title is \"Israel\u2019s Parliament Approves New Government, Ousting Netanyahu\". The final portion of the video consists of a picture within that New York Times article. The picture is of a group of people within a room with many rows of chairs. Two men in the foreground are raising their hands.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What action did the author take in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The video was trimmed.",
+ "B. The watermark was removed.",
+ "C. The subtitles were edited.",
+ "D. The video was paused."
+ ]
+ }
+ },
+ "NSynVvko": {
+ "video_id": "NSynVvko",
+ "video_duration": "547s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Unboxing Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman sits at a table with an open box in front of her. She picks up a small, rectangular package labeled \"The Gluten Free Bar\" and shows it to the camera. She sets it down on the table beside the box. Next, she picks up a small, foil-wrapped rectangular package from the box, opens it, and shows the contents - a small, rectangular bar - to the camera. She sets that on the table also. She repeats this process with several more food items from the box: a small, yellow pouch; a red and black package labeled \"Parm Crisps\"; a long, thin package; a pink and black pouch labeled \"Ketologie Keto Shake\"; a small bag of sunflower seeds; and finally, a bag of pecans. Each time, she shows the opened package and sometimes the food item inside to the camera before setting it down on the table beside the box.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the name of the woman's favorite snack mentioned while unboxing the items in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Black chokeberry.",
+ "B. Coconut lime protein bar.",
+ "C. Hickory-smoked snack sticks.",
+ "D. Keto drink shaker cup."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the woman in the video pay attention to the ingredient lists of each product?",
+ "answer": "C",
+ "candidates": [
+ "A. Verify that the product contains the ingredients as advertised.",
+ "B. Check if the product has expired.",
+ "C. Prevent overeating and potential weight gain.",
+ "D. Avoid consuming harmful ingredients."
+ ]
+ }
+ },
+ "HIxpBxST": {
+ "video_id": "HIxpBxST",
+ "video_duration": "57s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A white pick up truck is parked on a dirt road next to a muddy puddle. The truck then drives forward, kicking up a cloud of dust, and turns to the left, going up a dirt hill. The truck reaches the top and turns to go down the other side. It accelerates, kicking up a cloud of dust and then continues to drift to the right before turning to its right and making a larger turn, which also produces a cloud of dust. The video then cuts to the interior of the truck, where a shirtless man is visible, holding the steering wheel. The video ends with the man making a thumbs up gesture.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What vehicle operation is the man demonstrating in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Power performance demonstration.",
+ "B. Off-road capability demonstration.",
+ "C. Fuel efficiency demonstration.",
+ "D. Suspension performance demonstration."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What sound is captured in the audio when the vehicles descend the slope in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The sound of an impact.",
+ "B. The sound of a horn honking.",
+ "C. The sound of a door opening.",
+ "D. The sound of the engine stalling."
+ ]
+ }
+ },
+ "AtIFdhUj": {
+ "video_id": "AtIFdhUj",
+ "video_duration": "127s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A close-up shot shows a dark-colored bowl with a light-colored rim, filled with a simmering yellowish liquid. The scene changes to a kitchen, where a man in a blue and white checkered shirt stands behind a kitchen island. On the island are two of the same bowls from the previous shot, along with a bowl of croutons, a bundle of thyme, some onions on a plate, a block of cheese on a wooden cutting board, and a plant in a clear glass pot. The man sprinkles croutons from the small bowl into the two soup bowls, then shreds some of the cheese over the croutons and places more cheese slices on top of the soup. Next, he places the two bowls in an oven. After some time, he removes them, wearing green oven mitts. Now, the cheese on top of the soup is melted and bubbly. He sets the bowls back down on the kitchen island. Then he picks up a spoon and digs into one of the bowls, taking a bite of the cheesy soup and reacting with a smile. He stirs the contents of the bowl a few times with the spoon, then takes another spoonful and eats it.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Which cheese did the man use last?",
+ "answer": "D",
+ "candidates": [
+ "A. Parmesan cheese.",
+ "B. Roquefort cheese.",
+ "C. Proven\u00e7al cheese.",
+ "D. Swiss cheese."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What material is the handled bowl in the video made of?",
+ "answer": "A",
+ "candidates": [
+ "A. It is made of ceramic.",
+ "B. It is made of wood.",
+ "C. It is made of plastic.",
+ "D. It is made of glass."
+ ]
+ }
+ },
+ "VNxzNqDY": {
+ "video_id": "VNxzNqDY",
+ "video_duration": "117s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Sports Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens with a black screen. This transitions into a scene of a wrestling ring, with a man in a suit standing in the middle. There are portraits of two other people in the upper and lower corners of the screen. Text in a speech bubble appears above the man in the suit, followed by text in the same format. The scene changes to a wrestling match between two characters, with a referee standing off to the side. There is a view of the spectators in the background. The action is shown between multiple characters before cutting to an image of a gold championship belt, with multiple menu text options in a red colored font over the image. The scene changes back to wrestling matches between different characters. At one point a green chair appears, and is held by one of the characters. The video ends with a fight between multiple people in the ring.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man wearing purple pants do with the chair in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Used it to climb into the stage/ring.",
+ "B. Struck the referee on the head with it.",
+ "C. Gave it to the red-haired man to sit on.",
+ "D. Struck the red-haired man on the head with it."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the man in purple pants punch the referee in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Two.",
+ "C. One.",
+ "D. Three."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "When the video shows the game character's avatar, where is HAWK positioned in relation to ANIMAL?",
+ "answer": "C",
+ "candidates": [
+ "A. Below.",
+ "B. On the right.",
+ "C. On the left.",
+ "D. Above."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a steel chair in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. No, there is not.",
+ "B. I am not sure.",
+ "C. Yes, there is."
+ ]
+ }
+ },
+ "uLGYkGSg": {
+ "video_id": "uLGYkGSg",
+ "video_duration": "601s",
+ "duration": ">8min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing black gloves reaches into the dashboard of a vehicle, where the stereo has been removed, and pulls out a small electrical connector.\u00a0The person then holds up a wiring harness with multiple colored wires and connectors. They connect the wiring harness to the back of a new stereo unit. They then hold the stereo with attached wiring, position it near the dashboard opening, and connect a wire to the back of the stereo. They slide the stereo into the dashboard slot until it clicks into place. Then they use a screwdriver to tighten two screws on either side of the stereo faceplate, securing it to the dashboard. Finally, the person adjusts the stereo settings.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "After inserting the radio into the dashboard slot, what did the man do?",
+ "answer": "B",
+ "candidates": [
+ "A. He begins to tune the radio.",
+ "B. He secures the radio to the vehicle frame with screws.",
+ "C. He checks the radio's antenna.",
+ "D. He connects the radio's power cable."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "How does the sound from the radio change in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. From a man's voice to music and then to noise.",
+ "B. From noise to a woman's voice and then to music.",
+ "C. From noise to music and then to a man's voice.",
+ "D. From music to noise and then to a woman's voice."
+ ]
+ }
+ },
+ "mRoiTfBN": {
+ "video_id": "mRoiTfBN",
+ "video_duration": "93s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Tech Reviews",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man with red hair is holding a black box with the word \u201cRESPAWN\u201d printed in large white font with red accents. The box appears to have a picture of a chair on it. The perspective changes and the video shows a tunnel of some kind with the word \"ULTIMATE\" overlayed on it. Next, a man stands in front of what appears to be a network closet with a wall mounted router. The text \u201cULTIMATE TECH\u201d appears on the screen. The camera shows what appears to be many ethernet cables connected to various network devices with the text \u201cULTIMATE TECH HUB\u201d on the screen. There is a video game scene of two characters fighting with the same text \u201cULTIMATE TECH HUB\u201d on screen. Next, a hand is seen connecting a cable to a black router device with the text \"ULTIMATE TECH HUB\" on the screen. A hand is seen on a computer motherboard and the same text \"ULTIMATE TECH HUB\" is shown again. A person is shown with a black box that has a green light on it and is in front of an open box with network devices inside, with \u201cULTIMATE TECH HUB\u201d text on screen. A pair of hands is seen working on a computer motherboard and the text \"ULTIMATE TECH HUB\" is once again displayed. A person is shown pointing with his right hand at the network devices and router with the text \"ULTIMATE TECH HUB\" on the screen. The same person is pointing with his hands again but just the text \u201cTECH HUB\u201d is shown. A black and red box with \"RESPAWN\" on it is shown with a camera slowly panning up, then showing a price of $149.00 on top of the box, then the box is shown again, then finally the camera shows a chair that has red and black leather with a footrest extended on the floor. The camera pans around the chair. Then the camera is facing the front of the chair. The camera view changes again to show a person holding the black and red box, moving it toward the camera. The person then uses a knife or box cutter to cut the tape on the box. After opening the box, the person removes items. The camera focuses on a various parts of the chair on the floor.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "In the video, the man wearing a red short-sleeved shirt mentions how many colors the gaming chair is available in.",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Five.",
+ "C. Three.",
+ "D. One."
+ ]
+ }
+ },
+ "cYpYInuM": {
+ "video_id": "cYpYInuM",
+ "video_duration": "73s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing glasses, a black apron, a white button-down shirt, and a blue tie is in a kitchen. The person is pouring the contents of a dark-colored pot into a blender. Then, the person uses a spoon to stir the contents of the pot. Next, the person adds a small amount of liquid from a measuring spoon into the blender. Then, the person begins chopping dark green leaves on a light-colored cutting board using a knife. The person places the chopped leaves into the blender. The person then cuts avocados into halves and removes their pits with a spoon and places the halves onto the cutting board.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of steps for the man to prepare the dish shown in the video? (a) Chop the mint leaves. (b) Add the mint leaves to the blender. (c) Add vanilla extract to the blender. (d) Place the avocado in the blender. (e) Cut the avocado lengthwise and remove the flesh.",
+ "answer": "D",
+ "candidates": [
+ "A. (e), (d), (c), (a), (b).",
+ "B. (a), (c), (b), (e), (d).",
+ "C. (e), (d), (a), (b), (c).",
+ "D. (c), (a), (b), (e), (d)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the man in the video like mint?",
+ "answer": "B",
+ "candidates": [
+ "A. Because it helps promote digestion and absorption.",
+ "B. Because it is rich in polyphenols, which can improve mood.",
+ "C. Because it can be used to decorate food.",
+ "D. Because it adds a refreshing flavor."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "How did the man in the video remove the avocado's skin?",
+ "answer": "D",
+ "candidates": [
+ "A. A peeler.",
+ "B. The tip of a knife.",
+ "C. A spoon.",
+ "D. A hand."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occur in the state of the blender in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Remains stationary.",
+ "B. From stationary to running.",
+ "C. From running to stopped.",
+ "D. Continuous operation."
+ ]
+ }
+ },
+ "vEuDeXpS": {
+ "video_id": "vEuDeXpS",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows several clips of basketball games. In each clip, multiple people wearing basketball uniforms of different colors are seen playing basketball on a basketball court. Some clips show people jumping and throwing the basketball towards a basketball hoop, and sometimes the ball goes through the hoop. There are also clips showing the game score on a screen. At the end of the video, a person wearing a plaid jacket and dark hoodie walks onto the basketball court. There is also a text box at the bottom of the screen that says \"CHRIS WHITNEY\".",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "At the beginning of the video, what is the man wearing the white number 8 jersey doing to the man wearing the white number 12 jersey?",
+ "answer": "B",
+ "candidates": [
+ "A. Changing direction while dribbling.",
+ "B. Making an out-of-bounds pass.",
+ "C. Passing the ball hand-to-hand.",
+ "D. Setting a screen."
+ ]
+ }
+ },
+ "NevFMnKR": {
+ "video_id": "NevFMnKR",
+ "video_duration": "41s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A starting gate is shown with 8 numbered and colored slots. Several people in red vests stand behind it. A red light is lit above the gate. The slots open, and eight dogs wearing differently-colored wraps emerge. They run around a racetrack. A group of people watch from the other side of the track. The camera follows the dogs as they run along the track.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a dog in the video that the male commentator did not mention?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I don\u2019t know."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Which dog will win next?",
+ "answer": "A",
+ "candidates": [
+ "A. The dog in red.",
+ "B. The dog in black and white stripes.",
+ "C. The dog in green.",
+ "D. The dog in yellow."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the name of the dog wearing pink?",
+ "answer": "A",
+ "candidates": [
+ "A. Galactic Panthe.",
+ "B. Vare.",
+ "C. L'Ectrostomp.",
+ "D. Magellus Ferry."
+ ]
+ }
+ },
+ "zKfbqrBG": {
+ "video_id": "zKfbqrBG",
+ "video_duration": "111s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman wearing a blue tank top and black shorts, with a number '245' visible on the front of the tank top, is jogging on a paved road. In front of her, a light gray van is visible with a sign on the roof that says \"Carlsbad\". There are a number of cars driving behind the woman.A group of people are standing in front of a white stone wall, and palm trees are visible behind them. They are wearing blue shirts with \"USMS\" written on them, with variations in the sleeve length of the shirts.Three people are visible in a room, two women are seated at a table. The woman in the middle is holding a binder that says \"Team 216\".Several people are standing and seated in a room. The woman holding the \"Team 216\" binder is sitting on a desk.A group of people are sitting and standing in a room, with a pizza on the floor in the front of the frame.Two women are standing and talking next to cubicles. One woman is wearing an olive green shirt and khaki pants, and the other woman is wearing a black shirt with blue jeans.Three women are standing and talking in an office, next to several cubicles. Two of the women are wearing olive green and black shirts with khaki pants, and the third woman is wearing a dark blue shirt with blue jeans.Two women are looking at computer screens. One woman is wearing an olive green shirt and the other is wearing a dark blue shirt.Two women are standing and talking in a room.A large group of people are standing and cheering in a large hall.A woman is sitting in front of an American flag and a poster. She is wearing a dark blue jacket and an olive green shirt.Several blue drawstring bags with gray stripes are visible. The bags have a gold circular logo.Several people are standing in a room, some holding food.A woman is standing and talking in front of a group of people. She is holding a blue drawstring bag.A woman is standing in front of an American flag. She is wearing a dark blue jacket and a necklace.A group of women are standing side by side outside. They are wearing blue shirts and navy pants with the 'USMS' logo visible on the shirt.A close up shot shows the logo of \"US Marshals Women's Team\" on a blue shirt.A woman is holding a child. She and another woman are talking and standing in a room with other people.A poster with several pictures and names on it is visible. The title of the poster says, \"Our Fallen\".A close up shot shows a black rectangular object being held, with text engraved on it.Three people are standing on a gravel ground, with orange cones in the background. A woman is wearing a light blue tank top with a tag that says \"266\" on her back.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "Was the pizza on the table eventually finished?",
+ "answer": "C",
+ "candidates": [
+ "A. Half of it is still left.",
+ "B. One quarter of it is still left.",
+ "C. No, it wasn\u2019t.",
+ "D. Yes, it was completely eaten."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In which direction relative to the red candle is the red water cup located when the video starts?",
+ "answer": "C",
+ "candidates": [
+ "A. On the left.",
+ "B. At the front.",
+ "C. At the back.",
+ "D. On the right."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "From which direction does the applause come when the woman in green speaks in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The applause comes from the person on the left side of the screen.",
+ "B. The applause comes from behind the speaker.",
+ "C. The applause comes from the person on the right side of the screen.",
+ "D. The applause comes from in front of the speaker."
+ ]
+ }
+ },
+ "GAfirOdo": {
+ "video_id": "GAfirOdo",
+ "video_duration": "50s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "World News",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A woman with blonde hair is standing in front of a map. She is wearing a maroon dress. In the bottom left corner, there is a graphic labeled 'WEATHER ALERT' and another graphic labeled 'STORM WATCH'. Above the map is the text 'HOUR BY HOUR', along with a circle containing the time. The map shows a region with several labelled cities such as 'Rutland', 'Manchester', 'Boston', and 'Providence'. Initially the map has a single color over an area, then it shows areas of white, purple, and pink. In a few frames, a graphic of a snowplow appears in the top left, followed by the text 'Expected Snow Wednesday into Thu. A.M.' and areas of the map become labeled with text such as '0\"+', 'C-2\"', '2-5\"', and '5-10\"'. The woman is gesturing at different parts of the map.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occur in the animation within the model when the video starts?",
+ "answer": "B",
+ "candidates": [
+ "A. The area shrinks from small to large.",
+ "B. The area shrinks from large to small.",
+ "C. There are no changes.",
+ "D. The area becomes larger."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the woman in the video turn to look at the screen showing the map model?",
+ "answer": "B",
+ "candidates": [
+ "A. Six times.",
+ "B. Seven times.",
+ "C. Five times.",
+ "D. Three times."
+ ]
+ }
+ },
+ "lmJDMJgG": {
+ "video_id": "lmJDMJgG",
+ "video_duration": "509s",
+ "duration": ">8min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman and a man are sitting in the front seats of a yellow Jeep. The woman is wearing a yellow, red, and blue jacket and has long braids. She is looking at her phone. The man is wearing a black t-shirt and a black beanie. He is eating something from a small white cup. The woman shows the man something on her phone. They talk to each other. The man laughs. They continue talking. The woman gestures with her hands. The man eats from the cup again. They laugh together. The woman looks at the camera and talks while gesturing with her hands.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What did the man in the video do to prank the woman?",
+ "answer": "A",
+ "candidates": [
+ "A. Tricked her into thinking she wanted to cheat.",
+ "B. Mentioned a fake destination for the drive.",
+ "C. Showed photos of other women.",
+ "D. Did not share any fried chicken with her."
+ ]
+ }
+ },
+ "IgAcPnSN": {
+ "video_id": "IgAcPnSN",
+ "video_duration": "635s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A hand holds up a purple Twirl chocolate bar. Two young women sit on the floor in front of a couch. The woman on the right wears a light blue baseball cap and a gray Marvel hoodie. The woman on the left wears a black tank top.\u00a0They each eat a piece of the Twirl. They then eat a macaron. They hold up small, unwrapped chocolates and show them to the camera. Next, the woman wearing the hat holds up a yellow Coffee Crisp bar. They both eat a piece of the Coffee Crisp. The person wearing the hat places her hands on her ears, and animated images of the Knight Bus from Harry Potter appear and drive by on the screen. Next, the woman wearing a hat holds a small, round chocolate-covered biscuit with white filling and takes a bite. She holds a small, unwrapped, golden, rectangular chocolate bar and eats it. Next, she holds up an orange package of Jaffa Cakes. She opens it and holds a Jaffa Cake, describing it to the other woman, and then takes a bite.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "In the video, what does the woman wearing a hat do with the first piece of candy she tries?",
+ "answer": "B",
+ "candidates": [
+ "A. She throws half of the candy away.",
+ "B. She breaks the candy in half.",
+ "C. She bites the candy in half.",
+ "D. She crushes the candy into crumbs."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "In the video, how many types of snacks does the woman display?",
+ "answer": "A",
+ "candidates": [
+ "A. Eight.",
+ "B. Five.",
+ "C. Seven.",
+ "D. Ten."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Which food received the highest score during the tasting process?",
+ "answer": "B",
+ "candidates": [
+ "A. Aero Milk is a chocolate bar with a bubbly texture.",
+ "B. Jaffa Cakes are named after the Israeli city of Jaffa.",
+ "C. Jammie Dodgers are a type of biscuit.",
+ "D. Cadbury Twirl is a chocolate-coated bar with flaky layers."
+ ]
+ }
+ },
+ "sYwryQCj": {
+ "video_id": "sYwryQCj",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "Two hands are shown on a light colored table, one hand is pushing a stack of bills to the other hand. The table also has a clipboard with papers, a pen, and eyeglasses.A man in a suit is shown speaking behind a podium, with a green marble wall behind him.A large orange tanker ship is shown sailing in the water with a city in the background.A man in a suit, with a pink tie, is shown speaking in front of two flags.A close up shot of several hundred dollar bills is shown.A digital animation of black oil drums is displayed over an outline of the world.A man in a black suit and tie walks on a runway in front of a marching band.A man in a suit is shown sitting in a chair, then standing and walking away from the chair.A man in a suit, with a purple patterned tie, walks through a doorway.A man in a suit and blue tie is shown speaking behind a podium with an \u201cINDIA-AFRICA FORUM SUMMIT\u201d logo on it.A man in a suit walks behind another man at a podium in front of people seated in the room.A man in a suit is shown at a podium pulling papers from a red folder.A tan building with a gold dome is displayed.The building has the words \u201cRIGGS NATIONAL BANK\u201d on the front.The word \u201cFINE\u201d is displayed in large letters and a red box over the building.Several men in suits are sitting at a table in an assembly hall with signs in front of them. One of the signs says \u201cEquatorial Guinea\u201d.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What action does the U.S. government take regarding Obi at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The U.S. government fines Obi and conducts a thorough investigation.",
+ "B. The U.S. government invites Obi to a major conference.",
+ "C. The U.S. government invites Obi to take office.",
+ "D. The U.S. government protects Obi's personal safety."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where was the man wearing a black tie in the video in 2003?",
+ "answer": "C",
+ "candidates": [
+ "A. In the room.",
+ "B. In the car.",
+ "C. At the airport.",
+ "D. In the office."
+ ]
+ }
+ },
+ "VcqxSZee": {
+ "video_id": "VcqxSZee",
+ "video_duration": "187s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A tall, dark, multi-tiered structure with two cylindrical towers is shown in the distance on a flat, dark-colored landscape with a grey sky above it. This structure is then shown at a closer angle, revealing that it has several platforms and a rocket attached to the front of the structure. The rocket has an orange tip and its body is a mix of white and gray, with the words \"United States\" visible on the side. A white substance begins to appear at the base of the rocket, first in a fine spray and then as a large plume, as well as flames. The rocket then detaches and starts ascending into a dark blue sky, leaving a bright white, fiery trail behind it. The camera continues to follow the rocket as it flies higher and higher, leaving a large cloud behind it, until finally the rocket is seen moving away as a white, blurred outline with three small lights attached to the side. The video ends with the same white object being shown very far away with several small lights on its surface.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What abnormality occurred in the video before the production of 41-B?",
+ "answer": "C",
+ "candidates": [
+ "A. Authorization for guided convergence startup was not granted.",
+ "B. Authorization for starting with full throttle was not granted.",
+ "C. Authorization for 9-8 startup was not granted.",
+ "D. Authorization for automatic sequence start was not granted."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the launch process of an aircraft?",
+ "answer": "C",
+ "candidates": [
+ "A. Ignition, liftoff, separation of external fuel tanks, roll maneuver, and separation of solid rocket boosters.",
+ "B. Ignition, liftoff, roll maneuver, separation of solid rocket boosters, and separation of external fuel tanks.",
+ "C. Ignition, liftoff, roll maneuver, separation of external fuel tanks, and separation of solid rocket boosters.",
+ "D. Ignition, liftoff, separation of solid rocket boosters, roll maneuver, and separation of external fuel tanks."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "At what distance from the ground does the external fuel tank detach?",
+ "answer": "C",
+ "candidates": [
+ "A. 30 nautical miles.",
+ "B. 37 nautical miles.",
+ "C. 15 nautical miles.",
+ "D. 8 nautical miles."
+ ]
+ }
+ },
+ "JPXMaDMn": {
+ "video_id": "JPXMaDMn",
+ "video_duration": "67s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "swimming",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A swimming pool is shown with starting blocks at one end. People are in the stands to the left and right of the pool. Multiple flags are hanging above the stands. Eight swimmers position themselves on the starting blocks. They all dive into the pool and start swimming. A counter shows time in seconds at the bottom left of the video, as the swimmers move along their lanes. The swimmer in lane 4 finishes the race, and his time is shown. The same swimmer takes off his swim cap and goggles. Then he hugs a bald swimmer. A swimmer with a bald head holds on to the side of the pool and gives a thumbs up with his right hand. He smiles, looks around, and gives another thumbs up with his left hand. Then he looks to the left.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many gold medals has the bald man won?",
+ "answer": "D",
+ "candidates": [
+ "A. 1.",
+ "B. 3.",
+ "C. 4.",
+ "D. 2."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the brown-haired man and the bald man do after the competition?",
+ "answer": "A",
+ "candidates": [
+ "A. They are high-fiving.",
+ "B. They are hugging.",
+ "C. They are fighting.",
+ "D. They did nothing significant."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is the hat that the man takes off at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Green.",
+ "B. Black.",
+ "C. White.",
+ "D. Blue."
+ ]
+ }
+ },
+ "dzfksxTY": {
+ "video_id": "dzfksxTY",
+ "video_duration": "207s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A cartoon character with green hair and large glasses is shown with a pink background with white and yellow circles. A title card appears with the text \"pinkfong 3 sing-along movie catch the gingerbread man\" and \"stage clip.\" The words \"ice cream song (movie ver.)\" also appear. The cartoon character's mouth is open. A musical note symbol appears in a pink box. Another cartoon character appears beside the first one, this one is pink with a crown. They are both standing on a multicolored platform with a dark background with white dots. The two characters walk in a line towards the viewer. Confetti is falling. They dance on the platform with a purple background. The camera pans out to show a large ice cream sundae in a pink dish. The two characters stand to the sides of the sundae. Confetti is falling. The camera zooms into the sundae. Text appears reading, \"oh!\" The characters then say, \"Ice! Ice! Ice!\", \"Ice Cream!\", \"Yummy yummy in my tummy!\", \"Ice cream! Ha!\" and then say, \"Are you ready to scream for ice cream?\" The camera zooms out to show the characters on a stage, the background is purple and confetti is falling. Text appears reading, \"yay!\", \"yum! yum!\", \"Yummy tasty sweet\" and \"tum! tum!\". The characters say, \"Tummy goes party\", and \"This is my favorite treat\". They then say, \"Ice! Ice! Ice cream!\" and \"When the day is hot and steamy, ice cream!\". They say, \"If you're happy and you know it, ice cream! Don't let it melt away! Lick it up!\". They say, \"Are you ready to scream for ice cream?\" The background is light purple with snowflakes. They say, \"Yay!\", and then dance with one another. They say, \"yum! yum!\", \"Yummy tasty sweet\", \"tum! tum!\", \"Tummy goes party\", \"This is my favorite treat\". They then say \"Ice! Ice! Ice cream!\", and \"Can I have more ice cream?\" The background is purple and confetti is falling. A blue title card appears with the text \"pinkfong 3 sing-along movie catch the gingerbread man\" and \"stage clip\" and \"baby shark.\" The two cartoon characters stand on a platform with a blue screen in the background. A musical note icon appears in a pink box. The two characters walk to the center. Text appears that says \"baby shark\". The two cartoon characters are next to each other and wave their hands. A yellow cartoon shark appears in the background. There is an underwater scene with plants. The two characters face each other and tap their hands. The background changes to show a yellow cartoon shark with a smiling face. The background changes to a yellow background with a red star and the text \"Mommy shark\". The background changes to a blue background with a pink shark and the text \"Mommy shark\". A green background appears with the text \"Daddy shark.\" The background changes to a blue background with a yellow shark and the text \"Grandma shark\". The background changes to a pink background with a yellow shark. The characters say \"grandpa shark,\" and \"Let's go hunt, doo doo doo doo doo doo.\" They say, \"Run away.\" then, \"Safe at last,\" then, \"It's the end!\" Various underwater scenes show different cartoon sharks. The video ends with one pink cartoon character and the option to subscribe to the channel. There is a search bar that appears with a youtube logo. The words \"search pinkfong\" appears.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many different colors of overalls does the fox wear in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Two.",
+ "B. Three.",
+ "C. Zero.",
+ "D. One."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the possible relationship between the small yellow shark and the large blue shark in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Mother and son.",
+ "B. Grandfather and grandson.",
+ "C. Couple.",
+ "D. Father and son."
+ ]
+ }
+ },
+ "khNHbdJY": {
+ "video_id": "khNHbdJY",
+ "video_duration": "265s",
+ "duration": "4-6min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a first-person perspective from inside an IndyCar cockpit during a race. The driver, wearing a DHL-branded helmet, is initially in 18th position and closely follows another car. The driver overtakes and moves to 17th. The camera shifts to show the view over the top of the car, looking down at the front wing. The driver overtakes other cars, moving up to 14th position.After a caution period, the race restarts. The driver continues to follow other cars closely. During the race, the red and black Bryant car spins out in front of the DHL car, almost making contact. The driver maintains control and continues racing.The camera angle changes back to the in-cockpit view. The driver removes a tear-off from their visor. Raindrops begin to splatter the visor. The race is stopped due to rain. The track becomes wet and dark. The race restarts, now at night under track lighting. The driver overtakes to 13th position. The driver races side-by-side with the orange McLaren car and pulls ahead into 12th. The race progresses, with the driver maintaining their position. The video ends with the driver crossing the finish line in 13th place.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the electronic \"didi\" sound play in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Eight.",
+ "B. Ten.",
+ "C. Seven.",
+ "D. Nine."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual incidents occurred during the racing competition?",
+ "answer": "A",
+ "candidates": [
+ "A. The race was paused due to rain.",
+ "B. A major car accident occurred, causing the competition to end ahead of schedule.",
+ "C. A race car crashed into the guardrail, leading to a pause in the competition.",
+ "D. The competition ended early due to rain."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "Is someone guiding another person on how to drive?",
+ "answer": "C",
+ "candidates": [
+ "A. No, no one is providing guidance.",
+ "B. I am not sure.",
+ "C. Yes, someone is providing guidance."
+ ]
+ }
+ },
+ "hXXqCVrL": {
+ "video_id": "hXXqCVrL",
+ "video_duration": "45s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Climate Change",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a large room with many people sitting at desks. The room has green walls, and some of the walls have a pattern of white circles. Most of the people are clapping. One person, sitting at a desk labeled \"PHILIPPINES,\" picks up a red cloth and puts it to their face. A person wearing a striped black and white shirt with a blue lanyard and a red badge is standing next to them, as well as another person in a blue suit. The person from \"PHILIPPINES\" takes the cloth away from their face. The camera pans over the room and then the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "In the video, from which direction do the rhythmic applause and slogans seem to be coming?",
+ "answer": "C",
+ "candidates": [
+ "A. At the forefront of the screen, at the end of the passage.",
+ "B. From the door on the right side of the screen.",
+ "C. In the crowd at the front-left of the picture.",
+ "D. From the podium behind."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "At the beginning of the video, as the camera pans to the right, which country's name is displayed on the desktop of the empty seat?",
+ "answer": "A",
+ "candidates": [
+ "A. Suriname.",
+ "B. Nauru.",
+ "C. The Philippines.",
+ "D. Zambia."
+ ]
+ }
+ },
+ "NKMVutdT": {
+ "video_id": "NKMVutdT",
+ "video_duration": "228s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a series of still images, some with overlaid text or animated characters. The first images are of a hallway with yellow walls, a tan floor, and rectangular ceiling lights. The view shifts slightly from a blurred perspective to a more focused view. Then, a single brown chair appears against a yellow, patterned wall and tan floor. The video zooms in on the chair.Next, an animated character appears. The character is a fair-skinned individual with pointed ears and a red dot on their forehead. They are wearing a gray suit with a white shirt and a red tie. They are smiling and giving two thumbs up, with the text \"you probably already knew that\" beneath them.The next several images are of a dark hallway with water on the floor and concrete pillars. The perspective changes slightly between them. Then, a similar animated character appears, this time with a contemplative expression. The text \"and you should be in level 2\" is displayed beneath them.The following scenes are of a dark hallway with pipes along the ceiling, with one individual walking down the hallway. The video shows a green structure, and text appears: \"there's like a chance of entities chasing you and it's kind of hot and steamy\". The video changes back to the dark hallway and shows more of this structure. Then there is a quick shot of an exit door that reads \"FIRE EXIT\" on a green background with yellow arrows. The text \"open\" appears, followed by \"open it up and gon inside\". Then the animated character reappears, smiling and giving two thumbs up.More shots of dark hallways appear, then there is another animated character with a contemplative look. The text \"sort of dangerous\" appears below them.The video then goes through multiple shots of different locations. There is a rusty tunnel, then a hallway with prison bars on either side and a door further down the hall. There is a close up shot of a stainless steel door with a label that reads \"office sector\". There are more shots of the animated character looking happy, followed by contemplative, then a strange looking animated character with long arms and a smiling face appears. This is followed by one person walking down a bright tunnel. Then the animated character gives a thumbs up, text appears that reads \"this one is labeled office sector\", a shot of the stainless steel door appears with the text \"and you'll be sent to level 4\" below it. There is another character giving a happy smile, and a shot of office cubicles, and then an emergency supply kit. The animated character is contemplative and the text \"for what's coming later\" appears below them.There is a shot of an office with several desks and chairs, then a red door. The animated character is smiling, and the text reads \"and you'll walk directly into a level five\". Next, there is a shot of a dimly lit room filled with various items, followed by a large room with a single table in the center. A moth with glowing eyes appears, followed by the animated character looking contemplative.Next, there is a shot of an open elevator and the text \"and press a button that says 'Boiler Room'\". There is a shot of a hallway filled with pipes, followed by a boiler room. The animated character is smiling and waving. There are further shots of the boiler room, followed by a dark hallway, a group of wooden signs on the floor, and a black flashlight. There is a shot of a dark doorway, followed by the text \"when you find that wall...\". There are more dark hallways, and text that says \"and you should be sent to level 8\". A series of images shows a cave with water and stalactites. The animated character gives a thumbs up.More images appear, a very dark tunnel with a dark figure, and the animated character appears again. The video shows more of the dark tunnel and then says \u201cand you should be sent to level 9\u201d. The video shows a neighborhood of houses at night, a dark shot of a beach, with text that reads \u201crun into it and keep running\u201d. The animated character stands and poses again. Then, there is a very dark image of a beach, and a dimly lit shot of a wheat field. A pumpkin patch appears, and then the animated character is looking contemplative. More shots of the wheat and pumpkin fields are shown.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How do the protagonist's emotions change from level 4 to level 5 in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. From relaxation to fear and danger.",
+ "B. There is no change.",
+ "C. From happiness to disappointment and frustration.",
+ "D. From fear to relaxation and calmness."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "On the third level, why do players need to run in a straight line to find the door marked \"office sector\"?",
+ "answer": "B",
+ "candidates": [
+ "A. Because running in a straight line is the fastest way to reach the destination.",
+ "B. To make it easier to locate the correct door.",
+ "C. To conserve energy.",
+ "D. To avoid complex layouts and dangerous entities."
+ ]
+ }
+ },
+ "SjPnEQhK": {
+ "video_id": "SjPnEQhK",
+ "video_duration": "602s",
+ "duration": ">8min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man with a shaved head, wearing a dark striped sweater and dark pants, stands on a stage and speaks to an audience. He holds a piece of paper in his hands and occasionally gestures with his hands.\u00a0 Two tall black stools stand behind him on the stage.\u00a0A large screen behind him displays various graphical elements, including barcodes, stylized landscapes, and abstract shapes.\u00a0 The screen\u2019s visuals change periodically.\u00a0 The audience, mostly visible as dark silhouettes, sits in rows facing the stage.\u00a0 One audience member holds up a phone, apparently recording the speaker.\u00a0The video then transitions to a series of text slides on a black background, each with a magenta and red barcode design at the bottom. The first slide displays a quote: \u201cIt's not radical to critique our current society. It's radical to build a better one.\u201d The second slide shows the heading \"OUR BELIEFS\" followed by bullet points: Institutions Can Be Redesigned for Better; Rigorous Methodology is Useful; Moral/Social Principles Are Most Important; Diversity is Essential; People Are Good. The third slide shows the heading \"OUR VALUES\" followed by: Openness; Earnestness; Social Innovation; Dignity. The next slide shows \"CONFERENCE DESIGN\", then \"Event Flow\", then: Day 1: Inspire and Connect; Day 2: Learn and Experience; Day 3: Catalyze Action. Another \"CONFERENCE DESIGN\" slide follows, with \"THEMES\" and horizontal bars labeled: Politics (orange); Space (green); Collective Action (purple); Identity (pink); Economy (blue). A further \"CONFERENCE DESIGN\" slide lists: Art Exhibits, Workshops, Games Night, Movie Screenings; Open dialogue/Unconference; Intentional Diversity/Respect for Diversity; Your Conference. The final slide shows \"COLLEGE FOR CREATIVE STUDIES\", \"RICHARD ROGERS, PRESIDENT\", the College for Creative Studies logo (a yellow square on a black background, with the college name in white), and a small image of the speaker in the bottom right corner.\u00a0 Various graphical elements continue to appear on the slide\u2019s background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many stools were visible on the stage in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Four stools.",
+ "B. Five stools.",
+ "C. Three stools.",
+ "D. Two stools."
+ ]
+ }
+ },
+ "ezJBgnEM": {
+ "video_id": "ezJBgnEM",
+ "video_duration": "116s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "DIY & Crafts",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a wide shot of a snow-covered yard. There is a dark, thawed patch of ground in the center of the yard. In the background, there is a wooden fence, some bushes, and a tree. A hand appears in the frame and points to the thawed ground. A person wearing a blue shirt and black pants with black boots enters the frame. This person uses a long, silver implement to interact with the thawed ground. They place a silver, wrapped object on the ground and then use the implement to light it. Smoke begins to rise from the object, at first in a thin plume and then in a much larger cloud. The smoke then engulfs the object, obscuring it from view. The camera moves to show the smoke rising and drifting through the air, and up towards the rooftops of nearby buildings, with the smoke eventually billowing out of the frame, obscuring much of the visual. The camera then moves back down towards the initial object, which is still emitting smoke. The smoke recedes and the wrapped object is visible again. A flame appears on the object, which is now burning with a visible fire. The camera moves to show a close up of the object, revealing it to be a foil wrapped bundle. After a while, a hand appears with a small clump of snow. The hand touches the clump of snow to the wrapped bundle. The video concludes with shots of the snow-covered ground, the fire, and then the final shot is again of the bundle.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What happens in the middle of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A smoke grenade explodes.",
+ "B. A man throws snow at a smoke grenade.",
+ "C. The smoke grenade catches fire.",
+ "D. The smoke from the smoke grenade dissipates."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many types of sounds are produced when the smoke grenades burn in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. Two.",
+ "D. One."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the man in the video throw snow at the smoke grenade?",
+ "answer": "C",
+ "candidates": [
+ "A. Four.",
+ "B. One.",
+ "C. Two.",
+ "D. Three."
+ ]
+ }
+ },
+ "UDWxKMwc": {
+ "video_id": "UDWxKMwc",
+ "video_duration": "212s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with long brown hair is typing on a laptop. The person's head is resting on their forearms, which are on a table. Then, the image changes to a different man yelling, with an image of the first person visible behind him. Then, the word \u201cHELP\u201d appears on the screen. Then a graphic showing two stick figures, one standing holding a sign saying \u201cHELP\u201d and one sitting appears. Next is a skeleton sitting at a desk looking at a monitor. A view of a skeleton hand holding a computer mouse appears. This is followed by a picture of a red and silver plane that says \u201cNEW YORK AIR\u201d on the side. Next is the same plane in the sky and then over an ocean. Then, an image of an ocean and a blue sky with clouds appears. This is followed by a picture of dark waves on a stormy day. Then there is the same ocean image again, but this time it is calmer with more sun. Then, a world map with red and orange airplane graphics all over it appears. The image changes to show the continent of South America, then Brazil, and then Africa. A map of South Africa appears. The word \u201cPOOF!\u201d appears in a cartoon bubble, and then, a GPS satellite graphic appears. Then an image of numerous laptops all showing the word \u201cERROR\u201d in red appears. This is followed by an image of a white and blue airplane. Next, the same plane appears in a nighttime scene. The word \u201cPOOF!\u201d appears in a blue cartoon bubble. The image of the same plane with the \u201cPOOF!\u201d graphic behind it appears. Then, the same plane is shown in the sky, followed by the word \u201cPOOF!\u201d in a red bubble. Next, a wizard graphic appears next to the word \u201cPOOF!\u201d. Then, an image of a yellow bicycle safety sign that says \u201cPLEASE LOOK THANK YOU\u201d appears. This is followed by black screen that says \u201cPlease see links in the description for flight logs\u201d. A black screen that says \u201cThe planes disappear before the log entries show the gap.\u201d appears. A black screen appears. A graphic that contains words relating to computers and computing appears. A graphic with the word \u201cESTIMATE\u201d and many small empty circles appears. This is followed by the word \u201cMIRACLE\u201d in blue. A world map with orange and red graphics of airplanes on it appears with the word \u201cMIRACLE\u201d in blue over it. Then there is a graphic of a GPS satellite and finally a graphic of an airplane falling downwards. Then a world map with continents in different colors appears and the word \"MIRACLE\". Two side-by-side graphics of the northern and southern hemispheres of Earth appear with the word \u201cMIRACLE\u201d. A world map graphic of orange and red airplanes appears. The words \"Northern Hemisphere\" and \"Southern Hemisphere\" are placed on the graphic. A graphic of a map of the United States with a US flag design, with red and orange airplane graphics all over, appears. Then, the same graphic with just the US map and flag on it appears. Then an image of a family with two daughters with a US flag design on it is shown. The same family photo is shown again. Then an image of a family in an airplane is shown, followed by a view of Earth and finally, a map of the South Atlantic Ocean. A view of a world map with red and orange airplane graphics over a map appears with \"South Atlantic Ocean\" at the bottom of the graphic. Then, the same world map with orange and red airplane graphics appears. Lastly, an image of a white airplane flying appears.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the significance of multiple airplanes appearing on the map at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. It aims to emphasize the strength of the United States.",
+ "B. It aims to showcase the accuracy of GPS positioning.",
+ "C. It aims to show the large number of aircraft possessed by the United States.",
+ "D. It aims to better illustrate the map's layout."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Is a video displayed in the footage?",
+ "answer": "B",
+ "candidates": [
+ "A. Yes, it is shown.",
+ "B. No, it is not shown.",
+ "C. I am not sure."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why do GPS signals and flight data reappear when an airplane is nearing its destination?",
+ "answer": "B",
+ "candidates": [
+ "A. The GPS system automatically repairs itself.",
+ "B. GPS signals are activated manually.",
+ "C. The GPS signal becomes too strong.",
+ "D. The GPS system is designed to function this way."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order are the events in the video arranged? (a) Focus on the coastlines of the Southern Hemisphere. (b) The red airplane no longer crosses the ocean. (c) The author's friend sends a link. (d) Track an airplane flying from Brazil to South Africa.",
+ "answer": "C",
+ "candidates": [
+ "A. (a) (b) (c) (d).",
+ "B. (a) (d) (b) (c).",
+ "C. (b) (a) (d) (c).",
+ "D. (d) (b) (a) (c)."
+ ]
+ }
+ },
+ "yTMqWGsi": {
+ "video_id": "yTMqWGsi",
+ "video_duration": "190s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Cultural Explainers",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person with dark hair, a beard and wearing glasses and a black cap, is positioned facing the camera in front of a dark leather sofa. They are speaking into a black microphone on a boom arm. A shelf with books is visible in the background on the left. A framed picture is on the background right along with a YouTube logo plaque. The person moves their arms and hands while speaking. Various still images of news articles are displayed on the left of the screen along with the person. The view changes to show a person in the driver's seat of a car with a \"TMZ\" logo in the corner. Police car lights are shown in the background. The view switches back to the first person. This same person is shown from various angles. The video then transitions to a blue, abstract screen which shows a wireframe representation of a brain which rotates in the center of the screen. The view transitions again to show the first person from various angles, before showing him singing in a black-and-white segment. There is also a sequence showing the first person standing outside in a field followed by close-ups of a person singing in front of book shelves. Another man stands in front of a fountain at night, and another is shown next to a river. The video ends with a white screen, a circle, and text that reads \"SUBSCRIBE!\" and \"WATCH ANOTHER VIDEO!\" along with various images of merchandise.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What emotion does the bearded man in the video feel toward the singer?",
+ "answer": "B",
+ "candidates": [
+ "A. Complaint.",
+ "B. Gratitude.",
+ "C. Fear.",
+ "D. Love."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many different segments of music are there at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. 5.",
+ "B. 3.",
+ "C. 2.",
+ "D. 4."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Is there an alarm sound in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. I don't know.",
+ "B. Yes.",
+ "C. No."
+ ]
+ }
+ },
+ "hYdAPFtZ": {
+ "video_id": "hYdAPFtZ",
+ "video_duration": "302s",
+ "duration": "4-6min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two wrestlers, one in blue and white, and the other in green and black, circle each other in a wrestling ring. The wrestler in blue lifts and slams the wrestler in green to the mat. The wrestler in blue then pins the wrestler in green, and the referee begins counting. The wrestler in green kicks out of the pin. Both wrestlers get to their feet, and the wrestler in green kicks the wrestler in blue several times, knocking the wrestler in blue to the mat. The wrestler in green then picks up the wrestler in blue and slams them into the ropes, throwing the wrestler in blue outside of the ring. The crowd cheers as the wrestler in green waits for the wrestler in blue to reenter the ring. The wrestler in blue reenters the ring and is immediately kicked several times and then picked up and slammed to the mat again. The referee begins counting the pin and the wrestler in green wins the match. A replay is shown of the final move.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the color of the bracelet worn by the man in the white and green clothing at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Red.",
+ "B. Blue.",
+ "C. Pink.",
+ "D. Green."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man wearing a white and blue shirt run off the wrestling mat in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Two.",
+ "B. Four.",
+ "C. Three.",
+ "D. One."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a green stop sign in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Yes, there is.",
+ "B. No, there isn\u2019t.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "RIomlfCb": {
+ "video_id": "RIomlfCb",
+ "video_duration": "78s",
+ "duration": "1-2min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person with short light brown hair is sitting and holding a light-colored electric guitar. The person is wearing a dark gray short-sleeved polo shirt and is using a red guitar pick. Their left hand is positioned on the neck of the guitar, and their right hand is strumming the guitar strings. There is a variety of objects in the background including a desk, a computer monitor, a keyboard, speakers, a bass guitar, and various guitar pedals. Text is overlaid showing different guitar chord diagrams. The diagrams have grid lines, black dots and a label such as \"Fsus2\", \"Dm7\", \"Csus2\" \"F5\", \"Fma7-3\", or \"Cma7\"",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What object is the man in the video holding?",
+ "answer": "C",
+ "candidates": [
+ "A. A piece of paper.",
+ "B. A key.",
+ "C. A guitar pick.",
+ "D. A coin."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the man play the guitar?",
+ "answer": "B",
+ "candidates": [
+ "A. Ten.",
+ "B. Eight.",
+ "C. Seven.",
+ "D. Nine."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How many chord fingerings are mentioned in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Four.",
+ "B. Five.",
+ "C. Six.",
+ "D. Seven."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional change does the man in the video experience?",
+ "answer": "C",
+ "candidates": [
+ "A. Calm to satisfaction.",
+ "B. Excitement to calm.",
+ "C. Excitement to satisfaction.",
+ "D. Satisfaction to excitement."
+ ]
+ }
+ },
+ "NidaXDbl": {
+ "video_id": "NidaXDbl",
+ "video_duration": "160s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Biology",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video opens on a white background with a green banner at the top and the text \"Aerobic Respiration\" in the center. A circular logo is at the top of the screen. The background transitions to a purple color, a cartoon female character with brown hair, a tan headband, a purple shirt appears, with the text \u201cfuse school\u201d underneath. Small purple circles appear on the screen. A cartoon drawing of a human skeleton with a highlighted red lung appears, and a bolt of lightning appears around the skeleton. A cluster of 7 cells, each with a red interior, appears. A tan forearm with red muscle appears, followed by a green and yellow blob with a pink line running through it, followed by a brown neuron. An artery appears on the screen with a small white sphere moving through it. A tan squirrel appears beside a green snake. A green plant in a pink pot appears beside the snake. The text \u201caerobic\u201d appears, and the text \u201canaerobic\u201d appears beside it. A white circle labeled O2 appears under \u201caerobic\u201d, and another circle labeled O2 appears under \u201canaerobic\u201d with a red strike through. The cell cluster appears with the text \u201caerobic\u201d above it. A white circle highlights the cell cluster, and a white circle labeled O2 appears beside it. The text \u201cGlucose + Oxygen\u201d appears. The text \u201cGlucose + Oxygen --> Carbon Dioxide + Water\u201d appears below it. The text changes to \u201cGlucose + Oxygen --> Carbon Dioxide + Water + Energy.\u201d A green circle with red dots inside appears beside a white circle labeled O2, and the text Glucose + Oxygen --> Carbon Dioxide + Water + Energy remains on the screen. Two circles labeled O2 appear, and a circle labeled CO2 appears, with the text Glucose + Oxygen --> Carbon Dioxide + Water + Energy remaining. The circle labeled CO2 is next to another circle labeled H2O. Two gray shapes with lightning bolts appear underneath the circles labeled CO2 and H2O. The text \u201cATP\u201d appears. A single cell and green mitochondrion appears with an arrow indicating direction from the cell to the mitochondrion. The text \"aerobic respiration\" appears on screen, as the mitochondrion changes position and a circle appears. A purple cartoon bird with visible bones and lungs appears. A drawing of a leaf cross-section appears with yellow arrows indicating light entering from the top, a blue arrow indicating O2 exiting the bottom, and a green arrow indicating CO2 entering the bottom. The text \"photosynthesis\" appears, followed by an arrow, followed by the text \"Carbon Dioxide + Water --> Glucose + Oxygen + light energy\". The text \"respiration\" appears on screen and the text \"Glucose + Oxygen --> Carbon Dioxide + Water + Energy\" appears. A green plant in a pink pot appears with a yellow sun behind it and yellow arrows pointed towards the plant. The plant disappears as O2 circles appear around it. The moon appears to the left of the plant, and the sun is still behind the plant. A sun appears behind a plant with a lightning bolt symbol. A plant appears in a pink pot with arrows representing light. The text \u201cGlucose + Oxygen\u201d appears. The text \u201cGlucose + Oxygen --> Carbon Dioxide + Water + Energy\u201d appears. The video ends with a video play symbol, and the words \u201canaerobic respiration\u201d underneath. The screen fades to white text on a white background that says \u201cfuse school\u201d, with names and contact info under, and the words, \u201cFree education for everyone in the world!\u201d and \u201cwww.fuseschool.org\u201d.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Where is the energy generated during cellular respiration primarily stored?",
+ "answer": "A",
+ "candidates": [
+ "A. In ATP.",
+ "B. Within cells.",
+ "C. Within the mitochondria.",
+ "D. Within the chloroplasts."
+ ]
+ }
+ },
+ "WopnmBAw": {
+ "video_id": "WopnmBAw",
+ "video_duration": "109s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person is seen sitting in front of a microphone. The person is wearing glasses and a blue shirt. The background shows shelves with objects. The person makes several hand gestures throughout the video, including putting up two fingers like a peace sign, pinching two fingers together, placing his hands behind his head, putting his hands to the side of his face, and placing his hands together as if in prayer. A graphic of a tweet appears that includes text stating that the \"FNAF Movie\" is coming October 27th.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "Did the man in the video see anything related to FNAF?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes.",
+ "B. I don't know.",
+ "C. No."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times is the name Jason Blum mentioned in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. 1.",
+ "B. 0.",
+ "C. 3.",
+ "D. 2."
+ ]
+ }
+ },
+ "AdZFlcRf": {
+ "video_id": "AdZFlcRf",
+ "video_duration": "171s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A woman is driving a car at dusk. Trees and street lights are visible outside. The camera focuses on the steering wheel, dashboard, and the road ahead. The woman adjusts something on the dashboard. The car approaches an intersection and continues driving. It is now darker outside. The driver adjusts something near the gear shift and continues driving. The camera focuses on the driver's face. The sun begins to rise as the car drives. The driver makes a quick motion with their hand. The car continues driving. The screen cuts to black. A man in a black hoodie walks towards a car, where a woman is standing. They walk towards a garage door. The camera focuses on the woman. Another man joins them. He and the woman stand close together, facing the camera, and smile. They continue talking, and the video cuts to the woman inside, wearing glasses, inside a kitchen with wooden cabinets.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What will the woman do next if this video receives 100,000 likes?",
+ "answer": "A",
+ "candidates": [
+ "A. She will create a video of driving a car with 1,000 horsepower on the road.",
+ "B. She will thank everyone who has supported her.",
+ "C. She will go for another drive with the man wearing a baseball cap.",
+ "D. She will invite others to a meal."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Where does the scene featuring two men appear in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Throughout the entire video.",
+ "B. In the middle of the video.",
+ "C. At the end of the video.",
+ "D. At the beginning of the video."
+ ]
+ }
+ },
+ "SfGRRaFc": {
+ "video_id": "SfGRRaFc",
+ "video_duration": "60s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A young man wearing a dark baseball cap and a green hoodie stands in front of a black Chevrolet truck in a garage. He gestures with his hands while looking at the camera. He then opens a cardboard box, removing a long, rectangular LED light bar. He holds the light bar up, displaying it to the camera. Next, he pulls a plastic-wrapped bundle of cables and a small remote control from the box, holding each item up individually. Finally, he holds up a gray rectangular block. The man moves closer to the truck's front grill and continues gesturing.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What was the man's motivation for creating this video?",
+ "answer": "A",
+ "candidates": [
+ "A. The video was sponsored by a company.",
+ "B. He made the video to document his life.",
+ "C. He wanted to create the video himself.",
+ "D. He made the video at someone else's request."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video, what is the order in which the boy presents the company's package contents?",
+ "answer": "A",
+ "candidates": [
+ "A. Lighting fixtures, wires, remote controls, other items.",
+ "B. Lighting fixtures, remote controls, wires, other items.",
+ "C. Lighting fixtures, other items, remote controls, wires.",
+ "D. Lighting fixtures, wires, other items, remote controls."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "At the start of the video, where is the car positioned relative to the male student?",
+ "answer": "B",
+ "candidates": [
+ "A. To the right rear.",
+ "B. To the left rear.",
+ "C. Directly ahead.",
+ "D. Directly behind."
+ ]
+ }
+ },
+ "EXTEEJJq": {
+ "video_id": "EXTEEJJq",
+ "video_duration": "544s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "Two hands hold up the back of a product package depicting a red poppy illustration and text information. The package is placed on a black cutting mat. The hands then position a piece of white cardstock into a stamping tool on the cutting mat. Next, a hand uses a black ink pad to stamp a floral image onto the cardstock. Following this, the cardstock is removed from the tool, and watercolors are applied to the stamped floral image. Pink is used for the petals, and green for the stem and leaves. A small round section at the flower\u2019s center is painted with yellow. Finally, the words \u201cso grateful\u201d are stamped beneath the painted flower.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order does the person in the video color the parts of the flower?",
+ "answer": "C",
+ "candidates": [
+ "A. Petals, pistil, stem.",
+ "B. Pistil, petals, stem.",
+ "C. Petals, stem, pistil.",
+ "D. Pistil, stem, petals."
+ ]
+ }
+ },
+ "XTMmDfSM": {
+ "video_id": "XTMmDfSM",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Family Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a person in two different outfits, a long-sleeve black dress and a long-sleeve leopard-print top with dark pants. In both outfits, the person stands and makes various hand and mouth motions. Some of the mouth motions create an exaggerated effect on the person's face. The background shows either orange curtains or a couch with pillows. A TikTok logo appears in the lower right-hand corner of the frame throughout the video.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the emotion of the woman wearing a black top at the start of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Excited.",
+ "B. Angry.",
+ "C. Disappointed.",
+ "D. Enthusiastic."
+ ]
+ }
+ },
+ "XodGuGxC": {
+ "video_id": "XodGuGxC",
+ "video_duration": "214s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A dark blue anglerfish with a glowing lure swims toward a yellow baby shark wearing a chef's hat, who peeks out from a purple shark's mouth. The scene transitions to a pink kitchen where two Baby Sharks are baking. A large, purple shark is outside the window looking in. A tall, layered cake decorated with strawberries and whipped cream spins into view. The scene transitions to a display of cakes: a tiered cake, a star-shaped sponge cake, a rainbow loaf, a round chocolate cake, two cupcakes, and a heart-shaped cake. A purple shark-shaped eel is next to the cakes. The scene transitions to a display of food on plates: grapes, sliced bananas, carrots, bell peppers, sliced strawberries, and seaweed salad. The eel is still there. The scene shifts to a display of condiments and decorations: honey, hot sauce, a crab decoration, candies, macarons, and seashells. The eel is still there.In the pink kitchen, an orange baby shark chef is shown on the left while cooking ingredients are added to a blue bowl on the right: a mixing bowl, butter, milk, sugar, whipped cream, and finally, green seaweed salad. The orange chef shark and a yellow chef shark are in the kitchen. The yellow chef shark is imagining two decorated cupcakes. One cupcake has green frosting with a pink crab. The other has green frosting and a blue and yellow octopus. The purple shark outside the window holds a credit card. The yellow chef shark swipes the card at the cash register. The image of the dollar bill on the wall is replaced with a picture of the two decorated cupcakes. The cupcakes now appear on the counter along with all the other cakes: a tiered cake, a star-shaped sponge cake, a rainbow loaf, a round chocolate cake, two cupcakes, a heart-shaped cake, and a cake stand. The cupcakes are placed on the cake stand. The yellow chef shark decorates the cupcakes with green frosting, a crab, and an octopus. The scene shifts back to the anglerfish and the yellow chef shark at the shark's mouth. The yellow chef shark gives the anglerfish the cupcakes. The anglerfish cries and its angry expression changes to a happy one. The anglerfish eats the cupcakes and several carrots and pieces of seaweed. A pink badge showing the shark's open mouth is stamped on the image of the cupcakes, which is pinned to a board. The scene returns to the anglerfish approaching the shark's mouth, where the yellow chef shark is waiting.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Was a shark visible in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. No, a shark was not visible.",
+ "B. Yes, a shark was visible.",
+ "C. It is not possible to determine from the provided information.",
+ "D. This option is not applicable."
+ ]
+ }
+ },
+ "OVggzCbo": {
+ "video_id": "OVggzCbo",
+ "video_duration": "123s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person wearing a striped shirt and gold bracelets pours oil from a red measuring cup into a black pan. The person adds green chili pieces, a dried red chili piece, cumin seeds, and leaves to the oil, and mixes with a spoon. The person then adds chopped white onions to the pan and mixes them with the other ingredients, sauteeing until lightly browned. Chopped red tomatoes are added to the pan and sauteed until mushy. Yellow and brown spices are added and mixed in, followed by a green herb and ginger. The person adds a cut light green gourd to the mixture, mixing all ingredients, and then adds water and mixes again. The pan is covered.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What does the woman do with a red spoon at the beginning of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. She adds light soy sauce to the pot.",
+ "B. She adds water to the pot.",
+ "C. She adds dark soy sauce to the pot.",
+ "D. She adds oil to the pot."
+ ]
+ }
+ },
+ "BPHbwvCS": {
+ "video_id": "BPHbwvCS",
+ "video_duration": "508s",
+ "duration": ">8min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows multiple people dressed in soccer uniforms playing soccer on a green field. Some of the people are wearing yellow shirts with blue shorts, some are wearing white shirts with black shorts, and some are wearing white shirts with red stripes and black shorts. One person is wearing a black shirt with black shorts. In the background are stands filled with people. There is a white net and some tall white poles on either side of the field. There is a white ball moving around the field. The people kick the ball and try to score. Sometimes they fall on the ground. Sometimes the camera zooms in and out, or moves left and right along with the action on the field.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many goal attempts did the team in the red and white striped uniform make in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Nine.",
+ "B. Three.",
+ "C. Five.",
+ "D. Seven."
+ ]
+ }
+ },
+ "TNHqwIzS": {
+ "video_id": "TNHqwIzS",
+ "video_duration": "118s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen that transitions to show the word \"bits\" in blue. The text \"frank woodley's\" appears above the word \"bits\" in white.Two men wearing grey suits are on stage. One of the men wears a burgundy hat. The other has a microphone. The man without the microphone claps his hands together then points at his own chest. He then makes more gestures with his hands and walks to the right.The man with the microphone walks to the left and bends down by a large floral arrangement. The man without the microphone walks towards the center of the stage.The man with the microphone walks to the center of the stage, and the man without the microphone picks up a thin rod and walks to the right of the frame.The words \"Good Bye\" are projected onto a blue background. The two men stand on the stage on either side of the projection. There are floral arrangements on either side of the stage.A toy red race car with a flag attached is shown on a black background. It moves around and then comes to a stop at the lower edge of the frame.The two men are back on stage. The man without the microphone picks up a black remote control. The toy car moves around again.The men are back on stage. The man without the microphone holds the remote control. The man with the microphone appears to be speaking.The man with the microphone picks up a hammer. He kneels and appears to use the hammer to break the toy car.The man without the microphone picks up a toy red car with a flag. He walks to the left edge of the frame and then exits. The man with the microphone walks towards the left of the frame, holding a hammer.The man with the hat is standing by the broken toy car and holds a long rod. The other man walks into the frame holding the hammer and then exits. The man with the hat continues to stand holding the rod.The man without the hat returns holding the rod and appears to sweep the broken toy car off stage. The man with the hat is standing still. The man without the hat walks to the left edge of the frame.The man with the hat is shown swinging a stick. The other man holds the rod. The man with the hat is holding the remote control and points it at the other man. He then moves closer to the other man, holding the remote. The other man is holding a rod.The man without the hat is holding a rod and sweeping the broken car off stage. The man with the hat stands with the remote. They then change positions. The words \"youtube.com/woodleyyoutubular\" are projected onto a black background.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What is the final condition of the car?",
+ "answer": "B",
+ "candidates": [
+ "A. The car is burned to ashes.",
+ "B. The car is shattered.",
+ "C. The car flies up.",
+ "D. The car is split in two."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What happened after the car was damaged?",
+ "answer": "B",
+ "candidates": [
+ "A. The person with the red hat swept it to the side.",
+ "B. The man with the red tie swept it to the side.",
+ "C. The person with the red hat then took out another small car.",
+ "D. It remained in place."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times was the man wearing a red tie poked by the remote control's antenna?",
+ "answer": "D",
+ "candidates": [
+ "A. Five.",
+ "B. Seven.",
+ "C. Four.",
+ "D. Six."
+ ]
+ }
+ },
+ "qaAoJRxq": {
+ "video_id": "qaAoJRxq",
+ "video_duration": "53s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A red and purple character is on a grassy hill next to a large brown figure. The character is standing near the edge of the hill, then jumps and floats over to a large dark cave opening, and then jumps out of the cave onto a long rope. They then swing from the rope to the opposite hill, and begin to shoot projectiles at the large brown figure with a black and blue object. The brown figure has a bar across the bottom that states \u201cGolem Head\u201d followed by two numbers. The character fires a number of white, golden, and green projectiles at the brown figure. The brown figure breaks apart, and a smaller floating brown object rises above it. The character continues to shoot at the smaller floating brown object, as well as the rest of the brown figure, eventually destroying it. The text \u201cGolem has been defeated!\u201d appears on the screen. The character then appears in the menu screen. They have 4278 Boomerang Bullets.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What occurs after the game ends?",
+ "answer": "C",
+ "candidates": [
+ "A. More than 300 bullets are missing from the original stock.",
+ "B. The bullet inventory remains unchanged.",
+ "C. There are more than 300 additional bullets beyond the original stock.",
+ "D. All bullets are depleted."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the small figure jump down from a height?",
+ "answer": "A",
+ "candidates": [
+ "A. Four.",
+ "B. Five.",
+ "C. Twice.",
+ "D. Three."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual issue occurred in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The boomerang bullet has a flaw that prevents it from causing damage.",
+ "B. A flaw in the boomerang bullet makes it disappear without reason.",
+ "C. The boomerang bullet has a flaw that causes it to jam easily.",
+ "D. The boomerang bullet has a flaw, causing the bullet to return."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a small monster wearing a mask in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "HRXUIIaw": {
+ "video_id": "HRXUIIaw",
+ "video_duration": "82s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a woman with long blonde hair in multiple settings. In the first setting, she is sitting in a white chair, wearing a black long-sleeved shirt, and talking while gesturing with her hands. There are shelves behind her with a football helmet, a pair of gloves, and other objects on them. In another scene, she is standing indoors by a table with two other women. There are plates and glasses on the table. In a different scene, a group of people are sitting around a table in a room with brick walls. The woman with the long blonde hair is also sitting at this table. In another scene, the woman is outside, along with a woman wearing a black knit hat and a man wearing a hooded jacket, and is putting on a pair of gloves. They appear to be standing in front of a food truck. The woman is then seen handing out water and food to people. In other scenes, she is interacting with other people in front of the food truck and handing out food containers. There are also scenes of the woman sitting in a white chair, and talking while gesturing with her hands. In one of the scenes, she is standing next to a man wearing a blue hooded sweatshirt and interacting with people in front of a food truck.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "Who did the blonde woman in the video give the drinking water to?",
+ "answer": "A",
+ "candidates": [
+ "A. An elderly person.",
+ "B. A young woman.",
+ "C. A young man.",
+ "D. A child."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Did the community workers wear disposable gloves before distributing food in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. No, they didn\u2019t.",
+ "B. I am not sure.",
+ "C. Yes, they did."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Does the volume of dialogue between characters in the video change?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes, the blonde woman's voice suddenly becomes louder.",
+ "B. Yes, the man in black suddenly speaks louder.",
+ "C. Yes, the man in black suddenly speaks in a softer voice.",
+ "D. Yes, the blonde woman's voice suddenly becomes softer."
+ ]
+ }
+ },
+ "akqxFCst": {
+ "video_id": "akqxFCst",
+ "video_duration": "615s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Fashion",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman with long, blonde hair and tattoos on her arms sits in a white chair in a bedroom. She faces the camera and speaks while gesturing with her hands. Behind her is a bed with a brightly colored comforter and several stuffed animals. A white dresser stands against the wall to her left, and on top of it sits a black plaque or box. A large, crying emoji cushion rests on the floor to the right of the chair. The woman opens a cardboard box in front of her. She pulls out and displays several makeup palettes and other cosmetics, holding each item up to the camera individually while continuing to speak. The palettes vary in size, color, and design. One features a kaleidoscopic pattern, another is teal, one is pink and white, and another has a primarily neutral color scheme. She also shows a small bottle with a dropper top. The video occasionally cuts to close-up shots of the products she is holding.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What minimum purchase amount qualifies for free shipping as mentioned in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. $15.",
+ "B. $25.",
+ "C. $19.",
+ "D. $17."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many lipsticks does the golden-haired woman take out in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. Two.",
+ "D. One."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the color of the first lipstick introduced by the golden-haired woman in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Orange-brown.",
+ "B. Metallic.",
+ "C. Maple red.",
+ "D. Yellow-brown."
+ ]
+ }
+ },
+ "pLwSgcKL": {
+ "video_id": "pLwSgcKL",
+ "video_duration": "113s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Racing",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a black and white image of a person holding a pen and notepad. This transitions to a black screen with the text \"IN THE 1980'S AND 90'S\", which is followed by \"JAMAICAN SPRINTERS\", both displayed with a horizontal line underneath. A series of still images then begin, first showing three runners on a track with a yellow arrow pointing from the left runner to the right runner and a green arrow pointing towards a line on the track. This is followed by a mosaic effect displaying a still image of a runner on a track. Next, an overhead view of a race is shown, with the times \"9.83\" and \"9.79\" overlaid at the bottom. The view shifts to a close-up of a person in a red track uniform on a track with the numbers 159 visible, again with times on the left and right of the shot. This is followed by a montage of images showing the same runner, now including a shot of the runner kneeling on a track. Next, the video cuts to a still image of the same runner in a blue and black top and gold chain, followed by the same image with various colors overplayed on the image. Then, a still image of a woman in a yellow track uniform is displayed running, with the numbers \"453\" visible. Next, a fast-paced shot of a runner on a track with limited lighting. Following that, a still image of the same runner in a yellow uniform is shown running on a track. The video then displays an image of a runner at the starting blocks. An overlapping effect is applied to display a row of runners at the starting blocks, followed by another overplayed image of a woman running with the numbers \"3478\" visible. A still image of the same woman in a track uniform is then shown, followed by a sequence of short clips of her running. Following this, a table is shown with listed times for races, with one time circled in yellow. A new race is then shown, with four runners on a track. This clip is followed by an image of the same list of listed times with \"Merlene Ottey\" highlighted in yellow. The video then displays a black and white shot of a runner, then shows the name \"Raymond Stewart\" in yellow text at the bottom. This is followed by a table of names, along with their corresponding country. The video shifts to an overhead view of a race with a green arrow pointing towards one runner. This is followed by a still image of a man with the text \"Raymond Stewart\" underneath and the Jamaican flag, along with a race number of \"46\" visible. Next, the video displays two runners at the starting blocks, with a yellow line highlighting the runner on the left. Following this a montage of various images is displayed, ending on a still image of Raymond Stewart. The video then displays a view of a race with \"Raymond Stewart\" in yellow text underneath along with his winning time. The video then displays a mosaic effect of the runner wearing an orange track uniform with the number 15 on his chest, followed by an image of a magazine with a man wearing a yellow uniform with the number 245 on his chest. The magazine is titled \"The Green Machine\". The video then displays another shot of Raymond Stewart, his winning time and country shown below him. The video ends with a view of a track race with another time visible.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "What is the race number of the female athlete wearing a yellow vest and having dark skin, as seen in the beginning of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. 454.",
+ "B. 456.",
+ "C. 453.",
+ "D. 458."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the competition number of the black athlete wearing a red suit in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. 156.",
+ "B. 157.",
+ "C. 159.",
+ "D. 160."
+ ]
+ }
+ },
+ "ZwGINCbd": {
+ "video_id": "ZwGINCbd",
+ "video_duration": "78s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Physics",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video begins with a side-by-side comparison of two 3D shapes, both with a gradient of purple, pink and orange, one is a smooth sphere and the other is a more complicated icosahedron shape with many protrusions. A title card underneath these two shapes reads \"Idea: frequency partition\".The video then presents three squares with diagrams: on the left, a diagram of a bumpy circle with a red rim and blue center with a squiggly line around it, labeled \"Composed\". To the right of this is an equals sign, followed by a second diagram of a smooth circle with a red rim and blue center, labelled \"Base SDF\". To the right of the second diagram is a plus sign, and a third diagram with a spiky circle with a blue rim and red center labeled \"Displacement\". The title card reading \"Idea: frequency partition\" appears below these three diagrams as well. The diagrams change to feature a wavy line inside a square frame, instead of a circle. The wavy line in \"Base SDF\" has one peak and valley, while the line in \"Displacement\" has multiple peaks and valleys. The other two squares are the same as the prior frame.The video then changes to text at the top which reads \"Na\u00efve approach: direct sum\", followed by a bullet point: \"Without constraints high-frequency signal causes optimization problems\". Below this is an image of two dragon-like figures. The figure on the left is labeled \"naive approach\", and the figure on the right is labeled \"ground truth\". The figure on the left appears smoother and less defined than the one on the right.The video changes to text at the top which reads \"Solution: Implicit Displacement Field\", followed by a check mark and the text \"High-frequency signal is dependent on the low-frequency base.\". To the right are two diagrams, both with a flat black horizontal line. The first has a light blue wavy line on top with dots on the peaks and valleys. The second has a jagged light blue line directly below the flat horizontal line.Below this are two images: on the left is a grayscale bird's-eye view of a circular object with small circles in it, on the right is a 3D image of a cube with geometric designs on each of its faces. A url is placed below the left image.The text and check mark is the same on the following slide. The two diagrams are replaced with a new diagram: a flat black horizontal line with a light blue squiggly line around it. Dots are on the peaks and valleys. Two dotted lines are on the diagram with the text 'x' and 'y' where the dotted lines meet the squiggly line. The text \"implicit displacement\" and \"implicit inverse displacement\" are labelled with their respective colors, pink and green.This diagram is then replaced with two images of the same dragon-like figure. The dragon appears to have purple, pink and orange highlights. The implicit displacement information remains the same.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "After proposing a solution involving implicit displacement fields, what was the woman's mood?",
+ "answer": "A",
+ "candidates": [
+ "A. Satisfied.",
+ "B. Sad.",
+ "C. Disdainful.",
+ "D. Angry."
+ ]
+ }
+ },
+ "yhJYgCql": {
+ "video_id": "yhJYgCql",
+ "video_duration": "90s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Fashion",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A woman with long brown hair, wearing a beige sports bra and a cream cardigan, holds a black plastic mailing package. She gestures with her hands while looking at the camera. Cut to a man in a red robe dancing. Cut back to the woman with the package. Cut to three images of women in workout attire. Cut to a man with short gray hair and a white shirt, his mouth open in surprise. Cut back to the woman with the package. She runs a finger along the sealed edge of the package. Cut to a split screen featuring a close-up of gray leggings and a woman with blonde hair in a ponytail. Cut back to the woman holding the black package. She holds up two fingers, then three, then four. Cut to a woman in pink lingerie dancing on a stage with other women in similar outfits. Cut to a man in a black leotard standing in profile in a dance studio. Cut back to the woman with the black package, who holds up three fingers, then four. She continues to gesture and look at the camera.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What is the woman likely to do next?",
+ "answer": "B",
+ "candidates": [
+ "A. She will go out with friends.",
+ "B. She will try on the leggings.",
+ "C. She will exercise at the gym.",
+ "D. She will discard the black packaging bag."
+ ]
+ }
+ },
+ "nXqyZqfJ": {
+ "video_id": "nXqyZqfJ",
+ "video_duration": "320s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a red durag and a red, white, and blue sweater with a cartoon Santa Claus is standing in a bedroom. They walk towards a bed with a grey blanket on it. Another person wearing a red and green shirt sits up from under the covers and pulls the covers off the first person. Both people are then seen sitting up on the bed under the covers. Then, the screen reads \"ONE HOUR LATER...\". The same two people are now seen sitting up under the covers. The first person is seen talking. Another person wearing a black shirt and a red baseball cap walks into the frame. The first person is seen arguing with the third person. The first person then throws a pillow at the third person. The first person then sits up on the bed. The first person then argues with the third person. The first person walks towards the third person. The first person is seen embracing the third person. The first and second people are seen talking with the third person. The first and third person then embrace each other. The first and second people are then seen on the bed, and the first and third person are then seen next to each other. The third person then walks around the bed. The first person is seen gesturing and talking to the third person. The three people are then seen embracing each other. The video ends with the three people standing next to each other.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "When was the video recorded?",
+ "answer": "C",
+ "candidates": [
+ "A. During Christmas.",
+ "B. After Christmas.",
+ "C. Before Christmas.",
+ "D. I don't know."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is the woman wearing red and green clothes in the video sporting a manicure?",
+ "answer": "C",
+ "candidates": [
+ "A. She does not have a manicure.",
+ "B. She has a black manicure.",
+ "C. She has a white manicure.",
+ "D. I don't know."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What emotion is the man in black experiencing when he discovers the two women in bed?",
+ "answer": "A",
+ "candidates": [
+ "A. Feeling dissatisfied.",
+ "B. Feeling surprised.",
+ "C. Feeling sad.",
+ "D. Feeling indifferent."
+ ]
+ }
+ },
+ "uWRZUyRK": {
+ "video_id": "uWRZUyRK",
+ "video_duration": "88s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A cartoon figure with brown hair in pigtails and a purple hooded outfit is seated at a white table. She blinks and her face changes. There's a change in setting to a room with a wooden floor and a wall with a window that has green curtains. The figure is now standing. She waves her hand. The scene changes back to a white background, with a black and white cartoon figure who has frizzy hair and glasses. She is holding up a book with a drawing of a figure on it. The scene changes back to the room. The figure waves her hand and looks at her hand. She raises one hand, moves it toward her chest, and looks surprised. A black and white cartoon figure is standing behind the purple-hooded figure and holds a book open in front of the purple figure. The scene changes back to the room with the window. The figure clenches her fists and looks confused. The figure has one arm extended and looks confused. The scene changes back to the white background, showing the figure being watched by two figures. The figure is standing in the middle, while the two figures on the side have their mouths open. The purple-hooded figure stands on the white background while three figures with green faces and a black and white figure are on the side. The scene changes to the room, and the figure looks confused. The purple-hooded figure bounces on a white toilet. The scene changes back to the figure standing in the room with the window. The figure stands in the middle of the white background and has two figures bouncing on a toilet.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What do the \"established rules\" mentioned at the beginning of the video refer to?",
+ "answer": "C",
+ "candidates": [
+ "A. It is not appropriate to show anxiety in the classroom.",
+ "B. Movement is not allowed in the classroom.",
+ "C. The toilet is very dirty. Do not go there.",
+ "D. You must listen to the entire story read by the teacher before leaving."
+ ]
+ }
+ },
+ "ixHocOdh": {
+ "video_id": "ixHocOdh",
+ "video_duration": "238s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A woman and a child are sitting at a kitchen counter. The child picks up a cookie and breaks it in half. The woman holds up a large, round cookie to the camera. The child picks up a pair of scissors and cuts open a package with another cookie inside. The woman removes the cookie from the package. The child picks up a knife and uses it and their hand to tear another package open. The child picks up a cookie and breaks off a piece. They then place the piece in their mouth. The camera moves to show the child alone at the counter. The camera angle changes to an overhead shot of a cutting board on the counter, with several large, round cookies on it. There\u2019s a partially opened pink box with the word \u201cBROOKI\u201d on it, and pink packaging with the same word on the counter. The camera briefly shows the woman and child from the previous angle. Back to the overhead shot, a hand pulls up the pink box to reveal more cookies on the cutting board. The hand lowers the box again. Two hands, the woman\u2019s and the child\u2019s, begin cutting the cookies into smaller wedges using knives. A hand holds up a piece of a cut cookie to the camera. Another hand holds a different piece of cookie to the camera. Both pieces have different fillings. The hands continue cutting the cookies. A black and white cat\u2019s tail flicks into view. One hand picks up a piece of cut cookie and breaks it into smaller pieces. The hands continue cutting the cookies.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What is the primary reason the two individuals are eating cookies in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. They are determining the taste of the cookies in order to operate a bakery.",
+ "B. They have a strong personal preference for eating cookies.",
+ "C. They are participating in a cookie flavor evaluation competition.",
+ "D. They are filming a food video to showcase the cookies."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a cat in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "FuQhvhrv": {
+ "video_id": "FuQhvhrv",
+ "video_duration": "123s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A woman smiles at the camera in a room decorated for Christmas. Two stacks of double-stuffed chocolate cookies are shown with crushed candy canes sprinkled around them. The camera zooms in on the cookies. A box of devil's food cake mix is shown. Cake mix is poured into a glass bowl. Eggs are cracked and added to the mix. The mixture is stirred with a wooden spoon. The cookie dough is rolled into small balls. The dough balls are placed on a baking sheet. A glass is dipped into sugar and then pressed onto each dough ball. The cookies are baked. The baked cookies are placed on a wire rack to cool. A bowl of cream cheese frosting is mixed using a hand mixer. The frosting is stirred. Frosting is spread onto a cookie. Another cookie is placed on top to form a sandwich. Crushed candy canes are sprinkled around the finished cookies. The final product is displayed.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video contain a Christmas tree?",
+ "answer": "D",
+ "candidates": [
+ "A. No, it does not.",
+ "B. None of the above.",
+ "C. I am not sure.",
+ "D. Yes, it does."
+ ]
+ }
+ },
+ "pDPZJlgD": {
+ "video_id": "pDPZJlgD",
+ "video_duration": "115s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Artificial Intelligence",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black t-shirt and glasses stands in front of a white board. The whiteboard has the word \"Thermodynamics\" written at the top, along with the words \"Type of Systems\" and then \"Open\", \"Closed\", \"Isolated\", and \"Adiabatic\". The person holds a white marker with a red cap and then draws a box on the whiteboard with an opening on one side, and an exit on the other side. He adds the word \"Min\" to the opening, and \"Mout\" to the exit. Then, he draws a dotted outline around the whole box and adds \"Q in\" below the box, and \"E in\" and \"E out\" on two sides of the box. The person gestures with his hands, then holds a white marker and a red marker in his hands.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the action of placing the entire hand on the whiteboard occur?",
+ "answer": "D",
+ "candidates": [
+ "A. Throughout the entire video.",
+ "B. In the latter half of the video.",
+ "C. In the first half of the video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "HRmgilBi": {
+ "video_id": "HRmgilBi",
+ "video_duration": "117s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A group of four people wearing winter clothing and backpacks, and using skis and poles, move across a snow covered surface. They are followed by a snowmobile with two people onboard. The scene shows a large, open space covered in snow, with trees and distant mountains. The next scene shows three people inside, looking at a map. One person, wearing a park ranger uniform, is pointing at the map with a pen. The following scene shows the hands of two people pointing at and marking a map. The video then shows the group of skiers moving across a snowy landscape while the snowmobile kicks up snow behind them. Next is a shot of three bison in a snowy field with trees in the background. A single person wearing red walks across the screen in front of the bison. There are two more people using ski poles behind this person. A large brown bear walks through a snowy forest with trees and downed branches. Two people move through a snowy forest. There are ski poles visible. Then, a group of four people walk through a snowy forest wearing backpacks and using poles. Then, text that reads \"go.nps.gov/BearSpray\" is shown over a shot of snowy trees. There is a close-up of a bison laying down in the snow, and then a shot of a large bull elk walking next to a river. A moose is seen walking in the water, and then, the first group of people is seen skiing alongside a river with bison in the background. The next shot is from behind the group of skiers with their backpacks, with bison visible in the distance. Four people are standing and a man in a blue coat takes a picture of something in the snow behind them. A coyote yawns and then walks in the snow. Skis move over a snow-covered log in a forest. Two people look at a map in the snow. The video shows a snowy landscape and then shows the four skiers walking across a snowy landscape alongside a river. The scene changes to show a group of four people walking next to a river and across a snow-covered log. A shot of water moves under icy formations. A person wearing skis moves through a river, followed by a shot of a river with ice formations on its banks. A single tree stands in the middle of a snow-covered island in the middle of a river. The final scene is of a steaming geyser. A person walks through a geyser and then the four skiers cross in the background of another geyser.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many types of animals appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. Six.",
+ "D. Five."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the skier lose direction?",
+ "answer": "B",
+ "candidates": [
+ "A. Throughout the entire video.",
+ "B. At the end of the video.",
+ "C. At the beginning of the video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "CDqMRwCg": {
+ "video_id": "CDqMRwCg",
+ "video_duration": "101s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video begins with a red text sign that says \"BUCKTOWN BAZAAR\" with a green wrapped gift image. Next, a woman with dark hair and glasses is holding a cut apple pie. There are other pies, a water bottle and a red table cloth on the table behind her. Then, a painting of a field of yellow and purple flowers comes into view. A shot of two women standing together in front of clothes hung on racks is shown next. Next, a wicker basket of flower shaped barrettes and a price sign is displayed. The next shot features the same barrettes on top of a cutting board with a carrot next to it. The video then displays a wire rack of colorful stuffed monster toys. Then there is a view of a room filled with people and tables displaying art pieces. A shot of a gold frog statue wearing necklaces and other jewelry displayed on a table comes next. Three dress shaped displays with necklaces and bracelets are shown. A woman stands behind a table with a decorated pink Christmas tree and various ornaments. The next view shows a person with a black gloved hand holding black yarn. A table of small dolls and crafts is displayed as well. A orange and white stuffed cat is shown laying down on the table. A wooden table is seen with leather couches in the background. The gold frog statue and jewelry display is shown again. Two women pose with various colorful pillows and fabrics. A wooden bench is shown on a wooden floor, with a table and chairs in the background. An image of a bird on paper with orange designs is next, and then finally, a red stuffed toy with black hair and a red nose is shown being held in someone's hands.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What animal in the video is entirely golden in color?",
+ "answer": "A",
+ "candidates": [
+ "A. A frog.",
+ "B. A cat.",
+ "C. A tadpole.",
+ "D. A bird."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the two women in the fourth picture of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Event organizers and planners.",
+ "B. Event organizers and community staff.",
+ "C. Event organizers and invitees.",
+ "D. Community staff and invited guests."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a handcrafted carrot in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes.",
+ "C. No."
+ ]
+ }
+ },
+ "OVzQixuI": {
+ "video_id": "OVzQixuI",
+ "video_duration": "32s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a soccer game. There are two teams, one in red and white striped jerseys and one in white and blue striped jerseys. The soccer ball is kicked around the field between players. At one point a group of players and a referee gather in the center of the field, then disperse. Later in the video, there is a close up of a person wearing a blue suit and a striped tie speaking against a black background with the word \u201cEquality\u201d in white letters in the background. Following that, the video shows the soccer game again. Players are running around the field and moving the ball. Some people are wearing red jackets on the sidelines.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual action did the man wearing the red number 10 jersey perform on the soccer field in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. He accidentally kicks the soccer ball toward the stands.",
+ "B. He accidentally kicks the soccer ball toward an opponent.",
+ "C. He intentionally kicks the soccer ball toward the stands.",
+ "D. He intentionally kicks the soccer ball toward an opponent."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of the following events as they occur in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The man in the red #10 shirt kicks the football towards the stands, the audience shouts racist remarks and makes monkey noises, the man in the red #10 shirt leaves the field, and the man in yellow clothes holds the arm of the man in the red #10 shirt.",
+ "B. The man in the red #10 shirt kicks the football towards the stands, then leaves the field, the audience shouts racist remarks and makes monkey noises, and the man in yellow clothes holds the arm of the man in the red #10 shirt.",
+ "C. The audience shouts racist remarks and makes monkey noises, the man in the red #10 shirt kicks the football towards the stands, the man in yellow clothes holds the arm of the man in the red #10 shirt, and the man in the red #10 shirt leaves the field.",
+ "D. The audience shouts racist remarks and makes monkey noises, the man in the red #10 shirt kicks the football towards the stands, the man in the red #10 shirt leaves the field, and the man in yellow clothes holds the arm of the man in the red #10 shirt."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What is the man in yellow doing while holding the arm of the man wearing the red jersey numbered 10 in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Criticizing his behavior.",
+ "B. Approving of his action.",
+ "C. Asking for an explanation.",
+ "D. Comforting him."
+ ]
+ }
+ },
+ "fMnKIgvu": {
+ "video_id": "fMnKIgvu",
+ "video_duration": "133s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video is split into two panels. The left panel shows clips of various scenarios. The first clip shows a person wearing a black hoodie and a red and black plaid jacket, walking in a street at night with other people blurred in the background. The next clip shows a woman with glasses in a short black top and short white skirt standing and talking with the person in the jacket from the previous clip. In the next clip, the person wearing the jacket is holding a microphone up to the woman in the black top and white skirt. The scene cuts to other people blurred in the background and the woman in the black top and white skirt, pointing. The scene cuts again, showing a different woman wearing a black top with a red bra, and the person in the jacket from before. The scene cuts to show a man in a gray suit with a white shirt and a black tie. In the next clip, the man in the gray suit appears to be dancing. The scene cuts to show two clips side-by-side of women in work out clothing, in gym settings. The next scene displays a woman with a backpack walking in a store, and another clip of a person holding a box in a store. The scene then displays a woman in a green vest and blue pants in a store, and a clip showing a woman working at a cash register. The scene then shows a woman wearing blue, walking down an aisle in a store. A clip then displays a woman in a blue vest taking out their cell phone and then showing it to the camera. In the last few scenes, a person is lying down and another person is standing in the aisle of a store.The right panel of the screen shows a man wearing a pink hooded sweatshirt, and he rests his chin in his palm, at different moments, looks directly into the camera. His expression varies throughout the video. The background behind the man shows a dimly lit room with a desk and dark furniture.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What exercise is the woman doing in the second segment of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Sit-ups.",
+ "B. Push-ups.",
+ "C. Yoga.",
+ "D. Plank exercise."
+ ]
+ }
+ },
+ "seevAxqN": {
+ "video_id": "seevAxqN",
+ "video_duration": "351s",
+ "duration": "4-6min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man in the audience claps and cheers. The judges clap along to the music. A puppet sings. A costumed singer, the Flower, takes the stage. The Flower's costume consists of a large purple rose for the head and a green, sparkling, floor-length gown and green gloves. The stage is decorated with a large moon, winter trees, and lampposts. Fog covers the floor of the stage. The Flower sings. A blonde female judge listens intently. A male judge with dark hair places his hands together. The Flower continues singing. A dark-haired female judge gestures with her hands. The Flower sings. A bearded male judge watches and nods his head to the music. The Flower sings, and simulated snow falls on the stage. The dark-haired female judge sings along. The Flower sings. Audience members watch and clap along. The bearded judge smiles. The Flower sings. The camera zooms in on the Flower's face. The Flower finishes her performance. A man in a tuxedo approaches the Flower with a microphone. The Flower sings a few lines of \"Eye of the Tiger\" with another masked singer, who is dressed in a leopard costume with large white ruffles and a gold-trimmed dress. The blonde judge dances in her seat. The Flower sings a few lines with the Leopard. The man in the tuxedo walks off the stage, and the Leopard walks off the stage. The Flower and the Leopard stand on stage. The Flower sings a bit of \"Eye of the Tiger.\" The blonde judge claps. The Flower sings some more. The judges stand up and applaud. Audience members cheer and applaud. The host stands next to the Flower, who continues to hold the microphone. The host hugs the Flower. The blonde judge claps. The man with dark hair points excitedly at the stage. The host removes the Flower mask to reveal Patti LaBelle. The judges react with surprise and excitement. Patti LaBelle stands next to the host and speaks. The judges clap and cheer. Patti LaBelle sings. The host dances next to Patti LaBelle, who sings. A subscribe button appears in the lower right corner, with a hand clicking it, followed by a notification bell button.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What genre of music is performed in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Jazz music.",
+ "B. Country music.",
+ "C. Pop music.",
+ "D. Classical music."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many songs were performed in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Four songs.",
+ "B. One song.",
+ "C. Three songs.",
+ "D. Exhaling causes the hips to move forward or inward, drawing the ribs downward and essentially towards the pelvis."
+ ]
+ }
+ },
+ "IOTnjLAP": {
+ "video_id": "IOTnjLAP",
+ "video_duration": "146s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Drawing Tutorials",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A yellow pencil is used to draw a grid on a piece of white paper, with a rectangle divided into thirds both vertically and horizontally. In the center of the topmost rectangle of the grid, a drawing of a man's head is present, with sketch-style lines. The pencil is then used to add lines to the drawing, including the bottom of the man's chin and neck, shoulders, and a line that roughly denotes the left side of his torso. Finally, the pencil is used to add shading and depth to the man's face.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video, what is the correct order in which the author draws the elements?",
+ "answer": "A",
+ "candidates": [
+ "A. Square grid, neck, shoulder.",
+ "B. Neck, square grid, shoulder.",
+ "C. Square grid, shoulder, neck.",
+ "D. Shoulder, square grid, neck."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What serves as the starting guideline for drawing the neck in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Shoulder.",
+ "B. Eye area.",
+ "C. Face.",
+ "D. Ear."
+ ]
+ }
+ },
+ "GDSviMqM": {
+ "video_id": "GDSviMqM",
+ "video_duration": "50s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video begins with a black screen. A Rotten Tomatoes web page appears, with a play button on the movie \"Star Wars Episode I: The Phantom Menace\". The webpage then disappears to reveal the Rotten Tomatoes logo, followed by a \"What is the Tomatometer?\" explanation. The screen changes to the Walt Disney Pictures logo with a castle in the background. This is followed by a black and white diagram of a person wielding a sword and an arrow pointing to the ground next to a checkmark. A word cloud appears with \"drama\" in large red letters. The screen is black again, followed by an image of a person resting under a tree with a gun beside them. Next, an image of shelves of DVDs and books appears. Lastly, the video shows the word \"GOOD\" in white lettering, followed by an image of a man in a suit with his thumb down.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Where in the video is the Tomatometer score mentioned?",
+ "answer": "D",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. At the beginning of the video.",
+ "C. Throughout the video.",
+ "D. In the middle of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, on the Rotten Tomatoes rating index image, which side of the tomato is the cauliflower located?",
+ "answer": "D",
+ "candidates": [
+ "A. The cauliflower is above the tomato.",
+ "B. The cauliflower is to the left of the tomato.",
+ "C. The cauliflower is below the tomato.",
+ "D. The cauliflower is to the right of the tomato."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where does the scene featuring an elderly man with long white hair appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. On the Rotten Tomatoes website.",
+ "B. Beside a bookshelf.",
+ "C. Under a tree.",
+ "D. In front of a Disney castle."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the rating of the movie 'The Phantom Menace' as shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. 0.60.",
+ "B. 0.55.",
+ "C. 0.59.",
+ "D. 0.53."
+ ]
+ }
+ },
+ "zXxXuZIN": {
+ "video_id": "zXxXuZIN",
+ "video_duration": "42s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A number of small, remote-controlled cars are racing around a track. The track is black with white lines and a dirt center area. The cars are various colors, including red, blue, yellow, black, and white. There is a red bar at the top of the screen that shows the current race information. There are two adults visible on the right side of the screen. One adult is wearing a blue shirt and a baseball cap.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How does the narrator's emotion change throughout the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The narrator remains consistently excited.",
+ "B. The narrator starts calm, becomes excited after the results are revealed, and then returns to calm at the end of the competition.",
+ "C. The narrator remains consistently calm.",
+ "D. The narrator starts calm, becomes excited after the results are revealed, and then becomes sad at the end of the competition."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "Which car owner's name correctly matches its color?",
+ "answer": "D",
+ "candidates": [
+ "A. Tim owns a blue car.",
+ "B. Mike owns an orange car.",
+ "C. Rich Terwilliger owns a yellow car.",
+ "D. Sean Short owns a red car."
+ ]
+ }
+ },
+ "RQIsiXWV": {
+ "video_id": "RQIsiXWV",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video starts with a purple background displaying a staircase with two red curtain-like pillars with skull decorations. There are also two candles on each side of the stairs and pumpkin decorations above the candles. In the background there is a window with a moon. A girl in a yellow fairy costume is on the stairs, next to a person in a Frankenstein costume. The words \"Halloween is almost here\" appear. Then, a girl in a red witch costume and a person in a black vampire costume appear on the stairs. The words \"and I can hardly wait\" appear. Next, a person in a white skeleton pirate costume and a yellow shark character appear. The words \"Halloween is almost here\" appear. Then a white, cloud-like figure with spikes appears and the words \"and I can hardly wait\" appear. Next, the group of figures appears on the staircase. The words \"and I can hardly wait\" appear. Then the figures appear playing musical instruments on a dark red carpet. The words \"When Halloween is here\" appear. The same figures play the musical instruments on a dark red carpet. The words \"we will have a jolly time\" appear. The same figures are on a dark red carpet. The words \"we will have a jolly time\" appear. The same figures are on the dark red carpet again, with the words \"Halloween is almost here\" appearing. The same figures are on the dark red carpet with the words \"Halloween is almost here\" appearing. The same figures are on the dark red carpet with the words \"and I can hardly wait\" appearing. The same figures are on the dark red carpet with the words \"and I can hardly wait\" appearing. Two figures appear on the dark red carpet with the words \"and I can hardly wait\". Then, the group of figures are displayed with the words \"one more time!\". The same group of figures appear, surrounded by orange pumpkin decorations. The group of figures are displayed again. The same group is displayed again, with the words \"Halloween is almost here\" appearing. The same group is displayed again with the words \"and I can hardly wait\" appearing. The same group is displayed again with the words \"and I can hardly wait\" appearing. A close up of a witch-costumed character appears, with the words \"Halloween is almost here\" appearing. The same group of characters appears with the words \"Halloween is almost here\" appearing. The same group is displayed again, with the words \"and I can hardly wait\" appearing. The same group is displayed again, with the words \"and I can hardly wait\" appearing. The video then displays a set of glowing eyes in a dark blue background. The words \"When Halloween is here\" appear. The glowing eyes appear again, with the words \"When Halloween is here\" appearing. The glowing eyes appear again, with the words \"we will have a jolly time\" appearing. The glowing eyes appear again, with the words \"we will have a jolly time\" appearing. A group of characters are displayed in front of a dark blue background with a large cave-like structure and musical notes appear. The words \"Halloween is almost here\" appear. The same group is displayed again, with the words \"Halloween is almost here\" appearing. The same group is displayed again, with the words \"and I can hardly wait\" appearing. The same group is displayed again with the words \"and I can hardly wait\" appearing. A close-up of the pirate skeleton appears with the words \"Who has the best\" appearing. The same skeleton appears with the words \"Halloween costume?\" appearing. The same skeleton appears with the words \"Halloween costume?\" appearing. The group of characters appears again. The same group appears again with the words \"Me, me!\" appearing. The same group appears again. The same group appears again.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "Who appears after Dracula and the witch?",
+ "answer": "A",
+ "candidates": [
+ "A. A pink fox.",
+ "B. There are no other characters.",
+ "C. A skeleton pirate and a yellow shark.",
+ "D. A sprite and a cartoon monster."
+ ]
+ }
+ },
+ "XsdgxzNx": {
+ "video_id": "XsdgxzNx",
+ "video_duration": "57s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A Pinterest screenshot shows a blonde woman with a side braid. A woman with blonde hair pulled back in a low ponytail and tattoos on her neck speaks to the camera in a kitchen. The camera pans to a blue and black tripod and several cardboard boxes. The woman with the neck tattoos touches her hair, then the camera focuses on blonde hair extensions laid out on a countertop. A tube of Not Your Mother\u2019s Blonde Moment purple hair mask, a hair coloring brush, hair clips, sunglasses, a soda can, and more items are on the counter. A brunette woman wearing a black sleeveless turtleneck picks up and flips a section of the hair extensions. She puts on and adjusts a blonde wig with bangs. The camera pans across the kitchen island, showing various products, and blonde hair extensions lying on a beach towel. The tattooed blonde picks up a plastic bag containing a bundle of blonde hair extensions and holds it up. The brunette puts bobby pins in a section of hair extensions, then tries on and adjusts the blonde wig.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What might the blonde girl\u2019s profession be?",
+ "answer": "A",
+ "candidates": [
+ "A. A wig maker.",
+ "B. A designer.",
+ "C. A hairstylist.",
+ "D. A wig model."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the blonde girl ask the brunette girl to do at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Remove the bangs.",
+ "B. Wear black bangs and display them.",
+ "C. Wear the entire wig and display it.",
+ "D. Wear white bangs and display them."
+ ]
+ }
+ },
+ "twcSOVfY": {
+ "video_id": "twcSOVfY",
+ "video_duration": "518s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a grey shirt with a red graphic on it is standing in front of a white boat, and points at a circular black outlet with his right index finger. The person looks into the camera, then continues to show details of the boat. He points at a red light fixture and then moves along the boat. The camera cuts to a view through the boat of a large, open building in which other boats are present. Next, the video shows the interior of the boat in more detail. The person walks through the boat's interior and points out details like a black faucet, drawers, a bed, and a black electronic instrument panel. The person opens a refrigerator door and then shows the interior of a closet. The video cuts to show the exterior of the boat, and the person walks around the boat's deck. The camera cuts to an angle of the boat's front, then inside through the boat's windows. The person is sitting down inside the boat, facing the camera. The video cuts to show the boat interior again. Finally, the person is seated facing the camera, and moves his hands in a clapping gesture, with a purple message and a subscribe button appearing on screen.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "While presenting the yacht's kitchen in the video, what did the man wearing a gray short-sleeved shirt do with the cardboard?",
+ "answer": "B",
+ "candidates": [
+ "A. He tore up the cardboard.",
+ "B. He moved the cardboard aside.",
+ "C. He covered something with the cardboard.",
+ "D. He discarded the cardboard."
+ ]
+ }
+ },
+ "HAOfTctP": {
+ "video_id": "HAOfTctP",
+ "video_duration": "124s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video starts in a parking lot with cars and trees visible. A person wearing a black shirt and light blue shorts stands next to a black car, holding a phone. A shopping cart is seen in between two parked cars. Then, a person wearing a dark blue shirt and blue pants walks by, pushing a shopping cart. The person holds up a package of meat. The person continues walking and gesticulates with their hands while talking to the person holding the phone. The person in the dark shirt gets in a dark blue pickup truck and sits in the driver's seat. The person continues talking and looking out the open door of the vehicle. Then the video pans up to show the sky with trees and parked cars in the background. The sky has a criss-cross cloud pattern. The video then pans back down to the first person wearing a dark blue shirt and blue pants who points up towards the sky and then gets in the vehicle. Then the person is shown in the driver's seat of the dark blue truck. Then the video pans to the person from behind, and then back to show them walking, holding a shopping cart. The video shows the person gesturing and holding the shopping cart.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many shopping carts are visible at the beginning of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Three.",
+ "B. Two.",
+ "C. One.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Where does the car horn sound originate in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The car belonging to the man wearing blue.",
+ "B. The street on the left side of the screen.",
+ "C. The car belonging to the girl wearing black.",
+ "D. The car of the man holding the phone."
+ ]
+ }
+ },
+ "nXtWmHKD": {
+ "video_id": "nXtWmHKD",
+ "video_duration": "55s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A white Mercedes car drives slowly along a road. Several other Mercedes cars are parked on the side of the road, next to a large beige industrial building. A grassy field is visible to the right side of the road. The camera view switches to the interior of the silver car now driving slowly behind the white car. A man with sunglasses is sitting in the driver's seat. He's wearing a white short-sleeved shirt, blue jeans, and a seatbelt. He holds the steering wheel. The camera view switches back to show the rear of the two moving Mercedes cars. The silver car pulls up beside the white car, which slows to let it pass. The camera returns to the interior of the silver Mercedes. The driver is operating something on the console. The view switches to the front of the silver car, showing the Mercedes logo and grill. The camera focuses on the front license plate which reads: \"BSAN-011,\" with \"ONTARIO\" written smaller above it. The view returns to the driver in the front seat, showing him operating controls on the console. He keeps both hands on the steering wheel and his eyes on the road. Another white car remains visible behind him.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, where does the Mercedes-Benz park in relation to the white E-Class when it finds a parking spot?",
+ "answer": "A",
+ "candidates": [
+ "A. To its left front.",
+ "B. Directly in front of it.",
+ "C. To its left rear.",
+ "D. Directly behind it."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What action is the man in the video attempting by shifting into drive and pressing the accelerator?",
+ "answer": "A",
+ "candidates": [
+ "A. To bring the vehicle to a safe stop.",
+ "B. To manually adjust the vehicle's position.",
+ "C. To make the vehicle turn.",
+ "D. To increase the vehicle's speed."
+ ]
+ }
+ },
+ "pbLdCStB": {
+ "video_id": "pbLdCStB",
+ "video_duration": "73s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman with a black head covering and a black and white checkered jacket sits at a table, speaking. The scene transitions to an outdoor shot of two large metal gates, one dark brown and one rusty red. The subtitles read \"singing: Give me fire in my life, keep me burning.\" A woman's head appears in the gate. She opens the gate and comes out, wearing a dark blazer and grey skirt. Another woman wearing a light blue top and white skirt runs up to the gate. The woman in the blazer touches the woman in the blue top. Both women are at the gate, with the woman in the blazer holding a phone and books. The woman in the light blue top has her hands on the gate. The subtitles read \"Jesus, are you trying to run into the church?\" The woman in the blazer then calls the woman in the blue top a \"harlot\". The woman in the blue top says \"Don't touch me, blood of Jesus.\" The woman in the blazer continues to scold the woman in the blue top, using the phrase, \"the blood of Jesus.\" The woman in the blue top gestures and says \"I sprinkle the blood of Jesus on top of you, what you want here?\" The woman in the blazer says \"you harlot!\" and the woman in the blue top replies \"I'm not a harlot, what are you? It's my boyfriend.\" She adds, \"He wants me to commit the third abortion for him. I disagreed. He's beating me, Jesus! That's why I came to church to cool off.\" The woman in the blazer replies with \"[speaks in tongues] you want to cool off?\" Then she says, \"you want to cool off inside the church like the church is a club.\" The camera shows both women, and then the woman in the blazer gestures and says \"The Holies of Holies, dressed like this, you're even wearing waist beads.\" She then says, \"[speaks in other tongues].\" The woman in the blue top continues to cry, then she says \"Go home and change your harlot dress.\" She adds, \"Ma, I don't have a home, I used to co-habit with my boyfriend\". The woman in the blazer replies with, \"Cohabit!\" and speaks in another tongue.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the boyfriend of the woman wearing the light blue top do to her in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. He wants to divorce her.",
+ "B. He wants her to resign from her job.",
+ "C. He wants her to sever ties with her family.",
+ "D. He wants her to have a third abortion."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional changes does the lady wearing a black top exhibit in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. From disdain to surprise.",
+ "B. From happiness to anger.",
+ "C. From happiness to surprise.",
+ "D. From disdain to anger."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is the waist chain worn by the woman in the light blue top in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Green.",
+ "B. Blue.",
+ "C. Pink.",
+ "D. White."
+ ]
+ }
+ },
+ "sFrfMYZp": {
+ "video_id": "sFrfMYZp",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A bald man sits on a tree stump by a campfire, playing an acoustic guitar. Another man sits on an orange cooler, watching him. The man on the cooler looks up, surprised. The guitarist glances back and hoists his guitar over his shoulder, then turns back with a startled look. A large, hairy creature is standing behind the bushes.The man on the cooler's face turns to fear. A smaller image of a man with a beard and captain's hat appears superimposed over his shoulder, looking shocked. The guitarist walks towards the creature, and the man on the cooler calls out to him, saying \"Harry! Harry, be careful!\". The creature lifts a large rock over its head.The creature growls as he heaves the rock. The man playing the guitar looks confused. The creature drops the rock and advances toward them, roaring. The man on the cooler scrambles to his feet, yelling. The guitarist continues to play. The creature swipes at the man on the cooler, knocking him down. It then swings at the guitarist, sending him flying backwards. The creature picks up the orange cooler and hurls it into the bushes. It then runs towards the two men and swipes at them again, sending them both tumbling down a short embankment. They both land on the ground. The creature roars.The two men get up and start running away from the creature through the trees. The creature kicks the man on the cooler in the back, and he falls forward onto a log. The other man trips over him and falls as well. The creature falls on its side after stepping on a large blue bag, then sits up, shaking its head. One of the men helps the other one up. They stare, terrified, at the creature as it starts toward them again.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How did the man wearing the gray shirt's emotions change upon seeing the wild people in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. He remains excited.",
+ "B. His emotion shifts from excitement to panic.",
+ "C. His emotion shifts from panic to excitement.",
+ "D. He remains panicked."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "When the wildman first appears in the video, in which direction is he relative to the man wearing the gray shirt?",
+ "answer": "D",
+ "candidates": [
+ "A. To his left.",
+ "B. To his right.",
+ "C. Behind him.",
+ "D. Directly in front of him."
+ ]
+ }
+ },
+ "EZleJXpF": {
+ "video_id": "EZleJXpF",
+ "video_duration": "247s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a table, with the left column labeled \"CONCEALER\" and the right column labeled \"COMPLAINT\". The first row reads \"PINK\" and \"DULLNESS\", the second row reads \"YELLOW\" and \"PURPLE\", the third row reads \"GREEN\" and \"REDNESS\", the fourth row reads \"ORANGE\" and \"BLUENESS\", and the fifth row reads \"PURPLE\" and \"SALLOW\". Two hands are visible in the lower right portion of the screen. The video then cuts to a person with long brown curly hair and large gold hoop earrings, in front of a glittery background, wearing a white shirt. The person moves their hands, and makes facial expressions. The video then cuts to a still of several makeup products, labeled \"ORANGE/PEACH\", in front of a glittery background. The video cuts back to the person talking, and then a still image appears of 4 lines of an orange-colored makeup on a person's arm. The video cuts back to the person talking in front of a glittery background, making hand gestures. Then, the person holds up a stick of makeup, and applies it to their hand. The video then cuts to the person applying the same color makeup under their eyes, around their nose, and on their chin with the stick, then uses a makeup brush to blend in the color. At the end, the person is once again talking in front of the glittery background and demonstrates using the makeup stick.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct order for applying concealer as demonstrated in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. First, assess the areas that need coverage, then choose a shade that matches the skin tone, next blend the product with a concealer brush, and finally use a corrective color stick to apply color.",
+ "B. First, assess the areas that need coverage, then choose a shade that matches the skin tone, next apply color using a corrective color stick, and finally blend the product with a concealer brush.",
+ "C. First, choose a shade that matches the skin tone, then assess the areas that need coverage, next blend the product with a concealer brush, and finally apply color using a corrective color stick.",
+ "D. First, choose a shade that matches the skin tone, then assess the areas that need coverage, next apply color using a corrective color stick, and finally blend the product with a concealer brush."
+ ]
+ }
+ },
+ "dbPFQIbG": {
+ "video_id": "dbPFQIbG",
+ "video_duration": "113s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman and a man are standing outside in front of a house with a brick facade. A white pickup truck is visible in the background. The woman is wearing a black jacket with a white fur lining, and the man is wearing a gray hooded sweatshirt. They are holding what appears to be a machete, with a serrated blade. The woman uses the machete to cut at the base of some tall dried plants. She uses the machete to sever the stems of the plants from the base. The man helps to clear the debris by hand, and also uses the machete to cut at the plants. At one point the man puts a small piece of the plant stem in his mouth. The woman also does this.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the girl taste the bananas?",
+ "answer": "B",
+ "candidates": [
+ "A. Four times.",
+ "B. Three times.",
+ "C. Twice.",
+ "D. Once."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the sequence of events in the video? (a) The boy tosses the banana tree trunk into the air and waves the machete. (b) The boy tastes the banana tree for the first time. (c) The girl gets the machete stuck while cutting the banana tree. (d) The girl tastes the banana tree for the first time. (e) The girl steps on the banana tree.",
+ "answer": "B",
+ "candidates": [
+ "A. (a) (b) (d) (c) (e).",
+ "B. (c) (e) (b) (d) (a).",
+ "C. (b) (d) (c) (e) (a).",
+ "D. (b) (e) (d) (c) (a)."
+ ]
+ }
+ },
+ "ARTRikze": {
+ "video_id": "ARTRikze",
+ "video_duration": "525s",
+ "duration": ">8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen containing white text. The text states: \"You are only required to identify yourself to the police if you have been legally detained. In order to be legally detained, there has to be reasonable suspicion that you have committed or are about to commit a crime.\" Then another text box appears, stating: \"Not having ID at a federal building is not an actual criminal act (even if it was a requirement to enter that building). They also say in the video that he had not been trespassed from the property, which is actually public property. In addition, they never told him he had to leave. In fact, they illegally prevented him from doing so.\" Finally, another text box appears, stating: \"Therefore, (as they themselves admitted several times) the man in this video was not at any time being investigated for having committed a crime. That means he was not legally detained nor was he obligated to identify himself to them (regardless of whether they are \"the local police\").\" The video then transitions to a scene outside a building. A police officer wearing a tan uniform with dark pants is seen at the corner of a building. He then turns to his right and another police officer walks up. The first officer points and moves to his right while the second walks into the scene. A man wearing a suit with a light colored tie walks by. The video then shows the first officer walking along a paved area where three men are sitting on a low wall. The video then switches to a front facing perspective of one of the men in a black leather jacket and blue jeans. Another man wearing a plaid shirt is also visible. As the man in the black jacket is speaking, another man appears from behind the camera, pointing his finger and placing it near the lens. The video continues with the man in the black jacket speaking. A police officer then appears on screen and moves to the other side of the scene, turning to face the seated men. He motions with his hand as if counting. The camera perspective then switches again to another officer moving into the scene as the original camera perspective of the men is briefly visible before the new officer blocks it from view. The camera angle then shows one of the seated men being pulled and grabbed by the arms by the officers, and placed into the back of a police vehicle. The video continues to show the police officers at the scene. The video switches to an inside view of the police vehicle as the seated man is placed into it. One of the police officers puts his hands on the seated man's shoulder and attempts to strap him in. The video continues showing the man now inside the vehicle as the police officer tightens the man's seat belt. The police officer then closes the vehicle's door and the video ends.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the man in a black leather jacket handcuffed and taken away by the police at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Because he deliberately provokes trouble.",
+ "B. Because he insults the police.",
+ "C. Because he obstructs the police investigation.",
+ "D. Because he refuses to cooperate and provide identification."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What event occurred in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Police took away a man suspected of committing a crime.",
+ "B. The man was taken away by police in violation of regulations after refusing to provide his information.",
+ "C. The man was taken away by police after refusing to leave a building owned by others.",
+ "D. The man was taken away by police after insulting them."
+ ]
+ }
+ },
+ "bwyVshUk": {
+ "video_id": "bwyVshUk",
+ "video_duration": "526s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man in a blue hooded jacket and black pants with a yellow stripe is in a bedroom. He opens a suitcase and places clothing inside, including a green shirt, a white object, and a yellow object. Then he puts a dark colored object in, followed by more objects, including a pair of gloves. He then turns and pulls a blanket off a person lying in a bed who has a blue cloth around their forehead. The man walks around the bed, reaching into the suitcase. The person in the bed gets up, pulling off the covering over their face. They are wearing a white tank top and pink shorts and have a blue headband. They pick up a phone and look at it. They stand, facing the man, and begin arguing. The man points, and the two continue talking, gesturing with their hands. The man touches the person\u2019s shoulder and then their arm. The woman brushes his hand away and begins to pace around the room. The man continues putting objects in the suitcase while the person walks in a circle on the bed, eventually moving to the floor. The man grabs more items from the dresser, and the woman begins picking up items from the bed. The man turns to face the camera and makes gestures with his arms. The man holds both hands to his chest and then makes hand gestures towards the camera. The man claps his hands together and makes hand gestures toward the camera. The woman walks over to the dresser, holding a water bottle and a small black object. They both continue to argue. They then both are visible in the view of the camera for the rest of the video, talking to the camera. The camera shifts to show just the man, with the woman in the blurred background of the shot, as he continues speaking. At the end of the video, only the man's face is visible. The view then cuts to the ceiling. The view then shifts back to the man, as he continues to speak to the camera. He then adjusts his hood and then begins talking directly to the camera again. The man and woman are both visible in the camera view for the remainder of the video. They both speak to the camera, gesturing with their hands.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, what is the position of the vanity table relative to the Black woman wearing a white tank top while she is lying in bed?",
+ "answer": "B",
+ "candidates": [
+ "A. On her right.",
+ "B. On her left.",
+ "C. In front of her.",
+ "D. Behind her."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the woman lying in bed do to the black man's luggage at the beginning of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The woman closed the suitcase.",
+ "B. The woman pulled clothes out of the suitcase.",
+ "C. The woman carried the luggage off the bed.",
+ "D. The woman kicked the luggage."
+ ]
+ }
+ },
+ "bBXioxmu": {
+ "video_id": "bBXioxmu",
+ "video_duration": "142s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man with dark wavy hair, wearing a gray v-neck shirt, is speaking and gesturing. He is shown against a dark gray background.An Instagram post shows a picture of six men standing outdoors against a concrete wall. They are all shirtless.Four men are seated around a small table, each wearing headphones, and holding a microphone.A man with dark wavy hair, wearing a gray v-neck shirt, is speaking and gesturing. He is shown against a dark gray background.A shirtless man with a shaved head and a reddish beard does push-ups using a wooden structure outdoors. He wears blue shorts.A man in a black tank top is using a weight machine in a gym. There are other people in the background.A man with dark wavy hair, wearing a gray v-neck shirt, is speaking and gesturing. He is shown against a dark gray background.The image displays the words, \"The Game Changers Official Trailer\" in white, with green accents, on a dark, textured background with an arm and a portion of a shoulder.A man with dark wavy hair, wearing a gray v-neck shirt, is speaking and gesturing. He is shown against a dark gray background.A man with a dark curly haircut is sitting outside on a lawn, shirtless, wearing black shorts and white socks. He is flexing and holding a green object in his left hand.A woman with long brown hair, wearing a green and white striped shirt, with a blue lanyard around her neck is speaking. She is in front of a light blue background.A red combine is harvesting crops in a field. A green tractor is positioned to the right of the combine.A man with dark wavy hair, wearing a gray v-neck shirt, is speaking and gesturing. He is shown against a dark gray background.A man is looking at the camera, wearing a dark blue shirt with a black headband.A woman is sitting on a blue track in a yellow and green athletic suit, with a blue and red flag wrapped around her.A man with dark hair is pointing with his right hand, while wearing a bright blue jacket.Two men, with beards, are shown side-by-side with an orange 'VS' separating them.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the television relative to the man wearing the gray hat in the second group photo?",
+ "answer": "B",
+ "candidates": [
+ "A. The television is in front of him.",
+ "B. The television is to his right.",
+ "C. The television is to his left.",
+ "D. The television is behind him."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the emotion of the man with long curly hair in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Sad.",
+ "B. Joyful.",
+ "C. Optimistic.",
+ "D. Condemnatory."
+ ]
+ }
+ },
+ "sRwyXMlA": {
+ "video_id": "sRwyXMlA",
+ "video_duration": "75s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two shirtless individuals wearing boxing gloves are in a boxing ring with red, white, and blue ropes. A third individual in a light green shirt and black pants is also in the ring. The two boxers are engaged in a physical bout, throwing punches at each other. The third individual in a green shirt intercedes and separates the two. He holds one boxer by the shoulders with each hand. The boxer in the white shorts appears to be breathing heavily. The referee holds the boxer in white by his shoulders while he appears to speak to him. The boxer in gold shorts walks away. The boxer in white then resumes the bout, and both fighters throw punches. The boxer in white shorts is pushed against the ropes and holds up his black-gloved hands to his face. The boxer in gold punches the other boxer, who falls against the ropes. The referee comes between the two again. He raises his arm and the boxer in gold shorts' arm is raised as well. The individual in gold then has a green and gold belt put around his waist, as well as a shiny silver trophy, and the individuals in the ring appear to congratulate him.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "After the boxer wearing gold shorts wins the championship in the video, where is the referee positioned relative to him?",
+ "answer": "C",
+ "candidates": [
+ "A. Directly in front.",
+ "B. Directly behind.",
+ "C. On the left side.",
+ "D. On the right side."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the referee signal to pause the game?",
+ "answer": "C",
+ "candidates": [
+ "A. Two.",
+ "B. Three.",
+ "C. One.",
+ "D. Zero."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "In the video, how does the boxer in golden shorts deliver a heavy blow to the boxer in white shorts?",
+ "answer": "D",
+ "candidates": [
+ "A. Jab.",
+ "B. Overhand punch.",
+ "C. First a straight punch, then an uppercut.",
+ "D. First an uppercut, then a straight punch."
+ ]
+ }
+ },
+ "DpxxFRul": {
+ "video_id": "DpxxFRul",
+ "video_duration": "435s",
+ "duration": "6-8min",
+ "domain": "Games",
+ "sub_category": "Sports Game",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows gameplay footage of an American football video game. The initial view is from behind the line of scrimmage and then shifts to show the entire field. Men in uniform begin by standing at a line, and run into each other. Throughout the video there are multiple plays with men in uniform running or throwing a ball, with the camera following the action. There is often a graphic overlay showing the play setup. The video ends with a menu appearing that says \"Your peer has conceded the game\".",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What was the commentator's mood after Player 88 was tackled?",
+ "answer": "B",
+ "candidates": [
+ "A. From excitement to calm.",
+ "B. From calm to excitement.",
+ "C. From calm to disappointment.",
+ "D. From calm to anger."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many goals were successfully scored in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Zero.",
+ "B. Two.",
+ "C. Three.",
+ "D. One."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why might the opponent leave the game?",
+ "answer": "A",
+ "candidates": [
+ "A. The opponent started swearing after the narrator scored and stormed off angrily.",
+ "B. The opponent achieved the final victory.",
+ "C. The opponent no longer wanted to continue playing the game.",
+ "D. The opponent suddenly had something urgent to attend to and left."
+ ]
+ }
+ },
+ "bdCPCAOf": {
+ "video_id": "bdCPCAOf",
+ "video_duration": "301s",
+ "duration": "4-6min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man with short, dark hair and a beard sits in a dimly lit room, playing an acoustic guitar. He wears a dark olive-green T-shirt and a patterned guitar strap. A condenser microphone stands before him, attached to a tripod. Behind him are cream-colored curtains, a glowing salt lamp on a table, and an out-of-focus oval mirror on the wall to the left. He looks down at his guitar, smiling slightly as he strums, then looks directly at the camera. He continues playing and singing, his eyes occasionally closing as he performs. The camera focuses briefly on his hands as he plays a short guitar riff. Then, it returns to the man\u2019s face as he sings with more passion, sometimes looking slightly upward. He grins, then continues playing a guitar riff. Again, the camera briefly shows his hands on the fretboard and then returns to the man. He closes his eyes briefly, plays, and sings with increasing intensity, sometimes tilting his head back. He sings a long, high note, and another brief guitar riff follows. He sings with gusto, grinning and looking upward. He looks down at his instrument as he plays and then delivers more vocals, looking upward. Another guitar solo follows. He performs more vocals with passion. Then, the video shows another guitar riff and then more vocals. Another short guitar solo is played. He sings, occasionally closing his eyes. He plays another short guitar riff. He delivers more powerful vocals with his head slightly tilted back, eyes closed at times. After the next guitar solo, he sings powerfully with emotion and grins while continuing to strum and sing.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Is the performer standing while singing?",
+ "answer": "C",
+ "candidates": [
+ "A. I do not know.",
+ "B. No, the performer is not standing.",
+ "C. Yes, the performer is standing.",
+ "D. None of the above."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the progression of emotions experienced by the performer?",
+ "answer": "D",
+ "candidates": [
+ "A. From calm to excitement and then to joy.",
+ "B. From excitement to calm and then to joy.",
+ "C. From excitement to joy and then to calm.",
+ "D. From joy to excitement and then to calm."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotion is conveyed at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. I have a desire to kill someone.",
+ "B. Reflecting on one's position in history and humanity's role.",
+ "C. I want to be like Jesus.",
+ "D. I am a man of refined taste."
+ ]
+ }
+ },
+ "zwkbEZwG": {
+ "video_id": "zwkbEZwG",
+ "video_duration": "499s",
+ "duration": ">8min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A Hearthstone game is in progress. The player's hand contains several cards, including \"Ice Block\". The opponent's board has three minions: a 5/6 minion, a 1/1 minion, and a 2/3 minion. The player casts \"Book of Specters\", drawing cards and damaging the opponent's minions. Next, the player uses \"Cabalist's Tome\", adding three random Mage spells to their hand. The player plays \"Sorcerer's Apprentice\", reducing the cost of their spells by one. The player then plays four copies of \"Khadgar\", increasing the number of minions summoned by spells. The player casts \"Arcane Missiles\", launching multiple missiles at the opponent's minions. The missiles multiply due to Khadgar's effect, filling the board. The player then attacks the enemy minions, eliminating them. Finally, a \"Fireball\" card is shown in the player's hand. The scene changes to a victory screen, showing level 14 \"Raid Leader\" rank.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the woman wearing a white sweater leave her seat?",
+ "answer": "D",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. She is present throughout the entire video.",
+ "C. At the beginning of the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional shift does the woman wearing a white sweater experience at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Her mood changes from agitated to elated.",
+ "B. Her mood changes from calm to tense.",
+ "C. Her mood changes from frustrated to tense.",
+ "D. Her mood changes from nervous to excited."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the woman in the white sweater, sitting in the video, stand up and cover her face at the end?",
+ "answer": "C",
+ "candidates": [
+ "A. Because she is very excited about failing the game.",
+ "B. Because she is very sad about failing the game.",
+ "C. Because she is very excited about winning the game.",
+ "D. Because she is very upset about failing the game."
+ ]
+ }
+ },
+ "sWrqyBmV": {
+ "video_id": "sWrqyBmV",
+ "video_duration": "36s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person with pink braided hair and a patterned dress walks down a sandy path. Then a white substance is thrown on her, she reacts by touching her face. The camera cuts to a person wearing a black baseball cap and white tank top, and holding a green bucket. This person looks towards the right and then looks toward the camera. The video cuts back to the person with pink hair, who is rubbing her face and then opens her mouth. The video cuts back to the person with the green bucket, who looks off to the side and uses their hands to cover their face. The video cuts to the person with pink hair, who opens her mouth and looks to the right. She then uses hand gestures as if she is speaking. The video cuts to the person with the baseball cap who then uses hand gestures as if speaking. The video cuts back to the person with pink hair who uses different facial expressions and hand gestures as if speaking. The video then cuts back to the person with a baseball cap, who appears to be pointing at something. The video shows the person with pink hair then looking and gesturing toward her hand. Finally, the person with a baseball cap moves towards her and the video ends.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the blue basin's position relative to the man in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. It is in front of the man.",
+ "B. It is behind the man.",
+ "C. It is to the left of the man.",
+ "D. It is to the right of the man."
+ ]
+ }
+ },
+ "LVLxFBNY": {
+ "video_id": "LVLxFBNY",
+ "video_duration": "167s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Software",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a yellow hoodie sits at a desk with a black tablet with keyboard, a white stylus and a red phone. They gesture with their hands as they speak, then they pick up and open the tablet, turning it around in their hands to demonstrate its features. The screen shows various apps, then a white background with handwriting. The person uses the stylus on the tablet screen. Later in the video they change into a dark red hoodie, and the tablet screen shows a checklist. The person picks up their red phone and displays it, then places it on the black tablet keyboard. They continue to use the tablet, flipping it around. At the end of the video they hold the black tablet cover in their hands, show a peace sign, and the screen switches to a final slide with social media handles.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Does the man in the video put a protective case on the phone?",
+ "answer": "A",
+ "candidates": [
+ "A. No, he does not.",
+ "B. None of the above.",
+ "C. Yes, he does.",
+ "D. It is not possible to determine."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "According to the video, what activities does the man mention using a tablet for?",
+ "answer": "D",
+ "candidates": [
+ "A. Drawing, taking class notes, or searching the internet for information.",
+ "B. Writing papers, taking class notes, or drawing.",
+ "C. Writing papers, drawing, or searching the internet for information.",
+ "D. Writing papers, taking class notes, or searching the internet for information."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the man place his phone between the keyboard and the tablet?",
+ "answer": "B",
+ "candidates": [
+ "A. To make it more convenient to use the phone.",
+ "B. To prevent the keyboard from overlapping with the tablet.",
+ "C. The man uses the tablet and keyboard as a phone stand.",
+ "D. To adjust the angle of the keyboard."
+ ]
+ }
+ },
+ "xLprucUA": {
+ "video_id": "xLprucUA",
+ "video_duration": "67s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A news anchor in a suit and tie stands in front of a screen that reads \u201cWORKERS FIRED\u201d and \u201cA DAY WITHOUT IMMIGRANTS\u201d. The video cuts to a close up shot of clasped hands. The camera pans down to show a group of people standing together. It pans back up, showing the lower halves of their bodies. Most are wearing jeans or shorts. The shot changes to a two-story building with a curved awning over the entrance. The sign on the building reads \"I DON\u2019T CARE BAR AND GRILL.\u201d The camera zooms in on the sign. Then the video shows a freestanding sign that says \u201cBAR AND GRILL\u201d with a smaller sign underneath that says \"OPEN\" in red letters. The camera cuts back to the group of people and shows close up shots of clasped hands. The next shot shows a large, circular, yellow sign attached to a pole. In black letters, it reads \u201cTDC\u201d and below that \"BAR AND GRILL.\u201d The camera pans across the building again, pausing on a large garage door style window. The video cuts back to the group of people, focusing on their clasped hands. A screenshot of text messages appears. The first message reads, \u201cYou and your family are fired.\u201d The response reads, \"Ok tank you.\" The next message reads, \"And thank you.\u201d Finally, a longer message is displayed: \u201cI hope you enjoyed your day off and you can enjoy many more. Love you.\u201d The video cuts back to a shot of clasped hands and then to the TDC Bar and Grill sign. The video ends with another shot of clasped hands.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What is the reason for the individual in the video's stance on immigration policies?",
+ "answer": "B",
+ "candidates": [
+ "A. To protest unfair dismissals.",
+ "B. To demonstrate that immigrants need employment.",
+ "C. To reduce competition for jobs from immigrants.",
+ "D. To advocate for more vacation time for immigrants."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What are the people, whose faces are not visible in the video, doing?",
+ "answer": "C",
+ "candidates": [
+ "A. They are searching for employment.",
+ "B. They are enjoying a vacation.",
+ "C. They are protesting immigration policies.",
+ "D. They are demanding compensation from their employer."
+ ]
+ }
+ },
+ "AFfsMhIF": {
+ "video_id": "AFfsMhIF",
+ "video_duration": "56s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Astronomy",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with long brown hair is visible, wearing a yellow shirt with the text \u201cS2HAC3\u201d in a multi-color font. They are speaking and gesturing with their hands. In the background are a telescope, a poster on the wall, and a keyboard. The video then transitions to a view of space with a dark background and a field of red and white stars. There is some text overlayed on the bottom of this image, \"Untracked at 85mm f.l. 568x2 sec.\" Next an upward pointing arrow and a downward pointing arrow appear on either side of a bright red nebula, then the arrows disappear. This fades into the same image of space without the text or the arrows. The camera then pans out and the space image continues, it becomes more zoomed out so that it is more vertically oriented. A graphic of old star map is shown in the background, with a faded version of the previous space image overlayed. Finally the video transitions back to the initial shot of the person in the yellow shirt, who is still speaking and gesturing with their hands.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a dynamic image of a nebula in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes.",
+ "B. Uncertain.",
+ "C. No."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the image of the constellation Cygnus appear?",
+ "answer": "B",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the end of the video.",
+ "C. Throughout the entire video.",
+ "D. At the beginning of the video."
+ ]
+ }
+ },
+ "ETcECmaq": {
+ "video_id": "ETcECmaq",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a first-person perspective of a Minecraft game. At the beginning of the video, there is a text overlay that says \"WITH NO ITEMS\". There is a red bar at the top with \"Time Left\" written on it and a red bar at the bottom with 10 hearts. The player is standing in front of a lavafall. Then the text overlay changes to \"already started\", \"and I'm trapped\", \"by looking around,\", \"you'll find a\", \"and ice,\", \"which won't\", \"unless there's\", \"underneath it\", and \"there's also\". There is a red subscribe button on the left side of the screen. The cursor then clicks the button and it changes to a \"Subscribed\" button. The screen shows a barrel screen with nothing in it and text over it says \"with seemingly\" and \"nothing inside,\" \"look carefully,\", and \"you'll find an\" and a light grey item is selected with the text \"Light Grey\" on it. The text overlay continues with \"invisible glass pane...\" \"now if you\" and \"and break the ice,\" \"it'll turn to water,\" \"and you have to\" \"exit before it's\" and \"absorbed by\" and \"next,\" \"grab the sponge,\" \"and mushrooms\" \"around the room,\" \"then tower up\" \"the lava conceals\" \"a hidden hopper\" \"with these items,\" and \"and a crafting table\". The inventory menu appears with a bowl, flowers, iron ingot, and a bucket. Text overlays continue with \"with the\" \"flowers and bowl,\" \"a suspicious stew\" \"with fire resistance,\" and \"and a bucket\". The player drinks a stew, then swims through a lavafall, then text overlays continue with \"then extinguish yourself\" \"in the powdered snow\" \"there's a candle\" \"above ice,\" and \"a hidden snowball\" \"in an item frame\" The player throws a snowball \"through the lava\" \"to light the candle,\" and the video ends.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotion is primarily conveyed in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Tense.",
+ "B. Relaxed.",
+ "C. Sad.",
+ "D. Frightening."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many objects in the video are related to snow?",
+ "answer": "B",
+ "candidates": [
+ "A. Three.",
+ "B. Two.",
+ "C. Four.",
+ "D. One."
+ ]
+ }
+ },
+ "xILTYmgI": {
+ "video_id": "xILTYmgI",
+ "video_duration": "276s",
+ "duration": "4-6min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two Muay Thai fighters, one with darker skin wearing white shorts and one with lighter skin wearing red shorts, face off in a ring. The fighter in red shorts performs the wai kru pre-fight ritual. The fighters tap gloves at the center of the ring. Round 1 begins. The fighter in the red shorts lands a leg kick. The two fighters trade leg kicks. The fighter in red throws punches followed by a leg kick. The fighter in red lands several more leg kicks. He throws a punch combination followed by another leg kick. He lands a head kick, and the fighter in white falls to the mat. The referee checks on the fighter in white, who is lying on his back. Medical staff enters the ring and attends to the fighter in white, administering smelling salts and applying pressure to his chest. The fighter in red exits the ring and celebrates by raising an Isuzu-branded banner. Medical personnel continue to attend to the fighter in white. The fighter in red re-enters the ring to be declared the winner. The two fighters embrace.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video do a man in black armor and a man in white armor embrace?",
+ "answer": "D",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. In the middle of the video.",
+ "C. Throughout the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "How did the man in black armor defeat the man in white armor?",
+ "answer": "B",
+ "candidates": [
+ "A. A front kick.",
+ "B. A right hook.",
+ "C. A left hook.",
+ "D. An elbow strike."
+ ]
+ }
+ },
+ "tJcEYROv": {
+ "video_id": "tJcEYROv",
+ "video_duration": "479s",
+ "duration": "6-8min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a soccer match on a green field, with white lines marking the boundaries. Players wearing red shirts and blue shorts and players in yellow shirts and blue shorts are visible on the field. There is a goal at either end of the field with nets. The players are running around the field and kicking a soccer ball. A referee dressed in black is also visible on the field. The camera shifts between different angles and points of focus of the field and zooms in on some of the player's faces as they move on the field. The background shows crowds of people and sponsor signs around the field. Some players are seen falling on the field and other players are seen raising their arms in the air.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many successful goals are made in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Three.",
+ "B. One.",
+ "C. Two.",
+ "D. Zero."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What are the roles of two individuals wearing black on a soccer field?",
+ "answer": "B",
+ "candidates": [
+ "A. Referee and coach.",
+ "B. Goalkeeper and referee.",
+ "C. Coach and goalkeeper.",
+ "D. Coach and referee."
+ ]
+ }
+ },
+ "sPjBAKUu": {
+ "video_id": "sPjBAKUu",
+ "video_duration": "56s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video features a person holding a microphone and talking. They are wearing a red and black plaid shirt over a black t-shirt with a white design on it, dark pants, and white shoes with orange swooshes. They are wearing a silver necklace, and earrings. At times, the background is blue and purple with white horizontal stripes. At other times, the background is a darker color. There are a few quick shots of other people; one person is wearing a white shirt, one is wearing a dark outfit with sequins, one is wearing a pink outfit with sequins, and two are wearing black shirts with vertical white stripes. The bottom of the screen contains the text \"AfricanComedy.com.au\" in yellow, as well as a graphic of the continent of Africa in green, red, and yellow lines. The text \"Emo Majok - Australia's Got Talent\" is visible in yellow text at the top of the screen. The person with the microphone is sometimes standing and sometimes speaking in close up. They occasionally gesture while speaking. At one point they are standing on a stage that is lit in blue and white light, with a few round orange lights along the back.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order does the man in the video tell the following stories?",
+ "answer": "B",
+ "candidates": [
+ "A. A white person asks for change - a black town experiences poverty - a black person seeks water.",
+ "B. A black town experiences poverty - a white person asks for change - a black person seeks water.",
+ "C. A white person asks for change - a black person seeks money for water - a black town experiences poverty.",
+ "D. A black town experiences poverty - a black person seeks water - a white person asks for change."
+ ]
+ }
+ },
+ "tlpGEaqj": {
+ "video_id": "tlpGEaqj",
+ "video_duration": "82s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a close up shot of the back of a gray car, showing the license plate, taillights, the car's logo, and a silver fish emblem. The trunk of the car is then opened. A person in a long sleeved white shirt and grey shorts is standing next to the car. There is a person lying in the trunk, with their hands tied. The person in the trunk talks to the person standing, then begins putting their hands over their face. The person in grey shorts points inside the trunk. The person in the trunk brings his tied hands to his mouth and then laughs. He then asks to be punched. The person in the grey shorts closes the trunk. The person in the trunk once again asks to be punched. The person in the grey shorts opens the trunk again and the person in the trunk puts his hands over his face. They engage in conversation, then the person in the trunk hands the person in the grey shorts a small white object. The person in the trunk continues talking to the person in the grey shorts, who remains outside the trunk. The person in the trunk then ties his own hands together and begins laughing, continuing the conversation with the person standing. The person in the trunk then says \"No!\" The person in the trunk continues the conversation with the person in the grey shorts, and then begins tying his own hands. The person in the trunk then sits up and moves his hands around. The person in the trunk talks to the person in the grey shorts and begins laughing. The person standing then laughs. The person in the trunk once again moves his hands around. The video then shows the person in grey shorts close the trunk and the video ends with a final shot of the back of the gray car.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the man wearing black untie his restraints?",
+ "answer": "D",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. Throughout the video.",
+ "C. At the end of the video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "IxOInqGB": {
+ "video_id": "IxOInqGB",
+ "video_duration": "420s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman with dark hair pulled back, wearing a navy blue and light pink top with thin straps and a gold necklace, looks at the camera. Behind her, a large painting with vertical strokes of green, blue and white hangs on the wall. A screen displaying the Dough It Well website appears in the lower right corner of the screen. On the site, a cinnamon roll with white icing and some slices of bread are visible along with the text \"crave-worthy PROTEIN\".\u00a0The screen with the website disappears, and the woman holds up a package of Daring Foods lemon & herb pieces. She brings the package closer to the camera to show the nutritional information, ingredients, and cooking instructions on the back.\u00a0The screen then fades to white. A light green brush stroke appears in the center, and the words \u201cEAT LAUGH LIFT\u201d appear written over it in black. The word \"with Ghia\" is written underneath in a handwritten-style font. A cartoon drawing of a gorilla looking over its right shoulder appears to the right of the text. The gorilla image breaks into small pieces and then disappears, along with \u201cwith Ghia\u201d. The green brushstroke fades, leaving only the text \u201cEAT LAUGH LIFT\u201d. The screen fades to white.\u00a0The woman reappears, picking up a large mint green box with the word \"daring\" printed on it in white. She opens the box which contains some crinkled brown paper filling, a smaller white box with the word \"daring\" on it, dry ice packs, and a paper information sheet.\u00a0The woman opens the white box which contains several packages of the Daring Foods product. She opens one package.\u00a0The scene cuts to a close-up of a white plate containing a portion of green beans and two separate portions of the cooked Daring Foods product. One portion is light brown, and the other is darker, with a green herb sauce on it. The woman uses a fork to separate a piece from the light brown portion.\u00a0The woman picks up the package and shows the contents, several beige pieces of the uncooked product.\u00a0The woman then shows cooked and uncooked pieces on forks held next to each other to demonstrate the change in appearance after cooking.\u00a0The woman uses a knife and fork to cut and eat the lighter piece on the plate, and then takes a bite of the darker piece with sauce on it.\u00a0 She continues eating from the plate, alternating between the two portions of the product.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What flavors of chicken did the woman eat last in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Cajun and lemon herb flavors.",
+ "B. Cajun, lemon herb, and original flavors.",
+ "C. Cajun and original flavors.",
+ "D. Lemon herb and original flavors."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many times does the INS advertisement appear in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. One.",
+ "B. Four.",
+ "C. Three.",
+ "D. Two."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the woman's role in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The owner of Daring Foods.",
+ "B. The founder of a dessert brand.",
+ "C. A food critic.",
+ "D. An employee of Instagram."
+ ]
+ }
+ },
+ "zolnZkPk": {
+ "video_id": "zolnZkPk",
+ "video_duration": "167s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person is standing behind a food cart. They take a small pile of brown material on a wooden board and light it with a lighter. The wooden board is then placed into a small stove. Another wooden board is placed into the stove. More pieces of wood are placed in the stove. The stove is now lit with a visible flame. People stand around the cart and are served. There is a large pan with many wooden skewers in it and it is steaming. The person at the food cart adds a liquid from a black pan into the larger steaming pan. A person scoops skewers out of the pan and dips them in a smaller pan. More wooden skewers with food on them are pulled from the steaming pan. The person places some of the skewers into a box. The person also places skewers in bowls for customers. Another person ladles liquid into a bowl. Another person hands out bowls of food. More wooden skewers with food on them are added to the box. A person is scooping food from the large steaming pan.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order are the five steps presented in the video? (a) prepare the ingredients, (b) put the skewers in a bowl, (c) pour in the broth, (d) serve to the guests, (e) cook in the pot.",
+ "answer": "D",
+ "candidates": [
+ "A. (a) (b) (c) (d) (e).",
+ "B. (a) (b) (c) (d) (e).",
+ "C. (a) (c) (d) (b) (e).",
+ "D. (a) (e) (b) (c) (d)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "At the beginning of the video, from which direction is the sound of cars in the background coming?",
+ "answer": "B",
+ "candidates": [
+ "A. The sound of cars is coming from the street in front of the screen.",
+ "B. The sound of cars is coming from the street on the right side of the screen.",
+ "C. The sound of cars is coming from the street behind the screen.",
+ "D. The sound of cars is coming from the street on the left side of the screen."
+ ]
+ }
+ },
+ "wcPxfnqB": {
+ "video_id": "wcPxfnqB",
+ "video_duration": "546s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A cement truck pours wet concrete onto a gridded, framed area. Several workers, wearing boots and some in orange shirts, use long-handled tools to spread the concrete evenly within the frame. The frame sits on a bed of sand, and some pipes protrude from the concrete surface. A worker uses a bull float to smooth the surface. A power screed is then used to further level and smooth the concrete. Additional concrete is poured from the truck and spread. Workers continue to smooth and finish the surface with hand tools and bull floats.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many concrete trucks are shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. One.",
+ "B. Three.",
+ "C. Four.",
+ "D. Two."
+ ]
+ }
+ },
+ "exCyzNQV": {
+ "video_id": "exCyzNQV",
+ "video_duration": "70s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "Two individuals are sitting in the front seats of a moving car. The person in the driver's seat is holding the steering wheel. A swirling, tie-dye-like graphic appears with the text \"#5 WON'T LET ANYONE talk or play music.\" The driver makes various disapproving facial expressions and hand gestures as the passenger attempts to talk and play music from her phone. The passenger makes faces and gestures as if singing and dancing along to the unheard music. At one point, the passenger places her hand over her mouth as if shocked into silence. She then rubs her neck. She continues making exaggerated movements with her head and arms. She points to her cheek, then to her nose. The video ends with both individuals in the car, with the driver continuing to drive.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "In the video, what did the young girl do when the woman took out her phone?",
+ "answer": "D",
+ "candidates": [
+ "A. The young girl shouts at the woman.",
+ "B. The young girl looks at the woman.",
+ "C. The young girl does nothing.",
+ "D. The young girl stops the woman with her hand."
+ ]
+ }
+ },
+ "vVeAtqeZ": {
+ "video_id": "vVeAtqeZ",
+ "video_duration": "66s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a first-person view from inside a car. The camera is positioned behind the steering wheel, facing the car's dashboard and the road ahead. The dashboard contains two gauges, and a screen in between, with illuminated indicators. A hand with a light skin tone can be seen interacting with the buttons on the steering wheel. The legs of the person are visible below the steering wheel and are covered in green fabric. The car is on a road and surrounded by other vehicles. The car drives along various roads, some with buildings and trees alongside, and then an open highway with tall buildings in the background. A hand then picks up a clear plastic bottle with a white cap. The bottle appears empty and the person then puts the bottle into the cup holder. A hand then pours a yellow fluid into the bottle. The hand is then seen repeatedly shaking the bottle. The hand then proceeds to place the yellow liquid filled bottle back into the cupholder. The car is seen driving again and the hand is shown moving on the steering wheel and the dashboard. The video shows a bottle filled with a yellow liquid being held and moved around again, with various views of the interior of the car and driving on various roads.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the man drink water in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Three.",
+ "B. Zero.",
+ "C. One.",
+ "D. Two."
+ ]
+ }
+ },
+ "fodTXEid": {
+ "video_id": "fodTXEid",
+ "video_duration": "125s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Cultural Explainers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Two people are sitting in chairs facing the camera. The person on the left has blonde hair pulled up into a bun, wears a patterned sweater, a light colored shirt underneath and blue jeans. The person on the right has curly brown hair, a gray sweater, and a white shirt underneath, as well as blue jeans. In between them there is a fireplace with flames, and above it a white mantle. On the mantle is a small gnome figurine. On the floor to the right is a gray gnome figurine with a pointy hat. To the left is a lamp with a light tan shade. There is a series of close ups of other things that appear in the video. The first is a bowl of rice pudding with a wooden spoon in it, on a gray and white striped towel, with a small glass bottle of milk, a jar of cinnamon, and two cinnamon sticks. The second is four christmas stockings hanging from hooks, two with black and white stripes, and two smaller ones with red accents. The third is a collection of Christmas presents wrapped in red and white wrapping paper. The video shows several more images of people and Santas, as well as other images of the two original people talking, and a cartoon of a dog character attempting to make popcorn.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What is the family planning to do at 3 PM on Christmas afternoon?",
+ "answer": "C",
+ "candidates": [
+ "A. The family gathers to have dinner together.",
+ "B. The family gathers to prepare dinner.",
+ "C. The family gathers to watch a Disney movie collection.",
+ "D. The family gathers to open gifts."
+ ]
+ }
+ },
+ "KLcjGfCO": {
+ "video_id": "KLcjGfCO",
+ "video_duration": "354s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a view of a white, vertically paneled exterior wall with a rectangular window and a wood panel below. The window is then opened from the bottom outward. The wood panel underneath folds down to form a serving counter. A man wearing a grey t-shirt and a baseball cap appears in view. The man opens a door to the tiny house and enters the space. The camera follows the man as he walks through the interior, panning from the floors, to the kitchen, and then a washroom, followed by a bedroom area. The man shows the different cabinets and areas, and finally poses to the camera in the middle of the tiny house. The man then waves and the video ends.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What professions might a man or woman hold?",
+ "answer": "C",
+ "candidates": [
+ "A. A carpenter.",
+ "B. An electrician.",
+ "C. An interior designer.",
+ "D. A cleaner."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a washing machine operating in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. I don't know.",
+ "B. No, there isn't.",
+ "C. Yes, there is."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What will not happen after the video ends?",
+ "answer": "B",
+ "candidates": [
+ "A. Install a sink and toilet in the bathroom.",
+ "B. Install an oven in the kitchen.",
+ "C. Add a shoe rack to the stairwell.",
+ "D. Place a cat litter box in the corner of the living room."
+ ]
+ }
+ },
+ "qJrSTwKy": {
+ "video_id": "qJrSTwKy",
+ "video_duration": "86s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person in a black t-shirt and light-colored pants is using a leaf blower with an attached hose and collection bag to blow a pile of leaves into a green bin on wheels. The leaf blower is mostly black with some green parts. The bin has a black cover on top. The ground is concrete and there is some grass visible on the left side of the frame. The person moves the leaf blower over the pile of leaves. The leaves are mostly brown, with some lighter colored and yellow ones mixed in. In another scene, a person is shown attaching a dark, cylindrical attachment to the end of the leaf blower. In a subsequent scene, this person is shown removing parts from the leaf blower using various tools, including a screwdriver. In another scene, the person is using a black saw to cut a black plastic part of the leaf blower.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What occurs after a vacuum cleaner becomes clogged?",
+ "answer": "B",
+ "candidates": [
+ "A. The man buys a new vacuum cleaner hose and nozzle.",
+ "B. The man takes the vacuum cleaner to the table for repair.",
+ "C. The man throws away the vacuum cleaner.",
+ "D. The man buys a new vacuum cleaner."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many screwdrivers were used to address the vacuum cleaner blockage?",
+ "answer": "A",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. Two.",
+ "D. One."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What causes a vacuum cleaner to become blocked?",
+ "answer": "B",
+ "candidates": [
+ "A. The vacuum cleaner's hose is old.",
+ "B. The angle of the vacuum cleaner's hose is too large during use.",
+ "C. The vacuum cleaner's hose is damaged.",
+ "D. There is a lot of dust in the vacuum cleaner's hose."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the man cut the connection between the vacuum cleaner and the hose?",
+ "answer": "B",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. At the end of the video.",
+ "C. At the beginning of the video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "FgDkeBfL": {
+ "video_id": "FgDkeBfL",
+ "video_duration": "172s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person rides a dirt bike, then the video cuts to a close-up of the bike, showing a chain around the front tire. The person holds up a gray and black exhaust pipe. The video then shows the full dirt bike. The exhaust pipe is removed. The video cuts to a cardboard box being opened to reveal a black exhaust pipe. The person takes the exhaust pipe out of the box and examines it. The video cuts back to the dirt bike, and the person kicks the kickstart. The video cuts to a close-up of the dirt bike with the new exhaust pipe installed. The person sits on the bike and kicks the kickstart. The video then shows a first-person view of riding the dirt bike on a residential street.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual occurrence is observed in the video regarding the exhaust pipe?",
+ "answer": "C",
+ "candidates": [
+ "A. The exhaust pipe emits a very faint sound, along with popping noises.",
+ "B. The exhaust pipe is missing.",
+ "C. The exhaust pipe emits a loud sound, along with banging noises.",
+ "D. The exhaust pipe is broken."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the man test the sound-dampening effect of the muffler?",
+ "answer": "A",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. Throughout the video.",
+ "C. At the beginning of the video.",
+ "D. At the end of the video."
+ ]
+ }
+ },
+ "eUpfEGpJ": {
+ "video_id": "eUpfEGpJ",
+ "video_duration": "273s",
+ "duration": "4-6min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a shot of five parked aircraft on an airfield under a clear sky. The video then shifts to show a military vehicle on a cracked, desert-like plain. A fire is visible in the distance. Then, a view of the same desert area shows two fires in the distance. The shot then changes to a view from a flying aircraft above a desert region near a coastline. The view changes again, and we see an aircraft flying above a desert. Next, we see a close-up view of a different plane with weapons and bombs attached to it. We then see multiple aircraft flying above the desert. Next, a close up shows the interior of a cockpit with a person in it. The shot changes again to show an aircraft flying above a desert. The perspective shifts to an overhead view of an area with multiple plumes of black smoke. Next a shot of an aircraft flying over a beach shows another explosion in the distance. The scene changes and an aircraft is shown firing a missile. Next, we see a close up of an aircraft firing a weapon with the ground below in view. The shot changes to an aerial view of several fires below. The perspective changes and we see an aircraft flying near a coast and then turns toward the desert. A shot of an aircraft drops a payload. Next, we see a truck on the desert. Next, there is a distant view of an aircraft. Then, we see a truck explode. The perspective switches to the side and we see a person running. Next, we see a close-up of the aircraft with the ground and more explosions in view below. Finally, there are several shots of the aircraft flying and dropping payloads with an explosion on the ground shown. The video then shows another perspective of explosions on the desert and then a person crouches down near the explosions. The video then shows the aircraft flying over a coast. We see several views of an aircraft with the coast and desert in the background. A close-up of the cockpit interior is then shown, with a person visible. The video ends with aerial views of the aircraft and ground with multiple explosions in the distance.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order of appearance for the following military equipment? (a) Tank. (b) A-10 Warthog. (c) F-16 Fighting Falcon.",
+ "answer": "A",
+ "candidates": [
+ "A. (b) (a) (c).",
+ "B. (a) (c) (b).",
+ "C. (a) (b) (c).",
+ "D. (b) (c) (a)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Does the volume of the human voice change in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. There is no change.",
+ "B. Yes, the human voice gradually decreases when the background noise is louder.",
+ "C. Yes, the background noise overrides the human voice.",
+ "D. Yes, the human voice gradually increases when the background noise is louder."
+ ]
+ }
+ },
+ "iAzZyIxc": {
+ "video_id": "iAzZyIxc",
+ "video_duration": "162s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man in a blue shirt walks along a dirt path that forks into two directions through a wooded area. He pauses and turns around. Another man walks in from the same direction in a blue shirt and pants. They stop and face each other as a woman wearing a red and black plaid shirt walks in from behind them and towards them. The camera focuses on the woman as she talks. Then it cuts to a view of her and the man in the blue shirt as they talk to each other. The camera cuts to the man, and then back to a close-up of the woman. Then back to a shot of both. The camera cuts back to the woman. Then the camera cuts to a wide shot of the woman and man, and the man goes down to one knee in front of the woman. The man on his knee talks and waves his arms. Two more men come in from the side in the distance. The man on the ground gets up and motions around the ground and the woman walks away. The other two men approach the man. The two men look around at the man. The man points at the other man and gestures. Then the camera focuses on the two other men as they talk. Then it shifts to the first man, who makes exaggerated hand gestures and appears to be shouting. The woman is no longer in view. The man in the black shirt continues to make hand gestures. The camera cuts to a wide shot of the man and the two other men. Then it zooms to the man in the black shirt. He continues to gesticulate. The camera cuts to a wide shot of the forest. The video goes black.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When was the confession of the man in the blue suit discovered?",
+ "answer": "D",
+ "candidates": [
+ "A. The discovery occurs at the end of the video.",
+ "B. The discovery occurs at the beginning of the video.",
+ "C. The discovery occurs throughout the video.",
+ "D. The discovery occurs in the middle of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the woman and the man in red?",
+ "answer": "A",
+ "candidates": [
+ "A. Strangers.",
+ "B. Friends.",
+ "C. Lovers.",
+ "D. Colleagues."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "In the video, how many times does the man in the blue suit make the \"wuu wuu wuu\" sound?",
+ "answer": "C",
+ "candidates": [
+ "A. Eight times.",
+ "B. Five times.",
+ "C. Seven times.",
+ "D. Six times."
+ ]
+ }
+ },
+ "vmcbeOpl": {
+ "video_id": "vmcbeOpl",
+ "video_duration": "273s",
+ "duration": "4-6min",
+ "domain": "Tech & Science",
+ "sub_category": "Tech Reviews",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video displays a screen with multiple open browser tabs and a small window with a person visible. The tabs show various websites, some displaying text and code examples, others showing online forums or social media posts. The person in the window sometimes talks and gestures. The content of the websites on display varies, with one page showing a network diagram. At one point the video shows an image of text that says \"log4shell\" with a logo. One tab shows text of program code. There is also one tab that displays an issue tracker. A Minecraft game is shown in one of the split windows. In one split screen, a person wearing glasses covers their face with their hands.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is men's attitude towards the log4 vulnerability?",
+ "answer": "A",
+ "candidates": [
+ "A. Neutral.",
+ "B. Like.",
+ "C. Fear.",
+ "D. Dislike."
+ ]
+ }
+ },
+ "YLMMLsAq": {
+ "video_id": "YLMMLsAq",
+ "video_duration": "60s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Celebrity Interviews",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a man in a blue suit and tie speaking to the camera. Behind him, the words \"THE BEAT WITH ARI MELBER\" are displayed against a skyline background. The video then transitions to a split screen. On the left side of the screen, a different man in a suit walks to a podium in front of several US flags, and other people standing in the background. On the right side of the screen, the first man continues to speak to the camera. The scene then transitions to another split screen with the man still speaking on the right side. On the left side, two men sit at a desk with a world map on the wall behind them. One of the men at the desk gestures as the other laughs. The scene then changes again, with the first man still on the right. On the left side a black and white art image is displayed. The scene changes once more and displays a church with people inside, and then a woman in a white robe walking down stairs and a man in a jacket entering a doorway. Finally, the video shows the book cover of a book titled \"BOB HONEY WHO JUST DO STUFF\" by Sean Penn. Finally, the video concludes with the first man in a blue suit and tie once again talking to the camera. Then the scene switches to the man in the denim shirt.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Where in the video is the book's cover shown?",
+ "answer": "C",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the beginning of the video.",
+ "C. At the end of the video.",
+ "D. Throughout the entire video."
+ ]
+ }
+ },
+ "Whtviqco": {
+ "video_id": "Whtviqco",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen for a few seconds. Then, a young man with blonde hair and a dark jacket enters the frame. He appears to be speaking and gesturing with his hands. Another student with black hair can be seen seated at a desk behind him. The camera shifts to a wider view of the classroom, showing a projector screen with a quiz-like interface, a whiteboard with writing on it, a teacher seated at the desk, and several other students. The camera then focuses on a laptop screen, showing multiple blank colored squares. The camera then quickly pans across the room to the students seated at desks. It shifts back to a different laptop screen with a loading screen, and then back to the first young man who is still talking and gesturing. It pans once again to show the other students and back to the first man. The camera focuses on the same laptop as before. The loading screen on the computer shows a progress bar and finally a screen that says \"Correct\" with a check mark, then it shows a question page for a quiz. The camera pans across the class again as students appear to be answering questions on their individual laptops. The camera then focuses on a man in a blue jacket and a student in front of him, before panning across more students in the classroom. The camera focuses on another student who is smiling and sticking his tongue out. The camera then pans back over the classroom. The projector screen shows a quiz question with a diagram of a sock. It shifts back to the laptop showing multiple colored squares, then to a different computer screen showing a loading screen, and a screen displaying the word \"Correct.\" The camera shifts between various students and their laptops, and then returns to the first man, who is now in very close frame, speaking with his mouth open and then laughing. The camera shifts once again to different students in the classroom. The video ends with a student in a black jacket speaking to the camera. He is seen near other seated students and a screen with the text \"Sup Guys.\" Text appears on the screen saying \"So we got our\" and \"Exam results\". Text on the screen also says \"*Mumbles like a fucking retard*\", and \"Not achieved Not achieved Not achieved\". The text on screen also says \"Nah Nah\". The man continues to speak and the text on screen reads \"I did better than Sean's too but...\" and \"By one point\". The camera pans to a view of the classroom, then back to a male in a blue jacket who is near other students, and text on screen reads \"But you got D\". The camera moves, and a male voice is heard saying \"Yeah but you only got 24.\" The camera pans again and the text on screen reads \"But you got \"D\" Ha Ha Ha\" and then \"You got the \"D\" in your butt\" and \"Proceeds to Roast Jason instead\". The video ends showing a man in a black jacket and a camera panning towards other students.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the blonde boy and the boy with glasses in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. They are brothers.",
+ "B. They are classmates.",
+ "C. They are father and son.",
+ "D. They are colleagues."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How did the emotional state of the blonde boy change in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. From anger to excitement.",
+ "B. From excitement to anger.",
+ "C. From excitement to sadness.",
+ "D. From sadness to excitement."
+ ]
+ }
+ },
+ "nZObLuty": {
+ "video_id": "nZObLuty",
+ "video_duration": "117s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Drawing Tutorials",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a close-up of a painting of a monkey. The camera slowly zooms out, revealing more of the painting. The monkey is looking slightly down and to the left. The painting is highly realistic and detailed, capturing the individual hairs of the monkey's fur.The video then transitions to a close-up of a hand drawing white whiskers on a painting of a cat using a white pastel pencil. The hand is holding the pencil lightly and making delicate strokes. The cat in the painting has dark fur and a light-colored face.Next, the video shows a split screen with a young man on the left and a painting of a cat on the right. The man is wearing a navy blue shirt and is looking at the camera. The cat in the painting has gray and brown tabby stripes and green eyes. It looks straight ahead.Then, a close-up of two paintings of dogs is shown. The dogs are both light brown and white Cocker Spaniels with long fur. The dog on the left has a blue tag on its collar, and the dog on the right has a light brown collar. A montage of art supplies follows, showing graphite pencils, oil paint brushes, and soft pastels and pastel pencils. A hand holding a light blue soft pastel sponge tool is then displayed. Next, a hand is seen drawing a dog's mouth and tongue, using a light brown pastel pencil on a dark-colored area. The dog's nose is also visible.Following this, a hand colors the eye of a partially completed leopard drawing with a white pastel. The leopard has black spots on a tan background.Then, a hand uses a pastel pencil to add shading to the face of a young man in a portrait. The man has medium brown hair, blue eyes, and light skin.Next, masking tape is being carefully peeled from the edge of a painting of the two Cocker Spaniels.The video then shows a screenshot of a webpage listing art materials, followed by lists of specific pastel colors, including those from Unison Colour, Schmincke, Faber-Castell Pitt, and Caran d'Ache.Next, a painting of a German Shepherd dog is shown. The dog is black and brown with its tongue out. It has a collar with a silver tag. The background of the painting is a gradient of light blue and light pink.The video ends with the same young man shown earlier, standing next to a painting of a tabby cat on an easel. A hand is then drawing the cat's whiskers with a white pencil. A different hand is painting details on a barn owl painting with a thin brush. Finally, a website screenshot shows a range of animal portrait tutorials available.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What change is observed in the tabby cat's whiskers during the second half of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The whiskers become thinner.",
+ "B. The whiskers become thicker.",
+ "C. The whiskers become whiter.",
+ "D. The whiskers become longer."
+ ]
+ }
+ },
+ "qdbdmlfm": {
+ "video_id": "qdbdmlfm",
+ "video_duration": "35s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person in dark armor touches their cheek with their hand. Then, the same person is shown walking up a set of stairs as flaming particles appear to be falling around them. The person is standing on a platform with flaming particles flying around. A person in dark clothing is shown holding and shooting a bow. An arrow appears to pierce a figure's shoulder. A person is shown falling backward after being pierced by a spear. A person is firing a bow while three other figures are in the background, one of whom is kneeling. The same person in dark armor appears through a smoky lens. The same person in armor is shown from a low angle. A person is shown walking up stairs as flaming particles fall. Then, a person is shown with fire on each side of them. A gauntlet appears to be activating. The same person from earlier appears wearing a gauntlet and surrounded by particles. Two different colored explosions appear. The same figure from earlier is shown falling through debris. A person is shown walking on a structure with sparks flying. Then, the person is shown in a room with red and blue explosions with particles floating around them.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order do the following events occur: a man touches his injured cheek, a man wearing glowing gloves fights enemies, an archer kills enemies?",
+ "answer": "C",
+ "candidates": [
+ "A. An archer kills enemies, a man wearing glowing gloves fights enemies, a man touches his injured cheek.",
+ "B. A man wearing glowing gloves fights enemies, an archer kills enemies, a man touches his injured cheek.",
+ "C. A man touches his injured cheek, an archer kills enemies, a man wearing glowing gloves fights enemies.",
+ "D. A man touches his injured cheek, a man wearing glowing gloves fights enemies, an archer kills enemies."
+ ]
+ }
+ },
+ "HmwnfVdp": {
+ "video_id": "HmwnfVdp",
+ "video_duration": "548s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A woman with long dark hair pulled back in a ponytail is shown against a textured, mauve backdrop. She wears a dark gray t-shirt. She gestures with her hands. She holds up a rose-gold tube of L\u2019Oreal Paradise mascara and a pink and teal tube of Maybelline Baby Roll mascara. She looks down and then back at the camera. She rubs her closed eyes with her fingers and then rubs her hands together. She makes circular motions with her hands in front of her face.\u00a0She picks up an eyelash curler and uses it on her right eyelashes, looking down at a small, round mirror. She touches her eyelashes with her fingers and repeats the curling process on her right lashes. She shows a tube of Baby Roll mascara and the applicator brush in her open palm.\u00a0She begins applying the Baby Roll mascara to her right eyelashes.\u00a0 She shows the finished look on her right eye. She opens the L\u2019Oreal Paradise mascara and shows the applicator brush against her open palm. Next, she applies the L\u2019Oreal Paradise mascara to her left eyelashes, looking down at the small mirror. She holds the NYX Professional Makeup Matte Bronzer under her face. She puts down the bronzer.She holds up both mascara tubes for comparison. She looks down, smiles, and then looks up at the camera. She places the two mascara tubes at her chin while looking directly at the camera. She touches her lips and then makes hand gestures as if talking about each tube. She looks down and smiles, looking back up at the camera. She winks and holds her hand in front of the camera.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there an eyelash curler in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. No.",
+ "B. Uncertain.",
+ "C. Yes."
+ ]
+ }
+ },
+ "xjoAfcbt": {
+ "video_id": "xjoAfcbt",
+ "video_duration": "169s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two small, boxy robots, made of Lego bricks and other components, move around a square arena. The arena has a gray floor and transparent walls, with Lego structures and colorful blocks placed inside. A timer on a laptop screen in the background shows a countdown from 150 seconds.\u00a0The robots move somewhat erratically, occasionally bumping into the walls or Lego structures.\u00a0 One robot, positioned on the left side of the arena, attempts to pick up or manipulate some of the yellow Lego blocks within the arena. The other robot, on the right, approaches a structure made of red, yellow, and blue Lego bricks located near the top center of the arena. It appears to try to either push the structure or dislodge blocks from it.Several people in referee uniforms watch the robots attentively. On either side of the arena, there are spectators and participants, some of whom are dressed in costumes, including crowns and capes. People move around in the foreground and background throughout the video. As the timer counts down, the robots continue to maneuver in the arena, trying to manipulate the blocks and structures. More and more people start to walk around in front of the camera near the end of the video, obstructing the view of the arena.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "Does the content on the white screen in the background of the video change?",
+ "answer": "C",
+ "candidates": [
+ "A. The screen ultimately displays a large icon of the winning team.",
+ "B. The content on the screen does not change.",
+ "C. The time displayed on the screen changes.",
+ "D. The screen transitions to the next page."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the robot move the blue base?",
+ "answer": "B",
+ "candidates": [
+ "A. Two.",
+ "B. Four.",
+ "C. Three.",
+ "D. Five."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What activity is the man wearing orange clothes and blue pants doing in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Holding a remote control to operate a robot.",
+ "B. Holding an orange block and stacking it with others.",
+ "C. Managing security work for the competition in the background.",
+ "D. Standing nearby and watching the game."
+ ]
+ }
+ },
+ "rQtBWHmE": {
+ "video_id": "rQtBWHmE",
+ "video_duration": "52s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows multiple clips. The first clip shows a group of people wearing military camouflage, walking in an urban street.The second clip is a close up of a man with a short beard, wearing a camouflage military jacket.The third clip shows a man in a black hat and jacket, indoors, speaking into a camera.The fourth clip is a close-up of a man wearing a blue uniform with two badges and a dark shirt.The fifth clip shows a man with a beard wearing a camouflage jacket and cap, standing in front of cars.The sixth clip shows a group of people in what appears to be a military procession, all saluting. One person is wearing a black uniform and cap, while others wear a mix of camouflage and red or blue berets.The seventh clip shows a man with a beard, sunglasses, and a camouflage jacket and hat, standing in a field.The eighth clip shows a man with a beard wearing a black jacket speaking in front of a black background.The next series of clips are combined with text overlays, with a white rectangle containing \u201cJALSOMINO DJ SHY FEAT. MARSHAL\u201d and \u201cGR257\u201d visible in each. A logo with a yellow \u201cG\u201d with a yellow outline is also in all these frames. Many of these clips feature a border of text that reads \u201cBECAUSE\u201d. In some clips the border reads \u201cCAUSE\u201d. These clips feature: a soldier in a snowy area, the man with a beard in the black jacket, and two soldiers in a muddy field, and a close-up of a bearded man who appears to be crying. The video concludes with two more clips of soldiers running with weapons in a muddy, snowy field and the man with the beard in the black jacket.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Has the volume of the background music in the video changed?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes, the volume of the background music increases at the end of the video.",
+ "B. No, the volume of the background music does not change.",
+ "C. No, the volume of the background music remains consistently low.",
+ "D. Yes, the volume of the background music decreases at the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order are the three scenes introduced in the video? (1) An interview clip (2) Photos of different uniforms (3) A dynamic music video",
+ "answer": "B",
+ "candidates": [
+ "A. (1)(2)(3)",
+ "B. (2)(1)(3)",
+ "C. (1)(3)(2)",
+ "D. (2)(3)(1)"
+ ]
+ }
+ },
+ "cEvmJINw": {
+ "video_id": "cEvmJINw",
+ "video_duration": "116s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person's hands use a narrow tool to loosen the soil around a small plant with two green leaves, in a small black container. They lift the plant, which has soil and roots around it, out of the container. Next, the person creates a hole in the soil of a small gray container, using their finger. The person then plants the small green plant in the hole, and covers the plant's base with soil using their fingers. The person then takes a small plant from the black container with multiple sprouts. The person uses their finger to inspect and move the plants in the soil. They then use the narrow tool to push the soil around the plant in the small black container. The person holds the plant with two hands and rotates it to show all sides of the pot. The person then loosens the soil from around a plant, and holds up the plant with its roots visible. The person rotates the plant to show all sides. Next, the person puts small brown pellets into their palm and then drops them into a pot. The person then covers the pellets and small plant with more soil. The person gently presses the soil around the plant with their fingers. Next, the person holds a large green leaf with purple blotches on it. The person points at the leaf and then holds up a potted plant. The person rotates the plant to show the full size of the plant and leaves. The person again displays a close-up view of the green and purple leaf.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many different shapes of flower pots are shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. One.",
+ "D. Two."
+ ]
+ }
+ },
+ "ojiUlHwg": {
+ "video_id": "ojiUlHwg",
+ "video_duration": "65s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Soccer",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows several people wearing athletic clothing, helmets, and shoulder pads on a green field. In the first few frames a man wearing a blue shirt, blue shorts, and a cap throws a football. There are at least four other men standing in the background. The scene then cuts to a different angle, where several more people are visible, all wearing athletic clothing and a few helmets. One man walks to the left and there are more people standing in the back. The video cuts again to a person wearing red and number 17, as he walks past an orange cooler and a golf cart with another person also walking beside the cart. Next, several football players wearing white jerseys and helmets line up, with one player in red and number 17 in front. One person is heard saying, \"Let's go race,\" and another, \"Hope you brought your track shoes today boy.\" Then more players line up with number 17 in red. The video shows the men, including number 17, being instructed by another man and more people holding orange coolers. The players then practice football plays including passing the ball and tackling. The video ends with the number 17 player and another player walking off the field. A logo with a blue circle containing a \"T\" inside is overlaid onto the scene.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual action is observed in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Player 17 on the Red team touches their own helmet.",
+ "B. Player 17 on the Red team hits player 5's helmet.",
+ "C. Player 17 on the Red team touches a teammate's helmet.",
+ "D. Player 17 and player 5 on the Red team are fighting."
+ ]
+ }
+ },
+ "kLGzAaya": {
+ "video_id": "kLGzAaya",
+ "video_duration": "77s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "An opened hard disk drive is shown, revealing the internal platter and actuator arm. A hand briefly touches the platter. Then, using a screwdriver, the hand removes four screws on one side of the hard drive. Text appears, labeling a part as \"Voice Coil Actuator,\" and describing its function. The hand then removes four more screws from the opposite side of the drive.The video transitions to a split screen. On one side, the hard drive is visible; on the other side, a diagram depicting the Lorentz Force with the formula F = q[E + (v x B)] appears, with N and S magnets shown and a yellow arrow indicating downward force. The video then focuses on the copper coil within the actuator arm of the hard drive and shows how it moves in relation to a magnetic field. Two upward yellow arrows appear in the diagram, indicating force in that direction. The hand removes a piece from the hard drive.The view returns to the full hard drive, where the actuator arm moves across the platter. A diagram illustrating the Lorentz Force appears again, alongside the hard drive. Now the screen splits into three parts. The hard drive remains on the left. In the middle is the diagram of a ferromagnetic core and a coil. On the right is an illustration of six rectangles with arrows pointing to the right. The coil section is highlighted in orange and then white, as if receiving an electrical signal, and the rectangles with arrows appear to magnetize in response. Then, a diagram appears under the rectangles, representing an electrical signal. As the coil in the diagram passes over each section of the rectangles, the signal spikes. The numbers \"010111\" are shown, synchronized with the position of the coil over the magnetized areas.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the order of voltage arrangement when the head moves through a region where polarity changes in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. 0, 1, 0, 1, 1, 0.",
+ "B. 0, 1, 0, 1, 1, 1.",
+ "C. 0, 1, 0, 1, 0, 1.",
+ "D. 0, 1, 0, 0, 1, 1."
+ ]
+ }
+ },
+ "oOxmuyhE": {
+ "video_id": "oOxmuyhE",
+ "video_duration": "98s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video features three different individuals, each with different hair colors and styles, and different make-up. The first individual has bright blue hair and a black shirt. They have make-up on their face that appears as dark lines on their cheeks and chin. The second individual has blonde hair and is wearing a pink long-sleeved shirt and a silver necklace with a cross pendant. They appear to have dark make-up around their eyes. The third individual has long, messy reddish-brown hair, and is wearing an orange hooded sweatshirt. They also appear to have dark make-up around their eyes and a single dark line coming from the corner of one eye. Throughout the video, the individuals are seen speaking, and often gesturing with their hands. The background of each video clip changes, but usually consists of either a white wall, or a room with shelves, posters, and other decorations. There are several instances of color filters appearing over the video clips. In one instance, a camera is held up to the screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times did the man in the red hoodie speak in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Six times.",
+ "B. Eight times.",
+ "C. Seven times.",
+ "D. Nine times."
+ ]
+ }
+ },
+ "ApxeTuWo": {
+ "video_id": "ApxeTuWo",
+ "video_duration": "130s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person is visible in the video wearing a grey hooded jacket. Initially, their hair is loose and unstyled. They lift and move their hands through their hair. Later, they put their hair up, then spray it with product, then uses a curling iron and sprays it with product again. Finally, they style their hair with their hands. In the final frames, two social media icons appear followed by a username.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the purpose of the spray used in the video before styling the hair with curls?",
+ "answer": "B",
+ "candidates": [
+ "A. Change the hair color.",
+ "B. Make the hair look shiny.",
+ "C. Make the hair appear voluminous.",
+ "D. Straighten the hair."
+ ]
+ }
+ },
+ "juDBxemv": {
+ "video_id": "juDBxemv",
+ "video_duration": "85s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person wearing a blue jacket and khaki pants dances briefly. The video then cuts to a chessboard analysis screen. The white knight moves from f3 to g5. An orange arrow appears, showing the knight could capture the black pawn at f7. A second orange arrow then appears from g5 to h8, indicating the knight also threatens the black rook. An animated platypus then appears from the bottom right and bumps the black rook with its tail. An orange arrow then appears from h8 to g6, indicating the white king may capture the black bishop. Then the black pawn at f7 moves to f6. An orange arrow then appears from d8 to f6, showing the black queen captures the f6 pawn. Then the white rook at c1 moves to f1, putting the black king in check. The interface then indicates this is an excellent move, and the evaluation changes from +2.04 to +1.38. The software also switches from \"Next Key Move\" to \"Next Move\".",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What is the result of the game played by the white pieces in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Defeat.",
+ "B. I am not sure.",
+ "C. Victory.",
+ "D. A draw."
+ ]
+ }
+ },
+ "MYvfaxKB": {
+ "video_id": "MYvfaxKB",
+ "video_duration": "50s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Chemistry",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A clear glass Erlenmeyer flask sits on a black surface. It contains a clear liquid. A hand enters the frame, and rotates the flask in place. The image then zooms in on the flask, where a white solid is visible within the liquid. The liquid then becomes cloudy. The cloudiness increases, and bubbles are visible within the liquid. The image then zooms out to show the flask as well as a lit candle on a small glass dish next to it. A hand enters the frame, inserting a clear glass tube into the flask opening. The tube is removed, and a small puff of smoke appears briefly at the top of the flask. The image zooms back in on the flask, where some of the liquid is now pink in color. The hand then picks up the flask and turns it upside down. The image then zooms in on the flask with the pink liquid in it. Bubbles are visible within the pink liquid.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of experimental steps depicted in the video? (a) Add calcium metal to water, (b) add phenolphthalein indicator to the solution to detect calcium hydroxide, (c) collect hydrogen gas and test its flammability.",
+ "answer": "C",
+ "candidates": [
+ "A. (b), (a), (c).",
+ "B. (a), (b), (c).",
+ "C. (a), (c), (b).",
+ "D. (c), (a), (b)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What change did the candle undergo in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. It remained extinguished throughout.",
+ "B. It changed from being extinguished to being lit.",
+ "C. It changed from being lit to being extinguished.",
+ "D. It remained lit throughout."
+ ]
+ }
+ },
+ "vTfSSGMC": {
+ "video_id": "vTfSSGMC",
+ "video_duration": "99s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows drone footage of a park with multiple sports fields and a baseball stadium. There are people walking around, and several bounce houses, and food trucks are visible. There are also images of people taking pictures with props at a photo booth. A ribbon-cutting ceremony is shown at a gate leading into the stadium. A fire truck is also shown. Several individuals speak to the camera. The video includes the text, \"November 16, 2017\" and \"Ting presents Welcome to Ting Park\". The text \"Ting Stadium\" appears near the entrance and on a large video screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many types of sports facilities are showcased in the video taken at the park?",
+ "answer": "B",
+ "candidates": [
+ "A. One.",
+ "B. Four.",
+ "C. Three.",
+ "D. Two."
+ ]
+ }
+ },
+ "ZbktCcGp": {
+ "video_id": "ZbktCcGp",
+ "video_duration": "479s",
+ "duration": "6-8min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen displaying a watermark logo in the bottom right corner. This transitions to a man with dark hair and a mustache, wearing a pink button-down shirt, sitting on a black chair against a light yellow wall. Text appears on the left of the screen: \"Chandrayaan - 2 Misssion Objective.\" The man speaks and gestures with his hands. His hand gestures consist of pointing, forming small circles with his index finger and thumb, giving thumbs up, and opening his palm. The text on the screen disappears. The man is then seen holding the wires of a small lapel microphone in his hands. The background then changes to a black and white image of two astronauts on the moon's surface, their bodies and surroundings silhouetted. The time code \"165;07;58;19\" appears in the bottom right of the screen. This is followed by an image of gloved hands manipulating an object inside a laboratory setting, a large gray rock is visible on the table. Two individuals wearing white lab coats and white head coverings are seen behind the equipment. A close-up image of a gray porous rock appears. The video returns to the man in the pink shirt with the same background, and he continues speaking and gesturing with his hands. His hands then continue to open, close and make small circular motions. The background then changes to a landscape of the moon, showcasing craters and mountain like formations. A variety of images of the surface of the moon appear, some in closeup, some showing different angles. The video returns to the man in the pink shirt, and he continues to speak and use his hands to demonstrate various motions, including showing the motion of a projectile, and the movement of a rocket. This is followed by the text, \"Why finding water on Moon is important?\" on a black background. This transitions to the man, now wearing a blue shirt, standing in front of a set of stairs, and surrounded by green bushes. He continues to speak and gesture. He points, gives thumbs up and demonstrates the motion of a rocket launch with his hands. This is followed by a digital image of the interior of what looks like a space station. It has rows of plants with fruit visible on the vine. The image changes to a view of the interior space from the doorway, and a green sprout is visible on a tag that reads \u201cFood Product/Production\u201d. There is then a digital rendering of the earth in space, and a small object is seen orbiting the planet. The text \"MODULE ENTERS EARTH TRANSFER ORBIT\" is displayed on the bottom of the screen. The image moves to a view of the moon, with the object moving in different orbits around the moon. This is accompanied by text on the screen reading, \u201cMODULE GETS CAPTURED IN THE MOON\u2019S ORBIT. The view changes to the module flying over the moon, and this is accompanied by the text, \u201cMODULE INITIATES LUNAR ORBIT\u201d. The image changes to a view of the module separating, with the text \u201cVIKRAM SEPARATES FROM THE ORBITER\u201d. The image then returns to a view of the moon with two objects orbiting it. This is followed by a rendering of the lunar module landing on the moon. The text \u201cVIKRAM IN A DEBOOSTING PHASE\u201d appears, and is followed by the text \u201cVIKRAM SOFT LANDS ON THE MOON\u201d, with the lunar module on the surface of the moon. This is followed by the text \u201cPRAGYAN COMMENCES EXPLORATION OF THE LUNAR SURFACE,\u201d and a rover is seen driving on the moon. It stops and the text \u201cMISSION CONTROL INSTRUCTS PRAGYAN TO STOP AND DETERMINE ELEMENTAL COMPOSITION OF LUNAR ROCKS AND SOIL USING THE APXS PAYLOAD\u201d appears. The rover's arm lowers, with the text \u201cMISSION CONTROL ENGAGES THE LIBS PAYLOAD TO DETERMINE ELEMENTAL COMPOSITION OF THE LUNAR SURFACE\u201d. The video transitions to the man in the pink shirt again, and he continues to speak and gesture. He is seen pointing, opening and closing his hand, and moving his palms in small motions. A digital image of the moon again appears, with three objects orbiting around it. The video then returns to the man, and he continues speaking and gesturing, he makes a circle with his fingers, moves his hand in a swiping motion, moves his hands in a wave and then gives the ok sign with his thumb and index finger. The last images are a simulation of the lunar module on the surface of the moon, displaying the days that the module can be active. Then a side view of the moon and its orbit is followed by a shot of a lunar module flying over the moon. The video returns to the man, with the same background as before, and he continues to make various hand motions while speaking, which includes waving, clapping, and forming small circles with his fingers. The video returns to the image of the earth orbiting the moon, and then ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the Earth appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Three times.",
+ "B. Five times.",
+ "C. Four times.",
+ "D. Six times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the temperature in degrees Celsius on the morning side of the moon as shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. -230\u00b0C or 100\u00b0C.",
+ "B. -230\u00b0C or 125\u00b0C.",
+ "C. -200\u00b0C or 125\u00b0C.",
+ "D. -230\u00b0C or 150\u00b0C."
+ ]
+ }
+ },
+ "jkgthfBj": {
+ "video_id": "jkgthfBj",
+ "video_duration": "459s",
+ "duration": "6-8min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a Clash of Clans village layout. The player taps on various structures, bringing up an information box at the bottom of the screen. The structures shown are: Mortar (Level 6), Hidden Tesla (Level 2), Dark Elixir Storage, Archer Tower (Level 8), Gold Storage (Level 11), a second Archer Tower (Level 8), Cannon (Level 9), Barracks (Level 8), Dark Elixir Drill (Level 2), and finally, again, the Hidden Tesla. Throughout the video, purple walls enclose various sections of the village. Several resource collectors and decorative items are also visible.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the effect of a military zone on a therapist?",
+ "answer": "C",
+ "candidates": [
+ "A. A military zone enhances a therapist's damage output.",
+ "B. A military zone accelerates a therapist's level progression.",
+ "C. A military zone protects therapists.",
+ "D. A military zone increases a therapist's speed."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What method did the man use to resist the large troops?",
+ "answer": "B",
+ "candidates": [
+ "A. Arrange the cannons in a triangular formation.",
+ "B. Arrange the Wizard Tower in a triangular formation.",
+ "C. Arrange the cannons in a square formation.",
+ "D. Arrange the Wizard Tower in a square formation."
+ ]
+ }
+ },
+ "GzwXqFTK": {
+ "video_id": "GzwXqFTK",
+ "video_duration": "69s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Family Vlogs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A table is set with plates, red cups, and a banner reading \"HAPPY BIRTHDAY\". There are three plush toys on the table: one is a brown bear, one is a black and white figure wearing a chef's hat and apron, and one is white. The black and white figure is facing the camera. The camera angle changes and focuses on the plush toys. The black and white figure\u2019s mouth opens and closes. The view changes to show a white plush lamb, a black toy figure, and a figure in red overalls. The view is focused on the lamb plush. The figure in red overalls reaches up toward the top of the frame. A pi\u00f1ata is shown on the left and in the foreground, with several plush toys in the background. A wooden stick is being held by a white plush toy, who strikes the pi\u00f1ata with the stick. The camera blurs. The camera view shifts to reveal a red plush toy with a black eye patch, who hits the pi\u00f1ata. The camera view shifts to a black toy figure, who is holding a toy gun. The pi\u00f1ata is struck by a flash of light. A drink is poured. The camera view shifts to show a monkey plush toy with an open mouth. The view changes again to show the monkey plush with an open mouth.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the fox's emotional state in the video after hitting the pot?",
+ "answer": "B",
+ "candidates": [
+ "A. Sad.",
+ "B. Frustrated.",
+ "C. Defeated.",
+ "D. Negative."
+ ]
+ }
+ },
+ "aYzqKOHH": {
+ "video_id": "aYzqKOHH",
+ "video_duration": "424s",
+ "duration": "6-8min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man wearing a black jacket, grey pants, black boots, and a camouflage hat with a GoPro attached is fishing on the grassy bank of a waterway underneath a large overpass. He reels in a small fish, holds it up to the camera briefly, then removes it from the hook and releases it back into the water. He continues fishing, reeling and casting several times. He walks along the bank, occasionally adjusting his line. The camera perspective shifts to show the line going into the water. He reels in another small fish, removes it from the hook, and displays it to the camera before releasing it. He continues walking along the bank, casting his line. He sets up a camera on a tripod by the water's edge. He casts his line and places the rod on the ground before retrieving a fishing net.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man do with the fish he caught?",
+ "answer": "D",
+ "candidates": [
+ "A. He uses the fish as bait.",
+ "B. He roasts the fish and eats it.",
+ "C. He puts the fish in the bucket.",
+ "D. He throws the fish back into the lake."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many different types of fishing tools did the man use in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Three types.",
+ "B. Two types.",
+ "C. Four types.",
+ "D. One type."
+ ]
+ }
+ },
+ "USOlibxF": {
+ "video_id": "USOlibxF",
+ "video_duration": "49s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Celebrity Interviews",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen. The scene then changes to a large room with many people seated at round tables. There is a stage in the background that has large blue signs reading \"Critics Choice Awards\". A person wearing a green dress is on stage. There are small white stars around the stage background. The person on stage is holding a microphone. The video cuts to a closer shot of the person wearing the green dress, then a shot of a person in the audience wearing a brown dress. This repeats several times. The video ends on the person wearing the green dress.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What happens after the video ends?",
+ "answer": "C",
+ "candidates": [
+ "A. A woman in a green dress presents an award to America Ferrera.",
+ "B. A woman in a green dress presents an award to Margot Robbie.",
+ "C. Margot Robbie presents an award to America Ferrera.",
+ "D. America Ferrera presents an award to Margot Robbie."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did America Ferrera do to the woman wearing a green dress in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Gave two cheek kisses to the woman wearing a green dress.",
+ "B. Blew a kiss to the woman in a green dress.",
+ "C. Blew two kisses to the woman in the green dress.",
+ "D. Gave a cheek kiss to the woman in the green dress."
+ ]
+ }
+ },
+ "tZXQNqko": {
+ "video_id": "tZXQNqko",
+ "video_duration": "202s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Unboxing Videos",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A pair of hands hold a phone with a dark screen and an orange pattern in the center. The hands use their fingers to navigate the phone's interface. On a nearby wooden surface, there are two black boxes, one of which has \"Samsung Galaxy Note 10+\" printed on it, and a black card with a red and gray design that says \"Star Wars.\" The person opens an app, then opens a messaging app, then the phone's settings app. Inside the settings app, the person navigates to display, then sound and vibration settings. The phone displays a screen with various notification sound options. The person navigates back to the phone's main screen, where they remove and insert an orange stylus from the bottom of the phone. A menu with several circular icons appears on the screen. Later in the video, a black phone with a silver graphic is shown. The phone is placed next to an orange stylus on a wooden surface. Another phone, in a blue casing, is shown on a wooden surface. Later, a man is shown in front of a wall with posters and headphones. The man gestures with his hands and speaks. The man's shirt is red and has the words \"Star Wars\" printed on it. The video then returns to the phones and the wooden surface, and the objects are moved into different positions. A set of earbuds is revealed in an open case. Then, the video switches to an overhead shot of the phone and accessories, as well as a box. The video shows a phone being removed from a box. Later, the video switches to another shot of a phone with a case, and then another close up shot of a phone with a silver design.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did any actors from Star Wars appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes, images of the actors appeared.",
+ "B. No, they neither appeared nor were mentioned.",
+ "C. No, they were just mentioned.",
+ "D. Yes, the actors themselves appeared."
+ ]
+ }
+ },
+ "fuUbyjUL": {
+ "video_id": "fuUbyjUL",
+ "video_duration": "183s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Family Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a person in a white chef's jacket sitting at a table in a restaurant and talking. Then it cuts to a shot of the same person walking through the restaurant, and then to a shot of the same person preparing food. There are other people visible in the restaurant, some wearing face masks and most are wearing black aprons. Then the video cuts back to the person sitting at the table and talking. Then it cuts to a shot of the outside of the building with a large yellow sign that says \u201cRestaurant Peng Heong Hakka Paikut\u201d. Then it cuts to the person at the table again. Finally, there is a shot of four people standing in front of the restaurant and giving a thumbs up, before ending with a shot of food being prepared. The sign above them reads the same text as the exterior sign, but in Chinese.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "In the group photo at the end of the video, what is the relationship between the woman and the third man from the left?",
+ "answer": "C",
+ "candidates": [
+ "A. Brother and sister.",
+ "B. Husband and wife.",
+ "C. Mother and son.",
+ "D. Father and daughter."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man do to the pork?",
+ "answer": "D",
+ "candidates": [
+ "A. Cut the pork into chunks.",
+ "B. Threw away the pork.",
+ "C. Ate the pork.",
+ "D. Slice the pork into thin pieces."
+ ]
+ }
+ },
+ "zdcvGZEb": {
+ "video_id": "zdcvGZEb",
+ "video_duration": "315s",
+ "duration": "4-6min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing black clothing with yellow accents stands next to a bicycle on a paved lot. They gesture with their hands. The person walks the bicycle across the lot. A black motorcycle with two antennas enters the frame and stops. A point of view shot from the front of the motorcycle shows a paved lot with a bicyclist standing in the distance and also shows the control panel of the motorcycle. The person riding the motorcycle gestures with a gloved hand. They start the motorcycle and proceed to drive in circles in the parking lot. The point of view shifts to third person, and continues to show the motorcyclist driving in circles. They make several gestures with both hands. Finally, the video transitions to footage of two motorcycles riding on a highway, with a graphic showing the words \"Boots & Jeans Riders\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does the man adjust the camera?",
+ "answer": "C",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. At the end of the video.",
+ "C. In the middle of the video.",
+ "D. Throughout the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the man in the video doing while pushing the bicycle?",
+ "answer": "B",
+ "candidates": [
+ "A. Teaching how to ride a bicycle.",
+ "B. Demonstrating how to relax the arms for making a small turn.",
+ "C. Teaching how to ride a motorcycle.",
+ "D. Demonstrating how to relax the arms for making a large turn."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Which sound is absent in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The sound of a woman speaking.",
+ "B. The roar of an airplane.",
+ "C. The sound of a man speaking.",
+ "D. The roar of a motorcycle."
+ ]
+ }
+ },
+ "PrmYaoGM": {
+ "video_id": "PrmYaoGM",
+ "video_duration": "121s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A young girl holds a white sign with \"GO Billy\" written on it, with four rectangular flags and polka dots on the sign as well. People are sitting around her in an indoor stadium.A man with a cowboy hat is walking in a dirt arena. He is wearing a dark jacket with light markings on it. A man in a cowboy hat is riding on a bull in the dirt arena, as two other men in white shirts stand nearby.The bull bucks and jumps, and the man rides it.The same man is seen getting off the bull and walking away.Another man in a cowboy hat and dark jacket is seen walking in the arena.The man in the dark jacket is shown standing in the arena. He raises one finger in the air while looking into the stadium.The man is shown riding a bull, which is bucking and jumping.Another man is riding on a bull, which is bucking and jumping.Another man is shown riding a bull, with multiple shots of him riding.A man is seen leaning against a yellow gate, looking at his hands.The man is holding onto a rope.A man is shown riding a bull, with several shots.Four men are seen sitting at a table with a red tablecloth. They are all wearing cowboy hats and blue shirts with white markings.A man in a cowboy hat is shown looking into the crowd. He is seen with his fist near his chin.The man in the cowboy hat is smiling and looking into the crowd.The man is shown smiling again.A man is seen riding on a bull.A man on a bull is bucking and jumping.Another man is riding on a bull, which is bucking and jumping.Three men in cowboy hats and shirts are seen next to a camper. A man is shown riding on a bull, as another man stands nearby.The bull is shown bucking and jumping.The man falls off of the bull.The man is shown on the ground.A man is shown sitting on the ground near a yellow barrier.A man is seen in the stands, clapping.The man in the stands is smiling.A man is seen on the ground.People are in the stands waving small American flags.A man is seen riding a bucking bull.A man in a cowboy hat is riding on a bull.The man is on a bull with the name \"Billy Etbauer\" appearing on screen.The man continues to ride the bull.The man continues to ride the bull.The man continues to ride the bull.The man continues to ride the bull.The man continues to ride the bull.The man continues to ride the bull.The man continues to ride the bull.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times is the horse taming scene shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Eleven.",
+ "B. Twelve.",
+ "C. Ten.",
+ "D. Thirteen."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a stationary horse in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Uncertain.",
+ "B. Yes.",
+ "C. No."
+ ]
+ }
+ },
+ "SMybCtHR": {
+ "video_id": "SMybCtHR",
+ "video_duration": "144s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person with long brown hair, wearing a white dress, is sitting on a couch. There are gray pillows behind them, and the couch is woven. The person is holding a piece of paper. The video then cuts to a close up of a white rose boutonniere on a dark blue suit. The scene changes to show someone holding a white flower bouquet. The person with long brown hair is shown again, holding a piece of paper. Then, a person with blonde hair and a dark blue suit is shown standing outside by some pillars and trees. The long haired person in a white dress is shown from the back, with a white veil, walking in between white pillars. The person is shown again sitting on a couch, reading a piece of paper. Then, the camera moves to a shot of two people embracing in front of a wooden wall, one in a white dress with a veil, and the other in a dark blue suit. The long haired person is shown from the back walking along the white pillars towards the blonde haired person in a dark blue suit. The camera then moves to a shot of both of them in front of the white pillars. The camera cuts to a view of a person in a dark suit with their back towards the camera. The person with blonde hair is then shown. The video moves to a shot of a tree branch with the sun peeking through the leaves. Then, 11 people are shown walking and smiling, including the people from the previous shots. The camera then moves to a view of the blonde haired person, with his back towards the camera, and a view of the top of a building that reads \"ST.PAUL'S METHODIST EPISCOPAL CHURCH\". A shot inside the church shows wooden pews decorated with white flower arrangements. Then the camera moves to show the blonde haired person in a dark blue suit. The long haired person in a white dress is shown, being walked down the aisle with another person with a dark suit on. The blonde haired person is shown again. The camera then moves to the long haired person with a white dress and a veil, standing in front of a person in a black suit with a red tie. The two people then kiss. The video cuts to the long haired person in the white dress, reading a piece of paper. It then shows a shot of the long haired person being walked down a set of stairs with the blonde haired person with the dark blue suit. The two people in white and dark blue then walk out the church with a group of people around them. The two people kiss, with the long haired person in the white dress and a veil and the blonde haired person in the dark blue suit. The video then moves to show the blonde haired person in the dark blue suit reading a piece of paper. It shows another shot of the two people walking into an event center, then a shot of tables set up in the center. The two people are shown slow dancing. The video then cuts to close ups of roses and a table setting. The long haired person is shown again, reading the piece of paper. The two people are shown holding hands together, showing their rings. The long haired person is shown again, reading a piece of paper. The two people are shown kissing in front of a building. The long haired person is shown again reading a paper. The two people are shown kissing in front of a wooden wall, with their hands on the side of their faces.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At which point in the video does the wedding ceremony take place?",
+ "answer": "B",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the beginning of the video.",
+ "C. Throughout the video.",
+ "D. At the end of the video."
+ ]
+ }
+ },
+ "vZBufTMQ": {
+ "video_id": "vZBufTMQ",
+ "video_duration": "42s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Two men are in a gym with exercise equipment. The man closer to the camera wears a blue t-shirt and black shorts with white shoes, initially doing a standing exercise with dumbbells and then transitioning to a seated exercise machine. He holds on to two handles on the machine while looking at the camera. He speaks, uses his hands while speaking, and then appears to grimace and continues using the seated exercise machine. The second man wears a white shirt and black shorts with black and white shoes, initially using a cable machine and then appears in the background while the other man is seated. After a while, the second man also appears seated on the same exercise machine while wearing a black t-shirt and shorts, with athletic socks and brightly colored shoes, in order to use the machine. The words \"3 x 50 CALF MACHINE OF CHOICE\" appear on screen over the scene. A mirrored wall reflects the men. There is an array of gym equipment around them.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does the person wearing black clothing use the equipment in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. In the middle of the video.",
+ "C. At the end of the video.",
+ "D. Throughout the entire video."
+ ]
+ }
+ },
+ "czKSerfK": {
+ "video_id": "czKSerfK",
+ "video_duration": "266s",
+ "duration": "4-6min",
+ "domain": "Games",
+ "sub_category": "Sports Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a person in the top right corner of the frame with headphones, a red shirt and a grey chair. The video also shows multiple instances of a video game screen depicting various menu options, images of digital cards and pop ups. The game is identified as FIFA 22 Ultimate Team. The video also shows a screen sharing application interface at multiple instances.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "How does the male voice change in the video when the result of the in-game lottery is revealed?",
+ "answer": "C",
+ "candidates": [
+ "A. It becomes quieter when high-performing players are present and louder when low-performing players are present.",
+ "B. It becomes quieter when the desired star player appears and louder when the undesired star player appears.",
+ "C. It becomes louder when the desired star player appears and quieter when the undesired star player appears.",
+ "D. It becomes louder when high-performing players are present and quieter when low-performing players are present."
+ ]
+ }
+ },
+ "JOqCgckc": {
+ "video_id": "JOqCgckc",
+ "video_duration": "53s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen. Then a blue, blurry rectangle appears briefly against a dark blue background. The word \"COPS\" appears on the screen with a textured, faded appearance, over a background of blurry, horizontal lines. The word \"COPS\" is superimposed on an outdoor scene of three people. One person, wearing a grey hooded sweatshirt and blue jeans, is being handcuffed by an individual wearing a black police uniform. A third individual wearing a dark shirt and light pants is standing nearby. The person in the hooded sweatshirt is approached by the police officer. A struggle ensues and the officer grabs the person by the arms. The person in the hooded sweatshirt is then forcibly guided by the police officer. He attempts to flee and the police officer chases him. A second police officer joins the chase. Both officers pursue the individual through a street area and a yard. The first police officer then catches the individual who is then brought to the ground. He is handcuffed with the assistance of the second police officer.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What are the numbers on the columns of the blue house featured at the beginning of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. 234.",
+ "B. 324.",
+ "C. 423.",
+ "D. 432."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "At the beginning of the video, what unusual action does the man in deep gray short sleeves take as he passes by the arrest scene?",
+ "answer": "A",
+ "candidates": [
+ "A. Raises both hands.",
+ "B. Assaults a police officer.",
+ "C. Helps a black man escape.",
+ "D. Crouches and covers his head."
+ ]
+ }
+ },
+ "zAyUgInO": {
+ "video_id": "zAyUgInO",
+ "video_duration": "37s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A mountain biker speeds down a dirt trail. A person wearing a headset sits in front of multiple computer monitors showing data and video feeds of the biker. The biker navigates a steep, grassy decline, followed by a winding, lower section of the trail. The biker leans into turns, kicking up dust. An aerial shot shows the winding path from above. The biker continues down a dusty, exposed section of trail. Another shot shows the biker from behind as they traverse a rocky patch. More computer screens display data like heart rate, strain, and torque. The biker continues along a grassy, high-altitude trail with mountains in the background. A close-up shows the biker's front tire bouncing over rocks. More data screens appear. The biker speeds through a wooded area. The video feed of the biker is augmented with a heads-up display showing various metrics. The person at the computer speaks into a microphone. A final aerial shot of the biker is shown with a heads-up display labeled \"Rider.\"",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional shift does the instructor demonstrate in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. From distressed to sad.",
+ "B. From sad to distressed.",
+ "C. From nervous to happy.",
+ "D. From happy to nervous."
+ ]
+ }
+ },
+ "TntjPbep": {
+ "video_id": "TntjPbep",
+ "video_duration": "184s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Artificial Intelligence",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person's hands hold and manipulate a metallic object, initially using a red pen-like tool. Then, the hands take the tool away. The person's hands then grasp a black hose and wipe it with a tan cloth. Next, the red tool is picked back up and a pointed end of the tool is pressed down to make marks on a metallic surface. Small metallic debris is visible around the marks on the metallic surface, and the hands then appear to wipe the surface. The hands are seen holding two tools, then the hands select one and appear to use it on the metallic surface, which results in more small metallic debris, which the hands then wipe off the metallic surface. The person's hands are then visible using the red tool on a different part of the initial metallic object. The hands are then seen using a black drill bit held in a yellow drill, and it is used on a metallic part of the original object. The drill is then removed, and this process is repeated several times.After several drilling actions the interior of one of the parts of the object is seen in close-up and contains a golden metallic fluid. The hands appear to touch this area, and then the video ends.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "How does the silver disc's shape change in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. It changes from being split open to having a conical point, then to having a hole.",
+ "B. It changes from having a conical point to having a hole, then to splitting open.",
+ "C. It changes from having a conical point to splitting open, then to having a hole.",
+ "D. It changes from having a small conical point to having a large conical point, then to having a hole."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the two people speaking in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Colleagues.",
+ "B. Friends.",
+ "C. Strangers.",
+ "D. Teacher and student."
+ ]
+ }
+ },
+ "ajjyxmLe": {
+ "video_id": "ajjyxmLe",
+ "video_duration": "83s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a person with blue hair and a yellow t-shirt. The person makes various hand gestures, at times putting their hands together or using their fingers to gesture while they talk. They move within the frame and the background alternates between a living room and a kitchen setting. The person holds a cell phone with a pop socket on the back for a portion of the video. At one point a cat walks into the scene in the background.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What changes occurred in the woman's emotions as shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. From excitement to determination.",
+ "B. From confusion to sadness.",
+ "C. From excitement to confusion.",
+ "D. From distress to determination."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "In the video, how many times does the sound of clapping occur?",
+ "answer": "A",
+ "candidates": [
+ "A. Nine.",
+ "B. Ten.",
+ "C. Seven.",
+ "D. Eight."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the woman doing with the phone in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Showing her phone interface.",
+ "B. Flipping through her weight loss plan.",
+ "C. Searching for how to quickly lose weight.",
+ "D. Introducing her phone to everyone."
+ ]
+ }
+ },
+ "xNweadvS": {
+ "video_id": "xNweadvS",
+ "video_duration": "398s",
+ "duration": "6-8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Drawing Tutorials",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a framed drawing hung on a wall, then transitions to a split-screen view. On the left side of the split screen, the framed drawing is displayed. On the right side of the split screen, a hand with red painted nails is shown holding a black pen. The hand then begins to sketch a drawing in the notebook, using black, yellow, and purple pens. The sketch begins with a few angled lines, then the hand begins to draw the outlines of buildings. The hand adds details such as brick patterns to the buildings, and draws smoke emerging from the top of some of the buildings. Finally, the hand begins to use a paint brush and watercolors to paint in the drawing. The painted in drawing is shown at multiple angles.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What is the blogger likely to do next in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Create a video about pens.",
+ "B. Create a video about drawing techniques.",
+ "C. Create a video about sketchbooks.",
+ "D. Create a video about pen ink cartridges."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many drawing pens appear in the first half of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Two.",
+ "B. Five.",
+ "C. Three.",
+ "D. Four."
+ ]
+ }
+ },
+ "YazDdXou": {
+ "video_id": "YazDdXou",
+ "video_duration": "550s",
+ "duration": ">8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Historical Analysis",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video presents text of C\u00e6dmon's Hymn in Northumbrian and West Saxon versions.\u00a0 Subsequently, the runic symbol thorn and its variations are displayed alongside the word \"born,\" and a close-up of the thorn character in a manuscript is shown. Images of \"Ye Olde\" shops follow.\u00a0The word \"r\u00fan\" (Old Norse) and \"run\" (Old English) are shown, along with their meaning and etymological roots.\u00a0 A medieval illustration of robed figures, some with pointed hats, and a king is displayed.\u00a0 A page from a rune poem and the Armanen runes appear. A page of Fraktur text is presented.\u00a0Led Zeppelin symbols and the Krupp logo are depicted. A cannon, a catapult or trebuchet,\u00a0 and a drawing on a stone resembling a figure holding a laptop and a cell phone, are shown alongside the Bluetooth logo.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What was the reason for adopting the rune \"W\" in early Old English manuscripts?",
+ "answer": "D",
+ "candidates": [
+ "A. To avoid confusion between the vowel and consonant \"w\".",
+ "B. To avoid confusion between the vowel and consonant \"p\".",
+ "C. To avoid confusion between the vowel and consonant \"n\".",
+ "D. To avoid confusion between the vowel and consonant \"u\"."
+ ]
+ }
+ },
+ "SpHfgnus": {
+ "video_id": "SpHfgnus",
+ "video_duration": "484s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Pranks",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person with curly hair and brown clothing is in the driver's seat of a car. Another person is briefly visible in the passenger seat wearing a white shirt, before the curly-haired person takes up most of the screen. The curly-haired person is holding several small, pale objects in her hand. She puts her hands down, and continues speaking. She winks and then puts her hands in front of her. She then turns her head. She and another person are in the front of a car. Both are eating and holding food containers. The second person is also wearing a white shirt. The curly-haired person holds a large white bowl with green vegetables and a light-colored grain. The second person holds a half-eaten sandwich. The curly-haired person looks away and covers her mouth with her hand. She spits a few small pale objects into her hand. The other person watches her with a confused look. She pushes her car's sun visor up and spits into it, then she takes a tissue from the visor and cleans her mouth. She throws the objects in the trash bag. She cleans the area with a tissue, then she continues talking. She laughs, and smacks the man lightly in the face. She raises a finger. The two people laugh and talk to one another before both look forward in the car. The person with curly hair then points with her hand and starts talking again while the man looks at his phone. The curly-haired person begins talking to the camera, before pulling her hands through her hair. She then waves her hands in front of her before the video cuts.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video contain any fast food items from Chick-fil-A?",
+ "answer": "C",
+ "candidates": [
+ "A. None of the above.",
+ "B. No, it does not.",
+ "C. Yes, it does.",
+ "D. It is not clear."
+ ]
+ }
+ },
+ "BVHuJecL": {
+ "video_id": "BVHuJecL",
+ "video_duration": "46s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a full body view of a 3d avatar of a white-haired character wearing a red and white jacket, a blue shirt underneath, and a red plaid skirt with black boots, standing in front of a blurred screen of what appears to be a web browser window that is showing images of various Pokemon. The character moves her arms and body while text appears on screen. The text includes: \u201cFIILLLLian gets TROLLED and BULLIED by chat\u201d, \u201ci feel like\u201d, \u201ctyping my name\u201d, \u201cwith two L\u2019s\u201d, \u201cshould be an instant ban\u201d, \u201clike\u201d, \u201cTHE DISRESPECT\u201d, \u201ci have been streaming\u201d, \u201cTHREEEEEEE years\u201d, \u201cand people are still adding\u201d, \u201cextra L\u2019s in my name\u201d, \u201cI GET IT\u201d, \u201csometimes i mess up\u201d, \u201csometimes i take an L\u201d, \u201cbut who doesn\u2019t\u201d, \u201cwe all have our days\u201d, \u201ci don\u2019t deserve\u201d, \u201cto be\u201d, \u201cbullied for thaaat\u201d, \u201ci don\u2019t deserve that\u201d, \u201cyou understand\u201d, \u201c\u201d, \u201csup Filian\u201d, \u201cPH?\u201d, \u201chow about i get a PHD\u201d, \u201cin kicking your nuts\u201d, \u201cinstructions unclear\u201d, and \u201cfourteen L\u2019s\u201d. Another 3d avatar can be seen in the upper right hand corner, wearing a blue hat and yellow clothing, partially visible in the right hand of the frame throughout the video. A third, partially visible 3d avatar, with white hair and dark clothing, is visible in the upper left hand corner of the frame during portions of the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many roles are voiced by the female narrator in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Four.",
+ "B. One.",
+ "C. Three.",
+ "D. Two."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the girl\u2019s profession in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Broadcaster and viewer.",
+ "B. Broadcaster and student.",
+ "C. Viewer and student.",
+ "D. Singer and student."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What changes occur in the girl's emotions in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. From discouraged to excited.",
+ "B. From calm to excited.",
+ "C. From calm to discouraged.",
+ "D. From excited to calm."
+ ]
+ }
+ },
+ "kpKJbacm": {
+ "video_id": "kpKJbacm",
+ "video_duration": "400s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Fashion",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a close up of a woman with dark, curly hair, wearing a tan-colored top, smiling. The scene cuts to a different woman in a red shirt with the word \"Abercrombie\" on it. She wears a black and red headscarf and large hoop earrings. She gestures with her hands, putting one on her chest and opening her mouth as she talks. The scene cuts to the words \u201cElevateStyles.com\u201d in a white font, then returns to the second woman, who continues to gesture with her hands, and picks at her fingernails with her fingers. She places her hand on her chest again, then gestures with her hands and moves her hands to her face. The second woman introduces a black curly wig and then touches the inside of the wig with her hands. She picks at the wig, looks inside of it again, and picks at a small clip inside the wig. The second woman then puts the headscarf over her face, and takes it off. She holds the wig with her hands, and begins to put it on her head. Then, she uses a comb to make adjustments to the placement of the wig. Then, she uses a dark ball to press around the edge of the wig, and uses a toothpick to make adjustments to the front of the wig near her hairline. The second woman then uses a brush to make further adjustments to the wig, and uses her fingers to touch her hair. Then, she smooths the hair down with a brush, then uses a dark spray bottle on the hair. She brushes and styles the wig further. She also begins to pick at individual strands of hair. The second woman looks at the camera with the styled wig on. She moves her fingers through the hair, and then gestures with her hands as she talks. She shows a final view of the wig, which now sits completely on her head, and adjusts the hair with her fingers. The scene cuts to a white background showing social media icons next to a rainbow graphic, a desktop computer screen, a cell phone screen, text, and an illustration. The last image shows white text on a black background which reads \u201cThanks 4 Watching\u201d.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What type of music is playing in the middle of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Classical songs.",
+ "B. Rock songs.",
+ "C. Pop songs.",
+ "D. Children's songs."
+ ]
+ }
+ },
+ "fGXbLpbs": {
+ "video_id": "fGXbLpbs",
+ "video_duration": "64s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A group of six people are standing in a room, all facing forward. Most are wearing name tags. The first person in the foreground is a girl with long red hair wearing a green shirt and earrings. Next is a girl with blonde hair wearing a pink shirt, a boy with brown hair wearing a blue shirt with white and red accents, and a girl with brown hair wearing a patterned shirt. In the background, two people can be seen. The scene then changes to a different room. A man wearing a baseball cap and a white and black jacket is walking toward the camera. He raises a hand and then says something. The scene changes to show the first group of people again, then to a room where several people are getting food. The man in the cap is speaking to another man with white hair who is wearing a yellow shirt. The man in the cap then speaks to another person. Next, the boy with brown hair in the blue shirt and the man in the cap are standing on a raised platform. The two speak for a moment. The man in the cap then says something, and the boy in the blue shirt starts to move. The two speak again, then a car is shown on the platform. The boy in the blue shirt climbs on the car. A car jumps in the air, and he jumps off of it. The two have another short conversation. The boy then jumps in the air. The man in the cap then starts to walk.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "How does the volume of the man wearing the blue hat change in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The volume increases as he speaks.",
+ "B. The volume is particularly loud when he speaks to the person wearing a white hat.",
+ "C. The volume is particularly loud when he announces the class assignments.",
+ "D. The volume remains constant."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual event occurred when the man wearing a blue hat and the man in a blue short-sleeved shirt were standing on the stage?",
+ "answer": "A",
+ "candidates": [
+ "A. A car fell from the sky and crushed the boy in a blue short-sleeved shirt; after standing up, he was launched into the air by a pedal.",
+ "B. The boy wearing a blue short-sleeved shirt turned into a flying car.",
+ "C. A boy in a blue short-sleeved shirt was hit and sent flying by a car.",
+ "D. A car fell from the sky and crushed the boy in a blue short-sleeved shirt; after standing up, he was launched into the air by a pedal."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "At the beginning of the video, what did the girl in green do to make the man in the blue hat angry and shout?",
+ "answer": "B",
+ "candidates": [
+ "A. The girl in green interrupted the man in the blue hat while he was speaking.",
+ "B. The girl in green refused the man in the blue hat's request to demonstrate superpowers.",
+ "C. The girl in green used her superpowers to injure the man in the blue hat.",
+ "D. The girl in green insulted the man in the blue hat."
+ ]
+ }
+ },
+ "vVpFKAFw": {
+ "video_id": "vVpFKAFw",
+ "video_duration": "209s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A white and gray Citro\u00ebn DS3 rallycross car, number 7, races around a dirt track. It follows other cars, kicking up dirt and splashing through puddles. A black Ford Fiesta, number 95, with red and white lettering, loses control, spins nearly 180 degrees, its rear end sliding across the wet track. The car recovers and continues racing. The view changes to a wider shot of the track, showcasing the surrounding green fields and distant hills. Several other rallycross cars, including buggies, navigate the course.The focus returns to the Citro\u00ebn DS3, as it speeds along a grassy section of the track. It approaches another rallycross car with red, yellow, and orange markings and the word \"Goody\" prominently displayed. This red Citro\u00ebn then takes center stage, driving smoothly along the track. Next, a blue and yellow Ford Fiesta with the words \"parts4cars\" on the side appears, navigating the curves and kicking up dirt. The car accelerates away from the camera.The camera shifts back to the wider view of the track. The white and grey Citro\u00ebn DS3 and a yellow Ford Fiesta race side-by-side, jostling for position. As they approach a turn, they appear to make contact, with the Citro\u00ebn nudging the Ford. The blue Ford Fiesta returns, speeding along the paved track as it drifts around bends, kicking up dust. The camera remains on this car for a prolonged period, following it as it navigates the mixed-surface track.Another Ford Fiesta, number 95, reappears, drifting through corners, the camera now close behind it. The view switches back to the blue and yellow Ford Fiesta as it approaches a turn marked with a Cooper Tires sign and yellow poles. Finally, the clip ends with the Cooper Tires logo, followed by the Motorsport UK British Rallycross Championship and 5 Nations Trophy logos.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, what was the relative position of Car 111 and Car 1 as they approached the finish line?",
+ "answer": "D",
+ "candidates": [
+ "A. Car 111 did not cross the finish line.",
+ "B. Car 111 is behind Car 1.",
+ "C. Car 1 did not cross the finish line.",
+ "D. Car 111 is ahead of Car 1."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What number is displayed on the window of the red and yellow car?",
+ "answer": "B",
+ "candidates": [
+ "A. 95.",
+ "B. 172.",
+ "C. 72.",
+ "D. 75."
+ ]
+ }
+ },
+ "BrKHwVmX": {
+ "video_id": "BrKHwVmX",
+ "video_duration": "172s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person is shown applying make-up. They first hold a powder compact and a brush, and then use the brush to apply powder to their face. They then show a small container with red makeup inside. They touch their face with their hands. They then use a pink makeup pallette to apply blush with their fingers. They use their finger to apply some makeup to their lips. They hold a brown eyebrow pencil and apply makeup to their eyebrows. They apply makeup to their chin and below their eyes. They then use the same pencil to apply beauty marks to their face. They style their hair, touch their face, and speak to the camera. A floral wreath hangs on a door in the background. To the right, a mannequin wearing a pearl necklace is visible along with a peach-colored garment. Various still images are displayed in the video: one shows a woman with a white hairstyle wearing a blue dress, another shows a diagram of beauty marks, and another shows a woman with a large white hairstyle wearing a light blue dress with a black bow around her neck. The person in the video is wearing a pink dress.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "What brand of lipstick did the girl in the video use?",
+ "answer": "B",
+ "candidates": [
+ "A. Tom Ford.",
+ "B. I don't know.",
+ "C. M.A.C.",
+ "D. YSL."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why do girls shape their eyebrows to be thin, long, and arched?",
+ "answer": "B",
+ "candidates": [
+ "A. Because this makes them seem more amusing.",
+ "B. Because this makes them look smaller.",
+ "C. Because this gives them a more authoritative appearance.",
+ "D. Because this makes them appear more enthusiastic."
+ ]
+ }
+ },
+ "mEgRXdxw": {
+ "video_id": "mEgRXdxw",
+ "video_duration": "50s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a black screen. The words \"10 Matches That Deserved to Main Event WrestleMania\" then appear in white and red text. Then the words \"Number 10\" appear in white above red text that reads \"The Rock vs. Hulk Hogan (WM 18)\" Then an image of 3 people is displayed. The woman has long brown hair and is wearing a black and white print top. The man on the left is holding a silver belt and is wearing a gold belt. The man on the right has shorter brown hair and is shirtless with black wristbands and white hand tape. Text below reads \"Undisputed Championship\" above \"Chris Jericho\" and \"Triple H\". The image transitions to a man in green shorts holding another man in dark shorts in a headlock. A referee in a black and white striped shirt is standing next to them. The man in green shorts is replaced by an image of two men standing facing each other. The man on the left has a black t-shirt and a black head covering with white lettering. The man on the right is shirtless and has dark wristbands. The image transitions to a close-up of the man on the right. The image changes again to show a man in green shorts holding a man in black and white shorts. The image transitions again to show a man in green shorts holding a metal chair, as another man in dark shorts extends his leg. The image transitions again to show a close-up of the two men from before, the one on the left with a black head covering and t-shirt, as the other with a black shirt. The image then transitions to a close-up of the man on the right. The image is replaced by a poster of the 2 men facing off against each other, with the text \"Icon vs. Icon\", \"Wrestlemania\" and \"The Absolute Best Ever\". Then, a man with blonde hair is shown jumping toward the right side of the image, and another man in dark shorts is on the ground on the right.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How does the speaker's emotion evolve throughout the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The emotion shifts from doubt to anger.",
+ "B. The emotion shifts from excitement to anger.",
+ "C. The emotion shifts from excitement to regret.",
+ "D. The emotion shifts from anger to regret."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "In the middle of the video, how does the person wearing green pants defend against a kick from the person wearing black pants?",
+ "answer": "D",
+ "candidates": [
+ "A. Use a steel plate to block the attack.",
+ "B. Use a wooden plank to block the attack.",
+ "C. Use a forearm to deflect the attack.",
+ "D. Use an iron chair to withstand the attack."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "At the end of the video, is anyone wearing purple clothing?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes.",
+ "B. No.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "VOYYwdhd": {
+ "video_id": "VOYYwdhd",
+ "video_duration": "510s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person with light colored hair is having their hair washed in a black sink. Another person is washing the light colored hair in the sink. The person washing the hair is wearing multiple bracelets, a watch, and has tattoos on their arm. The person having their hair washed is wearing an earring. Then, the video shows the person having their hair washed sitting in a salon chair and a person is drying the light colored hair with a hair dryer. The person drying the hair has tattoos on their arm and is wearing a watch. Next, the video shows three tubes of hair color that have the words \u201cColor Intensity\u201d written on them. A hand is seen dispensing two different colors of hair dye in a bowl. A person is using a brush to apply a purple colored dye to the light colored hair. The person applying the hair dye has tattoos on their arm and is wearing a watch and gloves. The same person is seen using the brush to apply the purple dye to the light colored hair until the entire head of hair has been coated in purple dye. The person is also using a comb to work the dye through the hair. Then, the person is using their gloved hands to spread the dye through the hair. Next, the person is shown using a hair styling tool to curl the person's hair. Finally, the person with the newly dyed and styled purple hair, is shown smiling. There are 2 different people seen throughout the video.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the video recommend using dark dye and leaving it on the roots for a while?",
+ "answer": "A",
+ "candidates": [
+ "A. To ensure the color appears more balanced.",
+ "B. To make the roots less likely to grow black hair.",
+ "C. To help the root hair color last longer.",
+ "D. To make the color more noticeable."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many bottles of hair dye are visible on the table in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Four bottles.",
+ "B. Six bottles.",
+ "C. Seven bottles.",
+ "D. Five bottles."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What change occurred to the hair dye bottle held by the man wearing headphones in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Ultimately, the bottle was put into the drawer.",
+ "B. Ultimately, the bottle was given to another man.",
+ "C. Ultimately, the bottle was placed on the table.",
+ "D. Ultimately, the bottle was thrown into the trash can."
+ ]
+ }
+ },
+ "vEwHYQtH": {
+ "video_id": "vEwHYQtH",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "Three correctional officers stand near a white Isuzu truck. One officer opens the truck's door. Another officer inspects a compartment on the side of the truck. An officer opens a door to a holding cell inside the truck, revealing metal bars and a small bench. Another officer locks and unlocks the cell door. Inside an office, an officer points at a map displayed on a large screen. Another officer sits at a computer. The first officer then uses a computer to view a different map. Two officers escort a handcuffed man in a green shirt toward a transport van. They pat him down before putting him in the van. A white Mercedes van exits a secure facility, followed by a white pickup truck. The camera focuses on the shoulder patch of a Queensland Corrective Services uniform. A male officer with a goatee drives a vehicle. The vehicle passes by buildings surrounded by a chain link fence. Inside the truck, a female officer looks at a monitor displaying camera feeds of the holding cells. Another officer hands her some paperwork. Another white truck similar to the first one is shown, followed by a view of the back of the truck as it drives away.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the license plate number of the vehicle driven away by the police officer at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. QJJS61.",
+ "B. QGGS61.",
+ "C. QJJS62.",
+ "D. QGGS62."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What is the policewoman wearing a watch and the bald male officer attempting to do while speaking in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Adjust the vehicle and prepare to remain in place.",
+ "B. Analyze the map and prepare to alter the plan.",
+ "C. Analyze the map and prepare to remain in place.",
+ "D. Adjust the vehicle and prepare to alter the plan."
+ ]
+ }
+ },
+ "CSJJJqEk": {
+ "video_id": "CSJJJqEk",
+ "video_duration": "636s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man in a black hoodie stands in front of a patterned backdrop. A young girl looks through a kaleidoscope. A hand grips a steering wheel while driving past brick buildings. The man from the first scene stands outdoors, holding a small rectangular prism and a camera on a tripod. He crouches by a puddle, examining the prism. The camera focuses on the puddle through the prism. White flowers in a field come into focus. The man holds the prism up to the camera lens. He leans against a metal fence, holding a glass ball and the camera. He holds just the glass ball, gesturing with his other hand. He looks through the camera viewfinder.\u00a0 He stands next to a red-haired woman by the fence, adjusting the glass ball in front of the camera lens. He directs the woman into a pose, then looks at the camera and the prism. He gestures towards the camera and tripod.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why should a crystal ball with a smooth surface not be chosen for photography?",
+ "answer": "B",
+ "candidates": [
+ "A. It blurs the resulting image.",
+ "B. It does not produce a distortion effect.",
+ "C. It reduces the angle of view.",
+ "D. It makes the colors in the photograph more uniform."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What action did the man perform with the longer prism?",
+ "answer": "A",
+ "candidates": [
+ "A. He used it to photograph flowers.",
+ "B. He used it to photograph vehicles.",
+ "C. He used it to photograph trees.",
+ "D. He used it to photograph a woman."
+ ]
+ }
+ },
+ "RDfUlbkA": {
+ "video_id": "RDfUlbkA",
+ "video_duration": "250s",
+ "duration": "4-6min",
+ "domain": "Culture & Politics",
+ "sub_category": "Celebrity Interviews",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Four people are sitting at a desk. Each person has headphones and a microphone in front of them. The person on the far left has on a white jacket with red sleeves, blue jeans, and a black shirt. They are looking down at the desk. The person next to them is wearing a black hooded sweatshirt with a pink design, and dark glasses. They are looking towards the camera. The person next to them is wearing a dark colored jacket with a yellow letter \"G\" on the front, and a black hat. They are looking towards the camera with a slight tilt of their head. The person on the far right has a black zip-up sweatshirt and a black beard, and is looking toward the microphone. Each person is also on a separate chair and the table is dark colored. There are two black computer monitors, one a large screen and the other is smaller and angled on the table. Behind them are shelves containing various objects. The backdrop behind the desk is light grey with a repeating white stylized \"T\" in a square. The bottom of the video has a red banner with text in it.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the man wearing sunglasses seated in relation to the man wearing a hat?",
+ "answer": "A",
+ "candidates": [
+ "A. On the right.",
+ "B. In front.",
+ "C. Behind.",
+ "D. On the left."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What are the two men in the video doing with headphones on in front of microphones?",
+ "answer": "B",
+ "candidates": [
+ "A. Making a phone call.",
+ "B. Chatting.",
+ "C. Broadcasting live.",
+ "D. Singing."
+ ]
+ }
+ },
+ "tlPbMKTJ": {
+ "video_id": "tlPbMKTJ",
+ "video_duration": "124s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A black screen is replaced with a red, grey, and black radial pattern. White circles, rectangles, and triangles appear and vanish from this pattern. This transitions to a white circle that fades in and out, and then spins, with small white triangles around it. White shapes such as circles, squares, stars, and triangles move around a set of three white dashes in the center of the frame. The words \"Everything Science\" appear in white letters against a black background. Then, an aerial shot of a field with several wind turbines is shown under a clear blue sky. This is followed by an aerial shot of a dark-colored car driving down a road next to a lake. The car continues down the road as the shot angle changes. An aerial shot of a city with multiple skyscrapers is displayed under a blue sky with a haze. This city scene repeats, then transitions to the math problem \"2+2\" written in black on a white background. Next, a blue schematic of space appears next to the math problem. The view shows the schematic on one half of the screen and the aerial view of the city on the other half. This transitions back to the city view and then to an image of a flowing river surrounded by green forest. Then, the word \"Forests\" in white letters appears in the middle of the river scene. The video shows several views of a forest from different perspectives. A definition of \"Carbon Sink\" is shown in white text over the forest background. Then, a view of a river from a cliff with the sun setting is shown, with the text \"~15%\" superimposed. Next, a fire is shown consuming wooden boards. The shot changes to an aerial view of a dark-colored car driving down a road in the middle of a forest. This view of the car driving through the forest is shown again. Then the camera transitions to an interior view of a forest. The words \"CO2\" are shown repeatedly along with a downward facing white arrow. Next, a man wearing a black t-shirt and gloves is shown planting a small tree as another man stands behind him. The numbers 49/20 MIL is overlaid at the top right, and then it switches to 50/20 MIL. A field with several people planting trees is shown. The text 53/20 MIL, 73/20 MIL, and 89/20 MIL is displayed on the screen as the camera shows more people planting trees. A person wearing a white jacket is shown in a field next to a split screen that shows a city skyline. A water bottle falls slowly in front of the city skyline. A dog with black and white fur runs on a field. The camera shows a man wearing a hat and sweatshirt holding a shovel in front of the same field. The camera then shows aerial views of a snow-covered mountain. A person uses a chainsaw on a tree stump. A map of the world is shown with several trees and barn graphics that change location. The flowing river in a forest appears again. Then, two men are shown sitting in a field, with the text 300/20 MIL displayed. The man on the left gestures with his hands as he speaks. The man on the right has his hands clasped in front of him. A tweet about planting trees is displayed. The men are shown talking again, then a screenshot of a website showing a tree-planting campaign is shown, with the text \"*now 15 million\" superimposed. This screenshot of the website then repeats twice.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many types of carbon sinks are discussed in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Six types.",
+ "B. One type.",
+ "C. Three types.",
+ "D. Two types."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why do some countries convert forests into farmland?",
+ "answer": "D",
+ "candidates": [
+ "A. Because some countries have no need for forests.",
+ "B. Because other countries have lower carbon emissions.",
+ "C. Because other countries use alternative methods to reduce carbon emissions.",
+ "D. Because converting farmland back into forests would reduce the food supply."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How much does it cost to plant 20 trees, as shown in the final chart in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. $15.",
+ "B. $20.",
+ "C. $25.",
+ "D. $10."
+ ]
+ }
+ },
+ "BeWavfZy": {
+ "video_id": "BeWavfZy",
+ "video_duration": "100s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The Xbox logo appears on a screen. Soldiers' feet walk through a smoky, fiery battlefield. A Hunter creature lumbers in the background. Soldiers run through the scene, firing weapons. A close-up shows a soldier's face as they fire a gun. Another soldier runs through the smoke. A bright light appears in the sky above the clouds. Soldiers run and fire their weapons. A close-up of a soldier shouting while firing a large gun is shown. Another soldier runs through the smoke, looking behind them. A soldier is briefly shown being thrown back by an explosion. Soldiers fall through the air amid explosions and debris. Soldiers fire their weapons behind cover, and another large explosion occurs. Multiple soldiers are propelled into the air amid debris and explosions. A man in military dress uniform with a grey beard and a cap stands watching the battle. He faces left, looking concerned. A red armored figure runs through the scene, firing a weapon. An off-road vehicle drives quickly through the battle scene. The red armored figure runs beside the vehicle and shoots a weapon. As the vehicle drives, two missiles launch skyward from its roof. The missiles fly upwards trailing fire, then crash down on the ground. The vehicle drives on and is blown high up into the air by an explosion. A large mechanical walker walks through the scene, firing blasts. A flying ship is shown against the sky. The walker fires blasts at the ship. A large, multi-legged creature is shown moving through the air above the scene. It unleashes blasts of energy down at the ground. Another set of missiles launches and streaks towards the ground, exploding. A large figure is on top of a platform, looking at the flying ship in the sky. Another flying ship enters from right, and the figure uses some sort of laser weapon on it. A close-up of the laser weapon shows it attached to a robotic hand and arm. A heavily armored, alien creature with a skull-like face and a large beard appears in the center, looking determined. The creature fires a shoulder-mounted weapon. A figure falls through the smoky air, and an explosion sends debris flying. The man in uniform looks on grimly. He stands in the same position. The armored alien creature appears again in the foreground. The man in uniform watches. An armored soldier stands at the side of the scene. Other soldiers run and jump across the ground. The man in uniform remains still. The alien creature with the beard appears again in the center of the scene. The uniformed man continues to watch. Another armored soldier approaches him and stands near him. The soldier shoots a weapon. The scene changes to an overhead view of debris and wreckage falling slowly. Soldiers run out of a mechanical walker. They are followed by several more armored figures. The alien creature appears again. The man in uniform stands watching, unmoved.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video, what is the correct chronological order of the following events? (a) Soldiers fire their weapons. (b) A shockwave propels soldiers into the air. (c) A tank topples a mech. (d) Soldiers from opposing sides engage in combat.",
+ "answer": "B",
+ "candidates": [
+ "A. (d)(a)(c)(b).",
+ "B. (d)(a)(b)(c).",
+ "C. (a)(d)(b)(c).",
+ "D. (c)(a)(b)(d)."
+ ]
+ }
+ },
+ "ypiDRYAF": {
+ "video_id": "ypiDRYAF",
+ "video_duration": "173s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Inside a wrestling ring, a wrestler in a black shirt and shorts has another wrestler, wearing a black shirt and pants, in a headlock. A referee in a black and white striped shirt stands nearby. The wrestler in the headlock struggles to escape, then falls backward, pulling the other wrestler down with him. The wrestlers get back on their feet and face each other. They lock hands, then fall to their knees and push against each other. They get back on their feet again. The larger wrestler lifts the smaller wrestler and throws him over his shoulder, slamming him onto the mat. Both wrestlers crawl towards the center of the ring on their hands and knees. They rise to their knees and lock hands, then get back on their feet. They grapple near the ropes, then the larger wrestler picks up the smaller one and throws him over the top rope, out of the ring. The smaller wrestler lands hard on the floor outside the ring, then gets up and climbs back into the ring. The wrestlers grapple near the ropes again. The larger wrestler Irish whips the smaller wrestler, who runs and bounces off the ropes. The larger wrestler tries to catch him for a body slam, but the smaller wrestler counters with a head scissors takedown. Both wrestlers are on their knees in the center of the ring. The larger wrestler lifts the smaller wrestler and slams him onto the mat. The larger wrestler pins the smaller wrestler. The referee counts, but the smaller wrestler kicks out. The larger wrestler attempts another pin, but the smaller wrestler kicks out again. The larger wrestler lifts the smaller wrestler, who tries for a sunset flip pin, but is unsuccessful. The larger wrestler has the smaller wrestler in a headlock. The wrestlers grapple near the ropes again. The larger wrestler picks up the smaller one for a piledriver and throws him onto the mat. The smaller wrestler lies flat on the mat. The larger wrestler attempts a pin. The referee counts the pin.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What was the position of the man wearing the black and white striped referee shirt on the field at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. He was standing on the field.",
+ "B. He was lying on the field.",
+ "C. He was kneeling on the field.",
+ "D. He was sitting on the field."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many men are standing on the field at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. One.",
+ "B. Three.",
+ "C. Four.",
+ "D. Two."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Does the video show a third wrestler?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes, it does.",
+ "C. No, it does not."
+ ]
+ }
+ },
+ "jaYvnGus": {
+ "video_id": "jaYvnGus",
+ "video_duration": "162s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Fitness",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows an Instagram page with a grid of six photos and videos. The first video, on the top left, shows a man wearing a black t-shirt with the words \"GASP INC. UTILITY ITEM\" and a variety of packaged snacks laid out on a table in front of him. The second video, in the top center, shows two people riding horses. The third video, on the top right, shows a man lifting a weight. The fourth video, in the bottom left, shows a man working out on a weight machine. The fifth video, in the bottom center, shows a close up of a man's face. The final video, on the bottom right, shows a man working out using a cable machine.The video continues on to a full screen version of the third video from the grid. The video shows a man in a green shirt picking up a black dumbbell from a row of dumbbells. Another man wearing a black sleeveless shirt is standing behind him, on the right side of the video. The first man begins to do dumbbell bicep curls, and then continues with overhead presses and reverse curls, and various other movements with the dumbbell. The second man is seen walking to a stool and sitting on it, continuing to watch the first man. The video then shows different angles of the first man doing the same exercises. At one point, the reflection of the first man in a mirror is shown. The scene then transitions back to the instagram grid.The video continues on to show a full screen version of the sixth video from the grid. It shows a man sitting on a weight bench, with a weight bar in front of him. Another woman is standing behind him. The video then shows a dog laying on the ground in the foreground and then shows the man doing bench presses with the weights.The video then transitions to another full screen version of a video. This time it shows a man in a black shirt and camoflauge pants performing shoulder workouts using dumbbells, with a man in a tank top standing off to the side. The video shows different camera angles of the man performing his exercises. The video then transitions to a single image of a man wearing a black tank top with the letters \"RSP\" on it, with gray pants. The video transitions to another single image. It shows a close up of a man's face, with a black tank top with the letters \"4PM\" on it. The video then transitions to a close up video of the man with the \"4PM\" tank top, then transitions to the man from the previous full screen video exercising again with dumbbells. The video ends with the man in the black t-shirt performing the same exercises he was doing.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What changes are made to Branch Warren's training methods in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Emphasize proper execution of movements.",
+ "B. Focus on the sensation of the muscles.",
+ "C. Increase the intensity of training.",
+ "D. Adjust the pace of training."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "Branch Warren's shift from focusing less on perfect form to placing greater emphasis on executing movements correctly reflects what kind of emotional transition?",
+ "answer": "D",
+ "candidates": [
+ "A. From passion to happiness.",
+ "B. From passion to attentiveness.",
+ "C. From attentiveness to passion.",
+ "D. From happiness to passion."
+ ]
+ }
+ },
+ "qdcmWLYw": {
+ "video_id": "qdcmWLYw",
+ "video_duration": "213s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Astronomy",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person with long brown hair and a patterned blue button-down shirt is standing against a black background. On the right is a dark object that is partially obscured, with some red cables showing. The person makes various hand gestures. A box on the left of the screen shows either three full moons, an image of a purple and white nebula, or an image of a canyon and sky. In later parts of the video, boxes containing images and text appear on the left, showing diagrams related to angular size and field of view. At the end of the video, an image of a website appears. The website has a dark image of the night sky, and buttons and data related to the image. The website also shows a pop-up window, where data can be added for various parameters.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video show a telescope?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes, a telescope is placed next to the long-haired man.",
+ "B. No, a gaming console is placed next to the long-haired man.",
+ "C. No, a video camera is placed next to the long-haired man.",
+ "D. No, a camera is placed next to the long-haired man."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What are the functions of the web applications demonstrated in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A device compatibility database.",
+ "B. Calculation of the field of view for cameras and telescopes.",
+ "C. A platform for community sharing.",
+ "D. Guidance and educational resources."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the angular diameter of the Orion Nebula at its widest point, expressed in terms of the full moon's angular diameter?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Two.",
+ "C. One.",
+ "D. Three."
+ ]
+ }
+ },
+ "LohsmXXt": {
+ "video_id": "LohsmXXt",
+ "video_duration": "124s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A motorboat with two men is shown near the shore of a river. The boat is close to a rocky, partially submerged outcrop and some debris or driftwood in the water.\u00a0The scene changes to show two men on the boat holding a fish with an orange lure in its mouth.\u00a0Next, the view is from inside the boat and shows the fish being released back into the water.\u00a0Three men are now shown on the boat, two fishing and one searching through a tackle box. One of the fishermen pulls on his rod while the other looks on. The third man selects an item from the tackle box.The view shifts to two men fishing. One raises and lowers his fishing rod repeatedly.\u00a0The perspective then switches to the shore, showing a man fishing as the boat drifts nearby. A second man walks over and the two confer briefly. The second man casts his line into the water.The first man is then seen holding a large fish by its mouth as he stands at the edge of the water. He releases the fish back into the river.The man in the camo shirt appears and then a final shot shows him holding a very large blue catfish.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Why did the fishermen decide to release the catfish back into the river?",
+ "answer": "D",
+ "candidates": [
+ "A. The law required the release.",
+ "B. The catfish were too old.",
+ "C. The catfish were too large.",
+ "D. They enjoyed the fishing experience."
+ ]
+ }
+ },
+ "KqaSuyFQ": {
+ "video_id": "KqaSuyFQ",
+ "video_duration": "113s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A hand is seen drawing on a white surface. A woman is drawn sitting on the ground holding a mug, and a cauldron is drawn on a fire next to her. There are plants near the cauldron and a text label next to it that says \"Ayahuasca\". Another drawing is next to the woman that depicts a man in a white coat and a third is of a floating figure. There is a drawing of Albert Einstein in a circle with text reading \"Albert Einstein 1879 - 1955\" underneath it. More text is written above the floating figure, above the man in the white coat, and next to Einstein. A woman holding a mushroom is drawn, sitting on top of a drawing of the earth. A small drawing of a planet with rings and a swirling pattern are added near the woman holding the mushroom. A drawing of a man in a business suit is added, holding a mask in front of his face, a swirling pattern inside the area of his head. Text is written above this character as well. A hand draws a new drawing depicting an injured person in a hospital bed, a dog, a frog, mushrooms, and 2 people in the background. Text bubbles are drawn over the bed-ridden person and above the man in the suit. The hand draws a rectangle, then adds a monitor inside it. The hand continues to draw, adding animals and other items around the monitor. Several short clips are shown on the monitor with various cartoon figures, including a woman in a jungle, shapes, a boy and a man, an injured man on a bed, a man and a woman doing a hand motion, and a woman floating in space. The hand adds a drawing of a man and another person sitting by a fire, with a speech bubble between them, and a floating block-like figure. The hand adds more speech bubbles above the two people, and a figure is added next to the block-like figure. The hand adds additional text to the speech bubbles, then adds a drawing of two people planting a tree, three cars, a stream, and a skull and bones on the side of the road. Trees and plants are also added. A block of text is shown in the top left of the screen that says \"Watch AfterSkool\". Text below says \"Subscribe to AfterSkool\". The hand adds the drawing of the floating blue block-like figure with its hands in a praying motion. The hand adds more drawings to the scene from the \"Watch AfterSkool\" text. The hand stops drawing and the video ends.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between parapsychology and after-school activities?",
+ "answer": "C",
+ "candidates": [
+ "A. They are competitors.",
+ "B. They are part of the same program.",
+ "C. They are partners.",
+ "D. They are unrelated."
+ ]
+ }
+ },
+ "yTgaQrSl": {
+ "video_id": "yTgaQrSl",
+ "video_duration": "125s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A group of adults and children sit at a table. Some people are clapping. A child is holding a small stuffed animal. A man in a black robe sits behind a desk. People are taking photos and videos with phones. A group of people pose for a photo. A group of people pose for a video. A group of people stand in a row. There are two white boxes on a peach colored background. A blue butterfly moves across the bottom of the peach background. There are four logos with the word \"Watch\" on them in the peach background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video do the children clap?",
+ "answer": "D",
+ "candidates": [
+ "A. The children clap throughout the video.",
+ "B. The children clap at the end of the video.",
+ "C. The children clap at the beginning of the video.",
+ "D. The children clap in the middle of the video."
+ ]
+ }
+ },
+ "ezTQtfgR": {
+ "video_id": "ezTQtfgR",
+ "video_duration": "263s",
+ "duration": "4-6min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man in a fluorescent orange vest, black snow pants, and a black beanie unloads a black and silver snowmobile from the back of a pickup truck using a ramp. He walks the snowmobile off the ramp and stands facing the camera. He puts on a pair of brown work gloves. Next, he takes off the work gloves and beanie, revealing sunglasses underneath. He walks toward the ramp and begins to load the snowmobile back onto the pickup truck. He then walks around to the back of the truck and unloads the snowmobile again. He takes off the orange vest to reveal a black t-shirt and black and white snow pants. After that, he walks around to the side of the snowmobile and mounts the snowmobile, then drives it forward through the snow-covered ground, making tracks in the snow, passing two parked cars: a white Ford pickup truck and a white SUV. He then dismounts and walks toward the camera. He takes off his sunglasses and places a black helmet on his head, then takes a small black bag off the seat of the snowmobile. He pulls his gloves up higher on his wrists, opens the red coat he is wearing, revealing an orange and black undershirt, and puts the red coat on the ground beside the black bag. He pulls a black mask over the lower part of his face and walks toward a large mound of snow, where he remounts the snowmobile. He then drives the snowmobile around the snowy area and makes multiple passes over the snow mound, sometimes jumping off the top of it. Finally, he drives to the top of the mound, stops the snowmobile, sits down on it facing the camera, adjusts his gloves, and sits still.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color are the gloves worn by the man in the orange shirt in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Orange.",
+ "B. Green.",
+ "C. Yellow.",
+ "D. Black."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, after the man in green appears, from which direction relative to the man in orange does he engage in conversation?",
+ "answer": "C",
+ "candidates": [
+ "A. Behind.",
+ "B. To the right.",
+ "C. To the left.",
+ "D. In front."
+ ]
+ }
+ },
+ "KZuOWXuo": {
+ "video_id": "KZuOWXuo",
+ "video_duration": "118s",
+ "duration": "1-2min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Nine colorful, cartoon monsters and a baby are lying in a large, red and gold bed, under a light pink blanket decorated with stars. A large number 9 is displayed above the bed's headboard. A cartoon scientist with a blowtorch approaches the bed. The monsters and baby roll over, and one brown monster falls off the bed and into a dark area with gears and machinery. The scientist incinerates the fallen monster with the blowtorch.Now eight monsters and the baby are in the bed, and the number above the headboard changes to 8. The scientist approaches with a spray gun and sprays green gas at the bed. The monsters and baby roll over, and an orange and purple monster falls off the bed. This monster falls into a hole and lands next to a container with a skull and crossbones. A gray hose is attached to the container and sprays gas that causes the monster to swell.Seven monsters and the baby are now in the bed, the number above changing to 7. The scientist approaches wielding a large mallet. The monsters and baby roll over, and a pink monster falls out of bed. The pink monster lands in the dark mechanical area, and a large white boxing glove punches the monster into a blue vortex with radiating lines. The monster emerges from the vortex with a traffic cone on its head.This pattern repeats, with the number of monsters decreasing each time. A different monster falls out of bed and is subjected to a different fate by the scientist. The scientist uses a variety of objects and weapons, including a blowtorch, spray gun, mallet, magnet, net, spring-loaded boxing glove, spinning wheel, lever-operated claw, flamethrower, giant stamp, and vacuum cleaner. The monsters fall into various settings, including the mechanical area, holes that lead to toxic gas, and vortexes. Finally, only the baby is left in bed.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, what is the baby's position on the bed?",
+ "answer": "B",
+ "candidates": [
+ "A. The second on the left.",
+ "B. The first on the left.",
+ "C. The second on the right.",
+ "D. The first on the right."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Does the video show 10 objects appearing on the bed?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No.",
+ "C. Yes."
+ ]
+ }
+ },
+ "OAjztvQU": {
+ "video_id": "OAjztvQU",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A pre-cooked lasagna in a clear plastic container sits on a dark baking tray. The plastic film covering the lasagna is partially peeled back. The container is then placed into a pot of boiling water, partially submerged. The pot is on a stovetop burner. A hand places the lid on the pot. On a wooden cutting board, a hand breaks apart a head of broccoli into smaller florets. The broccoli is then added to the pot with the boiling water and lasagna container. A glass lid sits nearby on the stovetop. More broccoli florets are added to the pot. Using a large spoon, the broccoli and carrots in the pot are stirred. The pot continues to steam on the burner.\u00a0At a dining table set with plates, cutlery, and placemats, a person serves portions of lasagna onto two plates. The lasagna is placed on the plates using a spatula. Steam rises from the lasagna and tray. A plate with cooked carrots and broccoli sits on the table. The person serving the food is wearing a dark and light plaid long-sleeved shirt or pajama top. Another person, wearing a light gray t-shirt, sits at the table, his plate already containing a serving of lasagna. A woman wearing a dark long-sleeved shirt serves cooked carrots and broccoli onto her plate. She is sitting across from the other person at the table.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What was the woman doing when the man appeared in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. She was plating the second serving of lasagna.",
+ "B. She was eating lasagna.",
+ "C. She was preparing broccoli and carrots.",
+ "D. She was plating the first serving of lasagna."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Was cheese shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. No, it was not shown.",
+ "B. None of the above.",
+ "C. It is not clear.",
+ "D. Yes, it was shown."
+ ]
+ }
+ },
+ "iMvSPDgJ": {
+ "video_id": "iMvSPDgJ",
+ "video_duration": "163s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "Sports Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows American football games. In the first game, two teams in dark green and white, and white and blue are on the field. Players from both teams move and run with the ball. At one point, a dark green player catches the ball and a player from the other team tries to tackle him. Then a referee raises his hands. After this, a dark green player walks across the field, then another dark green player, and then a white and blue player. Later, a football player wearing a blue and white jersey and blue knit cap sits on a bench, then two coaches, then a dark green player in a black cap, then a coach in a grey jacket and a dark green baseball cap, then another man in a blue hoodie. Following this, multiple people are seen walking onto the field. In the second game, two teams wearing white and red and green and white are on the field. Players move the ball, and at one point, a player wearing green and white is tackled, and several white and red players surround him. Next, a player in green and white wearing a white helmet with black stripes walks across the field. Then multiple white and red players are visible. After this, a green and white player is seen catching a ball and then falling. He is surrounded by players in white and red. The green and white player has fallen to the ground, and the players are no longer touching him. Next, a white and red player is shown close-up. Then the next play starts. There are more instances of players tackling each other. Finally, multiple people are seen in the audience, a green and white football player removes an object from his ear, and a referee is visible.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "Sanchez was repeatedly intercepted. How did the fans' emotions change?",
+ "answer": "B",
+ "candidates": [
+ "A. From disappointment to anger.",
+ "B. From disappointment to renewed hope.",
+ "C. From anger to renewed hope.",
+ "D. From disappointment to despair."
+ ]
+ }
+ },
+ "sDlVtQWL": {
+ "video_id": "sDlVtQWL",
+ "video_duration": "280s",
+ "duration": "4-6min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a wide shot of a stage filled with people, with red lighting and some flames. A person in white is on a raised platform. The scene transitions to a boxer with short dark hair, bare-chested, with white teeth showing. The boxer is shown again with red boxing gloves. The boxer is shown again with leopard print shorts that say \"Prince\" on the front. The boxer is then shown getting ready in the ring. Then, the video shows another boxer walking into the ring, surrounded by people. A graphic on the screen reads \"Round 1 2:04\". The boxers engage, throwing punches in a boxing ring. A referee in a light blue shirt and dark pants is also visible in the ring. A graphic on the screen reads \"Round 2 2:22\". The boxers continue to fight through several rounds, with the graphic at the bottom changing from \"Ronde 2\" to \"Ronde 3\" to \"Ronde 4\". Then the graphic on the screen reads \"Round 5\" and a text box reading \"Prince Naseem Hamed v Marco Antonio Barrera IBO Featherweight Title\". The graphic on the screen reads \"Round 6 2:19\" then \"Round 7 2:38\" with a \"Glenn McCrory's Scorecard\" displayed at the bottom with a running count of the match. The video continues showing the boxers in the ring with the graphics at the bottom indicating rounds 8,9,10,11, and 12 with \"Glenn McCrory's Scorecard\" displayed in rounds 10. The boxer in leopard print is shown collapsing, then the second boxer raises his arms in celebration while the referee looks at him. The video ends with text overlaid on the screen with one of the boxer's names.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What instrument is primarily used in the warm-up music played as the contestants enter at the beginning of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Gong.",
+ "B. Guitar.",
+ "C. Electronic drums.",
+ "D. Bass."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "As the game was nearing its end, what move did the man in blue shorts make in response to his opponent's attack that impressed the commentator and warranted a replay?",
+ "answer": "A",
+ "candidates": [
+ "A. Crouch and lower the head to perfectly dodge.",
+ "B. Raise arms to block and counter with a left hook.",
+ "C. Quickly turn and deliver a right hook.",
+ "D. Throw a straight punch at the opponent."
+ ]
+ }
+ },
+ "wrcmvvKx": {
+ "video_id": "wrcmvvKx",
+ "video_duration": "127s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video depicts an animated sequence of stylized, painted images. The initial scene presents a landscape with a body of water, dark trees in the foreground, and a structure in the distance. Birds fly in the sky. The scene shifts to an interior shot with three people in white outfits and blue helmets, with red accents on their collar. Their backs face the viewer. The view then changes to a profile shot of a person with headphones, holding a cylindrical object. This individual is shown repeatedly with slight variations in the background, sometimes with additional people in the frame. Next, a room filled with rows of seats and many figures, is shown with two large, rectangular screens at the front. The scene returns to the initial landscape, now with some superimposed elements, such as transparent figures and a white circle in the sky. The profile shot with headphones then reappears, now with additional figures overlaid. The view shifts to a close-up of one person in a space suit with a red band at the neck. The landscape with the distant structure appears again, now with fewer clouds and birds. Hands touch a surface. A close-up of a face appears with a dark background and stylized lines. There is another wide shot of the room with seats and figures. A close up of a dark figure, a dark shape appears with a white line through it. The scene transitions to a depiction of a rocket launching with smoke and fire, and water in the foreground. The video then returns to close-ups of the individuals in space suits, then another shot of the room with seats and figures. A scene of a landscape is displayed, with superimposed elements. A rocket is depicted launching into space, and then from a side view, followed by a view looking down from the sky towards earth. The final scene is a black screen with text, a logo, and a Facebook symbol.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "In the video, when does the music start?",
+ "answer": "B",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. Throughout the video.",
+ "C. At the end of the video.",
+ "D. At the beginning of the video."
+ ]
+ }
+ },
+ "fXxdDOYK": {
+ "video_id": "fXxdDOYK",
+ "video_duration": "64s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A short intro sequence shows a blue background with repeating logos featuring a cartoon drawing of a man's head. The words \"Kuya TEN Learnings 1\" are superimposed over a yellow circle.The main video begins with a toddler in a light teal tank top and coral shorts, walking toward the camera in a dimly lit room. She has blonde hair and appears to be smiling. She approaches a light-colored carpeted area in front of a wooden bed frame with a sage green blanket draped over it. The wall behind the bed has framed artwork hanging on it. A dark wooden dresser is visible to the left. An open doorway is in the background.The toddler continues to look at the camera, smiling and talking. Her smile fades slightly, and she starts to cry while facing the camera. A hand enters the frame from the right side, offering a single booger to the crying toddler. The child cries and turns away slightly from the hand. The toddler briefly looks back towards the hand and partially covers her mouth with her hands, continuing to cry. The hand continues to present the booger to the child. The crying child swipes at the hand with her right hand, and then swipes at the hand with her left hand. She continues to look at the hand while crying, then takes the booger and puts it in her mouth. She continues crying and rubs her nose with her right hand. The child turns and walks away from the camera towards the bed and the open door. The video ends with a shot of the empty space in front of the bed.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What emotion is the woman expressing when she pinches the little girl's nose in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Sadness.",
+ "B. Happiness.",
+ "C. Calmness.",
+ "D. Guilt."
+ ]
+ }
+ },
+ "ZHpdCvom": {
+ "video_id": "ZHpdCvom",
+ "video_duration": "128s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows several clips of a soccer game, with multiple players visible on a green field. The players wear jerseys of red, yellow, green, and blue. There are white soccer goals with white netting on the field. A white soccer ball is visible. At the bottom of the screen, a red bar displays the score of the game. The background shows spectators in the stands. Several players are seen celebrating a goal by jumping on each other.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the player wearing the red jersey with number 7 signal for everyone to calm down?",
+ "answer": "C",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. Throughout the video.",
+ "C. At the beginning of the video.",
+ "D. In the middle of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What stage of a soccer match is shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A penalty shootout.",
+ "B. Overtime.",
+ "C. The first half, during an offensive play.",
+ "D. The second half, during an offensive play."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the score at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. One to two.",
+ "B. Two to two.",
+ "C. Two to one.",
+ "D. Two to zero."
+ ]
+ }
+ },
+ "kSzBsOEt": {
+ "video_id": "kSzBsOEt",
+ "video_duration": "370s",
+ "duration": "6-8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Drawing Tutorials",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A hand with fair skin is visible holding a wooden tool, applying it to a white paper surface. The tool is then put down. A white eraser is visible near the left edge of the paper. The hand begins using a pen, outlining a border of a page. The hand continues drawing the image, using both a pen and a pencil. A ruler is used to draw straight lines. Some of the lines appear to be guidelines for the image. A gray piece of paper is laid on top of some parts of the drawing to prevent the pen from drawing over those parts of the paper. The hand moves and adjusts the paper and a ruler. There are now two drawings next to each other. The hand picks up and moves each individual page of the drawing. The hand lays a new piece of paper over the drawings and then puts the drawing behind the blank paper. The hand moves the drawing and the blank paper and makes adjustments. The paper is removed. The hand then moves the whole paper out of view.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many tools were used during the painting process?",
+ "answer": "D",
+ "candidates": [
+ "A. Five.",
+ "B. Two.",
+ "C. Three.",
+ "D. Four."
+ ]
+ }
+ },
+ "bkCDyIyn": {
+ "video_id": "bkCDyIyn",
+ "video_duration": "74s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "Two tigers are behind a barred enclosure. A lion and a lion cub are in tall grass. The lion stands up and moves around. A white tiger is lying on grass. A group of four lion cubs are lying in grass. A white tiger and her cub are on grass in front of a stone wall. Two tigers and a lion are on grass. A tiger is sitting on grass with a lion lying down. The lion gets up and the tiger walks away. The tiger walks toward the camera. A lion is standing in a field, and the video shows a close up of his face. Two lions are walking in a field. A tiger walks through tall grass. The tiger walks through the grass. A tiger enters a shallow water source and briefly grapples with another tiger. A blue moon is in the sky. The video transitions to a tiger lying on its back as a lion stands over it. The lion is on its hind legs. A tiger and two lions are on grass. Two lions and a lion cub are in tall grass. A white tiger runs into water. A tiger yawns with trees in the background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the scene of a lion attacking a tiger occur?",
+ "answer": "C",
+ "candidates": [
+ "A. The scene occurs in the middle of the video.",
+ "B. The scene does not occur in the video.",
+ "C. The scene occurs at the end of the video.",
+ "D. The scene occurs at the beginning of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What is the style of the background music in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Melancholic.",
+ "B. Gentle.",
+ "C. Fast-paced.",
+ "D. There is no music."
+ ]
+ }
+ },
+ "zNpGxAXZ": {
+ "video_id": "zNpGxAXZ",
+ "video_duration": "43s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man in a light brown suit stands on stage near a clear plexiglass box. Another man, wearing a black leather jacket and a dark red shirt, bends down to pick up a glass pitcher. He shows the pitcher to the man in the suit, then he holds the pitcher through a vertical slit in one of the box's sides, so that Steve can grab the handle of the pitcher from the other side of the plexiglass. The man in the jacket claps his hands and points toward the pitcher, which starts to visibly vibrate inside the box. Steve holds onto the handle. The pitcher vibrates more intensely and deforms as if it were melting.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man wearing black do to the glass box?",
+ "answer": "D",
+ "candidates": [
+ "A. He places the glass box on the sofa.",
+ "B. He breaks the glass box.",
+ "C. He places the glass box on the ground.",
+ "D. He places the glass box on the shelf."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the kettle relative to the glass box?",
+ "answer": "B",
+ "candidates": [
+ "A. To the left of the glass box.",
+ "B. In the middle of the glass box.",
+ "C. Above the glass box.",
+ "D. In front of the glass box."
+ ]
+ }
+ },
+ "WNijOwao": {
+ "video_id": "WNijOwao",
+ "video_duration": "99s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A child wearing a pink jacket and holding a gray stuffed elephant stands next to a basket of apples and another basket of yellow bananas. The child walks forward and out of frame. The scene changes to a man holding the camera, with a living room as the background. He lifts his arm to touch his hair. Then the scene shifts to show a different view, and a second child in a white shirt and pink skirt is sitting on the floor. The man lifts the first child wearing a pink skirt onto his shoulders. The man moves the child on his shoulders, then allows her to jump down off his shoulders. The man then lifts the child onto his shoulders again, and he moves around with her on his shoulders. He then lifts the first child off his shoulders and holds her upside down, then lifts her to shoulder height again, then puts her on the floor. He moves the second child onto his shoulders. He then picks up the first child and then carries her onto the sofa. He touches his hair. The scene shows the living room again with a the man and both children, and then the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the man place the little girl, who was wearing a light pink skirt, on the sofa?",
+ "answer": "B",
+ "candidates": [
+ "A. Three times.",
+ "B. Twice.",
+ "C. Once.",
+ "D. Four times."
+ ]
+ }
+ },
+ "WybMuwJr": {
+ "video_id": "WybMuwJr",
+ "video_duration": "109s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Basketball",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a basketball game. Players in yellow jerseys and players in blue jerseys are seen running on the court while a basketball is being passed. At times the players jump to shoot the ball into the hoop. The game is taking place in a gymnasium with an audience seated in bleachers. Text appears on screen indicating the score and the current point totals for one team throughout the game. At the end of the video, the players are grouped together at the side of the court and are seen talking to adults, and clapping.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the whistle sound appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. One.",
+ "B. Six.",
+ "C. Four.",
+ "D. Two."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What celebratory actions are performed by the player wearing yellow number 21 and the player wearing yellow number 30 at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Both bump fists.",
+ "B. Both clap their hands.",
+ "C. Both touch shoulders.",
+ "D. Both clap their hands."
+ ]
+ }
+ },
+ "FimBwrRZ": {
+ "video_id": "FimBwrRZ",
+ "video_duration": "201s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a sandy, rocky landscape. The camera view shifts between different perspectives of the terrain, and the video also features some aerial views. A small object is seen flying above the ground. There is a graphic overlay that depicts the flight trajectory of the object on the terrain. The camera angle alternates between a view of the ground, a view of the object from below and a view of the object from above, as it travels. The video also shows some tire tracks in the sand and an area of rock formations.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many instances of black and white images are shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Eight instances.",
+ "B. Eleven instances.",
+ "C. Six instances.",
+ "D. Ten instances."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the complete flight trajectory of the helicopter appear?",
+ "answer": "A",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. At the end of the video.",
+ "C. In the middle of the video.",
+ "D. Throughout the video."
+ ]
+ }
+ },
+ "xhFyjGtM": {
+ "video_id": "xhFyjGtM",
+ "video_duration": "69s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Cultural Explainers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A young girl wearing a red and white long-sleeved top and black pants stands in a room with toys. She moves her arms up, down, and then to the sides. Yellow and red patches appear over the girl and then fade out. Then, black and white headshots of six different people are shown over a yellow and red background. A blue line with white cloud shapes that have text on them appears over a light blue background. Two men appear over the blue background with yellow shapes and the word \"PROUD.\" The girl is back, standing in the same room as before. An illustration of two faces in the water appears and then a group of 15 men is displayed. A red illustration of a battle appears with white and red text that reads \"BATTLE OF BATOCHE.\" The battle scene fades away and the blue line with white cloud shapes appears again. A black and white photo of six people with the word \"DEFEATED\" underneath appears and then the girl is shown again. She is followed by two animated, illustrated images of two boats, and a black and white image of seven people. Black bars cover the mouths of the people, and then the blue line with cloud shapes returns. The line moves and a black and white photo of two people appears and it reads \"SHAMED\" and then is replaced by another photo with seven people with the word \"SHAMED\" underneath. The young girl reappears. The blue line graphic is back again, and then it is followed by a black and white illustration of three people. Over a multi-colored background, a black and white image of 4 people appears and then the word \"HIDDEN.\"",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the little girl point to herself with both hands?",
+ "answer": "C",
+ "candidates": [
+ "A. Four.",
+ "B. Three.",
+ "C. One.",
+ "D. Two."
+ ]
+ }
+ },
+ "kcQoGqah": {
+ "video_id": "kcQoGqah",
+ "video_duration": "128s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a shot of two men standing beneath a covered walkway. The man on the left wears a black t-shirt with a white graphic, a black hat, and has a beard. The man on the right wears a black t-shirt with a white graphic, a black hat, glasses, and has a long beard. The man on the left claps his hands together and smiles. The camera view changes to a first-person view from a motorcycle. The motorcycle is red, and the person's hands, along with tattooed forearms, are seen holding the handlebars. The motorcycle moves slowly, then accelerates into a parking lot, onto a city street, and then onto a highway. The motorcycle continues along the highway until the end of the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the man twist the throttle in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Five times.",
+ "B. Three times.",
+ "C. Four times.",
+ "D. Six times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What change occurs in the volume of the background music in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The volume remains unchanged.",
+ "B. The volume decreases from loud to soft, then fades out, and then increases from soft to loud.",
+ "C. The volume gradually increases.",
+ "D. The volume gradually decreases."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the motorcycle overtake the car?",
+ "answer": "B",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. At the end of the video.",
+ "C. At the beginning of the video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "JUVbJeqW": {
+ "video_id": "JUVbJeqW",
+ "video_duration": "435s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A bearded man sits barefoot on a reddish-brown mat on a fallen log in a green field. A copper thermos and a small, dark-colored teapot sit on the log beside him. He unwraps and breaks off a piece of a dark brown, rectangular object that resembles a brick of tea. He places the piece into the teapot and pours hot water over it from the thermos. He pours tea from the teapot into a small white cup. He sips the tea. He eats a small piece of something light-colored from a napkin. He pours more hot water into the teapot and then into his cup. He sips from the cup again and eats another small item from the napkin. He continues sipping tea and eating from the napkin. He gets up and walks away.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What change occurred in the man's mood during his first tea tasting?",
+ "answer": "C",
+ "candidates": [
+ "A. From helplessness to anger.",
+ "B. From sadness to anger.",
+ "C. From excitement to anger.",
+ "D. From excitement to joy."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man do with the tea cakes at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Used them as a picnic mat.",
+ "B. Used them as fertilizer for plants.",
+ "C. Kicked them like a ball.",
+ "D. Played frisbee with them."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What did the man do after drinking the first bowl of tea?",
+ "answer": "D",
+ "candidates": [
+ "A. Poured the tea on himself.",
+ "B. Drank all the tea.",
+ "C. Shared the tea with friends.",
+ "D. Spit the tea out."
+ ]
+ }
+ },
+ "JRoanSKY": {
+ "video_id": "JRoanSKY",
+ "video_duration": "104s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man sits in a chair in front of a desk, wearing a light blue tank top and gesturing with his hands. Behind him, a white desk and shelves display various items, including anime figures, posters, and what appears to be a movie poster.\u00a0The video then cuts to a comedian on stage, wearing a black shirt and dark-colored pants. He holds a microphone and makes exaggerated gestures and expressions while facing the audience. The background is a deep blue with faint purple curtains visible. The words \"For Example\" are superimposed on the video.The video cuts back to the first man, who continues gesturing as if mimicking or reacting to the comedian's performance.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional change does the man wearing a blue sleeveless top experience at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. He goes from being excited to angry.",
+ "B. He goes from being excited to helpless.",
+ "C. He goes from being excited to thrilled.",
+ "D. He goes from being happy to excited."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where is the man wearing a black shirt in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. At the music festival.",
+ "B. In the opera house.",
+ "C. At the concert.",
+ "D. In the comedy club."
+ ]
+ }
+ },
+ "vOxFIOjn": {
+ "video_id": "vOxFIOjn",
+ "video_duration": "39s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black shirt and black pants is seen pulling a white sheet out of a bathroom. The person then enters the bathroom, leaving the sheet behind, and then leaves. A person with long blonde hair wearing a dark jacket is seen holding up a large bag in a bedroom. They are then seen moving about the room. A young child is seen close to the camera. They are wearing a yellow t-shirt, a red beaded necklace, and a backpack with various colors. The child then moves into a bedroom, where an adult is seen walking near a window. A person wearing a black and gray baseball cap and a checkered shirt is seen looking at the camera, and then moves out of the shot. Another person with short brown hair wearing a black shirt appears, puts their hands on their cheeks, and then puts their hands over their face. That same person then looks at the camera and moves their hands near their face. They then make a gesture with their palm towards their throat. The person wearing a black and gray baseball cap and a checkered shirt is seen again, in a different part of the room. The person with short brown hair wearing a black shirt is seen again, in a different part of the room. They are seen gesturing. The video shows a close-up shot of a cellphone on a counter, along with other objects. The person with short brown hair is then shown again, putting their hands over their eyes, and then gesturing near their face.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many backpacks are shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Two.",
+ "B. Three.",
+ "C. Four.",
+ "D. Five."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional changes does the woman wearing a black short-sleeved shirt experience at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. From excited to angry.",
+ "B. From excited to calm.",
+ "C. From calm to excited.",
+ "D. From calm to angry."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order did the man in the video say the names? (a) Cody. (b) Olivia. (c) Boston.",
+ "answer": "C",
+ "candidates": [
+ "A. (b), (a), (c).",
+ "B. (c), (b), (a).",
+ "C. (a), (b), (c).",
+ "D. (a), (c), (b)."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What is the family planning to do next in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. They are going to buy coffee.",
+ "B. They are going to the zoo.",
+ "C. They are going to board a plane.",
+ "D. They are going to eat."
+ ]
+ }
+ },
+ "oFwAeqwH": {
+ "video_id": "oFwAeqwH",
+ "video_duration": "35s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A hand draws a padlock, then a tree. The padlock and tree are each encircled. A cartoon boy is drawn between them. The boy is thinking of the lock. The hand draws a cartoon girl between the boy and the tree. The girl is thinking of the tree. The drawings fade, and the hand begins to draw a new scene. A cartoon boy is drawn with an angry expression and a hand up in a \u201cstop\u201d gesture. A stack of boxes is drawn next to him. The boy is thinking of the boxes falling over and cartoon children laughing. A cartoon girl stands on the other side of the boxes, smiling, and looking at the boy. She is thinking of cartoon faces cheering. The hand draws another stack of boxes, slightly askew. Cartoon children are climbing and playing on the boxes. More cheering cartoon faces are visible in thought bubbles above the boxes. The angry boy remains unchanged. Finally, a triumphant cartoon figure appears with raised arms, as if in celebration.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "After completing the vault, what was the man's position relative to the vaulting horse?",
+ "answer": "D",
+ "candidates": [
+ "A. Above it.",
+ "B. Behind it.",
+ "C. Below it.",
+ "D. In front of it."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Will the woman in the video successfully vault over the horse?",
+ "answer": "C",
+ "candidates": [
+ "A. No, she will not vault over it.",
+ "B. No, she will fall.",
+ "C. Yes, she will successfully vault over it.",
+ "D. No, she will get stuck on the vault."
+ ]
+ }
+ },
+ "RlonNikG": {
+ "video_id": "RlonNikG",
+ "video_duration": "124s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Cultural Explainers",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A cursor moves across a circle diagram with radial text and numbers. Then a light-green arrow starts at the center of the diagram and is stretched to the number 3. Then a line appears that is perpendicular to the arrow, and it is extended in both directions, so that it looks like an \"L\". Then the arrow is repositioned to show a long, straight green arrow and a short, black arrow. The black arrow rotates around the middle of the circle diagram, eventually intersecting the green arrow. Finally a black line is traced across the diagram so that it now looks like a \u201cT\u201d intersecting an \"L\" to form an incomplete \u201cH.\u201d",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many types of sounds are heard in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Two.",
+ "B. One.",
+ "C. Four.",
+ "D. Three."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What is the man discussing in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The use of clocks.",
+ "B. Opinions on sunrise and sunset.",
+ "C. The use of the Heavenly Stems and Earthly Branches calendar system.",
+ "D. The use of the lunar calendar."
+ ]
+ }
+ },
+ "OZvFdfDY": {
+ "video_id": "OZvFdfDY",
+ "video_duration": "283s",
+ "duration": "4-6min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a screen capture of gameplay of what appears to be an underwater video game, along with a smaller video capture of a person wearing blue headphones. The game begins with a loading screen featuring text describing that a shark's energy is slowly depleting and that it must eat to stay alive. Below this, there are two progress bars, one in green and one in red, a pause button icon and text that reads 'PERSONAL HI-SCORE: 5550\" and \"BEST SURVIVAL TIME: 03:59\". The game then loads, and the screen shows an underwater scene. A gray shark is visible, along with various smaller fish, bubbles and what appears to be ancient ruins. The shark is then controlled by the player, with the shark eating smaller fish and earning points. Various text alerts show up on the screen when the shark eats specific fish or activates certain game modes such as \"Binge Bonus\", \"Sushi Roll\" or \"Gold Rush\". As gameplay continues the shark eats an increasing number of small fish and the surrounding environment changes to include more items such as people on jet skis, boats, and cages in the water. The points increase over time and game levels change. The small video capture in the upper right corner shows the person wearing blue headphones talking and reacting to the game as the gameplay progresses.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the shark evolve to level 6?",
+ "answer": "B",
+ "candidates": [
+ "A. Throughout the entire video.",
+ "B. At the end of the video.",
+ "C. In the middle of the video.",
+ "D. At the beginning of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What was the final score achieved by the man?",
+ "answer": "C",
+ "candidates": [
+ "A. 29,855.",
+ "B. 30,000.",
+ "C. 34,215.",
+ "D. 31,560."
+ ]
+ }
+ },
+ "YHLsmzHR": {
+ "video_id": "YHLsmzHR",
+ "video_duration": "98s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Chemistry",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a white lab coat is in a lab. They turn on a faucet and fill a white plastic container with water, then they place a glass tube with bulbs in it into the water. They adjust the tubing connected to the tube. They put the tube into a beaker and turn on a faucet to fill it with water. They manipulate the tubing and then use a blue pipette to add a small amount of clear liquid into the clear tubing that is in the stand. The person then looks at a phone, where a stopwatch is running. The time increments. Then the person writes in a notebook on the counter. Then, they walk away. The person's hand then adjusts the glass tubing and then they place the tubing into the clear liquid in the plastic container. The tubing is left in the water.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the time required for the liquid meniscus to fall from the upper scale mark to the lower scale mark on the viscometer?",
+ "answer": "A",
+ "candidates": [
+ "A. 30.16 seconds.",
+ "B. 28.26 seconds.",
+ "C. 29.16 seconds.",
+ "D. 27.18 seconds."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What is the final state of the sample in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. There is no change.",
+ "B. It has increased in quantity.",
+ "C. It has completely evaporated.",
+ "D. It is reduced in amount."
+ ]
+ }
+ },
+ "VGOlxsVY": {
+ "video_id": "VGOlxsVY",
+ "video_duration": "56s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a man in a white shirt holding a microphone and a tablet. He is standing on a stage. There is a drum set in the background. The scene then changes to an overhead view of a city. The word \"RISE\" is superimposed on the image. Then, the video changes to a close up of a man with white eyes, against a black background. The video then shows a man wearing a red plaid shirt, holding a microphone, speaking to a group of people, one of whom is using a walker. The video then shows the same man speaking to a person who is using a walker at night, with people standing and walking in the background. Then the man is seen speaking to a crowd of people at night. There are many people waving their hands in the air. The scene then changes to a man and two woman hugging. The next clip shows a woman with white hair covering her face with her hand. Another woman in a pink dress is standing near her. The video then shows a woman pointing her finger at the camera. A man in a white shirt is standing behind her. Then the video shows a man speaking with his arms raised to a crowd. The video then switches to a view of a large crowd of people at night, with flood lights on. The scene then shifts to a view of a woman with a microphone in her hand, as well as three men. Then the video shows a man speaking to a crowd at night, while people watch him. The video then shows a very large crowd of people at night, illuminated by flood lights. There are two more similar views of large crowds at night. The video then changes to a stage with a large sign that says \"The Stand.\" People are raising their hands toward the stage. Then the video shows a crowd at night with many people raising their hands and clapping. The next clip shows two men in a field at night, one of whom is using a cane. There is another view of a large crowd at night with a stage lit up. Then the video switches to a close up of a woman with her hands raised and another two people standing around her, all with their arms raised. Next is a close up of a man with his hands raised, and a woman with her hands raised behind him. The video then changes to a view of a crowd with umbrellas, and a stage in the background. In another similar view, people in the crowd have their arms raised. Then the video shows a close up of a man in a red shirt with his eyes closed and his hands raised. Then, there is a group of people standing with their hands raised, and a few people kneeling down. There are a couple of digital clocks on a table in front of them. The video then shows a man on stage addressing a crowd at night, with his fist raised in the air. Then the video switches to an overhead view of a large crowd raising their hands at night. Another shot shows the crowd at night, with a stage in the background. The next shot is similar, but from further away. The video then shows the words \"FESTIVAL OF MIRACLES\" superimposed on a dark and cloudy background. The text is repeated, but with a different font. The text then begins to break apart and float into the distance. The video then shows the text \"arbminstries.com.\" The text \"COMING TO,\" \"LOS ANGELES\" \"FESTIVAL OF MIRACLES CRUSADE\" \"NOVEMBER 9th - 15th 2020\" \"REGISTER NOW ANKITRAMBABU.ORG/EVENTS\" appears next against a black background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the man give a speech?",
+ "answer": "D",
+ "candidates": [
+ "A. At the conclusion of the video.",
+ "B. In the middle of the video.",
+ "C. At the start of the video.",
+ "D. During the entire video."
+ ]
+ }
+ },
+ "LOKPlsoQ": {
+ "video_id": "LOKPlsoQ",
+ "video_duration": "240s",
+ "duration": "4-6min",
+ "domain": "Sports",
+ "sub_category": "Fitness",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows several individuals participating in what appears to be a weightlifting competition. The video begins with a person squatting a barbell with large weights on each side, with several spotters standing around them. Text overlays the video, indicating the weight being lifted, in kilograms, followed by lbs. Next, the video shows the same person bench pressing weights, with spotters near them. Then, the same person is seen lifting weights off the floor, once again with spotters. Other people are shown performing bench presses with spotters, as well. The person with blonde hair and black shirt who lifted weights is shown drinking from a plastic cup, and carrying a bag, followed by their award ceremony. At the end they are shown holding a trophy. Text overlays the video repeatedly showing various weight amounts in kilograms and lbs. The people in the video are wearing black, orange, and white clothing.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the blonde man participate in the squat weightlifting competition?",
+ "answer": "C",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. In the middle of the video.",
+ "C. At the beginning of the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What happened to the barbell during the squat weightlifting competition in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. The weight gradually becomes heavier.",
+ "B. The weight remains unchanged.",
+ "C. The weight changes from light to heavy and back to light.",
+ "D. The weight becomes lighter."
+ ]
+ }
+ },
+ "zlfuNBPt": {
+ "video_id": "zlfuNBPt",
+ "video_duration": "114s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a soccer field with several people in it. The field is green and has white lines on it. There are also yellow lines that create four rectangular areas in each corner of the field. There is a white circle in the middle of the field. The people are wearing either white or yellow tops. The people move around the field, some walking and others running. There is also a soccer ball moving on the field. Text appears at the top of the screen at various points, which says \"Four Corners Possession game\". Text also appears at the bottom of the screen throughout, which explains the actions and intentions of the players. These texts include \u201cGood support angle & distance,\u201d \u201cUse spare target player,\u201d \u201cProvide target to switch play,\u201d \u201cSwitch play to corner,\u201d \u201cMake angle to support,\u201d \u201cReceive with open body,\u201d \u201cSwitch play to corner player,\u201d \u201cFind space to receive the ball,\u201d \u201cPass simply to spare player,\u201d \u201cTurn out of tight area,\u201d \u201cPass simply and accurately,\u201d and \u201cTurn into space.\u201d The background appears to be a wall in blue and white, trees and buildings. In a few parts of the video, a title page appears, which includes information about the video, such as \u201cFind a Friend - Progression 4\u201d or a graphic with a soccer ball and text that says \u201cGrassRootsCoaching.com\u201d.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What are the six letters that appear below the subtitle when the player in yellow kicks the ball in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Player.",
+ "B. Use.",
+ "C. Target.",
+ "D. Spare."
+ ]
+ }
+ },
+ "TRmFtiSv": {
+ "video_id": "TRmFtiSv",
+ "video_duration": "167s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a black suitcase with papers and a prescription bottle placed on top. The video then transitions to a news anchor sitting in front of a blue screen with a photo of a man in the upper right. After this, several black and white images of crowds of people holding signs are shown. Then, police officers are shown pushing people down and swinging batons. The video cuts to a black man standing outside of a building, who is then followed by a cut to a scene inside a church where a woman is interviewing two men. There are close up shots of the three people in this interview, and then a close up shot of a black man's face is shown. There are more black and white photographs shown of people, followed by the three people in the church being interviewed, interspersed with close up shots of the men, and then of the woman. The video also shows several black and white pictures of the man seen in the beginning in other scenes as well as with his family. There are also several scenes of one of the two men that was interviewed at the church, standing outside holding a sign. The video concludes with a news anchor sitting in front of a blue screen with a banner at the bottom.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What are the people in the video, who are holding white cardboard signs, doing?",
+ "answer": "A",
+ "candidates": [
+ "A. They are participating in a protest.",
+ "B. They are participating in an election campaign.",
+ "C. They are participating in a celebration.",
+ "D. They are participating in a product promotion event."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video show a car in motion?",
+ "answer": "C",
+ "candidates": [
+ "A. None of the above.",
+ "B. Yes.",
+ "C. No.",
+ "D. I am not sure."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual action did the bald man perform in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The right hand points to the cheek, and the left hand touches the railing.",
+ "B. The left hand swings outward while the right hand points to the top of the head.",
+ "C. The right hand swings outward while the left hand points to the top of the head.",
+ "D. The right hand reaches out to touch the railing, and the left hand points to the cheek."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is written on the white board hanging around the neck of the man in the red shirt at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. I AM A MAN.",
+ "B. I AM A MAN.",
+ "C. I AM A MN.",
+ "D. I AM A MA."
+ ]
+ }
+ },
+ "aVANFBEl": {
+ "video_id": "aVANFBEl",
+ "video_duration": "38s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Motorsport",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Several race cars are on a track. Most of the cars are dark blue and red and white. They are racing one after the other along the track. The camera angle changes from an over the shoulder view, to a more bird\u2019s-eye view, to an inside the cockpit view. There is a graphic showing lap numbers and competitor names throughout the video. The cars are driving on a grey and tan paved track. Some cars have headlights turned on. There is a barrier on the side of the track that alternates red and white stripes. At times there are also green stripes. There is an area with brown grass next to the race track.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does car number 37 successfully overtake the leading car?",
+ "answer": "A",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. At both the beginning and end of the video.",
+ "C. At the beginning of the video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "CYsCvMVu": {
+ "video_id": "CYsCvMVu",
+ "video_duration": "59s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Motorsport",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A first-person view from a person riding a dirt bike is shown. The person is wearing a red and black jacket with white graphics and light-colored gloves. The bike is primarily white with black and light-blue accents. The setting appears to be a dirt track surrounded by tall green trees. There are various red and white markers on the ground. Text appears at different times, stating \"OUTSIDE LINE,\" \"INSIDE LINE,\" \"FULL SPEED OUTSIDE\" and \"FULL SPEED INSIDE.\"",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occurred to the motorcycle while it was moving in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A dive.",
+ "B. Swaying left and right.",
+ "C. A sharp turn.",
+ "D. A bounce."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the red rectangular object located on the motorcycle in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. At the back.",
+ "B. Below.",
+ "C. In front.",
+ "D. On both sides."
+ ]
+ }
+ },
+ "uARzgfjZ": {
+ "video_id": "uARzgfjZ",
+ "video_duration": "176s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Two shirtless men are in a wrestling ring, while a man in a striped shirt and black pants acts as referee. One man has black pants with red markings, while the other has white pants with red and blue markings. The man with black pants uses a series of wrestling moves against the man in white pants, throwing him to the ground and kicking him. The man in white pants then climbs a turnbuckle, jumps on the man in black pants, and hits him with more wrestling moves. The man in white pants then pins the man in black pants to the ground, and the referee counts to 3 before declaring him the winner. The winner then receives a gold belt and raises his arms in the air. The camera zooms out and shows the crowd in the arena before focusing again on the winner, who holds the belt in the air with one arm, while he walks around the ring with the belt and kneels.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What did the man in black pants do after elbowing the man in white pants in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Delivered a straight punch attack.",
+ "B. Did a flip and press.",
+ "C. Performed a side somersault press.",
+ "D. Executed a shoulder throw."
+ ]
+ }
+ },
+ "NZilfWOK": {
+ "video_id": "NZilfWOK",
+ "video_duration": "103s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with an overview of a crowded arena. The camera zooms in to show a side-by-side comparison of two men, one with dark hair and the other with short, dark hair. The image is labeled \"TALE OF THE TAPE\" and \"SHOWTIME CHAMPIONSHIP BOXING\". Below each of the men's photos are the names \"PAULIE MALIGNAGGI\" and \"ADRIEN BRONER\", respectively. The information between the two photos includes the categories AGE, HEIGHT, WEIGHT, and REACH. A yellow circle highlights the number \"32\" under \"PAULIE MALIGNAGGI\" in the age category. Next, the camera shows a wide shot of the ring, located in the center of a packed arena. The two men from the previous shot are in the boxing ring. The man on the left wears white shorts with \"PAULIE\" printed on them, with red gloves and white shoes with blue accents. The man on the right wears gold shorts, black gloves, and green shoes. A man in a white shirt with a black bow tie stands in the ring between them. The two men begin to fight, throwing punches at each other. The fight continues, with the two men moving around the ring, punching, and at times clinching. The fight is shown at different angles, and with changes in the distance of the shots. There is a large crowd of people surrounding the boxing ring, and the word \"CORONA\" is visible in several places around the edge of the ring. The time remaining on the round and the round number is displayed on the screen at the bottom of the screen. At one point the display says \"ROUND 1 of 12\", then changes to \"ROUND 2 of 12\", and then \"ROUND 3 of 12\". The video ends with the two men fighting in the boxing ring.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point does the action of the man in red gloves being pushed against the ropes by the man in black gloves occur?",
+ "answer": "D",
+ "candidates": [
+ "A. It does not occur.",
+ "B. At the beginning of the video.",
+ "C. In the middle of the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Will the man wearing black gloves win the match?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No.",
+ "C. Yes."
+ ]
+ }
+ },
+ "ztiZcOox": {
+ "video_id": "ztiZcOox",
+ "video_duration": "372s",
+ "duration": "6-8min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows gameplay from a first-person perspective in a shooter game.Initially, a screen displays the players on two teams, with names and character icons. The \"Kill Confirmed\" text is at the top of this screen. The user is on the BLACK OPS team. The video then shows a class selection screen with various options such as \"Custom 2,\" \"Custom 3,\" \"Custom 4,\" \"Custom 5,\" and other premade options such as \"Vanguard,\" \"Breach,\" \"Sentinel,\" \"Overwatch\" and \"Sharpshooter.\" A gun named \"Vesper\" is shown in the custom loadout selection. Then, it shows another class selection screen, this time with a gun called \"VMP\" on \"Custom 4.\"The next part of the video shows a scoreboard, with both teams currently at zero points. There is a countdown timer on screen that starts at 8 seconds. Two player characters are standing on a patterned tiled outdoor space in what appears to be a modern city.The perspective changes to the first person view of a player character. The character is holding a gun. This character runs and moves through various outdoor areas and indoors. There are other player characters that move and are shot at in the video. The characters have different colored clothing. One player is wearing white armor. There are buses, cars, and structures visible. The game has a kill count and score display on the screen. There is also a radar display.The video shows indoor scenes where there are couches, chairs, shelves, counters and plants. There are other items such as pictures, tables and lamps in the interiors. At one point there is an orange glow on the scene.The video then shows an outdoor area with green vegetation and different colored trees. The video then shows a long red graphic moving on the screen. Player characters are seen running, jumping, and falling. There are also glowing effects shown in the video as the player character fires their weapon.A scoreboard appears at various times in the video with scores and names. There are also short pop-up texts about player performance such as, \"Triple Kill,\" \"Revenge,\" \"Double Kill,\" \"UAV Assist,\" \"Kinetic Armor Ready\", \"R.A.P.S. Ready,\" \"Wraith Ready,\" and \"Double Time\" The word \"Victory\" is displayed on a screen with a \"100\" score for BLACK OPS versus the rival team's score of \"42.\" The screen shows a match bonus score.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a stationary car in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. Cannot be determined."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the final score of the two gaming teams at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. 100:55.",
+ "B. 100:50.",
+ "C. 100:51.",
+ "D. 100:53."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the man open the map?",
+ "answer": "B",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. In the middle of the video.",
+ "C. Throughout the video.",
+ "D. At the beginning of the video."
+ ]
+ }
+ },
+ "lxwKCyMv": {
+ "video_id": "lxwKCyMv",
+ "video_duration": "34s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A man rides a blue electric fat-tire bike on a paved path overlooking a river. He rides onto a pedestrian bridge with red and green lanes, joining other cyclists and walkers. The video shows a close-up of the man's hand twisting the throttle. Next, the man rides the bike on a paved road lined by trees and a low stone wall. The perspective shifts to an aerial view as he continues cycling alongside the wall, a river visible on the other side. The video shows the bike's rear wheel with a 7-speed Shimano system. The man resumes riding on the paved path along the river, passing parked cars. Finally, a close-up showcases the bike's handlebars and hydraulic disc brakes.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What bicycle component is shown in the final scene of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Hydraulic disc brake system.",
+ "B. Freewheel.",
+ "C. Cyclocomputer.",
+ "D. Wheelset."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What allows the bicycle in the video to achieve smooth and easy riding solely through pedaling?",
+ "answer": "C",
+ "candidates": [
+ "A. It uses a hydraulic disc brake system.",
+ "B. It employs a gear shifting system.",
+ "C. It incorporates a Shimano seven-speed gear system.",
+ "D. It utilizes a five-speed gear system."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video show a stationary bicycle?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. Not applicable.",
+ "D. Cannot be determined."
+ ]
+ }
+ },
+ "iBDaeoKE": {
+ "video_id": "iBDaeoKE",
+ "video_duration": "309s",
+ "duration": "4-6min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A group of people are standing indoors in what appears to be a rehearsal or performance space. There is a man at the front on the left with an acoustic guitar. To his right, a woman plays what appears to be an upright bass. Behind them are rows of people standing, some of whom are singing. Some people are holding pieces of paper. At one point, the woman holding the upright bass raises the neck of the instrument. Some of the people toward the back are wearing hats and other colorful garments. There are 2 microphones in front of the man with the guitar. In later segments, a man in a light blue shirt steps forward to sing at a microphone next to the guitarist, then another woman sings into the microphone in front of the bassist. Then, a younger woman with long hair steps forward to sing at a microphone near the man with the guitar. During the video, the people behind the man with the guitar can be seen raising their voices in song, and a woman is seen holding a sign. This sign changes several times. The man with the guitar can be seen making animated facial expressions. There is a second man that can be seen clapping and raising his hands.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotion does the song performed in the video convey?",
+ "answer": "A",
+ "candidates": [
+ "A. Anger toward government actions.",
+ "B. Support and approval of government actions.",
+ "C. Prayers and hopes for the government.",
+ "D. Resignation regarding government actions."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many people gradually joined the chorus in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Five.",
+ "B. Four.",
+ "C. Six.",
+ "D. Three."
+ ]
+ }
+ },
+ "ORptRjJV": {
+ "video_id": "ORptRjJV",
+ "video_duration": "191s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A muscular, shirtless character with blond hair, named Alex, performs a victory animation. He raises his arms and flexes, then performs a powerful slam attack. The screen displays \"Player 1 Win\" and \"Airhead.\" The opponent is \"Entember,\" and the score is 1 win for Airhead, 0 for Entember. A message appears: \"Your opponent is currently in a menu. Please wait...\"Round 1 starts. Alex, wearing khaki cargo pants and gloves, faces a smaller, black-clad opponent with a masked face and a pointed headpiece, possibly Viewtiful Joe. They fight, with Viewtiful Joe performing a series of fast kicks and flips, some of which connect with Alex. Alex counters with powerful punches and kicks. Alex lands a hit, knocking Viewtiful Joe to the ground. Viewtiful Joe performs various attacks involving white ribbons or streams trailing from his limbs. He lands several kicks, but Alex absorbs many of the blows. Alex delivers a final kick, knocking out Viewtiful Joe.Round 2 begins. The \"FIGHT!\" sequence initiates. The fight is similar to Round 1, with fast attacks from Viewtiful Joe and heavy hits from Alex. Alex hits Viewtiful Joe and executes a rapid series of attacks, which knock Viewtiful Joe down. Viewtiful Joe attempts several kicks and spinning attacks, but Alex counters, knocking Viewtiful Joe to the ground. He then delivers a final blow, resulting in a K.O.The screen displays \"Alex Wins.\" Alex performs his victory animation. He raises his arms, roars, flexes, then performs the slam attack again. The result screen shows \"Player 1 Win,\" \"Airhead\" as the victor, with 2 wins, 0 losses.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What fighting technique does the man wearing brown pants use to defeat his opponent at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A cross lock.",
+ "B. A low kick.",
+ "C. An elbow strike.",
+ "D. A straight punch."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What are the last four letters displayed on the screen following the white buffering animation in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Your.",
+ "B. Wait.",
+ "C. Oppo.",
+ "D. Menu."
+ ]
+ }
+ },
+ "tDjkSGlP": {
+ "video_id": "tDjkSGlP",
+ "video_duration": "103s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Cultural Explainers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a series of black and white images. The first few images display a street with several cars and many signs, including signs for 'Williams Drugs', 'Liquor', and 'Esquire'. The next few images show dilapidated houses with debris scattered on the ground. The following images show an outhouse and refuse. After that a large white sign with the text \"WE WANT WHITE TENANTS IN OUR WHITE COMMUNITY\" is shown, with two American flags mounted on the top. The next few images show a person entering a door with clothes hanging nearby and debris strewn on the ground. The video then shows a street with many cars, and an explosion in the middle of the street with fire and smoke billowing out. The next image shows several men standing and two men fighting, or otherwise engaged in a physical interaction. The following set of images depict many people walking in a street and one man with glasses in the foreground. The next image displays construction equipment in a large, flattened area filled with rubble. The video then depicts an open space with a man and wheelbarrow on a road. The following two images show empty lots with power lines in a grid pattern, and several cars on the roads. The following images show men around a map and men around a large image. After this a man smoking a cigar sits in a chair. This is followed by a map with a highlighted area. The video then displays a sign that says \"Lafayette Redevelopment Project\" next to damaged houses. The second to last image is a detailed architectural design. Finally, the last images show people holding signs that say \"FREEDOM IS EQUAL HOUSING TOO\" and a street with billowing smoke and fire, and then people in the street with damage and debris.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why are the protesters in the video demonstrating?",
+ "answer": "B",
+ "candidates": [
+ "A. Because they oppose the reconstruction of the area.",
+ "B. Because they were forcibly removed from the Black neighborhood.",
+ "C. Because of racial discrimination.",
+ "D. Because they are in conflict with white individuals."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order of events depicted in the video? (a) The election of the mayor, (b) a suicide incident and revelations of dishonorable activities, (c) riots erupt, (d) an investigation into gambling with increased racial tensions, (e) blacks launch protests demanding equal housing, (f) Lafayette Park and Lafayette Towers are built, (g) construction of Lafayette Park and Lafayette Towers is completed, (h) neighborhood demolition takes place.",
+ "answer": "B",
+ "candidates": [
+ "A. (c), (b), (a), (d), (h), (f), (g), (e).",
+ "B. (b), (d), (c), (a), (h), (f), (g), (e)",
+ "C. (b), (a), (h), (d), (e), (f), (g), (c).",
+ "D. (a), (b), (c), (d), (e), (f), (g), (h)."
+ ]
+ }
+ },
+ "HfpWAlmq": {
+ "video_id": "HfpWAlmq",
+ "video_duration": "83s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "DIY & Crafts",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A wheelbarrow filled with dirt is in the foreground. In the background, a trench has been dug in the dirt, and concrete blocks and buckets are scattered nearby.Concrete blocks have been arranged along the trench. A steel tamping tool is standing upright.More concrete blocks have been placed, lining the trench.Wooden pallets have been arranged to make walls within the area framed by the concrete blocks. A tarp has been draped over a frame supported by white pipes.The completed structure has walls made of pallets and a white and yellow tarp as the roof. A yellow ladder is inside. A bucket and other tools are nearby.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the color of the waterproof cloth shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Light red.",
+ "B. Light purple.",
+ "C. Light yellow.",
+ "D. Light brown."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many tires are behind the concrete blocks at the beginning of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. One.",
+ "B. Two.",
+ "C. Three.",
+ "D. Four."
+ ]
+ }
+ },
+ "CLULzPOm": {
+ "video_id": "CLULzPOm",
+ "video_duration": "64s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Art Exhibitions",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person in a gray dress is shown wading in water between two gray walls. A seagull is shown with its wing extended, then two seagulls are shown with one opening its mouth. A person in a white coat is shown holding a yellow structure on the side of a building. Subtitles read \"My body is a conduit to connect different places\". A person in a black dress is shown in front of a building. Subtitles read \"My body is a conduit to connect different places\". Then subtitles read \"I am intentionally foreign\". A person in a white dress is shown holding up a gray garment in a dark space. Subtitles read \"because I didn't come here to assimilate, I came here to work, to find who I am and what is my positionality to the viewers.\" A person in a blue sweater is shown speaking in front of a window. Subtitles read \"There's a certain poetic oddness\". A person in a dark coat walks on a street. Subtitles read \"that I claim\". A person in a blue sweater is shown speaking in front of a window. Subtitles read \"and the bottom line, that's my aesthetic.\" A person in a white coat carries a bouquet of plants. Subtitles read \"Each piece has its own reason to have come to the world.\" A person in a light coat is shown raising their arms on a bridge. Subtitles read \"Each piece has its own reason to have come to the world, and that's what I want to share.\" Subtitles then read \"Otake's First Museum Solo Exhibition\".",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order of events in the video: a woman bending by a drain and birds playing in the sky?",
+ "answer": "A",
+ "candidates": [
+ "A. A woman bends by the drain first, then birds play in the sky.",
+ "B. Birds play in the sky first, then a woman bends by the drain.",
+ "C. Only a woman bending by the drain appears; birds playing in the sky do not appear.",
+ "D. The two events occur simultaneously."
+ ]
+ }
+ },
+ "sAtaReHj": {
+ "video_id": "sAtaReHj",
+ "video_duration": "93s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a large crowd, many of whom are holding up phones to record. A woman wearing a light green shirt covers her mouth. Another woman has sparkly makeup on her cheek. Two women are taking a selfie with a phone. A woman with a microphone is adjusting a small child's hair on a stage. She is next to a woman wearing a white shirt and another woman wearing a black and white floral dress. A man in a blue shirt puts his arm around a woman with a microphone. A woman with long dark hair is talking into a microphone, with pink and teal balloons behind her. Another woman is wiping her eyes. There are also images of a woman putting on makeup, followed by close-up images of an eye shadow palette. A woman\u2019s fingers touch makeup on an eye shadow palette. Two women are walking in front of a balloon archway. A woman is holding a palette of makeup. A man and a woman are holding microphones and speaking to two children on a stage. One of the children is wearing a yellow skirt, and another a blue skirt. The crowd is seen holding up their phones. Two children on the stage are moving their arms. The two women are seen walking, and the woman on the left gives an ok sign with her hand. Two women are seen laughing in front of a digital display. A woman with a short bob is speaking. Two women take a photo with a phone in front of a balloon archway. A woman with long dark hair is talking directly to the camera, while another is off to the side, and then another woman wearing a white top is seen. Three women are seen laughing, one is holding a phone, and another one holds her hand up. Another woman is seen talking, then she is seen with two other women, laughing. There are several brief images of people smiling and laughing while at the event. Two people are seen hugging on stage in front of pink and teal balloons, and one woman is wearing a blue dress. Two other women are seen posing, with photos in the background. The woman from the opening is in a room with a woman in the background. There is also a brief image of the woman in the room. The woman is seen talking with the caption \"I enjoy the videos especially the kids' because they are so goofy.\" The video ends with another short clip of a child raising her arms on stage.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Does the volume of the background music in the video change?",
+ "answer": "D",
+ "candidates": [
+ "A. No, there is no change.",
+ "B. No, there is no background music.",
+ "C. Yes, it starts to get louder when the human voice appears.",
+ "D. Yes, it starts to get quieter when the human voice appears."
+ ]
+ }
+ },
+ "bYlwRrXe": {
+ "video_id": "bYlwRrXe",
+ "video_duration": "235s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A white helicopter is shown with its doors open. A man wearing a black shirt and sunglasses is standing on a tarmac. Another person approaches wearing a white shirt over a tan top and light green shorts, holding a green bottle. The man raises his hands in the air, smiling. The person in the shorts touches their hair and smiles. A third person enters the shot, wearing a white shirt and a patterned hat, and points toward the helicopter. All three people are then seen inside the helicopter. The two passengers are wearing headsets with microphones. The pilot, wearing a black shirt, is in the front left seat. The video shows aerial views of a city, including a mountain with a large white sign reading \"Hollywood,\" and a white building with a dome. The three people in the helicopter are shown talking and laughing. One passenger touches the face and shoulder of the other passenger. One of the passengers holds up a phone and speaks into it. The view switches back to the tarmac. One of the passengers is shown holding a photo in a silver frame. The view switches again to the inside of a building, where the same person is holding the frame and then a piece of paper. The three people are again shown in the building with the pilot offering a water bottle to one of the passengers. Finally, two of the passengers are shown outside at dusk standing in front of a vehicle. The people are laughing and talking to each other. Text boxes with social media handles appear below each person. A \"subscribe\" button with a bell icon also appears. A brief purple screen with other social media icons and \"watch next\" labels appears as the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where do the two people land after descending in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A lounge.",
+ "B. A jet bridge.",
+ "C. A helipad.",
+ "D. A terminal."
+ ]
+ }
+ },
+ "dEPpziVY": {
+ "video_id": "dEPpziVY",
+ "video_duration": "345s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A rotating globe zooms in on Sudan. A cityscape with trees and a mosque is shown. The word \"Sudan\" is displayed. The video transitions to a close-up of a woman's face. She has brown hair, green eyes, and a light complexion. A second woman's face appears, partially overlapping the first. This woman has darker skin, dark lipstick, and large hoop earrings. She's wearing a patterned top and has her fist raised. A third woman's face is shown. She has long, straight, dark hair, is wearing a white top, and is smiling. The video returns to the rotating globe, which now zooms in on Morocco. A sprawling cityscape with buildings of varying heights and colors is revealed. The word \"Morocco\" is displayed. A woman's face is partially visible against the cityscape. She has curly brown hair and large gold earrings. Another woman's face, with dark makeup and wavy brown hair, appears briefly. The map transitions to South Africa, zooming in on a cityscape with lit-up buildings and a tall tower at dusk. The word \"South Africa\" is displayed. A woman's face appears briefly. She has dark hair and eyes, and is wearing hoop earrings. A woman wearing a bikini top is shown in front of a rocky beach and ocean. The video transitions back to the globe, which moves towards Ghana. A busy roundabout in a city with cars, buses, and motorcycles driving around it is shown. A woman's face appears, partially overlapping the cityscape. She is wearing glasses and an orange and green patterned top. Another woman with a large afro appears briefly. A woman is holding an award. She is wearing black slacks, a black blazer, a black shirt, and a gold belt. A woman with short hair and a gold coin necklace is displayed. The rotating globe moves towards Nigeria. Coastal towns, cities, and beaches are shown. A woman is topless, covering her breasts with her arms and wearing a \"love\" necklace. The word \"Nigeria\" is displayed. The globe rotates towards Angola. A coastal town with sandy-colored houses and a large bay is shown. The word \"Angola\" is displayed. A woman with long, straight dark hair is displayed. She is smiling. A woman is lying on some rocks by the beach. She is wearing a denim crop top and bikini bottoms. A nude blonde woman is seated with her back to the viewer, with a teddy bear sitting behind her. A woman in an orange two-piece set with long sleeves is facing away from the camera. The globe moves to Somalia. A crowded beach with cliffs and lush greenery is displayed. The word \"Somalia\" is displayed. Three women are standing close together, laughing and smiling. The one on the left is wearing a silver dress, the one in the middle has a light-colored striped top and skirt, and the one on the right is wearing a white dress. The globe rotates to Eritrea. A city at night is shown, with lights illuminating the streets and buildings. The word \"Eritrea\" is displayed. A woman with long, curly brown hair and wearing a strapless top is on a beach. A group of women in green jackets and white shirts are smiling on an airplane. A woman with dark hair and dark eyes appears, wearing dark lipstick. Then the video returns to the globe. Images are interspersed with images of African women. At the end of the video, a globe with a crown on top rotates with fire in the background. The words \"Passport Kings\", \"Don't forget, thumbs up and subscribe now!\" are shown. Screenshots of flight prices from Google Flights are displayed. The words \"Travel Hacking has been around forever for those that knew how!\" are overlaid on a video of an airplane wing.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video, what is the order in which the man presents the women from these ten countries? (a) Nigeria. (b) Morocco. (c) South Africa. (d) Somalia. (e) Sudan. (f) Ghana. (g) Angola. (h) Eritrea. (i) Ethiopia. (j) Cape Verde.",
+ "answer": "C",
+ "candidates": [
+ "A. (h)(i)(a)(c)(b)(j)(d)(f)(g)(e).",
+ "B. (d)(b)(j)(h)(i)(a)(c)(e)(f)(g).",
+ "C. (e)(b)(c)(f)(a)(g)(d)(h)(j)(i).",
+ "D. (g)(a)(h)(b)(i)(f)(d)(c)(j)(e)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, what is the geographical position of Nigeria relative to South Africa?",
+ "answer": "D",
+ "candidates": [
+ "A. Nigeria is located west of South Africa.",
+ "B. Nigeria is located north of South Africa.",
+ "C. Nigeria is located northeast of South Africa.",
+ "D. Nigeria is located northwest of South Africa."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many countries' women are featured in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Nine.",
+ "B. Ten.",
+ "C. Eleven.",
+ "D. Eight."
+ ]
+ }
+ },
+ "BNIYmdvo": {
+ "video_id": "BNIYmdvo",
+ "video_duration": "87s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a red fire truck driving down a road, with several people riding in it. The camera focuses on the group of people as the fire truck is driving towards Cinderella's castle. The view changes to inside the car and Mickey Mouse can be seen in the front seat, waving to those around him. The view changes to a green double-decker bus, with many people riding inside waving their arms. The view changes to two people, a man with a blue button down shirt and a woman with a blue and white patterned shirt. They are standing together waving. The camera goes back to the green double decker bus with some of the people inside waving at the camera.The camera shows Minnie Mouse and a woman in a pink shirt and white shorts walking up to the bus. The camera changes view and shows the woman in the blue and white patterned shirt waving. The camera shifts to show the same woman standing with the man from before as a photographer is taking pictures behind them. The view changes to show four people inside the red fire truck and the view then shifts again to a group of people standing in front of Cinderella's castle, with two men taking pictures. The camera changes to show a woman walking, while Mickey Mouse and some young children pose in front of a red fire truck. The view switches to two young children and a woman in a blue and white patterned shirt hugging in front of a red fire truck, while a woman in a black shirt and sunglasses and a woman in a red and white shirt pose in the background. A group of people can then be seen hugging in front of Cinderella's castle with Mickey Mouse standing in the background. The camera then focuses on one of the children and the woman in the blue and white patterned shirt holding hands. The camera shows the woman and children walking, with Mickey Mouse in the background. The camera shows a woman hugging two children in front of Cinderella's castle while a person takes pictures. The view shows another set of people hugging and the view shifts to a woman in a pink shirt hugging a woman in a blue and white shirt. The next shot is of some people looking up and hugging each other, and then the camera focuses on a gray mouse statue.The camera then focuses on two young girls and a woman in a blue and white patterned dress looking up at something. The view then changes to show the same two young girls riding an elephant ride while two other girls watch. The view shifts again to show a man, woman, and young child riding in an elephant. The man waves at the camera. The elephant ride is shown from far away, with people riding it. The camera zooms into one of the elephants on the ride as a group of people rides. Then the camera shifts to the couple and child inside their own elephant, followed by a final shot of the trio on the elephant ride. The video ends with the screen fading to black.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "Who in the video is most likely the child of the woman wearing a blue floral dress?",
+ "answer": "D",
+ "candidates": [
+ "A. A little girl wearing red.",
+ "B. A little girl wearing a blue dress.",
+ "C. A little boy wearing blue.",
+ "D. A little girl wearing a pink dress."
+ ]
+ }
+ },
+ "ZvoIAWnB": {
+ "video_id": "ZvoIAWnB",
+ "video_duration": "78s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Chemistry",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A slide with the word \"Calibration\" and mathematical equations is shown. This slide is then partially obscured and another scene is shown. Two glass bottles with thermometers and blue labels reading \u201c1.0 M HCl\u201d and \u201c1.0 M NaOH\u201d are shown, along with a black cylindrical beaker, all on top of a white paper. The video then highlights the labels \"T initial\" near the base of each bottle, and then the labels disappear. A gloved hand reaches into the frame and removes the thermometer from the bottle on the right. This action is repeated with the thermometer in the bottle on the left. The right bottle is then removed. The left bottle is then removed and its contents are poured into the cylindrical black beaker. A thermometer in the black beaker, and a clear glass bottle on its right are shown, then the thermometer is moved to focus on the scale, which begins at 20 and reaches 40, where the liquid level inside is below the 25 mark. The thermometer moves out of the frame.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the calorimeter relative to the glass bottle containing acid in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Behind it.",
+ "B. In front of it.",
+ "C. On the right side.",
+ "D. On the left side."
+ ]
+ }
+ },
+ "pQoiSoQw": {
+ "video_id": "pQoiSoQw",
+ "video_duration": "206s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a loading symbol in the center of a black screen, with a video player at the bottom. The scene then changes to a wide shot of an auditorium with green benches. Many people are seated, and some are standing. The next shot shows a man in a grey suit and red tie, standing in the aisle, speaking to the camera. He is gesturing with his hands. This is followed by a closer shot of the same man. The scene changes to a shot of three men in suits sitting on a green bench. The man in the middle has a blue tie. The scene changes to a closeup of the man in the grey suit with the red tie speaking. This is followed by a wide shot of the auditorium again, with the same man standing in the aisle. The scene then changes to a man in a black robe with a white shirt and multicolored tie speaking to the camera, he is gesturing with his right hand. The scene cuts back and forth between shots of the man in the black robe, the man in the gray suit, and various wide shots of the auditorium. Then, a man in a gray suit is shown standing and speaking in front of a section of seated people. The video concludes with more alternating shots of the man in the black robe speaking, the man in the grey suit speaking, and a wide shot of the auditorium with the people seated.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does the boy wearing a red striped tie leave in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. Throughout the video.",
+ "C. At the start of the video.",
+ "D. At the end of the video."
+ ]
+ }
+ },
+ "mKFZqIgr": {
+ "video_id": "mKFZqIgr",
+ "video_duration": "105s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person wearing a black hooded sweatshirt with white stripes on the sleeves, red pants, white shoes, and a black baseball cap is holding a black camera with a fuzzy microphone attached to the top and facing the camera while in a bathroom. They then walk through a glass door into a dimly lit room. A Christmas tree is on the left side of the shot. The person walks into a gym and begins doing squats with their arms straight out in front of them. The person then sits down on the floor in a modified pose. The person then puts a black padded wrap around the center of a barbell. The person then uses the wrapped barbell to do a hip thrust exercise on a bench.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Is a dumbbell shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No, there is not.",
+ "C. Yes, there is."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What is the woman wearing a black hat doing while lying sideways on the ground in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Doing hip thrusts.",
+ "B. Performing hamstring curls.",
+ "C. Engaging in glute activation exercises.",
+ "D. Taking a break."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "At the beginning of the video, where is the lady wearing a black hat located?",
+ "answer": "B",
+ "candidates": [
+ "A. Staircase.",
+ "B. Bathroom.",
+ "C. Hall.",
+ "D. Gym."
+ ]
+ }
+ },
+ "DhzComZN": {
+ "video_id": "DhzComZN",
+ "video_duration": "71s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man in a red and white jacket is speaking. Several people with blond and brown hair appear around him. The man gestures with his hands. The scene cuts to a wide shot of a green field with yellow cones and soccer balls on it. A group of people stand together, and one wears a green vest. The man in red and white gestures at the group and walks toward the left side of the screen. Some of the people in the group hold their heads, and then all disperse, running in different directions. Several of them kick soccer balls. The scene cuts to an aerial view of the field. The people with soccer balls run toward different parts of the field. One kicks a ball toward the goal. The ball goes through the goal. The scene cuts to a side shot of a soccer ball going through the goal, then back to the man in the red and white jacket, who gestures with his hands while speaking. Another person walks by behind him, near a soccer goal.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where are the four football frames positioned on the football field in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. On the left, at the front, and on the right.",
+ "B. On the left, at the front, and at the back.",
+ "C. On the left, at the back, at the front, and on the right.",
+ "D. On the left, at the back, and on the right."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is the goalkeeper's shirt?",
+ "answer": "C",
+ "candidates": [
+ "A. Yellow.",
+ "B. Red.",
+ "C. Green.",
+ "D. Black."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What type of training is the coach instructing the players to perform in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Dribbling drills.",
+ "B. Penalty kick training.",
+ "C. Shooting drills.",
+ "D. Passing drills."
+ ]
+ }
+ },
+ "jGFctzCh": {
+ "video_id": "jGFctzCh",
+ "video_duration": "504s",
+ "duration": ">8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Drawing Tutorials",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a purple and pink gradient background with white dots and two large, glowing pink shapes. The text \"Dramatic Parrot\" appears in silver metallic font with pink highlights, centered on the screen. The background then changes to a blurry, pale blue and white, before focusing on a book with the title \"WRECK THIS JOURNAL\" taped to the front with light-yellow tape, written in black text. A hand with dark blue nail polish rests on the book, and a light blue banner appears on the screen, at the bottom, with \"@DRAMATICPARROT\" written on it and an Instagram logo to the left. The hand opens the book to a white page with instructions in black text and a black, arrow pointing to the word \"INSTRUCTIONS.\" The video then shows another page that reads \"ADD YOUR OWN PAGE NUMBERS\" in large black letters, along with a list of materials. Another black page appears that reads \"LEAVE THIS PAGE BLANK ON PURPOSE\" in white letters. The hand is then shown turning pages of the book. A page is shown that says \"INFUSE THIS PAGE WITH A SMELL OF YOUR CHOOSING\" with a small black cloud with white words printed inside. The hand opens another page that says \"GIVE AWAY YOUR FAVORITE PAGE\" with a black cloud and white words printed inside. The hand turns more pages before opening to a page that has several lined areas on which to write. A person begins writing on the page with a black pen, then a gray pen and another black pen, and continues filling out each of the lines. A page that says \"CRACK THE SPINE\" is displayed, and a blue colored pencil adds wavy lines. These lines are then shaded in with gray and a darker blue pencil. Another page is displayed that reads, \"STAND HERE. (WIPE YOUR FEET, JUMP UP AND DOWN.)\" and a figure is sketched with a black pen, a rabbit drawing is then added to this illustration and the image is colored in with a red, pink, and blue colored pencil. On the next page is the text \"POKE HOLES IN THIS PAGE USING A PENCIL,\" and multiple tube-shaped and pencil shaped figures are drawn in black pen with a few filled in with colored pencils. The page after that says, \"DRAW FAT LINES AND THIN. PUSHING REALLY HARD WITH THE PENCIL.\" Several curving, vertical blue lines are added, then the area is filled in with marker in shades of green, pink, red, and purple, and then outlined in black pen. The next page says, \"THIS PAGE IS FOR HANDPRINTS OR FINGERPRINTS. GET THEM DIRTY THEN PRESS DOWN,\" and multiple handprints appear in blue and green ink. The hand then covers itself in yellow paint and a handprint appears on the page along with splotches of green and red paint. The hand covers itself in yellow, then red paint, then another handprint is made before the hand is wiped clean. The next page says, \"(COLOR THIS ENTIRE PAGE)\" with an outline of a geometrical shape. The shape is then colored in with a variety of markers, with an orange pen drawing an outline of a unicorn above. The following page reads, \"THROW SOMETHING A PENCIL, A BALL DIPPED IN PAINT,\" a target is drawn, and several splatter marks appear on the black circle. The next page says, \"PRESS LEAVES AND OTHER FOUND THINGS,\" and an outline of a tree is drawn with falling leaves and then colored in using an orange marker. The following page reads, \"SCRATCH THE PAGE USING A SHARP OBJECT,\" and a person draws a crescent moon, a star, and other figures on the black page. The next page displays, \"HOLES IN THIS PAGE USING A PENCIL\" and several different colored pencils are displayed. Finally the person draws a cup and decorates it with paint and marker. The video ends with the \"Dramatic Parrot\" text appearing again in a purple and pink space scene. There are social media icons on the screen. The text says, \"NEW ART VIDEOS EVERY THURSDAY AND SATURDAY.\"",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many pages does the woman paint in the book shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Ten.",
+ "B. Eight.",
+ "C. Four.",
+ "D. Six."
+ ]
+ }
+ },
+ "CBvEypvb": {
+ "video_id": "CBvEypvb",
+ "video_duration": "31s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A first-person view of a virtual world is shown, with a hand holding a small gray object. A blue and purple rectangular plane is visible. A small inset shows an animated character with blue hair and a blue and white jacket.The view changes to show two cartoonish airplanes. A multicolored platform is seen in the distance. The inset character remains in view. The view changes to show a person's computer screen displaying chat windows and video information. The animated inset character remains visible. The computer screen disappears, and a view of two cartoonish airplanes and a multicolored platform is visible. The inset character is visible.The viewpoint moves along the multicolored platform and a different character, with black hair, a black jacket, and white shorts appears. The inset character remains visible.The black-haired character sits on the back of a red and white airplane. A first-person view appears inside the airplane's cockpit, with the multicolored platform in view. The inset character remains visible.The airplane moves along the platform, eventually reaching the end and falling off. The view switches to third person and shows the airplane falling through the air. The inset character remains visible.The airplane travels along an open area of water. The view changes to a multicolored platform, then back to the airplane. The inset character remains visible.The view changes to an open area of water with small landmasses and one larger island with a rocky face. The airplane flies in this area. The inset character remains visible.The view changes to show a blue, still image of the inset character.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What does the man do with the plane in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Place it behind the white plane.",
+ "B. Inspect the parts underneath the plane.",
+ "C. Throw the plane away.",
+ "D. Smash it against the white plane nearby."
+ ]
+ }
+ },
+ "SVNyPTsP": {
+ "video_id": "SVNyPTsP",
+ "video_duration": "338s",
+ "duration": "4-6min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a solid black screen. The screen changes to a wide shot of a gray, empty road, with a blue tint over the entire frame. This is followed by a similar shot but in regular color showing a long, straight gray road with a double yellow line down the middle, sparse shrubs on either side, and a dark rocky hill in the distance. Two motorcycles appear in the distance traveling towards the camera. The motorcycles get closer, and it\u2019s shown that the riders are wearing helmets. The motorcycles pass the camera and drive away. The view then changes to a wide shot of the road again, with a large sun flare in the distance. After that, the words \u201cBOOTS & JEANS RIDERS\u201d appear over the road, superimposed on top of an image of boots and jeans. Then it shows a similar wide view of the road at night. The next section of the video features two individuals and a motorcycle, all in close-up view. A man wearing sunglasses and a gray shirt is front and center, with a woman to his left who is wearing glasses and a blue bandana. The man gestures with his index finger and speaks, and a graphic appears over them saying \u201cSPECIAL-K PATTERN\u201d. The view shifts to a first-person view from the motorcycle, with several orange cones and white lines visible on the gray parking lot. The man's arm and hand are visible as he gestures to the left and right. The view briefly switches to a full view of the motorcycle, then back to first-person. Then the man\u2019s hand is visible again, pointing at various orange cones, and lines marked on the ground. Then directional black arrows are added to the video.The next section of the video is a side-by-side comparison of the motorcycle rider performing a maneuver. The top view shows a close-up, first-person view from the rider\u2019s perspective. The bottom view is a wide shot of the rider and motorcycle from a distance. Both view show the rider following the pattern laid out by the cones and lines. The words \u201cFirst Run\u201d appear briefly over the split screen. The maneuver is performed again, and the words \u201cSecond Run\u201d appear over a wide shot of the bike as it moves through the pattern. Both the first and second runs are shown from the driver\u2019s perspective and from a wide-view. The camera changes to a first-person view and shows another rider making the same manuver, before changing to a side-by-side split screen of both riders moving through the pattern at the same time. The video cuts to a view of the road, then to the close-up of the first rider again, who gestures and gives a thumbs up, while the words \u201cSubscribe\u201d appear below him. He does some more hand gestures and is seen talking, before a the video fades back to the \u201cBOOTS & JEANS RIDERS\u201d graphic over the boots and jeans image.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What is the most likely location depicted in the latter half of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A soccer field.",
+ "B. A car training area.",
+ "C. A parking lot.",
+ "D. A motorcycle training area."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many laps did the man complete on the motorcycle in the training video?",
+ "answer": "B",
+ "candidates": [
+ "A. Four laps.",
+ "B. Two laps.",
+ "C. One lap.",
+ "D. Three laps."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "At the beginning of the video, what is most likely the identity of the black man wearing sunglasses?",
+ "answer": "C",
+ "candidates": [
+ "A. Client of the training ground.",
+ "B. Coach at the training ground.",
+ "C. Owner of the training ground.",
+ "D. Host exploring the training ground."
+ ]
+ }
+ },
+ "CjwcRdDj": {
+ "video_id": "CjwcRdDj",
+ "video_duration": "122s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Academic Lectures",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a series of cartoon panels.The first panel shows a man holding a baby, with a woman standing next to him. The second panel shows a boy running in a forest as two figures crawl on the ground behind him. The third panel shows a person holding a tray with a cake and a drink on it, and a plate with food and a cloth napkin next to it. The fourth panel shows a boy who looks scared, as a man and woman tend to a crying baby in the background. The next series of panels shows a man holding an axe to a piece of wood, with a boy looking on. The next panel shows the man pointing with the axe, and the next panel shows a boy holding an axe and a hedgehog. The final panel shows a boy in the lower left frame holding wood blocks and a piece of paper, and in the lower right panel a man holding an axe.The next series of panels shows a man holding a bucket of water and a man with a rake standing by a shed. A third panel shows a boy lying next to some sheep, then a fourth panel with the same boy and a man petting a sheep. The final panel shows the same scene with another boy helping, and then a man walks to the shed with wood boards.The final series of panels shows a man's face, then the man in a collared shirt. The next panel shows the man talking to another man, and the next panel shows that man arranging some long plant stalks. The next panel shows that same man next to a boy, and finally in the last panel the two men are next to a sheep.The final series of panels show a man who looks distressed, and then in the next panel he appears to be smelling something. The next panel shows him hugging a person, with a boy watching. In the next panel he is shown hugging that person as the boy says \"whoa\" next to the hugging figures, then in the next panel the boy is shown sleeping. The next panel shows the same boy waking up, and the final panel shows the boy swinging a garden tool.Throughout the video there are frames that fade to white with the words \"sprouts schools\" written in purple on them, with other small shapes surrounding the writing.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "After the younger brother gave a sheep to their father, what unusual actions did the black-haired boy exhibit?",
+ "answer": "C",
+ "candidates": [
+ "A. Took the sheep away.",
+ "B. Slammed the door and stormed out.",
+ "C. Killed his younger brother.",
+ "D. Killed the sheep."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "On Father\u2019s Day, what did the boy with black hair do for his father?",
+ "answer": "A",
+ "candidates": [
+ "A. Gave his father dry grass.",
+ "B. Gave his father furniture.",
+ "C. Helped his father fetch water.",
+ "D. Gave his father a sheep."
+ ]
+ }
+ },
+ "pGMoopEh": {
+ "video_id": "pGMoopEh",
+ "video_duration": "164s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows various scenes of a city. It starts with an aerial view of a city with mountains in the background. Then, it shows a close-up of a bridge, an overhead view of the same bridge with cars driving, and an outlined drawing of that same area with pink highlights. The pink highlights turn into a 3D model of a building and then the building grows and twists upwards. The video then shows several photographs and short video clips of the same building taken from different locations and at various times of day. This building is a tall, white, geometric structure. There are also images of the building during different phases of construction, and a few images that include people, cars, and other buildings nearby. The video also includes several scenes of a large decorative chandelier hung underneath a bridge. Finally, there is a scene of a building with fog and a scene of a runner in front of the buildings. In the lower left of the video, there is also a small video of a person with reddish-blonde hair wearing a black jacket. This person is seen gesturing with his hands as he looks up at the various scenes in the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What art forms related to the project did the architect mention?",
+ "answer": "B",
+ "candidates": [
+ "A. Painting and sculpture.",
+ "B. Street art and video art.",
+ "C. Theater and film.",
+ "D. Music and dance."
+ ]
+ }
+ },
+ "ToBqGXFo": {
+ "video_id": "ToBqGXFo",
+ "video_duration": "38s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a building with multiple floors, a fence, and trees nearby. The scene transitions to the interior of a room where a person says \"Increase the volume\". Then a second person appears, and they are in a bedroom. They say \"My song!\". Both start dancing, and laugh. Then another person is seen in a doorway, looking at their phone. Then the second person says \"Aahh! Show them, show them, show them\". The person in the doorway enters and starts to dance. The first person says \"What's up with all the dance moves?\" The person in the doorway says \"What's up? Where are you girls going?\". One of the people says \"We are going to the party of the year.\" The person in the doorway says \"Really!\" and then the other person responds \"Yeah\" and then says \"Roy. Let me even call him to confirm the time for the party\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times did the woman with a ponytail in the video say, \"Show it to them!\"?",
+ "answer": "D",
+ "candidates": [
+ "A. Two.",
+ "B. Four.",
+ "C. One.",
+ "D. Three."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who just said, \"Let me even call him to confirm the time for the party.\"?",
+ "answer": "A",
+ "candidates": [
+ "A. The woman with a ponytail.",
+ "B. The person filming the video.",
+ "C. The woman in a red shirt.",
+ "D. The woman with a bun."
+ ]
+ }
+ },
+ "nAwmJsux": {
+ "video_id": "nAwmJsux",
+ "video_duration": "139s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man wearing a green jacket and black hat is standing next to a table. He looks around, then moves some items on the table. Another man is sitting on the opposite side of the table. He's wearing a plaid shirt and black hat. There are a number of objects on the table and in the background, including several guns. The two men gesture with their hands as they appear to be talking to one another. The man in the plaid shirt takes a gun from the table and hands it to the other man. The man in green examines the gun, then hands it back to the man in the plaid shirt. The man in green is handed another object, and examines it. The two men continue to gesture with their hands as they appear to be speaking to one another.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the man in red do to the man in olive green?",
+ "answer": "C",
+ "candidates": [
+ "A. Handed four handguns to the man in green military clothing for selection.",
+ "B. Handed three handguns to the man in green military clothing for selection.",
+ "C. Handed two handguns to the man in green military clothing for selection.",
+ "D. Handed five handguns to the man in green military clothing for selection."
+ ]
+ }
+ },
+ "IqrJTNjs": {
+ "video_id": "IqrJTNjs",
+ "video_duration": "51s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Tech Reviews",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows several different computer screens and displays in various settings:-A shot of a server room with rows of computer server racks lit with blue and red lights.-Two men in a room with many hanging wires in a server room.-A black background with white text stating \"That's Linux!\"-A desktop monitor with a variety of application icons showing.-A television screen showing a multicolored graphic-An interior view of a car with two people. One person is holding a tablet in their hands.-A car dashboard screen showing a cartoon penguin.-A close-up of a Samsung phone with a dark screen with a purple and blue graphic.-Two Samsung tablets on a table.-A large screen on a wall in a room with furniture.-A woman using a touch screen.-A woman in uniform spraying and wiping down a surface.-A red coupon machine with text on the screen.-An airplane's entertainment screen showing language options.-An ATM screen displaying a Windows logo.-An ATM screen displaying a blue error screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the brand of the tablet shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Huawei.",
+ "B. Xiaomi.",
+ "C. Samsung.",
+ "D. Apple."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a penguin icon representing the Linux system in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes.",
+ "B. No.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "YTQpYvwq": {
+ "video_id": "YTQpYvwq",
+ "video_duration": "67s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video opens on a dark, blurry scene showing figures moving around a body on the floor. Then, it cuts to a close-up of various items strewn on a surface, including what appears to be a shell casing, a ruler, and some crumpled paper. Next, a grieving woman wearing a baseball cap and a cloth around her neck is shown looking down at framed photos of a young man on a table decorated with flowers. The table also displays an assortment of items including a passport and another framed photo.The scene transitions to two women standing outside holding framed pictures. One woman holds a photo with the words \u201cRISE UP\u201d written at the bottom. The other woman has an album open showing multiple photos of the same person at different ages. Then, a woman with glasses and dark hair speaks to the camera, standing in front of a wall with maps.\u00a0The video cuts back to a dark scene of police officers in tactical gear handcuffing a man sitting on the ground.\u00a0Next, it shows a different dark scene of a police officer holding a large gun standing over two bodies lying on the ground. Another person is sitting nearby with their head in their hands, and another sits next to them appearing distraught. A motorbike is visible in the background.The woman with glasses is shown again speaking to the camera in front of the maps.The video cuts to a poster showing a drawing of a person lying in a coffin. Text in a foreign language is visible above the image. The video ends with the same woman continuing to speak to the camera, followed by a shot of a document with circled names on it. Text in a foreign language is at the top of the document.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What is the reason that human rights defenders were red-tagged in Cebu?",
+ "answer": "D",
+ "candidates": [
+ "A. They assisted a drug suspect in escaping.",
+ "B. They exposed police corruption.",
+ "C. They participated in illegal protests.",
+ "D. They were considered terrorists."
+ ]
+ }
+ },
+ "vOvDtAfq": {
+ "video_id": "vOvDtAfq",
+ "video_duration": "116s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Unboxing Videos",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a room with two aquariums, one on the left and one on the right. There are two stacks of boxes in the middle of the room. A hand points to the bottom box of the stack, then the video moves to focus on the top two cardboard boxes. The hand opens a drawer and removes a red box cutter. The video then moves to the other set of boxes and the person places the box cutter on the top of the box and cuts the tape. The person opens the top of the box. Inside, there is a white box with blue, grey, and red text. The person removes the box, revealing that it is a Ziss brine shrimp hatchery. The person then touches the aquarium to the left. There are numerous small fish swimming in the tank. The video then shows the stack of boxes again. The person uses the box cutter to cut the tape on the top box, opens the top of the box, and shows the inside.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "How will the man in the video use the two incubators?",
+ "answer": "B",
+ "candidates": [
+ "A. Both incubators are used as backups.",
+ "B. One incubator is used for incubation while the other is kept as a backup.",
+ "C. One incubator is used, and the other is returned.",
+ "D. Both incubators are used simultaneously for incubation."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Has fish food appeared in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes.",
+ "B. I am not sure.",
+ "C. No."
+ ]
+ }
+ },
+ "IJYuQkJu": {
+ "video_id": "IJYuQkJu",
+ "video_duration": "92s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a blurry image of lavender flowers with the word \"Planning\" superimposed. It then transitions to several shots of a small, light yellow house with a grassy yard. The grass is shown at different stages of growth, from overgrown weeds to neatly mowed lawns. A brick wall borders part of the property.Next, a hand holds a smartphone displaying a sun path tracking app. The app shows the sun's trajectory across the sky, overlaid on a view of trees. Several small white flags are shown staked in a weedy patch of ground near the house.The perspective shifts to various angles around the yard, showing weeds, a tree stump, and gardening tools leaning against the house. A concrete basin is situated next to the brick wall.A view of the street in front of the house follows, showing parked cars and a wet, muddy road after rain. A pile of rocks is visible in the foreground. This scene transitions into a depiction of a small stream or creek flowing through a wooded area.A screenshot of the Calscape website appears, showing information about the Western Sycamore tree. The website displays images of the tree, a range map, and plant details. Then, a search for \"Sage\" is entered on the Calscape website, showing various sage species native to California. The zip code \"91109\" is entered, refining the search results. The Hummingbird Sage (Salvia spathacea) is selected, and its details are displayed.The video returns to the house, showing rain falling on the yard through a window. Raindrops cover the glass. The rain is also viewed from a different window, showing a red barrel and white flags in the yard. A puddle forms in the dry grass next to the house.Hand-drawn plans for the yard are then shown. The plans depict the layout of plants, a patio, pathways, and other features. Different zones are marked with colored highlighters. A second plan focuses on the stormwater retention system, including rain barrels, bioswales, and catch basins.The BeWaterWise.com website appears next, with sections on rebates, classes, gardens, and more. The \"Turf Replacement Program\" page is shown, highlighting the program's requirements and application process. The SoCal WaterSmart rebate portal website is then displayed, showing the turf replacement application page.Finally, the video concludes with the hand-drawn stormwater retention plan overlaid onto a view of rocks and soil in the yard, which transitions to a close-up of a dry creek bed lined with rocks and mulch, newly planted with native plants. Several potted plants are shown on the ground, and two red rain barrels collect water from a downspout.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "How does sunlight affect the east and west sides of a house differently?",
+ "answer": "C",
+ "candidates": [
+ "A. The west side is significantly drier than the east side.",
+ "B. The level of dryness is the same on both sides.",
+ "C. The east side is significantly drier than the west side.",
+ "D. The east side is more humid than the west side."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct order of events in the video? 1. Develop an irrigation plan. 2. Observe the natural runoff of rainwater on the property. 3. Apply for the lawn removal reimbursement program. 4. Use the Sunseeker application.",
+ "answer": "C",
+ "candidates": [
+ "A. 1, 2, 4, 3.",
+ "B. 2, 4, 1, 3.",
+ "C. 4, 2, 1, 3.",
+ "D. 4, 2, 3, 1."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Did a red bucket appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No.",
+ "C. Yes."
+ ]
+ }
+ },
+ "dvIqvgUs": {
+ "video_id": "dvIqvgUs",
+ "video_duration": "495s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman with a blue and orange head wrap sits in a chair in front of a dresser. She holds a magenta wig. She picks up a black spray can and sprays something on the inside of the wig cap. She then puts on the wig. She adjusts the wig on her head and styles it with her fingers. She picks up a brush and brushes the wig. She continues to talk while adjusting and styling the wig.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many wigs is the woman in the video displaying on her head?",
+ "answer": "B",
+ "candidates": [
+ "A. Three wigs.",
+ "B. One wig.",
+ "C. Two wigs.",
+ "D. Four wigs."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What are the three letters displayed in the upper right corner of the initial on-screen text in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. div.",
+ "B. pre.",
+ "C. par.",
+ "D. sis."
+ ]
+ }
+ },
+ "ZFWqlMcH": {
+ "video_id": "ZFWqlMcH",
+ "video_duration": "120s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A black and white animated figure with large ears moves from the right side of the frame towards a dark, silhouetted building on the left. Then a different black and white animated figure is seen in the frame on the left of the first figure and both are visible in the frame. Three cartoon characters are visible on a colored background, two are behind a third cartoon figure which has a visible scythe. One cartoon character is shown on a blue background, moving forward. A drawing of 5 cartoon characters, standing side by side, is shown on a white background. One animated figure is shown with a blue and white color scheme, looking down. More animated figures are shown in the background behind it. A man is shown in a restaurant with a shake and the word \u201cSHAKE\u201d in all capital letters on a card in front of him. Two cartoon characters are visible, one is white with blue accents with its face obscured by its hands. A blue and white cartoon figure is shown behind a window. Inside a room, the same blue and white figure is shown on a bed. The same blue and white figure is shown reading a book in a room, with another white and black figure in the background. The same blue and white figure is shown looking in a mirror, with the room in the background. The same blue and white figure is shown sitting on the edge of a bed in a room with a white and black figure in the doorway in the background. A red circle is visible on the figure\u2019s chest. The text \u201cDON\u2019T WORRY SON, IT LOOKS GREAT!\u201d is visible on the screen. A person\u2019s face is shown close up with the text \u201cDON\u2019T WORRY SON, IT LOOKS GREAT!\u201d above it. Two red circles highlight two spots on a blue and white figure, with the text \u201cI KNOW THEYRE NOT SLEEVES, BUT I REALLY DON\u2019T KNOW WHAT TO CALL THEM\u201d below it. Three sketches of a blue and white figure are shown on a white background. Two drawings of blue and white figures are shown on a white background. A blue and white animated figure is shown holding an object with a visible \u201cNick\u201d logo. A black and white animated character is shown in a blue square that is attached to an arm-like object. Three sketches of a blue and white figure are shown on a white background. Three additional sketches of a blue and white figure are shown on a white background. Two more sketches of a blue and white figure are shown on a white background. A black and white robot illustration is shown with buildings in the background and an object that is shaped like a flying saucer above it. A black and white image is shown of tall buildings with a larger building in the center. Two sketches of a blue and white figure are shown on a white background. Three sketches of a blue and white figure are shown on a black background. Two different sketches of a blue and white figure are shown on a white background. A blue and white figure is shown walking. A blue and white figure is shown looking down, with one of its hands in a fist. The same blue and white figure raises one hand in the air. The same blue and white figure is shown pointing up. Two drawings of a blue and white figure are shown, with written descriptions. A 3d animated figure, with a blue and white color scheme, is shown in front of a black background. A game cover with various animated characters and the words \u201cNicktoons Attack of the Toybots\u201d is shown on a black background. A 3d animated figure with a blue and white color scheme is shown in front of a black background, in front of a game cover. A 3d animated figure with a blue and white color scheme is shown in front of a black background. A see through effect is shown with a 3D figure in the center, with several cartoon characters behind it.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "In the video, does the background music change?",
+ "answer": "D",
+ "candidates": [
+ "A. The video starts with no background music, shifts to calm background music in the middle, and ends with very loud background music.",
+ "B. The video starts with calm background music, shifts to very loud background music in the middle, and then returns to calm.",
+ "C. The background music changes from calm to eerie and finally becomes intense.",
+ "D. The video starts with no background music, transitions to very loud background music in the middle, and then returns to calm."
+ ]
+ }
+ },
+ "emzncDEV": {
+ "video_id": "emzncDEV",
+ "video_duration": "182s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Fitness",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen. Then a metal sign with the word \"MUTANT\" and a biohazard symbol appears on the ground, surrounded by debris and smoke. Then a group of people appear in the distance before the scene changes. Now the walls of a building are shown with black-and-white posters of men on the wall above the light fixtures. The scene then changes to a view of a room with weightlifting equipment. A woman wearing a red top with the word \"MUTANT\" on it performs exercises using dumbbells while a man behind her offers guidance. The camera cuts to show various views of her using weights in the room. There are other people in the gym as well, and the words \"OLYMPIA MUSCLE & FITNESS\" are seen on the wall of the room. Text appears at the bottom of the screen saying \"LOLA MONTEZ\", with \"FIT_LOLAMONTEZ\" underneath it and a biohazard symbol to the right. The woman moves and talks for a few moments. Then she performs more exercises with dumbbells, and then text appears on the screen reading \"MUTANT SHOULDERS\" and a workout plan. The scene then changes to show the sign with the word \"MUTANT\" again with social media information and the text \"Leave Humanity Behind!\" on the wall nearby.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the woman independently lift and lower the barbell in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Three.",
+ "B. Six.",
+ "C. Five.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did the woman in the video change the barbell?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes.",
+ "C. No."
+ ]
+ }
+ },
+ "PPybzref": {
+ "video_id": "PPybzref",
+ "video_duration": "42s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A brown horse with a black mane is shown being led by a person wearing blue jeans, a red vest over a black long-sleeved shirt. The person is holding the horse's lead. The horse and person are walking on grass with other foliage in the background.Then there is a horse race. Several horses and jockeys race on a grass course. There are white railings along the edge of the course. There are spectators in the background. One horse is circled in a lime green circle for some time and is shown as a dark brown horse with a red and white jockey. Towards the end, two horses are shown running close together with one wearing a purple and white jockey and another wearing a red and white jockey. A yellow ambulance and white timing board with the number 3:25:43 are also visible.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What time appears on the timer when the first contestant crosses the finish line?",
+ "answer": "A",
+ "candidates": [
+ "A. 3 minutes and 25 seconds.",
+ "B. 4 minutes and 10 seconds.",
+ "C. 3 minutes and 15 seconds.",
+ "D. 4 minutes and 1 second."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What position did the man in red achieve at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Fourth place.",
+ "B. Second place.",
+ "C. Third place.",
+ "D. First place."
+ ]
+ }
+ },
+ "pfpuNdzt": {
+ "video_id": "pfpuNdzt",
+ "video_duration": "129s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A harness race is shown from an elevated, wide-angle perspective. A white car pulls a starting gate with eight horses behind it. As the horses gain speed, the car pulls away and the starting gate swings open, marking the official beginning of the race. The horses, each pulling a two-wheeled sulky with a driver, race counter-clockwise around an oval dirt track. The camera follows the horses as they progress. The horses remain relatively bunched together initially. As the race continues, the horses spread out along the track. On-screen graphics display the race number, location, date, time, and horse positions. Number 2 maintains an early lead, briefly overtaken by Number 4 and then regaining its position at the front of the pack. Number 2 leads through much of the race with Numbers 3, 4, and 5 close behind in alternating positions throughout the race. In the final stretch, Number 2 and Number 4 race neck and neck but number 2 remains ahead throughout the stretch. The video ends shortly before the horses reach the finish line.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "According to the video, how many participants competed in this horse race?",
+ "answer": "C",
+ "candidates": [
+ "A. Seven.",
+ "B. Eight.",
+ "C. Nine.",
+ "D. Twelve."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What number match is shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The 6th match.",
+ "B. The 10th match.",
+ "C. The 12th match.",
+ "D. The 8th match."
+ ]
+ }
+ },
+ "XGZPhlDK": {
+ "video_id": "XGZPhlDK",
+ "video_duration": "99s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a charcoal grill with hot coals and a black grill grate. A gloved hand places a small, light-colored wood chunk onto the coals. The hand adjusts a vent on the side of the grill. A large cut of meat, a tomahawk steak, is placed on the grill grate over the coals. The grill lid is closed. After a period of time, the lid is opened to reveal the steak, which has darkened in color. A meat thermometer is inserted into the steak and the temperature is monitored as it cooks. Using tongs, some coals are moved around under the steak, and the steak is flipped over. The thermometer is inserted again, and the temperature continues to be monitored as the steak cooks on the second side.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the steak placed on the grill in the first half of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The steak is placed in the center of the grill.",
+ "B. The steak is placed below the grill.",
+ "C. The steak is placed on the right side of the grill.",
+ "D. The steak is placed on the left side of the grill."
+ ]
+ }
+ },
+ "VnTDRejl": {
+ "video_id": "VnTDRejl",
+ "video_duration": "226s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Fashion",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a view of several packages wrapped in white paper with the words \"Ever New Melbourne\" printed on them, along with a black bag also featuring that name. The camera pans across the packages and then lingers on the black bag. The video then focuses on a hand opening one of the packages with an orange bow and removing a small pink card. The hand also removes a card with a picture on the front of it. The camera zooms in on this card, then reveals the back of the card before returning to the packages. The camera then shows a person wearing a white off-the-shoulder dress with a pale purple purse walking on a staircase, followed by the same person wearing a black dress with white polka dots and a red purse. The person is then shown wearing a black and white dress with a black purse. Finally the person is shown wearing a black jumpsuit with a red purse. The person walks across the staircase in various poses.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "For what type of occasion is the white dress shown in the video appropriate?",
+ "answer": "A",
+ "candidates": [
+ "A. For meeting parents.",
+ "B. For wearing at home.",
+ "C. For a date.",
+ "D. For work."
+ ]
+ }
+ },
+ "hLaXEfNE": {
+ "video_id": "hLaXEfNE",
+ "video_duration": "40s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A pair of bare feet steps onto a black bathroom scale. The display on the scale lights up green then changes to blue, showing the number 68.5, with \"kg\" underneath. The video then cuts to a man holding a weight plate. He is wearing a white tank top and has a bracelet on his left wrist. In the background are two blurry figures. The video cuts again to a man in a red tank top walking in a gym. He has a medium complexion and short brown hair. In the background, other figures can be seen. The video cuts again to a mirrored image of the same man wearing a white tank top and a black cap, over which the text \"WORK\" \"WORKOUTS\" appears. The video cuts again to a grassy indoor soccer field. Three men are visible in the background, two are standing, one is in the foreground. The man in the foreground is holding a soccer ball. Text across the scene says \"FAZECLAN\". The video cuts again to two men cheering, again with \"FAZECLAN\" text over them. The video cuts again to a blurry image of the man from the first gym scene. The video then cuts to him working out with a cable machine and wearing earbuds. The video cuts again to a close-up of a man and a woman in helmets and blue shirts. The man says \u201cwelcome!\u201d Then the video cuts to black twice. The final section of the video shows the man from the gym scenes in a kitchen. He's wearing a white t-shirt and a black cap. He counts on his fingers, gestures, makes different expressions and moves his hands.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What fitness equipment is used in the video of the man exercising?",
+ "answer": "D",
+ "candidates": [
+ "A. A resistance band.",
+ "B. A treadmill.",
+ "C. A bench press rack.",
+ "D. A dumbbell."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the latter part of the video, where is the ladder positioned relative to the man inside his house?",
+ "answer": "A",
+ "candidates": [
+ "A. On the rear left side.",
+ "B. Behind him.",
+ "C. On the left.",
+ "D. On the right."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Will the man eventually reach his ideal muscle mass?",
+ "answer": "C",
+ "candidates": [
+ "A. He will not reach it.",
+ "B. It is uncertain.",
+ "C. He will reach it."
+ ]
+ }
+ },
+ "yaduLHSo": {
+ "video_id": "yaduLHSo",
+ "video_duration": "112s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A group of people raise glasses in a dimly lit room filled with candles. A man in a vest holds a woman\u2019s hands. A man watches security camera footage of figures moving through dark rooms. He speaks to the woman in the white dress. A gloved hand interacts with a gramophone. A long, dark hallway is lit by candles and a hanging lamp. A hanging lamp illuminates a curved staircase. The woman in white walks down the staircase. She holds a small object in her hands. A man sits in a chair with his head in his hands. A woman with short, styled blonde hair looks directly at the camera. The woman in white hides as people search for her. A man in a tuxedo and a woman in a dark dress look worried. A woman looks off to the side with a concerned expression. Another woman with short blonde hair looks in the same direction. The woman in white hides in a dumbwaiter.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "At the beginning of the video, where is the candle located on the table?",
+ "answer": "C",
+ "candidates": [
+ "A. Directly above.",
+ "B. On the right.",
+ "C. In the center.",
+ "D. On the left."
+ ]
+ }
+ },
+ "AnXSaLKm": {
+ "video_id": "AnXSaLKm",
+ "video_duration": "52s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a soccer match. Players wearing teal and white uniforms are moving the ball across a green soccer field. The goal is visible with a net and a player in a yellow uniform is standing near the goal. At some point, two players wearing white uniforms are jumping in the air and putting their arms up. Then multiple players are clustered around the soccer ball moving around and kicking it.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "How does the goalkeeper save the football in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Heads it away.",
+ "B. Blocks with hands.",
+ "C. Kicks it with the right foot.",
+ "D. Blocks with the body."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How does the atmosphere change before and after the goal in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Tense to Ordinary.",
+ "B. Tense to Heartbroken.",
+ "C. Tense to Excited Celebration.",
+ "D. Tense to Irritated."
+ ]
+ }
+ },
+ "chFRlQLS": {
+ "video_id": "chFRlQLS",
+ "video_duration": "52s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "Two girls perform a dance routine on a paved area outdoors. They begin in a crouched position, then stand and transition through a series of synchronized moves. Their choreography incorporates arm movements, steps, and turns, sometimes mirroring each other and other times performing complementary actions. The dance includes moments of quick, energetic steps and poses, as well as slower, more fluid movements. They dance facing forward, backward, and sideways, their hair often moving with the choreography. The background includes a grassy lawn, trees, benches, and a multi-story residential building.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "In the video, how many times does the girl wearing black shake her head?",
+ "answer": "D",
+ "candidates": [
+ "A. Once.",
+ "B. Three times.",
+ "C. Zero times.",
+ "D. Twice."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What are the occupations of the two girls in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. They are dance students.",
+ "B. She is a dance teacher.",
+ "C. They are dancers.",
+ "D. She is a dance professor."
+ ]
+ }
+ },
+ "rifzyIRi": {
+ "video_id": "rifzyIRi",
+ "video_duration": "65s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Two figures in white spacesuits are shown attached to an object in space. The background shows a planet. One of the figures appears closer and is shown working with a small tool. Then, in a dark scene, a single green light is visible, followed by the appearance of a white oval object with red and green lights. The scene shifts to a grey sky with two parachutes and an object underneath, moving down. The scene shifts again to a white, oval object on a boat in the water, then a shot of a space station and, lastly, a black screen with a blue logo and web address.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many unit parachutes make up the parachute shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Four.",
+ "B. Five.",
+ "C. Six.",
+ "D. Three."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What are the two astronauts doing outside in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Cleaning the spaceship.",
+ "B. Separating the spacecraft.",
+ "C. Connecting to the space station.",
+ "D. Repairing the spaceship."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Did a speedboat appear in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. Uncertain."
+ ]
+ }
+ },
+ "wYSrMEKF": {
+ "video_id": "wYSrMEKF",
+ "video_duration": "44s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Historical Analysis",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a split screen with 6 different angles, displaying sleeping characters in one window, and a bald man with a cap in the center, walking in front of a window overlooking space. The next scene shows a man in military attire walking in front of a holographic globe. The scene cuts to a military vehicle flying over a city, firing a weapon. A video call pops up in the corner. The scene cuts to three people looking at a holographic globe on a table. Then the screen shifts to the woman on the right, then back to a city, where a military vehicle fires a weapon at an enemy on the ground. A purple shield comes up from the ground. There are two people in a forest, one holding a futuristic device. The scene cuts to a large figure in metal armor walking over two defeated enemies. The scene cuts to a figure being beamed down onto a platform.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What action did the tall, silver-armored man take while holding the woman in red by the hand?",
+ "answer": "C",
+ "candidates": [
+ "A. He flees.",
+ "B. He helps her up.",
+ "C. He unlocks something.",
+ "D. He strikes her."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where are the man wearing a baseball cap and the woman in a white lab coat located relative to the dome?",
+ "answer": "A",
+ "candidates": [
+ "A. Above the dome.",
+ "B. I am not sure.",
+ "C. Inside the dome.",
+ "D. Beneath the dome."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Does the video feature a plasma-based Rhino tanks?",
+ "answer": "A",
+ "candidates": [
+ "A. No, it does not.",
+ "B. Unsure.",
+ "C. Yes, it does."
+ ]
+ }
+ },
+ "TdFGthgw": {
+ "video_id": "TdFGthgw",
+ "video_duration": "60s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Celebrity Interviews",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen. Then, a man in a dark suit appears in the center of a stage with wood paneling and red curtains. Text reading \"Christina Hendricks\" appears over him. A skeleton dressed in a suit and adorned with beads appears; text reading \"JUNE 3RD 2013\" is superimposed over the skeleton. The man in the dark suit reappears, and then a woman in a black dress walks into the scene from the right. The woman and the man shake hands. The woman and the man walk across the stage and then the woman goes over to the side of the stage where two armchairs are. The man walks over to a desk and picks up papers. The woman sits in one of the armchairs while the man sits behind the desk. The two individuals then talk to one another. Text reading \"THE JAYLENO FLY\" appears on the screen. The woman and the man continue talking and gesturing.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the action of \"a person sitting on a sofa\" take place?",
+ "answer": "B",
+ "candidates": [
+ "A. From the beginning to the middle of the video.",
+ "B. From the middle to the end of the video.",
+ "C. Throughout the entire video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, in which direction is the pink cup relative to the woman?",
+ "answer": "D",
+ "candidates": [
+ "A. In front.",
+ "B. To the left.",
+ "C. Behind.",
+ "D. To the right."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "Did the man in the video lightly kiss the woman?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes.",
+ "B. I am not sure.",
+ "C. No."
+ ]
+ }
+ },
+ "ZnTsagyn": {
+ "video_id": "ZnTsagyn",
+ "video_duration": "121s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two fighters, one in blue trunks and the other in red trunks, circle each other in a boxing ring. The fighter in blue throws a combination of punches. The fighter in red lands a punch. The fighter in blue throws a kick, and the fighter in red lands a punch. The fighter in blue throws another kick. They exchange punches. The fighter in red lands a right hook, knocking the fighter in blue down. The referee begins the count, and the fighter in blue struggles to get up. The fighter in blue gets up at the count of eight. The referee checks on him and allows the fight to continue. The fighter in red lands a combination of punches. The fighter in blue kicks. They exchange punches and kicks. The fighter in blue throws a leg kick, and the fighter in red lands a hard right hook that knocks the fighter in blue down again. The fighter in blue lies on his side, with one arm draped over the bottom rope. The referee counts him out, waving his hands over the fighter in blue's body at the count of ten. The referee raises the hand of the fighter in red. People enter the ring to congratulate the fighter in red. The fighters embrace.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotion does the man in red express by kneeling in the latter part of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. He is feeling sad.",
+ "B. He is angry.",
+ "C. He is excited.",
+ "D. He is feeling regretful."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did the referee blow a whistle in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. None of the above.",
+ "B. No, the referee did not.",
+ "C. Yes, the referee did.",
+ "D. It is not clear."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Was a whistle sound present in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. None of the above.",
+ "B. Yes, it was.",
+ "C. No, it was not.",
+ "D. I do not know."
+ ]
+ }
+ },
+ "iUspVOQu": {
+ "video_id": "iUspVOQu",
+ "video_duration": "99s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black hoodie with a colorful logo is shown sitting on a bed with white bedding. They are inside a vehicle with windows. The person stands and moves to the front of the vehicle. The camera shows a bed with an eyeglasses design on the bedding, a book and stuffed monkey, all on the bed. The person is sitting on the bed by the window. The camera shows a lantern and heater. The person is again seen on the bed. The camera moves to a wall with a guitar hanging on it, then pans to the bed and the person sitting on it. The camera pans to the view outside a window. The camera is back inside showing the person by the bed. The camera pans over to a sink and an appliance. The person is seen in the front of the vehicle, then the stairs and then back to the person talking. The person moves to the driver\u2019s section and looks at the camera. The camera shows the driver\u2019s section.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where is the scene located at the beginning of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. At the driver's door.",
+ "B. At the door of Jojo's bedroom.",
+ "C. At the doorway to the man's bedroom.",
+ "D. At the doorway to their daughter's room."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the last bedroom located in the bus as shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Below the driver's cabin of the bus.",
+ "B. Next to the toilet in the bus.",
+ "C. Above the driver's cabin of the bus.",
+ "D. Next to the shower room in the bus."
+ ]
+ }
+ },
+ "SCDOZbro": {
+ "video_id": "SCDOZbro",
+ "video_duration": "95s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person wearing a dark jacket over a plaid shirt is shown, speaking directly at the camera. The background is a building interior with columns and a white piano. The scene transitions to a graphic with the words \"BRAWL NEWS\" overlayed on a globe design. A second person is then shown speaking in front of a microphone in a small framed box with a blue background with black lines, while the globe design from the prior scene is on the left side of the screen. The screen transitions again to a circular logo with the letters \"KT\" inside it, and the words \"KAIZO\" below the logo. The video returns to the first person, wearing the same clothing, speaking directly at the camera in front of a window. This person then laughs. Lastly the video transitions back to the second person from before speaking in front of a microphone in a small framed box with a blue background with black lines and the globe design on the left side of the screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the waiter walk into the frame?",
+ "answer": "D",
+ "candidates": [
+ "A. Three times.",
+ "B. Two times.",
+ "C. Zero times.",
+ "D. One time."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the microphone relative to the man in the news broadcast scene in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Above the man.",
+ "B. To the left of the man.",
+ "C. To the right of the man.",
+ "D. Below the man."
+ ]
+ }
+ },
+ "KATPzeYA": {
+ "video_id": "KATPzeYA",
+ "video_duration": "131s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Family Vlogs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A woman with long red hair pulled into a ponytail is in the foreground, wearing a pink top and a dark-colored, horizontally striped sweater and a white headband. She is holding up various objects to the camera. She presents a wrapped Frozen-themed electric toothbrush, a Skittles candy cane, a Slinky in a box, a small, round, red container with a cartoon character image, a small container labeled \u201cNum Noms\u201d, a small container labeled \u201cMy Little Pony\u201d, a small set of Princess stickers, a small package of Princess stickers, a package of Trolls blind-bag figures, and a package of bath markers. Behind her is a bed with pillows and a blanket on it. There is a sign on the wall behind her, and also a white lamp. There is a baby bassinet in the background. There are various packages visible on the bed behind her. She frequently moves her hands, and she sometimes makes pointing gestures.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many distinct items did the woman present in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Six items.",
+ "B. Seven items.",
+ "C. Five items.",
+ "D. Eight items."
+ ]
+ }
+ },
+ "pXXalwjw": {
+ "video_id": "pXXalwjw",
+ "video_duration": "121s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A bald man wearing a gray t-shirt is in the driver's seat of a car, facing forward, and looking toward the camera. The camera angle shifts slightly a few times but primarily captures the front of the man's head and upper torso. At one point the man opens his mouth wide, but no other actions take place in this scene. The scene changes to focus on a car's tire and nearby suspension parts. A blue spring and a silver \"TOTAL CHAOS FABRICATION\" bar are visible. The scene changes to show the side of a car tire, with the focus changing to a close-up. The camera moves to show a bare hand touching the tire. The view changes again, focusing on a gray car bumper panel. A hand is shown pointing toward the bumper. The hand then repeatedly touches the panel. The view widens, and the same hand uses a heat gun on the bumper. Then the person is shown using a hammer and a heat gun to bend the bumper panel. The view changes, and the hand and hammer are used to bend the panel, while the camera shifts again. The person continues using the heat gun and hammer, while the camera moves into a close-up view of the panel being hit. The camera changes to a view of the damage on the bumper panel, and a hand touches the bent panel. The camera then moves to show the damage from below and to a view of the ground.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What type of music is played near the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Country music.",
+ "B. Hip-hop music.",
+ "C. Rock music.",
+ "D. Pop music."
+ ]
+ }
+ },
+ "guJYKZtb": {
+ "video_id": "guJYKZtb",
+ "video_duration": "580s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video begins with a hand holding a blue can of Nivea Body Mousse. A woman with a colorful head wrap and a black shirt appears, facing the camera, moving her hands in a variety of gestures. At times, she uses her hands to point down, and at other times she opens and closes her palms. She looks up, down, and sideways. She then picks up the blue can of Body Mousse, and points at it. She spins the can around and gestures at the label. She then touches the can top and shakes it. She applies some cream to her arm, rubbing it in with her other hand. She points to the tube, and back to her arm, and proceeds to rub the cream in with her hands. The woman picks up the tube of lip balm and shows it to the camera. She opens it, pushing the balm up, and then places it down. She returns to holding the can, turning it around in her hand and removing the cap, then shaking the can. She holds the can up and dispenses the product into her hand. She shows the cream to the camera, and moves it around with her fingers. She applies the cream to her arm and proceeds to rub it in. A text box appears over her face, stating: \"Thanks For Watching\", then a hand comes into the shot obscuring the text. At the end, the screen turns to a pure white background with only the words \"Thanks For Watching\".",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occurred to the body mousse applied to the woman\u2019s arm in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. It changes from a solid to a gas.",
+ "B. It changes from a liquid to a gas.",
+ "C. It changes from a liquid to a solid.",
+ "D. It changes from a solid to a liquid."
+ ]
+ }
+ },
+ "OtZWpbON": {
+ "video_id": "OtZWpbON",
+ "video_duration": "37s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A soccer game is in progress. A player in a blue and red jersey passes the ball to a teammate. The teammate attempts a bicycle kick, but misses. The ball goes out of bounds. The players reset. A player in a blue and red jersey gets the ball and passes to a teammate in the penalty box. The teammate shoots and scores. The player who scored celebrates with his teammates. The game restarts. A player in black and white passes to a teammate. The teammate passes it back. A player in black and white dribbles the ball down the field, passes to a player in blue and red, who then passes to another player in blue and red. The player in blue and red runs and shoots but misses.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What action occurred between player number 10 and player number 11, both wearing red and blue soccer uniforms?",
+ "answer": "C",
+ "candidates": [
+ "A. Player number 10 assists player number 11 in scoring a goal.",
+ "B. Player number 10 prevents player number 11 from scoring a goal.",
+ "C. Player number 11 assists player number 10 in scoring a goal.",
+ "D. Player number 11 prevents player number 10 from scoring a goal."
+ ]
+ }
+ },
+ "tHHgIQdS": {
+ "video_id": "tHHgIQdS",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman in a dark military uniform runs across a metal gantry. Smoke and orange light fill the background. She kneels by a shirtless man lying on the gantry. She holds his face in her hands, her expression distraught. The scene shifts to a man in a light vest and tie, standing at a control panel with dials and gauges. A mustached man in a military uniform, wearing glasses, points emphatically with one arm. The woman continues to look at the shirtless man with concern. The man in the vest gestures with open hands. The woman looks at a futuristic pod nearby. Three blue vials are inserted into the pod's base. The woman climbs into the pod, and it closes. The pod lifts into the air. The man in the vest pulls a lever at the control panel. The pod launches into the sky, leaving a beam of light behind it. The shirtless man on the gantry opens his eyes. The pod continues to ascend.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What action did the bearded man take towards the man in military uniform?",
+ "answer": "C",
+ "candidates": [
+ "A. He orders him to activate the machine.",
+ "B. He orders him to retreat.",
+ "C. He refuses the orders given to him.",
+ "D. He orders him to protect the woman."
+ ]
+ }
+ },
+ "xemYLyCF": {
+ "video_id": "xemYLyCF",
+ "video_duration": "67s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video opens on a black screen with the BBC logo in the top left corner. A man wearing a green jacket and orange life vest appears, standing on a boat. He is holding his hand up and moving it around. In the background is a white lighthouse in the middle of the sea. The man talks while moving his hands. He gives a thumbs up. The man continues to talk while moving his hands, and he makes a small fist with one hand. The camera moves to a wider angle showing the white lighthouse in the water, with a small boat in the distance. The camera pans to focus on the water around the lighthouse. The camera moves back to the man, who is now moving his hands again while talking. The camera focuses on the water around the base of the lighthouse. The camera zooms out and again shows the white lighthouse in the water, with a small boat in the distance. The camera is at a different angle this time. The camera is now looking down on what appears to be the top of a large metal structure in the water. The camera again moves to a wider angle showing the white lighthouse in the water, with a small boat in the distance. The camera focuses on a man on a white boat. The man is wearing a green jacket and orange life vest, and he is standing on a white boat. In the foreground is a black object. The camera zooms out showing the man on the boat from a wider angle. The camera moves to an angle where the man on the boat can be seen with the lighthouse in the background. He is using a remote control device. The video shows the lighthouse again, then the man operating the remote control from the boat, and the lighthouse again, and the man with the remote control.The camera moves to an angle showing the lighthouse from afar.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the drone take off?",
+ "answer": "B",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the end of the video.",
+ "C. Throughout the video.",
+ "D. At the beginning of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "Why is a lighthouse often built in the vast open sea?",
+ "answer": "B",
+ "candidates": [
+ "A. To avoid too many people standing on it.",
+ "B. To prevent it from being submerged during high tides.",
+ "C. To make the sea look more appealing.",
+ "D. To better guide ships."
+ ]
+ }
+ },
+ "lkFZqErL": {
+ "video_id": "lkFZqErL",
+ "video_duration": "599s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a light green background. In the center is the title \"KHICHI Beauty\" in dark green letters. Underneath the title, a silver stylized line scrolls from left to right. Next, a montage of pictures appears, showcasing the steps involved in preparing and applying the facial mask. In one picture, a woman with painted acrylic nails blends ingredients in a bowl with a wooden chopstick. Another image shows a blender containing a pinkish substance. Finally, there's a picture of a woman applying a yellow mask to her upper lip area. The backdrop is a corkboard with several images attached to it, including a world map.A woman with brown shoulder-length hair and purple eyeshadow sits against a white background. Her acrylic nails are painted in various colors. A \"Subscribe\" button is displayed underneath her, along with a cartoon hand pressing it and a bell icon next to it. The video then transitions to a top-down shot of a woman's hands preparing the ingredients. Her multicolored acrylic nails are visible. She holds a red onion on a white cutting board. Next to the board are a tomato, a bowl of gram flour, and another bowl containing turmeric powder. A small blender sits near the board.She cuts the tomato in half and then chops it into smaller pieces. Next, she peels the onion, cuts it in half, and chops it. Both the tomato and onion pieces are put into the blender, and the mixture is blended until it becomes a pink liquid. This pink mixture is poured into an empty bowl.The woman adds gram flour to a clean bowl using a metal spoon, then adds turmeric powder using another metal spoon. Using a wooden chopstick, she blends the gram flour and turmeric powder together until a thick, yellow paste starts to form.She then adds the pink blended tomato and onion mixture to the yellow paste and mixes them together with the chopstick.The woman, with minimal makeup and colorful acrylic nails, applies the yellow paste to her upper lip, creating a mustache shape. She keeps the mask on for an unspecified time, and then wipes it off with a cotton pad.Finally, the green title screen reappears with the title \"KHICHI Beauty\", scrolling silver line, and subscription information.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "After applying the facial mask, does the woman with light brown curly hair in the video still have visible hair on her face?",
+ "answer": "A",
+ "candidates": [
+ "A. No, there is no visible hair.",
+ "B. Yes, there is still visible hair.",
+ "C. It is not clear from the video."
+ ]
+ }
+ },
+ "kdoJfMHs": {
+ "video_id": "kdoJfMHs",
+ "video_duration": "59s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A man with curly dark hair and a beard appears in a yellow hoodie. He then switches to a bright blue hoodie. He speaks to the camera, gesturing with his hands. He points, claps, rubs his chin, and raises a finger. He switches back to the yellow hoodie, continues speaking and gesturing, including bringing his hands together in a prayer-like gesture and giving a salute. He switches back to the blue hoodie, makes surprised expressions by putting his hands on his head and widening his eyes, then back to the yellow hoodie. He continues switching between hoodies, speaking and using various hand gestures throughout the video, maintaining eye contact with the camera.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many instances of audience cheering are present in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Five instances.",
+ "B. Eight instances.",
+ "C. Seven instances.",
+ "D. Six instances."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "In what ways do the two men express their love for religion?",
+ "answer": "C",
+ "candidates": [
+ "A. Participating in question-and-answer sessions.",
+ "B. Reciting religious doctrines.",
+ "C. Engaging in a rap battle.",
+ "D. Reading famous quotes aloud."
+ ]
+ }
+ },
+ "nYpgTvdw": {
+ "video_id": "nYpgTvdw",
+ "video_duration": "69s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Editorials",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person is centered in front of a computer screen displaying a web page. The person has short dark hair and a dark beard, and is wearing a dark blue collared shirt and a dark neck tie. The person's mouth moves, and they look around while speaking. At times the person also points their finger. The web page has text on it that includes \"Cornell Law School,\" \"Legal Information Institute,\" \"18 U.S. Code \u00a7 1028A - Aggravated Identity Theft.\" There is also a TikTok icon and username at the bottom right.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What changes are expected in cases of identity theft after the video is published?",
+ "answer": "A",
+ "candidates": [
+ "A. Cases will decrease.",
+ "B. Cases will first increase, then decrease.",
+ "C. Cases will remain unchanged.",
+ "D. Cases will increase."
+ ]
+ }
+ },
+ "GbfPBflF": {
+ "video_id": "GbfPBflF",
+ "video_duration": "219s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man with a shaved head and a beard sits in front of a microphone. There are two microphones in the video. The man is wearing a black shirt. Text appears on screen that says \"BECAUSE YOU'RE RIGHT, THERE'S TOO MUCH BIAS IN THE WORLD.\" The scene cuts to two men standing and facing the same direction, while performing a sequence of actions. One man is wearing a red jacket and red pants, and the other is wearing a brown robe. Text appears on the screen which says \"THEN A VIDEO OF ANDREW TATE PRAYING WENT VIRAL\" with the text appearing across a red overlay. The actions of the two men are depicted. The scene shifts back to the two men performing the sequence of actions, this time with a blue tone overlay. The video returns to the two men performing the sequence of actions without a color tone overlay. The scene shifts to text on the screen, with a red tone overlay, that says \"THERE WAS MUCH SPECULATION THAT HE HAD ACCEPTED ISLAM\". This scene is repeated multiple times. Text appears on the screen that reads \"A FEW DAYS LATER, ANDREW TATE OFFICIALLY ANNOUNCED HIS CONVERSION\" multiple times. The scene shifts back to the man with a shaved head and a beard, who is now wearing sunglasses and a blue shirt, speaking. Text appears on the screen that says \"THIS IS ALL HARAM, BROV.\" This sequence of the man speaking and the text appearing on the screen is repeated multiple times. The scene shifts to the text on screen reading \"WAIT, ARE YOU MUSLIM?\". The text then changes to \"I'M MUSLIM, YEAH!\". The scene shifts to text appearing on the screen that reads \"WELCOME TO THE NEW MUSLIM TATE\" multiple times. The scene changes to two split images, one with a red tone, and one that is grayscale, both images depict the man with a shaved head and beard talking, with the text \"IN A MORE RECENT PODCAST, ANDREW HIMSELF EXPLAINS WHY HE CONVERTED\" being placed between the two images. This sequence is repeated multiple times. The scene changes to the man with a shaved head and beard talking while wearing sunglasses. Text appears on screen that reads \"I WAS BORN IN A CHRISTIAN COUNTRY, I WAS RAISED AS A CHRISTIAN, AND I'VE ALWAYS BEEN VERY RESPECTFUL OF ISLAM, AND IT'S BECOME MORE AND MORE OBVIOUS TO ME AND MORE AND MORE PERTINENT THAT ISLAM IS THE LAST RELIGION ON THE PLANET\". The next scene is an image of an open book with purple beads lying alongside of it. Next, a hand holding a blue book is depicted, before the scene changes back to the man with a shaved head and beard speaking into a microphone. Text appears on the screen that says \"I FEEL LIKE THERE'S NO OTHER RELIGION, PEOPLE SAY TO ME, WHY DID YOU CONVERT? I SAID, I DON'T REALLY FEEL IT AS A CONVERSION, IT'S ALMOST LIKE I KNEW GOD WAS REAL, AND NOW I'VE BECOME RELIGIOUS\". The scene changes to a man wearing a black shirt and gray pants walking. The text \"AND ALSO ISLAM VERY CLOSELY REFLECTS MY PERSONAL BELIEFS\" is displayed on the screen. The scene changes to two men sitting in a car while they are talking. The text \"THROUGH MY PERSONAL LIFE, I'VE LEARNED THAT IF YOU DON'T HAVE STANDARDS AND YOU'RE NOT A STRONG PERSON WHO'S PREPARED TO DEFEND HIS IDEAS, YOU'LL BE CRUSHED.\" appears on the screen. The scene changes back to the man with a shaved head and beard, speaking into a microphone with the text \"GOD TO ME IS SOMETHING TO BE FEARED, GOD TO ME IS SOMEONE THAT PEOPLE ARE AFRAID TO MOCK, GOD TO ME IS SOMEONE THAT YOU HAVE TO GO OUT OF YOUR WAY TO PROVE SOMETHING TO, GOD TO ME HAS RED LINES, LIKE, GOD TO ME REPRESENTS THE ISLAMIC FAITH. THE CHRISTIAN GOD, TO ME, I DON'T SEE GOD. TO ME IT WAS THE ONLY LOGICAL CHOICE IN THE END.\" The scene shifts to the man with a shaved head and beard speaking, with an overlay effect. The text reads \"THERE WAS A TIME WHEN I WAS ATHEISTIC, AND THE REASON I AM NOW SO ABSOLUTELY CERTAIN THAT GOD IS REAL IS BECAUSE I'VE SEEN EVIL, I'VE SEEN SHAYTHAN, I'VE SEEN IT. WHEN YOU SEE ENOUGH EVIL, YOU REALIZE THAT THERE MUST BE AN EQUAL AND OPPOSITE FORCE, AND THERE ARE PEOPLE OUT THERE IN THE WORLD TODAY DOING THE WORK OF THE DEVIL, GENUINE DEMONS, WHO ARE TRYING TO DESTROY THE BASELINE MORALITY THAT'S INSIDE OF ALL OF US. WE'RE ALL BORN WITH SOME KIND OF MORALITY AND THEY'RE TRYING TO DESTROY IT. AND THIS IS WHY PERHAPS I FOUND GOD THE WAY I DID, BECAUSE I UNDERSTOOD ALL THESE THINGS FIRST, AND THEN I SAW THE QURAN, AND IT CONFIRMED SO MANY THINGS FOR ME, EVEN THE CONVERSATIONS I'VE BEEN HAVING SO FAR, SO MANY THINGS HAVE BEEN CONFIRMED AND IT'S AMAZING THE KNOWLEDGE THAT'S INSIDE OF IT, WHICH IS SO APPLICABLE TODAY FOR AN OLD BOOK, RIGHT? IT'S SUPPOSED TO BE OLD, BUT IT SEEMS SO TIMELESS, BUT IT'S TRULY AMAZING. I THINK WHEN YOU HAVE NEGATIVE EXPERIENCES, AND I'M TALKING ABOUT TRULY NEGATIVE EXPERIENCES, I'VE HAD A FEW WHERE YOU COME CLOSE TO LOSING YOUR LIFE, OTHER PEOPLE WHO'VE COME CLOSE TO LOSING THEIR LIVES, NOBODY'S AN ATHEIST AT THE END, MY FRIEND. ATHEISTS STOP BEING ATHEISTS QUICKLY WHEN THE GAME GETS REAL! SO ALL OF US INNATELY UNDERSTAND IN GOD, THE MOST ATHEIST MAN ALIVE WILL PRAY IF HE REALLY NEEDS HELP. AND ALSO, IF WE'RE GOING TO TALK ABOUT HAPPINESS, WE'RE HERE TO BE HAPPY AND WE'RE HERE TO ENJOY OURSELVES, IT'S THROUGH GOD THAT YOU'RE HAPPY AND ENJOY YOURSELF. IT'S NOT THROUGH THAT BLACK HOLE OF HEDONISM, IT'S NOT THROUGH ENDLESS P**** AND SUPERCARS AND BOOZE AND... IT'S NOT THOSE THINGS. YOU CAN FIND ANY MAN WHO GOES DOWN THAT HOLE AND TELL ME, ARE THEY HAPPY? THE ANSWER IS, NO! WHO'S THE MAN WHO'S HAPPIEST? THE MAN WHO IS PRAYING. IF YOU REMOVE GOD FROM SOCIETY COMPLETELY, IF YOU CLICK YOUR FINGERS AND HE'S GONE, AND IF YOU COMPLETELY REMOVE RELIGION FROM EARTH, YOU'LL SEE THE DEVIL APPEAR PRETTY QUICKLY, YOU'LL SEE MASS BREAKDOWN OF SOCIETY, YOU'LL SEE RAPE, YOU'LL SEE MURDER, YOU'LL SEE THEFT, AND THEN YOU'LL UNDERSTAND THAT GOD IS A REAL FORCE. I JUST TRULY RESPECT AND UNDERSTAND THAT GOD IS REAL, AND I TRULY RESPECT AND UNDERSTAND THAT, WITH MY OWN EYES, WHAT I'VE OBSERVED, THE TRUEST WAY TO SHOW RESPECT TO GOD IS THROUGH ISLAM.\" The scene shifts to the man with a shaved head and a beard talking, with an overlay effect. Text reads \"MAY ALLAH KEEP OUR NEW BROTHER IN ISLAM, ANDREW TATE, STEADFAST ON THE RIGHT PATH\" and shows the two men in the scene praying. The scene changes to text on a black background that reads \"AFGHAN'S ARE SELLING THEIR ORGANS TO FEED THEIR CHILDREN! MILLIONS OF CHILDREN NEED YOUR HELP! DONATE YOUR ZAKAT & CHARITY NOW FOR MULTIPLIED REWARDS! WWW.LAUNCHGOOD.COM/SUPPORTSCHOOLMEALS $160 FEEDS SCHOOL MEALS FOR A CHILD FOR AN ENTIRE YEAR!\" This text appears multiple times.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does Andrew Tate make the praying gesture?",
+ "answer": "C",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. Throughout the entire video.",
+ "C. At the beginning of the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many different sounds are in the audio?",
+ "answer": "C",
+ "candidates": [
+ "A. Five.",
+ "B. Three.",
+ "C. Six.",
+ "D. Four."
+ ]
+ }
+ },
+ "lizhjFne": {
+ "video_id": "lizhjFne",
+ "video_duration": "62s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A figure with a white head and torso, red pants and green gloves spins in place three times. The figure then walks through an area with red trailers. The video shows a chat log with one message of \"Piggy Ugly.\" The figure moves through an area with a door. A second character with a yellow giraffe head appears on a balcony. The first figure passes a third person with brown hair. The giraffe figure picks up the white figure and the two disappear through a door. The video displays another chat log which has some text along with another \"Piggy Ugly\" message. The video cuts to the white figure in an isolated location standing still and turning slowly. The video again shows a multicolored static screen. A blurry area with the white figure walking in it is shown, and the figure says \"I'm coming for you\". The video shows a blurred area with multiple characters, including the white figure. The text \"#ENDGANG\" is displayed. The video then cuts to a man in a black shirt and yellow gloves in front of a work bench, holding up a metal can, which he shakes. Finally the video displays the white figure in another isolated location, standing still and then turns. The video ends with multicolored static.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "When the bald man reappeared, what was the color of the food truck in front of him?",
+ "answer": "B",
+ "candidates": [
+ "A. Yellow.",
+ "B. Blue.",
+ "C. Red.",
+ "D. Black."
+ ]
+ }
+ },
+ "toBEOsqF": {
+ "video_id": "toBEOsqF",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "Two characters in green and red outfits walk along a road. One character is pulling the other in a blue cart. They go to a building. One of the characters enters the building while the other waits outside. The second character then enters. They are both briefly shown using emotes. The first character then leaves through the back, followed by the second. The first character then enters a small adjacent building and opens a door. The camera then moves to behind the character who was following and shows him walk into the main building through an opening, then to the doorway of the small building. The first character then reenters the main building, then the second character reenters. The two characters interact with each other, facing off using emotes and weapons. The first character then walks away. The first character then is shown opening a door, and going outside to an area with a fenced in area.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the man in the cape do to the man with the tomato-shaped head?",
+ "answer": "C",
+ "candidates": [
+ "A. Introduced the restaurant's layout and design.",
+ "B. Tried to engage the other person in a friendly sparring match.",
+ "C. Attacked and drove the other person away.",
+ "D. Took out a pizza and offered it to the other person."
+ ]
+ }
+ },
+ "KtxYIObV": {
+ "video_id": "KtxYIObV",
+ "video_duration": "56s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A soccer game is taking place on a green field. The majority of players are wearing blue jerseys, while others are wearing light blue and maroon. At one end of the field is a white net, and there are white lines drawn throughout the field. The ball is passed around between the players. A player in yellow makes a diving save, stopping the ball from entering the net. A second player wearing blue scores a goal and raises both arms in the air. Two more players in blue jerseys join him. The camera pans to the audience in the stadium where people are celebrating. The camera switches back to the players on the field. More players in blue join in the celebration and group together. The field is shown from a wide angle with players spread throughout. Players continue moving the ball between each other. Players in blue jerseys celebrate again. The ball is passed in the center of the field and a player in maroon heads the ball into the goal. The player in yellow jumps to try and block the ball. The ball enters the net. The camera shows a player in blue being hugged by two other players in blue.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What might be the profession of the man wearing a black jacket with letters in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Referee.",
+ "B. Coach.",
+ "C. Photographer.",
+ "D. Reserve player."
+ ]
+ }
+ },
+ "MwapBRIc": {
+ "video_id": "MwapBRIc",
+ "video_duration": "68s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a first-person perspective of a wooden ship deck, with cannons visible along each side. There is a tall central mast and a series of ropes connected to it. The ship appears to be on the ocean, with islands and other distant landmasses visible. The camera moves from one side of the deck to the other, showing more of the ship. A sword is visible in the foreground in several shots, and then later a hand holding a sword is visible. At one point, the camera shows a view from a cannon, looking out over the water. The camera continues to move around the ship deck, showing different angles of the cannons, mast and ropes. The video ends with the camera panning around the ship before returning to a first-person perspective.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Which part of the video shows the person holding the ship's wheel?",
+ "answer": "C",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. Throughout the entire video.",
+ "C. In the middle of the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the character perform a sword stabbing attack in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Eight.",
+ "B. Fourteen.",
+ "C. Ten.",
+ "D. Twelve."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the cannon used for in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Attacking the opponent's pirate ships.",
+ "B. Attacking the monster in the sea.",
+ "C. Grabbing the cannon to move quickly.",
+ "D. Cancelling the penalty for missed sword thrusts."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What type of music appears in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Heavy metal music appears.",
+ "B. Rock music appears.",
+ "C. Pop music appears.",
+ "D. Classical music appears."
+ ]
+ }
+ },
+ "VYnCEnoD": {
+ "video_id": "VYnCEnoD",
+ "video_duration": "393s",
+ "duration": "6-8min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black jacket, white pants, and a black top hat is riding a brown horse in a sandy arena. The arena has a white, wavy border and is surrounded by advertisements. There are also potted green plants and small, white geometric structures with letters on them. The horse and rider perform a series of movements that include walking, trotting, and cantering. There are multiple segments of text appearing in the lower-left corner of the screen displaying a number, a percent sign, a phrase describing the movement, and a number. The video ends with the camera panning to show a large crowd of people in an elevated seating area who are clapping. The rider then tips their hat to the camera.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the contestant's final score?",
+ "answer": "C",
+ "candidates": [
+ "A. 82.5.",
+ "B. 81.196.",
+ "C. 82.783.",
+ "D. 83.37."
+ ]
+ }
+ },
+ "ygcWTUrg": {
+ "video_id": "ygcWTUrg",
+ "video_duration": "79s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A three-dimensional, animated character with a cowboy hat and outfit appears with arms raised. The camera angle shifts to look upwards at a blue sky with clouds, then to two figures in a dimly lit area, one figure with dark hair pulled into a ponytail, and a second with dark hair. The camera angle changes to three figures, a figure with short brown hair and a flower hair accessory, a figure with dark hair pulled into a ponytail, and a figure with short brown hair, all three are facing each other. The camera angle switches to a close up of a figure with short brown hair and a flower hair accessory, then a close up of this figure's face again. The camera angle changes again to a close up of a purple lunchbox with a sun design and a flower design. There is a black screen then a new scene with a classroom with people at tables and adults standing around. The scene changes to show hands in front of a blurry, lit up background, followed by a scene of a carousel and fair ride at night. The camera angle changes to show a close up of the face of an animated character with a cowboy hat, then the camera angle changes to the side of the head of the same character. The camera view changes to an animated character with a cowboy hat looking down, then a dark screen. Red text on a black background appears, reading \"A KILLER NEW SERIES,\" then the view changes to the side of the head of an animated character with a cowboy hat in an outdoor area. The camera view switches to the animated character in an indoor setting looking at a shelf. Then, the camera angle switches again to an animated character with short brown hair, who is holding an object and looks surprised. The scene switches to an animated character with a cowboy hat, who is positioned looking down into a red grill. Then, the camera angle changes to an animated character with short brown hair standing outside with an animated character with a cowboy hat on the ground. Then, the scene changes to the same animated character with short brown hair standing in an outdoor area with several different toys on the ground. The view then shifts to an animated character in a dark setting being lowered, followed by the same character on a television screen with the text \u201cGAME OVER.\u201d The scene switches to the face of the animated character with a cowboy hat in an indoor area, then to hands and feet holding an object, then back to a close up of the face of the animated character with a cowboy hat. The scene shifts to an outdoor road at night with a vehicle driving toward the camera, then to a shop window with a \"CLOSED\" sign, then to a fair scene at night with multiple figures. The scene shifts to a character with a cowboy hat inside of a vehicle, then the scene shifts to a door with an animated figure behind the door. A scene then switches to an animated figure running in a dark indoor space, followed by the view of a lit up indoor space. The camera view shifts again to a dimly lit room with a figure standing in the doorway. The scene shifts to a girl with a small purple dress in a brightly lit space, then the view changes to a figure peeking through blinds. The scene changes again to a figure with short brown hair and a purple dress holding a hat. Then, the camera angle shows an indoor space with two figures looking away from the camera. The scene shifts to a dimly lit area with a figure with short hair and an orange shirt. The scene shifts to a car driving on a road at night. Then, an animated character with a cowboy hat is shown next to a hand. The scene then changes to an indoor area with shelves holding items and a figure in a doorway. The camera switches to the face of an animated character with a cowboy hat looking up. Then, the camera shifts to the animated character with a cowboy hat walking in a dark area. Another scene of an animated character with short brown hair standing alone in a dimly lit area. A figure in an orange outfit is shown hugging themselves in a dark area. Then, a dimly lit image of hay bales is shown. An indoor area is shown with figures behind objects. Then, a figure with short brown hair in blue pants is shown picking up an object. Then, the screen fades to black before \"Disney Pixar Woody\" is shown in red text on a black background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does a relatively long, piercing sound effect occur at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Five times.",
+ "B. Six times.",
+ "C. Four times.",
+ "D. Three times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the doll wearing brown pants and black leather shoes doing with the glass box?",
+ "answer": "B",
+ "candidates": [
+ "A. Release the doll from the box and let it escape.",
+ "B. Pack the doll into a box and take it away.",
+ "C. Pack the doll into a box and trap it.",
+ "D. Pack the doll into a box and throw it away."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What kind of atmosphere does the entire video create?",
+ "answer": "B",
+ "candidates": [
+ "A. A mysterious and suspenseful atmosphere.",
+ "B. A tense and thrilling atmosphere.",
+ "C. A romantic and warm atmosphere.",
+ "D. A joyful and lively atmosphere."
+ ]
+ }
+ },
+ "SohqvGtQ": {
+ "video_id": "SohqvGtQ",
+ "video_duration": "122s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A stylized animation of a mountain range shows the sun setting and a crescent moon rising. The words \"Evening Routine\" appear in yellow.\u00a0The video then cuts to a woman sitting at a vanity.\u00a0 She is looking at herself in a mirror and waves to the camera. On the vanity, there are makeup brushes in a glass jar, two makeup sponges in another glass jar, a hairbrush, and a circular magnifying mirror.The woman picks up a bottle of Human Nature Sunflower Beauty Oil and displays it to the camera. She then pumps some oil onto her fingers and applies it to her face.Next, the woman shows a tube of Cosrx Salicylic Acid Daily Gentle Cleanser to the camera. She washes her face off-screen and then pats it dry with a light blue washcloth.She picks up a bottle of anti-acne mild astringent with vitamins C and E, opens it, pours some onto a cotton pad, and applies it to her face.\u00a0The woman then displays a bottle of green tea BHA pore tightening serum, applies it to her face, and rubs it in.She then shows the camera a large container of Nature Republic Aloe Vera 92% Soothing Gel. She applies the gel to her face. She waves goodbye to the camera.The text \"Once a week\" appears in pink. A woman with a white headband has two lemon slices over her eyes.The woman at the vanity then shows a container of St. Ives Apricot Scrub Blemish Control, opens it, and displays the scrub inside.\u00a0Next, sheet masks in individual packaging are shown to the camera.Finally, the text \"Once a month\" appears in blue. The video displays a picture of a person receiving some kind of facial treatment at a spa. A gloved hand holds a device to the woman's forehead.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "What does the woman in the video use to style her hair?",
+ "answer": "B",
+ "candidates": [
+ "A. A curling iron.",
+ "B. I am not sure.",
+ "C. A hairdryer.",
+ "D. A comb."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video is aloe used?",
+ "answer": "C",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. Throughout the entire video.",
+ "C. At the end of the video.",
+ "D. At the beginning of the video."
+ ]
+ }
+ },
+ "glJBTyCu": {
+ "video_id": "glJBTyCu",
+ "video_duration": "132s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "Two female fighters, one with braided hair and a black sports bra and shorts, the other with curly hair and a dark sports bra and patterned shorts, are in a boxing ring. The referee stands between them, signaling the start of round 3.\u00a0 The fighter in the black sports bra throws a punch, and the two fighters grapple, exchanging punches and knee strikes. The fighter in patterned shorts pushes the other fighter against the ropes and delivers knee strikes to her midsection. The referee separates them, and they resume fighting. The fighter in black lands a punch, and the fighters clinch again, with the fighter in patterned shorts landing a knee.\u00a0 They separate briefly, exchanging punches and kicks. They clinch again against the ropes, exchanging knees, with the fighter in black sports bra and shorts pushing her opponent along the ropes. They separate, circle, and the fighter in patterned shorts delivers a kick. The fighters continue to exchange punches, kicks, and knees, clinching several times against the ropes. The fighter in the patterned shorts is pushed against the ropes and bends over slightly, seemingly exhausted, but recovers and continues fighting. The round ends with the fighters clinching. The referee separates them and raises the arm of the fighter with the curly hair.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Where in the video does the referee separate the two contestants by hand?",
+ "answer": "A",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. In the middle of the video.",
+ "C. Throughout the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is a coach likely to do when holding a boundary rope?",
+ "answer": "C",
+ "candidates": [
+ "A. Separate the athletes to begin the competition.",
+ "B. Adjust the athlete's movements.",
+ "C. Pull the athlete back inside the boundary.",
+ "D. Regulate the athlete's emotional state."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Which instrument's sound can be continuously heard in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Erhu.",
+ "B. Suona.",
+ "C. Guzheng.",
+ "D. Pipa."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What move did the curly-haired mohawk use to deliver a powerful strike to the opponent?",
+ "answer": "A",
+ "candidates": [
+ "A. A straight punch.",
+ "B. A left hook.",
+ "C. A combination punch.",
+ "D. A hook punch."
+ ]
+ },
+ "task4": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What round of the competition is shown at the beginning of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Round 1.",
+ "B. Round 4.",
+ "C. Round 3.",
+ "D. Round 2."
+ ]
+ }
+ },
+ "ZtrwQQkV": {
+ "video_id": "ZtrwQQkV",
+ "video_duration": "656s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Yellowish-orange smoky ink is shown on a white background. Text appears on the screen reading \u201cMondo Sakura with Copic Bokeh Background\u201d and underneath \u201csandy allnock.\u201d Next a clear plastic package appears with a flower design on it along with the words \u201cClear Stamps, YOU beautiful, Inside & Out.\u201d The package is on a white surface and a pair of hands with red painted nails is holding it on either side. A flower-shaped stamp is placed on a device with a red and gray frame. The stamp is pressed down on a piece of paper. A cell phone appears next to the device and is showing a Google search for \u201cSakura flower.\u201d The screen shows different photos of flowers and the presenter's finger scrolls through them. A few areas on the stamped image are painted light pink with the label \"RV52\" over it. Other parts are filled with light pink under labels of \"RV02,\" \"RV10\" and \"RV21.\" Some areas are painted brown under the label \u201cRV32.\u201d The image is then completely filled in with various shades of pink and brown with corresponding color numbers labeled \u201cRV10,\u201d \u201cR56,\u201d \u201cRV11,\u201d \u201cRV39,\u201d \u201cYG63,\u201d \u201cE49,\u201d \u201cRV52,\u201d \u201cE04,\u201d and \u201cRV34\u201d. A clear liquid is applied on top in a circular motion. The card, along with the plastic packaging of the stamp, is placed together. The screen shows a yellow tinted frame with the words \u201csandy allnock\u201d and a yellow geometric shape. The screen then changes to show \u201csandy allnock\u201d with various gray social media icons.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many colors does the woman use to paint the cherry blossoms in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Nine colors.",
+ "B. Ten colors.",
+ "C. Five colors.",
+ "D. Seven colors."
+ ]
+ }
+ },
+ "aWSpUGYj": {
+ "video_id": "aWSpUGYj",
+ "video_duration": "77s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with gray hair and glasses sits at a table. They wear a green cardigan and a pearl necklace. They look down and frown. The same person, now with blond hair and black hoop earrings, sits in a chair and says \u201cPshh\u201d. The first person speaks and says \u201cShe looks like a croquette.\u201d The second person says \u201cYou look like a stuffed potato ball.\u201d The first person says \u201cYou look like a stuffed potato ball\u201d. The second person says \u201cBecause I\u2019m sitting next to you.\u201d The second person says \u201cSkeletor.\u201d The first person says, \u201cI only starve for the Lord.\u201d The second person looks down and fiddles with a patterned bag, then with a clear bag. The first person looks at the camera, frowns, and says, \"To remind.\" The second person puts dark smudges on their forehead and says \u201cMy Lent. I don't need a reminder to not break my Lent.\u201d The first person says \u201cI don\u2019t need a reminder to not break my Lent.\u201d The second person says \u201cThat is how focused I am in my love of Christ.\u201d The first person repeats the same phrase. The second person says \u201cI\u2019ll have you know that I have sacrificed a lot this lent.\u201d The first person says \u201cYour sense of fashion?\u201d The second person looks down and fiddles with the clear bag. The first person says \u201cWhat a sacrifice.\u201d The second person says, \u201cI sacrificed the WhatsApp.\u201d The first person says \u201cI sacrificed my soap operas.\u201d The second person says, \u201cMy sandals.\u201d The first person says \u201cPalmolive soap.\u201d The second person says, \u201cCharmin toilet paper.\u201d The first person says, \u201cBrushing my dentures.\u201d The second person says \u201cChacal de la Trompeta fan fiction.\u201d The first person says \u201cChacal de la Trompeta fan fiction.\u201d The second person says, \u201cAvocados.\u201d The first person says, \u201cCuban bread.\u201d The second person says \u201cCompression socks.\u201d The first person says, \u201cPeaches.\u201d The second person says, \u201cAll of my medication.\u201d The second person coughs. The first person coughs. They say, \u201cRaaagh! Ugh!\u201d The second person coughs. A dog is also visible at one point, laying in a dog bed.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What action by the woman in green clothing initially rendered the woman in the leopard-print shirt speechless?",
+ "answer": "C",
+ "candidates": [
+ "A. Asks the other person to watch a five-minute movie.",
+ "B. Offers the other person only green vegetables to eat.",
+ "C. Says the other person resembles a \"filler dad.\"",
+ "D. Buys the other person an inexpensive bracelet."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "How does the woman in green express her emotions when discussing faith with the other person in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. An uneasy expression.",
+ "B. A disdainful expression.",
+ "C. An excited expression.",
+ "D. A terrified expression."
+ ]
+ }
+ },
+ "ppeQAzxG": {
+ "video_id": "ppeQAzxG",
+ "video_duration": "34s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "World News",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person wearing a helmet with a blue and yellow stripe is looking out from behind a gray wooden box. A military vehicle with a Russian flag is shown moving forward. A person wearing a helmet is holding a gun, and the person is in focus. Multiple large wheels of a vehicle are seen moving forward on dirt. The wheels move past a rectangular opening, behind which a white object is visible. Two people wearing helmets and holding rifles are shown exiting from a vehicle. A military vehicle with people near it is shown moving across a field. A cannon is seen firing, producing a white smoke. Multiple military vehicles are seen in a trench. Two people wearing helmets stand behind a large weapon. Multiple people wearing helmets are running near a military vehicle. Military vehicles are seen from inside a plane. People are shown running near a car with weapons in their hands. A tank is seen firing.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What weapon is shown at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A rocket-propelled grenade.",
+ "B. A machine gun.",
+ "C. A missile.",
+ "D. A tank."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At which point in the video do the soldiers gather?",
+ "answer": "D",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. In the middle of the video.",
+ "C. Throughout the entire video.",
+ "D. At the end of the video."
+ ]
+ }
+ },
+ "mOJeFwbT": {
+ "video_id": "mOJeFwbT",
+ "video_duration": "101s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Sports Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a first-person perspective of a race car on a race track. The car starts on a white grid with a person in white overalls visible in the distance. The race car is yellow and red. The video transitions to a rear view of the race car moving on the track as another car is visible in the distance. Then a side view of a race car is shown with a yellow helmeted driver and the words \"Tom Coronel\" and \"DHL\" on its body. A man is shown sitting in a black racing chair in front of three monitors showing a race track. He has headphones on and is holding a black steering wheel with colorful controls. The video then transitions to a top-down view of the red and yellow race car on a white surface. The view transitions back to the rear of the car on the track. A shot of a man's hands is shown holding a steering wheel with colorful buttons and dials. The view transitions back to the rear of the car on the track. The video then shows the man in the chair in front of the three monitors talking. Then it transitions to a close-up of two feet in white and black shoes on two car pedals. Another view shows the black racing chair with the words \"Eat my Dust\" on a yellow backdrop. The view transitions to the man driving a simulated race car while sitting in the black racing chair in front of the three monitors. Then the view shifts to the red and yellow race car on the track. A text overlay says \"The Brabham BT44\" and the car is visible in its side view. The view transitions back to the man's hands holding the steering wheel. A close up view shows his watch. Then it transitions back to a view of his feet on the pedals and the three monitors showing the simulated race. The view then shows a simulated race car on the track. Then it goes back to showing the man in the chair in front of the three monitors talking, then back to the simulated race cars on the track. The car models include numbers 22, 68, 7, 10, 32, 11, 18, 7, 40, 98, 3, 58, 16, and 5. The video ends with the simulated race cars crashing on the track, while text overlays show broadcast information.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video contain a red race car numbered 28?",
+ "answer": "B",
+ "candidates": [
+ "A. Yes, it does.",
+ "B. No, it does not.",
+ "C. I do not know.",
+ "D. None of the above."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Which car is mentioned as the first to appear in the driver lineup?",
+ "answer": "B",
+ "candidates": [
+ "A. The yellow No. 31 race car.",
+ "B. The green No. 22 race car.",
+ "C. The orange No. 7 race car.",
+ "D. The blue No. 98 race car."
+ ]
+ }
+ },
+ "PuzSTnxO": {
+ "video_id": "PuzSTnxO",
+ "video_duration": "40s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a first-person perspective of a video game. The player character is wearing dark and gold clothing with glowing blue accents. The player is in a variety of environments, including a structure with orange walls, a platform with a tan sky background, and a sandy desert area. The player character is seen firing multiple different weapons which emit blue or white beams. Some of the enemies are robots that appear to be black. Other enemies are dark figures with glowing blue accents. The player uses abilities that create glowing blue explosions. A box with text and an image of a blue crystal appears at different times in the video. The text \"TAVIUS PLAYS\" appears in the top of the video and \"FOLLOW FOR MORE!\" appears on the bottom of the video at certain points.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "Based on the video, which country is the game most likely developed in?",
+ "answer": "C",
+ "candidates": [
+ "A. France.",
+ "B. Russia.",
+ "C. The United States.",
+ "D. China."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "In what scenario can players use the skill of the exotic chest armor?",
+ "answer": "B",
+ "candidates": [
+ "A. On a rock.",
+ "B. Under a pot.",
+ "C. Underwater.",
+ "D. On top of a bridge."
+ ]
+ }
+ },
+ "lUTkThaS": {
+ "video_id": "lUTkThaS",
+ "video_duration": "45s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A woman opens a door and enters a living room. A man enters behind her. A man and a woman stand in a dark room between curtains. He touches her face, and she looks surprised. He kisses her on the cheek, and she looks down. A woman in a black dress opens a door and walks through. A man and woman ride in a black convertible with the top down. The woman has her arms raised. A man sits in a blue racecar while another man offers him a helmet. The race car drives down a track. A man in a black shirt walks through a crowded club. A man and a woman face each other close together in the club. A woman with dark hair sits at a table as two other people reach out and hold her hands. A man with short dark hair leans in and talks to a woman with short blond hair. A young girl with blond hair looks up and smiles while a woman and a man each place a hand on her head. A woman with dark hair looks on, then she bites her thumbnail nervously. A blond woman in a tight, short-sleeved shirt lifts a baseball cap. A man with dark hair and an olive green t-shirt talks to someone off-screen. A woman takes a shower. Another woman with long dark hair stands in a living room looking off-screen. She looks sad.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What action occurred between the man and woman in the bar scene located in the middle of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The woman kisses the man.",
+ "B. The man kisses the woman.",
+ "C. The woman embraces the man.",
+ "D. The man holds the woman's chin with his hand."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Is it likely that the bearded man and the dark-haired woman will end up in a relationship?",
+ "answer": "B",
+ "candidates": [
+ "A. It is not known.",
+ "B. It is unlikely.",
+ "C. It is likely."
+ ]
+ }
+ },
+ "IbwpPyJk": {
+ "video_id": "IbwpPyJk",
+ "video_duration": "59s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A light-colored, winged object is seen in a light blue sky. It moves across the frame and comes into sharper focus. The view of the object changes to show the underside of the object. The object moves in the sky, above a ground area, including a body of water. The object moves toward the ground, and its landing gear is visible. The object touches down on a ground surface, and a parachute deploys from the rear. The object continues moving along the ground.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video show a deceleration parachute?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes, it does.",
+ "B. Not applicable.",
+ "C. No, it does not.",
+ "D. It is uncertain."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "In the video, if the airplane's landing gear malfunctions during landing, what is the most likely outcome?",
+ "answer": "C",
+ "candidates": [
+ "A. The plane experiences a bumpy and erratic landing.",
+ "B. The correct outcome is uncertain.",
+ "C. The plane is unable to complete the landing.",
+ "D. The plane achieves a normal, perfect landing."
+ ]
+ }
+ },
+ "YTLkKphC": {
+ "video_id": "YTLkKphC",
+ "video_duration": "69s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Motorsport",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with the number 2 in orange, shown on a black background with glitch effects. The number 2 is then shown in the same style, but with the letters \"a\", \"v\", and \"n\" also appearing in the background, which also has glitching effects. The next shot is of four people riding motorcycles around a dirt track, with grass in the center and stands full of people in the background. A label at the top of the video reads \"SGP HEAT 12 LAP 1\". Then the shot cuts to an alternative view of the same scene, and a label appears at the bottom that reads \"2011 TORUN SGP HEAT 12\" The four motorcyclists move around the track. A figure in a grey shirt and pants stands in the foreground. The camera zooms in on the motorcyclists as they continue around the track. The four motorcyclists approach the viewer. The riders are wearing helmets in the colors white, blue, red, and yellow. As they continue to race around the track, the camera angle changes several times. There are several different kinds of flags visible. The text \"SGP HEAT 12 LAP 2\" appears at the top of the screen. There are more views of the motorcyclists riding around the track, with a man in grey shirt and pants standing in the foreground. There is one moment where two of the motorcyclists bump together. The text at the top of the screen changes to \"SGP HEAT 12 LAP 3\". The camera angle follows the motorcyclists as they ride around the track. A text box in the corner of the video lists the names \"T. GOLLOB\", \"G. HANCOCK\", \"D. WARD\", and \"C. HARRIS\". The text at the top of the screen changes to \"SGP HEAT 12 LAST LAP\". The four motorcyclists continue around the track. The video shows a black and white checkered flag being waved over the finish line. The video then shows a motorcyclist with a yellow helmet riding forward, and then raising his fist in the air.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the boy with the white helmet and the boy with the blue helmet do to the boy with the yellow helmet?",
+ "answer": "A",
+ "candidates": [
+ "A. They pushed the boy with the yellow helmet toward the outer fence, attempting to eliminate him.",
+ "B. They trapped the boy with the yellow helmet between them, attempting to eliminate him.",
+ "C. The boy with the blue helmet kicked the boy with the yellow helmet.",
+ "D. The boy with the white helmet kicked the boy with the yellow helmet."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "Are there any surprising actions in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Two athletes are pinning another athlete against the railing.",
+ "B. A motorcycle explodes.",
+ "C. A contestant falls down.",
+ "D. Two contestants are fighting each other."
+ ]
+ }
+ },
+ "LjlMUVxH": {
+ "video_id": "LjlMUVxH",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Art Exhibitions",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a person in a black shirt and pants standing in a room. The walls, floor, and ceiling of the room are off-white, with red string stretching across the space. A ladder, tripod with a camera, and other objects are visible on the floor. Later, the view changes to a room where two ladders stand on either side of the view. Red string lines the ceiling. Another person in a light shirt and pants appears in the room with the red string.Next, a woman in a black shirt and a white necklace is shown. There is a man in a suit standing behind her. There is an orange background. The video shifts to a dark background. A lacy-looking dress is shown on the left-hand side of the screen. The background is dark.Four photos are presented. The photos are in black and white and depict a woman with wet hair, who appears to be in a bathtub. A bright, hazy light is shown. It is a white light with blue tinges.Next, a black and white image of what appears to be hands covered in mud appears on the screen.The woman in the black shirt and white necklace reappears, against the same orange background, speaking. Then the text \u201cChiharu Shiota\u201d appears, with the word \u201cEmbodied\u201d directly beneath it, in red. The video shifts to a room with two people in it, looking at framed pieces of artwork. The video shows a person lying in a mass of red and white string. The video transitions to show a stone building with trees in the background. A large clear structure is seen to the right. Three red dresses are hung in front of the building.The woman in the black shirt and white necklace reappears, against the same orange background, and speaks.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why would the Art Gallery of South Australia seek to become a branding element for both the nation and the city?",
+ "answer": "C",
+ "candidates": [
+ "A. To supplant the position of other galleries.",
+ "B. To generate revenue.",
+ "C. To exhibit the finest contemporary art.",
+ "D. To achieve national recognition."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What type of background music is being played in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Wild electronic music.",
+ "B. Energetic hip-hop music.",
+ "C. Relaxing light music.",
+ "D. Sad and melancholic music."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occur to the red woven net shown at the beginning of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. It shows no change.",
+ "B. It becomes denser.",
+ "C. It becomes sparser.",
+ "D. It becomes more integrated."
+ ]
+ }
+ },
+ "eTGldVzX": {
+ "video_id": "eTGldVzX",
+ "video_duration": "545s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A partially constructed water feature is shown, with a black pond liner, a black plastic box with a round knob, large rocks, and a section of black plastic grid. A man points at the camera. The view changes to a suburban street with parked cars. The man walks toward the backyard, pointing at a lawnmower. In the backyard, two workers are building the water feature. The man places his hand over the black box. One worker uses an orange hand truck to move a large rock. The workers place the rock onto the grid. Several large rocks rest on the pond liner. The man points at two rocks, then points towards the background. A worker walks around the water feature, which now has more rocks and gravel. Water flows down the rocks, forming a small waterfall. The water continues to flow. The man gives a thumbs-up sign.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "How many tons of stones did the team of the man wearing a baseball cap in the video use?",
+ "answer": "C",
+ "candidates": [
+ "A. 0.5 tons.",
+ "B. 1 ton.",
+ "C. 1.5 tons.",
+ "D. 2 tons."
+ ]
+ }
+ },
+ "MdgfKBIX": {
+ "video_id": "MdgfKBIX",
+ "video_duration": "141s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a point-of-view perspective of a motorcycle driving on the Pertamina Mandalika Circuit in Indonesia. The sky is clear and blue, with some wispy clouds. The track is dark grey asphalt with red and white curb markings and green runoff areas. A small map of the circuit is superimposed in the bottom right corner, with a red dot indicating the motorcycle\u2019s position as it moves. The motorcycle smoothly navigates the curves and straightaways of the track. The surrounding landscape consists of green hills, trees and grandstands, with sponsor signage visible along the track. The video ends as the motorcycle is still on the track.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "How does the race car engine sound change in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The engine sound increases when entering a straightaway from a curve.",
+ "B. The engine sound decreases when entering a curve from a straightaway.",
+ "C. The engine sound increases when entering a curve from a straightaway.",
+ "D. The engine sound remains unchanged when entering a straightaway from a curve."
+ ]
+ }
+ },
+ "SmbyAToS": {
+ "video_id": "SmbyAToS",
+ "video_duration": "92s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person's hand is seen touching and pointing at various parts of several small trees with green leaves, most of which appear to be fig trees. The trees are in pots and are located inside a greenhouse. There is a wooden shelf in the greenhouse and two black pots and a small white rectangle are on the shelf. Outside the greenhouse, green grass and other vegetation are visible.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many varieties of fig trees are introduced in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Two kinds.",
+ "B. Four kinds.",
+ "C. One kind.",
+ "D. Three kinds."
+ ]
+ }
+ },
+ "hjDAQYuL": {
+ "video_id": "hjDAQYuL",
+ "video_duration": "54s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person in a pink long-sleeved shirt and gray pants is standing in a hallway. The person is then shown wearing a black shirt with text and a pink heart on it, standing by a doorway. The person is shown back in the pink long-sleeved shirt with a second person coming into view in the hallway. The second person is wearing a blue shirt with a rainbow on it. Both people then enter a room and jump up and down. The person in the pink long-sleeved shirt is then shown at a glass coffee table, and then back in the doorway with the second person. The person in the pink long-sleeved shirt is shown again at the coffee table.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What might they do at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. They might play a game called \"Finance Academy.\"",
+ "B. They might go to bed to sleep.",
+ "C. They might play with electronic devices.",
+ "D. They might go eat something."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Is there any change in the volume of the woman wearing pink in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The sound gradually diminishes.",
+ "B. When communicating with her mom, she speaks loudly; with others, she speaks softly.",
+ "C. Yes, the sound gradually increases after she goes downstairs.",
+ "D. There is no change."
+ ]
+ }
+ },
+ "UntGDPRn": {
+ "video_id": "UntGDPRn",
+ "video_duration": "200s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Soccer",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a series of American football plays, featuring players in various uniforms. At different points, a text box is overlaid, giving information about the player Mekale McKay. The first sequence shows players in white and grey uniforms in an outdoor setting. The following sequence shows plays taking place on a green field, with players in black and white uniforms opposing a team in white and dark blue uniforms. The sequence transitions to show a player in a black and red uniform with the number 2, and players in white and red uniforms. Further on, a player in a black uniform with the number 8 is shown with a white and red team. These sequences are interspersed with the various text boxes. Finally, the video shows a series of other plays, with teams in a variety of uniform colors, including purple and white, black and white, white and navy blue, white and green. At various points the ball is thrown, or players attempt to tackle other players.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is Mekale McKay's height?",
+ "answer": "C",
+ "candidates": [
+ "A. 6 feet 5 inches.",
+ "B. 6 feet 6 inches, 455 pounds.",
+ "C. 6 feet 4 inches.",
+ "D. 6 feet 6 inches."
+ ]
+ }
+ },
+ "kPytZKON": {
+ "video_id": "kPytZKON",
+ "video_duration": "330s",
+ "duration": "4-6min",
+ "domain": "Tech & Science",
+ "sub_category": "Climate Change",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video displays a screen capture of a presentation involving several graphs, along with a small video of a light-skinned man in a light blue button-down shirt at the bottom right of the screen.The presentation begins with a graph titled \"Greenhouse Gas Net Emissions\" with a grey line that increases from 2000 to 2100. A blue line is then added to this graph, which rises at the same rate for the first 20 years or so and then plateaus under the original grey line.The text \"(Removing the economic damage function to illuminate differences in energy policies and modeling)\" appears on a blue background, below a partial circular shape at the top.The screen displays a new graph, titled \"Coal Primary Energy Demand,\" with a grey line rising from approximately 100 to 300 between the years 2000 and 2080. A blue line is added, increasing at the same rate initially and then tapering off underneath the original grey line.A new graph appears, titled \"Oil Primary Energy Demand,\" again showing a grey line and a blue line. These both begin at approximately 150 and then rise to approximately 260 around 2040, after which they decrease slightly. The blue line tapers off underneath the grey line after peaking around 2040.The next graph, titled \"Renewables Primary Energy Demand,\" displays a grey line slowly increasing from approximately 0 to 250 from 2000 to 2100. A dotted blue line is added, rising steeply at the same rate around 2030 and then rising more steeply at approximately 2040 until reaching 450 by 2100.The graph \"Net Land Use Emissions without Bioenergy\" is displayed. A grey line begins around 3, decreases to approximately 2.5, and then levels out horizontally from approximately 2018 to 2080. A blue line is added. This line begins at approximately 5 around 2005, sharply increasing, peaking at approximately 7 in the 2050's and decreasing slowly to a point slightly above the grey line near 2080.The graph \"Electric Share of Total Capital - Transport (Percent)\" is shown. A grey line gradually increases from approximately 1% at 2000 to approximately 35% around 2080. A blue line is added. The blue line rises at the same rate initially and then more sharply until it reaches approximately 50% around 2080.The text \"Comparison to IPCC AR6 Synthesis\" is displayed on a blue background.The graph \"IPCC Scenarios: Figure SPM.5 in the AR6 Synthesis Report (2023)\" is then shown. This graph contains a cluster of lines in the upper left corner and shaded areas going down and to the right. The text of the graph mentions \"Nationally Determined Contributions,\" \"Limit warming to 2C,\" and \"Limit warming to 1.5C\".The graph is then shown with the text \"Greenhouse Gas Net Emissions in the En-ROADS Baseline (2023) compared to the \"implemented policies\" range in the IPCC AR6 Synthesis Report (2023)\" written on top, with the En-ROADS Baseline highlighted in blue, along with the 63, 69, and 73 Gigaton lines highlighted in blue.The video then transitions to show only the man in the bottom right of the screen, who moves his hands and talks.The video then displays the graph \u201cGreenhouse Gas Emissions\u201d with three solid lines in green, red, and black, which all follow a similar path, slightly diverging in certain areas between 2000 and 2100. The black line increases at the end, while the red and green lines plateaus.Finally, the \u201cGreenhouse Gas Emissions\u201d graph is shown again, this time with a thick blue line. This line has some initial variation in the beginning and then increases before leveling off, remaining between the green and red lines between 2020 and 2070. After 2070, the blue line rises to a point above both the green and red lines. The man in the bottom right corner of the screen is visible during this section of the video.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did the video include an animated presentation of the document?",
+ "answer": "B",
+ "candidates": [
+ "A. Yes, it appeared.",
+ "B. No, it did not appear.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "vfvIxLYc": {
+ "video_id": "vfvIxLYc",
+ "video_duration": "419s",
+ "duration": "6-8min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The words \"GAMING RULES!\" appear in brown text on a textured grey background, followed by \"Presents\" in a faded smaller red text below. A black silhouette of a horse with a rider appears. The words \"CEPHALOFAIR\" and \"GAMES\" are shown below the silhouette. A man appears, seated at a table. On the left side of the screen there is a box with the word \"GLOOMHAVEN\" printed on it. The \"GAMING RULES!\" logo appears in the upper-left corner. The man is still present, and appears to be speaking. The screen zooms in on a light green card labeled \"Scoundrel\" showing an illustration of a figure. White and black sheets of paper are visible, along with a map which is displayed on the table. A red dotted line is drawn on the map, connecting a city to a cave. Another red dotted line is displayed on the map. A white piece of paper with text labeled \"Human Scoundrel\" is shown. Numbers are marked off on this paper. Another piece of paper with the text \"Prosperity\" is shown. X's are marked off on this piece of paper. The camera switches back to the man, seated at a table. The camera shifts to a wooden floor. A light green card labeled \"Scoundrel\" is shown, next to a deck of similar light green cards labeled \"Scoundrel\". Additional white pieces of paper are displayed on the wooden surface. The level on this card is marked as '1'. A character sheet is displayed. The camera moves back to display the card labeled \"Human Scoundrel\". The words \"XP Notes\" and \"Gold Notes\" are shown with numbers next to them. The camera moves to show cards with images of items, including Boots of Striding, Winged Shoes, Hide Armor, Leather Armor, Cloak of Invisibility, Eagle Eye Goggles, Iron Helmet, Heater Shield, Piercing Bow, War Hammer, Poison Dagger, Minor Healing Potion, and Minor Stamina Potion, and Minor Power Potion.The \"Leather Armor\" and \"Minor Stamina Potion\" cards are brought into focus, along with a health level scale. The camera then displays a card with a black book on its cover. A card labeled \"Greed is Good\" and a card labeled \"Seeker of Xorn\" are shown. Another card with the heading \"Party Sheet\" is displayed, and numbers are written down. The numbers on the chart begin at +20 and go down to -20. A banner appears with the words \"Ability Cards\". Multiple light green cards are shown. Text is highlighted on the cards, and groupings of similar cards are displayed. A single card labeled \"Flanking Strike\" is highlighted. A light green card labeled \"Flintlock\" and a card labeled \"Open Wound\" are highlighted. The camera zooms in on a card labeled \"Scoundrel\" with the numbers 9, 8, 7, 6, 5, 4, 3, 2, and 1 on a chart. The screen displays the image from the opening of the video, followed by the words \"The World\". A map is displayed. The camera zooms in on areas of the map labeled \"Gloomhaven\", \"Still River\", and \"Corpsewood\". The map highlights two areas, labeled \"Black Barrow\" and \"Barrow Lair\" and a dotted red line is displayed from \"Gloomhaven\" to \"Black Barrow\" and then to \"Barrow Lair\". A card labeled \"Road\" is displayed with an option to \"Run from the howling to safety\" or \"Let the wolves come\". The reverse side of the card shows choices with different starting scenarios, one that begins with Poison, and the other that begins with three damage. A card is then shown, and a small black book is highlighted in the corner. The camera shifts back to the man, seated at a table. A book card is highlighted, and a light blue box encircles the small book logo. The man appears, seated at a table. The camera shifts to the wooden surface, showing an open scenario book. A scenario is displayed with pictures of a room, and labels for Bandit Guard, Bandit Archer, Living Bones, and Treasure. An achievement for \"First Steps\" is shown. The map is displayed with the words \"Global Achievements\" above it. A single green banner is highlighted next to a string of dotted lines. The camera shifts to the map once more. \"East Road\" and \"Sergeant's Kiss River\" are highlighted. A 'Prosperity' sheet with x's marked is shown. Then, a group of cards labeled 015-021 are shown with a sword logo. An image of the opening is shown once more, followed by the text \"Visiting Gloomhaven\". The camera zooms in on a map showing a labeled \"Gloomhaven\" and \"Merchant's Bay.\" A character sheet is superimposed on the map. The character sheet is displayed, with a blue circle around the number 2, and then again around the level 1. A light green \"Scoundrel\" card is displayed. A character sheet is displayed. The text \"Replace one -1 card with one +1 card\" is marked with a check box. A variety of cards are shown, including Boots of Striding, Winged Shoes, Hide Armor, Leather Armor, Cloak of Invisibility, Eagle Eye Goggles, Iron Helmet, Heater Shield, Piercing Bow, War Hammer, Poison Dagger, Minor Healing Potion, Minor Stamina Potion, and Minor Power Potion. A card is shown with text and choices, labeled \"CITY\". The man is then shown, seated at a table, and then appears to present a white envelope with a black jellyfish logo on it. A \"Subscribe\" banner is shown. The man appears, and the camera moves to show a final screen showing \"GAMING RULES!\" with contact info and credits.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How many gold coins are required for the leather armor shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Fifteen.",
+ "B. Twenty-five.",
+ "C. Ten.",
+ "D. Twenty."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video is it mentioned that the new role must choose one card from two task cards?",
+ "answer": "C",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. In the middle of the video.",
+ "C. At the beginning of the video.",
+ "D. Throughout the entire video."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "When the video displays urban equipment, where is the hammer positioned relative to the leather armor?",
+ "answer": "D",
+ "candidates": [
+ "A. The hammer is to the left of the leather armor.",
+ "B. The hammer is to the right of the leather armor.",
+ "C. The hammer is in the upper right of the leather armor.",
+ "D. The hammer is in the lower left of the leather armor."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the video state that new cards replace existing ones each time a new card is added?",
+ "answer": "C",
+ "candidates": [
+ "A. Because this allows the cards to level up faster.",
+ "B. Because this increases the card's attack power significantly.",
+ "C. Because the number of cards must match the number on the character pad.",
+ "D. Because the number of cards must match the number on the campaign board."
+ ]
+ }
+ },
+ "hKOuixda": {
+ "video_id": "hKOuixda",
+ "video_duration": "59s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video is in black and white and shows a man with glasses, wearing a suit and bow tie. He is speaking and gesturing with his hands. In the lower right corner of the video, there is a smaller video inset of a woman with short hair wearing a patterned top. The background of the main video has light and dark vertical panels and a potted plant with large leaves. The man frequently adjusts his glasses and moves his hands in various gestures, including pointing, spreading his hands wide, and bringing them together in a clapping motion. The video zooms into the man's face several times throughout the sequence. At one point, he leans forward and looks down. He also appears to be adjusting his cufflinks.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the man stick out his tongue?",
+ "answer": "A",
+ "candidates": [
+ "A. The beginning of the video.",
+ "B. The end of the video.",
+ "C. The middle of the video.",
+ "D. It does not appear."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What is the overall emotional atmosphere in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Serious and dignified.",
+ "B. Tense and exciting.",
+ "C. Sad and repressive.",
+ "D. Humorous and absurd."
+ ]
+ }
+ },
+ "OXXybXmN": {
+ "video_id": "OXXybXmN",
+ "video_duration": "56s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man in a gray striped shirt stands in a kitchen with brown and beige mosaic tile backsplash. He opens a clear plastic rectangular container containing rice. He opens a cabinet above the counter, revealing various bottles and jars, some appearing to be spices or sauces. He removes several items and arranges them on the counter. The items include a jar of honey, a bottle of hot sauce, a small container with a yellow spice, a large container of a brown spice, and a bag of flour or sugar. He then picks up three small, clear spice jars with orange lids, followed by two larger bottles, one with a red substance and another with a brown substance. Next, he moves to the stove, where a frying pan is placed on a burner. He turns on the burner, and a blue flame ignites. He picks up raw chicken pieces on a wooden cutting board, positions a knife, and begins cutting the chicken.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What are the last three letters of the black word printed on the butter box shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. bon.",
+ "B. and.",
+ "C. ban.",
+ "D. ond."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What does the man in the video do with the yellow knife?",
+ "answer": "C",
+ "candidates": [
+ "A. Cut garlic with the yellow knife.",
+ "B. Cut chicken with the yellow knife.",
+ "C. Scrape butter with the yellow knife.",
+ "D. Scrape chicken fat with the yellow knife."
+ ]
+ }
+ },
+ "ItysBTus": {
+ "video_id": "ItysBTus",
+ "video_duration": "501s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Unboxing Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A blonde child is wearing a blue dress and talking. She is sitting in an area decorated with artificial leaves, and gray and white fur-like pillows. A cardboard cut out of a cartoon character stands next to her. She is using a tablet in a white stand with a red rectangular device attached to its top, and a booklet to play a game, coloring characters from a film. She is later joined by another child, who is wearing a purple costume and they both speak to the camera and demonstrate drawing using the tablet. The screen shows a drawing of a flower and other images. Both children are shown again and the video features product placements for toy sets. A drawing pad and an electronic device are used. The drawing is shown to the camera. The children speak again and show more games on the tablet screen. Various graphics, promotional text, and a YouTube subscription prompt appears. A final shot of a collage of the children ends the video.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did a green reflector appear in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. No, there isn\u2019t.",
+ "B. I am not sure.",
+ "C. Yes, there is."
+ ]
+ }
+ },
+ "QRgUeIzo": {
+ "video_id": "QRgUeIzo",
+ "video_duration": "69s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Celebrity Interviews",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two people are sitting next to each other on a set of bleachers in what appears to be a gymnasium. The person on the left is wearing a red and black basketball jersey with the word \u201cIndonesia\u201d and the number \u201c33\u201d on it. He also has a white hooded jacket on and is wearing a brace on one knee. The person on the right is wearing a white T-shirt with a graphic on it, and is holding a microphone. They are talking to each other, and at one point the person on the right holds up the mic to the person on the left. They laugh and clap together. Another person in a white tank top and red shorts joins them and the person wearing red and black touches fists with him. He hugs the person in the red and black, and both laugh. The person in the white tank top walks away. The person in red and black stands up and walks out of the frame. The person with the white T-shirt continues talking into his microphone while in a seated position. He then points forward in the direction of the camera.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "How many goals did the person wearing the number 8 jersey score in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. One.",
+ "B. Zero.",
+ "C. I am not sure.",
+ "D. Two."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the tattooed man do to the man in black at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A high five.",
+ "B. A fist bump.",
+ "C. A hug.",
+ "D. A handshake."
+ ]
+ }
+ },
+ "rLJqXzvY": {
+ "video_id": "rLJqXzvY",
+ "video_duration": "37s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows two individuals. The first person is wearing a pink shirt and a silver necklace. The second person is wearing an olive green shirt and black pants. Both individuals are seated in what appears to be a room. The first individual is facing toward the left side of the frame and speaks at multiple points. The second individual speaks and gestures while seated on a bed. The first individual then gets up and walks off camera.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What is the reason for the argument between the man and woman in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The woman cheated on the man.",
+ "B. The man deceived the woman's mother.",
+ "C. The woman deceived the man's mother.",
+ "D. The man cheated on the woman."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What does the woman plan to do after the video ends?",
+ "answer": "C",
+ "candidates": [
+ "A. The woman plans to chase the man.",
+ "B. The woman plans to change her clothes.",
+ "C. The woman plans to drink water.",
+ "D. The woman plans to refute the man's words."
+ ]
+ }
+ },
+ "CaFDEonl": {
+ "video_id": "CaFDEonl",
+ "video_duration": "94s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a crowd of people. Some people are holding long, thin sticks or rods, as well as flags of various colors including white, red, and green. One person is wearing a yellow hat. The crowd is seen from various angles, sometimes close up, sometimes further away. At one point in the video, a man in a maroon shirt and a scarf is shown talking to a camera, while in the background are many of the people seen earlier. Also seen in the video are a series of trucks driving down a road, and a group of people sitting in an area with a white building. Throughout the video, a banner across the lower part of the screen includes text which mentions \"Imran Khan\" and \"Lahore.\"",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "At the end of the video, what is the position of the man with a white beard and wearing red clothing, relative to the truck?",
+ "answer": "C",
+ "candidates": [
+ "A. The man is located on the right side of the truck.",
+ "B. The man is located on the left side of the truck.",
+ "C. The man is located in front of the truck.",
+ "D. The man is located behind the truck."
+ ]
+ }
+ },
+ "DgzVnMss": {
+ "video_id": "DgzVnMss",
+ "video_duration": "43s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen with the text \"Star Trek: Discovery\" on the bottom left and \"CBS All Access\" on the bottom right.The scene changes to two spaceships flying through space.The next scene is inside a spaceship where a man in a blue uniform looks to his left at a large screen showing circular blue graphics with the text \"COMMUNICATIONS\" at the top.The next scene shows a person with ridged skin and a dark blue uniform walking past a door-like opening.The following scene displays an area with metal bars where a bright light is visible.The light continues to pulsate from the area between the bars.Next, the person with ridged skin and another person in a dark blue uniform stand in front of a doorway.Then a man in a yellow uniform stands in a room with blue paneled walls and looks down.He then shakes hands with the person with ridged skin as the other person in dark blue watches.The man in yellow is next shown looking off camera.The man with ridged skin and the other person in dark blue appear once again in front of the doorway.The video then displays the side of a spaceship moving past the camera.Next, several pairs of feet are seen walking in a hallway.Then, several people wearing different color uniforms are shown walking in a hallway.Two people are seen walking down a corridor with panels that seem to be windows on one side.The underside of a spaceship flies close by, revealing several circular shapes on the undercarriage.Multiple people wearing different colored uniforms are seen in a ship hallway.The scene then shows a bright nebula in space with red objects floating through it.Next, inside a spaceship, three people are looking at a large screen with red graphics.The man in the yellow uniform is next shown looking off camera, his face partially visible.A woman in a dark blue uniform is shown looking off camera, her face partially visible.The man with ridged skin is shown looking off camera.The man in the yellow uniform and a woman in a dark blue uniform are then both shown looking off camera.A woman with a metal piece on her head is looking off camera.Two spaceships are seen flying in space.The screen is then overtaken by a bright white light.The text \"DISCOVER\" then appears on the screen.A ship is seen falling downwards, it is angled diagonally.A spaceship is shown from below. The ship fires a red light beam from the side.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the \"thud\" sound occur?",
+ "answer": "A",
+ "candidates": [
+ "A. Seven.",
+ "B. Five.",
+ "C. Six.",
+ "D. Eight."
+ ]
+ }
+ },
+ "TKNRvJTF": {
+ "video_id": "TKNRvJTF",
+ "video_duration": "45s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A character with yellow spiky hair, wearing a red cape, a purple shirt and white pants runs through a grassy field. There is a red and white building in the background, and a man in a black shirt, wearing headphones, is visible in a smaller box in the upper-right corner of the screen. The character is followed by a pink pig-like figure wearing a red dress and holding a brown stick. They move to a white platform with a red figure on top of it. The pink figure uses the stick on the red figure, which gets damaged and objects fly around, including a blue box and a small round tan object. The first character then runs on a brown field, past an orange figure, and two benches. The character continues to run, with the pink and red figure now on the platform in the background. The character runs toward the two figures. The character and pink figure then approach the red figure on the platform. The pink figure attacks the red figure again. They are now all standing on the platform.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color are the parts shown in the video that need to be placed on the pink pig's head?",
+ "answer": "D",
+ "candidates": [
+ "A. Yellow.",
+ "B. Red.",
+ "C. Green.",
+ "D. Pink."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the pink pig in the video do with the mallet?",
+ "answer": "C",
+ "candidates": [
+ "A. Smashed the large red robot into pieces.",
+ "B. Knocked the parts out of the hands of the little person in the red robe.",
+ "C. Broke apart the little person wearing the red robe.",
+ "D. Scattered the parts buried in the sand."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a skateboard in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No, it does not exist.",
+ "C. Yes, it exists."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where does the little person in the red robe find the pink part in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Next to the large robot.",
+ "B. On the sand.",
+ "C. On the grass.",
+ "D. On the cement floor."
+ ]
+ },
+ "task4": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What change occurred to the large red robot at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A blue pig's head appeared on top.",
+ "B. A pink pig's head appeared on top.",
+ "C. Knocked down by a pink pig with a cudgel.",
+ "D. Pushed over by a person in a red outfit."
+ ]
+ }
+ },
+ "anrYZzmH": {
+ "video_id": "anrYZzmH",
+ "video_duration": "497s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video starts with a view of people walking down a staircase in a shopping area. The camera then moves upward, briefly showing a person's back and beige clothing. Subsequently, the camera focuses on various clothing items in a bustling market setting. A woman examines a pair of jeans, followed by shots of other shoppers browsing through racks of colorful children's clothes, jeans, and other garments. A man in a red and black plaid shirt holds up a pair of jeans. Hands sift through stacks of folded clothing. The camera pans across different clothing displays, including striped boxer shorts, jackets, and more jeans. A man in a red shirt displays a bright yellow packable rain jacket. The camera continues to showcase various clothing items, including stacks of jeans, a man holding black jeans, a blue and white checkered dress shirt, and folded khaki and gray pants.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What is the name of the market featured in the video where the man purchases clothing?",
+ "answer": "C",
+ "candidates": [
+ "A. Bangladesh Affordable Market.",
+ "B. Dhaka Brand Discount Market.",
+ "C. Nurjahan Market.",
+ "D. Dhaka Central Market."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many pairs of pants does the man in the video buy?",
+ "answer": "C",
+ "candidates": [
+ "A. 3.",
+ "B. 2.",
+ "C. 4.",
+ "D. 1."
+ ]
+ }
+ },
+ "dipaCxSc": {
+ "video_id": "dipaCxSc",
+ "video_duration": "41s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man and woman are sitting on a blue couch. They are both wearing light colored clothing. The man is holding a black television remote in his hands. The scene changes to a television screen. The screen has a circular red loading icon in the bottom right corner. A white internet router is visible in the bottom left corner of the screen. The scene changes back to the man and woman. The man is wearing a peach colored button down shirt. The woman is wearing a white sweater with small light colored polka dots. The man and woman look towards the television. The scene changes back to a close up of the woman and man. The scene transitions back to the man and woman sitting on the couch and a third person appears. The third person is a child, wearing a pink and black sleeveless dress. The scene changes back to a close-up of the man and woman, then transitions to a close up of the child and then a close up of the man.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual event occurred near the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The television screen suddenly started flickering.",
+ "B. A strange-looking creature suddenly appeared.",
+ "C. A girl wearing black floral-patterned clothing suddenly appeared.",
+ "D. The lights suddenly went out."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How does the emotional state of the man in the video change?",
+ "answer": "D",
+ "candidates": [
+ "A. The man's emotion changes from sadness to astonishment.",
+ "B. The man's emotion changes from confusion to excitement.",
+ "C. The man's emotion changes from excitement to sadness.",
+ "D. The man's emotion changes from boredom to fear."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "At the beginning of the video, what is the man doing with the remote control?",
+ "answer": "B",
+ "candidates": [
+ "A. The man adjusts the volume of the movie with a remote control.",
+ "B. The man taps his leg with a remote control.",
+ "C. The man changes the TV program with a remote control.",
+ "D. The man turns off the TV with a remote control."
+ ]
+ }
+ },
+ "ympAiTqy": {
+ "video_id": "ympAiTqy",
+ "video_duration": "227s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two boxers, Regis Prograis in black and gold trunks, and Terry Flanagan in blue, green, and white trunks, circle each other in a boxing ring. Prograis throws a jab, followed by a right hook that Flanagan ducks under. Flanagan attempts a jab, but Prograis blocks it. Prograis throws a left hook to the body, then another jab. The two exchange jabs. Prograis lands a right hook that makes Flanagan stumble backward. Prograis follows up with a combination of punches, cornering Flanagan against the ropes. Flanagan attempts to fight back, but Prograis lands another combination. Flanagan covers up as Prograis throws more punches. The two continue to exchange punches, with Prograis landing the majority of them. Prograis lands a left to the body, followed by a right hook that sends Flanagan stumbling into the corner again. Flanagan throws a jab, which Prograis blocks. Prograis continues landing body shots. Flanagan circles away and puts his hands up defensively. Prograis lands a straight left hand to Flanagan\u2019s face just before the bell rings to end the round.The beginning of round 8 starts with both boxers exchanging punches in the center of the ring. Prograis lands several combinations. Flanagan attempts a few punches, but is mostly on the defensive. Prograis lands a powerful body shot that drops Flanagan to one knee. The referee starts the count. Flanagan struggles to get back on his feet. He attempts to stand at the count of 8, but loses his balance and takes a knee again. The referee counts to 10 and waves off the fight. Prograis raises his hands in victory. Flanagan is helped to his feet by the referee and a cornerman. The referee raises Prograis\u2019 arm, declaring him the winner by TKO.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "In the video, how did the boxer wearing black lettered shorts knock out the boxer wearing green lettered shorts?",
+ "answer": "A",
+ "candidates": [
+ "A. A left straight punch.",
+ "B. A downward elbow strike.",
+ "C. A left hook.",
+ "D. A right hook."
+ ]
+ }
+ },
+ "FmMNGQnw": {
+ "video_id": "FmMNGQnw",
+ "video_duration": "74s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black and white shot from a first-person perspective on a motorcycle. The handlebars with two gauges and a camera mounted on the right side are visible in the foreground. In the distance, two white semi trucks are parked to the left, and various other buildings and trees can be seen. The video then moves to show only the handlebars, gauges, and camera, again in black and white. The video then changes to a color shot, still from the perspective of the motorcycle driver, with the driver's arms, which have tattoos, and the black motorcycle helmet, which has a camera mounted to it, visible in the frame. The background shows trees and a blue sky. The video then switches back to a shot from the motorcycle's handlebars, and the scenery shows green trees, the road, and various cars. At times, the view is obstructed by what appears to be tall grasses blowing in the wind. The video concludes with the camera pointed at the driver wearing a black motorcycle helmet, with hair visible below the helmet and a black t-shirt, and with a blue sky and roadside trees as a background. The driver's arms and tattooed hands are again visible.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What music or background sounds are present in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Only a man's voice and the sound of a motorcycle are heard.",
+ "B. Initially, there is no music, followed by only a man's voice and the sound of a motorcycle.",
+ "C. Initially, there is music, followed by only a man's voice and the sound of a motorcycle.",
+ "D. Initially, there is no music, and then classical music is introduced."
+ ]
+ }
+ },
+ "NLlSAEOa": {
+ "video_id": "NLlSAEOa",
+ "video_duration": "462s",
+ "duration": "6-8min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman in a sequined dress sings on a stage extending into the audience. She dances and gestures with her hands as she sings, moving down the stage extension and back to the main stage. A band plays behind her, featuring a drummer, saxophone players, guitarists, and other musicians. Bright, predominantly blue, stage lights illuminate the performers. Security personnel and members of the audience are visible in the foreground. The audience claps along and some raise their hands. At one point, the woman is joined by a man with a guitar playing next to her on stage as she sings. Other band members sway and dance to the music as they play their instruments. A large screen behind the stage shows an unclear image. The woman walks along the stage and interacts with the band members.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does a display screen appear in the first half of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. No, it does not appear.",
+ "B. Yes, it does appear.",
+ "C. Not applicable.",
+ "D. It is not clear."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the black woman first stop singing and start dancing?",
+ "answer": "D",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. Throughout the entire video.",
+ "C. In the middle of the video.",
+ "D. At the beginning of the video."
+ ]
+ }
+ },
+ "jglJKNMn": {
+ "video_id": "jglJKNMn",
+ "video_duration": "130s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows an empty room filled with many tables and blue chairs. There are bottles of water, and sometimes juice, on the tables. The video then shows a person placing stacks of badges on a table, and then placing blue lanyards next to the badges. The video then moves to a crowded space where many people are talking and standing around a few tall tables with white cloths. There are banners with the word \"Nextcloud\" in blue. People are wearing badges and blue lanyards. Three people stand in front of a blue banner, and another person is seen taking a photograph of them using their phone. There is a presentation going on in the large room seen at the beginning of the video. The room is now full of people, many of whom are facing towards a projection screen. Throughout the video, multiple people are seen speaking to the camera, and many others are seen walking around the event.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video, what is the sequence in which the interview subjects appear, excluding the woman with long brown hair wearing a deep-necked blue outfit?",
+ "answer": "B",
+ "candidates": [
+ "A. A brown-haired bearded man in a brown suit, a gray-haired bearded man in a gray shirt, a gray-haired bearded man in a blue suit wearing glasses, a short blonde overweight man in a black crewneck, a black-haired bearded man in a gray suit, a brown-haired woman with hair combed back, a black curly-haired man in a burgundy shirt, a black-haired man wearing glasses in a light blue shirt, a black-haired man with hair combed back in a black shirt, a black-haired man wearing glasses in a green stand-up collar blue suit, a black-haired man wearing glasses in a plaid shirt.",
+ "B. A brown-haired bearded man in a brown suit, a gray-haired gray-bearded man in glasses and a blue suit, a gray-haired gray-bearded man in a gray shirt, a short blonde overweight man in a black crewneck, a black-haired bearded man in a gray suit, a black curly-haired man in a burgundy shirt, a brown-haired woman with hair combed back, a black-haired man in glasses and a light blue shirt, a black-haired man with hair combed back in a black shirt, a black-haired man in glasses wearing a green stand-up collar blue suit, a black-haired man in glasses wearing a plaid shirt, a gray-haired gray-bearded man in a blue suit.",
+ "C. A man with gray short hair and gray beard wearing a gray shirt, a short blond and slightly overweight man wearing a black crewneck shirt, a man with black short hair and a beard wearing a gray suit, a man with gray short hair and gray beard wearing glasses and a blue suit, a woman with brown short hair combed back, a man with black hair and glasses wearing a light blue shirt, a man with black short hair combed back wearing a black shirt, a man with black short hair and glasses wearing a green stand-up collar blue suit, a man with black curly hair wearing a burgundy shirt, a man with black short hair and glasses wearing a plaid shirt, a man with gray short hair and gray beard wearing glasses and a blue suit.",
+ "D. A brown-haired man with short hair, mustache, and brown suit, a gray-haired man with short hair, beard, and gray shirt, a gray-haired man with short hair, beard, and glasses wearing a blue suit, a black-haired man with short hair, beard, and brown suit, a brown-haired woman with short hair styled back, a black-haired man with curly hair and wine-red shirt, a short blonde man with a slightly overweight build and black round-necked sweater, a black-haired man with glasses and light blue shirt, a black-haired man with short hair styled back and black shirt, a black-haired man with glasses and green standing-collar blue suit, a black-haired man with glasses and checkered shirt, a brown-haired man with short hair, mustache, and brown suit."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "In the video, who mentioned \"ethical AI\"?",
+ "answer": "B",
+ "candidates": [
+ "A. A woman with long brown hair, wearing a white blouse and a red scarf.",
+ "B. A woman with short brown hair styled back.",
+ "C. A woman with her hair tied up, wearing a white blouse and skirt, and holding a microphone.",
+ "D. A woman with long brown hair, wearing a deep-collared blue blouse."
+ ]
+ }
+ },
+ "qmPWncaV": {
+ "video_id": "qmPWncaV",
+ "video_duration": "60s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person with a beard and mustache wearing a yellow hoodie is visible. Over the course of the video, different cartoon images are superimposed over their head. The person moves their hands and face, and appears to be speaking. Text appears on screen in a series of steps, some of which are also highlighted. The text starts as \"Detective: SPONGEBOB\" and continues to add additional text fields with the following format: \"Victim:\", \"Found Body:\", \"Blaming Everyone:\", \"Last Seen W/\", \"Locked Up:\", \"Who Did it:\", and \"Witness:\". The corresponding information to the right of each text field changes over the course of the video. The background includes a black door, a string of white lights, and a picture of 3 pill containers.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is Mr. Krabs's role in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The victim.",
+ "B. The discoverer.",
+ "C. The suspect.",
+ "D. The detective."
+ ]
+ }
+ },
+ "DxdbDgnm": {
+ "video_id": "DxdbDgnm",
+ "video_duration": "73s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Artificial Intelligence",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a split-screen view. The left side of the screen displays a web browser interface showing a list of commit messages. The right side of the screen displays a video of an outdoor scene. Initially, the video shows a patch of ground, a tree, a paved area and a section of the side of a building with people walking. The view shifts to show a white device connected with a grey cord to a small circuit board with LED lights. The device and circuit board are resting inside a cardboard box. The view is adjusted slightly, showing the device and circuit board more clearly. The outdoor scene video begins to show people walking by, then a person riding a bike. The video continues to show people moving through the outdoor scene. The camera view of the outdoor scene appears slightly unsteady. The man in the bottom right corner of the screen remains relatively still with his hands gesturing at different points.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the detection accuracy of the equipment during peak hours?",
+ "answer": "C",
+ "candidates": [
+ "A. Very high.",
+ "B. Low.",
+ "C. High.",
+ "D. Very low."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What type of vehicle is the person wearing a hat riding on campus in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A bicycle.",
+ "B. An electric bike.",
+ "C. A scooter.",
+ "D. A balance scooter."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What changes will occur in the equipment shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The weight decreases.",
+ "B. The display speed of symbols increases.",
+ "C. The interface becomes more visually appealing.",
+ "D. The accuracy improves."
+ ]
+ }
+ },
+ "JQWwYtEm": {
+ "video_id": "JQWwYtEm",
+ "video_duration": "60s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man wearing a dark blue shirt with a gold logo is standing on a sandy area with a city visible in the distance.Next, a group of six men wearing military attire are carrying a large brown cylinder. Then, a group of seven men wearing helmets and red life vests are carrying a boat over a beach toward the water. After that, a group of men in a boat are paddling through choppy water.The video cuts back to the first man on the beach, and then to footage of men performing crunches in the dark, and also walking through water in the dark. Then, the first man is shown on the beach again. The video cuts to a man wearing camouflage clothing handling a rifle in a dark room. Then, the first man is shown again on the beach. Next, the video shows two men in dark military attire in the water holding weapons, and then underwater footage of a man swimming. After that, two men are shown swimming in a cavern with weapons. Then, a man in military attire is walking across dry, sandy terrain while holding a rifle, followed by an image of a tan military vehicle moving on a dusty road. The video then cuts to a close up of a man holding a rifle while wearing a camouflage uniform.The next shot shows several men in the water, with their rifles raised.After this, the first man is shown punching a padded target while wearing red boxing gloves.The video cuts to a group of men in a boat on a body of water, and then to a group of men running on a rocky landscape at night.Next, a helicopter is shown as a man is seen rappelling from it.The video cuts to a military vehicle moving on a dusty road, and then to the first man running through a dark room and aiming a pistol.Lastly, the video shows a rotating golden logo on a black background. The video ends with a still image of the logo.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "In what ways does SEAL training differ from other military training?",
+ "answer": "C",
+ "candidates": [
+ "A. Focuses on underwater combat capabilities.",
+ "B. Stresses survival capabilities in extreme environments.",
+ "C. Emphasizes the comprehensive development of individual abilities.",
+ "D. Concentrates on physical fitness and mental resilience."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video do the soldiers land on the roof?",
+ "answer": "C",
+ "candidates": [
+ "A. Throughout the entire video.",
+ "B. At the end of the video.",
+ "C. At the start of the video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "uBojWcVi": {
+ "video_id": "uBojWcVi",
+ "video_duration": "274s",
+ "duration": "4-6min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "The video shows four musicians playing instruments and singing. In the leftmost frame is a man playing an electric guitar. He wears over-the-ear headphones and a red hoodie. The second frame, in the center, shows a man playing drums. He wears a baseball cap and a dark T-shirt with the word \"Zildjan\" on it. A third man, in the third frame, plays the saxophone. He is bald and wears a black T-shirt and black and gold patterned pants. He stands in front of what appears to be a home bar, with led lights underneath it, and behind a drumset. In the rightmost frame, a man is singing into a microphone. He wears over-the-ear headphones and a red hoodie. The ceiling of the room has an LED galaxy-patterned light on it. Gray acoustic panels line the walls. Several pieces of artwork hang on the walls behind the saxophonist and drummer. The drummer plays on a black Zildjan drum set, and near him are two floor monitors, lying on their sides. A dark flat-screen television is mounted on the wall above the home bar, showing blurred colorful patterns. The floor is covered with an intricate-patterned rug, in shades of red, brown, and gold. As the video continues, a light moves across the screen of the flat-screen television. The colors of the galaxy ceiling light change colors. The singer points and gestures with his right hand, and moves his head.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the sequence in which the electronic drum sounds, saxophone sounds, electric guitar sounds, and human voice appear in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Electronic drum sounds, saxophone sounds, human voice, electric guitar sounds.",
+ "B. Electronic drum sounds, saxophone sounds, electric guitar sounds, human voice.",
+ "C. Human voice, electric guitar sounds, saxophone sounds, electronic drum sounds.",
+ "D. Electric guitar sounds, human voice, saxophone sounds, electronic drum sounds."
+ ]
+ }
+ },
+ "ELPmPLpF": {
+ "video_id": "ELPmPLpF",
+ "video_duration": "50s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video opens with a grey textured background and a title graphic that reads \"TRANS THEORIES\". A dark humanoid robot appears on screen in an urban setting, standing near a light blue car and a demolished vehicle. The robot's headlights glow, and smoke is coming from behind the vehicle. A yellow and black robot jumps out of the background and then blue cars are crushed by debris. A metallic grey robot with red glowing eyes and sharp teeth appears. The scene transitions to a damaged city street with many cars and people, as part of the street breaks away. Two robots are fighting. There is an explosion with orange fire and debris flying. A gray robot walks past debris and a blue car. The scene transitions to a damaged gray robot under an archway. Then an aerial shot of the city is shown. A gray robot falls off of a building, then a robot with silver and gold parts jumps onto the clocktower of another building. Then there is another shot of a gray robot flying in the air and crashing into a building. A city street is shown with robots running and fighting. A group of people is fleeing. The scene transitions back to the first dark humanoid robot standing next to a blue car and a demolished vehicle in an urban setting. The title graphic \"TRANS THEORIES\" reappears. The scene transitions to a blue and red robot in a damaged city scene as it throws an object to a silver object. Then, a yellow robot is shown jumping in a city street next to a gray robot. The scene changes to people surrounding various robots as they are standing in a street. Finally, a red and blue robot stands in front of an orange sky.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What does Megatron do to the Autobot Jazz in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Megatron carries Jazz's body and hands it over to the leader.",
+ "B. Megatron assists Jazz in evacuating the crowd.",
+ "C. Megatron tears Jazz in half.",
+ "D. Megatron provides combat support to Jazz."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of the four events shown in the video? (a) Jazz and the ambulance Autobot assist in evacuating the crowd. (b) Ironhide lifts Jazz and hands him over to the leader. (c) Megatron fights Jazz on the rooftop. (d) Jazz is torn in half.",
+ "answer": "B",
+ "candidates": [
+ "A. (c) (d) (b) (a)",
+ "B. (a) (c) (d) (b)",
+ "C. (a) (b) (c) (d)",
+ "D. (c) (d) (a) (b)"
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the jazz dance move of sitting on the car occur in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Two.",
+ "B. Four.",
+ "C. One.",
+ "D. Three."
+ ]
+ }
+ },
+ "PixeDgOL": {
+ "video_id": "PixeDgOL",
+ "video_duration": "41s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a man with a beard wearing a white skullcap and a light-colored vest with a logo on it. He is speaking directly to the camera while moving his hands. He is indoors, with what appears to be a wooden wall and ceiling in the background. At one point a second person can be seen in the background wearing headphones and a light-colored shirt or vest. The man refers to a \"hole in the wall\". The wall has markings on it. He appears to cry at one point. There is a rug on the floor that can be seen.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the religious role of the man in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. An Islamic Mullah.",
+ "B. An Islamic Muezzin.",
+ "C. An Islamic Khatib.",
+ "D. An Islamic Imam."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the man say \"click the link\" in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Two.",
+ "B. Three.",
+ "C. Zero.",
+ "D. One."
+ ]
+ }
+ },
+ "pInQchQi": {
+ "video_id": "pInQchQi",
+ "video_duration": "413s",
+ "duration": "6-8min",
+ "domain": "Tech & Science",
+ "sub_category": "Tech Reviews",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person with long dark hair and eye makeup faces the camera. They are wearing a gray sweater. They hold up a tube and then rub a cream on their face.\u00a0Next, they pick up a small bottle and squeeze a small amount of liquid onto the back of their hand. They dab their fingers in the liquid and apply it to their face.\u00a0They then use a sponge to blend in the makeup. Finally, they apply lipstick.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many types of cosmetics did the woman in the gray sweater use in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Six.",
+ "B. Three.",
+ "C. Four.",
+ "D. Two."
+ ]
+ }
+ },
+ "bLnoaJeS": {
+ "video_id": "bLnoaJeS",
+ "video_duration": "98s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a dark, blurry shot. Then there is a view from within a boat at night, where a person wearing a light-colored hat and dark jacket is visible. The person is handling fishing rods and preparing equipment. The camera shows a lit-up section on the boat. Later, a bright light sprays a mist in front of the camera, obscuring the view briefly. Then, through the mist and darkness, a dim light can be seen from what appears to be a distant object. The video then shows the view from the boat towards the ocean, where fishing rods are visible. The camera is angled towards the water, where the ocean is seen splashing. There are several fishing poles in view. The video shows a view from within the boat, and the silhouette of a person can be seen in the corner of the shot. The video then focuses on hands preparing bait. After the bait is prepared and placed on the floor of the boat, another piece of bait is handled. Finally, there is a view of a screen displaying sonar.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, what is the spatial relationship between the head meat and the bari fish?",
+ "answer": "B",
+ "candidates": [
+ "A. The head meat is wrapped by the bari fish.",
+ "B. The head meat wraps around one end of the bari fish.",
+ "C. The head meat is beneath the bari fish.",
+ "D. The head meat is in the middle of the bari fish."
+ ]
+ }
+ },
+ "hIwjTYLc": {
+ "video_id": "hIwjTYLc",
+ "video_duration": "76s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Tech Reviews",
+ "audio_class": [
+ "Event"
+ ],
+ "video_caption": "A pair of light brown gloves holds a long black expandable baton. The baton is expanded and then collapsed. The gloves form a fist, and the fists moves and rotates. The gloves then make contact with a black punching bag, and the gloves form a fist again. The gloves rotate again, and they come into contact with the punching bag several more times. The gloves are then visible at a further distance from the punching bag, and the gloves rotate and form a fist several more times. The video ends with a view of the gloves from a distance and a last rotation of the gloves in the shape of a fist.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the man strike the punching bag forcefully?",
+ "answer": "C",
+ "candidates": [
+ "A. Six times.",
+ "B. Eight times.",
+ "C. Nine times.",
+ "D. Seven times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many sandbags can be seen in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Two.",
+ "B. One.",
+ "C. Zero.",
+ "D. Three."
+ ]
+ }
+ },
+ "WuRVRWDk": {
+ "video_id": "WuRVRWDk",
+ "video_duration": "106s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Hiking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black hooded wetsuit with \"Hollis\" written on it is standing in a river. He is holding a small, dark knife with a blue handle. Then, he is holding a snorkel mask with a blue mouthpiece. He puts the snorkel and mask on. He has black gloves on. He moves around in the river. He gives a thumbs-up gesture. He points with his gloved hand. He briefly looks up towards the trees. He continues to move through the water. There is a wide view of a river with sunlight coming through the trees. Another view of the river is shown. A man stands in front of a store called \"Wilson Diving\" with \"Victoria, BC\" written on the screen. The man in the wetsuit is seen diving in the river. The camera shows the man\u2019s hands moving rocks on the bottom of the river. He uses a tool that appears to be a screwdriver to move debris. The camera shows a small, round object under a rock and someone using a long white plastic tool to try to reach it. A fish swims in view. The person uses his hands to move the gravel around. The person uses a white squeeze bottle with a long nozzle. There is a person wearing a black wetsuit, black mask, and snorkel. The person is holding a small white bottle in his black gloved hand. The camera goes underwater and there are small fish and gravel and rocks on the bottom of the river. The last clip is from an animated show where a person is wearing a hat and appears to pick up a gold coin. The person in the wetsuit is seen using the plastic tool under the water again. The person is seen using his hands to lift rocks again. The man in the wetsuit is holding the white bottle. The person is underwater again using a white squeeze bottle with a black nozzle. There are rocks and small objects on the bottom of the river. The last shot is of a hand moving some of the rocks.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many voices can be heard in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. Two.",
+ "D. One."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who just said, \"This 8mm diving suit is exceptionally warm\"?",
+ "answer": "C",
+ "candidates": [
+ "A. The man wearing blue diving goggles.",
+ "B. The man wearing red diving goggles.",
+ "C. The man wearing green diving goggles.",
+ "D. The man wearing black diving goggles."
+ ]
+ }
+ },
+ "oZhZNksv": {
+ "video_id": "oZhZNksv",
+ "video_duration": "75s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A boxer with blond hair is standing over another boxer who is down on the mat. The referee begins the count, gesturing with his hand. The downed boxer attempts to stand up, using the ropes for support. He rises to one knee then stands, and the referee checks on him. The referee wipes the boxer\u2019s gloves on his shirt and gestures for the fight to continue. The boxers resume fighting. They grapple briefly. The blond boxer lands a series of punches, and the other boxer is knocked down again. The referee begins another count. The downed boxer falls against the ropes and stays down as the referee counts. The referee waves his hand to stop the fight. The blond boxer raises his arms in victory. He continues to celebrate as the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Does the video include the sound of a bell ringing?",
+ "answer": "D",
+ "candidates": [
+ "A. None of the above.",
+ "B. No, it does not.",
+ "C. I am not sure.",
+ "D. Yes, it does."
+ ]
+ }
+ },
+ "cTHqBAuR": {
+ "video_id": "cTHqBAuR",
+ "video_duration": "82s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Event"
+ ],
+ "video_caption": "A man wearing a yellow shirt with black stripes and black shorts walks towards the left side of the screen. The score displayed in the top left corner is BAL 2-0 LEI. The screen transitions to a blue background with white text that reads \"PRE-SEASON 2021.\" The screen transitions to a shot of a man wearing a yellow shirt with black stripes and black shorts running towards a soccer ball. The man kicks the ball towards the goal, where a man in a dark shirt and shorts kneels to defend. The ball goes into the net. The man in the dark shirt walks away from the goal. The screen transitions to the blue background with white text \"PRE-SEASON 2021.\" A man wearing a yellow shirt with black stripes walks from right to left on the screen. A man wearing a dark shirt and shorts with the number 12 on the back walks from left to right on the screen. The man in the dark shirt and shorts continues walking away from the camera. The man in the dark shirt continues walking. The shot transitions to a wider shot of a soccer field. A man wearing an orange shirt and shorts stands to the left of the soccer goal, while a man in dark shirt and shorts stands slightly to the right of the goal and a man in blue shirt and shorts is standing to the right of the screen. The man in the blue shirt and shorts begins running towards the soccer ball, which is near the middle of the screen and kicks the ball. The man in the orange shirt dives to try and block the ball, but the ball goes into the net. The score is now BAL 2-1 LEI in the top left corner. A man wearing a blue shirt with white text on the front walks from the right side of the screen to the left. The screen transitions to a close up of the same man with a blue circle border. The screen transitions to the wide shot of the soccer field, where a man in blue shorts and shirt kicks the ball. The man in orange shorts and shirt dives to try to block the ball, but the ball goes into the net. The referee raises a flag. The screen transitions to blue background with the text \"PRE-SEASON 2021\" in white text. A man wearing a blue shirt walks across the screen from right to left. A man wearing an orange shirt with the name Russel and number 24 on the back walks across the screen from right to left. The man in orange shirt continues walking. A man wearing a yellow shirt walks across the screen. A man wearing the same yellow shirt bends down with the soccer ball. The man in yellow shirt stands. The screen shows a wide shot of the soccer field where the score is BAL 2-1 LEI. A man in yellow shorts and shirt and a man in orange shorts and shirt each stand on opposite sides of the field. A man in dark shirt and shorts stands between the two. The man in yellow shorts and shirt kicks the soccer ball into the net. The score is now BAL 3-1 LEI. A man in a yellow shirt and shorts is seen from the shoulders up walking from left to right on the screen. The screen shows three men sitting, wearing white shirts. A man in yellow shorts and shirt helps a man in a blue shirt up off the ground.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "In the video, how many goals does the player wearing yellow score?",
+ "answer": "C",
+ "candidates": [
+ "A. Zero.",
+ "B. One.",
+ "C. Two.",
+ "D. Three."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order of appearance in the video for the goalkeeper in the black jersey, the athlete in the blue jersey, and the athlete in the yellow jersey?",
+ "answer": "B",
+ "candidates": [
+ "A. The athlete in the yellow jersey, the athlete in the blue jersey, and the goalkeeper in the black jersey.",
+ "B. The athlete in the yellow jersey, the goalkeeper in the black jersey, and the athlete in the blue jersey.",
+ "C. The goalkeeper in the black jersey, the athlete in the blue jersey, and the athlete in the yellow jersey.",
+ "D. The goalkeeper in the black jersey, the athlete in the yellow jersey, and the athlete in the blue jersey."
+ ]
+ }
+ },
+ "CpaihgLc": {
+ "video_id": "CpaihgLc",
+ "video_duration": "72s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a view from a moving boat, showing the water and a trail of white foam. In the distance, there are two large ships on the horizon. The view switches to a closer view of just the water, and then to a person swimming in the water. The scene cuts to a woman sitting on a boat, wearing a dark blue polo shirt with \"HONG KONG 2013\" and a flower design on it, and then cuts back to the ocean. The scene cuts back to the person swimming, and then a man is seen looking into the water from the side of a boat. The scene switches back to the woman on a boat, wearing the same clothing as before. The video then shows a group of people standing on a boat and looking at the water. The scene cuts to a man in the water next to an orange circular flotation device. A man on a boat hands down a white rope to the man in the water. The scene then switches to the woman sitting on a boat again and the video ends with a man on the side of a boat helping a man with a blue top out of the water.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video show a rope ladder being climbed?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No, it does not.",
+ "C. Yes, it does."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence for rescuing a woman from the water as shown in the video? (a) The passenger notices the woman who fell into the water. (b) The ship lowers a rope ladder. (c) A life buoy is thrown. (d) The woman in the water is hit by waves and is about to be sucked under the ship. (e) The crew uses a ladder to rescue her.",
+ "answer": "D",
+ "candidates": [
+ "A. (a) (b) (c) (e) (d).",
+ "B. (a) (b) (c) (d) (e).",
+ "C. (a) (d) (b) (c) (e).",
+ "D. (a) (c) (b) (e) (d)."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where was the woman positioned on the boat when she fell into the water in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. At the bow of the boat.",
+ "B. At the bottom of the boat.",
+ "C. On the left side of the boat.",
+ "D. At the stern of the boat."
+ ]
+ }
+ },
+ "rFzJdvhc": {
+ "video_id": "rFzJdvhc",
+ "video_duration": "99s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A woman with blonde hair wearing sunglasses and a denim jacket walks on a path next to a man wearing a grey t-shirt and backpack. The woman points to the path ahead. The camera then faces the path and moves forward. The woman spins around on the path with outstretched arms. The camera points up at the trees.A young man in a yellow raincoat talks to the camera while a young woman in a blue raincoat stands behind him and makes faces. The woman and man talk to the camera while wearing raincoats. The woman presses the button for an elevator. She stands inside the elevator as it goes up. The man in the yellow raincoat and black Adidas slides stands in the elevator.The man and woman stand on a city sidewalk in the rain wearing raincoats. The woman dances on the sidewalk in the rain. Water flows out of a pipe onto the street.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where does the man in the video say his style is inspired by swimmers?",
+ "answer": "B",
+ "candidates": [
+ "A. In the forest.",
+ "B. In an elevator.",
+ "C. In the living room.",
+ "D. On the road."
+ ]
+ }
+ },
+ "XRKONruE": {
+ "video_id": "XRKONruE",
+ "video_duration": "644s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Fashion",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A woman with shoulder-length, wavy, light brown hair and smoky eye makeup sits at a desk. She wears a black, cold-shoulder top and a dark headband with a bow.\u00a0She holds various makeup items and displays them to the camera. She gestures with her hands while looking directly at the camera. Occasionally, she shows swatches of makeup on the back of her left hand.\u00a0Behind her, a desk is visible with decorative items such as a two-tiered stand, candles, flowers, a wooden \"@\" symbol with lights, a light blue candle holder, and a wooden sign that reads, \"Let your faith be bigger than your fear\". \"Lipgloss Leslie\" is written in light teal cursive letters in the bottom right corner of the frame.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What object in the video is shaped like a brush?",
+ "answer": "C",
+ "candidates": [
+ "A. Eyebrow pencil.",
+ "B. Lip brush.",
+ "C. Mascara.",
+ "D. Eyeliner pen."
+ ]
+ }
+ },
+ "fpMbEArc": {
+ "video_id": "fpMbEArc",
+ "video_duration": "134s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A light-purple animated figure with a stitched patch on its torso stands at a light-green podium with a geometric design. The figure gestures with its hands and raises them. It then brings a hand to its face in a thinking pose. The figure shrugs, points a finger, and resumes gesturing.A photograph of a seated man holding a colorful, geometric, spherical object in front of his face appears. The object obscures his face. He wears a blue shirt and khaki pants and sits against a wooden wall with a door and a snowboard nearby.A complex, circular, geometric pattern appears on a black background. The label \"E8\" is in the bottom right corner.The animated figure returns, making gestures at the podium.Close-up images of coral appear. One is orange and pink, the other is orange and white.The animated figure uses a welding torch to apply flame to a brown rock covered in pink mushroom-like organisms.A purple cartoon creature with a smiling face and flower-like appendages appears in a dark-blue ocean. Green and red circles flash around parts of the creature.The animated figure gestures at the podium. Mathematical equations appear on a black background.Five metallic, spherical objects, encircled by thin rings, float in space.The close-up image of orange coral with circular openings reappears.A crowd of people walks across a crosswalk. Another crowd walks down a busy city street.The purple creature in the ocean reappears.The animated figure returns to the podium and gestures, raises a finger, and then resumes gesturing. Mathematical equations reappear on a black background.Five metallic spheres encircled by thin rings float in space again.A blue question mark appears.The animated figure returns to the podium and gestures.A table labeled \"Forces (bosons)\", \"Matter (fermions)\", \"Second generation\", and \"Third generation\" appears, filled with symbols and text.Two circles connected by thin yellow lines appear on a black background. The scene brightens to white.A blue question mark appears again. The animated figure returns to the podium and gestures.A rainbow-colored geometric pattern develops on a black background, growing in complexity as more lines and shapes are added. Circles, triangles, and hexagons form around a central point.A partial seven-circle pattern on a white background is overlaid by a six-pointed star outlined in dark red.The complex, rainbow-colored geometric pattern on the black background reappears.The image of the complex, white circular geometric pattern labeled E8 reappears.The rainbow-colored geometric pattern on a black background returns, with seven white circles superimposed over it. The pattern grows more complex with additional lines. More white circles are added over the pattern, then white hexagons.The two-circle diagram with yellow connecting lines reappears on the black background. A large bright circle fills the screen.A blue question mark appears.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where does the man in the video place the model?",
+ "answer": "D",
+ "candidates": [
+ "A. On the table.",
+ "B. On his left hand.",
+ "C. On his head.",
+ "D. On his right hand."
+ ]
+ }
+ },
+ "LkBpZUDg": {
+ "video_id": "LkBpZUDg",
+ "video_duration": "237s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A character wearing a grey beanie, scarf, and green jacket is visible on the left of the screen and another character is visible on the right of the screen. A dark screen with text is superimposed over them. Next a character in a blue dinosaur suit is shown walking down a dark hallway, holding a rifle. This same blue dinosaur suit character walks down a hallway towards the camera with a prompt on the screen reading \"Get the camera back\" and \"Kill zombies to free the survivor.\" The character walks through a doorway. The character walks through a crowd of red-colored zombies on a city street while driving a yellow car. The blue dinosaur suit character drives the car through more crowds and then through a building. The blue dinosaur suit character is shown running through another dark area. A map is visible. The character then runs through a plaza fighting an enemy labeled as \"Calder,\" before then entering a room and looking around. The character runs through a doorway to a ledge overlooking the plaza, runs around on the ledge and then grapples up and equips the exo-suit. Next, the character runs to a different part of the map. The character is shown walking and running in the exo-suit through a building. The map is briefly shown. The character exits the building in the exo-suit and reenters another building. The character in the exo-suit is shown in a different part of the building and fights a man in a dark suit repeatedly.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the relative position of the game character in relation to the soda machine in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The character is to the left of the soda machine.",
+ "B. The character is above the soda machine.",
+ "C. The character is in front of the soda machine.",
+ "D. The character is to the right of the soda machine."
+ ]
+ }
+ },
+ "EHwWoWYE": {
+ "video_id": "EHwWoWYE",
+ "video_duration": "100s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A seasoned piece of meat sits in an aluminum tray. A gloved hand briefly touches the meat. The scene cuts to an outdoor setting where a man in a teal shirt and black gloves uses meat claws and a knife to shred the meat. A young girl in a peach shirt and a man in a dark blue shirt watch. A jar of barbecue sauce sits on the white table next to the tray. The man opens the jar and pours the sauce over the shredded meat. He uses the claws to mix it in. He takes a bite of the meat, and the girl takes a bite as well. A child runs by in the background.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What type of content creator is the man wearing a light blue short-sleeved shirt and gloves in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Food exploration.",
+ "B. Culinary instruction.",
+ "C. Food review.",
+ "D. Mukbang."
+ ]
+ }
+ },
+ "QjosqNSt": {
+ "video_id": "QjosqNSt",
+ "video_duration": "165s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The text \"METHOD 2\" and \"PASTELS\" appears on a black background. A hand holds a box of pastel sticks, then takes a brown pastel stick and scores it with a silver utility knife, making powder. A hand holding a jar of brown powder. A hand holding a light brown toy horse dips the toy horse's hooves in the powder. The toy horse is held over a paper plate, with a green towel underneath, while a hand uses a paintbrush to apply the brown powder to the horse's body. The image transitions to a hand wearing a blue glove holding a toy horse. A hand holding a paintbrush applies a dark color to the toy horse's head. Then a hand uses a blue towel to scrub the toy horse's body. Then a hand using a wooden stick applies a dark color to a light brown toy horse, making a pattern on the side of the horse's body. The image fades into a brown toy horse with a white pattern on its side. A hand is shown holding a box with a metal pencil sharpener. The hand then takes a pink colored pencil and uses the sharpener. A hand holding a wooden stick uses it to apply a dark pigment to a toy horse's body. A hand holds a pink pencil and draws spots on a toy horse's flank. Then, using a small brush, it applies white paint spots to a grey toy horse. Then it takes a pink colored pencil and applies it to a grey toy horse's tail. The hand then uses a sharp pointed tool to draw white spots onto the side of the grey toy horse.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What item does the woman in the video use to enhance the spots?",
+ "answer": "C",
+ "candidates": [
+ "A. Chalk.",
+ "B. Watercolor pencils.",
+ "C. Acrylic paint.",
+ "D. Pencil."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the white plate relative to the blue towel in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The white plate is below the blue towel.",
+ "B. The white plate is to the left of the blue towel.",
+ "C. The white plate is above the blue towel.",
+ "D. The white plate is to the right of the blue towel."
+ ]
+ }
+ },
+ "lPvewDdi": {
+ "video_id": "lPvewDdi",
+ "video_duration": "76s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows multiple people and puppets speaking, with different backgrounds. A woman with blonde hair and a black t-shirt with a rose on it is speaking next to a rocket made of twigs. Two men are sitting in front of a painted town background. A man with dark hair is standing next to a puppet with black and white hair. A tan puppet with a red scarf is partially visible in a pink structure. The two men are speaking again in front of the town background. A man with dark hair is standing next to a seal puppet wearing a spacesuit. A woman with blonde hair is speaking in front of a painted landscape background. A fluffy, dark gray puppet with a pink bow is in a white tub. The same blonde woman is speaking again in front of the painted landscape. A man with gray hair is speaking in front of a tan sofa with a yellow stove behind him. A man with dark hair is holding a ukulele puppet. The same man is speaking next to the ukulele puppet again. The same man is holding the ukulele puppet for a final time. The man with gray hair is speaking again. A woman with dreadlocks is speaking in front of a barrel and trees. The same woman is speaking again. The man with dark hair is sitting down and playing the ukulele puppet next to a dark gray fluffy puppet. The same man is walking next to a pink eiffel tower structure, where the tan puppet from earlier is visible. The man from earlier is standing next to a gray puppet that pops up. The man is standing next to a puppet with a dark face, the tan puppet and a purple, fluffy puppet. A woman with blonde hair is speaking in front of green cactuses. The woman is speaking again. A woman with dark, curly hair is speaking in front of a colorful town background. The woman is speaking again. Two puppets are present, one with a mustache and a chef hat who is holding a covered dish. The same woman is speaking again. The chef puppet takes the cover off the dish, revealing a head with a red apple in it. The woman with dark hair is speaking again. A woman with dark hair is speaking again. A woman with dark, curly hair is speaking again.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "At the beginning of the video, where is the glasses-wearing man seated in relation to the bearded man?",
+ "answer": "C",
+ "candidates": [
+ "A. Opposite.",
+ "B. On the right.",
+ "C. On the left.",
+ "D. Behind."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "In the middle of the video, how many types of instruments appear?",
+ "answer": "C",
+ "candidates": [
+ "A. Four types.",
+ "B. Two types.",
+ "C. One type.",
+ "D. Three types."
+ ]
+ }
+ },
+ "clyiRhHB": {
+ "video_id": "clyiRhHB",
+ "video_duration": "53s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a man walking on a glacier, then he is shown speaking, with the title \"Donald Wray, Outdoor Expert\". Then, the first man is shown walking in a narrow channel in the glacier. Next, a man with a beard is shown in a transparent box while water falls around him and then is covered in foam. Then a digital clock is displayed showing 01:00:00, with the name \"Wim Hof\" and a Netherlands flag below it. Then a man is shown in a transparent box filled with ice, with medical tubes running to his nostrils and head, then with a bandage on his forehead. Then, the man emerges from the ice wearing light grey shorts, and with medical tubes attached to his chest and torso, with his arms in the air. Next, two men and a woman stand beside a large rectangular container filled with foam, with digital clock showing 1:43:36. Then a man, also Wim Hof, is shown receiving a certificate from two men in a studio setting. Then a man with a beard is shown close up, and another man is seen immersed in ice in a transparent container. Then the man is shown wearing yellow shorts and walking in snow with his bare feet. Then the man is shown walking in a snowy landscape, then he is shown walking along a narrow ridge on a snowy mountain, and finally standing next to signs on a mountaintop. Then a close-up view of a person's hands rubbing is shown, and then the video shows a person swimming under a layer of ice. Then the video shows a man sitting on the snow after emerging from the water, and he is being tended to by other people. Finally, a man with the title \u201cBrett Winterbottom, Outdoor Expert\u201d is shown speaking, and then a man is shown swimming in a lake.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the sequence in which Wim introduces his Disney records?",
+ "answer": "B",
+ "candidates": [
+ "A. Climbing mountains, swimming, running a marathon, and handling ice cubes.",
+ "B. Handling ice cubes, running a marathon, climbing mountains, and swimming.",
+ "C. Running a marathon, handling ice cubes, swimming, and climbing mountains.",
+ "D. Handling ice cubes, climbing mountains, swimming, and running a marathon."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "In which scenario does a bearded man cross his arms over his chest?",
+ "answer": "A",
+ "candidates": [
+ "A. A large glass container filled with ice cubes.",
+ "B. A large glass container filled with ice water.",
+ "C. A snowy mountain.",
+ "D. A frozen lake."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the bearded man spread his arms excitedly on the street in the middle of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Because he successfully reached the top of the mountain.",
+ "B. Because he won first place in diving.",
+ "C. Because he ran the half-marathon barefoot in the shortest time.",
+ "D. Because he recovered from a cold after being chilled by ice cubes."
+ ]
+ }
+ },
+ "XoquFTEn": {
+ "video_id": "XoquFTEn",
+ "video_duration": "51s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman wearing a black tank top and gray leggings is shown doing different exercises with gym equipment. She starts by lifting a barbell with weights while lying on a bench. Then she is seen doing shoulder presses on a machine. Next, she does lateral raises with two dumbbells. Then a man wearing a black tank top and a white hat does lateral raises with two dumbbells, followed by a man doing tricep press downs. The woman is then shown doing tricep press downs on a machine. She finishes by doing rope upright rows. Other people can be seen working out in the background of some of the shots. Text appears on the screen describing the workout type, set count, and repetition count.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many dumbbells does the man in the red shirt pick up in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. One.",
+ "B. Two.",
+ "C. Three.",
+ "D. Zero."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "Which exercise shown in the video requires reaching failure at 15 pounds?",
+ "answer": "B",
+ "candidates": [
+ "A. Cable row.",
+ "B. Lateral raise.",
+ "C. Machine shoulder press.",
+ "D. Tricep pushdown."
+ ]
+ }
+ },
+ "LrpeulHs": {
+ "video_id": "LrpeulHs",
+ "video_duration": "40s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video opens with a close up shot of a laptop, headphones, a mini keyboard, and an audio interface on a black table. The camera pans to a suspended microphone. Then, a pair of wooden over-ear headphones is centered in the frame. Text that reads \"DAY 01 // REHEARSE & REPEAT\" is overlaid on the screen. A person with dark hair and glasses wearing a red t-shirt appears, followed by a shot of a band with four members in a recording studio. In this shot, one person is holding a stand-up bass, another is using a laptop and other instruments, the third is standing with their hands in the air, and the last is off to the side. A cut to the two people from the previous shot shows the person in a red shirt looking away and the other person pointing and looking toward the same direction. Then, the person in the red shirt looks toward the camera while the other person touches the brim of their hat. The video cuts to a woman with black hair playing drums. The video cuts again to a man playing a silver trumpet. Two people are seen next playing keyboard instruments. A cut back to the two members from earlier. The person in a black shirt points at the camera. Then, a close-up shot of a hand playing a piano, followed by two people who are playing the bass and drums, respectively. Next, the video shows a man playing a trumpet again in front of a keyboard. There is a shot of someone using a laptop at the same time as someone plays the trumpet. The man with the black hat is also seen playing the keyboard. The person in the red t-shirt is writing with a pen. Then the person in the red shirt is seen looking toward the camera. The video cuts to a full view of the band, with the man in the black shirt facing away from the camera with his hands up. A shot of a parking lot with \"FAIRFAX RECORDINGS\" painted on the asphalt appears. Then the exterior of a building with the same name and the number \"15456\" is shown, followed by a shot of a wall full of framed gold records.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What instrument is the woman wearing an orange short-sleeved shirt playing in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Congas.",
+ "B. Drum set.",
+ "C. Bass.",
+ "D. Guitar."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who recently said, \"So I never thought I would become someone like this?\"",
+ "answer": "B",
+ "candidates": [
+ "A. A man wearing a black shirt.",
+ "B. A man wearing a red shirt.",
+ "C. A man wearing a yellow shirt.",
+ "D. A man wearing a blue shirt."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What do the man in the red short-sleeve shirt and the man in the black short-sleeve shirt do in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. High five.",
+ "B. Hug.",
+ "C. Fist bump.",
+ "D. Shake hands."
+ ]
+ }
+ },
+ "fBtZklQN": {
+ "video_id": "fBtZklQN",
+ "video_duration": "210s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows multiple scenes of what appears to be a beauty pageant.In the first scene, a woman wearing a silver crown, a gold sequin dress, and an orange sash that says \u201cSec\u201d is holding a black disc in a red box. Three other women are standing behind her to the right, all wearing silver sequin dresses.The next scene shows a woman wearing a silver crown and a green sequin dress. She has a sash that says \"Winner of The Miss\". She is holding a microphone. The background behind her has multiple national flags including a UK flag.The next scene shows the same woman, still holding a microphone, and with the same flags behind her. The background changes to a blue, abstract design with the same flags.The next scene shows the same woman in the green sequin dress speaking into the microphone. Another woman in a light blue dress and dark hair is in front of her holding a document in a dark case. There are multiple other women standing in the background.The next scene shows two women wearing a silver crown and orange sashes, holding a dark disc in a red box.The following scene shows the same two women with the discs in the boxes.The scene then changes to the woman wearing the green sequin dress speaking into a microphone, against a background of the flags.The next scene is the same woman speaking into the microphone, with a background of flags and other images.The scene again switches to the same woman speaking into the microphone with a background of more flags.The next scene shows a stage with a number of women in beige or gold dresses and a few in other colored dresses. Some women are holding up flags.The camera stays at the same angle showing more of the pageant participants on stage.The scene changes to show the woman in the green sequin dress speaking into the microphone with the same flags in the background.The camera shifts back to the stage with a number of women in beige or gold dresses and a few in other colored dresses. Some women are holding up flags.The scene changes to show the woman wearing the green dress speaking into a microphone.The following scene shows the woman in the green dress speaking into the microphone, with more of the flags in the background.The scene shifts to show the same woman wearing the green sequin dress and a crown, talking to a woman in a blue sequin dress. More women are standing in the background.The next scene is the woman in the green sequin dress talking to the woman in the light blue dress. More women can be seen standing in the background.The scene changes to the same two women, still speaking to each other.The camera then returns to the stage with the women in the gold dresses and a few other colored dresses.The scene shifts back to the stage showing the women in their gold dresses.The next clip is of the laptop being used to view the video.The next view is of the same video being played on the laptop screen and then the screen is turned towards the viewer.The video then shows the full stage with the women in the beige, gold and other colored dresses.The video then shows two women holding hands with a series of flags behind them.The next scene shows the stage with the women and then two women are holding hands against a background of more flags.The video shows the woman with the crown in a dark dress with more women in the background.The video shifts to the stage with a number of women on the stage and some holding flags.The screen is zoomed in showing two women holding hands with the flags in the background.The video is then zoomed out showing the women holding hands and more flags in the background.The screen zooms out again showing the women holding hands and the stage with the women.The next scene is of the woman in the green dress talking to a woman in a light blue dress and a crown.The scene shifts again to the woman in green talking to a woman in light blue and more women are in the background.The scene switches again to the same two women with other women in the background.The camera moves to the full stage and then the focus is zoomed in on two women holding hands on the stage.The next video shows the woman with the crown and dark green dress walking.The screen then shows more of the pageant.The video shows a woman in a golden dress and a red sash walking towards another woman with a microphone, then embraces her.The camera angle shows a wide shot of the stage with a number of women in different colored dresses.The next scenes show various perspectives of the stage and of women in different colored dresses.The next scene shows various pageant participants and then two women in a yellow and gold dress. One is being crowned and holding a prize.The scene shows the woman in gold and red holding up her prize, and being crowned. Another woman, wearing a green outfit is to her right.The scene is repeated.The woman in red and gold is again seen with the crown and a prize. The other woman in green is next to her.The women are seen on stage together from more perspectives.The scene shows women on the stage, and then is focused on a woman with a dark crown and a white banner.The woman with the silver crown is seen from another angle.The final scene is of the woman with the silver crown and her prize.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What do the Black woman and the white woman in yellow dresses do on stage in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Look into each other's eyes.",
+ "B. Hug each other.",
+ "C. Stand side by side.",
+ "D. Hold each other's hands."
+ ]
+ }
+ },
+ "IzIiNatP": {
+ "video_id": "IzIiNatP",
+ "video_duration": "55s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A man with a baseball cap, tank top, and multiple necklaces stands in front of a wall with other people behind him. He makes various hand gestures, including pointing at the camera, clapping, and making other motions with his hands. The cap is white with \"LA\" and what appears to be an assault rifle in black. His tank top is grey and white with the word \"Hollywood\" printed on it. He has tattoos on both arms and his chest. The men behind him wear a variety of clothing, including plaid shirts, t-shirts, and at least one man in a tuxedo. One man wears a black shirt with the words \u201cThuggish Ruggish\u201d on it. Another shirt has the words \u201cWe are still here 2015.\u201d A logo is visible in the upper right corner, and the words \u201cHollywood Kid\u201d appear across the bottom of the screen initially, later replaced by \u201c#AEC2015\u201d and the logo. The man continues rapping and gesturing throughout the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many gold necklaces appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Five.",
+ "B. Three.",
+ "C. Two.",
+ "D. Four."
+ ]
+ }
+ },
+ "zhSaWdPv": {
+ "video_id": "zhSaWdPv",
+ "video_duration": "62s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Academic Lectures",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person in a black wetsuit is paddling on a surfboard in choppy water. The person is then seen kneeling on the surfboard, also in the water. The person in the wetsuit is then shown sitting on the surfboard on a beach. A group of people wearing dark-colored clothing are working on several dirtbikes in a desert-like area. A close-up of a person with braided hair is shown. The person is then shown riding a dirtbike on the same desert-like terrain. A different shot shows a person on a dirtbike, with mountains in the background. The camera zooms in on the wheels of the dirtbike moving through the desert-like ground. An aerial shot shows a dirtbike moving across the white, desert-like expanse. A person is stacking wooden blocks in a lab setting. A robotic arm moves a block from a table and places it on top of the block tower. The person then smiles while the robotic arm moves blocks. The same person is then shown with a group of people in lab coats watching a number of robotic arms perform actions on a table. A close up of a robotic arm moving a circuit board is shown, followed by a close up of the same person holding a circuit board, followed by more shots of the robotic arm.A person wearing a spacesuit and backpack is standing on a rocky surface, with mountains in the background at sunset. A close up of the same person in the space suit is shown. A full view of the space suit person looking out over the rocky surface at sunset is shown, followed by more views of this person. A person in a dark outfit is running across the same white, desert-like expanse. The same person in a white baseball cap is then shown with mountains in the background. Three people are then shown walking through tall grass towards the mountains. The person in a white cap is then shown crouching down with a camera, taking photos. A close up of this person with a camera is shown. A full view of the same person, wearing a white cap and light sweatshirt, is shown in front of the mountains. Two people are shown walking together through a produce market. A close up of the person with braided hair eating a piece of fruit is shown. More views of the two people in the produce market are shown.A sunrise over silhouetted mountains is shown. A dome-shaped building is shown. A person is walking along a railing, overlooking the mountains at sunset. A close up of this same person looking out over the mountains is shown. The inside of a dome-shaped building with a large telescope is shown. A person with braided hair is standing on a platform, and then at a computer. The United States capital buildings are shown in the distance. People are shown walking up the stairs of a building. An interior view of a governmental room with people sitting and a person speaking at a podium is shown.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "How many apples did the woman in the video eat?",
+ "answer": "B",
+ "candidates": [
+ "A. One.",
+ "B. I don't know..",
+ "C. Two.",
+ "D. Three."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the woman go cycling on the salt flats in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Because she wants to see how fast I can go with just two wheels and a battery.",
+ "B. Because her friends want to go.",
+ "C. Because it is part of a learning task assigned by the school.",
+ "D. Because her family wants to go."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "After the woman finishes her speech at the end of the video, in which direction are the camera flashes relative to the woman?",
+ "answer": "B",
+ "candidates": [
+ "A. To the left of the woman.",
+ "B. In front of the woman.",
+ "C. To the right of the woman.",
+ "D. Behind the woman."
+ ]
+ }
+ },
+ "xKpMnpqF": {
+ "video_id": "xKpMnpqF",
+ "video_duration": "537s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person with dark hair and light eyes, wearing a black and white patterned bandanna, is visible, facing the camera. The person speaks and moves their hands to touch their face or adjust their hair and bandanna. In the background, there are several framed pictures, a dark lamp on a shelf, and a gray mannequin. The person applies makeup to their face using several cosmetic applicators, including several brushes, a pencil, and a tube. The colors of the makeup include beige, brown, white, pink, and dark brown. At the end of the video, the person shows several social media logos, as well as their name and a subscribe request. The person points with two hands towards the words, then they move their hands away, adjust their hair and bandanna, and then look towards the camera, speaking.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there any eye cream in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Not sure.",
+ "B. Yes.",
+ "C. No."
+ ]
+ }
+ },
+ "kfaFhuht": {
+ "video_id": "kfaFhuht",
+ "video_duration": "483s",
+ "duration": ">8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Drawing Tutorials",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with an animated graphic featuring yellow and black angled stripes and the text \"MICHAEL DENICOLA\". A series of black and white sketches then appear in rapid succession, showing a figure being drawn. The figure has a long cape and appears to be holding a sword. The sketches are followed by a still image of a person with dark hair in a dark shirt and a light colored, textured jacket. This person is wearing black headphones with a microphone, and also appears to be wearing glasses. This is followed by a black and white static image. Several still images appear after this, which appear to be stills from the sketch animation and various reference material for it. The video concludes with more stills of the person in the headphones and an outro that shows icons for Instagram and Twitter.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why did the man in the video choose a specific silhouette?",
+ "answer": "D",
+ "candidates": [
+ "A. Because the silhouette looks the coolest in the thumbnail.",
+ "B. Because he likes the storyline of *The Mandalorian.*",
+ "C. Because the creator wanted to highlight the details of the lightsaber.",
+ "D. Because Luke Skywalker appears in *The Mandalorian.*"
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times is the video muted?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Six.",
+ "C. Seven.",
+ "D. Five."
+ ]
+ }
+ },
+ "XdGDILDo": {
+ "video_id": "XdGDILDo",
+ "video_duration": "66s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A bar graph appears, showing a long tail distribution. The x-axis is labeled \"Poorest\" on the left and \"Richest\" on the right. A bracket encompasses the flatter part of the graph, near the \"Poorest\" label, with the words \"6% OF WORLD WEALTH.\" Then, a figure of a person appears near the right side of the graph, next to the number \"3\". This number changes to \"300\". Then, another figure of a person appears near the left side, as the \"300\" remains. Next to the figure on the left, \"3,00\" appears. The figure on the left is then replaced by an image of the Earth, and the \"3,00\" changes to \"3,000,000,000\". The screen goes black.A man in a suit speaks at a podium. The podium has a sign reading: \"FT Business of Luxury Summit.\" He gestures with his hands as he speaks. The screen cuts out and transitions to a YouTube video of the same man speaking at the same podium. The title of the video is \"Business of Luxury Summit 2015 D1 Opening Keynote - Johann Rupert\". The video plays for a brief moment. The screen then transitions to the Konoz learning platform. A video titled \"Robots Will Steal Your Jobs\" is playing, showing a man in a red t-shirt giving a TEDx talk. The video plays for a brief moment, then is paused and minimized. The mouse clicks on the next video listed: \"Humans Need Not Apply\". The view changes to the \"My Milestones and Goals\" page on Konoz. This displays three milestone goals in euros: \u20ac500, \u20ac1000, and \u20ac2000 per month, with descriptions for how achieving each milestone would support video production. Below these goals, the words \"MORE...\" appear, followed by the phrases \"The sky is the limit\" and \"To the moon\". The view transitions to Federico Pistono's teacher profile, which includes a photo of a man, a brief bio, and donation options. The mouse clicks on the \"Support this teacher\" button, opening a donation page asking \"How much do you want to give per month?\" The amount in the box is adjusted using the up and down arrows, first to \"7,\" and then to \"8.\" The screen then transitions to a page displaying four donation tiers, with rewards listed for each. The mouse hovers over and then clicks the third tier, labeled \"CREDITS\" at $20.00 per month. The screen fades to white. The following words appear in succession, in black text on the white background: \"@FEDERICOPISTONO,\" \"FIND ALL THE LINKS BELOW,\" and finally, \"SUPPORT ME ON KONOZ :D,\" with a small animated rainbow cat next to the last line.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "In what year did Rupert deliver the speech?",
+ "answer": "B",
+ "candidates": [
+ "A. 2016.",
+ "B. 2015.",
+ "C. 2013.",
+ "D. 2014."
+ ]
+ }
+ },
+ "JAXeCXBs": {
+ "video_id": "JAXeCXBs",
+ "video_duration": "473s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens with a man wearing a tan t-shirt looking directly at the camera, and gesturing with his hand in front of a white wall with shelving and objects. The text \"Unlimited access to classes\" is shown over the video. The man continues talking and gesturing. The video then cuts to a small inset showing a man looking at some cards and the text \"Jeff Staple\". The man wearing a tan t-shirt speaks and gestures and then the video cuts to a scene of three men sitting at an outdoor table with various food and drinks. The man on the left is wearing a green t-shirt, the man in the middle is wearing a blue v-neck shirt, and the man on the right is wearing a white t-shirt and a black baseball cap. There are five clear plastic cups with liquid in front of them. The video then shows closeups of the various foods including fried chicken pieces and a green dipping sauce, tater tots with a sauce, nachos, and a sandwich. There are close up views of the food preparation, including frying the chips for the nachos, adding cheese to them, assembling the chicken sandwiches, and preparing tater tots. There are also views of the three men eating. They hold the food up to the camera and one of them gives some of the food to another. They each drink from the various cups. Text appears naming the food, such as \"Lu Rou Nachos,\" \"Mapo Tofu Tater Tots\" and the drinks as \"Apple A La Mofo,\" \"Taro Sweet Milk,\" \"Carrot Matcha Latte,\" and \"Strawberry Basil Ginger Limeade\". The three men share the food. The video then has an inset of a woman standing in front of multiple 100 dollar bills, then back to the men at the table. One man flexes his bicep muscle and a picture of 85 degrees is shown. The video then shows more views of the men talking and gesturing, then more shots of the food being prepared including a salad with cabbage and carrots, then back to more footage of the men eating. The video concludes with a view of the three men, giving thumbs up signs, the man in green pointing to the bottom of the screen and the man in white asking \"what is another dish Bopomofo should serve?\" and \"check the description,\" before pointing to the bottom of the screen and the man in green again pointing at the bottom of the screen, and showing \"$25 amazon gift card\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who says \"Thank you, Jason\" in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The man in the blue shirt.",
+ "B. The man in the white shirt.",
+ "C. The man in the green shirt.",
+ "D. The man in the yellow shirt."
+ ]
+ }
+ },
+ "rpedSOsB": {
+ "video_id": "rpedSOsB",
+ "video_duration": "39s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Climate Change",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with close-up shots of white and pink blossoms on bare branches. The camera then pans up to show leaves on a branch against a slightly cloudy sky. The camera next focuses on a man with grey hair wearing a light blue shirt and grey vest, standing in front of a row of trees. The video then shows a computer screen displaying a data analysis graph with blue bars. This is followed by a shot of a man wearing an orange shirt with a dark collar and stripes across the chest, standing in front of trees. Then, the camera shows a tall white structure with a solar panel attached to it, against a blue sky with clouds. The camera pans out to show a larger field with dirt patches, trees in the distance and wooden crates. The video goes back to the grey-haired man in the light blue shirt and grey vest, in the same location. Following this, an aerial view of rows of green trees and brown dirt strips between them is shown. Another aerial view is then shown of land with a meandering river and square plots of land adjacent to each other. Then, there is an aerial view of green rows of trees and buildings with white roofs and parking lots. Next, the video goes back to the man in the grey vest and light blue shirt, followed by a shot of green trees against a clear blue sky. A white drone flies into the frame, and the camera follows its movement. A close-up of a phone shows a map image of green rows of trees. The screen is touched, and then a view of a man with a grey vest and a light blue shirt is again shown. Finally, a dirt road is shown with trees along both sides.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "Is the moisture content shown in the bar graph on the webpage in the video continuously decreasing?",
+ "answer": "A",
+ "candidates": [
+ "A. No, according to the chart, the moisture level increases, then decreases, followed by another increase and decrease.",
+ "B. No, according to the chart, the moisture level decreases, then increases, followed by another decrease and increase.",
+ "C. Yes, the chart indicates that the moisture level is continuously decreasing.",
+ "D. No, the chart shows that the moisture level is continuously rising."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the man do using satellite images?",
+ "answer": "C",
+ "candidates": [
+ "A. Count the types of tree crowns.",
+ "B. Measure the size of the tree trunks.",
+ "C. Measure the size of the tree crowns.",
+ "D. Count the number of tree crowns."
+ ]
+ }
+ },
+ "jQFYdVAg": {
+ "video_id": "jQFYdVAg",
+ "video_duration": "60s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video displays gameplay footage of a mobile game. Initially, a menu is shown, highlighting a feature called \"Supply Converter\" within a game interface. The menu lists categories like \u201cConsumables\u201d, \u201cThrowable\u201d, \u201cAttachment\u201d, \u201cMelee Weapon\u201d and \u201cOther\u201d. The text \"Throwables, attachments, consumables, and supplies can all be converted\" appears on the screen. Next a description of how to use the Supply Converter is shown, stating \"Every 2 types of items can be converted into 1 of 6 fixed items\" and \"which item you get is up to luck\". The video then shows a game scene where two avatars are visible in an outdoor environment. A menu with the text \"Supply Converter\" and plus symbols appears, and a hand icon appears to tap the question mark icon. This brings up the rules of the supply converter. The video then returns to the game scene. The video then proceeds to show gameplay, with additional pop-up menus appearing and players interacting with the game items. The text \"players can now prioritize searching for attachments, ammo, Shop Tokens, and fuel\" is superimposed. Next, a text box displays a menu with different items and a \"Convert\" button, followed by \"Collect\" button. The text \"There's a chance to convert them into a Flare Gun\" and \"Hustle at the start of the match\" appears. The video returns to a scene with two avatars in view with text that says, \"and once you reach the mid-game, you'll be able to get Ghillie Suits, Lv. 3 Helmets, and more by converting\". One of the avatars is standing over a helmet which is circled in yellow. More game play is shown and the text says, \"Going the traditional route of getting stronger weapons is also not a bad idea\", followed by \"but be sure to mark out useful supplies for your teammates.\" In a game menu a list is shown, consisting of \"Spetsnaz Helmet\", \"Backpack\", and \"Kar98K\". A final text box appears stating, \"That way, everyone will have the greatest chance of getting the items they want!\". Finally, another game menu is shown, with the text, \"Lastly, the Supply Converter also has a chance of giving some mysterious items\", followed by, \"We\u2019ll leave that to the players to discover for themselves!\". A question mark appears in a blue box.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the synthesized signal gun appear?",
+ "answer": "D",
+ "candidates": [
+ "A. At the beginning.",
+ "B. At the end.",
+ "C. It does not appear at all.",
+ "D. In the middle."
+ ]
+ }
+ },
+ "AAWgrzYx": {
+ "video_id": "AAWgrzYx",
+ "video_duration": "131s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows several distinct scenes. The first is a close up of a person picking fruit from a stall, with a light green and blurry overlay. The next scene is a wide shot of a field with large, light gray stones scattered throughout, under a cloudy sky. Then, there are street scenes with several buildings, with a variety of businesses on the ground floor, and white vehicles driving by. The next scene shows three men standing on a rooftop, with the man in the center wearing a leather jacket. Then, an older woman is shown in a greenhouse, trimming a plant and inspecting various potted plants. After that, a woman is shown spraying a small potted plant with an orange spray bottle. Then, there is an indoor scene of a large clothing store with racks of clothing and multiple people. Then, a close up view shows a rack of baby clothes. A blue awning with the letters ATM is shown in close up. A person enters a doorway with an ATM inside. Then, the view shows a woman standing at the ATM, pushing buttons, and then removing an item. An older woman with white hair is then shown against a blurry green background. Two people are shown walking down a sidewalk outside a building. Next, a close up of a hand pushes buttons on an ATM. An older woman is shown rubbing her neck, with a pained expression. A close up of a clenched fist is shown. Finally, two people appear to wrestle over a bag.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What is Winnie most likely to do next?",
+ "answer": "A",
+ "candidates": [
+ "A. Withdraw money from an ATM.",
+ "B. Go shopping at the mall.",
+ "C. Take a walk in the park.",
+ "D. Report to the police station."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What action did the person wearing black take toward the other woman at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Puts a coat over the woman.",
+ "B. Hands the bag to the woman.",
+ "C. Is robbing the woman's bag.",
+ "D. Embraces the woman."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What is the weather like at the beginning of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Gloomy.",
+ "B. Cold.",
+ "C. Humid.",
+ "D. Sunny."
+ ]
+ }
+ },
+ "UOotLKca": {
+ "video_id": "UOotLKca",
+ "video_duration": "75s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Basketball",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a basketball game. Players in blue and white uniforms are on the court. They are running and dribbling the ball. The ball is orange. There is a referee with a black and white stripped shirt. There is an audience watching the game. There is a scoreboard on the screen showing the scores of Memphis and Tulane. The video shows multiple clips of the game, along with ESPN logos and other graphic designs. The video also has \"Rewind\" and other timestamps.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "When shooting a three-pointer, which player passed the ball to the man wearing the white jersey with the number 12?",
+ "answer": "B",
+ "candidates": [
+ "A. 13.",
+ "B. 24.",
+ "C. 10.",
+ "D. 21."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does applause occur in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. One.",
+ "B. Two.",
+ "C. Three.",
+ "D. Four."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the identity of the man standing by the basketball court wearing black and white stripes?",
+ "answer": "D",
+ "candidates": [
+ "A. Spectator.",
+ "B. Player.",
+ "C. Referee.",
+ "D. Coach."
+ ]
+ }
+ },
+ "nOzvkDZR": {
+ "video_id": "nOzvkDZR",
+ "video_duration": "87s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman with long dark hair is shown applying makeup to a small child who is seated in a black folding chair. The woman is wearing a blue blazer and a black top. The child is wearing a white shirt with pink and blue graphics and has blue ribbons in her hair. The background is made up of a gold sequin backdrop. The woman first applies yellow paint to the child's forehead using a white applicator. She then uses a small black brush to draw lines on the child's face. She also applies pink paint to the child's cheeks. The woman uses a white applicator and her hand to apply what appears to be powder. The child occasionally moves her head, tongue, or hands.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order is the makeup applied in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Apply highlighter to the forehead, draw a spiderweb on the left side, apply highlighter to the cheeks, draw a spiderweb on the right side.",
+ "B. Draw a spiderweb on the left side, apply highlighter to the forehead, apply highlighter to the cheeks, draw a spiderweb on the right side.",
+ "C. Draw a spiderweb on the left side, draw a spiderweb on the right side, apply highlighter to the forehead, apply highlighter to the cheeks.",
+ "D. Apply highlighter to the forehead, apply highlighter to the cheeks, draw a spiderweb on the left side, draw a spiderweb on the right side."
+ ]
+ }
+ },
+ "uasWxMUF": {
+ "video_id": "uasWxMUF",
+ "video_duration": "299s",
+ "duration": "4-6min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video starts with an avatar standing in front of a red building, with text above reading \"ROBLOXIA HIGH SCHOOL.\" The avatar has a green face, black wings with orange flames, and is holding an egg in each hand. A menu appears with a list of items that are equipped to the avatar, such as \u201cGoldfork Troll,\u201d \u201cFester Eggs Headbow,\u201d and \u201cNinja Bandana\u201d. A window appears asking the user if they would like to \u201cWear Roblox Avatar?\u201d followed by \u201cCancel\u201d and \u201cWear\u201d options; \u201cWear\u201d is selected. The scene changes to a sandy area with an avatar wearing a red shirt with a golden hairstyle. A side panel appears with a list of clothing and accessory items, which are swapped out one by one, including a green cylinder, then a green leaf, then a dark creature, followed by a pink hairstyle. A window appears with options to select a body part to customize: All, Head, Torso, Left Arm, Right Arm, Left Leg, and Right Leg are shown. Next, there are options to customize the face of the avatar. The color of the avatar is changed from white to gray and then to black. Then the color of the clothing is customized, followed by a window prompting a URL or item id for an item. Next the menu shows an options window to choose \u201cKid Boy, Kid Girl, Boy, Girl, Man, or Woman.\u201d Options to change the head size, height, and width are shown. A list of avatars is shown, and one with a black body with white stripes is selected. Then multiple accessories are cycled through, such as glasses, ghost-like accessories, head decorations, and a pineapple on top of the head. The avatar is then customized with red horns on its head, followed by black headphones. Finally, the avatar stands with a white ghost-like face with red markings on the eyes and is shown in a variety of locations. The action menu is then shown with options like \u201cDab\u201d, \u201cRaise Hand,\u201d and \u201cWave.\u201d A title screen appears with text that reads: \"SUBSCRIBE!\" and below, \"(OR I'LL TELL MY MOM!)\"There is an image of a bear on a human figure's shoulders.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "After deciding on the puppet's gender, how many patterns did the author click to create the other body parts?",
+ "answer": "A",
+ "candidates": [
+ "A. Five.",
+ "B. Three.",
+ "C. Six.",
+ "D. Four."
+ ]
+ }
+ },
+ "NwckHeoi": {
+ "video_id": "NwckHeoi",
+ "video_duration": "66s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Event"
+ ],
+ "video_caption": "A person shreds cooked chicken breast into a metal bowl. They then add chopped pickles. Next, they mix the chicken and pickles with a fork. Two hard-boiled eggs are cracked and added to another metal bowl. Ground beef is added to a square black frying pan on the stove. The eggs are poured over the ground beef in the pan. The eggs are scrambled with a spatula, and then mixed with the beef. Finally, sauce is added to the beef and eggs.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of cooking steps shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Prepare the avocado, spray oil, fry venison, add eggs, mix avocado and eggs, fry eggs.",
+ "B. Add eggs, spray oil, fry venison, prepare the avocado, mix avocado and eggs, fry eggs.",
+ "C. Prepare the avocado, add eggs, spray oil, mix avocado and eggs, fry venison, fry eggs.",
+ "D. Add eggs, prepare the avocado, spray oil, fry venison, mix avocado and eggs, fry eggs."
+ ]
+ }
+ },
+ "LWkHXQqR": {
+ "video_id": "LWkHXQqR",
+ "video_duration": "55s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a person with short, curly hair and glasses, wearing a black top and black scarf, speaking at a podium with a microphone. In the upper right corner of the frame, there is a succession of images displayed over the course of the video. The first image shows an orange female symbol with the words \"MIND THE GAP\" in a blue bar across the middle of the symbol. The second image shows a photo of two people holding signs that read, \"END VIOLENCE AGAINST WOMEN\". The third image shows a photo of people holding signs that read \"SILENT + FORGOTTEN NO MORE\" and \"JUSTICE FOR MISSING & MURDERED ABORIGINAL WOMEN\". The fourth image displays a webpage from the Huffington Post with the headline, \"Violence Against Women Canada: 67% Of Canadians Know A Woman Who Has Experienced Abuse\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the speaker lower his head in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Twenty times.",
+ "B. Twenty-one times.",
+ "C. Twenty-two times.",
+ "D. Nineteen times."
+ ]
+ }
+ },
+ "KpOVEykG": {
+ "video_id": "KpOVEykG",
+ "video_duration": "29s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A man in a red shirt sits in front of a screen displaying a FIFA video game. The screen shows a \u201cPlayer Pick\u201d option with the text \"1 of 5 87+ BASE OR WC HERO.\" The first player card partially reveals an Australian flag and a Premier League logo. The card fully reveals Harry Kewell, 87 LW, with 90 PAC, 85 SHO, and 88 PAS. The man puts his hand to his face. The next card begins to reveal itself. The US flag appears. The card fully reveals Tim Howard. The man keeps his hand to his face. The card that appears next has a Portuguese flag. It reveals Ricardo Carvalho, 89 CB, with 80 PAC, 74 DRI, 57 SHO, 92 DEF, 73 PHY, 83 PAS. The next card is green and has snowflakes on it. It reveals Rudi V\u00f6ller with 86 DRI, 47 DEF, 85 PHY. The man keeps his hand on his head. A final player pick reveals the Brazilian flag. The card reveals Lucio, 89 CB, with 72 PAC, 77 DRI, 57 SHO, 91 DEF and 89 PAS. The screen then displays Carvalho's card with \"QUICK SELL NOW\" in the foreground.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many different countries were represented by player cards at the beginning of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Five.",
+ "B. Six.",
+ "C. Three.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What number is shown in the upper-left corner of the first card played by the player on the left?",
+ "answer": "C",
+ "candidates": [
+ "A. 85.",
+ "B. 88.",
+ "C. 87.",
+ "D. 86."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order do the player cards for these five countries appear in the video? (a) Brazil. (b) Portugal. (c) Germany. (d) USA. (e) Australia.",
+ "answer": "D",
+ "candidates": [
+ "A. Brazil, USA, Germany, Portugal, Australia.",
+ "B. Brazil, USA, Germany, Australia, Portugal.",
+ "C. Australia, USA, Germany, Brazil, Portugal.",
+ "D. Australia, USA, Germany, Portugal, Brazil."
+ ]
+ }
+ },
+ "ONhTzzPU": {
+ "video_id": "ONhTzzPU",
+ "video_duration": "90s",
+ "duration": "1-2min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows gameplay footage of a video game. In the top right corner, there is a small, rectangular inset video of a person with dark hair and a beard, who appears to be talking. The main part of the screen shows a map of the game world, and then switches to showing the game from an in-game character's perspective, and then back to a map view. At one point, the character appears and then begins to glow. Later, the in-game character is standing on a circle, where the character then glows again, followed by a transition of bright lights that moves through the ceiling. After another black screen, the game shows a loading screen that has the text \"Master Mode\" with a text tip below. This is followed by a dark screen, then another shot of the game with a bridge, where the game character again glows, before returning to the map view. The map view then zooms in and out, and the user pins several locations on the map. This pattern is repeated, where the gameplay footage of the game is interrupted by loading screens, and the player selects areas on the map. The video contains frequent cuts, moving between different locations within the game. The user appears to be showing off locations within the game.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of steps to enter bug mode?",
+ "answer": "C",
+ "candidates": [
+ "A. Leave Divine Beast ruins, then go to unfinished Divine Beast ruins, then enter Trial Sword.",
+ "B. Leave Divine Beast ruins, then enter Trial Sword, then go to unfinished Divine Beast ruins.",
+ "C. Enter Trial Sword, then go to unfinished Divine Beast ruins, then leave Divine Beast ruins.",
+ "D. Go to unfinished Divine Beast ruins, then enter Trial Sword, then leave Divine Beast ruins."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What will the man in the video do after finding the gem?",
+ "answer": "C",
+ "candidates": [
+ "A. Show how to reach Hyrule Castle.",
+ "B. Show how to find Korok seeds.",
+ "C. Show how to exit bug mode.",
+ "D. Show how to obtain the Hylian Shield."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the man's position relative to the treasure chest in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. To the left of the treasure chest.",
+ "B. To the right of the treasure chest.",
+ "C. In front of the treasure chest.",
+ "D. Behind the treasure chest."
+ ]
+ }
+ },
+ "JOtPXFUQ": {
+ "video_id": "JOtPXFUQ",
+ "video_duration": "54s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens with a black screen displaying the title \"NOTHING TO HIDE\" in white font, with the subtitle \"AN ANTI-STEALTH GAME WHERE YOU ARE YOUR OWN WATCHDOG.\"The scene transitions to a 2D side-scrolling platformer game. A red-haired female character stands on a platform. Text instructions on the upper right corner of the screen explain the controls for picking up and dropping objects: space key or clicking a circular button. Yellow triangular objects with eyes are positioned on the platform's edge. The character picks up one of these objects and throws it down into the darkness below. She proceeds to jump down to the next platform. The character moves left and activates a wall with a large eye. The eye's gaze follows the character as she jumps between platforms. She repeats the process of throwing the yellow triangle and jumping to lower levels.The environment changes, and new text appears: \"RE: DEVICE INACTIVE ON CARPET. IT'S NOT A BUG, IT'S A FEATURE. (ISSUE MARKED AS CLOSED).\" More triangular objects are visible. The character continues to traverse the level by jumping and throwing the triangles.She enters a new area with a sign reading \"SIDEWALK: REVOLUTIONARY NEW WAY TO AVOID USING YOUR LEGS.\" A moving sidewalk transports her. She picks up papers with symbols on them. In another area, the words \"THE WALL THE WALL\" repeat across the screen. She picks up another triangle. Text displays \"ERROR: 'TARGETED AUDIENCE' CANNOT BE FOUND > this wall is meant for NOBODY\".She continues jumping between platforms and picks up a light blue, teardrop-shaped object, which seems to give her a burst of upward movement. The backdrop changes again, featuring a dark red surface and the text \"WATCH WHAT YOU HEAR\" and later \"WATCH WHAT YOU SAY\" alongside a faceless figure getting progressively redacted.She proceeds through the level, picking up triangles as before. Text appears in a speech bubble: \"Guess I'm running away to Canada, then.\" The words \"THEY SEE EVERYTHING\" partially appear before the character moves on. She enters another area with the message: \"SOME EYES WILL ACTIVATE SOME SIDEWALKS ONLY WHEN THEY SEE SOMEBODY.\" After passing this area, she enters a grey corridor with potted plants. She continues to move along the platforms.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What did the red-haired girl do after reaching an impassable intersection?",
+ "answer": "D",
+ "candidates": [
+ "A. She took a different route.",
+ "B. She remained standing still.",
+ "C. She forced her way through.",
+ "D. She returned to use tools to help her get through."
+ ]
+ }
+ },
+ "hFBBiEBQ": {
+ "video_id": "hFBBiEBQ",
+ "video_duration": "141s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Academic Lectures",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video starts with a white cloud outline with text inside stating, \"How to narrow a topic?\" on a green background. The scene changes to a green background with a pen in the bottom right corner. A line drawing of a person with short hair and overalls appears. A hand appears and traces a rectangle shape to the right of the line drawing. The hand leaves, and then the hand reappears and writes \"Environmental Science\" inside the rectangle. The hand writes below the rectangle \"Write a research paper on a topic dealing with a current environmental issue. Your paper should not exceed five pages, and should be based on a topic that strongly interests you\". The hand then writes \"No Problem!\" The hand draws a light bulb, and then writes \"The Effects of Deforestation on Gorillas in Africa\". The hand writes \"No Problem!\" with a red 'X' through \"Problem\". The hand then draws a question mark and writes below it \"Fit into a five-page paper?\". The hand draws an arrow shape and writes the word \"solution\" inside and then writes \"add more context\". The scene changes to a green background and the hand writes \"Geographical Context\". The hand draws a line underneath \"Geographical Context\" and then reveals a map of Africa. The hand writes \"Specific Geographical Region or Country?\" and then \"Historical Context\", and draws a line underneath the word \"context\". The hand then draws a timeline. Below the timeline the hand writes \"Specific Year(s), Decade(s), Era?\". The hand then writes \"Biological Context\". Below the text \"Biological Context\" the hand writes \"Specific Sub-species of the Gorilla?\". The hand then draws a shape of a V. The hand writes \"Current Effects of Deforestation on the Mountain Gorilla in South-West Uganda\". The words \"Biological context\", \"Historical Context\", and \"Current effects of Deforestation on the Mountain Gorilla in South-West Uganda\", are underlined with an arrow pointing from each of them. The words \"Geographical Context\", \"Current effects of Deforestation on the Mountain Gorilla in South-West Uganda\", and \"Biological Context\", are underlined.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Where does the biological background appear in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. In the middle of the video.",
+ "C. Throughout the entire video.",
+ "D. At the beginning of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order are the three backgrounds for adding more context arranged? (a) historical background. (b) biological background. (c) geographical background.",
+ "answer": "D",
+ "candidates": [
+ "A. (a) (c) (b).",
+ "B. (c) (a) (b).",
+ "C. (a) (b) (c).",
+ "D. (c) (b) (a)."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where does the author typically place historical context relative to the geographical background in narration?",
+ "answer": "C",
+ "candidates": [
+ "A. Top left corner.",
+ "B. Top right corner.",
+ "C. Bottom right corner.",
+ "D. Bottom left corner."
+ ]
+ }
+ },
+ "EfInlDKp": {
+ "video_id": "EfInlDKp",
+ "video_duration": "76s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a series of different people speaking to the camera, each against a background of red curtains. The first person is a man wearing a cap, glasses, and a red t-shirt. Then a woman with dark curly hair wearing a dark top and earrings appears. Next is a bald man with a black jacket, a dark shirt and earrings. A woman with long brown hair and a scarf follows. A woman with long brown hair, a white shirt and a small red object on her chest appears after that. Next is a man with dark hair, wearing a white shirt, and a dark vest. The man with the cap and t-shirt appears a second time, then there is a woman with dark curly hair wearing a black shirt. She is followed by the man with dark hair wearing a dark vest, then the woman with dark curly hair from before. Then a fair haired man with a green hooded jacket and a red shirt. The woman with long brown hair and the scarf appears next. Then another dark curly haired woman. Following that, the fair haired man in the green hooded jacket is back on screen. Then another dark haired man. The fair haired man appears again. The man in the dark jacket is back on the screen. Then the man with dark hair wearing the dark vest appears. Then the man with no hair appears a second time, then the man with dark hair appears a second time. He holds up his index finger. The man with no hair appears, and he holds up his hands. The dark haired woman holds a burning piece of paper, reading \u201cDEBT\u201d, in her hand. The fair haired man has his fists in the air, cheering.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many women's voices are in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. One.",
+ "B. Three.",
+ "C. Four.",
+ "D. Two."
+ ]
+ }
+ },
+ "FNZUsiBl": {
+ "video_id": "FNZUsiBl",
+ "video_duration": "184s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A hand holds a clear glass bowl of brown lentils. On a wooden cutting board next to the bowl are chopped vegetables: a whole head of garlic, a piece of ginger, some small chili peppers, a larger green pepper, an onion, chopped pumpkin, and chopped scallions. Behind the bowl, there are some ripe yellow plantains sitting on a small woven mat, a jar of an orange substance, and a part of a blender or food processor is visible.The lentils are poured from the bowl into a pot of boiling water on the stove. Several peeled garlic cloves are added to the pot, and the lentils are stirred.Another pot is placed on the stovetop next to the one with lentils. Oil is poured into it, and chopped ginger is added and allowed to sizzle. Next, diced onion, scallions and diced pumpkin are added. The contents of the pot are stirred with a spatula.After a while, the cooked lentils are poured into the pot with the pumpkin and other vegetables. Next, some more chopped ginger, a whole small green pepper, and a spice mix are added. A light brown liquid is poured from a glass jar, followed by dark brown liquid, possibly soy sauce. Coconut milk is poured from a small white container. The mixture is stirred with a spatula and brought to a boil. A lid is placed over the pot.A pot with a brown stew is shown on the stovetop, being stirred with a large metal spoon. A clear glass bowl is brought to the pot, steam coming out of it. The contents of the bowl, diced vegetables and cooked meat in a brown sauce, are added to the pot with the stew.A woman with braided hair pulled back in a ponytail tastes the stew from a spoon.A close-up shot of a white plate shows cooked white rice, a cooked lentil dish in a brown sauce, pieces of brown stewed meat, and some shredded lettuce, sliced cucumber and tomatoes.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Where in the video does the action of \"pouring oil into the pot\" occur?",
+ "answer": "D",
+ "candidates": [
+ "A. The action occurs in the middle of the video.",
+ "B. The action occurs throughout the video.",
+ "C. The action occurs in the second half of the video.",
+ "D. The action occurs in the first half of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "Before adding lentils to the pot, what state might the water in the pot be in?",
+ "answer": "C",
+ "candidates": [
+ "A. Superheated water.",
+ "B. Boiling.",
+ "C. Liquid.",
+ "D. Near boiling."
+ ]
+ }
+ },
+ "hvlBGRmK": {
+ "video_id": "hvlBGRmK",
+ "video_duration": "293s",
+ "duration": "4-6min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a text overlay, \"The Original Hinsons, The Glory Days, 1980\". An ensemble of eight people is shown on a set. A man with a guitar on the left side of the screen and a woman wearing black are holding microphones. In front of them is a light-colored piano. Two men are standing in front of the piano also holding microphones. Behind the piano, a drum set, a keyboard and another guitar are visible. The camera then focuses on the guitar being played on the left, then changes to a close-up on the hands of another guitarist. The camera then pulls back to show a male guitarist on the left, and a woman with a microphone to the right of him. A title card appears that says \"Hallelujah Meeting, Written By: Ronny Hinson, Kenny, Ronny, Larry & Chris Hawkins - 1980\". The full group is again shown on set. The four in front are holding microphones and are standing, while the four people behind them are sitting or standing while playing instruments. The four in front are each shown in individual close-up while singing. The camera moves to different band members before returning to the front group. One man in a dark suit raises his arm while holding a microphone, and a man in a light suit kicks his leg out. The camera focuses on a blonde woman with a microphone, before cutting to the group again. The male singer in a dark suit raises his arms while speaking into a microphone. Various band members are focused on while they are singing or playing. The male singer in the dark suit waves his arms and the male singer in the light suit bows while moving forward and back. The other two are visible while singing, and the blonde woman waves one hand. The camera zooms in on the male singer in a dark suit and then to the male singer in a light suit, then to other band members while they sing, and finally to the full ensemble. The camera focuses on the guitar on the right of the screen being played. The male singer in a dark suit raises one arm before the video closes.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the man wearing a gray suit holding?",
+ "answer": "D",
+ "candidates": [
+ "A. A violin.",
+ "B. A saxophone.",
+ "C. A microphone.",
+ "D. A guitar."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the most likely relationship between the blonde woman and the man in the black suit shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. They are married.",
+ "B. They are partners.",
+ "C. They are friends.",
+ "D. They are in a romantic relationship."
+ ]
+ }
+ },
+ "RHKYUqjE": {
+ "video_id": "RHKYUqjE",
+ "video_duration": "66s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows five different people on a video call, divided into five separate panels. Each person is in their own room. The person in the top left panel has dark hair and a black shirt with a white graphic on it. They are wearing headphones and sitting in a red and black chair. The person in the top middle panel has dark curly hair, is wearing a black hooded sweatshirt, and is leaning forward with their hand to their chin. The person in the top right panel is wearing a white tank top with a black bandana. They are in front of a wall with various skateboards hanging on it. The person in the bottom left panel is wearing a black hat and headphones. They are sitting in a room with a white door in the back. The person in the bottom right panel is wearing a white t-shirt and a red baseball cap. They are in a black chair and are talking. A chat log is on the right side of the screen. The number 18,645 is at the top of the screen. As the video progresses, the person in the middle panel puts a purple hat on their head. The person on the bottom left is sometimes leaning on their hand or with both hands on their mouth. The person in the top right appears to move around in their space. There are many chat messages that appear on the right side of the screen throughout the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many instances of loud laughter occur in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Three instances.",
+ "B. Four instances.",
+ "C. Two instances.",
+ "D. One instance."
+ ]
+ }
+ },
+ "gAfTRfVs": {
+ "video_id": "gAfTRfVs",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video starts with a white screen with vertical film lines. Then a person with dark hair appears briefly in a dark room before the image changes to a close up of a man with dark hair and a black shirt. The background shows some blurred items on a wall and has a slight blur. The image stays on this man for several frames before changing to a grayscale view of a street with multiple people walking and various shops. The view changes to another blurred view of a dark room with light sources, then to a shaky image of a dark space with light sources in the distance and a blurry figure. This then shifts to a shaky image of a white plane with dark figures on its sides in a white background before switching to a view of a white object or structure that is angled with a slightly textured looking floor. The next few frames show a dark room with an object moving and the last image in the room is a chair in an empty space. These are followed by various blurry and shaky abstract images, the image then focuses on a portrait of a person in a round picture frame, before the person in the frame seems to disappear. The portrait shifts again before a shaky view of a room with two people in it appears. The next view shows one person sitting on a white chair with the other on the floor, then a view appears of another person walking barefoot toward a window. Then a view of a man in a suit looking toward the camera. This image shifts to a blurred image of a face. Then a figure is seen sitting in a dark room. These images are followed by shaky views of an outdoor wooded area with dark figures before shifting to a scene of people in a building. The next shot shows an empty street and two people in the distance, followed by blurred images. After that the image is black before showing a close up of a face with tears. This is followed by several shaky images of rooms and walls. The last image shows a dark room with an object that appears to be moving.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the relative position of the man sitting on the sofa to the corpse lying on the ground?",
+ "answer": "C",
+ "candidates": [
+ "A. The man is in front of the corpse.",
+ "B. The man is to the left of the corpse.",
+ "C. The man is behind the corpse.",
+ "D. The man is to the right of the corpse."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual event occurred after a chair appeared in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A clown jumps down from a streetlamp.",
+ "B. A giant spider is running around the room.",
+ "C. A huge monster is destroying the city.",
+ "D. The person in the photo starts to move."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a monster in the video that is taller than the building?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I am not sure."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man in white appear in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Three.",
+ "B. One.",
+ "C. Four.",
+ "D. Two."
+ ]
+ }
+ },
+ "zzvOZDpT": {
+ "video_id": "zzvOZDpT",
+ "video_duration": "122s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a shot of a crowded street scene, with a large tan archway visible in the background. Vehicles and people move in the street. The next shot is of a town with two hills in the distance. Then, a green road sign with white lettering is shown. The sign contains two lines of text in two different scripts. Next, a woman wearing a green and blue saree walks with several people around her. One woman touches the first woman's face. The video then cuts to a shot of a large group of people. Next, the video shows a clock tower surrounded by vehicles and people. The camera switches to a large crowd of people waving their arms. Then, the same woman in the saree is shown walking with men on either side of her. A static image of a woman in a green, blue, and pink saree is shown with her hands in a prayer position, followed by other similar images with text. Then a photo of a man wearing a white shirt is shown. This is followed by another similar series of images with text. A woman in a green and blue saree walks through a crowd of people. Several people offer gifts and money to the woman. The video then switches to the same woman now wearing a pink and orange saree. A man is shown with a religious marking on his forehead touching her. Then the woman in the green and blue saree walks past several people. The video shows a flag that is green, white, and blue. The flag has images of several people on it. Next, several men and a woman are shown sitting on a platform. One of the men speaks into a microphone. The woman in the green and pink saree also speaks into a microphone. Then the video shows the woman speaking with a group of other women. Lastly, static images with text appear.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What type of background music is featured at the beginning of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Classical music.",
+ "B. Pop music.",
+ "C. Ethnic music.",
+ "D. Modern symphonic music."
+ ]
+ }
+ },
+ "ftlJGLUz": {
+ "video_id": "ftlJGLUz",
+ "video_duration": "96s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man sits in the driver's seat of a car, wearing a dark blue t-shirt and dark shorts. He has short brown hair and is wearing a seatbelt. His right hand rests on the gearshift, and his left hand gestures as he speaks. He turns the key in the ignition, and the camera shifts to show the view out of the front windshield as he drives. The sun is bright, creating lens flare. Buildings, trees, and hills are visible as he drives. He changes gears, steers, and continues driving. A red pickup truck passes the car from the opposite direction.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What English letter appears in the white area on the left side of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The letter N.",
+ "B. The letter A.",
+ "C. The letter V.",
+ "D. The letter M."
+ ]
+ }
+ },
+ "TofYlzwA": {
+ "video_id": "TofYlzwA",
+ "video_duration": "392s",
+ "duration": "6-8min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a split-screen view of two men in a video call. The man on the left wears a dark t-shirt with the phrase \"I CAN'T BREATHE\" printed on it and a teal Miami Dolphins baseball cap. He has headphones around his neck and sits against a background featuring a shelf with various sports memorabilia, including trophies, balls, helmets, and framed photos. The man on the right wears a black t-shirt with the words \"GET IN TROUBLE,\" \"GOOD TROUBLE,\" and \"NECESSARY TROUBLE\" stacked vertically in rainbow colors. He sits against a purple backdrop. Both men rest their chins on their hands.A short clip plays in split-screen showing a dark-skinned woman with curly hair and large gold hoop earrings looking down and a light-skinned bald man with a beard talking on a cellphone against a gray-blue background. Text on the screen reads \"To: Unknown\" and \"Think the competition's out of product. I'll keep an eye on him.\"The video then cuts back to the two men in the call. The man on the right lifts his head from his hand, looks directly at the camera, and speaks. The man on the left places his chin back in his hand. The man on the right continues to speak, making expressive gestures with his hands. The man on the left lifts his head from his hand again, looks at the camera and nods.The man on the right picks up a black cylindrical object with a yellow detail and drinks from it. He puts the object down and continues to speak, making more hand gestures. He picks up the cylindrical object again, drinks from it, puts it down, and continues to speak and gesture.The man on the right then brings a red, white, and blue, striped, insulated bottle into view. He begins to talk about the bottle, gesturing towards it. He picks it up, removes the lid, and drinks from it. He then replaces the lid, puts it down, and continues to speak, often gesturing with both hands. The man on the left leans back slightly and looks downwards. The man on the right continues to talk while tapping his fingers rhythmically on the desk, making varied hand gestures. The man on the left covers his face with one hand. The man on the right continues to speak and gesture, sometimes using one finger to point towards his face.Text then appears on the screen overlaying the purple backdrop. The overlay displays Twitter, Patreon and Facebook handles. The man on the right picks up the striped bottle again and displays it to the camera. He removes the lid and drinks from it, laughing, then puts the bottle down and claps his hands together, still laughing. He gestures towards the overlaid social media handles on the screen, then waves and claps again, smiling.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times was water consumed in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Two.",
+ "B. Four.",
+ "C. Three.",
+ "D. One."
+ ]
+ }
+ },
+ "NxExKaog": {
+ "video_id": "NxExKaog",
+ "video_duration": "565s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "DIY & Crafts",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A woman with long, dark, wavy hair and large hoop earrings sits against a backdrop of city lights. She wears a white t-shirt and holds a styrofoam head. The head has a face drawn on it. An image of two similar heads appears in the upper right corner.\u00a0The woman gestures and speaks, looking at the camera. The image changes to a spool of black thread and curved needles. Next, the woman holds a black spandex dome cap. An image of the cap\u2019s packaging is shown. The woman then displays a mesh dome-style wig cap, its packaging appearing beside it. She holds a black weaving cap, and the corresponding packaging appears. The woman stretches the cap and continues speaking. Another image shows a mannequin head wearing three different wig cap styles. The woman holds the black weaving cap against the styrofoam head. Finally, an image appears showing two black wig combs.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Which type of wig cap does the woman present at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A knitted wig cap.",
+ "B. A silk wig cap.",
+ "C. A lace wig cap.",
+ "D. A dome wig cap."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is it not advisable to use very cheap pins?",
+ "answer": "C",
+ "candidates": [
+ "A. They tend to break easily.",
+ "B. They can easily prick you by mistake.",
+ "C. They are prone to falling apart.",
+ "D. They are difficult to sew with."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What does the woman in the video use to sew the wig cap?",
+ "answer": "D",
+ "candidates": [
+ "A. A round needle.",
+ "B. A thick needle.",
+ "C. A brooch.",
+ "D. A safety pin."
+ ]
+ }
+ },
+ "WvrPpvLg": {
+ "video_id": "WvrPpvLg",
+ "video_duration": "59s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A dark colored tracked vehicle drives towards the camera on a dirt path, kicking up dust. The camera zooms in to show the tracks. It then pans to a side view of the vehicle, showing several wheels on its side track. The camera pans up to show a man sitting on top of the vehicle. He is wearing a dark colored shirt and blue pants. He is touching the top of the vehicle. A name and subtitle appear on the screen: \u201cNICHOLAS MORAN AKA THE CHIEFTAIN\u201d. The man moves a metal rod and touches several other parts of the vehicle. Then he points to a compartment of the vehicle. He places his hands on the rim of the compartment and appears to climb into it.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order in which the man in black introduces the external structure of the Strv 103 tank?",
+ "answer": "B",
+ "candidates": [
+ "A. Commander compartment, smoke grenade launcher tube, independent rotating turret, label.",
+ "B. Commander compartment, independent rotating turret, smoke grenade launcher tube, label.",
+ "C. Smoke grenade launcher tube, commander compartment, independent rotating turret, label.",
+ "D. Smoke grenade launcher tube, commander compartment, label, independent rotating turret."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Why does the sound of the tank in the video gradually increase in volume?",
+ "answer": "A",
+ "candidates": [
+ "A. The tank is moving from far away to nearby.",
+ "B. The tank has just started moving.",
+ "C. The tank is moving from nearby to farther away.",
+ "D. The tank has run out of fuel."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Did the tank driver in the video wear a hat?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No.",
+ "C. Yes."
+ ]
+ }
+ },
+ "oAxHLMtJ": {
+ "video_id": "oAxHLMtJ",
+ "video_duration": "43s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a view inside a building, there is a person with orange clothing seen in the right of the frame, a large dark colored figure can be seen outside. The view changes slightly and the large figure outside of the building walks closer, this view is then obscured by fog. Then the view changes, there is a colorful object in the bottom right corner of the screen, with a large structure and trees in the background. Then large, bipedal creatures can be seen outside of the structure, moving around. The camera shifts to show only the legs and lower torso of one of the creatures. Then the view shifts again and shows the creature stepping through the doorway. The view changes to show more creatures. Then a red rectangle is around part of the screen at the top of the view, and there is also a rectangle around the arm of the orange clothed person. The view changes to see large creatures with trees in the background, then again to see more of them. Then a large creature is seen walking by a doorway. Then water can be seen spraying through the building. The view switches to see a red stop sign on the ground in the room. Fire appears outside of the doorway, the view moves slightly. Then there is a view of a metal grated walkway and the top of a structure. Then a colored orb object is seen with a moon, this moon is also in a red rectangle. The view shifts to look up at some of the structure with trees visible on the horizon. The view changes again to see a metal grated walkway with a moon and text on the upper left hand side, then a view of a structure again, and the view darkens significantly, with a moon visible. Then the view has a moon and some trees visible, with a creature in the distance. The view changes to show another creature walking by a structure. Then the view shows more creatures in the background and a body of water. A large structure is then seen on fire, then the fire is seen spreading. Then fire can be seen outside of a building with trees in the background. Then the fire continues to spread. Then the view darkens and fades out.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the protagonist's emotion at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Discouraged.",
+ "B. Anxious.",
+ "C. Afraid.",
+ "D. Excited."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the monster repeatedly return in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Because the monster discovered a way to enter.",
+ "B. Because the monster is startled by the egg thrown by the protagonist.",
+ "C. Because the monster is attracted by the teammates.",
+ "D. I don't know."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many different voices are present in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Four.",
+ "B. Two.",
+ "C. Five.",
+ "D. Three."
+ ]
+ }
+ },
+ "SqwzeFVx": {
+ "video_id": "SqwzeFVx",
+ "video_duration": "93s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "There are two white bowls, one containing a white powdery substance and the other containing an orange liquid. A hand takes a dark, cylindrical object and rolls it in the white powder. The hand then picks up the now-coated object and dips it in the orange liquid, using a spoon to assist. The object is then placed into a pot of hot, bubbling oil, using a pair of metal tongs. The oil continues to bubble, and a hand, wearing a clear glove, uses the tongs to turn over several of the cooking objects. The cooked objects are then placed on a plate lined with a white paper towel. The plate is shown with several of the cooked objects on it. The plate with the cooked objects is shown with two small white dishes containing a red sauce and a white sauce respectively. A knife and fork cut one of the cooked objects to show the inside. A fork picks up a section of the cooked, cut object.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why are the skewers fried three times in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. To make the skewers light and fluffy.",
+ "B. To ensure the meat filling is fully cooked.",
+ "C. To ensure the egg coating is fully cooked.",
+ "D. To use up the oil in the pan."
+ ]
+ }
+ },
+ "jgyVAQLz": {
+ "video_id": "jgyVAQLz",
+ "video_duration": "34s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Aqualad, Robin, and Speedy run through a dark cave, seemingly pursued by figures in the background. The scene changes to a woman in a dark purple outfit with long black hair, surrounded by more shadowy figures with glowing green staffs. She looks sternly ahead. The scene shifts again, now showing the woman facing Aqualad, Robin, and Speedy in the cave.\u00a0Next, a close-up shows Robin\u2019s face; he looks to the side with a slight smirk. The video then transitions to comic book panels. The first shows Robin speaking to a silhouetted Batman. The following panels depict a conversation between them, followed by panels illustrating Robin\u2019s emancipation from Batman.\u00a0 He removes his mask, revealing a bruise on his cheek, and walks away.The scene changes to another comic panel, where a young man in a white uniform and hat carries a stack of plates in a kitchen setting. A smaller panel in the top left corner shows a street scene with a line of people outside a building. Another man in a similar uniform speaks to him in the larger panel.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order do the events occur in the video? (a) Colleagues recommended Grayson to work at Eddie's place. (b) Grayson found a job and housing. (c) Grayson decided to go independent. (d) Grayson had a dispute with Batman.",
+ "answer": "D",
+ "candidates": [
+ "A. (d), (a), (b), (c).",
+ "B. (d), (c), (a), (b).",
+ "C. (d), (b), (c), (a).",
+ "D. (d), (c), (b), (a)."
+ ]
+ }
+ },
+ "HuRReUbF": {
+ "video_id": "HuRReUbF",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Soccer",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a shot of a crowd of people. Most are wearing orange. Some are clapping or raising their hands. The screen displays \"TEXAS 7 MARYLAND 24\", as well as \"2nd 6:51\". The score for Texas increases to 13 as the shot of the crowd is still visible. Then the shot changes to a football player wearing a white uniform with an orange logo and helmet, running on a football field, then it switches to two football players on a football field, both wearing the same white uniform with orange logos, and helmets.The camera shows a football field with one team wearing red uniforms and another wearing white uniforms. A man in a black and white striped shirt is present. The score is now \"TEXAS 16 MARYLAND 24\" and the time is \"2nd 0:24\". The time elapses as a man in a white uniform catches a football, falls to the ground, and a man in a red uniform comes close, followed by a man in a white uniform walking toward the end of the field. A touchdown graphic appears.A man in a red uniform is then shown up close, from the back, with \"TOUCHDOWN\" on a graphic. Then two men, in white uniforms, high five each other, then the camera shows three men in white uniforms. The score is now \"TEXAS 22 MARYLAND 24\". The screen then shows a wide shot of the football field again, this time with a \"3RD & 9\" graphic and a time of \"4:28\". The teams scrimmage again, with a man in a white uniform running down the field to score a touchdown.A graphic that says \"TOUCHDOWN\" appears over the bottom of the shot. The time is \"3rd 4:22\". The graphic then moves and then shows a man in a white uniform running down the field, with an orange graphic and the word \"TOUCHDOWN\". The game is then shown again from an overhead view, showing players in white and red uniforms scrimmaging. The screen displays \"1ST & GOAL\" with a time of \"3rd 4:19\" which then switches to \"3rd 4:16\". The screen displays \"TOUCHDOWN\" again, and the video ends.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "How did the teammates of the number 9 player, who was wearing white, celebrate after he scored at the beginning of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Jump up and bump shoulders.",
+ "B. Tap heads together.",
+ "C. Chest bump and hug.",
+ "D. Raise both hands and clap."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the final score shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. 7 to 24.",
+ "B. 28 to 24.",
+ "C. 22 to 24.",
+ "D. 16 to 24."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "At the end of the video, which player number passed the ball to the referee?",
+ "answer": "B",
+ "candidates": [
+ "A. Number 9.",
+ "B. Number 21.",
+ "C. Number 5.",
+ "D. Number 56."
+ ]
+ }
+ },
+ "pOHHFLcP": {
+ "video_id": "pOHHFLcP",
+ "video_duration": "173s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen. Then, the scene changes to show seven people seated around a table in a room with light-blue walls. One person at the table speaks, followed by another. The scene changes to focus on two people at the table. One speaks to the other. The scene shows the group of seven around the table again, and the one who had spoken before speaks once more. Then the scene changes to show two people at the table, and one is speaking to the other. The scene changes to show the group of seven around the table. Next the scene changes to show three people in a row. The scene changes back to two people at the table, one speaking to the other. The scene then changes back to show the full table with all seven again. The scene shifts to show one man at the table, speaking, and then shifts back to two people, and one is speaking to the other. The scene changes to show one person at the table, then two people, one speaking to the other. Then, the scene changes to one man speaking, then the group of seven, then three people in a row, then one man speaking. Then the scene changes to show all seven people at the table, then one man speaking to the other, then the group of seven again. The scene shifts to one man speaking, and then shifts to all seven around the table. Then, the scene shows one man speaking to the camera, then shifts to the group around the table. Then, the scene focuses on one man at the table speaking, then the whole group of seven. Then the camera goes back to one man speaking, and shows six people, and three new ones who enter the room. After that the camera goes to a man speaking, then the whole group. The camera then goes to the one man speaking to the camera, then the whole group. The camera focuses on one man speaking to the camera, then the whole group again. The scene shifts to two men speaking, then the whole group, then the camera shows one man speaking to the camera. The scene changes to show the whole group, and then one man is speaking to the camera, then the scene changes to all of the group, and then back to one man speaking. The scene shifts to one man who had been at the table, and he walks into the center of the scene, then the scene goes to one man and three women and the man at the table who is talking and then back to the three women. Finally, the scene shows the one man being held back by two other men. The camera then shifts to two women speaking.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does Urch expose Wisdom's actions in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Throughout the entire video.",
+ "B. At the beginning of the video.",
+ "C. At the end of the video.",
+ "D. In the middle of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the name of the chairman of SMART in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Urch.",
+ "B. Wisdom.",
+ "C. Denilson.",
+ "D. Starky."
+ ]
+ }
+ },
+ "FzpXdzZR": {
+ "video_id": "FzpXdzZR",
+ "video_duration": "142s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with reddish-brown hair and a light-colored, long-sleeve shirt is in the video. At the start, her hair is parted down the middle. She gathers the hair at the front of her hairline with her hands and pulls it forward and up. She then places a hairpiece, with a fringe and the same hair color as her own hair, along the front of her hairline. She moves her hands across the hairpiece and her head. She removes the hairpiece. She then picks up a hair comb and uses it to separate the hair along her hairline. She sprays a hair product on a section of her hair, pulls it straight, and then holds a hairdryer to that hair section with a hair brush as she blows air onto it. She repeats this process on different sections of the hair along her hairline. She then takes the comb to the same section of hair, pulls the section down, and uses scissors to cut the hair. The person uses her finger to hold the hair in place as she cuts. After she is done cutting this hair section, she uses a comb on it again.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many tools did the woman use to trim her hair in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Five.",
+ "B. Four.",
+ "C. Three.",
+ "D. Six."
+ ]
+ }
+ },
+ "iJVfpGDI": {
+ "video_id": "iJVfpGDI",
+ "video_duration": "611s",
+ "duration": ">8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Drawing Tutorials",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A hand with long, neon yellow nails uses a black marker to draw a cartoon surfboard on white paper. First, the top outline of the surfboard is drawn, followed by a curved line at the bottom, representing the water's surface. More curved lines are added to depict waves. Two large, closed circles are drawn on the surfboard for eyes, followed by smaller circles within them to suggest light reflection. Blush marks are added to the surfboard's cheeks, and a smiling mouth is drawn below the eyes. Vertical lines are added to the face of the surfboard to give it texture. A simple sun with triangular rays is drawn in the top right corner. The surfboard is colored in with pink marker, the sun with yellow and orange, and the wave with blue. White is used to highlight the waves and the shark fin in the water. Light blue highlights the bottom of the wave.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order in which the following four elements are drawn in the video? (a) The sun (b) A surfboard (c) Waves (d) A shark",
+ "answer": "D",
+ "candidates": [
+ "A. (a) (b) (c) (d).",
+ "B. (a) (c) (d) (b).",
+ "C. (b) (a) (c) (d).",
+ "D. (b) (c) (d) (a)."
+ ]
+ }
+ },
+ "KPtjttvw": {
+ "video_id": "KPtjttvw",
+ "video_duration": "202s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Camping",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person is initially shown sitting in front of a desk, then a black and red image of a mountain with birds flying overhead and a tent is shown. Then the image of a person on a motorcycle is superimposed onto that first image. The image without the motorcycle is shown again, and then a spool of red rope is shown. Next, 10 black tent stakes are displayed on a desk. The stakes are 30cm long. A person's hands are visible as they manipulate one of the stakes, then they pick up a red spool of rope and tie one end of it to one of the stakes. Then, the person measures out a length of the red rope using a yellow measuring tape. The person continues to tie each of the stakes to the red rope and, finally, holds all of the stakes with red ropes attached to them. An image of the tent and mountain with flying birds in black and red is shown again, followed by an image of a person on a motorcycle superimposed onto the first image, followed by the first image again. A cat is then shown laying on a desk, followed by a black screen with white text saying \"No animals were harmed in this video -Koonie the cat\", and then a black screen with white text saying \"Thanks For Watching Traveling Tiv Adventures\" is shown.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct order for tying camping spikes shown in the video? (a) Display the newly purchased spikes. (b) Describe the material and purchasing options for the nylon rope. (c) Measure the nylon rope with a ruler. (d) Knot the nylon rope to the spike. (e) Explain the purpose of tying the nylon rope to the spike.",
+ "answer": "D",
+ "candidates": [
+ "A. (a) (b) (e) (d) (c).",
+ "B. (a) (c) (e) (d) (b).",
+ "C. (a) (b) (c) (d) (e).",
+ "D. (a) (e) (c) (d) (b)."
+ ]
+ }
+ },
+ "WVCNxHrI": {
+ "video_id": "WVCNxHrI",
+ "video_duration": "92s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens with a blue discord logo with the word \"communi\" moving to the center. Then the word \"mate\" slides into view, completing the word \"Communicate\", with an image of a pen and the YouTube logo above the lettering. Another blue discord logo appears, then the phrase \"Thicc Blind\" is shown in blue text on a dark background. Next, a drawing of two buildings labeled #general-chat and #art-channel is shown on a dark background. The #art-channel building has a sign that says \"No Tracing!\", \"No NSFW!\", and \"No saying 'this is so bad' when it is actually amazing\". Inside the #art-channel building, an individual with blue skin, pink hair, a white hoodie, and two ears is sketching on an easel. Then, the image on the easel is shown, of a purple figure lifting up a green zombie. A person with brown hair and a black shirt appears and says \"zombie is lookin mighty thicc\". The brown haired character continues to be shown with various closeups of their face, and various comments about \"thiccness\" are shown. The blue individual says \"he's not thicc.\" The brown haired character continues to comment on \"thiccness\", stretching their arm over to point out the zombie's rear. The blue character says \"king I swear to god stop calling my drawings...\" The blue character is then shown with a fire background. Then a white duck like character and a new character that is dressed as a doctor appear along with the words \"Thicc Blind\". Then the blue character is shown with what appears to be large red strips coming out of their mouth. Then the word \"thicc\" is shown and the blue character has what appears to be red hair. Another character appears and gives a thumbs up, with some text saying \"thicc is good dont be ashamed.\" A purple character appears. Then the words \"meds is thicc blind\" are shown. A character with a skull mask appears. That character then uses an item, presumably a pencil, to erase part of the zombie's body that had been mentioned previously. The blue character has their head in their hands. A character comments \"wait how many times has this happened\".",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the apparent emotion of the white-clothed cat in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Happy.",
+ "B. Calm.",
+ "C. Angry.",
+ "D. Bored."
+ ]
+ }
+ },
+ "NZpvxjrC": {
+ "video_id": "NZpvxjrC",
+ "video_duration": "362s",
+ "duration": "6-8min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black racing suit and a black and white helmet sits in a race car. The inside of the car is mostly white and there is a black mesh screen directly in front of the person. A timer in the top left of the screen counts down. The view then changes to an overhead view of a race track. Several race cars of different colors are driving on the track. The view then changes to a close-up of a red and white car with a large black number 38 on it. The view then returns to the overhead view of multiple cars on the track. The view changes again to the front of the pack of cars. The view changes to one of three men standing on a platform looking over the racetrack. One man is holding a tablet. The view shifts back to the overhead view of the cars on the racetrack. One of the cars emits a plume of smoke, and the view then focuses on a single red and white car on the track. The view changes again to show a white car with black markings on it, with several other cars around it. Then the view is an overhead of a group of cars driving, followed by one of a few cars driving, and then a single car, from a low angle behind the car. The view changes again to show many race cars going around a curve on the track, from a high vantage point near the wall of the track. The view then switches to the dash cam view from inside a car as it goes around the track. The view changes back to the low shot of cars rounding the track. The view changes to the rear of the pack of race cars rounding the curve of the track. The view changes again to show a high angle shot of the track as several cars approach the finish line. The view shifts to a dash cam view inside a different race car. The view shifts once more to show a black and white race car as it drives around the track. The camera shows a race car that crashed. Then it shows a shot of a female reporter holding a microphone and talking to a man wearing a white racing uniform and a black hat. Then the video shows the Fox Sports logo.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "At what time remaining did the first race car start?",
+ "answer": "D",
+ "candidates": [
+ "A. When there are 1 minute and 45 seconds left.",
+ "B. When there are 1 minute and 40 seconds left.",
+ "C. When there are 1 minute and 30 seconds left.",
+ "D. When there are 1 minute and 35 seconds left."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the positional relationship between car number 38 and car number 17 in the first half of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Car 38 is consistently behind car 17.",
+ "B. Car 38 is consistently in front of car 17.",
+ "C. Car 38 starts in front of car 17 and later moves behind.",
+ "D. Car 38 starts behind car 17 and later moves in front."
+ ]
+ }
+ },
+ "RLZqjaDn": {
+ "video_id": "RLZqjaDn",
+ "video_duration": "67s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Astronomy",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video begins with a dark blue image of a star system with the planet Pluto at the center. Labeled above or below the planets are: Nix, Styx, Hydra, Kerberos, and Charon. A series of concentric white circles form a line around Pluto and the labeled planets. Then, a diagram of the planet Pluto in red and gray tones appears on a rectangular graph with latitude and longitude labels. Next a view of the planet Pluto in black and white appears, with the planet positioned on the right side of the frame, against a completely black background. Then, a series of three bulleted points written in a white font appear next to a gray planet on the right side of the frame, still against a black background. Next, a gold spaceship in a cartoon style is depicted against a dark background. The spaceship is close to a round planet with brown and white streaks. The last image that appears is another diagram of the same system of planets orbiting Pluto. A yellow line with timestamps at 10:00, 11:00, 12:00, 13:00, 14:00, 15:00, and 16:00 indicates the New Horizons trajectory. Finally, a close up of the planet Pluto in various brown and white tones, shown as a sphere. This image is against a completely black background.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What time was New Horizons' closest approach to Charon in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. 13:00.",
+ "B. 11:50.",
+ "C. 12:00.",
+ "D. 12:10."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What topic might the video creator discuss next week?",
+ "answer": "B",
+ "candidates": [
+ "A. Information about dwarf planets.",
+ "B. Spacecraft and other celestial bodies similar to Pluto.",
+ "C. The New Horizons spacecraft's flyby of Pluto.",
+ "D. The history of the Hubble Telescope."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there an artificial satellite in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. I don't know.",
+ "B. No.",
+ "C. Yes."
+ ]
+ }
+ },
+ "sjrJkIgJ": {
+ "video_id": "sjrJkIgJ",
+ "video_duration": "455s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a person performing various exercises on a black mat. The person is wearing light blue athletic clothing. They begin lying on their back, raising their head while keeping their feet in the air for one minute. Then, they lie flat on their back and alternate moving their legs in a bent position for one minute. Next, they keep their knees bent and arms to the side, and move their legs upward for one minute. Then, they perform a bicycle-like motion with their legs while doing crunches for one minute. Following this, they sit up, and rotate the torso side to side while keeping their knees bent. For the final exercise, the person lies flat on their stomach, lifts their head and legs, and move them up and down for 30 seconds. A timer counts down as the person performs each exercise, and a \"rest\" label appears between the exercises. In the corner of the screen there is a preview of the next exercise, and text that reads \"Up Next: [name of exercise]\". The background includes white cabinets, a fireplace, potted plants, and light gray floors.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the sequence of fitness exercises shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Reverse crunches, bicycle crunches, Russian twists, leg raises, superman pose, sit-ups, leg kicks.",
+ "B. Sit-ups, reverse crunches, bicycle crunches, Russian twists, leg raises, superman pose, leg kicks.",
+ "C. Sit-ups, leg kicks, reverse crunches, bicycle crunches, Russian twists, leg raises, superman pose.",
+ "D. Bicycle crunches, Russian twists, leg raises, superman pose, sit-ups, leg kicks, reverse crunches."
+ ]
+ }
+ },
+ "PIRxLdhc": {
+ "video_id": "PIRxLdhc",
+ "video_duration": "105s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Four robots are inside a rectangular arena that is marked with white lines and purple circles. Around the perimeter of the arena, there are ten people holding what appear to be remote controls and watching the robots in the arena. Two are wearing short sleeve black shirts. One person is wearing a white shirt and a white cap. Another person is wearing a long-sleeved red and black shirt. Two robots on the left side of the arena are predominantly black and red and are moving yellow disc shaped objects around the arena. One robot on the right side of the arena is predominantly white and blue and is moving blue disc-shaped objects around the arena. Another robot on the right side of the arena is predominantly black and gray and is moving yellow disc-shaped objects around the arena. The number 60 appears in a red box in the bottom left corner of the screen and is replaced by a 80, then a 180. The number 86 appears in a blue box in the bottom right corner of the screen, then it is replaced by a 66, then a 96, then a 116. The words \"Driver Control\" and a count down starting at 1:45 and ending at 0:03, appear at the top of the screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times is the blue platform bumped in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Once.",
+ "B. Twice.",
+ "C. Three times.",
+ "D. Zero times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the overall emotion conveyed in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Anxiety.",
+ "B. Anger.",
+ "C. Sadness.",
+ "D. Calmness."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "Did the blue team's platform change in the middle section of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. It changed from being uneven to balanced.",
+ "B. It remained unchanged and balanced.",
+ "C. It remained unchanged, with one side higher than the other.",
+ "D. It changed from being balanced to having one side higher than the other."
+ ]
+ }
+ },
+ "OiokDDAn": {
+ "video_id": "OiokDDAn",
+ "video_duration": "91s",
+ "duration": "1-2min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A character wearing a brown cloak with gold designs and a blue headdress walks through a stone doorway into an area with blue crystal structures. The character walks past a wall of ice into another area with stone pillars. The character walks through a field with blue flowers and climbs a small hill. The character is then riding on a brown horse, first through a field and then a forested area. The character then is walking on a stone structure, before climbing the side of it, and jumping down. The character approaches a figure with a white head and a blue robe, in an area with exposed roots and a candelabra. There are also a couple piles of books. The character then walks down a stone hallway. The character is then shown on a map. A menu pops up on the map, and then the screen fades to black. The character appears at the bottom of a set of stone stairs leading to a castle. The character walks up the stairs, and then enters the castle. The character approaches an NPC sitting in a chair. Then the character approaches two NPCs behind a cage. Then the character is shown on another map. The character walks down a dark hallway. The character steps into a circular chamber with two torches on either side. The character stands over an interactable object. A menu pops up asking if the character wants to be summoned to assist Sorceress Sellen. The character is then shown fighting a figure in red armor, along with a figure in a black robe. The character defeats the figure in red armor and receives a reward screen showing that they received the eccentric's hood, eccentric's armor, eccentric's manchettes, and eccentric's breeches.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where are the ruins located in the Rise of Lani?",
+ "answer": "B",
+ "candidates": [
+ "A. In the front.",
+ "B. On the left.",
+ "C. In the back.",
+ "D. On the right."
+ ]
+ }
+ },
+ "KgkBHYYk": {
+ "video_id": "KgkBHYYk",
+ "video_duration": "102s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Three people are present on a stage. Two of the people are standing at black lecterns. The person in the middle is sitting at a small table covered with a black cloth. The stage is lit and there is a banner in the top left corner that says \u201cIDEABATERS.\u201d The background has yellow curtains and a dark curtain behind the table. A crowd of people is seated in front of the stage, filling the theater seats. The video is shot from an elevated position.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What emotions did people exhibit when the video described the king's death?",
+ "answer": "B",
+ "candidates": [
+ "A. Gloomy and melancholic.",
+ "B. Joyful and relaxed.",
+ "C. Sad and sorrowful.",
+ "D. Frenzied and delirious."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What does the woman on the stage use the bell for in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Prompt her partner to speak.",
+ "B. Signal the start of the next part of the story.",
+ "C. Indicate a request for applause from the audience.",
+ "D. Provide a timely reminder during key moments."
+ ]
+ }
+ },
+ "yMhAdUta": {
+ "video_id": "yMhAdUta",
+ "video_duration": "53s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A young man in a suit stands in a kitchen, behind a counter, next to a sink. He looks at the camera. Three young adults, two women and one man, sit at a table, looking at the camera. The woman on the left wears a light pink-colored long-sleeved shirt, the woman in the middle wears a black turtleneck, and the man on the right wears a beige-colored t-shirt. One of the women stands in the kitchen, wearing a purple shirt and a black apron. She holds a sieve over a red bowl and sifts flour into it. Next to her on the counter, there are some other ingredients: sticks of butter, two measuring cups, and something wrapped in parchment paper. Another woman stands in the kitchen, wearing a white t-shirt and a vertically-striped pink and white apron. She cracks two eggs into a metal bowl containing other ingredients and a whisk. Next to the bowl, on the counter, there is a bottle of vanilla extract. The flour sifting process is shown in a close-up. A metal saucepan with red liquid sits on a stove burner over a blue flame. A person's hand stirs the liquid with a whisk. A metal bowl contains a bubbly purple liquid that is being mixed with a whisk attached to an electric mixer. The two women are back in the kitchen. One woman pours flour into a glass jar. Another woman mixes something in a bowl using a whisk. One woman holds a sieve in her right hand and stirs a mixture in a bowl near the sink with her left hand. Another woman stands to her right and holds an open bag. The sifting process is shown again.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What is the person in the video doing?",
+ "answer": "B",
+ "candidates": [
+ "A. Chatting.",
+ "B. Participating in a baking competition.",
+ "C. Cooking dinner together.",
+ "D. Conducting a baking lesson."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the cooking sequence shown in the video? (a) Beat the eggs (b) Mix the ingredients using a kitchen appliance (c) Cook and stir the boiling ingredients in the pot (d) Stir the materials by hand (e) Sift the flour (f) Sift the flour again.",
+ "answer": "B",
+ "candidates": [
+ "A. (E) (C) (D) (F) (A) (B).",
+ "B. (D) (A) (E) (C) (B) (F).",
+ "C. (A) (E) (C) (B) (F) (D).",
+ "D. (D) (E) (C) (B) (A) (F)."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the act of sifting flour appear in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. One.",
+ "B. Five.",
+ "C. Zero.",
+ "D. Two."
+ ]
+ }
+ },
+ "IMkQYZio": {
+ "video_id": "IMkQYZio",
+ "video_duration": "471s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Unboxing Videos",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two men are in a room with yellow walls. One man, wearing a green hoodie, unwraps a sword. The other man, wearing a maroon sweater, holds a microphone. The man in the green hoodie removes the sword from its packaging. He unsheathes the sword. The other man takes the sword and holds it, pointing it toward the first man. The first man takes the sword back. He sheathes the sword. The man in the maroon sweater takes the sword again. He holds it up, pointing it towards the camera. The scene changes to outdoors, in a backyard with a wooden fence. The man in the maroon sweater swings the sword. He stops, looks at the camera, and speaks. He holds the sword with both hands, showing it to the camera. He swings the sword again.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What was the bald man doing with the blade while the man with short hair was removing the scabbard packaging?",
+ "answer": "A",
+ "candidates": [
+ "A. Wiping the knife.",
+ "B. Waving the knife.",
+ "C. Observing the knife.",
+ "D. Putting down the knife."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video do the two people begin organizing the knives?",
+ "answer": "B",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. At the beginning of the video.",
+ "C. In the middle of the video.",
+ "D. Throughout the entire video."
+ ]
+ }
+ },
+ "rzAqzwvL": {
+ "video_id": "rzAqzwvL",
+ "video_duration": "289s",
+ "duration": "4-6min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a brief, blurry shot of a person walking near some off-road vehicles in a wooded area. Then, a series of sponsor logos appear: Southern Land Riders Presents, H Craft Customs, Performance SXS Bushings, RCV Performance Products, GDR Gatekeeper Design LLC Racing, ZRP Zollingerracingproducts.com, and All Things 2 UTV. Next, a rocky, dirt path uphill through the woods is shown, with trees and some rock formations on either side. A bald, light-skinned man appears briefly, looking at the camera. Then, a short clip from the animated movie Monsters, Inc. is shown: Sulley, a large, furry, blue monster, roars at a screaming, light-skinned child in bed. The child, also light-skinned, is wearing light blue pajamas. Following this, a small, open, black off-road vehicle drives up the rocky hill. The camera angle is from behind the vehicle. The vehicle slowly bumps up the incline. The video then cuts to a view from another off-road vehicle further down the hill, watching the black vehicle climb. The black vehicle climbs most of the way up and stops near the top. The video cuts back to the original view from behind the black vehicle, which continues to slowly ascend. The video cuts again to the lower viewpoint, showing other off-road vehicles parked. The black vehicle reaches the top. The camera pans over to a man in a black beanie and grey sweatshirt, and a woman with dark hair and a black headband, sitting in a stationary off-road vehicle. The woman shows the man a video on her phone. The video shows the black vehicle climbing the hill. A person wearing a helmet and dark jacket appears and speaks with the man and woman. The person climbs into a muddy, purple off-road vehicle. The video cuts back to the black vehicle climbing the rocky hill from the original viewpoint, then to the lower viewpoint, showing a different person now recording the climb. After a final cut to the original viewpoint, the black vehicle nears the top again. The camera follows as other off-road vehicles begin to ascend as well. The rocky terrain becomes more prominent, and the vehicles continue their ascent in succession.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Was the driver transported by ambulance after the rollover?",
+ "answer": "B",
+ "candidates": [
+ "A. Yes, the driver was.",
+ "B. No, the driver was not.",
+ "C. None of the above.",
+ "D. It is unknown."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "After reviewing the video, what action did the driver take with the vehicle?",
+ "answer": "C",
+ "candidates": [
+ "A. The vehicle is inspected.",
+ "B. The vehicle is turned off.",
+ "C. The vehicle is started.",
+ "D. The vehicle is wiped clean."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When did the rollover accident occur?",
+ "answer": "A",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. At the start of the video.",
+ "C. Throughout the video.",
+ "D. In the middle of the video."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "In the video, what did the companions do to the man wearing a black hat after he successfully went uphill for the first time?",
+ "answer": "A",
+ "candidates": [
+ "A. They showed him a video of climbing the slope.",
+ "B. They gave him a cup of water.",
+ "C. They tapped him on the shoulder.",
+ "D. They gave him a hug."
+ ]
+ },
+ "task4": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What was the sound of the vehicle's engine during the first climb?",
+ "answer": "B",
+ "candidates": [
+ "A. It increased from soft to loud.",
+ "B. It alternated between getting louder and softer intermittently.",
+ "C. It decreased from loud to soft.",
+ "D. It remained consistently loud."
+ ]
+ }
+ },
+ "IhfjbVPi": {
+ "video_id": "IhfjbVPi",
+ "video_duration": "132s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two wrestlers, one in a dark green singlet and the other in a light blue singlet, face each other on a green wrestling mat. A referee stands between them. The wrestlers circle each other briefly. They come together and lock up, the wrestler in blue getting a hold of the green wrestler's right arm. The green wrestler attempts to break free. They separate momentarily and circle again. The blue wrestler shoots in for a takedown and gets the green wrestler down to the mat. The wrestlers struggle on the mat, the blue wrestler working for control. The blue wrestler gains a dominant position, maneuvering to get on top of the green wrestler and eventually pinning his shoulders to the mat. The referee slaps the mat, signaling a pin. The blue wrestler gets up and walks away. The referee helps the green wrestler to his feet. A person in grey sweatpants walks across the mat carrying a yellow foam cylinder. The blue wrestler walks off the mat. The green wrestler also leaves the mat.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the man wearing a gray striped shirt, the person wearing blue, and the person wearing green?",
+ "answer": "B",
+ "candidates": [
+ "A. The coach and the athletes.",
+ "B. The referee and the athletes.",
+ "C. The doctor and the athletes.",
+ "D. There is no relationship."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times is a whistle heard in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Once.",
+ "B. Three times.",
+ "C. Four times.",
+ "D. Twice."
+ ]
+ }
+ },
+ "FPLryJoQ": {
+ "video_id": "FPLryJoQ",
+ "video_duration": "88s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a view of five light-purple, worm-like objects called \"mitochondria,\" surrounded by a field of small, bright yellow-orange dots. The words \"white blood cell,\" \"cytoplasm,\" and \"magnification 250,000x\" are visible in the upper left of the video. In the second scene, a close-up of smaller objects appears with the text, \"magnification 7.1 million x\". There are several red objects, and some blue, surrounded by a background of many more objects of various colors. The red and blue objects move around. The phrase \"free radicals\" is shown, as well as \"sentinel proteins\". In the next scene, several red objects combine to form a ring, with various other objects attaching to the ring. The phrase \"adaptor proteins\" is shown. The video then shows other colored objects attaching to the ring, and the phrase \"protein activation\" is shown. Finally the objects are seen moving around.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many sentinel proteins form a circle?",
+ "answer": "D",
+ "candidates": [
+ "A. Eight.",
+ "B. Five.",
+ "C. Six.",
+ "D. Seven."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What happens to the protein after it is activated in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. It spins around.",
+ "B. It connects with other proteins.",
+ "C. It is swallowed.",
+ "D. It separates."
+ ]
+ }
+ },
+ "lwlOxUvO": {
+ "video_id": "lwlOxUvO",
+ "video_duration": "26s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a soccer game on a green field. Players wearing blue and white/black striped uniforms are running across the field. A player in a blue uniform is dribbling a soccer ball across the center of the field. He passes the ball to another player wearing a blue uniform who kicks it towards the goal. The ball goes into the net. The player who kicked the ball raises his arms and is joined by his teammate, both of whom hug each other. The scene transitions to a closeup view of the player who scored the goal celebrating.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color jersey is worn by the player who scored a goal in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Blue and white stripes.",
+ "B. Orange.",
+ "C. Purple.",
+ "D. Blue."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "In the video, to which player did player number 15 pass the football?",
+ "answer": "A",
+ "candidates": [
+ "A. Number 2.",
+ "B. Number 15.",
+ "C. Number 10.",
+ "D. Number 17."
+ ]
+ }
+ },
+ "snKDWzoM": {
+ "video_id": "snKDWzoM",
+ "video_duration": "92s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Drawing Tutorials",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a screen recording of Adobe Illustrator. A diamond shape made of green lines is displayed in the center of the artboard. The \"Swatches\" panel is opened on the right side. The cursor clicks on the swatches panel menu, and selects \"Open Swatch Library\", then \"Other Library\". A file selection dialog opens up. The cursor navigates to and selects \"Herinneringsgroepen\" in the OneDrive folder. The dialog box closes without selecting any file. The cursor then selects the \"Live Paint Bucket\" tool from the toolbar on the left. It clicks outside the diamond shape, removing the surrounding rectangle that was previously visible. The cursor clicks on the swatches menu again and opens the same file selection dialog as before. This time it selects the \"Aan\" file in the \"Herinneringsgroepen\" folder, and the dialog box closes. The cursor uses the \"Live Paint Bucket\" tool to fill the triangular facets of the diamond with different shades of blue. Once all facets are filled, the green lines disappear, leaving a smooth-looking diamond. The cursor then clicks on the black selection tool in the toolbar, causing a bounding box to briefly appear around the diamond. The zoom tool is selected from the toolbar and used to zoom out, revealing the full artboard. The \"Layers\" panel is opened on the right, showing three layers: one for the conversation, one for the \"2D Diamond\", and one for the \"Dark Diamond\". The visibility of the \"2D Diamond\" layer is toggled off and then on again by clicking on the \"eye\" icon next to the layer name.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many instances of coloring occur in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Ten instances.",
+ "B. Nine instances.",
+ "C. Seven instances.",
+ "D. Eight instances."
+ ]
+ }
+ },
+ "EICpqnoP": {
+ "video_id": "EICpqnoP",
+ "video_duration": "329s",
+ "duration": "4-6min",
+ "domain": "Tech & Science",
+ "sub_category": "Academic Lectures",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a circular frame showing internal organs, including two large, dark reddish-brown organs. A light-colored sac-like structure with visible veins is between the larger organs. Two surgical tools with metal tips are introduced into the frame. The tools manipulate the light-colored sac-like structure, separating some tissues. A third tool, with a white shaft and a black tip, enters the frame and assists in separating the tissues. The two metal tools and one of the other instruments cut some of the remaining tissue. After further separation of the tissues, the white instrument places a plastic-like bag around part of the structure. The white instrument then helps push the entire structure into the bag, and then seals the bag shut. The video then transitions to a brown surface with some white spots. Then, a white surgical tool is shown briefly near the bottom of the frame, and the video becomes unclear. In the final scene, a purple screen shows text and contact information.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many tools were used during the surgical procedure shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Seven.",
+ "B. Eight.",
+ "C. Six.",
+ "D. Nine."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What conditions might the human body experience in the future after surgery?",
+ "answer": "D",
+ "candidates": [
+ "A. Cardiovascular and pulmonary functions deteriorate.",
+ "B. Respiratory function worsens.",
+ "C. Fertility decreases.",
+ "D. Digestive function weakens."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the doctor's specialty in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. An internist.",
+ "B. An orthopedic doctor.",
+ "C. A surgeon.",
+ "D. An anesthesiologist."
+ ]
+ }
+ },
+ "oYxSmMSM": {
+ "video_id": "oYxSmMSM",
+ "video_duration": "183s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man in a light gray blazer and dark gray pants stands behind a small round table draped in white cloth. On the table are several containers: a clear plastic jar with liquid, a clear glass, a small brown plastic bottle with a white cap, a rectangular prism made of clear plastic with writing on it, and a clear beaker. A crumpled blue glove lies on the table near an amber plastic container with an orange cap. The man puts on the blue glove. He picks up the small amber container and pours some orange liquid from it into the clear plastic jar with liquid. The liquid in the jar turns purple. He places the amber container back on the table. He picks up the small brown plastic bottle with the white lid, opens it, and pours the dark liquid from the small brown bottle into the glass. He places the small brown bottle back on the table. He picks up the clear jar with the purple liquid and pours it into the glass with the brown liquid. He puts the clear jar back on the table. He removes the blue glove. He picks up the small white container next to the amber one and pours some white powder from it into the glass with the mixed purple and brown liquids. He places the small white container back on the table and places the now-empty blue glove on top of it. He picks up the rectangular prism made of clear plastic and uses it to scoop up some of the mixed liquid from the glass and transfer it to the beaker. He then pours the liquid from the glass into the beaker as well. He picks up the glass and drinks the remaining liquid in it. He picks up the small brown plastic bottle again and closes the lid. He puts it back down on the table. He moves the beaker closer to himself. He picks up the clear glass, which is now empty, and places it upside down on top of the beaker. He then pours the liquid from the beaker into the upside-down glass. He lifts the beaker and places it next to the glass, which is still upside down and now filled with liquid. He walks away from the table.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the small bottle relative to the water cup in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The small bottle is behind the water cup.",
+ "B. The small bottle is in front of the water cup.",
+ "C. The small bottle is to the right of the water cup.",
+ "D. The small bottle is to the left of the water cup."
+ ]
+ }
+ },
+ "UubibIcM": {
+ "video_id": "UubibIcM",
+ "video_duration": "119s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A dark-colored Honda Pan-European motorcycle is parked on a brick driveway. The camera pans around the motorcycle, showcasing its features, including saddlebags, a top case, and a windscreen. The camera then focuses on the rear of the motorcycle, showing the license plate \"AF52 VRJ\" and the top case. A hand with tattoos enters the frame and opens the top case, revealing paperwork inside. The hand then moves to the front of the motorcycle, demonstrating the levers and controls. Finally, a man with a beard and tattoos appears, standing next to the motorcycle. He gestures towards the motorcycle. The video then shows the hand interacting with side panniers and top box, showcasing storage, before again showing documentation stored within the top box.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "In the video, which part of the motorcycle does the tattooed man with floral arm tattoos open for the second time using a key?",
+ "answer": "D",
+ "candidates": [
+ "A. A tail box.",
+ "B. A fuel tank.",
+ "C. A pocket.",
+ "D. A side pannier."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color was the motorcycle supposed to be in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Purple.",
+ "B. Green.",
+ "C. Blue.",
+ "D. Black."
+ ]
+ }
+ },
+ "lTNvBqgb": {
+ "video_id": "lTNvBqgb",
+ "video_duration": "176s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with a dark complexion and dark hair is seated on a table. There is a garment rack on the left with two dresses hanging from it, a dark-colored dress in the front, and a purple dress behind it. There is a window behind the person, and two framed pieces of art on the wall to the right of the window. To the person's right, there is a bookshelf, with a laptop on it, a golden piggybank to the right of the laptop, and other items on the lower shelves. A large quantity of clothes are piled on the table in front of the person, along with a white sheet of paper and a black object. The person picks up a dark-colored dress from the rack, holds it up, folds it and places it on the table. The person then picks up a tan-colored garment from the table, and places it on the shelf, they pick it up again and places it into a black duffel bag on the table. The person packs more items into the duffel bag, including clear plastic bags, and finally zips the bag closed. The person holds up an empty clear plastic bag, then puts it on the table, then the person holds up a full clear plastic bag and puts it into the duffel bag, then the person holds up a clear plastic bag with a red stripe and places it in the bag, and then a final clear bag with a zipper is put into the duffel bag and the bag is zipped closed.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the plug ultimately placed in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. It is placed outside the shopping bag in the suitcase.",
+ "B. It is placed on the bed.",
+ "C. It is placed on the table in the back.",
+ "D. It is placed inside the shopping bag in the suitcase."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the woman display the e-reader?",
+ "answer": "B",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. At the end of the video.",
+ "C. Throughout the entire video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "JXrYLgVg": {
+ "video_id": "JXrYLgVg",
+ "video_duration": "109s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a person riding a bicycle through a city. The video begins on a street with cars parked on either side of the road. There are white crosswalk lines on the road and a street sign that says \"34th Avenue, Queens (departure)\". The video pans down the street and shows parked cars on either side and an elevated train track. The video shows a train moving along the elevated tracks. Then the video shows a group of people playing a ball game in a fenced in area. The video then moves to a bike lane next to parked cars. The video shows a person riding a bicycle in a bike lane. The video shows several people riding bicycles in a bike lane. The video shows a person with a bike helmet and glasses talking to the camera. The video shows several people walking and riding bicycles on a bridge. The video shows cars driving on the bridge below the bike and pedestrian lane. The video then shows cars and trucks moving in traffic under an overpass. The video shows a person jogging on a city street with a bike lane. The video shows a cable car moving through the air. The video ends with a shot of a cable car moving over water.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "In the latter part of the video, from which direction does the sound of a car reach the cyclist while they are riding on the bridge?",
+ "answer": "A",
+ "candidates": [
+ "A. The left of the cyclist.",
+ "B. Below the bridge.",
+ "C. Above the bridge.",
+ "D. The right of the cyclist."
+ ]
+ }
+ },
+ "sKzAjbxS": {
+ "video_id": "sKzAjbxS",
+ "video_duration": "304s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man with a gray beard is standing in a kitchen, gesturing with his hands as he speaks to the camera. He is wearing a black chef's coat with camouflage patterned sleeves. There is a wooden cutting board in front of him on which is a piece of meat. He picks up the meat and gestures with it, then places it back down. He gestures again, and then points to the cutting board where some text has appeared. He then places a cut of meat onto a hot pan, and pours a liquid from a bottle into the pan. He then places another cut of meat onto the pan. He uses a utensil to move meat in the pan, then holds up a jar and shakes it to sprinkle seasoning into the pan. He picks up a small green stalk, and then uses a knife to chop it, then places it on a wooden cutting board. He then chops a piece of white material, then chops a green material, then places more of the white material into the pan. He then puts red material into the pan and adds a white granular material to the pan. He then uses a utensil to move materials in the pan. A white plate containing a light-colored grain is placed in front of the man, then he takes a cut of the meat from the board, and puts it on the white plate. He places another cut of the meat on top of the white plate. He then adds the green stalk he had chopped earlier on top of the other items. Finally, the text \"SPORTING CHEF\" appears on the screen.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did a spring onion appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes.",
+ "B. I don't know.",
+ "C. No."
+ ]
+ }
+ },
+ "CpXAoeKK": {
+ "video_id": "CpXAoeKK",
+ "video_duration": "99s",
+ "duration": "1-2min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A character wearing yellow and grey clothing and a gold helmet walks toward a rock face. The character climbs the rock face. The character jumps and appears to be gliding. The character lands on a rock formation. The character looks at something to their right. A double rainbow is visible in the sky. The character runs along a path before standing still. A camera view of the character is shown, then the character is shown running again. The character uses an item to look at a large dragon creature in the sky, then puts the item away. The character walks and jumps across a rock formation, then stands still. The character is shown on a path and is standing near the edge. A large dragon flies in the sky. The character jumps off the edge using a paraglider. They fly towards the dragon and shoot it. The character is shown collecting a glowing item after jumping off of the dragon. The character is shown paragliding and lands on the ground near a fire. The character is shown standing by the fire, and then they are shown again at an altar with blue light coming up from the floor. A map is shown, and then the character is shown at the altar again, then paragliding. The character builds a fire on a small hill. Then the video shows the character sitting near a fire in the dark, and using their paraglider. The character is then shown gliding alongside a dragon that is breathing fire. The dragon disappears and the character lands on the ground.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "At what time in the game does the man control the character to fly toward the orange-red dragon?",
+ "answer": "D",
+ "candidates": [
+ "A. 6:30 a.m.",
+ "B. 6:25 a.m.",
+ "C. 6:20 a.m.",
+ "D. 6:15 a.m."
+ ]
+ }
+ },
+ "CXcpoczR": {
+ "video_id": "CXcpoczR",
+ "video_duration": "90s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person with a beard and glasses is sitting in a chair wearing a white tank top and black shorts. A child is partially visible to the left side of the frame. The person moves from a sitting to a standing position with two dumbbells in each hand, lifting them straight up over their head. The person then stands bent over slightly with one dumbbell in one hand and moves it from low to high in a single arm row motion. The person continues lifting dumbbells over their head, while seated. They end with hands on hips, then continue lifting a dumbbell in both hands over their head. A slide with white lettering on a black background saying \u201cDay 1 vs Day 90\u201d with measurements of \u201cArms\u201d, \u201cChest\u201d, \u201cLegs\u201d, and \u201cTotal inches gained\u201d appears. Two side by side images then appear of the person standing, facing the camera, then a side view of the person and a back view of the person. The side-by-side images are replaced with more side-by-side images of the person. The video ends with text on a green background reading \u201cJoin my team and learn to \u201cDo what ever it takes\u201d to Beast Up\u201d and then with white text on a black background with social media links.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does a sword appear at the beginning of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Uncertain.",
+ "B. Yes.",
+ "C. No."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many different dumbbell weights appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. 1.",
+ "B. 5.",
+ "C. 3.",
+ "D. 2."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order in which the man trains his body parts in the video? (a) chest, (b) shoulders, (c) legs, (d) arms",
+ "answer": "D",
+ "candidates": [
+ "A. Legs, chest, shoulders, arms.",
+ "B. Arms, shoulders, chest, legs.",
+ "C. Chest, shoulders, legs, arms.",
+ "D. Chest, arms, shoulders, arms."
+ ]
+ }
+ },
+ "fCdLksnj": {
+ "video_id": "fCdLksnj",
+ "video_duration": "47s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A spiky animal with a long tail hangs upside down from a wooden beam. It has dark eyes, a pinkish nose, and white spikes with dark tips. It grasps a yellow and black patterned piece of fabric and chews it. The words \"COENDOU PREHENSILIS\" appear on screen. The animal walks on the floor, holding a red object between its feet. A dark spot appears next to it on the floor. Three people sit at a table. One is a woman with red hair and the other two are men, one in an orange shirt and the other in a black shirt. The spiky animal sits on the table in front of them. It walks toward the man in the orange shirt. It walks on the floor again, nibbling at the red object it holds with its feet. It is next to a green sphere and a straw bale and some floral decorations in the background. It walks up to a red apple and touches it with its nose. It is shown standing on its hind legs with a woman with red hair. She touches the top of its head with her hand. She holds out her hand with food and the animal eats it. A man in a red and blue checkered shirt and black glasses speaks with a bookshelf behind him. A woman in a blue shirt stands near a table. The spiky animal is on the table. She makes a gesture with her hands. The spiky animal hangs upside down from a beam. Someone holds a piece of food up and the animal takes it with its mouth. It walks near a pair of jean-covered legs and rests on them. It walks across the floor, its long tail curling in the air. It approaches a hand and takes a brown piece of food. A man in a checkered shirt speaks. The spiky animal is shown taking another piece of food.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times is a Brazilian porcupine shown eating in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Four times.",
+ "B. Two times.",
+ "C. Five times.",
+ "D. Three times."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What abnormality is there in the appearance of the Brazilian porcupine at the beginning of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Its right hind leg is injured.",
+ "B. Its tail is injured.",
+ "C. Its left hind leg is injured.",
+ "D. Its front paw is injured."
+ ]
+ }
+ },
+ "hFSjJhYE": {
+ "video_id": "hFSjJhYE",
+ "video_duration": "83s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a motocross track with a dirt surface. A number of people can be seen standing and sitting in the surrounding grass fields. The track has a large dirt hill that motocross bikes jump over. There are signs with company names and logos, and large letters that spell \"GOPHERDUNES\". There are several people seen wearing bright-colored clothing, such as a man wearing a red shirt and a woman with a striped shirt. Motocross bikes with different color schemes can be seen throughout the video, racing around the track and jumping over the hill. There are yellow posts around the dirt surface. Some people are under tents and other people can be seen walking in the fields.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the spatial relationship between the staff wearing red and the black team flag?",
+ "answer": "C",
+ "candidates": [
+ "A. The staff in red are to the right of the black team flag.",
+ "B. The staff in red are behind the black team flag.",
+ "C. The staff in red are to the left of the black team flag.",
+ "D. The staff in red are in front of the black team flag."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many motorcycles are shown in the video on the track?",
+ "answer": "A",
+ "candidates": [
+ "A. 38.",
+ "B. 37.",
+ "C. 35.",
+ "D. 39."
+ ]
+ }
+ },
+ "rwpLCxtf": {
+ "video_id": "rwpLCxtf",
+ "video_duration": "180s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "swimming",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a swimming competition taking place in an indoor pool. Initially, eight swimmers wearing swim caps and goggles are poised on starting blocks. They simultaneously dive into the water, and begin swimming butterfly stroke. The camera angle shifts between overhead, underwater, and views of individual swimmers, tracking their progress across multiple laps of the pool. At various points throughout the race, the screen displays a timing clock and other text overlays. In the final lap, one swimmer, in a black cap with the text \u2018IRON MILAK\u2019 on the front, appears to finish first and looks toward the timer as he nears the pool edge and holds the wall. The screen shows a listing of individual results including times and points, and a team result. The video concludes with close ups of two swimmers and then an overhead perspective of the empty pool.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When do the athletes in the video clap their hands?",
+ "answer": "C",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. Throughout the entire video.",
+ "C. At the end of the video.",
+ "D. In the middle of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, where is the man wearing a white name badge around his neck standing in relation to the woman wearing blue shorts?",
+ "answer": "D",
+ "candidates": [
+ "A. Behind.",
+ "B. In front.",
+ "C. To the left.",
+ "D. To the right."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many swim caps were present in the pool?",
+ "answer": "D",
+ "candidates": [
+ "A. Seven.",
+ "B. One.",
+ "C. Five.",
+ "D. Zero."
+ ]
+ }
+ },
+ "OcjhEchR": {
+ "video_id": "OcjhEchR",
+ "video_duration": "134s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "DIY & Crafts",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person with short brown hair wearing a blue striped shirt is shown placing an orchid onto a piece of bark on top of a white towel with a blue stripe on a black table. Then, the person sprays the bark with water from a blue bottle. The person uses a white stick to apply glue into crevices in the bark, then adds some of the moss-like substance out of a metal bowl to the bark crevices. The person applies glue to the roots of the orchid, and then pushes more of the moss-like substance around the roots to secure the plant to the bark. The person then uses glue and more of the moss-like substance to secure the orchid to the bark.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many bottles of glue appear in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Five bottles.",
+ "B. One bottle.",
+ "C. Three bottles.",
+ "D. Two bottles."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the spatial relationship between the folded towel and the man?",
+ "answer": "B",
+ "candidates": [
+ "A. The folded towel is on the man's head.",
+ "B. The folded towel is on the right side of the man.",
+ "C. The folded towel is on the left side of the man.",
+ "D. The folded towel is under the man."
+ ]
+ }
+ },
+ "pgZCIaZZ": {
+ "video_id": "pgZCIaZZ",
+ "video_duration": "32s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows three people, two seated and one in the background of a large screen.The person in the foreground on the left is wearing a green jacket, a navy top, and blue jeans. The person in the foreground on the right is wearing a red suit and a white shirt. Both are seated on a white couch. There is a white round table in front of them with a vase of pink flowers and a red mug. The background shows a large screen with a man in white shorts, no shirt, and a necklace, sitting at a desk with a computer and other items on it. The video shows clips of this picture interspersed with clips of the three people speaking.There are also clips of a child in a black and white shirt with their arm raised. This image appears on the screen in the background with the two people on the couch sitting in the foreground.There is a logo in the bottom center of the screen with a red H. There is a subscribe button at some points.There are white subtitles at the bottom of the video.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What is the progression of Kevin Hart's emotions in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. From embarrassment to excitement.",
+ "B. From excitement to disappointment.",
+ "C. From excitement to embarrassment.",
+ "D. From disappointment to embarrassment."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of events in the video? (a) A photo of Kevin Hart is shown. (b) Kevin Hart comments that Johnson looks like a detective and a DJ. (c) A photo of Johnson is shown.",
+ "answer": "B",
+ "candidates": [
+ "A. (c) (a) (b).",
+ "B. (c) (b) (a).",
+ "C. (a) (b) (c).",
+ "D. (a) (c) (b)."
+ ]
+ }
+ },
+ "feVRMJfh": {
+ "video_id": "feVRMJfh",
+ "video_duration": "141s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Bowling Ball",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A baseball game is shown, with several individuals visible on the field, and in the stands. At the start of the video, a player in a yellow and blue jersey with the number 23 is standing on first base. A player in a red and white uniform stands in the batter's box. A person in white pants and a black short-sleeved shirt is walking from the center of the field towards first base. There are many individuals seated in the stadium stands. The batter attempts to hit the ball, but it appears to be caught. The person in white pants and the black shirt is now seen walking off the field towards a dugout. The yellow and blue team begins to celebrate near first base, and the video cuts to show several of them in a group hug near second base. \"Simeon Wolverines Win!!\" appears as a text overlay on the field.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What could be the identity of the man wearing a black mask and black uniform in the infield area shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Base umpire.",
+ "B. Pitcher.",
+ "C. Plate umpire.",
+ "D. Infielder."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many balls appear in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Two.",
+ "B. Five.",
+ "C. Three.",
+ "D. Four."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the man in a red No. 23 shirt kneel on the ground with his hands supporting him at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Because he was too tired from the competition.",
+ "B. Because the team won the competition.",
+ "C. Because he was injured during the competition.",
+ "D. Because the team lost the game."
+ ]
+ }
+ },
+ "IEQaDTBh": {
+ "video_id": "IEQaDTBh",
+ "video_duration": "88s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "swimming",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing black swim trunks walks to the edge of a swimming pool and onto a starting block. The person's bare foot is on the blue surface of the starting block. Their hands are positioned on the front edge. Other swimmers are on starting blocks in the background. They then jump into the water. The view then shows multiple people swimming in separate lanes in a pool. The view shifts to a swimmer close up, who then grabs onto the lane rope. The swimmer removes their swim goggles and cap. They look up toward the side of the pool.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between Rafael Munoz and Lars Frolander in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Competitors.",
+ "B. Mentor and apprentice.",
+ "C. Friends.",
+ "D. Teammates."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the positional relationship between Lars Frolander and Almeida during the sprint phase in the second half of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Lars Frolander is positioned in front and to the left of Almeida.",
+ "B. Lars Frolander is positioned behind and to the right of Almeida.",
+ "C. Lars Frolander is positioned in front and to the right of Almeida.",
+ "D. Lars Frolander is positioned behind and to the left of Almeida."
+ ]
+ }
+ },
+ "MMdhPBKH": {
+ "video_id": "MMdhPBKH",
+ "video_duration": "239s",
+ "duration": "2-4min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen, then transitions to a display of a stage backdrop with the words \"HAPPY BIRTHDAY BOB\" and \"50 Stars Salute Your 50 Years! with NBC\". The scene then shifts to an elevated shot of a room with many people seated at round tables, each with white chairs. The stage has a curved backdrop with geometric designs in pink, gold, and black. Two people in black suits walk onto the stage. One man in a black tuxedo begins speaking with a microphone while standing center stage. The man then turns to face the camera and gestures with his hands while speaking. He walks around on the stage as he continues to speak, moving closer to the left and right edges. Confetti-like specks appear briefly as a man in a black tuxedo continues to speak with a microphone on stage. The scene shows a seated group, including a red-haired woman wearing a light blue dress, and the man in the tuxedo on stage. The man on stage then claps his hands briefly, and speaks with hand gestures, walking back and forth. The scene transitions back to the group of seated people at tables, and then the man on stage again. The man on stage turns to his left and raises his hand while speaking. The scene then shows a woman in a pink top laughing with two other people. The man in a tuxedo walks around the stage while speaking. The scene then briefly returns to a shot of the seated crowd. The man on stage claps his hands again, then the shot transitions to a close up of him speaking with a microphone. He walks around and gestures with his hands while speaking. He takes a few steps to the right and left while continuing to speak. He turns his body from left to right several times while gesturing and speaking. He then moves back to center stage. The man is speaking with his hands moving at the wrist. He turns toward the right side of the stage as he speaks. The man continues to move around as he gestures with his hands. He points his index finger while speaking, and then the camera shot transitions back to the group at tables as they laugh. The man in the tuxedo turns his body to the left as he talks. He moves to the left edge of the stage while talking. He uses both hands to gesture while facing towards the left of the stage. The man moves his body from left to right as he talks. He then looks up towards the light at the top of the stage as he holds the microphone in his hand. The man walks around the stage as he continues to talk. The man in the tuxedo makes his way back to the center of the stage. He looks back over his shoulder to the left of the stage as he continues to speak, and then turns to the camera. He holds his arms at his sides as he moves to the left edge of the stage. The man in the tuxedo walks from left to right while he talks. A seated group is shown as one man with glasses laughs. The man in a tuxedo continues to speak while gesturing with his hands. He turns his body from right to left as he speaks. He then touches his hand to his chest while talking. The man on stage faces the right of the stage, and then the camera shot transitions to the group at the tables. The man on stage makes his way towards center stage while talking, and then moves back to the left. He continues to gesture as he speaks. He raises his arm while speaking on stage, then the camera transitions back to the crowd at the tables. The man on stage points his right index finger, and the camera pans across the stage. The shot transitions to a wider shot of the seated people at tables, and the man in a tuxedo on stage is seen walking to the center again. He briefly turns his body to the left while speaking. The man then walks to the right edge of the stage. The man gestures as he continues to speak, and the camera focuses on the seated people. He returns to the center of the stage and walks back and forth as he talks. He then walks to the left edge of the stage, and then back to the center. He gestures with his hands as he talks. The camera shot transitions back to a shot of the man on the stage, and then back to a wider shot of all seated people. The man on stage continues to speak while gesturing. The man is seen walking towards the center of the stage as the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does Jackie Gleason run?",
+ "answer": "C",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. At the beginning of the video.",
+ "C. In the middle of the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "How many people did Jackie Gleason tease in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Two.",
+ "B. Three.",
+ "C. Four.",
+ "D. Five."
+ ]
+ }
+ },
+ "BIKoaIDz": {
+ "video_id": "BIKoaIDz",
+ "video_duration": "103s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Fashion",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a person with dark, curly hair, wearing a pink top and hoop earrings. They are speaking directly to the camera, and moving their hands as they do so. Behind the person is a white shelf with various objects, and a white bureau. On a few frames, a phone screen appears on the right side of the video, showing an instagram feed. Towards the end of the video, a black screen appears with text that reads 'Recap' and a white font with numbers that display numbered rules. These frames also show side by side images of the same person wearing different outfits. On the last few frames, the same person is talking and gesturing, with a banner appearing that displays 'SuppleChicTV' and the view count, a thumbs up button, and a subscribe button. Towards the end of the video, 2 other video thumbnails are displayed on the upper right side of the video.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How many views did the video of the woman receive?",
+ "answer": "D",
+ "candidates": [
+ "A. 1,729,460 views.",
+ "B. 1,729,550 views.",
+ "C. 1,725,440 views.",
+ "D. 1,729,440 views."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many outfits does the woman in the video change into?",
+ "answer": "B",
+ "candidates": [
+ "A. One.",
+ "B. Five.",
+ "C. Three.",
+ "D. Four."
+ ]
+ }
+ },
+ "NokNpYzx": {
+ "video_id": "NokNpYzx",
+ "video_duration": "85s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Biology",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A man wearing a black blazer and a light purple tie is standing on the left of the screen. There is text on the right of the screen, and at the top of the screen that reads \"NATURE OF MUTATIONS\". The background appears to be a pale blue. Next, the screen changes to show four diagrams of pairs of colored rods. Underneath each diagram is written one of the following: Deletion, Duplication, Inversion, and Translocation. A red line appears under \"Deletion\" and a red circle appears in the diagram above it. Then a red line appears under \"Duplication\" and a red circle appears in the diagram above it. A red line appears under \"Inversion\" and then under \"Translocation\", and red curves are drawn in the diagram above them.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How many types of sequences do the chromosomes in the video have after mutations occur during replication?",
+ "answer": "A",
+ "candidates": [
+ "A. Two.",
+ "B. One.",
+ "C. Four.",
+ "D. Three."
+ ]
+ }
+ },
+ "IyUJliBE": {
+ "video_id": "IyUJliBE",
+ "video_duration": "191s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Biology",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing blue gloves is assembling a gel electrophoresis apparatus on a white table. The apparatus consists of a clear plastic container with a green top. They remove the green top, then insert a green comb. A clear liquid is injected into the container, then the person connects the electrophoresis apparatus to a power supply. The apparatus appears to run, and a close-up reveals the apparatus submerged in liquid and connected to wires. Next, an animated sequence introduces the step of transferring proteins onto a membrane. After this, the person is seen preparing a membrane for the transfer, placing it first in a container marked \u201cmethanol\u201d, then using tweezers to move the membrane to a container marked \"transmembrane buffer\". They use a smartphone to set a timer, then uses a green scraper to spread liquid across a clear piece of plastic. A blue dye appears on the plastic and they use a green tool to spread the dye. Finally, the person is using tweezers to transfer paper from one container to another, with various other pieces of paper and containers on the table.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many pieces of absorbent paper were immersed in methanol?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Six.",
+ "C. Five.",
+ "D. Seven."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What is the reason for adding a rinse buffer between the fixtures in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. To cool the experimental equipment and ensure proper operation.",
+ "B. To check for leaks in the electrophoresis tank.",
+ "C. To clean the fixtures and prevent impurities from affecting subsequent experimental results.",
+ "D. To dilute the sample for easier subsequent handling."
+ ]
+ }
+ },
+ "CjRvaRGV": {
+ "video_id": "CjRvaRGV",
+ "video_duration": "389s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A rectangular, orange baking pan sits on a counter. It has a small amount of liquid in the base. A dark-skinned hand holding a red spoon begins spooning a mixture containing ground meat and red sauce into the pan. The spoon spreads some of the mixture along the bottom. Yellow lasagna noodles are placed on top of the meat sauce, covering most of it. The lasagna noodles are arranged, filling the bottom of the pan. The ground meat sauce is again added to the pan, covering the pasta. A hand then places sliced pepperoni in the baking pan over the meat sauce and pasta. Finally, a bag of shredded mozzarella cheese is shown, and the cheese is spread over the pepperoni. A red shaker sprinkles something over the cheese. A hand then pats down the cheese. A green shaker sprinkles more seasonings on the cheese. A hand pushes the cheese into the dish.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many different types of ingredients are shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Five.",
+ "B. Three.",
+ "C. Four.",
+ "D. Six."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes happen to the pancake in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. It is melted.",
+ "B. It is divided into five parts.",
+ "C. It is broken into pieces.",
+ "D. It is crushed."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a sausage in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Uncertain.",
+ "B. Yes.",
+ "C. No."
+ ]
+ }
+ },
+ "bNxJCjMq": {
+ "video_id": "bNxJCjMq",
+ "video_duration": "219s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a countdown timer overlayed on a static outdoor scene. The scene shows a tall, narrow, light-colored rocket on a launchpad, along with a dark structure beside it. In the background are a few buildings and a tall cylindrical tower. A small graphic appears in the top right of the screen and the NASA logo is displayed in the bottom right. The countdown number decreases from 27 to 1. At 21 seconds, white vapor appears around the base of the rocket. At 16 seconds, the rocket launches with a burst of orange flame. The video shifts to a view from behind the rocket as it rises vertically against a backdrop of gray sky. It then alternates between this and a shot from further away showing the launch site. At about 0:24, the rocket ascends through the clouds and into a darker sky. The sky eventually becomes completely dark, but the rocket is still illuminated. From 1:29, the video presents two side-by-side views. The left view shows the rocket continuing to move away, leaving an orange trail, while the right view shows a cockpit or capsule interior with digital displays showing an image of the Earth. The left view switches to an image from inside the rocket. The video concludes by showing the rocket separating.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What does the number inside the first circle from left to right in the bottom left corner of the screen represent after the spaceship takes off in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Atmospheric pressure.",
+ "B. Gravity.",
+ "C. Time.",
+ "D. Speed."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "What happened after the spaceship performed the second ignitions in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. I don\u2019t know.",
+ "B. Shut down.",
+ "C. Dxploded.",
+ "D. Continued flying."
+ ]
+ }
+ },
+ "uPXjTLNV": {
+ "video_id": "uPXjTLNV",
+ "video_duration": "507s",
+ "duration": ">8min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man stands in front of a soda dispenser, gesturing towards it. He is in a room filled with various food service appliances. Behind him are two large, stainless steel ice cream machines with digital displays. To the right of these are smaller machines, some on counters, others freestanding, and posters advertising \"Dolphin Appliances,\" featuring images of ice cream and candy floss machines. The soda machine he stands before is silver and has a purple sign that reads \"Dolphy Soda Shop.\" It has multiple dispensers and displays images of various drinks. Several silver canisters are placed around the soda machine, labeled with names like \u201cSprite,\u201d \u201cJeera Masala,\u201d \u201cMango,\u201d \u201cKalyo,\u201d \u201cChina,\u201d and \"Mint Jaljeera.\" The man points to different parts of the machine and the canisters as he talks, his hand motions indicating explanation. He opens a side panel of the soda machine briefly. He then returns to his original position, gesturing towards the soda machine and the canisters. Two phone numbers are displayed along the bottom of the screen.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "What is the number on the seventh soda bottle in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. 6",
+ "B. I am not sure.",
+ "C. 8",
+ "D. 7"
+ ]
+ }
+ },
+ "OoWIxCdw": {
+ "video_id": "OoWIxCdw",
+ "video_duration": "118s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Two men sit on a couch. The man on the left wears a white t-shirt and glasses. He holds a laptop. The man on the right wears a black t-shirt. They speak to each other while looking at the laptop. At one point, they look intently at each other. The man in the black t-shirt holds his hand to his chest. The man in the white t-shirt points at the laptop screen. He adjusts his glasses. The man in the black t-shirt holds his hand over his mouth. The man in white talks animatedly, using hand gestures. The man in the black t-shirt raises his hands in exasperation. He covers his face with one hand. He holds his chin. The man in white talks, pointing upwards. The man in black holds a smartphone. He types on the phone. He brings the phone to his ear and speaks. The man in white listens, then talks on the phone. The man in black laughs. The man in white returns the laptop to the man in black. The screen fades to black, showing white text and icons for Facebook and Instagram with the handle \"charlesthefrench\". Below these are the words \u201cWATCH another video here\u201d and \"CLICK on my face to subscribe\".",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video, what is the order in which the man wearing a white short-sleeved shirt and the man wearing a black short-sleeved shirt attempt to recover their passwords? (a) Click \"Forget Password.\" (b) Receive the password via SMS. (c) Answer the questions. (d) Seek help from their ex-girlfriend.",
+ "answer": "D",
+ "candidates": [
+ "A. (b) (d) (a) (c).",
+ "B. (b) (a) (c) (d).",
+ "C. (c) (b) (d) (a).",
+ "D. (a) (c) (b) (d)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "From the viewer's perspective in the video, what is the position of the man wearing the white short-sleeved shirt relative to the man wearing the black short-sleeved shirt?",
+ "answer": "D",
+ "candidates": [
+ "A. He sits in front of the man wearing the black short-sleeved shirt.",
+ "B. He sits behind the man wearing the black short-sleeved shirt.",
+ "C. He sits to the right of the man wearing the black short-sleeved shirt.",
+ "D. He sits to the left of the man wearing the black short-sleeved shirt."
+ ]
+ }
+ },
+ "WccSarjU": {
+ "video_id": "WccSarjU",
+ "video_duration": "49s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Software",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video displays a series of screens, each featuring text, graphics, and images.Initially, the screen shows numerical data tables with headings labeled \"Today\", \"Yesterday\", \"Last 7 Days\", and \"Last 30 Days,\" with each column containing monetary values. The text \"PLUS - We're Already Using It In Our Own Business To Great Success...\" is visible at the top. An orange banner then appears with text: \"Get Started With This Revolutionary New Software Today!\" followed by images of computer screens, a smartphone, and a tablet all displaying the \"ProfitClass\" logo, with an additional button that reads \"Click Here To Grab ProfitClass\". A countdown timer is displayed. The screen transitions to show various graphics and text related to online course creation and marketing. This includes graphic depictions of web interfaces with icons and text. Images of people are displayed alongside the text \"In Fact, Take A Look At All These Famous People Who Are Selling Their Own Courses Online.\". Various images and animations with text highlighting various features and benefits. Later, a graph that shows revenue numbers from the year 2015 to 2022. The video continues with more graphics of computer screens, phone screens, keyboards, dollar bills, and multiple animated images of people using devices. The video concludes with a screen that has the orange banner again, along with product images of the software and the same countdown timer.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What will happen to the user base of Profit Class next?",
+ "answer": "C",
+ "candidates": [
+ "A. The number of users will decrease.",
+ "B. The number of users will initially increase, then decrease.",
+ "C. The number of users will increase.",
+ "D. The number of users will initially decrease, then increase."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How does the Profit Class assist users in monetizing their products and services?",
+ "answer": "D",
+ "candidates": [
+ "A. Leasing service model.",
+ "B. Data algorithm model.",
+ "C. Membership-based profit model.",
+ "D. Content payment model."
+ ]
+ }
+ },
+ "tQqMFbqn": {
+ "video_id": "tQqMFbqn",
+ "video_duration": "33s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "World News",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man with a beard is wearing a navy blazer and white collared shirt, and appears to be talking. The scene then switches to several people in camouflage military uniforms standing in a field while aiming rifles. The scene changes again to show several people in camouflage military uniforms and a person in tan clothing standing in a line near brown target silhouettes while one person aims a rifle at the silhouettes. Next, a large black military hovercraft is on a beach near palm trees, with waves in the background. Following this, the video shows a large black military hovercraft with military vehicles on it on a beach, with several people in military uniforms standing near the hovercraft. Next, a green military vehicle is seen moving from right to left with two people wearing helmets inside, followed by a shot from inside another green military vehicle, showing one person in a helmet looking forward. A large military helicopter is shown in flight above a dusty field, with several people on the ground. Finally, a green military vehicle is seen moving onto a beach, followed by two people in camouflage military uniforms walking near a field.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What is the weather like at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Snowy.",
+ "B. Overcast.",
+ "C. Rainy.",
+ "D. Sunny."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many targets are shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Twelve.",
+ "B. Eleven.",
+ "C. Thirteen.",
+ "D. Ten."
+ ]
+ }
+ },
+ "gReEaMRG": {
+ "video_id": "gReEaMRG",
+ "video_duration": "92s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with an outdoor scene during what appears to be a light snowstorm. The camera faces down a dirt road flanked by trees. A mailbox is in the foreground. The video cuts to a woman with short grey hair and glasses, sitting on a red couch. She is wearing a brown sweater and dark leggings, and is gesturing with her hands. The scene switches to an outdoor wooden deck, where many people are standing, some are seated on benches. The camera then pans to a group of people sitting on a rock overlooking a body of water. Next the scene shows a woman with short dark hair speaking to a large group of seated people inside a wooden structure. Following this, a group of people are sitting around a table working on the construction of something involving small parts of what appear to be birds. The video then cuts to an outdoor puppet show where a man is reading off a piece of paper next to a puppet stand. The camera switches back to the woman with grey hair on the red couch, talking and gesturing. The video then shows another room with a group of people looking at a sculpture with a man is explaining something. The video returns to a dark wooden structure where a man is giving a lecture while standing in front of a white screen. Then the camera cuts again to the woman with grey hair on the red couch talking and gesturing. The scene changes to a group of people outdoors seated or standing on a wooden deck. A woman with an apron is talking to a group of people while standing under a tree. The video then cuts to a large group of people seated at long tables in a field, with a band on a small stage in front of them. The video shows a shot of a white angular building in a field, followed by a view of it with people seated around. The camera focuses again on the woman with grey hair on the red couch talking and gesturing. The next scene shows a group of people standing together outdoors in the dark, followed by a view of a larger group of people standing in front of a wooden structure. Next the video shows two people near what appears to be a bonfire. The final sequence of the video shows the bonfire burning with the people observing it. The last cuts show a close-up of the woman with grey hair, and finally a shot of sparks and ashes rising in the dark.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, in front of the fire, where is the woman wearing glasses positioned in relation to the man in black?",
+ "answer": "C",
+ "candidates": [
+ "A. In front.",
+ "B. Left side.",
+ "C. Right side.",
+ "D. Behind."
+ ]
+ }
+ },
+ "UKPLuuYC": {
+ "video_id": "UKPLuuYC",
+ "video_duration": "634s",
+ "duration": ">8min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A screen recording of the mobile game *Clash of Clans* is shown. A person wearing a blue hoodie can be seen on the left side of the screen, watching the gameplay.\u00a0The game screen shows a Town Hall level 8 base. The player opens the in-game shop and views the \"Treasure\" tab. Then they switch back to the main base view and tap on the laboratory. They open the upgrade window for the Inferno Tower, and cancel. The player then opens a menu to construct or upgrade defenses, then goes back to the base view. They click on the X-Bow, which is currently being upgraded. They click on an Inferno Tower and select the \"Info\" button, then tap on another Cannon. Next, they click on the \"Attack\" button, initiating a raid on another player's base.\u00a0 After briefly observing the other base, they go back to their own village. The Dark Elixir drill is selected, and its information panel opens. Then the player selects a Dark Barracks and returns to the main village view.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the final count of green gems?",
+ "answer": "A",
+ "candidates": [
+ "A. 218.",
+ "B. 302.",
+ "C. 250.",
+ "D. 151."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did the man upgrade his town hall to level 11?",
+ "answer": "D",
+ "candidates": [
+ "A. I am not sure.",
+ "B. None of the above.",
+ "C. Yes, he did.",
+ "D. No, he did not."
+ ]
+ }
+ },
+ "jsxscTPD": {
+ "video_id": "jsxscTPD",
+ "video_duration": "217s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Celebrity Interviews",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman with long brown hair and a man with short hair are sitting across from each other at a round table. The woman is wearing a black top. The man is wearing a black athletic jersey with gold stripes on the sleeves. The background is black. On the table is a red card with white text on it. There are also five glass bottles on a shelf behind the man. The woman is holding the red card. At several points, the woman places the red card face-down on the table. She sometimes looks at a red mobile phone she is holding. During the video, the two speak to each other. The woman picks up two bottles and gives one to the man. They clink bottles together.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between Thibaut Courtois and Eden Hazard in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Referee.",
+ "B. Teammates.",
+ "C. Coach.",
+ "D. Opponents."
+ ]
+ }
+ },
+ "gjCMwBZS": {
+ "video_id": "gjCMwBZS",
+ "video_duration": "25s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man and a girl walk on a sidewalk. The man is wearing a white cap, a long light-colored robe, and blue and red sneakers. He is holding a book. The girl is wearing a light blue denim jacket, black pants, and brown sandals.A black car approaches from behind and stops beside the man and the girl. A man wearing sunglasses and a black baseball cap, sitting in the driver's seat, speaks to the man and the girl on the sidewalk. The man in the car is holding a red energy drink can.The man and the girl on the sidewalk speak in response.The man in the car speaks and then he and the girl in the car clap their hands together.The man and the girl on the sidewalk speak to each other.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the girl with her hair down positioned in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Sitting next to the man wearing black clothes.",
+ "B. Sitting on the lap of the man wearing black clothes.",
+ "C. Sitting on the lap of the man wearing a white long robe.",
+ "D. Sitting next to the man wearing a white long robe."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional changes does the man wearing a white long shirt undergo in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. From happiness to anger to calm.",
+ "B. From calm to anger to happiness.",
+ "C. From anger to calm to happiness.",
+ "D. From calm to happiness to anger."
+ ]
+ }
+ },
+ "nZFMFiMC": {
+ "video_id": "nZFMFiMC",
+ "video_duration": "452s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a man wearing a plaid shirt and a baseball cap, sitting at a desk in front of a computer and speaking to the camera. He makes a variety of hand gestures. He is also holding and showing different camera equipment. The scene then transitions to a video editing program, and then back to the man holding a camera. Then the scene transitions to an outdoor scene showing a person riding a one wheeled scooter. Then there are several shots of the person riding the one wheeled scooter, interspersed with indoor shots of a living room and shots of the same man sitting in a computer chair. Then there are some graphic animations and close ups of a camera lens. Then there's a shot of two men in the same room. Then a man is seen using a drone controller. Then there are a series of shots of a large, white house and surrounding areas. The scene transitions to an orange K logo. The video then returns to the man sitting at the desk and making hand gestures, showing one of his fingers, making a fist, and then waving his hand at the camera. Then the man is using and showing different camera equipment. Then there are animated triangles, the text \"1. SHUTTER SPEED\", followed by the man holding a camera and showing various camera equipment. Then he shows a camera sensor, and then shows a series of rapid hand motions. Then the man shows a camera lens, and the video shows a view from inside of the lens. There are animated arrows and the text \"2. ISO\", followed by the man making hand gestures, and showing the camera, then a graphic animation with an overlay that states \"IGNORE OVEREXPOSURE\". Then there's an animation showing the text \"3. APERTURE\", then the man does various hand motions and shows the inside of a camera lens, then there are drawings, with the man showing different parts of the diagram. The scene transitions back to the man making hand gestures, showing different camera equipment, and the K logo.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the man wearing a plaid shirt in the video adjust the lights to be brighter?",
+ "answer": "C",
+ "candidates": [
+ "A. Because the room is too dark, making it difficult to see clearly.",
+ "B. To test the maximum aperture capability of the camera.",
+ "C. To compensate for the darkening of the image caused by adjusting the shutter speed.",
+ "D. To create more motion blur effects."
+ ]
+ }
+ },
+ "YslZiyOL": {
+ "video_id": "YslZiyOL",
+ "video_duration": "568s",
+ "duration": ">8min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video starts with a view of Earth from space. A golden logo appears, rotating, and gradually reveals the letters \"ICNA\" beneath it. The logo changes color to yellow, then to green and yellow.Next, the scene changes to a large audience sitting in an auditorium. Many of the women in the audience are wearing hijabs in various colors. They clap and wave their hands. A man with a beard, wearing a red jacket and white robe, is shown speaking on a stage in front of a light blue screen. He is gesturing with his hands. The backdrop displays various logos, including one that reads \"HUMANITY.\" To the lower left, an overlay displays text: \"BOONAA MOHAMMED,\" \"Dubbed as the 'Voice of a Generation',\" \"ICNA-MAS Convention\".The camera angle switches back and forth between the speaker and the audience. There are potted plants visible between the audience and the stage. The camera zooms out on the audience, revealing the scale of the crowd. The overlay on-screen changes: the text remains the same, but the circle logo is now green and contains Arabic script. The camera zooms in slightly on the audience, then returns to the speaker.The speaker gestures with his hands, walks from side to side, and occasionally sips from a water bottle. The backdrop changes; a large poster next to him reads: \"50 YEARS OF SHARING ISLAM SERVING HUMANITY,\" \"MARCH 30 - APRIL 1,\" \"EASTER WEEKEND,\" \"BALTIMORE,\" \"SHARE YOUR GEMS & EXPERIENCE WITH US! #ICNA2018,\" and various social media icons. The backdrop changes again, showing various colorful logos, most prominently \"ICNA.\" The camera continues to switch between the speaker and the audience. Some audience members hold up cell phones, seemingly recording the speech. The audience is shown clapping. The video ends with a black screen displaying the text: \"Volunteer & Support,\" \"WWW.ICNA.ORG,\" \"find @ICNA,\" with Facebook, Twitter, and YouTube icons. Additional text reads \"Copyright \u00a9 2012 ICNA. All Rights Reserved. Any reproduction without written permission is prohibited.\"",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the word displayed in the center of the large screen behind the man in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. BOONAA MOHAMMED.",
+ "B. ANNIVERSARY.",
+ "C. SHARING ISLAM.",
+ "D. YOUTH."
+ ]
+ }
+ },
+ "afvclsRh": {
+ "video_id": "afvclsRh",
+ "video_duration": "86s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Climate Change",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person with short curly brown hair and a black hooded sweatshirt appears in front of a dark backdrop. A smaller inset image in the upper left corner shows a rocky, grassy coastline and water. The same person is then shown, with the inset image still in the upper left corner.Four individuals are shown in an outdoor setting, standing in front of a wall, with plants along the bottom. Three individuals wear masks, and all wear jackets or sweaters. The individual in the center is wearing a red jacket and a red hair covering. A dark red screen appears, with white text reading \u201cWHAT WE\u2019VE ACHIEVED\u201d. A white icon of a thermometer with an Earth globe appears on the right side of the screen. The text and icon then appears against a bright red background. The screen then returns to a dark red background, with text and icon visible.A bright red screen shows the text \u201cCOLLABORATIVE LEARNING\u201d in white, followed by more white text explaining the concept. A photograph on the right side shows multiple people sitting on a floor, writing and drawing on papers. The screen shows the same text with a different camera angle of the photograph.A bright red screen shows the text \u201cIDEAS LAB\u201d in white, followed by more white text explaining the concept. A photograph on the right side shows two people holding up large drawings, in an indoor setting. The photograph is updated to show a different camera angle, and then a new person is added to the shot, then another one, with all of them holding up large drawings.A bright red screen shows the text \u201cLOCAL ACTIONS\u201d in white, followed by more white text explaining the concept. A photograph on the right side shows multiple people working in a garden, planting or tending to plants. The screen shows the same text with a different camera angle of the photograph.A bright red screen shows the text \u201cPERSONAL & SCHOOL PLEDGES\u201d in white, followed by more white text explaining the concept. A photograph on the right side shows several individuals holding up written pledges. The screen shows the same text with a different camera angle of the photograph. The screen then transitions to a dark red background.A classroom setting with several people seated at tables, and one person standing and speaking is shown. All the people in the classroom wear masks. The image is then superimposed over itself several times, and the view changes to show several individuals standing in a row, holding small white signs. They all wear masks and school uniforms, and one has brightly colored hair.Several individuals are shown standing in an indoor setting, with papers spread out on a table. All the people present wear masks, and some have their hands together as if applauding. The setting changes to show the same group with a board game on the table. The scene is then obscured with the setting being superimposed over itself and then switches to a setting in a library with four people around a game board. The scene is then obscured again.Multiple people are shown in front of a brick wall, as one person in a white lab coat speaks while gesturing. All the individuals wear masks, and some wave to the camera. The individual in the lab coat gestures to the camera again.A screen showing a video conference call with multiple individuals in different locations is shown, with several people waving. A person holds up a sign that says \u201cTHANKS FOR THE OPPORTUNITY TO MAKE CHANGE\u201d. The view then fades to black, and then text appears against a dark red background reading \u201c1.5 MAX! REAL WORLD CLIMATE EDUCATION.\u201d A thermometer with a globe appears to the right of the text, and more text appears at the bottom describing the organization.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does a man making a peace sign towards the camera appear?",
+ "answer": "D",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. Does not appear.",
+ "C. At the beginning of the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Who is the second speaker?",
+ "answer": "A",
+ "candidates": [
+ "A. A little girl.",
+ "B. A little boy.",
+ "C. A woman.",
+ "D. An elderly professor."
+ ]
+ }
+ },
+ "BKTuuqOX": {
+ "video_id": "BKTuuqOX",
+ "video_duration": "138s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Cultural Explainers",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with light skin and short, light-brown hair, wearing a green collared shirt, is centered in the frame. The person\u2019s facial expressions change throughout the video, and they frequently gesture with their hands while speaking. Behind them is a window with closed wooden blinds and greenery visible outside. To the left of the person, on a surface, are several tools including an orange handled tool. At certain points of the video, a banner with the Australian flag appears at the bottom of the frame, along with text. The text includes, in order: \u201cFair Go\u201d, \u201cFar Out\u201d, \u201cHooroo\u201d, \u201cG\u2019Day\u201d, \u201cServo\u201d, \u201cGet on the Horn\u201d, \u201cTrue Blue/Fair Dinkum\u201d, \u201cBarbie\u201d, \u201cBiff\u201d, \u201cMungin\u2019\u201d, \u201cShit-shaped\u201d, \u201cTwo Balls in a Sack\u201d, \u201cHard Yakka\u201d, \u201cPiss-up\u201d, \u201cWoop Woop\u201d, and \u201c[Going] Walkabout\u201d. Additionally, there are three brief scenes showing images other than the person. First there is an image of a white truck driving on a dirt road, with a similar Australian flag banner. The second image is of several packages of meat snacks hanging from hooks, with a similar Australian flag banner. The third image is a close-up of a plate of food consisting of a main course, a side of chips, and a salad, again with a similar Australian flag banner. Another image is presented at one point: a red map of Australia. Finally, there is an image of a case of beer with a similar flag banner.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video is alcohol mentioned?",
+ "answer": "C",
+ "candidates": [
+ "A. It is mentioned at the beginning of the video.",
+ "B. It is not mentioned.",
+ "C. It is mentioned at the end of the video.",
+ "D. It is mentioned in the middle of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What is the meaning of the red map that appears at the end?",
+ "answer": "D",
+ "candidates": [
+ "A. To explain the term \"going walkabout.\"",
+ "B. To explain the term \"Barbie.\"",
+ "C. To explain the term \"Hard Yak.\"",
+ "D. To explain the term \"Woop Woop.\""
+ ]
+ }
+ },
+ "RoIHxoNv": {
+ "video_id": "RoIHxoNv",
+ "video_duration": "605s",
+ "duration": ">8min",
+ "domain": "Games",
+ "sub_category": "Sports Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Two masked wrestlers, one in a predominantly silver and blue outfit and the other in blue and yellow, circle each other in a wrestling ring. The wrestler in silver and blue grabs the other wrestler and takes him down, then appears to attempt a submission hold. The referee watches closely. The wrestlers stand up, and the one in silver and blue takes the other one down again, this time flipping him over. After the referee counts two slaps on the mat, the wrestler in blue and yellow gets back on his feet, but he is quickly taken down again by the wrestler in silver and blue. The wrestler in silver and blue pins the other wrestler\u2019s shoulders to the mat while the referee counts the pin. The wrestler in silver and blue wins the match and raises his arms in victory.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the man in silver-blue pants and the man in the striped shirt in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Athlete and coach.",
+ "B. Teammates.",
+ "C. Opponents.",
+ "D. Athlete and referee."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What does \"MANTEQUILLA\" mean in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. It is the name of the man wearing black clothes.",
+ "B. It is the name of the man wearing a striped shirt.",
+ "C. It is the name of the man wearing yellow and blue pants.",
+ "D. It is the name of the man wearing silver-blue pants."
+ ]
+ }
+ },
+ "bvQEVvCH": {
+ "video_id": "bvQEVvCH",
+ "video_duration": "83s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "Greyhounds are in starting boxes. The boxes open and the greyhounds run out. The dogs in red and light blue jackets take an early lead. The dogs race around the first bend, with the red dog maintaining the lead. The red dog continues to lead along the back straight. The camera follows the dogs as they run along the back straight and around the far bend. The dogs bunch up going into the home straight, the red dog is still just in front. The dogs approach the finish line; the red dog wins, with the light blue dog in second. The dogs cross the finish line. The camera shows the results board displaying numbers 8, 6, 3, 7, and 4 in the top 5 positions, a time of 29.77, and a speed of 17.30 m/s.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many dogs competed in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Six.",
+ "B. Eight.",
+ "C. Five.",
+ "D. Seven."
+ ]
+ }
+ },
+ "qEUGsBFo": {
+ "video_id": "qEUGsBFo",
+ "video_duration": "40s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a red rectangle on a blue background. This is followed by a blue banner with white lettering that says \"BETTER TOGETHER,\" placed over two red rectangular shapes on the same blue background. Four characters appear, each in their own panel. The first panel has a grey rabbit holding a carrot. The second panel has a woman in red and black attire with a bat. The third panel shows a man with a black cape and cowl, and the fourth panel shows a yellow, dog-like creature. Below each character are the character's name and the level. Text reads \"WAITING FOR OTHER PLAYERS\" and a countdown. Then, a man with a bald head and black shirt is shown, standing with open palms and a \"MULTIVERSUS\" logo in the background. The next image reads \"2 VS 2\" in black lettering on a background that is half blue and half red. Then, the video shows 4 animated characters fighting on a platform. The background includes a large screen and a silhouetted dinosaur. A woman with red hair and pink attire uses a laser. A character with grey hair and a black cape moves. Then, a grey silhouette of the cape-wearing character is seen in a cloud of white smoke. A character with a red shirt and blue shorts is seen on a wood platform, next to a woman in blue clothing holding a sword. The characters are displayed against a blue sky with a boat above a green house. The two characters are seen again. The video shows a fight scene with various characters in front of waterfalls. Two brown objects move on a wooden platform. The man with a bald head and black shirt is seen again. A fight scene shows several characters on a platform in front of a waterfall, including the red shirt character from before. Another fight scene is seen on the platform with the dinosaur silhouette. The man with a black shirt and bald head is seen once more. Finally, the words \"WINNER YOUR TEAM\" appear in white lettering on a blue background, and two animated characters stand side by side.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What is the final state of the box in the third scene in the middle of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Destroyed.",
+ "B. Closed to open.",
+ "C. Hidden.",
+ "D. Open to closed."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Did Harley Quinn's team ultimately win the competition in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes, Harley Quinn's team won the competition.",
+ "B. I am not sure.",
+ "C. No, the Arya Stark team won the competition."
+ ]
+ }
+ },
+ "eDWlxbmr": {
+ "video_id": "eDWlxbmr",
+ "video_duration": "151s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man wearing a gray v-neck t-shirt with the word \u201cHawaii\u201d on it speaks against a blurred green background. An image of an Asian man wearing a brown suit jacket, a white collared shirt, and a green and white striped tie sits at a table wearing a headset with a microphone. He is smiling and looking at the camera. There is a blurry crowd of people seated in what appears to be stadium seats behind him. The next image shows the same man, now wearing a blue suit jacket and a blue paisley tie, smiling at the camera next to an older white man with short white hair wearing a black suit jacket, a white collared shirt, and a black bow tie. The older man is also smiling. Blurry figures can be seen behind them. The next image shows a split screen; on the left, the Asian man is wearing a gray suit jacket and a light blue collared shirt. There is an ESPN logo on the bottom left of the image. On the right is a portrait of Robert E. Lee. The next image shows a statue of a man on horseback. A white church with a green steeple is in the background. The next image shows the Asian man in the headset again, this time against a yellow blurry background, looking and speaking towards something off camera.\u00a0The next image shows the Asian man standing next to a much taller dark-skinned man. Both are wearing suit jackets and ties, and they are each holding ESPN microphones.\u00a0Text then appears: \u201cFrom ESPN PR: \u2018We collectively made the decision with Robert to switch games as the tragic events in Charlottesville were unfolding, simply because of the coincidence of his name. In that moment it felt right to all parties. It's a shame that this is even a topic of conversation and we regret that who calls play-by-play for a football game has become an issue.\u2019\"The next image shows a dark-skinned woman in a red sleeveless dress standing in front of a television screen. The image on the television screen shows the Asian man holding an ESPN microphone.The man in the gray \u201cHawaii\u201d shirt appears again. A Fox News Alert flashes onscreen beneath a Tweet that reads: \u201cFun fact: ESPN pulls an announcer from a broadcast because of his name. Robert Lee. He is Asian. Don\u2019t think this is going to stop.\u201d Next to the Tweet is a picture of the Asian man smiling at the camera next to an older man.\u00a0The man in the \u201cHawaii\u201d t-shirt speaks again. A photo of the Asian man and a blonde woman sitting at a table with headsets on is shown. They are smiling at the camera. A red curtain is behind them. Blurry red stadium seats can be seen behind the curtain.\u00a0The man in the \u201cHawaii\u201d t-shirt appears and continues speaking. The word \u201cXING\u201d appears on a black banner across the screen. The man continues speaking, before the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What announcement did ESPN make in the video regarding Robert Lee?",
+ "answer": "B",
+ "candidates": [
+ "A. How Robert Lee was targeted by ESPN.",
+ "B. Why Robert Lee was terminated from his position.",
+ "C. How Robert Lee abused his position to steal confidential information.",
+ "D. What Robert Lee's future plans are."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many bracelets is the woman wearing black in the video wearing on her wrist?",
+ "answer": "D",
+ "candidates": [
+ "A. Four.",
+ "B. Three.",
+ "C. Two.",
+ "D. Five."
+ ]
+ }
+ },
+ "qUgVFWdG": {
+ "video_id": "qUgVFWdG",
+ "video_duration": "163s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen, and then displays a light gray logo with several geometric shapes including two connected letter \u201cG\u201ds, and then some parallel lines representing \u201cwings\u201d on either side. Text reading \"Gravity Skateboards\" appears in front of the logo. The text changes to \"Presents\", and then to \"Slide Tips\". Then, the scene changes to a shot of two men standing on a road. One man is wearing a white shirt and red shorts and holding a skateboard. The other is wearing a brown shirt and light gray shorts. He drops his skateboard to the ground. The text \"Check Slides\" appears over the shot. The scene cuts to the man in the brown shirt, now riding his skateboard down the road. The scene shows a dog running alongside him and then the man performs several tricks on his skateboard, including a \"power slide\". The scene cuts back to the two men standing next to each other on the road. The man in the brown shirt performs a series of skateboard tricks, dropping his board several times. The man in white shirt is standing in the background, looking on. The scene then cuts to several different shots of the men performing skateboard tricks down the road. In some instances, there are other people seen in the background. The names \"Sergio Yuppie\", \"Will Jolly\", \"Jesse Parker\", and \"Robbie Lyons\" flash on the screen with the associated skater. The video then ends with a black screen and a gray logo appears followed by the text \u201cGravity\u201d in white letters with the logo on either side, and below it, the text \"Skateboards\", and then \"Productions\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Which part of the video shows four people practicing skateboarding together?",
+ "answer": "D",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. Throughout the video.",
+ "C. At the beginning of the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What are the man in the brown T-shirt and the man in the white T-shirt trying to do while interacting on camera?",
+ "answer": "D",
+ "candidates": [
+ "A. Debate who can complete the action in the shortest time.",
+ "B. Discuss the history of skateboarding.",
+ "C. Teach the audience how to perform tricks on a skateboard.",
+ "D. Teach the audience techniques for sliding."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who said, \"one more trick like a check slide\", in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A man in a white T-shirt.",
+ "B. A man in a brown T-shirt.",
+ "C. A man in a black T-shirt.",
+ "D. A man in a blue T-shirt."
+ ]
+ }
+ },
+ "uOGnffeq": {
+ "video_id": "uOGnffeq",
+ "video_duration": "131s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Academic Lectures",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video starts with four animated dinosaurs wearing green hats and green neck scarves inside a cave. The cave has several large mushrooms, and plants with long fern-like leaves. There is also a decorative branch with hanging yellow lights.The dinosaurs face the viewer while some text appears at the bottom of the screen.The dinosaurs move around, and then the camera moves closer to a wall inside the cave with the letters of the alphabet displayed. Some of the letters are partly covered by a stone-like material. As each letter is revealed, the name of a dinosaur is shown.The letters and names are shown in order: A and Ankylosaurus, B and Brachiosaurus, C and Compsognathus, D and Deinonychus, E and Elasmosaurus, F and Fukiraptor, G and Gallimimus, H and Hagryphus, I and Iguanodon, J and Jaxartosaurus, K and Kentrosaurus, L and Lambeosaurus, M and Megalosaurus, and N and Nodosaurus.A group of five dinosaurs is then shown again as the wall of letters and dinosaurs is shown again. The dinosaurs are shown from a distance, then closer again as the remaining letters of the alphabet are displayed.The letters and names are shown: O and Oviraptor, P and Protoceratops and a pteranodon is also mentioned, Q and Quetzalcoatlus, R and Rhamphorhynchus, S and Spinosaurus, T and Tyrannosaurus, U and Utahraptor, V and Velociraptor, W and Wannanosaurus, X and Xenotarsosaurus, Y and Yangchuanosaurus, and Z and Zephyrosaurus.The video ends with the dinosaurs in a cave in front of the wall of alphabet letters and dinosaurs again, with some text displayed.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "At the end of the video, how many glowing insects fly away?",
+ "answer": "D",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. Six.",
+ "D. Five."
+ ]
+ }
+ },
+ "MZjJczML": {
+ "video_id": "MZjJczML",
+ "video_duration": "248s",
+ "duration": "4-6min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The words \"Drive Impressions\" appear on a black screen. The video transitions to a view of a tree-lined road, and a green Alfa Romeo Giulia Quadrifoglio drives toward the camera. The perspective shifts to the interior of the car, showing a man with a baseball cap and dark t-shirt driving. The camera switches to a low angle, showing the road ahead from the front of the car as it drives through the countryside. Fields and trees are visible on both sides of the road.The camera returns to the driver's perspective inside the car, and after a while switches back to the front of the car, showing the road.\u00a0 This switching of camera angles happens several times.\u00a0 One shot shows the road from a low angle near the right front wheel, highlighting the green four-leaf clover emblem on the car. Another low angle shot focuses on the exhaust pipes as the car moves.Cut to footage of the same green Alfa Romeo Giulia Quadrifoglio driving aggressively around a race track. The car drifts through corners, emitting smoke from the tires.Back on the tree-lined country road, the driving and interior perspectives alternate again. This time, a rear view of the car driving is shown.\u00a0 The rear perspective follows the side of the car near the right rear wheel, again showing the clover emblem.\u00a0The words \"Final Thoughts\" appear on a black screen. The video then shows close-up, detailed views of the green Alfa Romeo parked by a lake: first the front headlight, then the hood with vents and the front grille, then the engine, the underside of the hood, the front again, the four-leaf clover emblem, the trunk and Alfa Romeo logo, and finally the entire side of the car.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Which sound does not appear in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The sound of a car engine.",
+ "B. The sound of a car horn.",
+ "C. Background music.",
+ "D. A man's speaking voice."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Does the video show a car taking a sharp turn at high speed?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes.",
+ "B. No.",
+ "C. I don't know."
+ ]
+ }
+ },
+ "vuKmMrUI": {
+ "video_id": "vuKmMrUI",
+ "video_duration": "72s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Family Vlogs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two people are in a kitchen. One person wearing a navy blue shirt with a white graphic on it is holding a small green object, which they open and from which they remove a small piece of white paper with a number on it. This person also holds a small orange object. The second person is wearing a long sleeved white shirt and is kneeling down. The person in white gets up and takes a round white object and puts it in their mouth and blows on it. The person in the navy shirt is also manipulating a small blue object. The person in white removes the blown-up white object from their mouth and does something with its edge. The person in the navy shirt is also doing something with a small blue object and then places their hand on the counter. Both people blow up large white and blue objects that are round and made of rubber. The person in white manipulates the edge of the white object and then both people hold the blown-up objects.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the black boy do with the balloon?",
+ "answer": "B",
+ "candidates": [
+ "A. The balloon exploded on the second attempt.",
+ "B. The balloon started to leak on the second attempt.",
+ "C. He blew the balloon to its maximum size at once.",
+ "D. He was unable to inflate the balloon successfully."
+ ]
+ }
+ },
+ "aBmCXbvX": {
+ "video_id": "aBmCXbvX",
+ "video_duration": "70s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A woman with a black bow headband speaks in profile view, then faces the camera while gesturing. She is wearing a white top and a colorful beaded necklace. She is looking out a window in profile view. A pair of hands type on a laptop keyboard showing lines of code on the screen. The woman continues speaking in profile view. Abstract animation shows blue and purple oblong shapes connected by white dots, forming a wavy line, which rearranges into a logo shape. A bright blue cloud icon appears and fills with small white letters and numbers. A second cloud icon appears alongside and also fills. The woman continues speaking. Close up shot focuses on the reflection of a computer monitor in a person's eyeglasses. Two hands swipe across tablet screens, one screen showing icons and the other displaying text. The woman types on her laptop keyboard. She sits on a purple couch across from a man wearing a gray t-shirt, and shows him the laptop screen. They speak. The word \"ENCRYPTED\" appears on screen against a white background. An abstract animation depicts a flow between cloud icons and a black diamond shape with a hazard symbol. The cloud icon appears again, and a dark circle moves across the screen. Two black hemisphere shapes are divided vertically, with binary code along the flat sides. Icons of a person, medical form, and credit card appear between them, labeled \"Financial Accounts,\" \"Health Records,\" and \"Consumer Data,\" respectively. Hazard symbols appear on either side of the icons, and lines connect them. The word \"Healthcare Records\" appears alongside a heartbeat icon inside a heart shape. The words \"Personal Banking Data\" appear alongside a line chart. Abstract animation shows two purple lines emanating from a point and merging into one. Three circles containing simplified line diagrams are labeled \"AI,\" \"Machine Learning,\" and \"Data Science.\" The woman continues speaking in profile view.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video, what is the order in which the following three topics are presented? (A) The advantages of International Business Machines, (B) The mission of the scientists, (C) The content of the research.",
+ "answer": "C",
+ "candidates": [
+ "A. (C) (A) (B).",
+ "B. (A) (B) (C).",
+ "C. (B) (A) (C).",
+ "D. (A) (C) (B)."
+ ]
+ }
+ },
+ "kHsSCiNB": {
+ "video_id": "kHsSCiNB",
+ "video_duration": "79s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Two people are inside a ring surrounded by four ropes. One person is wearing black shorts and black arm bands. The other person is wearing black pants and a black vest with red markings. A referee wearing a black and white striped shirt is also inside the ring. There is a crowd in the background. The person in the shorts and arm bands does a variety of wrestling moves on the person in the pants and vest. The person in the pants and vest also does wrestling moves on the person in shorts and arm bands. At one point the person in pants and vest falls to the floor of the ring. The person in shorts and arm bands wins and stands above the person in pants and vest. Both are lying on the floor. The referee kneels near them on the ring floor. The person in the vest lays on his back with one arm across his head.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "How did the man wearing the black and red vest defeat the man wearing black shorts?",
+ "answer": "C",
+ "candidates": [
+ "A. The man wearing the black and red vest quietly circled behind his opponent and pushed him down from behind.",
+ "B. The man wearing the black and red vest knocked his opponent down with a side kick.",
+ "C. The man wearing the black and red vest suddenly charged at his opponent and executed a takedown.",
+ "D. The man wearing the black and red vest suddenly charged at his opponent and applied a cross lock."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the man in the black and red vest and the man in the striped shirt in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Teammates.",
+ "B. Teammate and referee.",
+ "C. Teammate and coach.",
+ "D. Opponents."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who struck the wrestling platform at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A man wearing black shorts.",
+ "B. A man wearing a black and red tank top.",
+ "C. No one struck the wrestling platform.",
+ "D. A man wearing a striped shirt."
+ ]
+ }
+ },
+ "zusXBQwI": {
+ "video_id": "zusXBQwI",
+ "video_duration": "180s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Climate Change",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows several scenes of people gathering in different locations. Initially, there's a view of people walking along a waterfront with city buildings in the background. Then, there's a group of people holding a large banner that reads \"CLIMATE JUSTICE\". The video then shows individuals holding signs, some speaking into a microphone. Another scene depicts a group sitting at tables indoors. Later, a person is interviewed, followed by a close-up of a banner that reads \"CLIMATE JUSTICE YOUTH SUMMIT\". A different person is then seen speaking into a microphone, followed by two other individuals doing the same. The video shows a young person holding a slingshot and a water bottle while speaking into a microphone. Several people are seen in an indoor setting, at tables with items on the table in front of them. The video then shows a person with long hair and a cap on. A woman with dreadlocks speaks directly into the camera, followed by a group holding signs that say 'CLIMATE JUSTICE' and other phrases. There is a shot of a sign that has 'SOS' with a globe in place of the 'O'. A number of people are seen on city streets and parks holding various signs and banners with messages related to climate and climate justice. A few police officers are seen walking amid a large crowd of people. Towards the end there are scenes of an indoor gathering with individuals at tables, and a woman speaking at a podium. At the end, a number of different signs are shown including one that says \"URGENCY!\" and \"If you don't act like adults we will\".",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the identity of the middle-aged woman wearing a blue suit in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Protester.",
+ "B. Artist.",
+ "C. Opponent of the event.",
+ "D. Government leader."
+ ]
+ }
+ },
+ "PbwydzpV": {
+ "video_id": "PbwydzpV",
+ "video_duration": "41s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man in white and purple uniform is kneeling on the grass with his head in his hands. The camera cuts to a shot of a crowd of people with red flags, with their hands raised in the air. A man wearing a red and black striped jersey walks through a crowd of players. The camera pulls back to show a soccer field with men in white and red uniforms moving the ball across the field. The camera pulls in closer as the man in white runs towards the goal. The camera shows a view from the end line of the field as the ball travels toward the goal where the goalie is attempting a save. The ball enters the goal. The crowd is shown raising their hands in the air. The man in red and black has both arms raised, and he is smiling. Two men in red and black jerseys hug. A man and woman in red and black jerseys are shown holding a black phone. The man in red and black flexes his fist and looks down. A family wearing red and black jerseys is shown. The soccer game resumes and the men in white and red are in motion. The ball flies towards the goal. A player in red and black kicks the ball into the goal. A man in a red and black uniform is shown celebrating a goal and the crowd in the background is clapping. A man in a green uniform is looking to his left and raising his hand. The man in the red and black uniform is seen with both arms raised. Two players in red and black are seen hugging. The camera shows a close up of a man in a red and black uniform high fiving a member of the crowd. The players on the red and black team are all lined up and waving to the fans. The camera cuts to a large crowd clapping with their hands in the air. The camera focuses on a man walking away from the camera in a dark coat.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the red team's player number 11 score a goal?",
+ "answer": "B",
+ "candidates": [
+ "A. The goal occurs at the end of the video.",
+ "B. The goal occurs in the middle of the video.",
+ "C. The goal occurs at the beginning of the video.",
+ "D. The goal occurs throughout the entire video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What action does player number 11 take immediately after intercepting the ball from the white team in the middle of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The player passes the ball to a teammate behind them.",
+ "B. The player passes the ball to a teammate in front of them.",
+ "C. The player passes the ball to a teammate on their left.",
+ "D. The player kicks the ball into the goal."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the role of the bald man in yellow clothing in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A goalkeeper.",
+ "B. A referee.",
+ "C. A security officer.",
+ "D. A judge."
+ ]
+ }
+ },
+ "zpbEcSbm": {
+ "video_id": "zpbEcSbm",
+ "video_duration": "44s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Fashion",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man in a dark suit, a light blue shirt, and a gray-and-white striped tie is standing against a gray background. White text appears superimposed on the video. The man moves his hands and speaks. A close up of the cuff and pocket of a jacket with white x-shaped stitching. A close up of a pocket of a jacket with white loose stitches. Another close-up of a pocket of a jacket with white loose stitches. The man is shown again and speaks. A navy blue pinstriped suit on a mannequin. A rack with several suits hanging. A man is shown again and speaks. Scissors with a black handle appear against a white background. The scissors are shown near a cuff with x shaped stitching. The man is shown again. He points toward the right chest pocket of his jacket and puts his fingers in the pocket and then reaches into the back pocket of the jacket and then speaks again. A man wearing a blue suit and sunglasses stands outside against a brick building. The man is seen outside against a grey metal railing wearing a pinstripe suit. A man wearing a light grey suit, a blue shirt, and a patterned tie is shown outside with a dark suitcase in hand. The man is shown again against the gray background wearing a navy suit, light blue shirt, and patterned tie, he moves his hands and speaks.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the photo of a man with short brown hair wearing a suit appear?",
+ "answer": "D",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the beginning of the video.",
+ "C. Throughout the video.",
+ "D. At the end of the video."
+ ]
+ }
+ },
+ "HkrzDkLp": {
+ "video_id": "HkrzDkLp",
+ "video_duration": "516s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A woman with two braids is sitting in front of a blank canvas on an easel. A smaller video of Bob Ross painting plays in the bottom right corner. As Bob Ross paints, she follows his instructions and paints along on her own canvas. She holds up a fan brush, and then dips it into a palette of various paints. She paints Indian yellow onto the canvas, spreading the paint with horizontal strokes. She then dips the brush in Alizarin Crimson and paints horizontal lines below the yellow. She blends the two colors. She dips the brush in Phthalo blue, paints strokes across the top of the canvas, and blends the blue with the other colors. She uses a clean brush to blend the colors further, dabbing the canvas vertically. She dips the fan brush in titanium white and adds highlights to the yellow part of the canvas, creating the impression of a glowing sun or clouds. She uses a palette knife and titanium white to scrape clouds across the painting. She mixes phthalo green and cad yellow on the palette, dips a thin brush into the mix, and creates small evergreen trees on the canvas. Using the same colors with the fan brush, she begins creating a hillside below the trees. She uses the palette knife with phthalo blue and mixes it with white, before dragging it across the bottom of the canvas to create a lake. She touches up areas of her painting with white and Alizarin Crimson. She dips the fan brush in Van Dyke brown and begins adding shading to the hillside and trees. She uses her finger to add a highlight to the lake. She adds more trees. Using a clean fan brush, she continues adding highlights to the lake and shading to the bottom of the hillside and bottom of the trees. She wipes off her paint-covered hands with a cloth.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What action did the woman in the video take immediately after applying yellow paint?",
+ "answer": "C",
+ "candidates": [
+ "A. A splash of red paint is added on top of the yellow and blended.",
+ "B. A splash of purple paint is added below the yellow and blended.",
+ "C. A dash of red paint is added below the yellow and blended.",
+ "D. A touch of purple paint is added above the yellow and blended."
+ ]
+ }
+ },
+ "wbTSBNTa": {
+ "video_id": "wbTSBNTa",
+ "video_duration": "103s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows eight people sitting around a rectangular table covered in a purple tablecloth. There are papers, glasses of water, small dishes of food, and coffee cups on the table. Two people are standing in the background, one holding a camera. The first few shots focus on the group around the table, then zooms in on a man in a blue shirt who is gesturing with his hands. The video then returns to the full table view. Later in the video, it focuses on individuals around the table. At the end of the video, the background becomes orange and text appears that reads \"Smithsonian Cooper-Hewitt, National Design Museum\" and \"The Business of Design\" along with two smaller window views that each contain videos from different parts of the larger video.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the man wearing a black cardigan sweater sitting in relation to the woman wearing a green top in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Directly across from the woman wearing a green top.",
+ "B. Diagonally across from the woman wearing a green top.",
+ "C. To the left of the woman wearing a green top.",
+ "D. To the right of the woman wearing a green top."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who says in the video, \"I think what our user experience and our user designers have done is said this needs to apply.\"?",
+ "answer": "A",
+ "candidates": [
+ "A. A short-haired blonde woman wearing a green top.",
+ "B. A man wearing a blue shirt.",
+ "C. A bald man wearing a gray top.",
+ "D. A short-haired blonde woman wearing a green top."
+ ]
+ }
+ },
+ "DDHXqqEV": {
+ "video_id": "DDHXqqEV",
+ "video_duration": "52s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man wearing a white t-shirt and a light blue cloth wrapped around his waist is seen bending over and working with a stick in an outdoor setting surrounded by trees and plants. He appears to be creating a barrier of sorts. Subtitles on the video read \"I will kill him\". He then says \"Kuddus!\". Another man, wearing a black t-shirt comes up to him. The second man touches the first man and asks \"What happened?\". The first man again is seen working on his structure. He replies \"What happened?\". The man in the black t-shirt asks \"Why did you slap my son?\". The man in the white t-shirt grabs a wooden stick and begins to speak to the other man, while holding the stick. He says \"Your son came to my land\". The man in the black t-shirt asks \"Is that your land?\" The man in the white t-shirt responds, \"My grandfather's land.\" The man in the black t-shirt asks \"Who told you?\" The man in the white t-shirt responds \"Yes, Your father is my father. My son can go your land. That is also my land.\" The man in the white t-shirt then says \"No, it's not your land. It's only my own land\". The man in the black t-shirt responds \"You never can come to my land.\" The man in the white t-shirt responds \"I am your elder brother\". The man in the black t-shirt says, \"You are elder brother, but I don't respect you. Because your son is very stupid. And you are also very stupid\". He then asks again, \"Why did you slap my son?\". The man in the white t-shirt answers \"Because, he came to my land\". The man in the black t-shirt says \"What's the problem? That is my grandfather's land.\" The man in the white t-shirt responds \"It's my own land because I live there.\" Then he says, \"Got it. From this day I won't talk to you\". The man in the black t-shirt says \"Okay fine. You no need to talk to me. Go!\". The man in the black t-shirt points his finger up in the air and says, \"There is no relation between two brothers\".",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the man in the black T-shirt do to the man in the white T-shirt in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Pulled him with both hands.",
+ "B. Pushed him to the ground.",
+ "C. Hugged him with both hands.",
+ "D. Hit him with a stick."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who says \"What happened?\" in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The man in a black T-shirt.",
+ "B. The man in a red T-shirt.",
+ "C. The man in a blue T-shirt.",
+ "D. The man in a white T-shirt."
+ ]
+ }
+ },
+ "tQBLkmtB": {
+ "video_id": "tQBLkmtB",
+ "video_duration": "93s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "People are walking down a city street. Some are carrying signs. There are multiple white police vehicles with flashing lights. Some police officers are walking, and one is riding a scooter. Multiple yellow taxis are visible. Some people are taking pictures with their phones. A group of police officers surrounds a person in a red jacket. The police are holding hands and facing outwards, encircling the person in red. Some of the people are wearing hats. There are several buildings with signs and awnings.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What occurred in the video that caused the crowd to protest?",
+ "answer": "A",
+ "candidates": [
+ "A. Economic difficulties leading to higher unemployment rates.",
+ "B. Racial discrimination.",
+ "C. Environmental issues.",
+ "D. Conflicts in values."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Are two protesters arrested in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. No, only one protester was arrested.",
+ "B. I am not sure.",
+ "C. Yes."
+ ]
+ }
+ },
+ "yxBiWjLy": {
+ "video_id": "yxBiWjLy",
+ "video_duration": "154s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "World News",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a black screen, then transitions to a shot of a helicopter silhouetted against a starry night sky. The video then shows two helicopters, with text appearing briefly stating \"Active personnel 2,035,000 (2019)\". There is a shot of the sun rising behind the silhouette of a helicopter and the ground. There are tanks moving across a desert landscape at sunset with people on and alongside them. The video cuts to inside a dark plane. The view shifts to an exterior shot showing a black jet aircraft taking off into the air with text appearing on the video stating \"2015, the PLA is the world's third-most powerful military\". Next, a dark, large cargo plane flies through the sky. The view changes to the interior of this plane with many dark figures with red lights all around. A close up on 3 individuals in dark helmets and goggles also appears, with text stating \"52 frigates\". Then a man walks towards a grey jet and then a shot of a man in a dark jump suit closing the cockpit to the jet. The video shows the jets back facing the viewer. Next, a grey jet takes off from a tarmac, and the view changes to the interior of a plane as a person is exiting. A dark large cargo plane is shown dropping crates from its back. Then, a shot from inside the plane shows someone jumping from the plane. Next, text appears on the screen saying \"Air Force has amassed 1,232\". Then, several people in dark military uniforms are standing with rifles in a grassy terrain. A light blue jet is seen flying in the air, with text stating \"3,500 tanks\". The video shows many jets flying through the sky, with text stating \"281 attack helicopters\". Then, there is a shot of a person in a plane's cockpit with a text stating \"spend $237 billion on its armed forces in 2020\". The video shows a grey jet moving through the sky and then the interior view of the jet cockpit with a pilot. A light colored jet takes off from a carrier ship. The video shows the view from the cockpit of a jet as it takes off. The video cuts to show a dark grey cargo area. Then, a jet pilot is seen in a cockpit. The view changes to show a jet flying through a light blue sky. The video then shows a dark screen with a silhouette of a head and legs appearing. Next, a person wearing a black military suit with a rifle stands still. There is a close-up of that person and his rifle. The video shows a close up of the rifle with some of its features in clear view. Then there is a shot of a person walking towards a jet taking off. Next, two people in dark blue jump suits and helmets are seen in front of a jet aircraft. A group of men are walking in blue military jump suits. There is a shot of several dark attack helicopters taking off. The video shows many dark attack helicopters flying through a light blue sky. The shot changes to a silhouetted helicopter flying through a darker sky. Then a jet pilot is shown in the cockpit. The video shows a boat moving rapidly through dark blue water and then jet skis moving rapidly through the water. There is a close up of one boat moving quickly. Then, there is a shot of two people firing a machine gun inside of a building. The video then shows a convoy of black cars traveling along a desert road. There are light blue planes seen in the air. The video shows an explosion. People in dark military suits with rifles are marching, and then standing in front of a fire. The video shows a large dark silhouette of a helicopter and then people in military outfits with rifles walking by. There is a large dark blue cargo plane taking off and then a shot of several grey jets taking off on a runway. Then, a dark drone is shown taking off in the sky. A boat in large waves is shown and then another ship moving through the water. Then, a light grey ship is seen moving on the water with a red flag on it. A light grey ship moving on water with a clear view of the whole ship is shown. Then, a very large ship with some smaller ships following it is seen on the water. The shot switches to many ships moving in a line on the water. A light grey ship is moving on the water. Next, a grey ship with a big number 999 is shown. The video shows a dark grey ship on the water. Finally, a black screen appears and an explosion is seen. People in dark military suits with rifles are marching. The video ends showing a silhouette of a person holding a rifle.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the chronological order of the events depicted in the video: fighter jet formation flight, helicopter night landing, and naval ship fleet sailing?",
+ "answer": "B",
+ "candidates": [
+ "A. Naval ship fleet sailing, fighter jet formation flight, helicopter night landing.",
+ "B. Helicopter night landing, fighter jet formation flight, naval ship fleet sailing.",
+ "C. Fighter jet formation flight, helicopter night landing, naval ship fleet sailing.",
+ "D. Helicopter night landing, naval ship fleet sailing, fighter jet formation flight."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What is the scene shown after the tank drives through the desert in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A helicopter takes off.",
+ "B. A battleship is docked.",
+ "C. A helicopter lands.",
+ "D. A fighter jet takes off."
+ ]
+ }
+ },
+ "jnmNBWfP": {
+ "video_id": "jnmNBWfP",
+ "video_duration": "174s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A competition is taking place within a large room, with a large crowd of people seated in bleachers in the background. In the foreground, there is a competition area enclosed by glass barriers. In the center of this area there are two structures, one blue and one red, both shaped like long rectangular trucks with vertical towers at each end. There are round, orange balls distributed on the floor of the competition area and within the structures. Several people in various colored shirts are present along the edge of the playing field, with at least one in a black and white striped shirt. There are also several wheeled robots within the arena that are moving around. Throughout the video, these robots collect the orange balls and move them around the playing field, with some appearing to move the balls into the tall structures. A scoreboard is visible along the top of the screen, displaying numbers associated with each side of the competition. There are also several instances of different camera angles and points of view that are overlaid on the original video footage.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the score appear as 44 to 36, with the blue team leading the red team?",
+ "answer": "B",
+ "candidates": [
+ "A. The score appears at the beginning of the video.",
+ "B. The score appears at the end of the video.",
+ "C. The score appears in the middle of the video.",
+ "D. The score appears throughout the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "At the beginning of the video, what is the position of the man wearing yellow relative to the man raising his right hand?",
+ "answer": "B",
+ "candidates": [
+ "A. To the right.",
+ "B. To the left.",
+ "C. In front.",
+ "D. Behind."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video show a robot that is stuck?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. No, it does not.",
+ "C. Yes, it does."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "With 76 seconds remaining on the countdown in the video, what was the score for the blue team?",
+ "answer": "C",
+ "candidates": [
+ "A. 31 points.",
+ "B. 30 points.",
+ "C. 29 points.",
+ "D. 28 points."
+ ]
+ }
+ },
+ "JCdlVjpN": {
+ "video_id": "JCdlVjpN",
+ "video_duration": "134s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A figure in a white outfit and pink earmuffs walks on a snowy path through trees. The figure runs towards a ski lift, bumping into a post. The figure says \"I didn't see that\" then \"I just ran into a\". The figure then says \"THANK YOU VERY MUCH\" with a sad face emoticon. The video shows the same ski lift area again, and a chair arriving. The figure boards the chair lift and travels up a hill with the text \"IT TOOK AROUND 20 MINUTES TO GET TO THE TOP\" displayed on screen. The figure exits the chairlift, then it is shown from behind walking up a snowy incline. The figure turns and says \"y'all DIDN'T see NOTHING\" with a smiling face emoticon. The figure snowboards down a hill and falls. A black and white image of the figure lying on the ground is shown, before showing the figure again, standing up. The video shows the figure skiing down the hill, falling down, and lying in the snow. It shows the figure again next to a tree with the text \"amelia after breaking her bones\" with a smiling face emoticon. The figure appears again in a different location, walking next to a barrier and text appears saying \"instagram: @iicottencloud tiktok: @iicottencloud\" along with a Bible verse. The figure is shown from behind walking to another ski lift. The figure is shown again on the snow walking near a barrier.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a stationary cable car in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Not sure.",
+ "B. Yes, there is.",
+ "C. No, there isn't."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How does the girl\u2019s emotion change in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. From angry to tense, and finally excited.",
+ "B. From tense to joyful and calm, and finally excited.",
+ "C. From angry to tense, and finally joyful and calm.",
+ "D. From joyful and calm to tense, and finally excited."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does the game character take a cable car in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the end of the video.",
+ "C. Throughout the entire video.",
+ "D. At the beginning of the video."
+ ]
+ }
+ },
+ "Rrgdcuwk": {
+ "video_id": "Rrgdcuwk",
+ "video_duration": "82s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a person sitting in the driver's seat of a car. They are wearing glasses and a blue jacket. They pick up and inspect a black camera lens, holding it up to the camera. They then appear to screw something onto the lens. The person then puts the lens down, looks at the camera, and speaks. The scene transitions to a view of a sidewalk on a street. Two people are walking along the sidewalk. Then, the person who was in the car is walking along the sidewalk with another person. They both smile and look at the camera. The username @KARINADION is shown on the screen, with an Instagram logo.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video show a stationary car?",
+ "answer": "B",
+ "candidates": [
+ "A. None of the above.",
+ "B. Yes, it does.",
+ "C. It is uncertain.",
+ "D. No, it does not."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What type of lens does the man primarily intend to use for today's filming, according to the video?",
+ "answer": "C",
+ "candidates": [
+ "A. An 18-55mm zoom lens.",
+ "B. A 56mm f/3.2 portrait lens.",
+ "C. A 56mm f/1.2 portrait lens.",
+ "D. A 24mm f/1.2 portrait lens."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "From which direction is the car sound in the background of the video coming?",
+ "answer": "D",
+ "candidates": [
+ "A. The car sound is coming from the street on the right side of the screen.",
+ "B. The car sound is coming from the street in front of the character.",
+ "C. The car sound is coming from the street behind the character.",
+ "D. The car sound is coming from the street on the left side of the screen."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What do they plan to do after the video ends?",
+ "answer": "D",
+ "candidates": [
+ "A. A woman will take a photo of a man in the coffee shop.",
+ "B. A woman will take a photo of a man in the record store.",
+ "C. A man will take a photo of a woman in the coffee shop.",
+ "D. A man will take a photo of a woman in the record store."
+ ]
+ }
+ },
+ "YlLOSDav": {
+ "video_id": "YlLOSDav",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with two people in front of a window, with a large, rough, red object behind them. The person on the left has reddish hair and is wearing a white long-sleeve shirt with a black tie. The person on the right has blonde hair, is also wearing a white long-sleeve shirt and a black tie, and is smiling. The video then cuts to a view of a bearded person with short hair, wearing a dark-colored long-sleeve shirt, and brown jacket. This person moves out of the shot and the camera moves toward a woman with long brown hair, wearing a black top. A light grey sign with a purple A on it is visible in the background. The woman holds her hands out toward the viewer, and then a person's hand comes into frame and points a pink object toward her. The woman takes the object and the video continues to follow her as she takes a small bag and begins to open it, and then she opens a second small package. She then opens a card-sized object and looks at several more. Finally, a list of names on a black background is shown at the end of the video.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the woman's profession depicted in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A horror film director.",
+ "B. A horror movie actor.",
+ "C. A comedy director.",
+ "D. A comedian."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "When does the woman open the card pack in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. At the start of the video.",
+ "C. At the end of the video.",
+ "D. In the middle of the video."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "How do women feel about the jester card and the bat card?",
+ "answer": "D",
+ "candidates": [
+ "A. Initially dislike the jester card but feel happier when the bat card appears.",
+ "B. Initially feel anticipation for the bat card but become more excited when the jester card appears.",
+ "C. Initially dislike the bat card but become more excited when the jester card appears.",
+ "D. Initially feel anticipation for the jester card but become more excited when the bat card appears."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many cards are shown facing the camera in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. Two.",
+ "D. One."
+ ]
+ }
+ },
+ "xkJDkBng": {
+ "video_id": "xkJDkBng",
+ "video_duration": "88s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Fitness",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a split screen with several different cuts.On the left, there is a gymnast wearing a maroon leotard with white accents, doing a floor routine. The routine includes a forward roll, a bridge, leg stretches, and several dance moves. She performs several flips, including a tuck, and then joins her teammates, also wearing maroon leotards with white accents, who are standing near the gymnastics mat. The gymnasts and coaches jump up and down together, and then embrace. The right side of the split screen displays a rotating series of images. In the first images, there is a vertical graphic of award titles and dates. It reads \u201cHonors Awards\u201d, then \u201c9 Career Perfect 10\u2019s on Floor 2024\u201d, \u201cAcademic All Big Ten 2020-2021\u201d, \u201cAcademic All Big Ten 2021-2022\u201d, \u201cAcademic All Big Ten 2022-2023\u201d, \u201cAcademic All Big Ten 2023-2024\u201d, \u201cAll-Big Ten First Team 2021\u201d, \u201cAll-Big Ten First Team 2023\u201d, and \u201cAll-Big Ten First Team 2024\u201d. This graphic then changes into an image of a woman in a sheer red and black gymnastics outfit posing on the gymnastics mat, then later changes into a photo of two women in bathing suits. Throughout the video, a woman\u2019s face is visible in the bottom right corner. She wears a purple and blue patterned top.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did Maya Hooten do with her teammates after the competition?",
+ "answer": "A",
+ "candidates": [
+ "A. They are celebrating.",
+ "B. They are performing.",
+ "C. They are watching the game.",
+ "D. They are taking photos."
+ ]
+ }
+ },
+ "NFgNDAFH": {
+ "video_id": "NFgNDAFH",
+ "video_duration": "104s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man is sitting in front of a dark blue backdrop with white stars, wearing a dark gray shirt with white lettering. On his left is a white shelf with several objects including two figurines of a man with a shield and a figurine of a man in a blue uniform with red accessories, several telescopes, an animal skull, and other objects. On his right is a framed picture of a colorful space nebula. Text appears on the screen that says \u201cStarhopper Failed Hover Test.\u201d Then, a wider shot of an outdoor area appears with a body of water in the background and several white structures near the center of the frame, with a yellow flame emanating from a metal pipe on the right. As the shot progresses, a structure in the center begins to lift off the ground with a burst of yellow flame and white smoke from underneath, with a sunlit sky in the background. Then, a circle appears on the top right corner of the video, containing a black and white photo of a woman with blonde hair and a white shirt. Her name appears below in the video text, along with other text describing her image. The scene goes dark briefly, then returns to the same outdoor shot with the structure on the ground, then goes to a second attempt at lifting off, with more white smoke and orange flame. The scene goes dark again and changes back to a close-up of the first man, where he is again seated in his chair. A large white box appears on the screen over the man's face, which is an image of a tweet. The tweet says \u201cPC (chamber pressure) high due to colder than expected propellant\u201d. The man's face is on the screen again, while a second outdoor scene appears where a structure again tries to lift off, expelling white smoke and orange flame. The man's face then appears once more.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What abnormal event occurred after the rocket ignition in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The rocket accelerates skyward.",
+ "B. The rocket remains stationary.",
+ "C. The rocket exploded.",
+ "D. The rocket caught fire."
+ ]
+ }
+ },
+ "GSLoYRyv": {
+ "video_id": "GSLoYRyv",
+ "video_duration": "80s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a close-up of some green leafy plants against a light-colored wall. The camera pans right revealing more of the plants and a white surface in the background. A hand comes into frame, then the camera moves right and upward, showing more of the plants and some blue berries on them. The view transitions to show the plants in more detail, as well as a grassy area and road in the distance. The camera angle changes a number of times, moving around the plants and toward a road and building in the distance. The camera moves close into the leaves, and then pans down and moves along a support pole, finally settling on the blue berries again. A hand comes into view, picking berries. A person with a green t-shirt and a neon green hat picks more berries, and gathers them into their hand.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "In the video, what will the man do with the blueberries next?",
+ "answer": "A",
+ "candidates": [
+ "A. Eat them.",
+ "B. Juice them.",
+ "C. Discard them.",
+ "D. Break them apart."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a lemon tree in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. No.",
+ "B. I don't know.",
+ "C. Yes."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the man in the video doing with the fruit in his hand?",
+ "answer": "C",
+ "candidates": [
+ "A. Showing the color of the blueberries.",
+ "B. Displaying the size of the strawberries.",
+ "C. Displaying the size of the blueberries.",
+ "D. Counting the number of blueberries."
+ ]
+ }
+ },
+ "JIPcLtVo": {
+ "video_id": "JIPcLtVo",
+ "video_duration": "67s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a view of a city street. Several cars and an ambulance are visible on the road. The camera then focuses on a tall, gray building. The view transitions to an interior kitchen setting, where a person in a red t-shirt with light-colored sleeves and blue jeans is wearing oven mitts and appears to be taking a pizza out of the oven. The camera angle changes and now includes another person wearing a plaid jacket and blue jeans. The person in the red t-shirt holds up the pizza and seems to be talking to the other person. They then appear to start fighting with the pizza between them. The person wearing the plaid jacket is then shown eating some kind of food. The person in the red t-shirt is holding oven mitts. A view is shown of a third person lying in bed, with a bandaged head and a cast on one leg. A person in a blue-green shirt is seen standing next to the bed and adjusting the bedding. This person then walks to a security panel on a wall and touches the buttons. The camera then returns to the person in bed and then to the security panel. Then it shows the person in the bed again before the person in the blue-green shirt reappears, followed by the person in the red shirt and then the person in the plaid jacket, who is holding a bouquet of flowers. The person in the blue-green shirt speaks before the person in the red shirt and the person in the plaid shirt both come into the room. The person in the blue-green shirt looks surprised, then the other two people look at each other. Finally, all three people are visible.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Where in the video does the ambulance appear?",
+ "answer": "C",
+ "candidates": [
+ "A. The ambulance appears at the end of the video.",
+ "B. The ambulance appears throughout the entire video.",
+ "C. The ambulance appears at the beginning of the video.",
+ "D. The ambulance appears in the middle of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What are the man in the red shirt and the blonde woman doing in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A blonde woman tells the man in the red shirt that Freddie was hit.",
+ "B. A blonde woman and a man in a red shirt are arguing.",
+ "C. A blonde woman and a man in a red shirt are playing a game.",
+ "D. A blonde woman is questioning the man in the red shirt."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What will the woman in the blue shirt do next with the flower?",
+ "answer": "D",
+ "candidates": [
+ "A. Drop it on the ground.",
+ "B. Place it in a vase.",
+ "C. Throw it into the trashcan.",
+ "D. Soak it in bleach."
+ ]
+ }
+ },
+ "CJrXMdiM": {
+ "video_id": "CJrXMdiM",
+ "video_duration": "119s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A green figure is present throughout the video in the bottom left corner, holding a white stick. A flat green surface with white markings and pink edges is displayed. A white and pink spherical object is visible as well as a boat with brown masts. The flat green surface changes into a light colored surface with brown lines resembling a map. The light colored surface changes to a dark surface with light lines. A white light illuminates the center of the surface where the boat is. The dark surface is displayed with the boat in the middle and two cube shapes with maps and question marks on them appear on either side of the boat. A cone shape with a map and question mark then appears next to the cube shape to the right of the boat. The light colored surface is once again displayed with the boat in the middle. The boat moves towards the top left of the surface. A globe is then displayed with the boat sitting on top of it, both are on a dark background. The globe then has a green mesh over it and a human-like figure appears on top of it. The globe is shown with yellow snail-like figures on it. The globe disappears and the green surface is then shown with one snail-like figure on it. The edges of the green surface turn to blue light, then green, then red and yellow light. There are many of the snail-like figures spread out on the green surface. The green surface is then shown as a flat surface, a curved surface, a cylinder, a C shape, a donut, a curved surface, a cylinder, and a curved surface again. Finally, the green surface is once again shown with one snail-like figure on it, with its edges turning to blue light, then green, then red and yellow light, then having its light diminished.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many animated characters are in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. One.",
+ "D. Two."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "Which of the following options does not represent a change in the plane where the snail resides?",
+ "answer": "A",
+ "candidates": [
+ "A. Cone.",
+ "B. Square.",
+ "C. Torus.",
+ "D. Cylinder."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the snail in the video understand that the Earth is a sphere?",
+ "answer": "B",
+ "candidates": [
+ "A. Because the snail lives on a flat, limited, and cyclic surface.",
+ "B. Because the snail possesses highly advanced portal technology.",
+ "C. Because the snail sees itself repeating in a grid pattern.",
+ "D. Because the snail can view images of an infinitely similar space."
+ ]
+ }
+ },
+ "YahvfLTI": {
+ "video_id": "YahvfLTI",
+ "video_duration": "124s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a wide outdoor shot of a rocky terrain with tall, jagged rock formations, and three people, two in long brown robes and one in light clothing, standing near a vehicle on wheels. The scene then shifts to an interior area with a stair case, the wheeled vehicle and a metallic robot standing with its arms out. A man in a long tan coat and cap holding what appears to be a weapon approaches, passing behind the robot. The scene shifts again to show a woman with light blonde hair and a sparkly light-colored top standing in front of the robot. The robot appears to turn toward her. The next shot shows a control room with three people in tan clothing and a black haired person walking around several control consoles. The next scene shows three people in a round room, looking at a large round screen with glowing lights moving from the center of the screen outwards. Then the scene shifts to show a long hall with a row of floating spheres and a very large humanoid robot in the foreground. This changes to a different view of the hall, where two people are walking on the lower level below the floating spheres. Then there is a shot of 3 people, two in light colored clothes and a third with dark clothing, with the third person pointing towards the top of the frame. The next series of shots shows different views inside a metallic shaft with metallic spheres and green electrical currents flowing by the camera. Another shot shows a dark night time exterior scene, and then a view of the bottom of a vehicle. A woman and two men can be seen running up stairs inside a space station, and then a person with dark clothing opens a round hatch and a person in a tan uniform carries items in an opening. The next shot shows a control room with many lit panels and then a view of an opening filling with red light and smoke. The final two characters, a man and woman in space suits, look towards the glowing opening. The final shots show a series of burning shapes and then some text including \"FAR AND AWAY\" followed by \"THE MOST PROVOCATIVE AND UNUSUAL ADVENTURE FILM YOU'VE EVER SEEN\". Then \"FORBIDDEN PLANET AN M-G-M PICTURE\" is shown, followed by \"STARRING WALTER PIDGEON, ANNE FRANCIS, LESLIE NIELSEN and THE ROBOT\" followed by a wide angle shot of the landscape and then the text \"TWO YEARS IN THE MAKING IN CINEMASCOPE\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What type of business robot is featured in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Customer service assistant robot.",
+ "B. Production and service robot.",
+ "C. Intelligent question-answering robot.",
+ "D. Emotional support robot."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where is the scene most likely taking place in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. An alien planet.",
+ "B. I am not sure.",
+ "C. Earth.",
+ "D. Inside a spacecraft traveling to outer space."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What changes occur to the spaceship's entrance in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. No change.",
+ "B. Covered in graffiti.",
+ "C. Melted.",
+ "D. Thrown away."
+ ]
+ }
+ },
+ "vfowhIxa": {
+ "video_id": "vfowhIxa",
+ "video_duration": "408s",
+ "duration": "6-8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Drawing Tutorials",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A hand reaches into a container and takes out several markers. A book is opened, and then a hand attaches a metal clip to the top of the book's first page. A hand uses a blue marker to draw a hat. A hand uses a blue marker to draw a globe. A hand uses a red marker to draw a book. A hand uses a red marker to draw a chair. A hand uses a purple marker to draw a pump bottle. A hand uses a purple marker to draw a slice of pizza. A hand uses a purple marker to draw some paper clips. A hand uses a purple marker to draw some clothing hangers. A hand uses a red marker to draw a pencil. A hand uses a purple marker to draw some art supplies. A hand uses a red marker to draw a milk box, a soda can, some cookies, and a lollipop. A hand uses a blue marker to draw a shoe. A hand uses a blue marker to draw a popsicle. A hand uses a red marker to draw a popsicle. A hand uses a purple marker to draw a cloud and two strawberries. A hand uses a blue marker to draw a knife. A hand uses a purple marker to draw a cup. A hand uses a red marker to draw a hamburger. A hand uses a green marker to draw some sunglasses. A hand uses a green marker to draw a cell phone. A hand uses a green marker to draw some Lego bricks. The book is shown with all the drawings on the pages, then the book is closed and put aside.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "What action is the man performing at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Five actions are shown.",
+ "B. Three actions are shown.",
+ "C. Two actions are shown.",
+ "D. Four actions are shown."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the man in the video drawing simple sketches?",
+ "answer": "C",
+ "candidates": [
+ "A. To teach how to draw using the fewest lines.",
+ "B. To use a newly purchased marker.",
+ "C. To solve the problem of exaggerating details.",
+ "D. Because drawing simple sketches is his hobby."
+ ]
+ }
+ },
+ "ETSVxgro": {
+ "video_id": "ETSVxgro",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black hat, black coat and grey pants sits in a chair inside of a tent and holds a book. A coiled white rope lies near the chair. The person looks around, then opens and closes the book, then sets the book down next to him, and picks it up again, opening the book. The person looks to his right. The person\u2019s face fills the screen briefly and then shifts to an individual carrying a large grey rock and then to a person running while carrying something in the arm, in a mountainous area. The person is shown walking in the same mountainous area, facing away from the viewer while holding a bow. The person then turns to face the viewer.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the color of the clothing worn by the man who ran past the man wearing black?",
+ "answer": "C",
+ "candidates": [
+ "A. Black.",
+ "B. Red.",
+ "C. Brown.",
+ "D. Yellow."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional shift does the man in black clothing experience while reading?",
+ "answer": "C",
+ "candidates": [
+ "A. From tension to happiness.",
+ "B. From tension to relaxation.",
+ "C. From relaxation to tension.",
+ "D. From happiness to tension."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the man in black raise his head after the man in brown runs past?",
+ "answer": "A",
+ "candidates": [
+ "A. Because someone is hunting birds with a gun.",
+ "B. Because he is looking at a snake on the grass.",
+ "C. Because a woman is asking him a question.",
+ "D. Because a man is going to shoot him."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In which direction does the man in black clothes see the snake when he lifts his head?",
+ "answer": "C",
+ "candidates": [
+ "A. Behind.",
+ "B. On the left.",
+ "C. On the right.",
+ "D. In front."
+ ]
+ }
+ },
+ "eTHAzLvI": {
+ "video_id": "eTHAzLvI",
+ "video_duration": "176s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a split-screen view of a CNN news segment. On the left is a woman with shoulder-length brown hair, wearing a red jacket. On the right is a man wearing a dark suit jacket, white shirt, and red tie with an American flag pin on his lapel. He occasionally gestures with his hands while speaking. The lower third of the screen displays a CNN logo in the bottom right corner, \u201cLIVE\u201d next to it, and various text throughout. Initially, the banner reads \u201cONE-ON-ONE WITH GOP CANDIDATE VIVEK RAMASWAMY\u201d with \"2024 RACE\" and \"CNN PRIMETIME\" in smaller text. Later, the banner changes to \"RAMASWAMY CALLS FOR 'MAJOR CONCESSIONS' TO RUSSIA.\" At another point the banner displays \u201cVivek Ramaswamy (R) Presidential Candidate\u201d. The text in the bottom left corner changes periodically. The background behind each person shows a stylized cityscape at night. The woman mostly listens attentively, with occasional small nods or changes in facial expression. The man is more animated, gesturing and varying his tone of voice. He points his index finger on several occasions. He also occasionally touches his chest or tie.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the woman in the red blazer question the man in the blue suit about his position on Russia?",
+ "answer": "A",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. In the middle of the video.",
+ "C. In the middle of the video.",
+ "D. Throughout the video."
+ ]
+ }
+ },
+ "vzCNoUrf": {
+ "video_id": "vzCNoUrf",
+ "video_duration": "224s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video is a screen capture of someone playing a first-person shooter game called Call of Duty Mobile. In the top left corner, a small box shows a person wearing a black baseball cap, headphones, and a dark shirt. Below this box is a graphic showing the game logo. The main portion of the screen displays a menu showing the various weapons and customisation options for the player's in-game character. The character is a person wearing a light-orange outfit, a face mask, and a bullet-proof vest. Then, the in-game view switches to first person as the character navigates a map and engages with other characters. Various weapons are used. In several instances, the screen turns red as the character is hit. Text flashes onscreen indicating points are being gained or that a player has been killed. In one instance, a bottle-shaped object is thrown. The video ends with a series of screens, indicating the outcome of the match, experience point progress, and upgrades earned. The video concludes with a final screen featuring a logo and a list of names in white letters.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "In the video, who is the top performer in this competition?",
+ "answer": "B",
+ "candidates": [
+ "A. Ambulance.",
+ "B. Call of Duty.",
+ "C. Oasis.",
+ "D. Broadcaster."
+ ]
+ }
+ },
+ "etzYvfJi": {
+ "video_id": "etzYvfJi",
+ "video_duration": "87s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a horse race. In the left part of the video, horses and their riders are shown at the starting gate, then the horses run down the track. The horses are ridden by people wearing different colored clothing. In the right part of the video, a man is shown wearing headphones and looking through binoculars. There are papers and a pen on a desk in front of him. As the video plays the camera angle shifts on both the left and right sides to follow the events. The camera follows the horses as they move down the track. The man with the binoculars adjusts them as needed. At the end of the race the camera on the left shows the horses crossing the finish line. The camera on the right focuses on the view from his window.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "From which side did Horse No. 5 overtake Horse No. 7 in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Horse No. 5 overtakes Horse No. 7 from the left side.",
+ "B. Horse No. 5 is always in front of Horse No. 7.",
+ "C. Horse No. 5 does not overtake Horse No. 7.",
+ "D. Horse No. 5 overtakes Horse No. 7 from the right side."
+ ]
+ }
+ },
+ "nsKNcrlt": {
+ "video_id": "nsKNcrlt",
+ "video_duration": "33s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a blue chef's shirt places a wooden cutting board on a red and white checkered tablecloth. Several ingredients and utensils are already present on the table. The person picks up a clear glass bowl of seasoned chicken pieces with tongs. The shot changes angle, and now two people are visible. One wears a chef\u2019s shirt and holds the bowl of chicken. The second person wears a white apron with a red Safeway logo. They talk as the person in the chef\u2019s shirt mixes the chicken with tongs. The person in the apron takes the tongs. The person in the chef's shirt gestures to the background, where people are gathered and cars are parked. The person in blue walks to a grill and opens it, revealing the grill grates. Using a pink-and-white checkered cloth, they wipe the grates. They pick up tongs and use them to place a piece of chicken on the grill. The shot changes to a side angle, and more chicken is placed on the grill. The shot returns to the overhead angle. More chicken pieces are placed on the grill. The shot changes back to the wider angle featuring both people. The person in blue holds the bowl while the person in white grills the chicken. The person in white uses a pink checkered cloth to move the chicken on the grill. They both smile and gesture toward the background.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, where is the chicken placed on the grill?",
+ "answer": "D",
+ "candidates": [
+ "A. The chicken is in the center of the grill.",
+ "B. The chicken is evenly distributed across the entire grill.",
+ "C. The chicken is on the right side of the grill.",
+ "D. The chicken is on the left side of the grill."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct order of steps for preparing the roasted chicken shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Add marinade to the chicken, open the oven, place the chicken inside, and flip the chicken.",
+ "B. Add marinade to the chicken, open the oven, place the chicken inside, and close the oven.",
+ "C. Inspect the marinated chicken, open the oven, place the chicken inside, and flip the chicken.",
+ "D. Inspect the marinated chicken, open the oven, place the chicken inside, and close the oven."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the woman in the video laugh?",
+ "answer": "A",
+ "candidates": [
+ "A. Four.",
+ "B. Two.",
+ "C. Five.",
+ "D. Three."
+ ]
+ }
+ },
+ "ZmFDZDnX": {
+ "video_id": "ZmFDZDnX",
+ "video_duration": "123s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man in a dark blue suit points towards the center of a stage. The camera pans to another individual standing on the stage, who is wearing a red, patterned shirt and dark pants. The individual holds an acoustic guitar and stands at a microphone. The individual plays the guitar while singing. Behind him, other individuals are shown playing various instruments. The background displays a backdrop of clouds and fire imagery, lit with various colors. The camera zooms in on the singer\u2019s face, followed by a wide shot of the individual and the band behind him. The camera pans to another individual, a judge, who appears to be listening. The camera returns to a full view of the performance stage. As the song ends, the camera zooms in on the singer\u2019s face. A brief shot of the band is shown followed by a wide shot of the stage, with the singer, a judge in the background, and the instrumentalists on either side. Another wide shot of the stage is shown, followed by a tight shot of the singer. The camera pans and zooms out to show the singer on stage in front of judges, and then to three individuals in the audience. The video ends with a wide shot of the stage, band, and audience. The singer is at the center.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What is the most likely overall emotional tone of this song?",
+ "answer": "B",
+ "candidates": [
+ "A. Hateful and indignant.",
+ "B. Healing and affectionate.",
+ "C. Joyful and upbeat.",
+ "D. Somber and sorrowful."
+ ]
+ }
+ },
+ "sZzmvOze": {
+ "video_id": "sZzmvOze",
+ "video_duration": "109s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Soccer",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows an outdoor scene at a sporting event. There is a large crowd of people, many of whom are wearing red. On the field, a number of people are standing on a raised, red platform with \"AFC CHAMPIONS\" printed on it. There is confetti falling from the sky. There are several close up views of individuals. One man has on a hat that says \"CONFERENCE CHAMPIONS.\" He is wearing a light gray shirt with a red logo. He is interviewed by a man in a black jacket and is holding a trophy, which he then hands to another man. Another man in a red jacket raises his arm in the air and is also wearing a patterned scarf and ski goggles on his hat. Confetti falls from the sky in the various different scenes.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who says, \"We look forward to seeing you and your brother square off\", in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A man wearing a red headband.",
+ "B. A man wearing yellow gloves.",
+ "C. A man wearing a red jacket.",
+ "D. A man wearing a blue jacket."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the man wearing a hairband while speaking in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. In front of the man wearing a blue jacket.",
+ "B. To the left of the man wearing a blue jacket.",
+ "C. To the right of the man wearing a blue jacket.",
+ "D. Behind the man wearing a blue jacket."
+ ]
+ }
+ },
+ "HNZSIMiY": {
+ "video_id": "HNZSIMiY",
+ "video_duration": "111s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Camping",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A tan-colored vehicle with a black top is parked on a paved surface near a silver car. A person walks towards the camera, wearing a gray shirt. He touches his hair, raises his hand and talks. He has short dark curly hair, and stubble on his face. He puts on a pair of mirrored blue sunglasses and talks some more. The camera moves to a shot from above showing the same tan and black vehicle and the silver car, which are parked near a house with a dark roof and brick exterior. The person wearing the gray shirt is moving things around by the car. The view switches to inside the garage. The man, now wearing khaki shorts, is moving boxes and bins around, from left to right. He picks up and sets down various items, then picks up a box and a smaller blue item. The camera moves to a view of the backyard, where the man walks holding a black box. The view switches to inside the vehicle, where he places a large blue container, and puts a black container under a wooden platform. The camera is positioned under the platform as the man places a dark bag in the space. Then the view moves to the back of the vehicle where the man sets down a dark bin. The view moves to the garage, and the man moves an item from inside the garage and moves it closer to the car in the driveway. The man picks up various boxes and containers and puts them on top of bins outside the garage. The video transitions to a shot of an interior space with a black couch, white pillows, and a picture with a series of circles.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence for a man to prepare for a trip?",
+ "answer": "D",
+ "candidates": [
+ "A. Load camping gear, organize the vehicle, load items into the vehicle, start the vehicle, check the vehicle.",
+ "B. Check the vehicle, load camping gear, load items into the vehicle, organize the vehicle, start the vehicle.",
+ "C. Check the vehicle, load camping gear, organize the vehicle, load items into the vehicle, start the vehicle.",
+ "D. Start the vehicle, check the vehicle, load camping gear, organize the vehicle, load items into the vehicle."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What might happen at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A man is playing games on the sofa.",
+ "B. A man drives away in a white car.",
+ "C. A man drives away in a yellow car.",
+ "D. A man is cooking in the kitchen."
+ ]
+ }
+ },
+ "AHvABhxR": {
+ "video_id": "AHvABhxR",
+ "video_duration": "29s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "World News",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a person wearing glasses, a black blazer, a light-colored shirt and tie, centered in front of a blue background with lines creating square panels. There is a small GTV logo in the upper left corner. There is a time stamp in the upper right corner. The person appears to be speaking to the camera. The next scene shows 4 people standing in a row in front of a white building. They are all wearing blue short-sleeve shirts with writing on the front and back, black pants and baseball caps, with the last person wearing a black face mask. The next scene shows a person in tan pants and jacket, a black helmet with a white stripe, and black boots descending a rope from a second story window on the building. They are attached to a black cylindrical object. The next scene shows the same person along with 2 other figures in windows, descending from the building with a black cylinder at the end of a rope. The following shot shows a firefighter in an orange helmet and black jacket, pants and boots, attached to a red stretcher at the building wall, being pulled up by a rope. The next scene shows a wide shot of 2 lines of people, a red cone, some equipment and a white tent. A few people are standing on a set of stairs. The next scene shows a firefighter in yellow helmet, jacket, and gloves, and black pants, kneeling in the road with some equipment. The next scene shows 2 firefighters carrying a dark-colored object on their shoulders with a red frame. The next shot is a split screen with the 2 firefighters carrying the object on the left, and 2 firefighters performing CPR on a dummy on the right side. The next scene shows 3 figures on a small stage in front of a large banner with writing on it. The figure in the center is wearing a black bra top and tan pants. The two figures on either side are shirtless and wearing tan pants. The next scene shows a wide shot of the same stage with a crowd seated in front of them, and a person standing to the left with a paper in their hand. The next scene shows 4 firefighters kneeling near equipment at the front of a red fire truck.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many types of sports events are shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Six.",
+ "C. Seven.",
+ "D. Five."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the identity of the bald man at the beginning of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A firefighter representative.",
+ "B. The referee.",
+ "C. The host.",
+ "D. A journalist."
+ ]
+ }
+ },
+ "UmvvVWdk": {
+ "video_id": "UmvvVWdk",
+ "video_duration": "61s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a first-person perspective of a video game. The player is holding a black gun and is in an outdoor environment with green grass, some trees, and a couple of white houses. There are several blocky characters with a variety of skin tones, some with yellow shirts, some with white shirts, and one with a brown shirt. Some characters are on the ground with red stains around them. The player aims the gun at various characters and it appears they are being hit by the player. The player moves around the map shooting at more characters. A chat log appears on the left side of the screen and text appears at the bottom of the screen. There are white blocky shapes in the sky.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What changes occur in the sound when the player kills a game character in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Gradually decreases.",
+ "B. Decreases first, then increases.",
+ "C. Increases first, then decreases.",
+ "D. Gradually increases."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What item does the brown-bodied character use to attack the character holding a gun?",
+ "answer": "B",
+ "candidates": [
+ "A. A stone pickaxe.",
+ "B. A wooden stick.",
+ "C. A rifle.",
+ "D. An iron sword."
+ ]
+ }
+ },
+ "xmHjHCiU": {
+ "video_id": "xmHjHCiU",
+ "video_duration": "109s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Book Reviews",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person with light blonde, braided hair and black framed glasses is centered in the video. They are wearing a gray tank top and have a black string bracelet on their right wrist. They are standing in front of a white bookshelf filled with books and a white ceramic teapot. There are several strands of small white lights draped around the bookshelf. The person's hands move throughout the video. They start by moving their hand in front of their braid, then waving with their hand open, then touching their hair, and then showing a single finger, followed by two fingers, then three fingers. They repeat a similar hand motion by showing four fingers, then five fingers, then one, then two, then three, then a series of hand gestures, before ending with an open hand wave.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Where is the woman in the video most likely to film her next video?",
+ "answer": "A",
+ "candidates": [
+ "A. None of the above.",
+ "B. An offline club.",
+ "C. A new residence.",
+ "D. Her current residence."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the girl laugh in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Two.",
+ "B. One.",
+ "C. Three.",
+ "D. Four."
+ ]
+ }
+ },
+ "nBkuXaRl": {
+ "video_id": "nBkuXaRl",
+ "video_duration": "37s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A split-screen shows a man in a suit in an elevator and a group of people dancing outdoors. Then, Donald Trump and Kanye West walk together through a lobby, followed by a man sitting on a couch gesturing with his index fingers, and a news anchor speaking on television. Four news anchors are shown on a split-screen, and then a man peels an asparagus stalk at a table. A man speaks at a podium on a stage, then a parade float in the shape of a turkey passes by, with two people on top. Two women stand near a grave in a cemetery, one holding a bowl and the other sitting. Additional people arrive at the gravesite. Three women walk down a path, followed by a split-screen of women from the cemetery scene and close-ups of their legs and clothing. Back at the cemetery, the standing woman offers food from the bowl to the sitting woman. A man plays an accordion on a grassy hilltop, then another man plays an accordion while sitting on hay bales in the back of a truck. A man sings facing sideways, and a woman dances in a dimly lit room. She lies on a bed, then engages in intimate acts with another woman and a man. A man in a suit poses against a cityscape background, followed by a man in a bathrobe embracing a woman in a bikini on a boat. They embrace, then a man and a woman are shown in an office, the man touching the woman's back. A rounded blue and orange building appears, then the man from the office sits at a desk. Finally, two men in suits stand in an office holding a large piece of bread.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "In the video, where is the woman wearing sunglasses located?",
+ "answer": "C",
+ "candidates": [
+ "A. In a studio.",
+ "B. In an office.",
+ "C. In a cemetery.",
+ "D. In a car."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "Where did Valerian Manzat work prior to the events depicted in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. He worked in an office at a park.",
+ "B. He worked in an office at a restaurant.",
+ "C. He worked in an office at a prison.",
+ "D. He worked in an office at a construction site."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Which musical instrument is shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A violin.",
+ "B. An accordion.",
+ "C. A piano.",
+ "D. A cello."
+ ]
+ }
+ },
+ "GyPkHQnC": {
+ "video_id": "GyPkHQnC",
+ "video_duration": "545s",
+ "duration": ">8min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A player runs through a house and exits through a door. Outside, they jump and deploy a glider, flying down a hill. They land and pursue another player, firing a rifle. They switch to an assault rifle and continue firing. They approach a group of buildings, constructing a wooden structure, then brick stairs. They collect metal, then eliminate the other player. They run, construct a brick wall by a tree, and revive their teammate. They build a staircase of brick and ice blocks, climbing upward.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the man wearing a blue cloak doing with a black rifle in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. He strikes an enemy's head with the rifle butt.",
+ "B. He hunts camels with the rifle.",
+ "C. He hands the rifle to a teammate.",
+ "D. He shoots at enemies with the rifle."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many men are speaking in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. One.",
+ "D. Two."
+ ]
+ }
+ },
+ "vSEsziRM": {
+ "video_id": "vSEsziRM",
+ "video_duration": "70s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A light-colored batter is cooked in circular shapes on a flat black grill top, which is covered with a yellow potato-based mixture. The batter is then topped with various chopped vegetables and an orange shredded topping. A person in a white t-shirt and a camouflage hat receives money from a customer and places food into a white bag for them. Later, the same person fills a small plastic food container with a light orange liquid, and tops it with chopped green lettuce. They also add some of the orange shredded topping. The man then stands next to a silver food cart in what appears to be a park. On the side of the cart is an award for street food. The text \"In 2016 he released the Organic Baby Spinach Dosa, which is still off the menu\" appears over a shot of him topping multiple small discs of batter with various food items. He counts money and then turns to face the camera smiling. There is a final shot of the man standing in front of the food cart again, followed by a black screen with text that states \"Video by Jessica Leibowitz\" and then another screen with text that states \"Produced by Jessica Leibowitz Alex Tween\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many pancakes did the man in the video make?",
+ "answer": "D",
+ "candidates": [
+ "A. Five.",
+ "B. Three.",
+ "C. Six.",
+ "D. Four."
+ ]
+ }
+ },
+ "QPpZoWwt": {
+ "video_id": "QPpZoWwt",
+ "video_duration": "42s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Cultural Explainers",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with the word \"UPROXX\" in large white text over a black background, followed by the text \"Pop Brixton.\" There is a red line below \"UPROXX.\" The scene then switches to an aerial shot of a construction site. A crane is on the left, and several shipping containers are present. Some people are visible on the ground. A semi-truck pulls into the frame on the right, carrying more containers. The video then shows a blue shipping container being lifted into the air by the crane. Next, the camera moves to show a building made from red shipping containers. Some of the containers have windows. String lights are hung up between some of the containers. The scene then moves to a ground-level shot of an entrance to a building made from containers. There is a sign reading \"CONTAINER\" in white text with \"RECORDS\" below it, and an arrow to the right. The video then cuts to an aerial view of a patio with people gathered around. There are several yellow picnic-style tables, as well as an open tented area in the back. The next scene is a street in which a number of cars are moving, along with a person on a scooter. A text overlay reads, \"Brixton is a working class borough of London with a large Caribbean community.\" The video switches to a shot of a person speaking with the text overlay, \"ADRIAN LUCIE. BUSINESS OWNER | MAMA'S JERK @ POP BRIXTON.\" This is followed by a doorway with the text overlay \"BRIXTON VILLAGE\" above it. Then there is a view of the outside of a storefront, and the security gate is down. Text on the gate reads \"BRIXZTON BIZNESS CAFE.\" The video then moves to a shot of an indoor market with many closed stalls and some people walking through. This is followed by a black and white shot of a damaged building with smoke coming out of it. Some people are standing in the foreground and some firefighters are visible in the background. A text overlay reads, \"Before the recent process of gentrification, Brixton was infamous for its long history of crime, high unemployment and racial tension.\" The next scene shows an open-air market with a number of stalls. This is followed by a scene of a person speaking inside a building with a text overlay reading, \"TENDAI CHAGWEDA HOST | IMPACT HUB @ POP BRIXTON.\" The video then switches to a black and white scene of people on the street in the daytime, with a damaged building in the background. The video then shows people walking through an open-air market, followed by a scene of a street with people walking by. Finally, there is a black and white shot of a street with a group of police officers with dogs, facing away from the camera. A text overlay reads, \"Brixton's unsavory reputation came to a head in the famous Brixton race riots of the 1980's.\"",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What is the most likely setting of the scene at the beginning of the video, where containers are stacked?",
+ "answer": "B",
+ "candidates": [
+ "A. A cargo transfer station.",
+ "B. An undeveloped wasteland.",
+ "C. An outdoor parking lot.",
+ "D. A dumpsite."
+ ]
+ }
+ },
+ "HeoexUFx": {
+ "video_id": "HeoexUFx",
+ "video_duration": "218s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "swimming",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "People are swimming laps in a pool with lane markers. Some of the people diving into the pool are wearing swim caps of varying colors. Spectators can be seen standing behind a white barrier on the edge of the pool, many wearing short-sleeved shirts. There are also orange cones and blue and gold pennant banners at the edge of the pool.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the man wearing black shorts do with the athlete from lane 3 after they reached the finish line at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A fist bump.",
+ "B. A handshake.",
+ "C. Clap hands.",
+ "D. A hug."
+ ]
+ }
+ },
+ "jAOzofbK": {
+ "video_id": "jAOzofbK",
+ "video_duration": "108s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen, then cuts to a woman wearing glasses and a black shirt, who gestures with her hands as she speaks. The video then displays the white text \"IDENTIFYING THRIPS\" on a green background with a white circular icon with a figure inside and a magnifying glass around it. The woman reappears and gestures with her fingers. The text \"WHAT'S THE DAMAGE\" appears on a green background. The woman reappears and speaks and gestures. The video then cuts to a green leaf that has dark spots and discolorations on its surface and shows a few tiny insects moving on it. The woman reappears speaking and gesturing. Then a close up shot is shown of a hand gently touching a green and white leaf. This is followed by a shot of a large green leaf with brown spots and discolorations. The woman is again shown gesturing and speaking. A close up is shown of a leaf with a small dark worm-like insect on it. The woman then reappears and continues to speak and gesture. A green background is shown with the white outline of an insect in a circle and the text \"LOOK FOR THRIPS\". The woman is again shown. The video then cuts to a green background that shows a white insect and the numbers 0.5 and then 0.5-2.5 next to it. Then a hand holding a magnifying glass is shown focusing on small leaves. Close up shots of discolored leaves with damage and insects follows. The woman is then shown speaking and gesturing. A close up is shown of a dark insect on a green leaf, with text that indicates the size is 1.0mm. The woman reappears and continues to speak and gesture. Finally the video displays the word \"BREATHE\" in white text on a green background, and then cuts to the woman.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the giant water lily appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. No.",
+ "B. I am not sure.",
+ "C. Yes."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Where does the plant Victoria Amazonica appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. Throughout the video.",
+ "C. In the middle of the video.",
+ "D. At the end of the video."
+ ]
+ }
+ },
+ "riMjPFRz": {
+ "video_id": "riMjPFRz",
+ "video_duration": "149s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person's hands are shown arranging thin slices of peeled apples over a circle of pie dough on a wooden board. Another, smaller, circle of dough sits to the side. The apples are taken from a glass bowl to the upper right. After arranging the apples to cover the dough, the hands sprinkle a white substance, likely sugar, from a small white bowl. Next, the other circle of dough is lifted and placed over the apples. The top layer of dough is adjusted and pressed around the apples and the bottom layer. The excess dough around the edge of the pie is trimmed and then crimped, forming a decorative edge. A fork is used to make small holes across the surface of the top dough. A knife is then used to trim any remaining excess dough hanging from the pie's edge. The pie is brushed with a liquid, presumably an egg wash, from a small white bowl using a white and black silicone brush. A cut of the baked pie is then shown on a white plate, revealing the cooked apple filling inside. The pie crust is golden brown and the small holes made with the fork are visible. The pie is sliced with a knife.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the primary function of the egg mixture shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. It increases moisture.",
+ "B. It enhances color and sweetness.",
+ "C. It contributes to protein content.",
+ "D. It adds fragrance."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many steps are involved in the food preparation process depicted in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Six.",
+ "B. Seven.",
+ "C. Eight.",
+ "D. Five."
+ ]
+ }
+ },
+ "OfMRXaWI": {
+ "video_id": "OfMRXaWI",
+ "video_duration": "102s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a person with long blond hair, wearing sunglasses on their head, a black jacket, and a choker necklace, sitting at a table with a blue cloth covering it. There are bookshelves behind them filled with books and video games. They are talking and gesturing with their hands.The next scene is a man in a brown shirt and a black scarf, standing on a wooden bridge. Red splatters appear on his body. In the background there are other figures.The following scene is of multiple people wielding long poles and moving in unison. They are in a sandy area with small buildings in the background.The next few scenes are of a person fighting other people in a video game. The person's character is wearing dark clothing with white hair, and the setting is a grassy area with trees and buildings. Red blood splatters are present. The word \"COMBAT\" is displayed on the screen.The video shows menus from the game. The character's weapons and combat styles are displayed. These include a polearm, an oxtail blade, and a sabre.The person in the video game is fighting other video game characters, using swords, and a bow and arrow. The settings include a rocky area, a bamboo grove, and wooden structures. The person is seen moving through the environments and fighting other video game characters.The video returns to the person with blond hair at the table, who is talking and gesturing with their hands. Their hands are now in front of them, palms up.The game play continues in a different setting with a wooden floor, and then to a grassy environment. The person with white hair continues to fight other video game characters with swords.The video returns to the person at the table, who continues to talk and gesture with their hands.The video game footage continues with the person with white hair, walking along roofs, and fighting on roof tops.The video returns to the person at the table, who is talking and moving their hands.The video shows the person with white hair fighting in various environments, including grassy fields, and a rocky area.The video returns to the person at the table. They have their hands together in front of them, palms facing each other and they are talking with a smile.The video game is shown, with the character running along elevated wooden pathways and fighting other characters. The scene shows explosions and a different character falling backwards off a structure.The video returns to the person at the table, who is talking and moving their hands.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the white-haired game character standing on the rooftop in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The character is observing enemies.",
+ "B. The character is avoiding enemies.",
+ "C. The character is silently eliminating enemies.",
+ "D. The character is conserving stamina."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the character class of the white-haired female game character in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A marksman.",
+ "B. An assassin.",
+ "C. A samurai.",
+ "D. A mage."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What does the white-haired game character do with the hookshot in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Resist an attack.",
+ "B. Hook onto the trees.",
+ "C. Grapple the enemy.",
+ "D. Hook the eaves."
+ ]
+ }
+ },
+ "RNKMeiDV": {
+ "video_id": "RNKMeiDV",
+ "video_duration": "83s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A blue toy train travels along an orange track, then through two orange uprights before crossing a green bridge. The blue train then enters a toy boat which travels across a blue track. Another orange toy train with a white face moves along a track. A red toy train with black lettering saying \u201cSalty\u201d goes along a blue track. Two toy trains, one blue with a white face and a second orange with a white face, travel along a green track. Three toy trains, one green with a white face, one yellow with a white face, and one black with a white face, are positioned horizontally. Seven toy trains, a mix of colors, move along the surface. Three toy trains, one green with a white face, one yellow with a white face, and one black with a white face, are positioned horizontally. A black toy train travels along a yellow track. A black toy train travels along a yellow track, then a grey track, an orange track, and a blue track. A green building is visible in the background. A black toy train travels along the blue track past orange \u2018fire\u2019 pieces. Another train travels in the background. A yellow toy train goes along an orange track, then through the uprights to a boat. The boat carries it along a blue track to a green track. The yellow train goes along a green track then a brown slide. The yellow train then passes through an orange cave. The train moves onto a blue track, then an orange track, a grey track, and finally a green track. The yellow train passes through an orange doorway. A green train travels in the foreground, pulling a green container with several white boxes. The green train travels along a grey track with green edging. The train travels on a grey track with a green container and white boxes. The train passes through an orange bridge. The green train passes through a beige doorway and crosses another grey track. The train approaches a 'finish' line and then stops. A variety of different toy trains are positioned horizontally on the surface.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is diesel engine number 10 in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Blue.",
+ "B. Yellow.",
+ "C. Black.",
+ "D. Green."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many motorcycles are shown in the video participating in the team relay challenge?",
+ "answer": "C",
+ "candidates": [
+ "A. Six.",
+ "B. Nine.",
+ "C. Twelve.",
+ "D. Three."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video, what was the order of appearances in the first round of the diesel engine team competition?",
+ "answer": "C",
+ "candidates": [
+ "A. Blue locomotive, red locomotive, black locomotive.",
+ "B. Red locomotive, black locomotive, yellow locomotive.",
+ "C. Black locomotive, yellow locomotive, green locomotive.",
+ "D. Yellow locomotive, green locomotive, black locomotive."
+ ]
+ }
+ },
+ "BCTaQQCs": {
+ "video_id": "BCTaQQCs",
+ "video_duration": "325s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two hands reach into a green bin containing a mixture of shredded paper, dark soil, leaves, and food scraps. The hands dig into the mixture, revealing numerous red composting worms. The hands then pick up some of the compost and worms, and spread the mix around the bin. More food scraps, including a moldy pineapple, lemons, and potatoes are added to one side of the bin and mixed into the existing compost with the hands. Leaves are added on top. Finally, more shredded paper bedding is spread evenly across the top of the bin, covering the food scraps and compost mixture.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "In which setting is the pest bag shown in the video located?",
+ "answer": "A",
+ "candidates": [
+ "A. A basement.",
+ "B. A fruit store.",
+ "C. A living room.",
+ "D. A garden."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What is the final state of the fruit shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The woman crushes the fruit and then places it on the soil.",
+ "B. The woman peels the fruit and then places it on the soil.",
+ "C. The woman crushes the fruit and then buries it in the soil.",
+ "D. The woman peels the fruit and then buries it in the soil."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the woman in the video doing with the bucket of fruit?",
+ "answer": "C",
+ "candidates": [
+ "A. Using it as fertilizer for the soil.",
+ "B. Burying it in the soil for fermentation.",
+ "C. Using it as food for insects.",
+ "D. Using it as a shelter for insects."
+ ]
+ }
+ },
+ "xkwKYtgH": {
+ "video_id": "xkwKYtgH",
+ "video_duration": "210s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Astronomy",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video begins with a black background. In the upper left, there is the word \"44 days\" in purple text. To the left of center, there's an image of a galaxy. On the right side of the frame, there is a man with grey hair wearing a black suit and white collared shirt. To the right of his head, there's a dotted yellow line shaped like a parabola.The video then shifts and a black screen displays multiple smaller, bordered images. A man is seen standing at an open doorway with text that reads, \"HAMUY.\" Two men are standing next to a car with text reading \"SUNTZEFF SCHOMMER.\" A single man has a portrait with the name \"PHILLIPS\". Another man has a portrait with the name \"ANTEZANA.\" Two men have portrait images next to each other with text that reads, \"SMITH\" and \"AVILES WISCHNJEWSKY.\" Finally, another man is seen sitting at a desk with the text \"MAZA\".The video shifts again. On the left side of the frame is a man wearing a light shirt and holding a dark object, standing in front of an off-white wall. To the right of center is a photograph of a large grey machine. The man with the black suit is present on the right side of the frame.The video then changes to display another image, which has a graph in the top half. The horizontal axis reads \"DAYS SINCE 1990 SEPTEMBER 30 UT\". The man in the black suit remains on the right side of the frame.The video then displays three distinct light bursts with dotted line parabolas going beneath each. The man in the black suit remains on the right side of the frame.The video shifts to display a white background with a graph and the text \"DISTANCE\" along the vertical axis and \"REDSHIFT\" along the horizontal axis. The man in the black suit remains on the right side of the frame.The video shifts and displays the words \"Ideas and Technology\" in blue text against a black background. The man in the black suit remains on the right side of the frame.The video then shifts to display a black-and-white image of a round metal object with four square openings. The man in the black suit remains on the right side of the frame.The video shifts again to display four square images of nebulas. The man in the black suit remains on the right side of the frame.The video changes to display two white structures with gold interior lights in front of a pink sky, with the man in the black suit remaining on the right side of the frame.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the beginning of the video, what is the position of the man in the black suit relative to the man in the blue background image?",
+ "answer": "B",
+ "candidates": [
+ "A. Behind.",
+ "B. To the right.",
+ "C. To the left.",
+ "D. In front."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Which technological advancements have made the study of supernovae possible?",
+ "answer": "D",
+ "candidates": [
+ "A. New modern digital detectors and larger telescopes.",
+ "B. Only larger telescopes.",
+ "C. Traditional analog detectors.",
+ "D. Only new modern digital detectors."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How does the Chilean team study the emotional changes of celestial bodies?",
+ "answer": "C",
+ "candidates": [
+ "A. From despair to calmness.",
+ "B. From despair to hopelessness.",
+ "C. From sadness to excitement.",
+ "D. From enthusiasm to disappointment."
+ ]
+ }
+ },
+ "AnXFZkNE": {
+ "video_id": "AnXFZkNE",
+ "video_duration": "363s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Family Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Two avatars stand in front of a pastel background with trees and a rainbow. A text overlay reading \u201cAUGUST & AKEILA\u201d appears. More avatars gather in the scene, some with small horns and star-shaped markings around their eyes. Text overlays indicate names including \"LIA & LAILA\". Two avatars are holding a smaller, doll-like version of themselves. An outdoor scene appears showing a tent with strings of lights and balloons. Many avatars are seated in rows of white chairs while others stand in the scene. The scene transitions to a view of a stage with a podium, balloons, and signs with the text \u201c2022\u201d. An avatar named \u201cFatima\u201d stands at the podium. Avatars are shown walking on stage and are congratulated by others. Some avatars are throwing their graduation caps in the air. A text overlay reads \u201cTHE END\u201d. Avatars are seen gathered inside a kitchen, with pizza boxes and cakes on a table.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "In the video, how does the woman in the pink dress assist the little baby after the baby is blocked by the crowd?",
+ "answer": "C",
+ "candidates": [
+ "A. She has the little baby sit down.",
+ "B. She carries her away.",
+ "C. She carries her to the front of the stage.",
+ "D. She has the little baby stand up."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What did the woman in the pink gown do to the cake at the graduation ceremony in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. She threw the cake.",
+ "B. She gave the cake to a baby.",
+ "C. She ate the cake.",
+ "D. She sliced the cake."
+ ]
+ }
+ },
+ "flkVtGXO": {
+ "video_id": "flkVtGXO",
+ "video_duration": "341s",
+ "duration": "4-6min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a soccer game. Players wearing light blue uniforms and white and red uniforms are seen on a green field. At various points a black and white ball is seen moving among them, with many players kicking the ball. At multiple points, a player wearing a yellow or green uniform is seen near the goal, and a black-clad referee is also seen on the field. At points a player in red and white is seen raising his arms. The score is shown on the upper left of the screen, changing at different points throughout the video, as do logos and advertising. There is a large crowd of spectators visible at different points throughout the video.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "How long does the competition last in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Approximately 75 minutes.",
+ "B. Approximately 65 minutes.",
+ "C. Approximately 85 minutes.",
+ "D. Approximately 55 minutes."
+ ]
+ }
+ },
+ "azaRjLcu": {
+ "video_id": "azaRjLcu",
+ "video_duration": "39s",
+ "duration": "<1min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A close-up shot of an eye with blonde eyelashes is shown. The camera then moves down to show an individual with blonde hair and blue clothes standing in front of trees. The camera changes again and shows a circular, metallic head emitting a bright golden light. A circle of golden light appears behind the head before it zooms out and shows an individual wearing blue clothes falling downward with golden light coming out of their palms. The shot then changes to an individual with blonde hair in blue clothes with their hand forward and light emanating from their palm. This shot is followed by a close-up of an individual with blonde hair and black eyes with a metallic part on their shoulder. The next shot is of two individuals fighting, with one having white hair and bandages. This is followed by a close-up shot of a metallic object firing light towards an individual with white hair and bandages. The next shot shows an individual with blonde hair and black eyes standing in front of blue sky and white clouds with a metallic object on their shoulder. This shot is followed by two individuals fighting on a rocky terrain. The camera then moves to show a close-up shot of a person's butt that has red marks and an individual with long blonde hair playing a game in front of a television. The camera changes again to show an individual with a bald head looking forward. A shot of an individual with blonde hair wearing blue clothing and a dark metallic arm and metallic armor is then shown. Following this is an individual with white hair and bandages standing next to another individual in green clothing laying on the ground. The shot changes to an individual in a blue jacket with dark metallic arms, emitting light from their arm. The camera zooms in on the individual with metallic arms. A shot of a white-haired individual getting hit by a metallic object and emitting light is next. The next shot is a fast shot of a blonde haired individual getting hit by a metallic object. The final shots are of the same two individuals fighting and moving through green foliage.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What material is Genos' body primarily composed of?",
+ "answer": "A",
+ "candidates": [
+ "A. Metal.",
+ "B. Rock.",
+ "C. Rubber.",
+ "D. Organic tissue."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What could be the possible outcome of Genos's battle?",
+ "answer": "B",
+ "candidates": [
+ "A. Genos escapes in the end.",
+ "B. Genos is killed.",
+ "C. The opponent escapes in the end.",
+ "D. Genos kills the opponent."
+ ]
+ }
+ },
+ "lqXhbIhc": {
+ "video_id": "lqXhbIhc",
+ "video_duration": "235s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Drawing Tutorials",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A white rectangle is shown in the center of a grey workspace. The menu tabs are visible at the top, and toolbars are present on the left and right. The word \"place\" is shown at the end of one of the options on the drop-down menu under the file tab. A pop-up window appears with a list of folder names and a scrolling function, then the file name \u201cworld-map-large\u201d is selected. After a world map with different colors is imported into the white rectangle, the visibility of the land layer is turned off. After this, the world map is converted into a green solid color. A blue rectangle is drawn on top of the green world map, then sent to the back. After this, an ellipse is drawn, and a clipping mask is made. The visibility of the underlying layer is switched off, and a new layer is created. An ellipse is drawn and filled in with blue. Then, using the pen tool, it is slightly altered on one side to create a dividing line. Then the three dimensional tool is selected and the object is revolved. A map art window is opened and a previously made green world map is overlaid onto the blue sphere. A second blue circle is drawn in another part of the white rectangle. Using the pen tool, one of the sides is altered and a three dimensional revolve is made, then the map art is set to none.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "At the beginning of the video, how many dropdown options are listed under the File menu?",
+ "answer": "C",
+ "candidates": [
+ "A. 21.",
+ "B. 19.",
+ "C. 22.",
+ "D. 20."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is China on the world map displayed at the beginning of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Blue.",
+ "B. Magenta.",
+ "C. Orange.",
+ "D. Green."
+ ]
+ }
+ },
+ "SHZUpPGc": {
+ "video_id": "SHZUpPGc",
+ "video_duration": "63s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video begins with a close-up of a woman with long brown hair. She is wearing a dark-colored shirt. The background appears to be a room with picture frames on the wall and shelves with books. The video then transitions to a close-up of a pair of hands on a blue cutting mat. The hands are holding two small rectangular photos, one next to the other, with images of a landscape. A blue pencil, a clear ruler, and other tools are also on the mat. The hands position the photos on a cream-colored square card. Using the pencil, they draw a line along the edge of the photos and the top of the card. They use the ruler and the pencil to draw a horizontal line below the photos. They then use a clear T-ruler and the pencil to draw a vertical line on the edge of the cream-colored card. The T-ruler is repositioned, and a second vertical line is drawn on the side. The hands pick up the photos, which are moved to the upper-left corner of the cutting mat, leaving the card in the center. The video ends with the hands next to the card and the photos.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At which part of the video does the optical center point appear?",
+ "answer": "B",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the end of the video.",
+ "C. At the beginning of the video.",
+ "D. Throughout the entire video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "At the end of the video, where is the optical center point positioned relative to the image?",
+ "answer": "B",
+ "candidates": [
+ "A. The optical center point is in the upper left corner of the image.",
+ "B. The optical center point is in the lower right corner of the image.",
+ "C. The optical center point is in the upper right corner of the image.",
+ "D. The optical center point is in the lower left corner of the image."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the woman in the video using a ruler to construct?",
+ "answer": "B",
+ "candidates": [
+ "A. Measure the size of the photo.",
+ "B. Find the optical center point of the image with a ruler.",
+ "C. Draw on the paper with a ruler.",
+ "D. Cut the paper with a ruler."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What could be the occupation of the person in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Handcraft creator.",
+ "B. Math teacher.",
+ "C. Image measurer.",
+ "D. Painter."
+ ]
+ }
+ },
+ "PZqHNgap": {
+ "video_id": "PZqHNgap",
+ "video_duration": "46s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A video of a Senate hearing shows Senator John Cornyn writing on a piece of paper while Senator Ted Cruz looks on. Then, Barack Obama appears, speaking directly to the camera against a backdrop of foliage and a softbox light. He wears a light blue, button-down shirt. He gestures with his hands as he speaks.The scene cuts back to Obama, now in front of a blue curtain backdrop. He holds a tablet and scrolls through it with his finger. Text appears on the screen that reads \"Excuse #3\" repeated four times. Below the repeated text are the words, \"My vote doesn't matter.\"Obama is shown again, speaking to the camera against the blue backdrop. He gesticulates while speaking. The scene then transitions to a large crowd of people at a concert or festival. They are standing and moving to music, with lights illuminating them from above.Obama reappears against the blue backdrop, continuing to speak and gesture. An image of an elderly woman peering at a laptop screen through her eyeglasses is shown briefly. She lowers her glasses to see better. Obama appears one last time against the blue backdrop and continues speaking and gesturing.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What was Obama's third cited reason?",
+ "answer": "B",
+ "candidates": [
+ "A. The ease of obtaining loans.",
+ "B. An individual vote does not matter.",
+ "C. Issues raised by members of Congress.",
+ "D. Voting in performance-based programs."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video include a tablet computer?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. Not applicable.",
+ "D. It is not clear."
+ ]
+ }
+ },
+ "XRHkcodm": {
+ "video_id": "XRHkcodm",
+ "video_duration": "81s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The first panel shows a bed with two figures laying in it. An alarm clock is on a night stand. In the second panel, a woman sits on the edge of the bed and several shapes fly around her. In the third panel, a woman lays in bed with a sun figure, a light, and a rooster. The fourth panel shows a woman at a computer and a moon with hands is at an alarm clock. The text \"SOURCE: HABENAUER ET AL., 2009\" is shown at the bottom right of this panel.The following panels are in a grid of four, with 1 panel on top and three below. The first panel shows a man with glasses, looking to the right. The next panel shows a man in bed pointing at a pillow. The next shows the same man in bed looking at his wrist, with the image of houses outside. The next shows the same man, now a drawing of a woman is also present. Next, the drawing of the bed with eyes attached to the legs appears. Then, a broken heart is present above the bed in addition to a calendar. The image of the woman appears at the bottom left in bed, and below that, the bed has eyes on the legs and a smiling mouth drawn below the pillow. Next, the drawing of the woman's head appears, followed by a full drawing of the woman holding a purse and looking down. Following this, four heads are drawn in a circle, and a drawing of the woman looking down is below that. Finally, there are four people drawn together below the circle, with the woman seated at a desk below.The following panels are also in a grid of four. The first shows a figure looking out of the panel. Next, the figure is slumped over a book. Next, a brain appears floating and blowing out a paper airplane. Next, the figure appears to be reading a book. Following that, the brain is reading a book. The figure then appears to look into an open book, with another set of eyes next to them. Next, a confused looking brain is drawn. Finally, a drawing of the figure is below this and she has her eyes closed, with additional image of her in bed, holding a small item, with a brain also present and working at a computer.The next panels are a grid of four, with a \"2.\" in the top left corner of the first panel. The first panel shows a head looking to the right, with three different images surrounding it. In the next panel, the same head is looking right with a mirror in front of it. The mirror shows a different figure. Below that is a drawing of a figure lying in bed, with an elephant below this. Finally, the figure in bed has drawn lines above it, a cloud image is near the head, and a figure with a hat and lines flies past. Below that is the figure lying in bed, a drawing of her holding different facial expressions, and a drawing of a confused figure in front of a mirror.The last panels are a grid of three with \"3.\" in the top left corner of the first panel. The first panel shows the woman lying in bed. An arrow points to a drawing of her head. An arrow points to her standing up and another arrow points to a drawing of her in bed with the sky outside. The arrow points to a drawing of a man and woman. Below that, there is a drawing of a person working at a desk, with another drawing of a woman below it. Below that, a person is taking a small item in a drawing, with another drawing of a man, a woman, and another person below that, with the man holding a picture with a line on it.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many images featuring a brain appeared in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. One.",
+ "B. Four.",
+ "C. Three.",
+ "D. Two."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order are these four events introduced in the video? (a) The formation of a vicious cycle, (b) Schools not adapting to changes, (c) Consequences of teenage sleep problems, (d) Connection to the rising suicide rates among teenagers.",
+ "answer": "B",
+ "candidates": [
+ "A. (b) (a) (d) (c).",
+ "B. (b) (d) (a) (c).",
+ "C. (a) (b) (d) (c).",
+ "D. (a) (d) (b) (c)."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video contain a cartoon of the school?",
+ "answer": "B",
+ "candidates": [
+ "A. It does not exist.",
+ "B. It exists.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "csXkKkde": {
+ "video_id": "csXkKkde",
+ "video_duration": "33s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Camping",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with a light-colored top, dark hair, and a beard walks in the foreground while another person with dark hair pulled into braids, a red top, and a gold necklace walks in the background. There is a small child walking in the background as well. A video clip is overlaid in the top left corner of the frame that shows the scene in front of the camera. The scene depicted in the video clip shows a tan horse standing next to a wooden post. A hand is seen in the video clip, stroking the horse's face. In the background of the frame, a fenced off area can be seen. Then, the focus shifts to show only a tan horse standing next to a wooden post, with another brown horse in the background. The tan horse lowers its head and rubs it against the post. The video then changes to show a brown horse and another horse walking out of a barn. The brown horse rubs its head on a wooden pole.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many horses appear at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Five horses.",
+ "B. Four horses.",
+ "C. Three horses.",
+ "D. Two horses."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, what is the position of the blue box in relation to the yellow horse?",
+ "answer": "D",
+ "candidates": [
+ "A. The blue box is to the right of the horse.",
+ "B. The blue box is behind the horse.",
+ "C. The blue box is to the left of the horse.",
+ "D. The blue box is in front of the horse."
+ ]
+ }
+ },
+ "AUFrqnzf": {
+ "video_id": "AUFrqnzf",
+ "video_duration": "73s",
+ "duration": "1-2min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video starts with a scene depicting several trees on a green and brown textured ground. Text that says \"Get some wings!\" is overlaid on the image. The scene changes to show a night time sky and a figure in a purple hat, and another three figures standing on the ground. Text overlaid on the image says \"Konah the Witch Doctor has arrived!\". The scene changes to another night time image, with a small, brown house on the left of the screen, and a small figure with wings appearing in the center of the screen with some floating blue and white shapes. The figure moves towards the small house and enters it. The figure stands in front of another figure in the house. A speech bubble appears over the figure in the house. A game menu appears on the left of the screen and a different menu appears on the right. A cursor moves within the game menu. A menu box highlights some text that says \"Leaf Wings\", and gives a description of the object as \"Equipable, can be worn in vanity slots, allows flight and slow fall, buy price: 4 gold 75 silver\". The box is closed and the figure with wings moves outside of the house. The figure with wings flies upwards, with the word \"Weeeeee\" in white text overlaid on the black background. The figure continues to fly and move through the scene, occasionally being hit by other figures, objects, and projections of various colors. The video ends with a red and blue background, and three white shapes of two rectangles and a circle.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "After entering the next level of the game, can it be easily completed?",
+ "answer": "C",
+ "candidates": [
+ "A. Yes, because a top-up was made.",
+ "B. No, because the enemies in the next level are stronger.",
+ "C. Yes, because it provides damage reduction and improves mobility.",
+ "D. No, because the equipment is not strong enough."
+ ]
+ }
+ },
+ "gIORoOwO": {
+ "video_id": "gIORoOwO",
+ "video_duration": "39s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a drawing of two people. One person has glowing goggles and is positioned in the foreground and has a black object on their right shoulder and an object in their hands. The second person, in the background, is also holding a weapon. A white line emerges from the side of the weapon, directed to the right side of the screen. The drawing is in grayscale. The image moves to another grayscale drawing of a single individual holding an object in each hand. A white line appears, beginning from the top left and extending to the middle bottom of the screen. The drawing then changes to two grayscale figures. The figure on the left is kneeling while holding a weapon, and the figure on the right is jumping up off the ground while holding an object. The drawing is then replaced by a figure standing beside a door. The figure is holding a weapon and there is a black flag hanging from the side of the door. A thin white light comes from the right of the screen, heading towards the flag. The image changes to a person with short brown hair wearing a camouflage uniform. The background behind the person is solid black. The image changes to a black background with white text. The text reads, \"Payne and his teammates moved to the roof of the burning building, where the second team had called for help. They continued to receive constant fire from the enemy to their west and from the building directly below them.\" The image changes back to the man speaking against a solid black background, then changes to a blurry grayscale image, showing three people on the top of a building, one standing and two crouched. Then the video returns to another grayscale drawing of three people on the top of a building. One person is firing a weapon to the left side of the screen, and the other two people are moving to the right side of the screen.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What are the soldiers in the video likely to do next?",
+ "answer": "C",
+ "candidates": [
+ "A. Leave the battlefield.",
+ "B. Go upstairs to rescue their teammate.",
+ "C. Go downstairs to rescue their teammates.",
+ "D. Extinguish a fire."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What abnormality is shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The soldiers in the video are under suppressive fire from enemies to the east.",
+ "B. The building in the video is on fire.",
+ "C. A soldier is hit by a grenade in the video.",
+ "D. A person is struck in the head by a bullet in the video."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times were guns fired by all the people in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Four times.",
+ "B. Six times.",
+ "C. Five times.",
+ "D. Three times."
+ ]
+ }
+ },
+ "SheJYukC": {
+ "video_id": "SheJYukC",
+ "video_duration": "97s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A person in a black hooded sweatshirt and a black baseball cap is holding a white string in both hands. The string is initially straight, but the person manipulates it to form a loop with a silver ring tied in. After tying the ring in, the person holds the string with the ring tied on it horizontally between his teeth. The string is now looped, and has a small knot in it. The person then holds the string out to show the ring. Next, the view changes to show hands tying a lure on fishing line. The hands make a loop, pass the line through, and then tighten the knot several times. Then the hands hold up the lure attached to the line.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the likely occupation or role of the man in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A crafts content creator.",
+ "B. An outdoor survival content creator.",
+ "C. A fishing equipment store owner.",
+ "D. A fisherman."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many hooks are on the bait in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Two.",
+ "B. Four.",
+ "C. Three.",
+ "D. Five."
+ ]
+ }
+ },
+ "lIjEMAHz": {
+ "video_id": "lIjEMAHz",
+ "video_duration": "41s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A bald man with a beard, wearing an olive green t-shirt and a beaded necklace speaks to the camera in what appears to be a home gym. Behind him, a Belgian Malinois dog sits on a small platform atop a trash can. The dog mostly remains still, occasionally shifting its weight or looking around the room. An American flag and various framed pictures hang on the wall behind them. To the right of the frame, a whiteboard displays handwritten notes and a rack holds dumbbells. A brief cut shows an Instagram profile for \"elliotthulse\", with a profile picture, follower counts, and a bio. Text overlaid on this image prompts viewers to DM the word \"KING\". The video then cuts back to the man speaking, with the dog still visible in the background.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "After the audience sends a message to the man, what is most likely to happen next?",
+ "answer": "A",
+ "candidates": [
+ "A. He receives a detailed reply.",
+ "B. He receives a simple reply.",
+ "C. He does not receive a reply.",
+ "D. He receives an impolite reply."
+ ]
+ }
+ },
+ "BMquacHu": {
+ "video_id": "BMquacHu",
+ "video_duration": "324s",
+ "duration": "4-6min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video opens with a white silhouette of a person throwing, with the words \"ARETE THROWS NATION\" and a trademark symbol underneath. The scene changes to a man, wearing a black baseball cap and a long-sleeved white shirt with a blue logo, and gray pants. He is standing on a circular concrete platform, with four green cones positioned around it. In the background, there is a green net and a fence. The man gestures with both hands, pointing his index fingers upwards, and then towards the ground. Yellow text stating \"STAY ON THE GROUND\" appears on the screen. Below this, in a black box, the words \"HEAD COACH ERIK JOHNSON AreteThrowsNation.com\" are shown next to a silhouette logo. The man begins demonstrating various techniques, moving his arms and legs. An arrow appears on the screen, showing the direction of movement he is demonstrating. Sometimes, an outline is placed on his body to highlight what he is doing. The man is shown demonstrating in different ways, and at one point, a yellow starburst appears to briefly cover his body. He then picks up a round object from the platform and moves it in the air in front of him. The man raises both hands, and a yellow spark appears on the screen. In another moment, two yellow arrows are placed above his feet, moving upwards. The scene quickly changes to a close up shot of the man's face, wearing black sunglasses. The image is distorted with green digital lines and code. The man starts demonstrating moves again. A red x appears on the screen. The man then rotates to demonstrate the moves, with a green line following around his legs, and a green checkmark on his right foot. A yellow text appears in a black box: \u201cFOCUS ON THE \u2018TRANSFERS\u2019 FROM OUR PILLAR 5 TO 6 DRILLS.\u201d The man then shows a move that involves a rapid leg swing. The scene changes to footage of other athletes, a woman in a black and white tank top, a man in a black tank top, another man in blue shorts and a blue shirt, and another in yellow shirt and dark pants, all throwing a discus. The video continues showing the man teaching, and the scene suddenly flashes to an explosion on the platform, and the image then goes to static. The scene returns, with the man standing calmly on the platform. Yellow text appears, and the scene fades out, and then back in, with \"STAY ON THE GROUND\" text appearing. The man continues to demonstrate, and the words \u201cSOME DRILLS ARE MADE FOR VERY ADVANCED ATHLETES\u201d appear on the screen. The scene shows the man demonstrating more techniques. The scene switches to more athletes throwing a discus, with text stating their names \"SANDRA PERKOVIC,\" \"ROBERT HARTING,\" \"GERD KANTER,\" \"FEDRIC DACRES,\" \"ANDRIUS GUDZIUS,\" \"TOM WALSH,\" \"RYAN CROUSER,\" and \u201cDARRELL HILL.\" Text appears, and the man demonstrates more moves. The video then shows a split screen, with the man demonstrating and the athletes throwing the discus. Text appears on the screen \u201cTHE MORE TIME YOU\u2019RE ON THE GROUND THE MORE TIME YOU CAN APPLY THE FORCE\u201d. The man continues to demonstrate, and more arrows appear on the screen. Text then appears that reads \u201cDO DRILLS THAT MIMIC THE THROWS.\u201d The scene switches to the man holding a white sphere in his hands. The scene moves to more techniques. Text appears stating: \"LEARN THE 6 PILLARS OF THE THROWING CHAIN REACTION SYSTEM www.AreteThrowsNation.com.\u201d The man gestures to his head, and then to his body. Text appears briefly stating: \u201cL R A T E 6 L A R S O T E T H W I N C H A R N S S T\u201d and \u201cW A eteTrow N in co\u201d. He gestures with his hands. Then the text \u201cLIKE COMMENT ARETE THROWS NATION TV\u201d and a red \u201cSUBSCRIBE\u201d button appear. The man taps his head with an index finger, a bell icon, and a thumbs-up icon appear. Two stacked, vintage-style television sets appear, with green digital static on them. The video closes with a black background with diamond-plate pattern and the logo on the left, with a large text \u201cSUBSCRIBE NOW TO\u201d and a small text \u201cFOLLOW US: [Facebook icon] ARETE THROWS NATION [Instagram icon] ARETE THROWS NATION\u201d, and a red youtube symbol with text saying \u201cARETE THROWS NATION TV\u201d between the two vintage televisions.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many incorrect throwing motions did the man demonstrate in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. One.",
+ "B. Four.",
+ "C. Two.",
+ "D. Three."
+ ]
+ }
+ },
+ "CUjwkSXV": {
+ "video_id": "CUjwkSXV",
+ "video_duration": "52s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A man with stretched earlobes and a goatee, wearing a black jacket and a white t-shirt, speaks to the camera in a living room. He walks towards the camera and then performs a short dance routine.\u00a0The scene cuts to two men sitting at a round wooden table. One man has a beard and wears a black long-sleeved shirt, the other wears a red baseball cap and a white t-shirt. They appear to be having a conversation.\u00a0The video cuts back to the first man dancing in the living room.\u00a0 The two men at the table react to the dance. The first man continues dancing, putting his hood up at one point. The two men at the table mimic the dance moves. The dancer spins around. The two men also spin around in their chairs. The dancer then performs more moves, including some arm rolls. The men at the table continue mirroring the dance. The dancer steps back and forth across the carpet.\u00a0 The two men copy this move.The dancer faces the camera and cups one hand as if holding something. He then blows at his hand. The men at the table mimic this action.\u00a0The dancer performs more dynamic arm movements, which the two men at the table copy enthusiastically. The dancer pulls his hood down and continues dancing. The two men continue to mimic the dance moves from their seats.The words \"THE END\" appear on the screen twice, superimposed over the scene of the two men sitting at the table. They continue to look at each other as the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What genre of music plays at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Light instrumental music.",
+ "B. Rock music.",
+ "C. Melancholic music.",
+ "D. Classical music."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why did the two other men in the video react with disbelief towards the man wearing the black coat?",
+ "answer": "D",
+ "candidates": [
+ "A. The man in the black coat did not invite them to eat.",
+ "B. The man in the black coat did not play games with them.",
+ "C. They did not want to play games with the man in the black coat.",
+ "D. The man in the black coat expressed that certain matters are too ignorant."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional shift does the man wearing a black coat exhibit in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. He transitions from disappointed to excited.",
+ "B. He transitions from angry to excited.",
+ "C. He transitions from frustrated to excited.",
+ "D. He transitions from calm to excited."
+ ]
+ }
+ },
+ "nixkDWiN": {
+ "video_id": "nixkDWiN",
+ "video_duration": "488s",
+ "duration": ">8min",
+ "domain": "Tech & Science",
+ "sub_category": "Artificial Intelligence",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man in a dark gray, short-sleeved shirt with Harley-Davidson patches works on a motorcycle engine. He touches parts of the engine and uses a red and black digital torque wrench on bolts. The motorcycle is primarily black and orange. The engine is exposed, with many wires and components visible. In the background, other motorcycles and a tool chest are partially visible. Other mechanics work in the background. The man holds up two black engine gaskets, one larger than the other. He continues to work on the engine using the torque wrench.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In what order does the man in the video tighten each bolt? (a) Loosen 360 degrees. (b) Tighten to 20-30 pounds. (c) Rotate 90 degrees. (d) Tighten to 9-11 pounds. (e) Tighten to 25-27 pounds.",
+ "answer": "B",
+ "candidates": [
+ "A. (a) (c) (b) (d) (e).",
+ "B. (b) (a) (d) (e) (c).",
+ "C. (d) (c) (b) (a) (e).",
+ "D. (a) (b) (c) (d) (e)."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the distribution map of bolt installation shown at the beginning of the video, where is the No. 2 bolt located in relation to the No. 3 bolt?",
+ "answer": "B",
+ "candidates": [
+ "A. On the left.",
+ "B. Below.",
+ "C. On the right.",
+ "D. Above."
+ ]
+ }
+ },
+ "tAdEntxB": {
+ "video_id": "tAdEntxB",
+ "video_duration": "191s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a single person centered in the frame, facing the camera and holding a small, dark microphone in one hand. The person has light skin, a mustache, and is wearing eyeglasses, a black baseball cap, and a white long-sleeved t-shirt with red graphics. The background appears to be a room with walls decorated with numerous album covers and a \"I WANT TO BELIEVE\" poster. The person gestures with both hands while speaking and remains stationary. Throughout this first section, the lighting appears to shift slightly between blue and purple hues. Subsequently, the video displays white text on a black background, with some of the text highlighted in orange. The text appears on a white, rectangular box, and then transitions back to the person in the same original location in the room, who continues speaking with gestures, and the lighting in the background changes between blue and purple. Then there is a second transition to a white box, that repeats the same format of text as the first white box, this time displaying different text highlighted in orange. The video then transitions back to the person speaking with hand gestures and remains stationary as before in the same position, the lighting shifts back and forth between blue and purple hues as before.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "In the video, is the lamp behind the man always illuminated?",
+ "answer": "B",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes, it is always illuminated.",
+ "C. No, it is not always illuminated.",
+ "D. None of the above."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order in which the man in the video presents his views on e-cigarettes? (a) Research background, (b) Specific findings, (c) Discussion and recommendations regarding scientific evidence and countering misleading information, (d) Research outcomes.",
+ "answer": "D",
+ "candidates": [
+ "A. (a) (c) (b) (d).",
+ "B. (a) (b) (c) (d).",
+ "C. (b) (a) (d) (c).",
+ "D. (a) (d) (b) (c)."
+ ]
+ }
+ },
+ "RmuFGvqp": {
+ "video_id": "RmuFGvqp",
+ "video_duration": "146s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Physics",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video starts with an image of a blue swirled galaxy and a dog in the center of the galaxy. There is red text in different parts of the image. The next scene is of a white board with a hand writing the text \"Sir Nevin Kia\" using an orange marker. The hand places a black-and-white cutout of a person next to the text and draws a sword. Then the hand writes \"vs. Black Hole\" and \"Event\" followed by additional text below. The hand then draws over the cutout multiple times. The hand writes the text \"Gravitational Redshift\" and draws an orange line with a small stick figure to the left. The text \"Why U so Red?\" is written above the line. The hand then places the cutout down and moves it around multiple times. The next scene is a close up image of a movie poster that has \"The Troll Hunter\" in yellow text. Above the poster is the text \"Norway's biggest state secret will be revealed October 29th\". There is a cutout of a person on the upper right side of the poster. The next scene is of a plate of pasta with a cutout of a person in the center of the image. The video continues back to the white board with the black-and-white cutout and a hand writing the text \"Singularity\" next to a red dot. There is an orange circle around the word \"Bwaha\". The hand moves the cutout of a person and wipes away the red dot. The final scene is a rotating image of three books. The hand wipes the board and then makes a scrunching motion with the white board eraser.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "According to the video, what happens to the appearance of a paper character as it approaches the edge of the observer's field of view?",
+ "answer": "B",
+ "candidates": [
+ "A. It gradually brightens and appears red.",
+ "B. It gradually darkens and appears red.",
+ "C. It gradually darkens and appears blue.",
+ "D. It gradually darkens and appears blue."
+ ]
+ }
+ },
+ "VZkaKVFv": {
+ "video_id": "VZkaKVFv",
+ "video_duration": "64s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Physics",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A white circle appears, surrounded by a thin white line, against a black background with small yellow dots. The circle turns into a solid orange circle with a title \"Massive main sequence star\" appearing above it against the same black and yellow dot background. A second orange circle appears to the right of the first with the title \"Average main sequence star\" above it. A horizontal line appears below each orange circle with the text \"20 million\" and \"1000 million\" respectively. The right most orange circle disappears. The first orange circle turns red with a yellow border and the words \"Super Red Giant\" appear on the left. Inside the red circle, layers of different color appear from the inside out, including red, orange, yellow, green, light blue, dark blue, and magenta. The names of elements appear next to the colors. The multi-colored circle returns to a solid red color with a yellow border. Two white horizontal lines, and two diagonal red lines appear on top of the red circle. Two teal diagonal lines appear on the sides of the red circle. The red circle appears with a cloudy appearance. A burst of white covers the screen. An elliptical shape with light blue, magenta, and grey circles with blue dots appears against the black star background with the word \"Supernova\" on the top left corner. The words \"heat\" and \"light\" appear on the right side of the elliptical shape, and then \"temperature\" and \"pressure\" appears on the right side of the shape. The shape disappears against the black star background.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the main sequence lifespan of the massive stars shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. 10 million years.",
+ "B. 100 million years.",
+ "C. 2 million years.",
+ "D. 20 million years."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the sequence of changes in the state of giant stars shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. First, it explodes to release energy, then expands into a super red giant, and finally collapses.",
+ "B. First, it expands into a super red giant, then explodes to release energy, and finally collapses.",
+ "C. First, it collapses, then expands into a super red giant, and finally explodes to release energy.",
+ "D. First, it expands into a super red giant, then collapses, and finally explodes to release energy."
+ ]
+ }
+ },
+ "oqeBvQRJ": {
+ "video_id": "oqeBvQRJ",
+ "video_duration": "128s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A frying pan and a saucepan are placed on a stovetop. Chopped red onions sit on a cutting board behind the pans. A hand turns on the stovetop burner under the frying pan. The hand pours oil from a red bottle into the frying pan. A spoonful of minced garlic is added to the oil. The chopped onions are added to the frying pan and mixed with the garlic using a blue spatula. The onions and garlic are stirred and cooked.Ground beef is added to the pan and broken apart with tongs, then stirred with the spatula. A glass salt shaker is held over the pan, then set aside. The hand shakes a red bottle over the pan. The ground beef continues to be stirred and cooked. Sliced mushrooms are added to the pan and stirred in with the meat.The camera moves, showing a woman in a white shirt in a kitchen. The camera moves back to the stovetop. A can of Campbell's Cream of Mushroom Soup is held up, opened, and the contents are poured into the frying pan. The soup is stirred into the meat and mushroom mixture with the spatula. A spoon is used to scoop sour cream from a container into the frying pan. The sour cream is stirred into the mixture with the spatula.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many bottles of cream of mushroom soup are shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. One bottle.",
+ "B. Four bottles.",
+ "C. Three bottles.",
+ "D. Two bottles."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What is the final state of the water in the right-hand pan shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. It is in a boiling state.",
+ "B. It is in a calm state.",
+ "C. It is in a bubbling state.",
+ "D. It is in a solidified state."
+ ]
+ }
+ },
+ "BOaVJnLe": {
+ "video_id": "BOaVJnLe",
+ "video_duration": "60s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a headset sits in front of multiple computer screens displaying flight and terrain data. The scene shifts to a large screen showing a map with two aircraft icons and a flight path. A man in a military uniform speaks in profile view, with a map visible in the background. A row of individuals wearing headsets sit at computer consoles. The perspective changes to a drone in flight above the clouds under a bright moon. The camera focuses on the drone\u2019s body as it moves. Back at the computer screens, a map shows two aircraft approaching a target, marked by a circle. A different drone flies above the clouds, releasing three missiles. On the map, the aircraft converge on the target, after which the target is marked as destroyed. Another drone flies over the clouds, and the camera view descends to show missile bays opening and releasing missiles. The drone banks and turns. The scene cuts back to the large screen where the target is marked as destroyed, then to the man in uniform. The first drone appears again, flying above the clouds, and rotates to show its underside and open missile bays. Three missiles are released, after which the drone plummets toward the ground. The large screen shows the target destroyed. The man in the military uniform speaks again.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "Who is instructing others on what to do?",
+ "answer": "C",
+ "candidates": [
+ "A. A man with black hair is instructing a bald man on how to pilot a fighter jet.",
+ "B. A bald man is instructing a man with black hair on how to operate a computer.",
+ "C. A man with black hair is instructing a bald man on how to operate a drone and bomb the target.",
+ "D. A bald man is guiding a man with black hair along the correct path to the destination."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "Which of the following is the correct sequence for using a missile to destroy a target as shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Threat detection, target confirmation, data analysis, drone flight, missile launch, target destruction.",
+ "B. Drone flight, threat detection, data analysis, target confirmation, missile launch, target destruction.",
+ "C. Threat detection, drone flight, target confirmation, data analysis, missile launch, target destruction.",
+ "D. Data analysis, threat detection, target confirmation, drone flight, missile launch, target destruction."
+ ]
+ }
+ },
+ "eyENnlBD": {
+ "video_id": "eyENnlBD",
+ "video_duration": "114s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Cultural Explainers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A cartoon of a flower pot with a face and arms is on the right of the screen. The flower pot has a long white beard and a plant with purple flowers is growing out of it. The flower pot is holding a wand with a star on the tip and has sparkles around it. Text appears on the left of the screen that reads \"Lavender and Magic\". Next, there is a shot of an empty hotel hallway, with a flower pot with a plant growing out of it on the right, now brandishing a knife. The flower pot has a speech bubble that reads, \"Come on out demons! I dare you.\" After this, there is a shot of a bedroom, with a cartoon flower pot with a lavender plant floating on top, and the same purple flowers hanging off the bed posts. Next, there are shots of close ups of leaves and then the same cartoon flower pot, with a plant growing out of it, is shown next to the leaves. After that, there is a shot of a laptop that is on fire, followed by a close-up of an incense stick. Next, there is a shot of a house, with a cartoon flower pot floating on the side of it, and then there are some close-up shots of various flowers, followed by a scene from a cartoon movie of a fairy village, and then a shot from a movie with a group of people. After this, there is a shot of a leprechaun character, followed by a close up shot of a white mug next to a pile of lavender. The next scene shows a man in a suit, with the cartoon flower pot next to him, on the right of the screen. This repeats with a slightly different looking flower pot, and then changes to a black screen with floating purple hearts with text that reads \"Lavender & Love\" next to the cartoon flower pot with two hearts for eyes. The man in the suit reappears and is speaking, with the scene then switching to a shot of a movie with two people kissing, and the same cartoon flower pot in the corner. After this, the man in the suit reappears. Next is a shot from a painting with five people. On the right of the painting, the people are holding a cup with lavender in it. The scene switches to a dark screen with burning candles. Then a cartoon flower pot with a face appears with a circle of candles in the background, followed by a purple bottle of perfume and then a heart made of candles. The man in the suit reappears. Then the video shows some lavender stems, and the man in the suit waving his hand. The scene changes to a shot of mountains with a lake in the foreground, with a cartoon flower pot on the right and a pillow that has a face on it that looks like it is floating in the air. The same scene appears again, with the flower pot having heart eyes. Then there is the same shot with a flower pot next to a pillow with a heart appearing in a thought bubble above the flower pot. The man in the suit reappears with a series of facial expressions.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where should lavender be placed on the bed to provide protection for the home?",
+ "answer": "B",
+ "candidates": [
+ "A. At the foot of the bed.",
+ "B. On the bedposts.",
+ "C. Under the pillow.",
+ "D. At the head of the bed."
+ ]
+ }
+ },
+ "dGrTPQhY": {
+ "video_id": "dGrTPQhY",
+ "video_duration": "390s",
+ "duration": "6-8min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man wearing glasses and a black t-shirt faces the camera. The camera pans to a paved area where people have gathered under tents. A small black remote control car enters the frame, moving from right to left. It crosses the paved area and drives off screen, leaving tire marks on the pavement. The camera pans to show a static airplane display in the background. It then pans further to show a billboard with the words \"BIG DADDY DON GARLITS Museum\" written on it. Next, the camera pans to a building with a sign reading \"SMITH GARAGE\" and then to more tents. The camera returns to focus on the paved area. A black remote control car drives by and is then followed by a blue remote control truck. The camera focuses on the area of the tents as various remote control vehicles drive on the pavement: a small black car drifting and spinning and a small bright red dragster.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where is the man located?",
+ "answer": "A",
+ "candidates": [
+ "A. The racetrack finish line.",
+ "B. The racetrack maintenance area.",
+ "C. The racetrack starting line.",
+ "D. The racetrack spectator area."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the man in the black baseball cap doing to the race car in the middle of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Maintain the wheels.",
+ "B. Check the condition of the race car.",
+ "C. Clean the race car.",
+ "D. Cool down the race car."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What effects does a change in weather have on the man?",
+ "answer": "B",
+ "candidates": [
+ "A. Unable to watch the game.",
+ "B. Live broadcast is interrupted.",
+ "C. Unable to return home.",
+ "D. Equipment is damaged."
+ ]
+ }
+ },
+ "qqIiwBMF": {
+ "video_id": "qqIiwBMF",
+ "video_duration": "212s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a close-up of sushi rolls on a black slate plate. The scene then changes to a man in a brown sweater standing in a kitchen setting. He appears puzzled, scratching his head and gesturing with his hands. Next, a plate of soba noodles topped with seaweed flakes is shown, followed by a man in a suit examining a book. The video then displays various food items on a table in the kitchen setting, including oatmeal, mashed potatoes, toast, stir-fried meat, kimchi, roasted chicken, steak, rice, and more stir-fried meat. A painting of Napoleon on a horse against a backdrop of a burning building is shown. More images of food follow, including a seafood stew with lobster, roasted chicken, lobster in a red sauce, a chicken stew, and cavalry soldiers on horseback. The word \"NAPOLEON\" is superimposed on the image of the burning building. The soldiers are followed by a scene of more soldiers marching through a snowy landscape, then a close-up of chicken stew in a pan. The video cuts to a painting depicting Napoleon amidst cheering soldiers. A close-up of a man in a bicorne hat with a group of people follows, succeeded by close-ups of raw chicken pieces being arranged in a dish and then cooked in a pan with sauce. Images of soldiers marching through a village at night, more soldiers advancing, soldiers from an elevated position looking down at more soldiers below, and chicken stew appear. The video then shows a tomato being sliced, brandy being poured into a snifter, and a painting of Napoleon eating a meal with bandaged hands.\u00a0A battle scene with smoke and fallen soldiers is shown, followed by the words \"AN,\" \"MARCHES,\" and \"STOMACH\" appearing on the screen one after the other. Rows of preserved food in glass jars are displayed, followed by a battle scene on a snowy plain and soldiers marching through the snow. Images of cooked duck with oranges on a plate, \"CANARD A L'ORANGE\" written on the screen, and a knife slicing cooked duck in sauce follow. An orange is sliced on a wooden board, orange juice is squeezed using a citrus juicer, and the juice is poured into a glass measuring cup.\u00a0A painting of soldiers dining at a table and Napoleon on a horse amidst his soldiers are shown, along with images of lobster, \"LOBSTER THERMIDOR\" written on the screen, and Napoleon sitting at a table with a lobster on his plate. The video proceeds to show lobster tail being cooked, lobster meat being extracted with kitchen shears, cooked and stuffed lobster on a plate, lobster tails with parsley garnish, lobster tails being filled with a creamy sauce, and finally stuffed and baked lobster tails garnished with parsley and lemon wedges.A painting of Napoleon eating soup with other dishes on a table is shown, chicken meat seasoned with spices on a plate, various cooked chicken dishes, and chili in a bowl. Vegetable soup is ladled from a pot into a bowl, and chicken soup in a metal bowl is shown. A Roman mosaic depicting a feast and a variety of breakfast dishes on plates are displayed: eggs, salad, an omelette, meatballs, sliced roast, ribs, beans, salad, chicken wings, and more. Napoleon is shown eating a meal, then having a green soup, along with other foods on the table. Napoleon eats a dessert, Mille-feuille (Napoleon cake) is shown on a plate, cooked meat dishes, roasted chicken, raw meat, cooked poultry and a roasted pig are shown. Napoleon is depicted drinking champagne at a table laden with food, seated with men in military uniforms, also drinking,\u00a0and finally lying on a bed as though deceased while a man reads nearby. The video concludes with an image of a cup of tea in the kitchen setting.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the preparation process for the special-grade lobster appear?",
+ "answer": "B",
+ "candidates": [
+ "A. The process appears at the beginning of the video.",
+ "B. The process appears in the middle of the video.",
+ "C. The process appears throughout the video.",
+ "D. The process appears at the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "According to the video, what food did Napoleon typically enjoy for dinner?",
+ "answer": "B",
+ "candidates": [
+ "A. Buckwheat noodles and mackerel.",
+ "B. Japanese cuisine.",
+ "C. Chicken or beef.",
+ "D. Oatmeal, toast, and bacon."
+ ]
+ }
+ },
+ "mqnRjvVr": {
+ "video_id": "mqnRjvVr",
+ "video_duration": "145s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A woman with blonde hair, wearing a green shirt, is shown in a kitchen. She looks at the camera and speaks. A pan of red sauce with chopped green vegetables is shown, someone is seen adding more chopped green vegetables to it. The woman is seen again, looking at the camera and speaking. The woman is seen again, this time with a child next to her. Text reading \"Grow your own!\" is superimposed over the video. A hand is shown planting small plants in a garden bed. A man with dark hair, wearing a black shirt, is shown in a kitchen looking at the camera and speaking. A man is shown kneeling next to wooden planks. The man with dark hair, wearing a black shirt is shown again in a kitchen. A hand is seen making a \"rock on\" gesture over several raised garden beds. A hand is shown planting small plants in a garden bed. A close-up shows small purple plants. The man with dark hair, wearing a black shirt, is shown in a kitchen again. A hand is shown watering small plants. The man with dark hair, wearing a black shirt, is shown again in a kitchen. A woman is seen, tending to a vertical planter with green plants. The man with dark hair, wearing a black shirt, is shown again in a kitchen. A man in a plaid shirt is seen opening a box. The man in the plaid shirt is then seen carrying part of a vertical planter in a kitchen. A person in a gray shirt is seen setting up a vertical planter. The man with dark hair, wearing a black shirt, is shown again in a kitchen. The man is seen, placing white plant cups into the planter. The man with dark hair, wearing a black shirt, is shown again in front of the planter. The planter and a person's hand is shown in a close-up, then the man is shown again. A person is shown, picking leaves off the planter. The man with dark hair is seen, in front of the planter. A close-up shows the planter, along with some text. The man is seen, in front of the planter. A child is seen, approaching and looking at the planter. The man with dark hair is seen, in front of the planter. Text reading \"BONUS*\" with some other text is superimposed over the video. The man with dark hair is seen, in front of the planter.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What is the man's ultimate purpose for introducing the planting tower?",
+ "answer": "B",
+ "candidates": [
+ "A. To promote the advantages of planting towers.",
+ "B. To sell the planting tower.",
+ "C. To share planting experience.",
+ "D. To teach assembly skills."
+ ]
+ }
+ },
+ "SrvJcxKC": {
+ "video_id": "SrvJcxKC",
+ "video_duration": "84s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a volleyball match. The first segment displays a team of volleyball players wearing white and black uniforms playing against a team wearing black uniforms on a basketball court with volleyball lines and a net. A digital display at the top shows the score, reading \"Corsairs Set Point, Set 4, 25-18.\" The players on the court move about, jumping, raising arms and hugging each other. The next scene shows a different section of the match, where the displayed information on the court reads \"Beacons 5-0 Run, Set 5.\" The score at the bottom of the image is \"UMass Boston 7, UMass Dartmouth 5, Set 5, 2-2.\" The volleyball is passed back and forth across the net between the two teams. The next scenes show additional exchanges and plays on the court. The score at the bottom of the image is updated throughout. In the final sequence of the video the score reads, \"UMass Boston 14, UMass Dartmouth 7, Set 5, 2-2.\" A caption \"Final\" appears in the middle of the court along with a display that says \"3 Beacons\" over \"2 UMass Dartmouth.\" The team in black uniform runs toward the center of the court and huddles together, celebrating, and then the teams line up at the center of the court near the net to shake hands with one another.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What action does the player wearing the black jersey with the number 19 perform on the ball at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The player uses a two-hand forearm pass.",
+ "B. The player catches the ball in the palm of their hands.",
+ "C. The player attempts a spike while blocking with two hands.",
+ "D. The player performs a two-hand serve."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times do the players wearing black uniforms attack in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Three times.",
+ "B. One.",
+ "C. Five times.",
+ "D. Eight times."
+ ]
+ }
+ },
+ "bOvIJFzJ": {
+ "video_id": "bOvIJFzJ",
+ "video_duration": "115s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a red jacket, brown pants, and a dark beanie walks on a frozen lake toward a small wooden ice fishing setup with a red flag. The person kneels by the fishing hole and removes a small black item from their pocket. They then adjust something on the fishing setup. The person lowers something into the hole in the ice. A second person, wearing dark pants and jacket, approaches and kneels by the hole. They reach into the hole and pull out a fish, a northern pike. The second person holds the fish while facing the camera. The second person removes the hook from the pike's mouth. They then lower the fish back into the hole, head first, and release it.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What action did the man in the video take after catching the fish?",
+ "answer": "B",
+ "candidates": [
+ "A. He places it in a bucket.",
+ "B. He releases it back into the water.",
+ "C. He stuns it by hitting its head.",
+ "D. He cooks it on the spot."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What type of fish was caught in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Crucian carp.",
+ "B. Grass carp.",
+ "C. Snakehead fish.",
+ "D. Sea bass."
+ ]
+ }
+ },
+ "XWuxbcDK": {
+ "video_id": "XWuxbcDK",
+ "video_duration": "56s",
+ "duration": "<1min",
+ "domain": "Culture & Politics",
+ "sub_category": "Cultural Explainers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video opens on a black screen with text and icons in the top left and bottom right corner. The video then shows an aerial view of two tall buildings. The shot then transitions to a statue of three people on a large rock, shot from eye-level, then shifts to a low-angle shot of two people running up stone steps toward the statue. The video shifts to a low-angle shot of a city, and then to a field with leafy plants where two people shake hands. Then the two people briefly hug. Next, a man wearing a black t-shirt and khaki pants stands in a field with his hands raised and another man with dreadlocks is behind him playing a guitar. Another man sits behind them. After several shots of the same scene, the man in the black t-shirt moves closer to the camera and shakes another man's hand. Then, a man wearing shorts stands looking over the edge of a cliff. Then, a long shot of a bridge with two points that come together and are angled over a body of water. The shot is repeated. Next, the video shows a man eating with a street in the background, followed by a long shot of a beach with waves. The scene cuts to an aerial view of the same beach with a river running into it. This scene is repeated. The video then transitions to a train crossing a bridge, and shows the same scene again. It then cuts to a waterfall surrounded by foliage. Then it cuts to a lone boat on a body of water. The next shot shows two men standing in water with long sticks. Then there is a shot of a beach with a city in the background. Then an aerial view of a city with a beach in the foreground. The video shifts to a couple walking in water, followed by a shot of a train at a station. The next shot shows a building with people standing on the street in front of it. The video then shows cars on a city street, then a man is seen standing in front of a building with a fence in front of him. He is seen with his arms raised. The video shows the same scene again. The scene changes to an aerial view of a city and then to an aerial shot of a beach. This shot is repeated and then the video shifts to an aerial shot of a beach with people. Then, the man is seen standing by a city street and pointing. The man puts his hands on his face and looks at the camera. The man is seen sitting in the doorway of a van, and then a man closes the door. The same shot is repeated. The man is then seen doing pushups on a rocky surface. This is repeated several times. Then, a man is seen standing on a cliff edge. The video ends with a map of Africa.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video is the man swimming in the water?",
+ "answer": "B",
+ "candidates": [
+ "A. At the beginning of the video.",
+ "B. In the middle of the video.",
+ "C. Throughout the entire video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What might the hobbies of the black man in the video be?",
+ "answer": "C",
+ "candidates": [
+ "A. Pilgrimage.",
+ "B. Singing.",
+ "C. Traveling.",
+ "D. Photography."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many types of transportation appear in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Four.",
+ "B. Two.",
+ "C. Three.",
+ "D. Five."
+ ]
+ }
+ },
+ "dACVDmUZ": {
+ "video_id": "dACVDmUZ",
+ "video_duration": "47s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Three people are standing on a stage, the woman to the left, a man in a blazer in the middle and a second man to the right, in front of a white backdrop. A large sign is standing to the left side of the three people, with the text 'TEDx Asheville' vertically displayed. The woman has her blonde hair in a pony tail and is wearing a dark shirt and blue jeans, and is clapping with both of her hands. The first man is wearing glasses, and a blue and white striped shirt under a black blazer and tan colored pants. The second man is wearing a black shirt, with grey pants and a black belt. The first and second men touch palms. They each rub their arm and clap their hands, then they move their hands toward each other and touch them again. The three of them clasp their hands together, then each hold their hands out to the side, palms facing forward. The three of them continue the hand gestures. The camera zooms in on the three of them, then cuts to a wide angle view of an audience. The video returns to a closer view of the three people on stage as they continue the hand gestures.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the bald man with glasses forcefully twist the fingers of the woman with her long golden hair tied up, whose hands are crossed?",
+ "answer": "A",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. Throughout the video.",
+ "C. At the beginning of the video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "PMlyNmbX": {
+ "video_id": "PMlyNmbX",
+ "video_duration": "109s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Physics",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A young person wearing a blue and yellow baseball hat and a light-colored shirt is seated in what appears to be a sports stadium or arena. A woman with blonde hair and glasses sits next to them. The same woman is shown in a different scene in a classroom, with a man wearing a blue shirt that says \"Physics\" passing in the background. A young person wearing a blue \"Berkeley\" sweatshirt pulls a table cloth off of a table and some items stay put. A person is shown touching a metal rod with a weight suspended from a string. A person is shown touching a \"Danger High Voltage\" box. A person is shown holding a white ball and then letting it swing in a pendulum motion. A young person is shown spinning in a chair while holding white weights, and then spinning a bicycle wheel while seated in a chair. A young person is shown near a vacuum tube. A person is shown with a colored thermal image projected behind them. A black cube is shown levitating over a circle. A person is shown inflating and then freezing a balloon. A person is shown putting a pipe over a balloon. A person is shown holding a silver tube which is emitting smoke. An audience is shown in what looks like a lecture hall as boiling water is shown spraying steam. A person is shown holding a metal rod near sparking electrical connections. A person is shown holding a light tube that lights up. A young person is shown touching a metal orb that makes their hair stand on end. A person is shown using electromagnets. A person is shown touching a plate that levitates. A green laser light is shown traveling through a tank. An infrared thermal image of an audience is shown on a screen. The video ends with a blue screen and yellow text \"Berkeley University of California.\"",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What is the purpose of the professor shining the green light on the glass in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. To absorb light.",
+ "B. To display colors.",
+ "C. To measure the temperature.",
+ "D. To measure the glass."
+ ]
+ }
+ },
+ "BQkzFmeP": {
+ "video_id": "BQkzFmeP",
+ "video_duration": "126s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Family Vlogs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black hoodie and another person wearing a white top and black jacket are seated next to each other holding a baby. The person in the black hoodie is holding the baby in their hands, and both people are looking at the baby. A third person wearing a black hoodie and grey shirt washes their hands at a sink in a kitchen. They proceed to use a spray bottle and wipe down a counter. They place their hands on their chest and then sit on a couch. The first person wearing a white shirt and black jacket approaches them and passes the baby to them. The first person moves to a chair next to the couch. The third person holds the baby and talks to it, and then starts to cry and wipes their face. The first person in the white shirt and black jacket hugs them. The third person holding the baby looks at and talks to the camera. The first person looks and smiles at the camera and gives a thumbs up. The third person continues to look and talk to the camera. The third person passes the baby to the first person again.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many people in the video hugged the baby?",
+ "answer": "D",
+ "candidates": [
+ "A. Four.",
+ "B. Zero.",
+ "C. One.",
+ "D. Three."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the woman with long brown hair in the video compared to the young woman with long black hair?",
+ "answer": "B",
+ "candidates": [
+ "A. To her right.",
+ "B. In front of her.",
+ "C. To her left.",
+ "D. Behind her."
+ ]
+ }
+ },
+ "TLDGACjL": {
+ "video_id": "TLDGACjL",
+ "video_duration": "270s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a view from inside a moving car. The driver, a person wearing a white long-sleeved shirt, a dark vest, and a dark baseball cap, holds the steering wheel. The driver can be seen talking. The view changes, showing a rotating circular tunnel-like structure with several bolts or screws protruding outward and a black disk with white writing in the center, rotating within it. The video shows the same driver again, now driving in the city and continuing to talk, sometimes gesturing with their hands. Then the video shifts to show the car's side mirror and the road behind. The camera then shifts to show the dashboard, showing a speed limit sign, the car\u2019s display, and a camera mounted at the top of the car\u2019s windshield. The view shifts again to show the car's front view of traffic. The video changes to show a close-up of the car\u2019s display, then the person's hand touching the car\u2019s display, then again a front-view of traffic. Next the camera shows a person's hand shifting the gear stick. Another front view of traffic is shown, then a mobile phone displaying a map app. The video changes to show two feet in blue shoes pressing the car pedals. Again, a view through the windshield shows the traffic on the road, then the driver\u2019s feet operating the pedals, then a view of the cars through the side window, and finally a view through the car's side mirror, revealing a black car driving at a faster speed. Next, the car and a white car are shown from an outside point of view, and this sequence is repeated several times, before the driver is shown again from the inside of their car. The driver can be seen talking. A few more times, the point of view switches back and forth between the driver inside the car and the traffic on the road. The video concludes with a shot of the car's interior and exterior, the driver opening and getting out of the car, and a view of the car's license plate.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What gear is the car that the man is driving in the video in?",
+ "answer": "B",
+ "candidates": [
+ "A. Fifth gear.",
+ "B. Sixth gear.",
+ "C. Fourth gear.",
+ "D. Seventh gear."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "In what type of environment was the man driving in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Urban highway.",
+ "B. Regular road.",
+ "C. Unrestricted-speed highway.",
+ "D. Tidal flow lane."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many sports cars appear in total in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. 0.",
+ "B. 2.",
+ "C. 1.",
+ "D. 3."
+ ]
+ }
+ },
+ "LjpQKRih": {
+ "video_id": "LjpQKRih",
+ "video_duration": "40s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man with glasses and a baseball cap sits in a black chair at a white desk, gesturing with his hands. A microphone and computer are visible on the desk. The scene changes to a point-of-view shot of someone swinging on a swing set in a park-like setting with trees and grass. The camera then shows a person holding a camera in snowy terrain, looking down at the flip-out screen. Next, a hand drops a small white Insta360 camera into a clear drinking glass in a stainless steel sink. The perspective shifts to inside the glass as water is poured in. The scene transitions to a white watering can pouring water into a potted plant. The perspective becomes a point-of-view shot from inside the watering can's spout as water flows out. A man is then seen through a gate holding a round metal grill. The initial man from the desk reappears, followed by a shot of a hand placing the Insta360 camera into a dishwasher filled with plates and other dishes. The camera is placed near a stack of light blue/grey bowls. The dishwasher is then shown from a side angle, and a hand reaches in to adjust the contents. Finally, the point of view is from inside the dishwasher as a person in a floral shirt adds dishes.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video, what is the sequence in which the Insta360 Go 2 camera captures the items at home?",
+ "answer": "C",
+ "candidates": [
+ "A. Grill, watering can, glass cup, kitchen utensils.",
+ "B. Grill, watering can, kitchen utensils, glass cup.",
+ "C. Glass cup, watering can, grill, kitchen utensils.",
+ "D. Watering can, glass cup, grill, kitchen utensils."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the man in the video film the watering can?",
+ "answer": "A",
+ "candidates": [
+ "A. To display different camera angles.",
+ "B. To enrich the scene with visual material.",
+ "C. To demonstrate filming techniques.",
+ "D. To show the act of watering."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "In the fourth scene filmed by the man in the video, is there anything on the grill?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes, there is something.",
+ "C. No, there is nothing."
+ ]
+ }
+ },
+ "PVdCHTGR": {
+ "video_id": "PVdCHTGR",
+ "video_duration": "176s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Nutrition & Health",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a light brown baseball cap with the words \"LIVE FIT.\" on it, a black sleeveless shirt with a white logo, and black shorts is shown. They have a cross earring in their left ear, and they have tattoos on their left arm. This person is adding weight plates to a barbell, secured on a rack, using a green resistance band. Then they sit on a weight bench and perform a bench press exercise, lifting the barbell up and down several times. They take a break between sets, speaking to the camera while standing, and then resume the bench press exercise. At one point during the bench press exercise, they wear over-the-ear white headphones. They can also be seen in different locations in the gym, including one with a green turf floor. They have a water bottle with a pink lid on the ground next to them during the bench press exercise on the black floor. There are several other pieces of workout equipment in the background, as well as a basketball court, other people, exercise balls, and a set of vintage vehicles.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Why does the voice of the Black man in the video change?",
+ "answer": "C",
+ "candidates": [
+ "A. His distance from the camera increases.",
+ "B. The change is deliberately made to create atmosphere.",
+ "C. Increased weight requires concentrated attention.",
+ "D. The background music masks his voice."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What might the man do next?",
+ "answer": "B",
+ "candidates": [
+ "A. Take a bath.",
+ "B. Relax his muscles.",
+ "C. Go for an aerobic run.",
+ "D. Perform anaerobic weight training."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many dumbbell plates remained on the dumbbell at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Four.",
+ "B. Six.",
+ "C. Eight.",
+ "D. Ten."
+ ]
+ }
+ },
+ "eFWXQpXN": {
+ "video_id": "eFWXQpXN",
+ "video_duration": "510s",
+ "duration": ">8min",
+ "domain": "Tech & Science",
+ "sub_category": "Artificial Intelligence",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A bald man with a beard and glasses sits at a picnic table outdoors with a woman. There is a rock, a stress-strain curve diagram, crayons, paper, and a banana-shaped stress reliever on the table. The man picks up the rock and the diagram changes to depict force and deformation. The man attaches a piece of wood to a rope hanging from a ladder. He pulls the wood downwards and lets it go. Back at the table, he gestures to the diagram, which now shows the elastic and plastic regions. He picks up a white cloth, stretches it, and then crumples it. The diagram now shows stress and strain. He holds a piece of black material, stretches it, and releases it. He then holds two crayons and bends them, one breaking. He speaks while the woman draws on a piece of paper. The scene cuts to a woman standing against a starry background, explaining the terms \"Elasticity\" and \"Elastic Limit/Yield Strength.\"",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What happens to the socks in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The socks remain unchanged.",
+ "B. The answer is unclear.",
+ "C. A forceful pull causes the socks to deform.",
+ "D. The socks are pulled apart."
+ ]
+ }
+ },
+ "FkGWIiLo": {
+ "video_id": "FkGWIiLo",
+ "video_duration": "545s",
+ "duration": ">8min",
+ "domain": "Tech & Science",
+ "sub_category": "Science Explainers",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A man with dark hair and a white, short-sleeved, collared shirt sits in a chair outdoors. Behind him to the left, a bush with many yellow flowers is visible. Behind him to the right stands a tall, thin, dark pole with a brown woven decoration near the top. The man speaks directly to the camera. Words and phrases appear on the screen intermittently as he speaks, including \u201cFatigue,\u201d \u201cShortness of Breath,\u201d a screenshot of a clinical trial description, \u201cOxygen Requirement,\u201d \u201cICU Weakness,\u201d \u201cCognitive + Brain Effects,\u201d and a screenshot of an academic paper title and excerpt. The man gestures with his hands as he speaks.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What topic is the man in the video discussing?",
+ "answer": "A",
+ "candidates": [
+ "A. The symptoms of COVID-19.",
+ "B. The global impact of COVID-19.",
+ "C. The causes of COVID-19.",
+ "D. The viral characteristics of COVID-19."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the emotional state of COVID-19 patients after being discharged from the hospital?",
+ "answer": "D",
+ "candidates": [
+ "A. Sad and distressed.",
+ "B. Melancholy and downhearted.",
+ "C. Angry and irritable.",
+ "D. Excited and joyful."
+ ]
+ }
+ },
+ "oyxUBQwj": {
+ "video_id": "oyxUBQwj",
+ "video_duration": "56s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with four animated characters displayed in a video conference style format. The top left screen shows a character, labeled \u201cP. Spark\u201d, sitting on a purple background with a black rectangle in the background and a video game box next to him. The top right screen shows a blue character labeled \u201cGary\u201d. The bottom left screen shows a yellow character labeled \u201cSarah\u201d. The bottom right screen shows a black-haired character labeled \u201cAmilyta\u201d sitting in a dark room. The next screen is entirely black, followed by a screen with white text that reads \u201cDON\u2019T ADMIT TO DOING SOMETHING ILLEGAL\u201d with a blue skull icon with crossbones on the left, followed by the same screen zoomed in. The following screen shows five animated characters in a video conference style format. The top left screen shows a character, labeled \u201cOaxezem\u201d, with a monitor displaying stick figures in the background. The top middle screen shows a character labeled \u201cMatt\u201d with a room and window in the background. The top right screen shows a character labeled \u201cPapi\u201d with a table and shelves in the background. The bottom left screen shows a character labeled \u201cCrayCray\u201d with a door and shelves in the background. The bottom right screen shows a character labeled \u201cGlass\u201d with a bed in the background. A green box highlights each character's screen for a couple of seconds each in the order of Glass, Papi, Matt, Glass, Papi, Matt, Glass, and then Matt. Matt\u2019s screen then displays a cartoon of himself eating a container of milk. Then Matt\u2019s screen shows his character holding a box of crackers, and Glass appears to speak into a microphone. Then a drawing appears on Oaxezem\u2019s screen, and then Papi appears to hold a magnifying glass to his eye. Then Matt\u2019s character\u2019s screen shows his face changing, and Oaxezem\u2019s character\u2019s screen has an error message displayed. Then Matt\u2019s character\u2019s screen shows him with tears on his face. Oaxezem\u2019s character\u2019s screen displays a shattered screen, then Matt\u2019s screen is black, then Papi\u2019s screen is tan colored. Then CrayCray\u2019s screen shows him flailing, then the screen goes black. The next screen is a zoom conference with three characters in view: a yellow character with headphones labeled \u201cCreek\u201d on the top left, a black-haired character labeled \u201cGabe\u201d in the top right, and an alien character in the bottom left labeled \u201cUbi\u201d. Green text that reads, \"Like, Comment, Subscribe!\" is in the center top of the screen. Gabe and Ubi make various movements, and then the screen shows an end screen of a zoom meeting.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "In a five-person video chat, what is the person in the upper-left corner doing?",
+ "answer": "C",
+ "candidates": [
+ "A. The person is listening to the chat and not actively doing anything.",
+ "B. The person is lying in bed asleep.",
+ "C. The person is holding a controller and playing a game.",
+ "D. The person is sitting behind a table."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "In a video call involving five individuals, how many participated in the discussion?",
+ "answer": "D",
+ "candidates": [
+ "A. Five.",
+ "B. One.",
+ "C. Four.",
+ "D. Three."
+ ]
+ }
+ },
+ "RBVOffxi": {
+ "video_id": "RBVOffxi",
+ "video_duration": "229s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a view of a table laden with various dishes: two plates of a salad made with pale green noodles and orange pieces, one plate of cooked leafy green vegetables, one plate of a dark colored stew-like dish, one plate of pieces of dark-red meat in sauce, a plate of light green leaves, two small bowls of a dark liquid, one small bowl with green chilis, a clear jar of multi-colored straws, two bamboo containers with white rice, a clear glass of water with a straw and some spoons and forks in dishes. Two hands appear on the bottom left and a person is briefly visible at the top right. The camera moves to focus on the two people at the table and they gesticulate while looking at the camera and they appear excited. The camera angle changes and it can be seen that there are other people eating at other tables in the room, and that there is a refrigerator in the back. The two people at the table begin eating, and the camera focuses on a plate of the dark red meat and sauce while a piece is picked up with a fork, then a spoonful of sauce is added to the plate. Then the camera moves back to the two people and they continue eating, and a dark, stew-like dish is brought into view. One of the two people picks some of the stew up with a spoon and eats it. Then a piece of white rice is held up and eaten, and then more rice is dipped into the dark stew. Then the camera moves and we see a woman, smiling at the camera, who is in an interior space with various objects. A scene of the two people from earlier is layered over this woman. The camera moves to a close-up view of the various dishes, including a bowl with the salad made of white noodles, another bowl with a dark stew and other dishes. The video then shows two people eating sticky rice from a bamboo container. Then they continue eating and the camera moves to show a wide angle view of the dining area, and of the exterior of the restaurant. Finally, the camera goes to a still shot of what looks like a shrine within the restaurant.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What is the woman in the white top doing as the woman in the green apron walks toward the back in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Eating.",
+ "B. Smiling.",
+ "C. Drinking water.",
+ "D. Looking at a phone."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a motorcycle in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes.",
+ "B. No.",
+ "C. I don't know."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man eat in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Three.",
+ "B. Five.",
+ "C. Six.",
+ "D. Four."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where does the video end?",
+ "answer": "C",
+ "candidates": [
+ "A. A store.",
+ "B. A Buddha hall.",
+ "C. A temple.",
+ "D. Cultural Street."
+ ]
+ },
+ "task4": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the bald man in the video doing with the bamboo basket and glutinous rice?",
+ "answer": "B",
+ "candidates": [
+ "A. Checking the color.",
+ "B. Dipping it in soup.",
+ "C. Forming it into a ball.",
+ "D. Absorbing oil."
+ ]
+ }
+ },
+ "EOrWuzrp": {
+ "video_id": "EOrWuzrp",
+ "video_duration": "240s",
+ "duration": "4-6min",
+ "domain": "Tech & Science",
+ "sub_category": "Climate Change",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a split screen, one side a man wearing a blue jacket speaking, labelled \"STUDIO\", the other side a woman speaking, labelled \"WELLINGTON\". The background of the woman shows a landscape with water and a sunrise. A banner at the bottom of the screen reads \"'Urgent action' needed to address climate change following floods - Swarbrick\". There is also a Candy Crush Saga ad in the corner. The text on the right side of the screen reads \"Climate Change Urgent Actions Needed\". The individuals on screen are engaged in speaking with changing facial expressions. Various short lines of text appear on the news banner under the individuals, along with date and time stamps.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "What emotion does the entire video convey?",
+ "answer": "B",
+ "candidates": [
+ "A. Gentle and soothing.",
+ "B. Anxious and urgent.",
+ "C. Dark and gruesome.",
+ "D. Light and joyful."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the most likely occupation of the man in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. TV host.",
+ "B. Stand-up comedian.",
+ "C. TV journalist.",
+ "D. Radio host."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What is the first action the affected countries might take?",
+ "answer": "C",
+ "candidates": [
+ "A. Change leadership.",
+ "B. Develop land.",
+ "C. Improve transportation.",
+ "D. Build houses."
+ ]
+ }
+ },
+ "fIGnSkpJ": {
+ "video_id": "fIGnSkpJ",
+ "video_duration": "139s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a view of an empty asphalt path in the middle of a forest of bare trees. Green grass and some leafy undergrowth are visible along either side of the path. The camera moves forward along the path, following it through a bend. The next shot shows a pair of feet in athletic shoes walking on a similarly textured ground surface. The camera shifts to another shot of the asphalt path winding through the woods. A shot of two dogs comes next, a brown one and a lighter tan one, near a metal fence. The video switches to a woman with blonde hair wearing a gray baseball cap and a gray long-sleeved shirt, who appears to be speaking and is indoors. A second person with brown hair and a mustache enters the frame and walks past behind her. She moves to a different part of the room and holds up a small grey rectangle, then she moves closer to a window. She holds up a different grey color card. The woman moves to a different, larger room, and holds up a grey-green rectangle. She is indoors wearing a gray zip up jacket, and a gray hat. She holds a piece of white paper then she walks into another room with a grey hat. She takes protein powder out of a black container. She is wearing a gray jacket, and has long blonde hair. She uses a white scoop to add the protein powder to a clear container. Next, she pours honey from a jar into the container. The next shot shows the contents of the container being blended, then a brown liquid is poured from the blender into two glasses. She holds her hands up with the glasses of liquid in front of her. The last portion of the video shows the woman wearing a gray hat, and a puffy camouflage jacket. She is holding and looking through a notebook. She is still in an indoor space. She points at something in the notebook. Then she puts the notebook away.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Where does the blender appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the end of the video.",
+ "C. Throughout the entire video",
+ "D. At the beginning of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color does the woman in the video want to change the window decoration to?",
+ "answer": "B",
+ "candidates": [
+ "A. Deep blue.",
+ "B. Light brown.",
+ "C. Dark green.",
+ "D. Black."
+ ]
+ }
+ },
+ "wVBKUWgw": {
+ "video_id": "wVBKUWgw",
+ "video_duration": "60s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man and a woman wearing headsets and military uniforms are in a dark room with slatted walls. The man sits in a chair while the woman stands behind him. The scene changes to a first person view of a soldier in a helmet with goggles, pointing a rifle forward. In the corner of the screen, a circle shows a green outline of a floorplan of the room they are in. The text \"Enhanced Survivability in Dangerous Operations\" is displayed on the screen. The scene changes to a second person perspective, this time the rifle being pointed toward the camera. The setting is now in a hallway. The soldier in the helmet moves forward through a door, onto a rooftop, where a person stands with hands raised. The first person view returns to point the rifle at the person. The scene changes back to a dark room where a man sits in a chair looking at several monitors, a woman is standing behind him to his left. Some of the monitors show digital schematics and geographic maps. The scene switches back to the first person view, where the soldier walks through a door to the rooftop and the rifle is pointed at a woman standing with her arms raised. The scene changes to show a soldier holding a rifle and looking at it closely. The scene changes to a black and white overhead satellite view of a city with yellow circles overlayed on various locations. The camera zooms out and rotates as it shows different views of the Earth, and then some text appears over a CGI image of the planet: \"Tactical Augmented Reality systems will revolutionize how Soldiers access information and conduct operations.\" Then the screen fades to black.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many people are holding guns in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Four.",
+ "B. One.",
+ "C. Two.",
+ "D. Three."
+ ]
+ }
+ },
+ "IMSnbvOa": {
+ "video_id": "IMSnbvOa",
+ "video_duration": "201s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Climate Change",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a person with a light complexion and brown hair that is braided. They are wearing a pink collared shirt. They are standing in front of a blue background, with a black microphone in front of their mouth. They are shown speaking. The video then cuts to a wider view with the person speaking while sitting at a table with three other people. The video also includes text overlay in the upper left that reads \"VICE News\" and at the bottom that reads \"Source: United Nations\". The video continues to show the person at various angles, sometimes speaking, sometimes not.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, what is the position of the girl wearing the pink shirt relative to the woman wearing glasses?",
+ "answer": "A",
+ "candidates": [
+ "A. The girl in the pink shirt is to the left of the woman wearing glasses.",
+ "B. The girl in the pink shirt is behind the woman wearing glasses.",
+ "C. The girl in the pink shirt is in front of the woman wearing glasses.",
+ "D. The girl in the pink shirt is to the right of the woman wearing glasses."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between leaders and the younger generation as mentioned in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The younger generation depends on the support of leaders.",
+ "B. Leaders rely on the younger generation to solve problems.",
+ "C. Leaders depend on the younger generation to cultivate values.",
+ "D. The younger generation relies on leaders to solve problems."
+ ]
+ }
+ },
+ "pnxTBfek": {
+ "video_id": "pnxTBfek",
+ "video_duration": "82s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Camping",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows two individuals, each with a beard, standing outdoors near two vehicles. The first individual wears a dark jacket and a beanie. He is shown mostly in close-up, speaking towards the camera. At one point, he gestures with his hand and his mouth opens and closes as if speaking. The second individual wears a camouflage jacket with a yellow panel on the chest and hood, and a white hood lining. He is shown inside the back of an open white vehicle. The first individual appears in frame again, on the side of the white vehicle, alongside the second individual. The second individual is then shown holding a clear plastic bottle. He puts a black cap on the bottle and appears to demonstrate the device. The first individual is shown again near the second individual and next to the white vehicle. The second individual appears again holding the same clear plastic bottle and a cap with a nozzle, while speaking towards the camera. He gestures with his hands and moves the cap and nozzle on the bottle. He then shows the bottle and nozzle up close to the camera. The video shows the first individual, then a close up of the second individual, then both of the men in frame together, followed by multiple close-up shots of the second individual as he shows the plastic bottle with the cap and nozzle to the camera.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, in which direction is the black SUV relative to the man wearing yellow camouflage?",
+ "answer": "D",
+ "candidates": [
+ "A. Behind.",
+ "B. In front.",
+ "C. On the left rear side.",
+ "D. On the right rear side."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where is the man wearing camouflage clothing in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. On the grassland.",
+ "B. Deep in the mountains.",
+ "C. On the snowy ground.",
+ "D. Outside the house."
+ ]
+ }
+ },
+ "vozVJzmH": {
+ "video_id": "vozVJzmH",
+ "video_duration": "173s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a blue shirt and black pants is standing near a robot. They are holding an orange liquid in a glass. The person smiles then walks away and stands behind a couch. Then a person wearing a black and white striped long sleeved shirt and blue jeans stands at a counter, looking away from the viewer. The person walks away then back towards a white object on the floor and removes a white cloth, revealing a black motorcycle. Then the person in the blue shirt walks towards a container of white objects and appears to be removing one, and walks toward the person in the striped shirt, and they both walk toward the middle of the room. The person in the blue shirt then gets on the motorcycle. The person in the striped shirt is standing near the couch and appears to be speaking. The person in the blue shirt gets off the motorcycle, then the person in the striped shirt speaks and then they walk towards each other. The person in the blue shirt crosses their arms then the person in the striped shirt moves their arms up and down and then puts a hand on the other person's arm. The person in the blue shirt stands with crossed arms and the person in the striped shirt is speaking. Then a person with black hair is shown sitting at a desk. A person with blonde hair, wearing a black and white striped shirt underneath a brown jacket walks through a door. Then the person with blonde hair is shown speaking to the person with black hair who is seated at a counter. The person with black hair is then shown laying their head on the counter while the other person stands by them. Then the person with blonde hair is shown walking towards a door, which she opens, revealing two men in black tuxedos. The person with blonde hair then walks back into the room. The person with black hair walks over to the counter and picks up a tissue. A person wearing a military uniform walks in the front door. Then the two men in tuxedos, the person with blonde hair, and the person in the pink dress are shown in the same frame. The person in the military uniform moves into the room with a bag and a hat, and they are then shown hugging the person in the pink dress. The two men in tuxedos and the person with blonde hair are also in the same frame. Then the person in the blue shirt is shown wiping their mouth, and then hugs the person in the military uniform. Then all of the other people from the room join in the frame, then the scene changes to people dancing and goofing off, some in hats.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what moment did the man in a black suit appear?",
+ "answer": "C",
+ "candidates": [
+ "A. After the music started.",
+ "B. After a man in a military uniform entered the room.",
+ "C. After someone rang the doorbell.",
+ "D. After someone screamed."
+ ]
+ }
+ },
+ "zOWIYSyZ": {
+ "video_id": "zOWIYSyZ",
+ "video_duration": "268s",
+ "duration": "4-6min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A video game simulation of a boxing match between Deontay Wilder and Tyson Fury begins. Wilder, in red shorts, and Fury, in gold, circle each other. Wilder throws several jabs, while Fury focuses on blocking. The two boxers clinch briefly. They exchange more punches, with Wilder landing a few body shots. Fury backs away, maintaining his guard. Wilder continues to press forward. Fury throws a combination, but Wilder avoids most of the punches. Wilder lands a hard right hand, staggering Fury. Wilder follows up with a flurry of punches, and Fury falls to the canvas. The referee begins the count, but Fury does not get up, resulting in a knockout victory for Wilder. Wilder raises his arms in celebration.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What was the outcome of the competition between the athlete wearing red shorts and the athlete wearing yellow shorts?",
+ "answer": "D",
+ "candidates": [
+ "A. I do not know.",
+ "B. The competition ended in a tie.",
+ "C. The athlete wearing yellow shorts won.",
+ "D. The athlete wearing red shorts won."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the athlete wearing yellow pants first fall?",
+ "answer": "A",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the beginning of the video.",
+ "C. Throughout the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Does the video show a referee stopping a player from competing?",
+ "answer": "B",
+ "candidates": [
+ "A. I don't know.",
+ "B. No, it does not.",
+ "C. Yes, it does.",
+ "D. None of the above."
+ ]
+ }
+ },
+ "jqJqIhFx": {
+ "video_id": "jqJqIhFx",
+ "video_duration": "39s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Artificial Intelligence",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with four people near a red and black vehicle, and a white trailer with \"Matador Motorsports\" written on it. The vehicle is on a trailer, and one of the people has the hood up. The next shot shows the car on three wheels with a man crouching down and working on it. There is a red tool chest with a variety of stickers on it nearby. Next, the video shows a close-up of a tire with \"Hoosier\" written on it and other tires underneath. The next shot is a close-up of a logo on the vehicle. The logo includes \"SAE International\", a \"Formula SAE\" logo, and a car number, \"25\". Another shot shows a box with a yellow \"Race Car Xing\" sign and \"G-Force Racing Gear\" stickers. Next, a man with a beard and a black shirt is in front of a car. He is identified as \"Morgan Ryan\" and \"Electronic Vehicle Project Manager\". The same man gestures with his hands while sitting. The next shot shows two people in front of computer screens, looking at engineering drawings on the screens. One of the people is pointing at the screen. Another man looks at a computer screen showing an engineering drawing of a machine. The same person points at a section of the design. The next shot shows a person with a dark shirt looking at a laptop screen with an engineering drawing of a pipe. The video shows a wide shot of a vehicle being worked on in a shop with a \"SuperFlow CycleDyn\" platform and \"Hoosier Racing Tire\" and \"Welcome Race Fans\" banners. The next shot shows a man identified as \"Jacob Vorzimer\" and \"Internal Combustion/Project Manager\" with a black shirt. Next, a person is shown wearing protective goggles and working with a grinder. The next shot shows a person taping together pieces of pipe with writing on it. A person uses a power drill on a piece of metal. The last shot shows a person wearing protective gloves working with a grinding wheel and metal.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many polyvinyl chloride pipes are in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Eight.",
+ "B. Nine.",
+ "C. Ten.",
+ "D. Seven."
+ ]
+ }
+ },
+ "wpcKVUMn": {
+ "video_id": "wpcKVUMn",
+ "video_duration": "201s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A woman's hands place a small glass bowl of crushed red peppers next to chopped garlic and a whole lemon. The red pepper mixture is then scooped and placed in a pan with sizzling oil. Cooked pasta is then mixed into the pan with a fork. A woman in a blue shirt stands in a kitchen, gesturing with her hands. The camera focuses on the ingredients: a clove of garlic, a bowl of olive oil, a bowl of crushed red peppers, and half a lemon, all sitting on a wooden cutting board. The woman picks up the garlic and begins chopping it with a chef's knife. She sweeps the chopped garlic into a pile. She then picks up the bowl of crushed red peppers. The woman pours dried pasta from a blue box into a pot of boiling water. She stirs the water. She then pours olive oil into a separate pan on the stove. She adds the chopped garlic to the pan. The crushed red peppers are added to the pan and mixed with the garlic and oil using a wooden spoon. She zests the lemon over the pan. The pasta is drained using a colander and then added to the pan with the pepper and garlic mixture. The woman mixes the pasta, then adds salt. She squeezes lemon juice into the pasta, mixing continuously. Finally, she tastes the pasta. The plated pasta sits on a wooden cutting board, along with a lemon, a small wooden bowl and lid, and the box of pasta. The camera zooms in on the dish and then blurs, transitioning to a view of the kitchen counter.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At which part of the video does the woman add noodles to the soup?",
+ "answer": "B",
+ "candidates": [
+ "A. It does not appear.",
+ "B. The end of the video.",
+ "C. The middle of the video.",
+ "D. The beginning of the video."
+ ]
+ }
+ },
+ "fWOewqfr": {
+ "video_id": "fWOewqfr",
+ "video_duration": "551s",
+ "duration": ">8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Drawing Tutorials",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A sheet of white paper rests on a green surface. A black marker lies on the left side of the paper and several crayons are stacked on the right side of the paper. The text \"Supplies\" appears briefly above the white paper. The paper is then rotated so it is aligned with the screen. A hand enters the frame from the bottom of the screen and then exits the screen to the left. The hand then returns to the paper with the black marker, and begins drawing the outline of a sun, starting with a circular shape on the left side of the paper, and drawing small triangles along the edge of the circle to represent sunbeams. The word \"Sun\" is then written in the circle.The hand continues drawing dotted lines around the sun, to create an orbit, then draws small circles representing planets along the dotted lines. Below each of the small circles, a label for each planet is written, starting from the inside of the orbit with \"Mercury\", then \"Venus\", then \"Earth\", \"Mars\", \"Jupiter\", \"Saturn\", \"Uranus\", \"Neptune\", and finally \"Pluto\". The labels are not uniformly spaced.The hand then uses various colors of crayons to fill in the sun and each planet with different colors of crayon, and then uses various colors to add color to the orbital lines, before returning to the original view of the paper and drawing.The drawing then disappears and a teal shaped sign appears on the black background, it states \"Sign up now! Become a Subscriber Patty Fernandez Artist\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is the sun painted in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Red.",
+ "B. Golden.",
+ "C. Yellow.",
+ "D. White."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is Mars's position relative to Earth as shown in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Mars is in the lower-left corner relative to Earth.",
+ "B. Mars is in the upper-right corner relative to Earth.",
+ "C. Mars is in the upper-left corner relative to Earth.",
+ "D. Mars is in the lower-right corner relative to Earth."
+ ]
+ }
+ },
+ "NnqNmOmP": {
+ "video_id": "NnqNmOmP",
+ "video_duration": "339s",
+ "duration": "4-6min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a screen recording of The Battle Cats mobile game. On the stage select screen, the player selects the \"Flappy Cat (Insane)\" level and presses the \"Attack!\" button. The game transitions to the battle screen, a grassy field with a stone tower on the left and a forest in the background. A white, blocky cat unit moves towards the right. A black, floating cat with bat wings appears from the tower and advances towards the right. More cat units are deployed, including a white sheep-like creature with antlers. An orange cat with a cannon appears on the right side of the screen. The various cat units fight against the black, floating cat and the white, antlered sheep. The battle continues with more cat units being deployed, some with unique appearances like a lizard-cat and a muscular, tank-top-wearing cat. The cat units eventually defeat the enemy units.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the maximum level that Worker Cat can reach?",
+ "answer": "C",
+ "candidates": [
+ "A. Level 6.",
+ "B. Level 7.",
+ "C. Level 8.",
+ "D. Level 9."
+ ]
+ }
+ },
+ "eLUQfzyU": {
+ "video_id": "eLUQfzyU",
+ "video_duration": "34s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Astronomy",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Three people are sitting at a round wooden table with papers spread out. The person on the left is wearing a dark shirt and has glasses on. The person in the center is wearing a patterned button-up shirt. The person on the right is wearing a light-colored shirt. The person on the right points at the papers. A hand holding a pen circles something on the papers. The person on the right looks directly at the camera. The person in the center is pointing at something on the papers. The person in the center is now writing on the papers.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What are the professions of the three men shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Journalist.",
+ "B. Astronomer.",
+ "C. Scientist.",
+ "D. Teacher."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Which planet shown in the video is closest to Earth?",
+ "answer": "C",
+ "candidates": [
+ "A. The second planet.",
+ "B. The first planet.",
+ "C. The third planet."
+ ]
+ }
+ },
+ "iaGPYkUk": {
+ "video_id": "iaGPYkUk",
+ "video_duration": "84s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Film Trailers",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A young, blonde, cat-eared girl looks to the side, as a hand rests on her shoulder. Text appears, reading \"A Boy Attempts The Impossible\". A large, multi-armed statue sits upon a lotus flower pedestal, as fire and smoke rage in a town below. A large, winged dragon-like creature with red hair roars amid the fire and smoke. Three characters, the cat-eared girl, a dark-haired man, and a pink-haired girl, look upwards with serious expressions. The reptilian eye of the dragon-like creature zooms into view. A robed figure sits on a throne, with a fire burning behind. A purple and teal light engulfs the figure, as it transforms into a reptilian humanoid. The humanoid smirks. Text appears, reading \"In A World Consumed By Darkness\". The cat-eared girl runs in a dark environment, with a shadowy, red-eyed figure lurking behind. Purple trails follow behind the girl as she runs. Close up on the girl's face as she clenches her fist with a determined look. Text appears, reading \"To Prove Himself Worthy\". The girl, now wearing a red cape and a crown, walks towards a bright arched doorway down a red carpet. The girl's silhouette brightens as she gets closer to the door. Seated on a throne and wearing royal garb, the girl rests her hand over a green jewel. She lifts a silver crown with green jewels and places it on her head. From a balcony, the girl now addresses a large crowd in the rebuilt town, standing between the dark-haired man and the pink-haired girl. Several winged creatures fly across a clear blue sky. Text appears, \"And A True King Is Born\". The title, \"Ni no Kuni II Revenant Kingdom\", appears in teal and gold.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the chronological order of events in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The boy gains magic, the boy becomes king, and a colossus attacks the town.",
+ "B. The boy gains magic, a colossus attacks the town, and the boy becomes king.",
+ "C. A colossus attacks the town, the boy becomes king, and the boy gains magic.",
+ "D. A colossus attacks the town, a boy gains magic, and the boy becomes king."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Video Emotions",
+ "question": "Which scene in the video best conveys a sense of tension and dramatic emotion?",
+ "answer": "A",
+ "candidates": [
+ "A. Flames and smoke fill the air as the characters display fear and determination when the giant statue attacks the town.",
+ "B. A glow appears around the boy's feet as he gains magical power.",
+ "C. The characters show surprise and confusion when the villain appears.",
+ "D. The characters celebrate their victory when the boy becomes king and establishes his kingdom."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many dragons appear in the sky at the end of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Four.",
+ "B. Five.",
+ "C. Six.",
+ "D. Seven."
+ ]
+ }
+ },
+ "OKZHkcZs": {
+ "video_id": "OKZHkcZs",
+ "video_duration": "139s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a patterned gray and white button-up shirt is standing in a kitchen, holding a gray container. The person places the container on top of a silver machine, then removes it, then places a smaller gray container onto the machine. The person holds a small glass container, pours its contents into a clear bowl with pale yellow contents, and stirs the mixture using a white spatula. A silver machine and bowl are visible on the counter next to the glass bowl. The person is then seen assembling the silver machine by placing a gray circular piece into it. The person pours the contents of the clear bowl into the machine. The person then places a clear lid on the machine, and an additional gray piece inside of it, then the person removes the machine\u2019s lid. The person then uses a gray paddle to remove and scrape pale yellow contents from the machine into a clear rectangular bowl. The person then pours the pale yellow contents of the gray container into the clear bowl, using a white spatula to push more of the contents of the gray container into the clear bowl. The person then tastes the pale yellow contents of the container with a small white spatula. The person then makes hand gestures while speaking and smiling.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where is the ice cream maker positioned relative to her when she is stirring the ice cream?",
+ "answer": "C",
+ "candidates": [
+ "A. To her left.",
+ "B. Behind her.",
+ "C. In front of her.",
+ "D. To her right."
+ ]
+ }
+ },
+ "mCJgBZHe": {
+ "video_id": "mCJgBZHe",
+ "video_duration": "76s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a man with a white beard and a tan hat speaking in front of a hexagonal patterned wall, he is wearing overalls and a white shirt. At some points, the video shows a different man wearing a grey patterned button up shirt and glasses, speaking in front of a dark background. Finally, the video shows a man in a light jacket and patterned shirt speaking, with a woman with dark hair in a black dress next to him. These scenes alternate throughout the video. At one point a person is shown with cucumber in their mouth and two circular objects over their eyes. The man with the white beard is shown for a majority of the video. A final clip shows two people standing together, one of them is the man with the white beard.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the sequence of events mentioned by the white-bearded man in the video? 1. Describes Jim Carrey's performance, 2. Talks about the premiere of *Ace Ventura*, 3. Meets Courteney Cox, 4. Sees Chris Rock in the lobby.",
+ "answer": "C",
+ "candidates": [
+ "A. 4, 2, 1, 3.",
+ "B. 2, 4, 1, 3.",
+ "C. 2, 3, 1, 4.",
+ "D. 4, 2, 3, 1."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "In the video, how many times does the man with a white beard mention Jim Carrey's performance?",
+ "answer": "C",
+ "candidates": [
+ "A. Four.",
+ "B. Two.",
+ "C. Three.",
+ "D. One."
+ ]
+ }
+ },
+ "CoxIKJjr": {
+ "video_id": "CoxIKJjr",
+ "video_duration": "71s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Climate Change",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a light-colored cap, a blue jacket, and a dark t-shirt sits on a wooden bench in front of green foliage and speaks. A black box with the text \"Jon Graham, Rochester, VT\" is displayed in the bottom left corner of the frame. The scene shifts to an overhead view of a swirling cloud formation. The camera shows the person sitting on the bench again, speaking and gesturing with their hands. The scene then changes to a close-up of a newspaper headline reading \"A NIGHT OF RISING WATER, Thousands lose power as storm sweeps up East Coast.\" The person sitting on the bench is shown again speaking and gesturing. This scene changes to a street view in the rain, with a man walking through standing water. Then, the camera returns to the person on the bench, who is speaking and gesturing. A river overflowing with brown water is shown, followed by the person on the bench again. Finally, the camera returns to a view of the flooded river, with various debris being carried along the current.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "In which part of the video does the bicycle appear?",
+ "answer": "B",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. In the middle of the video.",
+ "C. At the beginning of the video.",
+ "D. At the end of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What abnormalities are present in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Vehicles have their headlights on during the day.",
+ "B. The clouds in the sky form a spiral.",
+ "C. The water in the river is very turbid.",
+ "D. The rapid water flow floods the road."
+ ]
+ }
+ },
+ "uqXqCITR": {
+ "video_id": "uqXqCITR",
+ "video_duration": "38s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person is fishing at the edge of a pond. The water is disturbed as a fish is caught on the line. The line is pulled in, bringing the fish closer to the shore. The person reels in the fish. The person pulls the fish out of the water. It is a snook. The person holds the snook for the camera, smiling. The person displays the inside of the snook\u2019s mouth. The fish is released back into the water. The person continues fishing. The fish is hooked again, pulled to the shore and released.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many fish appear in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Two fish.",
+ "B. Zero fish.",
+ "C. One fish.",
+ "D. Three fish."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "In the video, in what direction is the tongue sole fish located relative to the man wearing a green T-shirt?",
+ "answer": "D",
+ "candidates": [
+ "A. Directly in front.",
+ "B. To the left.",
+ "C. To the right front.",
+ "D. To the left front."
+ ]
+ }
+ },
+ "LcXJYSwd": {
+ "video_id": "LcXJYSwd",
+ "video_duration": "224s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Art Exhibitions",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person with gold nail polish is holding open a sketchbook with a pencil drawing of a person wearing a dress with a sword. Next, a drawing on a page with an illustration of a person in a hat carrying a bag that says \"Monster Hunter\" with a small dog is shown. The person is surrounded by two ghost-like beings and a speech bubble above one of the ghosts says: \"kill me it would be a kindness\". Below that illustration is a second illustration with a speech bubble reading \"I can't, it's too dangerous\" and a second speech bubble reading \"I just want spaghetti\". The video returns to the first drawing, and the person begins to use a black marker to outline the pencil drawing. They then use a red marker to color in the hair, a blue and black marker to color the shirt, and a light blue marker to color in a dress. Next, an orange marker is used to color in the shoes, the dress and sword are highlighted in yellow, and a light green marker is used to color in the background. The person then uses a light blue eraser-type tool to rub out the background. The person uses a black pen and various pieces of paper to add details to the drawing. They finish the drawing by using a white pen to fill in the details. Then, the video displays the words \"Skillshare\" in white font over a blue background. It then returns to the drawing, which is now fully colored.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many blades fell out of the transparent box in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Nine.",
+ "B. Eight.",
+ "C. Ten.",
+ "D. Seven."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a pen shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Yes.",
+ "B. No.",
+ "C. I don't know."
+ ]
+ }
+ },
+ "tsmINTwh": {
+ "video_id": "tsmINTwh",
+ "video_duration": "228s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "DIY & Crafts",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with blonde hair wearing a black long-sleeved shirt and black pants walks from the left of the frame to the center of the frame, while a second person with yellow hair wearing a blue shirt and blue shorts walks from the background to the center of the frame. Both people pass a green rectangular object, which is on the floor, before the person wearing black picks up a green piece of material. The person wearing blue kicks the green material and the person wearing black continues to move it, then the camera cuts to the person with blonde hair sitting on the floor with a dog nearby. The blonde person uses scissors to cut the green material, then opens a package, and opens another package, then starts placing straws onto a green piece of material on the floor. The person in black then smooths out the straws with their hands, and continues to apply straws to other parts of the material. After a while the person in black looks at the camera. After another cut, the person in black is standing and wearing a green structure with many pink straws around the opening. They look around and start to move in a jerky fashion. Then they begin moving around in the house. At some points the green material is seen attached to the persons back. Then the person interacts with a dog by putting their green material on top of it. At another point the person interacts with two dogs sitting on a couch, as another dog watches from the ground. The scene cuts to the same person still in the same green material, standing in the same area. The scene then cuts to a person with yellow hair putting their face into the pink straws of the green material from the other person. Then the scene cuts back to the first person in the green material, who now has pink material sticking out in multiple directions, and again looks around while moving. Then it switches back to the first person's face, being seen from behind the green structure. Then the scene ends on this person facing the camera, wearing the green material, and walking around.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many dogs appear on the sofa in the middle of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Four.",
+ "B. Two.",
+ "C. One.",
+ "D. Three."
+ ]
+ }
+ },
+ "YnpyrAWh": {
+ "video_id": "YnpyrAWh",
+ "video_duration": "215s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Physics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a dark jacket and a blue tie stands in front of a white board. There are multiple diagrams and equations written on the white board. The person writes \"a\" on the white board, followed by \"i=\", and then writes \"0.33 A\" . The person then writes \"b\", followed by \"\u03bcr?\", then the text \"B=\u03bcH\", then writes \"\u03bcr\u03bc0=B/H\", then \"\u03bcr\u03bc0 = 0.8/120\", and finally \"=>\u03bcr= 0.8/(120*4\u03c0*10^-7)=5252.305\". Next to the equation, the person writes \"saturation \u03bcr\". The person then draws a graph with one straight line, one curved line, and some rectangular lines on it. The person is holding a white piece of paper in their left hand and a marker in their right hand, which is used to write on the whiteboard.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "What homework did the teacher assign to the students?",
+ "answer": "B",
+ "candidates": [
+ "A. Chemistry.",
+ "B. I am not sure.",
+ "C. Physis.",
+ "D. Math."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What subject is the teacher teaching in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Mathematics.",
+ "B. Chemistry.",
+ "C. Biology.",
+ "D. Physics."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct order in which the following actions occur? (a) Write down the formula, (b) Solve the problem, (c) Explain the physical meaning of the formula, (d) Illustrate saturation on a graph.",
+ "answer": "D",
+ "candidates": [
+ "A. (c) (a) (b) (d).",
+ "B. (a) (c) (d) (b).",
+ "C. (a) (d) (c) (b).",
+ "D. (a) (c) (b) (d)."
+ ]
+ }
+ },
+ "yAPZpTGz": {
+ "video_id": "yAPZpTGz",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a landscape with mountains in the background. Initially, there are three stick figures in the foreground, one blue, one orange, and one green. The green figure is initially standing with its arms crossed. The green figure then extends one arm. The camera focuses on the blue figure, who shifts its arm position several times and briefly displays a peace sign. The camera shifts to the orange and green figures, with the green figure holding a small orange circle in one hand. The orange figure grasps the green figure's hand. The camera shifts to the blue figure again, who remains stationary. Finally, the camera pans to the landscape without any figures.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video do the green and orange cartoon characters agree with the ruling?",
+ "answer": "B",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. In the middle of the video.",
+ "C. At the end of the video.",
+ "D. At the beginning of the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What does the blue cartoon character do to the orange and green cartoon characters in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Offers an opinion.",
+ "B. Shakes hands with them separately.",
+ "C. Makes a \"V\" gesture while taking a photo.",
+ "D. Gives a command."
+ ]
+ }
+ },
+ "rxmrsQwS": {
+ "video_id": "rxmrsQwS",
+ "video_duration": "113s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Two fighters are circling in a boxing ring. One fighter, wearing camouflage shorts, lands a kick to the other fighter's body. The fighter in white shorts attempts a kick but is caught by the fighter in camouflage shorts, who then trips him to the ground. The fighter in camouflage shorts gets on top of the fighter in white shorts and lands punches. The referee steps in and stops the fight. The fighter in camouflage shorts gets up and celebrates. He does a cartwheel in the ring, raises his arms, and then climbs up on the ropes to hug a man outside of the ring. He steps back into the ring and is presented with a belt and a trophy. He raises both above his head, smiling. The other fighter, now standing, is given a plaque.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Are there two motorcycles visible in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. None of the above.",
+ "B. No, there are not.",
+ "C. I am not sure.",
+ "D. Yes, there are."
+ ]
+ }
+ },
+ "scihXtvG": {
+ "video_id": "scihXtvG",
+ "video_duration": "46s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The words \"the next day!\" appear in white text on a red background.Two young women stand close together, smiling at the camera. The woman closest to the camera wears a black sleeveless top with a sweetheart neckline edged in black lace. She has blonde hair styled in a bob. She wears a thin necklace with a small pendant. The other woman has long brown hair and wears a patterned, sleeveless top. People in various outfits walk behind them. The background suggests they are outdoors, with trees and buildings visible. A red carpet is visible on the ground.The scene changes to show a large group of photographers with cameras and lighting equipment, all focused on a red carpet. Several people walk on the red carpet toward a backdrop with the words \"Billboard Music Awards\". The backdrop also features various logos including Kia, ABC, Samsung Galaxy, and T-Mobile. The people pose for photos against the backdrop.The view returns to the two women. The blonde woman looks at the camera, holding a blue clutch purse. She walks forward on the red carpet. Bright blue open-toed shoes are visible on her feet as she walks. More people continue walking on the red carpet. The blonde woman continues talking to the camera, the view occasionally shifting downward to show her outfit or the blue clutch.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the likely chronological order of the following actions? (a) Sharing feelings about an activity, (b) Sharing the setting or environment, (c) Introducing companions, (d) Sharing what one is wearing.",
+ "answer": "D",
+ "candidates": [
+ "A. (a), (c), (b), (d).",
+ "B. (d), (c), (b), (a).",
+ "C. (b), (c), (a), (d).",
+ "D. (c), (a), (b), (d)."
+ ]
+ }
+ },
+ "ufbeRvcM": {
+ "video_id": "ufbeRvcM",
+ "video_duration": "257s",
+ "duration": "4-6min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video starts with a view of a red car door. A blue text overlay with \"BADGES DON'T GRANT EXTRA RIGHTS\" and \"COPIBLOCK.ORG\" at the bottom of the frame, along with a cartoon image of a policeman. A second text overlay says \"THIS VIDEO WAS SUBMITTED VIA COPIBLOCK.ORG/SUBMIT. THE MAN BEHIND THE CAMERA IS SIMPLY WAITING FOR HIS PARTNER TO COME OUT OF THE STORE, WHEN THE POLICE SHOW UP.\" Then the view changes to a parking lot, where a policeman in a dark blue uniform with a badge is looking at the camera, with the same text overlays on the bottom of the frame. The policeman speaks directly at the camera. A few seconds later, he picks up his walkie-talkie and speaks into it, looks around, and then continues looking directly at the camera and speaking. After a few seconds, his face is partially obscured by a hand. The policeman lowers his walkie-talkie, touches his face, and speaks while gesturing with his hand. The view turns slightly to the left so that a store is more visible in the background. The camera focuses on the policeman as he adjusts his uniform and begins looking off to the right side of the frame. He speaks, touches his face, adjusts his walkie talkie, and gestures with his hands. A black car can be seen behind the policeman. As the video continues, the camera shifts to show a stop sign and more stores behind the policeman. The policeman then looks directly at the camera and speaks while making gestures, touches his nose and then his mouth, and continues talking, and adjusts his walkie-talkie. The view changes again to show the store directly behind the policeman. The policeman continues to look directly at the camera, gesturing while he speaks, and then adjusts his belt, and looks down. He adjusts his belt again. Then he looks off to the right side again, and the camera view begins moving around to the left. He adjusts his belt again. The view moves to a different angle, showing more of the parking lot behind the officer. Then the view changes to show more of the stores and parking lot. The policeman is visible in the left side of the view. The policeman then moves further towards the center of the view and speaks to the camera while gesturing. He again begins moving slightly to the left while looking off camera to the right, and talking. He again adjusts his uniform and walks slightly to the left, again looking off camera to the right and talking while gesturing. The camera continues to move around the policeman. Then a second police officer wearing sunglasses appears from the right side, followed by a third police officer who is wearing a black shirt and glasses. All three police officers are now visible. The second officer gestures with her hands while talking, and the third officer stands still. The second and third officers both have \"POLICE\" written on their uniforms. Then the third officer begins to speak to the camera. The policeman continues to speak and gesture with his hand. The second officer continues to look directly at the camera and adjust her uniform. The third officer walks toward the camera and begins speaking. The camera moves to show the second and third officers as they stand next to each other and speak while looking at the camera. Then, the second officer puts her arms up and looks around. The third officer continues to speak while gesturing. The camera view shifts to show the policeman as he is speaking and moving his hand to the side while looking off to the right. The second officer is visible on the right, and then the third officer becomes visible on the right as well. They continue to speak and gesture. All three officers are now visible again in the parking lot in front of the store. The second officer adjusts her sunglasses. The second officer speaks. The third officer raises his hand to his ear. The camera shifts to show the second officer in front of the stores. The third officer adjusts the equipment around his neck. The second officer again puts her arm up to her ear. The policeman gestures with his hand while looking off to the right side of the camera view. The view continues moving back and forth between the policeman and the second and third officers. The video ends with a view of the policeman, while the second and third officers are no longer visible.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who said, \"If you don't identify himself, put him in handcuffs.\"?",
+ "answer": "A",
+ "candidates": [
+ "A. A Black man.",
+ "B. A White woman without sunglasses.",
+ "C. A White woman wearing sunglasses.",
+ "D. A White man."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times is the ID card mentioned in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. 5.",
+ "B. 6.",
+ "C. 4.",
+ "D. 3."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why are people moving around here?",
+ "answer": "D",
+ "candidates": [
+ "A. He is prying open a car door.",
+ "B. He works here.",
+ "C. He is stealing something.",
+ "D. He is waiting for someone."
+ ]
+ }
+ },
+ "bNUTmztw": {
+ "video_id": "bNUTmztw",
+ "video_duration": "62s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Biology",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A brown, textured blob with a smaller, orange sphere attached by a thin line moves along a straight line. The line changes colors, and another line extends from the brown blob forming a curved, single line. Two lines that are parallel to each other appear to peel open, with a colorful interior visible. A small red marker appears on the straight line. A short yellow line appears below the curved line and shows that the curved line detaches from the straight line.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What are the last three letters of the word inside the yellow box that appears at the end of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Ter.",
+ "B. Tor.",
+ "C. Ien.",
+ "D. Ion."
+ ]
+ }
+ },
+ "TMXEYCdJ": {
+ "video_id": "TMXEYCdJ",
+ "video_duration": "80s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The first shot shows a woman with long brown hair and glasses, wearing a white t-shirt and jeans, sitting and playing a small brown ukulele. The next shot shows a man wearing a brown t-shirt and headphones, playing a black and white electric guitar. This is followed by a shot of a man with dark curly hair and a beard, wearing a light blue shirt, sitting on a couch. The man is talking and looking to the right side of the screen. Next, the video shows a man wearing a brown t-shirt sitting on a dark blue couch and talking. Another shot shows the first woman, again, this time only her head and shoulders visible, and she is talking. The next shot shows a person wearing a black t-shirt with white text and a watch on their left wrist, handling a package of toilet paper rolls. This is followed by a shot of the man with dark curly hair and a beard talking. The next shot is of the first woman, this time with her head in her hands. After this is a clip of a man playing a black and white electric guitar. The next shot shows black screen with white text that says \"Featuring...\". Another shot shows the person handling the toilet paper rolls with the text \"Angelastic\" overlaid on the video. Next we see the man in the brown shirt with \"Spektugalo\" text overlaid. This is followed by the man with curly dark hair with \"MJPM\" text overlaid. Next, we see a man wearing a green t-shirt and headphones with \"Theanderblast\" text over the video. A man wearing a blue shirt has the text \"skyen\" overlaid on his shot. The next shot shows a person wearing dark glasses and orange headphones with the text \"elliomeg\". The first woman is on the screen with \"Colleenky\" overlaid on the video. The following shot is of a black screen with white text that says \"with Bry (on piano) and additional vocals by runtime1point0\". Finally, there is a black screen with white text that says \"Happy Birthday JoCo!\".",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What guitar playing technique is used by the electric guitarist in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Fingerstyle.",
+ "B. Strumming.",
+ "C. String picking.",
+ "D. Plectrum strumming."
+ ]
+ }
+ },
+ "PSNeTBjt": {
+ "video_id": "PSNeTBjt",
+ "video_duration": "603s",
+ "duration": ">8min",
+ "domain": "Culture & Politics",
+ "sub_category": "Art Exhibitions",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A woman with short, curly blonde hair wearing glasses and a black t-shirt with the words \"This Is What a Feminist Looks Like\" stands in an art studio. She picks up a metal stamp from a wooden box containing many similar stamps. She points at a canvas showing a light-colored painting with darker markings. She holds and manipulates a textured, painted metal object in her hands. She holds another painted metal object, seemingly a face, close to the camera. She points towards a wall with floral wallpaper on it, then holds a piece of what appears to be the same wallpaper in her hand. She gestures towards a painting of a person in a white coat. She holds a painting tool and gestures towards a canvas depicting a child. She points towards a palette covered with paint. She holds a brush and looks at the canvas.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many birds is the woman wearing a black shirt holding in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Four.",
+ "B. Two.",
+ "C. Five.",
+ "D. Three."
+ ]
+ }
+ },
+ "MMroctFh": {
+ "video_id": "MMroctFh",
+ "video_duration": "393s",
+ "duration": "6-8min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two people wearing black wrestling attire and a referee wearing a black and white striped shirt are in a wrestling ring. The person on the left has a visible tattoo on his right shoulder and the person on the right is bald. The match starts with both facing each other and throwing punches. They then grapple and the person on the right performs a move on the left person throwing them on the ground. The person on the right then stands over the person on the left and performs a move putting their legs on the left person's neck. The person on the right then picks up the left person and jumps up to the ropes and then throws the left person down. The video then replays some of the moves in slow motion. The match ends with the right person putting his legs on the left person\u2019s neck again while lying on the ground and the referee indicates the match is over. A screen reads \u201cWINNER THE ROCK\u201d. Then the video shows the person on the right celebrating before ending with an animated logo and image.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What type of setting is depicted in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A real-life setting.",
+ "B. None of the above.",
+ "C. A movie setting.",
+ "D. A game setting."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what timestamp in the video does the tattoo artist get hit on the head and fall to the ground?",
+ "answer": "B",
+ "candidates": [
+ "A. Throughout the video.",
+ "B. At the beginning of the video.",
+ "C. In the middle of the video.",
+ "D. At the end of the video."
+ ]
+ }
+ },
+ "qQEIKCTx": {
+ "video_id": "qQEIKCTx",
+ "video_duration": "130s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a blurred view of two yellow bars in front of a red background. This quickly transitions to a menu screen with a blue and purple interface, showing various song options. The cursor moves, selecting the song \u201cLeaf - Mope Mope (Noodle Extensions)\u201d to play. The screen changes, showing a selection of game options, and the video then transitions to the gameplay. The view is now a first-person perspective inside a virtual space, showing two long beams of light, one blue and one orange, resembling lightsabers. Various shapes, including orange and blue cubes and geometric shapes appear and are sliced by the lightsabers. The background changes color and patterns over the course of the game, showing red and white geometric patterns, some resembling black umbrellas, and white dots. The game ends with a \"Level Cleared\" screen, a score display, a button to restart, and two red and blue cylindrical objects along with four white rectangles.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many cartoon pillows are located to the right of the blue lightsaber in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Six.",
+ "B. Four.",
+ "C. One.",
+ "D. Three."
+ ]
+ }
+ },
+ "LMYNAPat": {
+ "video_id": "LMYNAPat",
+ "video_duration": "51s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Motorsport",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a shot of two men holding microphones, a green one on the left, and a red one on the right. The men are facing each other and there are a number of people in the background. A translucent graphic of a tweet is visible at the bottom of the screen with white text. The scene switches to an aerial view of three motorcycles racing on a track, with one of the bikes trailing the other two. The view changes to another angle and more views of the bikes racing on the track. A 'Chequered Flag' graphic is visible in the top left corner. The scene changes back to four men in an outdoor setting with a crowd of people behind them, all holding microphones. The two men holding the green and red microphones are also present in this shot. The men are standing in front of a blue truck with white text on the side.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who mentioned in the video that \"I feel like he rode like he was totally protecting Jorge Lorenzo\"?",
+ "answer": "A",
+ "candidates": [
+ "A. The man holding a red microphone.",
+ "B. The man holding a dark blue microphone.",
+ "C. The man holding a green microphone.",
+ "D. The man holding a black microphone."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the two people on the same motorcycle do in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The rear rider overtook the front rider.",
+ "B. The rider in front provided cornering protection to the rider in the back.",
+ "C. The front and rear riders rode side by side.",
+ "D. The rider in the back provided cornering protection to the rider in front."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a white and blue racing motorcycle in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. It does not exist.",
+ "B. I am not sure.",
+ "C. It exists."
+ ]
+ }
+ },
+ "DGZXvXQh": {
+ "video_id": "DGZXvXQh",
+ "video_duration": "41s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a close-up of a motorcycle exhaust system. Animated dark gray clouds appear, representing exhaust gases, and move through the header pipe connected to the engine. A white outline traces the path of the exhaust pipe. A red line appears, highlighting the divergent section of the exhaust, as the animated clouds move into the main muffler body. The clouds move further into the muffler, and then a red line appears, highlighting the convergent section, as the clouds compress. The view then changes to show the side of the muffler and engine. The white outline and gray clouds move along the length of the muffler. A red line again appears, highlighting a divergent section of the exhaust, as the clouds expand. The clouds continue to move toward the end of the muffler, and then a red line appears again, highlighting a convergent section of the exhaust, as the clouds compress.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many black clouds appear in the animation?",
+ "answer": "A",
+ "candidates": [
+ "A. Three.",
+ "B. Two.",
+ "C. Four.",
+ "D. One."
+ ]
+ }
+ },
+ "BfmbCifT": {
+ "video_id": "BfmbCifT",
+ "video_duration": "532s",
+ "duration": ">8min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A woman with curly dark hair, wearing a pastel rainbow-striped two-piece outfit and a gold necklace, stands in front of a leafy backdrop. She gestures as she speaks. A pink triangle graphic appears over her torso, labeled \u201cTHE TRIANGLE THEORY OF LOVE.\u201d The triangle points are then labeled \u201cPASSION,\u201d \u201cCOMMITMENT,\u201d and \u201cCOMPANIONSHIP.\u201dA rapid sequence of video clips follows: a man climbing a pole in a bedroom; a dog running through a doorway; a man and woman kissing; a couple holding hands at sunset; a woman embracing a man, smiling; two young women embracing and laughing.The woman in the rainbow outfit returns, gesturing as she continues speaking. Briefly, a video clip of a man and woman at a computer desk appears, followed by a woman praying. The video then cuts to a color bar test pattern, followed by rapid image sequences of various people: a man smiling; a woman among track hurdles; a muscular tattooed man sitting; and the woman in the rainbow outfit again.The woman continues speaking. A split-screen appears with the woman on one side and a clip of someone\u2019s hands playing a stringed instrument on the other. The \u201cNO\u201d graphic appears in front of the woman. She continues speaking as various clips play, including the couple at sunset, two muddy vehicle tires, and the woman with the dark curly hair, now wearing a black shirt and a gold chain necklace, sitting and talking. A man with braided hair is visible in one of the inset videos, and the braided-hair man and woman sit on a sofa in a three-panel image. The woman in the rainbow outfit returns, gesturing as she continues speaking.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What is the woman wearing a gold chain necklace discussing in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Theories of attitudes toward love are being discussed.",
+ "B. Theories of relationship dynamics in love are being discussed.",
+ "C. Theories of love styles are being discussed.",
+ "D. The triangular theory of love is being discussed."
+ ]
+ }
+ },
+ "ygQSYVai": {
+ "video_id": "ygQSYVai",
+ "video_duration": "483s",
+ "duration": ">8min",
+ "domain": "Sports",
+ "sub_category": "Fishing",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A hand holds a fishing rod over a shallow, clear stream. The streambed is visible, composed of rocks and sand. The hand casts the line. The view follows the line as it travels downstream. The camera moves along the bank of the stream, showing different sections of water and surrounding foliage. The water flows around rocks and small sandy areas. The hand adjusts the fishing rod and line. The camera goes underwater briefly, showing a hand reaching into the stream, disturbing the sandy bottom. The hand emerges from the water. The camera continues to follow the movement of the fishing rod along the stream. The water gets deeper and faster flowing over rocks. The camera goes underwater again, close to the rocky bottom. A man wearing a black shirt, suspenders, and a baseball cap is shown from the shoulders up in front of the stream and surrounding trees.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many distinct species of fish are shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Two species are shown.",
+ "B. Four species are shown.",
+ "C. One species is shown.",
+ "D. Three species are shown."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What happens after the man catches the fish?",
+ "answer": "A",
+ "candidates": [
+ "A. The man releases the fish back into the stream.",
+ "B. The man immediately roasts the fish.",
+ "C. The man puts the fish into the bucket.",
+ "D. The man feeds the fish to the cat."
+ ]
+ }
+ },
+ "HfpYwkJi": {
+ "video_id": "HfpYwkJi",
+ "video_duration": "63s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Short Films",
+ "audio_class": [
+ "Event"
+ ],
+ "video_caption": "Three cartoon characters are running in different directions. One has a pirate hat, the second is wearing a santa hat, and the third is wearing a head band. An island and surrounding water are shown. The characters walk through a jungle. They walk on flat land, then among some dome-like structures. Some dome-like structures are being broken apart. A turtle character walks around. The three cartoon characters look shocked. The characters are shown with tears on their face. They are shown in the foreground of many turtle characters. The character in the head band is shown with a square object on its back and hands, while a square object is shown in the air near turtle characters. The character with the head band is shown holding and displaying a rectangular object. The three characters are shown together again. Some turtle characters are shown. The character with the head band is shown moving its arms and legs. The character in the head band is shown running.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video contain a stone?",
+ "answer": "D",
+ "candidates": [
+ "A. It is not clear.",
+ "B. None of the above.",
+ "C. Yes, it does.",
+ "D. No, it does not."
+ ]
+ }
+ },
+ "Kauavmph": {
+ "video_id": "Kauavmph",
+ "video_duration": "82s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a close-up of a person wearing a camouflage military uniform, with a logo with a blue and yellow emblem in the top left corner. The person is crouching and seems to be examining the inside of a blue room. Another person in camouflage is next to them. The video pans, showing another angle of the room. The camera shifts focus to the helmet of one of the camouflaged individuals. It reads \"C\u0411\u0423\" in white letters. There is a small yellow and blue flag on their helmet. Another person, wearing a black mask and gloves, appears in the background. The video shifts back to the camouflaged person's helmet. The camera shifts to another angle, focusing on the helmet of a different individual. The camera shifts again, focusing on a person's sleeve which has a yellow, green, and silver emblem on it. A bare-chested person appears in the background, with two other people sitting on a bed behind them. The video shifts to a still shot of 3 people sitting on a split-screen television interview: 2 people on the left hand side and one person on the right hand side with other people in the background. The bottom part of the screen has a white rectangle on which text appears. The video shifts, showing the person on the right hand side of the television interview standing at a lectern. The camera shifts back to the same still shot of 3 people sitting on a split-screen television interview. It then shifts to one man on the left side of a television interview, and another person on the right side of the television interview in a smaller box. Then the video shows the person standing at the lectern from a different camera angle with people applauding in the background. The video shifts to the original two men on the split screen again, then shows the other man alone. The video shifts to one man on a television interview wearing a suit, and text at the bottom of the screen, with an image of a church appearing on the screen.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Will the U.S. government impose sanctions on Zelensky next?",
+ "answer": "A",
+ "candidates": [
+ "A. No, it will not.",
+ "B. None of the above.",
+ "C. Yes, it will.",
+ "D. It is not known."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the woman in the blue blazer do to the man in the olive green long-sleeve shirt in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The woman in the blue blazer kissed the man in the olive green long-sleeve shirt.",
+ "B. The woman in the blue blazer argued with the man in the olive green long-sleeve shirt.",
+ "C. The woman in the blue blazer hugged the man in the olive green long-sleeve shirt.",
+ "D. The woman in the blue blazer shook hands with the man in the olive green coat."
+ ]
+ }
+ },
+ "kZdVtcAx": {
+ "video_id": "kZdVtcAx",
+ "video_duration": "54s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "World News",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a series of clips with different settings and shots of military vehicles.The first few seconds show a tan-colored vehicle, with a bush with red berries in the foreground. This vehicle then drives through a wooded area. A larger, darker green vehicle then drives over a dirt path in the woods.Next there are multiple shots of a field covered in snow. In these shots, various explosions occur.Following this, a tan and green tank is shown leaping over a mound of dirt.After this, a tan tank drives over a dirt path in a dry landscape.These are followed by several more shots of explosions on a snow-covered field, filmed from different perspectives.The video then shows a number of yellow and gray shells.Finally, a series of shots shows a tank driving through a snow-covered field from a bird's eye view. In these shots there are explosions happening, and houses are visible.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What is the final state of the T-90 tank?",
+ "answer": "B",
+ "candidates": [
+ "A. The entire armor is destroyed.",
+ "B. The turret control system is destroyed.",
+ "C. The armored track is broken.",
+ "D. It is in perfect condition."
+ ]
+ }
+ },
+ "UzfRSfAv": {
+ "video_id": "UzfRSfAv",
+ "video_duration": "169s",
+ "duration": "2-4min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A woman with shoulder-length brown hair wearing a white t-shirt with pink trim and the number \"666\" in pink on the front is shown sitting at a table. She gestures with her hands as she speaks. The background consists of various pieces of equipment on a workbench and a tool board on the wall. She has a small microphone clipped to her shirt.The scene shifts to a man wearing a dark suit and tie standing beside the same woman. She is now wearing denim overalls with a white shirt, safety glasses, and a red bandana. The woman appears to be adjusting equipment on a table.The scene transitions to show a large white and black machine. The same woman appears to be examining and pressing buttons on the machine.The scene cuts to the man holding a red fire extinguisher while smiling, while the woman smiles in the background. The scene shifts again to show the woman adjusting and then opening the large machine. She places a thin, light-colored rectangular sheet into the machine and presses buttons on the control panel, then closes the lid.The camera moves to show the inside of the machine and a beam of red light cutting letters into the rectangular sheet. The beam stops and cuts a perimeter around the carved words.The woman then opens the lid of the machine and removes the sheet. The woman and man hold up the rectangular sheet together which has the words \"Stay Awesome\" cut into it.The scene then shifts back to the woman sitting at the table. She is again gesturing with her hands as she speaks.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What are the last three letters engraved on the board by the laser in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. oma.",
+ "B. ome.",
+ "C. sem.",
+ "D. som."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the wooden board in the video relative to the laser cutter?",
+ "answer": "C",
+ "candidates": [
+ "A. The wooden board is in front of the laser cutter.",
+ "B. The wooden board is on the right side of the laser cutter.",
+ "C. The wooden board is below the laser cutter.",
+ "D. The wooden board is on the left side of the laser cutter."
+ ]
+ }
+ },
+ "NeJkUzDA": {
+ "video_id": "NeJkUzDA",
+ "video_duration": "404s",
+ "duration": "6-8min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A gloved hand holds a light-colored cloth, inside a building with tile floors and wooden stairs. Then, a first-person view shows someone in military-style clothing laying prone in tall grass. The camera shifts, showing them standing near a wooden fence. They move to an opening in a brick wall and look out. A figure in gray clothing stands in tall grass, then a second figure in darker clothing appears to their left. The viewpoint shifts again, showing a close-up of the two figures standing side-by-side. The figure in darker clothing turns towards the camera and is shown to be wearing a black mask. They shift around and a menu overlay is shown, followed by the dark-clothed figure being seen from the knees up. The camera perspective turns and shows the masked figure speaking to the gray-clothed figure. A menu prompt appears, \"Check person's pulse.\" The gray-clothed figure holds up a hand. The view moves to show a roadway and then the camera returns to the gray-clothed figure, moving towards the masked figure. Another scene displays the two figures, now inside a building, with what appears to be an orange hat on the floor. One figure stands still while the other searches their inventory and moves items. The view then returns to show both figures, and one of them takes a pill, then picks up a red hat. Both figures can be seen, then they touch their hands, and a flag is briefly shown over them. The camera view then changes several times, showing one or both of the figures. A final shot shows one figure holding a rifle.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why do they need Marsh's boots?",
+ "answer": "D",
+ "candidates": [
+ "A. Because they like boots.",
+ "B. Because they cannot find their boots.",
+ "C. Because they do not have boots.",
+ "D. Because their boots are damaged."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is Marsh's mood while singing?",
+ "answer": "A",
+ "candidates": [
+ "A. Joy.",
+ "B. Excitement.",
+ "C. Anger.",
+ "D. Sadness."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "How does Marsh's emotion change throughout the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Shifts from being tense to humorous.",
+ "B. Moves from feeling relaxed to suppressed.",
+ "C. Changes from happiness to anxiety.",
+ "D. Transitions from being lively to silent."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "On which side of the room is the red hat on the floor?",
+ "answer": "A",
+ "candidates": [
+ "A. On the right.",
+ "B. In the middle.",
+ "C. On the left.",
+ "D. At the bottom."
+ ]
+ },
+ "task4": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "Is there any change in the volume of the dialogue between characters in the room within the video?",
+ "answer": "D",
+ "candidates": [
+ "A. No change.",
+ "B. The volume change is minor.",
+ "C. There are changes; the volume decreases, and the clarity fades.",
+ "D. The volume increases with the character's emotions, becoming louder during excitement and when near the microphone."
+ ]
+ }
+ },
+ "PGRPfkNv": {
+ "video_id": "PGRPfkNv",
+ "video_duration": "623s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A woman with long brown hair and a plaid shirt sits in front of a blue sparkly backdrop. She gestures with her hands while speaking directly to the camera. She holds up various skincare products, one at a time, including a round, dark-colored jar, a light green squeeze tube, a white squeeze tube with purplish-pink text, a white pump bottle, and a clear bottle with a brown pump top. Between showing each product, her hands are down and out of frame. She looks up briefly a few times while talking. Her gaze remains focused on the camera for the duration of the video.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video show cleansing oil?",
+ "answer": "D",
+ "candidates": [
+ "A. Uncertain.",
+ "B. Not applicable.",
+ "C. No.",
+ "D. Yes."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What is the woman in the plaid shirt doing in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Introducing the detailed ingredients of cosmetics.",
+ "B. Introducing cosmetic brands.",
+ "C. Discussing skincare methods for oily and clogged skin.",
+ "D. Introducing techniques for using cosmetics."
+ ]
+ }
+ },
+ "LGaGpppj": {
+ "video_id": "LGaGpppj",
+ "video_duration": "75s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video is split into two sections. On the left, an audience sits watching a video game competition. On the right, gameplay footage is displayed on a screen. Initially, the screen on the right is blue. Then, a video game starts, showing a small raft with a figure on it, paddling in a body of water. Text appears on the screen that says \u201cSTART\u201d. The figure paddles the raft forward, away from a small island with trees and enters a river or stream. The game prompts the player with the instruction \u201cRotate to Paddle\u201d. The raft proceeds down the river, passing by several islands with trees and bushes. As the raft approaches the finish line, the game\u2019s background changes color. When the raft crosses the finish line, the word \u201cFINISH\u201d appears on the screen. The screen then goes dark, followed by a results screen displaying the ranking of the players and their scores. The audience and competitors cheer and clap. One audience member stands and raises their arms in celebration. The screen on the right fades to black again.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "In the final ranking changes of the video, did the third-place ranking move up by one position?",
+ "answer": "C",
+ "candidates": [
+ "A. None of the above.",
+ "B. I don't know.",
+ "C. No, it did not.",
+ "D. Yes, it did."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What change in the audience's emotions occurred when the boat in the video game first collided with the shore and then restarted?",
+ "answer": "C",
+ "candidates": [
+ "A. The audience remained in a state of excitement.",
+ "B. Their emotions changed from excitement to disappointment.",
+ "C. Their emotions changed from disappointment to excitement.",
+ "D. The audience remained in a state of disappointment."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many stones does the boat pass in the game footage shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Three.",
+ "B. Two.",
+ "C. One.",
+ "D. Four."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What roles do the four men in the front row at the bottom left of the video play?",
+ "answer": "C",
+ "candidates": [
+ "A. They are photographers.",
+ "B. They are audience members.",
+ "C. They are contestants.",
+ "D. They are referees."
+ ]
+ }
+ },
+ "JXoaOVzs": {
+ "video_id": "JXoaOVzs",
+ "video_duration": "189s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman with a blue satin bonnet on her head appears on screen. The screen changes to a textured beige background. A small photo of a woman with long, highlighted brown hair is in the upper left corner. A pink rectangular shape with small floral graphics is slightly to the right of and above the photo. Text appears: \u201cPatti Cake,\u201d \u201cBeauty,\u201d \u201cHair | Fashion,\u201d \u201cLifestyle,\u201d \u201cWE ARE NOT INVISIBLE,\u201d \u201cMature Woman,\u201d and \u201c@PattiCakeGal.\u201d Three smaller photos appear. One is similar to the original photo; one depicts a woman wearing a red, furry top or scarf; the third photo is of a woman lying down in a red top. The screen fades to black, and then returns to the woman with the bonnet, in what appears to be a bathroom with light brown tile. She holds up a small, clear bottle filled with a yellow liquid. The label reads \u201cFARMACY CLEAN BEE\u201d and lists it as an \u201cultra gentle facial cleanser.\u201d Text on screen reads \u201cFARMACY CLEAN BEE\u201d and \u201c@PattiCakeGal.\u201d The woman holds up a white tube with dark gray text, and the cap pointing down. The label reads \u201cSKIN&CO ROMA,\u201d \u201cTRUFFLE THERAPY,\u201d \u201cPolishing Micro-Exfoliating Truffle Face Gommage,\u201d \u201cGommage Micro-Esfoliante Con Estratto Di Tartufo,\u201d and \u201cFACE GOMMAGE Made in Italy.\u201d Text on screen reads \u201cSKIN & CO Polishing Micro-Exfoliating Truffle Face Gommage\u201d and \u201c@PattiCakeGal.\u201d The woman holds up her left palm, pours a small amount of clear yellow liquid into it, and spreads it on her face and neck. She then holds up the white tube, squeezes some of the white cream onto her fingers and begins to spread it on her face and neck over the yellow cleanser.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "Does the music play continuously throughout the video?",
+ "answer": "B",
+ "candidates": [
+ "A. None of the above.",
+ "B. Yes.",
+ "C. No.",
+ "D. I am not sure."
+ ]
+ }
+ },
+ "FfnbBeum": {
+ "video_id": "FfnbBeum",
+ "video_duration": "434s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a black shirt, gray shorts, and a black baseball cap opens a glass double door and walks inside. The camera follows this person as they walk through several empty rooms. Some of the rooms have gray floors, off-white walls, white trim along the baseboards and a white doorframe. One of the rooms has a white ceiling fan hanging from the ceiling. In another room, there are three children sitting on the floor using tablet devices. The camera moves to a person wearing a yellow t-shirt with a design on it. The person wearing the yellow t-shirt does not move. The camera moves to the person in the black shirt who walks around some more empty rooms. This person is holding a mobile device in a handheld mount. The person in the black shirt walks into a bathroom, which contains a toilet, sink and a handrail. The person in the yellow t-shirt appears in the bathroom, and then walks out again, followed by the person in the black shirt. This person then walks into an empty room with a yellow ladder. The person in the black shirt then points toward a door, and then walks through it. The person in the yellow t-shirt again appears, followed by the person in the black shirt. The camera then shows three men standing outside next to some glass doors, facing the camera. The men raise their hands and then give thumbs up. The camera is then held by the person in the black shirt as the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point does the child appear in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. At the end of the video.",
+ "B. At the beginning of the video.",
+ "C. Throughout the entire video.",
+ "D. In the middle of the video."
+ ]
+ }
+ },
+ "AHrZYufL": {
+ "video_id": "AHrZYufL",
+ "video_duration": "45s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A figure wearing a red suit and a mask is shown looking to the side, then looks forward while slightly moving their thumb up. A figure wearing a black hat and a black vest is facing away from the viewer. The figure in the black hat is holding up a drawing of a cartoon wolf-like figure. The figure in the red suit looks directly forward with large eyes and a slightly opened mouth, then looks down and away with their mouth agape. The figure in the black hat holds up the same drawing of a cartoon wolf-like figure. The figure in the red suit looks directly forward with narrowed eyes and a closed mouth, then the figure in the black hat holds the same drawing of a cartoon wolf-like figure up again. The figure in the black hat puts their hand over their face briefly. Then the figure in the black hat is shown briefly bringing a flat piece of paper in front of the face of the figure in the red suit, who has a look of disgust, then a look of surprised confusion on their face. The drawing is briefly zoomed in on where the letters \u201cSPY\u201d can be seen written on it. The figure in the red suit closes their eyes and opens their mouth. The figure in the black hat holds the drawing of the cartoon wolf-like figure up again. The figure in the black hat is shown with a wide mouth and then is shown with a finger pointing up. The figure in the red suit has their arms crossed, then is shown with their mouth agape. The figure in the black hat is shown holding the drawing again and then has a finger pointing up. The figure in the red suit is shown with their hands behind their back. The figure in the black hat is shown covering their face with one hand. The figures are shown with several cartoon wolf-like characters, four of which are accompanied by text, arranged next to each other. The figure in the black hat is shown holding the drawing of the cartoon wolf-like figure again. The figure in the red suit is shown in the middle of the screen with a slightly open mouth, then with a slightly different expression. The figure in the black hat holds their hands up to their face with their mouth open and then they stretch their arms out. A thought bubble with a smaller image of the figure in the red suit inside of it is briefly shown above the figure in the black hat. The figure in the black hat holds the drawing up once more, then the drawing is briefly zoomed in on, and the letters \u201cSPY\u201d can be seen again. Then the two figures are shown facing each other. The figure in the black hat is holding the drawing, and is pointing at the figure in the red suit.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "In the video, what color is the signature on the portrait held by the man with dark skin?",
+ "answer": "A",
+ "candidates": [
+ "A. Black.",
+ "B. Brown.",
+ "C. Red.",
+ "D. Blue."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why is the man wearing a red mask in the video painting the picture held by the man with dark skin?",
+ "answer": "B",
+ "candidates": [
+ "A. Because he wants to make the man with dark skin angry.",
+ "B. Because he likes cute things.",
+ "C. Because he enjoys painting.",
+ "D. Because he likes the man with dark skin."
+ ]
+ }
+ },
+ "UUaIqmyE": {
+ "video_id": "UUaIqmyE",
+ "video_duration": "83s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "FPS Game",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows first-person gameplay footage of a video game. In the first clip, a message reading \"YOUR TEAM WON ROUND 4\" appears on screen, followed by text saying \"ENEMIES ELIMINATED\". The next clip begins with the player holding a red and white gun, the player looks around a room, then there are explosions on the floor. The player aims down the sight of the gun and text on the screen reads \"FOUND BOMB A\". The player looks around several rooms and hallways and aims their weapon at several other players that appear on the screen. The video then transitions to an overhead view of the character selection screen, displaying the names and images of characters being used. This clip shows character portraits labeled 'Melusi', 'Alibi', 'Kapkan', and 'Thunderbird'. The video then ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What is the maximum number of bullets a magazine in the red machine gun shown in the video can hold?",
+ "answer": "C",
+ "candidates": [
+ "A. Thirteen bullets.",
+ "B. Forty-five bullets.",
+ "C. Thirty bullets.",
+ "D. Fifteen bullets."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the person holding the red machine gun in the video throw a grenade?",
+ "answer": "D",
+ "candidates": [
+ "A. Zero.",
+ "B. One.",
+ "C. Two.",
+ "D. Three."
+ ]
+ }
+ },
+ "LOEvHrcA": {
+ "video_id": "LOEvHrcA",
+ "video_duration": "211s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a close-up of a person with curly purple hair. The person is wearing a sheer purple top. The person touches their hair a few times and looks directly at the camera. The video then cuts to a shot of two white bottles with purple writing on a silver sparkly background. One of the bottles has printing visible, the other does not. A third bottle with pink writing is placed next to the first two, then a fourth bottle with blue writing, then a fifth bottle with teal writing, then a final bottle with yellow writing, so that there are now six bottles in total. The camera then zooms into the bottles to show the labels more closely, and a purple graphic saying \"GIVEAWAY\" appears on the screen.The video cuts back to the person with the curly purple hair. They are now wearing a blue jean jacket over a dark undershirt. They use their hand to touch their hair, and they look directly into the camera, making several different facial expressions. The person then holds a bottle with purple writing toward the camera and moves it around, gesturing with their other hand. The person holds the bottle up to their nose and then holds it out again. The person continues to gesture with their hands and touch their hair, looking toward the camera. The person gestures with both hands and puts their hands up on their head. The person also points to their hair and to themselves. The person holds the bottle towards the camera again, touching their hair with their other hand. The person continues to make gestures, then waves, and then presents their profile information @MzBiancaRenee in a banner near the bottom of the screen.The screen fades to black with white glittery graphics and says \"Bianca\", then \"Bianca Renee Today\", followed by a pink graphic with social media icons, the text \"MORE VIDEOS!\", and a Subscribe arrow. The video fades to black at the end.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video contain black spray?",
+ "answer": "B",
+ "candidates": [
+ "A. Yes, it does.",
+ "B. No, it does not.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "yQmxtqnA": {
+ "video_id": "yQmxtqnA",
+ "video_duration": "59s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen. Then, a screen with text and an image of earth and a satellite appears. The text says \"SMAP AT WORK.\" A white beam comes from the satellite towards Earth. The video transitions to a scene of a rocket launching. Next is an image of the Earth and three objects in orbit, including a satellite. The rocket sections break away from the satellite in the following scenes. The satellite is shown flying in front of the earth in the following scenes. Finally a panel opens on the side of the satellite and an arm extends from the satellite with an object on the end of it. The satellite then moves, revealing a flat antenna. The video ends with the satellite in orbit around the earth with the antenna deployed.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What is the purpose of the rocket launch depicted in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. To measure global soil moisture levels.",
+ "B. To improve global signal transmission.",
+ "C. To conduct surveillance of other countries.",
+ "D. To monitor the rate of land desertification."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the final separation segment in the video and the satellite?",
+ "answer": "C",
+ "candidates": [
+ "A. Acts as a part of environmental monitoring.",
+ "B. Serves as a component for information transmission.",
+ "C. Serves as the last source of power.",
+ "D. Functions as a counterweight in the vacuum environment."
+ ]
+ }
+ },
+ "OwegJYIA": {
+ "video_id": "OwegJYIA",
+ "video_duration": "231s",
+ "duration": "2-4min",
+ "domain": "Culture & Politics",
+ "sub_category": "Art Exhibitions",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A pair of hands places a brown rectangular card to the left of a rectangular light blue box with drawings and the word \"artful.\" written on it in a dark color on a tan colored surface. The hands place themselves on top of the light blue box, and lift the brown card to reveal a card with a drawing of a rainbow on it. The hands pick up a sheet of paper that had been covered by the brown card with writing on it and flip it over so it faces the camera. The hands hold up the rainbow card next to the brown card and then turn the rainbow card over to reveal it has a slit cut into it. The brown card and rainbow card are moved away from the box. The hands place themselves on the left and right sides of the light blue box and tilt it and turn it so the top, bottom and sides of the box are shown. The hands open the top of the light blue box. The inside of the top of the light blue box is orange with white drawings on it. The hands lift off the top of the light blue box. The hands adjust a red material inside of the box and spread it open to reveal a white card inside of it. The hands remove the white card and open it to reveal a brown folder with red circular closures and strings inside. The hands then remove the brown folder and place the white card back inside the box. The hands undo the closures on the brown folder and remove a collection of pens and markers. The hands remove some of the pens and markers from the folder and hold them up. The hands remove the pen caps, showing the tips of some of the pens, then place them on the tan surface next to the box. The hands pick up an eraser and hold it up. The hands then pick up the sketch book and move it so both sides of the sketchbook are shown. Then, the hands shift the red packing material and remove a card with the word \"artful\" on it that is covered by a semi transparent sheet. The hands then flip through the card before closing the box. The video ends with the box being closed.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "After removing the red wrapping paper, where is the pen relative to the sketchbook?",
+ "answer": "B",
+ "candidates": [
+ "A. Below.",
+ "B. Above.",
+ "C. On the right.",
+ "D. On the left."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What image is on the card cover?",
+ "answer": "B",
+ "candidates": [
+ "A. A pencil.",
+ "B. A rainbow.",
+ "C. A plant.",
+ "D. A bottle."
+ ]
+ }
+ },
+ "ITACZeSj": {
+ "video_id": "ITACZeSj",
+ "video_duration": "519s",
+ "duration": ">8min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A man with graying hair, wearing a gray suit jacket and patterned shirt, holds a paper while standing in front of a microphone. The words \"Jimmy O'Brien\" appear on the screen. A man with dark hair, wearing a checkered jacket and black shirt, holds a mug while sitting in front of a microphone. The words \"Jake Storiale\" appear on the screen. The graying haired man releases what looks like smoke from his mouth, then shouts, then pulls at his shirt. The two men sit at a desk, both facing forward and holding papers, with a mug and other items on the table, and the word \"DUMB\" appears behind them. Then the words \"Weekly DUMB\" appear on the screen. The dark haired man sits at the desk while the logo \"Dugout Mugs\" and the code \"DUMB\" appear on the screen. The man puts on a gold crown. The graying haired man sits at the desk with a white paper in his hand while three cylindrical baseball bat mugs sit in front of him. The dark haired man is seen at the desk again. Then, the graying haired man is seen again at the desk. The dark haired man is at the desk again, then the graying haired man is at the desk. The two men sit at the desk again, and the dark haired man gestures with his finger. The dark haired man sits in front of a background of the word DUMB, then the words \"Weekly DUMB\" appear on screen with the dark haired man in the frame. Then, the dark haired man sits at the desk again. The graying haired man is at the desk with papers in hand. Then the man sits at the desk. Red stage curtains cover the frame, and the dark haired man emerges from the curtains, adjusting his jacket. The graying haired man sits at the desk with papers in hand. Then the man sits at the desk, again with the papers. A series of images and the two men in conversation appear onscreen with papers and other items on the table, interspersed with close-ups of the men\u2019s faces. One man is seen holding and looking at what appears to be a photo or drawing. Images of other men holding a donut, or on a basketball court, appear in small boxes on the screen. At one point, what appears to be a fire erupts on the desk and is quickly gone. The two men move items around on the desk and then strike the cylindrical baseball bat mugs together. The dark haired man stands behind the table and adjusts his chair before re-sitting. The two men again are at the table together with the cylindrical mugs. The words \"Also Lou Ellen\" appear on screen and then the dark haired man is seen holding a cylindrical bat mug and the words \"Code: Dumb\" with \"35% Off\" appear on screen. The dark haired man is seen again holding the bat mug and the screen displays \"Early Bird Father's Day Sale,\" \"Code: Dumb,\" and \"35% Off\" separately. Finally, the graying haired man is seen at the desk again holding a piece of paper. A text box and the words \"Nowhere to Go: Portable Toilet Inventory Wiped Out In Central Florida\" are displayed on the screen. The video ends with the dark haired man holding one of the baseball bat mugs.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the mood of the man in the green suit after seeing the man in the pink plaid suit pick his nose and wipe his hands on his clothes?",
+ "answer": "B",
+ "candidates": [
+ "A. Anger and disgust.",
+ "B. Surprise and happiness.",
+ "C. Sadness and resentment.",
+ "D. Calmness and annoyance."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a king's crown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. No, it does not exist.",
+ "B. I am not sure.",
+ "C. Yes, it exists."
+ ]
+ }
+ },
+ "AxLVdyXs": {
+ "video_id": "AxLVdyXs",
+ "video_duration": "61s",
+ "duration": "1-2min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A man wearing glasses and a baseball cap raises his hands in the air and hugs a woman beside him. He pumps his fist and yells excitedly, while other spectators cheer in the background. The perspective shifts to a general view of the soccer field, showing players in blue and white uniforms. A player in blue kicks the ball towards the goal. It bounces once and goes into the net past the diving goalkeeper. The player in blue runs towards the corner flag, throws off his gloves, and celebrates with the fans, jumping into their outstretched arms. The crowd cheers wildly.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "How does the player wearing the blue number 24 jersey interact with the staff member wearing orange clothing?",
+ "answer": "D",
+ "candidates": [
+ "A. Lifts the staff member.",
+ "B. Tackles the staff member.",
+ "C. Holds onto the staff member.",
+ "D. Protects the staff member."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What jersey number is worn by the blue team player who scored?",
+ "answer": "A",
+ "candidates": [
+ "A. 24.",
+ "B. 14.",
+ "C. 33.",
+ "D. 10."
+ ]
+ }
+ },
+ "gzosccqg": {
+ "video_id": "gzosccqg",
+ "video_duration": "501s",
+ "duration": ">8min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a first-person perspective of a motorcyclist riding on a two-lane road with yellow dividing lines. There is another motorcycle ahead. There is a lake on the left side of the road and trees on both sides. As the motorcycles travel down the road, various signs are passed on the sides. There are a couple of sections where the road passes underneath bridges. At several points in the video the second motorcyclist passes people walking or riding bicycles on the side of the road. At one point a hand appears in the frame to point at something. The video then changes to the second motorcycle sitting parked. The camera then pans back to a person who is filming the video, who begins speaking. White text overlays several moments of the speaker talking. Then the camera angle switches again to a motorcycle, and also focuses on a new person, then switches again to a different person. The camera continues switching angles. After some further shots of motorcycles, a white screen with grey text appears, reading \u201cI AM SANDY\u201d followed by \u201cLEAN YOUR MOTORCYCLE LIKE A PRO!\u201d The person speaking reappears and there is additional text overlaid on the screen while he speaks. The camera angle continues switching between various shots of motorcycles, people, and roads, and ends with a person pointing at text with red arrows. There is text overlaid on the road and the motorcycle as well. Finally the words \u201cClick It\u201d appear on the screen.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a lake in the middle of the video?",
+ "answer": "B",
+ "candidates": [
+ "A. There is.",
+ "B. There is not.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "LoTmUgCS": {
+ "video_id": "LoTmUgCS",
+ "video_duration": "54s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person wearing camouflage clothing walks through trees. Two more people follow. All three individuals hold sticks and are climbing a dirt slope using a rope for aid. Two people wearing combat gear and helmets, one blue and one red, are fighting each other in a field, while others watch. A man with a black shirt speaks to the camera. Three people are walking on a road, two are carrying other people on their shoulders. A man wearing a camouflage helmet is looking up to the sky. A man sitting down rubs his eye, he is holding an object in his other hand. A person is on the ground performing push ups. A close up shows several people in green shirts, which display a graphic of a rifle and a sword along with the words \"One Mind, Any Weapon\". A man is standing before a group. Someone pours what appears to be sand over another person's head. A black object is seen on the floor, as well as several pieces of cloth. People lift large, round objects together. People pour sand on each other\u2019s heads. Three people shake each other's hands, then hug each other. A man is speaking in the field, then the video darkens, and a logo with a red and gold dragon appears.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the purpose of the soldiers cutting open sandbags in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. To prepare concrete.",
+ "B. To create fortifications.",
+ "C. To participate in a sandbag challenge.",
+ "D. The purpose is unknown."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the black man and the soldiers in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. The black man is a passerby.",
+ "B. The black man is a soldier.",
+ "C. The black man is an instructor.",
+ "D. The black man is the coach."
+ ]
+ }
+ },
+ "CWiWZgXx": {
+ "video_id": "CWiWZgXx",
+ "video_duration": "48s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "In the upper left corner of the video, a man is visible, wearing a dark t-shirt with a logo on the front, sitting in front of a blue wall with white text and an emblem. In the larger portion of the video, two people are on a grassy field near a body of water and several trees. At first, both are standing, but they begin to grapple each other. One person, wearing red shorts and a dark-colored shirt, ends up on the ground. The other person, wearing a light colored shirt and dark shorts, moves around and seems to be pinning the first person to the ground. A third person walks into the frame. They are wearing a dark plaid shirt, dark shorts, and light-colored shoes.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What did the man in black pants do to the man in red pants when he was knocked down?",
+ "answer": "D",
+ "candidates": [
+ "A. Applied a triangle choke.",
+ "B. Hugged tightly.",
+ "C. Rolled to dodge.",
+ "D. Struggled to get up."
+ ]
+ }
+ },
+ "hdIcAkxU": {
+ "video_id": "hdIcAkxU",
+ "video_duration": "168s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Tech Reviews",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a black screen that displays the white text \"8 HOURS LATER...\". The scene then changes to a medium shot of a person with curly dark hair with blonde highlights. The person has on a red top, and appears to have on makeup. Throughout the video, the person's eyes move, their mouth opens and closes, their hands move, and they speak. At one point, the person holds up two blue and white rectangles. A brown smudge is on both rectangles. After this, the person removes makeup from their face using one of the rectangles. The person then holds the rectangles up again, then speaks, then moves their hands around their face. The person is in front of a gray background. At the top of the video, a horizontal yellow stripe appears with text that reads \"@MzBiancaRenee\", above the person's head.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why does the woman in the video appear so upset at the end?",
+ "answer": "D",
+ "candidates": [
+ "A. Because she worked an eight-hour shift.",
+ "B. Because her face is covered in oil.",
+ "C. Because she only has two blotting papers and cannot use more.",
+ "D. Because the blotting paper removed her makeup."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What might the woman in the video do next?",
+ "answer": "B",
+ "candidates": [
+ "A. Go to sleep.",
+ "B. Switch to a new foundation.",
+ "C. Buy more oil-absorbing sheets.",
+ "D. Do a deep facial cleansing."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "What does the woman in the video use to wash her face?",
+ "answer": "B",
+ "candidates": [
+ "A. Facial cleanser.",
+ "B. I don\u2019t know.",
+ "C. Hands.",
+ "D. Towel."
+ ]
+ }
+ },
+ "rASmdleG": {
+ "video_id": "rASmdleG",
+ "video_duration": "264s",
+ "duration": "4-6min",
+ "domain": "Sports",
+ "sub_category": "Motorsport",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a speedway motorcycle race. Initially, two racers with red and white uniforms and helmets are shown on a dirt track. The racers are on motorcycles, with logos on them. One has a yellow helmet, and the other has a white helmet. A screen shows an Italian flag with \"1 pt\" and a Polish flag with \"5 pts\". Next, a Polish flag is waved in the stands, held by a person with a black hoodie. There are two people in bright pink and green hats and a scarf that says \"Polska\". Next, a close-up shot shows one of the racers with a white and green helmet, wearing a white and red uniform with the Polish flag on the side. The racer gives a thumbs up. The camera then cuts to the same racer driving on the track. The camera then cuts to a marshal wearing a black jacket and hat. The racer is behind the marshal. Another close-up is shown of the racer and the marshal. The screen says \"Heat 20\". The scene then shows the starting gate with 4 racers in uniform. The racers begin the race, and the camera follows them around the track, showing the racers kicking up dust as they go. The screen shows different track placements with times and names, as well as \"Heat 20, Lap 1\". The camera continues to follow the racers around the track. The screen then says \"Heat 20, Lap 2\". The camera follows the racers around the track. There is then a final race placement of: 1: B. Zmarzlik; 2: K. Huckenbeck; 3: P. Dudek; 4: M. Smolinski. The camera again follows the racers around the track with the time at \"Heat 20, Lap 2\". The camera cuts to different viewpoints along the racetrack as the riders continue to race with \"Heat 20, Lap 3\". The camera follows the race to the finish line. The screen now shows 1: B. Zmarzlik; 2: M. Smolinski; 3: K. Huckenbeck; and 4: P. Dudek with the time being at \"Heat 20, Lap 4\". Next, one of the racers, in blue and white, is shown riding around the track, with the screen saying \"Poland: 3pts and Germany 3pts\". Next, a section of the crowd is shown with people waiving both Polish and German flags. The camera cuts back to a rider in blue and white and a German rider in black and yellow. Both riders meet in the center of the track. The screen says \"Final KO\" with the time at \"Heat 23, Lap 1\". Next is another race with 4 riders and another \"Final KO\" screen. Two riders are shown with a man in a black jacket in between them. The camera cuts to a closeup of a rider in a red helmet. The camera cuts again to a starting gate with four riders. The racers start on the track with \"Heat 23\". The camera pans around to follow the racers with the time at \"Heat 23, Lap 1\" and \"Heat 23, Lap 2\". The camera shows the different placements and times and the race proceeds with the time at \"Heat 23, Lap 3\". The camera continues to show the racers and the final times are: 1: M. Smolinski; 2: V. Milik; 3: K. Huc; and 4: E. Krcmar. The screen then shows a \"Germany: 3 pts\" and \"Czech Rep: 3pts\" score. A rider wearing red is shown riding around the track, and waves with his right arm. The video shows the same rider giving another rider a high-five on the track with the riders in uniform at the finish line. The camera cuts back to the riders on the track. One rider gives a thumbs up. The rider then waves on the track. The camera cuts again to the same rider who is now waving again while riding around the track. Lastly, a red, white, and blue screen is shown with the text \"FIM Speedway of Nations\" and a blue red and blue striped banner.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "Which country is the athlete from who scored twice in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Italy.",
+ "B. Germany.",
+ "C. Czech Republic.",
+ "D. Poland."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What color is the helmet worn by the first person to cross the finish line in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Yellow.",
+ "B. Red.",
+ "C. Blue.",
+ "D. White."
+ ]
+ }
+ },
+ "UHdeBzre": {
+ "video_id": "UHdeBzre",
+ "video_duration": "166s",
+ "duration": "2-4min",
+ "domain": "Tech & Science",
+ "sub_category": "Space Missions",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A bright yellow light appears to the right of the center. A graphic with white writing and numbers in the bottom left corner indicates a speed of 1899 and an altitude of 19.3, as well as \u201cSTAGE 2 TELEMETRY\u201d written below. A graphic with an upward curve and white writing is seen in the bottom center. The time \u201cT+00:01:27\u201d is displayed there. The light on the right appears to move upward and becomes more wispy with orange edges. The numbers on the bottom left of the screen increase throughout the video. Some small video clips are visible in a row at the bottom of the screen. The word \"Swipe\" is visible in the upper center of the video clip and a pause button appears below that. The time in the bottom center of the video is also changing throughout. A small box appears in the center of the screen with a time inside of it (-32:15). As the video progresses, the bright light transitions to a black and white smoky substance. The screen display changes to 3 video clips at the bottom. There is a long horizontal video clip in the middle and 2 small video clips on the left and right. All three clips are in the bottom of the screen. In the middle video clip, there appears to be some mechanism that's centered. The upper center of the screen shows a bright red light with a yellow center, which fades and moves upward throughout the video. The bottom three clips move up the screen, and eventually the center clip shows a white circular light and then some dark mechanism. As the video progresses, the upper center part of the screen shows a much smaller bright light. As the video progresses there are two points of light visible, one in the upper center, and one in the center of the three video clips in the bottom of the screen. Eventually, the video in the middle of the three clips shows the two fairing pieces separated from the main object. The two fairing pieces appear as very dim white semi-circles that separate in the middle video clip. Then, the video in the upper center of the screen displays a rocket engine. The rocket engine continues to be visible as the video continues. At the end of the video, a dark object is seen in the upper center with a bright white light in the bottom, and in the video clips at the bottom of the screen there are two bright white dots.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What is the state of the solid rocket boosters after separation?",
+ "answer": "D",
+ "candidates": [
+ "A. They fly towards the moon.",
+ "B. They are suspended in mid-air.",
+ "C. They continue ascending.",
+ "D. They begin returning to the ground."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the relationship between the central core and the side boosters?",
+ "answer": "B",
+ "candidates": [
+ "A. The central core controls the ignition of the side boosters.",
+ "B. The boosters and the central core operate independently.",
+ "C. The central core is responsible for bringing the side boosters back to Earth.",
+ "D. The side boosters and the central core shut down simultaneously."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual situation is shown in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. After losing its booster, the rocket operates normally.",
+ "B. The camera does not melt despite the high temperatures during rocket launch.",
+ "C. Rockets can be controlled from a distance of 10,000 meters.",
+ "D. The booster can still be located and landed after detachment."
+ ]
+ }
+ },
+ "YTJGbLPJ": {
+ "video_id": "YTJGbLPJ",
+ "video_duration": "89s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "DIY & Crafts",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person with light skin and reddish hair is centered in the frame, speaking while gesturing with their hands. The person is wearing a gray shirt with a cartoon drawing on the front. A gold-colored lamp stands on the right side of the frame, with a lit candle in a glass holder in front of it. Behind the person, there is a fireplace and couch. The person holds a clear, plastic container containing several rolls of patterned tape, a pair of scissors, and a tape dispenser. The person touches the different items and moves them around.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the golden desk lamp in relation to the man in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The golden lamp is in front of the man to his left.",
+ "B. The golden lamp is behind the man to his left.",
+ "C. The golden lamp is in front of the man to his right.",
+ "D. The golden lamp is behind the man to his right."
+ ]
+ }
+ },
+ "QQJNBZhx": {
+ "video_id": "QQJNBZhx",
+ "video_duration": "97s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Cooking",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing black gloves cuts a tomato into slices on a white cutting board. Then they chop okra on the cutting board. They place the chopped okra in a glass bowl. The person is now wearing a dark blue apron and a plaid shirt with orange and yellow in it, in addition to the black gloves. They season the chopped okra in the bowl. They then rinse the okra under running water in a sink. They cover a pot with a lid and then add oil from a clear bottle into the pot. The person then uses a red spatula to transfer shrimp from a glass bowl into the pot. They then season the shrimp using pepper and salt from a wooden pepper grinder and a wooden container with a lid. They use metal tongs to stir the shrimp as they cook in the pot.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What is a Black man most likely to do immediately after cooking shrimp?",
+ "answer": "B",
+ "candidates": [
+ "A. Eat the shrimp.",
+ "B. Cook okra.",
+ "C. Add more salt.",
+ "D. Clean the kitchen."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "How should okra be prepared during the ingredient preparation stage?",
+ "answer": "D",
+ "candidates": [
+ "A. Juice it.",
+ "B. Cut it into strips.",
+ "C. Dice it.",
+ "D. Slice it into segments."
+ ]
+ }
+ },
+ "tUCOtIVa": {
+ "video_id": "tUCOtIVa",
+ "video_duration": "97s",
+ "duration": "1-2min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a view of a website with a header that contains categories labeled \"Art,\" \"Bushes,\" \"Clothing,\" \"DIY Recipes,\" \"Flowers,\" \"Fossils,\" \"Fruits,\" \"Furniture,\" \"Materials,\" \"Photos,\" \"Posters,\" \"Songs,\" \"Tools,\" and \"Villagers.\" The \"Villagers\" category is selected, and the screen displays rows of digital representations of different animal characters, with names and prices underneath each image. The screen scrolls down through the list of characters. The video then transitions to a screen labeled \u201cnookazon\u201d with a \u201cnew feature release\u201d for \u201cTrade Chat,\u201d showing several cartoon characters at the bottom. After this, the video displays a 3D rendering of a character standing on a patch of grass, with a body of water and a sky in the background. This character closes their eyes and an effect appears on their body. The scene then transitions to an interior view with two characters standing on a patterned floor. One of the characters has a speech bubble next to it that says \u201cHeckin IDIOT.\u201d The video then returns to the website, specifically the account creation screen, which has options to sign up with discord, twitter, or a manual method using email. A username box is typed into with text reading \"test user\". The video then shows a link in the footer being selected, which opens a black screen with a message that the discord app has launched. The screen then shows discord with a series of channels on the left. The arrow points to the \"rules-you-must-read\" channel. The screen scrolls to display more text in the channel. An arrow points towards a prompt saying \u201cpress the below once to gain access to the rest of the channels\u201d. A box is then selected with the arrow in the discord interface. It then returns to the website, navigating again to the villagers section, then to the listing for a character named \u201cAudie,\u201d and a description that says \u201cAudie is a peppy wolf villager.\u201d Then, a small text box with the average price of \u201cAudie\u201d is highlighted, as well as the posted date dropdown. The view then scrolls down to show offers for \u201cAudie.\u201d One of the offers is selected and a small text box explains the average price, with the listing of the individual offer. A make an offer button is then selected with the arrow, revealing a window to input offers. The video transitions to a view of a shopping cart with canceled and denied notifications. The video then shows an arrow selecting the villager trade channel in Discord. The channel opens up to show a picture of a character with the text \"Audie is moving! LF NMT, please send me a dm for offers!\". A cursor is used to hover over the profile pic and send a message. Then, the video transitions to the website, showing a list of offers on Audie\u2019s listing page. An offer is selected with an arrow and the cancel button is selected. Finally, the screen shows notifications, including canceled and denied offers and then returns to the discord interface with the villager-trade channel selected, and a highlighted message from Olivia.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the author set up the account?",
+ "answer": "B",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the beginning of the video.",
+ "C. At the end of the video.",
+ "D. Throughout the video."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct order for trading with villagers on Nookazon according to the video? (a) Set up a Nookazon account and add a Discord channel (b) Make an offer of 20 Nook Mile tickets (c) Make an offer of 30 Nook Mile tickets (d) Click on a name to start a private conversation with someone (e) Discover the \"Villager Trading\" channel",
+ "answer": "B",
+ "candidates": [
+ "A. (a), (c), (b), (e), (d).",
+ "B. (a), (b), (c), (e), (d).",
+ "C. (a), (e), (d), (b), (c).",
+ "D. (a), (e), (d), (b), (c)."
+ ]
+ }
+ },
+ "JKCcTdeB": {
+ "video_id": "JKCcTdeB",
+ "video_duration": "490s",
+ "duration": ">8min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a man wearing a white hoodie and black headphones on the right of the screen. A program is visible on the left side of the screen with menus and tabs. The man clicks on the plus sign on the left side of the screen, and then a menu appears with options such as \"advancement\", \"block\", and \"command\". He clicks on \"biome\", and an \"Enter Biome Name\" box appears. He types \"BandiLand\" and clicks on \"create new biome\". In the program, on the left side, color options appear. The man clicks on one of the color options, and a color palette appears. He clicks on multiple colors and then selects them. After, he scrolls down and selects boxes associated with stem, branch, vines and fruits. A box appears, and inside there are multiple block options from the game Minecraft. He clicks on one of the blocks, and then click \"use selected item\". After, he clicks on \"Fluid\" and creates one called \"Wated:DK\". The man then creates a block texture and draws a pattern in the texture image. After, he clicks on \"block\" for the texture type. The created texture is then selected. After, he clicks on \"food\", and creates one called \"GrowBerrys\". He also creates a texture, drawing a design in a color that's red, and then selects \"block\" for the texture type. He clicks \"food model\" and selects the 2d item texture, and then he clicks on \"select texture\". The screen goes black, and the man covers his mouth with his hand. He then uses a finger on his left hand to push his left eyelid up, as if he's looking more closely at the screen. Then, the man creates a \"BAndiOre\", \"BAndiOreBlock\", \"BAndiOreRecipe\", \"BAndiArmor\", \"BAndiArmorHelmetRecipe\", \"BAndiArmorBodyRecipe\", \"BAndiArmorLeggingsRecipe\" and \"BAndiArmorBootsRecipe\". He uses a color palette to select colors for these armor pieces, and then he selects the block from the menu as the texture. After this, he creates a WoodPack, and a ToolPack. After, he creates a \"Living Entity\" called \"Greg\". He then searches for Minecraft skins online, entering the names \"Notch\", \"Ban\", and \"Bandirue\". Then, he selects a texture for his entity named Greg. He then clicks on \"test world\" and the world loads. The game world shows a flat land with trees and grass. The man changes settings so he has a faster field of vision, and switches from the \"Creative\" setting to \"Survival\". He begins by creating crafting items by crafting \"WoodTreePlanks\". Then, he crafts a crafting table, and creates the tools he created with the program. He then shows that he has all of the armor pieces on. He then begins to mine into the ground with a pickaxe. After mining for some time, he goes back to the surface and picks the \"GrowBerries\" which he created. He then crafts cobblestone and his \"BAndiOre\" and armor parts. The man changes the world settings to Creative mode. He equips the armor, and then shows the \"BandiLand\", where he shows the trees, and the water texture. He shows the \"Greg\" entity, and then equips a sword, and kills another entity. The video ends by the man speaking to the camera.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video display a \"growing berries\" texture?",
+ "answer": "C",
+ "candidates": [
+ "A. No, it does not.",
+ "B. None of the above.",
+ "C. Yes, it does.",
+ "D. It is not clear."
+ ]
+ }
+ },
+ "NUhenCVt": {
+ "video_id": "NUhenCVt",
+ "video_duration": "300s",
+ "duration": "4-6min",
+ "domain": "Tech & Science",
+ "sub_category": "Engineering Projects",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person is standing in front of a set of shelves, and then walks toward a table. The person puts on gloves and starts assembling items on the table. The items consist of wooden parts, a clear plastic bag of small metal parts, and several larger metal components. The person inserts screws into the wooden parts. The person inserts a small, cylindrical white part, followed by a spark plug into one of the metal components. The person screws multiple small metal components together and puts them in the wooden structure. The person takes out the plastic parts from a clear bag and inserts them in the metal and wooden components. The person finishes assembling the components, and puts the finished item back on the table.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "Into how many parts was the nozzle disassembled in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Six.",
+ "B. Five.",
+ "C. Seven.",
+ "D. Eight."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Was a spring included during the installation of the nozzle in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. No.",
+ "B. Yes.",
+ "C. I am not sure."
+ ]
+ }
+ },
+ "AwurESFo": {
+ "video_id": "AwurESFo",
+ "video_duration": "82s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A woman with her hair in a bun sits in front of a shelving unit. A neon sign that says \"slay\" is visible behind her. She speaks to the camera, gesturing with her hands. She holds up and points at a small bottle of Olay Tone Perfecting Serum. The image changes to a before-and-after split-screen view of the woman\u2019s face. On the left (\u201cbefore\u201d) side, her hair is in braids, while on the right (\u201cafter\u201d) side, it\u2019s in a bun as before. The woman returns and continues speaking to the camera, gesturing with her hands.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Is it common for Black women to use facial masks?",
+ "answer": "B",
+ "candidates": [
+ "A. This question is not applicable.",
+ "B. No, it is not common.",
+ "C. Yes, it is common.",
+ "D. I do not know."
+ ]
+ }
+ },
+ "IynMlwan": {
+ "video_id": "IynMlwan",
+ "video_duration": "86s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with a black screen, which then transitions to a scene at a stadium. The scene shows two people standing beside each other holding a flag with green, white, and orange stripes. The person on the left is wearing a black hat, blue and white scarf, and black jacket, while the other person on the right is wearing a blue and white scarf, and a green jacket. The text \"THE HEARTWARMING STORY OF EVERTONIANS EOIN AND FRANCIS\" appears over the lower portion of the screen. The video then focuses on the individual on the left side of the screen, who gives a thumbs up. Next, the video switches to the person on the right and then returns back to the person on the left. This pattern repeats several times. The video then transitions to a shot inside of a store, showing a person holding up a blue jersey with the number 10 on it. The word 'ROONEY' is also visible on the jersey. Next, the video shows a group of 4 people at the stadium. The individual on the far right side of the screen is one of the people holding the flag from previous scenes. The video then transitions to a shot of an individual, who is also from the original two people, shaking hands with another person. Then, it shows another individual shaking hands with the person from the original two people. The video transitions back to the two original people being interviewed. In the background, two other people can be seen walking along a path. Then a video is shown of 3 people, two from the original interview and a third person in the middle. One person is seen putting his arm around the other 2. The video transitions back to the interview. Then, a text box with the words \"Seamus Coleman\" and a message appears. The text is then removed. The video then shows one person from the original interview speaking and then cuts to the same interview with the other original person speaking. The video then shows the person who was not being interviewed, with his arm around the other original person. The video then shows a close-up of one of the original people who were interviewed as he shakes hands with another individual. The video ends with the original two people being interviewed again.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "Who is Francis's favorite player?",
+ "answer": "B",
+ "candidates": [
+ "A. Eoin.",
+ "B. Paddy.",
+ "C. Francis.",
+ "D. Seamus Coleman."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What happens next after Ein describes Samuel Coleman's generous donation?",
+ "answer": "B",
+ "candidates": [
+ "A. Coleman appears in the video.",
+ "B. The video ends.",
+ "C. Francis begins crying.",
+ "D. They start discussing other topics."
+ ]
+ }
+ },
+ "moDlvufA": {
+ "video_id": "moDlvufA",
+ "video_duration": "44s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Talks",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man wearing a black t-shirt with a white and yellow design is on a teal couch, reclined and looking to his left. He has a silver watch on his left wrist. There is a black remote control on the couch next to him. The man sits up and puts his arms across his chest. He yawns and looks to his right. A man in a black hooded robe approaches him and says, \"I think you should get water. Sprinkle it on your eyes so you will be strong enough to continue the movie. Hmm.\" The first man responds, \"Wisdom will not kill me.\" He gets up from the couch. The video shifts to a close up of a TV screen, showing a person with a bloody face and black background. The scene on the TV changes to a rocky ground with fire in the background. The first man sits back on the couch. The video shows the man in the black hood again, and the first man is pouring water from a clear container into his hands. The first man drinks from his cupped hands and the video shifts back to the TV showing two figures embracing. The video shows the man holding a clear container and pouring water into his hands again. The video shows a close up of the TV showing a person in a dark setting with the caption [mournful music]. The man pours water into his hands a final time as the man in the hood says, \"Now you have more strength to finish up the movie.\" The video shifts back to the first man who says \"When Ola sees me tomorrow I will tell him twelve hours\". The video shows a hand holding the remote and then the TV showing various movie thumbnails. The first man asks, \"I sat on a movie. Which one should I watch?\" The TV shows more movie thumbnails and the man asks, \"Which one?\"",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the person dressed in black clothing and wearing a watch doing with water?",
+ "answer": "D",
+ "candidates": [
+ "A. Taking a bath.",
+ "B. Cooking.",
+ "C. Drinking water.",
+ "D. Washing their face."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is the sofa that the person, wearing black clothes and a watch, is sitting on?",
+ "answer": "B",
+ "candidates": [
+ "A. Black.",
+ "B. Blue.",
+ "C. Red.",
+ "D. Gray."
+ ]
+ }
+ },
+ "fKoPGylX": {
+ "video_id": "fKoPGylX",
+ "video_duration": "267s",
+ "duration": "4-6min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "A selfie of a young woman with long brown hair wearing blue scrubs is shown. She is smiling and wearing a smartwatch on her left wrist. She is sitting in a car. Text appears over part of the image, along the bottom edge.\u00a0The image transitions to different images of the same woman, including one where she is standing in front of a mountain vista, and another taken inside a vehicle. These are superimposed over the original image, which slowly fades out.\u00a0A man with glasses and a beard is shown, sitting in what appears to be a recording studio in front of a microphone, looking at a tablet.\u00a0The image transitions to a protest, where a person is holding a sign reading \"No Vaccine\" with a drawing of a syringe with red drops dripping from it. Various US and \"Make America Great Again\" flags are behind them.The image transitions back to the man in the recording studio, who is looking at the tablet and speaking.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "In the video, how many total times did the individuals mentioned post on Facebook?",
+ "answer": "A",
+ "candidates": [
+ "A. Three times.",
+ "B. One time.",
+ "C. Two times.",
+ "D. Four times."
+ ]
+ }
+ },
+ "CkgGXJXV": {
+ "video_id": "CkgGXJXV",
+ "video_duration": "58s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Sketches",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person is lying down on what appears to be a bed. They have facial hair and are wearing a yellow shirt with buttons and a blue and black checkered shirt over it. Another hand is seen placing a lightbulb in the person\u2019s mouth. The hand places pressure on the person's forehead and then removes the lightbulb from their mouth. The video then shows the person with the lightbulb removed from their mouth and then the person is partially obscured by another person's hand and arm, and their feet.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "What is the prankster doing first in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Puts a light bulb into their mouth.",
+ "B. Imitates a doctor checking someone's heart rate.",
+ "C. Pretends to act in a serious manner.",
+ "D. Lifts the eyelids of a sleeping person to mock them."
+ ]
+ }
+ },
+ "GHCelweo": {
+ "video_id": "GHCelweo",
+ "video_duration": "37s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Event Livestreams",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A woman and a man, both wearing sunglasses, are standing in front of a body of water. The woman is wearing a black collared shirt with a logo on the left breast. The man is wearing a black vest with a logo on the left breast, over a white collared shirt. The logo is a white square with an arrow symbol going up to the right. The word \"PRESTIGE\" is underneath the logo. The man gestures with his hands as he speaks. The camera moves to show several yachts moving in the water, with a helicopter flying above. The camera angle shifts, now showing the view from the rear of a boat as it moves through the water. The camera moves again to show a large yacht traveling through the water with people visible on the top deck. The camera is then again positioned at the front of a boat showing it moving through the water. The camera moves to show a side view of the yacht as it travels across the water. The woman and man are shown again talking in front of the water, the man gesturing with his hands as he speaks. A yacht with the text \"X70\" is then shown moving across the water, along with views from various angles, including an overhead and interior view. A man is seen walking through an interior space on the yacht. A man and a child are seen standing in front of several screens.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What are the last three letters of the model logo displayed on the yacht in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. TIG.",
+ "B. IGE.",
+ "C. PRE.",
+ "D. TGE."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "How many Prestige X60 yachts are there in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Two.",
+ "B. Four.",
+ "C. I don't know.",
+ "D. One."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "At the beginning of the video, how many yachts appear?",
+ "answer": "B",
+ "candidates": [
+ "A. Two.",
+ "B. Four.",
+ "C. Three.",
+ "D. One."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "In which scene does the long-haired lady with white hair appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. The deck.",
+ "B. The cabin.",
+ "C. The bow.",
+ "D. The stern."
+ ]
+ }
+ },
+ "cfHFfFWz": {
+ "video_id": "cfHFfFWz",
+ "video_duration": "105s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a close-up of a man wearing a dark head covering and light green clothing, standing in a field of plants with round, bulbous seed pods. He manipulates some of the seed pods with his hands. The camera shows a close up of his hand on the pods. The view changes to show the man with his back to the camera, and then the view changes again to another person in a white garment who is also picking seed pods.The view changes to show more of the field of plants, and a man in military garb walks toward the camera, with other similar figures in the background. There are soldiers and other men in the background in some frames.Then the video transitions to an area with a wall and a group of men, including three soldiers in military garb and several other people in civilian clothing. The view changes to a close-up of a soldier's face. The video shows an older man with a white beard in a white head covering and white robe.The view shows a reporter wearing a black vest, standing in the same plant field and speaking with his hands. Behind him, a man in dark clothing is in the background, working with the plants. The video shows another soldier, followed by the man in the dark clothing again.The video shows a pile of white and yellow bags.The video shows a view of a field of plants, and the video ends with a logo for MOX NEWS.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Will U.S. soldiers compensate the farmer?",
+ "answer": "B",
+ "candidates": [
+ "A. No compensation is necessary.",
+ "B. Will.",
+ "C. Will not."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "Were the poppy fields ultimately destroyed?",
+ "answer": "D",
+ "candidates": [
+ "A. Yes.",
+ "B. The poppy fields are not mentioned.",
+ "C. It is not clear.",
+ "D. No."
+ ]
+ }
+ },
+ "MBEYmLNm": {
+ "video_id": "MBEYmLNm",
+ "video_duration": "42s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Movie Reviews",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A man wearing a gray suit walks into a dining room with a table set for a meal. Two men are sitting at the table. One man is wearing a striped suit and a red bow tie. The other man at the table is wearing a grey suit and patterned tie. A third man is standing with a brown suit and patterned tie. The man in the brown suit says, \"heard you had a lucky streak.\" The man in the grey suit responds, \"luck had nothing to do with it.\" The man in the striped suit says, \"I'm a skilled play is what he means.\" The man in the brown suit says, \"That's what I hear. So what's all this about cutting me off? It's not that. It's a small house, we just can't handle your kind of action right now.\" The man in the striped suit says \"Your credit's good with me.\" The man in the brown suit says, \"why don't we get to know each other better?\". The man in the grey suit says, \"I don't think I like what I'm hearing.\" The man in the brown suit replies \"maybe you shouldn't be listening to the grown ups conversation\". The man in the grey suit stands up from the table. The man in the striped suit says \"Charlie sit down. Charlie, Charlie\". The man in the gray suit is seen moving behind the man in the striped suit. The man in the brown suit looks onward. The video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What material is the cup on the table made of?",
+ "answer": "A",
+ "candidates": [
+ "A. Glass.",
+ "B. Porcelain.",
+ "C. Metal.",
+ "D. Wood."
+ ]
+ }
+ },
+ "liiNxoPV": {
+ "video_id": "liiNxoPV",
+ "video_duration": "607s",
+ "duration": ">8min",
+ "domain": "Tech & Science",
+ "sub_category": "Tech Reviews",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A blonde-haired woman sits on a couch, holding a black and gold vaping device. She gestures with her hands while looking at the camera. She holds up the device, displaying its various sides. She then takes a puff from the device, exhaling a cloud of vapor. The camera zooms in on the device\u2019s screen as she presses buttons, cycling through a lock screen, a settings menu indicated by a gear icon, and other displayed information. She continues to hold and examine the device while speaking to the camera.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video include a user manual?",
+ "answer": "C",
+ "candidates": [
+ "A. No, it is not included.",
+ "B. I am not sure.",
+ "C. Yes, it is included."
+ ]
+ }
+ },
+ "KKqPKmuB": {
+ "video_id": "KKqPKmuB",
+ "video_duration": "119s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Storytime",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A grey and white tabby cat is walking on a red wooden floor, and a black robot vacuum is in the background. A white parrot with a yellow crest is perched on the back of a chair next to a glass table. Then, the cat stands facing a brown and white puppy. A woman is seen holding the puppy as it walks toward the cat. The woman bends down and pets the puppy. The woman picks up the puppy. The video then shows the puppy and a woman in front of a staircase. The video then shows a man holding and kissing a grey parrot. A woman is seen in the background. The video then shows a white parrot with a green scarf. Then a white and brown puppy is seen sleeping on a blue couch. Then a woman is seen holding a brown box, as a man in a black shirt stands next to her. The woman holds a puppy. The video then shows a black and white dog sitting on a gray rug, under a glass table. The woman is seen sitting on a blue chair in the background. The video then shows a woman with blond hair holding a brown and white puppy. A blue and green parrot is seen in the background. Then a man in a grey t-shirt pets a puppy as a green and yellow parrot perches on his shoulder. The video then shows the grey and white cat on a blue couch. Then, a woman is seen petting a white parrot on a glass table. Then, two white parrots are standing next to each other. The woman laughs. One of the parrots has a yellow crest. The parrots both move, then a woman says \u201cBye!\u201d and \u201cBuh-bye!\u201d. The woman holds two parrots on her arm and one of them flies off. The video shows the room full of animal cages, wooden perches, and boxes.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the puppy relative to the cat when the video starts?",
+ "answer": "A",
+ "candidates": [
+ "A. The dog is to the left of the cat.",
+ "B. The dog is behind the cat.",
+ "C. The dog is in front of the cat.",
+ "D. The dog is to the right of the cat."
+ ]
+ }
+ },
+ "rcFQZFdZ": {
+ "video_id": "rcFQZFdZ",
+ "video_duration": "72s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Auto",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing black rollerblades with large wheels and tan pants is shown rollerblading on a forest path. The person is then shown rollerblading on grassy hills and then on a path covered in leaves. A man with no shirt on and a skull mask on is shown in the forest. The camera then pans upward to trees. The camera then goes to a path where a man in black pants, no shirt, and skull mask is shown rollerblading. The person is then shown rollerblading down various forest paths.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "After hearing a bat's call, how did the man's emotions change?",
+ "answer": "B",
+ "candidates": [
+ "A. He became angry and disgusted.",
+ "B. He became calm and silent.",
+ "C. He became excited and enthusiastic.",
+ "D. He felt melancholic and lonely."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What type of life is mentioned at the end of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A red butterfly.",
+ "B. A brownish-gray butterfly.",
+ "C. A green butterfly.",
+ "D. A white butterfly."
+ ]
+ }
+ },
+ "mjNbKLpl": {
+ "video_id": "mjNbKLpl",
+ "video_duration": "42s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two fighters are boxing in a cage. One fighter has a shaved head and a beard, and wears yellow and black shorts. The other fighter has dark hair, a beard, and some tattoos on his chest and arms; he wears black shorts. They circle each other, exchanging punches. The fighter in black shorts lands a combination of punches, causing the other fighter to fall to his knees. The referee begins the count. The downed fighter remains on his knees, looking up at the referee. The referee reaches the count of ten, waves his hands to indicate the end of the fight, and separates the fighters.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times did the referee count down before the end of the game?",
+ "answer": "D",
+ "candidates": [
+ "A. Two.",
+ "B. Three.",
+ "C. Five.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What boxing technique did the short-haired man use to knock the bald man to the ground?",
+ "answer": "A",
+ "candidates": [
+ "A. A right hook.",
+ "B. A straight punch.",
+ "C. An uppercut.",
+ "D. A left hook."
+ ]
+ },
+ "task2": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual action did the short-haired man perform at the end of the competition?",
+ "answer": "B",
+ "candidates": [
+ "A. He vigorously pounded his chest.",
+ "B. He climbed over the fence.",
+ "C. He knelt on the ground.",
+ "D. He broke into a dance."
+ ]
+ }
+ },
+ "JEaovhwv": {
+ "video_id": "JEaovhwv",
+ "video_duration": "56s",
+ "duration": "<1min",
+ "domain": "Music",
+ "sub_category": "Remixes",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A group of people are shown in various settings. In one scene, a group of five people are standing or sitting in what appears to be an outdoor urban environment. In another, four of the same people are sitting around a desk inside, some with instruments, including a keyboard and a mixer. Several different indoor settings show various subsets of these four individuals. A close up of an electronic music mixing board is shown. In later scenes, these individuals are seen walking on the street and also performing on a large stage in front of large crowds of people. Different camera angles and perspectives are shown, and there are many instances where the video is collaged or fragmented. A sign is briefly displayed which reads \"The Four Harmonious Friends.\" Different musicians are seen using keyboards, mixers, and drums.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times did the man in gray carry the guitar on his shoulder?",
+ "answer": "A",
+ "candidates": [
+ "A. Once.",
+ "B. Zero times.",
+ "C. Three times.",
+ "D. Twice."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Emotion Change",
+ "question": "What emotional changes do the members of Rudimental experience in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. There is no emotional change.",
+ "B. They go from calm to excited.",
+ "C. They go from excited to calm.",
+ "D. They remain excited."
+ ]
+ }
+ },
+ "VvYSrBQE": {
+ "video_id": "VvYSrBQE",
+ "video_duration": "469s",
+ "duration": "6-8min",
+ "domain": "Daily Life",
+ "sub_category": "Travel",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "Two women sit on floor cushions at a low table in a room with beige walls and sliding doors. A steaming pot sits in the center of the red table, along with various small bowls and glasses. The woman on the left wears a dark blue Levi's sweatshirt, while the woman on the right wears a denim shirt. The woman on the left uses chopsticks to eat from a small bowl. The other woman picks up food from the central pot with her chopsticks. The woman on the left takes a drink from a glass. The other woman looks at her phone. They continue to eat and converse, their hands gesturing as they speak. The woman in the denim shirt picks up and sets down her phone several times. At one point, the woman in the Levi's sweatshirt rests her chin on her hand. They continue to interact and eat.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Does the video include a photograph of someone wearing a kimono?",
+ "answer": "D",
+ "candidates": [
+ "A. Not applicable.",
+ "B. No, it does not.",
+ "C. It is not clear.",
+ "D. Yes, it does."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "In the video, what is the order of topics discussed by the woman with long brown hair and the woman with her hair tied into a bun? (a) Discusses Japanese cuisine. (b) Recommends travel destinations. (c) Shares Japanese cultural experiences. (d) Paris protests.",
+ "answer": "A",
+ "candidates": [
+ "A. (a) (c) (b) (d).",
+ "B. (c) (d) (b) (a).",
+ "C. (a) (b) (c) (d).",
+ "D. (c) (b) (a) (d)."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "Where does the woman with long hair tied in a bun sit in relation to the woman with long brown hair in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. To the right of the woman with long brown hair.",
+ "B. Across from the woman with long brown hair.",
+ "C. Diagonally opposite the woman with long brown hair.",
+ "D. To the left of the woman with long brown hair."
+ ]
+ }
+ },
+ "iuvcIYkj": {
+ "video_id": "iuvcIYkj",
+ "video_duration": "102s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Skincare & Makeup",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A woman with dark skin and dark hair pulled up in a bun sits facing the camera. She is wearing a light gray tank top and large silver hoop earrings. She holds up a black and white makeup sponge, then a spray bottle, which she sprays onto a folded white tissue in her other hand. She picks up the sponge and begins dabbing it onto her face, starting with her forehead, then moving to her cheeks, chin, and nose. She brings her hand, now showcasing a blended swatch of foundation on the back, into view, before resuming applying the foundation to her face with the sponge, in similar dabbing motions across her forehead, temples, cheeks, chin, and nose. She holds the makeup sponge up close to the camera, showing the foundation now visible on its surface, before continuing to apply it to her face. After a while she pauses, then resumes dabbing foundation onto her face, focusing on blending. She occasionally adjusts her earrings while applying the foundation. Finally, she licks her lips, moves her tongue around inside her mouth, and rubs her chin with her finger.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is a woman's typical reaction after using a makeup sponge?",
+ "answer": "C",
+ "candidates": [
+ "A. She believes that using a sponge results in a more even and natural-looking makeup application.",
+ "B. She finds that a sponge does not provide adequate coverage.",
+ "C. She believes that sponges cause her to waste too much foundation.",
+ "D. She finds that applying makeup with a sponge is more comfortable than using a foundation brush."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "How does the woman's emotion evolve in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. From curiosity to confusion to agreement.",
+ "B. From agreement to confusion to curiosity.",
+ "C. From opposition to curiosity to confusion.",
+ "D. From curiosity to confusion to opposition."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Will women continue to use makeup sponges in the future?",
+ "answer": "A",
+ "candidates": [
+ "A. No, they will not.",
+ "B. I don't know.",
+ "C. Yes, they will."
+ ]
+ }
+ },
+ "RJLdxvvg": {
+ "video_id": "RJLdxvvg",
+ "video_duration": "183s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Music Videos",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A pixelated skeleton character, Sans, with a red slash across his torso, stands on a light gray platform against a solid light gray background. The scene transitions to white, then to a black background with Sans in the center, holding a bone-like weapon. A text box below him reads \"* Sans is dead serious now.\" A battle menu appears at the bottom.The scene shifts to a battle screen. A red, oval-shaped battle area appears below Sans. The player's heart, a red pixelated heart, appears within a box in the battle area. Sans blocks an unseen attack, and the word \"BLOCKED\" appears above him.The red oval disappears, and the battle box remains. Sans stands above it. The heart turns blue. Two white bones appear from the sides of the box, almost touching the blue heart. Sans raises one hand.The scene flips, and the box is now above Sans, who is upside down. The blue heart remains within the box. White bones start appearing at the bottom edge of the box, moving upwards.Sans is right-side up again, above the battle box. Two white, skull-like objects with flames appear on either side of the box. The blue heart is within the box.Sans and the box rotate again, as before. More of the white, skull-like objects appear.Sans is right-side up. A large, white blade appears diagonally across the box. Two skull-like objects are also present.Sans is upside down again. The white blade stretches across the box, and the white skull-like objects increase in number.Sans is right-side up. More white, skull-like objects appear.Sans is upside down. The battle box shrinks, confining the blue heart. Vertical bones fill the inside of the box.Sans is right-side up, and the red oval battle area reappears below him.The scene flips, turning Sans upside down. The box shrinks again, as before.White skull-like objects reappear near Sans, as before. A long, white blade appears.Sans is right-side up. More white, skull-like objects appear, surrounding him.Sans is upside down once more. The same pattern of vertical bones in the shrinking box repeats.Sans is right side up, facing a black screen. White bones and white skull-like objects start appearing. His left eye socket glows light blue, and then his right eye socket glows yellow. More white, skull-like objects appear, as before.Sans rotates again, now upside down. Multicolored vertical bars appear within the box.Sans is right-side up. Two white, diagonal beams cross through the battle box. More white, skull-like objects appear, as before.Sans stands above the battle box. More white beams start to crisscross the box, and white, skull-like objects fly around him. Sans remains still as his health bar, labeled \"KR,\" decreases. White skull-like objects appear. Sans sits down on the edge of the battle box. A bone appears in the box. Sans remains motionless as the video ends.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "At what point in the video does the skull flip?",
+ "answer": "A",
+ "candidates": [
+ "A. The skull flips at the beginning of the video.",
+ "B. The skull flips at the end of the video.",
+ "C. The skull flips throughout the video.",
+ "D. The skull flips in the middle of the video."
+ ]
+ }
+ },
+ "rDcsswVJ": {
+ "video_id": "rDcsswVJ",
+ "video_duration": "112s",
+ "duration": "1-2min",
+ "domain": "Games",
+ "sub_category": "Role Playing Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A person wearing a military-style helmet and clothing, and holding a rectangular object, stands near a dark vehicle with its headlights on. The vehicle has a damaged windshield. Two people are visible in the interior of the car. The driver is wearing a vest. Another person in the driver\u2019s seat looks over at them. Later, the perspective switches to a group of people exiting the back of the vehicle, all wearing military-style clothing. One of these people then says \u201cWhat\u2019s this?\u201d. One of the other people replies with, \u201cThis is the immediate future\u201d. This is followed by the voice saying \u201cStep away from the gate\u201d. Someone else then asks \u201cWhat?\u201d. The voice then says \u201cYou heard me\u201d. A man\u2019s voice responds with \u201cYou\u2019re crazy, this is my base.\u201d A different voice says \u201cIt\u2019s not a base. This is a sizeable covert facility and I admire it-\u201d The same voice then says \u201cSo, I\u2019m taking it. You boys have been relieved. Thank you for your service.\u201d One of the other people says \u201cNo, no, no, I don\u2019t take orders from you\u201d. Another person then responds \u201cDidn\u2019t Valeria say that?\u201d. He continues, \u201cNow that makes me wonder what else I don\u2019t know about your affiliation with a drug-lord?\u201d. Another person says \u201cWhat the f*** did you just say to me, pendejo...\u201d. The person is immediately cut off by another man's voice saying \u201cYou\u2019re out of line, Graves\u201d. The person being called Graves then says \u201cDon\u2019t do that. Don\u2019t\u2026 do that. No one needs to get hurt here\u201d. A person responds with \u201cAre you threatening us?\u201d, to which Graves says, \u201cSoldier, I don\u2019t make threats, I make guarantees, so let\u2019s not do this\u201d. Another person then says, \u201cI\u2019m calling Shepherd!\u201d. Graves says, \u201cGeneral Shepherd sends his regards. He told me y\u2019all wouldn\u2019t take this well.\u201d A man says, \u201cHe knows about this?\u201d, to which another man responds with, \u201cHe\u2019s put me in command of this operation from here on out\u201d. Graves continues, \u201cSo, y\u2019all need to stand down. It\u2019s time to let the pros finish this. And why the hell are we talking like this is some kind of negotiation? It\u2019s not. I\u2019ve got my orders and now you have yours.\u201d. A man then says, \u201cAnd who the f*** do you think you are, cabron?\u201d. The man continues, \u201cMy men are inside!\u201d. A person says, \u201cI\u2019m afraid not. Your men have been\u2026 retained.\u201d Another person yells \u201cCabron!\u201d. A fight breaks out and is cut short.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many types of weapon sounds are heard in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. One.",
+ "B. Three.",
+ "C. Two.",
+ "D. Four."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Source Localization",
+ "question": "Who said in the video, \"It's not. I've got my orders. And now you have yours\"?",
+ "answer": "C",
+ "candidates": [
+ "A. A man wearing a black helmet.",
+ "B. A man wearing a skull mask.",
+ "C. A man wearing a vest with the American flag.",
+ "D. A man wearing a vest with the British flag."
+ ]
+ }
+ },
+ "IgpVjjMh": {
+ "video_id": "IgpVjjMh",
+ "video_duration": "131s",
+ "duration": "2-4min",
+ "domain": "Games",
+ "sub_category": "Casual Game",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video game Among Us is shown. Initially, there are six players visible: \u201cWisefin\u201d, \u201cChary\u201d, \u201cTightcurve\u201d, \u201cNarcissa\u201d, \u201cSniffblaze\u201d, and \"You Know I\". Wisefin is yellow and wears a wizard's hat, Chary is red and wears a tall black hat, Tightcurve and Sniffblaze are both white and wear white hats, Narcissa is teal, and You Know I is pink. A red pet is following Chary. They walk through a white and grey hallway with a small blue section. Chary walks away from the others. Then, other players appear, including \"Thym\" in green and \"WiseBill\" in blue. Chary then walks into several different rooms, including the \"Locker Room\", where Wisefin appears; then the \"Balcony,\" where Bill is; then \"Gap Room;\" then \"Records,\" where the floor is blue; then \"Showers,\" where there are white papers on the ground; then \"Main Hall,\" where the floor is blue; then \"Engine Room,\" where the floor is grey; then \"Armory,\" where the floor is purple. After the Armory, Chary swipes a card at a machine. Chary and Bill are then shown on the \"Viewing Deck\" before Chary kills Bill. Chary and Bill then are shown in the following areas; the \"Viewing Deck,\" \"Kitchen,\" \"Hall of Portraits,\" \"Cargo Bay,\" and then the \"Gap Room\". In the Gap Room, a large red creature with sharp teeth appears behind Bill, then the video ends. Throughout the entire video Chary is followed by a small red pet.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Is there a red airplane at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Yes.",
+ "B. I am not sure.",
+ "C. No."
+ ]
+ }
+ },
+ "qxJMpupJ": {
+ "video_id": "qxJMpupJ",
+ "video_duration": "69s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "Bernie Sanders walks onto a stage in front of a cheering crowd, some of whom hold signs reading \"Bernie.\" He steps up to a podium. The scene cuts to a man in a cowboy hat sitting in a car and talking, then to the man walking through a large room towards a crowd of people. The scene cuts back to the crowd holding signs spelling out \"Bernie.\" News crews film the event. People in the crowd hold signs reading \"Bernie\" with a state flag of Iowa and the BernieSanders.com website address in the background. The scene cuts to the audience listening to Sanders speak at the podium. One audience member's hat has an anti-Trump patch. Sanders gestures with his hands while speaking. The scene cuts back to the audience listening, and then back to Sanders speaking at the podium. He raises his hand. Some of the audience wave small American flags. The scene cuts to the man in the cowboy hat walking away through the large room again, followed by a woman pushing a stroller.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the speaker's main message in this speech?",
+ "answer": "C",
+ "candidates": [
+ "A. The speech is about the global economy.",
+ "B. The speech is about world peace.",
+ "C. The speech is a call for people to unite.",
+ "D. The speech is about science and exploration."
+ ]
+ }
+ },
+ "ioCQOTeS": {
+ "video_id": "ioCQOTeS",
+ "video_duration": "50s",
+ "duration": "<1min",
+ "domain": "Daily Life",
+ "sub_category": "Daily Vlogs",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A small tan and white Chihuahua with a light blue collar looks at the camera. The dog walks on a leash along a large log next to an asphalt path. The dog jumps off the log and runs across a grassy area. The camera pans to show the surrounding park with trees, a parking lot, and a grassy field. The dog runs in the parking lot, then back onto the grass. The camera follows the dog as it runs across the field, then slows to a trot. The dog continues trotting across the field.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the man say \"mouse\" in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Four times.",
+ "B. Three times.",
+ "C. One time.",
+ "D. Two times."
+ ]
+ }
+ },
+ "HViWZOYm": {
+ "video_id": "HViWZOYm",
+ "video_duration": "214s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Boxing",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A wrestling match takes place on a red mat with a large white stylized wolf head in the center. Red bleachers filled with spectators surround the mat on three sides. Two wrestlers, one in a white singlet with red trim and the other in a black singlet with white trim, stand facing each other at the center of the mat. A referee in a black shirt and gray pants stands between them.The referee blows a whistle, and the wrestlers begin circling each other. The wrestler in white reaches down and grabs the wrestler in black around the legs, taking him down to the mat. The wrestler in white gets on top of the wrestler in black and attempts to pin his shoulders to the mat. The wrestler in black struggles to escape but is pinned almost immediately. The referee slaps the mat, signaling the end of the match. The wrestler in white gets up, and the referee raises his hand in victory. The wrestler in white then helps the wrestler in black up off the mat. The two wrestlers shake hands, and then the wrestler in white walks off the mat. The referee takes a knee next to the wrestler in black, who is still sitting on the mat. The wrestler in black shakes his head. The referee helps him to his feet and holds his arm up. He then walks off the mat. The wrestler in white returns to the mat and shakes hands with the wrestler in black. The wrestler in black leaves the mat. The wrestler in white turns to the bleachers and pumps his fist twice while looking at the crowd.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Interaction",
+ "question": "What action by the wrestler in white and the wrestler in black in the video would result in a warning for stalling?",
+ "answer": "C",
+ "candidates": [
+ "A. The wrestler in white wraps around the wrestler in black who is lying down.",
+ "B. The wrestler in white wraps around the wrestler in black who is prone on the ground.",
+ "C. The wrestler in white wraps around the wrestler in black who is standing.",
+ "D. The wrestler in white wraps around the wrestler in black who is sitting."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the wrestler wearing white appear to mount the wrestler wearing black in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Four times.",
+ "B. Five times.",
+ "C. Three times.",
+ "D. Two times."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Does the video show a scene where a wrestler dressed in black is on top of a wrestler dressed in white?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes.",
+ "C. No."
+ ]
+ },
+ "task3": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "Why do sponsors avoid placing chairs on mats in videos?",
+ "answer": "D",
+ "candidates": [
+ "A. Required by regulations.",
+ "B. Due to space limitations.",
+ "C. To maintain visual appeal.",
+ "D. To protect the mats."
+ ]
+ },
+ "task4": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "What change occurred in the commentator's voice during the second round in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. There was no change.",
+ "B. It gradually became louder.",
+ "C. It gradually became quieter.",
+ "D. I am not sure."
+ ]
+ }
+ },
+ "gZZTubXP": {
+ "video_id": "gZZTubXP",
+ "video_duration": "109s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Tech Reviews",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video begins with black screen text that reads \"Cool things about this vape pen\". This changes to a medium shot of a person sitting on a beige couch. They have medium-length reddish-brown hair, wear a white shirt and a ring on their left hand, and have dark nail polish. They are holding a black pen-shaped object and manipulating it with both hands. Text that reads \"510 thread size\" appears superimposed to the right of the person. The person continues to manipulate the pen-shaped object. The camera focuses on the person as they speak, move their hands, and hold the pen-shaped object. Text overlays appear, such as \"3 heat modes\", \"low (3.2v)\", \"medium (3.5v)\", \"high (3.7v)\", and \"plus 'extended draw mode' (hits without holding button)\". The person presses the button on the pen-shaped object, puts the object to their lips, inhales, and exhales a white vapor. The video ends with the person holding the pen-shaped object and speaking.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What is the woman in the video doing with the vaporizer pen in her hand?",
+ "answer": "A",
+ "candidates": [
+ "A. She is reviewing the vaporizer pen.",
+ "B. She is selling batteries.",
+ "C. She is reviewing e-liquid.",
+ "D. She is promoting e-liquid."
+ ]
+ }
+ },
+ "ysTwCLxB": {
+ "video_id": "ysTwCLxB",
+ "video_duration": "66s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Biology",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video shows a woman with brown hair and a patterned shirt standing in front of wooden cabinets, next to a document with \"Scientific Reports\" at the top. It then cuts to a sandy area with about 10 small turtles crawling around. The scene switches to a blurry view of the woman, and then a clear shot of her holding a medium sized turtle in a lab setting. The video next shows two silhouettes, one of a person with \"XX\" underneath and the other of a person with \"XY\" underneath. A cartoon turtle image is shown between the silhouettes, with text underneath that says \"XX or XY?\". Then the text and line below the turtle is crossed out and an exclamation mark appears above the turtle. The woman is next shown with the cartoon silhouettes and turtle overlayed on her. Next a red thermometer graphic is shown. The screen then shows multiple turtles with \"FEMALE\" above, and an arrow towards the thermometer. Then \"MALE\" is added above another set of turtles on the other side of the thermometer, with arrows going from the thermometer to each set of turtles. Next, a circular infographic showing temperature changes from 1880-2020 is shown, with one set of turtles labeled \"FEMALE\" and the other \"MALE\". Different years between 1964 and 2020 are shown in the center of the infographic with the color changing. At the end of the sequence there is one lone turtle image on the male side. The video cuts to two people holding and interacting with a turtle, and then cuts back to the woman with the document image at her side.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "How many turtles are crawling on the sand at the beginning of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Ten.",
+ "B. Twelve.",
+ "C. Thirteen.",
+ "D. Eleven."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What did the scientists use in the video to coordinate efforts to save the endangered sea turtle species?",
+ "answer": "C",
+ "candidates": [
+ "A. Use of heaters.",
+ "B. Use of artificial selection.",
+ "C. Use of western yellowtail fish.",
+ "D. Relocation of habitats."
+ ]
+ }
+ },
+ "WROYbsBv": {
+ "video_id": "WROYbsBv",
+ "video_duration": "229s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "Home Improvement",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video begins with a shot of a white surface displaying multiple tools including a black and silver can of spray, a red and yellow screwdriver, a yellow and black utility knife, a green and black screwdriver, red and blue pliers, red and silver water pump pliers, and a tube of silver silicone. The video cuts to a close-up of a black and silver shower faucet with two handles. The lower handle is rotated counter-clockwise, and a drop of water appears. The handle is then removed, revealing the parts beneath. A finger is used to remove a black O-ring, followed by a brass ring. A finger pulls out a brass component. The brass component is then removed, separating a small grey plastic piece. The video then cuts to the complete faucet assembly and a metal shower head hose, and a finger touches each part before the cover plate is removed. The camera shows the exposed pipes within the wall. The can of spray is brought in and an attached red tube is used to spray the brass component. A finger touches the brass component, then some parts are moved back into position by hand and the can of spray is brought in again. The water pump pliers are used to loosen a part. The brass component is then pulled out of the wall, and a metal ring is placed over the brass part. A hand applies a lubricant to the brass part. The video cuts to the brass component and the tube of silicone on a grey surface. A finger touches the brass component and tube of silicone.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Attribute Reasoning",
+ "question": "What is the purpose of the handle at the bottom of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. Control the pressure of the valve.",
+ "B. Control the direction of the valve.",
+ "C. Control the flow of the valve.",
+ "D. Control the locked state of the valve."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "How does the man in the video remove the clamp?",
+ "answer": "B",
+ "candidates": [
+ "A. Use a wrench to rotate the screw.",
+ "B. Use a wrench to rotate the nut.",
+ "C. Use a screwdriver to rotate the screw.",
+ "D. Use a craft knife to rotate the nut."
+ ]
+ }
+ },
+ "FjwCSYRq": {
+ "video_id": "FjwCSYRq",
+ "video_duration": "89s",
+ "duration": "1-2min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "Two people walk side by side, one holding a bag. The words \"TOM CRUISE\" are written across the scene. The camera cuts to a man in a military uniform. Then the camera cuts to another man in a suit. A car explodes on a bridge. A car goes over a cliff, exploding. The words \"TOM CRUISE TOP GUN MAVERICK OFFICIAL TRAILER\" appear. Two people walk down a tree-lined road from behind. The words \"RAIN MAN\" appear. The words \"A FEW GOOD MEN\" appear. A man in a suit walks through a dark tunnel. A man in a black jacket speaks with another man. A man is speaking into a microphone, a tuxedo, in black and white. Another man speaks into a microphone, a tuxedo, in black and white. A man is speaking in black and white, wearing a tuxedo. Two women are laughing at a table, in black and white. A man is smiling, in black and white. A man is speaking in black and white, wearing a tuxedo and a medal. A man is looking forward, in color. The words \"MIMI ROGERS\" appear next to a woman and a man, in black and white. A man and a woman are pictured together, in black and white. A man and a woman are pictured together, in black and white. The words \"KATIE HOLMES\" are pictured next to a man and a woman, in black and white. A man and two women are pictured with two children, in black and white. A woman and a child are pictured together, in black and white. A man is pictured rubbing his face in black and white. Two men are in a vehicle. A man is hanging on a pole. A man is hanging on a pole inside a structure with metal supports. Several men are inside a vehicle. A band plays at a table in the dark. A woman in front of a dimly lit room. A woman stands in front of a table of weapons. A man is holding a gun. A man rides a motorcycle in a field. A man rides a motorcycle up a rocky hill. A man is riding a motorcycle over snowy mountains. A man is falling from the sky. A man is riding a motorcycle over a mountainous landscape. The words \"MISSION: IMPOSSIBLE\" appear. The words \"MISSION: IMPOSSIBLE DEAD RECKONING PART ONE\" appear. The words \"JULY 2023\" appear with other information. The scene goes to black. The scene goes to black.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What is the setting of the scene where Cruz rides a motorcycle in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. A highway.",
+ "B. The jungle.",
+ "C. The racetrack.",
+ "D. The city."
+ ]
+ }
+ },
+ "NHlcCWBX": {
+ "video_id": "NHlcCWBX",
+ "video_duration": "67s",
+ "duration": "1-2min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "The video shows a street scene from a first-person perspective. The viewpoint is likely from a person riding a scooter or bicycle. The street has shops on both sides, cars parked along the curb, and traffic passing through.At the beginning of the video, a silver car and a black SUV are parked on the left. Two people are walking along the sidewalk ahead, and a woman in a dark jacket and blue bag passes on the left.The viewpoint turns into the street, showing a marked crosswalk. Three people on bicycles cross the street, two carrying delivery bags.The viewpoint then turns back to the sidewalk. A man in a green shirt and jeans walks into view, then pauses near a black truck parked on the street. He is joined by other people in the video. Several black and white photos appear as frames in the view, showing this man in different poses.The man in the green shirt and jeans holds what appears to be a small axe. The camera then pans further into the street, passing other pedestrians and bikes. The video ends with the camera panning up to a street sign.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Object Existence Recognition",
+ "question": "Did a bicycle appear in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. No.",
+ "B. I am not sure.",
+ "C. Yes."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Counting",
+ "question": "How many times does the man in the video say \"brother\"?",
+ "answer": "A",
+ "candidates": [
+ "A. Three.",
+ "B. Four.",
+ "C. One.",
+ "D. Two."
+ ]
+ }
+ },
+ "DangzaIx": {
+ "video_id": "DangzaIx",
+ "video_duration": "210s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Walkthroughs",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video displays a first-person perspective of a player using various weapons in different indoor and outdoor environments. The player is holding a variety of firearms, each equipped with scopes or sights, and they appear to be engaging in combat. The video displays a user interface including round timers, player scores, and ammunition counts. The player's view moves through hallways, doorways, and across rooftops. At times, a large block of white text appears across the video, and sometimes the player's perspective includes blood splatters on the screen. The video also includes scenes with other characters, who are visually distinct and move independently. The screen often displays the count of other players visible in the game, and also text indicating when a round is won.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What does the video recommend to improve aiming skills more quickly?",
+ "answer": "B",
+ "candidates": [
+ "A. Watch game replays to analyze aiming mistakes and their causes.",
+ "B. Like and subscribe to the channel.",
+ "C. Maintain patience and practice consistently.",
+ "D. Pursue low-difficulty kills."
+ ]
+ }
+ },
+ "FQwzqsrg": {
+ "video_id": "FQwzqsrg",
+ "video_duration": "322s",
+ "duration": "4-6min",
+ "domain": "Daily Life",
+ "sub_category": "Painting & Photography",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A hand holding a painting with yellow, brown and turquoise areas is visible. A hand places the painting near a white square container and a piece of white paper. A metal palette knife picks up yellow paint. A hand then uses a metal palette knife to spread yellow paint. A hand then squeezes a container that says \"Golden\" and yellow paint comes out onto a palette. Another container of \"Golden\" paint is squeezed out onto a palette. The metal palette knife spreads the two colors of paint. A hand adds yellow paint to a white square container. A paintbrush spreads the yellow paint in the container. A paintbrush washes in a clear glass container of liquid. The paintbrush brushes the yellow paint in the container again. A paintbrush spreads a yellow color onto the paper. A hand uses a white material to smudge the yellow. A hand places a new piece of white paper underneath. A hand picks up a metal palette knife and touches it to a light orange color. A hand then scrapes paint into a clear liquid in a glass container. Then the hand picks up the canvas and uses a paintbrush to apply the yellow paint. Then the hand uses the paintbrush to apply yellow paint to the bottom of the painting. Finally the hand uses the paintbrush to spread the yellow paint onto the canvas. Then a hand picks up the painting and then wipes a part of it. Finally, a hand touches a canvas with colored paint sections.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What color is the roof of the house in the finished product shown by the man in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Brown.",
+ "B. Indian Yellow.",
+ "C. White.",
+ "D. Blue."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the man dip his hand in the water in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Eight.",
+ "B. Five.",
+ "C. Seven.",
+ "D. Six."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What is the texture of the paint applied to the canvas at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. A texture between milk and cream.",
+ "B. Rough texture.",
+ "C. Creamy texture.",
+ "D. Milk-like texture."
+ ]
+ },
+ "task3": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence of steps to complete the product shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. First, use a brush to dip in water and stir the paint to dilute it, then use a decorative brush to dip in the paint and apply it to the canvas, next use a palette knife to dip in water and stir the paint to dilute it, and finally, display the finished product.",
+ "B. First, use a brush to dip in water and stir the paint to dilute it, then use a palette knife to mix the water and paint, afterwards use a decorator brush to apply the paint to the canvas, and finally, display the finished piece.",
+ "C. First, use a palette knife to dip in water and mix the paint, then use a decorative brush to apply the paint to the canvas, next use a painting brush to dip in water and stir the paint, and finally, display the finished product.",
+ "D. First, use a palette knife to dip in water and mix the paint, then use a brush to dip in water and stir the paint, next use a decorative brush to apply the paint to the canvas, and finally, display the finished product."
+ ]
+ }
+ },
+ "mgDBXvBi": {
+ "video_id": "mgDBXvBi",
+ "video_duration": "170s",
+ "duration": "2-4min",
+ "domain": "Music",
+ "sub_category": "Covers",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A person with short brown hair is in a room. They are wearing a light blue short-sleeved shirt with a pattern of yellow and brown circles. Initially, the person is standing against a wall that has two black shelves with several objects on them, including a black bird, a white candle with a red and blue image, a small figure of a man with a beard and a small stack of wrapped candies. There is an additional artwork on the wall with a blue background, and an additional work of art on the wall with a pink background and a picture of a band. The person is moving their mouth and opening it widely. The person moves offscreen and returns holding a red guitar with a white pick guard. They proceed to a different section of the room with amplifiers on the floor. The person plugs a black cable into the guitar. The person, now wearing the guitar with a black strap, stands in front of the amplifiers and in front of a large window with blinds. The person speaks, sings, and moves while holding the guitar, they play the guitar and gesture with their hands. The person then moves through a doorway holding the guitar.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What type of music does the man sing in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. Pop music.",
+ "B. Independent music.",
+ "C. Folk music.",
+ "D. Rock and roll."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "From which direction relative to the sofa is the audio in the video coming?",
+ "answer": "C",
+ "candidates": [
+ "A. On the left.",
+ "B. In front.",
+ "C. On the right.",
+ "D. Behind."
+ ]
+ }
+ },
+ "mFYvLDne": {
+ "video_id": "mFYvLDne",
+ "video_duration": "55s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Tech Reviews",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A person wearing a light-colored jacket, dark hat, and dark glasses is holding a black handgun. The person aims the handgun at a target with a red bullseye and shoots. Then the person handles the gun, appears to take it apart, and then put it back together. There are boxes of ammunition and several parts of the disassembled gun on a table. The person is standing outside in an area with a dirt hill, several barrels, and a white sign on the ground.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "What is the weather shown in the video?",
+ "answer": "D",
+ "candidates": [
+ "A. It is raining.",
+ "B. The weather is windy.",
+ "C. The weather is overcast.",
+ "D. The weather is cloudy."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct sequence for disassembling the firearm shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Pull out the slide, pull the trigger to release the slide, remove the magazine, and remove the barrel.",
+ "B. Remove the magazine, pull out the slide, press the release pin, and remove the barrel.",
+ "C. Pull out the slide, press the release pin, pull the trigger to release the slide, remove the magazine, and remove the barrel.",
+ "D. Press the release pin, pull out the slide, pull the trigger to release the slide, remove the magazine, and remove the barrel."
+ ]
+ }
+ },
+ "NGHQpLWF": {
+ "video_id": "NGHQpLWF",
+ "video_duration": "62s",
+ "duration": "1-2min",
+ "domain": "Tech & Science",
+ "sub_category": "Software",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video begins with five program icons on a black background. The icons are labeled Synfig, Blender, Pencil, Tupi, and Gimp. A cursor then hovers over Tupi, and then clicks it. A new screen appears showing the Tupi program, where an window is open to create a new project. The window has project name, author, description, presets, dimensions, and options for background and fps. The 'OK' button is then clicked, and the Tupi main window appears with the animation canvas and exposure sheet. The dynamic BG mode is selected, and the shift length is set to 5. A new window opens to import an SVG file, where the 'landscape.svg' file is selected and imported. The background is set to static, and the window changes to the frames mode, where frames are added to the exposure sheet. Another SVG is imported, this time a car SVG file, which appears in the animation canvas. The car's object properties are adjusted so that it spins. The video ends by previewing the animation, where the car appears to be driving along in the landscape.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "What features does the software with a blue leaf icon provide?",
+ "answer": "C",
+ "candidates": [
+ "A. Design UI interfaces.",
+ "B. Create videos.",
+ "C. Write programs.",
+ "D. Create conceptual renderings."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "What are the six software programs in the video about?",
+ "answer": "A",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Painting.",
+ "C. NCodingo.",
+ "D. Game."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "Will Blue Leaf software be able to run on Windows in the future?",
+ "answer": "C",
+ "candidates": [
+ "A. No, it is a technical challenge.",
+ "B. No, developing a Windows version has not been considered.",
+ "C. Yes, it is in preparation.",
+ "D. It is unclear."
+ ]
+ },
+ "task3": {
+ "task_domain": "Recognition",
+ "task_type": "Temporal Localization",
+ "question": "Which part of the video introduces the BlueLeaf software?",
+ "answer": "D",
+ "candidates": [
+ "A. In the middle of the video.",
+ "B. At the beginning of the video.",
+ "C. Throughout the entire video.",
+ "D. At the end of the video."
+ ]
+ }
+ },
+ "TTsifOTn": {
+ "video_id": "TTsifOTn",
+ "video_duration": "40s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentary Profiles",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video opens on a black screen with the words \"FADE TO BLACK\" in white text. The screen then cuts to a grainy video of a concert stage bathed in green light. Text appears stating \"HETFIELD ACCIDENTALLY STEPPED INTO AN AREA\". The scene shifts to another angle of the stage, now dark, with a small burst of flame visible. Text overlays reading \"JUST AS THE CHARGES\". The next shot shows a large burst of flame erupting on stage, obscuring the band members, as text reads \"WERE DETONATED.\" Another burst of flame is shown, now orange, with text reading \"HETFIELD WAS BLASTED WITH\". The scene cuts to a black and white photo of James Hetfield, shirtless and wearing a baseball cap. He holds his arm and looks down as text appears: \"MOLTEN HOT FIRE AND SPARKS\". He continues to hold his arm as more text appears: \"HE QUICKLY FELL TO\" and in the next frame \"AND WAS HELPED OFF\" followed by \"BY HIS BANDMATES AND STAGEHANDS\". A lime green ambulance with flashing lights speeds down a street as text reads \"HE WAS RUSHED TO THE HOSPITAL WITH SECOND AND THIRD DEGREE BURNS\". A photo shows Hetfield lying down, looking at the camera, with visible burns on his arm and bandaged hand. Text reads \"ON HIS HANDS ARMS AND FACE\". The scene shifts quickly to Axl Rose singing on stage in a US flag shirt and bandana, with text appearing: \"INSTEAD OF TAKING THE STAGE AFTER METALLICA\u2019S UNTIMELY EXIT\". The next frame shows Guns N' Roses band members posing for a photo. Text reads: \"GUNS N ROSES MADE FANS WAIT FOR TWO HOURS BEFORE THEY TOOK THE STAGE\" and \"AND THEN ONLY PLAYED NINE SONGS\". The final scene shows grainy video footage of a crowd rioting, overturning concession stands, and spilling out into the streets. Riot police in helmets move towards the crowd. Text appears stating \"THE ENRAGED CROWD RIOTED OVER TURNING CONCESSION STANDS AND SPILLING OUT INTO THE STREETS WHERE THEY OVERTURNED A POLICE\".",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What caused the crowd in the video to overturn a police car and set multiple fires in the streets?",
+ "answer": "B",
+ "candidates": [
+ "A. The crowd had a conflict with the police.",
+ "B. A band member's injury caused the performance to be delayed.",
+ "C. The band did not appear.",
+ "D. The crowd was dissatisfied with social issues."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Event Recognition",
+ "question": "In the video, what did the band members do after the bearded man wearing a hat was injured?",
+ "answer": "B",
+ "candidates": [
+ "A. They insist on delaying the performance until the injured member can participate.",
+ "B. They send him to the hospital and delay the performance.",
+ "C. They insist on continuing the performance immediately.",
+ "D. They send him to the hospital and refuse to perform."
+ ]
+ }
+ },
+ "HLVTATdC": {
+ "video_id": "HLVTATdC",
+ "video_duration": "129s",
+ "duration": "2-4min",
+ "domain": "Sports",
+ "sub_category": "Football",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video depicts a soccer match being played on a green field with white lines. There are two teams, one wearing white and the other wearing red and blue. The match is taking place in a stadium with a large crowd visible in the stands. The game clock is visible in the top left corner, showing the elapsed time. At several points during the video, the ball is passed, dribbled, and kicked around the field by the players. Some players fall onto the grass. A goal is scored by the team wearing white. After the goal is scored, some players run towards the corner of the field near the goal. A replay of the goal is shown. A title appears on screen with a soccer team logo and the score, showing that the game is tied 1-1. The final scenes show players running down the field.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What was the final score between the team in white shirts and the team in blue shirts, as shown in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. 1 to 3.",
+ "B. 0 to 1.",
+ "C. 1 to 1.",
+ "D. 1 to 2."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the correct order of actions for the player wearing the number 21 white jersey scoring a goal in the video? (a) Player 7 in the white uniform passed to the 21st player in the white uniform, (b) the 21st player in the white uniform dodged the defense, (c) the 21st player in the white uniform dribbled into the penalty area, (d) the 21st player in the white uniform took the shot.",
+ "answer": "D",
+ "candidates": [
+ "A. (a), (c), (b), (d).",
+ "B. (b), (c), (a), (d).",
+ "C. (b), (a), (c), (d).",
+ "D. (a), (b), (c), (d)."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Is the goalkeeper in yellow blocking the ball in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. I don't know.",
+ "B. Yes.",
+ "C. No."
+ ]
+ }
+ },
+ "tygDimYF": {
+ "video_id": "tygDimYF",
+ "video_duration": "116s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Music"
+ ],
+ "video_caption": "A person with short brown hair, a beard, wearing a dark t-shirt and blue jeans is standing in front of a light colored wall. The person raises one hand, then brings their hand to their chest. The person points to their side, then puts a hand to their chest again. The person raises both hands, then looks down and picks up a gray laptop. A screen showing a Google search is displayed on the laptop. The person puts the laptop down and puts a hand to their chest. The person points one finger up, then raises both hands before putting them to their chest. The person waves three fingers. The person holds the laptop again, then waves a hand. The person puts a hand to their chest, then uses their index finger to point. The person is pointing and moving their hands. The person has their hands in a prayer position, moving their hands and bowing. The person waves a hand. The person is waving a flat hand, then places their hands together in a prayer motion. The person is raising one hand, then places a hand on their chest. The person points a hand and another finger. The person raises both hands, then points two fingers. The person moves their hands. The person points their fingers at the camera, then points at themselves, and places a hand on their chest. The person lifts their arms and moves their hands, then stands still and points a hand to the side, then bows their head, before placing their hands together in a prayer position. The person is still with their hands in a prayer position. The person is holding up a hand, and then moves their hands again. The person bows and sways, and finishes with hands together in a prayer position.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human-object Interaction",
+ "question": "What device did the man use to search for the term \"epidemiologist\"?",
+ "answer": "D",
+ "candidates": [
+ "A. A smart tablet.",
+ "B. A mobile phone.",
+ "C. A desktop computer.",
+ "D. A laptop computer."
+ ]
+ }
+ },
+ "XkPiWLSY": {
+ "video_id": "XkPiWLSY",
+ "video_duration": "423s",
+ "duration": "6-8min",
+ "domain": "Sports",
+ "sub_category": "Other Sports",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video shows a rowing race on a large body of water. There are multiple pairs of rowers in slim boats, identifiable by country flags and names displayed on large signs in the water. The camera angle changes throughout the video, sometimes focusing on pairs of rowers as they move, sometimes on the entire race. A small orange and white motorboat is visible near the left edge of the screen at various points in the video. The number 500, in a white font on a black rectangle, is visible on the water at certain points in the race, as is a smaller version of the same shape containing the number 1000. At the end of the race, the camera returns its focus to a pair of rowers from France, who remove their oars from the water. They are identified as the winners. Text on the screen identifies the location of the race as Bled Slovenia, at the 2011 World Rowing Championships. In the background, mountains and trees can be seen. There are people sitting in the stands, some waving small flags. Yellow buoys are floating in the water.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Attribute Recognition",
+ "question": "Which country is shown to win first place at the end of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Poland.",
+ "B. Canada.",
+ "C. Portugal.",
+ "D. Greece."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the order of countries crossing the finish line at the end of the video? (a) France, (b) Greece, (c) Norway, (d) Portugal, (e) Canada, (f) Poland",
+ "answer": "C",
+ "candidates": [
+ "A. (a) (b) (c) (e) (d) (f).",
+ "B. (a) (c) (b) (d) (e) (f).",
+ "C. (a) (b) (c) (d) (e) (f).",
+ "D. (a) (e) (b) (c) (d) (f)."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the position of the Canadian athlete relative to the Norwegian athlete in the video before the competition begins?",
+ "answer": "D",
+ "candidates": [
+ "A. On the left.",
+ "B. Behind.",
+ "C. In front.",
+ "D. On the right."
+ ]
+ }
+ },
+ "NqRQxLAi": {
+ "video_id": "NqRQxLAi",
+ "video_duration": "135s",
+ "duration": "2-4min",
+ "domain": "Daily Life",
+ "sub_category": "DIY & Crafts",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A blurry background fades to reveal a cream-colored towel on a light gray surface. A hand with pale, sparkly fingernails places a clear glass jar filled with a brown, viscous liquid onto the towel, and then places a stack of rectangular white strips next to the jar. The hand then places a light-brown wooden spatula on the towel. The jar is held in the hand and a spatula is inserted into the jar to scoop out some of the brown viscous liquid. The hand moves to position its forearm on the towel and uses the spatula to apply the brown viscous liquid onto the forearm. A white rectangular strip is placed over the liquid, and the hand rubs the strip firmly on the skin. The hand removes the strip quickly to reveal a mostly hairless forearm, with a few spots of the brown liquid remaining. The hand then repeats this procedure on the other arm, which is also mostly hairless. The hand then flattens out onto the towel and wiggles the fingers before a \u201cThanks for watching please subscribe\u201d message appears on the light gray surface.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Object State Change",
+ "question": "What state change does the sugar wax undergo in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The wax first hardens and then softens.",
+ "B. The wax hardens.",
+ "C. The wax softens.",
+ "D. The wax remains unchanged."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Object Counting",
+ "question": "In the video, how many sheets of wax paper did the dark-skinned woman wearing a silver bracelet on her right wrist use?",
+ "answer": "B",
+ "candidates": [
+ "A. Three sheets.",
+ "B. Two sheets.",
+ "C. One sheet.",
+ "D. Five sheets."
+ ]
+ },
+ "task2": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "What is the sequence in which the woman with dark skin, wearing a silver bracelet on her right wrist, removes hair using sugar wax in the video? (a) Cool the sugar wax, (b) Stir the sugar wax, (c) Apply the sugar wax, (d) Pull off the wax strip.",
+ "answer": "D",
+ "candidates": [
+ "A. (b), (a), (c), (d).",
+ "B. (a), (c), (b), (d).",
+ "C. (b), (c), (a), (d).",
+ "D. (a), (b), (c), (d)."
+ ]
+ }
+ },
+ "zjXNQMHj": {
+ "video_id": "zjXNQMHj",
+ "video_duration": "94s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Social Commentary",
+ "audio_class": [],
+ "video_caption": "A woman wearing a gray hoodie jacket, jeans, and pink sneakers stands at the counter of a store. She is holding several trading cards in her left hand. Behind the counter, a young man with glasses and a dark shirt stands facing the woman. Shelves filled with various merchandise are visible in the background. The woman and the store clerk continue to have a conversation. The woman makes gestures with her hand and then turns and walks away from the counter, moving further into the store. The camera follows her, passing a large plush Pikachu toy and a smaller Freddy Fazbear character plush toy. The camera pans around the store, showing various merchandise, including plush toys, collectibles, and displays. The camera turns back to the woman, who is now looking at another display. The camera then returns to the counter area. The man behind the counter reaches down and picks up a yellow shopping basket full of items and sets it on the counter. The woman continues walking around the store and then climbs on a tall display stand. The camera focuses on a display of Funko Pops and plushies. The woman climbs down, walks back toward the counter, picks up a crumpled piece of paper from the floor, and throws it at the clerk, hitting him in the head. The woman again walks away from the counter further into the store. She turns around, approaches a low table, and throws a large stuffed Pikachu toy to the floor. She does the same with a smaller Freddy plush toy that is on the table. She picks up something from under the table, possibly another stuffed toy, and throws it to the floor. She turns and walks away, pausing near a free-standing display unit, before walking toward a red display rack with gift cards. The woman walks back toward the counter area where the clerk is standing. She walks past him without stopping.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Relation Reasoning",
+ "question": "What is the likely relationship between the person wearing pink shoes and the man behind the counter?",
+ "answer": "C",
+ "candidates": [
+ "A. A chef and a customer.",
+ "B. Close friends.",
+ "C. A customer and a cashier.",
+ "D. Colleagues."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Causal Reasoning",
+ "question": "What might cause a person wearing pink shoes to argue with a cashier?",
+ "answer": "D",
+ "candidates": [
+ "A. The person wearing pink shoes believes the cashier made an error in the bill.",
+ "B. The person wearing pink shoes perceives the cashier's attitude as biased against people of lower socioeconomic status.",
+ "C. The person wearing pink shoes perceives the cashier as having a negative and unprofessional attitude.",
+ "D. The person wearing pink shoes believes the cashier is making transphobic remarks."
+ ]
+ },
+ "task2": {
+ "task_domain": "Reasoning",
+ "task_type": "Temporal Prediction",
+ "question": "What is the most likely next action of a person wearing pink shoes?",
+ "answer": "B",
+ "candidates": [
+ "A. Obtain the company's contact information and file a complaint with the police.",
+ "B. Obtain the company's contact information and file a complaint with an LGBTQ advocacy group.",
+ "C. Obtain the company's contact information and file a complaint with the Consumer Association.",
+ "D. Negotiate with the salesperson."
+ ]
+ }
+ },
+ "xgvXcjwX": {
+ "video_id": "xgvXcjwX",
+ "video_duration": "44s",
+ "duration": "<1min",
+ "domain": "Tech & Science",
+ "sub_category": "Astronomy",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video begins with an image of a cluster of rocks, followed by an infographic depicting the Oort Cloud. Next a square frame appears, containing an image of a blue object that looks like a comet with a bright white center. This image is labeled C/2013 A1, then Hubble Re, Hubble Release 14-090, and lastly Target: Comet Siding Spring is added. The frame appears in two different cosmic backdrops before transitioning to a view of a model of the solar system with an orbit depicted as a blue line, as well as two more framed views of rocks and a comet. Next, there are multiple views of the orbit, with the blue line coming closer and closer to the orbit of Mars. Eventually the blue line connects with Mars. The focus then transitions to a large blue object in the sky, labeled Siding Spring with a distance line labeled 88,000 miles, as well as an infographic showing the proximity between a comet and an Earth-like planet. The video concludes with an image of a comet in the night sky from the perspective of a planet. The text \"(Artist's Concept)\" appears in the bottom center of the screen.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the spatial relationship between Comet Siding Spring and Mars?",
+ "answer": "D",
+ "candidates": [
+ "A. It is unrelated to Mars.",
+ "B. It is located behind Mars.",
+ "C. It is located in front of Mars.",
+ "D. It is located near Mars."
+ ]
+ }
+ },
+ "qKVQfqnu": {
+ "video_id": "qKVQfqnu",
+ "video_duration": "23s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "A girl with long dark hair, streaked with green, smiles and holds out a Pok\u00e9 Ball. She wears a light purple short-sleeved shirt, a purple tie, and a pink and black armband. Another girl, with dark hair and a large purple hat, holds a phone to her ear with a distressed expression. Text boxes appear, with the first girl saying \u201cHolaaa amiga, ready for a battle?\u201d and the second girl saying, in succession, \u201cHello, 911?\u201d, \u201cA big tomboy keeps stalking me and proclaiming herself as my rival, send help! This is unbearable!\u201d, and \u201cWhen she hugs me her huge chest suffocates me. I am afraid for my life!\u201d A final text box appears by the second girl\u2019s head saying, \u201cGod, I wish that was me.\u201d The video zooms in on the second girl's increasingly frantic face as she grips her phone.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What emotion is the person on the phone expressing towards the woman in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Envy.",
+ "B. Anguish.",
+ "C. Sadness.",
+ "D. Happiness."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What is unusual about the girl with the ponytail in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The individual is female.",
+ "B. The individual is a hermaphrodite.",
+ "C. The individual is gender-neutral.",
+ "D. The individual is male."
+ ]
+ }
+ },
+ "tfHzebBL": {
+ "video_id": "tfHzebBL",
+ "video_duration": "54s",
+ "duration": "<1min",
+ "domain": "Sports",
+ "sub_category": "Motorsport",
+ "audio_class": [
+ "Speech",
+ "Music",
+ "Event"
+ ],
+ "video_caption": "A red and orange race car drives on a track. The camera view switches to show two cars racing each other on the track. The video shows the race from the perspective of the race car driver. The camera switches back and forth showing multiple cars racing. A person in a black racing suit and orange helmet stands on top of a race car. The camera shows fireworks in the night sky. A person in a black and red racing suit, and black and yellow cap, holds a silver trophy.",
+ "task0": {
+ "task_domain": "Reasoning",
+ "task_type": "Audio Change",
+ "question": "How did the commentator's voice change in the video?",
+ "answer": "B",
+ "candidates": [
+ "A. The volume initially decreases and then increases.",
+ "B. The volume gradually increases.",
+ "C. The volume initially increases and then decreases.",
+ "D. The volume gradually decreases."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Audio Recognition",
+ "question": "What animal's call is heard in the later part of the video?",
+ "answer": "D",
+ "candidates": [
+ "A. A tiger.",
+ "B. A cow.",
+ "C. A wolf.",
+ "D. A lion."
+ ]
+ }
+ },
+ "EPETrikf": {
+ "video_id": "EPETrikf",
+ "video_duration": "87s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video contains clips of what appears to be a late-night talk show. In the first segment, a man in a black suit sits in a red armchair and faces a woman seated in another red armchair. The woman is wearing a green dress and has dark hair. They appear to be in conversation. In the next series of segments, the man sits behind a desk and the woman is seated in a tan armchair. The woman is now wearing a black and white patterned short dress with a green design. The man gestures with his hands. The woman gestures with her hands. At one point she puts her hands on the desk in front of her. They continue to converse. In the next segment, a skeleton-like figure wearing a suit and tie stands behind a podium. This segment repeats a few times. The video returns to the man and woman at the desk and in the chair respectively. The woman claps her hands. The skeleton figure reappears. The man and the woman reappear again. The man reaches for an object next to him on the desk. Then, another woman is seated in a red armchair, facing the man. This woman is wearing a strapless red dress. They appear to be in conversation. The man gestures. The woman gestures. She places her hand on the desk. The man places his hand on the desk. The man gestures again, ending with a handshake between the woman's hand and his hand.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Action Counting",
+ "question": "How many times does the woman wearing the green dress tap the table in the video?",
+ "answer": "A",
+ "candidates": [
+ "A. Two times.",
+ "B. Three times.",
+ "C. One time.",
+ "D. Four times."
+ ]
+ }
+ },
+ "eHzbtCpG": {
+ "video_id": "eHzbtCpG",
+ "video_duration": "37s",
+ "duration": "<1min",
+ "domain": "Film & TV",
+ "sub_category": "Documentaries",
+ "audio_class": [
+ "Speech",
+ "Music"
+ ],
+ "video_caption": "The video consists of a sequence of frames with a four-panel layout. The word \"DAY\" appears at the top of the left-most panel.Initially, the panels are empty except for the \"DAY 3\" label in the top left corner. Subsequently, \"DAY 315\" replaces \"DAY 3\", and the left-most panel begins to fill with drawings of multiple mouse faces, accumulating into a large pile. Then a wire frame enclosure appears around the mice and a single mouse moves to the top right panel. A second mouse appears next to the first, and then they are both depicted with gray and blue coloration. A mouse in the bottom left panel is shown curled up, and another mouse appears near a watering device above it. Then, multiple mice are depicted around the water source. A graph appears in the bottom right panel and then a mouse is shown superimposed on top of the graph. The image then fades out. After the white transition, a single mouse is shown in the top left panel. This is followed by the introduction of another mouse in the same panel, and they are both given gray and blue coloration. In the top left panel, another white mouse appears, and then another gray mouse, leading to a total of four mice in that panel. A mouse in the top right panel looks towards the group of mice. In the bottom left panel, one mouse is shown attacking another. The mouse on the right in the top right panel is shown to have light-gray coloration. Finally, another light-gray mouse appears in the bottom right panel.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Event Sorting",
+ "question": "Which of the following represents the correct sequence of the development of the population issue in mice as described in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Competition for resources, slowing reproduction rates, the breakdown of community communication, and rapid reproduction.",
+ "B. Rapid reproduction, the breakdown of community communication, competition for resources, and slowing reproduction rates.",
+ "C. Rapid reproduction, competition for resources, slowing reproduction rates, and the breakdown of community communication.",
+ "D. Rapid reproduction, competition for resources, the breakdown of community communication, and slowing reproduction rates."
+ ]
+ },
+ "task1": {
+ "task_domain": "Understanding",
+ "task_type": "Spatial Relation",
+ "question": "What is the spatial relationship between the mouse with its head raised and the sitting mouse in the final frame of the video?",
+ "answer": "A",
+ "candidates": [
+ "A. On its left.",
+ "B. In front of it.",
+ "C. Behind it.",
+ "D. On its right."
+ ]
+ }
+ },
+ "iSnHpfOQ": {
+ "video_id": "iSnHpfOQ",
+ "video_duration": "43s",
+ "duration": "<1min",
+ "domain": "Performance",
+ "sub_category": "Stand-up",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "A man with dark hair and a mustache, wearing a dark jacket, is standing inside what appears to be a kitchen or restaurant. He talks to another man who is wearing a dark jacket and a collared shirt. The man with the mustache points at his head. The men then stand face to face near a serving area. The man with the mustache gestures with his hand and then points at his head again.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Human Emotions",
+ "question": "What is the emotion of the man on the left in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Excited.",
+ "B. Anxious.",
+ "C. Calm.",
+ "D. Angry."
+ ]
+ }
+ },
+ "KBVsSjRc": {
+ "video_id": "KBVsSjRc",
+ "video_duration": "69s",
+ "duration": "1-2min",
+ "domain": "Performance",
+ "sub_category": "Parodies",
+ "audio_class": [
+ "Speech",
+ "Event"
+ ],
+ "video_caption": "The video begins with a rotating earth globe with blue, red, and yellow hues, and the text \"PENEWS\" in white letters over a blue rectangle, framed by red stripes. The text \"VS\" is visible to the right of this rectangle. The scene then transitions to a person with dark, curly hair, a black hooded jacket, and a white shirt with black letters, moving erratically in a chair. The text \"PELO\" is visible in white letters on a red rectangle. The person is holding a microphone with a tan handle. The person continues to move while talking, at times raising and moving their hands. The video then returns to the rotating earth globe. The person is again moving in the chair, at times turning to one side. The person is seen with an image of a person with orange-blond hair in a yellow rectangle in the upper left. The person then holds up a yellow, plush duck toy, then moves the toy around. The person then moves their hand in front of the camera. The person then holds up their hand with an \"OK\" symbol. The person then brings their hand to their face. The person is shown again, with an image of a cartoon skeleton in a coffin to the left of them in a purple rectangle. The person makes a hand gesture in front of the camera, then moves their head and face around, speaking. The video ends with the person speaking and making hand gestures while sitting in the chair.",
+ "task0": {
+ "task_domain": "Recognition",
+ "task_type": "Anomaly Recognition",
+ "question": "What unusual event occurred during the middle of the video?",
+ "answer": "C",
+ "candidates": [
+ "A. A man unexpectedly produces some snacks during a live news broadcast.",
+ "B. A man unexpectedly produces a bottle of water during a live news broadcast.",
+ "C. A man unexpectedly produces a toy during a live news broadcast.",
+ "D. A man unexpectedly produces a phone during a live news broadcast."
+ ]
+ },
+ "task1": {
+ "task_domain": "Reasoning",
+ "task_type": "Hallucination",
+ "question": "Is Joe Biden present in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. I am not sure.",
+ "B. Yes, he is.",
+ "C. No, he is not."
+ ]
+ }
+ },
+ "MaTlBlze": {
+ "video_id": "MaTlBlze",
+ "video_duration": "88s",
+ "duration": "1-2min",
+ "domain": "Culture & Politics",
+ "sub_category": "Politics",
+ "audio_class": [
+ "Speech"
+ ],
+ "video_caption": "The video shows a news program with three people on screen. Initially, only one person is visible, a man with dark skin in a suit and tie, as a light briefly covers most of the screen. Then, the screen shows a split view with the man in the center. On the left is a woman with dark skin in a blue jacket with a light-colored scarf. On the right is a woman with light skin in a light blue top and glasses. All three are looking at the camera and speaking. The woman on the left appears to be in a different location than the other two, and at one point the shot changes to show her against a background of flags. Then, the split screen reappears. The man is sitting at a desk, while the two women appear to be in their own separate spaces. Text is displayed along the bottom of the screen with the title \u201cGuns in America\u201d, and a ticker with smaller text also moves across the bottom. The CNN logo appears in the bottom right corner of the screen and shows the word \"live.\" The time is shown in the bottom right at different intervals, and changes from 2:12 PM ET to 11:13 AM PT, which is also displayed to the left of the woman on the left.",
+ "task0": {
+ "task_domain": "Understanding",
+ "task_type": "Text and Diagram Understanding",
+ "question": "What is the minimum age for purchasing a semi-automatic rifle according to the new bill?",
+ "answer": "D",
+ "candidates": [
+ "A. 20 years old.",
+ "B. 19 years old.",
+ "C. 22 years old.",
+ "D. 21 years old."
+ ]
+ },
+ "task1": {
+ "task_domain": "Recognition",
+ "task_type": "Scene Recognition",
+ "question": "Where is the woman wearing the dark blue suit in the video?",
+ "answer": "C",
+ "candidates": [
+ "A. Medical facility office.",
+ "B. Police station.",
+ "C. Government office.",
+ "D. News studio."
+ ]
+ }
+ }
+}
\ No newline at end of file
diff --git a/eval_scripts/video-SALMONN2-testset/eval_video-SALMONN2-testset.sh b/eval_scripts/video-SALMONN2-testset/eval_video-SALMONN2-testset.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d737dbdc6708f2ddef54ccc1b707b02cda67608f
--- /dev/null
+++ b/eval_scripts/video-SALMONN2-testset/eval_video-SALMONN2-testset.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+MODEL_PATH="path_to_AVoCaDO" # TODO
+OUTPUT_DIR="$1"
+
+mkdir -p "$OUTPUT_DIR"
+
+python eval_scripts/video-SALMONN2-testset/generate_caption.py \
+ --model_path "$MODEL_PATH" \
+ --save_path "$OUTPUT_DIR/model_caption.json"
+
+python eval_scripts/video-SALMONN2-testset/evaluation.py "$OUTPUT_DIR/model_caption.json"
diff --git a/eval_scripts/video-SALMONN2-testset/evaluation.py b/eval_scripts/video-SALMONN2-testset/evaluation.py
new file mode 100644
index 0000000000000000000000000000000000000000..ca1eb0a211396f649a9d32a326b0b40328b875ea
--- /dev/null
+++ b/eval_scripts/video-SALMONN2-testset/evaluation.py
@@ -0,0 +1,325 @@
+import time
+import copy
+import os
+import json
+import concurrent.futures
+import threading
+from tqdm import tqdm
+import ast
+import random
+import re
+import sys
+from pathlib import Path
+
+try:
+ with open("apikey.txt", "r") as f:
+ api_key = f.read()
+except:
+ api_key = ''
+
+def call_gpt41(msg):
+ while True:
+ try:
+ response = openai.ChatCompletion.create(
+ model="gpt-4.1",
+ messages=msg,
+ api_key=api_key,
+ request_timeout=5)
+ break
+ except:
+ print("Timeout, retrying...")
+ time.sleep(5)
+
+ output_text = response['choices'][0]['message']['content']
+ return output_text
+
+
+script_dir = Path(__file__).resolve().parent
+events_file = script_dir / "video_salmonn2_test.json"
+LOCK = threading.Lock()
+
+summarize_prompt_system = '''
+### Task:
+A good video description is one that describes the various details in the video. You task is to judge whether a video description is good or not. You will be provided all the base events in the video, and also a video description to be evaluated. You need to determine which video events are described in the given video description.
+
+### Instruction:
+Given the video desciption, besides the events described correctly, there might be events that are missed, described incorrectly and hallucination. You need to determine the number of missed events, incorrect events and hallucination events.
+- Missed Event: An event from the given base events set whose central action, participants, and context are entirely absent in the caption to be evaluated, meaning the main occurrence is not conveyed in any form, even if paraphrased or summarized.
+- Incorrect Event: An event that is in the base events set and is mentioned in the caption, but described with significant factual inaccuracies.
+- Hallucination Event: An event mentioned in the caption that is not in the base events set and is not a plausible variation or inference from them; it is entirely fabricated or clearly inappropriate given the context.
+You are also required to list these events out. Note that there should be no overlap between Incorrect Events and Hallucination Events.
+
+### Input Format:
+- There are totally {event_num} base events in the video. All the base events in the video will be provided in List format, i.e. ["xxx", "xxx", ...]
+- The video description to be evaluated will be provided as well.
+
+### Output Format:
+Your output should be strict in the following Python dictionary format without anything else:
+{{"Missed": x, "Incorrect": x, "Hallucination": x, "Missed Event": [...], "Incorrect Event": [...], "Hallucination Event": [...] }}
+'''
+
+summarize_prompt_system_2 = '''
+### Task:
+A good video description is one that describes the various details in the video. You task is to judge whether a video description is good or not. You will be provided all the base events in the video, and also a video description to be evaluated. You need to determine which video events are described in the given video description.
+
+### Instruction:
+Given the video desciption, you need to determine the number of missed events, correct events, incorrect events and hallucination events. Make sure that: "missed" + "correct" + "incorrect" + "hallucination" = {event_num}
+- Missed Event: An event from the given base events set whose central action, participants, and context are entirely absent in the caption to be evaluated, meaning the main occurrence is not conveyed in any form, even if paraphrased or summarized.
+- Incorrect Event: An event that is in the base events set and is mentioned in the caption, but described with significant factual inaccuracies.
+- Hallucination Event: An event mentioned in the caption that is not in the base events set and is not a plausible variation or inference from them; it is entirely fabricated or clearly inappropriate given the context.
+Note that there should be no overlap between Incorrect Events and Hallucination Events.
+
+### Input Format:
+- There are totally {event_num} base events in the video. All the base events in the video will be provided in List format, i.e. ["xxx", "xxx", ...]
+- The video description to be evaluated will be provided as well.
+
+### Output Format:
+Your output should be strict in the following Python dictionary format without anything else:
+{{"Missed": x, "Correct": x, "Incorrect": x, "Hallucination": x}}
+'''
+
+summarize_prompt = '''
+#### Events In The Video
+{events_in_video}
+
+#### Video Description To Be Rated
+{cap_to_be_rated}
+
+Given base events in the video and the video description, please count the missed, incorrect and hallucination events and list them out.
+'''
+
+summarize_prompt_2 = '''
+#### Events In The Video
+{events_in_video}
+
+#### Video Description To Be Rated
+{cap_to_be_rated}
+
+Given base events in the video and the video description, please count the missed, correct, incorrect and hallucination events.
+'''
+
+seed = 2024
+
+def gpt_caption(ref, pred, summarize_prompt_system=None, summarize_prompt=None):
+ temp_query_system = summarize_prompt_system.format(event_num=len(ref))
+ temp_summarize_prompt = summarize_prompt.format(events_in_video=ref, cap_to_be_rated=pred)
+
+ msg = [
+ {"role": "system", "content": temp_query_system},
+ {"role": "user", "content": temp_summarize_prompt}
+ ]
+ completion = call_gpt41(msg)
+
+ return completion
+
+
+def reduce_repeated_words(text):
+ pattern = "."
+ for i in range(1, 50):
+ p = pattern * i
+ text = re.sub(f'({p})' + r'\1{4,200}', r'\1', text)
+ for i in range(50, 100):
+ p = pattern * i
+ text = re.sub(f'({p})' + r'\1{3,200}', r'\1', text)
+ return text
+
+def gpt_extract(item):
+ try:
+ if isinstance(item['pred'], list):
+ item['pred'] = item['pred'][0]
+ if "<|im_end|>" not in item['pred']:
+ text = reduce_repeated_words(item["pred"])
+ else:
+ text = item['pred'].replace("<|im_end|>", "")
+ res = gpt_caption(item['events'], text, summarize_prompt_system=summarize_prompt_system, summarize_prompt=summarize_prompt)
+
+ miss = int(res.split('"Missed":')[1].split('"Incorrect"')[0].strip().replace(",", ""))
+ incor = int(res.split('"Incorrect":')[1].split('"Hallucination"')[0].strip().replace(",", ""))
+ hall = int(res.split('"Hallucination":')[1].split('"Missed Event"')[0].strip().replace(",", ""))
+
+ try:
+ miss_event = json.loads(res.split('"Missed Event":')[1].split('"Incorrect Event"')[0].strip()[:-1])
+ except Exception as e:
+ miss_event = eval(res.split('"Missed Event":')[1].split('"Incorrect Event"')[0].strip()[:-1])
+
+ try:
+ incor_event = json.loads(res.split('"Incorrect Event":')[1].split('"Hallucination Event"')[0].strip()[:-1])
+ except Exception as e:
+ incor_event = eval(res.split('"Incorrect Event":')[1].split('"Hallucination Event"')[0].strip()[:-1])
+
+ try:
+ hall_event = json.loads(res.split('"Hallucination Event":')[1].split('}')[0].strip())
+ except Exception as e:
+ hall_event = eval(res.split('"Hallucination Event":')[1].split('}')[0].strip())
+
+ item["cutted_pred"] = text
+ item["Missed"] = miss
+ item["Incorrect"] = incor
+ item["Hallucination"] = hall
+ item["Missed Event"] = miss_event
+ item["Incorrect Event"] = incor_event
+ item["Hallucination Event"] = hall_event
+
+ return item
+
+ except Exception as e:
+ return item
+
+def gpt_extract_2(item):
+ try:
+ if "<|im_end|>" not in item['pred']:
+ text = reduce_repeated_words(item["pred"])
+ else:
+ text = item['pred'].replace("<|im_end|>", "")
+ res = gpt_caption(item['events'], text, summarize_prompt_system=summarize_prompt_system_2, summarize_prompt=summarize_prompt_2)
+
+ miss = int(res.split('"Missed":')[1].split('"Correct"')[0].strip().replace(",", ""))
+ cor = int(res.split('"Correct":')[1].split('"Incorrect"')[0].strip().replace(",", ""))
+ incor = int(res.split('"Incorrect":')[1].split('"Hallucination"')[0].strip().replace(",", ""))
+ hall = int(res.split('"Hallucination":')[1].split('}')[0].strip().replace(",", ""))
+
+ assert miss + cor + incor == len(item["events"])
+
+ item["cutted_pred"] = text
+ item["Missed"] = miss
+ item["Incorrect"] = incor
+ item["Hallucination"] = hall
+ item["Correct"] = cor
+
+ return item
+
+ except Exception as e:
+ return item
+
+if __name__ == "__main__":
+ if len(sys.argv) < 2:
+ print("require for res_file_path")
+ sys.exit(1)
+ elif len(sys.argv) < 3:
+ res_file = sys.argv[1]
+ output_json = res_file.replace(".json", f"_eval_results_gpt41.json")
+ log_path = res_file.replace(".json", f"_eval_results_gpt41.log")
+ else:
+ raise ValueError()
+
+ with open(res_file, 'r', encoding='utf-8') as fp:
+ res_data = json.load(fp)
+
+ with open(events_file, 'r', encoding='utf-8') as fp:
+ events_data = json.load(fp)
+
+ map_dic = {}
+ for item in events_data:
+ map_dic[item["video"]] = item
+
+ for item in res_data:
+ video_path = item['id'][0].split("/")[-2] + '/' + item['id'][0].split("/")[-1]
+ if video_path in map_dic:
+ events = map_dic[video_path]["events"]
+ map_dic[video_path] = item
+ map_dic[video_path]["events"] = events
+
+ res_data = list(map_dic.values())
+ print(len(res_data))
+
+ total_result = []
+
+ for i in range(7):
+ with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
+ responses = list(tqdm(executor.map(gpt_extract, res_data)))
+
+ seed += 1
+ result = [r for r in responses if "Hallucination" in r]
+ ignore = [r for r in responses if "Hallucination" not in r]
+ try:
+ with open(output_json, 'w', encoding='utf-8') as fp:
+ json.dump(result, fp, indent=4, ensure_ascii=False)
+ except Exception as e:
+ print(e)
+ print(output_json)
+ print(len(result), len(ignore))
+
+ k = 0
+
+ while len(ignore) != 0 and k < 8:
+ with concurrent.futures.ThreadPoolExecutor(max_workers=min(100, len(ignore))) as executor:
+ responses = list(tqdm(executor.map(gpt_extract, ignore)))
+
+ seed += 1
+ result += [r for r in responses if "Hallucination" in r]
+ ignore = [r for r in responses if "Hallucination" not in r]
+ try:
+ with open(output_json, 'w', encoding='utf-8') as fp:
+ json.dump(result, fp, indent=4, ensure_ascii=False)
+ except Exception as e:
+ print(e)
+ print(output_json)
+ print(len(result), len(ignore))
+
+ k += 1
+
+ if len(ignore) > 0:
+ k = 0
+ print("Version 2 GPT request")
+ while len(ignore) != 0 and k < 3:
+ with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
+ responses = list(tqdm(executor.map(gpt_extract_2, ignore)))
+
+ seed += 1
+ result += [r for r in responses if "Hallucination" in r]
+ ignore = [r for r in responses if "Hallucination" not in r]
+ try:
+ with open(output_json, 'w', encoding='utf-8') as fp:
+ json.dump(result, fp, indent=4, ensure_ascii=False)
+ except Exception as e:
+ print(e)
+ print(output_json)
+ print(len(result), len(ignore))
+ k += 1
+
+ eer = sum([it['Missed'] + it['Incorrect'] + it['Hallucination'] for it in result]) / sum([len(it["events"]) for it in result])
+ left_r = sum([it['Missed'] for it in result]) / sum([len(it["events"]) for it in result])
+ inc_r = sum([it['Incorrect'] for it in result]) / sum([len(it["events"]) for it in result])
+ fan_r = sum([it['Hallucination'] for it in result]) / sum([len(it["events"]) for it in result])
+
+ print(f"EER: {eer}, MISS: {left_r}, INCORRECT: {inc_r}, HALLUCINATION: {fan_r}, Success_samples: {len(result)}")
+ print(f"{eer:.3f}, {left_r:.3f}, {inc_r:.3f}, {fan_r:.3f}, {len(result)}")
+
+ total_result.append([eer, left_r, inc_r, fan_r, len(result)])
+
+
+ sorted_result = sorted(total_result, key=lambda x: x[0])
+ print(f"EER: {sorted_result[3][0]:.3f}, MISS: {sorted_result[3][1]:.3f}, INCORRECT: {sorted_result[3][2]:.3f}, HALLUCINATION: {sorted_result[3][3]:.3f}")
+ with open(log_path, 'w', encoding='utf-8') as f_log:
+ f_log.write("=" * 80 + "\n")
+ f_log.write(" PERFORMANCE SUMMARY\n")
+ f_log.write("=" * 80 + "\n")
+ summary_line = (
+ f" EER : {highlight_result[0]:.4f}\n"
+ f" MISS : {highlight_result[1]:.4f}\n"
+ f" INCORRECT : {highlight_result[2]:.4f}\n"
+ f" HALLUCINATION: {highlight_result[3]:.4f}\n"
+ )
+ f_log.write(summary_line)
+ f_log.write("=" * 80 + "\n\n")
+
+ f_log.write("Full Results Table:\n")
+
+ header = f"{'EER':<18}{'MISS':<18}{'INCORRECT':<18}{'HALLUCINATION':<18}{'Success Samples':<18}"
+ separator = "-" * len(header)
+
+ f_log.write(header + "\n")
+ f_log.write(separator + "\n")
+
+ # Write the data rows
+ for sublist in sorted_result:
+ row_str = (
+ f"{sublist[0]:<18.4f}"
+ f"{sublist[1]:<18.4f}"
+ f"{sublist[2]:<18.4f}"
+ f"{sublist[3]:<18.4f}"
+ f"{sublist[4]:<18}"
+ )
+ f_log.write(row_str + '\n')
+
+ f_log.flush()
\ No newline at end of file
diff --git a/eval_scripts/video-SALMONN2-testset/generate_caption.py b/eval_scripts/video-SALMONN2-testset/generate_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..8d124f4e931e2b89c5138340645d473292c2f280
--- /dev/null
+++ b/eval_scripts/video-SALMONN2-testset/generate_caption.py
@@ -0,0 +1,92 @@
+import os
+import torch
+from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
+from qwen_omni_utils import process_mm_info
+import argparse
+import json
+from tqdm import tqdm
+from pathlib import Path
+
+VIDEO_MAX_PIXELS = 401408 # 512*28*28
+VIDEO_TOTAL_PIXELS = 20070400 # 512*28*28*50
+USE_AUDIO_IN_VIDEO = True
+os.environ['VIDEO_MAX_PIXELS'] = str(VIDEO_TOTAL_PIXELS)
+script_dir = Path(__file__).resolve().parent
+fin_path = script_dir / "video_salmonn2_test.json"
+video_dir = "" # TODO
+
+
+parser = argparse.ArgumentParser(description="Evaluate a model and save results.")
+parser.add_argument("--model_path", type=str, required=True, help="Path to the model checkpoint.")
+parser.add_argument("--save_path", type=str, required=True, help="Path to save the evaluation results.")
+args = parser.parse_args()
+
+model_path = args.model_path
+fout_path = args.save_path
+
+fin = open(fin_path, 'r', encoding='utf-8')
+fout = open(fout_path, 'w', encoding='utf-8')
+
+model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
+ model_path,
+ torch_dtype=torch.bfloat16,
+ device_map="auto",
+ attn_implementation="flash_attention_2",
+)
+model.disable_talker()
+processor = Qwen2_5OmniProcessor.from_pretrained(model_path)
+
+
+def chat(file_path, prompt):
+ conversation = [
+ {
+ "role": "system",
+ "content": [
+ {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
+ ],
+ },
+ {
+ "role": "user",
+ "content": [
+ {
+ "type": "video",
+ "video": file_path,
+ "max_pixels": VIDEO_MAX_PIXELS,
+ },
+ {
+ "type": "text",
+ "text": prompt
+ },
+ ],
+ },
+ ]
+
+ text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
+ audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
+ inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
+ inputs = inputs.to(model.device).to(model.dtype)
+
+ text_ids = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO, do_sample=False, thinker_max_new_tokens=2048)
+
+ text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+ model_generation = text.split("\nassistant\n")[-1]
+
+ return model_generation
+
+
+annotations = json.load(fin)
+updated_annos = []
+
+for idx, anno in tqdm(enumerate(annotations, start=1)):
+ video_path = os.path.join(video_dir, anno["video"])
+ prompt = anno["conversations"][0]["value"].replace("\n", "")
+
+ model_generation = chat(video_path, prompt)
+ out_data = {
+ "id": [video_path],
+ "prompt": prompt,
+ "pred": model_generation,
+ }
+ updated_annos.append(out_data)
+
+json.dump(updated_annos, fout, indent=4, ensure_ascii=False)
\ No newline at end of file
diff --git a/eval_scripts/video-SALMONN2-testset/video_salmonn2_test.json b/eval_scripts/video-SALMONN2-testset/video_salmonn2_test.json
new file mode 100644
index 0000000000000000000000000000000000000000..2480e93e726ed1c9cdfd0530e919081b947472c7
--- /dev/null
+++ b/eval_scripts/video-SALMONN2-testset/video_salmonn2_test.json
@@ -0,0 +1,23788 @@
+[
+ {
+ "video": "video/0.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man in a bright yellow shirt holding a clear plastic water bottle, pouring water into a tall, clear glass. The backdrop is a dark blue wall adorned with a starry pattern, and the text at the top of the frame reads, \"Things you have to look at twice to understand,\" accompanied by a thinking face emoji. Meanwhile, the audio reveals an inviting voice introducing the segment with, \"Things you have to look at twice to understand. Firstly, this bridge in Japan is not what you think it is. Viewed from afar, it looks insanely steep,\" all set against a background of eerie music. As the water fills the glass, the scene shifts to an image of a steep, wavy bridge with cars steadily driving across it. The man in the yellow shirt gestures towards the bridge to emphasize its unusual structure, with industrial buildings and a mountainous landscape forming the background.\n\nThen, we see the man in the yellow shirt standing directly in front of the steep background, undulating bridge, now shown from a different angle to further highlight its dramatic incline. The audio continues with the same man's voice explaining, \"almost like a cliff and you're probably wondering how cars are even driving on it. Well, once you get closer, it's just a normal looking bridge, although it does have a hill, but it's not that steep.\" The video zooms in on the bridge, capturing more details of the road and its surroundings, including road signs and utility poles, as the man in the yellow shirt continues to point towards the bridge. Explaining why the bridge looks so steep afar. The seamless blend of narration and background music continues to engage the viewer.\n\nTransitioning to another scene, the man in the yellow shirt partially obstructs the view of a girl standing in the background. Apparently her legs look wrongly. She's wearing a black outfit paired with a leopard print scarf and black boots. She stands on a grassy field, with several parked cars and trees in the background. The man in the yellow shirt gestures towards the girl, while the audio adds a layer of intrigue by saying, \"okay next once you see it you won't be able to unsee it you're probably wondering why this girl's legs are so skinny well they're not she's just holding a bag of popcorn that blends in with the grass.\" \n\nThe scene continues with the girl still standing in the grassy field, and immediately transitions to a black background where a grid of blurred faces appears. The man in the yellow shirt points towards these shifting images. Just then, the audio prompts an interactive challenge: \"Now, stare at the cross in the middle of these two pictures and focus really hard. You will see your favorite celebrities turn into monsters, and the longer you stare, the scarier it gets, and honestly, it's pretty spooky. \" as the photos continues to change, showing different pairs of celebrities and eerie music in the background amplifies the tension and mystery.\n\nThe background changes once more, showing a black grid of changing images that quickly dissolves into the scene of the man in the yellow shirt standing in front of what looks like an ocean. In the distance, a man clad in a black wetsuit is seen running on the water, causing splashes with each step. The video highlights this seemingly impossible action from various angles as the man in the yellow shirt points and gestures vigorously. The corresponding audio enhances this marvel with, \"Finally, right when you saw this, you're probably wondering how in the world is this man walking on water? Of course he's not actually walking on water, he dived and turned upside down.\" maintaining a tone of curiosity and suspense amidst the ongoing music.\n\nFinally, the last scene opens with the man in the yellow shirt still set against the blue oceanic background. This time, a man in a black wetsuit is swimming underwater, creating a stream of bubbles and splashes. The man in the yellow shirt continues to emphasize the swimmer's movement. The audio concludes with a calm and assured voice stating, as we observe the swimmer from various angles, showing a seamless loop of underwater actions to complete the focal activity. Throughout the video, the blend of visuals and narration creates a cohesive and captivating narrative, leaving viewers both intrigued and informed."
+ }
+ ],
+ "events": [
+ "Man in yellow shirt holding a clear plastic water bottle",
+ "Man pouring water into a tall, clear glass",
+ "Backdrop with dark blue wall and starry pattern",
+ "Text on top of frame reads 'Things you have to look at twice to understand' with a thinking face emoji",
+ "Inviting voice in audio introduces the segment",
+ "Eerie background music plays",
+ "Audio mentions a steep bridge in Japan",
+ "Scene changes to image of steep, wavy bridge with cars driving",
+ "Industrial buildings and mountainous landscape in background",
+ "Man gestures towards bridge to emphasize structure",
+ "Man in yellow shirt stands in front of bridge shown from a different angle",
+ "Audio explains perceived steepness of the bridge",
+ "Zoom-in on details of the bridge with road signs and utility poles",
+ "Man points towards bridge while explaining",
+ "Scene shifts to man in yellow obstructing view of a girl",
+ "Girl in black outfit with leopard print scarf and black boots stands in grassy field",
+ "Several parked cars and trees in background",
+ "Man gestures towards girl",
+ "Audio explains illusion of girl's skinny legs due to bag of popcorn",
+ "Girl still standing in grassy field",
+ "Scene transitions to black background with grid of blurred faces",
+ "Man points towards shifting images",
+ "Audio introduces interactive challenge with cross in the middle of pictures",
+ "Photos change showing pairs of celebrities turning into monsters",
+ "Eerie music in background",
+ "Background changes to man in yellow standing in front of ocean",
+ "Man in black wetsuit seen running on water in distance",
+ "Different angles highlight man seemingly running on water",
+ "Man in yellow gestures vigorously towards scene",
+ "Audio explains illusion of man dived and turned upside down",
+ "Scene with man in black wetsuit swimming underwater creating bubbles",
+ "Man in yellow emphasizes swimmer's movement",
+ "Audio concludes with explanation of underwater action",
+ "Various angles of swimmer shown",
+ "Seamless loop of underwater actions completes video narrative"
+ ]
+ },
+ {
+ "video": "video/1.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a low-angle view of a wooden staircase, with a warm, evening indoor lighting setting. A child in black and white patterned pajamas starts to descend from the top of the stairs, carefully holding onto the railing for support. The background features a dark curtain and some items on the floor near the stairs. \n\nAudio reveals a man speaking, \"`Yeah, I don't think you should never do that again. I thought you was gonna walk down the top steps. That was crazy. `\" followed shortly by a thud.\n\nThe child reaches the bottom of the stairs and stands holding onto the wall, while the scene remains consistent in lighting and background.\n\nThe video then transitions to showing the child positioned near the bottom of the stairs, still in the black and white pajamas and holding onto the wooden railing with one hand and a blue item with the other. The warm light from the right side continues to illuminate the scene with the dark curtain in the background. \n\nThe child starts to descend the stairs. As he reach the bottom and move out of the frame, the voice of the man continues, \"`Don't let that cake boost your head up no more. Did you hurt yourself? Yeah. `\"\n\nA child's voice responds, \"Yeah.\"\n\nIn the next scene, the focus returns to the staircase from the low angle. The child, in the same patterned pajamas, appears once again at the top of the stairs, holding both the railing and the blue item. The background elements, including the dark curtain and items on the floor, remain consistent, and the warm light bathes the scene.\n\nThe man's voice is heard repeating, \"`Yeah, I don't think you should never do that again, I thought she was gonna walk down the top steps.`\" After a thud, another man speaks briefly. The child then reaches the bottom of the stairs, holding onto the wall.\n\nIn the final scene, the child stands near the bottom of the stairs, holding the railing and the blue item, in the same warm, indoor evening setting. The child stands still for a moment before beginning the descent. The audio echoes, \"`That was crazy. Don't let that cake boost your head up no more. Did you hurt yourself?`\" and the child responds again, \"Yeah.\" The child continues down the stairs, eventually moving out of the frame.\n\nThe video concludes with the child fully reaching the bottom and disappearing from the view, maintaining the coherent indoor setting throughout."
+ }
+ ],
+ "events": [
+ "Low-angle view of a wooden staircase with warm, evening indoor lighting",
+ "Child in black and white patterned pajamas appears at the top of the stairs",
+ "Child carefully holds onto the railing for support while descending",
+ "Background features a dark curtain and items on the floor near the stairs",
+ "Audio: Man says, 'Yeah, I don't think you should never do that again. I thought you was gonna walk down the top steps. That was crazy.'",
+ "Audio is followed by a thud",
+ "Child reaches the bottom of the stairs and stands holding onto the wall",
+ "Scene remains consistent in lighting and background",
+ "Video transitions: Child near the bottom of the stairs holding the wooden railing with one hand and a blue item with the other",
+ "Warm light illuminates the scene from the right side",
+ "Child starts to descend the stairs again",
+ "Audio: Man says, 'Don't let that cake boost your head up no more. Did you hurt yourself?'",
+ "Child's voice responds, 'Yeah'",
+ "Child moves out of frame",
+ "Focus returns to staircase from low angle",
+ "Child appears at the top of the stairs again holding both the railing and the blue item",
+ "Background elements remain consistent (dark curtain, items on the floor)",
+ "Warm light continues to bathe the scene",
+ "Audio: Man repeats, 'Yeah, I don't think you should never do that again, I thought she was gonna walk down the top steps.'",
+ "Audio is followed by another thud",
+ "Another man speaks briefly",
+ "Child reaches the bottom of the stairs holding onto the wall",
+ "Final scene: Child stands near the bottom of the stairs holding the railing and the blue item",
+ "Same warm, indoor evening setting",
+ "Child stands still before beginning the descent",
+ "Audio: Echoes, 'That was crazy. Don't let that cake boost your head up no more. Did you hurt yourself?'",
+ "Child responds again, 'Yeah'",
+ "Child continues down the stairs and moves out of frame",
+ "Video concludes with the child fully reaching the bottom and disappearing from the view",
+ "Coherent indoor setting is maintained throughout"
+ ]
+ },
+ {
+ "video": "video/2.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man standing near a glass door, dressed in a black polo shirt with blue accents and gray pants. He wears a black watch on his left wrist and has a purple backpack strap over his shoulder. A black and brown dog stands near his legs. The man talks and gestures with his right hand, set against a background that reveals a green lawn and a parked car outside the door. The scene transitions to another man standing in the same location, wearing a black Nike t-shirt, red and black shorts with a Jordan logo, and a silver watch on his left wrist. He has his hands on his hips, observing the first man, adding a dynamic to the interaction. In the background, a voice is heard explaining, \"It only takes three seconds for even a half-dog fight for someone to lose an eye or something. \u201cThe video shifts to a man in a black t-shirt featuring a blue graphic on the back, with a red backpack strap over his shoudler, who is holding a leash and interacting with two black and brown dogs near the same glass door. The green lawn remains visible in the background. In the background, a voice continues explaining, \"So we have to let them know that that's unacceptable.\" Then a man with black shirt and a blue backpack strap over his shoulder comes out. He bends and push the two dogs with his one hand. A dog barks intermittently as the man continues to speak.The audio progresses, detailing: \"He doesn't need to be biting her in the face like that.\" Then the screen transitions to the man dressed in a black polo shirt with blue accents and gray pants who appears at the the beginning , he says\" I'm surprised she puts up with it for that long. Does she ever fire off on you and say, knock it off? \u201c Then a manlike voice answers, \u201dyeah\u201c .\n\nThe screen transitions to the glass door where the man in the black polo shirt with purple backpack strap continues to talk and gesture vigorously with his right hand. He occasionally glances towards the door, maintaining a conversation against a backdrop of a green lawn and a parked car outside. The dialogue is now heard, \" The progression of his gestures correlates with the emphasis in his speech.The man says\uff0c\u201cIt's our job to let him know that that's unacceptable. It seems a little cute, but it's not. She doesn't think it's cute. I basically bit him while he's biting her.\" When he mentions himself , he points himself, when he mentions \"he / her\" he points to the dogs. Then a manlike voice answerd \"No problem!\" . after that, the man continues talking, \"And I'm saying, knock it off. And he knocked it off.\" \n\nThe final scene features the same man in the black polo shirt standing near a white door with a black handle. A black and brown dog sits attentively near his legs, looking up at him, as he continues to gesture with his right hand. The background remains consistent with a white wall and a small control panel. This part of the conversation in the audio unfolds: \"It's nothing personal. I'm not mad at him. Don't bite your sister in the face.\" The visual captures the ongoing dynamics of the talking man and attentive dog, bringing cohesion to the overall narrative that underscores proper behavior and training of the dogs. The video closes with the consistent visual presence of the man gesturing and conversing near the door, with the loyal dog by his side."
+ }
+ ],
+ "events": [
+ "Video begins with a man standing near a glass door.",
+ "Man is dressed in a black polo shirt with blue accents and gray pants.",
+ "Man wears a black watch on his left wrist.",
+ "Man has a purple backpack strap over his shoulder.",
+ "A black and brown dog stands near the man's legs.",
+ "Man talks and gestures with his right hand.",
+ "Background reveals a green lawn and a parked car outside the door.",
+ "Scene transitions to another man standing in the same location.",
+ "Second man is wearing a black Nike t-shirt, red and black shorts with a Jordan logo.",
+ "Second man wears a silver watch on his left wrist.",
+ "Second man has his hands on his hips, observing the first man.",
+ "In the background, a voice explaining the potential dangers of dog fights.",
+ "Video shifts to a man in a black t-shirt featuring a blue graphic on the back.",
+ "This man has a red backpack strap over his shoulder.",
+ "He is holding a leash and interacting with two black and brown dogs near the glass door.",
+ "Background remains a green lawn.",
+ "Voice continues explaining proper behavior reinforcement to the dogs.",
+ "A man in a black shirt with a blue backpack strap comes out.",
+ "He bends and pushes the two dogs with one hand.",
+ "A dog barks intermittently while the man speaks.",
+ "The audio discusses inappropriate dog behavior (biting).",
+ "Screen transitions back to the original man in a black polo shirt with blue accents and gray pants.",
+ "Original man speaks about dog behavior and points to himself and the dogs while speaking.",
+ "Man says: 'It's our job to let him know that that's unacceptable.'",
+ "Man mentions how behavior might seem cute but is not.",
+ "Man continues explaining training dogs to not bite.",
+ "A manlike voice answers, 'No problem!'.",
+ "Man continues talking about how he corrected the dog's behavior.",
+ "Final scene features the same man standing near a white door with a black handle.",
+ "A black and brown dog sits attentively near the man's legs.",
+ "Man continues to gesture with his right hand.",
+ "Background shows a white wall and a small control panel.",
+ "Audio discusses not taking dog correction personally.",
+ "Man advises not to 'bite your sister in the face'.",
+ "Video closes with the man gesturing and conversing near the door, dog by his side."
+ ]
+ },
+ {
+ "video": "video/3.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a group of cyclists riding on a road under a clear sky with some clouds, showing a grassy area with a few trees and buildings in the distance. The main cyclist in focus, dressed in a blue cycling outfit with red and white accents and a white helmet, loses balance and falls, sliding on the road. A man yells \"No, Jerome!\" in the distance, followed by a loud whoosh, and another man speaking over a loudspeaker saying, \"Okay, let'`s break down how this happened. This is the last corner at the LV.`\" The scene then cuts to black.\n\nThe video resumes with the same cyclist, riding this time alongside another cyclist in an orange and white outfit. The group continues on the road, with more trees and a few parked vehicles in the background.\n\nA new clip starts with the group of cyclists navigating a turn, passing traffic cones and a street sign while a man in a high-pitched voice says, \"`So criterium field sprint is about to pop off in about 20 seconds, so we're all jostling for position. I'm at the end of the San Jose lead-out train, which has brought things up to about 37 miles an hour.`\" The roads are marked with yellow lines, and the cyclists maintain a steady pace, passing various landmarks, while the main cyclist is joined by another in a white and blue outfit labeled 'San Jose.'\n\nThe cyclists continue to ride closely together under a clear blue sky with scattered clouds. The man narrates, \"`Now, drafting at this speed is so important. The difference between a two-foot gap and a two-inch gap to the wheel in front of you is like hundreds of watts of power save.`\" The cyclists maintain their formation and steady pace on the road marked with yellow lines, in a straight line.\n\nNext, while cycling under the same clear sky, the narration continues, \"So Jerome puts himself in his high-risk, high-reward situation, only a couple of inches from my rear wheel, he ends up overlapping and paying the price. But Jerome's a tough dude. He bounced back surprisingly quick.`\" A loud bang follows, emphasizing the collision or mishap caused by this risky positioning.\n\nThe scene shifts to a fallen cyclist on the road, with another cyclist approaching and a bystander in a striped shirt possibly coming to assist. The eye-catching text 'Legend.' appears, underlining the cyclist's resilience or achievement. The narration informs viewers, \"`And he went on to win a bunch of races this season,`\" followed by the sound of a motorcycle engine and people conversing in the background.\n\nThe video concludes with a close-up of the cyclist standing near a fence, the word 'Legend.' displayed prominently at the top of the frame, reinforcing the cyclist's esteemed status. A man says\"I'm sorry dude that sucks. There'`s no collarbone, no head though. So since I just, you know, with some painful showers, Jerome replies' No I don`'t... I'`m not giving you an inch. Yeah,`\" providing insight into his endurance and resolve. Another man follows up with final commentary, wrapping up the narrative. The standing cyclist's blue and white outfit further accentuates their accomplished legend status.\n\nThe entire video spans through various phases of a cycling journey, marked by perseverance, camaraderie, tactical maneuvering, and marked achievements, underscored by consistent visual motifs of cycling attire and road elements."
+ }
+ ],
+ "events": [
+ "Group of cyclists riding on a road under a clear sky with some clouds.",
+ "Grassy area with a few trees and buildings in the distance.",
+ "Main cyclist in focus is dressed in a blue cycling outfit with red and white accents and a white helmet.",
+ "Main cyclist loses balance and falls, sliding on the road.",
+ "Man yells 'No, Jerome!' in the distance.",
+ "Loud whoosh sound.",
+ "Man speaking over a loudspeaker: 'Okay, let`s break down how this happened. This is the last corner at the LV.'",
+ "Scene cuts to black.",
+ "Video resumes with the same cyclist riding alongside another cyclist in an orange and white outfit.",
+ "Group continues on the road with more trees and a few parked vehicles in the background.",
+ "New clip starts with the group of cyclists navigating a turn, passing traffic cones and a street sign.",
+ "Man in high-pitched voice says: 'So criterium field sprint is about to pop off in about 20 seconds, so we're all jostling for position. I'm at the end of the San Jose lead-out train, which has brought things up to about 37 miles an hour.'",
+ "Roads are marked with yellow lines.",
+ "Main cyclist is joined by another cyclist in a white and blue outfit labeled 'San Jose.'",
+ "Cyclists maintain a steady pace, passing various landmarks.",
+ "Cyclists ride closely together under a clear blue sky with scattered clouds.",
+ "Narration: 'Now, drafting at this speed is so important. The difference between a two-foot gap and a two-inch gap to the wheel in front of you is like hundreds of watts of power save.'",
+ "Cyclists maintain their formation and pace on the road marked with yellow lines, in a straight line.",
+ "Narration continues: 'So Jerome puts himself in his high-risk, high-reward situation, only a couple of inches from my rear wheel, he ends up overlapping and paying the price. But Jerome's a tough dude. He bounced back surprisingly quick.'",
+ "A loud bang follows, indicating a collision or mishap.",
+ "Scene shifts to a fallen cyclist on the road.",
+ "Another cyclist approaches and a bystander in a striped shirt possibly coming to assist.",
+ "Eye-catching text 'Legend.' appears.",
+ "Narration: 'And he went on to win a bunch of races this season.'",
+ "Sound of a motorcycle engine and people conversing in the background.",
+ "Video concludes with a close-up of the cyclist standing near a fence.",
+ "Word 'Legend.' displayed prominently at the top of the frame.",
+ "Man says: 'I'm sorry dude that sucks. There`s no collarbone, no head though. So since I just, you know, with some painful showers,",
+ "Jerome replies: 'No I don`t... I`m not giving you an inch. Yeah,'",
+ "Another man follows up with final commentary."
+ ]
+ },
+ {
+ "video": "video/4.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man standing on a badminton court, holding a racket in his left hand and wearing a red t-shirt with 'TRME' written on it. He is explaining three tips of improve mash power, using his right hand gestures and holding up three fingers. Above his head, icons representing a ruler, a refresh symbol, and a stopwatch appear, indicating the topics he will cover. The audio starts with his voice saying, \"In this video we're going to give you three tips to improve your smash power.\" The scene transitions to a woman in a similar setting, also wearing a red 'TRME' t-shirt and a white skirt. She also holds a racket in her left hand. As she begins explaining, she gestures with her right hands and still holds a racket in her left hand. Text appears above her head reading \"Strike the shuttle \u22480.5 metres in front of you.\" The audio continues with the woman speaking and gesturing with her right hand and a different voice saying, \"Tip number one is to hit the shuttle approximately half a meter in front of you. This is because your power comes from rotating your body forwards into the shot.\" The woman demonstrates a badminton serve, with a red arrow to measure her from the ground to the air as she jumps, hitting the shuttlecock with her racket. Then, the woman performs a badminton serve in mid-action,with the racket raised and her body in motion, with a background still of the green walls and the wooden floor. \n\nThe scene transitions back to the man in the red 'TRME' t-shirt and white shorts on the same court, holding the racket and continuing with the demonstration, gesturing with his hands as he explains. The audio continues, \"And this is our second tip: you need to rotate. We see a lot of people just using their arm to create power, but this is wrong.\" The clip includes the man raising his left arm and showing a gesture of creating power, concludes with a close-up of the man explaining while holding the racket with his both hands, with a red sign 'X' appearing in the frame. Next, the video shows a man in a red t-shirt and red shorts performing a powerful jump smash on a badminton court, setting in a professional environment with banners from sponsors like Yonex, HSBC, and L'Equipe. He strikes the shuttlecock mid-air with great force with twice repeated actions. The audio describes the technique, \"You actually need to generate power from your legs, which enables you to rotate your hips, then torso and finally your shoulder.\" Text on the screen reads \"Rotate your legs, hips, core, shoulder, then arm\" with a circular arrow icon. The scene transitions back to the man a red t-shirt with 'TRME' written on it, standing on the badminton court with green walls and a wooden floor, elaborating on the technique with his hand gestures. And the audio continues with, \"Only then do you use other parts of your racket arm like your elbow, forearm, and wrist.\" \n\nThe video then focuses on the woman in the red 'TRME' t-shirt and white skirt on the badminton court. While holding a racket and gesturing with her hands. The audio elaborates, \"A third and final tip is to generate a fast racket swing speed, which is the ultimate goal of a powerful smash. And you want to try and start this swing as late as possible. If you start it too early, you'll lose your speed, and therefore your power.\" Text appears on the screen reading \"You need a fast racket swing speed at the contact point\" accompanied by a stopwatch icon. The woman demonstrates the technique, emphasizing a quick racket swing, with the camera occasionally widening to emphasizing the woman. \n\nThe scene transitions back to the man standing on the same court, who continues the explanation while holding a racket. The audio encourages, \"If you want to see our in-depth smash tutorial with lots more tips, then check out that video on our channel. And lastly please give this video a like, obviously smash the subscribe button, and we'll see on another video very soon.\" Text on the screen reads \"SMASH TUTORIAL! Badminton SMASH Tutorial - Improve Your POWER and...\" with an arrow pointing downwards. The man demonstrates a badminton smash, followed by a 'SUBSCRIBE' button appearing above his head with 'BADMINTON INSIGHT' and a bell icon. The video concludes with the man on the same court, with the text above his head fading out, still standing on the court while holding the racket."
+ }
+ ],
+ "events": [
+ "Man standing on a badminton court holding a racket in his left hand",
+ "Man wearing a red t-shirt with 'TRME' written on it",
+ "Man explaining three tips to improve smash power",
+ "Man using his right hand gestures and holding up three fingers",
+ "Icons representing a ruler, a refresh symbol, and a stopwatch appear above the man's head",
+ "Audio starts with man saying 'In this video we're going to give you three tips to improve your smash power.'",
+ "Scene transitions to a woman in a similar setting",
+ "Woman wearing a red 'TRME' t-shirt and a white skirt",
+ "Woman holding a racket in her left hand",
+ "Woman explaining and gesturing with her right hand while holding racket in left hand",
+ "Text appears above the woman's head reading 'Strike the shuttle \u22480.5 metres in front of you.'",
+ "Audio with a different voice saying 'Tip number one is to hit the shuttle approximately half a meter in front of you. This is because your power comes from rotating your body forwards into the shot.'",
+ "Woman demonstrates a badminton serve",
+ "Red arrow measures the woman's jump from ground to air as she hits shuttlecock",
+ "Woman performs a badminton serve in mid-action with racket raised and body in motion",
+ "Background shows green walls and wooden floor",
+ "Scene transitions back to man in red 'TRME' t-shirt and white shorts on same court",
+ "Man holding the racket and continuing demonstration",
+ "Man gesturing with his hands as he explains",
+ "Audio continues 'And this is our second tip: you need to rotate. We see a lot of people just using their arm to create power, but this is wrong.'",
+ "Clip shows man raising his left arm and showing a gesture of creating power",
+ "Close-up of man explaining while holding racket with both hands",
+ "Red sign 'X' appears in frame",
+ "Man in red t-shirt and red shorts performs a powerful jump smash on badminton court",
+ "Professional environment with banners from sponsors like Yonex, HSBC, and L'Equipe",
+ "Man strikes shuttlecock mid-air with great force, action repeated twice",
+ "Audio describes technique 'You actually need to generate power from your legs, which enables you to rotate your hips, then torso and finally your shoulder.'",
+ "Text on screen reads 'Rotate your legs, hips, core, shoulder, then arm' with circular arrow icon",
+ "Scene transitions back to man in red t-shirt with 'TRME' written on it standing on the badminton court with green walls and wooden floor",
+ "Man elaborating on technique with hand gestures",
+ "Audio continues 'Only then do you use other parts of your racket arm like your elbow, forearm, and wrist.'",
+ "Video focuses on woman in red 'TRME' t-shirt and white skirt on badminton court",
+ "Woman holding racket and gesturing with hands",
+ "Audio elaborates 'A third and final tip is to generate a fast racket swing speed, which is the ultimate goal of a powerful smash. And you want to try and start this swing as late as possible. If you start it too early, you'll lose your speed, and therefore your power.'",
+ "Text appears on screen reading 'You need a fast racket swing speed at the contact point' with stopwatch icon",
+ "Woman demonstrates technique, emphasizing quick racket swing",
+ "Camera occasionally widens to emphasize the woman",
+ "Scene transitions back to man standing on same court",
+ "Man continues explanation while holding racket",
+ "Audio encourages 'If you want to see our in-depth smash tutorial with lots more tips, then check out that video on our channel. And lastly please give this video a like, obviously smash the subscribe button, and we'll see on another video very soon.'",
+ "Text on screen reads 'SMASH TUTORIAL! Badminton SMASH Tutorial - Improve Your POWER and...' with arrow pointing downwards",
+ "Man demonstrates a badminton smash",
+ "'SUBSCRIBE' button appears above man's head with 'BADMINTON INSIGHT' and bell icon",
+ "Video concludes with man on same court, text above his head fading out while still standing on court holding racket"
+ ]
+ },
+ {
+ "video": "video/5.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a blank gray screen transitioning to a black screen displaying a rotating 3D logo that reads 'INSIDE edition' in white and black text. The logo rotates and then the scene shifts to a street flooded with water. Two women are carrying a third woman on their shoulders to help her navigate through the floodwater. The one being carried is wearing a black jacket and blue jeans, while one of the carriers on the left is dressed in a floral shirt, black pants, and pink shoes, and the other on the right is in an orange shirt with white text, blue jeans, and green shoes. In the background, a few people stand observing the scene, including a man in a red shirt and another in a white shirt with black patterns and a camera, with some buildings and a bicycle visible. The 'INSIDE edition' logo appears in the bottom left corner of the screen.\n\nA voice narrates over music: \u201cIt\u2019s a photo that seems to show ordinary people lending a hand to a flood victim. Or does it? The woman getting the free ride is a TV reporter who, get this, doesn't want to get her expensive shoes wet. Lydia Cumming is a 24-year-old reporter who hitched a ride with this couple while reporting on devastating floods in Mexico. That's her TV microphone in her right hand. Her press credentials dangle around her neck. As for those expensive shoes, they're nice and dry. The internet lit up with mocking memes like these. There's Lydia being carried to safety in Star Wars. And who knew she was on the Titanic? Twitter lit up too. Why doesn't she take off her shoes instead? Another post, you would think that wearing maybe rain boots in a flooded area would be common sense. Makes sense to us...\" The narration continues while the camera zooms in on this photo. Then the scene shows this photo from down to up. The woman being carried holds a microphone with a red windscreen and a blue logo. \n\nThe scene shifts to another reporter holding a microphone with a red windscreen and a blue logo and speaking with the subtitle \u201cLydia Cumming\u201d. The narration continues while the camera zooms in on this photo. Then a red arrow points to the microphone in the photo. Then a red arrow points to her work card. The camera zooms in on her shoes. The scene abruptly transitions to a desert-like environment with explosions and science fiction movie characters running in the background, including a small spherical robot. The three women in the photo appear on the left with the matting. The 'INSIDE edition' logo stays in the bottom left corner of the screen. The narration states, followed by a short musical interlude.\n\nNext, the scene transitions again, showing the three women appear in a different environment resembling a famous movie scene with dark, watery surroundings with the matting. Two characters are seen clinging to debris in the water. Soon, it shifts to a computer screen displaying a Twitter page; a tweet reads, \u201cWhy doesn't she take off her shoes instead?\u201d alongside the picture of the three women from the previous scenes.\n\nAs the video progresses, the focus is on another tweet stating, \u201cYou would think that wearing maybe rain boots in a flooded area would be common sense.\u201d \n\nThe final clip presents a closer view of the picture showing the two women carrying the third woman through the flooded street. The video then transitions to a black screen where a red line draws a rectangular shape, followed by the 'INSIDE edition' logo appearing in the center. Below the logo, the text reads 'LIKE US ON FACEBOOK,' 'FOLLOW US ON TWITTER,' 'Check Local Listings,' and copyright information.\n\n The final audio features a man speaking, followed by music playing. The video ends branding the 'INSIDE edition' and offers the essential directives to follow on social media and check local listings."
+ }
+ ],
+ "events": [
+ "Video begins with a blank gray screen transitioning to a black screen.",
+ "A rotating 3D logo that reads 'INSIDE edition' in white and black text is displayed.",
+ "Logo rotates and scene shifts to a street flooded with water.",
+ "Two women are carrying a third woman on their shoulders through the floodwater.",
+ "The woman being carried is wearing a black jacket and blue jeans.",
+ "One of the carriers on the left is dressed in a floral shirt, black pants, and pink shoes.",
+ "The other carrier on the right is in an orange shirt with white text, blue jeans, and green shoes.",
+ "In the background, a man in a red shirt and another in a white shirt with black patterns and a camera are seen, along with some buildings and a bicycle.",
+ "'INSIDE edition' logo appears in the bottom left corner of the screen.",
+ "A voice narrates over music: \u201cIt\u2019s a photo that seems to show ordinary people lending a hand to a flood victim. Or does it?\u201d",
+ "The narration reveals the woman getting the free ride is a TV reporter with expensive shoes she doesn't want to get wet.",
+ "The reporter is identified as Lydia Cumming, a 24-year-old.",
+ "Lydia holds her TV microphone with a red windscreen and a blue logo in her right hand.",
+ "Her press credentials dangle around her neck.",
+ "The narration comments on Lydia's nice, dry shoes.",
+ "The internet lights up with mocking memes featuring Lydia in various scenes.",
+ "The camera zooms in on the photo of the three women, highlighting Lydia's microphone and press credentials.",
+ "The scene abruptly transitions to a desert-like environment with explosions and science fiction movie characters running in the background, including a small spherical robot.",
+ "The three women in the photo appear on the left with the matting.",
+ "'INSIDE edition' logo remains in the bottom left corner of the screen.",
+ "Narration continues with a short musical interlude.",
+ "Scene transitions to the three women in a different famous movie scene with dark, watery surroundings.",
+ "Two characters are seen clinging to debris in the water.",
+ "Scene shifts to a computer screen displaying a Twitter page.",
+ "A tweet reads, \u201cWhy doesn't she take off her shoes instead?\u201d with the same picture of the three women.",
+ "Another tweet states, \u201cYou would think that wearing maybe rain boots in a flooded area would be common sense.\u201d",
+ "Final clip presents a closer view of the picture showing the two women carrying the third woman through the flooded street.",
+ "Video transitions to a black screen where a red line draws a rectangular shape.",
+ "'INSIDE edition' logo appears in the center with directives to 'LIKE US ON FACEBOOK,' 'FOLLOW US ON TWITTER,' 'Check Local Listings,' and copyright information.",
+ "Final audio features a man speaking, followed by music playing.",
+ "Video ends with 'INSIDE edition' branding and social media directives."
+ ]
+ },
+ {
+ "video": "video/6.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a young woman wearing white headphones and a black jacket with a pink t-shirt underneath, speaking into a microphone. In the lower left corner, there is a smaller video overlay showing another young woman with the text in English: \"nice tits, keep at it,\" beyond which a comment section includes remarks such as \"They're called being pretty and having a personality,\" \"Only a woman can be a woman,\" and \"You're a man.\" The dark background emphasizes the speaker, who appears engaged in responding to the overlay comments.\n\nThe audio reveals the woman in the smaller video saying, \"`Hey, can we stop using gendered compliments like as a people? Because first of all, um, that's a term used by and for black women.`\" She says with her head occasionally moving, while the young woman with pink t-shirt reacts with confusion, as indicated by she is frowning. \n\nThe video progresses with a close-up of the woman in the smaller video, with the video now widening. She continues to speak with\" Second of all, I am not a famale person, um I don't care. I don't give a single **** what I look like to you.\" The scene then back to the frame maintaining focus on the woman with white headphones who continues listening to the woman in the smaller video. The audio continues with, \"I am non-binary, and I'm really sick of getting comments that say you go girl nice job, sis like I don't care if they`'re compliments. They misgender me and they make me feel bad. I'm not a sis for so many reasons gender reasons, race reasons, and I don't want to keep making videos about this.`\" The next frame shifts to the woman in the pink t-shirt speaking into the headphone, \"`Why are you triggered about people being nice to you? This person had a kind comment. And you're going to lose your freaking mind because you didn'`t like the gendered compliment they used, get over it,\" in one shot showing a colorful thumb-up appearing in the frame.\n\nIn the final clip, the same visual elements are continued, with the woman still in white headphones, speaking. The overlay retains the comment: \"Here'`s My rule: If you have rules about being complimented, you aren't getting one.`\" The audio concludes with a candid exchange: \"` Somebody said, here`'s my rule. If you have rules about being complimented, you aren't getting one. Yeah, fine then. I just won't, I just won't talk to you. If you're making it that difficult, go kick rocks.`\" "
+ }
+ ],
+ "events": [
+ "Video begins with young woman wearing white headphones, black jacket, and pink t-shirt speaking into a microphone",
+ "Smaller video overlay in lower left corner shows another young woman",
+ "Text in English in overlay: 'nice tits, keep at it'",
+ "Comment section includes remarks: 'They're called being pretty and having a personality', 'Only a woman can be a woman', 'You're a man'",
+ "Dark background emphasizes the speaker",
+ "Speaker appears engaged in responding to overlay comments",
+ "Audio reveals woman in smaller video speaks: 'Hey, can we stop using gendered compliments like as a people? Because first of all, um, that's a term used by and for black women.'",
+ "Woman in smaller video moves her head occasionally",
+ "Woman in pink t-shirt reacts with confusion; she is frowning",
+ "Video progresses with close-up of woman in smaller video",
+ "Video widens to show more of the woman in smaller video",
+ "Woman continues to speak: 'Second of all, I am not a female person, um I don't care. I don't give a single **** what I look like to you.'",
+ "Scene returns to woman with white headphones listening",
+ "Audio continues: 'I am non-binary and I'm really sick of getting comments that say you go girl nice job, sis like I don\u2019t care if they\u2019re compliments. They misgender me and they make me feel bad. I\u2019m not a sis for so many reasons gender reasons, race reasons, and I don't want to keep making videos about this.'",
+ "Next frame shifts to woman in pink t-shirt speaking into microphone, 'Why are you triggered about people being nice to you? This person had a kind comment. And you're going to lose your freaking mind because you didn't like the gendered compliment they used, get over it.'",
+ "Colorful thumb-up appears in the frame",
+ "Final clip continues with woman in white headphones speaking",
+ "Overlay retains comment: 'Here\u2019s My rule: If you have rules about being complimented, you aren\u2019t getting one.'",
+ "Audio concludes with exchange: 'Somebody said, here\u2019s my rule. If you have rules about being complimented, you aren\u2019t getting one. Yeah, fine then. I just won\u2019t, I just won\u2019t talk to you. If you\u2019re making it that difficult, go kick rocks.'"
+ ]
+ },
+ {
+ "video": "video/7.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with three individuals standing on a baseball field, setting the scene for a sports challenge between Division 1 baseball and softball players. The woman on the left is wearing a blue 'McNeese Softball' shirt, the man in the middle is in a purple 'NSU Baseball' shirt, and the man on the right is dressed in a dark blue shirt with the text '13 is our lucky number Seniors 2013'. The background displays a grassy field with trees under a cloudy sky. Text overlays on the screen read 'D1 Baseball vs D1 Softball' and 'Have a D1 baseball pitcher pitch to a D1 softball player,' establishing the premise of the challenge. The audio features a man saying, \"All right, we're flipping the roles today. This time we got a D1 pitcher, Jason Ortiz, pitching to D1 softball, Caroline Settle. So I'll ask for it. Jason, how are you feeling? What's going to be the result?\"\n\nThe video transitions to a close-up of the man in the purple 'NSU' shirt and cap, then to the woman in the blue 'McNeese Softball' shirt, wearing a black helmet and walking towards the batting area. The fenced baseball field in the background has bleachers and a building. The text overlay reads 'Caroline's ABs,' indicating the focus is on Caroline's at-bats. The pitcher in the purple 'NSU' shirt prepares to pitch, while the batter readies herself. The audio features a man saying, \"Three straight strikes right here. Okay. Carol how are you feeling?\" and a woman whispering, \"I'm kind of scared.\" followed by the man replying, \"Yeah I figured. first pitch. yep.\"\n\nNext, the video continues the challenge, with the batter missing the three pitches thrown by the pitcher in the purple 'NSU' shirt. The text overlay says '1. Strikeout,' signaling Caroline's first strikeout. A close-up of the batter follows, with another text overlay stating, 'We let her use a softball bat to make it more fair.' The audio includes a man saying, \"Switching bats to the softball bat.\" accompanied by the sound of a bicycle bell ringing and a man laughing.\n\nThe challenge progresses with Caroline at bat again, missing another \nthree pitches. The grassy background with trees and a cloudy sky reappears. The text overlay updates to '2. Strikeout,' indicating her second strikeout. A close-up of the pitcher in mid-throw concludes this part. The audio has a man commenting, \"Oh no. Yeah, not a chance. Yup.\" followed by the sound of a ball bouncing and another man speaking.\n\nIn the subsequent clip, Caroline misses another three pitches, updating the text overlay to '3. Strikeout.' The fenced baseball field, bleachers, and building are visible in the background. The pitcher walks towards the batter, possibly to discuss or provide feedback. The audio captures a conversation where a man says, \"No. This is Carol's bonus swing. Jason's second about 10 miles on.\"\n\nFinally, the challenge concludes with another missed pitch by Caroline. The text overlay on-screen reads 'Follow and comment what we should do next!' in red, indicating a call for viewer engagement. As the pitcher prepares for another throw, the audio features a man saying, \"Now we're off. Okay. Yep, that's enough. All right, it's okay,\" with the background noise of wind blowing and footsteps walking on gravel, maintaining the focus on the ongoing challenge."
+ }
+ ],
+ "events": [
+ "Three individuals standing on a baseball field",
+ "Woman on the left wearing a blue 'McNeese Softball' shirt",
+ "Man in the middle wearing a purple 'NSU Baseball' shirt",
+ "Man on the right wearing a dark blue shirt with '13 is our lucky number Seniors 2013'",
+ "Grassy field with trees under a cloudy sky in the background",
+ "Text overlays: 'D1 Baseball vs D1 Softball' and 'Have a D1 baseball pitcher pitch to a D1 softball player'",
+ "Audio: Man says, 'All right, we're flipping the roles today. This time we got a D1 pitcher, Jason Ortiz, pitching to D1 softball, Caroline Settle. So I'll ask for it. Jason, how are you feeling? What's going to be the result?'",
+ "Transition to close-up of man in purple 'NSU' shirt and cap",
+ "Close-up of woman in blue 'McNeese Softball' shirt, wearing a black helmet and walking towards the batting area",
+ "Fenced baseball field with bleachers and a building in the background",
+ "Text overlay: 'Caroline's ABs'",
+ "Pitcher in the purple 'NSU' shirt prepares to pitch",
+ "Batter readies herself",
+ "Audio: Man says, 'Three straight strikes right here. Okay. Carol how are you feeling?'",
+ "Audio: Woman whispers, 'I'm kind of scared.'",
+ "Audio: Man replies, 'Yeah I figured. first pitch. yep.'",
+ "Batter misses three pitches thrown by the pitcher in the purple 'NSU' shirt",
+ "Text overlay: '1. Strikeout'",
+ "Close-up of batter",
+ "Text overlay: 'We let her use a softball bat to make it more fair.'",
+ "Audio: Man says, 'Switching bats to the softball bat.'",
+ "Sound of a bicycle bell ringing",
+ "Man laughing in the background",
+ "Caroline at bat again, missing another three pitches",
+ "Grassy background with trees and a cloudy sky reappears",
+ "Text overlay updates to '2. Strikeout'",
+ "Close-up of the pitcher in mid-throw",
+ "Audio: Man comments, 'Oh no. Yeah, not a chance. Yup.'",
+ "Sound of a ball bouncing",
+ "Another man speaking",
+ "Caroline misses another three pitches",
+ "Text overlay updates to '3. Strikeout'",
+ "Fenced baseball field with bleachers and building visible in the background",
+ "Pitcher walks towards the batter",
+ "Audio captures conversation: Man says, 'No. This is Carol's bonus swing. Jason's second about 10 miles on.'",
+ "Caroline misses another pitch",
+ "Text overlay on-screen: 'Follow and comment what we should do next!' in red",
+ "Pitcher prepares for another throw",
+ "Audio: Man says, 'Now we're off. Okay. Yep, that's enough. All right, it's okay.'",
+ "Background noise of wind blowing",
+ "Sound of footsteps walking on gravel"
+ ]
+ },
+ {
+ "video": "video/8.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "At the beginning of the video, a woman stands in a modern kitchen, holding a pineapple. She brings the pineapple close to the camera, revealing a label that says \"PINK PINEAPPLE.\" She then continues to hold the pineapple and places it on the counter.\n\nIn the audio, a woman says, \"`So I was at Costco yesterday and I found pink pineapple. I never tried it before, so I thought it'd be fun to try it together. I think they removed the top so that way, you can't replant it.`\" She stops to take a breath before continuing to speak.Text corresponding to her speech appears on the screen.\n\nNext, the woman holds the pink pineapple and starts cutting it with a large knife. She continues to cut the pineapple, hoping it tastes good. She struggles a bit with cutting the pineapple, and the text \"*SEND HELP*\" appears on the screen. Finally, she successfully cuts the pineapple in half, revealing the pink interior, and holds up the two halves for the camera.\n\nIn the audio, the woman says, \"it's so cute so this was $11.99 so kind of expensive so hopefully this tastes good.\" Sounds of crumpling and crinkling accompany her words.Text corresponding to her speech appears on the screen.\n\nThe woman then holds up the two halves of the pink pineapple, showing them to the camera. She brings one half closer to her face and smells it, then places it on the counter and begins cutting a small piece from one of the halves.\n\nIn the audio, the woman says, \"it'`s pink. It smells good. Smells really pineappley. First bite, let's see if this is worth.`\" There are shuffling sounds before she speaks again. Text corresponding to her speech appears on the screen.\n\nThe woman holds a piece of the pink pineapple close to the camera. She takes a bite, chews thoughtfully, and continues eating. She appears satisfied with the taste.\n\nIn the audio, the woman says, \"And it'`s slightly sweeter. It's good. I don't know if it's worth the money, but not bad.`\" You hear her talking while chewing and swallowing.Text corresponding to her speech appears on the screen.\n\nFinally, the video shows another woman joining in, also in the modern kitchen. She is holding a piece of pink pineapple and is chewing, contemplating the taste. The text on the screen reads, \u201cI think I prefer regular pineapples...\u201d which concludes the scene, indicating a preference for traditional pineapples over the pink variety.\n\nIn the audio, the woman says, \"I don'`t think it tastes kind of weird. Would you spend $12 on this?\" \"If it was bigger.\" \"I think I prefer regular pineapples.`\" Another person responds in agreement. Text corresponding to her speech appears on the screen.\n\nOverall, the video presents a light-hearted and amusing story of trying an unusual pink pineapple, with interesting and reflective comments displayed and spoken throughout the clips."
+ }
+ ],
+ "events": [
+ "A woman stands in a modern kitchen, holding a pineapple.",
+ "The woman brings the pineapple close to the camera, showing a label that says 'PINK PINEAPPLE'.",
+ "The woman places the pineapple on the counter.",
+ "In the audio, a woman mentions she found pink pineapple at Costco and wants to try it together.",
+ "Text corresponding to her speech appears on the screen.",
+ "The woman holds the pink pineapple and starts cutting it with a large knife.",
+ "She struggles with cutting the pineapple.",
+ "Text '*SEND HELP*' appears on the screen.",
+ "The woman successfully cuts the pineapple in half, revealing the pink interior.",
+ "She holds up the two halves for the camera.",
+ "In the audio, the woman mentions the cost of the pineapple and hopes it tastes good.",
+ "Sounds of crumpling and crinkling accompany her words.",
+ "Text corresponding to her speech appears on the screen.",
+ "The woman holds up the two halves of the pink pineapple to the camera.",
+ "She brings one half closer to her face and smells it.",
+ "She places the half on the counter and cuts a small piece from it.",
+ "In the audio, the woman comments on the smell and prepares for the first bite.",
+ "Shuffling sounds are heard before she speaks again.",
+ "Text corresponding to her speech appears on the screen.",
+ "The woman holds a piece of the pink pineapple close to the camera.",
+ "She takes a bite, chews thoughtfully, and continues eating.",
+ "In the audio, the woman says the pineapple is slightly sweeter, but questions its value.",
+ "You hear her talking while chewing and swallowing.",
+ "Text corresponding to her speech appears on the screen.",
+ "Another woman joins in the kitchen, holding a piece of pink pineapple and chewing.",
+ "The text on the screen reads, 'I think I prefer regular pineapples...'",
+ "In the audio, the woman expresses a preference for regular pineapples, questions the value, and another person agrees.",
+ "Text corresponding to her speech appears on the screen."
+ ]
+ },
+ {
+ "video": "video/9.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a scene of a man on a boat, sailing in the ocean. The boat is moving forward, and several other boats with colorful sails are visible in the distance. The sky is clear and blue, and there is a rocky coastline to the right. A man's high-pitched voice narrates, \"`Every morning, the local fishermen bring back their catch to sell on the side of the road. That's how most people get their protein in small Balinese fishing villages like this one.`\" \n\nNext, the view shifts to a closer look at one of the boats with a striped sail splashing through the water, accompanied by background music. The scene then transitions to a large fish lying on the ground, indicating it has been caught. The fish is silver and shiny, with a large tail fin. A second man's low-pitched voice begins to speak as the frame moves to a bustling market with various stalls selling fresh produce and fish. People are seen walking around and shopping, as the market buzzes with activity. The first man's voice interjects again, providing details about the market scene.\n\nThe video then transitions to a close-up view of a market stall displaying various fish. A woman wearing red pants with white polka dots is handling some items near the fish arranged in a green plastic basin, implicating a busy market setting. \"`And whenever I see roadside fish for sale,`\" a high-pitched voice states, \"there's a good chance that those are the exact same types of fish that I'll see in the water when I go spearfishing later in the day.`\" The voice trails off as the screen shows a person cutting a large fish on the ground, with a green basin and a ladle nearby.\n\nNext, an underwater scene unfolds featuring a diver in a camouflage wetsuit equipped with a GoPro camera. The diver swims close to the sandy ocean floor, followed by a street scene showing a wooden table displaying a large fish and some pumpkins. Ending the second clip, we see elements of an outdoor market or street vendor setting, complete with cracked pavement and a parked car. \n\nThe narrative continues with a close-up view of a large fish lying on a wooden table next to three pumpkins, reminiscent of the street vendor's setting. \"`So, when I saw this giant trevally for sale this morning, I had high hopes that I might catch something similar with my speargun and lo and behold on my very first dive I came across what was basically the exact same fish,`\" a man explains, set to background music. The focus zooms in and out on various angles of the fish and surroundings.\n\nThe scenery shifts back to the ocean during sunset, featuring a boat moving forward with two people on board. The calm and serene moment is accompanied by hues of orange and blue skies, contrasting with the earlier hustle and bustle of the market. \n\nThe underwater adventures continue as we see a rocky ocean floor with various corals and marine life in a bath of blue hues. A fish swims slowly near the surface, and suddenly, a black speargun with yellow accents appears, aimed at the fish. \"`I almost landed on top of it. It wasn't afraid of me at all,`\" the man confesses. The speargun remains steady, pointing at the dark-colored fish swimming near the rocks. The spear gun fires but misses the fish. He concludes, \"`In fact, I was so close to it that I didn't even feel the need to line up my gun properly,`\" just as the sound of a gun being cocked is heard.\n\nAn ongoing underwater scene follows, keeping the speargun aimed precisely at the large fish moving across the rocky ocean floor. The speargun trails the fish effortlessly, maintaining a steady focus. \"`Which in hindsight was pretty stupid because when I pulled the trigger on what should have been the easiest shot in the world I missed.`\" the voice narrates, as the video continues to follow the calm, undisturbed movements of the fish with serene background music and a series of beeps rounding off the audio.\n\nFinally, the climax shows an underwater scene of a large fish swimming near the ocean floor. The video transitions to a magnificent view of the water's surface shot upwards, bathing everything in clear blue hues. Then, a figure swims near the surface, holding a speargun. The clear water and shimmering sunlight create a beautiful tableau. The narrative concludes, \"`This fish was a massive target from point-blank range and I was so confident in my ability to hit it that I got lazy and paid the price.`\" \n\nAs the scene ends, with the ocean surface bathed in the breathtaking purple and pink hues of the sunset, the serenity is pierced by an unexpected scream. The final text on the screen reads 'SUBSCRIBE' in bold pink letters, calling viewers to action.\n\n"
+ }
+ ],
+ "events": [
+ "Man sailing on a boat in the ocean",
+ "Several boats with colorful sails visible",
+ "Clear blue sky and rocky coastline on the right",
+ "High-pitched man's voice narrates about local fishermen's morning catch",
+ "View shifts to a closer look at a striped sailboat splashing through water with background music",
+ "Large silver shiny fish on the ground, indicating it has been caught",
+ "Low-pitched man's voice begins to speak",
+ "Frame moves to bustling market with various stalls selling fresh produce and fish",
+ "People walking around and shopping in the market",
+ "First man's voice provides details about the market scene",
+ "Close-up view of a market stall displaying various fish",
+ "Woman in red pants with white polka dots handling items near fish in a green plastic basin",
+ "High-pitched voice states about roadside fish sale being similar to those found when spearfishing",
+ "Screen shows person cutting a large fish on the ground with a green basin and ladle nearby",
+ "Underwater scene unfolds with a diver in camouflage wetsuit with GoPro camera",
+ "Diver swims close to sandy ocean floor",
+ "Street scene showing wooden table displaying a large fish and some pumpkins",
+ "Outdoor market or street vendor setting with cracked pavement and a parked car",
+ "Close-up view of large fish lying on wooden table next to three pumpkins",
+ "Man explains about seeing the giant trevally for sale and catching similar fish on first dive",
+ "Focus zooms in and out on various angles of fish and surroundings",
+ "Scenery shifts back to ocean during sunset, featuring a boat moving forward with two people on board",
+ "Calm and serene moment with hues of orange and blue skies",
+ "Underwater adventures continue with rocky ocean floor and various corals and marine life",
+ "Fish swims slowly near the surface, black speargun with yellow accents appears",
+ "Man confesses almost landed on top of fish and it wasn't afraid",
+ "Speargun fires but misses the fish",
+ "Sound of gun being cocked is heard",
+ "Underwater scene follows with speargun aimed at large fish moving across rocky ocean floor",
+ "Speargun trails fish with steady focus",
+ "Man narrates about missing an easy shot",
+ "Calm, undisturbed movement of the fish with serene background music and series of beeps",
+ "Underwater scene of large fish swimming near ocean floor",
+ "View transitions to magnificent shot of water's surface with clear blue hues",
+ "Figure swims near the surface holding a speargun",
+ "Clear water and shimmering sunlight create a beautiful tableau",
+ "Narrative concludes about laziness and overconfidence causing missed shot",
+ "Ocean surface bathed in purple and pink hues of sunset",
+ "Unexpected scream pierces the serenity",
+ "Final text on screen reads 'SUBSCRIBE' in bold pink letters"
+ ]
+ },
+ {
+ "video": "video/10.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a man dressed in a maroon hoodie displaying a graphic of a red truck and a yellow arrow pointing toward it. The background features various model cars and a microphone. The man in the hoodie states in the audio with the text, \"This is the coolest truck you've never heard of, and it's stupid fast\", followed by background music. The scene transitions to a black muscle car from the 1960s parked in front of a contemporary building. The narrator continues in the audio with the text, \"Back in the 1960s, muscle cars were all the craze,\" as the camera shows the rear view of the car. The clip then shifts to an indoor showroom showcasing a silver Pontiac GTO and a Camaro. The narrator adds in the audio with the text, \"Including vehicles like the Pontiac GTO and the Camaro.\"\n\nThe scene continues in the showroom with a silver Pontiac GTO displayed on a rotating platform. The camera transitions to a black Camaro SS showcased similarly. The camera then zooms into a close-up of a chrome wheel. The scene shows the rear side of a black muscle car, likely a Mustang. The narrator explains in the audio with the text, \"But when the 1970s came along, there was an issue.\" The camera pans to the side view of the car, where the narrator continues in the audio with the text, \"Emissions regulations cut down the thriving muscle car trend.\" The final part of this clip presents the front side of the black muscle car as the narrator concludes in the audio with the text, \"And slowed production down massively.\"\n\nThe video progresses with a close-up of the front side of the black muscle car. The scene then switches to older, grainy footage of an orange truck with a white stripe. The narrator remarks in the audio with the text, \"That's until Dodge found a little loophole.\" The camera shows the truck's rear as the narrator adds in the audio with the text, \"Trucks didn't need to meet all the strict emissions regulations.\" The scene culminates with the truck driving under a large airplane on-screen while the narrator concludes in the audio with the text, \"So they built a muscle car in disguise.\"\n\nThe next clip introduces a close-up of the rear side of a red truck with yellow and white detailing, labeled 'LIL' RED TRUCK'. The narrator states, \"The LIL'red express was a beast.\" The truck drives on a grassy field, tires smoking, indicative of powerful acceleration. The scene transitions to a close-up of the truck's engine. The narrator explains in the audio with the text, \"It had a 360 police engine under the hood.\" The scene further details the truck's power, as the narrator continues in the audio with the text, \"Making 225 horsepower.\" The narrator concludes in the audio with the text, \"It could even thrash the 1978 Corvette in a drag race.\"\n\nThe video wraps up returning to the initial setting with the man in the maroon hoodie with a yellow truck. The background is consistent with various model cars and a microphone. The narrator invites the audience in the audio with the text, \"For more car content, follow me\", followed by the sound of a camera shutter, marking the end of the video."
+ }
+ ],
+ "events": [
+ "A man dressed in a maroon hoodie is displayed with a graphic of a red truck and a yellow arrow pointing toward it.",
+ "The background features various model cars and a microphone.",
+ "The man in the hoodie states in the audio, 'This is the coolest truck you've never heard of, and it's stupid fast,' followed by background music.",
+ "The scene transitions to a black muscle car from the 1960s parked in front of a contemporary building.",
+ "The narrator continues in the audio, 'Back in the 1960s, muscle cars were all the craze,' as the camera shows the rear view of the car.",
+ "The clip shifts to an indoor showroom showcasing a silver Pontiac GTO and a Camaro.",
+ "The narrator adds in the audio, 'Including vehicles like the Pontiac GTO and the Camaro.'",
+ "The scene continues in the showroom with a silver Pontiac GTO displayed on a rotating platform.",
+ "The camera transitions to a black Camaro SS showcased similarly.",
+ "The camera then zooms into a close-up of a chrome wheel.",
+ "The scene shows the rear side of a black muscle car, likely a Mustang.",
+ "The narrator explains in the audio, 'But when the 1970s came along, there was an issue.'",
+ "The camera pans to the side view of the car.",
+ "The narrator continues in the audio, 'Emissions regulations cut down the thriving muscle car trend.'",
+ "The final part of this clip presents the front side of the black muscle car.",
+ "The narrator concludes in the audio, 'And slowed production down massively.'",
+ "The video progresses with a close-up of the front side of the black muscle car.",
+ "The scene switches to older, grainy footage of an orange truck with a white stripe.",
+ "The narrator remarks in the audio, 'That's until Dodge found a little loophole.'",
+ "The camera shows the truck's rear.",
+ "The narrator adds in the audio, 'Trucks didn't need to meet all the strict emissions regulations.'",
+ "The scene culminates with the truck driving under a large airplane on-screen.",
+ "The narrator concludes in the audio, 'So they built a muscle car in disguise.'",
+ "The next clip introduces a close-up of the rear side of a red truck with yellow and white detailing labeled 'LIL' RED TRUCK'.",
+ "The narrator states, 'The LIL'red express was a beast.'",
+ "The truck drives on a grassy field, tires smoking, indicative of powerful acceleration.",
+ "The scene transitions to a close-up of the truck's engine.",
+ "The narrator explains in the audio, 'It had a 360 police engine under the hood.'",
+ "The scene further details the truck's power.",
+ "The narrator continues in the audio, 'Making 225 horsepower.'",
+ "The narrator concludes in the audio, 'It could even thrash the 1978 Corvette in a drag race.'",
+ "The video wraps up returning to the initial setting with the man in the maroon hoodie with a yellow truck.",
+ "The background is consistent with various model cars and a microphone.",
+ "The narrator invites the audience in the audio, 'For more car content, follow me,' followed by the sound of a camera shutter, marking the end of the video."
+ ]
+ },
+ {
+ "video": "video/11.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video is split into two screens.The top scene starts with a woman in a white sweater enjoying some food, with the text 'How Many Bites?' prominently displayed in red at the bottom of the frame. As a voice prompts, \"How many bites should it take to eat these foods?\", the bottom scene features a man in a black jacket and white cap speaking into a microphone, against a backdrop featuring a gaming chair and a neon sign. The voice continues, \"A chicken nugget is like two bites. You dip it in sauce and eat like 80% of it on the first bite and then you just finish it off.\" Next, a close-up of four golden-brown chicken nuggets appears, one nugget broken open to reveal the white meat inside, with 'Two Bites' text in red below. Returning to the man speaking, the audio mentions dipping food in sauce, and the frame shows a hand dipping a chicken nugget into a bowl of red sauce, with the text 'Two Bites' displayed again.\n\nThe transition to a close-up of sushi rolls, featuring ingredients like cucumber and avocado sprinkled with sesame seeds, is narrated by a voice stating, \"Sushi is one bite, always. Very few exceptions like nigiri.\" The text 'One Bite' appears in red at the bottom. The scene then shifts back to the man in the black jacket and white cap. The next frame showcases sushi nigiri held by chopsticks, with text again asserting 'One Bite.' The voice continues, \"Oreo is also one bite, after you dip it in milk of course. I know some people like to do like...\" as the bottom frame continues with the man speaking. Finally, it shows a stack of chocolate sandwich cookies, one partially bitten, and the text 'One Bite.' The clip concludes with a hand holding a split chocolate sandwich cookie, revealing the cream, accompanied by 'Subscribe if you do this' in red.\n\nThen, there's a close-up of a chocolate sandwich cookie with an exaggerated cream filling, with 'Subscribe if you do this' at the bottom. The voice says, \u201cOoh. I think the first pizza roll is two bites because you have to check the temperature on it, but then it becomes one bite once you confirm if it's hot or not\u201d Following this, golden-brown pizza rolls on a plate with a bowl of red sauce appear, with 'First Roll: 2 Bites' text displayed. The man continues speaking, followed by the same plate of pizza rolls with the text 'Rest of them: 1 Bite.'\n\nNext, it shows more golden-brown pizza rolls on a plate with sauce, and the text \u2018Rest of them: 1 Bite.\u2019 The voice resumes with, \u201cHot dog. Maybe like five bites? I don't get how some people can eat munchkins in two bites.They're literally designed to be popped into your mouth. So one.\u201d along with a frame transition to three hot dogs with pickles, ketchup, and mustard, displaying '5 Bites.' Later the frame shows a box of assorted doughnut holes while he keeps speaking.\n\nThe scene cuts to a close-up of sugar-coated doughnut holes with 'One Bite' text. The voice says, \u201ccan eat munchkins in two bites. They're literally designed to be popped into your mouth, so one.\u201d The camera shifts to a bowl of baby carrots with '3 or 4 Bites' displayed. He explains, \"Baby carrots. I like nibbling on those like a rabbit, so maybe like three or four...\u201d The video returns to the bowl of baby carrots with '3 or 4 Bites' text.\n\nFinally, the video features deviled eggs on a blue plate with a sprinkle of red seasoning, and 'One Bite' text at the bottom. The audio sums up, \"And then deviled eggs. Just one bite. You just gotta toss it in there,\" as the frame switches back to the speaking man amidst a background, signaling playful completion with the sound of something being tossed in the air, leaving the viewer with a sense of closure.\n\nThroughout the video, the consistent visuals of different bite-sized foods, paired with the man's upbeat, engaging commentary and the intermittent background music, create a dynamic and cohesive narrative centered around the enjoyment and portioning of various snacks."
+ }
+ ],
+ "events": [
+ "Top screen: Woman in white sweater enjoying food.",
+ "Text 'How Many Bites?' in red at bottom of the top scene frame.",
+ "Voice prompts, 'How many bites should it take to eat these foods?'",
+ "Bottom screen: Man in black jacket and white cap speaking into a microphone.",
+ "Background features a gaming chair and a neon sign.",
+ "Man says, 'A chicken nugget is like two bites.'",
+ "Close-up of four golden-brown chicken nuggets. One nugget is broken open to reveal white meat inside.",
+ "Text 'Two Bites' in red below the chicken nuggets.",
+ "Hand dips chicken nugget into a bowl of red sauce.",
+ "Text 'Two Bites' displayed again.",
+ "Voice states, 'Sushi is one bite, always. Very few exceptions like nigiri.'",
+ "Close-up of sushi rolls with ingredients like cucumber and avocado, sprinkled with sesame seeds.",
+ "Text 'One Bite' appears in red at the bottom.",
+ "Man in black jacket and white cap appears again.",
+ "Sushi nigiri held by chopsticks with text 'One Bite.'",
+ "Voice mentions, 'Oreo is also one bite, after you dip it in milk of course. I know some people like to do like...' ",
+ "Man speaks against background.",
+ "Stack of chocolate sandwich cookies, one partially bitten, with text 'One Bite.'",
+ "Hand holds a split chocolate sandwich cookie, revealing the cream.",
+ "Text 'Subscribe if you do this' in red.",
+ "Close-up of a chocolate sandwich cookie with exaggerated cream filling.",
+ "Text 'Subscribe if you do this' at the bottom.",
+ "Voice says, 'First pizza roll is two bites because you have to check the temperature, but then it becomes one bite.'",
+ "Golden-brown pizza rolls on a plate with a bowl of red sauce. Text 'First Roll: 2 Bites' displayed.",
+ "Man continues speaking.",
+ "Same plate of pizza rolls, text 'Rest of them: 1 Bite.'",
+ "Golden-brown pizza rolls on a plate with sauce, text 'Rest of them: 1 Bite.'",
+ "Voice says, 'Hot dog. Maybe like five bites? I don't get how some people can eat munchkins in two bites.'",
+ "Frame shows three hot dogs with pickles, ketchup, and mustard. Text '5 Bites.'",
+ "Frame shows a box of assorted doughnut holes.",
+ "Close-up of sugar-coated doughnut holes with text 'One Bite.'",
+ "Voice says, 'Can eat munchkins in two bites. They're designed to be popped into your mouth, so one.'",
+ "Camera shifts to bowl of baby carrots with '3 or 4 Bites' displayed.",
+ "Voice says, 'Baby carrots. I like nibbling on those like a rabbit, so maybe like three or four.'",
+ "Video returns to bowl of baby carrots with '3 or 4 Bites' text.",
+ "Deviled eggs on a blue plate with a sprinkle of red seasoning. Text 'One Bite' at the bottom.",
+ "Voice sums up, 'And then deviled eggs. Just one bite. You just gotta toss it in there.'",
+ "Frame switches back to the speaking man amidst a background.",
+ "Sound of something being tossed in the air, signaling playful completion.",
+ "Upbeat, engaging commentary and intermittent background music create dynamic, cohesive narrative centered around enjoyment and portioning of snacks."
+ ]
+ },
+ {
+ "video": "video/12.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a woman in a kitchen setting, wearing a black sleeveless top. She has long, straight brown hair, and the background includes white cabinets and a white door. Holding a bottle of Malibu Pineapple Rum, she points towards the camera as she starts to speak, \"`So I don't drink alcohol but today I`'m going to show you why I don'`t drink. Can't believe I'm doing this.`\" She then proceeds to uncap the bottle and pour the rum into a glass jar filled with ice while upbeat music plays in the background. \n\nNext, as the scene continues in the kitchen, the woman adds coconut milk from a carton with a white and blue design into the glass jar. The conversation moves forward with her saying, \"This is Malibu Pineapple Rum, and now I'`m adding coconut water. Might as well just go for it and make this delicious.`\" She then picks up a bottle labeled 'Maraschino Cherry Juice,' pours it in, creating a layered effect, and adds two maraschino cherries to the drink. Finally, she places a metal straw into the jar and takes a sip, while commenting, \"`I think I need to lay down now,`\" as the background music continues. The woman's face turns red and looks uncomfortable.\n\nThe video then transitions to the woman lying on a gray couch in a different room, draped with white and gray pillows. She appears to be resting with her hands covering her face. Her voice is heard saying, \"`so my head's pounding, my face is obviously red and hot, and my heart is palpitating.`\" The scene shows her looking visibly uncomfortable. The woman then sits up and looks towards the camera with one hand holding her head, set against a plain wall, indicating another shift in location from the kitchen to a more relaxed setting.\n\nFinally, the clip shows the woman, still in her black sleeveless top, sitting on the couch with the same white and gray pillows in the background. She talks and explains, gesturing with her hands, and says, \"`My Asian gene does not let me metabolize alcohol. And this is what you call the Asian glow.`\" The atmosphere appears calm, with her lying down on the couch, hand resting on her forehead, suggesting a moment of relaxation or contemplation. The gentle background music enhances this tranquil setting, rounding off the narrative of her explaining why she avoids alcohol."
+ }
+ ],
+ "events": [
+ "Woman is in a kitchen setting wearing a black sleeveless top.",
+ "Background includes white cabinets and a white door.",
+ "Woman holds a bottle of Malibu Pineapple Rum.",
+ "She points toward the camera and starts to speak.",
+ "She says she doesn't drink alcohol and will show why she doesn't drink.",
+ "She uncaps the bottle and pours rum into a glass jar filled with ice.",
+ "Upbeat music plays in the background.",
+ "Woman adds coconut milk from a carton with a white and blue design into the jar.",
+ "She mentions adding coconut water and makes the drink delicious.",
+ "She pours in Maraschino Cherry Juice, creating a layered effect.",
+ "She adds two maraschino cherries to the drink.",
+ "She places a metal straw into the jar and takes a sip.",
+ "Comments that she needs to lay down as her face turns red and looks uncomfortable.",
+ "Scene transitions to a different room with a gray couch and white and gray pillows.",
+ "Woman is lying on the couch with hands covering her face.",
+ "Her voice mentions her head pounding, face being red and hot, and heart palpitating.",
+ "Scene shows her looking visibly uncomfortable.",
+ "Woman sits up and looks toward the camera with one hand holding her head.",
+ "Indicates a shift in location to a more relaxed setting.",
+ "Woman is sitting on the couch with the same white and gray pillows in the background.",
+ "She talks and explains, gesturing with her hands.",
+ "She says her Asian gene prevents her from metabolizing alcohol and mentions the Asian glow.",
+ "Atmosphere is calm with her lying down on the couch, hand resting on her forehead.",
+ "Gentle background music plays, enhancing the tranquil setting."
+ ]
+ },
+ {
+ "video": "video/13.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a child standing on a sidewalk at an intersection, wearing a gray shirt, black shorts, and purple shoes. On the screen, the text \"How to get a tooth out pt 2\" is displayed at the bottom. An audio of a man's voice is heard saying, \"`In case you missed it, after shimmying into Walmart, Audrey realized she had a really loose tooth.`\" \n\nThe scene transitions to a close-up of the child's mouth, showing a particularly loose tooth, while the man's narration continues, \"`We tried a bunch of different ways to get it out, but that tooth wasn't budging.\" The video then shifts to the child sitting in a shopping cart inside a store, with someone in a blue vest walking in the background. The man explains, \"`So I took matters into my own hands. I took my remote-controlled car and a strand. `\" before it cuts off.\n\nIn the next scene, the man, now in a light green shirt and dark blue shorts, is kneeling on a sidewalk, tying a string to a red and white toy car. He continues narrating, \"`I had one end of the car, and I took Audrey and tied that other end to her tooth.`\" The video shows him standing next to a child in a yellow shirt and black shorts, both looking at the toy car on the ground, with the man holding a remote control. The toy car zooms forward as both the man and the child react excitedly. The man's voiceover continues, \"`I pressed go, and no, it slipped right off the tooth.`\"\n\nThe scene transitions to the man kneeling next to an orange bicycle with blue wheels, attaching something, while the child stands nearby. The narration resumes, \"`So then I took her baby brother Braxton, I tied part of the string to Braxton's hand, and I took the other part of the string and put it on Audrey`'s tooth.\" The scene then shows the man checking the child's mouth, suggesting that the attempt might have failed again.\n\nContinuing in the same location, the man in the light green shirt adjusts something on the orange bicycle. The child in a yellow shirt watches attentively. He then helps a smaller child, dressed in a green outfit, onto the bicycle. The voiceover humorously adds, \"I said, ready Braxton and go! I forgot he couldn't ride a bike.`\" As the smaller child pedals, the bike tips over, causing the child to fall. The scene changes to an indoor setting where the man playfully swings the child in the yellow shirt in a living room. The narration adds, \"`Finally, I just went WWE style and threw her, tossing her on the couch and nothing would work. Actually, none of that happened.\"\n\nFollowing this, a table is shown with a partially eaten burger, fries, a bottle of chocolate milk, a smartphone with various app icons, a toy car, and a water bottle. The child in a purple shirt stands next to the table, looking at the items. The man narrates, \"`The real story is I took her to McDonald's and got her a happy meal when she bit into her hamburger boom, the tooth came out.`\" \n\nThe final scenes show a close-up of the child's mouth, now missing the loose tooth, and then a broader smile indicating the successful extraction. The sound of the tooth falling out is heard as the visual confirms the child's exuberant smile, bringing the narrative to a joyful close."
+ }
+ ],
+ "events": [
+ "A child stands on a sidewalk at an intersection wearing a gray shirt, black shorts, and purple shoes.",
+ "Text on the screen reads 'How to get a tooth out pt 2'.",
+ "A man's voice says, 'In case you missed it, after shimmying into Walmart, Audrey realized she had a really loose tooth.'",
+ "Scene transitions to a close-up of the child's mouth showing a loose tooth.",
+ "The man's narration continues, 'We tried a bunch of different ways to get it out, but that tooth wasn't budging.'",
+ "The scene shifts to the child sitting in a shopping cart inside a store, with someone in a blue vest walking in the background.",
+ "The man explains, 'So I took matters into my own hands. I took my remote-controlled car and a strand.'",
+ "Next scene shows the man kneeling on a sidewalk, tying a string to a red and white toy car.",
+ "The man narrates, 'I had one end of the car, and I took Audrey and tied that other end to her tooth.'",
+ "The man stands next to a child in a yellow shirt and black shorts, both looking at the toy car on the ground.",
+ "The man holds a remote control and the toy car zooms forward as both react excitedly.",
+ "The man's voiceover continues, 'I pressed go, and no, it slipped right off the tooth.'",
+ "The scene transitions to the man kneeling next to an orange bicycle with blue wheels, attaching something.",
+ "The child stands nearby as the man narrates, 'So then I took her baby brother Braxton, I tied part of the string to Braxton's hand, and I took the other part of the string and put it on Audrey's tooth.'",
+ "The man checks the child's mouth, suggesting the attempt might have failed again.",
+ "Continuing at the same location, the man adjusts something on the orange bicycle.",
+ "The child in a yellow shirt watches attentively.",
+ "The man helps a smaller child in a green outfit onto the bicycle.",
+ "The voiceover humorously adds, 'I said, ready Braxton and go! I forgot he couldn't ride a bike.'",
+ "As the smaller child pedals, the bike tips over, causing the child to fall.",
+ "Scene changes to an indoor setting where the man swings the child in the yellow shirt playfully in the living room.",
+ "The narration adds, 'Finally, I just went WWE style and threw her, tossing her on the couch and nothing would work. Actually, none of that happened.'",
+ "A table is shown with a partially eaten burger, fries, a bottle of chocolate milk, a smartphone with various app icons, a toy car, and a water bottle.",
+ "The child in a purple shirt stands next to the table, looking at the items.",
+ "The man narrates, 'The real story is I took her to McDonald's and got her a happy meal when she bit into her hamburger boom, the tooth came out.'",
+ "The final scenes show a close-up of the child's mouth, now missing the loose tooth.",
+ "The child displays a broader smile indicating the successful extraction.",
+ "The sound of the tooth falling out is heard.",
+ "Visual confirmation of the child's exuberant smile, bringing the narrative to a joyful close."
+ ]
+ },
+ {
+ "video": "video/14.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a man wearing a blue cap and a black hoodie, speaking into a microphone in a room decorated with a neon sign that reads \"BEAMS.\" The text '`Last Week I posted this video...`' appears at the top of the screen, indicating a continuation from a previous video. This is a split-screen video where the upper part changes as the man speaks in the lower part. As the man talks, the background remains consistent with occasional decorations visible on the walls and a door in the backdrop. The audio reveals the man narrating: \"`Last week I posted this video, let's see if you're psychic, so heads or tails, go ahead and pick. And it got over 100,000 comments, yet not a single person picked up on the hints throughout the video. \" Later, the man, now wearing a light blue cap and a black vest over a white shirt, continues speaking into the microphone. A large pink circle with the word 'TAILS' and a gray button labeled 'FLIP IT' appears on the upper part of the screen, followed by continuing change in the circle's display either the word 'TAILS' in pink or the word 'HEADS' in purple. Above the circle, texts displaying '26K', '20.8K', and '21.5K' along with speech bubble icons indicating comments or interactions appear. Throughout this segment, the sound of music can be heard faintly in the background.\n\nThe text 'Let'`s break it down...`' appears as the man, back in their blue cap and black hoodie, speaks into the microphone. The scene transitions once more to him in a light blue cap and black vest over a white shirt, gesturing animatedly. The large pink circle with 'TAILS' and the 'FLIP IT' button reappears, soon changing to display 'HEADS' in purple again. The audio plays the man's voice saying, \"`So let`'s break it down. Let's see if you're psychic. Did you catch that? Look at what number I'm making with my hands. This will be important for later. So heads or tails, go ahead and pick. Here I touch the top of my hat to indicate that it`'s about to be heads.\" accompanied briefly by a musical instrument playing a melody.\n\nNext, in a light blue cap and black vest over a white shirt, the man continues speaking into the microphone. The background remains unchanged. As the video progresses, red circles appear to highlight the man's hand and then disappear, followed by an arrow appearing to point to the man's hat and then disappearing. Then the caption 'Subscribe if you're enjoying!' displays beneath the cirle in a red box. The top part of the screen becomes a keypad of numbers from 1 to 10. In the end, only 6 is highlighted by a green background, and the other numbers are all on a red background and blurred.The audio matches the visuals with the man saying, \"`Okay, heads, so pretty good. But here are three more randomized questions to check if you're truly psychic. So pick a number 1 through 10. All of a sudden, this question becomes a no-brainer. Good job, sixes.`\" His speech is together with the sound of tapping the phone screen.\n\nTransitioning to a grid of numbers from 1 to 10 with numbers 2 and 6 highlighted in red, the man, still in a light blue cap and black vest over a white shirt, continues to speak and gesture. The audio captures the man commenting, \"`All right, now pick a letter A through Z. Now check out which letter I'm making with my left hand. I even try to stretch it out so the B becomes more apparent.`\" The scene then transitions to a colorful spinning wheel with letters on the upper part of the screen, and a red arrow points towards the man's hand, emphasizing his gestures. The background jingle plays before the man's voice returns.\n\nAs the spinning wheel continues in the background, the man remains engaged in his narration. The close-up of a brown door then takes over the upper part of the screen while he continues to speak. The man's speech relates, \"`Congrats if you chose B. And then finally, what animal is behind this door? Really think about it. You'll quickly realize that it's going to be...`\" Music softly underscores his narration.\n\nThe video concludes with a close-up image of the brown door, transitioning to a picture of a monkey hanging upside down from a branch. The text 'Subscribe for more!' is prominently displayed at the bottom in a red box. The final audio segment has the man revealing, \"That'`s going to be a monkey. So how'd you do?`\" as music continues to play in the background.\n\nOverall, the video guides viewers through an interactive series of prompts, hinting at numbers, letters, and animals, intertwined with engaging audio cues and musical interludes, urging viewers to participate and reflect upon his psychic guesses."
+ }
+ ],
+ "events": [
+ "Man wearing a blue cap and a black hoodie speaks into a microphone.",
+ "Room is decorated with a neon sign that reads 'BEAMS'.",
+ "Text 'Last Week I posted this video...' appears at the top of the screen.",
+ "Split-screen video format is introduced.",
+ "Background remains consistent with decorations and a door visible.",
+ "Audio: Man narrates about posting the previous video and hints.",
+ "Man switches to wearing a light blue cap and a black vest over a white shirt.",
+ "Large pink circle with 'TAILS' and 'FLIP IT' button appears on the upper part of the screen.",
+ "Circle display changes between 'TAILS' in pink and 'HEADS' in purple.",
+ "Text displaying '26K', '20.8K', and '21.5K' with speech bubble icons appear.",
+ "Faint background music is heard.",
+ "Text 'Let\u2019s break it down...' appears on screen.",
+ "Man in blue cap and black hoodie talks again.",
+ "Scene transitions to man in light blue cap and black vest speaking animatedly.",
+ "Pink circle with 'TAILS' and 'FLIP IT' button reappears, then changes to 'HEADS' in purple.",
+ "Audio: Man narrates about breaking it down and hints.",
+ "Musical instrument plays a brief melody.",
+ "Man in light blue cap and black vest speaks as background remains unchanged.",
+ "Red circles highlight the man's hand, then disappear.",
+ "Arrow appears pointing to man's hat, then disappears.",
+ "Caption 'Subscribe if you're enjoying!' appears in a red box beneath the circle.",
+ "Top screen displays a keypad of numbers from 1 to 10.",
+ "Only number 6 is highlighted by a green background; other numbers are red and blurred.",
+ "Audio: Man speaks about heads and asks viewers to pick a random number. He praises 'sixes'.",
+ "Sound of tapping the phone screen.",
+ "Grid of numbers from 1 to 10 appears with numbers 2 and 6 highlighted in red.",
+ "Man continues speaking and gesturing.",
+ "Audio: Man narrates about picking a letter A through Z.",
+ "Audio: Man mentions stretching his hand to show a letter B.",
+ "Colorful spinning wheel with letters appears on the upper part of the screen.",
+ "Red arrow points towards man's hand emphasizing his gestures.",
+ "Background jingle plays.",
+ "Close-up of a brown door takes over the upper part of the screen.",
+ "Audio: Man narrates about choosing letter B and asks about the animal behind the door.",
+ "Soft background music underscores narration.",
+ "Close-up image of brown door transitions to a picture of a monkey hanging upside down.",
+ "Text 'Subscribe for more!' displayed at the bottom in a red box.",
+ "Audio: Man reveals the animal is a monkey, asks how viewers performed.",
+ "Music continues to play in the background."
+ ]
+ },
+ {
+ "video": "video/15.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a touching scene in a McDonald's restaurant where a man has taken around 14 or 15 underprivileged children to enjoy a meal with his first paycheck. The environment is bright with natural light streaming through the large windows, creating a welcoming atmosphere. A text overlay introduces the scene: \"BLESSED IMAGES.\" , beneath it, the text informs viewers, \"This man took around 14 or 15 poor kids to McDonald's with his first paycheck.\" The joy and excitement of the children resonate through the scene.\n\nThe next scene transitions to a girl in a light blue tank top, warmly cradling three baby squirrels against her shirt. The squirrels seem to find comfort and safety, reflecting the girl\u2019s care. A text overlay introduces the scene: \"BLESSED IMAGES.\" Beneath it, the text reads, \"My sister saved 3 squirrels from the hurricane and named them Alvin, Simon, and Theodore,\" followed by a crying emoji, underscoring the emotional connection.\n\nAudio reveals a man recounting these events, saying, \"This man took around 14 or 15 poor kids to McDonald's with his first paycheck. My sister saved three squirrels from the hurricane and named them Alvin, Simon, and Theodore.\"\n\nThe narrative continues with a peaceful scene where a man lies down with a light brown tabby cat nestled contentedly against his face and chest. The cat\u2019s closed eyes and the man's gentle smile illustrate a deep bond. A text overlay introduces the scene: \"BLESSED IMAGES.\" Beneath it, the text reads, \"My cat learned that the alarm sound means I wake up, and she snuggles on my chest right after. I\u2019ve been setting my alarm 30 minutes early every day to give her more happy time,\" highlighting the effort to ensure more quality time with the cat.\n\nThe audio captures a man speaking these sentiments: \"My cat learned that the alarm sound means I wake up and she snuggles on my chest right after. I've been setting my alarm 30 minutes early every day to give her more happy time.\" The transition to a man\u2019s voice adds warmth to the storytelling.\n\nShifting to a snowy backdrop, the video depicts a girl in red attire and her large light brown dog, which sports a red backpack. A baby goat peeks out from the backpack, adding a charming element. The text overlay reads, \u201cThis Girl And Her Dog Just Saved A Mom Goat With Her Newborn Baby.\u201d The scene exudes a sense of adventure and care.\n\nThis moment is followed by a more solemn but equally moving scene with the 'BLESSED IMAGES' text overlay. The camera focuses on flowers in a vase and a photograph of a young girl with her father at the beach. The text overlay reveals, \"My dad passed away when I was 16 from cancer and before he died, he pre-payed flowers so I could receive them every year on my birthday. Well this is my 21st birthday flowers and the last. Miss you so much daddy.\ud83d\udc9c\" , deepening the narrative's emotional resonance. Visuals show a girl at the beach on her father's shoulders. The background setting of the serene beach reinforces the bittersweet memory.\n\nThe audio narrates, \"My dad passed away when I was 16 from cancer and before he died, he pre-payed flowers so I could receive them every year on my birthday. Well this is my 21st birthday flowers and the last. Miss you so much daddy.\" \n\n\nThe narrative then brings a joyful scene where a boy, dressed in a superhero costume, is being adopted. His family, also dressed as superheroes, surrounds him. The text overlay shares, \"Boy being adopted asks family to dress like superheroes for adoption day. They do.\" A woman with a camera captures the heartfelt moment in an indoor setting, likely a courthouse or adoption center.\n\n\nConcluding the series of 'BLESSED IMAGES,' the final clip returns to a heartfelt gesture from a mattress store. The text at the top reads, \"BLESSED IMAGES.\" Below, a tweet from Ferdanko (@623fer) reads, \"There's a mattress store that gives you a mattress for your dog when you buy a normal one.\" The video shows a small dog in a red sweater comfortably seated on a small mattress, with stacks of larger mattresses visible in the background, exuding a sense of comfort and thoughtfulness.\n\nThe audio ends with a man stating, \"There's a mattress store that gives you a mattress for your dog when you buy a normal one.,\" wrapping up the touching compilation of stories demonstrating acts of kindness and love."
+ }
+ ],
+ "events": [
+ "Man takes around 14 or 15 underprivileged children to McDonald's with his first paycheck.",
+ "Environment: Bright with natural light in McDonald's restaurant.",
+ "Text overlay: 'BLESSED IMAGES' and 'This man took around 14 or 15 poor kids to McDonald's with his first paycheck.'",
+ "Joy and excitement of children resonate in the scene.",
+ "Girl in light blue tank top cradles three baby squirrels.",
+ "Text overlay: 'BLESSED IMAGES' and 'My sister saved 3 squirrels from the hurricane and named them Alvin, Simon, and Theodore' with a crying emoji.",
+ "Audio: Man recounts these events.",
+ "Man lies down with a light brown tabby cat on chest.",
+ "Text overlay: 'BLESSED IMAGES' and 'My cat learned that the alarm sound means I wake up, and she snuggles on my chest right after. I\u2019ve been setting my alarm 30 minutes early every day to give her more happy time.'",
+ "Audio: Man narrates this sentiment.",
+ "Snowy scene with girl in red attire and light brown dog with red backpack.",
+ "Baby goat peeks out from the backpack.",
+ "Text overlay: 'This Girl And Her Dog Just Saved A Mom Goat With Her Newborn Baby.'",
+ "Solemn scene with flowers in vase and photograph of young girl with her father at the beach.",
+ "Text overlay: 'My dad passed away when I was 16 from cancer and before he died, he pre-payed flowers so I could receive them every year on my birthday. Well this is my 21st birthday flowers and the last. Miss you so much daddy.'",
+ "Visuals of girl at the beach on father\u2019s shoulders.",
+ "Audio: Man narrates these events.",
+ "Boy in superhero costume being adopted.",
+ "Family also dressed as superheroes surrounds him.",
+ "Text overlay: 'Boy being adopted asks family to dress like superheroes for adoption day. They do.'",
+ "Woman with camera captures the moment in indoor setting, likely a courthouse or adoption center.",
+ "Final clip: Mattress store gives a mattress for dog when you buy normal one.",
+ "Text overlay: 'BLESSED IMAGES' and tweet from Ferdanko (@623fer).",
+ "Dog in red sweater seated on small mattress with larger mattresses in background.",
+ "Audio: Man states this information."
+ ]
+ },
+ {
+ "video": "video/16.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man holding a guitar in a dark room with computer monitors in the background. The man is wearing a sleeveless shirt, revealing tattoos on their arms. Text overlays the image in white and red letters, reading '3 GUITAR TECHNIQUE TRICKS THAT'. This text changes to 'DRASTICALLY CLEANED UP MY PLAYING!' The man begins to play the guitar, and the text 'TRICK #1 PICKING' appears, emphasizing a picking technique. The frame transitions to a close-up of the guitar strings and the man's hand with the text 'UNWANTED STRING NOISE' highlighting the focus on reducing noise. In the audio, the man states, \"Three guitar technique tricks that drastically cleaned up my playing. When I started playing fast alternate picking lines I had some issues with unwanted strings noise here and there \" followed by a guitar solo.\n\nNext, the scene continues with a close-up of the guitar strings and the text 'UNWANTED STRING NOISE' in red and white. The man is now wearing a camouflage-patterned sleeveless shirt. The man demonstrates a technique to reduce unwanted string noise, focusing on his hand movements. The video transitions to a black-and-white close-up of the hand with a red 'X' mark to illustrate an incorrect technique and returns to color with the text 'LIFTING MY FINGERS DID THE TRICK!' In the audio, the man explains, \"I could see that my picking hand fingers were sometimes scraping against the strings and as soon as I started lifting my fingers a little bit like that I got much better results!\" while playing the guitar.\n\nFollowing this, a close-up of the man's hands playing the guitar shifts the focus to their finger movements, continuing the demonstration. Text in red and white letters, 'BAD HABIT: PUSHING DOWN WITH YOUR FIRST FINGER,' identifies a common mistake. The explanation, 'SOUNDS AND FEELS VERY STIFF!', follows. This frame correlates with the audio where the man says, \"What also slowed me down was always pushing down with my index finger while I was playing with my other fingers.\" accompanied by music playing.\n\nIn the subsequent segment, the close-up of the man's fingers on the guitar strings continues.The scene shifts to the hand on the guitar with text suggesting 'TRY ADDING OPEN STRINGS TO YOUR WORKOUTS.' The man demonstrates this technique, focusing on hand movements, followed by 'MAKE ALL YOUR FINGERS WORK!' emphasizing full finger usage. The introduction of 'TRICK #3 SWEEPING' signifies another technique to be revealed. The man mentions in the audio, \"So what really helped me with developing finger independence was working on exercises like this with an open string in between each finger. I used to keep my palm anchored on the bridge like this,\" while demonstrating the guitar technique.\n\nNext, the video showcases a close-up of the person's hand playing the guitar, focusing on the sweeping technique. The man, still in a camouflage-patterned sleeveless shirt, demonstrates this with red arrows indicating the motion. The text 'FOLLOWING THE ARPEGGIOS HELPED' describes the benefit of following arpeggios. This is paired with audio where the man explains, \"But as you could see I end up with a completely different picking angle, and that means I got some really inconsistent results, so as soon as I loosened up my hand a little bit and I started following the arpeggios,\" with interspersed guitar playing.\n\nThe video concludes with a close-up of the man's hands playing the guitar and the text 'MUCH BETTER!', signifying an improvement. The man points to the guitar strings, explaining a technique, before the text 'BERNTH Join The Community TODAY!' and a 'SUBSCRIBE' button appear on screen. The button is clicked, changing to 'SUBSCRIBED' with the bell icon indicating notifications are enabled. The final audio includes the man promoting his YouTube channel: \"You can find the perfect technique exercises and some awesome guitar music on my YouTube channel so make sure to subscribe if that was helpful!\" followed by a guitar riff."
+ }
+ ],
+ "events": [
+ "A man holding a guitar in a dark room with computer monitors in the background.",
+ "Man wearing a sleeveless shirt, revealing tattoos on his arms.",
+ "Text overlay: '3 GUITAR TECHNIQUE TRICKS THAT'.",
+ "Text changes to: 'DRASTICALLY CLEANED UP MY PLAYING!'.",
+ "Man begins to play the guitar.",
+ "Text overlay: 'TRICK #1 PICKING'.",
+ "Close-up of the guitar strings and man's hand.",
+ "Text overlay: 'UNWANTED STRING NOISE'.",
+ "Audio: Man states, 'Three guitar technique tricks that drastically cleaned up my playing. When I started playing fast alternate picking lines I had some issues with unwanted strings noise here and there'.",
+ "Guitar solo starts.",
+ "Continuation of close-up on guitar strings, text 'UNWANTED STRING NOISE' in red and white.",
+ "Man now wearing a camouflage-patterned sleeveless shirt.",
+ "Man demonstrates a technique to reduce unwanted string noise.",
+ "Video transitions to a black-and-white close-up of hand with a red 'X' mark to illustrate incorrect technique.",
+ "Video returns to color with text 'LIFTING MY FINGERS DID THE TRICK!'.",
+ "Audio: Man explains, 'I could see that my picking hand fingers were sometimes scraping against the strings and as soon as I started lifting my fingers a little bit like that I got much better results!'.",
+ "Demonstration continues with close-up of man's hands on the guitar.",
+ "Text overlay: 'BAD HABIT: PUSHING DOWN WITH YOUR FIRST FINGER'.",
+ "Text overlay: 'SOUNDS AND FEELS VERY STIFF!'.",
+ "Audio: Man says, 'What also slowed me down was always pushing down with my index finger while I was playing with my other fingers.'.",
+ "Music playing in the background.",
+ "Close-up of man's fingers on the guitar strings continues.",
+ "Text overlay: 'TRY ADDING OPEN STRINGS TO YOUR WORKOUTS.'.",
+ "Man demonstrates the technique focusing on hand movements.",
+ "Text overlay: 'MAKE ALL YOUR FINGERS WORK!'.",
+ "Introduction of 'TRICK #3 SWEEPING'.",
+ "Audio: Man mentions, 'So what really helped me with developing finger independence was working on exercises like this with an open string in between each finger. I used to keep my palm anchored on the bridge like this.'.",
+ "Demonstration of the sweeping technique with close-up of man's hand playing the guitar.",
+ "Man still in camouflage-patterned sleeveless shirt.",
+ "Red arrows indicating sweeping motion.",
+ "Text overlay: 'FOLLOWING THE ARPEGGIOS HELPED'.",
+ "Audio: Man explains, 'But as you could see I end up with a completely different picking angle, and that means I got some really inconsistent results, so as soon as I loosened up my hand a little bit and I started following the arpeggios.'.",
+ "Interspersed guitar playing.",
+ "Video concludes with close-up of man's hands playing the guitar.",
+ "Text overlay: 'MUCH BETTER!'.",
+ "Man points to the guitar strings, explaining a technique.",
+ "Text overlay: 'BERNTH Join The Community TODAY!'.",
+ "Text overlay: 'SUBSCRIBE' button appears.",
+ "Button is clicked, changing to 'SUBSCRIBED' with bell icon for notifications.",
+ "Audio: Man promotes his YouTube channel, 'You can find the perfect technique exercises and some awesome guitar music on my YouTube channel, so make sure to subscribe if that was helpful!'.",
+ "Guitar riff follows."
+ ]
+ },
+ {
+ "video": "video/17.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a muscular man standing shirtless, showcasing his well-defined physique while holding dumbbells in both hands. As the man starts performing dumbbell exercises, The bold red and white text and a man's voice appear, \"KINGS, HERE ARE SOME OF THE BEST HIGH PROTEIN FOODS THAT WILL EASILY HELP YOU ADD 100 GRAMS OF PROTEIN TO YOUR DIET.\" with emojis of plate and knife and fork, 100 points.\n\nThe scene transitions to the man performing an exercise with a wooden stick and a weight attached to it. An image of a plate of raw chicken and an egg appears. The man continues his exercise, and the audio continues with the text \"STARTING OFF WE HAVE CHICKEN GIZZARDS, I DISCOVERED THESE RECENTLY,\" appears. The clip concludes with a close-up of a nutritional label showing the protein content and other nutritional information, the audio with the text, \"AND LOOK AT THE MACROS\".\n\nNext, the man is seen performing push-ups while holding onto a barbell, with various nutritional values overlaid on the scene. The text appears in bold red letters with the voice, \"THEY'RE SUPER. LOW IN CALORIES, AND THEY HAVE 80 GRAMS OF PROTEIN PER POUND.\" The scene shifts to the man lying on a bench exercise. Overlaid texts read with the voice, \"THE CALORIES OF PROTEIN RATIO IS ABSOLUTELY INSANE.\" The man stands shirtless, holding dumbbells again, and the text with the voice, \"NEXT UP, WE HAVE SOMETHING I CAN'T LIVE WITHOUT ARE EGG WHITES TO PUT IT IN PERSPECTIVE.\" appears, coupled with emojis of a finger and a chick hatching.\n\nThe man now performs more dumbbell exercises. The scene transitions to a close-up of a carton of egg whites with the text and voice, \"YOU ONLY NEED TO CONSUME ONE CARTON OF EGG WHITES TO GET EXTRA 100 GRAMS OF PROTEIN IN YOUR DIET.\"\n\nThe man continues his exercise, and the text with the voice, \"NEXT UP, WE HAVE FIRM TOFU JUST TWO OF THESE BLOCKS WILL GET YOU OVER. HUNDRED GRAMS OF PROTEIN LIKE.\" appears with emojis of a chick hatching and a finger, followed by an image of a package of firm tofu.\n\nFinally, the man stands shirtless with a smartphone in one hand and the other hand behind his head. The text 'SUNRISE TOFU' appears on a package of tofu, which is displayed on the screen. The scene transitions to a close-up of the man flexing, the text in bold red and white letters with the voice, \"AND FOLLOW PART 2 KING.\" appearing with an emoji of a crown.\n\nThroughout the video, the muscular man\u2019s exercises and the informative texts aim to guide viewers on high-protein food options, synchronized with the detailed audio descriptions to ensure a coherent and engaging narrative."
+ }
+ ],
+ "events": [
+ "Muscular man standing shirtless holding dumbbells.",
+ "Man starts performing dumbbell exercises.",
+ "Bold red and white text appears with voice, 'KINGS, HERE ARE SOME OF THE BEST HIGH PROTEIN FOODS THAT WILL EASILY HELP YOU ADD 100 GRAMS OF PROTEIN TO YOUR DIET.' with emojis of plate, knife, fork, and 100 points.",
+ "Scene transitions to man performing an exercise with a wooden stick and weight attached.",
+ "Image of a plate of raw chicken and an egg appears.",
+ "The man continues his exercise.",
+ "Audio and text, 'STARTING OFF WE HAVE CHICKEN GIZZARDS, I DISCOVERED THESE RECENTLY,' appears.",
+ "Clip concludes with a close-up of a nutritional label showing protein content and other nutritional information.",
+ "Audio and text, 'AND LOOK AT THE MACROS,' appears.",
+ "Man performing push-ups while holding onto a barbell.",
+ "Various nutritional values overlaid on the scene.",
+ "Text in bold red letters with the voice, 'THEY'RE SUPER. LOW IN CALORIES, AND THEY HAVE 80 GRAMS OF PROTEIN PER POUND.'",
+ "Scene shifts to the man lying on a bench exercise.",
+ "Overlaid texts read and voice, 'THE CALORIES OF PROTEIN RATIO IS ABSOLUTELY INSANE.'",
+ "Man stands shirtless holding dumbbells again.",
+ "Text with voice, 'NEXT UP, WE HAVE SOMETHING I CAN'T LIVE WITHOUT ARE EGG WHITES TO PUT IT IN PERSPECTIVE.', appears with emojis of a finger and a chick hatching.",
+ "Man performs more dumbbell exercises.",
+ "Scene transitions to a close-up of a carton of egg whites.",
+ "Text and voice, 'YOU ONLY NEED TO CONSUME ONE CARTON OF EGG WHITES TO GET EXTRA 100 GRAMS OF PROTEIN IN YOUR DIET.'",
+ "Man continues his exercise.",
+ "Text with the voice, 'NEXT UP, WE HAVE FIRM TOFU JUST TWO OF THESE BLOCKS WILL GET YOU OVER. HUNDRED GRAMS OF PROTEIN LIKE.' appears with emojis of a chick hatching and a finger.",
+ "Image of a package of firm tofu appears.",
+ "Man stands shirtless with a smartphone in one hand and the other hand behind his head.",
+ "Text 'SUNRISE TOFU' appears on a package of tofu on the screen.",
+ "Scene transitions to a close-up of the man flexing.",
+ "Text in bold red and white letters with the voice, 'AND FOLLOW PART 2 KING.' appears with an emoji of a crown."
+ ]
+ },
+ {
+ "video": "video/18.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a shirtless man standing on a sandy beach, fringed by dense greenery. He speaks and gestures animatedly with his hands. Birds chirp in the background, enhancing the tranquil beach atmosphere as he says, \"I'm going to show you how to make fresh drinkable water from the salt ocean right now. I think we got it going here, guys. These can help us make fresh drinking water with this fire.`\" The scene transitions to him sitting cross-legged on the sand, tending a small fire made of twigs and dry leaves, adding more sticks to make the flames grow. The final segment of this sequence focuses on the man's hand holding two glass bottles filled with sand, which he examines carefully.\n\nNext, the view zooms in on a hand holding two similar glass bottles filled with sand, maintaining continuity. Then the man, still shirtless, crouches by the water's edge, filling glass bottles with seawater, and the bottles are clean now. \"`Grab one bottle. We want to fill it up with salt water. Now, this bottle here is filled with salt water, you can see right there.`\" he explains, his voice accompanied by the sounds of water splashing and gurgling. Carrying the filled bottle, he returns to the sandy area with the fire and gestures to the camera with the bottle, continuing his explanation. The man takes a drink from the glass bottle and spits it out, saying \"Disgusting.\"\n\nThe continuity persists with the shirtless man crouching near the fire. The embers smolder as smoke wafts in the air. He then holds up two glass bottles-one filled with water and the other empty-explaining, \"Just like this and it's going to be connected to this bottle right here. As this bottle gets hot, it's going to create steam that's going to go into this bottle right here. And that is going to be fresh drinking water.`\" His voice is higher-pitched, and the background remains the sandy beach with dense greenery.\n\nThe visual narrative continues with a close-up of the man holding the two glass bottles by the extinguished fire. The lush greenery and sandy beach are still visible. He points and gestures to the bottles, explaining further. \"`You can see the steam slowly building up. So as that one cooks, it creates steam, which passes over to this empty bottle. This is what had all the salt water in it. It's all gone.`\" he says. The video shifts to a close-up of the man's hand pointing to two glass bottles lying among ashes and charred wood, and the mouth of one bottle is placed against the mouth of the other bottle. Steam passes from the bottle with seawater to another empty bottle. He continues to explain and gestures animatedly with his hands.\n\nFinally, a close-up shows the glass bottle with seawater all evaporated being held over the sandy area, with twigs and branches scattered around and dense greenery in the background. The shirtless man raises the bottle to show the water inside and takes a drink, showing the bottle to the camera, explaining visibly, \"Let's see if this thing is actually freshwater or if it's just salt. That's fresh water, man. Are you freaking kidding me?`\" His tone is loud and surprised. The man concludes by holding the bottle, with the serene beach and greenery forming the backdrop."
+ }
+ ],
+ "events": [
+ "Shirtless man standing on a sandy beach fringed by dense greenery",
+ "Man speaks and gestures animatedly with his hands",
+ "Birds chirp in the background",
+ "Man says, 'I'm going to show you how to make fresh drinkable water from the salt ocean right now. I think we got it going here, guys. These can help us make fresh drinking water with this fire.'",
+ "Scene transitions to man sitting cross-legged on the sand",
+ "Man tends a small fire made of twigs and dry leaves",
+ "Man adds more sticks to make the flames grow",
+ "Focus on man's hand holding two glass bottles filled with sand",
+ "View zooms in on a hand holding two similar glass bottles filled with sand",
+ "Man, still shirtless, crouches by the water's edge",
+ "Man fills glass bottles with seawater",
+ "Bottles are clean now",
+ "Man says, 'Grab one bottle. We want to fill it up with salt water. Now, this bottle here is filled with salt water, you can see right there.'",
+ "Sounds of water splashing and gurgling",
+ "Man carries filled bottle back to the sandy area with the fire",
+ "Man gestures to the camera with the bottle",
+ "Man takes a drink from the glass bottle and spits it out",
+ "Man says, 'Disgusting.'",
+ "Man crouches near the fire",
+ "Embers smolder as smoke wafts in the air",
+ "Man holds up two glass bottles, one filled with water and the other empty",
+ "Man explains, 'Just like this and it's going to be connected to this bottle right here. As this bottle gets hot, it's going to create steam that's going to go into this bottle right here. And that is going to be fresh drinking water.'",
+ "Man's voice is higher-pitched",
+ "Background remains sandy beach with dense greenery",
+ "Close-up of man holding two glass bottles by the extinguished fire",
+ "Lush greenery and sandy beach are still visible",
+ "Man points and gestures to the bottles",
+ "Man explains, 'You can see the steam slowly building up. So as that one cooks, it creates steam, which passes over to this empty bottle. This is what had all the salt water in it. It's all gone.'",
+ "Close-up of man's hand pointing to two glass bottles lying among ashes and charred wood",
+ "Mouth of one bottle placed against the mouth of the other bottle",
+ "Steam passes from bottle with seawater to another empty bottle",
+ "Man continues to explain and gestures animatedly with his hands",
+ "Close-up of glass bottle with seawater all evaporated being held over sandy area",
+ "Twigs and branches scattered around",
+ "Dense greenery in the background",
+ "Shirtless man raises the bottle to show water inside",
+ "Man takes a drink from the bottle",
+ "Man shows the bottle to the camera",
+ "Man explains visibly, 'Let's see if this thing is actually freshwater or if it's just salt. That's fresh water, man. Are you freaking kidding me?'",
+ "Man's tone is loud and surprised",
+ "Man holds the bottle with serene beach and greenery in the backdrop"
+ ]
+ },
+ {
+ "video": "video/19.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a simple black and white cartoon character sitting at a desk, looking at a computer screen with a bored expression. The screen is divided into a black upper part and a gray lower part. The cartoon character is on the gray lower part, while other content on the black upper part. The audio reveals a man saying, \"I always thought computers were better at math than people until I tried to add 0.1 to 0.2 and got an answer of 0.304.\" The text '0.1' appears on the upper part of screen, and the character's expression changes to a more focused look. As the character gets more intensely focused, the text evolves to '0.1 +', causing the character to look surprised. When the text changes to '0.1 + 0.2', the character is visibly excited. Suddenly, a complex array of lights and switches appears, and the character, now holding a hammer, seems ready to smash the computer. The astonishingly incorrect result of '0.30000000000000004' is displayed in purple, and with a satisfied look, the character smashes the computer and throw the hammer away.\n\nThe scene shifts to a bearded man in a teal shirt standing in a doorway, holding a child's shoulder, with the door labeled '22801' in purple text. This scene then transitions back to the cartoon character now with a red gasoline can, pouring it over the computer, and the content of the computer shown on the upper part of screen displaying absurd JavaScript results ('0.1 + 0.2', '0.30000000000000004', '0.1 + 0.2 === 0.3', and 'false') and absurd Python results('0.1 + 0.2', '0.30000000000000004'). As the audio continues with, \"JavaScript is just a toy language, so I figured it was a bug and tried the same thing in Python, but it also gave me the same wrong answer,\" the character lights a match and burns the computer, looking at the flame with content and turns over to leave.\n\nOn the upper part, a thoughtful man in a gray jacket and gold watch stands by a wall , and on the lower part a black screen with the bold text 'FLOATING POINT MATH' explains the concept. Subsequent frames discuss the limited space for floating-point math with binary numbers and the necessity to choose between range and precision. The visual content in quick succession shows a person looking through a telescope on the upper part or microscope the lower part, illustrating the trade-off in computing. The clip ends with a depiction of a traditional floating point (IEEE 754 style) diagram labeled '64 BIT floating point.' Meanwhile, a high-pitched voice explains floating-point arithmetic and limitations due to memory constraints, against a background of soft music.\n\nThe video's narration now describes integer accuracy and decimal limits. The visuals reiterate the previously shown floating-point diagram, adding a moving corresponding red square to represent the movement of the decimal point. The audio continues, \"An integer is accurate up to 15 digits, and a max of 17 numbers after the decimal point,\" as the frames shift through scenes of individuals looking through telescopes and microscopes. As the discussion continues, binary numbers and base-2 and base-10 systems are shown visually, along with sequence of \"0\"\"1\" for base-2 and colorful beads on an abacus for base-10, explaining the differences between human and computer numerical systems.\n\nThe final segment introduces 'ROUNDING ERRORS' in bold red letters over a black screen. An IEEE 754 converter interface details the rounding errors inherent in floating-point conversion, showing specific values, binary representations, and conversion errors on screen. Accompanied by narration, \"errors when all the memory has been used up, and that's why your computer sucks at math,\" the interface details remains same. It transitions to an old computer monitor displaying faulty JavaScript code snippets ('0.1 + 0.2', '0.30000000000000004', '0.1 + 0.2 === 0.3', and 'false') with '8 Issues: \u274c 8' in the corner, conveying the ultimate message of computational rounding errors and limitations in succinct yet striking visual and audio coordination."
+ }
+ ],
+ "events": [
+ "Black and white cartoon character sitting at a desk, looking at a computer screen with a bored expression.",
+ "Screen is divided into a black upper part and a gray lower part.",
+ "Text '0.1' appears on the upper part of the screen.",
+ "Character's expression changes to a more focused look.",
+ "Text evolves to '0.1 +', causing the character to look surprised.",
+ "Text changes to '0.1 + 0.2', character gets visibly excited.",
+ "Complex array of lights and switches appears.",
+ "Character now holding a hammer, ready to smash the computer.",
+ "Incorrect result '0.30000000000000004' displayed in purple.",
+ "Character smashes the computer and throws the hammer away with satisfied look.",
+ "Scene shifts to a bearded man in a teal shirt holding a child's shoulder in a doorway labeled '22801' in purple text.",
+ "Scene transitions back to cartoon character with a red gasoline can pouring it over the computer.",
+ "Upper part of computer screen displaying absurd JavaScript and Python results (0.1 + 0.2, 0.30000000000000004, 0.1 + 0.2 === 0.3, false).",
+ "Character lights a match and burns the computer, looking at the flame with content.",
+ "Character turns over to leave.",
+ "Thoughtful man in a gray jacket and gold watch stands by a wall on upper part.",
+ "Bold text 'FLOATING POINT MATH' explains the concept on lower part.",
+ "Frames discuss limited space for floating-point math with binary numbers.",
+ "Visual content shows a person looking through a telescope or microscope illustrating trade-off in computing.",
+ "Depiction of traditional floating point (IEEE 754 style) diagram labeled '64 BIT floating point.'",
+ "High-pitched voice explains floating-point arithmetic and limitations due to memory constraints.",
+ "Narration describes integer accuracy and decimal limits.",
+ "Visuals reiterate floating-point diagram with a moving red square representing the decimal point movement.",
+ "Audio explains an integer's accuracy up to 15 digits and a max of 17 numbers after the decimal point.",
+ "Frames shift through scenes of individuals looking through telescopes and microscopes.",
+ "Visuals of binary numbers and base-2 and base-10 systems.",
+ "Sequence of '0' '1' for base-2 and colorful beads on an abacus for base-10.",
+ "Final segment with 'ROUNDING ERRORS' in bold red letters over black screen.",
+ "IEEE 754 converter interface details rounding errors in floating-point conversion.",
+ "Narration mentions errors when all memory is used up and why computers struggle with math.",
+ "Old computer monitor displays faulty JavaScript code snippets with '8 Issues: \u274c 8' in the corner."
+ ]
+ },
+ {
+ "video": "video/20.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video features a man wearing a red shirt participating in a TikTok challenge called \"Character to beat.\" The screen is divided into two sections: The upper portion has an orange banner that reads \"Character to beat:\" followed by an image of multiple anime characters. Below that, the text \"My character\" is displayed with categories like \"Height:\", \"Race:\", \"Abilities:\", \"Speed:\", \"PowerUp:\", and \"IQ:\", which are progressively filled in as the video advances.\n\nAt the start, the man introduces the challenge by saying, \"`I gotta build an anime character that has to beat this character right here. Who we got? Yo, who do I gotta beat? This is gonna be nothing. Come on. What the f- Alright, come on. Let's go, bro. What's up?\" Various anime characters such as Vegeta, Kakashi, All Might, and SSJB Gogeta appear above the person's head. The background is a simple indoor setting with a gray wall and a strip of LED lights. The man responds to each character with enthusiasm.\n\nAs the challenge proceeds, The character image that has come to a stop features a muscular man with blue hair, who is bare-chested and wearing a vest. The text \"Height: 5'0\" gets filled in under the \"My character\" section. The man reacts, \"`My height... I got this. No, I don't, bro. What's my height though? Meli, bro, you mad little. You know what, yo? Looks could be deceiving; the short characters would be the most OP sometimes.`\" The backdrop remains the same.\n\nThe man says, \"But what's my race?\" The character image now stops on a bald character who is wearing a red jacket. Next, the text \"Race: Bald\" is filled in under the \"My character\" section. The man comments, \"Okay, I'm a bald human person. So far, my character is looking like an off-brand Krillin. But look, we're going to keep going. What's my abilities, though?`\" Various characters with blue, orange, and no hair appear and the man continues reacting.\n\nThe character image now stops on a male character with black hair, who is wearing a dark blue T-shirt. The text \"Abilities: Tobio (volleyball hit)\" is added to the \"My character\" section. The man touches his face upon seeing the result. The man's voice can be heard saying, \"Turn me up. I get to serve a volleyball. What's my speed looking like, bruh? I'm done. There's nothing I can do. What's my speed, though?\"\n\nAs more characters appear, the character image now stops on a male character with black hair, who is wearing a white shirt and holding a book in his hand. The text \"Speed: Light\" is filled in. The man, in a state of deep sorrow and anger, covers his face with his hands, clearly distressed by the outcome. He say, \"It's worse? This is probably the worst character I ever made, ever.`\" The challenge screen continuously displays the \"Character to beat: SSJB Gogeta.\"\n\nIn the final clip, various anime characters continue to appear including ones with orange, blonde, black, and pink hair. The man consoles himself by saying, \"But what's my power-up? Maybe this could change the game. Maybe this could change...\" The character image now stops on a little girl character with long black hair. The text \"PowerUp: Nezuko\" appears under \"My character\" section. Upon seeing the result, the man's face turns ashen, indicating his utter dismay and disappointment.\n The man now has an image of a blonde character on his head, and he is holding a white flag made from a mask, signaling his surrender or defeat in a humorous or light-hearted manner. He says, \"If you like it, drop a follow. I'll give up. This is the white flag, bruh.\" A gust of wind is heard towards the end. The clip and the video conclude with the \"Character to beat: SSJB Gogeta\" banner and the image of multiple anime character. The overall tone is playful and animated, keeping the audience engaged with the man's lively reactions and descriptions."
+ }
+ ],
+ "events": [
+ "Man wearing a red shirt participates in a TikTok challenge called 'Character to beat.'",
+ "Screen is divided into two sections with an orange banner that reads 'Character to beat:' followed by an image of multiple anime characters.",
+ "Below that, 'My character' text is displayed with categories: 'Height:', 'Race:', 'Abilities:', 'Speed:', 'PowerUp:', and 'IQ:'.",
+ "Man introduces the challenge, says, 'I gotta build an anime character that has to beat this character right here. Who we got?'",
+ "Various anime characters (Vegeta, Kakashi, All Might, and SSJB Gogeta) appear above man's head.",
+ "Background is a simple indoor setting with a gray wall and a strip of LED lights.",
+ "Man respond to each character with enthusiasm.",
+ "Character image stops at a muscular man with blue hair, text 'Height: 5'0' fills in under 'My character' section.",
+ "Man reacts, 'My height... I got this. No, I don't, bro. What's my height though?... Looks could be deceiving; the short characters would be the most OP sometimes.'",
+ "Man continues, 'But what's my race?'",
+ "Character image stops at a bald character with a red jacket, 'Race: Bald' fills in under 'My character' section.",
+ "Man comments, 'Okay, I'm a bald human person. So far, my character is looking like an off-brand Krillin. But look, we're going to keep going. What's my abilities, though?'",
+ "Various characters (blue, orange, and no hair) appear and man continues reacting.",
+ "Character image stops at a male character with black hair wearing a dark blue T-shirt, 'Abilities: Tobio (volleyball hit)' added to 'My character' section.",
+ "Man touches his face and says, 'Turn me up. I get to serve a volleyball. What's my speed looking like, bruh? I'm done. There's nothing I can do. What's my speed, though?'",
+ "Character image stops at a male character with black hair, white shirt holding a book, 'Speed: Light' filled in.",
+ "Man covers his face with hands in deep sorrow and anger, says, 'It's worse? This is probably the worst character I ever made, ever.'",
+ "Challenge screen continuously displays 'Character to beat: SSJB Gogeta.'",
+ "Man questions, 'But what's my power-up? Maybe this could change the game.'",
+ "Character image stops at a little girl character with long black hair, 'PowerUp: Nezuko' appears under 'My character' section.",
+ "Man's face turns ashen, indicating utter dismay and disappointment.",
+ "Man now has an image of a blonde character on his head, holds white flag made from a mask, signaling surrender or defeat humorously.",
+ "Man says, 'If you like it, drop a follow. I'll give up. This is the white flag, bruh.'",
+ "Gust of wind is heard towards the end.",
+ "Video concludes with 'Character to beat: SSJB Gogeta' banner and image of multiple anime characters.",
+ "Overall tone is playful and animated."
+ ]
+ },
+ {
+ "video": "video/21.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video features a comprehensive guide provided by a woman in a yellow hoodie on how to achieve optimum sleep. \n\nStanding in a bathroom, the woman addresses the camera with the text \"Here's how to have the best sleep of your entire life`\" displayed at the top. You can hear her narrate the same tip as she gestures. The setting shifts to a darker room where she pulls down window blinds, emphasizing the importance of darkness with the text \"`Make sure your room is dark.`\" This is followed by the woman turning off a television accompanied by the text \"`Turn off any electronics.`\" The scene then shows her adjusting a thermostat, with the recommendation \"And set your thermostat between 60\u00b0 and 70\u00b0\" displayed. The final part of this segment captures her speaking to the camera once more, with the text \"And then just try the 3-2-1 technique\" at the top.\n\nContinuing, the woman in the yellow hoodie provides more specifics on the 3-2-1 technique. Standing in the kitchen and closing a refrigerator, the text \"No more food for 3 hours before you sleep\" is shown. She follows this by writing at a desk, with \"No more work for 2 hours before you sleep\" displayed at the top. The scene transitions to the woman sitting on a bed using a red phone, highlighted by the text \"No more screen time for 1 hour before you sleep.\" The segment concludes with the woman lying in bed, covered with a white blanket, as the text \"`And then once you're in bed`\" appears.\n\nIn the next part, while lying in bed, the woman in the yellow hoodie continues to provide tips. The text \"You can use this military technique to fall asleep in just two minutes\" is displayed as she gesticulates. She emphasizes slow, deep breaths with the corresponding text \"`You have to take slow deep breaths.`\" The next advice pops up as \"`And relax all your muscles from your face to your toes,`\" with the woman remaining in bed. The final instruction in this segment reads \"And once you're fully relaxed just repeat a word or a phrase.\"\n\nAs she stays in the same position, the video shows the text \"Like 'sleep' or 'don't think'.\" The woman then sits up in bed, with a space-themed tapestry in the background, and the text \"`And if you've been laying there for quite a minute now and you can't fall asleep\" appears. She touches behind her ear, guiding viewers to find a pressure point, with the text \"You're going to want to find this pressure point behind your ear that's a little sore.\" She continues rubbing the pressure point with the added tip \"And you want to rub that about 200 times\" displayed.\n\nThe final portion features the woman still sitting up and rubbing behind her ear, as the text \"You're going to fall asleep so fast\" is shown. She speaks to the camera once again. The scene transitions to her lying down while covered with a white blanket and advising \"`And if you really wanna have an amazing sleep`\" at the top. In the final piece of advice, the woman removes her yellow hoodie, showing a tattoo on her chest, along with the text \"`Lose the pajamas and sleep naked.`\"\n\nThroughout the video, audio narration synchronizes with the visual tips, providing a thorough, step-by-step guide to achieving the best sleep. Music often plays in the background when the narrator is not speaking, adding to the tranquil and instructional atmosphere of the guide."
+ }
+ ],
+ "events": [
+ "Woman in a yellow hoodie stands in a bathroom.",
+ "Text 'Here's how to have the best sleep of your entire life' appears at the top.",
+ "Woman gestures towards the camera while narrating the first tip.",
+ "Scene shifts to a darker room; woman pulls down window blinds.",
+ "Text 'Make sure your room is dark.' appears.",
+ "Woman turns off a television.",
+ "Text 'Turn off any electronics.' appears.",
+ "Scene shows her adjusting a thermostat.",
+ "Text 'And set your thermostat between 60\u00b0 and 70\u00b0' is displayed.",
+ "Woman speaks to the camera.",
+ "Text 'And then just try the 3-2-1 technique' appears at the top.",
+ "Standing in the kitchen, woman closes a refrigerator.",
+ "Text 'No more food for 3 hours before you sleep' is shown.",
+ "Woman writes at a desk.",
+ "Text 'No more work for 2 hours before you sleep' is displayed.",
+ "Scene transitions to woman sitting on a bed using a red phone.",
+ "Text 'No more screen time for 1 hour before you sleep' is highlighted.",
+ "Woman lies in bed, covered with a white blanket.",
+ "Text 'And then once you're in bed' appears.",
+ "Woman in the yellow hoodie continues to provide tips while lying in bed.",
+ "Text 'You can use this military technique to fall asleep in just two minutes' is displayed.",
+ "Woman emphasizes slow, deep breaths.",
+ "Text 'You have to take slow deep breaths.' is shown.",
+ "Next advice text 'And relax all your muscles from your face to your toes' appears.",
+ "Woman remains in bed as text is displayed.",
+ "Final instruction reads 'And once you're fully relaxed just repeat a word or a phrase.'",
+ "Scene shows text 'Like 'sleep' or 'don't think'.",
+ "Woman remains lying down.",
+ "Woman sits up in bed with a space-themed tapestry in the background.",
+ "Text 'And if you've been laying there for quite a minute now and you can't fall asleep' appears.",
+ "Woman touches behind her ear.",
+ "Text 'You're going to want to find this pressure point behind your ear that's a little sore.'",
+ "Woman rubs the pressure point.",
+ "Text 'And you want to rub that about 200 times' is displayed.",
+ "Woman continues rubbing behind her ear.",
+ "Text 'You're going to fall asleep so fast' appears.",
+ "Woman speaks to the camera again.",
+ "Scene transitions to her lying down, covered with a white blanket.",
+ "Text 'And if you really wanna have an amazing sleep' appears at the top.",
+ "Woman removes her yellow hoodie, showing a tattoo on her chest.",
+ "Text 'Lose the pajamas and sleep naked.' is displayed.",
+ "Audio narration synchronizes with visual tips throughout the video.",
+ "Background music plays when the narrator is not speaking."
+ ]
+ },
+ {
+ "video": "video/22.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a thrilling cricket match in progress. A bowler, dressed in a white uniform, is seen running towards the batsman to deliver a ball. The batsman, also attired in white, adopts a defensive stance, preparing to play. As the bowler releases the ball, the batsman attempts a shot but misses, resulting in the ball being caught by a fielder positioned behind him. The bowler then celebrates the wicket with exuberance, displaying a sense of triumph. This scene is accentuated by someone remarking over a loudspeaker, \"That's it. Beautiful bowling. Really nice from Moeen Ali. Good catch again from Ben stokes.\" The crowd cheers and applauds, creating an atmosphere of excitement. \n\nThe focus shifts to a close-up of another cricket player in a white uniform and black cap, acknowledging the crowd. We see a group of fans in the audience holding a sign that reads \"ENG LAND,\" with the colorful letters representing their support for the team. The fans are animated, clapping and cheering. The camera returns to the cricket field where a group of players in white uniforms walk together, celebrating a successful play. The bowler delivers another ball to a batsman, who attempts a shot but is caught by a fielder, prompting another celebration from the bowler. A commentator describes the action over the loudspeaker, \"Just watch the way he just gets this a little bit higher litter bit flight little bit slower,\" as the crowd continues to cheer.\n\nThe scene transitions to a close-up of a player clapping, clearly in celebration. Two players walk together, their arms around each other, signifying camaraderie and shared joy. A group of players in white uniforms and black caps, some donning helmets, walk together, exchanging high-fives, further highlighting the celebratory mood. The action continues with a bowler delivering a ball to a defensively positioned batsman. The commentary intensifies with, \"Another one and Moeen Ali's on a hat trick this time same again Moeen Ali outside edge Ben Stokes,\" resonating through the cheers of an enthusiastic crowd.\n\nNext, we see a batsman in a white uniform and helmet trying to play a shot. The bowler delivers the ball, and the batsman misses. A fielder in white dives for a potential catch, adding to the tension. The umpire, dressed in a white jacket with 'ICC' and 'Fly Emirates' logos and wearing a white hat, stands observing the play. The focus then shifts to a fielding player in a white uniform and black cap, walking across the field, possibly indicating an end of play or a strategic change. The excitement builds with the commentator exclaiming, \"Oh! There we go! There we go! Surely! Surely! Surely! \" amidst the roar of the crowd.\n\nFollowing this, a player in a white uniform and black cap walks on the field, hinting at the conclusion of a play. Two players are then shown celebrating; one pats the other on the head, demonstrating their joy. A detailed graphical analysis overlays the screen, showing the ball\u2019s trajectory hitting the stumps, culminating in a successful wicket. A group of players in white uniforms celebrates this key moment, with excitement and camaraderie on full display. The commentator praises, \"They're going to reveal that, haven't you? Just to give him a chance. I think we'd figure that out. Well done. What a fantastic performance, and a great finale,\" as the crowd's applause and cheers form a lively backdrop.\n\nThe video culminates with a group of cricket players in white uniforms and black caps huddled together in celebration. They pat each other on the back and head, basking in the collective joy and unity of their achievement. The jubilant atmosphere is encapsulated by the commentator's closing words, while the crowd continues its rejoicing, creating a fitting end to the video highlighting team spirit and victory."
+ }
+ ],
+ "events": [
+ "Thrilling cricket match in progress",
+ "Bowler in white uniform running towards the batsman",
+ "Batsman in white uniform adopts a defensive stance",
+ "Bowler releases the ball",
+ "Batsman attempts a shot but misses",
+ "Ball is caught by a fielder positioned behind the batsman",
+ "Bowler celebrates the wicket with exuberance",
+ "Someone remarking over a loudspeaker, 'That's it. Beautiful bowling. Really nice from Moeen Ali. Good catch again from Ben Stokes.'",
+ "Crowd cheers and applauds",
+ "Close-up of a cricket player in a white uniform and black cap acknowledging the crowd",
+ "Group of fans in the audience holding a sign that reads 'ENG LAND'",
+ "Fans in the audience clapping and cheering",
+ "Group of players in white uniforms walk together celebrating",
+ "Bowler delivers another ball to a batsman",
+ "Batsman attempts a shot but is caught by a fielder",
+ "Bowler celebrates again",
+ "Commentator describes the action over the loudspeaker",
+ "Crowd cheers",
+ "Close-up of a player clapping in celebration",
+ "Two players walk together with their arms around each other",
+ "Group of players in white uniforms and black caps exchanging high-fives",
+ "Bowler delivers a ball to a defensively positioned batsman",
+ "Commentary intensifies",
+ "Batsman in white uniform and helmet tries to play a shot",
+ "Bowler delivers the ball, and batsman misses",
+ "Fielder in white dives for a potential catch",
+ "Umpire in a white jacket observes the play",
+ "Fielding player in a white uniform and black cap walks across the field",
+ "Commentator exclaims excitedly",
+ "Player in a white uniform and black cap walks on the field hinting at the conclusion of play",
+ "Two players celebrate; one pats the other on the head",
+ "Graphical analysis shows the ball's trajectory hitting the stumps",
+ "Group of players in white uniforms celebrates",
+ "Commentator praises the performance",
+ "Crowd's applause and cheers",
+ "Group of cricket players in white uniforms and black caps huddled together in celebration",
+ "Players pat each other on the back and head",
+ "Crowd rejoices",
+ "Commentator\u2019s closing words"
+ ]
+ },
+ {
+ "video": "video/23.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up of a small plastic bottle labeled 'Gallium Metal.' The label details its specifications: purity (99.99% min.), density (5.91 g/cm\u00b3), origin (China), and quantity (100g). The white bottle, which provides a stark contrast against the dark, plain background, is held by a hand with a light skin tone. The scene transitions to a top-down view showing the metallic contents inside the bottle, which is then placed on a flat surface and rolls slightly before stopping. \n\nAs the video unfolds, the audio reveals a man\u2019s voice explaining, \"Here I have some gallium, and at room temperature, it's completely solid.\" This is followed by a loud bang.\n\n The bottle is dropped onto a textured, dark surface, likely the floor of a vehicle, where it lies next to a black cable and a small black object. The hand picks up the bottle again, and the scene transitions to submerging it into a clear glass beaker filled with water. The bottle, partially filled with water, stands upright in the beaker as the clip concludes.\n\nDuring this segment, the audio provides crucial information: \"However, it has a super low melting point of only about 30 C, or 86 Fahrenheit. .\" A loud metallic clang is heard, followed by a male voice speaking.\n\nThe next scene displays a close-up of the bottle standing in the glass beaker. A hand begins pouring water from a kettle into the beaker, submerging the bottle completely. The video then shifts to a top-down view of the bottle inside the beaker, clearly showing the metallic contents settled at the bottom.\n\nAt this point, the audio explains: \"So, if I put it into a bowl of hot water, it should completely liquefy.\" The man speaks as the sound of water being poured into a bowl is heard, followed by the man's voice resuming.\n\nThe focus shifts to a top-down view of the submerged bottle. A hand reaches into the beaker and, lifting the bottle out of the water tilts it to pour the metallic contents into the palm of the other hand. The liquid gallium, shiny and reflective, forms a pool that is gently touched and manipulated by the hand. The dark background accentuates the contrast with the white bottle and metallic gallium.\n\nAccompanying this detailed visual, the audio notes: \"It all eventually melted, and what I have now, kind of looks like mercury. Unlike mercury though, it's not nearly as toxic, and it's much safer to hold in my hand.\" The man's voice is notably high-pitched.\n\nThe video culminates with a close-up of the hand holding the shiny pool of liquid gallium. The fluid properties of the gallium are showcased as the hand touches and manipulates it, eventually transitioning to a wider view of the hand resting on the dark, plain surface. The fingers spread and shape the gallium, emphasizing its malleability and metallic luster. The hand continues to play with the liquid metal, underlining its unique characteristics.\n\nThe audio highlights a significant trait of gallium: \"However, it is a lot messier and it tends to stick to things and leave an annoying residue.\" The man's voice, distinct with a British accent, is monotone as he describes this aspect of the element. \n\nOverall, the video thoroughly documents the properties of gallium metal, with both visual demonstrations and an informative dialogue synchronizing to offer a comprehensive understanding of the element\u2019s characteristics."
+ }
+ ],
+ "events": [
+ "Close-up of a small plastic bottle labeled 'Gallium Metal.'",
+ "Label details: purity (99.99% min.), density (5.91 g/cm\u00b3), origin (China), quantity (100g).",
+ "White bottle against dark, plain background held by a hand with a light skin tone.",
+ "Transition to a top-down view showing metallic contents inside the bottle.",
+ "Bottle placed on a flat surface, rolls slightly before stopping.",
+ "Audio: Man's voice explains 'Here I have some gallium, and at room temperature, it's completely solid.'",
+ "Loud bang is heard.",
+ "Bottle dropped onto a textured, dark surface, likely the floor of a vehicle.",
+ "Bottle lies next to a black cable and a small black object.",
+ "Hand picks up the bottle again.",
+ "Scene transitions to submerging the bottle into a clear glass beaker filled with water.",
+ "Bottle stands upright in the beaker, partially filled with water.",
+ "Audio: 'However, it has a super low melting point of only about 30 C, or 86 Fahrenheit.'",
+ "Loud metallic clang is heard, followed by male voice speaking.",
+ "Close-up of the bottle standing in the glass beaker.",
+ "Hand pours water from a kettle into the beaker, submerging the bottle completely.",
+ "Top-down view of the bottle inside the beaker shows metallic contents settled at the bottom.",
+ "Audio: 'So, if I put it into a bowl of hot water, it should completely liquefy.'",
+ "Sound of water being poured into a bowl.",
+ "Audio: Man's voice resumes.",
+ "Top-down view of the submerged bottle.",
+ "Hand reaches into the beaker, lifts the bottle out of the water, tilts it to pour contents into palm of the other hand.",
+ "Liquid gallium forms a pool in the hand, shiny and reflective.",
+ "Hand gently touches and manipulates the liquid gallium.",
+ "Dark background contrasts with white bottle and metallic gallium.",
+ "Audio: 'It all eventually melted, and what I have now, kind of looks like mercury.'",
+ "Audio: 'Unlike mercury though, it's not nearly as toxic; it's much safer to hold in my hand.'",
+ "Man's voice is high-pitched.",
+ "Close-up of hand holding shiny pool of liquid gallium.",
+ "Hand touches and manipulates the fluid gallium.",
+ "Transition to a wider view of the hand resting on dark, plain surface.",
+ "Fingers spread and shape the gallium, showing its malleability and metallic luster.",
+ "Hand continues to play with liquid metal.",
+ "Audio: 'However, it is a lot messier and it tends to stick to things and leave an annoying residue.'",
+ "Man's voice, British accent, is monotone as he describes this aspect.",
+ "Video documents properties of gallium metal with visual demonstrations and informative dialogue."
+ ]
+ },
+ {
+ "video": "video/24.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a wooden table displaying the text \"$5 AMAZON POWER SUPPLY.\" A cardboard box with shipping labels is placed on the table. The scene transitions to an Amazon product page, showing an HP power supply priced at $5.49, last purchased on January 10, 2022. The view zooms in on the price and delivery details before shifting back to the cardboard box. A man uses a box cutter to open the box, revealing the power supply inside, wrapped in plastic. \n\nThroughout this sequence, a male voiceover narrates, \"`This is the cheapest power supply I could find on Amazon, which I got for a little over $5. It's a 240W unit that's meant to go in a proprietary HP small form factor PC.`\" Background music plays softly as the narration continues.\n\nThe next scene presents a close-up of the open cardboard box, showing the power supply unit surrounded by plastic air cushions. A pair of hands removes the unit, as the overlay text \"$5 AMAZON POWER SUPPLY\" reappears at the top of the screen. The power supply is placed on the wooden table, and the person displays various connectors: a 24-pin connector, a 4-pin connector, and other smaller connectors. A close-up emphasizes the details of these connectors.\n\nThe audio continues with the male voice explaining, \"`But it still has all the standard ATX connections we would expect, like our 24 pin, 4 pin 12 volt, SATA, as well as a floppy drive connector, and this weird 4 pin fan header.`\" while maintaining the background music.\n\nNext, a wooden table displays a motherboard placed on a piece of cardboard. The person's hands connect cables from the power supply unit to the motherboard, with the text overlay \"testbench\" showing on screen. Connectors, including a 24-pin and others, are connected to the motherboard. The view transitions to a close-up of the man plugging a power cable into a wall socket. The man uses a screwdriver to press a button on the motherboard, presumably to power it on.\n\nThe audio voiceover describes, \"`I use our pavilion test bench from a previous video to make sure that it was working. Everything seemed to be looking good until I tried to actually boot it up.`\" as gentle background music plays.\n\nAnother scene shows the wooden table with the motherboard on cardboard. Hands connect cables from the power supply unit labeled \"testbench.\" The connectors are connected, transitioning to a close-up of the man plugging a power cable into a wall socket. The narration continues, \"` Thinking that there might be something wrong with my test bench, I tried out a known working power supply, and it booted up just fine. So I thought maybe there was something I needed to do with that weird fan header or that this was just the result of buying a $5 power supply.`\" with the background music persisting.\n\nThen, a view of the power supply unit on the wooden table displays colorful cables extending from it, with a window in the background. The man's hands connect a cable to the motherboard and then plug a power cable into the power supply unit. A close-up highlights a red switch labeled \"230\" and \"115.\" The man uses a screwdriver to press a motherboard button.\n\nThe narration follows, \"`But while testing things out, I noticed that the input voltage switch was actually set to 230 volts instead of 115.`\" with initial narration giving way to background music and another man's voice.\n\nFinally, the wooden table holds a motherboard on cardboard, the power supply unit with colorful cables in the background. A hand gives a thumbs-up gesture, suggesting a successful setup. The scene changes to a close-up of another motherboard labeled \"GIGABYTE,\" showing various components. The video concludes with a black screen displaying \"hardware haven\" in white letters.\n\nThe last piece of audio narration says, \"`So after correcting that issue, it fired up just fine. I'll be using this cheap power supply for a different project later on, so be ready for that,`\" finishing with the music continuing in the background."
+ }
+ ],
+ "events": [
+ "The video opens with a wooden table displaying the text '$5 AMAZON POWER SUPPLY.'",
+ "A cardboard box with shipping labels is placed on the table.",
+ "The scene transitions to an Amazon product page, showing an HP power supply priced at $5.49, last purchased on January 10, 2022.",
+ "The view zooms in on the price and delivery details.",
+ "The view shifts back to the cardboard box.",
+ "A man uses a box cutter to open the box, revealing the power supply inside, wrapped in plastic.",
+ "A male voiceover narrates about the cheap power supply on Amazon.",
+ "Background music plays softly as the narration continues.",
+ "The next scene presents a close-up of the open cardboard box, showing the power supply unit surrounded by plastic air cushions.",
+ "A pair of hands removes the unit as the overlay text '$5 AMAZON POWER SUPPLY' reappears at the top of the screen.",
+ "The power supply is placed on the wooden table.",
+ "The person displays various connectors: a 24-pin connector, a 4-pin connector, and other smaller connectors.",
+ "A close-up emphasizes the details of these connectors.",
+ "The male voice continues explaining about the standard ATX connections.",
+ "Background music continues.",
+ "A wooden table displays a motherboard placed on a piece of cardboard.",
+ "The person's hands connect cables from the power supply unit to the motherboard.",
+ "Text overlay 'testbench' shows on screen.",
+ "Connectors, including a 24-pin and others, are connected to the motherboard.",
+ "The view transitions to a close-up of the man plugging a power cable into a wall socket.",
+ "The man uses a screwdriver to press a button on the motherboard, presumably to power it on.",
+ "The audio voiceover describes using a test bench to ensure the power supply is working.",
+ "Background music continues.",
+ "Another scene shows the wooden table with the motherboard on cardboard.",
+ "The person's hands connect cables from the power supply unit labeled 'testbench.'",
+ "The connectors are connected, transitioning to a close-up of the man plugging a power cable into a wall socket.",
+ "The narration continues about potentially needing to do something with the weird fan header or the result of a $5 power supply.",
+ "Background music persists.",
+ "A view of the power supply unit on the wooden table displays colorful cables extending from it, with a window in the background.",
+ "The man's hands connect a cable to the motherboard and plug a power cable into the power supply unit.",
+ "A close-up highlights a red switch labeled '230' and '115.'",
+ "The man uses a screwdriver to press a motherboard button.",
+ "The narration mentions noticing the input voltage switch set to 230 volts instead of 115.",
+ "Background music continues and another man's voice is heard.",
+ "The wooden table holds a motherboard on cardboard, with the power supply unit and colorful cables in the background.",
+ "A hand gives a thumbs-up gesture, suggesting a successful setup.",
+ "The scene changes to a close-up of another motherboard labeled 'GIGABYTE,' showing various components.",
+ "The video concludes with a black screen displaying 'hardware haven' in white letters.",
+ "The final piece of audio narration mentions correcting the issue and the power supply working fine.",
+ "The narration ends with a note about using the cheap power supply for a different project later on.",
+ "Background music continues as the video finishes."
+ ]
+ },
+ {
+ "video": "video/25.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up shot of the ground, presenting a few scattered leaves on a paved surface. The scene transitions to an outdoor parking lot where a man in a light beige sweatshirt, with 'BEAR COAST COFFEE' inscribed, and black shorts points behind him. He gestures animatedly while talking. The background reveals trees and a clear blue sky. As the man continues to gesture, He occasionally points to his head and behind him. The camera then captures a gray SUV parked in the lot. The man walks towards the SUV, stands next to it, and continues his animated discussion.\n\nThe audio reveals the man saying, \"What is up, guys? Right behind me is a Volvo XC90, and I guarantee you did not know this car was this luxurious. I've always liked the way these SUVs look; you have this...\" Cars driving by can be heard in the background.\n\nNext, the focus stays on the same man standing next to the front of the dark gray SUV, specifically a Volvo. He bends down, pointing to the front grille. The background consistently shows a tree and a clear blue sky. As the discussion continues, He moves to the side of the vehicle, showcasing the headlight, side mirror, and other features. He then opens the driver\u2019s side door while still engaged in conversation.\n\nAuditory cues indicate that the man is saying, \"Bright, strong chrome grille, strong hood lines, and you have your Thor's hammer LED lights that even pulse with power. The most surprising part is once you open up the car...\" This is followed by the sound of the car door opening.\n\nThe narrative transitions to the man standing next to the open driver's side door. He points to various interior features on the inside of the door, including the handle and control buttons, while discussing him. In the background, a grassy area with trees and blue skies is visible. The man then sits inside the SUV, pointing to the sunroof and other interior features like the dashboard and seats. The man also touches the front passenger seat, highlighting its texture and design.\n\nThe audio captures, \"You have real wood, metal here, metal there, real leather, and it gets even better. To close my moonroof, all I do is swipe my finger like this. You have these super soft leather seats that are going to be...\" The background noise includes birds chirping.\n\nSeated in the driver\u2019s seat, the man continues to gesture and speak. He holds up his fingers and points to the dashboard and various controls on the center console, explaining his functionalities. The man touches the dashboard, points to a small speaker, and then highlights a crystal-like gear shift, explaining its design and use. Finally, He focuses on the infotainment screen displaying a menu.\n\nThe audio description includes, \"Heated, ventilated, and He massage you. You've got wood, more leather, a Bowers & Wilkins sound system, a glass crystal as a shifter. Your touchscreen is Apple CarPlay, Android Auto, and...\"\n\nMoving into a close-up of the SUV's interior, the camera focuses on the infotainment screen and dashboard, which displays a map and various controls. The man, seated in the driver\u2019s seat with hands resting on his lap, continues the narrative while occasionally gesturing. The scene shifts to the back seat, where he points out rear features like the seats and sunroof, continuing his explanation. \n\nThe audio provides, \"Hey Google, tell me a joke. What do you call a belt made of watches? A waste of time. In the second row, I have good legroom...\"\n\nConcluding the video, in the back seat, the man points to the window, sunroof, the rear of the front seat, and the rear console while explaining his features. He also adjusts the seat to demonstrate the functionality of the rear console and points to a child car seat, describing its features and security measures.\n\nThe audio captures, \"I can recline, my own sunscreen shade, a great view, my own heated seats, climate control, built-in cup holders, a built-in booster seat, and enough room for a car seat...\"."
+ }
+ ],
+ "events": [
+ "Close-up shot of the ground showing scattered leaves on a paved surface",
+ "Outdoor parking lot scene",
+ "Man in a light beige sweatshirt and black shorts gestures and points behind him",
+ "Background includes trees and a clear blue sky",
+ "Man points to his head and behind him while talking",
+ "Camera captures gray SUV parked in the lot",
+ "Man walks towards SUV and stands next to it, continuing animated discussion",
+ "Audio: 'What is up, guys? Right behind me is a Volvo XC90, and I guarantee you did not know this car was this luxurious. I've always liked the way these SUVs look; you have this...'",
+ "Sounds of cars driving by in the background",
+ "Focus on man standing next to front of dark gray Volvo SUV",
+ "Man bends down and points to front grille",
+ "Background shows tree and clear blue sky",
+ "Man moves to side of vehicle, showcasing headlight, side mirror, and other features",
+ "Man opens driver\u2019s side door while continuing conversation",
+ "Audio: 'Bright, strong chrome grille, strong hood lines, and you have your Thor's hammer LED lights that even pulse with power. The most surprising part is once you open up the car...'",
+ "Sound of car door opening",
+ "Narrative shifts to man standing next to open driver\u2019s side door",
+ "Man points to various interior features on inside of door including handle and control buttons",
+ "Background shows grassy area with trees and blue skies",
+ "Man sits inside SUV, points to sunroof and other interior features like dashboard and seats",
+ "Man touches front passenger seat, highlighting texture and design",
+ "Audio: 'You have real wood, metal here, metal there, real leather, and it gets even better. To close my moonroof, all I do is swipe my finger like this. You have these super soft leather seats that are going to be...'",
+ "Sounds of birds chirping in the background",
+ "Man seated in driver\u2019s seat continues to gesture and speak",
+ "Man holds up fingers, points to dashboard and various controls on center console",
+ "Man touches dashboard, points to small speaker, highlights crystal-like gear shift",
+ "Man explains design and use of gear shift",
+ "Focus on infotainment screen displaying a menu",
+ "Audio: 'Heated, ventilated, and He massage you. You've got wood, more leather, a Bowers & Wilkins sound system, a glass crystal as a shifter. Your touchscreen is Apple CarPlay, Android Auto, and...'",
+ "Close-up of SUV's interior, camera focuses on infotainment screen and dashboard",
+ "Dashboard displays map and various controls",
+ "Man seated in driver\u2019s seat with hands resting on lap, continues narrative, occasionally gesturing",
+ "Scene shifts to back seat, man points out rear features like seats and sunroof",
+ "Audio: 'Hey Google, tell me a joke. What do you call a belt made of watches? A waste of time. In the second row, I have good legroom...'",
+ "Man in back seat points to window, sunroof, rear of front seat, and rear console",
+ "Man adjusts seat to demonstrate functionality of rear console",
+ "Man points to child car seat, describes features and security measures",
+ "Audio: 'I can recline, my own sunscreen shade, a great view, my own heated seats, climate control, built-in cup holders, a built-in booster seat, and enough room for a car seat...'"
+ ]
+ },
+ {
+ "video": "video/26.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens in a dimly lit recording studio where three men are gathered. The central figure, donned in a white graphic t-shirt and striped cardigan, prominently displays a large shiny necklace with the text \"TRAP HOUSE.\" The man on the left holds a large microphone, hinting at an interview or recording session, while the one on the right, wearing a dark t-shirt with a red design, holds a drink. The backdrop showcases a guitar and a keyboard, suggesting a musical setting. As the central figure continues to show off the necklace, the man to the right gestures animatedly, perhaps emphasizing a point. The ambient lighting is low, punctuated by a subtle purple glow in the top right corner, enhancing the room's atmosphere. The scene closes with a close-up of the central figure now wearing large headphones, standing before a professional microphone setup, indicating the start or continuation of a recording.\n\nThe audio aligns as follows: \"`You know what? You see us, we in the session turning up. You got 10 minutes. If you fuck up my nigga's vibe, I'ma fuck you up. So let's get this going like right now. Wow, this is incredible. I can't believe that we're here right now.`\"\n\nThe scene transitions to a brighter setting still focused on the central figure with large headphones, continuing from the previous frame. The man remains in front of a professional microphone setup, his attire unchanged. Another man, in a dark jacket with a large microphone, observes closely, perhaps conducting the interview. The backdrop retains its dim ambiance. The focal man appears to speak or sing into the microphone while the observer attentively watches. This sequence emphasizes an ongoing recording or interview.\n\nThe audio captures: \"`Shit. Real trap shit. Trapaholics. We make it look easy. Walk with your boys. Damn, son. Where'd you find this.\" Voices include a mix of speech and music.\n\nThe camera then zooms in on the central figure again, detailed with the same attire and accessories. The bright lighting highlights the professional microphone and another man in a dark jacket close by. The scene expands to reveal two more sitting figures, both in dark t-shirts with red designs. The central figure continues engaging with the microphone, while the adjacent man continues observing. The climax of this segment comes with the central figure gesturing towards the observer, accompanied by on-screen text: \"Stop with the questions.\"\n\nAccompanying audio reveals: \"` Okay, that's incredible. That makes everything better. How do you do it? I don't know, man. Where did you guys first meet?\" This interaction appears as a back-and-forth dialogue.\n\nNext, the main man, now holding a drink, maintains the same outfit. The dark-jacketed man with the furry-microphone is still present, suggesting continuity in the interview or session. The dim studio ambiance persists. As the central figure sips his drink, the dark-jacketed man is speaking, and text appears on screen: \"You don't even gotta answer that if you don't want to,\" providing reassurance. The clip concludes with the main figure looking towards the speaker, while the background men are engaged in conversation.\n\nThe audio includes: \"`Where did you guys meet and why do you do it? Uh... Hey, shh. I'm talking over here. Hey, hold on. Don't shush me. This is my session. I'm a news reporter. I'm getting a new story.\" The conversation flows between several men, each taking turns to speak.\n\nFinally, a close-up features three figures in the dimly lit studio. The leftmost man is in a dark T-shirt with a red design and a cap. The center figure, also in a dark t-shirt with a red design, gestures emphatically, while the rightmost man in a dark jacket holds the furry microphone. The lighting maintains a low, intimate atmosphere. They seems to have an argument. Abruptly, the scene cuts to color bars, signaling a possible technical interruption or the session's end.\n\nThe audio builds to a climactic: \"`You don't come in my face. Whoa, whoa, whoa, whoa, whoa. Who the f***?`\" followed by a loud beep, signaling an intense and sudden disturbance.\n\nThrough these interconnected clips, the narrative reveals a dynamic recording session filled with emphatic interactions, overlaying an ambiance rich in musical context and spontaneous dialogue."
+ }
+ ],
+ "events": [
+ "Video opens in a dimly lit recording studio with three men gathered",
+ "Central figure wears a white graphic t-shirt, striped cardigan, and a large shiny necklace with text 'TRAP HOUSE'",
+ "Man on the left holds a large microphone",
+ "Man on the right, wearing a dark t-shirt with a red design, holds a drink",
+ "Backdrop showcases a guitar and a keyboard",
+ "Central figure continues to show off the necklace",
+ "Man on the right gestures animatedly",
+ "Ambient lighting is low with a subtle purple glow in the top right corner",
+ "Scene closes with a close-up of the central figure wearing large headphones, standing before a professional microphone setup",
+ "Audio: 'You know what? You see us, we in the session turning up... I can't believe that we're here right now.'",
+ "Scene transitions to a brighter setting",
+ "Central figure with large headphones in front of a professional microphone setup",
+ "Another man in a dark jacket with a large microphone observes closely",
+ "Backdrop retains dim ambiance",
+ "Central figure speaks or sings into the microphone",
+ "Observer attentively watches",
+ "Audio: 'Shit. Real trap shit. Trapaholics. We make it look easy. Walk with your boys. Damn, son. Where'd you find this.'",
+ "Camera zooms in on the central figure, detailing same attire and accessories",
+ "Bright lighting highlights the professional microphone",
+ "Another man in a dark jacket close by",
+ "Scene expands to reveal two more sitting figures in dark t-shirts with red designs",
+ "Central figure continues engaging with the microphone",
+ "Adjacent man continues observing",
+ "Central figure gestures toward the observer",
+ "On-screen text: 'Stop with the questions'",
+ "Audio: 'Okay, that's incredible. That makes everything better. How do you do it?... Where did you guys first meet?'",
+ "Main man now holds a drink, maintains same outfit",
+ "Dark-jacketed man with furry microphone still present",
+ "Dim studio ambiance persists",
+ "Central figure sips his drink while dark-jacketed man speaks",
+ "On-screen text: 'You don't even gotta answer that if you don't want to'",
+ "Audio: 'Where did you guys meet and why do you do it?... I'm getting a new story.'",
+ "Close-up of three figures in the dimly lit studio",
+ "Leftmost man in a dark t-shirt with a red design and a cap",
+ "Center figure, also in a dark t-shirt with a red design, gestures emphatically",
+ "Rightmost man in a dark jacket holds the furry microphone",
+ "Low, intimate lighting atmosphere",
+ "They seem to have an argument",
+ "Scene cuts to color bars",
+ "Audio: 'You don't come in my face. Whoa, whoa, whoa, whoa, whoa. Who the f***?' followed by a loud beep"
+ ]
+ },
+ {
+ "video": "video/27.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a man wearing a black t-shirt speaking into a microphone mounted on a stand. The background features a white wall and a white bookshelf filled with various books. The text '@ishansharma7390' appears in the bottom left corner, along with a 'Share' button at the bottom center of the screen. The man gestures with his hands while emphasizing points and picks up a book titled 'Homo Deus' by Yuval Noah Harari, flipping through its pages and occasionally looking at the camera.\n\nThe audio reveals: \"The average person only reads one book every single year. That is crazy to imagine. Most of us are not able to read through the complete book because first of all they just don't have the attention span through the complete book.\" A general description of the audio events of the current clip is: A man speaks.\n\nThe video continues with the man still speaking into the microphone. As he flips through 'Homo Deus', he emphasizes points with hand gestures. The visual elements and text on the screen remain the same. The audio continues: \"And secondly if you're a college student you might start saying people have many and you just don't see enough worth into reading these books what you can do.\" \n\nThe scene transitions, displaying a black screen with the text 'Ad-free with Music Premium' before showing a search bar where 'one up on wall street' is typed, producing search suggestions related to 'One Up on Wall Street' by Peter Lynch. A search result for the book, featuring a thumbnail and video duration of 2:10:22, appears next. The video then returns to the man in the black t-shirt continuing to speak into the microphone.\n\nIn the audio, the same man says: \"And this is what I started doing recently just go to youtube and search the name of the book and just write audiobook after that there would be a high chance that you would find that book available on YouTube as an audiobook.\"\n\nThe visual sequence maintains the setup with the man discussing into the microphone. The red text 'Listen 30mins a day' appears, followed by '3 Books/month' in red, then returning to the man speaking. \n\nThe audio continues: \"Listen to it, make sure that you listen to it at least 30 minutes every single day and you would then be able to read three books every single month.\" A general description of the audio events of the current clip is: A man is speaking, and then he takes a deep breath. He continues to speak, and then he takes another deep breath. He continues to speak, and then he takes...\"\n\nThe video transitions to showing '~36 Books/yr' in red at the bottom. Next, it shows a close-up of a notebook containing handwritten notes titled 'One Up On Wall Street - Peter Lynch'. The notes include several guidelines about investing. \n\nThe accompanying audio states: \"That is 36 books every single year. The normal man is reading one book every single year. You are reading one book in 10 days. This is huge. Make sure that you also make notes as much as you can.\" A man speaks quickly and loudly, and then pauses for a moment before continuing to speak.\"\n\nFinally, the video shifts back to the man in the black t-shirt, with the same background and text overlays. \n\nIn the audio, the man advises: \"That is what I do and just refer back to them whenever necessary. And that is how I read one book every single week.\" A general description of the audio events of the current clip is: The man is speaking while music is playing in the background.\n\nThe video concludes with a black screen stating 'Hit Like!' accompanied by a thumbs-up icon, and 'Follow @ishansharma7390 for more!' next to an Instagram icon."
+ }
+ ],
+ "events": [
+ "Man in a black t-shirt speaks into a microphone on a stand",
+ "Background: white wall and white bookshelf filled with various books",
+ "Text '@ishansharma7390' appears in the bottom left corner",
+ "Share button appears at the bottom center of the screen",
+ "Man gestures with his hands while emphasizing points",
+ "Man picks up a book titled 'Homo Deus' by Yuval Noah Harari",
+ "Man flips through the pages of 'Homo Deus'",
+ "Man occasionally looks at the camera",
+ "Audio: 'The average person only reads one book every single year. That is crazy to imagine...'",
+ "Man continues to speak into the microphone while flipping through 'Homo Deus'",
+ "Visual elements and text on screen remain the same",
+ "Audio: 'And secondly if you're a college student you might start saying...'",
+ "Scene transitions to a black screen with text 'Ad-free with Music Premium'",
+ "Black screen transitions to a search bar typing 'one up on wall street'",
+ "Search suggestions related to 'One Up on Wall Street' by Peter Lynch appear",
+ "Search result for the book with a thumbnail and video duration of 2:10:22 appears",
+ "Video returns to man in black t-shirt speaking into the microphone",
+ "Audio: 'And this is what I started doing recently just go to youtube...'",
+ "Visual sequence maintains setup with man discussing into microphone",
+ "Red text 'Listen 30mins a day' appears",
+ "Red text '3 Books/month' appears",
+ "Video returns to man speaking into the microphone",
+ "Audio: 'Listen to it, make sure that you listen to it at least 30 minutes every single day...'",
+ "Video transitions to '~36 Books/yr' in red at the bottom",
+ "Close-up of a notebook with handwritten notes titled 'One Up On Wall Street - Peter Lynch'",
+ "Notes include several guidelines about investing",
+ "Audio: 'That is 36 books every single year. The normal man is reading one book every single year...'",
+ "Man speaks quickly and loudly, then pauses before continuing to speak",
+ "Video shifts back to man in black t-shirt with same background and text overlays",
+ "Audio: 'That is what I do and just refer back to them whenever necessary...'",
+ "Man speaks while music plays in the background",
+ "Video concludes with black screen stating 'Hit Like!' with a thumbs-up icon",
+ "Text 'Follow @ishansharma7390 for more!' appears next to Instagram icon"
+ ]
+ },
+ {
+ "video": "video/28.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up of a hand with discolored fingers, showing a dark blue or black hue. The hand is held up in front of a mirror, reflecting the image. The background is a warmly lit room with wooden elements. A man in a light blue shirt captures the scene with a smartphone. On the screen, the text 'Doctor explains' is displayed in red. The hand moves slightly, showcasing different angles of the discolored fingers. The scene transitions to a more neutral indoor environment with a couch, where a man in a white outfit examines the same discolored hand.\n\nMeanwhile, the audio reveals the beginning of the story with a bell ringing, followed by a man saying, \"This is the story of Fahad Badar.\"\n\nThe scene shifts to the man in a gray jacket and black gloves holding a white banner with the text 'Qatar Cancer Society' against a backdrop of snow-covered mountains under a clear blue sky. Then, another man in a black and yellow mountaineering suit sits on a snowy peak, holding a maroon and white flag, likely representing Qatar. A close-up of a white oxygen cylinder with the German text 'Technischer Sauerstoff O2' and safety symbols suggests the use of oxygen at high altitudes.\n\nThe audio continues, explaining, \"Fahad is a Qatari mountaineering champion. He recently climbed to the summit of Pakistan's Broad Peak mountain, which stands at 8,000 meters. Unfortunately, at the top, he ran out of oxygen\"\n\nNext, a man in a yellow and black mountaineering suit with climbing gear stands on a snowy slope, preparing to climb. Other climbers are visible in the distance, highlighting the high-altitude setting. The focus returns to the discolored hand from earlier clips, now held by the man in a light blue shirt in a warm-lit room with wooden elements. The scene transitions to a medical diagram showing the progression of frostbite, from normal skin to deep frostbite, with images of affected hands.\n\nThe audio informs us, \"And he had to spend the entire night at the top of the mountain without any supplemental oxygen. This led to deep frostbite, gangrene of his fingers, tissue death. This is the most severe stage.\"\n\nA detailed medical diagram appears, illustrating frostbite stages, labeled with terms like epidermis, dermis, and subcutaneous tissue against a warm background. The scene changes to someone in a black shirt explaining the diagram in a medical setting with an IV drip and other medical equipment visible.\n\nThe narration continues, \"The damage usually extends the tissue underneath the skin causing paralysis and lack of sensation. He likely received emergency medical treatment with clot busters and medication to widen the blood vessels to improve blood flow in order to save some tissue.\"\n\nThe imagery moves to a close-up of an IV drip and medical equipment, emphasizing a clinical environment. Then, it returns to a close-up of the discolored hand, held up before a mirror in a warmly lit room with wooden elements, captured by a smartphone.\n\nThe final narration delivers the conclusion, \"Unfortunately, this appeared not to be successful, so he will likely have an amputation.\" The audio is accompanied by the sound of an amputation procedure, emphasizing the gravity of the situation.\n\nThe complete audio-visual narrative tells the tragic yet heroic story of Fahad Badar, a Qatari mountaineer who braved extreme conditions and suffered severe frostbite, eventually facing possible amputation."
+ }
+ ],
+ "events": [
+ "Close-up of a hand with discolored fingers in dark blue or black hue.",
+ "Hand held up in front of a mirror reflecting the image.",
+ "Background shows a warmly lit room with wooden elements.",
+ "Man in light blue shirt captures the scene with smartphone.",
+ "Text 'Doctor explains' displayed in red on the smartphone screen.",
+ "Hand moves, showing different angles of discolored fingers.",
+ "Scene transitions to neutral indoor environment with a couch.",
+ "Man in a white outfit examines the discolored hand.",
+ "Audio begins with bell ringing.",
+ "Man says, 'This is the story of Fahad Badar.'",
+ "Scene shifts to man in gray jacket and black gloves holding white banner with 'Qatar Cancer Society' against snow-covered mountains.",
+ "Another man in black and yellow mountaineering suit sits on snowy peak with maroon and white flag (probably representing Qatar).",
+ "Close-up of white oxygen cylinder with German text 'Technischer Sauerstoff O2' and safety symbols.",
+ "Audio explains, 'Fahad is a Qatari mountaineering champion.... at the top, he ran out of oxygen.'",
+ "Man in yellow and black mountaineering suit with climbing gear stands on snowy slope, preparing to climb.",
+ "Other climbers visible in the distance, highlighting high-altitude setting.",
+ "Focus returns to discolored hand held by man in light blue shirt in warm-lit room with wooden elements.",
+ "Scene transitions to medical diagram showing progression of frostbite from normal skin to deep frostbite with images of affected hands.",
+ "Audio informs, '...This led to deep frostbite, gangrene of his fingers, tissue death... most severe stage.'",
+ "Detailed medical diagram showing frostbite stages labeled with epidermis, dermis, subcutaneous tissue.",
+ "Scene changes to someone in a black shirt explaining diagram in a medical setting with an IV drip and other medical equipment visible.",
+ "Narration explains, 'Damage extends to tissue underneath skin causing paralysis and lack of sensation.... taken emergency medical treatment with clot busters and medication to widen blood vessels.'",
+ "Close-up of IV drip and medical equipment in a clinical environment.",
+ "Return to close-up of discolored hand before a mirror in warmly lit room with wooden elements, captured by smartphone.",
+ "Final narration concludes, '...appeared not to be successful, so he will likely have an amputation.'",
+ "Sound of an amputation procedure emphasizes gravity of the situation."
+ ]
+ },
+ {
+ "video": "video/29.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a young woman with long brown hair tied back, who is wearing a light blue shirt with white polka dots. She stares at the camera, looks scared, and talks to someone behind it, and she is against a backdrop of green light. The top of the frame is labeled 'Label: FREAK' in red and white. The text at the bottom reads, 'Subject is Venessa Hart age 16 and it looks like her birthday is tomorrow!' The woman's voice introduces herself and mentions her upcoming birthday. The bottom text changes to, \"I'll get you a gift if you're a good listener today,\" followed by another change to, \"I'm turning 17 tomorrow, not 7!\" This indicates a conversation where she is potentially being coerced.\n\nAfter an image of plain black, the text at the top reads 'Fiesty!' Then the same visual setting appears, followed by 'If you don't behave though, there's consequences!' at the bottom. A man's voice issues a stern warning, indicating authority over her. After an image of plain black again, the text changes to 'I know you have 3 powers and one of them is telepathy. *hands her photo*' as a photo is handed to her. Showing the image of plain black shortly, continuing, the text reads, 'That's your foster mom, tell me where she is.' This suggests that any resistance comes with a price, and he knows specific details about her abilities. Vanessa's response includes hesitation and reluctant participation, then she looks at the paper.\n\nThe scene transitions to Vanessa holding a piece of paper, with the setting unchanged. The bottom text questions, 'What do you see?' Vanessa closes her eyes and mutters, 'I see water...' The video shows a view of water, and a woman in black pants walking at the beach, supported by the text's declaration. Next, the frame shifts to show a beach with trees, the text now reads, 'And I see trees.' Vanessa repeats, 'She's at a beach' indicating she is utilizing her powers of perception. The woman's voice clarifies Vanessa's foster mother's location. \n\nBack in the initial setting, the text changes to 'Excellent, yes, we had her sent on a little vacation.' The screen then goes black with the words and then shows the girl with the text in red, implying that a subordinate is helping the man, 'I'll get her hooked up to the machine, boss.' When the scene returns, Vanessa has a device attached to her head and a black pen is hanging before the camera, which she is controlling with her telekinesis. She puts her two fingers on the right temple and holds her left hand towards the camera, palm facing it. The man's awe is evident in his voice, marked by the text, 'This is amazing!' in black. The following text in red reads, 'Telekinesis is such a cool power!' showing what the subordinate says.\n\nContinuing this tense atmosphere, the man commands again with the text reads, 'I wanna see her lift something heavier!' A short gasp that reflects scariness is shown as a black screen appears. Then the man demands, 'I want you to lift this 500lb weight.' While Vanessa shakes her head and does not believe she could do this. The text, 'No, I can't!' in blue conveys her strained rejection. Nevertheless, the demand persists, 'I need to see how strong you are!' Ultimately ending with a forceful 'Do it!', a close-up of Vanessa shows, her putting her fingers on her right temple to try.\n\nThe background appears and then Vanessa holds two fingers to her temple, trying hard. The text in red affirms her success, says the subordinate, 'She's lifting it!' The scene blacks out with the stern command 'Okay, enough.' in black. The text reappears on the returning visual. Vanessa's nosebleed is hinted by the text 'Uh boss her nose!' by the subordinate. She looks at the camera when the boss says 'She's fine.' Finally, Vanessa faints.\n\nThrough these coherent clips, the video tells a fragmented narrative of Venessa Hart, who exhibits powers like telepathy and telekinesis and is being controlled by an unseen authority, possibly for their advantage. The setting, character, and ongoing conversation seamlessly stitch the narrative together."
+ }
+ ],
+ "events": [
+ "A young woman with long brown hair tied back, wearing a light blue shirt with white polka dots, stares at the camera.",
+ "The woman looks scared and talks to someone behind the camera.",
+ "The backdrop is a green light, and the top of the frame is labeled 'Label: FREAK' in red and white.",
+ "Text at the bottom reads, 'Subject is Venessa Hart age 16 and it looks like her birthday is tomorrow!'",
+ "The woman's voice introduces herself and mentions her upcoming birthday.",
+ "The bottom text changes to, 'I'll get you a gift if you're a good listener today,' and then to, 'I'm turning 17 tomorrow, not 7!'",
+ "An image of plain black appears with text at the top reading 'Fiesty!'",
+ "The same visual setting reappears, followed by text, 'If you don't behave though, there's consequences!'",
+ "A man's voice issues a stern warning, indicating authority over her.",
+ "Another plain black image appears with text reading, 'I know you have 3 powers and one of them is telepathy. *hands her photo*'",
+ "A photo is handed to Venessa.",
+ "Another black screen appears, with the text, 'That's your foster mom, tell me where she is.'",
+ "Venessa hesitates and reluctantly participates, looking at the paper.",
+ "Transition to Venessa holding a piece of paper.",
+ "Text at the bottom questions, 'What do you see?'",
+ "Venessa closes her eyes and mutters, 'I see water...'",
+ "Video shows a view of water and a woman in black pants walking at the beach.",
+ "Text declares the vision, then shifts to a beach with trees and reads, 'And I see trees.'",
+ "Venessa repeats, 'She's at a beach.'",
+ "Back to the initial setting, the text changes to 'Excellent, yes, we had her sent on a little vacation.'",
+ "Screen goes black with the text in red, implying a subordinate is helping the man, 'I'll get her hooked up to the machine, boss.'",
+ "Vanessa is now shown with a device attached to her head and a black pen hanging before the camera, controlled with telekinesis.",
+ "She puts her two fingers on her right temple and holds her left hand towards the camera, palm facing it.",
+ "The man's awe is marked by text, 'This is amazing!' in black.",
+ "Text in red appears, 'Telekinesis is such a cool power!' indicating what the subordinate says.",
+ "The man commands with text, 'I wanna see her lift something heavier!'",
+ "A short gasp is heard, followed by a black screen.",
+ "A demand reads, 'I want you to lift this 500lb weight.'",
+ "Vanessa shakes her head and responds, 'No, I can't!' in blue text.",
+ "The demand persists, 'I need to see how strong you are!' followed by a forceful, 'Do it!'",
+ "Close-up of Vanessa shows her placing fingers on her right temple to try.",
+ "Vanessa holds two fingers to her temple, trying hard.",
+ "Text in red affirms her success, 'She's lifting it!'",
+ "The scene blacks out with a stern command, 'Okay, enough.' in black.",
+ "Returning visual shows Vanessa with a nosebleed, hinted by text from the subordinate, 'Uh boss her nose!'",
+ "She looks at the camera while the boss says, 'She's fine.'",
+ "Vanessa faints."
+ ]
+ },
+ {
+ "video": "video/30.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man wearing a blue medical scrub top, embroidered with 'Dr. Mikhail Varshavski D.O. Family Medicine.' This man is energetically speaking and gesturing in front of a background featuring a shelf with various items and a neon sign. As the man talks, a yellow banner with the word 'DISEASES DOCTORS OFTEN MISS' in bold black letters appears. Subsequently, the words 'POLYCYSTIC OVARIAN SYNDROME,' and 'PCOS' display in bold white letters on a dark background. The scene transitions to an image of a doctor consulting with a patient, alongside the text 'AFFECTS 5M WOMEN OF REPRODUCTIVE AGE' in bold white letters. The scene shifts again to this man in blue, who is still speaking to the camera.\n\nThe audio reveals, \"Diseases doctors often miss, episode two. Polycystic ovarian syndrome, or PCOS, is a hormonal condition that affects over five million women of reproductive age. \" A man and then a woman are heard speaking.\n\nContinuing, the man in blue scrubs goes on animatedly, with the background remaining constant. A graphic with 'ISSUES' in bold white letters on a dark background appears. \u63a5\u7740\u753b\u9762\u5c55\u793a\u4e86\u4e00\u4e2a\u8138\u4e0a\u957f\u6ee1\u75d8\u75d8\u7684\u5973\u4eba\u7684\u56fe\u7247\u548c\u4e00\u4e2a\u900f\u660e\u7684\u4eba\u4f53\u6a21\u578b\uff0c\u53ef\u4ee5\u770b\u89c1\u4eba\u4f53\u6a21\u578b\u5185\u90e8\u7684\u8840\u7ba1\u3002An ovary with multiple cysts is shown labeled 'FERTILITY,' followed by images labeled 'APPEARANCE,' and 'CARDIOVASCULAR.' The man in blue scrubs reappears, accompanied by a yellow banner displaying 'TO BE DIAGNOSED.' \n\nThe audio explains, \"Women with PCOS produce higher than normal amounts of male hormones, and as a result, face issues with fertility, appearance, and even cardiovascular consequences. For a diagnosis, patients need two of the three Rotterdam criteria\".\n\nGraphics intercut, stating 'MUST HAVE AT LEAST 2 OF 3' in bold white letters, followed by images: a woman next to a calendar labeled 'INFREQUENT OR MISSED PERIODS,' a chemical structure labeled 'TESTOSTERONE' with 'INCREASES IN MALE HORMONES,' and an ultrasound image labeled 'Polycystic ovary' with 'POLYCYSTIC OVARIES ON ULTRASOUND.' The man continues in front of a neon sign.\n\nThe audio narration states, \"Which includes infrequent or missed periods, increases in male hormones, or polycystic ovaries on ultrasound. Too often it's overlooked or misdiagnosed by doctors simply because they're focusing on the symptom,\" repeatedly voiced in a high-pitched tone.\n\nSubsequently, the man in blue scrubs explains more details. The graphic transitions display 'IF LEFT UNTREATED' in bold white letters, followed by 'NEGATIVE EFFECTS' in bold red. Listed negative effects include: 'Mood,' 'Blood Pressure,' 'Blood Sugar' and 'Cholesterol'. An image of a person labeled 'Metabolic Syndrome' follows, with associated conditions: 'High Blood Glucose,' 'High Blood Pressure,' 'High Lipid Profile,' 'Large Waist Size,' and 'Low HDL Levels.' The final frame in this clip shows the individual continuing to speak, accompanied by a yellow banner with 'TO BE DIAGNOSED.'\n\nThe audio here elaborates, \"Rather than the root of the problem. In fact, if left untreated, PCOS can negatively affect mood, blood pressure, blood sugar, cholesterol, and even lead to metabolic syndrome.\" spoken quickly and excitedly.\n\nThe video transitions to the final sequence with the same man speaking assertively. Yellow banner text states 'THERE IS NO CURE,' followed by a blue banner close-up proclaiming 'THERE ARE OPTIONS!' The speaker emphasizes the availability of options while continuing to gesture. The video concludes with a yellow banner urging 'SUBSCRIBE!' as the individual encourages viewers to subscribe.\n\nThe audio concludes, \"While there is no definitive answer, There are medications that can help normalize hormones as well as increase the chances of pregnancy if desired. Subscribe for episode number three.\" The entire narration here is delivered in a high-pitched voice.\n\nThis tightly knit presentation educates viewers on PCOS, its diagnosis, impacts, and available treatment options, effectively combining both informative visuals and cohesive, explanatory audio."
+ }
+ ],
+ "events": [
+ "Man in blue medical scrub top speaking energetically",
+ "'Dr. Mikhail Varshavski D.O. Family Medicine' embroidered on scrub top",
+ "Background with a shelf, various items, and a neon sign",
+ "Yellow banner with 'DISEASES DOCTORS OFTEN MISS' in bold black letters",
+ "'POLYCYSTIC OVARIAN SYNDROME' and 'PCOS' in bold white letters on dark background",
+ "Image of a doctor consulting with a patient",
+ "Text 'AFFECTS 5M WOMEN OF REPRODUCTIVE AGE' in bold white letters",
+ "Scene shifts back to man in blue scrubs speaking to camera",
+ "Audio reveals: 'Diseases doctors often miss, episode two. Polycystic ovarian syndrome, or PCOS, is a hormonal condition that affects over five million women of reproductive age.'",
+ "Man and woman are heard speaking in the audio",
+ "Graphic with 'ISSUES' in bold white letters on a dark background",
+ "Image of a woman with acne",
+ "Transparent human body model showing internal blood vessels",
+ "Ovary with multiple cysts labeled 'FERTILITY'",
+ "Images labeled 'APPEARANCE' and 'CARDIOVASCULAR'",
+ "Man in blue scrubs appears with yellow banner 'TO BE DIAGNOSED'",
+ "Audio explains: 'Women with PCOS produce higher than normal amounts of male hormones, and as a result, face issues with fertility, appearance, and even cardiovascular consequences. For a diagnosis, patients need two of the three Rotterdam criteria.'",
+ "Graphics state 'MUST HAVE AT LEAST 2 OF 3' in bold white letters",
+ "Image of a woman next to a calendar labeled 'INFREQUENT OR MISSED PERIODS'",
+ "Chemical structure labeled 'TESTOSTERONE' with 'INCREASES IN MALE HORMONES'",
+ "Ultrasound image labeled 'Polycystic ovary' with 'POLYCYSTIC OVARIES ON ULTRASOUND'",
+ "Man continues speaking in front of a neon sign",
+ "Audio states: 'Which includes infrequent or missed periods, increases in male hormones, or polycystic ovaries on ultrasound. Too often it's overlooked or misdiagnosed by doctors simply because they're focusing on the symptom.'",
+ "Man in blue scrubs explains further details",
+ "Graphic transitions display 'IF LEFT UNTREATED' in bold white letters",
+ "Graphic with 'NEGATIVE EFFECTS' in bold red letters",
+ "Listed negative effects: 'Mood,' 'Blood Pressure,' 'Blood Sugar,' and 'Cholesterol'.",
+ "Image of a person labeled 'Metabolic Syndrome'",
+ "Conditions associated with Metabolic Syndrome: 'High Blood Glucose,' 'High Blood Pressure,' 'High Lipid Profile,' 'Large Waist Size,' and 'Low HDL Levels'.",
+ "Man continues speaking, accompanied by yellow banner 'TO BE DIAGNOSED'",
+ "Audio elaborates: 'Rather than the root of the problem. In fact, if left untreated, PCOS can negatively affect mood, blood pressure, blood sugar, cholesterol, and even lead to metabolic syndrome.'",
+ "Final sequence of man speaking assertively",
+ "Yellow banner text 'THERE IS NO CURE'",
+ "Blue banner close-up 'THERE ARE OPTIONS!'",
+ "Man emphasizes the availability of options while gesturing.",
+ "Video concludes with yellow banner urging 'SUBSCRIBE!'",
+ "Individual encourages viewers to subscribe",
+ "Audio concludes: 'While there is no definitive answer, There are medications that can help normalize hormones as well as increase the chances of pregnancy if desired. Subscribe for episode number three.'",
+ "Presentation educates viewers on PCOS, its diagnosis, impacts, and treatment options",
+ "Combination of informative visuals and explanatory audio"
+ ]
+ },
+ {
+ "video": "video/31.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a young man in a white tank top and light gray sweatpants entering a building through a glass door. The text overlay reads \"how hoopers be getting their first tat\" with a laughing emoji, indicating a humorous look at basketball players getting their first tattoo. The young man appears either nervous or excited, clapping his hands and rubbing his arms. The audio reveals him saying, \"Man, I'm about to get inky. I'm getting asleep for real, baby.\"\n\nThe scene transitions to him sitting in a chair inside a room decorated with vibrant posters of comic book characters. Another person, likely the tattoo artist, wearing a blue jacket and black gloves, is seated across from him. The tattoo artist says, \"So you're trying to get a tattoo, right?\" The young man continues to rub his arm, showing anticipation or anxiety, and responds, \"Actually, I ain't gonna lie, bro. My skin, my epidermis is kind of like tripping today. We can just do something small instead of something small...\"\n\nThe narrative unfolds as the young man remains seated in the room adorned with colorful comic book posters. The tattoo artist's hand touches his chest and stomach, saying, \"Game time, baby,\" seemingly trying to comfort the young man. The young man says, \"Actually, I ain't go out with my mom. I gotta go with my mom.\" A person in a beige shirt enters, placing a comforting hand on his shoulder as the young man looks up. The beige shirt person tells him, \"Sit down. Take it like a man.\"\n\nThe scene transitions to the man already getting the tattoo, with a pained expression indicating the discomfort.\n\nThe video then shows him after the tattoo is done. The young man, sitting and holding up his phone as if making a video call, says, \"Good eye, July. You see what's going on, man. Yes, sir.\" The tattoo artist, sitting next to him, says, \"Hey, bro. You got to stay still. My fault, bro. Yeah, my fault.\"\n\nThe video moves to the young man standing, showing his tattooed arm while examining it and interacting with the tattoo artist. The room\u2019s background with colorful posters remains consistent. Soon, he is in a car, proudly displaying his new tattoo wrapped in plastic. The audio features the subsequent conversation: \"Oh, it's kind of inky for you. Hey, Goliath, I'm going to have to book. My next appointment. I want that sleeve for real, man. You gotta get me right.\" The tattoo artist responds, \"$350, my guy.\" The young man remarks, \"Oh, you gotta get the fuck out. Yeah, let me, uh, get my wallet...\"\n\nIn the final scene, the young man continues to display his tattoo in the car, touching his hair and talking with a person in a beige shirt seated behind him, adding to the narrative of his first ink experience. As they react to the tattoo, the young man appears even more ecstatic and proud. The audio concludes with, \"We just got out the shop. You already taking pictures. Hey, bro. Get on my...\"\n\nOverall, the video humorously portrays a young basketball player\u2019s experience of getting his first tattoo, moving through stages of excitement, nervousness, and eventual pride."
+ }
+ ],
+ "events": [
+ "Young man in white tank top and light gray sweatpants enters building through a glass door",
+ "Text overlay reads 'how hoopers be getting their first tat' with a laughing emoji",
+ "Young man appears nervous or excited, clapping hands and rubbing arms",
+ "Audio of young man saying, 'Man, I'm about to get inky. I'm getting asleep for real, baby'",
+ "Scene transitions to young man sitting in a chair inside a room with vibrant comic book posters",
+ "Tattoo artist in a blue jacket and black gloves is seated across from young man",
+ "Tattoo artist says, 'So you're trying to get a tattoo, right?'",
+ "Young man continues to rub his arm and responds, 'Actually, I ain't gonna lie, bro. My skin, my epidermis is kind of like tripping today. We can just do something small instead of something small...'",
+ "Tattoo artist touches young man's chest and stomach, says, 'Game time, baby'",
+ "Young man says, 'Actually, I ain't go out with my mom. I gotta go with my mom'",
+ "Person in beige shirt enters, places comforting hand on young man's shoulder",
+ "Person in beige shirt tells young man, 'Sit down. Take it like a man'",
+ "Scene transitions to young man already getting the tattoo with a pained expression",
+ "Young man, sitting and holding up his phone, says, 'Good eye, July. You see what's going on, man. Yes, sir.'",
+ "Tattoo artist, sitting next to young man, says, 'Hey, bro. You got to stay still. My fault, bro. Yeah, my fault.'",
+ "Young man standing and examining his tattooed arm, interacting with tattoo artist",
+ "Young man in car, proudly displaying new tattoo wrapped in plastic",
+ "Audio of young man saying, 'Oh, it's kind of inky for you. Hey, Goliath, I'm going to have to book. My next appointment. I want that sleeve for real, man. You gotta get me right.'",
+ "Tattoo artist responds, '$350, my guy.'",
+ "Young man remarks, 'Oh, you gotta get the fuck out. Yeah, let me, uh, get my wallet.'",
+ "Young man continues to display his tattoo in the car, touching his hair",
+ "Young man talks with person in beige shirt seated behind him",
+ "Audio concludes with, 'We just got out the shop. You already taking pictures. Hey, bro. Get on my...'",
+ "The video humorously portrays a young basketball player\u2019s experience of getting his first tattoo"
+ ]
+ },
+ {
+ "video": "video/32.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video discusses various critical aspects of subwoofer installation and common mistakes that can significantly affect bass performance. \n\nStarting with a close-up of two large black subwoofers branded 'JL AUDIO' on a white surface, the text \"TOP 3 MISTAKES THAT RUIN BASS!\" appears prominently in bold white letters across the screen. As the camera shifts to different angles, a man's voice asks, \"`Do you have a subwoofer in your car? Here are three mistakes that absolutely ruin the bass performance.`\" The scene transitions to a close-up of another black subwoofer, which is being handled and rotated by a person, showcasing its connectors and structure. The person's hands are shown handling the black subwoofer against a light wooden surface.\n\nThe focus then shifts to the connectors on the back of another black subwoofer, where a person carefully attaches color-coded wires to the terminals. The subwoofer, installed in a car trunk, has the 'JL AUDIO' logo prominently displayed. The voiceover continues, \"`Mistake number one is incorrect wiring polarity. Having one subwoofer wired out of polarity from another makes them work against each other and destroys the acoustic energy.`\" The scene then transitions to a technical diagram, explaining phase in audio signals, showing 'In phase' with 'Constructive interference' and '180\u00b0 out of phase' with 'Destructive interference.' The audio explanation stresses on the importance of proper phase alignment.\n\nNext, the video continues with the phase diagram, showing waveforms labeled 'In phase' and '180\u00b0 out of phase.' The background features audio equipment on shelves. The subsequent scene focuses on the back of another black subwoofer, with a close-up of its color-coded connectors. A 'DM-810' audio control device is shown, being adjusted by a person, emphasizing the critical role of digital signal processors and time alignment in maintaining phase relationships between subwoofers and mid-range speakers. The voiceover discusses, \"`Much like indirect polarity, we also want to make sure we have a correct face relationship between the subwoofer and our mid-range speaker. By using a digital signal processor and time alignment, we can ensure that the phase relationship between a subwoofer and a mid-range speaker is correct.`\"\n\nThe camera then zooms in on a 'JL AUDIO' subwoofer installed in a car's trunk. Hands are meticulously placing another black subwoofer into a circular opening on a textured black surface using a power drill, ensuring a secure fit. The background reveals a workshop environment filled with tools. The speaker highlights, \"`The next big mistake is not providing enough air volume for the subwoofer or using the wrong subwoofer box type. Every subwoofer has different parameters.`\"\n\nFollowing this, the video shows a black 'JL AUDIO' subwoofer in a white cabinet, with the camera shifting angles to provide multiple perspectives. Then, it transitions to a scene where two black 'JL AUDIO' subwoofers are installed side-by-side in a car trunk. The voiceover continues, \"`If you put it in a box that's too small, you're going to have poor performance, and if you put it in a sealed box when it`'s better designed to be in a ported box, you'`re not going to have as good performance.`\"\n\nFinally, the focus returns to two 'JL AUDIO' subwoofers in a car trunk, with a shift to a single subwoofer with a red accent ring inside the car. Bold white text at the bottom asks, \"WHAT MISTAKES HAVE YOU SEEN?\" The voiceover invites feedback and further engagement, \"`But I want to hear from you. What other mistakes have you guys seen? And if you want to see more of my subwoofer box builds, come on over to my channel.`\" The video concludes with a person bending a piece of wood featuring multiple cuts, demonstrating a technique for creating curved surfaces in subwoofer enclosures, underscoring the significance of proper construction techniques."
+ }
+ ],
+ "events": [
+ "Close-up of two large black subwoofers branded 'JL AUDIO' on a white surface.",
+ "Bold white text appears on the screen: 'TOP 3 MISTAKES THAT RUIN BASS!'",
+ "Camera shifts to different angles while a man's voice talks about subwoofer mistakes.",
+ "Voiceover introduction: 'Do you have a subwoofer in your car? Here are three mistakes that absolutely ruin the bass performance.'",
+ "Transition to a close-up of another black subwoofer being handled and rotated by a person.",
+ "Showcase of the subwoofer's connectors and structure on a light wooden surface.",
+ "Focus shifts to connectors on the back of another black subwoofer.",
+ "Person carefully attaches color-coded wires to the subwoofer's terminals in a car trunk with 'JL AUDIO' logo.",
+ "Voiceover mentions: 'Mistake number one is incorrect wiring polarity. Having one subwoofer wired out of polarity from another makes them work against each other and destroys the acoustic energy.'",
+ "Transition to a technical diagram explaining phase in audio signals with 'In phase' and 'Constructive interference', '180\u00b0 out of phase' with 'Destructive interference.'",
+ "Audio explanation underscores the importance of proper phase alignment.",
+ "Continuation with the phase diagram, showing waveforms labeled 'In phase' and '180\u00b0 out of phase.'",
+ "Background features audio equipment on shelves.",
+ "Focus on the back of another black subwoofer with a close-up of color-coded connectors.",
+ "Showcase of 'DM-810' audio control device being adjusted by a person.",
+ "Voiceover discusses importance of digital signal processors and time alignment: 'Much like indirect polarity, we also want to make sure we have a correct face relationship between the subwoofer and our mid-range speaker. By using a digital signal processor and time alignment, we can ensure that the phase relationship between a subwoofer and a mid-range speaker is correct.'",
+ "Camera zooms in on 'JL AUDIO' subwoofer installed in a car's trunk.",
+ "Hands placing another black subwoofer into a circular opening on a textured black surface using a power drill.",
+ "Background reveals a workshop environment filled with tools.",
+ "Voiceover highlights: 'The next big mistake is not providing enough air volume for the subwoofer or using the wrong subwoofer box type. Every subwoofer has different parameters.'",
+ "Scene shows black 'JL AUDIO' subwoofer in a white cabinet with multiple camera angles.",
+ "Transition to two black 'JL AUDIO' subwoofers installed side-by-side in a car trunk.",
+ "Voiceover continues: 'If you put it in a box that's too small, you're going to have poor performance, and if you put it in a sealed box when it\u2019s better designed to be in a ported box, you're not going to have as good performance.'",
+ "Focus returns to two 'JL AUDIO' subwoofers in a car trunk.",
+ "Shift to a single subwoofer with a red accent ring inside the car.",
+ "Bold white text at the bottom: 'WHAT MISTAKES HAVE YOU SEEN?'",
+ "Voiceover invites feedback and engagement: 'But I want to hear from you. What other mistakes have you guys seen? And if you want to see more of my subwoofer box builds, come on over to my channel.'",
+ "Conclusion with a person bending a piece of wood with multiple cuts, demonstrating creating curved surfaces in subwoofer enclosures.",
+ "Underscore the significance of proper construction techniques."
+ ]
+ },
+ {
+ "video": "video/33.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a person dressed in a blue jacket and shorts, sitting on the ground against a concrete wall. A blue box filled with various snacks sits in front of them. The text 'PART 1' is displayed at the top of the frame. The person is approached by another man dressed in a black hoodie, black pants, and a black cap. The text 'Hey, how you doing man?' appears on the screen as the man in black initiates the conversation. The person on the ground continues arranging snacks in the box while the standing man comments on the snacks, saying, \"I see you're selling some chips.\" They both exchange nods and friendly words. The person sitting on the ground responds \"Yeah\". The standing man says, \"It's pretty cool man. I wanna give you something,\" while reaching into his pocket to get a wallet and handing money to the person on the ground, who accepts it with a grateful gesture. The man standing says \"This is for you\" as the text reads.\n\nIn the next scene, the person in the blue jacket is shown again, holding a note of money, looking at the man, and asking \"What's this?\" as the text reads\"What's this?\". The man in black stands next to them. The text 'That's $20 for you' appears on the screen as the person on the ground looks at the money. The person sitting on the ground continues talking, with the text 'Yea I see that you gotta take somrthing' appearing on the screen. The person on the ground offers a snack from the box, but the standing man declines, waving his hand, putting the hands in his pockets, and saying, \"No, I'm good. I'm really good. I promise\".\n\nThe narrative continues with the same two individuals. The person on the ground is still holding the money and touches the box. The man tries to walk away but stops by the person sitting on the ground and the man in black steps back and reaches toward the snack box. The person on the ground says \"No..No man you gotta take something\" as the text reads the same and puts the box toward the man standing. The man in black says \"Are you sure?\" as the text reads the same. The text 'Yeah' appears on the screen as the standing man picks up a snack, saying, \"I just take this one.\"He pushes the box toward the person on the ground. The person on the ground insists, \"You gotta take more than that,You handed me $20\" while holding the money. The standing individual, holding the snack, responds, \"You can keep the change.\"\n\nFollowing this, the person on the ground points at the snack on the man's hand and says\" They're a dollar so you gotta take more\". The man in black asks \"You selling them for a dollar\". The person on the ground repeats, \"I selling them for a dollar.\" The man in black then says, \"Alright, you know what, give me everything.\" The person on the ground responds \"Alright Cool!\" and gets out a white plastic bag from his pocket. As the person on the ground starts to gather the snacks into a white plastic bag, the standing individual crouches down to help and says \"I'll take it all\". The person on the ground, still emotional, says, \"Let me bag them for you. You gave me $20 and there's 12 Let me hand you change. \" The man in black says \"Sure\". They proceed to fill the bag with snacks. The person on the ground finds the change in a black bag on his left and gives the man in black some notes.\n\nThe final part of the video shows the same setup but with a white plastic bag partially filled with snacks. The man in black holds the money and says, \"I appreciate that, Thank you very much.\" The person on the ground continues to express gratitude, \"Thank you, man,\" with the man in black acknowledging, \"You're a very honest salesman.\" The scene concludes with a palpable sense of gratitude and emotion, with the person on the ground looking appreciative while the standing man continues to crouch next to them.\n\nThroughout all scenes, background music is continuously playing, complementing the mood of the interactions between the two men. Subtitles consistently provide context and direct dialogue in English, enhancing understanding for all viewers."
+ }
+ ],
+ "events": [
+ "A person in a blue jacket and shorts sits on the ground against a concrete wall.",
+ "A blue box filled with snacks is in front of them.",
+ "Text 'PART 1' is displayed at the top of the frame.",
+ "A man in a black hoodie, black pants, and a black cap approaches.",
+ "Text 'Hey, how you doing man?' appears on the screen as the man in black speaks.",
+ "The person on the ground continues arranging snacks.",
+ "The man in black comments on the snacks: 'I see you're selling some chips.'",
+ "They exchange nods and friendly words.",
+ "The person on the ground responds, 'Yeah'.",
+ "The man in black says: 'It's pretty cool man. I wanna give you something.'",
+ "The man in black reaches into his pocket, gets a wallet, and hands money to the person on the ground.",
+ "The person on the ground accepts the money with a grateful gesture.",
+ "The man in black says 'This is for you' and the text reads the same.",
+ "Next scene: The person in the blue jacket looks at the money and asks 'What's this?'",
+ "Text 'What's this?' appears on the screen.",
+ "The man in black stands next to the person on the ground.",
+ "Text 'That's $20 for you' appears on the screen.",
+ "The person on the ground continues talking, text 'Yea I see that you gotta take something' appears on the screen.",
+ "The person on the ground offers a snack from the box.",
+ "The man in black declines, waves his hand, puts hands in pockets, and says, 'No, I'm good. I'm really good. I promise'.",
+ "The person on the ground still holding the money touches the box.",
+ "The man in black tries to walk away, stops, steps back, and reaches toward the snack box.",
+ "The person on the ground says 'No..No man you gotta take something' and puts the box toward the man standing. Text reads the same.",
+ "The man in black says 'Are you sure?' and text reads the same.",
+ "Text 'Yeah' appears on the screen.",
+ "The standing man picks up a snack, says 'I just take this one'.",
+ "The standing man pushes the box toward the person on the ground.",
+ "The person on the ground insists 'You gotta take more than that, You handed me $20'.",
+ "The standing man responds 'You can keep the change' while holding the snack.",
+ "The person on the ground points at the snack and says 'They're a dollar so you gotta take more'.",
+ "The man in black asks 'You selling them for a dollar?'",
+ "The person on the ground repeats 'I'm selling them for a dollar.'",
+ "The man in black says, 'Alright, you know what, give me everything.'",
+ "The person on the ground responds 'Alright Cool!' and gets a white plastic bag from his pocket.",
+ "The person on the ground starts gathering snacks into the white plastic bag.",
+ "The standing man crouches down to help, says 'I'll take it all.'",
+ "The person on the ground, emotional, says, 'Let me bag them for you. You gave me $20 and there's 12 Let me hand you change.'",
+ "The man in black says 'Sure'.",
+ "They proceed to fill the bag with snacks.",
+ "The person on the ground finds change in a black bag on his left, gives the man in black some notes.",
+ "Final part: White plastic bag partially filled with snacks.",
+ "The man in black holds the money, says 'I appreciate that, Thank you very much.'",
+ "The person on the ground expresses gratitude, 'Thank you, man.'",
+ "The man in black acknowledges, 'You're a very honest salesman.'",
+ "The scene ends with both showing gratitude, with background music playing.",
+ "Subtitles consistently provide context and dialogue in English."
+ ]
+ },
+ {
+ "video": "video/34.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a close-up of a pack of 10 white canvases from 'Artist's Loft' placed on a kitchen counter, clearly indicating the size as 8 x 10 inches. The hand holds a tube of black acrylic paint, also from 'Artist's Loft,' squeezing the paint onto a white canvas in four separate lines. As the black paint is spread evenly across the canvas with a blue-handled brush, a female voice narrates: \"So I saw this video on TikTok of a girl who takes a canvas and she paints it black and she makes it look like a neon light so I thought I'd challenge myself and give it a go.\"\n\nTransitioning to a close-up of the now black-painted canvas on a metallic surface outdoors, one can see the grass and a house in the background. Another shot shows the canvas from a broader backyard view, including a tree and a fence. The edges of the canvas are painted black with a blue-handled brush, followed by a yellow pencil sketching a pineapple design on the black canvas. A fine brush then starts outlining the pineapple sketch with white lines, while the voice continues, \"So I paint my canvas black and then I leave it outside to dry for a bit and then I realized I forgot to put black on the sides so then I start painting the sides.\"\n\nThe hand holding a fine brush carefully paints a white curved line on the black canvas, revealing a pineapple outline with a grid pattern. The camera shifts to a close-up of a paper plate holding blobs of yellow and light green paint, ready for the next step. The narration explains further: \" Then I outlined what I wanted to paint with pencil, which was a pineapple and then I go over the pencil part with white. I actually chose a pineapple because I'm allergic to pineapples and I thought it'd be funny and then I'll be able to have one in my room that I do not eat.\" And a sound effect is shown up.\n\nContinuing from the previous setup, yellow and white paint are on the paper plate. The fine brush applies yellow paint over the white outline of the pineapple against the black background, highlighting the grid and adding green leaves on top. Detailing the meticulous painting process, the hand with light-coloured nail polish traces the pineapple outline with deeper yellow paint. The scene shifts from this close-up to adding deeper green paint to the pineapple's leaves, defining the top leaves. The close shot of the canvas reveals the developing vibrant pineapple design, while the audio describes: \"Then i outlined it with yellow because that's what the girl in the video said to do. she said to do a light color of what you want the outline to be, and then a darker color. So that's what I did here. I actually really liked the green that kind of came out. That's like my favorite.\"\n\nThe final scene features a close-up of the black canvas texture and edges, transitioning to the workspace displaying the completed pineapple design: vibrant yellow outline, white grid pattern, and green leaves. Paint supplies lie around, indicating the colours used. The concluding close-up highlights the detailed pineapple against the black backdrop while the final narration adds, \"So this is like the finished kind of look, but I didn't like it wasn't clean enough so then I kind of cleaned it up a little bit and I think I like it better but let me know what you think.\"\n\nThis sequence captures the chronological artistic process from blank canvas to finished pineapple artwork, intertwined with a casual, reflective narration from the artist."
+ }
+ ],
+ "events": [
+ "Close-up of a pack of 10 white canvases from 'Artist's Loft' on a kitchen counter, size 8 x 10 inches",
+ "Hand holds a tube of black acrylic paint from 'Artist's Loft'",
+ "Black paint squeezed onto a white canvas in four separate lines",
+ "Black paint spread evenly across the canvas with a blue-handled brush",
+ "Female voice narrates about seeing a TikTok video of a girl painting a canvas to look like a neon light",
+ "Close-up of the black-painted canvas on a metallic surface outdoors",
+ "Grass and a house visible in the background",
+ "Broader backyard view showing the canvas, a tree, and a fence",
+ "Edges of the canvas painted black with a blue-handled brush",
+ "Yellow pencil sketches a pineapple design on the black canvas",
+ "Fine brush outlines the pineapple sketch with white lines",
+ "Voiceover mentions realizing the sides were not painted black and painting the sides",
+ "Hand with fine brush paints a white curved line on the black canvas, revealing a pineapple outline with a grid pattern",
+ "Close-up of a paper plate holding blobs of yellow and light green paint",
+ "Narration explains the choice of painting a pineapple and the process",
+ "Sound effect",
+ "Fine brush applies yellow paint over the white outline of the pineapple against the black background, highlighting the grid",
+ "Green leaves added to the pineapple design with fine brush",
+ "Hand with light-colored nail polish traces pineapple outline with deeper yellow paint",
+ "Deeper green paint added to the pineapple's leaves",
+ "Close-up of the canvas revealing the developing pineapple design",
+ "Audio description about outlining with yellow and then a darker color as per instructions from the TikTok video",
+ "Close-up of the black canvas texture and edges",
+ "Workspace displaying the completed pineapple design with vibrant yellow outline, white grid pattern, and green leaves",
+ "Paint supplies indicating the colors used around the workspace",
+ "Final close-up of detailed pineapple against black backdrop",
+ "Final narration about cleaning up the design for a better outcome and inviting opinions"
+ ]
+ },
+ {
+ "video": "video/35.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a vibrant image celebrating the 35th anniversary of the Super Mario Bros. franchise. Iconic characters, including Mario, Luigi, Toad, Bowser, and Princess Peach, surround a large, colorful '35th' sign with \"`SUPER MARIO BROS.`\" written below it in yellow. Mario snaps his fingers.\n\ufeff\n\ufeff\nThe scene transitions into the classic 8-bit gameplay of the original Super Mario Bros. game. Against a bright blue sky with a white cloud and green hills, Mario, in red overalls and a red hat, runs from left to right on a ground of brown bricks. He jumps to hit a block with a question mark, releases a coin, and then stomps on a Goomba to earn points before moving forward, hitting more blocks, and collecting coins. Mario hits another question mark block, this time releasing a red and white mushroom. Beside a green pipe, as Mario collects a mushroom, he becomes bigger and an upbeat music starts, accompanied by a man's voice narrating the game's history: \"`The original Super Mario Brothers, which made its debut on NES in 1985.`\"\n\ufeff\nThe action continues as Mario jumps over the green pipe. The gameplay is shown on a handheld device labeled '`Game & Watch`' with a beige and red color scheme. The device features a directional pad on the left and red 'A' and 'B' buttons on the right. The narration explains, \"`can now be played on this special Game & Watch system. A plus control pad is included, so you can play the game smoothly,`\" while the music continues.\n\ufeff\nNext, the focus remains on the '`Game & Watch`' device, showcasing more 8-bit gameplay. Mario runs, jumps over pipes, and encounters more enemies. The screen transitions reveal different settings: one with large '4:00' made of brown bricks and another showing '11:59' with sparkling effects and a night-time sky filled with stars. The narration continues, \"Since it'`s a Game & Watch product, it also functions as a clock. And in honor of the game's 35th anniversary, it comes with 35 little touches to discover,`\" with music playing consistently throughout.\n\ufeff\nThe imagery switches to the title screen of 'Super Mario Bros. 2' on the same '`Game & Watch`' device, displaying large orange letters and a castle in the background. Options for 'Mario Game' and 'Luigi Game' are visible below the title. The narrator mentions: \"`In addition, you can play the Super Mario Bros. The Lost Levels game on it,`\" against the backdrop of classic Mario theme music.\n\ufeff\nA close-up of the handheld '`Game & Watch`' showcases a simple black and white game where Mario juggles two balls, with a score of '18' displayed. The narration continues, \"`\u2026as well as an updated version of the title Game & Watch Ball Starring Mario.`\"\n\ufeff\nThe scene shifts to a promotional image of the 'Game & Watch: Super Mario Bros.' box, adorned in gold with the '`Game & Watch`' logo and '`Super Mario Bros.`' in red. The text 'Limited Production Available November 13' is prominently displayed. The narration states: \"`Game & Watch Super Mario Bros. will launch November 13th. The original Nintendo handheld gaming system is back in full-color liquid crystal display.`\" \n\ufeff\nFinally, the video transitions to a red background with the white 'Nintendo' logo centered as the music plays in the background.\n\ufeff\nThroughout, the visual and auditory elements intertwine to celebrate the legacy and upcoming release of the 'Game & Watch: Super Mario Bros' in a cohesive and engaging manner."
+ }
+ ],
+ "events": [
+ "Video opens with vibrant image celebrating the 35th anniversary of Super Mario Bros. franchise",
+ "Iconic characters (Mario, Luigi, Toad, Bowser, Princess Peach) surround a large '35th' sign",
+ "'SUPER MARIO BROS.' is written below the '35th' sign in yellow",
+ "Mario snaps his fingers",
+ "Scene transitions into classic 8-bit gameplay of original Super Mario Bros. game",
+ "Bright blue sky with white cloud and green hills is shown",
+ "Mario in red overalls and red hat runs from left to right on brown brick ground",
+ "Mario jumps to hit a block with a question mark and releases a coin",
+ "Mario stomps on a Goomba to earn points",
+ "Mario moves forward, hits more blocks, and collects coins",
+ "Mario hits another question mark block and releases a red and white mushroom",
+ "Beside a green pipe, Mario collects the mushroom and becomes bigger",
+ "Upbeat music starts",
+ "Man's voice narrates the game's history: 'The original Super Mario Brothers, which made its debut on NES in 1985'",
+ "Mario jumps over the green pipe",
+ "Gameplay is shown on a handheld device labeled 'Game & Watch' with beige and red color scheme",
+ "Device features directional pad on the left and red 'A' and 'B' buttons on the right",
+ "Narration explains: 'can now be played on this special Game & Watch system. A plus control pad is included, so you can play the game smoothly'",
+ "Music continues",
+ "Focus remains on the 'Game & Watch' device showcasing more 8-bit gameplay",
+ "Mario runs, jumps over pipes, and encounters more enemies",
+ "Screen transitions reveal different settings: '4:00' made of brown bricks and '11:59' with sparkling effects and night-time sky filled with stars",
+ "Narration continues: 'Since it\u2019s a Game & Watch product, it also functions as a clock. And in honor of the game's 35th anniversary, it comes with 35 little touches to discover'",
+ "Music playing consistently throughout",
+ "Imagery switches to title screen of 'Super Mario Bros. 2' on same 'Game & Watch' device",
+ "Title screen displays large orange letters and a castle in the background",
+ "Options for 'Mario Game' and 'Luigi Game' are visible below the title",
+ "Narrator mentions: 'In addition, you can play the Super Mario Bros. The Lost Levels game on it'",
+ "Backdrop of classic Mario theme music",
+ "Close-up of handheld 'Game & Watch' showcases simple black and white game where Mario juggles two balls",
+ "Score of '18' is displayed",
+ "Narration continues: '\u2026as well as an updated version of the title Game & Watch Ball Starring Mario'",
+ "Scene shifts to promotional image of 'Game & Watch: Super Mario Bros.' box",
+ "Box adorned in gold with 'Game & Watch' logo and 'Super Mario Bros.' in red",
+ "Text 'Limited Production Available November 13' is prominently displayed",
+ "Narration states: 'Game & Watch Super Mario Bros. will launch November 13th. The original Nintendo handheld gaming system is back in full-color liquid crystal display'",
+ "Video transitions to a red background with white 'Nintendo' logo centered",
+ "Music plays in the background",
+ "Visual and auditory elements intertwine to celebrate legacy and upcoming release of 'Game & Watch: Super Mario Bros' cohesively and engagingly"
+ ]
+ },
+ {
+ "video": "video/36.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a detailed view of a bustling salad bar in a cafeteria or buffet, displaying a variety of fresh vegetables like shredded carrots, red cabbage, kidney beans, and lettuce, all arranged neatly in white trays with black tongs. Above the salad bar are jars of grains and oils. As the camera moves, it shifts focus to a dessert section featuring glasses filled with yellow pudding-like desserts topped with whipped cream, strawberries, and pastries, alongside other creamy delights and a large cake in the background. Further into the video, a fruit bar is shown, offering a vibrant selection of fresh fruits-green apples, oranges, watermelon, and pineapple-again arranged in white trays with black tongs.\n\nThe audio reveals a man's voice explaining, \"So what's food like below deck for crew? Well, it changes quite a bit. Today we've got a crepe section; sometimes it might be fried rice or pad thai or whatever, but this was pretty tasty.\" This narration, accompanied by background music, leads into a scene showing people gathered around a table piled with various toppings and ingredients for making crepes or pancakes. The table, covered in a white cloth, presents bowls filled with chocolate chips, sprinkles, and other toppings. The clip concludes with a person in a blue shirt and shorts serving themselves from the buffet table, underscoring the busy, well-stocked nature of this dining area.\n\nContinuing, the next segment zooms in again on the salad bar, reinforcing its variety with jalape\u00f1os, chia seeds, shredded cheese, raisins, sesame seeds, and sunflower seeds, all presented similarly in white trays with spoons. The scene transitions to another section of the buffet, showcasing trays of cold cuts labeled 'Mortadella' and 'Swiss Cheese.' The man's voice continues, \"`And then we've got typical salad bar stuff that doesn`'t really change. Look at all these things, jalape\u00f1os and fresh this and fresh that. \" while he seems to be eating and drinking during his commentary.\n\nIn the following part of the video, attention is given to more cold cuts and cheeses, newly introducing trays labeled 'Edam Cheese' and 'Antipasti', followed by a reprise of the fruit bar scene with added details of kiwis and more apples. Additional hot dishes in stainless steel trays are also shown, making the buffet seem endless. The narrator mentions, \"`And there are cheeses and meats and just pretty much anything you could want at a buffet is what we've got down here. I could eat up at the buffet with like the guests`\"\n\nThe camera then brings the focus to a variety of hot dishes, decorated with lemon slices and herbs, further capturing close views of attractive desserts topped with whipped cream, berries, caramel, and chocolate shavings. The descriptive audio complements this with, \", But honestly, it's just easy to eat down here, and the food'`s really good, so I don`'t have any complaints. And on top of that, we'`ve got so many desserts. I`'m not a dessert person, but dude, so many, like very, very good desserts. \"\n\nFinally, the video gives an intimate look at more desserts, such as cream puffs drizzled with chocolate, pie slices with whipped cream and fruit, and beautifully layered desserts in glasses. The scene broadens to show the dining area, populated with round tables and chairs, some of which are occupied by people enjoying their meals. The man's voice concludes, saying, \"`So, this is what it's like where we eat - just lots of tables, really, really plain and simple. And that's everything, guys. Bye,`\" with laughter from a woman faintly audible in the background, rounding out the video's informal yet comprehensive tour of the buffet."
+ }
+ ],
+ "events": [
+ "Detailed view of bustling salad bar in a cafeteria or buffet.",
+ "Showing variety of fresh vegetables: shredded carrots, red cabbage, kidney beans, lettuce.",
+ "Vegetables arranged neatly in white trays with black tongs.",
+ "Above the salad bar are jars of grains and oils.",
+ "Camera shifts focus to dessert section.",
+ "Dessert section features glasses filled with yellow pudding-like desserts topped with whipped cream, strawberries, and pastries.",
+ "Showing other creamy delights and a large cake in the background.",
+ "View of a fruit bar offering selection of fresh fruits: green apples, oranges, watermelon, pineapple.",
+ "Fruits also arranged in white trays with black tongs.",
+ "Man's voice explaining food variety for crew below deck.",
+ "Mention of crepe section, fried rice, pad thai, and other tasty meals.",
+ "Scene shows people gathered around table with ingredients for making crepes or pancakes.",
+ "Table covered in white cloth presents bowls filled with chocolate chips, sprinkles, and other toppings.",
+ "Person in blue shirt and shorts serving themselves from buffet table.",
+ "Zoom in again on salad bar with added variety: jalape\u00f1os, chia seeds, shredded cheese, raisins, sesame seeds, sunflower seeds.",
+ "Narrator comments on consistent salad bar items.",
+ "View of another buffet section with trays labeled 'Mortadella' and 'Swiss Cheese.'",
+ "Narrator eats and drinks during commentary.",
+ "Focus on more cold cuts and cheeses, trays labeled 'Edam Cheese' and 'Antipasti.'",
+ "Return to fruit bar scene with added details of kiwis and more apples.",
+ "Additional hot dishes shown in stainless steel trays.",
+ "Narrator mentions diverse buffet options: cheeses, meats, and endless variety.",
+ "Focus on variety of hot dishes decorated with lemon slices and herbs.",
+ "Close views of attractive desserts topped with whipped cream, berries, caramel, chocolate shavings.",
+ "Narrator praises quality and variety of desserts despite not being a dessert person.",
+ "More desserts shown: cream puffs drizzled with chocolate, pie slices with whipped cream and fruit, layered desserts in glasses.",
+ "View broadens to dining area with round tables and chairs.",
+ "Some tables occupied by people enjoying their meals.",
+ "Narrator describes dining area as plain and simple with lots of tables.",
+ "Narrator concludes with a farewell and laughter from a woman faintly audible in the background."
+ ]
+ },
+ {
+ "video": "video/37.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with three young men sitting together and conversing in an outdoor setting against a plain, light-colored wall. The man in the middle, wearing a maroon shirt and a red cap, narrates a story about a humorous incident, captured in subtitles: 'And it was the funniest thing.' The audio reveals more of the story, with the man saying, '`His dad said, look between your thighs.`' As the subtitle, '`His dad said..`' appears, the camera cuts to a close-up of the storyteller, showing his expression. It then cuts back to the group, with the man in the checkered shirt on the right reacting energetically, repeating 'Thighs' multiple times and the final frame focusing on his animated gestures.\n\nThe scene transitions outdoors to a separate setting where a man in a yellow shirt with 'JAP' text and a red cap stands in front of a building. However, the focus shifts back to the man in the checkered shirt from the previous conversation, who continues to animatedly gesture while speaking.He tries to get a high five but ends up patting the man in the maroon shirt on the shoulder. When the camera returns to the group, the man in the maroon shirt stands up, signaling his departure with the lines, 'With that, I'm going to bounce. See you guys. Peace out, yo.' This note of departure is followed by handshakes and farewells, capturing the mood of camaraderie concluded with a close-up of the checkered shirt man, reacting to the departure.\n\nNext, the man in the maroon shirt and red cap extends a farewell handshake towards the man in the yellow shirt with 'JAP' text. With a quick cut to the checkered shirt man observing this, the scene captures the essence of departure and friendship. Another handshake is shown between the maroon-shirted man and a new, blue-shirted man with subtitles reading, 'See you man.' Amidst these interactions, the background audio features some background music and phrases like 'Bombay. Bombay. See you, man. Shoot. Shoot.' keeping up with the camera movement and adding layers to the casual farewell ambiance.\n\nThe clip then transitions to a lively scene where five young men, including familiar characters in the maroon, blue, checkered, and yellow 'JAP' shirt, are engaged in a synchronized dance routine. The delightful dance features jumps, kicks, and coordinated arm movements, filled with laughter and camaraderie, as friends enjoy each other's company. The background music uplifts the entire moment with dialogue interjections, such as '`Fusion. Ha! Oh, my God. Oh.`' The scene encapsulates the spirit of youthful energy and fun.\n\nShifting to a relatively quieter segment, the man in the checkered shirt is seen standing and conversing with the man in the maroon shirt. He then checks his watch, hinting at time awareness, before the camera captures the group once more engaging in an animated discussion-gestures, and playful banter fill the screen. The clip wraps up with the maroon-shirted man shaking hands with the checkered-shirted man, underscored by friendly farewell phrases, 'How long did it take? See you, bro. Take care.' The consistent background music adds to the upbeat atmosphere.\n\nThe final scene reiterates farewell sentiments as the man in the maroon shirt shakes the checkered-shirted man's hand,touches his arm briefly and says '`See you bro, take care,`' before walking away. The man in the yellow 'JAP' shirt and another man in a red shirt observe the farewell from nearby. Left standing alone, the checkered-shirted man looks at his hand and quips, 'Handshake ah?' then observes around, feeling slightly left out and uttering, 'For me, no choreography and all, huh?Something they did no?' with a puzzled gesture. The subtler background sound of chirping birds and his parting words 'Bombay?' add an emotional nuance to his thoughtful demeanor, captured against the light-colored wall background-bringing a pensive close to the sequence of interactions captured in the video."
+ }
+ ],
+ "events": [
+ "Three young men sit together and converse in an outdoor setting against a plain, light-colored wall.",
+ "The man in the middle, wearing a maroon shirt and a red cap, narrates a story about a humorous incident.",
+ "Subtitles read 'And it was the funniest thing.'",
+ "Audio reveals the man saying, 'His dad said, look between your thighs.'",
+ "Subtitle 'His dad said..' appears.",
+ "Camera cuts to a close-up of the storyteller showing his expression.",
+ "Camera cuts back to the group, man in the checkered shirt on the right reacting energetically and repeating 'Thighs' multiple times.",
+ "Final frame focuses on the animated gestures of the man in the checkered shirt.",
+ "Scene transitions to an outdoor setting where a man in a yellow shirt with 'JAP' text and a red cap stands in front of a building.",
+ "Focus shifts back to the man in the checkered shirt who continues to gesture while speaking.",
+ "Man in the checkered shirt tries to get a high five but ends up patting the man in the maroon shirt on the shoulder.",
+ "Camera returns to the group, and the man in the maroon shirt stands up and signals his departure with 'With that, I'm going to bounce. See you guys. Peace out, yo.'",
+ "Handshakes and farewells are exchanged, capturing the mood of camaraderie.",
+ "Close-up of the checkered shirt man reacting to the departure.",
+ "Man in the maroon shirt and red cap extends a farewell handshake towards the man in the yellow shirt with 'JAP' text.",
+ "Quick cut to the checkered shirt man observing the handshake.",
+ "Man in maroon shirt exchanges a handshake with a new man in a blue shirt, with subtitles reading 'See you man.'",
+ "Background audio features background music and phrases like 'Bombay. Bombay. See you, man. Shoot. Shoot.'",
+ "Scene transitions to five young men engaged in a synchronized dance routine, including familiar characters in maroon, blue, checkered, and yellow 'JAP' shirts.",
+ "Dance routine features jumps, kicks, and coordinated arm movements filled with laughter and camaraderie.",
+ "Dialogue interjections over the background music such as 'Fusion. Ha! Oh, my God. Oh.'",
+ "Scene shifts to a quieter segment where man in checkered shirt is seen conversing with the man in the maroon shirt.",
+ "Man in checkered shirt checks his watch, indicating time awareness.",
+ "Camera captures the group engaged in animated discussions, gestures, and playful banter.",
+ "Clip wraps up with the man in the maroon shirt shaking hands with the checkered shirt man and saying friendly farewell phrases, 'How long did it take? See you, bro. Take care.'",
+ "Consistent background music adds to the upbeat atmosphere.",
+ "Final scene: man in maroon shirt shakes checkered shirt man's hand and touches his arm lightly saying 'See you bro, take care.'",
+ "Man in maroon shirt walks away as man in yellow 'JAP' shirt and another man in a red shirt observe.",
+ "Checkered shirt man, left standing alone, looks at his hand and quips, 'Handshake ah?'",
+ "Checkered shirt man observes around, feeling left out, and says 'For me, no choreography and all, huh? Something they did no?' with a puzzled gesture.",
+ "Subtle sound of chirping birds in the background.",
+ "Parting words 'Bombay?' add an emotional nuance to the thoughtful demeanor of the checkered shirt man.",
+ "Man in checkered shirt stands against the light-colored wall background, bringing a pensive close to the sequence of interactions."
+ ]
+ },
+ {
+ "video": "video/38.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video titled \"EVERYTHING YOU WANTED TO KNOW ABOUT DRAWING A FACE FROM PHOTO REFERENCE (IN UNDER A MINUTE)\" begins with bold pink and white text on a blue background. The scene transitions to a close-up of a hand holding a red pencil, sketching an oval shape on a blank sheet of paper. Text appears at the bottom of the screen, guiding the viewer through the process: \"drawing heads today,\" followed by \"using photo reference,\" and \"Today we're drawing my good friend Santa J. Claus.\" The artist began perfecting the elliptical shape, adding horizontal and vertical lines to indicate the center of the head.\n\nThe voice-over explains: \"Hello everyone, today let's practice drawing headshots with photos. Today we are going to draw my good friend Santa Claus. The first thing I'm going to do is start with an oval. Then I'm going to draw a horizontal and vertical line to represent the center of the head.\"\n\nThe scene continues with a close-up of the hand painted with a cross on the oval. Draw a circle over the oval previously drawn to represent the nose. The text on the screen reads: \"Then the nose and two circles of eyeballs.\" The artist paints the eyeball in detail, making sure the viewer knows to paint the entire eyeball in the eye socket, not just the visible part. Further instructions: \"When I draw the eyeball, I draw the entire eyeball in the socket, not just the visible part. And then I add the eyelids, top and bottom. Don't forget the skin that overlaps the eyeball. It has weight and thickness.\"\n\nThe video progresses with the artist continuing to add details to the eyeball, emphasizing its weight and thickness. The text \"And please remember there's going to be a shadow underneath the eye\" and \"and above the eyelid\" appears as the artist shades these areas. Additional text prompts \u201cAnd the eyelid itself casts a shadow on the eyeball\" and \"I usually draw the eyebrows at the top of the circle of the eyeball.\" The narration adds, \"And remember, there's going to be shadow underneath the eye and above the eyelid. And the eyelid itself casts a shadow on the eyeball. \"The scene the transitions back to the oval with two half completed eyes .The hand then draws an eye brow above the right eye.The voice then reads:\"I usually draw the eyebrows at the top of the circle of the eyeball. And for the nose, I usually start off with a circle as well with two half circles on each side.\"\n\nThe artist draws a circle for the nose, adding two half circles on each side with the text When the hand uses the eraser to erase the face previously drawn, the video stresses that erasers are part of the process: \"And if you make a mistake, that's what erasers are for. In my lifetime, I've probably drawn tens of thousands of faces, and I still make mistakes EVERY SINGLE TIME! But that's art.\" The audio complements this with: \"In my lifetime, I've probably drawn tens of thousands of faces, and I still make mistakes every single time. But that's art.\"\n\nThe artist continues to refine the face's sketch, adding details to the nose and mouth. The scene then transitions to a pen completing a new sketch of the same model. The video emphasizes imperfection with text: \"We're not copy machines.It's not automatic .So embrace the 'wonkiness' of your art.\" The artist begins painting the sketch, filling the hat with red paint. The text \"We're not going for perfection\" and \"We're making art. We're showing the world how we see it\" appears, reinforcing the artistic philosophy shared in the narration: \"We're showing the world how we see it. Wonkiness and all. \n\nFinally, the artist refines the painting of a man with a white beard and red hat with the text '@Santajclaus' displayed.The voice then says: \"And remember, use photo reference, draw from models,go out into nature,use a stencil or a ruler, trace\"As more intricate details are added to the hat and beard in a blue background, the text reads:\"Just make art and have fun.\" The audio concludes with: \"use photo reference, draw from models, go out into nature, use a stencil or a ruler, trace, use anything you want to make art - there's no rules, just make art and have fun.\"\n\nMusic plays softly in the background during the final scenes, enhancing the friendly and encouraging atmosphere of the tutorial."
+ }
+ ],
+ "events": [
+ "The video titled 'EVERYTHING YOU WANTED TO KNOW ABOUT DRAWING A FACE FROM PHOTO REFERENCE (IN UNDER A MINUTE)' begins with bold pink and white text on a blue background.",
+ "The scene transitions to a close-up of a hand holding a red pencil, sketching an oval shape on a blank sheet of paper.",
+ "Text appears at the bottom of the screen: 'drawing heads today,' 'using photo reference,' and 'Today we're drawing my good friend Santa J. Claus.'",
+ "The artist perfects the elliptical shape, adding horizontal and vertical lines to indicate the center of the head.",
+ "Voice-over explains: 'Hello everyone, today let's practice drawing headshots with photos. Today we are going to draw my good friend Santa Claus. The first thing I'm going to do is start with an oval. Then I'm going to draw a horizontal and vertical line to represent the center of the head.'",
+ "A close-up shows the hand drawing a cross on the oval, then a circle over the oval to represent the nose.",
+ "Text on the screen reads: 'Then the nose and two circles of eyeballs.'",
+ "The artist paints the eyeball in detail, emphasizing to paint the entire eyeball in the eye socket, not just the visible part.",
+ "Further instructions appear: 'When I draw the eyeball, I draw the entire eyeball in the socket, not just the visible part. And then I add the eyelids, top and bottom. Don't forget the skin that overlaps the eyeball. It has weight and thickness.'",
+ "The artist continues adding details to the eyeball, emphasizing its weight and thickness.",
+ "Text appears: 'And please remember there's going to be a shadow underneath the eye' and 'and above the eyelid.'",
+ "Additional text prompts: 'And the eyelid itself casts a shadow on the eyeball' and 'I usually draw the eyebrows at the top of the circle of the eyeball.'",
+ "The narration adds: 'And remember, there's going to be shadow underneath the eye and above the eyelid. And the eyelid itself casts a shadow on the eyeball.'",
+ "The scene transitions back to the oval with two half-completed eyes.",
+ "The hand draws an eyebrow above the right eye.",
+ "The voice narrates: 'I usually draw the eyebrows at the top of the circle of the eyeball. And for the nose, I usually start off with a circle as well with two half circles on each side.'",
+ "The artist draws a circle for the nose, adding two half circles on each side.",
+ "The hand uses the eraser to erase the previously drawn face.",
+ "The video stresses that erasers are part of the process: 'And if you make a mistake, that's what erasers are for. In my lifetime, I've probably drawn tens of thousands of faces, and I still make mistakes EVERY SINGLE TIME! But that's art.'",
+ "The audio complements: 'In my lifetime, I've probably drawn tens of thousands of faces, and I still make mistakes every single time. But that's art.'",
+ "The artist continues refining the face's sketch, adding details to the nose and mouth.",
+ "The scene transitions to a pen completing a new sketch of the same model.",
+ "The video emphasizes imperfection with text: 'We're not copy machines. It's not automatic. So embrace the 'wonkiness' of your art.'",
+ "The artist begins painting the sketch, filling the hat with red paint.",
+ "Text appears: 'We're not going for perfection' and 'We're making art. We're showing the world how we see it.'",
+ "The narration comments: 'We're showing the world how we see it. Wonkiness and all.'",
+ "The artist refines the painting of a man with a white beard and red hat, with the text '@Santajclaus' displayed.",
+ "The voice then says: 'And remember, use photo reference, draw from models, go out into nature, use a stencil or a ruler, trace.'",
+ "More intricate details are added to the hat and beard on a blue background.",
+ "Text reads: 'Just make art and have fun.'",
+ "The audio concludes: 'use photo reference, draw from models, go out into nature, use a stencil or a ruler, trace, use anything you want to make art - there's no rules, just make art and have fun.'",
+ "Music plays softly in the background during the final scenes, enhancing the friendly and encouraging atmosphere of the tutorial."
+ ]
+ },
+ {
+ "video": "video/39.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man, wearing a dark blue shirt, holding a piece of green cucumber above a beaker filled with yellow liquid. The man then places the cucumber slice into the beaker, submerging it into the yellow liquid, and subsequently adds more slices one by one taken from a white plate. The beaker is then placed inside a transparent vacuum chamber. The scene transitions to a close-up of the beaker, showing the cucumber slices floating in the yellow liquid, eventually focused on the fully immersed cucumber slices. Meanwhile, the man's voice accompanies the footage, saying, \u201cJust sliced up some cucumbers. I'm gonna put them in here. Three, two, one. So you can see already, look at the air bubbling out of them.\" with the sound of a motor being heard.\n\nThe focus stays on the beaker filled with yellow liquid from a side view, in which several cucumber slices are floating. Measurement markings indicate a capacity of 250 ml on the beaker. A closer view reveals small bubbles forming around the cucumber slices, suggesting an ongoing reaction or change. The audio continues with the man explaining, \u201cSo one reason that pickling takes so long normally is because you're trying to replace all of the liquid and air that's in the cucumber with the brine vinegar pickling solution.\u201d while in the background, a motor hums steadily.\n\nThe video keeps its close-up view on the cucumber slices partially submerged in the yellow liquid, with bubbles indicating a continued interaction. Transitioning to a top-down view, the beaker brims with yellow liquid and overlapping cucumber slices. A wider shot brings back the measurement markings and shows a person\u2019s finger near the beaker, hinting at observation or interaction. Over this, the man\u2019s voice details, \u201cWhile using a vacuum chamber we can make that exchange go much faster. Let me show you. Now the goal here is to get all of the air out of these cucumbers.\u201d all while motor hums.\n\nThe focus remains on the beaker filled with yellow liquid and cucumber slices. Bubbles are still visible around the slices, and become more and more. A slight change in angle keeps the concentration on these bubbles and the cucumber slices. The camera closes in, highlighting the more intensive ongoing reaction. Next scene, the sudden disappearance of the bubble indicates that the reaction has stopped. The man instructs, \u201c We can possibly get let\u2019s let in the air and watch what happens. Three, two, one,\u201d with the background audio capturing the sound of air sizzling.\n\nA similar view continues to showcase the cucumber slices in the yellow liquid in different angle. Then a mist appears on the inside of the beaker, and the cucumber inside shrinks and turns black. This repeated, detailed focus enhances the observation of the reaction between the cucumber slices and the liquid. Again, the man's voice narrates, \u201cSo do you see how dark they got? That's the brine vinegar solution just diffused into them, look how dark it got.\u201d \n\nThe final section starts with a close-up of the cucumber slices in the beaker of yellow liquid, maintaining a consistent measurement capacity indication of 250 ml. The scene transitions to a hand holding a single wet and shiny cucumber slice above the beaker. The man, identified by his dark blue shirt with a geometric design, holds both the beaker and the cucumber slice, then takes a bite, chewing the slice to offer a sensory conclusion. The man comments, \u201cSo look at that, look how it made a perfect pickle now. Okay, the moment of truth here. That's a good pickle.\u201d presenting the completion and approval of the experiment."
+ }
+ ],
+ "events": [
+ "Man in dark blue shirt holds a piece of cucumber above a beaker filled with yellow liquid.",
+ "Man places a cucumber slice into the beaker, submerging it in the yellow liquid.",
+ "Man adds more cucumber slices one by one taken from a white plate into the beaker.",
+ "Beaker is placed inside a transparent vacuum chamber.",
+ "Close-up of the beaker with cucumber slices floating in yellow liquid.",
+ "Focus on fully immersed cucumber slices in yellow liquid.",
+ "Man's voice: 'Just sliced up some cucumbers. I'm gonna put them in here. Three, two, one. So you can see already, look at the air bubbling out of them.'",
+ "Sound of a motor is heard.",
+ "Side view of the beaker showing floating cucumber slices.",
+ "Measurement markings show a capacity of 250 ml on the beaker.",
+ "Closer view reveals small bubbles forming around cucumber slices.",
+ "Man's voice: 'So one reason that pickling takes so long normally is because you're trying to replace all of the liquid and air that's in the cucumber with the brine vinegar pickling solution.'",
+ "Motor hums steadily in the background.",
+ "Close-up view of cucumber slices partially submerged in the yellow liquid with bubbles indicating interaction.",
+ "Top-down view of beaker filled with yellow liquid and cucumber slices.",
+ "Wider shot shows measurement markings and a person's finger near the beaker.",
+ "Man's voice: 'While using a vacuum chamber we can make that exchange go much faster. Let me show you. Now the goal here is to get all of the air out of these cucumbers.'",
+ "Motor hum heard in background.",
+ "Focus on beaker with cucumber slices and visible bubbles.",
+ "Change in angle emphasizing bubbles and cucumber slices.",
+ "Camera closes in on the intensive reaction.",
+ "Sudden disappearance of bubbles indicates reaction stopped.",
+ "Man's voice: 'We can possibly get let\u2019s let in the air and watch what happens. Three, two, one.'",
+ "Sound of air sizzling.",
+ "Cucumber slices in yellow liquid shown from different angles.",
+ "Mist appears inside the beaker.",
+ "Cucumber inside shrinks and turns black.",
+ "Detailed focus on cucumber slices and liquid reaction.",
+ "Man's voice: 'So do you see how dark they got? That's the brine vinegar solution just diffused into them, look how dark it got.'",
+ "Close-up of cucumber slices in beaker with 250 ml measurement marking.",
+ "Transition to hand holding a wet, shiny cucumber slice above the beaker.",
+ "Man in dark blue shirt holds the beaker and cucumber slice.",
+ "Man takes a bite of the cucumber slice and chews.",
+ "Man comments: 'So look at that, look how it made a perfect pickle now. Okay, the moment of truth here. That's a good pickle.'",
+ "Completion of the experiment."
+ ]
+ },
+ {
+ "video": "video/40.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up of a man holding a can of Rust-Oleum NeverWet Multi-Surface Liquid Repelling Treatment, specifically the 'STEP 2 Top Coat' version. The can is white with blue and black text, and the man is holding it with his left hand. To the left is a clear plastic container filled with a liquid, likely water. The background is a plain indoor setting. The audio reveals a man explaining, \"This stuff is called Never Wet. Whatever you spray it on becomes super hydrophobic. Here's what happens when I spray it on a paper towel.\" The scene transitions to a close-up of a hand pouring a clear liquid from a small cup onto a white paper towel on a wooden surface. The liquid forms a bead on the paper towel, demonstrating the water-repellent properties of the treated material. The man continues, \"You can see the water doesn't absorb into it at all.\" \n\nThen, the video develops through a close-up of the bead of water sitting on the paper towel, showing that it has not been absorbed. The man gently touches the paper towel, emphasizing its repellency as the audio continues, \"So let's see what happens if I spray Never Wet on water.\" The scene transitions to a clear plastic container with water placed on a wooden surface, shown from a slightly elevated angle. Black and yellow tape is applied to the table. The man wearing a dark blue shirt and jeans stands behind the container. The man explains further, \" Can you make the water hydrophobic?\" \n\nThe scene then transitions to a close-up of a clear plastic container with water against a plain cardboard background. He then sprays a lot of base coat from the right in the water, followed by the top coat from the left. Text appears at the top: \"had to spray gently so it didn't...\" The audio has the man saying, \"So first we spray the base coat and I spray a lot of it over several hours and let it dry And now I spray the top coat, let it dry for 12 hours. Now let's see if it works.\" A hand holding a small plastic dropper filled with a green liquid is shown next. The man releases a few drops into the water, forming small green beads on the surface, demonstrating the repellent properties again, followed by him saying, \"Okay, here we go. Let's see if this works. Yeah, beads up on there.\"\n\nThen the man releases more green drops on the water. The video moves forward with close-up views and transitions emphasizing the effectiveness of the Rust-Oleum NeverWet Multi-Surface Liquid Repelling Treatment. The audio matches these scenes, saying, \"So it looks like it worked best on the edges where I had most of the never wet.\" Aman's arm, which is slightly tanned and hairy, appears.\n\nSubsequent scenes focus on the man's hand lifting a clear plastic sheet from the water, showcasing green beads floating on the surface and the sheet, with the audio explaining, \"So it didn't work really well. It never actually really dried, it just basically ends up with this skin of plastic on top.\" As the video proceeds, the man's hands carefully handle the plastic sheet showing the green beads clinging to it.\n\nThe video effectively illustrates the super hydrophobic properties of the Rust-Oleum NeverWet Multi-Surface Liquid Repelling Treatment."
+ }
+ ],
+ "events": [
+ "Close-up of a man holding a can of Rust-Oleum NeverWet Multi-Surface Liquid Repelling Treatment, 'STEP 2 Top Coat' version.",
+ "Can is white with blue and black text.",
+ "Man holding can with left hand.",
+ "Clear plastic container filled with liquid (likely water) to the left.",
+ "Plain indoor background setting.",
+ "Audio: Man explaining, 'This stuff is called Never Wet. Whatever you spray it on becomes super hydrophobic. Here's what happens when I spray it on a paper towel.'",
+ "Scene transitions to close-up of a hand pouring clear liquid from a small cup onto a white paper towel.",
+ "Paper towel is on a wooden surface.",
+ "Liquid forms a bead on the paper towel, showing water-repellent properties.",
+ "Audio: Man says, 'You can see the water doesn't absorb into it at all.'",
+ "Close-up of the bead of water on the paper towel, showing non-absorption.",
+ "Man touches the paper towel gently.",
+ "Audio: Man says, 'So let's see what happens if I spray Never Wet on water.'",
+ "Scene transitions to clear plastic container with water on a wooden surface.",
+ "View is from a slightly elevated angle.",
+ "Black and yellow tape applied to the table.",
+ "Man in dark blue shirt and jeans stands behind the container.",
+ "Audio: Man explains, 'Can you make the water hydrophobic?'",
+ "Scene transitions to close-up of clear plastic container with water against a plain cardboard background.",
+ "He sprays a lot of base coat from the right into the water.",
+ "He sprays top coat from the left.",
+ "Text appears at the top: 'had to spray gently so it didn't...'",
+ "Audio: Man says, 'So first we spray the base coat and I spray a lot of it over several hours and let it dry And now I spray the top coat, let it dry for 12 hours. Now let's see if it works.'",
+ "Hand holding a small plastic dropper filled with green liquid shown next.",
+ "Man releases a few green drops into the water, forming small green beads on the surface, demonstrating repellent properties.",
+ "Audio: Man says, 'Okay, here we go. Let's see if this works. Yeah, beads up on there.'",
+ "Man releases more green drops on the water.",
+ "Close-up views and transitions showing effectiveness of treatment.",
+ "Audio: Man says, 'So it looks like it worked best on the edges where I had most of the never wet.'",
+ "A man's slightly tanned and hairy arm appears.",
+ "Scenes focus on the man's hand lifting a clear plastic sheet from the water.",
+ "Green beads are shown floating on the surface and the plastic sheet.",
+ "Audio: Man explains, 'So it didn't work really well. It never actually really dried, it just basically ends up with this skin of plastic on top.'",
+ "Man's hands carefully handle the plastic sheet showing green beads clinging to it.",
+ "Video illustrates the super hydrophobic properties of Rust-Oleum NeverWet Multi-Surface Liquid Repelling Treatment."
+ ]
+ },
+ {
+ "video": "video/41.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a bartender, clad in a black shirt, standing behind a bar counter surrounded by shelves brimming with various bottles of alcohol. He is seen energetically shaking a cocktail shaker. The text on the screen reads, \"What REALLY happens when you call THE NIGHT OF.\" Abruptly, the bartender answers a ringing phone.He says,''JRs,how can I help you?'' The scene shifts to a man dressed in a gray blazer and a hat, standing in a room adorned with ornate decor, holding a phone to his ear. \"Feb 14th - 6:30pm\" appears on the screen. The man inquires if there are any tables available. Cutting back to the bartender, he affirms while continuing to shake the cocktail shaker. He says,''We do actually and people are sitting at them.'' Then he hangs up the phone in angry.\n\nThe narrative proceeds with the bartender, still behind the bar counter, now holding a phone with a blue and white polka-dotted handle in one hand and a copper cocktail shaker in the other. The man says,''We've been booked for months,people,Jesus Chri....''After he picks up the phone,his emotion transitions from angry to false pleasure.''Thank you for calling,Johnny Roccos,How can I help you.''The scene transitions to a man in a blue and white plaid shirt standing in front of a vintage car, holding a phone to his ear. He says, \"How long'`s your wait tonight?`\" The bartender, ready to respond, holds the phone and wears a contemplative look.Then he hangs the phone up.\n\nThe story continues with the bartender still behind the bar counter, holding the same phone and cocktail shaker. The screen's text evolves to read, \"`Longer than my patience is for that conversation.\" The bartender resumes shaking his cocktail shaker after placing the phone down. The another call comes in.When he picks it up again, he begins saying, \"My names Judd, How can I...''- before the scene cuts to a man in a light denim jacket, standing in a modern, well-decorated room with a staircase. The man, holding a phone, says he wants to make a reservation, and his name is Jennifer and he wants to come with his friends.Returning to the bartender, now equipped with a black notebook and pen, he tells the caller to hold on while he checks availability. The audio captures these dialogues, underlined by intermittent phone rings.Then he whistles.\n\nFollowing this, the scene features the bartender, notebook, and pen in hand, phone held in the other, He says excitedly,''All right team we've got another one.This one seems as optimistic as she does self-entitled.Bets are now open five dollars a square closet one to without going over wins the pot plus she's friends with the owner.''\n\nFinally, the bartender, notebook, and phone in hand remains behind the bar counter. He says\"`Place your bets!Bets are now open!How long will she last before she hangs up?``\" The notebook he raises displays a chart titled 'V-DAY TIME ON HOLD' with a broken heart symbol and a column labeled 'CRAZY CUSTS NAME.' At the bottom, it encourages viewers to follow with a heart and cocktail glass symbol. The concluding scene has text reading \"Subscribe for daily\" with a laughing emoji.The backdrop audio features his voice mingling with playful background music.\n\nThe video integrates the unfolding events at the bar, punctuated by customer interactions, focusing on the dynamic and somewhat humorous behind-the-scenes look at managing a busy bar on a presumably special night."
+ }
+ ],
+ "events": [
+ "Bartender in black shirt behind bar counter shaking cocktail shaker",
+ "Shelves with various bottles of alcohol behind bartender",
+ "Text on screen: 'What REALLY happens when you call THE NIGHT OF.'",
+ "Phone rings, bartender answers: 'JRs, how can I help you?'",
+ "Scene shifts to man in gray blazer and hat holding phone",
+ "Text on screen: 'Feb 14th - 6:30pm'",
+ "Man asks if there are any tables available",
+ "Bartender affirms tables are available while shaking shaker: 'We do actually and people are sitting at them'",
+ "Bartender hangs up phone angrily",
+ "Bartender behind bar counter holding polka-dotted phone and copper cocktail shaker",
+ "Bartender speaks: 'We've been booked for months, people, Jesus Chri...'",
+ "Bartender answers phone with false pleasure: 'Thank you for calling, Johnny Roccos, How can I help you'",
+ "Scene transitions to man in blue and white plaid shirt holding phone in front of vintage car",
+ "Man asks: 'How long's your wait tonight?'",
+ "Bartender contemplative, then hangs up phone",
+ "Bartender behind bar counter holding phone and cocktail shaker",
+ "Text on screen: 'Longer than my patience is for that conversation.'",
+ "Bartender shakes cocktail shaker again after placing phone down",
+ "Another call comes in, bartender answers: 'My name's Judd, How can I...'",
+ "Scene cuts to man in light denim jacket holding phone in modern room with staircase",
+ "Man says his name is Jennifer and wants to make a reservation with friends",
+ "Bartender with black notebook and pen tells caller to hold on while he checks availability",
+ "Audio captures dialogues with intermittent phone rings",
+ "Bartender excitedly says: 'All right team we've got another one. This one seems as optimistic as she does self-entitled. Bets are now open five dollars a square closest one to without going over wins the pot plus she's friends with the owner.'",
+ "Bartender behind bar counter with notebook and phone",
+ "Bartender says: 'Place your bets! Bets are now open! How long will she last before she hangs up?'",
+ "Notebook displays chart titled 'V-DAY TIME ON HOLD' with broken heart symbol and column labeled 'CRAZY CUSTS NAME'",
+ "Chart encourages viewers to follow with heart and cocktail glass symbol",
+ "Text on screen: 'Subscribe for daily' with laughing emoji",
+ "Playful background music mingling with bartender's voice"
+ ]
+ },
+ {
+ "video": "video/42.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a conversation in a medical setting. One man in a pink shirt with red accents and a black helmet asks, \"Alright what'`s going on?`\" as displayed in the subtitle, while a man's voice states, \"`All right, what's going on?`\" The same man, dressed in blue medical scrubs with a stethoscope, simulating a doctor, responds with, \"`Well this patient needs to be admitted`\" both in the subtitle and audio. A discussion ensues about whether the patient should go to the floor or ICU, with the subtitle reading, \"I'`m just not sure if it should be to the floor or the ICU, and I'm kind of afraid to ask.`\" and the voice in the audio echoes, \"I'`m just not sure if it should be to the floor or to the ICU, and I'm kind of afraid to ask.`\" The first man looks at the left and assures, \"Don't worry I'`ll handle it,`\" which aligns with the audio, \"Don'`t worry about it. I'll handle it.`\"\n\nThe scene shifts to the man in the pink shirt with red accents and a black helmet, now holding a phone with his left hand and looking at the right. The subtitle says, \"Hi this is the emergency doctor, I have a patient in room 9 who needs to be admitted to the ICU,\" matched by the audio, \"`Hi, this is the emergency doctor. I got a patient down here in room nine that needs admission to the ICU.`\" Followed by, \"Yeah, too sick for the floor, thanks,\" in both subtitle and audio, and finally, \"Hi this is the emergency doctor, I have a patient in room 9 I'`d like you to see, needs to be admitted`\" in the subtitle, with the audio repeating \"`Hi, this is the emergency doctor. I have a patient in room 9 I'`d like you to see, needs to be admitted.\" as the man in the pink shirt with red accents holds a black phone with his right hand and speaks. \n\nThe dialogue progresses with, \"`Yeah the floor is fine, doesn't need the ICU`\" in the subtitle, echoed by the audio, \"`Yeah, the floor is fine. Doesn't need the ICU. OK. Bye`\" The scene transitions to the man now in blue scrubs whowears glasses and walks, stating, \"Hey I just saw the patient in room 9, he doesn't need the ICU, the floor is fine\" confirmed by the audio, \"`Hey, I just saw the patient in room 9, he doesn't need the ICU, the floor is fine.`\"\n\nThe video features the man in blue scrubs without glasses continuing the conversation: \"Wait are you talking about room 9? No, way too sick for the floor, he needs the ICU,\" and the audio reflects this confusion, \"`Are you talking about the patient in room nine? Way too sick for the floor. The patient should go to the ICU.`\" The clip concludes with a subtitle question, \"I'`m sorry, who are you?`\" paralleled by the audio question, \"I'`m sorry. Who are you?`\" The scene traditions to the man in blue scrubs and wearing glasses. \n\nIn the next clip, the man in blue scrubs without glasses identifies himself with the subtitle, \"I'm the hospitalist, who are you?\" and the audio says, \"`I'm the hospital is. Who are you?`\" The scene shifts as the same man states, \"I'm critical care, this patient does not qualify for the ICU,\" followed by the audio confirming this: \"I'`m critical care. This patient does not qualify for ICU.`\" The man in the pink shirt with red accents and a black helmet smiles, zips his shirt, and looks down.He then wears orange glasses The text reads \"`This patient requires a high level of nursing care, we can't handle that for the floor\" with the audio declaring, \"`This patient requires a high level of nursing care, we can't handle that for the floor\" The exchange ends with the individual in blue scrubs asserting, \" None of the patient`'s vital signs warrant admission to the ICU,\" mirrored by the audio, \"`What are you talking about? None of the patient's vital signs warranted admission to the ICU.`\"\n\nThe final clip takes place in a room with beige walls, framed pictures, and a beige armchair. The man in the pink shirt with red accents carries a bike and walks to the left to leave. From the previous conversation, the audio continues,\"`You have open beds, just admit the patient,`\" which the subtitle and audio both confirm. They exit the room as another speaker begins with \"We need those-\" matched by the audio portraying, \"`we need those.`\" As the video ends, there is a sense of urgency about the patient's need for care, tying the narrative about the appropriate placement of a patient within the medical hierarchy."
+ }
+ ],
+ "events": [
+ "Man in pink shirt with red accents and black helmet asks, 'Alright what's going on?'",
+ "Audio states, 'All right, what's going on?'",
+ "Man in blue medical scrubs with stethoscope responds, 'Well this patient needs to be admitted'",
+ "Audio echoes, 'Well this patient needs to be admitted'",
+ "Subtitle reads, 'I'm just not sure if it should be to the floor or the ICU, and I'm kind of afraid to ask.'",
+ "Audio echoes, 'I'm just not sure if it should be to the floor or to the ICU, and I'm kind of afraid to ask.'",
+ "Man in pink shirt assures, 'Don't worry I'll handle it,'",
+ "Audio echoes, 'Don't worry about it. I'll handle it.'",
+ "Scene shifts to man in pink shirt holding a phone with his left hand, looking to the right",
+ "Subtitle says, 'Hi this is the emergency doctor, I have a patient in room 9 who needs to be admitted to the ICU,'",
+ "Audio matches, 'Hi, this is the emergency doctor. I got a patient down here in room nine that needs admission to the ICU.'",
+ "Subtitle and audio state, 'Yeah, too sick for the floor, thanks,'",
+ "Subtitle and audio state, 'Hi this is the emergency doctor, I have a patient in room 9 I'd like you to see, needs to be admitted,'",
+ "Man in pink shirt holds black phone with his right hand and speaks",
+ "Subtitle reads, 'Yeah the floor is fine, doesn't need the ICU,'",
+ "Audio echoes, 'Yeah, the floor is fine. Doesn't need the ICU. OK. Bye'",
+ "Scene transitions to man in blue scrubs with glasses walking",
+ "Subtitle reads, 'Hey I just saw the patient in room 9, he doesn't need the ICU, the floor is fine,'",
+ "Audio matches, 'Hey, I just saw the patient in room 9, he doesn't need the ICU, the floor is fine,'",
+ "Man in blue scrubs without glasses continues the conversation",
+ "Subtitle reads, 'Wait are you talking about room 9? No, way too sick for the floor, he needs the ICU,'",
+ "Audio matches, 'Are you talking about the patient in room nine? Way too sick for the floor. The patient should go to the ICU,'",
+ "Subtitle question reads, 'I'm sorry, who are you?'",
+ "Audio question matches, 'I'm sorry. Who are you?'",
+ "Scene transitions to man in blue scrubs and glasses",
+ "Next clip: man in blue scrubs without glasses identifies himself",
+ "Subtitle reads, 'I'm the hospitalist, who are you?'",
+ "Audio matches, 'I'm the hospitalist. Who are you?'",
+ "Scene shifts as same man states, 'I'm critical care, this patient does not qualify for the ICU,'",
+ "Audio confirms, 'I'm critical care. This patient does not qualify for ICU,'",
+ "Man in pink shirt with red accents and black helmet smiles, zips his shirt, and looks down; wears orange glasses",
+ "Subtitle reads, 'This patient requires a high level of nursing care, we can't handle that for the floor,'",
+ "Audio declares, 'This patient requires a high level of nursing care, we can't handle that for the floor,'",
+ "Individual in blue scrubs asserts, 'None of the patient's vital signs warrant admission to the ICU,'",
+ "Audio matches, 'What are you talking about? None of the patient's vital signs warranted admission to the ICU,'",
+ "Final clip: room with beige walls, framed pictures, and a beige armchair",
+ "Man in pink shirt with red accents carries a bike and walks to the left to leave",
+ "Audio and subtitle both confirm, 'You have open beds, just admit the patient,'",
+ "An unseen speaker begins with, 'We need those-'",
+ "Audio portrays, 'we need those,'",
+ "Video ends with a sense of urgency about the patient's need for care"
+ ]
+ },
+ {
+ "video": "video/43.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up of a silky omelette being placed on a slice of bread. Throughout the video, the text 'SILKY OMELETTE' is displayed at the top of the screen. The text 'hi guys I wanna show you my favourite omelette that takes 45 seconds the pan... on the heat tap it crack it whisk it' appears on the screen. The omelette is lifted with a utensil. The scene transitions to a man in a white shirt holding a frying pan, gesturing and speaking. The scene then cuts to a close-up of a gas stove burner being lit. A close-up of the frying pan on the lit burner follows. The scene transitions to the man cracking two eggs into a bowl. Next, the video shows a close-up of a silky omelette being stirred in a white bowl with a fork.\n\nThroughout this sequence, the audio features the man's voice saying, \"Hi guys, I wanna show you my favourite omelette that takes 45 seconds. The pan... on the heat. Tap it, crack it, whisk it.\" Background music plays softly, adding to the lively atmosphere. \n\nThe scene switches to the man speaking. A close-up of the omelette being stirred in the bowl appears. The final frame shows yellow grated cheese and a grater on a white surface. The text 'this is something that my kids love something that I like to do for breakfast brunch lunch midnight munchies brilliant I've grated some nice cheddar cheese and I have some toast' appears on the frames.\n\nThe audio for this part features the man's voice saying, \"This is something that my kids love, something that I like to do for breakfast, brunch, lunch, midnight munchies, brilliant. I've grated some nice cheddar cheese and I have some toast.\" The same background music plays.\n\nIn the next scene, a slice of bread flows out of the toaster. The man caught the piece of bread with both hands. The man shows the piece of bread to the camera, turns and puts it on the table. A close-up of toasted bread on a white plate appears. A close-up follows of a frying pan. The man uses his hand to feel the heat. The man pours a little oil into the pan. Finally, he pours the whisked eggs into the hot pan. The text 'let's do this you can see that the pan has got nice and hot we're gonna go in with a little bit of oil we're going to go straight in with the eggs look at it' appears on the frames.\n\nIn this segment, the audio reveals, \"Let's do this. You can see that the pan has got nice and hot. We're gonna go in with a little bit of oil. We're going to go straight in with the eggs. Look at it.\" Background music continues to play.\n\nThe omelette is cooking in the pan. The man shakes the pan a little. The scene transitions to a close-up of grated cheese being added to the omelette. The video then shows the man holding the pan and folding the omelette with a spatula. The man speaks and gestures with his hand. The man rolls the omelette up with a spatula. The text 'it's like little perfect silky sheets the cheese goes in like this we're done now it's finished if you just flick it that is the most incredible non-stick' appears on the frames.\n\nThe audio complements the visuals with, \"It's like little perfect silky sheets. The cheese goes in like this. We're done now. It's finished. If you just flick it, that is the most incredible non-stick.\" The background music remains consistent.\n\nThe next frame transitions to a close-up of toasted bread on a white plate. The following frames show a hand picking the toast up and pouring the omelette on it. The scene then shows the man holding the frying pan. The man shows the finished omelette to the camera. The omelette is lifted with a utensil. The man has a ring on his finger. The text 'and I take my toast I wanna show you that we've created the most incredible texture like a croissant like puff pastry what you done is create potentially the most amazing omelette in the world look at that gooeyness and deliciousness!' appears on the frames.\n\nThe audio complements the visuals with, \"And I take my toast, I wanna show you that we've created the most incredible texture, like a croissant, like puff pastry. What you done is create potentially the most amazing omelette in the world. Look at that! Gooeyness and deliciousness!\""
+ }
+ ],
+ "events": [
+ "Close-up of a silky omelette being placed on a slice of bread",
+ "Text 'SILKY OMELETTE' displayed at the top of the screen",
+ "Text 'hi guys I wanna show you my favourite omelette that takes 45 seconds the pan... on the heat tap it crack it whisk it' appears on the screen",
+ "Omelette lifted with a utensil",
+ "Scene transitions to a man in a white shirt holding a frying pan, gesturing and speaking",
+ "Close-up of a gas stove burner being lit",
+ "Close-up of the frying pan on the lit burner",
+ "Man cracking two eggs into a bowl",
+ "Close-up of silky omelette being stirred in a white bowl with a fork",
+ "Audio: Man says, 'Hi guys, I wanna show you my favourite omelette that takes 45 seconds. The pan... on the heat. Tap it, crack it, whisk it.'",
+ "Background music plays softly",
+ "Scene switches to the man speaking",
+ "Close-up of the omelette being stirred in the bowl",
+ "Final frame shows yellow grated cheese and a grater on a white surface",
+ "Text 'this is something that my kids love something that I like to do for breakfast brunch lunch midnight munchies brilliant I've grated some nice cheddar cheese and I have some toast' appears on the frames",
+ "Audio: Man says, 'This is something that my kids love, something that I like to do for breakfast, brunch, lunch, midnight munchies, brilliant. I've grated some nice cheddar cheese and I have some toast.'",
+ "Background music continues to play",
+ "Scene: A slice of bread flows out of the toaster",
+ "Man catches the piece of bread with both hands",
+ "Man shows the piece of bread to the camera, then turns and puts it on the table",
+ "Close-up of toasted bread on a white plate",
+ "Close-up of a frying pan",
+ "Man uses his hand to feel the heat",
+ "Man pours a little oil into the pan",
+ "Man pours the whisked eggs into the hot pan",
+ "Text 'let's do this you can see that the pan has got nice and hot we're gonna go in with a little bit of oil we're going to go straight in with the eggs look at it' appears on the frame",
+ "Audio: Man says, 'Let's do this. You can see that the pan has got nice and hot. We're gonna go in with a little bit of oil. We're going to go straight in with the eggs. Look at it.'",
+ "Background music continues to play",
+ "Omelette cooking in the pan",
+ "Man shakes the pan a little",
+ "Close-up of grated cheese being added to the omelette",
+ "Man holding the pan while folding the omelette with a spatula",
+ "Man speaks and gestures with his hand",
+ "Man rolls the omelette up with a spatula",
+ "Text 'it's like little perfect silky sheets the cheese goes in like this we're done now it's finished if you just flick it that is the most incredible non-stick' appears on the frame",
+ "Audio: Man says, 'It's like little perfect silky sheets. The cheese goes in like this. We're done now. It's finished. If you just flick it, that is the most incredible non-stick.'",
+ "Background music remains consistent",
+ "Close-up of toasted bread on a white plate",
+ "Hand picks the toast up and pours the omelette on it",
+ "Man holding the frying pan",
+ "Man shows the finished omelette to the camera",
+ "Omelette lifted with a utensil",
+ "Man has a ring on his finger",
+ "Text 'and I take my toast I wanna show you that we've created the most incredible texture like a croissant like puff pastry what you done is create potentially the most amazing omelette in the world look at that gooeyness and deliciousness!' appears on the frame",
+ "Audio: Man says, 'And I take my toast, I wanna show you that we've created the most incredible texture, like a croissant, like puff pastry. What you done is create potentially the most amazing omelette in the world. Look at that! Gooeyness and deliciousness!'"
+ ]
+ },
+ {
+ "video": "video/44.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a close-up view of a toy gun placed on a textured surface. The gun is primarily white with orange accents on the magazine, grip, and parts of the barrel, and is equipped with a black scope on top. Hands appear and begin pointing towards different parts of the gun, possibly explaining its features. The voiceover enthusiastically states, \"`Guys, this is one of the newest additions to my collection. It is actually a gel blaster and not an airsoft gun. `\" The hands remove the orange magazine from the gun and show its components in detail.\n\nTransitioning to the second clip, the focus remains on the removed orange magazine. The hands manipulate the magazine's components, pointing to a small black button and other details. The audio explains, \"`It comes with a 200 round high-capacity banana magazine, and here is how you load it right here. And then it also connects to the battery and gives you automatic feeding from the main compartment to the main feeding tube. You get the scope and laser combo. It doesn't have a reticle, but it's actually two times zoomed. And then you also get this laser right here. `\" The video then focuses on the black scope attached to the toy gun, with the hands highlighting its features.\n\nThe third clip opens with a detailed view of the black scope, showing the lens and other parts of the scope from various angles. The video then reveals the full toy gun with the orange magazine reattached, lying on the textured surface beside two small plastic bags containing blue and orange beads. The hands point to the gun and the accessories, with the voice commenting, \"`This gun retails for a hundred dollars and comes with a battery and a charger, and then you also get free gel blaster balls as well, blue and orange`\" The hands then display the black battery pack labeled '7.4V' and a black USB charging cable. The scene concludes with a close-up of the bags of beads.\n\nIn the fourth clip, the person holds the toy gun and aims it at a cardboard target attached to a wooden fence. The white gun with orange accents and a black scope is fired, and small projectiles hit the target. The audio remarks, \"` Range the target is right there. We're going to see how well it can do.`\" The camera zooms in on the target to show the impact points, followed by the sound of a power tool.\n\nContinuing to the fifth clip, the cardboard target with sections labeled 'A', 'B', 'C', and 'D' is shown in a close-up. The camera shifts to the person holding the toy gun and firing it towards the target, with projectiles hitting the labeled sections as well as a blue balloon floating nearby. The speaker announces, \"`All right boys, this is the range test. All right guys, we were actually able to hit the target 50% of the time from the 75-foot range and at CQB, of course it'`s going to be able to hit the target every single time. `\" The target gets hit multiple times, and the clip ends with a close-up of the toy gun held over a pool.\n\nThe final clip features the toy gun being held over a pool, with the blue and green balloons floating on the water in the background. The person points out different parts of the gun, and the voice concludes with, \"This thing's not that bad for a hundred dollars; I recommend it, to be honest. It's a great gel blaster.`\" The video closes with the person highlighting the gun's features and the orange magazine over the pool.\n\nThroughout the entire video, the visuals and audio are synchronized to provide a thorough and detailed presentation of the gel blaster toy gun, its features, and its performance, culminating in a personal recommendation by the speaker."
+ }
+ ],
+ "events": [
+ "Close-up view of a toy gun placed on a textured surface.",
+ "Gun is white with orange accents on magazine, grip, barrel, and equipped with a black scope on top.",
+ "Hands appear and point towards different parts of the gun.",
+ "Voiceover states, 'Guys, this is one of the newest additions to my collection. It is actually a gel blaster and not an airsoft gun.'",
+ "Hands remove the orange magazine from the gun and show its components in detail.",
+ "Focus on the removed orange magazine.",
+ "Hands manipulate the magazine's components, pointing to a small black button and other details.",
+ "Audio explains, 'It comes with a 200 round high-capacity banana magazine, and here is how you load it right here. And then it also connects to the battery and gives you automatic feeding from the main compartment to the main feeding tube. You get the scope and laser combo. It doesn't have a reticle, but it's actually two times zoomed. And then you also get this laser right here.'",
+ "Focus on the black scope attached to the toy gun, with hands highlighting its features.",
+ "Detailed view of the black scope, showing the lens and other parts from various angles.",
+ "Full toy gun with orange magazine reattached, lying on the textured surface beside two small plastic bags containing blue and orange beads.",
+ "Hands pointing to the gun and accessories.",
+ "Voice comments, 'This gun retails for a hundred dollars and comes with a battery and a charger, and then you also get free gel blaster balls as well, blue and orange.'",
+ "Hands display the black battery pack labeled '7.4V' and a black USB charging cable.",
+ "Close-up of the bags of beads.",
+ "Person holds the toy gun and aims it at a cardboard target attached to a wooden fence.",
+ "White gun with orange accents and black scope is fired, hitting the target.",
+ "Audio remarks, 'Range the target is right there. We're going to see how well it can do.'",
+ "Camera zooms in on the target to show impact points, followed by the sound of a power tool.",
+ "Close-up of the cardboard target with sections labeled 'A', 'B', 'C', 'D'.",
+ "Camera shifts to person holding toy gun and firing at the target.",
+ "Projectiles hit the labeled sections and a nearby floating blue balloon.",
+ "Speaker announces, 'All right boys, this is the range test. All right guys, we were actually able to hit the target 50% of the time from the 75-foot range and at CQB, of course it's going to be able to hit the target every single time.'",
+ "Target gets hit multiple times.",
+ "Close-up of toy gun held over a pool.",
+ "Toy gun is held over a pool with blue and green balloons floating on the water.",
+ "Person points out different parts of the gun.",
+ "Voice concludes, 'This thing's not that bad for a hundred dollars; I recommend it, to be honest. It's a great gel blaster.'",
+ "Video closes with person highlighting gun's features and orange magazine over the pool."
+ ]
+ },
+ {
+ "video": "video/45.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a boy sporting a red and blue plaid shirt, seated on a white couch against a beige background. This boy, who has medium-length, wavy brown hair, smiles and appears to be awaiting something or someone. The audio from the boy sets a lively tone as a conversation begins:\"`Hope you guys are ready for board game night.`\" An agreeable response from the man follows, \"`Yes, sir.\" The boy continues, \"I brought `'Gotcha!,'` because you know what you always say, 'Gotcha will getcha.`'\"\n\n\nThe scene transitions to a cozy, well-decorated room. Here, two individuals are seated closely together, enhancing the intimate and inviting atmosphere. One woman, dressed in a teal outfit, listens attentively while the other, a man in a dark suit with a light blue shirt and a patterned tie, speaks with confidence. The audio continues seamlessly, \n\"I'`ve said that before. \" \"You don't sound that excited.`\" This conversation exchange is light-hearted, but there's an undercurrent of anticipation.\n\n\nReturning to the boy in the plaid shirt, the scene reveals him holding a board game box prominently displaying the word \"Gotcha!\" in large red letters. There is a visible enthusiasm as they speak, evidently discussing the game, adding to the excitement conveyed in the audio. The video then expands to show a wider view of the room, showcasing a fireplace, a coffee table, and various decorative items. The boy in the plaid shirt now sits in a white armchair while the other two individuals sit across from him. The boy and the man remain engaged in a conversation, emphasizing the social nature of the scene. The frames of the rest of video are 'the boy talking', 'the man talking', 'the man talking and the woman sitting next to him nodding in agreement', 'the panorama of three people', and back and forth between these four scenes.\n\nSwitching to a close-up of the plaid shirt boy on the white couch, his expressive eyes and gesturing hands suggest a deeper dive into the context or rules of the game. The audio complements his expressions with a sequence of voices: \"`Are we going to engage in a game of `'Gotcha?'\" A playful back-and-forth ensues, \"Yeah, you told me to come over to play this.\" A challenging rebuttal, \"`That allegation is false,`\" adds a touch of humor to the interaction. The scene transitions to a close-up of the man with gray hair, accentuating his composed demeanor against the backdrop of the well-decorated room. Returning to the plaid-shirted boy, the narration continues, amplifying the group's playful dynamics. \n\nA closer focus on the two individuals seated together reveals the intermittent dialogues, \u201cYou told me to come to your state to play Gotcha\u201d \"`Everybody going to my state and spending more time trying to play `'Gotcha.'\" This line blurs the lines between leisure and commitment, then adding, \"I thought we had arranged to have a board game,\" to which the response is pragmatic, \"`This is not an arrangement or an understanding. This is a marriage.`\" The astonished reply, \"What?\" punctuates the scene, underscoring the complexity of their conversation and adding layers of meaning to the interactions.\n\nIn a smooth transition, the camera zooms in on the two individuals. The conversation delves into perceptions and expectations: \"Gacha isn'`t a bad game or anything, right?`\" Affirmation is swift, \"`Of course it`'s not.\" Seeking resolution, the man in the dark suit contemplates, \"And let'`s take it from your point of view,`\" followed by a pragmatic suggestion, \"`Alright, then let's play so we can make this awkward situation go away.`\" However, the resolution remains elusive as the comment, \"That won'`t make it go,`\" reflects the underlying tension.\n\nThe narrative advances with alternating close-ups between the two discussing the game and the plaid-shirted boy on the white couch emphasizing their speech and gestures. He expresses frustration and ultimatum, \"`All right, you know what? You guys, I'm leaving. We can play by ourselves.`\" The response is immediate and layered with sarcasm, \"`Oh, good. Now we can play. `'Gotcha.'\" The retort, \"Don'`t patronize me,`\" adds a concluding note to their interactions.\n\nThe video wraps up with the boy in the plaid shirt finalizing his explanation, his facial expressions and tone indicating a conclusive remark. The cohesive yet complex interplay of visuals and dialogue encapsulates a narrative rich with anticipation, humor, and subtle tension, woven seamlessly across the scenes."
+ }
+ ],
+ "events": [
+ "Boy in a red and blue plaid shirt sitting on a white couch against a beige background.",
+ "Boy, with medium-length wavy brown hair, smiles and waits for someone.",
+ "Boy says, 'Hope you guys are ready for board game night.'",
+ "Man responds, 'Yes, sir.'",
+ "Boy continues, 'I brought 'Gotcha!,' because you know what you always say, 'Gotcha will getcha.''",
+ "Scene transitions to well-decorated room with two individuals seated closely together.",
+ "Woman in a teal outfit listens attentively.",
+ "Man in dark suit with light blue shirt and patterned tie speaks with confidence.",
+ "Man says, 'I've said that before. You don't sound that excited.'",
+ "Light-hearted conversation with an undercurrent of anticipation.",
+ "Scene returns to boy in plaid shirt holding a board game box displaying 'Gotcha!'",
+ "Boy shows visible enthusiasm discussing the game.",
+ "Scene expands to wider room view showing fireplace, coffee table, and decorative items.",
+ "Boy in plaid shirt now sits in a white armchair.",
+ "Two individuals sit across from the boy.",
+ "Boy and man engaged in conversation emphasizing social nature.",
+ "Frames alternate: boy talking, man talking, man talking with woman nodding, panorama of three people.",
+ "Close-up of boy on white couch gesturing with expressive eyes.",
+ "Boy says, 'Are we going to engage in a game of 'Gotcha?'",
+ "Playful back-and-forth, 'Yeah, you told me to come over to play this.'",
+ "Challenging rebuttal, 'That allegation is false.'",
+ "Close-up of composed man with gray hair in the well-decorated room.",
+ "Scene returns to boy in plaid shirt narrating group's playful dynamics.",
+ "Focus on two seated individuals having intermittent dialogues.",
+ "Dialogue: 'You told me to come to your state to play Gotcha.'",
+ "Reply: 'Everybody going to my state and spending more time trying to play 'Gotcha.'",
+ "Man adds, 'I thought we had arranged to have a board game.'",
+ "Response: 'This is not an arrangement or an understanding. This is a marriage.'",
+ "Astonished reply, 'What?'.",
+ "Camera zooms in on two individuals discussing perceptions and expectations.",
+ "Man says, 'Gotcha isn't a bad game or anything, right?'",
+ "Affirmation: 'Of course it's not.'",
+ "Man contemplates: 'And let's take it from your point of view.'",
+ "Suggestion: 'Alright, then let's play so we can make this awkward situation go away.'",
+ "Response: 'That won't make it go.'",
+ "Alternating close-ups between two discussing and plaid-shirted boy emphasizing their speech and gestures.",
+ "Boy expresses frustration and ultimatum, 'All right, you know what? You guys, I'm leaving. We can play by ourselves.'",
+ "Sarcastic response: 'Oh, good. Now we can play. 'Gotcha.'",
+ "Boy retorts, 'Don't patronize me.'",
+ "Boy in plaid shirt finalizes his explanation with conclusive facial expressions and tone.",
+ "Video wraps up encapsulating anticipation, humor, and tension."
+ ]
+ },
+ {
+ "video": "video/46.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a close-up of hands tying a white string around a black metal post in a garden. The text \"Most affordable trellis!\" appears at the top of the frame. The man pulls the string tight and secures it around the post, with small green plants visible in the background. A voice enthusiastically asks, \"`Are you ready for the cheapest way to trellis your vegetables? And it actually works really well.`\" The speaker introduces himself, \"Hey, I'`m Joshua with The Garden is Growing, and let me show you what I've been doing for the past couple of years.''\nThe scene shifts to a wider view of the garden, showing the man wearing black shoes and shorts as he walks along a row of plants, securing the string to another post. He continues, \" After learning about and trying the Florida weave, which is similar to this, but weaving the string in between tomato plants, I decided to try growing my snap peas with a string trellis. `\" The tone is confident and clear, with music playing softly in the background.\n\nContinuing, the man secures the string tightly around the post, and the scene transitions again to a wider view of the garden. Multiple black metal posts with strings tied between them form a simple trellis system. As Joshua continues to explain the setup, his voice remains clear: \"`Now, I bought posts to go in between, so that's an added cost, but you could even use an existing fence post, and the twine, I bought it two years ago, and look how much is left. `\" The camera focuses on his hands pulling the string from a cardboard box.\n\nThe video progresses with Joshua, seen in a black shirt and shorts, walking from right to left and then left to right in the garden, tying more string around the metal posts. He adds, \"`It's going to last me years longer. I first tried using jute twine, but it didn't quite hold the tension after a while.`\" The sound of the string being plucked accompanies his voice. The close-up reveals the small green plants growing, supported by the strings.\n\nThe final segment begins with a close-up view of the green plants now thriving along the white string trellis system. The ground is covered with wood chips and straw. As the camera pans along the row of plants, Joshua says, \"Having these in the outside,gives the snap peas something to climb up through and be supported on both sides. I'll link to the twine if you want to check it out. And that's it. Super simple. Let me know if you try it out.`\" Finally, a broader view of the garden shows multiple rows of plants and various garden structures. The video closes with a zoomed-in shot of green pea pods hanging from the trellis, indicating the system's success as Joshua warmly concludes, \"`I hope you can spend some time in your garden today.`\"\n\nThe video effectively demonstrates a practical and affordable method to create a garden trellis while providing helpful tips and a sense of encouragement through Joshua's engaging narrative."
+ }
+ ],
+ "events": [
+ "Close-up of hands tying a white string around a black metal post in a garden.",
+ "Text 'Most affordable trellis!' appears at the top of the frame.",
+ "Man pulls the string tight and secures it around the post.",
+ "Small green plants visible in the background.",
+ "Voice enthusiastically asks, 'Are you ready for the cheapest way to trellis your vegetables? And it actually works really well.'",
+ "Speaker introduces himself, 'Hey, I\u2019m Joshua with The Garden is Growing, and let me show you what I've been doing for the past couple of years.'",
+ "Scene shifts to a wider view of the garden.",
+ "Man with black shoes and shorts walks along a row of plants, securing the string to another post.",
+ "Joshua says, 'After learning about and trying the Florida weave, which is similar to this, but weaving the string in between tomato plants, I decided to try growing my snap peas with a string trellis.'",
+ "Tone is confident and clear, with music playing softly in the background.",
+ "Joshua secures the string tightly around the post.",
+ "Scene transitions to a wider view of the garden.",
+ "Multiple black metal posts with strings tied between them form a simple trellis system.",
+ "Joshua continues to explain the setup.",
+ "Joshua's voice is clear: 'Now, I bought posts to go in between, so that's an added cost, but you could even use an existing fence post, and the twine, I bought it two years ago, and look how much is left.'",
+ "Camera focuses on his hands pulling the string from a cardboard box.",
+ "Joshua, in a black shirt and shorts, walks from right to left and then left to right in the garden, tying more string around the metal posts.",
+ "Joshua adds, 'It's going to last me years longer. I first tried using jute twine, but it didn't quite hold the tension after a while.'",
+ "Sound of the string being plucked accompanies his voice.",
+ "Close-up reveals the small green plants growing, supported by the strings.",
+ "Final segment begins with a close-up view of green plants thriving along the white string trellis system.",
+ "Ground is covered with wood chips and straw.",
+ "Camera pans along the row of plants.",
+ "Joshua says, 'Having these in the outside, gives the snap peas something to climb up through and be supported on both sides. I'll link to the twine if you want to check it out. And that's it. Super simple. Let me know if you try it out.'",
+ "Broader view of the garden shows multiple rows of plants and various garden structures.",
+ "Video closes with a zoomed-in shot of green pea pods hanging from the trellis.",
+ "Joshua warmly concludes, 'I hope you can spend some time in your garden today.'"
+ ]
+ },
+ {
+ "video": "video/47.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a title card that reads 'DUMBEST TEXTS' in bold green and white letters against a dark background. As the scene transitions to a text message conversation, the first message appears in a green text bubble saying, \"Your kid is in my yard again.\" The conversation continues with a white text bubble replying, \"Come on, this again?\" followed by a green text bubble stating, \"He's not wearing any pants and he's chasing my dog.\" The white text bubble retorts, \"Boys will be boys,\" and the final green text bubble humorously points out, \"Yeah, but yours is 24 and drunk.\" The contact name 'Valerie' is displayed at the bottom of the screen. Concurrently, the audio features a man speaking the entire dialogue.\n\nThe scene shifts to the same title card as the first clip and transitions to a new text message conversation with 'Valerie' at the top. The first message, in a blue text bubble, reads, \"hey, you wanna see a movie sometime\" and is marked as read at 12:27 PM. Then, a screenshot reveals the same message sent to a different contact named 'Julian' at the same time. Julian's response in a black text bubble reads, \"this guy is so f*cking annoying and weird I DON'T like him.\" Realizing the mistake, Julian sends another message, \"oh f*ck.\" The final response in a blue text bubble humorously reads, \"nah its cool lol\" with a skull emoji. The audio features a man and a woman exchanging these lines, ending with the man saying, \"nah, it's cool, lol.\"\n\n The scene then introduces a new conversation with 'Mom'. The mom's message reads, \"Sorry Honey Your grandpa died LOL\" at 10:28 PM. To this, the child replies, \"mom do you know what LOL Means,\" and the mom answers, \"Yes! 'lots of love.'\" The child corrects, \"Mom google it.\" The scene then transitions to another conversation where the contact name is obscured, with the first message in a green text bubble saying, \"Hello.\" The white text bubble replies, \"Tell me something that will make me wet xxxx,\" to which the green text bubble responds, \"Go outside.\" A man primarily voices this dialogue, with a woman interjecting briefly.\n\nAgain, the title card appears and transitions to the continuation of the suggestive conversation. The green text bubble states, \"Go outside. It's raining,\" followed by the white text bubble, \"Try again xxxx.\" The green text bubble suggests, \"Don't use an umbrella,\" and then the scene moves to another conversation with 'Mom'. An image of a caracal cat is shown with partial text, \"he looks like the when a was,\" sent at 6:53 PM. 'Mom' replies in a gray text bubble, \"You're aiding and abetting Russian mafia when u share Floppa memes,\" sent at 6:55 PM. This conversation amusingly concludes with a green text bubble saying, \"Excellent.\" The man continues narrating, while another man also briefly speaks. \"It's what floppa would want,\" at 7:03 PM. The scene then transitions to a new text conversation with 'Taylor's Hot Friend.' The white text bubble reads, \"You didn't like my post yesterday!\" accompanied by laughing and kissing emojis. A blue text bubble responds with a question mark, and the white text bubble clarifies, \"On Instagram!\" The blue text bubble then expresses confusion, \"What in the actual f*ck is happening right now?\" The white text bubble complains, \"Whatever. And now you're liking every picture I've ever posted. Ok we\" indicating frustration. The dialogue continues with alternating male and female voices.\n\nThe video concludes with the same title card and a continuation of the conversation with 'Taylor's Hot Friend.' In this sequence, the white text bubble finishes the last message: \"Ok we get it. A**hole. Not what I meant.\" A blue text bubble then harshly replies, \"Yea... I want to make sure I never receive a text this f*cking stupid ever again.\" The video ends with a woman with long blonde hair wearing earphones, accompanied by the text 'SUBSCRIBED' and a bell icon, prompting viewers to subscribe to"
+ }
+ ],
+ "events": [
+ "Video begins with a title card 'DUMBEST TEXTS' in bold green and white letters against a dark background",
+ "Scene transitions to a text message conversation",
+ "First message in green text bubble: 'Your kid is in my yard again'",
+ "White text bubble replies: 'Come on, this again?'",
+ "Green text bubble states: 'He's not wearing any pants and he's chasing my dog'",
+ "White text bubble retorts: 'Boys will be boys'",
+ "Final green text bubble: 'Yeah, but yours is 24 and drunk'",
+ "Contact name 'Valerie' is displayed at the bottom of the screen",
+ "Audio features a man speaking the entire dialogue",
+ "Scene shifts to the same title card as the first clip",
+ "New text message conversation with 'Valerie' at the top",
+ "First message in blue text bubble: 'hey, you wanna see a movie sometime' marked as read at 12:27 PM",
+ "Screenshot reveals the same message sent to a different contact named 'Julian' at the same time",
+ "Julian's response in a black text bubble: 'this guy is so f*cking annoying and weird I DON'T like him'",
+ "Julian realizes the mistake and sends another message: 'oh f*ck'",
+ "Final response in blue text bubble: 'nah its cool lol' with a skull emoji",
+ "Audio features a man and a woman exchanging these lines, ending with the man saying, 'nah, it's cool, lol'",
+ "Scene introduces a new conversation with 'Mom'",
+ "Mom's message reads: 'Sorry Honey Your grandpa died LOL' at 10:28 PM",
+ "Child replies: 'mom do you know what LOL Means'",
+ "Mom answers: 'Yes! 'lots of love.''",
+ "Child corrects: 'Mom google it'",
+ "Scene transitions to another conversation with the contact name obscured",
+ "First message in green text bubble says: 'Hello'",
+ "White text bubble replies: 'Tell me something that will make me wet xxxx'",
+ "Green text bubble responds: 'Go outside'",
+ "A man primarily voices this dialogue, with a woman interjecting briefly",
+ "Title card appears again and transitions to the continuation of the suggestive conversation",
+ "Green text bubble states: 'Go outside. It's raining'",
+ "White text bubble: 'Try again xxxx'",
+ "Green text bubble suggests: 'Don't use an umbrella'",
+ "Scene moves to another conversation with 'Mom'",
+ "Image of a caracal cat is shown with partial text: 'he looks like the when a was' sent at 6:53 PM",
+ "'Mom' replies in a gray text bubble: 'You're aiding and abetting Russian mafia when u share Floppa memes' sent at 6:55 PM",
+ "Green text bubble says: 'Excellent'",
+ "A man continues narrating, while another man briefly speaks: 'It's what floppa would want' at 7:03 PM",
+ "Scene transitions to a new text conversation with 'Taylor's Hot Friend'",
+ "White text bubble reads: 'You didn't like my post yesterday!' accompanied by laughing and kissing emojis",
+ "Blue text bubble responds with a question mark",
+ "White text bubble clarifies: 'On Instagram!'",
+ "Blue text bubble expresses confusion: 'What in the actual f*ck is happening right now?'",
+ "White text bubble complains: 'Whatever. And now you're liking every picture I've ever posted. Ok we'",
+ "Dialogue continues with alternating male and female voices",
+ "Video concludes with the same title card",
+ "Continuation of the conversation with 'Taylor's Hot Friend'",
+ "White text bubble finishes the last message: 'Ok we get it. A**hole. Not what I meant.'",
+ "Blue text bubble harshly replies: 'Yea... I want to make sure I never receive a text this f*cking stupid ever again.'",
+ "Video ends with a woman with long blonde hair wearing earphones",
+ "Text 'SUBSCRIBED' and a bell icon prompts viewers to subscribe"
+ ]
+ },
+ {
+ "video": "video/48.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a woman sitting in the driver's seat of a car, wearing a bright orange long-sleeve top and colorful, textured leggings with a rainbow pattern. She holds a smartphone with a marble-patterned case in her left hand, showing nails painted in different colors. Text on the screen reads 'Muahahaha she hates me!', suggesting a playful or mischievous context. The background outside the car window reveals a parking lot with several vehicles and some buildings in the distance. The TikTok username '@jacobfredenburg1' is visible on the screen. \n\nThe audio reveals a man speaking into a microphone, commenting, \"Got a nice short-term MILF walking in right now. Good for one or two more divorces. Looks like she's been foundered. Got a bad quarter on that side. Shortened spread on the teeth, non-legible tattoo. I'd say she's bangs vaccinated, though, because she's been banged more than an anvil at a blacksmith shop. \" all while a playful tone in the background. The woman appears to react humorously to the man's speaking, covering her mouth with her right hand as she laughs and looks out the window.\n\nThe speaker continues with the audacious commentary, adding, \"I mean to tell you, see more wieners than the Oscar Mayer hot dog company. She looks like she's pretty high-maintenance, not an easy keeper. Look at that rear end. Looks like a hail-damaged Volkswagen bug hood. Looks like somebody took a baseball bat and dented the hell out of her. \" The woman in the driver's seat then leans toward the window, still laughing or reacting humorously. This time, she covers her mouth with her right hand, and her left-hand rests on her lap.\n\n The audio continues with the same character's voice, now saying, \"You bet,\" followed by the revving sound of a vehicle engine.\n\nThe person then leans back in the seat, pulling their knees to their chest and covering their face with both hands. 'Muahahaha she hates me!' remains on display, underscoring the playful nature of the situation. The parking lot remains unchanged outside the window. The audio now features a man speaking quickly and loudly, saying, \"Right here. What do you want to do? Here. Oh, boy. Nice little rig here. Now 30. Now 1. Now 1. Now 1. You won't find another dumpster fire like her, boy. Now 31. Now 31. While you're up to you. Now 31. Now I'm going to begin 2. Now 32. Now I'm going to begin 32. And 33. And I'm going to begin 33. 33. And I'm going to begin 34. 34... ''\n\nSubsequently, the woman leans forward again, laughing and looking out the window. She again covers her face with her right hand while her left-hand rests on her lap. She then leans back into the seat, pulling her knees up and covering her face with both hands again, continuing her humorous reaction. The fast-paced voice continues with a rapid, high-pitched speech pattern, saying, \"\"And I'm going to begin 34. And 35. And I'm going to begin 35. And 35. And I'm going to begin 35. And 36. She goes zero down the fight in 2.2 seconds, boy. Now 37, now 37, now 37, now 38, now 38, now 39.\"Finally, the woman pulls the collar of her top over her face using her right hand, crossing her left arm over her body with the smartphone resting on her lap. The 'Muahahaha she hates me!' text remains on the screen, and the parking lot scene outside the window is consistent. The same quick, loud speech still continues in the audio.\n\nThe overall theme suggests a humorous, possibly satirical tone as the woman reacts to an ongoing, exaggerated commentary, maintaining a light and playful atmosphere throughout the video."
+ }
+ ],
+ "events": [
+ "Video begins with a woman sitting in the driver's seat of a car.",
+ "Woman is wearing a bright orange long-sleeve top and colorful, textured leggings with a rainbow pattern.",
+ "Woman is holding a smartphone with a marble-patterned case in her left hand.",
+ "Her nails are painted in different colors.",
+ "Text on the screen reads 'Muahahaha she hates me!'",
+ "Background outside the car window shows a parking lot with several vehicles and some buildings in the distance.",
+ "TikTok username '@jacobfredenburg1' is visible on the screen.",
+ "Audio features a man speaking into a microphone with a playful tone in the background.",
+ "Man comments: 'Got a nice short-term MILF walking in right now. Good for one or two more divorces. Looks like she's been foundered. Got a bad quarter on that side. Shortened spread on the teeth, non-legible tattoo. I'd say she's bangs vaccinated, though, because she's been banged more than an anvil at a blacksmith shop.'",
+ "Woman reacts humorously to the man's speech, covering her mouth with her right hand and laughing while looking out the window.",
+ "Man continues commentary: 'I mean to tell you, see more wieners than the Oscar Mayer hot dog company. She looks like she's pretty high-maintenance, not an easy keeper. Look at that rear end. Looks like a hail-damaged Volkswagen bug hood. Looks like somebody took a baseball bat and dented the hell out of her.'",
+ "Woman leans toward the window while laughing or reacting humorously.",
+ "Woman covers her mouth with her right hand, and her left hand rests on her lap.",
+ "Audio features the man's voice saying, 'You bet,' followed by the revving sound of a vehicle engine.",
+ "Woman leans back in the seat, pulling her knees to her chest and covering her face with both hands.",
+ "'Muahahaha she hates me!' text remains on display.",
+ "Background scene outside the window stays consistent with the parking lot.",
+ "Audio features a man speaking quickly and loudly: 'Right here. What do you want to do? Here. Oh, boy. Nice little rig here. Now 30. Now 1. Now 1. Now 1. You won't find another dumpster fire like her, boy. Now 31. Now 31. While you're up to you. Now 31. Now I'm going to begin 2. Now 32. Now I'm going to begin 32. And 33. And I'm going to begin 33. 33. And I'm going to begin 34. 34...'",
+ "Woman leans forward again, laughing and looking out the window.",
+ "Woman covers her face with her right hand while her left-hand rests on her lap.",
+ "She then leans back into the seat, pulling her knees up and covering her face with both hands again, continuing her humorous reaction.",
+ "Fast-paced voice continues with rapid, high-pitched speech pattern: 'And I'm going to begin 34. And 35. And I'm going to begin 35. And 35. And I'm going to begin 35. And 36. She goes zero down the fight in 2.2 seconds, boy. Now 37, now 37, now 37, now 38, now 38, now 39.'",
+ "Woman pulls the collar of her top over her face using her right hand, crossing her left arm over her body with the smartphone resting on her lap.",
+ "'Muahahaha she hates me!' text remains on the screen.",
+ "Parking lot scene outside the window remains consistent.",
+ "Quick, loud speech continues in the audio.",
+ "Video theme is humorous and satirical, with the woman reacting to exaggerated commentary and maintaining a light and playful atmosphere."
+ ]
+ },
+ {
+ "video": "video/49.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with two identical Lugia Pok\u00e9mon positioned side by side against a bright green background. The text at the bottom of the screen reads 'Pok\u00e9mon are being fused!' in English. As the two Lugia Pok\u00e9mon start to merge into a single, more complex Pok\u00e9mon with an intricate wing structure, a man narrates in a deep voice, \"Fusion between Lugia and Lugia. Okay, why is his head so tiny? Dude, I like the arms, man. Fire design, except the head, though. \" The text at the bottom changes to 'Congratulations! Your Pok\u00e9mon was fused into Lugia!', indicating the successful fusion. The background remains consistently green.\n\nNext, the newly fused Lugia Pok\u00e9mon, showcasing its evolved and elaborate design with intricate wings, stands against the same bright green background. The scene seamlessly transitions to two identical Torterra Pok\u00e9mon positioned side by side. The text at the bottom reads 'Pok\u00e9mon are being fused!' Once again, a fusion process begins. As the two Torterra Pok\u00e9mon merge into a larger, more complex version, the intricate details on its shell and the larger tree feature become evident. \"It looks kind of weird,\" a man remarks, followed by, \"Fusion between Torterra and Torterra. He has a whole ass Mount Everest on his back, man.\" The text changes to 'Congratulations! Your Pok\u00e9mon was fused into Torterra!' confirming the successful fusion.\n\nThe newly fused Torterra Pok\u00e9mon with its more intricate design stands against the bright green background. The scene quickly transitions to two new Pok\u00e9mon: a robust, yellow Pok\u00e9mon with black stripes and antennae-like structures on its head, and a purple, snake-like Pok\u00e9mon with a menacing face on its hood. The text at the bottom announces 'Pok\u00e9mon are being fused!' The fusion process begins again, and the man exclaims, \"Multiple mountains. Spikes. Tube trees. Bonsai trees. Tree of life. I don't even know, man. This is clean.\"\n\nThe final transformation reveals the two different Pok\u00e9mon, side by side. The text at the bottom reads 'Pok\u00e9mon are being fused!' The fusion initiates, and as the green silhouette of the new Pok\u00e9mon emerges, combining the muscular build and purple-black stripes, and featuring multiple antennae-like structures with glowing orange eyes, the man comments, \"Fusion between Electivire and Arbok. Oh yeah, look at the design. It has stripes everywhere, bro. Absolutely fire.\" The text at the bottom of the screen confirms the successful fusion with 'Congratulations! Your Pok\u00e9mon was fused into Arvire!' The background remains consistently green as the video concludes."
+ }
+ ],
+ "events": [
+ "Two identical Lugia Pok\u00e9mon are positioned side by side against a bright green background.",
+ "Text at the bottom reads 'Pok\u00e9mon are being fused!' in English.",
+ "Two Lugia Pok\u00e9mon start to merge into a single, more complex Pok\u00e9mon with intricate wing structure.",
+ "A man narrates in a deep voice: 'Fusion between Lugia and Lugia. Okay, why is his head so tiny? Dude, I like the arms, man. Fire design, except the head, though.'",
+ "Text at the bottom changes to 'Congratulations! Your Pok\u00e9mon was fused into Lugia!'",
+ "Newly fused Lugia Pok\u00e9mon with evolved and elaborate design stands against the bright green background.",
+ "Scene transitions to two identical Torterra Pok\u00e9mon positioned side by side.",
+ "Text at the bottom reads 'Pok\u00e9mon are being fused!'",
+ "Two Torterra Pok\u00e9mon merge into a larger, more complex version with intricate details on its shell and larger tree feature.",
+ "A man remarks: 'It looks kind of weird,' followed by 'Fusion between Torterra and Torterra. He has a whole ass Mount Everest on his back, man.'",
+ "Text changes to 'Congratulations! Your Pok\u00e9mon was fused into Torterra!'",
+ "Newly fused Torterra Pok\u00e9mon with intricate design stands against the bright green background.",
+ "Scene transitions to two new Pok\u00e9mon: a robust, yellow Pok\u00e9mon with black stripes and antennae-like structures on its head, and a purple, snake-like Pok\u00e9mon with a menacing face on its hood.",
+ "Text at the bottom announces 'Pok\u00e9mon are being fused!'",
+ "Fusion process begins and man exclaims: 'Multiple mountains. Spikes. Tube trees. Bonsai trees. Tree of life. I don't even know, man. This is clean.'",
+ "Final transformation reveals the two different Pok\u00e9mon, side by side.",
+ "Text at the bottom reads 'Pok\u00e9mon are being fused!'",
+ "Fusion initiates and green silhouette of new Pok\u00e9mon emerges with muscular build, purple-black stripes, multiple antennae-like structures, and glowing orange eyes.",
+ "Man comments: 'Fusion between Electivire and Arbok. Oh yeah, look at the design. It has stripes everywhere, bro. Absolutely fire.'",
+ "Text at the bottom confirms successful fusion with 'Congratulations! Your Pok\u00e9mon was fused into Arvire!'",
+ "Video concludes with background consistently green."
+ ]
+ },
+ {
+ "video": "video/50.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a young woman standing in front of an elevator, dressed in a grey t-shirt, sweatpants, and a cap. The text overlay reads, \"when you're scared of the elevator:\" She appears hesitant, extending her arm to press the elevator button. The audio reveals a woman saying, \"`Come on, let's take the elevator.`\" The video shifts to the same woman but without cap suggesting another character saying, \"`Oh, heck no! There's no way I'm getting on that elevator.'' The woman in cap says,''Are you serious? You're actually not going to take the elevator? Why?`\" The scene transitions to her in front of a city backdrop, now in a beige hoodie and grey sweatpants, making a cross sign with her fingers, indicating refusal or fear. \n\nThe woman in a beige hoodie and grey sweatpants stands in front of the city backdrop and gestures forward, while the audio continues, \"I'`ve seen it in the movies. The elevators, they can get stuck and people get stuck on it and they die.`...She extends her arm forward hesitantly. The scene transitions back to the woman in the grey outfit in front of the elevator, where she looks down with a questioning expression.The woman with a cap says and gestures with her hands,\"You'`re being ridiculous. Everybody rides the elevator. You're fine. We're not going to get stuck.`\" The woman without cap says and gestures with her hands, \"Let'`s take the stairs. The stairs is safe. Stairs equal safe. ''\n\nAs the story continues, the young woman in a grey t-shirt, sweatpants, and a cap stands still in front of the elevator, her arms slightly raised in apprehension. The audio conveys her loud resistance: \"`There's no way I`'m taking the stairs I'm not stepping on those stairs,`\" followed by the woman without cap debating, \"Well, I'`m taking the stairs. See ya, and I'm gonna beat you.`\" The woman without cap then runs away from the elevator,making gestures showing her refusal or fear, emphasizing her fear of using the elevator. The scene shifts to the woman with cap is back in front of the elevator, cautiously extending her arm to press the elevator button and stepping inside.The audio ends with she says, \"`Have fun on the stairs, I'm beating you,`\" with a door's opening and closing sound.\n\nThe video then transitions to the young woman in a beige hoodie and grey sweatpants running down a hotel hallway, with a small white dog sitting on the floor. The text still reads, \"when you're scared of the elevator:.\" The woman bursts through a door marked 'Stair 1' and runs up the stairs. The setting alternates between the woman with a cap standing inside the elevator in her grey outfit says,\"She'`s so stupid.The woman without cap runs in the stairs,then opens a door with a poster on the wall writing Stair 1 and a big 5 on it.The audio records the woman in the elevator's mocking voice, ''I love the elevator. The elevator is like a ride,`\" followed by a beeping sounds and the elevator is showing from 3 to 2.\n\nFurther, the young woman stands apprehensively inside the elevator then walks out of it says \"`Oh, and I beat her.`\" She responds, \"`You beat me? It's about time you got here.''The woman without cap says,'' How did you beat me?I took the stairs. I thought it was so quick,\" The scene returns to her running down a carpeted staircase and through a hotel hallway in her beige hoodie and sweatpants, showing her fear of the elevator. \n\nLastly, the woman is back in front of the elevator, still in her grey outfit.She explains something with outstretched arms, a questioning posture.\"It'`s because the elevator's better than the stairs.`\" In a different setting, wearing her beige hoodie and grey sweatpants, she stands in a hallway with a patterned carpet, looking around cautiously. Her exaggerated movements show her avoiding something or expressing fear. The audio ends with another child responding, \"`Oh, dang it! I don't like today!`\" And she gestures with angry expression."
+ }
+ ],
+ "events": [
+ "Young woman stands in front of an elevator, dressed in a grey t-shirt, sweatpants, and a cap.",
+ "Text overlay reads: 'when you're scared of the elevator:'",
+ "She appears hesitant, extends her arm to press the elevator button.",
+ "Audio reveals a woman saying: 'Come on, let's take the elevator.'",
+ "Scene shifts to same woman without cap suggesting another character saying: 'Oh, heck no! There's no way I'm getting on that elevator.'",
+ "Woman in cap says: 'Are you serious? You're actually not going to take the elevator? Why?'",
+ "Scene transitions to her in front of a city backdrop, now in a beige hoodie and grey sweatpants, making a cross sign with her fingers.",
+ "Woman in beige hoodie stands in front of the city backdrop, gestures forward, audio continues: 'I've seen it in the movies. The elevators, they can get stuck and people get stuck on it and they die.'",
+ "Scene transitions back to the woman in the grey outfit in front of the elevator, she looks down with a questioning expression.",
+ "Woman with a cap says and gestures: 'You're being ridiculous. Everybody rides the elevator. You're fine. We're not going to get stuck.'",
+ "Woman without cap says and gestures: 'Let's take the stairs. The stairs is safe. Stairs equal safe.'",
+ "Young woman in grey t-shirt, sweatpants, and a cap stands still in front of the elevator, arms slightly raised in apprehension.",
+ "Audio conveys her loud resistance: 'There's no way I'm taking the stairs I'm not stepping on those stairs.'",
+ "Woman without cap debates: 'Well, I'm taking the stairs. See ya, and I'm gonna beat you.'",
+ "Woman without cap runs away from the elevator, making gestures showing her refusal or fear.",
+ "Scene shifts to woman with cap in front of the elevator, extending her arm to press the elevator button and stepping inside.",
+ "Audio ends with her saying: 'Have fun on the stairs, I'm beating you,' with a door's opening and closing sound.",
+ "Video transitions to young woman in beige hoodie and grey sweatpants running down a hotel hallway, with a small white dog sitting on the floor.",
+ "Text still reads: 'when you're scared of the elevator:'",
+ "Woman bursts through a door marked 'Stair 1' and runs up the stairs.",
+ "Setting alternates between woman with a cap standing inside elevator in grey outfit says: 'She's so stupid.'",
+ "Woman without cap runs in the stairs, opens a door with a poster on the wall writing Stair 1 and a big 5 on it.",
+ "Audio records woman in elevator's mocking voice: 'I love the elevator. The elevator is like a ride,' followed by beeping sounds and the elevator shows from 3 to 2.",
+ "Young woman stands apprehensively inside the elevator then walks out of it says: 'Oh, and I beat her.'",
+ "She responds: 'You beat me? It's about time you got here.'",
+ "Woman without cap says: 'How did you beat me? I took the stairs. I thought it was so quick.'",
+ "Scene returns to woman without cap running down a carpeted staircase and through a hotel hallway in beige hoodie and sweatpants.",
+ "Woman is back in front of the elevator, still in her grey outfit, explains something with outstretched arms, a questioning posture: 'It's because the elevator's better than the stairs.'",
+ "In different setting, wearing beige hoodie and grey sweatpants, she stands in hallway with patterned carpet, looking around cautiously.",
+ "Her exaggerated movements show her avoiding something or expressing fear.",
+ "Audio ends with another child responding: 'Oh, dang it! I don't like today!' and she gestures with angry expression."
+ ]
+ },
+ {
+ "video": "video/51.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts in a Minecraft-like environment, where a character navigates through a water-filled area containing red and beige blocks. The screen displays the texts corresponding to the audio in real time. At the bottom is a yellow timer, and at the bottom are game items, hearts, and other icons. As the character moves and jumps on the blocks, audio plays with background music: \"You'`ve probably realized that numbers are pretty weird. Probability and statistics really drive the ball out of the park for me though.`\" This adds a casual and slightly humorous tone to the video. Near the 'Finish' sign, the text appears, matching the informal audible tone. The clip shows the character standing on a pressure plate by the 'Finish' sign.\n\nNext, in the similar environment with torch-like objects on the wall prominently. Additional text appears sequentially, suggesting a narrative or list of terms. The video continues with the character's exploration, while the audio discusses: \"`Just listen to this. In 1996, David Justice, a baseball player, had a higher batting average each year than Derek Jeter.\" \n\nContinuing with the narrative, the text surfaces as the character continues the traversal. This text sequence adds to the unfolding story. Concurrently, the audible content explains: \"`However, when looked at the two years together, Derek's batting average was higher, even though in both 1995 and 1996, David`'s batting averages were higher for each year.\" The background chatter gives more depth to the statistical paradox described in the visuals. The clip concludes with the character at another pressure plate.\n\nIn the next sequence, more text appears, forming a continuation of the narrative. The character's journey remains consistent, and the audio elaborates: \"`This paradox is called the Simpson's Paradox, and it's the basis for gerrymandering, which is a real flaw in the electoral system.`\" The mix of gameplay and informative discussion deepens viewer engagement.\n\nFinally, the character progresses in the same setting, and texts illustrate a nearing milestone. Corresponding with the text, the audio states: \"`Anyways, if you liked this video, make sure to drop a follow, since we're really close to 400K, and by now...`\" The voice accompanied with different music builds the mysterious sense of advancement in the video. The clip maintains the same dynamic visual style, keeping the narrative cohesive and visually engaging till the end.\n\nThroughout, the video blends casual humor with educational content, using in-game text to visually represent spoken narratives. The concluding cliffhanger about the follow count, represented both visually and audibly, leaves an engaging call-to-action for viewers."
+ }
+ ],
+ "events": [
+ "Video starts in a Minecraft-like environment",
+ "Character navigates through a water-filled area with red and beige blocks",
+ "Screen displays texts corresponding to audio in real time",
+ "Yellow timer is visible at the bottom",
+ "Game items, hearts, and other icons are shown at the bottom",
+ "Character moves and jumps on the blocks",
+ "Audio plays with background music: 'You've probably realized that numbers are pretty weird. Probability and statistics really drive the ball out of the park for me though.'",
+ "Casual and slightly humorous tone is set to the video",
+ "Near the 'Finish' sign, text appears matching the informal audible tone",
+ "Character stands on a pressure plate by the 'Finish' sign",
+ "Next scene in a similar environment with torch-like objects on the wall",
+ "Additional text appears sequentially, suggesting a narrative or list of terms",
+ "Character explores the environment",
+ "Audio discusses: 'Just listen to this. In 1996, David Justice, a baseball player, had a higher batting average each year than Derek Jeter.'",
+ "Text surfaces as the character continues the traversal",
+ "Audio explains: 'However, when looked at the two years together, Derek's batting average was higher, even though in both 1995 and 1996, David's batting averages were higher for each year.'",
+ "Background chatter adds depth to the statistical paradox description",
+ "Character stands on another pressure plate",
+ "Next sequence with more text appearing to continue the narrative",
+ "Character's journey remains consistent",
+ "Audio elaborates: 'This paradox is called the Simpson's Paradox, and it's the basis for gerrymandering, which is a real flaw in the electoral system.'",
+ "Character progresses in the same setting, with texts indicating a nearing milestone",
+ "Corresponding audio states: 'Anyways, if you liked this video, make sure to drop a follow, since we're really close to 400K, and by now...'",
+ "Different music builds a mysterious sense of advancement in the video",
+ "Video maintains dynamic visual style, keeping narrative cohesive and visually engaging",
+ "Concluding cliffhanger about follow count, both visually and audibly represented",
+ "Blending of casual humor with educational content throughout",
+ "In-game text visually represents spoken narratives",
+ "Engaging call-to-action for viewers"
+ ]
+ },
+ {
+ "video": "video/52.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up view of various small aquatic creatures, most likely snails, moving slowly on a glass surface submerged in water. The background is slightly blurred, showing diverse colors and shapes, suggesting an underwater environment. The snails feature different shell patterns and colors, including brown and red. In the bottom left corner, a neutral-face emoji shifts colors from orange to blue. The audio features a male voice discussing his mixed feelings about snails despite recognizing their usefulness in algae control. He explains his preference for nerite snails because their offspring cannot survive in freshwater, though mentions that when the females lay eggs, they are hard to manage.\n\nThe scene seamlessly transitions to a close-up of a larger snail, with a striped shell, resting on a green leaf underwater. The slightly blue background features a few visible aquatic plants, enhancing the underwater setting. The audible continuation centers on the inconvenience of hard white eggs or egg capsules appearing on plants and hardscape, making maintenance more difficult.\n\nNext, the focus shifts to a close-up view of a snail with a rounded, light-colored shell moving slowly. A series of small, striped fish swim around it, adding a dynamic feel to the underwater scene. The scene changes to feature another snail with a black and yellow striped shell attached to a reddish-brown rock. \"#1 BUY NERITE SNAIL\" text appears at the top, recommending the purchase of nerite snails. The audio reveals difficulties in identifying male from female nerite snails and outlines the first step as simply purchasing a nerite snail.\n\nThe narrative continues with a close-up of a hand holding a small, dark-colored snail over a green leaf, with the blue underwater background slightly blurred. \"#2 1 SNAIL PER TANK\" text advises keeping only one snail per tank. It then transitions to a snail with a light-colored shell carrying small particles, resting on a textured surface, alongside the advice \"#3 WAIT AND SEE,\" indicating patience. The scene then shows a brown leaf dotted with small white spots, possibly eggs. Text at the bottom, \"*MOVING DOTS = BBS,\" clarifies that the moving dots are baby brine shrimp. The associated audio suggests steps to manage snail populations and guides how to handle female snails laying eggs.\n\n It transitions to a snail with a light-colored, spotted shell in a shallow container with a blue and yellow patterned rim. \"#4 SWAP ANY GIRLS\" appears at the top, signaling a strategy to swap snails. The video focuses on another striped snail attached to a green leaf underwater, displaying \"NEW SNAIL,\" indicating the introduction of a new specimen. The clip ends with yet another striped snail moving along a dark, textured surface, introduced with the text,\" NO MORE NERITE SNAIL EGGS.\" The accompanying audio advises exchanging female snails at local fish stores to increase the likelihood of acquiring a male, thus preventing egg-laying.\n\nThe final scene features the same snail moving on a textured, dark surface underwater, with green aquatic plants in the background. The text at the bottom reads, 'NO MORE NERITE SNAIL EGGS' in blue, paired with 'TIP OF THE DAY' in white in the frame's bottom left corner, emphasizing the informational aim. The detailed visualization of particles floating in the water completes the aquatic ambiance. The concluding audio underscores the effort needed to secure a male nerite snail, which, unlike females, will not lay eggs, effectively rounding off the instructional nature of the video."
+ }
+ ],
+ "events": [
+ "Close-up view of various small aquatic creatures, likely snails, moving on a glass surface submerged in water",
+ "Background slightly blurred, showing diverse colors and shapes, underwater environment",
+ "Snails feature different shell patterns and colors (brown and red)",
+ "Neutral-face emoji in the bottom left corner shifts colors from orange to blue",
+ "Male voice discusses his mixed feelings about snails and their usefulness in algae control",
+ "Preference for nerite snails because their offspring cannot survive in freshwater",
+ "Mentions hard-to-manage eggs laid by female nerite snails",
+ "Transition to a close-up of a larger snail with a striped shell on a green leaf underwater",
+ "Background slightly blue with visible aquatic plants",
+ "Audio discusses the inconvenience of white eggs or egg capsules on plants and hardscape making maintenance difficult",
+ "Close-up of a snail with a rounded, light-colored shell moving slowly",
+ "Small, striped fish swim around it, creating a dynamic underwater scene",
+ "Scene changes to another snail with a black and yellow striped shell on a reddish-brown rock",
+ "#1 BUY NERITE SNAIL text recommends purchasing nerite snails",
+ "Audio mentions the difficulty of identifying male from female nerite snails",
+ "First step is simply to purchase a nerite snail",
+ "Close-up of a hand holding a small, dark-colored snail over a green leaf with a blue underwater background slightly blurred",
+ "#2 1 SNAIL PER TANK text advises keeping only one snail per tank",
+ "Transition to a snail with a light-colored shell carrying small particles on a textured surface",
+ "#3 WAIT AND SEE text advises patience",
+ "Scene shows a brown leaf with small white spots, possibly eggs",
+ "*MOVING DOTS = BBS text clarifies that moving dots are baby brine shrimp",
+ "Audio suggests steps to manage snail populations and handle female snails laying eggs",
+ "Snail with a light-colored, spotted shell in a shallow container with a blue and yellow patterned rim",
+ "#4 SWAP ANY GIRLS text appears, signaling strategy to swap snails",
+ "Focus on another striped snail attached to a green leaf underwater",
+ "NEW SNAIL text indicates the introduction of a new specimen",
+ "Another striped snail moving along a dark, textured surface",
+ "NO MORE NERITE SNAIL EGGS text appears",
+ "Audio advises exchanging female snails at local fish stores to increase the likelihood of acquiring a male",
+ "Final scene with the same snail moving on a textured, dark surface underwater, green aquatic plants in the background",
+ "NO MORE NERITE SNAIL EGGS text in blue with TIP OF THE DAY in white in the frame's bottom left corner",
+ "Detailed visualization of particles floating in the water completes the aquatic ambiance",
+ "Concluding audio underscores the effort needed to secure a male nerite snail, which will not lay eggs"
+ ]
+ },
+ {
+ "video": "video/53.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with two men standing in front of a glass wall. One man, wearing a light gray suit, has his back to the camera, while the other, dressed in a dark suit, faces the camera. The audio introduces a man speaking, \"`We need at least a good ten hours. Sydney to Los Angeles. One of the longest flights in the world. He makes it every two weeks.`\" Background music accompanies the dialogue, setting a professional and urgent tone.\n\nNext, the visual shifts to another man in a dark suit standing on a staircase, looking down into a modern, well-lit lobby area decorated with several lamps and comfortable seating arrangements. This setting reinforces the high-end, modern environment, suggesting a workplace of significance or an upscale airport. The clip then returns briefly to the two men in front of the glass wall, with the man in the light gray suit turning slightly, before switching back to the man on the staircase as he begins to walk. The scene then reveals a black luxury car parked near an airplane, with a pilot and another man in a red uniform standing nearby, emphasizing the theme of travel and luxury.\n\nFollowing this, the scene transitions back to the two men by the glass wall, before shifting to an airport setting featuring a black luxury jet with the registration 'JC-HV.' A man in a red uniform works on the jet's engine, as another man in a dark suit stands nearby, inspecting the aircraft. The audio continues with the voice-over, \"It must be flying private not if they're unexpected to maintenance we're displaying it would have to be a 747. Why is that? It's not a 747, the pilot's up top on the first Class cabins in the nose.`\" underscoring technical concerns and the logistics of private flying.\n\nThe visuals then show a man in a dark suit conversing with a pilot in a black uniform and cap. The man in the dark suit appears thoughtful, looking around the modern, well-lit lobby area. The scene then transitions to a man in a dark suit and yellow tie. The clip circles back to the two men in front of the glass wall, where the man in the light gray suit crosses his arms, and the other leans forward, possibly in deep conversation. Meanwhile, a man in a long coat and a woman in a red jacket walk through the lobby, with the man gesturing as if explaining something.\n\nThe narrative unfolds with the man and the woman continuing their walk through the lobby, the man gesticulating and the woman listening intently. It transitions to the man in the dark suit with the yellow tie on a staircase. The dialogue overlaps seamlessly, \"So no one would walk through but you'd have to buy out the entire cabin and the first class flight attendant. I bought the airline,`\" enhancing the discussion about premium flight arrangements and exclusivity.\n\nThe video shows two men standing in front of the glass wall again, continuing their nuanced conversation. A sense of continuity and urgency permeates these settings. The man in the dark suit with the yellow tie appears in a modern lobby area under a sign that reads 'CONCIERGE,' glancing around. The scene transitions one last time to the two men at the glass wall, joined now by a woman in a red jacket.\n\nThe audio then punctuates the climax with, \"`It seemed neater. It looks like we have our ten hours,`\" as the background music enfolds the closing scenes, creating a resolution to the intertwined narrative of high-stakes planning and luxurious travel logistics. The story concludes, weaving together scenes of careful orchestration and privileged settings, all in an advanced, modern environment."
+ }
+ ],
+ "events": [
+ "Two men stand in front of a glass wall.",
+ "One man in a light gray suit has his back to the camera.",
+ "The other man in a dark suit faces the camera.",
+ "Audio begins with a man speaking about flight duration and a frequent traveler.",
+ "Background music sets a professional and urgent tone.",
+ "Scene shifts to a man in a dark suit standing on a staircase.",
+ "He looks down into a modern, well-lit lobby with several lamps and comfortable seating.",
+ "Video returns to two men in front of the glass wall.",
+ "The man in the light gray suit turns slightly.",
+ "Scene switches back to the man on the staircase as he begins to walk.",
+ "A black luxury car is parked near an airplane.",
+ "A pilot and a man in a red uniform are standing nearby.",
+ "Emphasis on theme of travel and luxury.",
+ "Scene transitions back to two men by the glass wall.",
+ "Scene shifts to an airport setting with a black luxury jet 'JC-HV'.",
+ "A man in a red uniform works on the jet's engine.",
+ "Another man in a dark suit stands nearby, inspecting the aircraft.",
+ "Voice-over discusses technical concerns and logistics of private flying.",
+ "A man in a dark suit converses with a pilot in a black uniform and cap.",
+ "The man in the dark suit appears thoughtful, looking around the modern lobby.",
+ "Scene transitions to a man in a dark suit with a yellow tie.",
+ "Video circles back to the two men in front of the glass wall.",
+ "The man in the light gray suit crosses his arms.",
+ "The other man leans forward, possibly in deep conversation.",
+ "A man in a long coat and a woman in a red jacket walk through the lobby.",
+ "The man is gesticulating as if explaining something to the woman.",
+ "The man and the woman continue their walk through the lobby.",
+ "The woman listens intently to the man.",
+ "Scene transitions to the man in a dark suit with a yellow tie on a staircase.",
+ "Voice-over discusses exclusivity of buying out the entire cabin and flight attendants.",
+ "Voice-over punctuates with 'I bought the airline'.",
+ "Two men stand in front of the glass wall again, continuing their conversation.",
+ "Sense of continuity and urgency in the settings.",
+ "The man in the dark suit with a yellow tie appears in a modern lobby under a 'CONCIERGE' sign.",
+ "He glances around.",
+ "Scene transitions back to the two men at the glass wall, now joined by a woman in a red jacket.",
+ "Audio punctuates with 'It seemed neater. Looks like we have our ten hours'.",
+ "Background music creates a resolution to the narrative."
+ ]
+ },
+ {
+ "video": "video/54.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "This video is one man playing two roles at once. The video begins with a man in a green shirt standing in a doorway, looking concerned. This character is the son. The text \"When You Lose Your Mom's Money\" appears at the top of the screen. The scene transitions to another man sitting on a bed, wearing a pink shawl and a black jacket on his head, who asks \"For what?\" in yellow text. The role is that of a mother. This dialogue exchange is underscored by audio where a man says, \"`Mummy, can I have $5?`\" and is answered by another voice asking, \"For what?\" The scene then cuts back to the man in the green shirt who responds both visually and audibly with, \"For chewing gum.\" The scene returns to the man on the bed who says, \"Ok, you can get $5 from my purse. It'`s on top of the fridge. `\" Which is visually reinforced by yellow text. The man in the green shirt then replies visually and audibly with \"Ok\" and leaves the doorway while the man's voice says, \"Hi, Uncle.\"\n\nNext, standing outside in a parking lot at night, the person in the green shirt is shown holding some money, with the text \"Hi uncle, can I get 1 pack of chewing gum?\" appearing at the bottom of the screen in yellow. In the audio, the person asks, \"`Can I get one pack of Chingum?`\" The scene then transitions to the same man now holding a green container of chewing gum along with the money, thanking the uncle and stating, \"`I have to give this to mummy, `\" both in audio and yellow text at the bottom. The man then says, \"Ok, let me put this in my pocket, so I don't lose it,\" visually represented as well, as they prepare to put the money and gum in their pocket with traffic sounds in the background.\n\nContinuing, the man stands outside in the parking lot at night, holding a green container of chewing gum and some money. He expresses the same concern about not losing it. The background shows a building with lit windows and a parked car. The scene shifts indoors, where the man announces, \"Mom, I got my chewing gum!\" according to text in yellow at the bottom. \n\nInside a room, still clutching the green container of chewing gum, the man in the green shirt enthusiastically declares, \"Mom, I got my chewing gum!\" as mirrored by yellow text. The scene transitions to the man in the pink shawl and black jacket, sitting on a bed, who replies with \"`Are you happy now? And rest of the money, put it on my table, ok? `\" in yellow text, mirrored in the audio. The man in the green shirt responds with \"Ok,\" also in yellow text, and starts to place the remaining money in his pocket. The audio captures this exchange, accompanied by music and the sound of a coin dropping.\n\nNext, the man in the green shirt stands in the room, looking down with a concerned expression. The text \"`What happened beta?`\" appears in yellow at the bottom. The man responds both audibly and visually with, the man in the green shirt says \"I put the money in my left pocket, but it's not there, `\" pointing to his left pocket. The scene shifts to the man in the pink shawl, seated on a bed, narratively and textually insisting, \"`Beta stop lying to me, I know you spent it on some kurkure or something. `\" The audio reflects this tension between the characters as they exchange their lines.\n\nThe scene transitions to the green-shirted individual searching their pockets with worry etched on his face, emphasized by the text, \"`Did you lose the money again? `\" The dialogue follows in the audio. The man in the pink shawl sternly reiterates, \"`Did you lose the money again? `\" in both text and audio. Surprised and worried, the man in the green shirt looks up, with the pink-shawled man finally saying \"Wait\" while gesturing with his hand. The audio includes the dramatic final sounds of a door slamming and a man screaming, mirroring the heightened tension and urgency conveyed visually as the video concludes with the person in the green shirt looking surprised."
+ }
+ ],
+ "events": [
+ "Man in a green shirt (son) stands in a doorway looking concerned",
+ "Text 'When You Lose Your Mom's Money' appears on the screen",
+ "Scene transitions to man in a pink shawl and black jacket (mother) sitting on a bed",
+ "Mother asks 'For what?' in yellow text",
+ "Audio plays: 'Mummy, can I have $5?'",
+ "Another voice responds in audio: 'For what?'",
+ "Scene cuts back to man in a green shirt (son) responding: 'For chewing gum'",
+ "Scene transitions back to mother on the bed saying: 'Ok, you can get $5 from my purse. It's on top of the fridge.' in yellow text",
+ "Man in green shirt (son) replies 'Ok' and leaves the doorway, implying a voice saying 'Hi, Uncle'",
+ "Outside, in a parking lot at night, man in green shirt holding some money with text 'Hi uncle, can I get 1 pack of chewing gum?' in yellow",
+ "Audio: 'Can I get one pack of Chingum?'",
+ "Scene transitions to man holding green container of chewing gum and money, thanking uncle and stating 'I have to give this to mummy' in audio and text",
+ "Man says 'Ok, let me put this in my pocket, so I don't lose it', preparing to put the money and gum in his pocket, with traffic sounds in the background",
+ "Man stands outside at night holding green container of chewing gum and money, expressing concern about not losing it",
+ "Background shows a building with lit windows and a parked car",
+ "Scene shifts indoors where man announces 'Mom, I got my chewing gum!' in yellow text",
+ "Inside a room, man in green shirt declares 'Mom, I got my chewing gum!' with mirrored yellow text",
+ "Scene transitions to mother on bed replying 'Are you happy now? And rest of the money, put it on my table, ok?' in yellow text",
+ "Man in green shirt responds with 'Ok' in yellow text and starts placing money in pocket, audio captures exchange with music and sound of coin dropping",
+ "Man in green shirt stands in room looking down concerned, text 'What happened beta?' appears in yellow",
+ "Man responds 'I put the money in my left pocket, but it's not there,' pointing to his left pocket",
+ "Scene shifts to mother on the bed insisting 'Beta stop lying to me, I know you spent it on some kurkure or something' both narratively and textually",
+ "Green-shirted individual searching pockets worried with text 'Did you lose the money again?'",
+ "Dialogue follows in audio, mother reiterates 'Did you lose the money again?' in both text and audio",
+ "Man in green shirt looks up surprised and worried",
+ "Mother says 'Wait', gesturing with hand",
+ "Audio includes final sounds of door slamming and man screaming",
+ "Video concludes with person in green shirt looking surprised"
+ ]
+ },
+ {
+ "video": "video/55.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man in a black shirt speaking animatedly, using hand gestures for emphasis. Bold white text on a blue background at the bottom of the screen reads, \"EVERY SINGER STRUGGLES WITH IMITATION.\" The background features a white door with pink and blue gradient lighting. The man explains that while every singer struggles with imitating their favorite artist, doing so excessively might lead to losing one's unique sound. The screen continues with the same man, text now displaying, \"`YOU MAY LOSE YOUR UNIQUE SOUND!`\", gesturing as they emphasize their warning.\n\nIn the background audio, a man speaks these words. As the man holds an acoustic guitar and prepares to demonstrate musically, a woman's voice starts singing, slightly fading in relevance compared to the spoken narrative.\n\nThe scene then transitions to the same man now seated with the acoustic guitar. Again, wearing a black shirt and framed by the same background lighting. The text now reads, \"`HOW DO YOU FIND YOUR NATURAL VOICE?`\" The man starts to speak, possibly explaining or demonstrating musical techniques. As he continues his explanation, the text changes to \"STEP 1: SPEAK THE WORDS\". The man's gestures accompany his instructions, enhancing his points.\n\nThe narrative continues with the on-screen text still reading \"STEP 1: SPEAK THE WORDS\", and the man remains animated and engaging. The text suddenly shifts to \"`I LOOK AT YOU ALL, SEE THE LOVE THERE THAT`'S SLEEPING\". His gestures accentuate the verbal instructions. The on-screen text changes again to \"STEP 2: SPEAK THE WORDS ON PITCH\". The speaker transitions seamlessly into the next step of the demonstration.\n\nThe audio description here reveals the speaker citing the lyrics, \"I look at you all, see the love there that's sleeping. Step 2, speak the words on pitch.`\" The voice alternates between speaking and singing the phrases.\n\nMaintaining the same engaging tone, the man persists with his explanation while holding the acoustic guitar. The new text on-screen, \"DON'T SING YET!\" surfaces at the bottom. The man advises through hand gestures and vocal emphasis not to sing just yet, but rather to think of it as doing one's best robot impression.\n\nConcurrently, the voices in the audio mirror this advice. A male voice sings, followed by a recommendation to refrain from singing and to think robotically.\n\nThe scene focuses on the speaker, who now elaborates on the third step: \"STEP 3: MAKE IT SMOOTH!\" is presented in bold at the bottom. The man emphasizes this third step by holding up three fingers and elaborating on making the transition smooth to dissolve strain and tension.\n\nAudio complements this with the advice being clearly stated by a male voice, pausing slightly and then reaffirming the transition to smooth singing to alleviate strain.\n\nThe final scene continues with the same man who reiterates the earlier steps and gestures passionately to highlight their points. As he picks up the guitar, the visual narrative aligns perfectly with their intent to demonstrate musically once more. The consistent background ties the entire narrative together.\n\nThe auditory narrative closes, transitioning into music playing and a man singing, which underscores the topic of finding one's natural voice in singing."
+ }
+ ],
+ "events": [
+ "Man in black shirt speaking animatedly using hand gestures.",
+ "Bold white text on blue background: 'EVERY SINGER STRUGGLES WITH IMITATION.'",
+ "Background: white door with pink and blue gradient lighting.",
+ "Man explains singers struggle with imitating their favorite artist.",
+ "Text changes to: 'YOU MAY LOSE YOUR UNIQUE SOUND!'",
+ "Man continues gesturing as he emphasizes his warning.",
+ "Background audio: man speaking.",
+ "Man holding acoustic guitar prepares to demonstrate musically.",
+ "Woman's voice starts singing in the background.",
+ "Scene transitions to the same man seated with the acoustic guitar.",
+ "Text reads: 'HOW DO YOU FIND YOUR NATURAL VOICE?'",
+ "Man starts to speak, explaining or demonstrating musical techniques.",
+ "Text changes to: 'STEP 1: SPEAK THE WORDS'.",
+ "Man's gestures accompany his instructions.",
+ "Text reads: 'STEP 1: SPEAK THE WORDS'.",
+ "Man remains animated and engaging.",
+ "Text shifts to: 'I LOOK AT YOU ALL, SEE THE LOVE THERE THAT'S SLEEPING.'",
+ "Man's gestures accentuate verbal instructions.",
+ "Text changes to: 'STEP 2: SPEAK THE WORDS ON PITCH'.",
+ "Speaker transitions to the next step of the demonstration.",
+ "Audio: speaker cites lyrics 'I look at you all, see the love there that's sleeping. Step 2, speak the words on pitch.'",
+ "Voice alternates between speaking and singing phrases.",
+ "Man maintains engaging tone holding the acoustic guitar.",
+ "Text on-screen: 'DON'T SING YET!'",
+ "Man advises not to sing but thinks of it as doing one's best robot impression.",
+ "Audio mirrors this advice: male voice sings, followed by recommendation to refrain from singing think robotically.",
+ "Speaker elaborates on third step: 'STEP 3: MAKE IT SMOOTH!' presented in bold.",
+ "Man emphasizes third step by holding up three fingers and elaborating on making transition smooth to dissolve strain and tension.",
+ "Audio: advice stated by male voice, pausing and reaffirming smooth singing transition to alleviate strain.",
+ "Final scene: man reiterates earlier steps with passionate gestures.",
+ "Man picks up guitar to demonstrate musically.",
+ "Consistent background ties narrative together.",
+ "Auditory narrative transitions into music and man singing.",
+ "Topic underscores finding one's natural voice in singing."
+ ]
+ },
+ {
+ "video": "video/56.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a tense baseball game scene. The pitcher, dressed in a gray uniform, prepares to throw a pitch while the batter, in white, stands ready. The backdrop includes a 'True Value' advertisement. The pitcher winds up, throws the ball, and the batter swings, hitting the ball high into the sky. The camera shifts to the outfield where a player in a gray uniform is sprinting toward the ball. He makes a dramatic diving catch, securing the ball while on the ground, holding it up to show his successful catch. Over this action, a voice narrates, \"`The next time he faced Matt. Here's a fly ball center field long run for Jim Edmonds,`\" as the crowd erupts in cheers. and the narrators is also in excitement saying,'Unbelievable!\"\n\nThe scene transitions to the outfielder lying near the outfield wall, holding the ball in his glove. He lifts his glove to confirm the catch, then sits up, still clasping the ball. Another gray-uniformed player, wearing number 15, approaches and helps him to his feet. They share a handshake and a few words before walking away, with the outfielder adjusting his cap. The sound of the crowd's cheers and applause enhances the moment, and a voice comments, \"`Jerry, that may be the best catch I have ever seen. `\"\n\nThe video then shows a player in a gray uniform with the number 25 running across the field, celebrating his catch by patting his chest and looking around. Nearby, a player in a white uniform with the number 6 stands with a disappointed expression. The celebrating player is joined by another in gray, number 5, and they exchange high-fives. The scene ends with a wide shot of the field, showing the dejected player in white walking away. Meanwhile, a voice states, \"`And the Royals fans appreciated Steve as much as we do and all the Angel fans who are watching,`\" amidst the backdrop of the cheering crowd.\n\nNext, the video displays the player in the gray number 25 uniform running toward the outfield wall, eyes fixed on the ball. He makes a remarkable leap, securing the ball mid-air and rolling on the ground upon landing. He lies there briefly before triumphantly lifting his glove to show the ball. The crowd's cheer swells as a voice on the loudspeaker describes, \"`Edmonds playing in because Howard doesn't have a great deal of power. Look how many steps he has to make before he's able to dive for that ball and pull it in.\"\n\nFinally, the video revisits the player in the gray number 25 uniform tracking another high fly ball near the outfield wall. Performing another sensational airborne catch, he lands and rolls again but holds onto the ball. He raises his glove to the audible delight of the crowd. The narrator exclaims, \"`What a play by Jim Edmonds. He is absolutely phenomenal in center field,`\" as background music plays and the final frame displays the text 'MLB.com \u00a9 2011 MLB Advanced Media'."
+ }
+ ],
+ "events": [
+ "The video begins with a tense baseball game scene.",
+ "The pitcher, dressed in a gray uniform, prepares to throw a pitch.",
+ "The batter, in white, stands ready.",
+ "The backdrop includes a 'True Value' advertisement.",
+ "The pitcher winds up and throws the ball.",
+ "The batter swings and hits the ball high into the sky.",
+ "The camera shifts to the outfield.",
+ "An outfielder in a gray uniform sprints toward the ball.",
+ "The outfielder makes a dramatic diving catch.",
+ "The outfielder secures the ball while on the ground and holds it up to show the catch.",
+ "Voice narrates, 'The next time he faced Matt. Here's a fly ball center field long run for Jim Edmonds.'",
+ "The crowd erupts in cheers.",
+ "The narrator exclaims in excitement, 'Unbelievable!'",
+ "The scene transitions to the outfielder lying near the outfield wall.",
+ "The outfielder holds the ball in his glove.",
+ "The outfielder lifts his glove to confirm the catch.",
+ "The outfielder sits up, still holding the ball.",
+ "Another gray-uniformed player, number 15, approaches.",
+ "Number 15 helps the outfielder to his feet.",
+ "They share a handshake and exchange a few words.",
+ "They walk away, with the outfielder adjusting his cap.",
+ "The crowd's cheers and applause enhance the moment.",
+ "A voice comments, 'Jerry, that may be the best catch I have ever seen.'",
+ "A player in a gray uniform with number 25 runs across the field celebrating his catch.",
+ "Number 25 pats his chest and looks around.",
+ "A player in a white uniform with number 6 stands with a disappointed expression.",
+ "The celebrating player is joined by another in gray, number 5.",
+ "Number 25 and number 5 exchange high-fives.",
+ "The scene ends with a wide shot of the field.",
+ "The dejected player in white walks away.",
+ "A voice states, 'And the Royals fans appreciated Steve as much as we do and all the Angel fans who are watching.'",
+ "The crowd continues to cheer.",
+ "The player in the gray number 25 uniform runs toward the outfield wall.",
+ "Number 25 fixes his eyes on the ball.",
+ "Number 25 makes a remarkable leap.",
+ "Number 25 catches the ball mid-air and rolls on the ground upon landing.",
+ "Number 25 lies on the ground briefly.",
+ "Number 25 triumphantly lifts his glove to show the ball.",
+ "The crowd's cheer swells.",
+ "A voice on the loudspeaker describes, 'Edmonds playing in because Howard doesn't have a great deal of power. Look how many steps he has to make before he's able to dive for that ball and pull it in.'",
+ "The player in the gray number 25 uniform tracks another high fly ball near the outfield wall.",
+ "Number 25 performs another sensational airborne catch.",
+ "Number 25 lands and rolls again but holds onto the ball.",
+ "Number 25 raises his glove to the audible delight of the crowd.",
+ "The narrator exclaims, 'What a play by Jim Edmonds. He is absolutely phenomenal in center field.'",
+ "Background music plays.",
+ "The final frame displays the text 'MLB.com \u00a9 2011 MLB Advanced Media.'"
+ ]
+ },
+ {
+ "video": "video/57.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with an outdoor scene featuring a rusty water pump partially buried in snow, situated near a green building. Meanwhile, the audio reveals a man explaining, \"`When I was at this property dealing with the well outside, they were also telling us that they had low water pressure. Go inside, check the water filter, and it's going to need a little TLC.`\" As the scene transitions indoors, a close-up of a man's hand adjusting a blue valve is shown, followed by the man pointing to a rusty valve connected to a pipe with a pressure gauge nearby. The audio underscores this with the background sounds.The man then turns on a faucet, and water begins to flow. The final part of the clip shows the man working on a water filtration system, using a wrench to adjust or replace a component.\n\nThe narrative develops as the man, now seen wearing a maroon shirt and brown overalls, continues working on the water filtration system. The audio transitions to the man saying, \"`And of course, the filter wrench that I brought did not fit, so we've got to adapt to the situation. I would say this filter is definitely past its life expectancy. Scrub it out.`\" The man removes the filter housing, revealing a dirty filter filled with sediment and debris. A man speaks, and the scene is punctuated by the sound of a scraping noise as the man inspects and prepares to clean or replace the filter component.\n\nThe focus intensifies on maintaining the water filtration system, showing the man rinsing the dirty, sediment-filled filter housing under running water. The dirt and debris are washed away, and the audio narration continues with the man's voice, \"`Clean it out and keep flushing it till we get clear water flowing through. Once we finally got it cleaned out, we`'re going to add in a new O-ring. I didn'`t know exactly what size O-ring to grab.`\" The cleaned filter housing is then carefully reassembled back onto the system, ensuring a secure fit.\n\nWith attention to detail, another scene displays the man holding the filter housing coated with sediment and debris, inspecting it closely. The man then takes out a new gasket from a plastic bag after comparing debris on a white piece of paper with black specks. The man's voice says, \"`I just grabbed a multi-pack and luckily enough, it had the correct one in it. Once we verified the O-ring fits, we're going to take it out, use the grease that came with the new package of O-rings, and apply it on the face and on the O-ring.`\" The narrative continues as the new gasket is fitted, and the filter housing is reassembled onto the water filtration system.\n\nNext, the theme of system maintenance proceeds with a close-up of the man reassembling and securing the filter housing. Using a wrench and a red tool, the connections are tightened to prevent leaks. The man's voice instructs, \"` Once again, use the strap to tighten it up, and then now we just need to flush all of the air out of the system if you have a whole house filter like this.`\" This sequence transitions into a close-up of a working faucet, demonstrating the operational system with water smoothly flowing, complemented by the sound of a strap being tightened.\n\nThe video concludes with a close-up of the water flowing from a faucet, transitioning to a pressure gauge reading around 60 PSI, with water droplets indicating recent use. Finally, a man holds a black hose and inspects it in a cluttered workspace. The audio says, \"` I highly recommend putting a reminder on your phone and changing it periodically. So done with this and on to the next project,`\" before the scene closes with ambient music, signifying the completion of one task and transition to another.\n\nThis vivid narrative effectively ties together a cohesive story of water system inspection, maintenance, and repair, synchronized with the relevant audio commentary, providing a comprehensive understanding of the entire process."
+ }
+ ],
+ "events": [
+ "Outdoor scene with a rusty water pump partially buried in snow near a green building.",
+ "Man explains in audio about low water pressure and checking the water filter needing TLC.",
+ "Scene transitions indoors with close-up of man's hand adjusting a blue valve.",
+ "Man points to a rusty valve connected to a pipe with a pressure gauge nearby.",
+ "Background sounds underscore the actions.",
+ "Man turns on a faucet and water begins to flow.",
+ "Man works on a water filtration system using a wrench to adjust or replace a component.",
+ "Man is now seen wearing a maroon shirt and brown overalls while working on the water filtration system.",
+ "Man explains in audio that the filter wrench did not fit and they needed to adapt to the situation.",
+ "Man inspects and prepares to clean or replace the filter component with scraping noise in the background.",
+ "Man removes filter housing revealing a dirty filter filled with sediment and debris.",
+ "Man rinses the dirty filter housing under running water, washing away dirt and debris.",
+ "Man narrates cleaning out the filter and flushing it until clear water flows.",
+ "Man mentions adding a new O-ring but was unsure about its size.",
+ "Cleaned filter housing reassembled back onto the system ensuring a secure fit.",
+ "Man holds and inspects the sediment-coated filter housing closely.",
+ "Man takes out a new gasket from a plastic bag after comparing debris on a white piece of paper.",
+ "Man mentions grabbing a multi-pack of O-rings and finding the correct one.",
+ "Man applies grease on the face and O-ring before fitting the new gasket.",
+ "New gasket is fitted, and filter housing is reassembled onto the water filtration system.",
+ "Close-up of the man reassembling and securing the filter housing.",
+ "Man tightens connections using a wrench and red tool to prevent leaks.",
+ "Man instructs to use a strap to tighten it up and flush the air out of the system.",
+ "Close-up of the working faucet with water flowing smoothly.",
+ "Sound of strap being tightened accompanies the sequence.",
+ "Video concludes with a close-up of water flowing from a faucet.",
+ "Transition to a pressure gauge reading around 60 PSI with water droplets indicating recent use.",
+ "Man holds a black hose and inspects it in a cluttered workspace.",
+ "Man recommends putting a reminder to periodically change the filter in the audio.",
+ "Scene closes with ambient music signifying completion of the task and transition to another project."
+ ]
+ },
+ {
+ "video": "video/58.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up of a Hot Wheels toy car package held by a hand. The package features a blue 2006 BMW M3 model car. The packaging is predominantly red with the Hot Wheels logo at the top, accompanied by flames and the text 'Hottest metal cars in the world.' The toy car inside the package is a miniature replica of the blue BMW M3 shown on the packaging. The background consists of a surface covered with small rocks. The hand adjusts the position of the package, showing different angles of the toy car. As the hand moves the package out of the frame, it reveals a yellow toy car held in the same hand. The focus shifts to the yellow toy car against the rocky background.\n\nAs the audio plays, a man says, \"For those of you jumping on Mattel Creations to get the next RLC release of the 2006 BMW M3 from Hot Wheels, you might remember the first release had a bit of a hood issue.\"\n\nThe video then transitions to a close-up of the hand holding the yellow toy car, similar to the one shown at the end of the previous clip. The toy car is a miniature BMW with detailed features like the front grille and headlights. The rocky surface background is consistent with the previous clip. The hand adjusts the position of the toy car, showing different angles, and then places it on the rocky surface. Another hand appears holding the Hot Wheels package with the blue 2006 BMW M3 model car, as seen before. The hand begins to open the package, focusing on the blue toy car inside.\n\nIn the audio, another man adds, \"At least some of them did. So you know what? Just so you can see and I can see, let's just have ourselves a little moment of truth.\"\n\nNext, the video shows a close-up of a hand holding the now-opened Hot Wheels package, revealing the blue 2006 BMW M3 model car inside. The rocky background remains consistent. The hand carefully removes the blue toy car, showing a detailed view of the car's front with the hood open, displaying the engine. The hand then closes the hood, offering a clear view of the car's front grille and headlights. Holding both the blue and yellow toy cars side by side, the hand highlights their similarities and differences. The clip ends with a focus on the front view of both cars against the rocky surface.\n\nA clicking sound is heard in the audio, followed by a man saying, \"Ready? Well, it's better, but still not perfect.\"\n\nFinally, the video shows a close-up of the two toy cars, the blue BMW M3 and the yellow BMW M3, placed side by side on the black surface. The front view of both cars is emphasized, showcasing their detailed features like front grilles, headlights, and hoods. The steady view allows for a clear comparison of their colors and designs. The lighting highlights the shiny surfaces of the toy cars, contrasting vividly against the black background. The video concludes with the same angle and focus on the front view of both cars.\n\nIn the audio, the man says, \"I'd say that's better.\""
+ }
+ ],
+ "events": [
+ "Close-up of a Hot Wheels toy car package held by a hand",
+ "Package features a blue 2006 BMW M3 model car",
+ "Packaging is predominantly red with the Hot Wheels logo at the top",
+ "Packaging includes flames and the text 'Hottest metal cars in the world.'",
+ "Toy car inside the package is a miniature replica of the blue BMW M3",
+ "Background consists of a surface covered with small rocks",
+ "Hand adjusts the position of the package, showing different angles of the toy car",
+ "Hand moves the package out of the frame",
+ "Reveals a yellow toy car held in the same hand",
+ "Focus shifts to the yellow toy car against the rocky background",
+ "Voice of a man says, 'For those of you jumping on Mattel Creations to get the next RLC release of the 2006 BMW M3 from Hot Wheels, you might remember the first release had a bit of a hood issue.'",
+ "Transition to a close-up of the hand holding the yellow toy car",
+ "Toy car is a miniature BMW with detailed features like the front grille and headlights",
+ "Rocky surface background remains consistent",
+ "Hand adjusts the position of the toy car, showing different angles",
+ "Hand places the toy car on the rocky surface",
+ "Another hand appears holding the Hot Wheels package with the blue 2006 BMW M3 model car",
+ "Hand begins to open the package, focusing on the blue toy car inside",
+ "Voice of another man adds, 'At least some of them did. So you know what? Just so you can see and I can see, let's just have ourselves a little moment of truth.'",
+ "Close-up of a hand holding the now-opened Hot Wheels package",
+ "Background remains rocky",
+ "Hand carefully removes the blue toy car",
+ "Detailed view of the car's front with the hood open, displaying the engine",
+ "Hand closes the hood, offering a clear view of the car's front grille and headlights",
+ "Hand holds both the blue and yellow toy cars side by side",
+ "Hand highlights the similarities and differences between the blue and yellow toy cars",
+ "Clip ends with a focus on the front view of both cars against the rocky surface",
+ "Clicking sound is heard in the audio",
+ "Voice of a man says, 'Ready? Well, it's better, but still not perfect.'",
+ "Close-up of the two toy cars, the blue BMW M3 and the yellow BMW M3, placed side by side on the black surface",
+ "Emphasis on the front view of both cars, showcasing their detailed features like front grilles, headlights, and hoods",
+ "Steady view allows for a clear comparison of their colors and designs",
+ "Lighting highlights the shiny surfaces of the toy cars, contrasting vividly against the black background",
+ "Video concludes with the same angle and focus on the front view of both cars",
+ "Voice of a man says, 'I'd say that's better.'"
+ ]
+ },
+ {
+ "video": "video/59.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a white fluffy dog named Fluffy, a Maltese, being groomed on a grooming table. The dog is wearing a blue muzzle and is secured with a purple grooming loop attached to an overhead bar. The groomer, dressed in a black shirt, holds Fluffy's paw and uses a nail grinder to trim the dog's nails. In the background, a small shelf with grooming supplies and a sign that reads 'Like the video' can be seen. Fluffy appears slightly anxious, moving his head and body during the grooming process. The groomer gently continues the nail trimming as the audio reveals the groomer's playful banter: \"`This dog and I have a very love-bite relationship. As some of you might know, this is Fluffy the Maltese, and sometimes I just gotta stare in his eyes convince him not to bite me.`\" Following this, the groomer continues to speak, creating a light-hearted interaction. The clip cuts to Fluffy sitting calmly on the grooming table, looking towards the camera.\n\n\nThe scene transitions, showing Fluffy still on the grooming table. Text appears on the screen stating 'we debating' and then changes to 'he so cute though.' The groomer continues to trim Fluffy's nails while holding his paw. In the background, the shelf with grooming supplies remains visible. The audio plays with the groomer saying: \"`And it actually works like 85% of the time. And Fluffy really out here levitating. Okay, Duelipa.`\" Next, the scene changes to Fluffy in a white bathtub, being washed with a handheld showerhead. Fluffy appears wet and slightly anxious, standing on a non-slip mat as the groomer gently sprays his fur. Then Fluffy looks directly at the camera with front leg lift, covered in soap suds, standing in the bathtub.\n\nFluffy returns to the grooming table, wearing the blue muzzle and secured with the purple grooming loop. The groomer uses a red clipper to trim Fluffy's fur. Fluffy looks more relaxed, standing calmly and occasionally panting as he looks around. The background includes the small shelf with grooming supplies and the sign 'Like the video.' The audio follows with the groomer commenting on Fluffy's behavior: \"`And look at him like debate biting me and then play it off as a yawn. It was very slick of you, Fluffy, but I'm glad you decided not to bite me.`\" The groomer continues to trim Fluffy's fur around his legs and face, with the dog remaining calm throughout.\n\nThe grooming continues with Fluffy sitting calmly on the table while the groomer trims the fur around his face with different scissors. Fluffy looks upwards, relaxed, and cooperative. The background retains the same. The groomer meticulously ensures that Fluffy's fur is trimmed neatly. The audio description captures the groomer's continuing dialogue mixed with background music: \"It's really only his front feet that he doesn't like me scissoring. And besides the nails, obviously, he'`s pretty good for everything else. At the end of every dog's groom, I like to give them an option of what accessories they would like.`\"\n\nThe groomer adjusts Fluffy's fur and selects a blue bow tie from a jar filled with colorful bows and accessories. The groomer probably does not think it is appropriate and replaces it with a green one. Initially, Fluffy's movement makes it difficult to place the bow tie, but the groomer eventually succeeds with a green bow tie. Fluffy sits calmly on the grooming table, now adorned with the green bow tie, and looks towards the camera. The humorous and engaging audio continues: \"`Fluffy said the blue one didn't match his outfit, and I mean of course, I'm an idiot. And I guess this green one just makes his harness pop.`\"\n\nWith the green bow tie in place, Fluffy sits contentedly on the grooming table. The groomer, dressed in the same black shirt, picks up Fluffy, holding him close. Text appears on screen reading 'digs nails into my neck,' showing the playful and affectionate bond between the groomer and Fluffy. The groomer continues to comfort and pet Fluffy, creating a tender moment. The audio concludes with the groomer's fond remarks as music plays softly in the background: \"`I know you guys probably see those hairs on the right side of his ear little old me took them a little short last time and so now they need to grow back, but fluffy is ready to go home. Bye fluffy.`\". \"Bye fluffy\" is said in a high-pitched, spoof tone. The video concludes, highlighting the affectionate and professional relationship between the groomer and Fluffy."
+ }
+ ],
+ "events": [
+ "The video begins with a white fluffy dog named Fluffy, a Maltese, being groomed on a grooming table.",
+ "Fluffy is wearing a blue muzzle and is secured with a purple grooming loop attached to an overhead bar.",
+ "The groomer, dressed in a black shirt, holds Fluffy's paw and uses a nail grinder to trim the dog's nails.",
+ "In the background, a small shelf with grooming supplies and a sign that reads 'Like the video' can be seen.",
+ "Fluffy appears slightly anxious, moving his head and body during the grooming process.",
+ "The groomer gently continues the nail trimming while engaging in playful banter about their 'love-bite relationship' with Fluffy.",
+ "The groomer mentions to the audience about sometimes having to convince Fluffy not to bite.",
+ "The clip cuts to Fluffy sitting calmly on the grooming table, looking towards the camera.",
+ "The scene transitions, showing Fluffy still on the grooming table.",
+ "Text appears on the screen stating 'we debating' and then changes to 'he so cute though.'",
+ "The groomer continues to trim Fluffy's nails while holding his paw.",
+ "The groomer comments on how convincing Fluffy works like 85% of the time and humorously remarks on Fluffy 'levitating' like 'Duelipa'.",
+ "The scene changes to Fluffy in a white bathtub, being washed with a handheld showerhead.",
+ "Fluffy appears wet and slightly anxious, standing on a non-slip mat as the groomer gently sprays his fur.",
+ "Fluffy looks directly at the camera, lifting his front leg and covered in soap suds, while standing in the bathtub.",
+ "Fluffy returns to the grooming table, wearing the blue muzzle and secured with the purple grooming loop.",
+ "The groomer uses a red clipper to trim Fluffy's fur.",
+ "Fluffy looks more relaxed, standing calmly and occasionally panting as he looks around.",
+ "The background includes the small shelf with grooming supplies and the sign 'Like the video.'",
+ "The groomer comments on Fluffy debating about biting and then playing it off as a yawn, expressing relief that Fluffy didn't bite.",
+ "The groomer continues to trim Fluffy's fur around his legs and face while Fluffy remains calm.",
+ "Fluffy is sitting calmly on the table while the groomer trims the fur around his face with different scissors.",
+ "Fluffy looks upwards, relaxed, and cooperative.",
+ "The background remains the same with grooming supplies visible.",
+ "The groomer ensures that Fluffy's fur is trimmed neatly.",
+ "The audio captures the groomer's dialogue mixed with background music, mentioning Fluffy's front feet sensitivity and preference for accessories.",
+ "The groomer selects a blue bow tie from a jar filled with colorful bows and accessories.",
+ "The groomer decides the blue bow tie is not appropriate and replaces it with a green one.",
+ "Fluffy's movement makes it difficult to place the bow tie, but the groomer eventually succeeds in placing the green bow tie.",
+ "Fluffy sits calmly on the grooming table with the green bow tie and looks towards the camera.",
+ "The groomer humorously remarks about the choice of the bow tie matching Fluffy's outfit.",
+ "Fluffy sits contentedly on the grooming table with the green bow tie.",
+ "The groomer, dressed in the same black shirt, picks up Fluffy, holding him close.",
+ "Text appears on the screen reading 'digs nails into my neck,' showing the playful bond between the groomer and Fluffy.",
+ "The groomer comforts and pets Fluffy, creating a tender moment.",
+ "The groomer remarks about the hairs on the right side of Fluffy's ear needing to grow back.",
+ "The groomer announces that Fluffy is ready to go home.",
+ "The groomer says 'Bye Fluffy' in a high-pitched, spoof tone.",
+ "The video ends, showcasing the affectionate and professional relationship between the groomer and Fluffy."
+ ]
+ },
+ {
+ "video": "video/60.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video showcases a sequence of clips featuring a man wearing a red and black checkered shirt, sitting at a desk with multiple screens displaying blue and purple abstract patterns in the background.\n\nThe man wearing a red and black checkered shirt sits at a desk, surrounded by multiple screens displaying blue and purple abstract patterns. He holds a smartphone and appear to be explaining something. The man uses hand gestures and points to the phone screen to emphasize their points. He says, \"Did you know that Google can tell you the name of a song just by listening to you sing it, hum it, or even whistle it? So let's start with singing.\". He then addresses Google Assistant with, \"Hey Google, what's this song?\", as he holds the phone up. The audio includes a sound effect followed by the man speaking and then a song starts playing in the background.\n\nThe scene transitions to a close-up of a hand holding a smartphone. The screen displays the Google Assistant interface with the text 'Play, sing, or hum a song' at the top, a visual representation of sound waves, and the text 'Listening...' below it, with options for 'Search' and 'What\u2019s this song?'. The background is blurred but appears to show an indoor dark setting. The text on the phone screen updates to 'Still working\u2026' as the Google Assistant processes the input.\n\nThe scene transitions back to the man in the checkered shirt holding the phone and looking at it intently. The man reacts with excitement, and the text 'YES!' appears in large, bold letters. He then shows the phone screen, which displays 'Never Gonna Give You Up' by Rick Astley, with 'GOT IT!' in bold letters. The audio includes the man singing \"I just wanna tell you how I'm feeling. Gotta make you understand\", and then exclaiming \"Yes! Got it! And everyone told me I couldn't sing.\".\n\nBack to the man in the red and black checkered shirt at the desk, emphasizing his points with hand gestures. The text 'Mrwhosetheboss' is visible at the bottom throughout the video. He raises an index finger as if making an important point, then hums while holding the phone up. The text 'Hmmmmmm' appears on the screen in large letters. Another scene shows a woman dancing with the same 'Hmmmmmm' text overlay. The clip ends with a close-up of the phone screen displaying search results for 'Never Gonna Give You Up' by Rick Astley, with the text 'IT GOT THAT, TOO!' in bold letters again. The audio includes someone saying, \"But let's try the next level of this. Let's see if it can pick up the song just from humming.\".\n\nIn a close-up of a hand holding a smartphone, the Google Assistant interface displays 'Sounds like one of these' with search results for 'Never Gonna Give You Up' by Rick Astley showing a 75% match. The text 'IT GOT THAT TOO!' appears in bold letters. The scene transitions to the man in the checkered shirt looking satisfied while holding the phone. The screen shows '75% MATCH' in large text. He gestures as if explaining something and then begin to whistle, with musical notes appearing around him, while the text 'WHISTLING' appears in large letters. The audio includes the man saying, \"Ooh, 75% match, I'll take it. This is kind of insane. Okay, whistling. Yes!\".\n\nThe final clip shows a close-up of a hand holding a smartphone, displaying the Google Assistant interface with search results for 'Never Gonna Give You Up' by Rick Astley. The options 'Try again' and 'What's this song?' are visible at the bottom.\n\nTransitioning back to the man in the checkered shirt at the desk, he gestures with his hands. The video ends with him holding the phone up and the text 'SUBSCRIBE' in large letters at the bottom, accompanied by a downward-pointing arrow. The audio includes the person saying, \"72%! I can't tell what's more impressive, Google's intelligence or my musical talent. Subscribe and I'll never sing again\", with music playing in the background.\n\nThrough the sequence of demonstrations, the person showcases how Google Assistant can identify songs through singing, humming, and whistling, celebrating its high accuracy and amusingly presenting the capabilities in an engaging manner."
+ }
+ ],
+ "events": [
+ "Man in red and black checkered shirt sits at a desk with multiple screens displaying blue and purple abstract patterns.",
+ "Man holds a smartphone and appears to be explaining something.",
+ "Man uses hand gestures and points to the phone screen to emphasize his points.",
+ "Man says, 'Did you know that Google can tell you the name of a song just by listening to you sing it, hum it, or even whistle it? So let's start with singing.'",
+ "Man addresses Google Assistant with, 'Hey Google, what's this song?', holding the phone up.",
+ "Audio includes a sound effect.",
+ "Man speaks, then a song starts playing in the background.",
+ "Scene transitions to a close-up of a hand holding a smartphone.",
+ "Smartphone screen displays Google Assistant interface with 'Play, sing, or hum a song', visual representation of sound waves, and 'Listening...'",
+ "Background blurred but appears indoors in a dark setting.",
+ "Text on phone screen updates to 'Still working\u2026' as Google Assistant processes input.",
+ "Scene transitions back to the man in the checkered shirt holding the phone and looking at it intently.",
+ "Man reacts with excitement.",
+ "Text 'YES!' appears in large, bold letters.",
+ "Man shows phone screen which displays 'Never Gonna Give You Up' by Rick Astley with 'GOT IT!' in bold letters.",
+ "Audio includes the man singing 'I just wanna tell you how I'm feeling. Gotta make you understand', then exclaiming 'Yes! Got it! And everyone told me I couldn't sing.'",
+ "Man in red and black checkered shirt at the desk, emphasizing his points with hand gestures.",
+ "Text 'Mrwhosetheboss' is visible at the bottom.",
+ "Man raises an index finger, then hums while holding the phone up.",
+ "Text 'Hmmmmmm' appears on the screen in large letters.",
+ "Another scene shows a woman dancing with the 'Hmmmmmm' text overlay.",
+ "Clip ends with a close-up of the phone screen displaying search results for 'Never Gonna Give You Up' by Rick Astley with 'IT GOT THAT, TOO!' in bold letters.",
+ "Audio includes someone saying, 'But let's try the next level of this. Let's see if it can pick up the song just from humming.'",
+ "Close-up of a hand holding a smartphone displaying 'Sounds like one of these' with search results for 'Never Gonna Give You Up' by Rick Astley showing a 75% match.",
+ "Text 'IT GOT THAT TOO!' appears in bold letters.",
+ "Scene transitions to the man in the checkered shirt looking satisfied while holding the phone.",
+ "Screen shows '75% MATCH' in large text.",
+ "Man gestures as if explaining something.",
+ "Man begins to whistle.",
+ "Musical notes appear around the man.",
+ "Text 'WHISTLING' appears in large letters.",
+ "Audio includes the man saying, 'Ooh, 75% match, I'll take it. This is kind of insane. Okay, whistling. Yes!'",
+ "Final clip shows a close-up of a hand holding a smartphone displaying Google Assistant interface with search results for 'Never Gonna Give You Up' by Rick Astley.",
+ "Options 'Try again' and 'What's this song?' are visible at the bottom.",
+ "Transitioning back to the man in the checkered shirt at the desk gesturing with his hands.",
+ "Video ends with man holding the phone up and text 'SUBSCRIBE' in large letters at the bottom, with a downward-pointing arrow.",
+ "Audio includes person saying, '72%! I can't tell what's more impressive, Google's intelligence or my musical talent. Subscribe and I'll never sing again', with music playing in the background.",
+ "Sequence showcases how Google Assistant can identify songs through singing, humming, and whistling.",
+ "Person celebrates Google's high accuracy and amusingly presents its capabilities."
+ ]
+ },
+ {
+ "video": "video/61.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a woman standing in a cozy room filled with various plants, featuring a pink-lit shelf in the background. The woman, dressed in a gray t-shirt with a leopard print design and denim shorts, is seen reading text on the screen that says \"I'm sick but I got a package.`\" She bends down to pick up a large cardboard box. As she begins to open the box, the text updates to \"I don't know who this is from.`\" The woman's voice is heard saying, \"`This is a big box,`\" as she holds the flaps open, revealing brown packing paper inside. The text on the screen then exclaims, \"It's a cactus!!!`\" with an excited emoji as she discovers a smaller package labeled \"Cactus Mix\" inside. The woman excitedly holds up the package, commenting about the cactus.\n\nThe scene continues with the same woman in the room, holding up the package labeled \"Cactus Mix\" and the text \"It'`s a cactus!!!`\" still displayed. She removes a large piece of bubble wrap from the box, uncovering a cylindrical brown pot. She brings the pot close to her face and smells, \"`It just smells like, like you know, the machine where you go and you put your hands in it and you make the pot\" with the text \"I'm delirious\" with a delicious emoji appearing on screen. Mimicking the action of putting her hands into the pot as if it were a machine, she continues to smell the pot.\n\n She then comments, \"And you make the pot this feels like it came right out the oven,`\" holding the pot up to her face, which is also shown as a subtitle. Placing the pot down, she reaches into the box and pulls out a green, twisted cactus. Subtitles and audio display \"Look how girthy it is a cactus,\" as she examines the cactus with intrigue.\n\nThe woman examines the green, twisted cactus carefully, with the subtitles and audio displaying \"Ouch! Trying to be nice\". She then places the cactus into the cylindrical brown pot, saying \"for in here,\" shown as a subtitle. Admiring the cactus in the pot, the text changes to \"How pretty are you?\" She decides to pot the cactus immediately, with subtitles and audio displaying \"Pot it up right now. I'm gonna do it right now, right.\" She starts preparing to pot the cactus by spreading out brown packing paper on the floor.\n\nStill, in the same room filled with plants, the woman sits on the floor amidst brown packing paper and bubble wrap. Holding a bag labeled \"Cactus Mix,\" she pours its contents into the cylindrical brown pot, subtitles, and audio displaying \"he's gonna do what we do\". She picks up the green, twisted cactus, saying \"I just want to put you in a home it's okay,\" and carefully places it into the pot. Realizing more soil is needed, she comments, \"I'm gonna need some more mix actually so yeah,\" as the text displays this statement. She looks satisfied with the potted cactus.\n\nThe final scene opens with a close-up of the newly potted green, twisted cactus in a terracotta pot filled with white pebbles. The woman places this pot among other potted cacti of varying shapes and sizes. The screen text reads \"there you go! To the collection you go yeah!\" as they adjust the new cactus to fit well with the existing collection. The video concludes with a wider shot showing the entire collection of potted cacti near a window with blinds, allowing natural light to illuminate the room. The overall audio ambiance includes varied expressions of excitement and narration about each action, along with occasional sounds of packing material and subtle background music, adding to the overall cozy and enthusiastic atmosphere of the video."
+ }
+ ],
+ "events": [
+ "Video opens with a woman standing in a cozy room filled with various plants.",
+ "A pink-lit shelf is visible in the background.",
+ "The woman is dressed in a gray t-shirt with a leopard print design and denim shorts.",
+ "Text on screen reads 'I'm sick but I got a package.'",
+ "The woman bends down to pick up a large cardboard box.",
+ "Text updates to 'I don't know who this is from.'",
+ "The woman's voice is heard saying, 'This is a big box,'",
+ "She holds the flaps open, revealing brown packing paper inside.",
+ "Text on the screen exclaims, 'It's a cactus!!!' with an excited emoji.",
+ "The woman discovers a smaller package labeled 'Cactus Mix' inside.",
+ "She excitedly holds up the package, commenting about the cactus.",
+ "The text 'It\u2019s a cactus!!!' is still displayed as the woman holds the package.",
+ "She removes a large piece of bubble wrap from the box, uncovering a cylindrical brown pot.",
+ "She smells the pot and comments, 'It just smells like, like you know, the machine where you go and you put your hands in it and you make the pot.'",
+ "Text 'I'm delirious' with a delicious emoji appears on screen.",
+ "She mimics the action of putting her hands into the pot as if it were a machine and continues to smell the pot.",
+ "She comments, 'And you make the pot this feels like it came right out the oven,' holding the pot up to her face.",
+ "She places the pot down and reaches into the box to pull out a green, twisted cactus.",
+ "Subtitles and audio display 'Look how girthy it is a cactus,' as she examines the cactus with intrigue.",
+ "She carefully examines the green, twisted cactus.",
+ "Subtitles and audio display 'Ouch! Trying to be nice'.",
+ "She places the cactus into the cylindrical brown pot, saying 'for in here,'",
+ "Text changes to 'How pretty are you?' as she admires the cactus in the pot.",
+ "She decides to pot the cactus immediately.",
+ "Subtitles and audio display 'Pot it up right now. I'm gonna do it right now, right.'",
+ "She starts preparing to pot the cactus by spreading out brown packing paper on the floor.",
+ "Woman sits on the floor amidst brown packing paper and bubble wrap.",
+ "She holds a bag labeled 'Cactus Mix' and pours its contents into the cylindrical brown pot.",
+ "Subtitles and audio display 'he's gonna do what we do'.",
+ "She picks up the green, twisted cactus, saying 'I just want to put you in a home it's okay,' and carefully places it into the pot.",
+ "Realizing more soil is needed, she comments, 'I'm gonna need some more mix actually so yeah,' as text displays this statement.",
+ "She looks satisfied with the potted cactus.",
+ "Final scene opens with a close-up of the newly potted green, twisted cactus in a terracotta pot filled with white pebbles.",
+ "The woman places this pot among other potted cacti of varying shapes and sizes.",
+ "Text on screen reads 'there you go! To the collection you go yeah!'.",
+ "She adjusts the new cactus to fit well with the existing collection.",
+ "Video concludes with a wider shot showing the entire collection of potted cacti near a window with blinds, allowing natural light to illuminate the room.",
+ "Overall audio ambiance includes varied expressions of excitement and narration about each action, along with occasional sounds of packing material and subtle background music."
+ ]
+ },
+ {
+ "video": "video/62.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a lively scene in a modern indoor space, where a man in a blue shirt performs a backbend on a wooden floor, surrounded by four others standing and watching. The text 'BACKBEND CHALLENGE!' is displayed in bold red at the bottom of the screen. As the participants get into their backbend positions, the audio captures voices guiding them: \"`All right, go get in your backbends. Whoever can hold it longest. All right, right leg up. Keep it straight.`\" Footsteps and squeaks are audible as the others join in. Gradually, everyone lifts one leg off the ground and the screen flashes, 'who can hold their backbend the longest?' in blue. The participants continue holding their backbends as a woman's voice directs them.\n\nIn the next part of the clip, the audio captures a rhythmic chant encouraging the group: \"Keep it straight. Keep it straight. Oh, my gosh. I'`m already shaking. Uh-oh. Sebastian's shaking. I'm going to try.`\" The participants, now with one leg lifted, are determined, and text in pink appears at the bottom: 'keep those legs straight!'. The focus is on their combined effort to maintain straight legs while holding their backbend positions.\n\nThe video transitions to a close-up of a participant shifting out of the backbend position, then shows multiple participants in a mix of white tops and black shorts continuing, with their legs lifted. The text 'Nayeli is out!' appears in pink, signifying Nayeli's exit from the challenge. Laughter and conversation fill the audio: \"`Here we go, this is better. Oh, okay, who's that? Nayeli, I think you`'re out. All right, keep watching. Legs higher, higher, higher,\" and the words, 'Legs Higher!!' in blue encourage those still in the competition.\n\nAs the participants struggle to lift their legs higher, the text 'Legs Higher!!' reappears in blue. They are instructed to swap the raised leg when 'SWITCH LEGS!' flashes green on the screen. The cheerful and supportive audio cues, \"`Switch now? Switch! Yay, left leg. Switch, left leg. Hey, Sebastian. Sebastian, you know you want to get out. Do I? Yeah, you do,`\" resonate as everyone shifts.\n\nThe challenge intensifies, showing the participants in their backbend positions with switched legs. Text in purple announces 'PEJA IS OUT!' as a woman in black outfit drops out, and the camera shifts, focusing on those still competing. Voices in the background reinforce the environment: \"`All right, Paige's knee is bent. She's out. Okay, keep your legs straight. Sixteen sisters. Okay, Sebastian's going to do a switch. Okay,`\" while a girl's scream adds to the atmosphere.\n\nTowards the end, the video highlights a person in a blue shirt and gray pants struggling, revealing 'SEBASTIAN IS OUT!' in blue when he drops out. The remaining participants work hard to hold their positions with text 'ANISE IS OUT!' in purple marking another elimination. The final moments show the participants as celebratory graphics and excited exclamations fill the scene: \"Oh, he'`s out. His leg dropped. Sorry, Sebastian, even though that was fancy. All right, Anise, your leg's bent. You`'re out.\" The video concludes amidst cheers and congratulatory overlays, marking the end of the backbend challenge."
+ }
+ ],
+ "events": [
+ "Man in blue shirt performs a backbend on a wooden floor",
+ "Four others stand and watch the man in a blue shirt",
+ "Text 'BACKBEND CHALLENGE!' in bold red displayed at the bottom of the screen",
+ "Audio: voices guiding participants to get into their backbend positions",
+ "Footsteps and squeaks audible as others join in",
+ "Participants lift one leg off the ground",
+ "Screen flashes 'who can hold their backbend the longest?' in blue",
+ "Woman's voice directs participants",
+ "Audio captures rhythmic chant: 'Keep it straight. Keep it straight. Oh, my gosh. I'`m already shaking. Uh-oh. Sebastian's shaking. I'm going to try.'",
+ "Text in pink: 'keep those legs straight!'",
+ "Participants determined to maintain straight legs while holding their backbend positions",
+ "Close-up of a participant shifting out of the backbend position",
+ "Multiple participants in white tops and black shorts with their legs lifted",
+ "Text 'Nayeli is out!' appears in pink",
+ "Audio: Laughter and conversation guide participants: 'Here we go, this is better. Oh, okay, who's that? Nayeli, I think you'`re out. All right, keep watching. Legs higher, higher, higher,'",
+ "Text 'Legs Higher!!' in blue appears",
+ "Participants struggle to lift their legs higher",
+ "Text 'Legs Higher!!' reappears in blue",
+ "Text 'SWITCH LEGS!' flashes green on the screen",
+ "Audio: cheerful and supportive cues: 'Switch now? Switch! Yay, left leg. Switch, left leg. Hey, Sebastian. Sebastian, you know you want to get out. Do I? Yeah, you do,'",
+ "Participants switch their raised legs",
+ "Text in purple: 'PEJA IS OUT!' as a woman in black outfit drops out",
+ "Camera focuses on those still competing",
+ "Voices reinforce environment: 'All right, Paige's knee is bent. She's out. Okay, keep your legs straight. Sixteen sisters. Okay, Sebastian's going to do a switch. Okay,'",
+ "Girl's scream adds to the atmosphere",
+ "Person in blue shirt and gray pants struggles, text 'SEBASTIAN IS OUT!' in blue when he drops out",
+ "Remaining participants work hard to hold positions",
+ "Text 'ANISE IS OUT!' in purple marking another elimination",
+ "Participants in final moments, celebratory graphics and excited exclamations fill the scene",
+ "Audio: 'Oh, he's out. His leg dropped. Sorry, Sebastian, even though that was fancy. All right, Anise, your leg's bent. You're out.'",
+ "Video concludes amidst cheers and congratulatory overlays"
+ ]
+ },
+ {
+ "video": "video/63.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a close-up shot of a cutout of a man examining something through a microscope, suggesting a scientific or educational context. The overlaid text reads, \"This Is What Things Actually Look Like Under A Microscope.\" The man then holds a small object, possibly a fingertip gyro, and points upwards, as if explaining or demonstrating something. The scene shifts to a group of ants gathered around a small dish, followed by a white background with scattered ants. These visuals emphasize the theme of the demonstration: examining tiny creatures under a microscope. The sequence ends with a highly magnified, detailed image of an ant's face, showcasing the intricate features visible under the microscope.\n\nThe accompanying audio matches the visuals, as a man speaks: \"`This is what things actually look like under a microscope. First, we have ants. These are just tiny little creatures and no one's really afraid of them. But if you look at one very close, this is what they look like.`\" His statement is followed by a pause before he resumes speaking.\n\nContinuing from the previous scene, the video maintains the highly magnified image of the ant's face before transitioning to a close-up of a small, red, worm-like creature held in a person's hand. Next, a highly magnified, detailed image of a different creature with a gaping mouth and sharp teeth appears, illustrating another example of what things look like under a microscope. The sequence concludes with a highly magnified, colourful image of the worm, showcasing intricate patterns and textures.\n\nIn sync with these visuals, the man describes: \"`Next, we have a worm. This is what these slimy creatures look like under a microscope. It's quite terrifying.`\" Additional sound effects, like the sound of a worm under a microscope, enhance the realism of the scene.\n\nSegueing seamlessly, the video showcases the colorful, highly magnified image of a plant structure under the microscope. Once again, the man keeps pointing upwards, as if to explain or demonstrate.\n\nThe audio follows the narrative structure: \"This is what grass looks like under a microscope, and it's just a bunch of little smiley faces.`\" accompanied by the sound of something being spit out, adding an additional layer of auditory engagement.\n\n The background then transitions to a highly magnified, detailed image of a worm-like creature with a gaping mouth and sharp teeth, resonating with the previous narrative of examining everyday organisms under a microscope. The man keeps pointing upward, explaining, and then making faces. He is saying: \" Next up we have maggots they're kind of gross but this is what they look like under a microscope. Uhh.\"\n\nThe final section of the video retains the highly magnified image of the small, white, and brown objects, likely minerals or crystals. The man points upwards, suggesting continued explanation or demonstration, and the background remains consistent with the magnified image.\n\nThe man's clear and confident voice overlays the visuals, with the concluding audio: \"` And this is what salt and pepper looks like for some reason it looks like rocks and marshmallows.`\" The harmonious combination of visual magnification and descriptive audio creates an educational narrative that effectively communicates the microscopic wonders of everyday objects and creatures."
+ }
+ ],
+ "events": [
+ "Video opens with a close-up shot of a cutout of a man examining something through a microscope",
+ "Overlaid text reads: 'This Is What Things Actually Look Like Under A Microscope'",
+ "Man holds a small object, possibly a fingertip gyro, and points upwards",
+ "Scene shifts to a group of ants gathered around a small dish",
+ "White background with scattered ants",
+ "Highly magnified, detailed image of an ant's face",
+ "Man says: 'This is what things actually look like under a microscope. First, we have ants. These are just tiny little creatures and no one's really afraid of them. But if you look at one very close, this is what they look like.'",
+ "Video maintains the highly magnified image of the ant's face",
+ "Close-up of a small, red, worm-like creature held in a person's hand",
+ "Highly magnified, detailed image of a different creature with a gaping mouth and sharp teeth",
+ "Highly magnified, colorful image of the worm showcasing intricate patterns and textures",
+ "Man says: 'Next, we have a worm. This is what these slimy creatures look like under a microscope. It's quite terrifying.'",
+ "Additional sound effects like the sound of a worm under a microscope",
+ "Highly magnified image of a plant structure under the microscope",
+ "Man points upwards to explain or demonstrate",
+ "Man says: 'This is what grass looks like under a microscope, and it's just a bunch of little smiley faces.'",
+ "Sound of something being spit out",
+ "Highly magnified, detailed image of a worm-like creature with a gaping mouth and sharp teeth",
+ "Man points upward, makes faces, and says: 'Next up we have maggots they're kind of gross but this is what they look like under a microscope. Uhh.'",
+ "Highly magnified image of small, white, and brown objects likely minerals or crystals",
+ "Man points upwards suggesting continued explanation or demonstration",
+ "Man says: 'And this is what salt and pepper looks like for some reason it looks like rocks and marshmallows.'"
+ ]
+ },
+ {
+ "video": "video/64.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man in a light blue shirt holding two plastic containers filled with raw beef, one in each hand, and positioning him in front of his face. The background features a kitchen setting with light-colored cabinets and a stainless steel appliance. Audibly, the man starts by saying, \"`Philly cheesesteak. You wouldn't believe how easy and versatile it can be.`\"\n\nThe scene transitions to the man holding a green bell pepper in one hand and pointing at it with the other hand, ready to demonstrate a cooking technique. Sliced white onions can be seen on a cutting board in front of him. The man adds, \"`My onions are sliced up and I'm going to show you how to cut the pepper. Slice down the sides. You avoid all those seeds and membrane,`\" as he proceeds to slice the bell pepper into halves and then into smaller pieces.\n\nNext, the focus shifts to precise knife movements as the green bell peppers are being sliced on the cutting board. The sounds of slicing can be heard, smoothing into the auditory ambiance of a kitchen. The scene extends seamlessly to a close-up of the sliced green bell peppers.\n\nThe scene develops further with the man gesturing and talking while holding a plastic container filled with raw beef. Close-up of a man showing sliced \u200b\u200bbeef in a plastic container. He narrates, \"`Last side. You can thinly slice some steak yourself, or I like to get shaved beef. Just saves a little bit of work and prep. Saute up those veggies,`\" White onions and the cut bell peppers are added to a black skillet on the stove, seasoning them in preparation for cooking. He then sprinkles seasoning onto the vegetables. The sound of sizzling and scraping can be heard in the background. \n\nThe sizzling sounds continue as the clip proceeds, with a close-up of more seasoning being sprinkled over the mixture of vegetables. \"I'm gonna add a little salt to help these vegetables sweat down,`\" the voice explains, gradually incorporating sliced mushrooms into the skillet. \"`I like to add sliced mushrooms to mine and slide everything to one side. Just gonna add a little bit more oil and then add in our beef. Spread that out.`\" He then slides everything to one side and adds some oil.\n\nIn the subsequent scene, a large piece of raw beef is placed into the sizzling skillet beside the vegetables, guided by tongs. The man continues narrating, \"That'll need some salt. Now be generous with that cracked pepper. And some garlic powder.`\" Salt, freshly ground black pepper, and garlic powder are added with a pepper grinder. The man stirs and mixes the meat and vegetables together. \"Mix it all together, let everything cook through, and we're done.\"\n\nThe mixture is seen being stirred in the skillet with tongs and then set aside. The kitchen lights reflect off the glistening surfaces of the cooked meat, green bell peppers, white onions, and mushrooms. The man holds a plate with slices of cheese and begins adding them to the skillet, mentioning, \"`Now you can melt the cheese over the heat.`\"\n\nThe video shows the next step: a close-up of placing slices of cheese inside a partially sliced baguette wrapped in aluminum foil. The scene transitions to using tongs to fill the baguette with the cheesesteak mixture. \"` ...but I like to make individual sandwiches. Cheese down on the inside, some of our Philly cheesesteak mix, and of course, more cheese on top.`\"\n\nThe prepared mix, now securely in the baguette, is topped off with additional cheese slices, carefully wrapped in foil, and the loo is complete. Two foil-wrapped baguettes filled with the cheesy, meaty, vegetable-laden mixture are presented to the camera. \"`Couple minutes in the oven to get it melty and toasty,`\" the voice continues, showcasing the anticipation of the final product.\n\nIn the culminating scene, the man in the light blue shirt holds up the foil-wrapped baguettes before placing them into the oven as the climax of the preparation. A close-up captures one of the baguettes after baking, with cheese perfectly melted over the filling. The melted cheese dominates the scene, emphasizing the richness and appeal of the final product.\n\n\"`This turned out incredible, and no one would believe how easy this was,`\" the man says, showcasing the baked baguette close to the camera. As the man presents two halves of the baguette with melted cheese stretching between them, he concludes, \"`Well, I guess now you know. Cheesy pull,`\" accentuated by the sound of crumpling foil, closing on a perfectly prepared Philly cheesesteak."
+ }
+ ],
+ "events": [
+ "Man in light blue shirt holding two plastic containers filled with raw beef",
+ "Man positions containers in front of his face",
+ "Background shows a kitchen with light-colored cabinets and a stainless steel appliance",
+ "Man says, 'Philly cheesesteak. You wouldn't believe how easy and versatile it can be.'",
+ "Scene transitions to man holding a green bell pepper and pointing at it",
+ "Sliced white onions on cutting board in front of man",
+ "Man says, 'My onions are sliced up and I'm going to show you how to cut the pepper. Slice down the sides. You avoid all those seeds and membrane.'",
+ "Man slices bell pepper into halves and then into smaller pieces",
+ "Focus shifts to precise knife movements as green bell peppers are sliced",
+ "Sounds of slicing are heard in the background",
+ "Close-up of sliced green bell peppers",
+ "Man gesturing and talking while holding a plastic container filled with raw beef",
+ "Close-up of sliced beef in plastic container",
+ "Man says, 'Last side. You can thinly slice some steak yourself, or I like to get shaved beef. Just saves a little bit of work and prep. Saute up those veggies.'",
+ "White onions and cut bell peppers added to black skillet on the stove",
+ "Man sprinkles seasoning onto vegetables",
+ "Sound of sizzling and scraping in the background",
+ "Close-up of seasoning being sprinkled over vegetable mixture",
+ "Man says, 'I'm gonna add a little salt to help these vegetables sweat down.'",
+ "Man incorporates sliced mushrooms into skillet",
+ "Man says, 'I like to add sliced mushrooms to mine and slide everything to one side. Just gonna add a little bit more oil and then add in our beef. Spread that out.'",
+ "Man slides vegetables to one side and adds oil to skillet",
+ "Large piece of raw beef placed into sizzling skillet beside vegetables using tongs",
+ "Man continues narrating, 'That'll need some salt. Now be generous with that cracked pepper. And some garlic powder.'",
+ "Salt, freshly ground black pepper, and garlic powder are added with pepper grinder",
+ "Man stirs and mixes meat and vegetables together",
+ "Man says, 'Mix it all together, let everything cook through, and we're done.'",
+ "Mixture is stirred in skillet with tongs and then set aside",
+ "Kitchen lights reflect off cooked meat, green bell peppers, white onions, and mushrooms",
+ "Man holds a plate with slices of cheese and begins adding them to skillet",
+ "Man says, 'Now you can melt the cheese over the heat.'",
+ "Close-up of cheese slices placed inside a partially sliced baguette wrapped in aluminum foil",
+ "Scene transitions to using tongs to fill baguette with cheesesteak mixture",
+ "Man says, '...but I like to make individual sandwiches. Cheese down on the inside, some of our Philly cheesesteak mix, and of course, more cheese on top.'",
+ "Prepared mix topped with additional cheese slices, carefully wrapped in foil",
+ "Two foil-wrapped baguettes filled with mixture are presented to the camera",
+ "Man says, 'Couple minutes in the oven to get it melty and toasty.'",
+ "Man holds up foil-wrapped baguettes before placing them into oven",
+ "Close-up of baguette after baking with cheese melted over the filling",
+ "Man says, 'This turned out incredible, and no one would believe how easy this was.'",
+ "Man showcases baked baguette close to the camera",
+ "Man presents two halves of the baguette with melted cheese stretching between them",
+ "Man concludes, 'Well, I guess now you know. Cheesy pull.'",
+ "Sound of crumpling foil as video closes on perfectly prepared Philly cheesesteak"
+ ]
+ },
+ {
+ "video": "video/65.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a view of a modern kitchen, featuring a white marble countertop, brass fixtures, and pendant lights. Decorative items and a framed picture grace the background. Text on the screen reads, \"`When my first organization video went viral...`\" A hand pours cereal from a plastic bag into a large glass jar. The scene transitions to another angle of the kitchen, with the text now stating, \"`People wondered if we kept our pantry that neat ALL the time...`\" The same hand pours cereal into another glass container labeled \"CEREAL.\" The text changes again, saying, \"`And the answer is...`\" The scene shifts to a white metal rack filled with various snack bags, and the text reads, \"NOPE! Of course, not \ud83d\ude0a\ud83e\udd37,\" followed by, \"We're humans robots \ud83e\udd16,\" underscoring that maintaining a perfectly organized pantry at all times isn't realistic.\n\nIn the background, a woman's voice narrates: \"`When my first organization video went viral, people wondered if we kept our pantry that neat all the time. And the answer is no, of course not. We're humans, not robots. `\"\n\nThe video continues with the same modern kitchen backdrop. A hand places a white mesh basket labeled \"BAKING\" on the countertop, with text on the screen that reads, \"& no matter how well-intentioned we are,\" underlining the ongoing theme of realistic organization. The hand proceeds to place several red packages into the basket. The text changes to, \"There will be days when we slack off \ud83d\ude1c.\" The scene cuts to a woman with long brown hair organizing a pantry. She places the \"BAKING\" basket on a shelf and arranges various items. Text on the screen reads, \"But that's the beauty of organization-,\" followed by, \"It'`s not meant to be a one-off deal, `\" and \"where you clean one thing & it stays like that forever \ud83d\ude1d,\" emphasizing the continuous effort needed for organization. The clip ends with a close-up of the woman placing a bag of chips into a wire basket on the pantry door.\n\nThe audio, provided by the same woman's voice, states: \"`How well-intentioned we are, there will always be days when we slack off. But that's the beauty of organization. It's not meant to be a one-off deal where you clean one thing and it stays like that forever, `\" with playful music playing in the background.\n\nThe video then shows a wider view of the pantry, displaying various organized items in labeled containers and baskets. Text appears, saying, \"It's more like an ongoing decision to be present with your surroundings \ud83d\ude4f,`\" emphasizing mindfulness in maintaining organization. A wooden rolling cart filled with neatly arranged fabric bags is shown. The text reads, \"Be appreciative of your things \u2764\ufe0f.\" A woman with long brown hair is seen organizing the bags on the cart. New text appears, saying, \"& maintain a functional & beautiful space that serves you \ud83d\ude18,\" highlighting the value of a well-maintained environment. Finally, the scene shows a cabinet being organized with neatly stacked boxes and bags. The text states, \"If you're curious about any of the featured products such as my Simple human paper towel holder w/ a spray pump,\" inviting viewers to learn more.''\n\nThe audio continues seamlessly: \"It's more like an ongoing decision to be present with your surroundings, be appreciative of your things, and maintain a functional and beautiful space that serves you. If you're curious about any. `\" The sound of paper crinkling is heard.\n\nThe scene shifts to a close-up of a white paper towel holder with a spray pump on the white marble countertop. The text changes to, \"such as my Simplehuman paper towel holder w/ a spray pump,\" as a hand demonstrates the spray pump feature. The text then reads, \"I' ll link it all in my bio under `'Storage & Organizers' \u2764\ufe0f.\" The scene transitions to the woman with long brown hair wiping down the countertop with a cloth. Text on the screen reads, \"LIKE this video & FOLLOW me for more! \u2764\ufe0f\ud83d\ude4f.\" The final shot shows the entire kitchen, showcasing its organized and pristine condition."
+ }
+ ],
+ "events": [
+ "The video opens with a view of a modern kitchen featuring a white marble countertop, brass fixtures, and pendant lights.",
+ "Decorative items and a framed picture grace the background.",
+ "Text on the screen reads, 'When my first organization video went viral...'",
+ "A hand pours cereal from a plastic bag into a large glass jar.",
+ "The scene transitions to another angle of the kitchen.",
+ "Text on the screen states, 'People wondered if we kept our pantry that neat ALL the time...'",
+ "The same hand pours cereal into another glass container labeled 'CEREAL.'",
+ "Text on the screen changes to, 'And the answer is...'",
+ "The scene shifts to a white metal rack filled with various snack bags.",
+ "Text reads, 'NOPE! Of course not \ud83d\ude0a\ud83e\udd37,' followed by, 'We're humans robots \ud83e\udd16,'",
+ "In the background, a woman's voice narrates: 'When my first organization video went viral, people wondered if we kept our pantry that neat all the time. And the answer is no, of course not. We're humans, not robots.'",
+ "The video continues with the same modern kitchen backdrop.",
+ "A hand places a white mesh basket labeled 'BAKING' on the countertop.",
+ "Text on the screen reads, '& no matter how well-intentioned we are,'",
+ "The hand places several red packages into the basket.",
+ "Text on the screen changes to, 'There will be days when we slack off \ud83d\ude1c.'",
+ "The scene cuts to a woman with long brown hair organizing a pantry.",
+ "The woman places the 'BAKING' basket on a shelf and arranges various items.",
+ "Text on the screen reads, 'But that's the beauty of organization-,'",
+ "Text changes to, 'It's not meant to be a one-off deal,'",
+ "Text continues with, 'where you clean one thing & it stays like that forever \ud83d\ude1d.'",
+ "The clip ends with a close-up of the woman placing a bag of chips into a wire basket on the pantry door.",
+ "The background audio continues: 'How well-intentioned we are, there will always be days when we slack off. But that's the beauty of organization. It's not meant to be a one-off deal where you clean one thing and it stays like that forever,'",
+ "Playful music plays in the background.",
+ "The video shows a wider view of the pantry, displaying various organized items in labeled containers and baskets.",
+ "Text on the screen appears, saying, 'It's more like an ongoing decision to be present with your surroundings \ud83d\ude4f,'",
+ "A wooden rolling cart filled with neatly arranged fabric bags is shown.",
+ "Text reads, 'Be appreciative of your things \u2764\ufe0f.'",
+ "A woman with long brown hair is seen organizing the bags on the cart.",
+ "New text appears, saying, '& maintain a functional & beautiful space that serves you \ud83d\ude18,'",
+ "The scene shows a cabinet being organized with neatly stacked boxes and bags.",
+ "Text states, 'If you're curious about any of the featured products such as my Simplehuman paper towel holder w/ a spray pump,'",
+ "The audio continues seamlessly: 'It's more like an ongoing decision to be present with your surroundings, be appreciative of your things, and maintain a functional and beautiful space that serves you. If you're curious about any,'",
+ "The sound of paper crinkling is heard.",
+ "The scene shifts to a close-up of a white paper towel holder with a spray pump on the white marble countertop.",
+ "Text changes to, 'such as my Simplehuman paper towel holder w/ a spray pump,'",
+ "A hand demonstrates the spray pump feature.",
+ "Text then reads, 'I'll link it all in my bio under 'Storage & Organizers' \u2764\ufe0f,'",
+ "The scene transitions to the woman with long brown hair wiping down the countertop with a cloth.",
+ "Text on the screen reads, 'LIKE this video & FOLLOW me for more! \u2764\ufe0f\ud83d\ude4f,'",
+ "The final shot shows the entire kitchen showcasing its organized and pristine condition."
+ ]
+ },
+ {
+ "video": "video/66.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a woman with long brown hair, wearing a necklace and a light blue patterned top. The text \"`Come on come on`\" appears at the bottom of the screen. The woman closes her eyes as if expecting something. The audio reveals the same phrase being spoken multiple times. In the next frame, a '+' and '-' button overlay appears at the top right corner, while the woman continues to say \"Yess.\" The woman clicks something at the top of the screen. The text \"`Are you sure?`\" appears in yellow at the top left corner with the same audio, to which the woman responds through a combination of voice and text on screen, \"`Of course I`'m sure, who wouldn'`t want to live forever?`\" The sequence culminates in a black screen that displays the text \"10 years later\" in white, accompanied by a woman's voice continuing to speak amidst soft background music and eventually a man begins singing.\n\nTransitioning from the black screen with \"10 years later\" text, we return to the same woman. The text \"`Did you guys hear what happened yesterday?-`\" appears at the bottom of the screen, echoed by the same question being asked out loud. The text \"Do you have a moment?\" appears on the woman's face, echoed by the same question being asked out loud. The woman looks at the right and says to the camera, \"One sec, I'`ll be right back,`\" as it appears on screen. After briefly fading to black, the text reads \"This is your last warning to change your mind\" in yellow at the top. The clip concludes with the woman declaring, \"I don't wanna change my mind, this is the best decision I'`ve ever made,`\" as reflected in the on-screen text, alongside audio of the woman's and a man's voices and a briefly singing man.\n\nThe screen fades to black and shows the text \"60 years later\" in white. The scene transitions back to the woman, who is now surrounded by the words \"Family gone\", \"Husband gone\", and \"Friends gone\" in a thoughtful overlay. The woman frowns and says, \"`I have no one left...`\" both audibly and with on-screen text, followed by background music.\n\nThe narrative progresses with another black screen stating \"10 years later\" in white text. The scene transitions back to the familiar woman, and the word \"Hey\" appears at the bottom of the screen. The woman then says, \"I don't wanna live forever; I've changed my mind,\" with the on-screen text mirroring the audio. Responded in audio and yellow text, \"Yes?\" appears at the top followed by \"I warned you\". The woman then says, \"There has to be a way to change my decision,\" which appears at the bottom, and is met with \"It's impossible\" in yellow at the top left corner. The clip finishes with the woman on screen looking at the camera.\n\nIn the concluding sequence, another black screen with \"10 years later\" in white text precedes the appearance of the woman we have followed throughout. The woman expresses despair with \"I don'`t wanna live like this anymore, please, there has to be a way,`\" as voiced and displayed in text at the bottom. The word \"There isn't unless...\" in yellow appears at the top left corner, leaving an ambiguous note as the music continues to play in the background with occasional singing and speech from a woman and a man.\n\nThe overarching theme of the video explores a journey through time, immortality, and the eventual regret associated with the decision to live forever, conveyed through a consistent visual and audio narrative."
+ }
+ ],
+ "events": [
+ "Woman with long brown hair wearing a necklace and light blue patterned top appears.",
+ "Text 'Come on come on' appears at the bottom of the screen.",
+ "Woman closes her eyes.",
+ "Audio repeats the phrase 'Come on come on' multiple times.",
+ "A '+' and '-' button overlay appears at the top right corner.",
+ "Woman continues to say 'Yess.'",
+ "Woman clicks something at the top of the screen.",
+ "Text 'Are you sure?' appears in yellow at the top left corner.",
+ "Woman responds with voice and text: 'Of course I'm sure, who wouldn't want to live forever?'",
+ "Screen goes black and displays '10 years later' in white.",
+ "Woman's voice continues with soft background music.",
+ "Man begins singing.",
+ "Transition back to the woman.",
+ "Text 'Did you guys hear what happened yesterday?-' appears at the bottom of the screen.",
+ "Audio echoes the question.",
+ "Text 'Do you have a moment?' appears on the woman's face.",
+ "Audio echoes the question.",
+ "Woman looks right and says, 'One sec, I'll be right back,' with corresponding text.",
+ "Scene briefly fades to black.",
+ "Text 'This is your last warning to change your mind' appears in yellow at the top.",
+ "Woman declares, 'I don't wanna change my mind, this is the best decision I've ever made,' with corresponding text.",
+ "Audio includes woman's and man's voices.",
+ "Man briefly sings.",
+ "Screen fades to black and shows '60 years later' in white.",
+ "Transition to the woman surrounded by words 'Family gone', 'Husband gone', and 'Friends gone' in overlay.",
+ "Woman frowns and says, 'I have no one left...' with corresponding text.",
+ "Background music plays.",
+ "Another black screen states '10 years later' in white.",
+ "Transition back to the woman.",
+ "Word 'Hey' appears at the bottom of the screen.",
+ "Woman says, 'I don't wanna live forever; I've changed my mind,' with corresponding text.",
+ "Audio and yellow text respond with 'Yes?' and 'I warned you.'",
+ "Woman says, 'There has to be a way to change my decision,' with corresponding text.",
+ "Response 'It's impossible' appears in yellow at the top left corner.",
+ "Woman looks at the camera.",
+ "Another black screen with '10 years later' in white text.",
+ "Appearance of the woman again.",
+ "Woman expresses despair: 'I don't wanna live like this anymore, please, there has to be a way,' with corresponding text.",
+ "Word 'There isn't unless...' in yellow appears at the top left corner.",
+ "Ambiguous note with continuous background music.",
+ "Occasional singing and speech from a woman and a man."
+ ]
+ },
+ {
+ "video": "video/67.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up of a person's fingers holding a small fishing hook against a dark background, highlighting the tools and materials used in fly tying. The logo 'Mainely Outdoors' is visible at the top of the frame. The person secures the hook in a vise and uses a white tool to wrap a thin black thread around the hook, creating a tight base layer. Next, a bundle of yellow fibers is picked up, carefully aligned, and secured to the hook, forming the tail of the fly. Throughout this meticulous process, the audio narration reveals, \"`This fly was banned in the 1800s for being too successful. To start this pattern, we'll grab a yellow goose feather. Fix our feather in place on the back of our hook and continue to secure towards the head of our fly.`\" accompanied by the sound of adjustments being made.\n\nThen the video develops through the continuation of the fly tying process. The yellow fibers are now securely attached to the hook, and the black thread is wound further to ensure a tight base. The audio continues, \"`And then grab some gold wire and wrap it back towards the tail. Grab some silver tinsel. Tie it into the back of our fly. Once complete we will wrap our silver tinsel in closed spirals towards the head of the fly.`\" The person introduces a thin gold wire, positioning it along the hook and securing it with the thread. The sound of wire wrapping can be heard amidst the person's careful work.\n\nThe craft intensifies as the person wraps the black thread tightly around the hook, creating a stable foundation. The audio narrates, \"`Secure and snip the excess free. We will then begin to wrap our gold wire in open spirals towards the head of our fly.`\" and we hear the subtle snipping of scissors. The precise placement and securing of each material are highlighted.\n\nThe intricate process continues; the hook displays the attached yellow fibers, complemented by the gold wire. The person picks up a small bundle of grey fibers, aligning and securing them to form the fly's body. Meanwhile, the audio tells us, \"`Secure and helicoptering the excess free. Grab a black rooster cape and tying it onto the head of our fly. Begin to hackle our feather around the head of the fly and snip free.`\" against the faint sound of a helicopter, signifying reinforcement of the fibers.\n\nFurther securing involves introducing iridescent green and blue fibers, adding a splash of vibrant color to the structure. The person uses the white tool to wrap the fibers tightly, and another bundle of yellow fibers is carefully positioned on top. Over this, the audio notes, \"`Then grab a peacock sword and tie them onto our fly. Grab some more yellow goose feathers. We will tie these onto either side of our fly.`\" \n\nFinally, the video showcases the completed fly, with close-ups displaying the amalgamation of yellow fibers forming the tail, grey fibers shaping the body, and reflective iridescent fibers adding a colorful flair. The person's meticulous handwork results in a finely crafted fly, with the 'Mainely Outdoors' logo visible throughout the clip. The audio wraps up the process by explaining, \"And this is a variation of the great alexander fly. We're going to be giving away in our discord community. The details will be in the description below.`\" The clip concludes with the website link 'www.mainely-outdoors.com' displayed at the bottom, guiding viewers for more information.\n\nThroughout the video, both the visual and audio elements harmoniously detail the focused and intricate art of fly tying, showcasing the expertise and attention to detail that goes into creating each fly, emphasized by the calm yet informative narration."
+ }
+ ],
+ "events": [
+ "Close-up of fingers holding a small fishing hook against dark background",
+ "Logo 'Mainely Outdoors' visible at the top of the frame",
+ "Person secures the hook in a vise",
+ "Person uses a white tool to wrap a thin black thread around the hook, creating a tight base layer",
+ "Audio narration: 'This fly was banned in the 1800s for being too successful ...'",
+ "Person picks up a bundle of yellow fibers",
+ "Yellow fibers are carefully aligned and secured to the hook, forming the tail of the fly",
+ "Yellow fibers are securely attached to the hook",
+ "Black thread is wound further to ensure a tight base",
+ "Audio narration: 'And then grab some gold wire ...'",
+ "Person introduces a thin gold wire along the hook and secures it with the thread",
+ "Sound of wire wrapping can be heard",
+ "Person wraps the black thread tightly around the hook, creating a stable foundation",
+ "Audio narration: 'Secure and snip the excess free ...'",
+ "Sound of subtle snipping of scissors",
+ "Hook displays attached yellow fibers, complemented by the gold wire",
+ "Person picks up a small bundle of grey fibers",
+ "Grey fibers are aligned and secured to form the fly's body",
+ "Audio narration: 'Secure and helicoptering the excess free ...'",
+ "Faint sound of a helicopter",
+ "Introduction of iridescent green and blue fibers, adding color",
+ "Person uses white tool to wrap the iridescent fibers tightly",
+ "Another bundle of yellow fibers is carefully positioned on top",
+ "Audio notes: 'Then grab a peacock sword and tie them onto our fly ...'",
+ "Close-ups display the amalgamation of yellow fibers forming the tail",
+ "Grey fibers shaping the body",
+ "Reflective iridescent fibers adding a colorful flair",
+ "Person's meticulous handwork showing a finely crafted fly",
+ "Logo 'Mainely Outdoors' visible throughout the clip",
+ "Audio wraps up: 'And this is a variation of the great alexander fly ...'",
+ "Clip concludes with the website link 'www.mainely-outdoors.com' displayed at the bottom"
+ ]
+ },
+ {
+ "video": "video/68.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a man in a white t-shirt holding a black gaming controller in a modern room. Various icons and a prominent green '20' logo are displayed on a large screen behind him. The texts '3 THINGS that I HATE' and 'Xbox Series X' in red and white appears on the screen. The video transitions to a close-up of the screen displaying a message that reads 'Connect to the Xbox network', with 'Network Issues' in red and white below it and the number '1' in a red circle on the left top screen. Returning to the man in the white t-shirt, who appears to be explaining something, the text '`not connected to Xbox Live`' appears in red and white. The audio reveals the man's frustration as he says, \"`These are three things that I hate about the Xbox Series X. Number one are network issues. Sometimes when I randomly turn on my console, it will say I'm not connected to Xbox Live.`\" \n\nThe scene transitions smoothly with the man continuing to speak, his gestures and passionate expressions conveying his frustration. \"`But when I run a network test, it says I'm connected perfectly fine,`\" he explains. The background remains a modern room with the green '20' logo on the screen. A red circle with the number '1' is visible at the top left corner, and the red circle gradually disappears with the passage of time, possibly acting as a timekeeping function. The screen displays the message 'It's all good. Your console is connected to the Internet' with a green 'CONTINUE' button below it, while the text 'perfectly fine' appears in red and white at the bottom. The scene transitions again to a screen showing various icons, including a green settings icon, a game cover for 'F1 2021', and an 'Ultimate Game Sale' advertisement proclaiming 'Save up to 80%'. The number '2' in a red circle is visible. The text 'Adverts' and 'pretty cluttered' appear in red and white below. The man in the white t-shirt continues, \"`Number two is the adverts on the Xbox dashboard. Already the Xbox user interface is pretty cluttered.`\" The audio is punctuated by a ding sound, illustrating the shifts of the scene.\n\nThe video continues with a clear view of the modern room, displaying a large screen with icons and the green '20' logo, featuring a black gaming chair and racing wheel setup in front. The number '2' in a red circle persists in the top left corner. The man in the white t-shirt appears again, gesticulating. The '`Xbox Game Pass`' logo appears on the screen. He explains, \"`And then it's filled with all of these 80% off sales that I'm not interested in because I use Xbox Game Pass. I haven't bought a game on Xbox in ages.`\" As he speaks, the audio includes intermittent whoosh sounds, enhancing the transitions. The scene zooms in on a black gaming controller, showing the number '3' in a red circle. The word 'Controller' in red and white appears below it. The following frame displays a close-up of a green and black gaming controller held by a figure in green armor, marked with '117'. The word 'SHAPE SIZE FEEL' in red and white appears below it. The man continues, \"`And number three is the controller.`\"\n\nThe scene then shifts to a white gaming controller held by a hand, maintaining the same red circle and number. The word 'ADAPTIVE TRIGGERS' in red appears at the bottom. Both controllers are then shown side by side on a wooden surface. The video closes with the man in the white t-shirt holding a black gaming controller while gesturing, the green '20' logo visible behind him. \"`Although I prefer the shape, size, and feel of the Xbox controller, I do love the adaptive triggers on the PS5, and I hope we see that one day on the Xbox.`\" he finishes, his voice resonating with an echo, emphasizing his hopes for future improvements. \n\nThe coherent narrative ties the frustrations and hopes of an Xbox user, making the video engaging and relatable for gamers dealing with similar issues."
+ }
+ ],
+ "events": [
+ "The video opens with a man in a white t-shirt holding a black gaming controller in a modern room.",
+ "Various icons and a prominent green '20' logo are displayed on a large screen behind him.",
+ "Texts '3 THINGS that I HATE' and 'Xbox Series X' in red and white appear on the screen.",
+ "The video transitions to a close-up of the screen displaying a message 'Connect to the Xbox network'.",
+ "Texts 'Network Issues' in red and white and number '1' in a red circle appear on the screen.",
+ "The scene returns to the man in the white t-shirt.",
+ "Text '`not connected to Xbox Live`' appears in red and white.",
+ "The man, frustrated, says: 'These are three things that I hate about the Xbox Series X. Number one are network issues. Sometimes when I randomly turn on my console, it will say I'm not connected to Xbox Live.'",
+ "The man explains: 'But when I run a network test, it says I'm connected perfectly fine.'",
+ "Background remains a modern room with a green '20' logo on the screen.",
+ "Red circle with the number '1' is visible at the top left corner and gradually disappears.",
+ "The screen displays the message 'It's all good. Your console is connected to the Internet' with a green 'CONTINUE' button below it and 'perfectly fine' in red and white at the bottom.",
+ "The scene transitions to a screen showing various icons including a green settings icon, a game cover for 'F1 2021', and an 'Ultimate Game Sale' advertisement proclaiming 'Save up to 80%'.",
+ "The number '2' in a red circle is visible.",
+ "Texts 'Adverts' and 'pretty cluttered' appear in red and white below.",
+ "The man in the white t-shirt continues, 'Number two is the adverts on the Xbox dashboard. Already the Xbox user interface is pretty cluttered.'",
+ "A ding sound punctuates the scene shifts.",
+ "The video continues with a modern room displaying a large screen with icons and the green '20' logo, featuring a black gaming chair and racing wheel setup.",
+ "The number '2' in a red circle persists in the top left corner.",
+ "The man in the white t-shirt appears again, gesticulating.",
+ "The 'Xbox Game Pass' logo appears on the screen.",
+ "He explains, 'And then it's filled with all of these 80% off sales that I'm not interested in because I use Xbox Game Pass. I haven't bought a game on Xbox in ages.'",
+ "Audio includes intermittent whoosh sounds enhancing the transitions.",
+ "The scene zooms in on a black gaming controller, showing the number '3' in a red circle.",
+ "The word 'Controller' in red and white appears below it.",
+ "Next frame displays a close-up of a green and black gaming controller held by a figure in green armor, marked with '117'.",
+ "The word 'SHAPE SIZE FEEL' in red and white appears below it.",
+ "The man continues, 'And number three is the controller.'",
+ "Scene shifts to a white gaming controller held by a hand, maintaining the same red circle and number.",
+ "The word 'ADAPTIVE TRIGGERS' in red appears at the bottom.",
+ "Both controllers are shown side by side on a wooden surface.",
+ "The video closes with the man in the white t-shirt holding a black gaming controller while gesturing.",
+ "The green '20' logo is visible behind him.",
+ "He finishes, 'Although I prefer the shape, size, and feel of the Xbox controller, I do love the adaptive triggers on the PS5, and I hope we see that one day on the Xbox.'",
+ "His voice resonates with an echo, emphasizing his hopes for future improvements."
+ ]
+ },
+ {
+ "video": "video/69.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up of two hands delicately holding incense sticks in front of a portrait of a person in traditional attire. The audio reveals a male voice talking about a theory based on \"Zuko Alone,\" and he mentions that Iroh nearly besieged the city of Ba Sing Se five years prior, but retreated after learning about his son's death in combat. The video then shows the incense sticks being placed in a holder on a cloth-covered surface, flanked by two round objects.\n\nThe scene transitions to a person dressed in brown clothing and a conical hat riding a horse through a grassy landscape under a stunningly clear blue sky adorned with white clouds. It then shifts to a large tent with a flag on top, backed by rocky cliffs and a cloudy sky. Inside the tent, an older man with a bald head and topknot in red and black armor writes diligently on a scroll at a table.\n\nNext, two people are seen by a fountain under a tree laden with orange fruits. Female one is seated, engrossed in a scroll, while the male one stands beside her in a white robe and hat. The clip concludes with a close-up of the bun of the woman's head, displaying a distinctive black and gold headdress.\n\nThe video then shows a man in a dark outfit and conical hat standing before a family of threefour adorned in light-colored traditional clothing, set against a backdrop of rocky cliffs and clear sky. The audio narration describes Zuko's travels through the Earth Kingdom and how he encounters a family whose eldest son was conscripted into the military. A man and then a woman speak. \n\nThe family is depicted more closely, with the man embracing the woman and one child. The woman appears distressed, covering her face with her hands. The scene transitions to a harrowing image of a child tied to a wooden post, guarded by guards in yellow uniforms holding spears. An orange sky over a village of wooden structures hints at either sunrise or sunset.\n\nBack to the family, the bearded man with a topknot in a sleeveless garment appears close-up, looking angry or distressed, as the woman in light clothing behind him places her hands on the child's shoulders. This family stands in a similiar setting of a wooden structure with a thatched roof and rocky cliffs. The narration continues with mention of a 600-day siege and the Earth Kingdom forces that killed Lu Ten. Inside a warmly lit room, a person in red and brown attire holds a green clothing while another person stands partially visible in the background.\n\nThe video shows a close-up of a hand holding a knife falling backwards, with someone in red and brown clothing sitting on a wooden chair in the background. The portrait of a person, appearing at the beginning of the video, is seen more clearly in the next scene. He has a topknot in traditional attire, flanked by Chinese characters, appears. The clip concludes with a person on a horse in tall, golden grass, with the family of three standing beside him. The man and woman hug each other and then go into the room. The child runs to the man on the horse. Rocky cliffs and a clear sky serve as the backdrop, with a wooden structure and fence.\n\nIn the final segment, the background shifts to a wooden structure with an open area. A close-up shows of the child with dark hair in a sleeveless light-colored garment and a green belt, holding a knife with both hands.. The audio elaborates on the theory, contemplating the conscription timing and its impact on Lee and his brother.\n\nThe video transitions to an evocative outdoor scene during sunset, where an elderly person with long white hair and a beard in dark clothing kneels on the ground with leaves drifting down around him. Rocky cliffs and a solitary tree form the backdrop, with two incense under the tree, amplifying the sense of poignant reflection, with the male voice discussing the memories conscription would leave on Lee if it happened five years before the current events."
+ }
+ ],
+ "events": [
+ "Video starts with a close-up of two hands holding incense sticks in front of a traditional portrait",
+ "Male voice narrates about 'Zuko Alone' theory mentioning Iroh's siege on Ba Sing Se",
+ "Incense sticks are placed in a holder, surrounded by two round objects",
+ "Scene transitions to a person in brown clothes and conical hat riding a horse in a grassy landscape",
+ "Clear blue sky with white clouds is shown",
+ "Scene shifts to a large tent with a flag, backed by rocky cliffs and cloudy sky",
+ "Inside the tent, an older man in red and black armor writes on a scroll",
+ "Two people are seen by a fountain under a tree with orange fruits",
+ "A woman engrossed in a scroll while a man in a white robe and hat stands beside her",
+ "Close-up of the woman's bun with a distinctive black and gold headdress",
+ "A man in dark outfit and conical hat stands before a family of four",
+ "Family of four is adorned in light-colored traditional clothing",
+ "Rocky cliffs and clear sky serve as the backdrop",
+ "Narration describes Zuko's travels and encounter with a family whose eldest son was conscripted",
+ "Man and woman from family speak",
+ "Close-up of the family shows the man embracing the woman and one child",
+ "Woman appears distressed, covering her face with her hands",
+ "Scene transitions to a child tied to a wooden post guarded by guards with spears",
+ "Orange sky indicates sunrise or sunset over a village of wooden structures",
+ "Close-up of bearded man with topknot looking angry or distressed",
+ "Woman in light clothing places her hands on child's shoulders",
+ "Family stands near a wooden structure and rocky cliffs",
+ "Narration mentions the 600-day siege and the Earth Kingdom forces killing Lu Ten",
+ "Inside a warmly lit room, person in red and brown attire holds green clothing",
+ "Another person stands partially visible in the background",
+ "Close-up of hand holding a knife falls backwards",
+ "Person in red and brown clothing sits on wooden chair in background",
+ "Clear portrait of the person from beginning of video with topknot and traditional attire",
+ "Chinese characters flank the portrait",
+ "Person on a horse in tall golden grass with family of three beside him",
+ "Man and woman of family hug and go into the room",
+ "Child runs to the man on the horse",
+ "Rocky cliffs and clear sky with a wooden structure and fence in backdrop",
+ "Close-up of child with dark hair in sleeveless light-colored garment and green belt holding a knife",
+ "Audio discusses theory on conscription timing and its impact on Lee and his brother",
+ "Video transitions to outdoor scene during sunset with elderly person kneeling on ground",
+ "Elderly person has long white hair and beard, wears dark clothing",
+ "Leaves drift down around elderly person",
+ "Rocky cliffs and solitary tree form the backdrop",
+ "Two incense under the tree with a sense of poignant reflection",
+ "Narrator discusses memories of conscription impacting Lee if it happened five years before current events"
+ ]
+ },
+ {
+ "video": "video/70.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man wearing a white t-shirt that says 'COMMERCIAL BREAK' in red and yellow, along with gray sweatpants. This man is holding a can of G FUEL energy drink and walks towards the camera, putting gestural emphasis on the beverage. The background includes a parked black car and a grassy area with trees. He then moves towards the car, pointing and discussing its luxurious features, especially the front grille.\n\nAudio complements the visual with the person introducing the car: \"What is up guys so right behind me is Korea'`s most luxurious car this g90 lets you save the cash and still get the flash you've got this giant grill that comes all the way across.\" Ambient sounds include the noise of a car driving by.\n\nTransitioning to the next scene, the man continues to wear the same outfit and now stands beside the front of a blue car. He highlights the car's front grille and headlights with great enthusiasm, emphasizing the LED lights wrapping around the vehicle. As he opens the rear door of the car and sits inside, they showcase the tan leather interior, pointing out the rear seat's luxurious features such as the headrest pillows and integrated speakers.\n\nThe audio during this part includes descriptions like: \"`These LED lights wrap all the way around but the best part is how you close the door i just push this button in the back i get pillows in the headrest speakers in the roof a mirror to see how successful I am, but it gets really interesting when I push this button.\" The sound of a car door opening and closing punctuates the spoken words.\n\nFollowing this, the man is seen sitting comfortably in the rear seat and points skyward to the ceiling light, then towards the windows, showcasing them. He adjusts the seat to display the ample legroom and pulls down the center armrest, revealing additional features like cup holders. The front seat titles forward and the man's feet are propped up.\n\nThe audio captures his continued commentary: \"If fold this down, I'm going to have, storage space, a wireless charging pad, my seating controls, a cup holder for my G Fuel, who's sponsoring this video, use my code BURNOUT for 30% off, and a touchscreen to control this, this and that,\" Pressing a button creates an audible click within the audio.\n\nThe narrative continues with the man still in the rear seat engaging with the center console, highlighting storage space, a wireless charging pad, and seating controls. He places a can of G FUEL in the cup holder and discusses the energy drink's relevance, noting a discount code 'BURNOUT' for 30% off.\n\nNext, the man moves to the front seat, further pointing out features on the center console and the sunroof. He describes the front's functionality, showing the infotainment system, dashboard, and glove compartment.\n\nThe spoken audio presents: \". So in the front, the door's also shut by just pushing a button. You have a tinted moonroof, and these quilted leather seats that are heated, ventilated, and massage you. Wood, leather, a touch panel, apple carplay, an android auto that works super fast, a fingerprint reader, wireless charging pad, head-up display, and whenever I turn the car off the speakers do this.\" \n\nFinally, seated in the front seat still, the man engages with the car's infotainment system, navigating through various settings on the screen, showcasing apple carplay and Android Auto, He highlights the high-speed response of the system and mentions unique features like the speakers' response when the car is turned off. At the end of the video the speaker is seen slowly dropping.\n\n"
+ }
+ ],
+ "events": [
+ "Man is wearing a white t-shirt that says 'COMMERCIAL BREAK' in red and yellow and gray sweatpants",
+ "Man is holding a can of G FUEL energy drink and walks towards the camera, emphasizing the beverage",
+ "Background includes a parked black car and a grassy area with trees",
+ "Man moves towards the black car, pointing and discussing its luxurious features, especially the front grille",
+ "Audio: Man introduces the car: 'What is up guys so right behind me is Korea's most luxurious car this g90 lets you save the cash and still get the flash you've got this giant grill that comes all the way across'",
+ "Ambient sounds include the noise of a car driving by",
+ "Scene transitions to the man standing beside the front of a blue car, still wearing the same outfit",
+ "Man highlights the blue car's front grille and headlights, emphasizing the LED lights",
+ "Man opens the rear door of the blue car and sits inside, showcasing the tan leather interior, headrest pillows, and integrated speakers",
+ "Audio: Man describes features: 'These LED lights wrap all the way around but the best part is how you close the door i just push this button in the back i get pillows in the headrest speakers in the roof a mirror to see how successful I am, but it gets really interesting when I push this button.'",
+ "Sound of a car door opening and closing punctuates the spoken words",
+ "Man sits comfortably in the rear seat and points skyward to the ceiling light, then towards the windows",
+ "Man adjusts the seat to display ample legroom and pulls down the center armrest, revealing cup holders and other features",
+ "Front seat tilts forward and the man's feet are propped up",
+ "Audio: Man continues commentary: 'If fold this down, I'm going to have, storage space, a wireless charging pad, my seating controls, a cup holder for my G Fuel, who's sponsoring this video, use my code BURNOUT for 30% off, and a touchscreen to control this, this and that'",
+ "Pressing a button creates an audible click",
+ "Man engages with the center console, highlighting storage space, a wireless charging pad, and seating controls",
+ "Man places a can of G FUEL in the cup holder and discusses the energy drink, mentioning the discount code 'BURNOUT' for 30% off",
+ "Man moves to the front seat, pointing out features on the center console and sunroof",
+ "Man describes the front's functionality, showing the infotainment system, dashboard, and glove compartment",
+ "Audio: Man describes features: 'So in the front, the door's also shut by just pushing a button. You have a tinted moonroof, and these quilted leather seats that are heated, ventilated, and massage you. Wood, leather, a touch panel, apple carplay, an android auto that works super fast, a fingerprint reader, wireless charging pad, head-up display, and whenever I turn the car off the speakers do this.'",
+ "Man sits in the front seat, engaging with the car's infotainment system",
+ "Man navigates through various settings on the screen, showcasing apple carplay and Android Auto",
+ "Man highlights the high-speed response of the system",
+ "Man mentions unique features like the speakers' response when the car is turned off",
+ "Video ends with the man slowly dropping from view"
+ ]
+ },
+ {
+ "video": "video/71.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a view of a beige dress from a mannequin marred with various paint marks standing on a metal base. Adjacent to it is another dress form fashioned out of brown leather, featuring a distinctive metal cage-like structure at the bottom. Behind them, clothes hang on a rack. Donning dark nail polish, a man holds a roll of green tape and starts methodically applying it vertically down the center of the beige mannequin's torso, starting from the neck and extending to the lower abdomen. He then applies a horizontal strip across the chest, intersecting with the vertical one, and proceed to adjust the tape, ensuring it adheres properly. His meticulous adjustments are vital as he smooths the tape carefully over the mannequin. The same man with dark nail polish continues his work, adding more vertical strips along the sides of the torso and a horizontal strip around the waist. \n\nAs this happens, a voiceover states: \"Today I'`m making a red carpet dress, and in order to do so I need to make a pattern. So I'm going to start by using a draping tape on the dress form and laying it around so I can create the silhouette. Then I start adding in the lines where my seams are going to go.`\"\n\nNext, the view remains focused on the beige mannequin, now adorned with vertical and horizontal green tape. He then drapes a piece of beige fabric over the mannequin's front, smoothing it out and pinning it in precise locations. He uses a pencil to make critical lines on the fabric, indicating areas that need to be cut or sewn. \n\nOver this sequence, the voice adds: \"`Now that I've got that done, I'm going to use a little bit of cotton muslin and drape it over the dress form to create the pattern itself. I create some tension across the dress form using pins and then I trace out where those lines are going to be.`\"\n\nThe video then transitions to a close-up of the pinned beige fabric that now features several pencil-marked lines. The scene shifts to a flat surface where the man draws precise lines on another piece of beige fabric using various rulers, including a 'Fairgate Fashion Form' and a French curve. Subsequently, the man is shown at a sewing machine, carefully stitching pieces of dark gray fabric together. Concentrated on the task, he guides the fabric under the needle to ensure straight and even stitches.\n\nThe voiceover explains: \"`Afterwards, I use this ruler to perfect the pattern and add in the seam allowances. Now that I have the pattern finished up, it's time to start sewing the mock-up together and then make the actual dress. I always get asked if I went to school for this.`\"\n\nIn the next sequence, a close-up showcases the man's hand, still adorned with dark nail polish, pressing down on a piece of dark gray fabric with a sleek silver iron. Transitioning, the full view of the beige dress form now sports a fully sewn, form-fitting dark gray fabric dress. A quick scan over the room reveals various garments and accessories hanging on racks in the background. The man then takes a piece of black fabric intricately decorated with silver floral patterns, draping it over the mannequin and adjusting it to fit seamlessly over the dark gray dress. Pinning the floral fabric in place, he ensures all details are perfectly aligned.\n\nDuring this sequence, the voiceover continues: \"`And the truth is that I'm totally self-taught, and there's been a lot of trial and error over the years to figure this whole thing out. I have a bit of a learning disability, which is why learning from a book is so challenging for me, and that's why I structure my education.`\"\n\nThe final clip begins with a close-up view of the man's hand pinning silver floral patterns onto a piece of black fabric adorned with silver sequins and floral embellishments, creating an ornate design. The scene then transitions to showcase the mannequin now dressed in the final product: a form-fitting black gown covered in the intricate silver floral patterns. The background is dramatically accented with rich red draped fabric, which contrasts beautifully with the sleek and elegant gown extending down to the floor. Text appears on the screen: 'Link is in the description!' in English, indicating additional information or related links can be found in the video's description.The video ends with a black screen.\n\nThe voiceover concludes: \"`Videos in a way that also makes sense to me. If you're a beginner and have been wanting to learn and don't know where to start, check out my Patreon. The link is in my bio.`\" As the final spoken words fade, the sound of a camera shutter clicking concludes the video."
+ }
+ ],
+ "events": [
+ "View of a beige dress from a mannequin with paint marks standing on a metal base",
+ "Adjacent mannequin fashioned out of brown leather with a distinctive metal cage-like structure at the bottom",
+ "Clothes hanging on a rack in the background",
+ "Man with dark nail polish holds a roll of green tape",
+ "Man applies green tape vertically down the center of the beige mannequin's torso",
+ "Tape starts from the neck and extends to the lower abdomen",
+ "Man applies a horizontal strip across the mannequin's chest intersecting the vertical strip",
+ "Man adjusts the tape to ensure it adheres properly",
+ "Man adds more vertical strips along the sides of the torso",
+ "Man adds a horizontal strip around the waist",
+ "Voiceover states the goal of making a red carpet dress and explains the need to create a pattern",
+ "Man drapes a piece of beige fabric over the mannequin's front",
+ "Man smooths out the fabric and pins it in precise locations",
+ "Man uses a pencil to mark lines on the fabric indicating cut or sew areas",
+ "Voiceover explains the use of cotton muslin to create the pattern itself",
+ "Voiceover explains creating tension across the dress form using pins",
+ "Voiceover explains tracing out lines on the fabric",
+ "Close-up of pinned beige fabric featuring several pencil-marked lines",
+ "Scene shifts to a flat surface where man draws precise lines on another piece of beige fabric using various rulers",
+ "Man uses a Fairgate Fashion Form and a French curve to draw lines",
+ "Man shown at a sewing machine stitching pieces of dark gray fabric together",
+ "Man guides fabric under the sewing machine needle for straight stitches",
+ "Voiceover explains using a ruler to perfect the pattern and add seam allowances",
+ "Voiceover mentions starting to sew the mock-up together to make the actual dress",
+ "Voiceover states the question about whether the man went to school for this craft",
+ "Close-up of man's hand pressing dark gray fabric with a sleek silver iron",
+ "Full view of beige dress form wearing a fully sewn, form-fitting dark gray fabric dress",
+ "Quick scan over room reveals various garments and accessories hanging on racks",
+ "Man takes a piece of black fabric with silver floral patterns",
+ "Man drapes and adjusts black fabric over the dark gray dress on the mannequin",
+ "Man pins the floral fabric in place ensuring all details are aligned",
+ "Voiceover continues discussing self-taught nature and overcoming learning difficulties",
+ "Close-up view of man's hand pinning silver floral patterns onto black fabric with silver sequins and floral embellishments",
+ "Transition to showcase mannequin in final product: a form-fitting black gown with intricate silver floral patterns",
+ "Background with rich red draped fabric contrasts with sleek and elegant gown",
+ "Text appears on screen: 'Link is in the description!' in English",
+ "Video ends with a black screen",
+ "Voiceover concludes by suggesting to check out the creator's Patreon for more learning",
+ "Sound of a camera shutter clicking concludes the video"
+ ]
+ },
+ {
+ "video": "video/72.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a gameplay scene from Fortnite. A character is descending towards a coastal area with lush green terrain and scattered buildings, using a glider shaped like a tree or plant. On the screen, the text 'Space To Deploy Silver Surfer's Surfboard' suggests an option to switch gliders. The mini-map in the top right corner details the surroundings, showing a countdown of players from 94 to 91. The bottom status bar displays the username 'CODE - INTHELITTLEWOOD', level 107, and full health. A small webcam overlay on the left shows a person in a room filled with various items, and notes recent viewer interactions or donations with texts like 'thefadingmusic x6', 'foxhawk4125 1,000 Bits', and 'ISNEAK'. The character continues to descend over a lighthouse towards a small settlement.\n\nThe audio reveals a man speaking: \"`Hello everybody, I'm In The Little Wood, also known as Martin. Welcome back to the channel and welcome back to Fortnite. So, it is another brand new week which means there are new points of interest hanging around the map.`\" Another voice briefly chimes in.\n\nThe scene continues with the character nearing the ground over a similar lush terrain and scattered buildings. The player count ticks down from 91 to 90. The character glides past a small settlement towards the coast. A quick glance at the map screen shows various locations like 'Stark Industries' and 'Doom's Domain', and visible quick challenges each offering 10K XP.\n\nThe audio continues with the man's voice: \"` and, uh, you can already hear a bunch of different gliders, actually. Maybe it's just one fella making his way over to the northwestern corner of the map. You can see there's a blob right about here northwest of Doom`'s Domain, but before that the lighthouse.\"\n\nNext, the clip transitions to the map screen displaying locations such as 'Stark Industries' and 'Doom's Domain'. Quick challenges are offered for 10K XP each. The character uses a new glider shaped like a surfboard and descends towards a settlement with colorful trees and a purple truck. The mini-map in the top right shows the player count starting at 89 and dropping to 87. Upon landing near the truck, text indicating the 'Wolverine Challenge: Locate a Trask Transport Truck' appears, pointing to a challenge objective. The character explores, collects items, and encounters another player. The screen then showcases the discovery of a new landmark 'Homely Hills', earning 1,000 XP.\n\nThe audio features the man's ongoing commentary: \"You can see this going on right here, there'`s a couple of riffs knocking about as well because obviously this is all slightly portally and timey-wimey, but all you got to do is make your way over towards this truck, and there you go.`\"\n\nIn the subsequent scene, the character, now revealed as Groot, engages in combat with another player near the Trask Transport Truck. 'NEW LANDMARK! TRASK TRANSPORT TRUCK' text appears, marking the new landmark discovery. The player uses a shotgun, but the opponent retaliates with a burst assault rifle. Completing the challenge, text says 'Challenge Complete: Locate a Trask Transport Truck - NEW STYLE!'. However, the character is eliminated by the opponent, placing 82nd in the game. The text 'You placed #82' and 'Eliminated by SRTANEII' appear. The opponent flees, as the webcam overlay shows the player's reaction.\n\nThe audio includes the man saying: \"`Locate a transport truck or a Trask transport truck is all you needed to do. Guys killed me off. It's fine. If you did find this video helpful, then be sure to leave a like on it. `\" followed by gunshots and laughter.\n\nThe final clip portrays the aftermath of the character's elimination with 'You placed #82' displayed. The status bar still shows the username 'CODE - INTHELITTLEWOOD'. The opponent, 'SRTANEII', is near a blue house holding a burst assault rifle, earning 2,386 XP for survival and a total match XP of 2,790. Options for 'Ready Up!', 'Report Player', 'Spectate', and 'Return to Lobby' appear on the screen.\n\nThe audio concludes with the man speaking: \"`As always subscribe to the channel, use code 'In The Little Word`', and I'`ll get some more videos done for you. See you again in a sec.`\" This is followed by a beep and more speech."
+ }
+ ],
+ "events": [
+ "Gameplay scene from Fortnite begins.",
+ "A character is descending towards a coastal area with lush green terrain and scattered buildings, using a glider shaped like a tree or plant.",
+ "Text on the screen indicates 'Space To Deploy Silver Surfer's Surfboard'.",
+ "Mini-map in the top right corner shows surroundings and player count from 94 to 91.",
+ "Bottom status bar displays username 'CODE - INTHELITTLEWOOD', level 107, and full health.",
+ "A small webcam overlay on the left shows a person in a room with various items and text of recent viewer interactions or donations: 'thefadingmusic x6', 'foxhawk4125 1,000 Bits', and 'ISNEAK'.",
+ "The character descends over a lighthouse towards a small settlement.",
+ "Audio of a man speaking: 'Hello everybody, I'm In The Little Wood, also known as Martin. Welcome back to the channel and welcome back to Fortnite. So, it is another brand new week which means there are new points of interest hanging around the map.'",
+ "Another voice briefly chimes in.",
+ "Character nears the ground over similar lush terrain and scattered buildings.",
+ "Player count ticks down from 91 to 90.",
+ "Character glides past a settlement towards the coast.",
+ "Map screen shows locations such as 'Stark Industries' and 'Doom's Domain', and visible quick challenges offering 10K XP each.",
+ "Audio continues with the man saying: 'and, uh, you can already hear a bunch of different gliders, actually. Maybe it's just one fella making his way over to the northwestern corner of the map. You can see there's a blob right about here northwest of Doom's Domain, but before that the lighthouse.'",
+ "Clip transitions to the map screen displaying locations like 'Stark Industries' and 'Doom's Domain'.",
+ "Quick challenges are offered for 10K XP each.",
+ "Character uses a new glider shaped like a surfboard and descends towards a settlement with colorful trees and a purple truck.",
+ "Mini-map in the top right shows player count dropping from 89 to 87.",
+ "Upon landing near the truck, text indicates 'Wolverine Challenge: Locate a Trask Transport Truck', pointing to a challenge objective.",
+ "Character explores, collects items, and encounters another player.",
+ "Screen showcases the discovery of a new landmark 'Homely Hills', earning 1,000 XP.",
+ "Audio features ongoing commentary: 'You can see this going on right here, there's a couple of riffs knocking about as well because obviously this is all slightly portally and timey-wimey, but all you got to do is make your way over towards this truck, and there you go.'",
+ "Character, revealed as Groot, engages in combat with another player near the Trask Transport Truck.",
+ "Text 'NEW LANDMARK! TRASK TRANSPORT TRUCK' appears, marking the new landmark discovery.",
+ "Player uses a shotgun, opponent retaliates with a burst assault rifle.",
+ "Challenge completed, text says 'Challenge Complete: Locate a Trask Transport Truck - NEW STYLE!'.",
+ "Character is eliminated by the opponent, placing 82nd in the game.",
+ "Text 'You placed #82' and 'Eliminated by SRTANEII' appear.",
+ "Opponent flees as the webcam overlay shows the player's reaction.",
+ "Audio includes man saying: 'Locate a transport truck or a Trask transport truck is all you needed to do. Guys killed me off. It's fine. If you did find this video helpful, then be sure to leave a like on it.' followed by gunshots and laughter.",
+ "Final clip shows aftermath of character's elimination with 'You placed #82' displayed.",
+ "Status bar still shows username 'CODE - INTHELITTLEWOOD'.",
+ "Opponent 'SRTANEII' is near a blue house holding a burst assault rifle, earning 2,386 XP for survival and a total match XP of 2,790.",
+ "Options for 'Ready Up!', 'Report Player', 'Spectate', and 'Return to Lobby' appear on the screen.",
+ "Audio concludes with the man saying: 'As always subscribe to the channel, use code 'In The Little Wood', and I'll get some more videos done for you. See you again in a sec.' followed by a beep and more speech."
+ ]
+ },
+ {
+ "video": "video/73.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video introduces a man in a blue suit and striped tie speaking in a studio, a man in a black suit with a red tie, and the man in a black suit with a blue and red tie, characterized by a large globe and cityscape in the background, indicating a news or talk show environment. Text overlays frame the conversation: \"1861 because the point is, can we talk about 1861 and 2021 this weekend? And I don't know who, I don't know who and I don't know who's against talking about history.\" The \"PROFOUND THOUGHTS\" logo with a brain image is visible at the screen's bottom throughout the video. The person is discussing historical events from 1861 and comparative examples in 2021.\n\nThe audio in this segment reveals a debate about discussing historical events. The man with a blue and red tie says, \"1861. Because...\" And another man in a red tie says: \" The point is, can we talk about 1861 in 2021?\" The man in the blue and red tie says: \"Yes we can! And I don't know who, I don't know who and I don't I don't know know who's against talking about history who's against talking about history.\"\n\nThe video continues with the man in the red tie speaking. The man in the blue tie keeps saying: \"Well I mean I understand that they want to you know in Texas they're talking about removing reference to the Ku Klux Klan. They're talking about removing I mean, come on, it's Texas, I mean you watch Forrest Gump without even knowing who Nathan Bedford Forrest is, right? I mean these are people who are historical figures who did bad things, no one is actually saying, we want to go back and erase everything and restart and talk to you about these things critical. This theory is a subject because the Republican right is made, it is something.\"\n\nThe man in the red tie keeps saying: \"No no no, real school organizations are mandating this for skills.\", while the man in the blue suit tries to interrupt. The screen temporarily shows another person in a suit and striped tie responding. The camera cuts again, displaying another participant in a blue suit. The man in the blue suit says: \"I hope every Democrat in America takes your perspective and, and, and they will be shellacked at the polls. Roy Teixeira just wrote a piece, the emerging Democrat majority, right, he just wrote a piece about what he called, the Fox News fallacy, a tendency of people on the left to believe that, just because Fox News mentioned something, it can't be real. The reality is a lot of this stuff is taught in schools.\" indicating a panel discussion or debate format. The text and the speech are matched.\n\nThe man in the red tie quickly asks: \"You mean like viruses?\" The man in the blue and red tie interrupts the debate, saying: \"Well okay well can I can I now, i'm just...\" culminating with a crowd laughing and clapping, indicating the spirited nature of the discussion.\n\nOverall, the video conveys a robust dialogue on historical events, educational mandates, and political perspectives, mirroring contemporary societal debates."
+ }
+ ],
+ "events": [
+ "Video begins with a man in a blue suit and striped tie speaking in a studio.",
+ "A man in a black suit with a red tie is also present.",
+ "Another person characterized by a blue and red tie is shown, set against a backdrop featuring a large globe and cityscape, indicating a news or talk show setting.",
+ "Text overlays appear, framing the conversation about historical events from 1861 and comparative examples in 2021.",
+ "The 'PROFOUND THOUGHTS' logo with a brain image is visible at the screen's bottom throughout the video.",
+ "The person wearing the blue and red tie is discussing historical events from 1861 and their relevance to 2021.",
+ "The man with a blue and red tie begins the audio segment by saying, '1861. Because\u2026'",
+ "Another man in a red tie interjects, 'The point is, can we talk about 1861 in 2021?'",
+ "The man in the blue and red tie responds, 'Yes we can! And I don't know who, I don't know who and I don't know who's against talking about history.'",
+ "The man in the red tie continues speaking.",
+ "The man in the blue tie says, 'Well I mean I understand that they want to you know in Texas they're talking about removing references to the Ku Klux Klan. They're talking about removing, I mean, come on, it's Texas, I mean you watch Forrest Gump without even knowing who Nathan Bedford Forrest is, right? I mean these are people who are historical figures who did bad things, no one is actually saying, we want to go back and erase everything and restart and talk to you about these things critical. This theory is a subject because the Republican right is made, it is something.'",
+ "The man in the red tie continues with, 'No no no, real school organizations are mandating this for skills,' while the man in the blue suit tries to interrupt.",
+ "The screen temporarily shows another person in a suit and striped tie responding.",
+ "The camera cuts to display another participant in a blue suit.",
+ "The man in the blue suit states, 'I hope every Democrat in America takes your perspective and, and, and they will be shellacked at the polls. Roy Teixeira just wrote a piece, the emerging Democrat majority, right, he just wrote a piece about what he called, the Fox News fallacy, a tendency of people on the left to believe that, just because Fox News mentioned something, it can't be real. The reality is a lot of this stuff is taught in schools.'",
+ "Text and speech are matched, indicating a panel discussion or debate format.",
+ "The man in the red tie quickly asks, 'You mean like viruses?'",
+ "The man in the blue and red tie interrupts, saying, 'Well okay well can I can I now, I'm just\u2026'",
+ "A crowd laughs and claps, indicating the spirited nature of the discussion.",
+ "The video concludes by highlighting a robust dialogue on historical events, educational mandates, and political perspectives."
+ ]
+ },
+ {
+ "video": "video/74.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens in a tranquil forested area where a man stands holding an orange lighter in his right hand, with his left hand pointing at it. Dressed in a black t-shirt with a graphic design surrounded by ears of wheat, a green camouflage cap, and sunglasses, the man also has a black wristwatch on his left wrist and a green bracelet on his right wrist. Text appears on the screen, 'HOW TO START A FIRE WITH A LIGHTER'. The man gestures with his hands, pointing at the lighter, making explanatory movements. The accompanying audio reveals the man's voice: \"`A lot of people say, oh I carry a lighter to start a fire. I put lighters in many men's hands and they hadn't been able to start a fire. So I'm going to show you exactly how to do that.`\"\n\ufeff\nNext, the scene shifts to the man standing next to a tree, still in his forest attire. He explains something about the tree using hand gestures, then uses his hands to pick bark from the tree. The audio continues, \"`The first thing we need is tinder. Tinder is a dry combustible material. We're going to take some of this bark from a cedar tree.`\" Soft music plays in the background, enhancing the forest ambiance.\n\nThe camera captures a close-up as he gathers dry leaves and twigs from the forest floor, forming a small bundle in his hands. The audio continues, \"`Now we're going to process this cedar bark up to make it more combustible by splitting it apart.`\"\n\nThe video cuts to a close-up of the man holding the bundle of dry leaves and twigs. The hands are shown meticulously arranging the bundle over the forest floor, which is strewn with leaves, twigs, and pine needles. The man's green shoe and pieces of bark are visible on the ground. He places the bundle carefully while speaking, \"`So this is called the platform embrace method. We're gonna brace here and we got a platform to keep our tinder off the wet ground.`\" The consistent background music underscores the instructional tone.\n\nThe man stands up to collect more dry twigs and branches. In the audio, the man's voice explains, \"`Now we need to get kindling. This is wood off the ground that's not wet. You hear that pop, and this needs to be pencil lead size up to pencil size.`\" The natural environment's sounds blend with the man's instructions.\n\nThe next scene shows a close-up of the man organizing twigs and branches on the forest floor. The familiar black wristwatch and green bracelet are prominent as he uses an orange lighter to ignite the bundle. The small flame begins to flicker as the man carefully adds more tinder to keep it alight. The audio complements with, \"`Now we're going to light our tender. The brace allows me to lift up the light underneath it. This helps give oxygen to fuel so I can pick it up to help the flame grow.`\" The background music continues, maintaining the informative and serene atmosphere.\n\nThe scene evolves with the man crouching down, explaining, and gesturing while the fire burns steadily. The camera captures the earnestness in his movements as he manages the flames, carefully explaining the process with hand gestures. The audio states, \"`I've lit my tender, it's flaming up and lighting my kindling. Now I have my fuel here because it rained a few hours ago. I have it sitting here to the side. This will help dry it out. Fuel is finger up to wrist size.`\" The constant musical background supports the ongoing narrative.\n\nIn the final sequences, the man ensures the fire remains stable, adjusting the firewood and twigs. The flames are now steady and visible as he tends the fire with dedicated movements. The concluding audio includes, \"`You want to keep your fire small and manageable just to the point of whatever you need.`\" The layered audio elements of the man's instructions and background music complete the coherent and informative narrative of the entire video."
+ }
+ ],
+ "events": [
+ "The video opens in a tranquil forested area.",
+ "A man is standing holding an orange lighter in his right hand.",
+ "The man points at the lighter with his left hand.",
+ "The man is dressed in a black t-shirt with a graphic design, green camouflage cap, and sunglasses.",
+ "The man has a black wristwatch on his left wrist and a green bracelet on his right wrist.",
+ "Text appears on the screen: 'HOW TO START A FIRE WITH A LIGHTER'.",
+ "The man gestures with his hands, making explanatory movements.",
+ "The accompanying audio reveals the man's voice giving instructions on fire starting with a lighter.",
+ "The scene shifts to the man standing next to a tree.",
+ "The man explains something about the tree using hand gestures.",
+ "The man picks bark from the tree with his hands.",
+ "The audio mentions needing tinder and explains it as dry combustible material.",
+ "Background soft music enhances the forest ambiance.",
+ "Close-up of the man gathering dry leaves and twigs from the forest floor.",
+ "The man forms a small bundle of dry leaves and twigs in his hands.",
+ "Audio mentions processing the cedar bark to make it more combustible by splitting it apart.",
+ "Close-up of the man arranging the bundle of dry leaves and twigs on the forest floor.",
+ "The man's green shoe and pieces of bark are visible on the ground.",
+ "The man places the bundle carefully.",
+ "Audio explains the platform embrace method for keeping tinder off the wet ground.",
+ "Background music continues to reinforce the instructional tone.",
+ "The man stands up to collect more dry twigs and branches.",
+ "The man's voice explains the need for kindling and describes the appropriate size.",
+ "Natural environment sounds blend with the man's instructions.",
+ "Close-up of the man organizing twigs and branches on the forest floor.",
+ "The black wristwatch and green bracelet are visible.",
+ "The man uses an orange lighter to ignite the bundle.",
+ "A small flame begins to flicker.",
+ "The man carefully adds more tinder to keep the flame alight.",
+ "Audio details the technique for lighting and managing the tinder flame.",
+ "Background music maintains an informative and serene atmosphere.",
+ "The man is crouching down, explaining, and gesturing while the fire burns steadily.",
+ "Camera captures the man's earnest movements as he manages the flames.",
+ "Audio explains lighting tender, flaming up kindling, and drying out fuel.",
+ "Background music supports the ongoing narrative.",
+ "The man ensures the fire remains stable.",
+ "The man adjusts the firewood and twigs.",
+ "The flames are steady and visible.",
+ "Man tends to the fire with dedicated movements.",
+ "Concluding audio advises keeping the fire small and manageable.",
+ "Layered audio elements of the man's instructions and background music complete the narrative."
+ ]
+ },
+ {
+ "video": "video/75.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video features a sequence from a public debate event between two high-profile figures, Donald Trump and Joe Biden. They are engaged in a heated exchange, with Trump making pointed critiques towards Biden. The entire sequence is overlaid with explanatory text, adding context to the ongoing verbal confrontation.\n\nThe sequence starts with Joe Biden, in a dark suit and striped tie, prominently speaking into a microphone against a blue background. The text overlaid in yellow at the top reads: '\"Don't ever use the word smart with me`\" - Trump || Trump lashes out at Biden,' giving an immediate sense of the confrontation's tone. The white text at the bottom highlights one of the arguments being made: '`Knew how dangerous it was going to be back in February, and he didn't even tell you.`' \nThe white text changes to expand the narrative: '`He panicked or he just looked at the stock market. One of the two... because guess what. A lot of people died and a lot more are going to die, unless he gets a lot smarter a lot quicker.`' \n\nIn sync with the video, the audio reveals Biden's voice, sharply addressing an issue related to the timing and communication of dangers back in February: \"`He knew how dangerous it was going to be back in February, and he didn't even tell you. He's on record as saying it. He panicked, or he just looked at the stock... One of the two... because guess what. A lot of people died and a lot more are going to die, unless he gets a lot smarter a lot quicker.`\"; the audio signaling a back-and-forth exchange typical of debate scenarios.\n\n\nThe camera shifts to Trump, who is dressed in a suit and begins to speak, with the changing bottom text adding: 'Mr. President... Did you use the word smart?? So you said, you went to Delaware State but... you forgot the name of your college. You didn't go to Delaware State. You graduated either the lowest or almost the lowest in your class.`' This moment is underscored by Trump saying, \"Did you use smart. So you said you went to Delaware State, but you forgot the name of your college. You didn't go to Delaware State. You graduated either the lowest,\" followed by a brief laugh, setting a judgemental tone.\n\nThe discourse maintains its sharp edge as Trump continues his unforgiving critique, clearly intending to undermine Biden. The bottom in white text appears, 'Don't ever use the word smart with me. Don't ever use that word.\" Now insert Biden speaking with yellow subtitles displayed, \"Oh give me a break...\" The Trump continues with white subtitles, \"Because you know what.. there's nothing smart about you Joe. 47 years, you have done nothing.' Biden speaks with yellow subtitles, \"Let's have this debate.\"\n\nTrump continues, \"If you would have had... let me just tell you something Joe. If you would have had the charge of what I was put through... I had to close the greatest economy in the history of our country. And by the way, now it's being built again.' \n\nThe screen transitions to a black background, and the sound of an argument can be heard, along with a distinct phrase, \"We're going to get to the economy in the next segment.\"\n"
+ }
+ ],
+ "events": [
+ "Debate event between Donald Trump and Joe Biden.",
+ "Joe Biden speaks into a microphone wearing a dark suit and striped tie against a blue background.",
+ "Yellow text overlay at the top: '\"Don't ever use the word smart with me\" - Trump || Trump lashes out at Biden.'",
+ "White text at the bottom: 'Knew how dangerous it was going to be back in February, and he didn't even tell you.'",
+ "White text changes to: 'He panicked or he just looked at the stock market. One of the two... because guess what. A lot of people died and a lot more are going to die, unless he gets a lot smarter a lot quicker.'",
+ "Audio reveals Biden addressing the timing and communication of dangers back in February.",
+ "Biden's voice: 'He knew how dangerous it was going to be back in February, and he didn't even tell you. He's on record as saying it. He panicked, or he just looked at the stock... One of the two... because guess what. A lot of people died and a lot more are going to die, unless he gets a lot smarter a lot quicker.'",
+ "Camera shifts to Trump dressed in a suit.",
+ "Bottom text: 'Mr. President... Did you use the word smart?? So you said, you went to Delaware State but... you forgot the name of your college. You didn't go to Delaware State. You graduated either the lowest or almost the lowest in your class.'",
+ "Trump's voice: 'Did you use smart. So you said you went to Delaware State, but you forgot the name of your college. You didn't go to Delaware State. You graduated either the lowest,' followed by a brief laugh.",
+ "Bottom white text appears: 'Don't ever use the word smart with me. Don't ever use that word.'",
+ "Biden speaking with yellow subtitles: 'Oh give me a break...'",
+ "Trump continues with white subtitles: 'Because you know what.. there's nothing smart about you Joe. 47 years, you have done nothing.'",
+ "Biden speaking with yellow subtitles: 'Let's have this debate.'",
+ "Trump continues: 'If you would have had... let me just tell you something Joe. If you would have had the charge of what I was put through... I had to close the greatest economy in the history of our country. And by the way, now it's being built again.'",
+ "Screen transitions to a black background.",
+ "Sound of an argument can be heard.",
+ "Distinct phrase heard: 'We're going to get to the economy in the next segment.'"
+ ]
+ },
+ {
+ "video": "video/76.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a close-up of a large piece of silver ore placed on a wooden surface, with the text 'Mining silver ore to make a pure silver ring' displayed at the top. A hand holds a ring pointing to the silver ore, emphasizing its texture and metallic sheen. The audio reveals a man's voice explaining, \"`This is some silver ore that I found in an abandoned mine. I turned it into this pure silver ring by smelting down the ore. The first step was to go 450 feet down in an abandoned mine.`\" The scene transitions to a close-up of a finished silver ring, showcasing its polished surface and intricate design.\n\nThe next scene shows molten silver being poured from a crucible into a mold, with the word 'Spicy' appearing at the bottom, hinting at the intense heat of the molten metal. The audio complements this with the man saying, \"`And then, you have to get the ore out of the wall; this is galena, which is lead, silver, and some other minerals.`\" In the final sequence of this clip, the scene shifts to a miner in a helmet and mining gear descending a 450-foot abandoned mineshaft using ropes, with the text 'Roping 450 ft down an abandoned mineshaft' providing context for the action.\n\nAs the narrative progresses, the video shows a miner using a power tool to extract minerals from the rock wall. The text 'Last time they mined here was in the 1800s. Now, let's refine the Galena.' appears, providing historical context. Close-ups of the extracted minerals follow, with the text 'Galena - lead, silver, and some other minerals' clarifying the composition of the ore. The audio resumes with, \"Back up top it was time to refine the ore,\" and background music starts to play, enhancing the discovery atmosphere. The scene concludes with the hands in gloves taking a mineral.\n\nThe scene transitions to the mineral now on a white surface. The next clip continues the refining process. Gloves-clad hands place ore into a crusher, with the text 'Crush up the ore' explaining the action. The audio details, \"`The first step is crushing it up. Then you add in some soda ash, and all that goes into a crucible`\" as crushed ore pours into a white bucket. The following text, 'Add a flux, Heat up the mixture,' appears while the hand mixes the crushed ore with a white substance in the bucket. The next frames show the mixture being poured into a crucible, followed by a scene of the crucible being heated. The man's voice elaborates, \"`And heat it to a few thousand degrees. And then you break off the slag which holds all the impurities.`\" as the molten mixture glows orange inside the crucible.\n\nAfter heating, molten metal is poured into a mold to separate impurities. The text 'Pour into a mold to separate the 'slag'' precedes a scene of a person breaking apart solidified slag with a hammer. As the molten metal is heated again, 'Add Zinc to the party. Zinc binds to silver but repels lead' text appears, describing the next action. \"`Then you reheat that mixture and add in zinc. Zinc binds to silver but repels lead. So you scrape all that off. And once again heat it up. Here, you burn off the zinc and oxidize any lead`\" the voice says, capturing the addition of zinc to the process.\n\nThe video then depicts a crucible with zinc and silver being stirred, with the text emphasizing the binding of zinc to silver. The tool scraps the mixture. With '`Burn off the zinc...`' text, the crucible glows inside a furnace. In the final scene, viewers see a glowing orange crucible containing molten silver, with tongs holding a piece of metal above it. The text 'Pure silver!!' and a money bag emoji highlight the value of the refined silver. \"`Pure silver. Ready to be made into a ring,`\" the man's voice concludes, as the molten silver is placed on a dark surface to cool and solidify. The final frame shows 'Now, to make the ring...' text, signaling the next step in the journey.\n\nThe coherent narrative beautifully ties together the mining, refining, and preparation stages of crafting a pure silver ring from raw ore."
+ }
+ ],
+ "events": [
+ "Close-up of a large piece of silver ore on a wooden surface",
+ "Text 'Mining silver ore to make a pure silver ring' displayed at the top",
+ "Hand holds a ring pointing to the silver ore",
+ "Audio: Man's voice explains about the silver ore found in an abandoned mine",
+ "Audio: Man mentions smelting the ore to make a pure silver ring",
+ "Scene transition to a close-up of a finished silver ring",
+ "Molten silver being poured from a crucible into a mold",
+ "Text 'Spicy' appears at the bottom",
+ "Audio: Man explains extracting ore from the wall and mentions galena",
+ "Scene shifts to a miner descending a 450-foot abandoned mineshaft using ropes",
+ "Text 'Roping 450 ft down an abandoned mineshaft' displayed",
+ "Miner using a power tool to extract minerals from the rock wall",
+ "Text 'Last time they mined here was in the 1800s. Now, let's refine the Galena.' displayed",
+ "Close-ups of extracted minerals",
+ "Text 'Galena - lead, silver, and some other minerals' displayed",
+ "Audio: Man says 'Back up top it was time to refine the ore', background music starts",
+ "Hands in gloves holding a mineral",
+ "Scene transitions to mineral on a white surface",
+ "Hands with gloves place ore into a crusher",
+ "Text 'Crush up the ore' displayed",
+ "Audio: Man details the crushing process and adding soda ash",
+ "Crushed ore pours into a white bucket",
+ "Text 'Add a flux, Heat up the mixture' displayed",
+ "Hand mixes crushed ore with a white substance in the bucket",
+ "Mixture being poured into a crucible",
+ "Crucible being heated",
+ "Audio: Man elaborates on heating to a few thousand degrees and breaking off slag",
+ "Molten mixture glows orange inside the crucible",
+ "Molten metal is poured into a mold to separate impurities",
+ "Text 'Pour into a mold to separate the 'slag'' displayed",
+ "Scene of breaking apart solidified slag with a hammer",
+ "Molten metal reheated, text 'Add Zinc to the party. Zinc binds to silver but repels lead' displayed",
+ "Audio: Man explains reheating mixture, adding zinc, scrapping off impurities",
+ "Crucible with zinc and silver being stirred",
+ "Text emphasizing the binding of zinc to silver",
+ "Tool scraps the mixture",
+ "Text 'Burn off the zinc...' displayed, crucible glows inside furnace",
+ "Glowing orange crucible containing molten silver",
+ "Tongs holding a piece of metal above the crucible",
+ "Text 'Pure silver!!' and money bag emoji displayed",
+ "Audio: Man says 'Pure silver. Ready to be made into a ring'",
+ "Molten silver placed on a dark surface to cool and solidify",
+ "Final frame shows 'Now, to make the ring...' text"
+ ]
+ },
+ {
+ "video": "video/77.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a close-up shot of a hand holding a large, rectangular salt lick block. The block is light brown with white patches, indicating it is made of mineral salt. The background is dark, suggesting it is nighttime. The scene is illuminated by a flashlight or a similar light source. Text appears at the top of the frame, reading \"Salt Lick For Deer!\" The hand wearing a dark-colored sleeve, likely a jacket, holds the salt lick steadily. A male voice, speaking in a robotic tone, states, \"`Dear friends, because of your comments on my last video, I spent some dough on a salt lick for the deer that keeps licking me like a lollipop. It sounded like a good idea..`\"\n\nAs the video continues, the camera remains focused on the salt lick, revealing a forested area with trees in the background. A deer with glowing eyes, reflecting the light, appears and approaches the salt lick. The deer stands closer, indicating its interest in the mineral block.\n\nThe scene transitions to a close-up of the same hand and salt lick block, with the deer now actively licking the salt block. The hand is still wearing the dark-colored sleeve, and it gently pets the deer's side. A male voice continues, \"`I show it up in the dark holding this thing like I'm some sort of a hoof fairy, and the deer were acting a little weird until I gave one a little tickle. To humor me back, they actually licked the salt lick. .`\" The audio includes the sounds of someone giving a tickle followed by the sound of licking.\n\nThe camera then shifts focus to the deer's face, giving an intimate look at the deer licking the hand of the man holding the salt lick. The deer's fur is light brown and has a smooth texture. The robotic male voice follows, \"`Then perhaps to show their appreciation, or just to be silly, they decide they'd rather lick me instead. Now, some folks might assume that because I'm all chummy with these wild animals, I must hate hunters.\" \n\nNext, an extreme close-up of the deer's face fills the frame, highlighting its eyes and nose. The deer's glowing eyes reflect the light, emanating a magical quality in the dark setting. The camera angle shifts slightly to reveal more of the deer's head and neck, while a hand pets the deer's head gently, moving from its head to its neck. The voice continues, \"` But that's not true. I try to turn judgment into compassion. We're all on our own journey. Hunting is a part of creation's grand plan in our dear lives.`\" \n\nFollowing this, a close-up shot captures the deer licking the hand of the man in the dark-colored sleeve. The deer's behavior signifies trust and affection, as it continues to lick while the man moves his hand, and the deer follows. The voice narrates, \"`and some of the people I love and respect most dearly are hunters. I showed my finger gun to the deer to see what they thought, and they gave it a little kiss. It's like they were saying, after all that's happened between us, we still love humans, including the hunters.`\" culminating with the sound of a deer kissing the hand.\n\nIn the final scene, the deer licks the man's finger before being presented with the large salt lick block again. The ground shows patches of snow, indicating the cold weather. The deer licks the salt block calmly, while another deer with glowing eyes appears in the background. The concluding words are, \" I spent 10 bucks on that salt lick, but giving it to the deer made me feel like a million bucks. Everybody loves everybody,`\" ending with the voice of another male, rounding off the narrative. \n\nThe video showcases a growing bond of trust and affection between the man and the deer, interspersed with reflective and humorous commentary, interconnected with a coherent narrative of mutual respect and shared existence."
+ }
+ ],
+ "events": [
+ "Video opens with a close-up of a hand holding a large, rectangular salt lick block.",
+ "Salt lick block is light brown with white patches, made of mineral salt.",
+ "Background is dark, indicating nighttime.",
+ "Scene is illuminated by a flashlight or similar light source.",
+ "Text appears at the top: 'Salt Lick For Deer!'",
+ "Hand wearing a dark-colored sleeve, likely a jacket, holds the salt lick.",
+ "Male voice with a robotic tone speaks about buying the salt lick due to comments on a previous video.",
+ "Camera remains focused on the salt lick, revealing a forested area with trees.",
+ "A deer with glowing eyes appears and approaches the salt lick.",
+ "Deer stands closer, showing interest in the salt lick.",
+ "Scene transitions to a close-up of the hand and salt lick with the deer licking it.",
+ "Hand is still wearing the dark-colored sleeve and pets the deer's side.",
+ "Male voice continues commenting about the deer licking the salt lick after some initial weird behavior.",
+ "Audio includes sounds of tickling and licking.",
+ "Camera shifts to focus on the deer's face, licking the man's hand holding the salt lick.",
+ "Deer's fur is light brown with a smooth texture.",
+ "Robotic male voice talks about the deer licking him instead of the salt lick and mentions hunters.",
+ "Extreme close-up of the deer's face with glowing eyes, reflecting light.",
+ "Camera angle shifts to reveal more of the deer's head and neck as the hand pets it.",
+ "Male voice reflects on judgment, compassion, and hunting as part of life.",
+ "Close-up shot of the deer licking the man's hand in the dark-colored sleeve.",
+ "Deer's behavior signifies trust and affection.",
+ "Man moves his hand and the deer follows, continuing to lick.",
+ "Voice narrates about showing a finger gun to the deer, which the deer then kisses.",
+ "Sound of a deer kissing the hand.",
+ "Final scene: the deer licks the man's finger before being presented with the salt lick block again.",
+ "Ground shows patches of snow, indicating cold weather.",
+ "Deer licks the salt block calmly.",
+ "Another deer with glowing eyes appears in the background.",
+ "Concluding words about the salt lick purchase and the bond with the deer, ending with another male voice."
+ ]
+ },
+ {
+ "video": "video/78.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a red vintage Fiat car being loaded onto the back of a white pickup truck using a metal ramp. The car's license plates read 'MIAS 6646' and 'YP-1187' (MIAS 6646 and YP-1187). A man in a black jacket and light blue jeans claims with excitement, \"Whoa! Whoa! We're going to try to load the little Fiat into the bed of a new full-size truck.\" Music plays in the background while the man continues talking and chimes in. The scene transitions to the man in a black jacket, light blue jeans, white sneakers, and a black-and-white beanie standing and speaking next to the red car. \n\nThe scene moves to the red car driving through a suburban area under clear blue skies with autumn foliage. The scene shifts to another older man in a black hoodie speaking and gesturing. The older man says, \u201cI've never contemplated putting a car. And it is a car inside of a bed. This is a 2022 Ram 1500 classic.\u201d The scene cuts to the red car in circles. A close-up of the front of the white pickup truck appears with the two men walking and gesturing near it. The scene shifts to the older man sitting in the white car. Meanwhile, the man in a black jacket describes the situation: \"The Fiat weighs right around a thousand pounds, we're gonna use these...\". \n\nThe scene shows a metal ramp held by a hand. The video continues with the white pickup truck moving on a dirt road, its tailgate open to reveal a tool. The man in the black hoodie with logos reading 'IKE Gauntlet' and 'Rocky', black pants, and brown shoes, places a metal ramp onto the tailgate and carefully walks up. Music continues as two men talk. \"Hold on a second. Let's test it. Oh yeah, Andre, if it can hold you, it'll hold the PF. Put at least half of it on. Am I on the tailgate? Almost. We're good now, passenger. A little more. Stop. Am I on? Okay. So, Tommy, we have two millimeters on the front. Aha. We have half an inch on each side. Yeah, we're riding on a bump stop. Do you know what we should have done? A 0-60 in the back of the Ram. Because this car won't do 60 miles an hour really by itself, but in the Ram it should.\" they discuss.\n\nNext, the camera focuses on the red vintage car being driven up the metal ramp. The front wheel takes center stage as the car edges its way up. The car is fully seated on the truck bed with the driver visible inside. The older man, dressed in a black jacket, secures the car, ensuring stability. \n\nProgressing further, the video shows a close-up of the rear wheel of the red vintage car as it rolls up the ramp. The man in the black jacket is visible inside, climbing out through the window once the car sits on the truck bed. The older man in the 'IKE Gauntlet' and 'Rocky' hoodie stands nearby, assisting. Lastly, a close-up of the underside of the truck appears, focusing on the suspension and axle. \nThe clip focuses on the license plates 'MIAS 6646' and 'YP-1187' once again. The older man in the 'IKE Gauntlet' and 'Rocky' hoodie squats and works with a red pipe. The man, in the black jacket sits in the driver's seat, holding the steering wheel and speaking. The setting reflects a rural area with a dirt road, scattered objects, and a house. "
+ }
+ ],
+ "events": [
+ "Red vintage Fiat car is loaded onto the back of a white pickup truck using a metal ramp.",
+ "The car's license plates read 'MIAS 6646' and 'YP-1187'.",
+ "A man in a black jacket and light blue jeans expresses excitement.",
+ "Music plays in the background while the man continues talking.",
+ "Scene transitions to the man in a black jacket, light blue jeans, white sneakers, and a black-and-white beanie standing next to the red car.",
+ "The scene shifts to the red car driving through a suburban area with clear blue skies and autumn foliage.",
+ "Scene transitions to an older man in a black hoodie speaking and gesturing.",
+ "The older man mentions he has never contemplated putting a car into the bed of a truck.",
+ "The scene cuts to the red car driving in circles.",
+ "A close-up of the front of the white pickup truck with two men walking and gesturing near it.",
+ "The scene shifts to the older man sitting inside the white car.",
+ "The man in the black jacket describes the Fiat weighing around a thousand pounds.",
+ "A metal ramp is shown held by a hand.",
+ "White pickup truck moves on a dirt road with its tailgate open revealing a tool.",
+ "The man in the black hoodie with 'IKE Gauntlet' and 'Rocky' logos, black pants, and brown shoes places a metal ramp onto the tailgate and carefully walks up.",
+ "Music continues while the two men talk about testing the ramp's stability.",
+ "Camera focuses on the red vintage car as it is driven up the metal ramp.",
+ "The front wheel of the car takes center stage as it edges its way up the ramp.",
+ "The car is fully seated on the truck bed with the driver visible inside.",
+ "Older man dressed in a black jacket secures the car.",
+ "Close-up of the rear wheel of the red vintage car rolling up the ramp.",
+ "Man in the black jacket climbs out through the window once the car sits on the truck bed.",
+ "Older man in 'IKE Gauntlet' and 'Rocky' hoodie stands nearby, assisting.",
+ "Close-up of the underside of the truck focusing on the suspension and axle.",
+ "Clip shows the car's license plates 'MIAS 6646' and 'YP-1187' again.",
+ "Older man in the 'IKE Gauntlet' and 'Rocky' hoodie squats and works with a red pipe.",
+ "Man in black jacket sits in the driver's seat holding the steering wheel and speaking.",
+ "The setting reflects a rural area with a dirt road, scattered objects, and a house."
+ ]
+ },
+ {
+ "video": "video/79.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a close-up of a hand holding a product package labeled \"Korky WaxFREE Toilet Seal & Bolts.\" The package is orange and blue and displays the text \"Easiest Way to Set a Toilet.\" The scene is set in a bathroom, with the text \"Should I do more of these?\" overlaid at the top. The audio reveals a male voice .\"Is customer had a smell coming from their bathroom they could not figure it out.\"\n\nNext, the video transitions to a black and orange tool on a tiled bathroom floor. The scene shows a hand operating the tool to lift the toilet. The audio reveals a male voice .\" I said whenever they ran the shower ,they smell like a sewer gas. This isn't always the problem but oftentimes if the wax seal on a toilet has never been replaced.\"\nThis is backed by onscreen text highlighting similar points.\n\nIn the third segment, a close-up shows a hand using pliers to remove a worn-out wax seal from the base of the toilet. The audio continues to explain that these old seals can also cause smells if not properly replaced. The video goes on to show the dirty wax seal lying on the floor and then a close-up of hands cleaning the toilet flange, with text inviting viewers to follow the channel for more detailed instructions on replacement. The narrator mentions switching to a waxless seal, as music plays softly in the background.\n\nThe fourth clip opens with a close-up of a rusty, dirty toilet flange on the bathroom floor. A person wearing white gloves is seen cleaning the flange while the audio mentions that waxless rings are beginner-friendly since they allow for corrections if the toilet isn't set perfectly the first time. The sound of a toilet flushing is heard at the end, providing real-life context to the instructional nature of the video.\n\nFollowing this, the video shows another close-up of a person with white gloves continuing to clean the flange and elaborating on adjusting the toilet, emphasizing the commitment once the wax is set. The narrator reveals more tips available on their channel, providing additional guidance for simplifying the task with a new wax ring from a package labeled \"Korky.\"\n\nFinally, the clip shows a manual being held by gloved hands, detailing the installation steps for the \"Korky WaxFREE Toilet Seal & Bolts.\" The person points out specific instructions, cautioning about the height of the flange and the need to adjust neoprene rings correspondingly. He removes the red object from the box and mounted it on the floor.The narrator invites viewers to leave any questions in the comments, indicating the end of the video with \"Part 2? Follow\" prompting further engagement from the audience.\n\nOverall, the video combines detailed visual instructions with coherent audio explanation, ensuring a comprehensive guide to resolving bathroom odors by replacing the toilet's wax seal with a more efficient, wax-free solution.\n"
+ }
+ ],
+ "events": [
+ "Close-up of a hand holding a product package labeled 'Korky WaxFREE Toilet Seal & Bolts'",
+ "Package is orange and blue and displays the text 'Easiest Way to Set a Toilet'",
+ "Scene is set in a bathroom with overlay text 'Should I do more of these?'",
+ "Male voice mentions a customer having a smell coming from their bathroom",
+ "Transition to a black and orange tool on a tiled bathroom floor",
+ "Hand operates the tool to lift the toilet",
+ "Male voice mentions sewer gas smell when running the shower and issues with old wax seals",
+ "Onscreen text highlights similar points",
+ "Close-up of a hand using pliers to remove a worn-out wax seal from the base of the toilet",
+ "Audio explains old seals can cause smells if not replaced properly",
+ "Dirty wax seal shown lying on the floor",
+ "Close-up of hands cleaning the toilet flange",
+ "Text invites viewers to follow the channel for more instructions",
+ "Narrator mentions switching to a waxless seal with soft music in the background",
+ "Close-up of a rusty, dirty toilet flange on the bathroom floor",
+ "Person wearing white gloves cleaning the flange",
+ "Audio mentions waxless rings are beginner-friendly and allow for corrections",
+ "Sound of a toilet flushing at the end of this segment",
+ "Close-up of a person with white gloves continuing to clean the flange",
+ "Narrator elaborates on adjusting the toilet and commitment once the wax is set",
+ "Mentions more tips are available on the channel",
+ "Clip shows a manual being held by gloved hands detailing installation steps for 'Korky WaxFREE Toilet Seal & Bolts'",
+ "Person points out specific instructions in the manual, cautioning about flange height and neoprene ring adjustments",
+ "Red object removed from the box and mounted on the floor",
+ "Narrator invites viewers to leave questions in the comments",
+ "End of the video indicated with 'Part 2? Follow' prompt"
+ ]
+ },
+ {
+ "video": "video/80.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a close-up of a man wearing black gloves holding a bowl of creamy macaroni and cheese. The man gestures with excitement and talks to the video. The background is a typical kitchen featuring a refrigerator and various appliances. The view shifts to the man pouring milk into a large copper-colored pot on the stovetop. This action is accompanied by the audio: \"Today I'm going to show you guys how to make a three-ingredient mac and cheese. It is the easiest recipe you'll ever try and you probably have all the ingredients in your kitchen right now. Alright, first to a pot we're going to add in four cups of milk.\"\n\nNext, the man stands by the stovetop, gesturing with his hands, before using a wooden spoon to stir the milk, with steam rising to suggest it's being heated. The audio narrates: \"Then on medium heat, we're going to bring this to a slight boil. Make sure to stir it around every once in a while so the milk doesn't burn. Once it comes to a boil, we're going to add in 12 ounces of pasta.\" The man then gradually adds macaroni to the pot, stirring continuously to mix the ingredients.\n\nThe pot is steamed, and the man adds freshly ground black pepper, captured on audio as: \"On medium-low heat, we're gonna cook this for about 10 to 15 minutes until all the milk gets absorbed by the pasta. I know I only said three ingredients, but if you wanna spice it up just a little bit, you can add in some peppers and some paprika.\" To enhance the flavor, the man pours a red liquid, likely hot sauce, into the pot.\n\nIn the follow-up sequence, the man continues to stir the pot, ensuring an even mix of macaroni, milk, black pepper, and possibly hot sauce. The scene transitions to an overhead shot of the creamy mixture being stirred. He then adds shredded cheddar cheese, mixing it until fully melted and blended as heard in the audio: \"About 15 minutes later it should be looking like this, now we want to take it off the heat and add in seven ounces of cheddar cheese and we're gonna mix until everything is well incorporated and the cheese has melted.\" \n\nSubsequent footage shows the man meticulously stirring the pot, ensuring the cheese integrates smoothly to create a creamy sauce. The scene shifts to the man, having transferred the mac and cheese into a blue bowl and taking a bite, expressing approval. He puts the bowl down on the table, and then steps back, showing approval and satisfaction. The audio states: \"Alright, let's give this a try. This is stupid.\"\n\nThe video wraps up with the man at a table, enjoying the mac and cheese. Gesturing with his hands, he conveys high satisfaction with the dish. The video ends with the audio: \"I'm eating mac and cheese for the rest of my life.\" The background remains consistently a kitchen setting, aligning with the comforting, homemade nature of the recipe. And the background music plays along."
+ }
+ ],
+ "events": [
+ "A close-up of a man wearing black gloves holding a bowl of creamy macaroni and cheese.",
+ "The man gestures with excitement and talks to the video.",
+ "The background is a typical kitchen featuring a refrigerator and various appliances.",
+ "The view shifts to the man pouring milk into a large copper-colored pot on the stovetop.",
+ "Audio: 'Today I'm going to show you guys how to make a three-ingredient mac and cheese. It is the easiest recipe you'll ever try and you probably have all the ingredients in your kitchen right now. Alright, first to a pot we're going to add in four cups of milk.'",
+ "The man stands by the stovetop, gesturing with his hands.",
+ "The man uses a wooden spoon to stir the milk, with steam rising.",
+ "Audio: 'Then on medium heat, we're going to bring this to a slight boil. Make sure to stir it around every once in a while so the milk doesn't burn. Once it comes to a boil, we're going to add in 12 ounces of pasta.'",
+ "The man gradually adds macaroni to the pot, stirring continuously to mix the ingredients.",
+ "The pot is steamed.",
+ "The man adds freshly ground black pepper.",
+ "Audio: 'On medium-low heat, we're gonna cook this for about 10 to 15 minutes until all the milk gets absorbed by the pasta. I know I only said three ingredients, but if you wanna spice it up just a little bit, you can add in some peppers and some paprika.'",
+ "The man pours a red liquid, likely hot sauce, into the pot.",
+ "The man continues to stir the pot, ensuring an even mix of macaroni, milk, black pepper, and possibly hot sauce.",
+ "Scene transitions to an overhead shot of the creamy mixture being stirred.",
+ "The man adds shredded cheddar cheese, mixing it until fully melted and blended.",
+ "Audio: 'About 15 minutes later it should be looking like this, now we want to take it off the heat and add in seven ounces of cheddar cheese and we're gonna mix until everything is well incorporated and the cheese has melted.'",
+ "The man meticulously stirs the pot, ensuring the cheese integrates smoothly to create a creamy sauce.",
+ "Scene shifts to the man having transferred the mac and cheese into a blue bowl and taking a bite, expressing approval.",
+ "The man puts the bowl down on the table, and then steps back, showing approval and satisfaction.",
+ "Audio: 'Alright, let's give this a try. This is stupid.'",
+ "The man is at a table, enjoying the mac and cheese.",
+ "The man gestures with his hands, conveying high satisfaction with the dish.",
+ "Audio: 'I'm eating mac and cheese for the rest of my life.'",
+ "Background remains consistently a kitchen setting.",
+ "Background music plays along."
+ ]
+ },
+ {
+ "video": "video/81.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man standing on a wooden deck surrounded by trees. He is wearing a yellow t-shirt with the text 'Have a Beautiful Day' and holding a striped mug. The text 'Increase Your Energy' appears on the screen. The man gestures with his hand while holding the mug, saying: \"Hello my friend, here are some ways you can increase your energy levels naturally. I love coffee, but it's not something we should rely on long term. In fact, most energy drinks and stimulants don't actually give you much energy.\" as the scene transitions to a close-up of him sitting at a table with the same mug in front of him. The final scene shows a close-up of several colorful energy drink cans with condensation on him, with the text 'Avoid Reliance on Energy Drinks & Stimulants' suggesting to avoid these products.\n\nNext, a close-up shot of a white mug filled with coffee beans appears, with the text 'Try Cycling Your Coffee: Drink it Every Other Day!' suggesting a strategy for coffee consumption. The scene transitions to a roaring bear with the text 'Fight or Flight?', indicating a discussion about stress responses. Then, a hand holds several coffee beans with the text 'Don't Numb Your Exhaustion,' emphasizing the importance of addressing fatigue instead of masking it. The final scene features the man in the yellow t-shirt holding two small oranges, with the text 'Food is Fuel' promoting the idea of using food for energy. The audio continues: \"They often just make us less sensitive to our exhaustion and or use our body's reserves, which are supposed to be reserved for special occasions. Think of these as painkillers for your exhaustion. What actually gives us energy are calories, also known as food and drink.\"\n\nNext, a close-up of a plate of creamy pasta with 'Save Heavy Meals For Later' suggests that heavy meals should be eaten later. The man indoors holding instant noodles with 'Shin' in Korean characters follows. A close-up of sliced bread on a cutting board with 'High Glycemic Index' highlights the high glycemic index of the bread. The clip ends with a close-up of various berries, directing to 'More info: www.glycemicindex.com.' The voice explains: \"I personally avoid heavy meals early in the day. If you're tired of feeling tired after you eat, you may want to avoid foods with a high glycemic index. Foods with a low glycemic index allow your body to absorb their sugars more slowly.\"\n\nNext, the man in the yellow t-shirt gestures with his hands. Following this, he pours water from a green bottle into a glass, with 'Drink Water' on screen, emphasizing hydration. He continues holding the glass, and 'Dehydration Lowers Energy' and 'Injury & Stress Reduces Energy' highlight the negative impact of dehydration. The final scene features a digital animation of a human body with highlighted internal organs. The audio continues: \"Getting enough water is non-negotiable though. Everyone needs water. Pure health. Lack of water is just one of the many ways we can strain our body. More strain means more energy is spent energy.\"\n\nThe scene transitions to the man in the yellow t-shirt performing arm exercises indoors with a clock behind him. Next, a man sleeping on his side with 'Sleep Long & Deep 8 Hours Daily' emphasizes the importance of sleep. The man in the yellow t-shirt holds a smartphone with 'Reduce Electronic Usage 2 Hours Before Bed', suggesting better sleep hygiene. The final scene shows a woman outdoors performing pull-up exercises. The voice says: \"We heal and recharge when we sleep, so make sure you're getting enough sleep. Staying off electronics for two hours before bed can help you sleep. Exercise in general can also improve your sleep quality.\"\n\nThe woman in a purple outfit performs pull-up exercises on an outdoor fitness bar, promoting physical fitness. Indoors, The man performs a handstand against a wall in the yellow t-shirt with 'Inverse Exercises' on the screen. A serene beach at sunset follows, with 'Use Low Energy Days to Rest,' promoting rest. 'Recharge & Return Stronger' enforces recovery. The final scene features him outdoors, holding a drink with 'Follow & Like For More :)' encouraging viewers to engage. The finishing audio message is: \"Although if you're just feeling tired, sometimes a quick handstand against the wall can rejuvenate you. And lastly, we should allow ourselves to feel tired sometimes. Our energy should ebb and flow like a tide. Have a beautiful day.\""
+ }
+ ],
+ "events": [
+ "Man standing on a wooden deck surrounded by trees",
+ "Man wearing a yellow t-shirt with the text 'Have a Beautiful Day'",
+ "Man holding a striped mug",
+ "Text 'Increase Your Energy' appears",
+ "Man gestures with hand while holding mug, starts speaking about increasing energy levels naturally",
+ "Man says 'Hello my friend, here are some ways you can increase your energy levels naturally. I love coffee, but it's not something we should rely on long term. In fact, most energy drinks and stimulants don't actually give you much energy.'",
+ "Scene transitions to a close-up of man sitting at a table with the same mug",
+ "Close-up of several colorful energy drink cans with condensation",
+ "Text 'Avoid Reliance on Energy Drinks & Stimulants' appears",
+ "Close-up shot of a white mug filled with coffee beans",
+ "Text 'Try Cycling Your Coffee: Drink it Every Other Day!' appears",
+ "Scene transitions to a roaring bear",
+ "Text 'Fight or Flight?' appears",
+ "Hand holds several coffee beans",
+ "Text 'Don't Numb Your Exhaustion' appears",
+ "Final scene: Man in yellow t-shirt holding two small oranges",
+ "Text 'Food is Fuel' appears",
+ "Audio continues: 'They often just make us less sensitive to our exhaustion and or use our body's reserves, which are supposed to be reserved for special occasions. Think of these as painkillers for your exhaustion. What actually gives us energy are calories, also known as food and drink.'",
+ "Close-up of a plate of creamy pasta",
+ "Text 'Save Heavy Meals For Later' appears",
+ "Man indoors holding instant noodles with 'Shin' in Korean characters",
+ "Close-up of sliced bread on a cutting board",
+ "Text 'High Glycemic Index' appears",
+ "Close-up of various berries",
+ "Text 'More info: www.glycemicindex.com' appears",
+ "Audio explains: 'I personally avoid heavy meals early in the day. If you're tired of feeling tired after you eat, you may want to avoid foods with a high glycemic index. Foods with a low glycemic index allow your body to absorb their sugars more slowly.'",
+ "Man in yellow t-shirt gestures with hands",
+ "Man pours water from a green bottle into a glass",
+ "Text 'Drink Water' appears",
+ "Man continues holding glass",
+ "Text 'Dehydration Lowers Energy' appears",
+ "Text 'Injury & Stress Reduces Energy' appears",
+ "Digital animation of a human body with highlighted internal organs",
+ "Audio continues: 'Getting enough water is non-negotiable though. Everyone needs water. Pure health. Lack of water is just one of the many ways we can strain our body. More strain means more energy is spent energy.'",
+ "Scene transitions to man in yellow t-shirt performing arm exercises indoors with clock behind him",
+ "Man sleeping on his side",
+ "Text 'Sleep Long & Deep 8 Hours Daily' appears",
+ "Man in yellow t-shirt holds a smartphone",
+ "Text 'Reduce Electronic Usage 2 Hours Before Bed' appears",
+ "Woman outdoors performing pull-up exercises",
+ "Audio says: 'We heal and recharge when we sleep, so make sure you're getting enough sleep. Staying off electronics for two hours before bed can help you sleep. Exercise in general can also improve your sleep quality.'",
+ "Woman in purple outfit performing pull-up exercises on outdoor fitness bar",
+ "Man indoors performs a handstand against a wall",
+ "Text 'Inverse Exercises' appears",
+ "Serene beach at sunset",
+ "Text 'Use Low Energy Days to Rest' appears",
+ "Text 'Recharge & Return Stronger' appears",
+ "Final scene: man outdoors holding a drink",
+ "Text 'Follow & Like For More :)' appears",
+ "Finishing audio message: 'Although if you're just feeling tired, sometimes a quick handstand against the wall can rejuvenate you. And lastly, we should allow ourselves to feel tired sometimes. Our energy should ebb and flow like a tide. Have a beautiful day.'"
+ ]
+ },
+ {
+ "video": "video/82.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with an aerial view of a bustling construction site, identifiable as the Wuhan Huoshenshan Hospital. The date '1\u670830\u65e5' (January 30) is displayed in the top right corner. Numerous white buildings with blue rooftops, cranes, and construction vehicles are actively working, creating a sense of urgency and coordinated effort. Tall residential buildings in the background suggest the urban setting of Wuhan. As the camera angle shifts for a closer look at the construction, the rapid development of the hospital becomes evident. The text 'ABC NEWS PRIME' and 'HOSPITAL BUILT OVERNIGHT' along with '600,000 SQ FT FACILITY HASTILY CONSTRUCTED IN WUHAN' provides context about the unprecedented speed and scale of the project. \n\nThe audio features a woman saying, \"`Welcome back to ABC News Live. You are looking at a time lapse of the incredible effort to build a hospital in just 10 days.`\"\n\nThe scene progresses with another detailed aerial view of the construction site. The same date, '1\u670830\u65e5' (January 30), appears in the top right corner, with the previously mentioned texts still overlaid at the bottom. The camera offers varying perspectives of cranes and buildings in operation. A shift in time is indicated by '1\u670831\u65e5' (January 31) appearing in the top right corner towards the end. \n\nThe audio follows with a woman stating, \"`Can you imagine that? The 600,000 square foot facility started treating patients today in Wuhan, China, the epicenter of the coronavirus outbreak.`\" followed by a pause and then she continues speaking.\n\nThe visual focus remains on the construction site with the date '1\u670831\u65e5' (January 31) in the top right corner. Multiple views highlight the continued rapid progress. An overlay of '2\u67081\u65e5' (February 1) marks further advancement. The site, along with its surroundings including a body of water and additional buildings, is captured comprehensively. The clip transitions indoors to a studio setting, where a person in a brown suit and red-striped tie stands in front of a blue backdrop with 'GMA' text.\n\nThe audio cuts in with, \"` A crew of 7,000 people built THE HOSPITAL, WHICH HAS 1,000 BEDS. A SECOND HOSPITAL WITH 500 MORE BEDS IS DUE TO OPEN LATER THIS WEEK,`\" followed by some background music.\n\nNext, the studio setting is in sharp focus, with the individual continuing to speak, gesturing occasionally. The backdrop features the blue 'GMA' screen with a white overlay listing 'SUBSCRIBE NOW' and 'KEEP WATCHING' in gray. Visible through a window behind the person is a building labeled 'ONE VANDERBILT,' situating the studio in an urban locale.\n\nThe man's voice is heard saying, \"`Hi,everyone, George Stephanopoulos here. Thanks for checking out the ABC News YouTube channel. If you'd like to get more videos, show highlights, and watch live event coverage, click on the right over here to subscribe to our channel.`\" accompanied by the click of a computer mouse.\n\nThe closing clip maintains the focus on the person in the studio, reinforcing the studio environment and the call to action for audience engagement. The textual prompts to 'SUBSCRIBE NOW' and 'KEEP WATCHING' remain present.\n\nThe final audio remarks, \"` And don't forget to download the ABC News app for breaking news alerts. Thanks for watching.`\" delivered in a slightly robotic tone as the screen fades out."
+ }
+ ],
+ "events": [
+ "Aerial view of Wuhan Huoshenshan Hospital construction site.",
+ "Date '1\u670830\u65e5' (January 30) displayed in the top right corner.",
+ "Numerous white buildings with blue rooftops seen.",
+ "Cranes and construction vehicles actively working on site.",
+ "Tall residential buildings visible in the background indicating urban Wuhan setting.",
+ "Camera angle shifts for closer look at the construction.",
+ "Rapid development of the hospital observed.",
+ "Text overlay: 'ABC NEWS PRIME', 'HOSPITAL BUILT OVERNIGHT', '600,000 SQ FT FACILITY HASTILY CONSTRUCTED IN WUHAN'.",
+ "Audio features a woman saying, 'Welcome back to ABC News Live. You are looking at a time lapse of the incredible effort to build a hospital in just 10 days.'.",
+ "Another detailed aerial view of the construction site.",
+ "Date '1\u670830\u65e5' (January 30) still shown in the top right corner.",
+ "Texts 'ABC NEWS PRIME', 'HOSPITAL BUILT OVERNIGHT', '600,000 SQ FT FACILITY HASTILY CONSTRUCTED IN WUHAN' remains overlaid.",
+ "Camera offers varying perspectives of cranes and buildings in operation.",
+ "Time shift indicated by '1\u670831\u65e5' (January 31) in the top right corner.",
+ "Audio with a woman stating, 'Can you imagine that? The 600,000 square foot facility started treating patients today in Wuhan, China, the epicenter of the coronavirus outbreak.'.",
+ "Visual focus on construction site with date '1\u670831\u65e5' (January 31) in the top right corner.",
+ "Multiple views show continued rapid progress.",
+ "Overlay of '2\u67081\u65e5' (February 1) marks further advancement.",
+ "Site and its surroundings, including a body of water and buildings, captured comprehensively.",
+ "Transition to an indoor studio setting.",
+ "Person in a brown suit and red-striped tie stands in front of a blue backdrop with 'GMA' text.",
+ "Audio: 'A crew of 7,000 people built THE HOSPITAL, WHICH HAS 1,000 BEDS. A SECOND HOSPITAL WITH 500 MORE BEDS IS DUE TO OPEN LATER THIS WEEK', followed by background music.",
+ "Studio setting in sharp focus.",
+ "Individual continues speaking, gesturing occasionally.",
+ "Backdrop features blue 'GMA' screen with 'SUBSCRIBE NOW' and 'KEEP WATCHING' in gray.",
+ "Visible through a window is a building labeled 'ONE VANDERBILT', indicating urban location of studio.",
+ "Man's voice: 'Hi,everyone, George Stephanopoulos here. Thanks for checking out the ABC News YouTube channel. If you'd like to get more videos, show highlights, and watch live event coverage, click on the right over here to subscribe to our channel.' accompanied by computer mouse click.",
+ "Closing clip maintains focus on person in studio, reinforcing studio environment and call to action for audience engagement.",
+ "Text prompts 'SUBSCRIBE NOW' and 'KEEP WATCHING' remain present.",
+ "Final audio: 'And don't forget to download the ABC News app for breaking news alerts. Thanks for watching.' delivered in slightly robotic tone as screen fades out."
+ ]
+ },
+ {
+ "video": "video/83.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man in a plaid shirt with shades of orange, blue, and white, gesturing with raised and open hands against a dark brown background. This scene transitions to an animated character in a blue outfit with a red scarf and a green cloak, standing with crossed arms amidst falling snow -a clear indication of a wintery setting. The text 'ANIME NO MI' appears on the screen, marking the scene as likely from an anime. Throughout this transition, a deep male voice can be heard, expressing, \"Monkey D. Dragon, gonna come to this arc. I mean, this arc already is deeply connected to the backstory of so many characters, and of course one of them is Dragon.\"\n\nMaintaining the thematic continuity, the text 'ANIME NO MI' is still present. The scene then shifts to another picture of the same character, Monkey D. Dragon, with long, dark hair, a green cloak, and a muscular scarred chest set against a warm, orange-hued background representative of either sunrise or sunset. As this transition occurs, the audio shifts seamlessly, revealing a continuation of the previous dialogue: \"We literally saw him talking to Vegapunk at the beginning of the arc. And now, when one of the Gorosei members left the Holy land!\"\n\nNext, the focus moves to a character with white, curly hair and a large white beard, dressed in a black suit, white shirt, and purple tie, holding a stack of papers and actively writing or drawing with a pen. This scene features a golden object and a blue backdrop. The video then shifts to a grassy field where a different set of animated characters gather: one in the foreground with long, dark hair and a blue cloak, \n showing a muscular chest; another in pink attire and black fishnet stockings with blue, curly hair stands in the background. In the backdrop is a character with a green hat and thick lips and another one in a yellow outfit, holding a long tube. The camera pans right, showing another character donned in red, holding a pipe. The ambiance remains steady as the audio introduces a different male voice, saying, \"Do you think Dragon is stupid enough to let this golden opportunity to go? If Dragon is any of smart.\"\n\nThe scene transitions yet again to another image of Monkey D. Dragon, with long, dark hair, cloaked in a purple suit with an orange shirt and a green cloak, displaying a facial scar. A character with purple hair and lips, wearing a pearl necklace is on the left of the video, and a character, Nami, with orange hair, is on the right, while another character, wearing purple and orange glasses, with white and orange hair, is on the top. This segment has a red backdrop, enhancing the dramatic effect. During this visual progression, we hear the audio, \"If he have a lick of sense about the policy of the world, he will not allow this opportunity to go in vain. \"\n\nIn the final clip, we return to the character with white, curly hair and a large white beard, dressed in a black suit and purple tie with hands clasped and a serious expression. The blue background with glowing elements hints at a critical moment. The video effectively wraps up with the initial man in the plaid shirt, gesticulating with hands raised against the consistent dark brown background. The audio now crescendos with urgency as the speaker commands, \"He should attack Meereen so we can capture one of the Gorosei. I will not respect Dragon Intelligence if we miss this opportunity.\" "
+ }
+ ],
+ "events": [
+ "A man in a plaid shirt with shades of orange, blue, and white gestures with raised and open hands against a dark brown background.",
+ "Transition to an animated character in a blue outfit with a red scarf and a green cloak, standing with crossed arms amidst falling snow, indicating a wintery setting.",
+ "Text 'ANIME NO MI' appears on the screen.",
+ "Deep male voice audio expresses: 'Monkey D. Dragon, gonna come to this arc. I mean, this arc already is deeply connected to the backstory of so many characters, and of course one of them is Dragon.'",
+ "The text 'ANIME NO MI' is still present.",
+ "Scene shifts to Monkey D. Dragon with long, dark hair, a green cloak, and muscular scarred chest set against an orange-hued background.",
+ "Audio continues: 'We literally saw him talking to Vegapunk at the beginning of the arc. And now, when one of the Gorosei members left the Holy land!'",
+ "Scene shifts to a character with white, curly hair and a large white beard, dressed in a black suit, white shirt, and purple tie, holding a stack of papers and writing or drawing with a pen. Background features a golden object and a blue backdrop.",
+ "Focus moves to a grassy field with multiple animated characters: one in the foreground with long, dark hair and a blue cloak showing a muscular chest; another in pink attire and black fishnet stockings with blue, curly hair stands in the background; a character with a green hat and thick lips and another in a yellow outfit holding a long tube in the backdrop; the camera pans right to show another character in red holding a pipe.",
+ "Audio introduces a different male voice saying: 'Do you think Dragon is stupid enough to let this golden opportunity to go? If Dragon is any of smart.'",
+ "Scene transitions to another image of Monkey D. Dragon with long, dark hair, cloaked in a purple suit with an orange shirt and a green cloak, displaying a facial scar. Left of the video shows a character with purple hair and lips wearing a pearl necklace, right shows Nami with orange hair, and the top has a character wearing purple and orange glasses with white and orange hair. Background is red.",
+ "Audio continues: 'If he have a lick of sense about the policy of the world, he will not allow this opportunity to go in vain.'",
+ "Final clip returns to the character with white, curly hair and a large white beard, dressed in a black suit and purple tie with hands clasped and a serious expression. Blue background with glowing elements.",
+ "Video wraps up with the initial man in a plaid shirt gesticulating with hands raised against the dark brown background.",
+ "Audio crescendos with urgency: 'He should attack Meereen so we can capture one of the Gorosei. I will not respect Dragon Intelligence if we miss this opportunity.'"
+ ]
+ },
+ {
+ "video": "video/84.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video introduces its theme by showing an image of a woman in a traditional Japanese bridal kimono, adorned with intricate floral patterns and vibrant colors. The overlaid text in English reads, \"Why Brides Brought Katana to Their Wedding.\" A narration starts questioning why Japanese brides brought katanas to their weddings. The scene shifts to a man in a green kimono seated in a traditional Japanese room, explaining in English that in Japan, there used to be a custom of giving a 'guardian katana' to the daughter getting married. Text appears at the bottom of the screen, reinforcing this explanation, followed by an illustration of a bride in a red kimono holding a short katana, adding that women from samurai families traditionally kept tanto (short katana). A man's voice speaks.\n\nThe illustration transitions to a visual of a traditional Japanese tanto with a black and gold handle and sheath, the video narrates that these katanas were kept between their kimono sashes or bosoms for protection, symbolizing their maturity and adding a protective measure for brides. The overlaid text provides this additional information. The scene features women in colorful kimonos, elaborating on the cultural practice. Detailed images of elegantly designed tanto display the significance of these items. This practice underscored not only the bride's maturity but also her resolve to protect herself, as articulated by the male narrator, accompanied with music.\n\nIllustrated visuals continue as two women in dark blue and purple kimonos with floral patterns appear, emphasizing the bride's determination to protect herself in case of emergency, with text reiterating the message. Further graphics show women seated among various items, amplifying the theme by stating the bride's resolution not to return home no matter what happens. The scene transitions to a woman in a black kimono standing on a scenic balcony, with text beneath underscoring her will to protect her chastity from anyone other than her husband. Then the video shifts back to the man. The narration points out, however, that the enforcement of the \"Abolition of Katana Law\" during the Meiji period changed these practices.\n\nThe sequence shifts to show a group of people in western attire with vibrant decorations, contextualizing that after the \"Abolition of Katana Law\" during the Meiji era and the \"Firearms Law\" after World War II, this bridal custom dwindled. The man in the green kimono reappears, explaining that the culture of brides carrying katanas faded, with the screen turning gray for a while. \"Nowadays, when a bride is getting married in a Japanese kimono,\" the scene alludes to the modern customs which have replaced older traditions, described by a male voice with music.\n\nA closer look is taken at a decorative piece resembling a katana incorporated into a bride's kimono. The text in Japanese '\u61d0\u5263' (Kaiken) and its meaning is explained, showing how it's positioned in modern wedding attire-not as a real katana, but rather just a symbolic piece of cardboard. Then two images of woman standing wearing floral kimono are shown. The man in the green kimono expresses a personal connection by sharing that they are a big fan of katanas and have two daughters. The audio reveals the speaker's wish to give his daughters actual katanas if they ever get married. The male narrator voice his insights.\n\nIn the final scene, a family portrait is shared, showing the man in traditional attire holding a baby and a young girl standing next to him, all dressed in kimonos. The text expresses a personal wish to give real katanas to his daughters if they marry, strengthening the narrative with intimate family visuals. A heartfelt narration concludes the video with the family smiling, emphasizing the cultural and personal significance of this tradition through the spoken words of a man, echoed by another male voice. The video closes on a warm, familial note, tying together historical context and personal aspirations."
+ }
+ ],
+ "events": [
+ "Video introduces theme with an image of a woman in a traditional Japanese bridal kimono.",
+ "Overlay text in English reads 'Why Brides Brought Katana to Their Wedding.'",
+ "Narrator questions why Japanese brides brought katanas to their weddings.",
+ "Scene shifts to a man in a green kimono seated in a traditional Japanese room.",
+ "Man in green kimono explains the custom of giving a 'guardian katana' to the daughter getting married.",
+ "Text at the bottom of the screen reinforces this explanation.",
+ "Illustration of a bride in a red kimono holding a short katana.",
+ "Text adds that women from samurai families traditionally kept tanto (short katana).",
+ "Narration describes the significance of the tanto for protection and symbolizing maturity.",
+ "Visual shows a traditional Japanese tanto with a black and gold handle and sheath.",
+ "Overlaid text provides additional information on brides keeping tanto for protection.",
+ "Scene features women in colorful kimonos, elaborating on cultural practice.",
+ "Detailed images of elegantly designed tanto are shown.",
+ "Male narrator articulates the practice underscored the bride's maturity and resolve.",
+ "Illustration of two women in dark blue and purple kimonos emphasizing bride's determination to protect herself.",
+ "Text reiterates the bride's resolution not to return home no matter what happens.",
+ "Scene transitions to woman in black kimono standing on scenic balcony.",
+ "Text beneath underscores her will to protect her chastity from anyone other than her husband.",
+ "Video shifts back to man in green kimono.",
+ "Narration mentions the 'Abolition of Katana Law' during the Meiji period changed these practices.",
+ "Sequence shifts to group of people in western attire with vibrant decorations.",
+ "Contextualizes that these bridal customs dwindled after the 'Abolition of Katana Law' and 'Firearms Law' after World War II.",
+ "Man in green kimono reappears explaining the fading of this bridal custom.",
+ "Screen turns gray for a while.",
+ "Male voice describes modern customs replacing older traditions.",
+ "Closer look at decorative piece resembling a katana in bride's kimono.",
+ "Text in Japanese '\u61d0\u5263' (Kaiken) and its meaning is explained.",
+ "Shows how it's positioned in modern wedding attire as a symbolic piece of cardboard.",
+ "Two images of women standing, wearing floral kimono are shown.",
+ "Man in green kimono expresses personal connection, sharing love for katanas and mention of two daughters.",
+ "Audio reveals speaker's wish to give his daughters actual katanas if they marry.",
+ "Male narrator voices his insights.",
+ "Final scene with family portrait showing man in traditional attire holding baby and young girl dressed in kimonos.",
+ "Text expresses personal wish to give real katanas to daughters if they marry.",
+ "Heartfelt narration concludes video with family smiling.",
+ "Emphasizes cultural and personal significance of the tradition through spoken words of a man.",
+ "Video closes on a warm familial note."
+ ]
+ },
+ {
+ "video": "video/85.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with three men, and the man in the middle uses crutches with the other two's help, suggesting he might be recovering from physical injuries. They are dressed in formal attire, implying they are at an important event. In the next frames, the man using crutches goes forward difficulty in a setting like a public place or venue. Eventually, this man is on a stage holding a trophy and dressed in a black outfit with a red armband and a silver strap across the chest. Another man with glasses, also dressed in black, stands beside him. The screen displays the texts corresponding to the real-time audio throughout the video. what the man with a red armband says is shown with yellow text, while what the man wearing glasses says is shown with white text. The audio reveals the start of the conversation, \"Uhhh..Uhhh wait yeah...\" Accompanied by cheering crowd sounds and music, the man with glasses further punctuates by saying, \"I wanna know what happened. I don't know about them.`\"\n\nThe scene focuses on the two men standing behind a table with the trophy placed on it. The man in black with the red armband gestures towards the trophy, showing disbelief or surprise. Leaning in, the individual says \"rlly now? I'm sorry about this. Very sorry. You really wanna know what happened?\" suggesting an apology or explanation is being made. The man with glasses responds with '`Yeah, what happened?`' while holding a crutch, indicating they are also part of this conversation. The interaction continues with the man with glasses adding \"You can't just come out like that.\" and amidst cheers from the crowd, the sounds of a woman speaking and the crowd responding enthusiastically can be heard.\n\nNext, the two men are still behind the table with the trophy. The man in the black outfit and red armband holds the trophy with a helpless face, while the other, holding a crutch, is smiling expectantly. Gesturing with the crutch, the man continues \"and not tell us what happened. I mean...\" suggesting they are about to explain. The man with the red armband looks on with attention. The text and audio both articulate, 'You really wanna know?' followed by enthusiastic cheering and applause.\n\nThe video continues with the two men still by the table and trophy; the man with the red armband leans forward, gesturing, as the text appears in yellow. He describes his actions, leading to the text 'I was dancing, then I went into a spin.' while the other nab, crutch in hand, listens attentively. As this explanation continues, the colorful abstract background sets the scene. The crowd's cheering accompanies the man's voice narrating the unfolding story.\n\nWith a seamless transition, the explanation lingers as the man with the red armband goes on to say, '`and I twisted my ankle very badly`' with yellow text highlighting the severity of the injury. As he pauses, another text 'but first...' suggests a change in the subject. The attentive listener, the man with glasses, still holding the crutch, is questioned by the man with the red armband, 'Eddie, I heard you do have a new album out. You're the one who's DANGEROUS' shown in yellow on the screen, adding a new layer to their conversation. The cheering crowd provides a backdrop to the man's voice.\n\nIn the final scene, the men remain at the table with the trophy. The man with the trophy listens as the other, holding the crutch, speaks. The text asserts, '`See I thought...`' and 'I thought...' in white, hinting that the crutch-holder is trying to say something but hesitates. Gesturing, the crutch-holder signals the anchored speaker with 'Just finish,' displayed prominently. The exchange wraps up with 'Don't drag me into this,' by the crutch-holder bringing a humorous and light conclusion to their banter, as the crowd's applause and cheering echo in the background, coinciding with the man's voice stating the same."
+ }
+ ],
+ "events": [
+ "Three men appear at the start, with the middle man using crutches and getting help from the others.",
+ "Men dressed in formal attire, suggesting an important event.",
+ "Man using crutches moves with difficulty in a public setting or venue.",
+ "Man on stage holds a trophy, dressed in black outfit with red armband and silver chest strap.",
+ "Another man with glasses, dressed in black, stands beside the trophy holder.",
+ "Foreign text aligns with real-time audio throughout the video.",
+ "Yellow text represents what the man with the red armband says.",
+ "White text displays what the man wearing glasses says.",
+ "Audio reveals the beginning of the conversation with the man holding a trophy stating, 'Uhhh..Uhhh wait yeah...'.",
+ "Cheering crowd sounds and music in the background.",
+ "Man with glasses says, 'I wanna know what happened. I don't know about them.'",
+ "Scene focuses on two men standing behind a table with the trophy on it.",
+ "Man in black with red armband gestures towards the trophy, showing disbelief or surprise.",
+ "Man with the red armband apologizes, saying, 'rlly now? I'm sorry about this. Very sorry. You really wanna know what happened?'",
+ "Man with glasses replies, 'Yeah, what happened?', holding onto a crutch.",
+ "Interaction continues with man with glasses adding, 'You can't just come out like that.'",
+ "Woman speaking and crowd responding enthusiastically heard.",
+ "Man with black outfit and red armband holds the trophy with a helpless look.",
+ "Man with crutch smiles expectantly.",
+ "Man with crutch gestures with the crutch saying 'and not tell us what happened. I mean...,'",
+ "Man with red armband looks on attentively.",
+ "Text and audio express, 'You really wanna know?' followed by crowd\u2019s enthusiastic cheering and applause.",
+ "Two men remain by the table and trophy.",
+ "Man with red armband leans forward and gestures as yellow text appears.",
+ "Man with red armband narrates actions, saying 'I was dancing, then I went into a spin.'",
+ "Man with a crutch listens attentively.",
+ "Colorful abstract background enhances the scene.",
+ "Crowd cheers while the man narrates the story.",
+ "Man with red armband adds, 'and I twisted my ankle very badly,' in yellow text.",
+ "Man pauses and text 'but first...' hints a topic change.",
+ "Man with red armband questions man with glasses, 'Eddie, I heard you do have a new album out. You're the one who's DANGEROUS,' in yellow text.",
+ "Cheering crowd in the background.",
+ "Men continue to stand at the table with the trophy.",
+ "Man with trophy listens to the man with a crutch speak.",
+ "Text in white asserts, 'See I thought...' and 'I thought...' indicating hesitation.",
+ "Crutch-holder gestures to the speaker, with 'Just finish' displayed prominently.",
+ "Exchange ends humorously with crutch-holder saying, 'Don't drag me into this,' amidst crowd\u2019s applause and cheering.",
+ "Man\u2019s voice coincides with the text."
+ ]
+ },
+ {
+ "video": "video/86.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a boy standing outdoors on a patio, holding several baseballs in both hands, wearing a black cap, a dark hoodie with a bull logo, and gray shorts. A black dog walks in the background. The text overlay on the video reads, \"`BEST WAY TO GET BALLS FROM PLAYERS AT A GAME IS TO BE POLITE AND MAKE IT A GAME!`\" The TikTok handle '@thebullpentraining' is also visible. The scene then transitions to a baseball field where the same boy is holding a pink drawstring bag, looking over a railing towards the field, and interacting with players. The background shows a crowd of people and a batting cage, indicating a baseball game or practice session. The audio reveals a man's voice saying, \"`Best way to get balls from players at a game is to be polite and make it a game. And Mr. Hernandez, rock, paper, scissors for a ball? Rock, paper, scissors,`\" showcasing a playful interaction. A woman and others talk in the background.\n\nThe video continues at the baseball field, showing the boy from behind, still holding the pink drawstring bag and interacting with a baseball player in a black and yellow uniform with a purple headband. The boy gestures and points towards the player who holds a baseball and looks up, engaging in the conversation. Their conversation ends with the difference in their gestures. The player in a black and yellow uniform waves and leaves. The background captures the green field and a fence, while the text overlay \"SOMETIMES YOU TAKE AN L\" appears, indicating that the attempt might not have been successful. The TikTok handle '@thebullpentraining' remains visible. Audio shifts to a man's voice saying, \"`Sometimes you take a nap,`\" followed by a group of people talking and laughing, adding to the dynamic atmosphere.\n\nThe camera shifts to another angle, showing the boy from behind. The boy, still holding a pink baseball glove, continues interacting with the player below in the black and yellow uniform. Engagement continues with the player holding a baseball, while the background shows the field and other players in the bullpen area. The text overlay changes to \"OTHERS YOU TAKE AN L, BUT STILL GET A W WITH THE BALL,\" suggesting a triumph over initial challenges. The player in the black and yellow uniform throws the ball to the boy and gives him a thumbs-up. The audio complements the scene with a man saying, \"Others you' `ve taken out, but still get a W with the ball.`\" Laughter and chatter fill the background, emphasizing the lively environment.\n\nThis leads to a transition where the boy is standing outdoors, now wearing a black cap, a black long-sleeve shirt with a yellow 'P' logo, and sunglasses, displaying happiness and excitement, likely due to acquiring the baseball. The lively background remains with other people and the baseball field in view. Audio catches a man saying, \"`Oh, he still got it anyway. Man right there. Let's go. I'm going to ask Carlos to play rock, paper, scissors, shoot. Carlos,`\" followed by a woman's voice and ambient conversations. Background music plays, enhancing the cheerful vibe.\n\nThe video continues with the man in the black cap and hoodie walking to the railing and looking down. Many people in colorful clothes lean over the railing, too. The scene transitions to the man leaning over the railing, interacting with a baseball player in a black and yellow uniform who sits on a bench. The lively atmosphere features another man in a white jersey with 'MCCUTCHEON' on the back, also leaning over the railing and observing this. The baseball player in a black and yellow uniform now stands up and takes a ball. The interaction is polite and fun, as the text overlay highlights, \"THAT'`S HOW IT'S DONE. PLAYERS LIKE TO HAVE FUN TOO. DON'T JUST YELL BALL!`\" The TikTok handle '@thebullpentraining' is visible. Audio captures a man's voice saying, \"That'`s how it`'s done. Players like to have fun too. Don't just yell ball.\" accompanied by birds chirping and a dog barking in the background, adding to the serene yet lively scene.\n\nIn the final clip, the interaction at the baseball field continues similarly, with the man leaning over the railing and engaging with the player in the black and yellow uniform, number 18. Then, the baseball player in a black and yellow uniform throws a ball to the man in the black cap and hoodie. The additional man with 'MCCUTCHEON' on their jersey remains in sight. The TikTok handle '@thebullpentraining' is shown. This interaction concludes the video with a consistent theme of polite and fun engagement with baseball players, creating a friendly and interactive environment."
+ }
+ ],
+ "events": [
+ "Boy standing outdoors on a patio holding several baseballs in both hands",
+ "Boy wearing a black cap, dark hoodie with a bull logo, and gray shorts",
+ "A black dog walking in the background",
+ "Text overlay reads: 'BEST WAY TO GET BALLS FROM PLAYERS AT A GAME IS TO BE POLITE AND MAKE IT A GAME!'",
+ "TikTok handle '@thebullpentraining' visible",
+ "Scene transitions to a baseball field",
+ "Boy holding a pink drawstring bag, looking over a railing towards the field, interacting with players",
+ "Background shows a crowd of people and a batting cage, indicating a baseball game or practice session",
+ "Man's voice says: 'Best way to get balls from players at a game is to be polite and make it a game. And Mr. Hernandez, rock, paper, scissors for a ball? Rock, paper, scissors,'",
+ "Boy seen from behind, holding pink drawstring bag, interacting with a player in black and yellow uniform with a purple headband",
+ "Boy gesturing and pointing towards a player who holds a baseball and looks up",
+ "Player in black and yellow uniform waves and leaves",
+ "Background shows a green field and a fence",
+ "Text overlay reads: 'SOMETIMES YOU TAKE AN L'",
+ "TikTok handle '@thebullpentraining' visible",
+ "Audio shifts to a man's voice saying: 'Sometimes you take a nap,'",
+ "Group of people talking and laughing in the background",
+ "Camera shifts to another angle, showing the boy from behind, holding a pink baseball glove",
+ "Boy continues interacting with the player below in the black and yellow uniform",
+ "Background shows the field and other players in the bullpen area",
+ "Text overlay changes to: 'OTHERS YOU TAKE AN L, BUT STILL GET A W WITH THE BALL,'",
+ "Player in the black and yellow uniform throws the ball to the boy and gives him a thumbs-up",
+ "Audio complements with man's voice saying: 'Others you' ve taken out, but still get a W with the ball,'",
+ "Laughter and chatter in the background",
+ "Transition to the boy standing outdoors, now wearing a black cap, black long-sleeve shirt with yellow 'P' logo, and sunglasses",
+ "Boy displays happiness and excitement, likely due to acquiring the baseball",
+ "Lively background remains with other people and the baseball field in view",
+ "Audio catches a man saying: 'Oh, he still got it anyway. Man right there. Let's go. I'm going to ask Carlos to play rock, paper, scissors, shoot. Carlos,'",
+ "Woman's voice and ambient conversations in the background",
+ "Background music plays, enhancing the cheerful vibe",
+ "Man in black cap and hoodie walking to the railing and looking down",
+ "Many people in colorful clothes leaning over the railing",
+ "Scene transitions to man leaning over the railing interacting with a baseball player in black and yellow uniform who sits on a bench",
+ "Lively atmosphere features another man in white jersey with 'MCCUTCHEON' on the back, leaning over the railing and observing",
+ "Baseball player in black and yellow uniform stands up and takes a ball",
+ "Polite and fun interaction",
+ "Text overlay reads: 'THAT'S HOW IT'S DONE. PLAYERS LIKE TO HAVE FUN TOO. DON'T JUST YELL BALL!'",
+ "TikTok handle '@thebullpentraining' visible",
+ "Audio captures a man's voice saying: 'That's how it's done. Players like to have fun too. Don't just yell ball,'",
+ "Birds chirping and a dog barking in the background",
+ "Final clip shows interaction at the baseball field continuing similarly",
+ "Man leaning over the railing engaging with the player in black and yellow uniform number 18",
+ "Baseball player in black and yellow uniform throws ball to man in black cap and hoodie",
+ "Additional man with 'MCCUTCHEON' jersey remains in sight",
+ "TikTok handle '@thebullpentraining' shown",
+ "Video concludes with theme of polite and fun engagement with baseball players creating friendly interactive environment"
+ ]
+ },
+ {
+ "video": "video/87.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man wearing a dark gray V-neck shirt, gesturing with his hand against a simple indoor background with a dark-colored object in the back. A blue logo resembling the Windows logo appears on the man's chest, followed by the text \"SCREENSHOT TRICKS\" in bold red letters. The scene transitions to a light blue background with \"1\" in the center. \n\nThe audio reveals a man's voice introducing the topic: \"Three Windows screenshot tricks revealed.\" The man continues: \"Number one, on your keyboard, press the print screen key. That'll take a screenshot of your entire desktop. It'll place it on your clipboard. You can now paste that anywhere. Combine that with the Windows logo key and it'll automatically save your screenshot under Pictures and Screenshots. Press the Alt key together with Print Screen and that'll take a screenshot of just the currently active window. Combine that with the Windows logo key and it'll automatically save your screenshot to OneDrive and Captures. Number two, you can press the Windows logo key together with Shift and S. Number three, in Edge, press control shift and S. Here, you can take a screenshot of just a specific area. Or you can also click an entire website with a scrolling screenshot.\" \n\n\nA close-up of a keyboard highlights the 'prt sc' (Print Screen) key, with the text \"PRINT SCREEN\" at the bottom. The next frame shows the man in a circular frame above a laptop screen displaying a document with the 'Kevin Cookie Company' logo and the text \"SCREENSHOT ENTIRE DESKTOP.\" \n\nThe text \"PASTE WITH CTRL + V\" appears at the bottom. A close-up of a keyboard highlights the 'Windows' key and the 'prt sc' (Print Screen) key, with the text \"WINDOWS KEY + PRINT SCREEN\" at the bottom. A folder named \"Screenshots\" within the 'Pictures' directory is shown, with a file named \"Screenshot.png\" highlighted. A mouse moves around it. The text \"AUTOMATICALLY SAVES SCREENSHOT\" explains the key combination. \n\nThe video progression includes a close-up of a keyboard highlighting the 'alt' key and the 'prt sc' (Print Screen) key, with the text \"ALT + PRINT SCREEN\" at the bottom. The scene transitions to a view of the 'Kevin Cookie Company' document again, and the text \"SCREENSHOT ACTIVE WINDOW\" appears at the bottom. A close-up of a keyboard highlights the 'Windows' key, 'alt' key, and 'prt sc' (Print Screen) key, with the text \"WINDOWS + ALT + PRINT SCREEN\" at the bottom. \n\nThe narrative grows as a file explorer window shows the 'Captures' folder within 'Kevin - Personal,' with \"The Unbelievable Cookie.docx Word 9.29.2022 2,55,30 PM.png\" highlighted. The text \"AUTOMATICALLY SAVES SCREENSHOT\" appears at the bottom. The scene transitions to a light blue background with \"2\" in the center. A close-up of a keyboard highlights the 'shift' key, the 'Windows' key, and the 'S' key, with the text \"WINDOWS + SHIFT + S.\" The man appears in the circular frame above the laptop screen displaying the 'Kevin Cookie Company' document, showing a screenshot.\n\nThe scene transitions to a light blue background with \"3\" in the center. The scene transitions to a close-up of a keyboard highlighting the 'ctrl', 'shift', and 'S' keys, with the text \"CTRL + SHIFT + S.\" The final frame shows the 'Kevin Cookie Company' website with a red banner at the top, a promo code 'BEST' for a $1 discount, and an 'Order Now' button.\n\nThe laptop screen changes to display a video thumbnail titled \"Kevin Cookie Company - Official Co...,\" and the text \"SCROLLING SCREENSHOT\" appears at the bottom. Finally, the screen displays a section of the website with the heading \"Deliciousness in every bite!\" and a descriptive paragraph. \n"
+ }
+ ],
+ "events": [
+ "Man wearing a dark gray V-neck shirt appears on screen.",
+ "Man gestures with his hand against a simple indoor background.",
+ "Dark-colored object seen in the background.",
+ "A blue logo resembling the Windows logo appears on the man's chest.",
+ "Text 'SCREENSHOT TRICKS' in bold red letters appears.",
+ "Scene transitions to a light blue background with '1' in the center.",
+ "Audio reveals man's voice introducing the topic: 'Three Windows screenshot tricks revealed.'",
+ "Audio continues with the description of the first trick involving the print screen key.",
+ "Close-up of a keyboard highlighting the 'prt sc' (Print Screen) key with 'PRINT SCREEN' text at the bottom.",
+ "Man reappears in a circular frame above a laptop screen displaying a document with 'Kevin Cookie Company' logo and 'SCREENSHOT ENTIRE DESKTOP' text.",
+ "Text 'PASTE WITH CTRL + V' appears at the bottom.",
+ "Close-up of a keyboard highlighting the 'Windows' key and 'prt sc' (Print Screen) key with 'WINDOWS KEY + PRINT SCREEN' text at the bottom.",
+ "Folder named 'Screenshots' within the 'Pictures' directory is shown with 'Screenshot.png' highlighted and mouse moving around it.",
+ "Text 'AUTOMATICALLY SAVES SCREENSHOT' explains the key combination.",
+ "Close-up of a keyboard highlighting the 'alt' key and 'prt sc' (Print Screen) key with 'ALT + PRINT SCREEN' text at the bottom.",
+ "View of 'Kevin Cookie Company' document again with 'SCREENSHOT ACTIVE WINDOW' text at the bottom.",
+ "Close-up of a keyboard highlighting 'Windows' key, 'alt' key, and 'prt sc' (Print Screen) key with 'WINDOWS + ALT + PRINT SCREEN' text at the bottom.",
+ "File explorer window shows the 'Captures' folder within 'Kevin - Personal' with 'The Unbelievable Cookie.docx Word 9.29.2022 2:55:30 PM.png' highlighted.",
+ "Text 'AUTOMATICALLY SAVES SCREENSHOT' appears at the bottom.",
+ "Scene transitions to a light blue background with '2' in the center.",
+ "Close-up of a keyboard highlighting the 'shift' key, 'Windows' key, and 'S' key with 'WINDOWS + SHIFT + S' text.",
+ "Man reappears in circular frame above laptop screen displaying 'Kevin Cookie Company' document and showing a screenshot.",
+ "Scene transitions to a light blue background with '3' in the center.",
+ "Close-up of a keyboard highlighting the 'ctrl', 'shift', and 'S' keys with 'CTRL + SHIFT + S' text.",
+ "Final frame shows 'Kevin Cookie Company' website with a red banner, promo code 'BEST' for $1 discount, and 'Order Now' button.",
+ "Laptop screen changes to display video thumbnail titled 'Kevin Cookie Company - Official Co...' and 'SCROLLING SCREENSHOT' text at the bottom.",
+ "Screen displays section of the website with heading 'Deliciousness in every bite!' and a descriptive paragraph."
+ ]
+ },
+ {
+ "video": "video/88.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video offers a comprehensive step-by-step guide on how to clean a dirty garbage disposal, encompassing both visual instructions and auditory guidance.\n\nIt opens with a close-up of a stainless steel sink housing an 'insinkerator pro disposer.' A man wearing blue gloves reaches into the disposal and removes the rubber splash guard, revealing a grimy, food-debris-filled interior. The screen displays the text 'How to Clean That Dirty Disposal,' setting the instructional tone. As the man scrapes off grime and looses two screws using a tool, we hear a man say, \"`One of my most asked questions on here is how do I clean my garbage disposal? Well, let's do it. This one was a little extra gross, so I'm going to knock these teeth loose with a flathead and a hammer.`\" In the background, we hear the sound of knocking.\n\nThe footage then zooms in on the disposal's interior as water flows into it, swirling and rinsing away loosened debris. Next, a foamy substance is added, creating a thick layer of foam together with flowing water as the disposal spins, aiding in the cleaning process. We hear the man continue, \"`Run some water and then run some dish soap through it. But we are not done. That`'s just the start. You guys have convinced me. I'`ll admit it.`\" while the disposal cleaning sound chimes softly in the background. The next frame reveals a cleaner disposal, with much of the initial grime gone.\n\nAs the process advances, we see ice cubes being placed into the disposal by the gloved hands. The ice cubes scattered around are added one by one, eventually filling the disposal. They are ground up as the disposal spins, with water aiding in creating a swirling motion that further cleans the interior. Concurrently, the man comments, \"`The ice method does work. You can see how dirty it still is with the soap. Let's throw some ice in there. And look at that beautiful slosh of grease and ice and water.`\" set against a backdrop of continuous music.\n\nNext, the focus shifts back to the foamy interior of the disposal as it spins. The foam gradually dissipates, and the disposal grinds to a halt, making way for a cleaner surface beneath. The narrator enthusiastically observes, \"Mm-hmm, it'`s a masterpiece. Let's slow it down here, see how it looks. And you remember how it looked before? Uh yeah, it does not look like that anymore. Is it perfect?\" his voice accompanied by music, before trailing off to let the cleanliness of the disposal speak for itself.\n\nIn the concluding clip, we get a close-up of the disposal's interior, now cleaner with its metal components spinning rapidly. As the spinning slows down, text appears on the screen urging '`Follow for more!`' The narrator concludes with, \"`No, but it`'s a garbage disposal. It cleans up some nasty stuff. Follow for more DIY homeowner and plumbing tips.`\" while music continues to play. The video ends with the disposal coming to a complete stop, its interior shining with minimal debris and just a few small rust spots remaining.\n\nThis cohesive narrative combines clear visual steps and detailed audio commentary, providing a thorough guide to effectively cleaning a dirty garbage disposal."
+ }
+ ],
+ "events": [
+ "Video opens with a close-up of a stainless steel sink housing an 'insinkerator pro disposer'",
+ "Man wearing blue gloves reaches into the disposal",
+ "Man removes the rubber splash guard",
+ "Interior of the disposal is revealed, grimy and filled with food debris",
+ "Text 'How to Clean That Dirty Disposal' appears on the screen",
+ "Man scrapes off grime using a tool",
+ "Man loosens two screws using the tool",
+ "Audio: Man says, 'One of my most asked questions on here is how do I clean my garbage disposal? Well, let's do it. This one was a little extra gross, so I'm going to knock these teeth loose with a flathead and a hammer.'",
+ "Background sound of knocking",
+ "Footage zooms in on the disposal's interior as water flows into it",
+ "Water swirls and rinses away loosened debris",
+ "Foamy substance is added to the disposal",
+ "Thick layer of foam is created with the flowing water as the disposal spins",
+ "Audio: Man says, 'Run some water and then run some dish soap through it. But we are not done. That\u2019s just the start. You guys have convinced me. I'll admit it.'",
+ "Disposal cleaning sound chimes softly in the background",
+ "Frame reveals a cleaner disposal with much of the initial grime gone",
+ "Ice cubes are placed into the disposal by gloved hands",
+ "Ice cubes are added one by one until the disposal is filled",
+ "Ice cubes are ground up as the disposal spins",
+ "Water aids in creating a swirling motion that further cleans the interior",
+ "Audio: Man says, 'The ice method does work. You can see how dirty it still is with the soap. Let's throw some ice in there. And look at that beautiful slosh of grease and ice and water.'",
+ "Continuous music plays in the background",
+ "Focus shifts to the foamy interior of the disposal as it spins",
+ "Foam gradually dissipates",
+ "Disposal grinds to a halt, revealing a cleaner surface",
+ "Audio: Man says, 'Mm-hmm, it\u2019s a masterpiece. Let's slow it down here, see how it looks. And you remember how it looked before? Uh yeah, it does not look like that anymore. Is it perfect?'",
+ "Music continues to play",
+ "Close-up of the disposal's now cleaner interior, metal components spinning rapidly",
+ "Spinning slows down",
+ "Text 'Follow for more!' appears on the screen",
+ "Audio: Man says, 'No, but it\u2019s a garbage disposal. It cleans up some nasty stuff. Follow for more DIY homeowner and plumbing tips.'",
+ "Music continues to play",
+ "Video ends with the disposal coming to a complete stop",
+ "Interior of the disposal is shining with minimal debris and a few small rust spots remaining"
+ ]
+ },
+ {
+ "video": "video/89.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a woman holding a pair of scissors in one hand and two red hair ties in the other. Overhead, the text \"IT'S THAT TIME AGAIN\" with three small scissors icons appears. She uses a purple comb to section her curly hair, securing the top section with a pink hair clip. The woman brushes her hair back and lifts each side with her hands, holding hair ties. A text overlay shows a thumbnail of a YouTube video titled \"MY EASY/UPDATED DIY CURLY HAIR CUT (the pigtails cut)\" by \"Manes by Mell\" with 1.3M views. She then holds up a bottle of hair product, dispensing some into her hand.\n\nThe audio reveals a woman speaking: \"Oh, you know what time it is. I am going to be trimming my hair by doing the now famous pigtails cut technique. But we're gonna do it a little bit differently. I'm just gonna apply a little bit of a cutting lotion...\" Background music plays.\n\nThe woman rubs the product between her hands and shows her hands covered in the product before applying it to her curly hair, with a pink hair clip securing the top section. She holds her hair on the right side, and shows how long she is going to keep. She works the product through each strand, ensuring it is well-coated. A text overlay appears, reading \"If you struggle with a triangular or 'bell bottom' shape to your haircut, this will help!\"\n\nThe audio continues with the woman saying: \"...AKA a light leave-in conditioner. And I am going to be customizing this technique so that the ends are not so bottom-heavy. Right in here is what I want to remove.\" Music continues to play.\n\nUsing a black comb, the woman sections her curly hair and ties the hair on her left side with a red hair tie. She points to her hair on her left side to show how she is going to cut. She brushes through the hair until it is smooth and tangle-free. The woman wears a black cape printed, indicating a focus on hair care or styling. A text overlay reads, \"Contrast to the original placement in my YT video, this will create graduation.\" Wearing the \"MANES BY MELL\" cape, she then holds the hair on her right, which is tied with a red hair tie.\n\nDuring this clip, the woman narrates: \"I split my hair in two and I need to brush it really smooth and then place the hair tie very carefully in the hair. I have placed the hair tie further back and lower down on my head which isn't quite going to layer my hair but it is gonna give the ends a graduated effect, if I wanted to remove more weight up top I would place...\"\n\nShe holds the ponytail high on her right side and brushes the hair. Then she ties it with the hair tie again. She then slightly moves to show the differences between a higher and a lower ponytail. Then the clip closes with the woman holding both pigtails, tied behind her ears, showing them evenly tied and prepared for the next haircut step.\n\nAccompanying this scene, the woman's voice says: \"...my ponytail higher. And if I wanted to create layers that are shorter in the front and longer in the back, then I would place the hair tie all the way up in the front. See the difference? But all I'm looking to achieve...\" The narration is enthusiastic and high-pitched, with frequent pauses.\n\nThe woman holds the hair on her left and cuts it using scissors. Then she holds her ponytails taut. She puts her hand closer to the camera, which transitions to the woman showing her freshly cut curly hair, now shorter and more voluminous. And she is now wearing a spotted purple strapless dress. A text overlay appears: \"Full video on my YT Channel\" with a camera emoji.\n\nThe woman concludes the audio description: \"...is a slightly rounder shape. I will point cut to keep it soft around the edges like this. Ta-da!\" As the music begins to play."
+ }
+ ],
+ "events": [
+ "Woman holds scissors and two red hair ties.",
+ "Text 'IT'S THAT TIME AGAIN' with three scissors icons appears overhead.",
+ "Woman uses a purple comb to section her curly hair.",
+ "Woman secures top section with a pink hair clip.",
+ "Woman brushes her hair back.",
+ "Woman lifts each side of her hair with her hands, holding hair ties.",
+ "Thumbnail of 'MY EASY/UPDATED DIY CURLY HAIR CUT (the pigtails cut)' by 'Manes by Mell' with 1.3M views appears.",
+ "Woman holds up a bottle of hair product, dispensing some into her hand.",
+ "Audio: Woman says she will trim her hair using the pigtails cut technique.",
+ "Background music plays.",
+ "Woman rubs the product between her hands.",
+ "Woman shows her hands covered in the product.",
+ "Woman applies the product to her curly hair.",
+ "Pink hair clip secures the top section again.",
+ "Woman holds her hair on the right side, showing the length she will keep.",
+ "Woman works the product through each strand.",
+ "Text overlay reading 'If you struggle with a triangular or 'bell bottom' shape to your haircut, this will help!' appears.",
+ "Audio: Woman describes the product as a light leave-in conditioner.",
+ "Woman says she will customize the technique to avoid a bottom-heavy look.",
+ "Background music continues.",
+ "Woman uses a black comb to section her curly hair.",
+ "Woman ties the hair on her left side with a red hair tie.",
+ "Woman points to her hair on her left side, showing where she will cut.",
+ "Woman brushes through the hair until smooth and tangle-free.",
+ "Woman wears a black cape printed with 'MANES BY MELL'.",
+ "Text overlay reads, 'Contrast to the original placement in my YT video, this will create graduation'.",
+ "Woman holds the hair on her right, tied with a red hair tie.",
+ "Audio: Woman describes splitting her hair in two and brushing it smooth.",
+ "Woman places hair tie carefully in the hair.",
+ "Woman places hair tie further back and lower down on her head.",
+ "Audio: Woman explains this technique won't layer the hair but will give a graduated effect.",
+ "Woman brushes the ponytail high on her right side.",
+ "Woman ties the ponytail with the hair tie again.",
+ "Woman shows differences between higher and lower ponytails.",
+ "Woman holds both pigtails tied behind her ears, showing them evenly tied.",
+ "Audio: Woman describes creating layers by placing the hair tie higher.",
+ "Woman shows differences for creating shorter front layers.",
+ "Audio: Woman describes achieving a rounder shape.",
+ "Woman narrates enthusiastically and with frequent pauses.",
+ "Woman holds hair on her left and cuts it using scissors.",
+ "Woman holds ponytails taut.",
+ "Woman's hand moves closer to the camera, transitioning the scene.",
+ "Woman shows freshly cut curly hair, now shorter and more voluminous.",
+ "Woman wears a spotted purple strapless dress.",
+ "Text overlay: 'Full video on my YT Channel' with a camera emoji appears.",
+ "Audio: Woman concludes with wanting a rounder shape and keeping edges soft.",
+ "Audio: Woman says 'Ta-da!' as music begins to play."
+ ]
+ },
+ {
+ "video": "video/90.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a woman sitting in a chair, holding a piece of paper and reading it. She is dressed in a light gray sweater and a pearl necklace. Beside her, to the left, a man in a light blue sweater and light-colored shorts leans over to look at the paper, while to her right, another woman in a red sweater with the text 'Don't Be A Morning Person' stands and also looks at the paper. The text 'Pavin Smith' and a bird logo are on the top right of the screen and the text 'INSIDE EDITION' is on the bottom left of the screen. The background reveals this scene to be indoors, likely in a living room, with a framed picture on the wall. The woman continues reading the paper, engaging both individuals beside her in the moment. The audio captures someone saying, \"Okay, just home and love. The words are small. These parents are about to be brought to tears by their own child.\" The voices of a woman speaking, music playing, and another woman speaking can be heard.\n\nThe scene transitions to a sports setting, showing the back of a man wearing a red baseball jersey with 'Smith' and the number '17.' This man holds a baseball bat under his shoulder, standing against a dark background. Above, a sign reads 'CHASE FIELD' with 'HOME OF THE DIAMONDBACKS' underneath, placing the setting in a baseball stadium. Then it transitions to another man wearing an orange shirt in a baseball stadium where many viewers are in the background, raising his right arm. Someone in the audio says, \"21-year-old Pavan's Smith was the seventh overall pick in this year's amateur baseball draft and got a $5 million bonus when he signed with the Arizona Diamondbacks.\" Throughout, the audio contains alternating speaking parts by a woman.\n\nThen, a blurred transition effect leads back to the original indoor setting. Again, the woman in the light gray sweater and pearl necklace is seen reading the paper, with the man in a light blue sweater and another woman in a red sweater peering over at the paper. This intimate, focused moment continues with the woman reading the paper. The audio describes, \"He used it to give his parents an incredible Christmas gift. Thank you for raising me in a great home filled with love because of all the sacrifices you made for our family.\" The clip has music playing while a woman speaks.\n\nNext, the scene shows the woman still reading, with the woman in the red sweater moving out of the frame, leaving the woman and the man in the light blue sweater engaged with the paper and they are crying. The woman continues reading, maintaining an emotional and concentrated atmosphere. The audio reveals, \"Smith's mother and father were overcome with emotion when he reveals in his heartfelt note that he paid off their mortgage..\" The background includes the sound of a woman crying, a man speaking, and music playing.\n\nFurther, the scene remains similar with the woman reading the paper. The man in the light blue sweater stands up and leaves, shifting the focus entirely on the woman. Briefly, the scene switches to a man in a dark shirt with a white logo in the same room with bright windows. Returning to the original setting, the woman carries on reading. The audio expresses, \"While Smith says the gift doesn't make up for what his parents have given him, it sure seems to mean a whole lot to them.\" With music in the background, a woman's voice speaks.\n\nFinally, the video closes with the woman reading the paper, becoming emotional, and wiping her eyes. Transitioning to a scene showing the 'Inside Edition' logo on a black background, texts encourage viewers to 'LIKE US ON FACEBOOK' 'FOLLOW US ON TWITTER' and 'SUBSCRIBE TO YouTube', a prompt to subscribe to their YouTube channel. The audio concludes with the statement, \"For InsideEdition.com, I'm Mara Montalbano,\" while the music continues to play. The screen finishes with the 'Inside Edition' logo and the text 'Check Local Listings @ MMXVI INSIDE EDITION INC. ALL RIGHTS RESERVED.' displayed."
+ }
+ ],
+ "events": [
+ "Woman sits in a chair holding and reading a piece of paper",
+ "Woman is dressed in a light gray sweater and a pearl necklace",
+ "Man in light blue sweater and light-colored shorts leans over to look at the paper",
+ "Another woman in red sweater with 'Don't Be A Morning Person' stands and looks at the paper",
+ "Text 'Pavin Smith' and bird logo on top right of the screen",
+ "Text 'INSIDE EDITION' on bottom left of the screen",
+ "Setting appears to be indoors in a living room with a framed picture on the wall",
+ "Woman continues reading the paper, engaging with individuals beside her",
+ "Audio captures someone saying, 'Okay, just home and love. The words are small. These parents are about to be brought to tears by their own child.'",
+ "Voices of a woman speaking, music playing, and another woman speaking can be heard",
+ "Transition to a sports setting showing back of a man in a red baseball jersey with 'Smith' and number '17'",
+ "Man holds a baseball bat under his shoulder, setting is in a baseball stadium",
+ "Sign reads 'CHASE FIELD' with 'HOME OF THE DIAMONDBACKS' underneath",
+ "Transition to a man in an orange shirt in a stadium raising his right arm with viewers in the background",
+ "Audio says, '21-year-old Pavan's Smith was the seventh overall pick in this year's amateur baseball draft and got a $5 million bonus when he signed with the Arizona Diamondbacks.'",
+ "Blurred transition effect back to original indoor setting",
+ "Woman in light gray sweater and pearl necklace reading the paper again",
+ "Man in light blue sweater and woman in red sweater peering over at the paper",
+ "Woman continues reading, audio says, 'He used it to give his parents an incredible Christmas gift. Thank you for raising me in a great home filled with love because of all the sacrifices you made for our family.'",
+ "Music playing while a woman speaks",
+ "Woman still reading, woman in red sweater moves out of the frame",
+ "Woman and man in light blue sweater engaged with the paper and start crying",
+ "Audio states, 'Smith's mother and father were overcome with emotion when he reveals in his heartfelt note that he paid off their mortgage.'",
+ "Background sound of a woman crying, a man speaking, and music playing",
+ "Woman continues reading the paper",
+ "Man in light blue sweater stands up and leaves, focus shifts to woman",
+ "Brief switch to a man in a dark shirt with a white logo in the same room with bright windows",
+ "Return to original setting with woman reading the paper",
+ "Audio expresses, 'While Smith says the gift doesn't make up for what his parents have given him, it sure seems to mean a whole lot to them.'",
+ "Music in the background, a woman's voice speaks",
+ "Video closes with woman reading the paper, becoming emotional and wiping her eyes",
+ "Transition to 'Inside Edition' logo on black background",
+ "Texts encourage viewers to 'LIKE US ON FACEBOOK', 'FOLLOW US ON TWITTER', and 'SUBSCRIBE TO YouTube'",
+ "Audio concludes with, 'For InsideEdition.com, I'm Mara Montalbano,' while music continues",
+ "Screen finishes with 'Inside Edition' logo and text 'Check Local Listings @ MMXVI INSIDE EDITION INC. ALL RIGHTS RESERVED.' displayed"
+ ]
+ },
+ {
+ "video": "video/91.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a young girl in a modern kitchen wearing a floral dress. She holds a metal measuring cup filled with white sugar and pours it into a large bowl of melted butter. She mixes the sugar and butter with a red-handled whisk until it becomes smooth and yellow. She then cracks two eggs into the bowl and continues whisking until the mixture is well combined. The background shows white cabinets, a stainless steel refrigerator, and kitchen appliances. Meanwhile, a woman's voice introduces, \"`This is my five-year-old daughter, and she used to make a lot of content on my channel when I first put it up in 2020. But slowly, over the years, she decided, you know what?`\"\n\nThe young girl continues working in the kitchen, now vigorously whisking the yellow mixture. She sets the whisk aside, picks up a small bottle of vanilla extract, unscrews the cap, and pours a small amount into the bowl. Then, she pours a spoonful of the white powder into the bowl and whisk. Pink text appears at the bottom of the screen: 'She didn't like voiceovers \ud83d\ude02, but she did want to make these brownies.' The mixture now has a darker color. She pours the yellow mixture into another bowl containing a chocolate mixture. The audio continues with a woman saying, \"`Maybe content is not for me. And I often get asked the question, do you want your daughter to bake just like you do? And the answer is actually no. And it's not that baking isn't a viable career and it's not that I don't love doing it myself, but I want her to be genuinely happy and to follow her own dreams. Now I know that all sounds incredible cliche because to be fair kind is. However, I told very much the opposite my whole life, not live your dreams but do what is going to make you money and what is going to just keep you on the straight and narrow road. That way of thinking definitely led me to a level of even and steady success.`\"\n\nThe girl is now mixing the yellow and chocolate mixtures together with a white spatula until it becomes a uniform, dark brown color. She picks up a red bowl, pours flour into the mixture, and then holds a bag of chocolate chips, indicating they are the next ingredient.\n\nContinuing, the young girl holds a pink bag of chocolate chips and pours them into the dark brown batter. She thoroughly mixes them with the spatula, ensuring even distribution. The modern kitchen remains constant in the background. She then pours the thoroughly mixed batter into a square baking pan and spreads it evenly. The girl occasionally raises her head to communicate with the photographer.\n\nThe girl continues her preparation by holding a square baking pan with dark brown brownie batter. She picks up a blue bag labeled 'Chips' and sprinkles more chocolate chips onto the batter. She carefully distributes them on the surface. The background remains the same.\n\nIn the final scenes, the girl takes the square baking pan filled with dark brown brownie batter topped with chocolate chips. She walks to another room and places it inside the oven. The scene transitions to a close-up of the baked brownies with a glossy surface and melted chocolate chips. The final shot shows the girl holding a pink plate with a piece of brownie topped with a small blue and white decoration. The audio concludes with a young woman excitedly saying, \" However, if I had taken every word of that to heart, I don'`t know if I would have been able to create the following that I have here on YouTube, which I'm super grateful for. By the way, Arya says her dream is to be an accountant.`\"\n\nThe cohesive narrative from start to finish shows the girl's journey through the brownie-making process, effortlessly guided by the background conversations that describe a mother's thoughts on her daughter's baking and future."
+ }
+ ],
+ "events": [
+ "A young girl in a modern kitchen wearing a floral dress holds a metal measuring cup filled with white sugar.",
+ "She pours the sugar into a large bowl of melted butter.",
+ "She mixes the sugar and butter with a red-handled whisk until it becomes smooth and yellow.",
+ "She cracks two eggs into the bowl and continues whisking until the mixture is well combined.",
+ "The background shows white cabinets, a stainless steel refrigerator, and kitchen appliances.",
+ "A woman's voice introduces, 'This is my five-year-old daughter, and she used to make a lot of content on my channel when I first put it up in 2020. But slowly, over the years, she decided, you know what?'",
+ "The young girl continues working in the kitchen, now vigorously whisking the yellow mixture.",
+ "She sets the whisk aside, picks up a small bottle of vanilla extract, unscrews the cap, and pours a small amount into the bowl.",
+ "She pours a spoonful of the white powder into the bowl and whisk.",
+ "Pink text appears at the bottom of the screen: 'She didn't like voiceovers \ud83d\ude02, but she did want to make these brownies.'",
+ "The mixture now has a darker color.",
+ "She pours the yellow mixture into another bowl containing a chocolate mixture.",
+ "The audio continues with a woman saying, 'Maybe content is not for me. And I often get asked the question, do you want your daughter to bake just like you do? And the answer is actually no. And it's not that baking isn't a viable career and it's not that I don't love doing it myself, but I want her to be genuinely happy and to follow her own dreams. Now I know that all sounds incredible cliche because to be fair kind is. However, I told very much the opposite my whole life, not live your dreams but do what is going to make you money and what is going to just keep you on the straight and narrow road. That way of thinking definitely led me to a level of even and steady success.'",
+ "The girl is now mixing the yellow and chocolate mixtures together with a white spatula until it becomes a uniform, dark brown color.",
+ "She picks up a red bowl, pours flour into the mixture, and then holds a bag of chocolate chips, indicating they are the next ingredient.",
+ "The young girl holds a pink bag of chocolate chips and pours them into the dark brown batter.",
+ "She thoroughly mixes them with the spatula, ensuring even distribution.",
+ "The modern kitchen remains constant in the background.",
+ "She pours the thoroughly mixed batter into a square baking pan and spreads it evenly.",
+ "The girl occasionally raises her head to communicate with the photographer.",
+ "The girl continues her preparation by holding a square baking pan with dark brown brownie batter.",
+ "She picks up a blue bag labeled 'Chips' and sprinkles more chocolate chips onto the batter.",
+ "She carefully distributes them on the surface.",
+ "The background remains the same.",
+ "In the final scenes, the girl takes the square baking pan filled with dark brown brownie batter topped with chocolate chips.",
+ "She walks to another room and places it inside the oven.",
+ "The scene transitions to a close-up of the baked brownies with a glossy surface and melted chocolate chips.",
+ "The final shot shows the girl holding a pink plate with a piece of brownie topped with a small blue and white decoration.",
+ "The audio concludes with a young woman excitedly saying, 'However, if I had taken every word of that to heart, I don'`t know if I would have been able to create the following that I have here on YouTube, which I'm super grateful for. By the way, Arya says her dream is to be an accountant.'",
+ "The cohesive narrative from start to finish shows the girl's journey through the brownie-making process, effortlessly guided by the background conversations that describe a mother's thoughts on her daughter's baking and future."
+ ]
+ },
+ {
+ "video": "video/92.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a sniper scope aimed precisely at a police department building, signifying a meticulous mission setup within the game \"`GTA 5 RP.`\" The text '24 HOURS' appears, immediately followed by '`IN GTA 5 RP`' and 'JOB,' clearly indicating a time-bound mission challenge within the game. The audio reveals a man saying, \"`I spent 24 hours as a hitman in GTA 5 RP,`\" setting the stage for an intense narrative. The on-screen scene shifts to a character in tactical gear descending the steps of the police department, denoted audibly as the man continues, \"`My first job was to kill a cop, get his outfit, then go kill another cop, and then I get my final reward.`\"\n\nThe scene transitions to a grim sight where the character stands over a body in the street, blood pooling around it, emphasizing the mission's lethal gravitas. The character is seen pointing a gun toward another individual holding their hands up in surrender. Next, the setting changes to a room filled with stacks of money, and the text 'REWARD' appears, symbolizing the bounty. The intense mission concludes with the character interacting with someone outside a Walmart store, with the voiceover saying, \"Hey, what's good, homie? Alright, so who you need killed right now?`\"\n\nThe narrative then cuts to a different scene where a man in a beige shirt and light blue jeans stands in front of a police officer next to a black police SUV marked '1541.' The background is marked by the structure's blue awning and glass doors. The man holds a yellow rifle, and from various angles, their calm interaction unfolds. The text 'Discord.gg/ParadiseSRP' appears at the top, pointing to a specific role-play server. In the audio, the man says, \"`Tell me. It's like 2Bravo or 7, whatever his name is. 2Bravo7? All right. Where are you patrolling at?`\" The police officer responds calmly, \"`Probably Play-Doh, yeah. All right, baby.`\"\n\nNext, the scene showcases the same man in beige and blue standing by another police officer in tactical gear, again near the SUV '1541.' The video transitions to a different figure, a man in a green jacket and dark pants, sprinting towards a sheriff's SUV parked close to an 'Ammu-Nation' store. As he opens the SUV door, a police officer emerges from the vehicle. The voiceover matches the visual with, \"`So 2507? Sir, are you 2507? Yeah, yeah, why? Just wondering, but I need a ride. I'm kinda lost. I don't really know where I'm going and shit.`\"\n\nThe story continues with the man in green walking away from the sheriff's SUV, the officer closely trailing him. An intense moment unfolds as the man suddenly turns and shoots the officer, leading him to the ground. As the man stands over the fallen officer, he takes the officer's clothes and runs to the car and drives the car away. The audio adds, \"`You know what I mean, okay let's get the out of here all right buddy I'm gonna take your clothes all right all right buddy I'm gonna take your ship I'm gonna hop in your car I'm gonna kill more cops, alright.`\" followed by tense background music and a gunshot.\n\nThe SUV speeds away with its sirens blaring, passing various streets and an 'Ammu-Nation' store before arriving at a gas station where it parks next to another sheriff's SUV. A police officer in tactical gear exits the vehicle and begins to survey the area. The audio matches the visuals with, \"`Thanks, buddy. Yo, what's good? Look what I caught, bro. Alright, who's next?\"\n\nIn the final clip, two police officers in tactical gear stand beside the sheriff's SUV '1541' at the gas station. One officer wears a beige shirt and light pants, while the other is in a black uniform. They converse and move around the SUV, observing the gas station surroundings. The audio captures a snippet of their conversation with, \"Bro. I'm trying to get a payday.`\"\n\nThe video intricately balances a narrative filled with mission objectives, police encounters, and brief moments of criminal undertakings, meshed with the spoken audio revealing the hitman's experiences and tasks within the dynamic world of \"`GTA 5 RP.`\""
+ }
+ ],
+ "events": [
+ "Sniper scope aimed at a police department building",
+ "Text '24 HOURS' appears, followed by 'IN GTA 5 RP' and 'JOB'",
+ "Audio: 'I spent 24 hours as a hitman in GTA 5 RP'",
+ "Character in tactical gear descends the steps of the police department",
+ "Audio: 'My first job was to kill a cop, get his outfit, then go kill another cop, and then I get my final reward.'",
+ "Character stands over a body with blood pooling around it",
+ "Character points a gun at a surrendering individual",
+ "Setting changes to a room filled with stacks of money",
+ "Text 'REWARD' appears",
+ "Character interacts with someone outside a Walmart store",
+ "Audio: 'Hey, what's good, homie? Alright, so who you need killed right now?'",
+ "Man in beige shirt and light blue jeans stands in front of a police officer next to a black police SUV marked '1541'",
+ "Background marked by a blue awning and glass doors",
+ "Man holds a yellow rifle and interacts with the police officer",
+ "Text 'Discord.gg/ParadiseSRP' appears at the top",
+ "Audio: 'Tell me. It's like 2Bravo or 7, whatever his name is. 2Bravo7? All right. Where are you patrolling at?'",
+ "Police officer responds: 'Probably Play-Doh, yeah. All right, baby.'",
+ "Man in beige and blue standing by another police officer in tactical gear near the SUV '1541'",
+ "Man in green jacket and dark pants runs towards a sheriff's SUV parked near an 'Ammu-Nation' store",
+ "Man opens the SUV door and a police officer emerges from the vehicle",
+ "Audio: 'So 2507? Sir, are you 2507? Yeah, yeah, why? Just wondering, but I need a ride. I'm kinda lost. I don't really know where I'm going and shit.'",
+ "Man in green walks away from the sheriff's SUV with officer trailing him",
+ "Man turns and shoots the officer, who falls to the ground",
+ "Man takes the officer's clothes and runs to the SUV",
+ "Man drives the SUV away with sirens blaring",
+ "Audio: 'You know what I mean, okay let's get the out of here all right buddy I'm gonna take your clothes all right all right buddy I'm gonna take your ship I'm gonna hop in your car I'm gonna kill more cops, alright.'",
+ "Tense background music and a gunshot",
+ "SUV speeds away, passing various streets and an 'Ammu-Nation' store",
+ "SUV arrives at a gas station, parking next to another sheriff's SUV",
+ "Police officer in tactical gear exits the vehicle and surveys the area",
+ "Audio: 'Thanks, buddy. Yo, what's good? Look what I caught, bro. Alright, who's next?'",
+ "Two police officers in tactical gear stand beside the sheriff's SUV '1541' at the gas station",
+ "One officer wears a beige shirt and light pants; the other is in a black uniform",
+ "Officers converse and move around the SUV",
+ "Audio: 'Bro. I'm trying to get a payday.'"
+ ]
+ },
+ {
+ "video": "video/93.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The audio features a beeping sound followed by a male voice providing a weather forecast, mentioning, \"When guests of 15 to 16 milli-hours are expected throughout the 65 milli-hours in up-routed areas, with the chance of guests of up to 65 milli-hours in parts of western Scotland where the clouds combine with the morning tide causing spray and deluge waves along the western coast, according to the Mid-Office Statute. Twenty neighborhoods could be, Flooded as the Environmental Protection Agency warned residents to take steps to protect themselves and their property. The agency also urged people to avoid contact walking or driving through the flooded water. However, the Mid Office has also issued a yellow warning covering large swaths of Scotland until 6 p.m. today as a severe storm could reach 75 million hours in the western parts of the country. A Western outlet warned that travel charts and potential delays are severe storms in the UK and northern reaches. The Meteorological Service warned that...\"\n\nThe video begins with an elevated view of a heavily flooded street. The muddy water sweeps away various objects including potted plants and furniture. A prominent purple sign with a white 'M' stands out in the foreground. The street is lined with buildings that have white and peach-colored facades, and outdoor furniture including a white umbrella is visible. The camera zooms in slightly to emphasize the objects being carried away by the floodwaters. The scene transitions to a series of blurry frames, likely due to heavy rain or mist, before clearing up to show an aerial view of a vast flooded landscape. The entire area is submerged in muddy water with only small trees and shrubs protruding. The sky is overcast with thick, dark clouds, indicating stormy weather. The camera pans across the landscape, showcasing the extensive flooding. An indoor setting appears, showing a flooded train station with water swirling around stairs and platforms. Signs reading 'Elizabeth Olympic Park' and 'Way out' are visible. The camera zooms in on the turbulent water inside the station, highlighting the flood's severity. Next, a close-up view continues the scene of the flooded train station. The water is still violently swirling around the stairs and platforms, with signs reading 'Way out' and 'Stratford High Street' visibly in the background. The camera then transitions to an outdoor setting, revealing a street with a black pickup truck parked in front of a white house amid heavy rain and mist. Finally, the scene shows a white house with black shutters and a black roof, slightly elevated above a street. Heavy rain and mist continue to obscure visibility. A black pickup truck drives into the frame from the right and parks in front of the house. The camera pans to the right, revealing more of the street and a large, leafless tree in the background. \n\nThroughout the video, the combination of visual and auditory elements emphasizes the severe impact of the flooding and stormy weather, urging viewers to understand the seriousness and take necessary precautions."
+ }
+ ],
+ "events": [
+ "Beeping sound is heard.",
+ "Male voice provides a weather forecast.",
+ "Weather forecast mentions guests of 15 to 16 milli-hours expected through 65 milli-hours.",
+ "Forecast mentions chance of guests up to 65 milli-hours in parts of western Scotland.",
+ "Clouds combine with the morning tide causing spray and deluge waves along western coast.",
+ "Mid-Office Statute predicts extensive flooding in twenty neighborhoods.",
+ "Environmental Protection Agency warns residents to protect themselves and their property.",
+ "Agency urges people to avoid contact walking or driving through flooded water.",
+ "Mid Office issues yellow warning for large swaths of Scotland until 6 p.m. today.",
+ "Severe storms in western parts of Scotland could reach 75 million hours.",
+ "Western outlet warns of severe storms and potential travel delays in the UK and northern reaches.",
+ "Meteorological Service issues another warning.",
+ "Visual starts with an elevated view of a heavily flooded street.",
+ "Muddy water sweeps away various objects including potted plants and furniture.",
+ "Prominent purple sign with a white 'M' is visible in the foreground.",
+ "Street is lined with buildings having white and peach-colored facades.",
+ "Outdoor furniture including a white umbrella is visible.",
+ "Camera zooms in slightly to emphasize objects being carried away by floodwaters.",
+ "Scene transitions to blurry frames likely due to heavy rain or mist.",
+ "Clear view shows an aerial view of a vast flooded landscape.",
+ "Entire area is submerged in muddy water with small trees and shrubs protruding.",
+ "Sky is overcast with thick, dark clouds indicating stormy weather.",
+ "Camera pans across the landscape showcasing extensive flooding.",
+ "Indoor setting appears showing a flooded train station.",
+ "Water swirls around stairs and platforms.",
+ "Signs reading 'Elizabeth Olympic Park' and 'Way out' are visible.",
+ "Camera zooms in on turbulent water inside the station emphasizing flood severity.",
+ "Close-up view continues seeing the flooded train station.",
+ "Water still violently swirls around stairs and platforms.",
+ "Signs reading 'Way out' and 'Stratford High Street' are visible in the background.",
+ "Camera transitions to an outdoor setting revealing a street.",
+ "Black pickup truck is parked in front of a white house amid heavy rain and mist.",
+ "Scene shows a white house with black shutters and a black roof slightly elevated above the street.",
+ "Heavy rain and mist obscure visibility.",
+ "Black pickup truck drives into the frame from the right and parks in front of the house.",
+ "Camera pans to the right revealing more of the street and a large, leafless tree in the background."
+ ]
+ },
+ {
+ "video": "video/94.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with several shirtless man wearing different pants or shorts performing a handstand in front of a gym-like structure featuring blue bars and equipment. As he struggles to maintain his balance, text appears on the screen asking, \"ARE YOU STRUGGLING WITH BALANCE IN YOUR HANDSTAND?\", \"`DO YOU CONSTANTLY FALL OUT OF IT AFTER JUST A FEW SECONDS?`\" as the man continually falls out of the handstand.'STRUGGLING' is in yellow, 'FALL OUT' is in orange, and the others are white. The audio echoes these texts, as a narrator's voice poses the same questions and reassures by saying, \"`We got you! When we start training our handstands, our range of balance is very small, meaning that if we shift a bit forward we fall and if we shift back we also fall.`\" Triumphant music underscores the moment the man in gray pants finally holds a handstand, with others around him lying happily on the ground. \n\nIn the next shot, the man holds a handstand against a dark blue wall, while green and red lines with dots are superimposed on his legs, suggesting a form of analysis. The narrative progresses as the shirtless man in gray pants successfully maintains a handstand against the dark blue wall, indicated by the green lines and dots on his legs. The lines and dots in red means incorrect angle, followed by the man falls out.\n\nHe then transitions to performing a handstand outside the gym-like structure, helping to demonstrate the right way to handstand. Another shirtless man in gray pants sways the legs of the man who performs a handstand and explains. The audio continues to elaborate on balance, saying, \"`Our goal is to increase our range of balance so if we overbalance we`'re able to come back and if we underbalance we'`re also able to come back.`\" The scene is punctuated by the distant sounds of bicycle bells. As a green tick appeared on the screen, the man holds the handstand independently, solidifying his achievement.\n\nIn the next scene, we see the man resuming his handstand against the dark blue wall. Green lines and dots reappear on his legs for further analysis. He adjusts his position by bending and straightening his legs. Outside again by the blue gym structure, the assisting man kneels and gestures nearby, seemingly providing Interpretation. A black arrow appears on the screen to indicate. The final frames show a close-up of the man's hands on the ground, readying for another handstand. The audio underscores the importance of building capacity in the end ranges of a handstand, suggesting exercises and stating, \"`In order to improve that, we need to build capacity in those endurance. Handstand press negatives are my preferred exercise to accomplish this. Start about a hand distance from the wall, kick up into your handstand, use your fingers to fit all the wall.`\" highlighting specific training details.\n\nThe theme of instruction continues as the man performs another handstand against the dark blue wall. A yellow and black arrow appears on his legs, a green curved arrow appears on his hip, and a green and black arrow appears near his hands, suggesting directions for movement or adjustments. The man lowers himself from the handstand while another shirtless man provides explanation, gesturing with hands. The narrator advises on technique, suggesting to \"`Shift the weight forward to contrabalance the weight of the legs coming down`\" and adds, \"`At first, you might smash the ground. As you repeat this exercise, you'`re going to be able to come down slower and slower over time.`\" emphasizing the learning curve and practice needed.\n\nIn the final clip, the same man in gray pants performs different handstand against the dark blue wall. He gracefully transitions out of the position and joins another shirtless man, holding a child who playfully hangs upside down. The cohesive narrative ends with the audio encouraging the viewer by saying, \"You`'re going to also work on tuck and straddle. That'`s it, family. This is my favorite exercise.`\" solidifying the theme of gradual improvement and community support. The final frames depict the shirtless man's standing side-by-side with the child between them, wrapping up the video with a sense of accomplishment and camaraderie.\n"
+ }
+ ],
+ "events": [
+ "Shirtless man performs a handstand in front of a gym-like structure with blue bars and equipment.",
+ "Man struggles to maintain his balance and continually falls out of the handstand.",
+ "Text appears on screen: 'ARE YOU STRUGGLING WITH BALANCE IN YOUR HANDSTAND?'",
+ "Text appears on screen: 'DO YOU CONSTANTLY FALL OUT OF IT AFTER JUST A FEW SECONDS?'",
+ "'STRUGGLING' is in yellow, 'FALL OUT' is in orange, and other texts are in white.",
+ "Narrator echoes the questions and reassures by explaining balance in handstands.",
+ "Triumphant music plays as the man in gray pants finally holds a handstand.",
+ "Others around him lie happily on the ground.",
+ "Man holds a handstand against a dark blue wall.",
+ "Green and red lines with dots are superimposed on his legs.",
+ "Narrative continues as the man successfully maintains a handstand.",
+ "Green lines and dots suggest correct angle, red indicates incorrect angle.",
+ "Man falls out of the handstand.",
+ "Man performs a handstand outside the gym-like structure demonstrating the right way.",
+ "Another shirtless man in gray pants sways the legs of the man performing a handstand.",
+ "Narrator elaborates on balance and increasing the range of balance.",
+ "Distant sounds of bicycle bells.",
+ "Green tick appears on screen as the man holds the handstand independently.",
+ "Man resumes his handstand against the dark blue wall.",
+ "Green lines and dots reappear for further analysis.",
+ "Man adjusts his position by bending and straightening his legs.",
+ "Outside by the blue gym structure, the assisting man kneels and gestures.",
+ "Black arrow appears on the screen to indicate.",
+ "Close-up of the man's hands on the ground, readying for another handstand.",
+ "Narrator underscores the importance of building capacity in the end ranges of a handstand.",
+ "Narrator suggests exercises and training details for improvement.",
+ "Man performs another handstand against the dark blue wall.",
+ "Yellow and black arrow on his legs, green curved arrow on his hip, and green and black arrow near his hands.",
+ "Man lowers himself from the handstand while another shirtless man provides explanation, gesturing.",
+ "Narrator advises on technique to shift weight forward to contrabalance legs coming down.",
+ "Narrator highlights the learning curve and practice needed.",
+ "Man performs different handstand against the dark blue wall.",
+ "Man gracefully transitions out of the position.",
+ "Man joins another shirtless man holding a child who playfully hangs upside down.",
+ "Audio encourages the viewer to work on tuck and straddle.",
+ "Final frames depict shirtless men standing side-by-side with the child between them."
+ ]
+ },
+ {
+ "video": "video/95.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video showcases a light-hearted and spirited conversation among a group of football players focusing on who has the best hair on the team.\n\nA man wearing a gray long-sleeve shirt and black shorts enters a building, carrying a white helmet and a beige towel. The text \"`Best hair on the team?`\" with a barber emoji appears. He proudly gestures towards his hair. The scene transitions to an outdoor field where teammates are walking and chatting. A player in a black sleeveless shirt and white shorts mentions, \"`I got a haircut,`\" pointing to his hair. Nearby, another player in a white jersey, number 17, and black shorts stands beside him. The top text changes to \"It ain't him!\" A man in a green jersey, number 41, confidently asserts, \"Lane Johnson, no doubt.\" The field background shows several other players engaging in various activities. The audio captures cheerful banter and laughter among the group, punctuated by \"Who has the best hair on the team? It ain't him. Not me. It ain't him. I got a haircut. Layton Johnson, no doubt.\"\n\nNext, on a grassy field, white-jerseyed football players in black shorts are training. A player, number 59, walks towards the camera holding a helmet. He shakes his hair. The text \"Me!\" appears on the screen. The scene fades to another player, number 17, who walks along the sideline. The top text initially reads \"`Matter of fact...`\" and then changes to \"`I changed my mind... T.J. Edwards.`\" He gestures towards the camera, and a picture of another player's hair appears with the text \"`Put a picture of his hair on here, you'll see.\" As he points to the image, the audio humorously captures his change of heart: \"`Matter of fact, I changed my mind. T.J. Edwards put a picture of his hair on here. You'll see it.`\" The background has ongoing chatter and laughter from the crowd.\n\nA different scene begins with a man in a gray sleeveless shirt with an eagle logo entering a building. Text at the bottom reads, \"`Who has the best hair on the...?`\" He removes his cap, revealing his braided hair, and the top text humorously exclaims, \"Psh, what?!\" followed by \"`You see that?`\" with an eye emoji. He confidently shows off his hair while the top text asserts, \"No question!\" He finishes with a wide grin. The audio captures the man's pride as he says, \"`Who has the best helmet what, do you see that, no question,`\" while laughter and music play faintly in the background.\n\nThe video then cuts to a man standing on a grassy field in a white long-sleeve shirt and black shorts, holding a helmet with a green logo. The top text declares, \"Ah, Isaac.\" Two other players in similar outfits walk towards him, also with helmets. The text changes to \"Is the g.o.a.t. with the best hair on the team.\" with a goat emoji. The man jokes, \"`I told him to cut some off and let me have some.`\" The scene switches to a man in a black sleeveless shirt and shorts entering a building, gloves on, carrying a helmet. The bottom text notes, \"That'`s an easy one.`\" It concludes with another man in a black sleeveless shirt on the field. The field activity continues in the background. The audio picks up a conversation: \"`Isaac is the goat with the best hair on the team. I told him to cut some off and let me have some. Come on, that's an easy one. This ain't nothing,\" with teammates talking and laughing.\n\nFinally, the scene displays a man in a black sleeveless shirt on the field, wearing shoulder pads, emphasizing his hair quality with \"This... this ain't napps now.\" The text updates to \"Know what I'm saying?\" as he continues speaking confidently. He points to someone off-screen, adding, \"Then again, my boy Dallas going bald.\" The text urges, \"Come here Dallas, show 'em.\" As he gestures towards the unseen Dallas, the background of active teammates persists. The audio accentuates the humor: \"`Yeah, my boy Dallas going bald. Come here, Dallas,`\" amidst jovial conversation and laughter.\n\nThroughout the video, the recurring theme of camaraderie and friendly competition is clear, with teammates poking fun at and praising each other's hairstyles. The video combines both on-field and off-field scenes with humorous commentary, capturing the joyful and playful atmosphere among the football players."
+ }
+ ],
+ "events": [
+ "A man wearing a gray long-sleeve shirt and black shorts enters a building, carrying a white helmet and a beige towel.",
+ "Text 'Best hair on the team?' with a barber emoji appears.",
+ "He proudly gestures towards his hair.",
+ "Scene transitions to an outdoor field where teammates are walking and chatting.",
+ "A player in a black sleeveless shirt and white shorts mentions, 'I got a haircut,' pointing to his hair.",
+ "Nearby, another player in a white jersey, number 17, and black shorts stands beside him.",
+ "The top text changes to 'It ain't him!'",
+ "A man in a green jersey, number 41, confidently asserts, 'Lane Johnson, no doubt.'",
+ "Field background shows several other players engaging in various activities.",
+ "Audio captures cheerful banter and laughter among the group with phrases: 'Who has the best hair on the team? It ain't him. Not me. It ain't him. I got a haircut. Layton Johnson, no doubt.'",
+ "On a grassy field, white-jerseyed football players in black shorts are training.",
+ "A player, number 59, walks towards the camera holding a helmet and shakes his hair.",
+ "Text 'Me!' appears on the screen.",
+ "Scene fades to another player, number 17, who walks along the sideline.",
+ "The top text initially reads 'Matter of fact...' then changes to 'I changed my mind... T.J. Edwards.'",
+ "He gestures towards the camera and a picture of another player's hair appears with the text 'Put a picture of his hair on here, you'll see.'",
+ "Audio humorously captures his change of heart with phrases: 'Matter of fact, I changed my mind. T.J. Edwards put a picture of his hair on here. You'll see it.'",
+ "Background has ongoing chatter and laughter from the crowd.",
+ "A man in a gray sleeveless shirt with an eagle logo enters a building.",
+ "Text at the bottom reads 'Who has the best hair on the...?'",
+ "He removes his cap, revealing his braided hair.",
+ "Top text humorously exclaims 'Psh, what?!' followed by 'You see that?' with an eye emoji.",
+ "He confidently shows off his hair while the top text asserts 'No question!'",
+ "He finishes with a wide grin.",
+ "Audio captures the man's pride with phrases: 'Who has the best helmet what, do you see that, no question,' with laughter and music faintly in the background.",
+ "Scene cuts to a man standing on a grassy field in a white long-sleeve shirt and black shorts, holding a helmet with a green logo.",
+ "Top text declares 'Ah, Isaac.'",
+ "Two other players in similar outfits walk towards him, also with helmets.",
+ "Text changes to 'Is the g.o.a.t. with the best hair on the team.' with a goat emoji.",
+ "The man jokes, 'I told him to cut some off and let me have some.'",
+ "Scene switches to a man in a black sleeveless shirt and shorts entering a building, gloves on, carrying a helmet.",
+ "Bottom text notes 'That's an easy one.'",
+ "It concludes with another man in a black sleeveless shirt on the field with field activity continuing in the background.",
+ "Audio picks up a conversation with phrases: 'Isaac is the goat with the best hair on the team. I told him to cut some off and let me have some. Come on, that's an easy one. This ain't nothing,' with teammates talking and laughing.",
+ "Scene displays a man in a black sleeveless shirt on the field, wearing shoulder pads, emphasizing his hair quality with 'This... this ain't napps now.'",
+ "Text updates to 'Know what I'm saying?' as he continues speaking confidently.",
+ "He points to someone off-screen, adding 'Then again, my boy Dallas going bald.'",
+ "Text urges 'Come here Dallas, show 'em.'",
+ "Background of active teammates persists.",
+ "Audio accentuates the humor with phrases: 'Yeah, my boy Dallas going bald. Come here, Dallas,' amidst jovial conversation and laughter.",
+ "Recurring theme of camaraderie and friendly competition, with teammates poking fun at and praising each other's hairstyles.",
+ "Combination of both on-field and off-field scenes with humorous commentary, capturing the joyful and playful atmosphere among the football players."
+ ]
+ },
+ {
+ "video": "video/96.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man in a white shirt standing in front of a bright yellow background with red and black text that reads '`THE MOST PAINFUL THINGS HUMANS CAN EXPERIENCE.`' An illustration of a foot in a pink shoe stepping on a nail is shown next to the text. The man gestures with his hand while speaking, explaining that \"`The most painful things humans can experience. Childbirth is one of the most intense and painful things a woman can experience. This is due to the pain caused by the contraction of muscles and the pressure caused by the fetus.`\"\n\nThe scene transitions to an image of a woman lying in a hospital bed holding a newborn baby, indicating childbirth. The man's voice continues, describing the painful experiences. The next scene shows a diagram of a muscle tear in the arm, still accompanied by the man's narration in a monotone voice.\n\nNext, the scene shows a medical illustration of a baby in the womb, suggesting the severe pain of childbirth. The man continues to speak in front of a blurred background, weaving the narrative together. Next, the scene shows a medical illustration of a baby in the womb, suggesting the severe pain of childbirth. The person continues to speak in front of a blurred background, weaving the narrative together.\n\nThe image then shifts to a woman in a white shirt and dark pants holding her lower abdomen, indicating abdominal pain. The speaker gestures upwards and continues: \"`Women who have gone through this say it feels like very intense cramping.`\" \n\nA close-up of a hand holding a small kidney stone with more stones in the background is shown, suggesting the intense pain of passing kidney stones. The narration continues when it is said, \"`Kidney stones can be caused by drinking too little water, exercising too much or too little, obesity, weight-loss surgery, or eating foods with too much.`\" \n\nThe man gestures toward the kidney stones while narrating. The clip concludes with a woman in a blue sports bra and black leggings doing push-ups, indicating physical exertion. The commentary continues about painful experiences during physical activities. The background in the video shifts first to three blue, obese animated characters, then to three animated women in purple green, and pink bikinis respectively.\n\nIn the third clip, the man in the white shirt stands in front of an image showing two wooden spoons, one filled with salt and the other with sugar, with the words 'Salt' and 'Sugar' above the respective spoons. The scene transitions back to a close-up of a hand holding a small kidney stone with more stones in the background. The man gestures with his hand, explaining \"`The stones are made of minerals and salts that form in your kidneys.`\" The speaker moves on to describe the pain caused when kidney stones block the urine from traveling from the kidneys to the bladder.\n\nA medical illustration of the urinary system, labeled 'Kidneys,' 'Ureter,' and 'Bladder', appears. The man gestures towards the diagram while continuing with explanations about the urinary system and kidney stones. The final frame in this clip shows the man in front of a blurred background.\n\nAs the fourth clip starts, the focus shifts to a new topic. The man stands in front of an image of Earth from space, gesturing upwards. The narration describes cluster headaches: \"`Only 1% of the world suffers from them, and the people who have experienced it say it's a piercing, sudden, and intense pain.`\" Following this, there is a close-up of a man holding his head in pain.\n\nThe speaker's commentary shifts to another painful condition, shingles. An illustration of shingles on a person's back is shown, labeled 'Spinal cord,' 'Affected nerve,' and 'Shingles rash.' The speaker describes shingles as a viral infection that causes a burning sensation in your skin and appears as blisters.\n\nIn the final clip, the focus remains on the theme of shingles. The man stands in front of the same medical illustration of shingles from the previous clip. The man underscores the excruciating pain associated with shingles, saying \" Finally, shingles is a vi infection that causes burning in your skin that appears as blisters, which contain past or blood.\" A high-pitched electronic voice briefly highlights the chemical nature of the blisters as the video continues showing the person providing commentary underlining the severe and painful experiences depicted throughout the images.\n\nThe coherent narrative comes full circle, combining detailed visual depictions of painful human experiences with an engaging narration, providing an informative and impactful overview of some of the most intense pains humans can endure."
+ }
+ ],
+ "events": [
+ "Man in white shirt stands in front of bright yellow background.",
+ "Background text reads 'THE MOST PAINFUL THINGS HUMANS CAN EXPERIENCE.'",
+ "Illustration of a foot in pink shoe stepping on a nail is shown.",
+ "Man gestures with hand and explains about painful human experiences.",
+ "Man mentions childbirth as one of the most intense and painful experiences for women.",
+ "Explanation about childbirth pain due to muscle contractions and fetal pressure.",
+ "Transition to image of woman in hospital bed holding newborn baby.",
+ "Man's voice continues describing painful experiences.",
+ "Diagram of a muscle tear in the arm is shown.",
+ "Medical illustration of a baby in the womb is shown.",
+ "Man continues to speak in front of a blurred background.",
+ "Image shows woman in white shirt and dark pants holding her lower abdomen.",
+ "Speaker gestures upwards and talks about intense cramping experienced by women.",
+ "Close-up of hand holding small kidney stone with more stones in background is shown.",
+ "Narration about causes of kidney stones including insufficient water intake, exercising irregularities, obesity, weight-loss surgery, and diets high in certain foods.",
+ "Man gestures toward the kidney stones while narrating.",
+ "Clip shows woman in blue sports bra and black leggings doing push-ups.",
+ "Commentary on painful experiences during physical activities.",
+ "Background shifts to blue obese animated characters and then to animated women in bikinis.",
+ "Man in white shirt stands in front of image showing wooden spoons with salt and sugar.",
+ "Scene transitions back to close-up of hand holding small kidney stone with more stones in background.",
+ "Man gestures with hand, explaining formation of kidney stones from minerals and salts.",
+ "Man describes pain caused by kidney stones blocking urine flow.",
+ "Medical illustration of urinary system labeled 'Kidneys,' 'Ureter,' and 'Bladder' is shown.",
+ "Man gestures towards diagram and continues explaining about urinary system and kidney stones.",
+ "Final frame shows man in front of blurred background.",
+ "Fourth clip starts with man standing in front of image of Earth from space.",
+ "Man gestures upwards and discusses cluster headaches.",
+ "Close-up of man holding head in pain is shown.",
+ "Speaker shifts to talking about shingles.",
+ "Illustration of shingles on person's back labeled 'Spinal cord,' 'Affected nerve,' and 'Shingles rash' is shown.",
+ "Speaker describes shingles as viral infection that causes burning sensation and blisters.",
+ "Final clip shows man in front of medical illustration of shingles.",
+ "Man underscores painfulness of shingles mentioning blisters with pus or blood.",
+ "High-pitched electronic voice highlights chemical nature of blisters.",
+ "Commentary continues about severe and painful experiences depicted in images."
+ ]
+ },
+ {
+ "video": "video/97.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a black screen, eventually transitioning to a lively scene where a man in a pink jacket animatedly interacts with a blue and orange bird-like puppet adorned with a pink bow. Children, dressed in colorful clothing, play around in the background, enhancing the playful atmosphere. The man in the pink jacket is fully engaged with the puppet, making expressive gestures that entertain the surrounding children. As the animated interaction continues, a voice says, \"`Are you all right? We'll see you back. But of course, why don't we meet our pink windmill kids? Hi, my name's Catrina.\"\n\nIn the next scene, a girl wearing a green shirt with 'CATRINA' emblazoned on it energetically dances in an environment that suggests a game or play setting, with a dartboard and various boxed items in the background. Music begins to play as Catrina energetically moves around, contributing to the vibrant atmosphere.\n\nThe scene transitions to a boy in a light blue shirt with the name 'HUGH.' He says, \"And my name is Hugh.\"Hugh waves enthusiastically while the background continues to be colorful and playful. The clip then shifts to another girl in a pink shirt labeled 'SARAH,' pointed at by Hugh. The audio matches the visual sequence with voices introducing themselves: \"I'm Sarah. I'm Emma. And I'm Joe to you.`\" The scene becomes even more energetic as two more individuals, 'JOE' and 'EMMA,' join in, raising their hands and moving energetically. Eventually, a group of five people performs a coordinated dance, maintaining the vibrant and dynamic mood.\n\nFollowing this, the video shows children sitting in an audience, with some wearing shirts with names like 'JANE' and 'MIKE.' A child named Spencer, in a grey shirt, then a girl in a light green shirt in a light blue shirt, is seen running towards the audience, maintaining the playful ambiance while the audio continues with, \"`Hello, my name is Spencer. And my name is Lorraine.`\"\n\nThe next sequence features kids in the audience, excitedly participating in the lively atmosphere. Children with names like 'LORRAINE,' 'ANNIE,' and 'HUGH' are seen animatedly interacting and moving around. This scene is paired with the audio introductions: \"I'm Anthony. I'm Debbie. And my name is Abby.`\"\n\nThe scene shifts back to the energetic group of five dancers in their colorful costumes. They are dancing in a whimsical, pink-themed indoor environment, showcasing wooden beams and various decorations that add to the dynamic atmosphere. This is followed by a transition to a black screen displaying yellow Korean text, '\uc5b4\ub978\uc774 \ub41c\ud6c4..' (After becoming an adult..), hinting at a narrative shift.\n\nThe video then shows Catrina dancing energetically in front of a pink wall with large brown sacks labeled '354.' The vibrant scene continues with Hugh making a hand gesture in front of the same backdrop. The energetic movement persists as Sarah, Emma, and Joe dance lively in this whimsical setting.\n\nIn subsequent scenes, Joe, wearing a green shirt, and Emma in a light green shirt, continue their energetic dance in front of a pink-painted indoor environment. The visuals remain consistent with a focus on a lively interaction among the group members, while voices continue to introduce themselves: \"Hello, my name is Spencer. And mine is Lorraine. I'm Anthony. I'm Debbie.\"\n\nFinally, the video showcases Anthony, Abbie, and Debbie in front of a seated audience, energetically dancing and giving thumbs-up gestures. This concluding section includes the entire group dancing in a coordinated manner, maintaining the whimsical, pink-themed setting. The dancers perform lively movements, ending the clip with a group of dancers joyfully running towards the back of the stage, encapsulating the playful and spirited essence of the video. The final audio, a child's voice, introduces, \"`And my name is Abby,`\" as the music plays out, marking the energetic conclusion of the clip."
+ }
+ ],
+ "events": [
+ "Video opens with a black screen.",
+ "Transition to a lively scene with a man in a pink jacket interacting with a blue and orange bird-like puppet adorned with a pink bow.",
+ "Children dressed in colorful clothing play in the background.",
+ "The man in the pink jacket makes expressive gestures to entertain the surrounding children.",
+ "Voice says, 'Are you all right? We'll see you back. But of course, why don't we meet our pink windmill kids? Hi, my name's Catrina.'",
+ "Next scene features a girl in a green shirt with 'CATRINA' emblazoned on it dancing energetically. Background has a dartboard and various boxed items.",
+ "Music begins to play as Catrina dances energetically.",
+ "Scene transitions to a boy in a light blue shirt with the name 'HUGH' who says, 'And my name is Hugh', while waving enthusiastically.",
+ "Scene shifts to another girl in a pink shirt labeled 'SARAH'.",
+ "Audio sequence with voices: 'I'm Sarah. I'm Emma. And I'm Joe to you.'",
+ "Two more individuals, 'JOE' and 'EMMA,' join in, raising their hands and moving energetically.",
+ "Five individuals perform a coordinated dance, maintaining the vibrant and dynamic mood.",
+ "Children sit in an audience, some wearing shirts with names like 'JANE' and 'MIKE'.",
+ "A child named Spencer in a grey shirt runs towards the audience.",
+ "A girl in a light green shirt, seen in a light blue shirt, also runs towards the audience.",
+ "Children in audience, including 'LORRAINE,' 'ANNIE,' and 'HUGH,' animatedly interact and move around.",
+ "Audio introductions: 'I'm Anthony. I'm Debbie. And my name is Abby.'",
+ "Scene shifts back to five dancers in colorful costumes, dancing in a whimsical, pink-themed environment with wooden beams and various decorations.",
+ "Transition to a black screen displaying yellow Korean text, '\uc5b4\ub978\uc774 \ub41c\ud6c4..' (After becoming an adult...).",
+ "Catrina dances energetically in front of a pink wall with large brown sacks labeled '354.'",
+ "Hugh makes hand gestures in front of the same backdrop.",
+ "Sarah, Emma, and Joe continue dancing lively in the whimsical setting.",
+ "Joe in a green shirt and Emma in a light green shirt dance energetically in front of a pink-painted indoor environment.",
+ "Voices continue to introduce themselves: 'Hello, my name is Spencer. And mine is Lorraine. I'm Anthony. I'm Debbie.'",
+ "Anthony, Abbie, and Debbie energetically dance and give thumbs-up gestures in front of a seated audience.",
+ "Entire group dances in a coordinated manner in the whimsical, pink-themed setting.",
+ "Group of dancers joyfully runs towards the back of the stage.",
+ "Child's voice introduces, 'And my name is Abby,' as the music plays out."
+ ]
+ },
+ {
+ "video": "video/98.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with the USPTA (United States Professional Tennis Association) logo displayed prominently on a white background. The logo features a stylized figure playing tennis with a red star in the background and the text \"USPTA\" in blue. A swooshing sound accompanies the transition to a fast-paced padel match on a blue court with glass walls, labeled \"World Padel Tour\" and \"Granada Open.\" Surrounding advertisements for \"Estrella Damm\" and \"Adeslas\" come into view, adding to the event's professional ambiance. Four players, two on each team, compete vigorously as a man's voice says, \"Padel is very similar to tennis. All courts are standardized, roughly 33 by 66 feet, so about a third the size of a tennis court. And it's the same scoring as tennis.\" Concurrently, text elements \"THIS IS PADEL,\" \"33 X 66 FOOT COURT,\" and \"`SAME SCORING AS TENNIS`\" appear at the bottom of the screen, providing viewers with essential information about the sport.\n\nThe setting then shifts to a man in a white jacket holding a padel racket, labeled \"LEE SPONAUGLE\" and \"ALL RACQUET SPORTS\" in red and white text at the screen's bottom left, against the backdrop of a green court where people are playing padel. The scene continues to show the players' dynamic rallies and shots, accentuating the sport's high energy as the audio transitions to a man stating, \"But I would say 99% of the play is doubles play. You start with an underhand serve; after you serve, the ball has to bounce onto the turf.\" Text annotations highlight these rules with prompts \"START WITH UNDERHAND SERVE\", \"BALL HAS TO HIT FLOOR FIRST\" and \"PLAY OFF SIDE AND BACK WALLS.\"\n\nA voiceover elaborates, \"And you can play off the side. If tennis is checkers, padel is chess. It's really a thinking person sport, but the great thing about padel.\" The strategic aspect of the sport is underlined by text stating \"THINKING PERSON'S SPORT.\"\n\nThe rally continues fervently, further mirrored by the audio, which continues, \"It is easy to learn and hard to master. That means that people from various levels can go out and play and have fun right away, but it's hard to master.\" Text on the screen alternates between \"EASY TO LEARN HARD TO MASTER\" and \"`PLAY AND HAVE FUN RIGHT AWAY.`\" The clip underscores the sport's broad accessibility and potential for mastery.\n\nAs the scene transitions, the text \"78 COUNTRIES WORLDWIDE\" is displayed at the bottom, marking the global spread of padel. A continuation of the rally shifts to another match labeled \"#WPTMijasOpen,\" where players wearing blue shirts engage in intense gameplay. Highlighting padel's reach and infrastructure, text reads \"30,000 COURTS WORLDWIDE\" and \"35 INTERNATIONAL FEDERATIONS.\" The audio syncs, stating, \"Fidel has played in 78 different countries around the world, 30,000 courts around the world, 35 different international federations. Get PADEL certified today.\" This is bolstered by the screen text \"GET PADEL CERTIFIED TODAY.\"\n\nThe video then narrows back to the blue court labeled \"#WPTMijasOpen,\" followed by the text \"@USPTA.COM,\" urging viewers to visit the United States Professional Tennis Association's website. The clip concludes with the USPTA logo reappearing on a white background, with added text \"`IS YOUR TEACHING PRO USPTA CERTIFIED?`\" The audio wraps up with \"Learn more at USPTA.com. Is your Teaching Pro USPTA certified?\" culminating in the final text \"USPTAFINDAPRO.COM,\" directing viewers to a website to find certified USPTA professionals. The clip ends with the cheerful noise of a crowd."
+ }
+ ],
+ "events": [
+ "USPTA logo displayed on white background",
+ "Logo features stylized figure playing tennis with a red star and text 'USPTA' in blue",
+ "Swooshing sound transitions to padel match",
+ "Padel match on blue court with glass walls labeled 'World Padel Tour' and 'Granada Open'",
+ "Surrounding advertisements for 'Estrella Damm' and 'Adeslas'",
+ "Four players, two on each team, compete vigorously",
+ "Man's voice says 'Padel is very similar to tennis. All courts are standardized, roughly 33 by 66 feet, so about a third the size of a tennis court. And it's the same scoring as tennis.'",
+ "Text 'THIS IS PADEL', '33 X 66 FOOT COURT', 'SAME SCORING AS TENNIS' appears on screen",
+ "Scene shifts to man in white jacket holding a padel racket",
+ "Man labeled 'LEE SPONAUGLE' and 'ALL RACQUET SPORTS' on green court where people are playing padel",
+ "Dynamic rallies and shots shown",
+ "Audio transitions to man stating, 'But I would say 99% of the play is doubles play. You start with an underhand serve; after you serve, the ball has to bounce onto the turf.'",
+ "Text annotations 'START WITH UNDERHAND SERVE', 'BALL HAS TO HIT FLOOR FIRST', 'PLAY OFF SIDE AND BACK WALLS'",
+ "Voiceover elaborates, 'And you can play off the side. If tennis is checkers, padel is chess. It's really a thinking person sport, but the great thing about padel.'",
+ "Text 'THINKING PERSON'S SPORT' on screen",
+ "Rally continues fervently",
+ "Audio says, 'It is easy to learn and hard to master. That means that people from various levels can go out and play and have fun right away, but it's hard to master.'",
+ "Text alternates between 'EASY TO LEARN HARD TO MASTER' and 'PLAY AND HAVE FUN RIGHT AWAY.'",
+ "'78 COUNTRIES WORLDWIDE' text displayed at bottom",
+ "Rally shifts to another match labeled '#WPTMijasOpen'",
+ "Players in blue shirts engage in intense gameplay",
+ "Text '30,000 COURTS WORLDWIDE' and '35 INTERNATIONAL FEDERATIONS' on screen",
+ "Audio states, 'Fidel has played in 78 different countries around the world, 30,000 courts around the world, 35 different international federations. Get PADEL certified today.'",
+ "Screen text 'GET PADEL CERTIFIED TODAY'",
+ "Scene narrows to blue court labeled '#WPTMijasOpen'",
+ "Text '@USPTA.COM' urging viewers to visit USPTA website",
+ "USPTA logo reappears on white background",
+ "Text 'IS YOUR TEACHING PRO USPTA CERTIFIED?'",
+ "Audio wraps up with 'Learn more at USPTA.com. Is your Teaching Pro USPTA certified?'",
+ "Final text 'USPTAFINDAPRO.COM'",
+ "Clip ends with cheerful noise of a crowd"
+ ]
+ },
+ {
+ "video": "video/99.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a shirtless man adorned with multiple gold accessories, including a prominent gold chain with an anchor pendant, bracelets, rings, and a watch. He is wearing a gray cap with a logo on it. The background features a modern interior, white curtains, a flat-screen TV mounted on the wall, a railing with metal bars, and various items on the floor, such as a pink bag and a red cup. The lighting is warm, giving the room a cozy ambiance. The man gestures with their hands, adjusts his chain, and moves around slightly while an audio track captures a man loudly saying: \"`Yo, what it is? You already know, man. That nigga Ply checking in with you, man. Everybody has been hitting me by the biggest and the best chain\".\n\nContinuing with the same man and setting, the focus shifts to him holding up the large gold chain with both hands, displaying it prominently. The warm lighting maintains the cozy atmosphere. The man continues emphatically: \" 7 kilos, biggest motherfucking Cuban. Bigger than any nigga that ever did it. \"\n\nThe scene progresses seamlessly with the man now holding up the gold chain with one hand, moving it closer to the camera to showcase its size and intricate details. The audio captures more intense commentary: \"`Nigga told me they didn't do it bigger than 7 kilos that motherfucking lot. And a lot of niggas doing it, but nigga ain't spending that after check, you know what I'm saying? Whole nother kilo right here, boy..`\". \n\nNext, the man is seen holding the chain with both hands again, moving it nearer to the camera for a more detailed view. He adjusts his cap and gestures while the audio delivers: \"`BitBoy shit, though. Y'all already know how I nigga rockin', nigga. Eight kilos altogether. Yaaah. That nigga. \n\nMaintaining the scene's continuity, the man holds up the chain once more, emphasizing its size and details before lowering it to stand with their arms at their sides, showcasing their tattoos. The enthusiastic narration continues: \"Bet you can't find a nigga who did it like this, bitch I bet you can't find a nigga Did it this big, nigga Double up, nigga Stop that spinning, that look, that look might not on that Cuban, nigga\". \n\nFinally, the man brings the gold chain closer to the camera and shifts the focus to his mouth, revealing gold teeth. The audio culminates with: \"Put you some on that Cuban. Yeah. Eight kilos, nigga\". "
+ }
+ ],
+ "events": [
+ "A shirtless man adorned with multiple gold accessories, including a prominent gold chain with an anchor pendant, bracelets, rings, and a watch, appears in the video.",
+ "The man is wearing a gray cap with a logo on it.",
+ "The background features a modern interior with white curtains, a flat-screen TV mounted on the wall, a railing with metal bars, and various items on the floor, such as a pink bag and a red cup.",
+ "The lighting is warm, giving the room a cozy ambiance.",
+ "The man gestures with his hands, adjusts his chain, and moves around slightly.",
+ "The audio track captures a man loudly saying: 'Yo, what it is? You already know, man. That nigga Ply checking in with you, man. Everybody has been hitting me by the biggest and the best chain.'",
+ "The man holds up the large gold chain with both hands, displaying it prominently.",
+ "The warm lighting maintains the cozy atmosphere.",
+ "The man emphatically says: '7 kilos, biggest motherfucking Cuban. Bigger than any nigga that ever did it.'",
+ "The man now holds up the gold chain with one hand, moving it closer to the camera to showcase its size and intricate details.",
+ "The audio captures intense commentary: 'Nigga told me they didn't do it bigger than 7 kilos that motherfucking lot. And a lot of niggas doing it, but nigga ain't spending that after check, you know what I'm saying? Whole nother kilo right here, boy..'",
+ "The man is seen holding the chain with both hands again, moving it nearer to the camera for a more detailed view.",
+ "He adjusts his cap.",
+ "He gestures while the audio delivers: 'BitBoy shit, though. Y'all already know how I nigga rockin', nigga. Eight kilos altogether. Yaaah. That nigga.'",
+ "The man holds up the chain once more, emphasizing its size and details.",
+ "The man lowers the chain to stand with his arms at his sides, showcasing his tattoos.",
+ "The enthusiastic narration continues: 'Bet you can't find a nigga who did it like this, bitch I bet you can't find a nigga Did it this big, nigga Double up, nigga Stop that spinning, that look, that look might not on that Cuban, nigga.'",
+ "The man brings the gold chain closer to the camera and shifts the focus to his mouth, revealing gold teeth.",
+ "The audio culminates with: 'Put you some on that Cuban. Yeah. Eight kilos, nigga.'"
+ ]
+ },
+ {
+ "video": "video/100.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a high-intensity football game between the Broncos in white uniforms and the Giants in blue uniforms. The scoreboard shows the Broncos leading 20-7 in the 4th quarter with 12:03 remaining. The Giants are on 2nd down with 7 yards to go. The scene sets with the Giants lined up in an offensive formation. The quarterback receives the snap, looks for a passing opportunity, but under pressure, he decides to scramble. As he steps up in the pocket, he eventually opts to run, gaining some yards before being tackled by multiple Broncos. The sequence ends with the quarterback on the ground, players from both teams converging around him. In sync with the visuals, a commentator describes the play: \"All right, check out this play. Daniel Jones on a second and seven. Pass play is called, but he's going to scramble. Steps up in the pocket, decides to run, slides, and a Denver defender comes over the top of them.\" Meanwhile, the crowd cheers loudly in the background.\n\nThe game intensifies as players gather around the spot of the tackle. The Broncos are still leading 20-7, now with 11:53 left in the 4th quarter, and it remains 2nd down and 7 yards to go. A heated exchange ensues, with some players pushing and shoving, showing visible frustration and aggression. Referees step in to separate them and restore order. The atmosphere is tense. The audio continues with the commentator noting: \"A couple of linemen take exception to that, and now it's like, yeah, you didn't like my tweet on Twitter, and you were following this person instead of me.\" Music begins to play softly alongside the narration.\n\nThe referee's involvement becomes more pronounced as both teams' players remain gathered and agitated. With the scoreboard now showing 11:43 left in the 4th quarter, the play has moved to 3rd down. Referees are seen trying to manage the situation, keeping the players apart while some continue their heated exchanges. The tension remains palpable as the referees diligently work to restore order. During this sequence, the commentator adds: \"A couple of slaps and penalty flags will eventually come out. I think it might be offsetting. Just happened a few seconds ago. But that's enough, fellas. Yellow flag comes on the field.\" The background music complements this narration.\n\nThe video transitions back to the field with the Broncos still leading the Giants 20-7 and 11:34 left in the 4th quarter. With 3rd down in progress, referees continue their efforts to calm the situation. Players, still visibly upset, engage in brief arguments before gradually dispersing and preparing for the next play. As cooler heads eventually prevail, the scene shifts to a different game. Now the focus is on a white-uniformed team with orange helmets huddled together, indicating a game between Cleveland leading 22-20 against Kansas City. The scoreboard shows 8 seconds remaining in the 3rd quarter with 1st down and 10 yards to go. The audio transitions with the commentator stating: \"Cooler heads eventually prevailed. To Arrowhead.\" Music keeps playing and is interrupted, after which the man continues to speak, hinting at the dramatic shift to the new scene."
+ }
+ ],
+ "events": [
+ "The video begins with a high-intensity football game between the Broncos in white uniforms and the Giants in blue uniforms.",
+ "The scoreboard shows the Broncos leading 20-7 in the 4th quarter with 12:03 remaining.",
+ "The Giants are on 2nd down with 7 yards to go.",
+ "The scene sets with the Giants lined up in an offensive formation.",
+ "The quarterback receives the snap.",
+ "The quarterback looks for a passing opportunity.",
+ "The quarterback under pressure decides to scramble.",
+ "The quarterback steps up in the pocket.",
+ "The quarterback eventually opts to run.",
+ "The quarterback gains some yards before being tackled by multiple Broncos.",
+ "The quarterback ends up on the ground with players from both teams converging around him.",
+ "A commentator describes the play.",
+ "The commentator says: 'All right, check out this play. Daniel Jones on a second and seven. Pass play is called, but he's going to scramble. Steps up in the pocket, decides to run, slides, and a Denver defender comes over the top of them.'",
+ "The crowd cheers loudly in the background.",
+ "Players gather around the spot of the tackle.",
+ "Broncos still lead 20-7 with now 11:53 left in the 4th quarter.",
+ "It remains 2nd down and 7 yards to go.",
+ "A heated exchange ensues with some players pushing and shoving.",
+ "Referees step in to separate the players and restore order.",
+ "The atmosphere is tense.",
+ "The commentator notes: 'A couple of linemen take exception to that, and now it's like, yeah, you didn't like my tweet on Twitter, and you were following this person instead of me.'",
+ "Music begins to play softly alongside the narration.",
+ "Referee involvement becomes more pronounced as players remain gathered and agitated.",
+ "The scoreboard now shows 11:43 left in the 4th quarter.",
+ "The play has moved to 3rd down.",
+ "Referees try to manage the situation, keeping players apart.",
+ "Some players continue their heated exchanges.",
+ "The commentator adds: 'A couple of slaps and penalty flags will eventually come out. I think it might be offsetting. Just happened a few seconds ago. But that's enough, fellas. Yellow flag comes on the field.'",
+ "Background music complements the narration.",
+ "The video transitions back to the field with Broncos still leading the Giants 20-7 and 11:34 left in the 4th quarter.",
+ "3rd down is in progress.",
+ "Referees continue their efforts to calm the situation.",
+ "Players engage in brief arguments before gradually dispersing.",
+ "Players prepare for the next play.",
+ "The scene shifts to a different game.",
+ "Focus is now on a white-uniformed team with orange helmets indicating a game between Cleveland leading 22-20 against Kansas City.",
+ "The scoreboard shows 8 seconds remaining in the 3rd quarter with 1st down and 10 yards to go.",
+ "The audio transitions with the commentator stating: 'Cooler heads eventually prevailed. To Arrowhead.'",
+ "Music keeps playing.",
+ "Music is interrupted.",
+ "The commentator hints at the dramatic shift to the new scene."
+ ]
+ },
+ {
+ "video": "video/101.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video introduces a man wearing a blue tank top and a cap, sitting inside a vehicle with a green field visible through the window. The person is wearing a seatbelt. Text appears in red boxes with white letters showing: \"Just got a call from the POLICE. Loose alligators in Marsing, ID.\" As the audio plays, the man says, \"`We just got a call from the police department. We have loose alligators in Marsing. `\" The scene transitions to an outdoor view of a pond, with the overlay text \"Claytonia Pond.\" Followed by another frame saying, \"Gators in here???\" indicating the necessity for caution. The narrative continues with the audio, mentioning, \"So we just arrived at Clayton.\" The video reaches a high point as it shows the same man standing on a road wearing a blue tank top and gray shorts, with the text \"Arrived!!!\" displayed at the bottom.\n\nNext, the video naturally progresses with the same man standing on the road while the audio reinforces with, \"We're going to your pond.\" The camera then shows a black police car marked \"STATE\" and an outline of Idaho, parked in a gravel area. The background reveals a white sheriff's vehicle and a red truck, with several people standing around. The audio informs us, \"` Apparently we have some alligators. `\" and transitions to show the pond again, with the anticipation heightened by the text \"There's 10 of them out there.\" The conversation continues audibly, affirming, \"There's like 10 of them out there. `\" and builds suspense as the next visual shows the same man standing near the pond, in front of him is another man in a red shirt. The text \"14 foot?!?\" appears on the screen, while a police officer stands nearby. A man in a yellow top says, \"`What? Yeah, some of them are like 14 feet long. I live in the south. I like alligators. `\"\n\nThe following scene opens with the serene view of the pond surrounded by greenery, the calm water mirroring the surrounding vegetation. The audio breaks this calm by clarifying, \"Yeah you guys, those are fish. Those are not alligators.\" The video aligns, overlaying text: \"`Those are fish! `\" and \"Not alligators!!!\" transitioning from confusion to realization. As the camera zooms on the pond again, the text \"Carp breeding!!!\" overlaid indicates the actual scene. The audio describes, \"`Like some carp or something over there, like a breeding thing.`\"\n\nWrapping up, the video re-focuses on the calm pond with the text \"`Fish just jumped! `\" The audio describes, \"Look at. Look at, guys. See, all the fish just jumped, now they went under.\" The final scene transitions to the man in the blue tank top and gray shorts standing near a sign reading \"PUBLIC ACCESS AREA\" and \"DAY USE ONLY,\" with the man talking to two other people. Text humorously states \"Thanks for giving me a heart attack!!!\" alongside a laughing emoji. The audio follows this light-hearted moment, \"` Well you guys, thank you for giving me a mini heart attack, `\" and concludes with the man talking amusedly amidst laughter, tying the narrative together."
+ }
+ ],
+ "events": [
+ "Man wearing blue tank top and cap, sitting inside a vehicle with green field visible through window",
+ "Man wearing seatbelt",
+ "Text appears in red boxes with white letters: 'Just got a call from the POLICE. Loose alligators in Marsing, ID.'",
+ "Audio: 'We just got a call from the police department. We have loose alligators in Marsing.'",
+ "Scene transitions to outdoor view of pond with overlay text 'Claytonia Pond.'",
+ "Frame with text 'Gators in here???'",
+ "Audio: 'So we just arrived at Clayton.'",
+ "Man standing on road wearing blue tank top and gray shorts",
+ "Text displayed at bottom: 'Arrived!!!'",
+ "Audio: 'We're going to your pond.'",
+ "Camera shows black police car marked 'STATE' and outline of Idaho, parked in gravel area",
+ "Background reveals white sheriff's vehicle, red truck, several people standing around",
+ "Audio: 'Apparently we have some alligators.'",
+ "Scene shows pond again with text 'There's 10 of them out there.'",
+ "Audio: 'There's like 10 of them out there.'",
+ "Man standing near pond with another man in red shirt, police officer nearby",
+ "Text on screen: '14 foot?!?'",
+ "Man in yellow top says, 'What? Yeah, some of them are like 14 feet long. I live in the south. I like alligators.'",
+ "Scene opens with serene view of pond surrounded by greenery, calm water mirroring vegetation",
+ "Audio: 'Yeah you guys, those are fish. Those are not alligators.'",
+ "Overlay text: 'Those are fish!' and 'Not alligators!!!'",
+ "Camera zooms on pond again with text 'Carp breeding!!!'",
+ "Audio: 'Like some carp or something over there, like a breeding thing.'",
+ "Scene re-focuses on calm pond with text: 'Fish just jumped!'",
+ "Audio: 'Look at. Look at, guys. See, all the fish just jumped, now they went under.'",
+ "Final scene transitions to man in blue tank top and gray shorts standing near sign reading 'PUBLIC ACCESS AREA' and 'DAY USE ONLY'",
+ "Man talking to two other people",
+ "Text: 'Thanks for giving me a heart attack!!!' alongside laughing emoji",
+ "Audio: 'Well you guys, thank you for giving me a mini heart attack.'",
+ "Man talking amusedly amidst laughter"
+ ]
+ },
+ {
+ "video": "video/102.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up of a man's hands working on a mechanical part, threading a white zip tie through a rusted metal component, with the man manipulating the zip tie to secure it. The audio reveals the man speaking, \"You know, this reminds me of a Roadkill Garage zip tie moment. Here it is. When I asked Dulcich to bring me some zip ties to hold up this Caltrack bar,\" as the music plays in the background. The scene transitions to a close-up of two white zip ties lying on a flat surface, suggesting preparation for further work, and then shows the man with a dark-colored cap working under a vehicle, with a tire visible in the background, continuing to secure the zip tie around the metal component.\n\nNext, the video shows another close-up of the man's hands working on the same task, followed by a wider shot of the man under the vehicle, surrounded by various colored containers in a workshop environment. The man manipulates the zip tie to secure it firmly around the metal component. The video then shifts back to the man under the vehicle. The audio narration continues, with the man saying, \"He knowingly did not just bring me one zip tie. Not even two, because he knew I was doing both sides. No, he wisely showed up with four zip ties.\" The scene ends with a close-up of two white zip ties on a flat surface, the man then picks up them for further work, with a ring on his ring finger.\n\nThe video progresses to focus on the man working under the vehicle again, with the tire and workshop elements visible in the background. The man, wearing a dark-colored hat and black jacket, holds his head with one hand while continuing the task, and gestures with the zip ties in his other hand while talking. The audio captures a man explaining, \"So here's the zip tie moment rule. Never hand a guy one zip tie. Guys under the car goes, hey, man, can I have a zip tie? You walk over with five minimum,\" accompanied by background music. The scene shows the man's slight movements under the vehicle.\n\nThe camera zooms closer to show the zip tie being threaded through a rusted metal part. The audio provides more context with the man saying, \"The other thing is use up your white zip ties for jobs like this that you know are non-structural and that are not going to stay on the car because the white ones are junk,\" while music plays in the background.\n\nThe final clip of the video shows the man still under the vehicle, with a tire, various colored containers, and a yellow-handled hammer lying nearby. The man continues to hold two while zip ties while speaking, and then he reaches his left hand to the left of the frame. The concluding audio states, \"And that is your Roadkill Garage zip tie moment,\" tying back to the initial narrative, emphasizing the practical utility and preference of zip ties in automotive repair tasks."
+ }
+ ],
+ "events": [
+ "Close-up of a man's hands working on a mechanical part.",
+ "Man threading a white zip tie through a rusted metal component.",
+ "Man manipulating the zip tie to secure it.",
+ "Audio: Man speaks about a Roadkill Garage zip tie moment.",
+ "Background music is playing.",
+ "Transition to a close-up of two white zip ties on a flat surface.",
+ "Man working under a vehicle with a tire visible in the background.",
+ "Man continues to secure the zip tie around the metal component.",
+ "Close-up of the man's hands working on the task.",
+ "Wider shot of the man under the vehicle in a workshop environment.",
+ "Various colored containers are visible.",
+ "Man manipulates the zip tie to secure it firmly.",
+ "Audio: Man speaks about Dulcich bringing four zip ties.",
+ "Close-up of two white zip ties on a flat surface.",
+ "Man picks up the zip ties for further work.",
+ "Ring visible on the man's ring finger.",
+ "Man works under the vehicle again.",
+ "Tire and workshop elements visible in the background.",
+ "Man wears a dark-colored hat and black jacket.",
+ "Man holds his head with one hand while continuing the task.",
+ "Man gestures with zip ties in his other hand while talking.",
+ "Audio: Man explains the zip tie moment rule.",
+ "Background music continues.",
+ "Slight movements of the man under the vehicle.",
+ "Camera zooms closer to the zip tie being threaded through a rusted metal part.",
+ "Audio: Man discusses using white zip ties for non-structural jobs.",
+ "Background music plays.",
+ "Final clip shows the man under the vehicle with visible tire, colored containers, and a yellow-handled hammer nearby.",
+ "Man holds two white zip ties while speaking.",
+ "Man reaches his left hand to the left of the frame.",
+ "Concluding audio: Man mentions the Roadkill Garage zip tie moment."
+ ]
+ },
+ {
+ "video": "video/103.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a deer with large antlers grazing on green grass in a natural setting. The background features tall, dry grass and trees with sparse leaves. Suddenly, the deer appears startled by a gunshot, quickly leaps forward towards the tall grass and trees, and then returns to its original position to resume grazing. The cycle repeats, with the text 'Head' and an arrow highlighting the movement of the deer's head during the leap. Accompanying the visuals is a man's voice explaining: \"`This head is down and they hear something coming where they raise that head up to identify the source.`\"\n\nThe scene continues with the deer still grazing, and the 'Head' text with an arrow persists, emphasizing its head movement. Another text appears, outlining the deer's body with a dotted line and an arrow pointing to it. The deer gets startled again, leaping forward, with the text and arrows highlighting both head and body movements. The sequence shows a full leap progression. The audio features a man explaining: \"When they'`re raising the head up, this is pivoting down. The pivot point is right here in the shoulder. So, when they sling that head up, they're dropping their vitals and they drop to lock these back muscles like in sprinter position.\"\n\nIn the next visual, the deer is in mid-leap, with consistent background elements. 'Head' with an arrow points to its head, while additional text and a dotted outline highlight its body. Transitioning from the initial leap, the scene changes to a person kneeling next to a deer model in grassy surroundings, bushes in the background. The individual, wearing a black hoodie and blue jeans, uses several tools and gestures to demonstrate something related to the deer model. The speaker elaborates: \" Think about a sprinter in the starting blocks. They're getting down and loading those, those ham muscles to take off really quickly,`\" followed by the sound of a horse trotting.\n\nThe video continues with the deer mid-leap, leaping forward into the dense vegetation. Transitioning again, the visual returns to the person with the deer model, explaining and gesturing. The deer scenario reappears, now alert, with text and arrows highlighting its raised head and body movements before it leaps forward again. The sequence captures the progression to a full leap, while the audio follows with the explanation: \"They can't do that quick takeoff in this fully standing up erect position. When they`'re standing their head up, \"\n\nNext, the deer stands alert amidst mixed green and dry grass, cautiously looking around. The scene has autumn-colored foliage in the background. The deer starts to move, appears startled, and leaps forward through dense vegetation. The visual shifts back to the person next to the deer model, demonstrating and pointing out parts of the model. Returning to the grazing scene, 'Head' and 'Vitals' texts appear with arrows, emphasizing head and body movements as the deer gets startled and fully leaps forward. Audio continues with: \"A lot of people don't like to shoot because they say, well, the deer is already alert. But if you think about it, they have to drop much more. Only at the speed of gravity. \" \n\nLastly, the deer is seen in mid-leap, the 'Head' and 'Vitals' texts persist, with the additional phrase 'faster than gravity' highlighting agility. Progressing through several frames, the deer appears almost horizontal to the ground. It then stands alert, surrounded by green and dry grass and sparse-leaved trees, cautiously observing the surroundings before leaping forward and disappearing into the tall grass. The concluding audio states: \"They can't swing their head up, which pushes the vitals down, you have a much better chance of the arrow arriving before the vitals sink.\""
+ }
+ ],
+ "events": [
+ "Deer with large antlers grazing on green grass",
+ "Background of tall, dry grass and trees with sparse leaves",
+ "Deer gets startled by a gunshot",
+ "Deer quickly leaps forward towards the tall grass and trees",
+ "Deer returns to original position to resume grazing",
+ "Cycle repeats with text 'Head' and an arrow highlighting the deer's head movement",
+ "Man's voice explains head movements of the deer",
+ "Deer continues grazing with 'Head' text and arrow persisting",
+ "Text highlighting the deer's body with a dotted line and an arrow appears",
+ "Deer gets startled again and leaps forward",
+ "Text and arrows highlight head and body movements during the leap",
+ "Complete leap progression shown",
+ "Man explains the pivot and muscle action when deer raises its head",
+ "Video shows deer in mid-leap",
+ "'Head' text with an arrow pointing to its head and additional text highlighting the body",
+ "Transition to a person kneeling next to a deer model",
+ "Person wears black hoodie and blue jeans and uses tools to demonstrate on deer model",
+ "Man compares deer leap to a sprinter in starting blocks",
+ "Sound of a horse trotting",
+ "Deer continues mid-leap into dense vegetation",
+ "Transition back to person with deer model explaining and gesturing",
+ "Scene returns to the deer now alert with text and arrows highlighting movements before leaping forward",
+ "Sequence shows progression to a full leap",
+ "Man explains deer's quick takeoff abilities and head movements",
+ "Deer stands alert amidst mixed green and dry grass with autumn foliage in the background",
+ "Deer moves, gets startled, and leaps forward through dense vegetation",
+ "Visual returns to person next to the deer model demonstrating and pointing out parts",
+ "Grazing scene returns with 'Head' and 'Vitals' texts and arrows",
+ "Deer gets startled and fully leaps forward",
+ "Man explains people\u2019s hesitation to shoot at an alert deer and the speed of gravity",
+ "Deer in mid-leap with 'Head' and 'Vitals' texts and phrase 'faster than gravity' highlight agility",
+ "Deer appears almost horizontal to the ground during leap",
+ "Deer stands alert surrounded by green and dry grass and sparse-leaved trees",
+ "Deer cautiously observes surroundings before leaping forward and disappearing into tall grass",
+ "Concluding audio explains better chances of arrow hitting when deer swings its head up"
+ ]
+ },
+ {
+ "video": "video/104.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a brown dog named Stitch standing on a grooming table, surrounded by clumps of its fur, indicating the start of the grooming process. A man is using clippers on the dog. In the audio, a man introduces the scene, saying, \"This is how I worked with Stitch and gave him the makeover of a lifetime but first let me show you guys his state-issued ID. It's very important. One of Stitch's relatives tried to home groom him\". The background includes a black tool chest and a bag labeled 'Neewer'. The dog is shown reacting to the grooming but eventually stands still, more groomed, and with its tongue out, possibly panting or indicating contentment. A close-up of the dog's collar reveals an ID tag from Illinois, featuring a picture of the dog. \n\nThe narrative continues with a man stating: \"And so he came in super uneven and matted and it's time for me to fix that up, the mats on his ears were super super tight like that's a whole dreadlock.\" In this segment, the dog is now on the grooming table again, the man using a red razor on its face and legs. The scene shifts to a washing station where the dog is completely wet and covered in soap, being bathed while looking at the camera. The dog returns to the grooming table, now more groomed and panting, with the same consistent background setting.\n\nNext, the clip progresses to Stitch being shaved on the table again, with the man using the red razor on the dog's back. In the background, a man says, \"I talked to Stitch's mom about maintenance and matting, and so now hopefully she'll come every six weeks. And if you couldn't tell, I did try and attempt to draw Stitch in the background.\" The text 'whatchu looking at' humorously appears on the screen as the dog sits on the table, calm and occasionally sticking its tongue out while its facial fur is being trimmed with scissors.\n\nThe grooming continues, with Stitch sitting secured with a blue harness as a man trims the fur around his face with scissors, gently holding his head. The dog appears calm and cooperative, with occasional slight opening of his mouth. The man continues the trimming around its mouth. An ongoing dialogue features a first man saying: \" but other Stitch's head was kind of covering it, so this haircut should be so good for him because really all he needs is his cheek hairs to be maintained.\" \n\nFollowing this, the scene shows Stitch well-groomed and sitting with a blue harness, panting and appearing relaxed. The text 'tmi Logan' appears on the screen. A man describes verbally, stating: \"Me too, Stitch. Me too. Other than his nails, Stitch was really good for the groom, and he was good for the blow dryer, which was a surprise, but hey, I'll take it.\" While the man carefully trims the fur on his paw, ensuring a neat appearance, followed by trimming around the dog's ear, Stitch stands panting and calm.\n\nIn the final segment, Stitch stands on the grooming table with a blue harness, while a man trims the fur around his ears. The dog is then shown wearing a yellow plaid tie, looking at the camera and yawning, projecting a more formal, possibly relaxed or tired demeanor. The audio commentary concludes with a cheerful note: \"I'm all done with him though, and I gave him that super fancy business tie because, don't let him fool you, he's got places to be. BYE STITCH!\" The clip ends humorously with the text 'off to corporate' as the man removes the blue harness and helps Stitch off the table, signifying the completion of his grooming session. \n\nOverall, the video seamlessly illustrates the comprehensive grooming process, highlighting Stitch's transformation from matted fur to a well-groomed, dapper dog, ready to 'head off to corporate'. The narrative and visual elements combine to provide an engaging and informative overview of the dog's grooming experience."
+ }
+ ],
+ "events": [
+ "Stitch, a brown dog, is standing on a grooming table surrounded by clumps of fur.",
+ "A man is using clippers on Stitch.",
+ "Audio: A man introduces the scene, talking about Stitch's makeover and mentions his state-issued ID.",
+ "Background items: black tool chest and a bag labeled 'Neewer'.",
+ "Dog reacts to grooming but eventually stands still, more groomed, and with its tongue out.",
+ "Close-up of dog's collar showing an Illinois ID tag with Stitch's picture.",
+ "Audio: A man discusses Stitch's uneven and matted fur, particularly tight mats on his ears.",
+ "Stitch is back on the grooming table, with the man using a red razor on its face and legs.",
+ "Scene shifts to washing station where Stitch is wet and soapy, looking at the camera.",
+ "Stitch returns to the grooming table, more groomed and panting.",
+ "Stitch being shaved on the table with a red razor on its back.",
+ "Audio: Man talks about discussing maintenance and matting with Stitch's mom.",
+ "Text on screen: 'whatchu looking at' while Stitch sits calmly on the table.",
+ "Stitch's facial fur is being trimmed with scissors as it remains calm.",
+ "Audio: Man discusses the importance of maintaining Stitch's cheek hairs.",
+ "Scene shows Stitch well-groomed, sitting with a blue harness, panting and relaxed.",
+ "Text on screen: 'tmi Logan'.",
+ "Audio: Man mentions Stitch being good for the groom and the blow dryer.",
+ "Man trims the fur on Stitch's paw and around its ear.",
+ "Stitch stands, panting and calm.",
+ "Final segment: Stitch stands on the grooming table with a blue harness, man trims fur around its ears.",
+ "Stitch wears a yellow plaid tie, looks at the camera, and yawns.",
+ "Audio commentary concludes cheerfully.",
+ "Text on screen: 'off to corporate' as the man removes the blue harness and helps Stitch off the table.",
+ "Video ends with Stitch's grooming session completed."
+ ]
+ },
+ {
+ "video": "video/105.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with two men seated at a wooden dining table in a well-lit kitchen adorned with white cabinets, a chandelier, and a large mirror on the wall. The text \"When your friend asks for a bite\" is displayed at the top of the screen in red with white letters. One man is wearing a green jacket, while the other sports a denim jacket and is holding a sandwich. The denim jacket wearer starts eating the sandwich, with close-ups alternating between them and the green jacket wearer, who looks on longingly. In the background, a man says, \"You should have got a sandwich, man. This thing is so good.\" Another voice responds, \"You think I can get a bite?\" The first voice agrees, saying, \"Uh, yeah. \" Another voice responds, \"Aw, thanks, man. \" The first voice adds,\"Just a little bite, though. Don't overdo it. All right, that's enough.\" A lighthearted tune plays in the background.\n\nThe video continues with the same humorous scenario. The text remains at the top, and close-ups show the green jacket-wearing friend now taking a bite of the sandwich, switching to the denim jacket-wearing friend looking on. The camera then shifts to a wider shot of the kitchen, where the denim jacket wearer heads to the refrigerator, only to find it empty. The clip ends with the green jacket wearer holding up a single grape, signaling that it\u2019s the last piece of food. As the sequence plays out, voices exchange: \"That's good. That's good. That's good. That's good. Come on, dude. \",\"Sorry, man. I guess I take really big bites.\" The response is, \"It's fine. I'll just make another one.\" This is followed by a refrigerator door sound and a final comment, \"What the?\"\nThe video advances to the green jacket wearer holding a grape above their mouth, ready to eat it. Text stays consistent at the top. The denim jacket wearer looks disappointed near the now-closed refrigerator door. The scene transitions to an exterior shot of a Chipotle Mexican Grill, where the denim jacket wearer is seen outside, conversing with someone in a blue shirt. The close-up reveals the denim jacket person explaining something. As this happens, a voice says, \" Like I said, I take really big bites.\"Then another voice shouts,\"All right, that's it! I'm going to Chipotle!\" Another says, \"Hey, good luck getting any food. They're closed. Apparently some guy came in and ate all the food.\" Soft music underscores this dialogue.\n\nThe scene then shows the denim jacket wearer outside Chipotle, seemingly in conversation. From here, it cuts to another man in a black jacket eating food with a fork. A close-up of the denim jacket wearer reacting to this is shown, followed by a transition to an exterior shot of a Vons grocery store. The final segment depicts the denim jacket individual walking toward the Vons entrance. In the audio, one man says, \"No. Like I said, I take really big bites.\" The sound of a vehicle passing by is heard.\n\nNext, the denim jacket wearer is shown trying to open glass doors of a building but appears frustrated. The familiar text remains on-screen. They then express exasperation before another person in a green jacket and a white cap enters the frame, holding a phone. A close-up of this man explaining something ensues. This is followed by the denim jacket wearer listening intently. Another man in a black jacket eating from a bag of chips adds comedic relief. Voices in the background say, \"What is happening?\" Another answers, \"You didn't hear? There's a nationwide food shortage. Apparently some guy named Matt ate up all the food.\" This is followed by loud crunching noises.\n\nFinally, the video shows a man in a black jacket and eating from a bag of chips. The denim jacket wearer watches this man. The scene cuts to the black jacket individual continuing to eat. Then, attention shifts to another man in a green jacket and white cap standing in front of the building\u2019s glass doors, holding a drink. The denim jacket wearer observes him eating. The video humorously captures this ongoing quest for food. Concurrently, voices say, \"I take really big bites. Yo man, mind if I get a sip of that?\" A man's scream and some music serve as the backdrop. \n\nThe entire video ties together the humorous plight of friends navigating sharing food, highlighting the comedic yet relatable nuances of this everyday interaction."
+ }
+ ],
+ "events": [
+ "Two men are seated at a wooden dining table in a well-lit kitchen.",
+ "The kitchen is adorned with white cabinets, a chandelier, and a large mirror on the wall.",
+ "Text 'When your friend asks for a bite' is displayed at the top of the screen in red with white letters.",
+ "One man is wearing a green jacket; the other is wearing a denim jacket and holding a sandwich.",
+ "The denim jacket wearer starts eating the sandwich.",
+ "Close-ups alternate between the denim jacket wearer eating and the green jacket wearer looking on longingly.",
+ "A man says, 'You should have got a sandwich, man. This thing is so good.'",
+ "Another voice responds, 'You think I can get a bite?'",
+ "The first voice agrees, saying, 'Uh, yeah.'",
+ "Another voice responds, 'Aw, thanks, man.'",
+ "The first voice adds, 'Just a little bite, though. Don't overdo it. All right, that's enough.'",
+ "A lighthearted tune plays in the background.",
+ "The green jacket wearer takes a bite of the sandwich.",
+ "Close-ups switch to the denim jacket wearer looking on.",
+ "The camera shifts to a wider shot of the kitchen.",
+ "The denim jacket wearer heads to the refrigerator, only to find it empty.",
+ "The green jacket wearer holds up a single grape, signaling that it\u2019s the last piece of food.",
+ "Voices exchange: 'That's good. That's good. That's good. That's good. Come on, dude.'",
+ "Another voice responds, 'Sorry, man. I guess I take really big bites.'",
+ "The response is, 'It's fine. I'll just make another one.'",
+ "Refrigerator door sound is heard.",
+ "A final comment: 'What the?'",
+ "The green jacket wearer holds a grape above their mouth, ready to eat it.",
+ "The denim jacket wearer looks disappointed near the now-closed refrigerator door.",
+ "The scene transitions to an exterior shot of a Chipotle Mexican Grill.",
+ "The denim jacket wearer is seen outside, conversing with someone in a blue shirt.",
+ "Close-up reveals the denim jacket wearer explaining something.",
+ "A voice says, 'Like I said, I take really big bites.'",
+ "Another voice shouts, 'All right, that's it! I'm going to Chipotle!'",
+ "Another voice says, 'Hey, good luck getting any food. They're closed. Apparently some guy came in and ate all the food.'",
+ "Soft music underscores this dialogue.",
+ "Denim jacket wearer is outside Chipotle, seemingly in conversation.",
+ "Cuts to another man in a black jacket eating food with a fork.",
+ "A close-up of the denim jacket wearer reacting to this.",
+ "Transition to an exterior shot of a Vons grocery store.",
+ "The denim jacket individual is walking toward the Vons entrance.",
+ "In the audio, a man says, 'No. Like I said, I take really big bites.'",
+ "Sound of a vehicle passing by is heard.",
+ "Denim jacket wearer is shown trying to open glass doors of a building but appears frustrated.",
+ "Text remains on-screen.",
+ "Denim jacket wearer expresses exasperation.",
+ "Another person in a green jacket and a white cap enters the frame, holding a phone.",
+ "A close-up of this man explaining something.",
+ "The denim jacket wearer listens intently.",
+ "Another man in a black jacket eating from a bag of chips adds comedic relief.",
+ "Voices in the background say, 'What is happening?'",
+ "Another voice answers, 'You didn't hear? There's a nationwide food shortage. Apparently some guy named Matt ate up all the food.'",
+ "Loud crunching noises are heard.",
+ "A man in a black jacket is shown eating from a bag of chips.",
+ "Denim jacket wearer watches this man.",
+ "Scene cuts to the black jacket individual continuing to eat.",
+ "Attention shifts to another man in a green jacket and white cap standing in front of the building\u2019s glass doors, holding a drink.",
+ "Denim jacket wearer observes him eating.",
+ "Humorous quest for food is captured.",
+ "Voices say, 'I take really big bites. Yo man, mind if I get a sip of that?'",
+ "A man's scream and some music serve as the backdrop."
+ ]
+ },
+ {
+ "video": "video/106.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video shows a systematic experiment to determine how many sheets of paper it takes to stop an airsoft BB. \n\nA man is holding a tan-coloured airsoft pistol against the backdrop of a neatly made bed with gray sheets and pillows. There's an overlay comment from 'California Airsoft' asking, \"how many pieces of paper does it take to stop a bb?\". The man introduces the experiment, saying, \"`yo guys so today we're going to be finding out how many pieces of printer paper will it take to stop an airsoft bb.\" while typing on a keyboard.\n\nNext, the man holds a single sheet of white paper and then five in front of the pistol and fires. The BB passes through, leaving a hole. The audio features the man saying, \"`One piece of paper I`'m sure it will go through, yep, of course. Okay, guys, we're skipping to five just because I think it will go through four easily. So five. Here we go. And yep, easily through.\" followed by gunshots and further dialogue as he continues testing with an increased number of sheets.\n\nIn the subsequent scenes, the individual repeats the process. They line up sheets of paper-each time, the BB passes through, creating new holes in the paper. The man comments, \"Now six even shoot. Uh did it do six? Oh, yeah. No, I went through six. Okay, seven layers of printer paper, let'`s see. Uh did it go through i think it stopped it no uh i think the bb's, oh no, it went through.\" \n\nWith 15 sheets of paper, the bullet continues to penetrate: \"Okay, so we're at 10 pieces of paper, I think it will stop it this time. Okay, never mind. Went through.\" the man exclaims during the continued tests. The man aims and fires multiple shots, creating several holes as evident in the visual examination of the paper.\n\nIn the final portion, a thicker stack of white paper is held in front of the pistol. The man fires, and this time, the sound analysis indicates that the BB is not passing through as before. Multiple rounds fired at the stack result in the bullets bouncing off, accompanied by the man's observation: \"Okay, we are at 15. I will be very, very surprised if it goes through, nope, it`'s just bouncing off. Yep uh, 15 will stop it.\"\n\nThroughout the experiment depicted in the video, the analytical commentary and gunshots make clear the consistent methodology and final conclusions of how many pieces of paper are necessary to stop a BB. The visual inspection of the paper at various increments of BB shots correlates directly with his spoken insights and observations."
+ }
+ ],
+ "events": [
+ "A man holds a tan-colored airsoft pistol.",
+ "Background shows a neatly made bed with gray sheets and pillows.",
+ "Overlay comment from 'California Airsoft' asking about stopping a BB with paper.",
+ "Man introduces the experiment while typing on a keyboard.",
+ "Man holds a single sheet of white paper in front of the pistol and fires.",
+ "BB passes through the single sheet of paper leaving a hole.",
+ "Man comments on the BB passing easily through one piece of paper.",
+ "Man holds five sheets of paper in front of the pistol and fires.",
+ "BB passes through the five sheets of paper leaving holes.",
+ "Man comments on the BB passing easily through five pieces of paper.",
+ "Man holds six sheets of paper in front of the pistol and fires.",
+ "BB passes through the six sheets of paper leaving holes.",
+ "Man comments on the BB going through six pieces of paper.",
+ "Man holds seven sheets of paper in front of the pistol and fires.",
+ "Man checks the paper and confirms BB went through seven pieces of paper.",
+ "Man holds ten sheets of paper in front of the pistol and fires.",
+ "Man comments that the BB went through ten pieces of paper.",
+ "Man holds fifteen sheets of paper in front of the pistol and fires.",
+ "BB does not pass through, bullets bounce off the thick stack of paper.",
+ "Man confirms fifteen pieces of paper stop the BB.",
+ "Man makes a final observation on the effectiveness of fifteen sheets in stopping the BB."
+ ]
+ },
+ {
+ "video": "video/107.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up shot of a fish frying in a pan. The fish are golden brown and cut diagonally. The text above reads \"non-stick pan does not break the skin is fragrant and tender\" (non-stick pan, does not break the skin, fragrant and tender). A woman's voice introduces the cooking demonstration: \"Today, I'm going to share with you my three months of cooking experience.\" Music plays softly in the background. At this time, you can see the video using chopsticks to flip the fish in the pot, find that there is fish skin stuck to the pot, and then use chopsticks to scrape off the fish skin stuck to the pot. The caption above the video shows: \"No matter what fish you fry at home, do not put the oil in the pan directly after the fish, so that the skin will certainly not be preserved, but also the fish thin broken.\" Today I would like to share with you my experience from the kitchen in three months.\"\nThe video transitions to a man using a large knife to paddle a fish on a yellow cutting board. The text says \"Make sure that the skin doesn't fall off when you cook fish at home\" (make sure that the skin doesn't fall off when you cook fish at home). Then the yellow subtitles changed to: \"Let's change the fish into a small knife, and then wash it and put it in the basin...\" The woman continues, \"Add a small slice of spring onion and ginger. A little less salt and thirteen spices.\" The woman's voice guides the process.\nNext, a bowl of carved fish, ginger slices, and shallots is displayed, with the words \"Put less MSG\" (add a little MSG) on it. Follow the woman's instructions, \"Add a little MSG, a little seafood soy sauce, and a little cooking wine to remove the fishy smell.\" Finally, add a little white pepper and stir well. Marinate for 15 minutes.\" The background music continues to play, keeping a cheerful tone. Then, with one hand, the fish in the basin is thoroughly mixed with the other ingredients.\nIn the next scene, the marinated fish appears again with the words \"Marinate it for 5 minutes to marinate it\" (marinate it for 5 minutes to allow it to absorb the flavor). The video cuts to a pan in which oil is being poured, with the words \"This will make the fried fish taste delicious\" (so that the fried fish will taste delicious), and the woman explains: \"Then put the oil in the pan, heat the oil first.\" When the oil is hot, pour it out first.\" Hissing fills the moment.\nBelow is a close-up of the smoking pan with the words \"Another scoop of cool oil in the pan\" (and then add a scoop of cool oil to the pan). The woman said: \"Put a spoonful of cold oil in the pot, after the cold oil, we fry the big fish on the high fire.\" We fry small fish over low heat. Then put the fish into the pot one by one.\" The music played softly as the fish sizzled into the pan.\nFinally, the video shows three scoring fish fried in hot oil, with the words \"slow fry on low heat after the pan\" (after adding to the pan, slow fry on low heat). The narrator instructs: \"When you're done, turn on the heat and fry slowly. When the other side is done, turn it over and fry it until it's golden on both sides and it's perfect. It doesn't stick to the pan at all. Do you know how to do that?\" As the sizzle continues, the fish are turned upside down to reveal their golden-brown bellies, and the video ends with the words \"Did you learn it?\"\nThe audio and video work together to create a seamless cooking tutorial that emphasizes the step-by-step process of frying fish to perfection and offers helpful tips and an engaging narrative."
+ }
+ ],
+ "events": [
+ "Close-up shot of fish frying in a pan",
+ "Fish are golden brown and cut diagonally",
+ "Text on screen: 'non-stick pan does not break the skin is fragrant and tender'",
+ "Woman's voice introduces cooking demonstration",
+ "Woman says: 'Today, I'm going to share with you my three months of cooking experience'",
+ "Soft music plays in the background",
+ "Video shows using chopsticks to flip the fish in the pot",
+ "Fish skin sticks to the pot",
+ "Using chopsticks to scrape off the fish skin",
+ "Caption: 'No matter what fish you fry at home, do not put the oil in the pan directly after the fish, so that the skin will certainly not be preserved, but also the fish thin broken.'",
+ "Caption: 'Today I would like to share with you my experience from the kitchen in three months.'",
+ "Video transitions to a man using a large knife to paddle a fish on a yellow cutting board",
+ "Text: 'Make sure that the skin doesn't fall off when you cook fish at home'",
+ "Caption changes to: 'Let's change the fish into a small knife, and then wash it and put it in the basin...'",
+ "Woman continues guiding by voice",
+ "Woman says: 'Add a small slice of spring onion and ginger. A little less salt and thirteen spices.'",
+ "Bowl of carved fish, ginger slices, and shallots is displayed",
+ "Text: 'Put less MSG'",
+ "Woman says: 'Add a little MSG, a little seafood soy sauce, and a little cooking wine to remove the fishy smell.'",
+ "Woman says: 'Finally, add a little white pepper and stir well. Marinate for 15 minutes.'",
+ "Background music keeps a cheerful tone",
+ "With one hand, the fish in the basin is thoroughly mixed with other ingredients",
+ "Marinated fish appears with the text: 'Marinate it for 5 minutes to marinate it'",
+ "Cut to a pan where oil is being poured",
+ "Text: 'This will make the fried fish taste delicious'",
+ "Woman explains: 'Then put the oil in the pan, heat the oil first. When the oil is hot, pour it out first.'",
+ "Hissing sound fills the moment",
+ "Close-up of smoking pan with text: 'Another scoop of cool oil in the pan'",
+ "Woman says: 'Put a spoonful of cold oil in the pot, after the cold oil, we fry the big fish on the high fire.'",
+ "Woman says: 'We fry small fish over low heat. Then put the fish into the pot one by one.'",
+ "Music played softly as the fish sizzled into the pan",
+ "Video shows three scoring fish fried in hot oil",
+ "Text: 'slow fry on low heat after the pan'",
+ "Narrator instructs: 'When you're done, turn on the heat and fry slowly. When the other side is done, turn it over and fry it until it's golden on both sides and it's perfect. It doesn't stick to the pan at all. Do you know how to do that?'",
+ "Sizzling sound as fish are turned upside down to show golden-brown bellies",
+ "Video ends with the text: 'Did you learn it?'",
+ "Audio and video create a seamless cooking tutorial",
+ "Tutorial emphasizes step-by-step process of frying fish to perfection",
+ "Tutorial offers helpful tips and engaging narrative"
+ ]
+ },
+ {
+ "video": "video/108.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up of a man's hand holding a rectangular sharpening stone while he sharpens a knife by sliding its blade across the stone's smooth, dark gray surface. The background is a dark surface with some objects visible in the distance. The sharpening continues steadily, and the audio reveals that the man is describing his process: \"`So today I'm sharpening 14C 28N steel. I love 14C; it's an incredible steel. I dulled it on a stone first just to show that it was completely dull.`\" As the sharpening continues, the man tests the knife's sharpness by smoothly slicing through a piece of white paper, demonstrating its effectiveness.\n\nThen the video shifts slightly to another close-up of knife sharpening, this time on a light gray, textured sharpening stone placed on the same dark surface. The man maintains a consistent motion, capturing the technique from different angles. The man's voice continues to describe his process: \"`And then I started reprofiling on a 140 grit diamond plate. Now after reprofiling to 17 degrees per side.`\" The sharpening process remains the focal point, with the background unchanged, emphasizing careful and precise movements.\n\nNext, the camera captures a close-up of the man's hand sharpening the knife on the light gray, textured stone, then transitions to a shot of a spray bottle labeled \"Neeve's Knives\" with a skull graphic. The man sprays a foamy liquid onto the sharpening stone and continues sharpening the knife. The voice explains: \"That's my edge angle that I put on each side. I went to my Veneve resin-bonded diamond stones. So these are still diamond stones; they're just in resin.`\" The clip concludes with the man switching to a red sharpening stone, keeping the same rhythm and technique.\n\nIn the following segment, the consistent sharpening motion continues on the red stone, now covered in foamy liquid, then transitions to a yellow stone similarly treated. The man switches back to the red stone and later uses a leather strop with a wooden handle to refine the knife's edge. The man's voice provides the details: \"`So they last an incredible long time. And then I moved to a diamond emulsion, Gunny Juice.`\" There is a pause, and another man's voice can be heard.\n\nAs the video progresses, it shows an intimate view of the man's hand sharpening the knife on a dark gray sharpening stone with a wooden handle. He inspects the blade against his palm, highlighting its sharpness. The man tests the knife by cutting through a piece of paper with printed designs, making several clean cuts. The voice narrates: \"`After stropping, here is my mirror edge. A beautiful looking edge. I love 14C28N. Takes an incredibly sharp edge. And here's the results.`\"\n\nFinally, the man continues to cut through the paper featuring images of various knives with different colored handles and prices. The smooth cuts and intricate shapes formed highlight the knife's sharpness. The voice concludes: \"`Obviously a very very sharp edge. Now this paper is even thinner than the paper I tried to cut with the dull edge, so yeah, thank you for watching.`\" As the man makes additional precise cuts, the sound of paper being ripped adds to the narrative, showcasing the efficiency of the sharpening process."
+ }
+ ],
+ "events": [
+ "Close-up of a man's hand holding a rectangular sharpening stone.",
+ "Man sharpens a knife by sliding its blade across the stone's dark gray surface.",
+ "Background includes a dark surface with some visible objects.",
+ "Audio reveals the man describing his process.",
+ "Man mentions sharpening 14C 28N steel, and dulling it first.",
+ "Man tests the knife's sharpness by slicing through a piece of white paper.",
+ "Shift to close-up of knife sharpening on a light gray, textured stone.",
+ "Background remains a dark surface, consistent motion captured.",
+ "Man discusses reprofiling on a 140 grit diamond plate at 17 degrees per side.",
+ "Close-up of sharpening on light gray stone.",
+ "Transition to a shot of a spray bottle labeled 'Neeve's Knives' with a skull graphic.",
+ "Man sprays foamy liquid onto the stone and continues sharpening.",
+ "Man explains using Veneve resin-bonded diamond stones.",
+ "Clip shows man switching to a red sharpening stone with same technique.",
+ "Sharpening motion on red stone covered in foamy liquid, transitions to yellow stone.",
+ "Man switches back to red stone, then uses a leather strop with wooden handle to refine edge.",
+ "Man mentions using diamond emulsion, Gunny Juice.",
+ "Pause with another man's voice heard.",
+ "Close-up of man's hand sharpening on a dark gray stone with wooden handle.",
+ "Man inspects blade against palm, highlighting sharpness.",
+ "Man tests knife by cutting through printed paper, making clean cuts.",
+ "Voice narrates showing the mirror edge and mentioning 14C28N steel.",
+ "Man cuts through paper featuring images of knives with different handles and prices.",
+ "Smooth cuts and intricate shapes demonstrate knife's sharpness.",
+ "Voice concludes thanking for watching, mentioning the thinness of paper.",
+ "Sound of paper being ripped enhances narrative of sharpening efficiency."
+ ]
+ },
+ {
+ "video": "video/109.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video features a podcast recording set in a studio with blue and purple lighting. The key participants are a man named John Cena, wearing a green t-shirt with 'USMC' printed on it, a second man with curly hair, and a third woman with a necklace. The screen displays the texts corresponding to the real-time audio at the bottom throughout the video.\n\nThe video opens with John adjusting his headphones, with a 'John Cena' purple banner below, while the curly-haired man asks, \"What'`s it like to be that muscular and big?`\" To which John responds, \"Oh, it's uh, just like everything else, just got a little more.\" The curly-haired man continues, \"Isn't it great though, when you walk down the street, you don't fear any man, do you?\" John answers, \"No\" The voice reveals a humorous exchange about the benefits and feelings of being muscular and strong.\n\nNext, the focus remains on John as he shares, \"I started working out when I was 12 years old,\" He explains getting beaten up daily, prompting his decision to work out, which is underscored by the onscreen texts '`got beat up a lot, right?`' and '`exactly.`'. The narrative is visually supported by an old photograph of a young boy wearing a makeshift weightlifting belt and the text '`got beat up every day`' and '`in junior high`', illustrating his challenging early years.\n\nIn the subsequent segment, the old photograph transitions back to the studio where John elaborates on his story. The phrases 'working out', '`hit the gym`', and 'some self-defense' appear on the screen as he recounts how he began weightlifting and learning self-defense. There is also an icon of fitness equipment, accompanied by a sound. The curly-haired man listens attentively and the woman with the necklace asks, \"Was it easy to get big?\" John provides a candid \"No, no,\" accompanied by a \"like\" icon shown in the middle.\n\nThe narrative deepens as John reflects more on his commitment, revealing, \"It was a while. I started seriously working out\nat 12 years old.\". He talks about his dedication over decades, mentioning, \"will go to the gym every day,\" followed by \"A lot of people will go to the gym really strong, for a month or even six months or even a year.\" and concluding with '`I have been doing it for 20 years,`' emphasizing his long-term dedication.\n\nFinally, a red 'Subscribe' button graphic appears, encouraging viewers to follow their content for more insights. When the mouse arrow is clicked, the icon turns gray.\n\nThe audio consistently follows the conversation between the three, reinforcing their dialogue about fitness, struggles, and dedication. The combination of their verbal exchanges, the supportive visual texts, and the old photograph harmoniously create a narrative of perseverance and motivation in the arena of physical fitness."
+ }
+ ],
+ "events": [
+ "The video is set in a studio with blue and purple lighting.",
+ "John Cena is wearing a green t-shirt with 'USMC' printed on it.",
+ "There is a curly-haired man and a woman with a necklace present.",
+ "Texts corresponding to the real-time audio are displayed at the bottom.",
+ "John Cena adjusts his headphones.",
+ "A 'John Cena' purple banner is shown below.",
+ "The curly-haired man asks John, 'What's it like to be that muscular and big?'",
+ "John responds, 'Oh, it's uh, just like everything else, just got a little more.'",
+ "The curly-haired man continues, 'Isn't it great though, when you walk down the street, you don't fear any man, do you?'",
+ "John answers, 'No.'",
+ "Humorous exchange about the benefits and feelings of being muscular and strong ensues.",
+ "John shares, 'I started working out when I was 12 years old.'",
+ "John explains getting beaten up daily, prompting his decision to work out.",
+ "Texts appear on the screen: 'got beat up a lot, right?' and 'exactly.'",
+ "An old photograph of a young boy wearing a makeshift weightlifting belt is shown.",
+ "Texts 'got beat up every day,' and 'in junior high' appear next to the photograph.",
+ "Photograph transitions back to the studio.",
+ "John elaborates on his story about working out and learning self-defense.",
+ "Screen texts: 'working out', 'hit the gym', and 'some self-defense' appear.",
+ "An icon of fitness equipment appears with a sound.",
+ "The curly-haired man listens attentively.",
+ "The woman with the necklace asks, 'Was it easy to get big?'",
+ "John candidly responds, 'No, no.'",
+ "A 'like' icon is shown in the middle.",
+ "John reflects on his commitment to working out.",
+ "John says, 'It was a while. I started seriously working out at 12 years old.'",
+ "John talks about his dedication over decades.",
+ "John mentions, 'I will go to the gym every month or even six months or even a year.'",
+ "John concludes, 'I have been doing it for 20 years,' emphasizing long-term dedication.",
+ "A red 'Subscribe' button graphic appears on the screen.",
+ "When the mouse arrow is clicked, the 'Subscribe' icon turns gray.",
+ "Audio consistently follows their conversation about fitness, struggles, and dedication.",
+ "The verbal exchanges, visual texts, and the old photograph create a narrative of perseverance and motivation."
+ ]
+ },
+ {
+ "video": "video/110.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video features a yellow robot with a cylindrical body and a rectangular head, prominently set against a bright pink background. The robot's face, a black screen, displays large, expressive white eyes that convey emotions. It features flexible silver arms and antennas with a red ball on top. The robot's movements, animated and lively, involve waving, pointing, and placing its hand on its head, suggesting it is communicating in a playful and curious manner while the background remains consistently pink to keep the focus on its expressions. \n\nAs the audio starts with a bell ringing, a man speaks in a robotic voice saying, \"It's time to learn with Plex. It's fun to put on your pajamas. It's fun to put on your pajamas. It's fun to put on your pajamas. And this is how you do it,\" leading into a catchy song with a man singing in a high-pitched voice: \"It's fun to put on your pajamas, it's fun to put on your pajamas, it's fun to put on your pajamas, and this is how we do it.\"\n\nThe scene transitions to a bright blue background featuring the number 1's in different colors jagging on top of each other, then remainning the large red number '1', with enthusiastic the corresponding number's shouts from a group of children. Accompanied by joyful music, a male voice guides children with the phrase, \"One! Stick your feet right through the holes. Stick your feet right through the holes.\" It then shows pairs of bare feet upon a bright blue background. A dark blue pajama, is in front of the bare feet, is pointed out with white arrows. As a chorus of children echoes, \"Stick your feet right through the holes,\" the scene cuts to the pajama fully covers the feet. From bare feet to blue pajama on, the scene replays.\n\nThe scene transitions to a bright blue background featuring the number '2' is in different colors jagging on top of each other, then remainning the large green number '2', with enthusiastic the corresponding number's shouts from a group of children. The section highlights a person lifting a leg while wearing the dark blue pajama in dfferent angle, emphasing the step of pulling pants up, and the pant leg changes from loose to tight. The male voice continues, \"Two! Pull your pants up high. Pull your pants up high,\" with children echoing, \"Pull your pants up high.\" From loose to tight, the scene of pulling pants up replays.\n\nTransitioning back to a bright blue backdrop featuring the number '3' is in different colors jagging on top of each other, then remainning the large blue number '3', with enthusiastic the corresponding number's shouts from a group of children. The following sequence shows a boy, now in a white shirt, putting on the dark blue pajama: pulling it over his arm, sliding arms through the holes and tensioning. The voice commands, \"Three! Find the holes, pull your arms, and slide them right on through,\" followed by voices repeating the instructions. Background music plays as the visual instruction continues. The scene of pulling the pajama over arm replays.\n\nNext, a bright blue backdrop features the number '4' is in different colors jagging on top of each other, then remainning the large yellow number '4', with enthusiastic the corresponding number's shouts from a group of children. It is followed by the boy zipping up the pajama from bottom to top and spreading out his arms in a broad smile. The voice instructs, \"Zip it up real tight, and now you're ready for bed.\". The boy in the video then spreads their arms, displaying the fully zipped pajamas, projecting readiness for bed. The scene of zipping the pajama up replays. \n\nThe scene shifts back to the bright pink background, where the yellow robot with its expressive white-eyed screen and flexible silver arms continues to interact playfully. It moves its arms in different directions, waves, points, and places a hand on its head. Reflecting a rhythmic sync with the visuals, the high-pitched male voice sings over an upbeat, fast-paced tune, \"It's fun to put on your pajamas. It's fun to put on your pajamas. And that's how you do it. Yay!\" The energetic and lively audio reinforces the playful tone established throughout the video."
+ }
+ ],
+ "events": [
+ "A yellow robot with a cylindrical body and a rectangular head is featured against a bright pink background.",
+ "The robot's face, a black screen, displays large, expressive white eyes.",
+ "The robot has flexible silver arms and antennas with a red ball on top.",
+ "The robot's movements are animated and lively, including waving, pointing, and placing its hand on its head.",
+ "The background remains consistently pink throughout to keep focus on the robot's expressions.",
+ "A bell rings, and a man speaks in a robotic voice saying, 'It's time to learn with Plex. It's fun to put on your pajamas. It's fun to put on your pajamas. It's fun to put on your pajamas. And this is how you do it.'",
+ "A catchy song begins with a man singing in a high-pitched voice: 'It's fun to put on your pajamas, it's fun to put on your pajamas, it's fun to put on your pajamas, and this is how we do it.'",
+ "The scene transitions to a bright blue background with the number '1' in different colors jagging on top of each other and then showing a large red number '1'.",
+ "A group of children enthusiastically shouts the corresponding number, 'One!'",
+ "A voice guides, 'Stick your feet right through the holes.'",
+ "Pairs of bare feet appear against a bright blue background.",
+ "A dark blue pajama is shown in front of the bare feet and pointed out with white arrows.",
+ "Children chorus, 'Stick your feet right through the holes.'",
+ "The scene shows pajama fully covering the feet, and this sequence replays.",
+ "The scene transitions to a bright blue background with the number '2' in different colors jagging on top of each other and then showing a large green number '2'.",
+ "A group of children enthusiastically shouts the corresponding number, 'Two!'",
+ "The section shows a person lifting a leg while wearing the dark blue pajama, emphasizing the step of pulling pants up from loose to tight.",
+ "The voice says, 'Two! Pull your pants up high.'",
+ "Children echo, 'Pull your pants up high.'",
+ "The scene of pulling pants up from loose to tight replays.",
+ "The scene transitions to a bright blue background with the number '3' in different colors jagging on top of each other and then showing a large blue number '3'.",
+ "A group of children enthusiastically shouts the corresponding number, 'Three!'",
+ "The following sequence shows a boy, now in a white shirt, pulling on the dark blue pajama: pulling it over his arm, sliding arms through the holes and tensioning.",
+ "The voice commands, 'Three! Find the holes, pull your arms, and slide them right on through.'",
+ "Voices repeat the instructions as background music plays.",
+ "The sequence of pulling the pajama over the arm replays.",
+ "A bright blue backdrop features the number '4' in different colors jagging on top of each other and then showing a large yellow number '4'.",
+ "A group of children enthusiastically shouts the corresponding number, 'Four!'",
+ "The boy zips up the pajama from bottom to top and spreads out his arms in a broad smile.",
+ "The voice instructs, 'Zip it up real tight, and now you're ready for bed.'",
+ "The boy spreads his arms, displaying the fully zipped pajamas, and projecting readiness for bed.",
+ "The sequence of zipping the pajama up replays.",
+ "The scene shifts back to the bright pink background with the yellow robot interacting playfully.",
+ "The robot moves its arms in different directions, waves, points, and places a hand on its head.",
+ "The background music features a high-pitched male voice singing, 'It's fun to put on your pajamas. It's fun to put on your pajamas. And that's how you do it. Yay!'",
+ "The energetic audio reinforces the playful tone established throughout the video."
+ ]
+ },
+ {
+ "video": "video/111.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video commences with a man clad in a red jacket and a gray beanie standing in a snowy landscape. Snow-covered trees form a serene backdrop under a cloudy sky. Text appears on the screen, identifying 'CHRIS PAULSON' in the top right corner and 'CTV NEWS' in the bottom right corner. Transitioning to a close-up, the man is now holding a lynx, a wild cat, as the text 'BURNS LAKE, B.C.' flashes in the top left corner. The man converses with the lynx, saying with the text, \"Let's go see the damage you did, buddy.\" The lynx appears calm, listening as the man continues, asking, \"Yeah? Should we go see the damage you did? How many chickens did you get?\" The audio complements this visual with soft music playing in the background, and the occasional the clucking of chickens can be heard.\n\nThe video extends this interaction, showing the man still holding the lynx in the same snowy setting. The man addresses the lynx, saying with the text, \"You got some of our new ones. Not good is it? No?\" and adds, \"No, it's not good!\" The lynx remains calm and inquisitive. Background sounds include soothing music, reinforcing the tranquil yet solemn mood.\n\nNext, we continue watching the man with the lynx amidst the unchanging snowy backdrop. The text identifiers are present. The man says to the lynx with the text, \"First the owl and then you, huh?\" and shortly after adds, \"Let's go have a look at this damage.\" The lynx's demeanor stays unperturbed.\n\nIn the following scene, the focus changes to the aftermath of the lynx's actions. The ground outside a building displays a patch of blood and feathers, indicating a recent attack. The camera shifts to show more blood and feathers inside a building. The man with the lynx points these out, saying with the text, \"There's one\", and emphasizing further with, \"See how upset you made everyone? There's two. That's two of our new chickens.\" Background sounds include the clucking of chickens, supplemented with soft music.\n\nAs the video progresses, the man with the lynx approaches a group of chickens near a building. The man comments with the text, \"These guys are pretty upset over there\", as the camera captures the chickens amidst the snow. The scene transitions, showing the man and the lynx back outside the building, where the text 'SGT. RON LEBLANC' and 'B.C. CONSERVATION OFFICER SERVICE' appears, followed by a voice with the text: \"There could've been a couple enforcement actions, but none were taken in this instance.\" The lynx stays calm, and their interaction remains tender. The background audio features more clucking sounds and the man's voice explaining the lenient approach adopted for the incident.\n\nConcluding the video, a close-up reveals the man still holding the lynx. The man, reflecting on the situation, says with the text, \"We're just glad that this incident didn't end in a different way.\" The snowy setting remains, marking a consistent theme throughout the clips. The soundtrack consists of gentle music, aligning with the man's thoughtful remarks, ending the narrative on a note of relief and contemplation."
+ }
+ ],
+ "events": [
+ "The video begins with a man in a red jacket and gray beanie standing in snowy landscape.",
+ "Snow-covered trees form serene backdrop under cloudy sky.",
+ "Text appears identifying 'CHRIS PAULSON' in the top right corner and 'CTV NEWS' in the bottom right corner.",
+ "Transition to a close-up of the man holding a lynx.",
+ "Text 'BURNS LAKE, B.C.' appears in the top left corner.",
+ "The man converses with the lynx, saying, 'Let's go see the damage you did, buddy.'",
+ "Lynx appears calm as the man asks, 'Yeah? Should we go see the damage you did? How many chickens did you get?'",
+ "Soft music and occasional clucking of chickens are heard.",
+ "The man continues holding the lynx in the same snowy setting.",
+ "The man addresses the lynx, saying, 'You got some of our new ones. Not good is it? No?'",
+ "The man adds, 'No, it's not good!'",
+ "Background includes soothing music to complement the tranquil mood.",
+ "The text identifiers are still present.",
+ "The man says to the lynx, 'First the owl and then you, huh?'",
+ "The man adds, 'Let's go have a look at this damage.'",
+ "The lynx's demeanor remains calm and inquisitive.",
+ "Focus shifts to the aftermath of the lynx's actions outside a building.",
+ "A patch of blood and feathers is visible on the ground, indicating an attack.",
+ "Camera shifts to more blood and feathers inside a building.",
+ "The man with the lynx points out, 'There's one,' and emphasizes, 'See how upset you made everyone? There's two. That's two of our new chickens.'",
+ "Background sounds include chicken clucking and soft music.",
+ "The man with the lynx approaches a group of chickens near a building.",
+ "The man comments, 'These guys are pretty upset over there.'",
+ "Camera captures the chickens amidst the snow.",
+ "Scene transitions showing the man and the lynx outside the building.",
+ "Text 'SGT. RON LEBLANC' and 'B.C. CONSERVATION OFFICER SERVICE' appears.",
+ "A voiceover text says, 'There could've been a couple enforcement actions, but none were taken in this instance.'",
+ "The lynx remains calm during the interaction.",
+ "Background audio includes more clucking sounds and man's voice explaining lenient approach.",
+ "A close-up shows the man still holding the lynx, reflecting on the situation.",
+ "The man says, 'We're just glad that this incident didn't end in a different way.'",
+ "The snowy setting remains consistent throughout the video.",
+ "Gentle music aligns with the man's thoughtful remarks, ending on a note of relief and contemplation."
+ ]
+ },
+ {
+ "video": "video/112.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a scene where a man stands by a window overlooking a building. \"Last Purchase:\" text is displayed at the top. The man is donning a gray shirt that reads \"Mentally Gone\" Another man's voice says, \"Hey, can you see what my son has been buying?\" as purple text appears at the bottom. The camera zooms in, and the text \"`You got a picture of him?`\" appears in white. The man holds up a phone, revealing \"Last Purchase: Vapes\" in red at the top. He then says, \"Vapes.\" Another man reacts with a surprised voice as the text changes to \"What arr im gonna kill him\" in purple, followed by \"Frankly come here I'm spanking your ass.\" the man ends with, \"OMG.\"\n\nThe following scene transitions to a black screen with \"Last Purchase: Vapes\" in white and red text. It then shows the man with curly hair, standing in front of a window overlooking buildings. \"Last Purchase:\" still here with a loading icon. The man's gray shirt is visible as \"`I wonder what things Mrs. Wilson buys`\"displayed in blue at the bottom. The man then answers, \"I have no idea. \" Another man then says, \"`Dude just use your power!`\" Then the man touches his temple. \"Last Purchase: The shaft 69 3,000\" flashes in red, and the man reacts with a black \"What!\" followed by, \"`What what is it?`\" said by another man and \"The shaft 693000\" in blue and black texts as he responds.\n\nThe bottom text in blue reads, \"`Ohh those are the best!`\" with the man's confusing face. The scene shifts again to a black screen displaying \"Last Purchase: The shaft 69 3,000\" in white and red; it then returns to the man, now with the text, \"Yo you're that freak that has the weird power right? I bet you 100$ you can't guess what my mom's last purchase was...\" in orange. Wearing a gray shirt and a black jacket, the man says, \"Fish & chips...she's in England huh\" while holding a picture and touching his temple, \"Last Purchase: Fish & Chips\" appears in red.\n\nThe text \"Last Purchase: Fish & Chips\" remains in red, with \"What she's on her business trip there!`\" in orange at the bottom. The man, in a gray shirt and a black jacket, gets the 100$ and smiles. The text \"Can I next!\" in green appears on the bottom of the screen. The scene changes to the man holding a book titled \"365 Days of Business Ideas\". \"Hey can you help me my dad hasn't texted me all day. Do you need a photo of him?\" in pink, then the man in grey says, \"Nah I know his face!\" in black while touching his temple, and a loading icon shows.\n\nThe text \"Last Purchase: Casket\" is in red at the top. The man touches his temple, \"*Phone ringing*\" appears in white and black at the bottom, transitioning to \"Hello dad?\" in red. The text changes to \"`Hello honey your mother, she passed away`\" in brown and \"I'm sorry\" in black. The man looks upset. The scene changes to the man holding a book, and the text \"Police!\" followed by, \"`What I didn't do anything wrong!`\" appears in black, ending with \"`Relax kid the chief wants to talk to you`\" in yellow, with a loading icon above.\n\nIn the final scene, the man with the gray shirt stands by the window, distressed. \"Last Purchase:\" is displayed atop the scene, with a loading icon. The bottom text in blue reads, \"`My son went missing last night can you help me track him down!`\" transitioning to \"I'll do my best!`\" in black as the man holds up a photo and touches his temple. \"What? It's not working!`\" appears in black, and the clip concludes with the man looking confused, holding a pink piece of paper with a stickman on it."
+ }
+ ],
+ "events": [
+ "Man stands by a window overlooking a building.",
+ "'Last Purchase:' text is displayed at the top.",
+ "Man is wearing a gray shirt that reads 'Mentally Gone'.",
+ "Another man's voice says, 'Hey, can you see what my son has been buying?'",
+ "Purple text appears at the bottom.",
+ "Camera zooms in and the text 'You got a picture of him?' appears in white.",
+ "Man holds up a phone showing 'Last Purchase: Vapes' in red at the top.",
+ "Man says, 'Vapes.'",
+ "Another man reacts with a surprised voice.",
+ "Text changes to 'What arr im gonna kill him' in purple.",
+ "Text 'Frankly come here I'm spanking your ass.' appears.",
+ "The man ends with, 'OMG.'",
+ "Scene transitions to a black screen with 'Last Purchase: Vapes' in white and red text.",
+ "Man with curly hair stands in front of window overlooking buildings.",
+ "'Last Purchase:' is still shown with a loading icon.",
+ "Man's gray shirt is visible.",
+ "Text 'I wonder what things Mrs. Wilson buys' is shown in blue at the bottom.",
+ "Man answers, 'I have no idea.'",
+ "Another man says, 'Dude just use your power!'",
+ "Man touches his temple.",
+ "'Last Purchase: The shaft 69 3,000' flashes in red.",
+ "Man reacts with a black 'What!'",
+ "Another man says, 'What what is it?'",
+ "Text 'The shaft 693000' appears in blue and black as he responds.",
+ "Bottom text in blue reads, 'Ohh those are the best!'",
+ "Man makes a confused face.",
+ "Scene shifts to black screen displaying 'Last Purchase: The shaft 69 3,000' in white and red.",
+ "Scene returns to the man.",
+ "Text 'Yo you're that freak that has the weird power right? I bet you 100$ you can't guess what my mom's last purchase was...' appears in orange.",
+ "Man, wearing a gray shirt and black jacket, says, 'Fish & chips...she's in England huh' while holding a picture and touching his temple.",
+ "'Last Purchase: Fish & Chips' appears in red.",
+ "Text 'Last Purchase: Fish & Chips' remains in red.",
+ "Text 'What she's on her business trip there!' appears in orange at the bottom.",
+ "Man, in gray shirt and black jacket, gets the 100$ and smiles.",
+ "Text 'Can I next!' appears in green on the bottom of the screen.",
+ "Scene changes to man holding a book titled '365 Days of Business Ideas'.",
+ "Text 'Hey can you help me my dad hasn't texted me all day. Do you need a photo of him?' appears in pink.",
+ "Man in gray says, 'Nah I know his face!' in black while touching his temple.",
+ "A loading icon shows.",
+ "Text 'Last Purchase: Casket' is in red at the top.",
+ "Man touches his temple.",
+ "*Phone ringing* appears in white and black at the bottom.",
+ "Transition to 'Hello dad?' in red.",
+ "Text changes to 'Hello honey your mother, she passed away' in brown and 'I'm sorry' in black.",
+ "Man looks upset.",
+ "Scene changes to man holding a book.",
+ "Text 'Police!' followed by 'What I didn't do anything wrong!' appears in black.",
+ "Text 'Relax kid the chief wants to talk to you' appears in yellow with a loading icon above.",
+ "Man with gray shirt stands by window, distressed.",
+ "'Last Purchase:' is displayed atop the scene with a loading icon.",
+ "Bottom text in blue reads, 'My son went missing last night can you help me track him down!'",
+ "Text transitions to 'I'll do my best!' in black as the man holds up a photo and touches his temple.",
+ "Text 'What? It's not working!' appears in black.",
+ "Clip concludes with man looking confused, holding a pink piece of paper with a stickman on it."
+ ]
+ },
+ {
+ "video": "video/113.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video features an incident at a covered parking area or garage, where police officers assess the legality of a car's window tint. \n\nA police officer, dressed in a high-visibility yellow vest with reflective strips over a black uniform, and a white hat with a checkered band, stands beside a black car. He is handling a small device. The background contains various equipment and another police vehicle marked with yellow and blue. The atmosphere is filled with natural light streaming through the open side of the garage. The audio captures an officer speaking to someone named Owen: \"`Right, Owen, so I see you have a tint reader in your hands.`\" Owen confirms, and the first officer continues, \"`Right, fantastic. I think it's about time you put it onto that chameleon tint and tell us whether Shogun Customs have done a good job.\n\nThe scene remains in the parking area, with the officer now articulating hand gestures while demonstrating the device, which prominently displays the number '100'. The officer states, \" Do you know what?`I'`m really hoping it fails because it's horrible. Right. Just calibrating the tint meter. There we go. You can see it's on 100, which means it`'s working perfectly. \" \n\nThe scene transitions as the officer places the device on the car's dashboard through the open window, showing the '100' display. The conversation continues, \"Let'`s stick it on and moment of truth. What do you reckon?`\" Another voice replies, \"`I can see 79 from here. I can.`\"\n\nA close-up shows the device on the dashboard once more, the number '100' visible. Adjustments are made, then the scene broadens to display the officer leaning through the car window to adjust the device. The exchange proceeds, \"`And see 79 from here.''The voice outside the frame says,'' And that means...`\" The officer adjusts the device further, with another voice inquiring, \"`Well, hold on. Let's just put it in another place.`\" The officer affirms, \"`Right, checking another time, are we? 78, 79,`\" leading to shared laughter.\n\nFinally, the view shifts to inside the car through the windshield, as the officer continues adjusting the device on the dashboard. The background remains detailed with equipment and a marked police vehicle. The officer asserts, \"`That pretty much means that GCM`'s chameleon tin is road legal.''The police then says,'' It means I can'`t take it off, which is a real shame because I was looking forward to that.`\" The second man speaks again, marking a cohesive narrative that concludes with the realization that the vehicle's tint complies with legal standards, ending on a note of mild disappointment from the officers regarding their inability to remove the tint."
+ }
+ ],
+ "events": [
+ "Police officers are in a covered parking area assessing a car's window tint.",
+ "A police officer wears a high-visibility yellow vest with reflective strips, a black uniform, and a white hat with a checkered band.",
+ "The officer stands beside a black car, handling a small device.",
+ "The background includes various equipment and another police vehicle marked with yellow and blue.",
+ "Natural light streams through the open side of the garage.",
+ "Audio of an officer speaking to someone named Owen: 'Right, Owen, so I see you have a tint reader in your hands.'",
+ "Owen confirms having the tint reader.",
+ "The first officer says: 'Right, fantastic. I think it's about time you put it onto that chameleon tint and tell us whether Shogun Customs have done a good job.'",
+ "The officer demonstrates the device which shows the number '100'.",
+ "The officer comments hoping the tint fails because it's horrible and calibrates the tint meter.",
+ "The officer places the device on the car's dashboard through the open window; the display shows '100'.",
+ "Conversation continues: 'Let\u2019s stick it on and moment of truth. What do you reckon?' Another voice says, 'I can see 79 from here. I can.'",
+ "Close-up of the device showing '100'.",
+ "Adjustments are made to the device by the officer leaning through the car window.",
+ "Another voice states, 'And that means...' with inquiries about rechecking.",
+ "Officer confirms rechecking: 'Right, checking another time, are we? 78, 79,' leading to shared laughter.",
+ "View shifts to inside the car through the windshield, background detailed with equipment and marked police vehicle.",
+ "Officer adjusts the device on the dashboard, concluding: 'That pretty much means that GCM's chameleon tint is road legal.'",
+ "Officer expresses mild disappointment: 'It means I can\u2019t take it off, which is a real shame because I was looking forward to that.'",
+ "The second man speaks again, marking the narrative's conclusion about the legality of the vehicle's tint."
+ ]
+ },
+ {
+ "video": "video/114.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up of a clear glass mug embossed with the text 'COUNTY FAIR DRINKING JAR' and an image of a rooster. A hand holds the mug over a speckled white countertop, with two yellow paper cups with brown bases in the background. The narration starts with a man saying, \"Try to keep track of which cup has the water, and I promise there's no editing on this video.\" As he speaks, the hand pours clear water from the glass mug into the right paper cup, partially filling it. The hand then places the glass mug down and adjusts the positions of the two paper cups, moving them closer together. \"So first I'm gonna hit you with the little switch swoop, and then the swap swap, then I'm gonna pull it back\", the man continues, followed by the sound of the cups being scraped against a surface. The clip concludes with the hands holding both cups, swap their position.\n\nThe scene transitions to a close-up of the two yellow paper cups on the same countertop. Two hands adjust the cups, swap their position. The man speaks, \"Double reverse. And now you might think you know what's going on, but ready, boom, boom. And so now you're probably pretty lost. So pause now and select your cup.\" As he speaks, the sound of the cups being scraped against a surface is heard in the background. The hands place the cups down side by side on the countertop and release them, emphasizing their position.\n\nNext, a close-up of the two yellow paper cups is shown on the speckled white countertop. A hand lifts the right cup, turns around it, but nothing inside it, and then places it back down next to the other cup. Another hand lifts the left cup, turns around it, but nothing inside it, and then places it back down. The the hand enters the frame holding a silver packet labeled, 'SCIENCE GONE FUN! Super Absorbent Diaper Polymer Non-Toxic 100% Biodegradable Multipurpose Educational Science Projects.' The packet is opened, revealing a white powder inside. The man continues speaking, \"Water falls out of it, but also no water falls out of this one, so what happened? Well, like I said, there was no editing, but that didn't mean there weren't any tricks. I put some of this super absorbent material inside of one of the cups.\" The hand pours the white powder into the right cup. The clip concludes with the cup, now filled with the white powder.\n\nOne hand lifts the cup, turning it around, but nothing be poured out. The hand uses a spoon to stir the contents, transforming the powder into a gel-like substance. The man explains, \"So when I poured the water in, it absorbed the water and so it stays when I turn the cup over. As the package said, this is a diaper polymer which is the same thing used in diapers to help clean up after accidents.\" The scene concludes with a close-up of a diaper placed on the countertop, suggesting the next activity.\n\nIn the final scene, a close-up of the diaper placed on the countertop is shown. A pair of hands uses scissors to cut open the diaper, revealing its inner contents. The narration continues, \"As you can see, if I pour water in the diaper, it absorbs it and helps keep the baby dry.\" One hand holds a clear glass mug, similar to the one seen in the first clip, and pours water onto the exposed inner material of the diaper. The water quickly gets absorbed by the material. The hand continues to pour water until the material is fully saturated. The clip concludes with the hand using a spoon to stir the now gel-like substance inside the diaper, effectively demonstrating the absorbent properties of the super absorbent polymer.\n\nOverall, the video demonstrates a magic-like trick involving cups and a super absorbent polymer to engage the audience and educate them about the science behind diaper technology."
+ }
+ ],
+ "events": [
+ "Video begins with a close-up of a clear glass mug with text 'COUNTY FAIR DRINKING JAR' and an image of a rooster.",
+ "Hand holds the mug over a speckled white countertop.",
+ "Two yellow paper cups with brown bases are in the background.",
+ "Narration starts with a man saying to keep track of which cup has the water and no editing is involved.",
+ "Hand pours clear water from the glass mug into the right paper cup, partially filling it.",
+ "Hand places the glass mug down and adjusts the positions of the two paper cups, moving them closer together.",
+ "Narrator explains trick moves: 'little switch swoop', 'swap swap', 'pull it back'.",
+ "Sound of the cups being scraped against a surface.",
+ "Hands hold and swap the position of the cups.",
+ "Scene transitions to a close-up of two yellow paper cups on the countertop.",
+ "Two hands adjust the cups and swap their position.",
+ "Narrator describes more trick moves: 'Double reverse. And now you might think you know what's going on, but ready, boom, boom. Now you're probably pretty lost. So pause now and select your cup.'",
+ "Sound of cups being scraped against a surface.",
+ "Hands place the cups down side by side on the countertop and release them.",
+ "Close-up of the two yellow paper cups on the countertop.",
+ "Hand lifts the right cup, shows it around but nothing is inside, then places it back down.",
+ "Another hand lifts the left cup, shows it around but nothing is inside, then places it back down.",
+ "Hand enters the frame holding a silver packet labeled 'SCIENCE GONE FUN! Super Absorbent Diaper Polymer Non-Toxic 100% Biodegradable Multipurpose Educational Science Projects'.",
+ "Packet is opened revealing white powder inside.",
+ "Narrator explains there's no editing but there were tricks involving super absorbent material inside one of the cups.",
+ "Hand pours the white powder into the right cup.",
+ "Cup now filled with white powder.",
+ "Hand lifts the cup, turns it around, but nothing pours out.",
+ "Hand uses a spoon to stir the content, transforming the powder into a gel-like substance.",
+ "Narrator explains the powder absorbs water, making it stay in the cup when turned over. It's a diaper polymer.",
+ "Scene concludes with a close-up of a diaper placed on the countertop.",
+ "Final scene shows a close-up of the diaper on the countertop.",
+ "Pair of hands use scissors to cut open the diaper, revealing its inner contents.",
+ "Narrator explains pouring water into the diaper to show it absorbs water and keeps the baby dry.",
+ "Hand holds clear glass mug and pours water onto the exposed inner material of the diaper.",
+ "Water quickly gets absorbed by the material.",
+ "Hand continues to pour water until the diaper material is fully saturated.",
+ "Hand uses a spoon to stir the now gel-like substance inside the diaper, demonstrating the absorbent properties."
+ ]
+ },
+ {
+ "video": "video/115.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a clean wooden cutting board on a kitchen counter, surrounded by jars and a toaster in the background. The text 'Anju Road' at the top of screen. A man, wearing a ring, places a wrapped package on the cutting board and unwraps it to reveal a piece of steak. He sprinkles kosher salt over the meat, ensuring it's evenly seasoned. The man then rewraps the meat in the paper and places it in a refrigerator drawer filled with various food items. The drawer is closed. A man's voice narrates, \"How to cook the best steak of your life. Get yourself a nice cut. This is a ribeye. Kosher salt as soon as you get home. Seriously, this is super important. Don't skip this step. Wrap it up, put it in the fridge, only take it out about half-hour before you need it.\"\n\nThe scene shifts inside the refrigerator, showing items like a carton of eggs, a jar of chili paste, and a loaf of bread. The man take the steak away. The view transitions to a kitchen stove where a man places a large black cast-iron griddle on the burners and adjusts the stove's temperature knobs to medium. The camera focuses on a hand turning on a ventilation fan above the stove, labeled 'NuTone.' The scene then jumps to the metal baking tray holding the steak from earlier. The man sprinkles more kosher salt and grinds black pepper over the meat for even seasoning. The man's voice continues, \"Get yourself cast iron, ideally. Turn it up to medium-high. Let it come up to heat and turn on the fan. You're going to need it. A little bit more kosher salt and freshly ground pepper before you cook. And say hi to your cat because she's going to want to eat the steak.\"\n\nNext, a gray cat is shown being held by a man in the kitchen, with the seasoned meat in the background on the baking tray and the black cast-iron griddle on the stove. The viewpoint transitions to the griddle, now empty, before the man places the seasoned meat onto it, causing a sizzle. Using tongs, the man adjusts the meat and flips it to cook evenly. Text appears on the screen, 'Flip every 20-30 seconds!'. The man's voice narrates, \"Once the pan is ripping hot and starting to smoke, you lay your ribeye down and we're going to do the pretty blasphemous thing here and we're going to flip it about 20 or 30 seconds in. The reason I'm doing this is because I'm slowly cooking it side to side as it goes through.\"\n\nThe man places the meat back onto the griddle, where it sizzles. He holds a digital thermometer reading 79.3\u00b0F over the meat, then insert it, changing the temperature to 88.9\u00b0F. The man uses his fingers to press down on the meat, checking its doneness. The voice continues, \"Which will result in a more tender steak. Also, if you're not using a thermometer, you're cooking meat wrong. Keep flipping it. The reason we're doing this is it will develop a crust.\"\n\nThe scene transitions back to the meat sizzling on the griddle. The man uses tongs and a knife to cut the meat, checking its doneness. He holds a digital thermometer over the meat, reading 117.8\u00b0F, which changes when inserted, eventually reading 113.3\u00b0F. The man adjusts the meat with tongs. The man's voice adds, \"But it's really important that each piece gets cooked to the proper temperature. This is a ribeye and the cap. Obviously, they cook at different temperatures, so I separated them. Keep flipping them. I know it sounds crazy, but you're going to have the most tender steak of your life.\" \n\nThe final scene focuses again on the sizzling, browning meat on the griddle. The man flips the meat with tongs, ensuring even cooking. The view shifts to a wooden cutting board where two cooked meat pieces are placed. A digital thermometer reads 132.8\u00b0F when inserted into one of the meat pieces, which changes to 136.7\u00b0F. The man cuts into the meat, revealing a juicy, pink interior, signifying a medium-rare doneness. The man examines the slice closely. The man's voice concludes, \"Pull it off, let it rest, it'll come up a few degrees, slice, and look at this, the most beautiful, tender, pull-apart steak you've ever had.\""
+ }
+ ],
+ "events": [
+ "Video opens with a clean wooden cutting board on a kitchen counter.",
+ "Background shows jars and a toaster.",
+ "Text 'Anju Road' appears at the top of the screen.",
+ "A man, wearing a ring, places a wrapped package on the cutting board and unwraps it to reveal a piece of steak.",
+ "Man sprinkles kosher salt over the meat, ensuring it's evenly seasoned.",
+ "Man rewraps the meat in the paper.",
+ "Man places the wrapped meat in a refrigerator drawer filled with various food items.",
+ "Drawer is closed.",
+ "A man's voice narrates the steps on how to cook the steak.",
+ "View shifts inside the refrigerator, showing items like a carton of eggs, a jar of chili paste, and a loaf of bread.",
+ "Man takes the steak away.",
+ "View transitions to a kitchen stove where a man places a large black cast-iron griddle on the burners.",
+ "Man adjusts the stove's temperature knobs to medium.",
+ "Camera focuses on a hand turning on a ventilation fan labeled 'NuTone' above the stove.",
+ "Scene jumps to a metal baking tray holding the steak.",
+ "Man sprinkles more kosher salt and grinds black pepper over the meat for even seasoning.",
+ "Narration continues with more instructions.",
+ "Gray cat is shown being held by a man in the kitchen.",
+ "Seasoned meat is visible in the background on the baking tray and the black cast-iron griddle on the stove.",
+ "View transitions to the empty griddle.",
+ "Man places the seasoned meat onto the griddle, causing a sizzle.",
+ "Using tongs, the man adjusts the meat and flips it to cook evenly.",
+ "Text appears on screen: 'Flip every 20-30 seconds!'",
+ "Narration explains the flipping process and reasoning behind it.",
+ "Man places the meat back onto the griddle, causing it to sizzle.",
+ "Man holds a digital thermometer reading 79.3\u00b0F over the meat, then inserts it, changing the temperature to 88.9\u00b0F.",
+ "Man uses his fingers to press down on the meat, checking its doneness.",
+ "Narration continues providing cooking tips.",
+ "Scene transitions back to the meat sizzling on the griddle.",
+ "Man uses tongs and a knife to cut the meat, checking its doneness.",
+ "Man holds a digital thermometer over the meat, reading 117.8\u00b0F, which changes when inserted, eventually reading 113.3\u00b0F.",
+ "Man adjusts the meat with tongs.",
+ "Narration explains the importance of cooking each piece to the proper temperature.",
+ "Final scene focuses on the sizzling, browning meat on the griddle.",
+ "Man flips the meat with tongs, ensuring even cooking.",
+ "View shifts to a wooden cutting board where two cooked meat pieces are placed.",
+ "Digital thermometer reads 132.8\u00b0F when inserted into one of the meat pieces, which changes to 136.7\u00b0F.",
+ "Man cuts into the meat, revealing a juicy, pink interior signifying medium-rare doneness.",
+ "Man examines the slice closely.",
+ "Narration concludes, emphasizing the results of a beautiful, tender, pull-apart steak."
+ ]
+ },
+ {
+ "video": "video/116.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video portrays a series of informational clips relating to Vitamin B12 and its importance, featuring medical professionals.\n\nTwo men in blue medical scrubs, green surgical caps, and with identification badges around their necks, stand in a medical or surgical room. The man on the right points a bottle of Jamieson Time Release B12 vitamins on the screen, highlighted with a red arrow pointing to the bottle. A close-up reveals the text 'Dr. P Zalzal' on the left man's scrub. The left man speaks. The focus then shifts to the man on the right, who is emphasized speaking and gesturing, with the yellow text 'B12' and 'vitamin' appearing. The final visual shows a girl with long hair against a yellow background, holding her head with one hand.\n\nThe audio reveals an informative dialogue on B12 deficiency: \"B12, you're probably deficient. Hey, hey, don't be so judgy. I mean, I eat some meat, I'm okay. Right, the trouble is B12 is a really critical vitamin.`\"\n\nNext, Dr. B Weening, a medical professional in blue scrubs with a green lanyard, is identified with the text 'Dr. B Weening' on the scrub as he speaks. A woman in a white turtleneck sweater is seen in a kitchen, using a pepper grinder over a bowl of salad. Following this, a close-up of a surgical procedure appears. Another scene displays a woman with short hair in a blue checkered shirt holding a glass of orange juice. The scene circles back to the medical room, showing Dr. P Zalzal speaking.\n\nThe audio discussion continues with information on at-risk groups: \"`And a lot of people actually are deficient the groups that are most at risk are vegans and vegetarians, people that have bariatric surgery, and older people, over 50 and over 65. ...`\"\n\nA medical professional in blue scrubs and a green surgical cap is speaking while gesturing with both hands. The scene transitions to a close-up of various sliced meats like prosciutto and salami, alongside a green grape and a basil leaf. Dr. B Weening reappears speaking, with the word 'supplement' in yellow. The final frame shifts back to Dr. P Zalzal speaking.\n\nThe audio guides on supplement intake: \"`Hey, pretend there's a mirror right here. But if you are not getting enough through your meat or fortified cereal or almond milk, consider a supplement. And a supplement should be 50 to 100 micrograms per day or 1000 to 2000 micrograms per week `\"\n\nDr. P Zalzal is seen speaking. He continues speaking. Subsequent frames show him gesturing with both hands. Dr. B Weening then speaks. Dr. B Weening gestures with one hand. The final frames show Dr. P Zalzal speaking.\n\nThe audio elaborates on proper supplement intake: \"The math doesn't work but just trust us. Make sure you chew it or put it underneath your tongue do not swallow it it just won't work you are in charge of your own health.\"\n\nDr. B Weening speaks. The scene transitions to the same man. Throughout, a banner indicates the YouTube channel 'Talking With Docs' with 314,289 total subscribers.\n\nThe audio message concludes with a call to action: \"`If you like this video, please like and subscribe to our channel and check out our long form content.`\""
+ }
+ ],
+ "events": [
+ "Two men in blue medical scrubs, green surgical caps, and with identification badges stand in a medical or surgical room.",
+ "The man on the right points a bottle of Jamieson Time Release B12 vitamins on the screen, highlighted with a red arrow.",
+ "A close-up reveals the text 'Dr. P Zalzal' on the left man's scrub.",
+ "The left man speaks.",
+ "The focus shifts to the man on the right who speaks and gestures.",
+ "Yellow text 'B12' and 'vitamin' appear.",
+ "A girl with long hair against a yellow background holds her head with one hand.",
+ "Informative dialogue on B12 deficiency: 'B12, you're probably deficient. Hey, hey, don't be so judgy. I mean, I eat some meat, I'm okay. Right, the trouble is B12 is a really critical vitamin.'",
+ "Dr. B Weening, in blue scrubs with a green lanyard, is identified with text 'Dr. B Weening' on the scrub while he speaks.",
+ "A woman in a white turtleneck sweater uses a pepper grinder over a bowl of salad in a kitchen.",
+ "A close-up of a surgical procedure.",
+ "A woman with short hair in a blue checkered shirt holds a glass of orange juice.",
+ "The scene returns to the medical room showing Dr. P Zalzal speaking.",
+ "Audio information on at-risk groups: 'And a lot of people actually are deficient the groups that are most at risk are vegans and vegetarians, people that have bariatric surgery, and older people, over 50 and over 65.'",
+ "A medical professional in blue scrubs and a green surgical cap speaks while gesturing with both hands.",
+ "A close-up of various sliced meats like prosciutto and salami, alongside a green grape and a basil leaf.",
+ "Dr. B Weening reappears speaking with the word 'supplement' in yellow.",
+ "The final frame returns to Dr. P Zalzal speaking.",
+ "Audio guide on supplement intake: 'Hey, pretend there's a mirror right here. But if you are not getting enough through your meat or fortified cereal or almond milk, consider a supplement. And a supplement should be 50 to 100 micrograms per day or 1000 to 2000 micrograms per week'",
+ "Dr. P Zalzal is seen speaking.",
+ "Dr. P Zalzal continues speaking and gestures with both hands.",
+ "Dr. B Weening speaks.",
+ "Dr. B Weening gestures with one hand.",
+ "The final frames show Dr. P Zalzal speaking.",
+ "Audio elaborates on proper supplement intake: 'The math doesn't work but just trust us. Make sure you chew it or put it underneath your tongue do not swallow it it just won't work you are in charge of your own health.'",
+ "Dr. B Weening speaks.",
+ "The scene transitions to Dr. B Weening speaking.",
+ "Throughout a banner indicates the YouTube channel 'Talking With Docs' with 314,289 total subscribers.",
+ "Audio concludes with a call to action: 'If you like this video, please like and subscribe to our channel and check out our long form content.'"
+ ]
+ },
+ {
+ "video": "video/117.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video exhibits a step-by-step tutorial on automotive maintenance. A take on car headlight maintenance, though technically nonsensical. Backdropped with consistent music, it features a persistent narrative flow with a man and an intermittent companion conversing about the headlight's condition and maintenance tasks.\n\nA man in a gray hoodie and black shorts scrutinizes the front headlight of a red car with its hood open, pointing and saying, highlighted in yellow text on the screen. The man mentions the headlight being a bit oily. The audio reveals a conversational tone with the text, \"Yo, I noticed my blinker fluid is getting a bit oily... Really? I'm impressed that you even noticed. Most people don't even check until it's too late. Let me check it out though.\" Another man in a gray t-shirt approaches and starts talking about his observations. The scene revisits the headlight close-up. \"Yeah, you're right. It's time to change that blinker fluid.\"\n\nThe scene enlarges, including a wider view of the man now seated on a stool beside the red car with its hood ajar. Gesturing toward the headlight. He outlines the process of needing in the audio and the text, \"So, what you want to do is drain everything out of this first. Now you may need to remove the entire headlight and in my case, I need to remove the bumper to access the bolts for the headlight.\"\n\nA man now dressed casually in black shorts, gray t-shirt, and a black cap, diligently works on removing the car's bumper, proceeding to the headlight. Leading visually to the draining of the fluid into a container with the audio and the text, \"You want to remove the headlight and dump out all the fluid. You can even give it a rinse if you want to. Now once that's all drained, you can put the headlight back.\" His careful steps and instructions further illustrated by on-screen texts, the actions syncing fluidly with the voiceover.\n\nIn the next scene, the same man is kneeling and subsequently introduces tools like a red funnel and a clear tube aiding in the replenishment process. A close-up portrayal of a headlight filled with red liquid after showing a white cup with the text 'this is a joke' on it. The audio states with the text, \"Now once you've got the headlight back in, you can top up the blinker fluid. Now, the amount of blinker fluid that you should add is different for every car.\"\n\nClosing the instructional satirical video, the man remains seated, articulating and gesturing, revealing final commentary, explaining, \"And every mechanic tends to do it differently. So just Google your car and see how much you're supposed to fill up.\" The continuity in guidance is maintained clearly through his gesturing explanations, highlighting the maintenance mock-procedure.\n"
+ }
+ ],
+ "events": [
+ "Man in gray hoodie and black shorts scrutinizes the front headlight of a red car with its hood open.",
+ "Highlighted yellow text on screen: 'Yo, I noticed my blinker fluid is getting a bit oily... Really? I'm impressed that you even noticed. Most people don't even check until it's too late. Let me check it out though.'",
+ "Second man in a gray t-shirt approaches and talks about his observations.",
+ "Scene revisits headlight close-up with commentary: 'Yeah, you're right. It's time to change that blinker fluid.'",
+ "Wider view of first man seated on a stool beside red car with hood open, gestures toward headlight.",
+ "First man outlines the maintenance process: draining the fluid and potentially removing the headlight and bumper.",
+ "Man in black shorts, gray t-shirt, and black cap removes the car's bumper and proceeds to remove the headlight.",
+ "Draining the blinker fluid into a container with explanation: 'You want to remove the headlight and dump out all the fluid. You can even give it a rinse if you want to. Now once that's all drained, you can put the headlight back.'",
+ "Next scene introduces a red funnel and a clear tube for replenishing the blinker fluid.",
+ "Close-up of headlight filling with red liquid. White cup with text: 'this is a joke.'",
+ "Audio and text: 'Now once you've got the headlight back in, you can top up the blinker fluid. Now, the amount of blinker fluid that you should add is different for every car.'",
+ "Closing scene with man seated, providing final commentary and gestures: 'And every mechanic tends to do it differently. So just Google your car and see how much you're supposed to fill up.'"
+ ]
+ },
+ {
+ "video": "video/118.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a close-up view of three revolvers on a wooden surface, featuring a silver revolver with a black grip prominently displayed in the center. The hand of a person then appears, inspecting a long-barreled revolver with a white grip. The man speaking in the background begins with, \"Now I'm going to give you a little breakdown of my five favorite revolvers, starting with the guy in the middle, the most expensive, Colt Single Action Army, 1873, Generation 3.\" identifying it as a Colt Single Action Army, 1873, Generation 3.\n\nAs the narration continues to emphasize the beauty and functionality of the revolvers, viewers are shown a close-up of the silver revolver with a dark grip. The speaker says, \"`It is great with ivory grips and it looks beautiful, seven and a half inch barrel add that accuracy, second we have the Colt Python six inch, this is a .357 Magnum as well,`\"\n\nThe hand then picks up the silver revolver to demonstrate its features, and the narration shifts, mentioning, \"`And this is the gun from The Walking Dead. Amazing gun, shoots great, triggers awesome. Double action here, single action here. Here to get the gun for self-defense. Double action is the way to go.`\" The silver revolver is laid back on the wooden surface. The hand briefly touches a black revolver with a long barrel, before refocusing on the silver revolver.\n\nIn the subsequent frames, the hand interacts with another silver revolver, pointing out a separate barrel component lying next to it. The narration continues with detailed remarks on various revolvers, stating, \"Gp100 four-inch, amazingly durable, reliable revolver for the money. This is only about six hundred dollars whereas this is 1200 and you're looking about three grand for that guy.`\" The hand's movements emphasize the highlights on the revolvers, especially showcasing the text 'PYTHON' on one barrel.\n\nThe presentation then transitions to a black revolver with a wooden grip. The hand lifts it slightly for a closer examination. The speaker shares, \"`my overall favorite gun to shoot probably is this little 1-inch .357 Magnum. This is actually an 8-shot Smith & Wesson here. This is a badass gun. This is a performance center,`\" affirming the revolver's exceptional attributes.\n\nFinally, the video circles back to include a focus on the black revolver with the wooden grip, and then a silver revolver with a dark grip. The narration closes with, \"`Colt Python, it has the best trigger on the table. And then the venerable 686 from Smith & Wesson, 6-inch .357 Magnum as well. Phenomenal gun,`\" spoken quickly and loudly to highlight the enthusiasm of the speaker.\n\nOverall, the video provides a visually enriched, detailed breakdown of the speaker's favorite revolvers, their unique features, and personal preferences, enhanced dynamically by the consistent wooden surface backdrop and scattered gravel. The combination of close-up visuals and the passionate narration creates an immersive experience for the audience, deepening their understanding of these firearms."
+ }
+ ],
+ "events": [
+ "Video opens with a close-up view of three revolvers on a wooden surface.",
+ "The silver revolver with a black grip is prominently displayed in the center.",
+ "A hand appears, inspecting a long-barreled revolver with a white grip.",
+ "A man begins speaking, introducing the video purpose: breakdown of his five favorite revolvers.",
+ "The speaker identifies the center revolver as a Colt Single Action Army, 1873, Generation 3.",
+ "Close-up of the silver revolver with a dark grip as the narration continues.",
+ "Speaker mentions the beauty and functionality of the revolvers.",
+ "Speaker praises the revolver\u2019s ivory grips and 7.5-inch barrel for accuracy.",
+ "Speaker introduces the second revolver: Colt Python six inch, .357 Magnum.",
+ "Hand picks up the silver revolver and speaker discusses its features.",
+ "Speaker references the gun from The Walking Dead, noting it shoots great and has an awesome trigger.",
+ "Explanation of double action and single action for self-defense purposes.",
+ "Silver revolver is laid back on the wooden surface.",
+ "Hand briefly touches a black revolver with a long barrel.",
+ "Hand refocuses on the silver revolver.",
+ "Hand interacts with another silver revolver, points to a separate barrel component.",
+ "Speaker discusses the Gp100 four-inch revolver, describing its durability and price.",
+ "Hand movements emphasize the highlights on the revolvers, showing text 'PYTHON' on one barrel.",
+ "Camera transitions to a black revolver with a wooden grip.",
+ "Hand lifts the black revolver with a wooden grip for a closer examination.",
+ "Speaker expresses preference: favorite gun to shoot is the 1-inch .357 Magnum Smith & Wesson, a performance center model.",
+ "Video circles back to focus on the black revolver with a wooden grip and then a silver revolver with a dark grip.",
+ "Speaker concludes with enthusiasm for Colt Python, praising its trigger quality.",
+ "Speaker also praises the Smith & Wesson 686, a 6-inch .357 Magnum.",
+ "Video provides a visually enriched and detailed breakdown of revolvers, features, and personal preferences.",
+ "Close-up visuals and passionate narration create an immersive experience with the wooden surface backdrop and scattered gravel."
+ ]
+ },
+ {
+ "video": "video/119.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man in a dark suit and glasses standing near the glass barrier of a hockey rink. He raises his arm and throws an black object to a boy, while the text overlay reads \"This video just kept getting better\" accompanied by a crying-laughing emoji throughout the video. An announcer on a loudspeaker comments on the scene, mentioning something about a puck intended for a kid in red but being stolen, creating a lively and engaging atmosphere. The scene shifts to a man in a light grey jacket and a black cap who is standing on the other side of the glass barrier. He takes away the black object before the boy catches it. The audio continues with loudspeaker announcements and crowd cheers. The man in the suit then stands with his arms crossed, looking up at the stands, as another man in a white shirt and black vest holding a hockey stick stands near the glass. The crowd is clapping and cheering, suggesting a moment of excitement or celebration during the hockey game.\n\nNext, the camera focuses on the stands filled with enthusiastic spectators, continuing the lively ambiance. The scene highlights a young boy in a red jacket, waving and holding something small, possibly a toy or snack. He seems trying to catch the object. The audio describes the boy's excitement and then introduces a mascot dressed in a black and yellow jersey marked with the number 87, who walks down the stairs in the stands. The mascot interacts with the crowd and particularly the young boy, eventually handing him a jersey. The camera zooms in to focus on the boy who is pointing forward. The boy eagerly accepts the jersey, showing it off to the surrounding people, while the crowd intensifies its cheering and clapping. A man in the audio remarks on the special moment as music plays in the background, describing it as an early Halloween treat for the youngster receiving a Crosby jersey.\n\nFollowing this, the video maintains an engaging atmosphere as the interaction between the mascot and the boy in the red jacket continues. The mascot hugs the boy, who is visibly elated and happy. The crowd's cheering and clapping continue to add to the heartwarming moment. We see the boy, now proudly holding the mascot-given jersey, interacting with a hockey player near the glass barrier. The player, wearing a white jersey, leans over to greet the boy, making the young fan visibly thrilled by the encounter. A man's voice: \" Is that it? No! Look at this! This is early Halloween for this youngster as he gets Crosby's jersey and he gets a little hug, a couple of flaps. \"over the audio expresses admiration for this special moment, describing the boy's encounter as fantastic, predicting a memorable night for him, possibly crowned by a Penguin win.\n\nFinally, the video extends the heartwarming sequence of events. It shows the young boy now standing by the rink area by a man in a black suit and red tie. The boy is visibly excited. The scene then shows him being greeted by another hockey player in a white jersey near the glass barrier. The players are walking by and some of them are clapping with him. The crowd shares the boy's excitement and the overall scene continues to capture the memorable joy the young fan experiences during this special hockey game. During this, the audio:\" What a night, what a night to make a Penguin to win the...\" highlights the shame felt by a man who previously took the boy's puck, culminating in loud music that complements the lively and heartwarming atmosphere."
+ }
+ ],
+ "events": [
+ "A man in a dark suit and glasses stands near the glass barrier of a hockey rink.",
+ "The man in the suit raises his arm and throws a black object to a boy.",
+ "Text overlay reads 'This video just kept getting better' accompanied by a crying-laughing emoji.",
+ "An announcer on a loudspeaker comments on the scene, mentioning a puck meant for a kid in red but being stolen.",
+ "The scene shifts to a man in a light grey jacket and a black cap standing on the other side of the glass barrier.",
+ "The man in the light grey jacket takes away the black object before the boy catches it.",
+ "Audio continues with loudspeaker announcements and crowd cheers.",
+ "The man in the dark suit stands with his arms crossed, looking up at the stands.",
+ "Another man in a white shirt and black vest holding a hockey stick stands near the glass.",
+ "The crowd is clapping and cheering, suggesting excitement or celebration during the hockey game.",
+ "The camera focuses on the stands filled with enthusiastic spectators.",
+ "A young boy in a red jacket is waving and holding something small, possibly a toy or snack.",
+ "The boy seems to be trying to catch the black object.",
+ "Audio describes the boy's excitement and introduces a mascot dressed in a black and yellow jersey marked with number 87.",
+ "The mascot walks down the stairs in the stands.",
+ "The mascot interacts with the crowd and, particularly, the young boy.",
+ "The mascot hands the boy a jersey.",
+ "The camera zooms in on the boy who is pointing forward.",
+ "The boy eagerly accepts the jersey, shows it off to surrounding people.",
+ "The crowd's cheering and clapping intensify.",
+ "A man in the audio remarks on the special moment as music plays in the background.",
+ "The moment is described as an early Halloween treat for the boy receiving a Crosby jersey.",
+ "The interaction between the mascot and the boy in the red jacket continues.",
+ "The mascot hugs the boy, who is visibly elated and happy.",
+ "The crowd's cheering and clapping continue.",
+ "The boy, now proudly holding the mascot-given jersey, interacts with a hockey player near the glass barrier.",
+ "The hockey player, wearing a white jersey, leans over to greet the boy.",
+ "The boy is visibly thrilled by the encounter.",
+ "A man's voice expresses admiration for the special moment over the audio.",
+ "The boy's encounter is described as fantastic and memorable, possibly crowned by a Penguin win.",
+ "The video continues with heartwarming events.",
+ "The young boy stands by the rink area with a man in a black suit and red tie.",
+ "The boy is visibly excited.",
+ "The boy is greeted by another hockey player in a white jersey near the glass barrier.",
+ "The hockey players walk by, and some clap with the boy.",
+ "The crowd shares the boy's excitement.",
+ "The audio highlights the shame felt by a man who previously took the boy's puck.",
+ "Loud music complements the lively and heartwarming atmosphere."
+ ]
+ },
+ {
+ "video": "video/120.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up shot of a pot on a stove filled with a bubbling brown liquid. A hand holding a bowl of raw orange salmon pieces, with skin on one side, pours the salmon into the pot, causing the liquid to bubble vigorously. Transitioning to a scene, a person holds a bunch of fresh green herbs, ready to add them to the pot. The person wears a white t-shirt adorned with the text 'AC/DC World Tour '79' in black. In the background, a woman's voice narrates, \"I'`m making this salmon tamarind soup for dinner that my husband won't eat. Basically, he's scared of bones, so he won`'t eat,\" against a backdrop of sizzling sounds.\n\nThis transitions to a close-up of the same hand adding the fresh green herbs to the pot. From a wider view, we see the person standing in the kitchen, a beautiful bouquet of flowers in the background. The person holds a package labeled 'bibigo Fully Cooked Mini Wontons Chicken & Cilantro.' The scene then shifts to a pot filled with boiling mini wontons, being stirred with a wooden spoon. The narrator continues, \"`So for his dinner, I'm making chicken cilantro mini wontons from Bibigo. You know when they're ready, when they float to the top,`\" complemented by the sound of sizzling.\n\nWe move on to a close-up of a wooden cutting board with fresh green herbs, probably cilantro, and a small empty glass bowl. Various sauce bottles with red, yellow, black, and yellow labels are visible in the background. A hand pours a dark brown liquid, seemingly soy sauce, from a bottle with a red cap and label with Chinese characters into the glass bowl. Another bottle with a beige cap is then opened, and another dark liquid is poured into the bowl, mixing with the soy sauce. A container marked 'Sugar' is reached for and scooped with a white spoon. The narrator outlines, \"To start up, I'`m making a dipping sauce of soy sauce, black vinegar, sesame oil, sugar to balance out the saltiness, soy sauce and the acidity of the black vinegar and cilantro and green onions `\" accompanied by background music.\n\nNext, a close-up of the 'Sugar' container precedes the hand pouring sugar into the bowl of dark liquid on the wooden cutting board, where fresh cilantro is also laid out. The sugar melts into the liquid as the hand adds chopped cilantro and green onions. The dark mixture is then carefully poured over a plate of cooked mini wontons, now glossy and appealing. The voice describes, \"`Okay and here's the sauce I made,`\" as the woman stirs the bowl.\n\nA close-up shows cooked mini wontons on a plate with a hand pouring the dark brown sauce and chopped green herbs over them, enhancing their appearance. The person wearing a beige shirt and black smartwatch uses chopsticks to pick up a wonton. The scene captures the person bringing the wonton towards their mouth, signaling the dish's completion and the start of the meal. \"You'`re trying to use chopsticks these are slippery little suckers you don't want that spoon wow wow,`\" a woman and man exchange remarks and laughter during this moment.\n\nFinally, the scene highlights a person in a beige shirt and black smartwatch sitting at a table with a bowl of cooked mini wontons. They hold a wonton with chopsticks, ready to taste it. The view expands to show another person in a dark shirt holding a white spoon at the white marble-topped table, which is adorned with various items, including a small pumpkin and a chalkboard. The focus then returns to a close-up of the wonton in the chopsticks, with a shrimp soup bowl also visible. The video emphasizes the joy of eating the prepared dishes, underscored by laughter and conversation, \"`Kiera just picked it up right away,`\" followed by a woman laughing and another woman's comments.\n\nThis cohesive journey through the preparation and enjoyment of the meal showcases the culinary process, with narrations that mirror the vibrant, bustling kitchen atmosphere filled with sizzling sounds, thoughtful preparation, and shared enjoyment."
+ }
+ ],
+ "events": [
+ "Close-up shot of a pot on a stove filled with bubbling brown liquid.",
+ "Hand holding a bowl of raw orange salmon pieces pours salmon into the pot.",
+ "Transition to a person holding a bunch of fresh green herbs to add to the pot.",
+ "Person wears a white t-shirt with 'AC/DC World Tour '79' text.",
+ "Woman's voice narrates about making salmon tamarind soup.",
+ "Background sizzling sounds heard.",
+ "Close-up of the hand adding fresh green herbs to the pot.",
+ "Wider view shows person standing in the kitchen with a bouquet of flowers in the background.",
+ "Person holds a package labeled 'bibigo Fully Cooked Mini Wontons Chicken & Cilantro'.",
+ "Scene shifts to a pot of boiling mini wontons being stirred with a wooden spoon.",
+ "Narrator continues describing the chicken cilantro mini wontons.",
+ "Close-up of a wooden cutting board with fresh green herbs and a small empty glass bowl.",
+ "Various sauce bottles in the background.",
+ "Hand pours dark brown liquid, soy sauce, from a bottle with a red cap into the glass bowl.",
+ "Another bottle with a beige cap is opened and another dark liquid is poured into the bowl.",
+ "Container marked 'Sugar' is scooped with a white spoon.",
+ "Narrator outlines making a dipping sauce with soy sauce, black vinegar, sesame oil, sugar, cilantro, and green onions.",
+ "Close-up of 'Sugar' container and hand pouring sugar into the bowl of dark liquid.",
+ "Hand adds chopped cilantro and green onions to the dark liquid.",
+ "Dark mixture is poured over a plate of cooked mini wontons.",
+ "Voice describes the sauce made, as the woman stirs the bowl.",
+ "Close-up of cooked mini wontons with sauce and chopped green herbs being poured over them.",
+ "Person in beige shirt and black smartwatch uses chopsticks to pick up a wonton.",
+ "Scene shows the person bringing the wonton towards their mouth.",
+ "Voice and laughter exchange about using chopsticks.",
+ "Person in beige shirt and black smartwatch sitting at a table with a bowl of cooked mini wontons.",
+ "Person holds a wonton with chopsticks, ready to taste it.",
+ "View expands to another person in a dark shirt holding a white spoon at a white marble-topped table with various items like a small pumpkin and a chalkboard.",
+ "Close-up of the wonton in chopsticks with a shrimp soup bowl visible.",
+ "Emphasis on the joy of eating prepared dishes, underscored by laughter and conversation.",
+ "Woman comments about Kiera picking it up right away, followed by laughter."
+ ]
+ },
+ {
+ "video": "video/121.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with an illustration of a samurai dressed in traditional attire, holding two swords, with the text 'Why Musashi Fought with Two Swords' prominently displayed. The audio reveals a man explaining, \"Why Musashi fought with two swords. Miyamoto Musashi is one of the most well-known samurai swordsmen in Japanese history, and he is very famous for being a dual katana fighter.\" The scene evolves into another illustration of a samurai in action, wielding two wooden swords. The final frames of this segment depict a different samurai seated.The background is a traditional Japanese art style with muted colors and intricate details.\n\nTransitioning to the next scene, the video shows a man in traditional Japanese attire.The audio picks up with a question, \"Why did he use two katana swords?\" following with, \"Musashi devoted his whole life to studying and mastering the art of swordsmanship until the very last day of his life.\" Another illustration of a samurai with two swords appears superimposed over the man.The scene transitions to show the same man with an illustration of two samurais in combat, traditional Japanese decor and a wall scroll with Japanese calligraphy are visible.\n\nNext, the video gives a close-up of three rolled-up scrolls with Japanese writing on them on a table. The man continues, \"All of his findings have been documented in a book called The Book of Five Rings.\" The scene shifts back to a man in traditional attire, with an illustration of a samurai holding two swords superimposed over them.The narration proceeds with, \u201cHis core belief was that a samurai must be trained to be able to properly fight in any environment.That's why he taught everything from how to use dual katana,various forms,and even the correct mindest during a battle.\n\nThe man starts gesturing with his hands.The next frames present a close-up of a hand holding a katana against a black background ''A person who only knows how to use a single katana with both hands will not be able to fight properly.\" The traditional Japanese art style remains as the backdrop for these frames.\n\nThe scene remains animated with a battle illustration featuring samurais in action, the man says,''When his arm is injured, when he runs, or when he suddenly has to ride a horse.But in other words, if you're in a situation where fighting with two swords is a disadvantage you should immediately hold a single sword.\"The segment finishes with the man in traditional attire, gesturing and a picture of hands holding a sword is seen.\n\nThe final scene shows the samurai holding two swords the transitions to a picture full of samurais.The man continues to say''The important teaching of Musashi is to fight in a way that is most advantageous to you.''"
+ }
+ ],
+ "events": [
+ "Illustration of a samurai dressed in traditional attire holding two swords",
+ "Text displayed: 'Why Musashi Fought with Two Swords'",
+ "Audio: Man explains why Musashi fought with two swords and mentions Miyamoto Musashi",
+ "Illustration of a samurai in action, wielding two wooden swords",
+ "Final frames depict a seated samurai",
+ "Background in traditional Japanese art style with muted colors and intricate details",
+ "Transition to a new scene with a man in traditional Japanese attire",
+ "Audio: Man questions why Musashi used two katana swords",
+ "Audio: Man explains Musashi's lifelong dedication to mastering swordsmanship",
+ "Illustration of a samurai with two swords appears superimposed over the man",
+ "Scene transitions to man with illustration of two samurais in combat",
+ "Traditional Japanese decor and wall scroll with Japanese calligraphy visible",
+ "Close-up of three rolled-up scrolls with Japanese writing on a table",
+ "Audio: Man mentions 'The Book of Five Rings'",
+ "Scene shifts back to man with an illustration of a samurai holding two swords",
+ "Audio: Man explains Musashi's core belief about training to properly fight in any environment",
+ "Man starts gesturing with his hands",
+ "Close-up of a hand holding a katana against a black background",
+ "Audio: Man explains the disadvantage of someone only knowing how to use a single katana",
+ "Animation of a battle illustration featuring samurais in action",
+ "Audio: Man explains scenarios where fighting with two swords is a disadvantage",
+ "Final segment shows man in traditional attire gesturing",
+ "Picture of hands holding a sword",
+ "Final scene shows samurai holding two swords then transitions to a picture full of samurais",
+ "Audio: Man says the important teaching of Musashi is to fight in a way that is most advantageous to you"
+ ]
+ },
+ {
+ "video": "video/122.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man sitting on a bed in a room with a slanted ceiling and a skylight window. Wearing a black 'Purple Rain' t-shirt and black pants, he speaks energetically with hand gestures. The subtitles on the screen read, \"Can we all agree that the back of a school bus would be the loudest and most chaotic? For my first month of middle school, my mom would take me to and from school. Later on though, she thought it would be a good idea to take a bus.\". He is narrating a personal story about his school bus experience.\n\nThe narrative continues with the man still on the bed, maintaining his expressive delivery. The subtitles now read \"keep in mind, this was the first time I ever took a bus.\".\n\nThe scene transitions outdoors, where the man stands in front of a building with a striped facade and a basketball hoop in the background, holding a small note card. Subtitles read \"I even had a little note card to find the bus number.\".\n\nThe video shifts back to the room with the man on the bed, subtitles reading \"so I'm a little late when I first get on the bus and the only open seat is in the second to last row.\". He continues to use gestures to emphasize his story.\n\nThe visual then shifts to the man on a school bus, clutching a black backpack, looking around. Subtitles read \"I'm sitting there all alone surrounded by bigger kids.\". Back on the bed, subtitles read \"and when that bus takes off pure chaos.\". The scene transitions back to the bus, now showing the man in a red shirt and cap standing and raising his hands, emphasizing the chaotic atmosphere.\n\nReturning to the narrative setting, the video shows the man on the bed discussing the bus driver's attempts to manage the chaos, emphasized by subtitles reading \"the bus driver is trying to talk to us.\".\n\nThe scene cuts to inside the bus with the man in a beige shirt holding a microphone saying \"sit down be quiet.\". The man in the red shirt famously retorts, \"Man suck my d**k.\".\n\nThis transitions to an outdoor reaction clip of cheering people. Wrapping up the video, the man on the bed discusses his feelings post-bus ride with subtitles \"when I got home I felt like a changed person. I was so scared. After that, I would sit in front of the bus directly behind the driver.\", concluding with \"builds character\"."
+ }
+ ],
+ "events": [
+ "Man sitting on a bed in a room with a slanted ceiling and a skylight window.",
+ "Man wearing a black 'Purple Rain' t-shirt and black pants.",
+ "Man speaks energetically with hand gestures.",
+ "Subtitles: 'Can we all agree that the back of a school bus would be the loudest and most chaotic? For my first month of middle school, my mom would take me to and from school. Later on though, she thought it would be a good idea to take a bus.'",
+ "Man narrates a personal story about his school bus experience.",
+ "Subtitles: 'keep in mind, this was the first time I ever took a bus.'",
+ "Scene transitions outdoors.",
+ "Man stands in front of a building with a striped facade and a basketball hoop in the background, holding a small note card.",
+ "Subtitles: 'I even had a little note card to find the bus number.'",
+ "Scene shifts back to the man on the bed.",
+ "Subtitles: 'so I'm a little late when I first get on the bus and the only open seat is in the second to last row.'",
+ "Man continues to use gestures to emphasize his story.",
+ "Visual shifts to the man on a school bus, clutching a black backpack, looking around.",
+ "Subtitles: 'I'm sitting there all alone surrounded by bigger kids.'",
+ "Back on the bed, subtitles: 'and when that bus takes off pure chaos.'",
+ "Scene transitions back to the bus, showing the man in a red shirt and cap standing and raising his hands, emphasizing the chaotic atmosphere.",
+ "Man on the bed discusses the bus driver's attempts to manage the chaos.",
+ "Subtitles: 'the bus driver is trying to talk to us.'",
+ "Scene cuts to inside the bus with the man in a beige shirt holding a microphone saying 'sit down be quiet.'.",
+ "Man in the red shirt famously retorts, 'Man suck my d**k.'.",
+ "Scene transitions to an outdoor reaction clip of cheering people.",
+ "Man on the bed discusses his feelings post-bus ride.",
+ "Subtitles: 'when I got home I felt like a changed person. I was so scared. After that, I would sit in front of the bus directly behind the driver.'",
+ "Subtitles: 'builds character.'"
+ ]
+ },
+ {
+ "video": "video/123.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a young woman sitting in the driver's seat of a car, dressed in a white tank top and denim shorts with sunglasses. She appears to be a mom and is conversing and gesturing with her hands. A text overlay reads, \"when you skip school but the school calls your mom:\". The audio reveals the mom saying, \"`Have a good day at school, sweetie! ''The daughter then leaves the car and says,''Bye!''After that,she is seen standing in front of some kind of building,gesturing with her hands,'' You know what? I think I'm gonna skip school today again at the mall since it worked last time. I'm just gonna run back here in time for mom to pick me up again.'' \n\n The scene shifts to the woman, in a plaid shirt and denim shorts who exits the car and walks toward a building marked 'Aspen Dental.' She gestures animatedly as if explaining something, and the scene closes with her continuing to talk.\n\nThe video transitions to the daughter, now running energetically in a parking lot, wearing the same plaid shirt, denim shorts, white socks, and white sneakers. Another text overlay reiterates, \"when you skip school but the school calls your mom:\". She looks excited and the audio depicts her saying, \"`Finally!made it to the mall. I'm just going to wait here until 2.20. ''Then she runs to the mall.Following this, the video cuts to the daughter standing against a wall, looking worriedly at her arm and then walking towards 'Aspen Dental,' still gesturing.'' Time to run back to school. Oh, there`'s mom.\" Then she waves to her mom.The scene shifts back to the mom, now sitting in the driver's seat.\n\nNext, the scene shows the mom inside a car, dressed in a plaid shirt. Again, the text overlay reads \"when you skip school but the school calls your mom:\". The mom says,'So how's your school today?\"The daughter now in the car says,''It was fine you know,it's just school.''The scene moves indoors to the mother in a bathroom, now in a white bathrobe, holding a phone and looking concerned. Subtitled \"*2 HOURS LATER*\", she continues talking on the phone with a worried expression, the audio containing a voice saying, \"Hello?''The voice says,'' Hello, this is Aspen Dental Middle School calling to inform you that your daughter, Pearl Krabs, was absent today at school.Goodbye.`\"\n\nThen the scene transitions to the daughter in brown jacket,holding a green shoe in her hand.The scene shifts back to the mom in a bathroom, still in a white bathrobe and speaking on the phone with a concerned face. The familiar text overlay is present. The mom reveals, \"Why did I just receive a phone call telling me that you were absent from school today?`\" The daughter replies, \"I don'`t know. It must be a glitch or something.`\" The scene then transitions to daughter holding a mint green Croc shoe in one hand and a glass jar labeled 'Cookies' in the other, appearing to explain something. The scene transitions to the mom in bathrobe, phone in hand, looking increasingly worried,raises her one finger and says'No sweetie.'Then she says there's no reason for her to escape from school.\n\nThen the mom asks,''Did you skip? ''The daughter says,''Mom, I was at school. I was literally sitting there with Patrick working on a group assignment. ''The mother seems angry and says,''You know what? Should I call the school and ask them if you were there? ''The daughter says,''Go ahead. Call the school. Ask if I was there.`\" She is now holding a Croc shoe and glass cookie jar is in front of her, gestures more animatedly as the scene switches between the trunk and the kitchen.\n\nThe final clip shows the daughter, still clutching the mint green Croc and the cookie jar, with a last set of worried gestures. Then the video reveals her thoughts in head as she puts the Croc near her face,\"She ain't actually going to call the school. She'`s not going to call. ''The mom then starts to use her phone to make call,says,''Okay, I'm going to call them right now.`\" Panic seizes the daughter,she holds her hair says, \"`Oh no. I didn't think she was actually going to call. Dang it. I don't like today.`\" She looks increasingly distressed, scratching her head while standing in the kitchen. \n\nThe video deeply explores themes of deception and fear of getting caught, humorously portraying a student's ill-fated attempt to skip school without their parent's knowledge."
+ }
+ ],
+ "events": [
+ "Young woman in driver's seat of a car, dressed in white tank top and denim shorts with sunglasses.",
+ "Text overlay reads, 'when you skip school but the school calls your mom:'.",
+ "Mom says, 'Have a good day at school, sweetie!'",
+ "Daughter leaves the car and says, 'Bye!'",
+ "Daughter stands in front of a building, gesturing with her hands and talking about skipping school to go to the mall.",
+ "Scene shifts to the woman in plaid shirt and denim shorts exiting the car and walking towards 'Aspen Dental'.",
+ "Woman animatedly gestures as if explaining something.",
+ "Scene shifts to the daughter running energetically in a parking lot.",
+ "Text overlay reiterates, 'when you skip school but the school calls your mom:'.",
+ "Daughter excitedly says, 'Finally! made it to the mall. I'm just going to wait here until 2.20.'",
+ "Daughter runs to the mall.",
+ "Daughter stands against a wall, worriedly looking at her arm.",
+ "Daughter walks towards 'Aspen Dental,' still gesturing.",
+ "Daughter says, 'Time to run back to school. Oh, there's mom.'",
+ "Daughter waves to her mom.",
+ "Scene shifts back to the mom in the driver's seat.",
+ "Scene shows the mom inside a car, dressed in a plaid shirt.",
+ "Text overlay reads, 'when you skip school but the school calls your mom:'.",
+ "Mom says, 'So how's your school today?'",
+ "Daughter now in the car says, 'It was fine you know, it's just school.'",
+ "Scene moves indoors to the mother in a bathroom in a white bathrobe, holding a phone and looking concerned.",
+ "Subtitled '*2 HOURS LATER*', mom continues talking on the phone with a worried expression.",
+ "Voice on phone says, 'Hello? This is Aspen Dental Middle School calling to inform you that your daughter, Pearl Krabs, was absent today at school. Goodbye.'",
+ "Scene transitions to the daughter in a brown jacket, holding a green shoe in her hand.",
+ "Scene shifts back to the mom in a bathroom, still in a white bathrobe and speaking on the phone with a concerned face.",
+ "Familiar text overlay is present.",
+ "Mom says, 'Why did I just receive a phone call telling me that you were absent from school today?'",
+ "Daughter replies, 'I don't know. It must be a glitch or something.'",
+ "Scene transitions to daughter holding a mint green Croc shoe and a glass jar labeled 'Cookies'.",
+ "Mom in bathrobe, phone in hand, looking increasingly worried, raises her one finger and says, 'No sweetie.'",
+ "Mom says there's no reason for her to escape from school.",
+ "Mom asks, 'Did you skip?'",
+ "Daughter says, 'Mom, I was at school. I was literally sitting there with Patrick working on a group assignment.'",
+ "Mother seems angry and says, 'You know what? Should I call the school and ask them if you were there?'",
+ "Daughter says, 'Go ahead. Call the school. Ask if I was there.'",
+ "Daughter holding Croc shoe and glass cookie jar in front, gestures more animatedly.",
+ "Scene switches between the trunk and the kitchen.",
+ "Daughter still clutching mint green Croc and cookie jar, with worried gestures.",
+ "Text reveals daughter's thoughts, 'She ain't actually going to call the school. She's not going to call.'",
+ "Mom starts to use her phone to make call and says, 'Okay, I'm going to call them right now.'",
+ "Panic seizes daughter, holds her hair and says, 'Oh no. I didn't think she was actually going to call. Dang it. I don't like today.'",
+ "Daughter looks increasingly distressed, scratching her head while standing in the kitchen."
+ ]
+ },
+ {
+ "video": "video/124.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a person with curly hair in a room with a window in the background. The text 'Your puffs:' appears at the top of the screen, followed by a counter displaying '100,000,000' in green. Below the counter, there are two options labeled 'Private' and 'Public.' The subtitles read, 'Dam talk about living forever.' The man contemplates these options and points upwards, selecting the 'Private' option, indicated by a green lock icon. He is seen wearing a black shirt and later a black backpack. The final subtitle in this segment reads, 'Garyvee!'in white and 'Yeah dude I got 1000 puffs.'in green. In the audio, amidst the visuals, two distinct voices: a man speaking, then another man responding.\n\nThe scene continues with the same man in the same room making different facial expressions.The green lock icon persists next to 'Your puffs:'. Subtitles appear, including 'I'm happy for you!'in white, 'Devin do you have extra puffs I can use?' in orange, followed by 'Sure.' in white. The final subtitles depict a conversation ending with, 'Actually you ran out, sorry for it.' in blue and 'What was that for?!' in white, concluding with 'Don't be silly you can't be giving away your puffs!' in blue. The man is holding both a blue and a white object, as well as an orange one. The audio captures a conversation between multiple male voices discussing among themselves, conveying the same message as depicted in the subtitles.\n\nThe theme extends with the man still in the same setting, sitting with his hands behind his head. Subtitles progress the narrative, starting with 'Jackass you're not allowed to sit here.' in red, 'Why not?' in white, 'Well everyone over here has 100,000 puffs.' in red, to which the man gestures and responds, 'But you can't even see my number.' in white. This clip ends with 'Well it's private for some reason.'in red.'The audio follows a similar tonal shift with the same conversation amidst the alternation of voices.\n\nThe video maintains continuity with the same man and background. The lock icon and text 'Your puffs:' remain the same. The man looks surprised as the subtitle reads, 'Yo why is everyone coughing?' in white ,followed by a message revealing, 'Didn't you hear? The system cut everyone's puffs in half to reduce population.' in blue subtitle. The man holding a tennis ball in his hand looks up and exclaims 'WHAT THE JOHN!' in white, and the dialogue continues with, 'I heard you only have 1k puffs is that true?' in pink and the reply, 'Yeah that's correct' in white. The audio matches with an man expressing shock about the 'system' cutting puffs in half and affirming the possession of 'thousand puffs.'\n\nWith a consistent scene, appearing throughout the video is the heartfelt exchange; Subtitles express a storyline with 'Ew I only dated you because I thought you had more puffs, we're over!' and then shifting to 'Your teacher is now dead she ran out of puffs.' and the query 'Does this mean we don't have a test today?' appearing in red. The audio narrative mimics the subtitles discussing an individual's death due to lack of 'puffs' and queries about a test being canceled.\n\nThe video wraps up with the consistent man in the same room with a fan against the wall.The man is holding an orange object and shaking it. The screen fades to black with the final subtitle: 'Attention everyone's puffs will be public by tomorrow' in white with speaker emojis. There is a final exclamation: \"Oh, fuck.\" The audio concludes with a final declaration eerily mirroring the impending revelation about 'puffs,' emphasizing the surprise.\n\nThe integration of visual responses and dialogues seamlessly etches out a narrative around 'puffs' against a common backdrop, encapsulating an arcing story of realization, regulation, and reaction towards this abstruse concept throughout the series of clips."
+ }
+ ],
+ "events": [
+ "Person with curly hair is in a room with a window in the background.",
+ "Text 'Your puffs:' appears at the top of the screen.",
+ "Counter displaying '100,000,000' in green.",
+ "Two options labeled 'Private' and 'Public' appear below the counter.",
+ "Subtitles read, 'Dam talk about living forever.'",
+ "Man contemplates options and points upwards, selecting the 'Private' option.",
+ "Green lock icon indicates 'Private' selection.",
+ "Man is wearing a black shirt and later a black backpack.",
+ "Final subtitle in this segment reads, 'Garyvee! Yeah dude I got 1000 puffs.'",
+ "Audio captures two distinct male voices conversing.",
+ "Man in the same room makes different facial expressions.",
+ "Green lock icon persists next to 'Your puffs:'.",
+ "Subtitles appear: 'I'm happy for you!' in white.",
+ "Subtitles: 'Devin do you have extra puffs I can use?' in orange.",
+ "Subtitles: 'Sure.' in white.",
+ "Subtitles: 'Actually you ran out, sorry for it.' in blue.",
+ "Subtitles: 'What was that for?!' in white.",
+ "Subtitles: 'Don't be silly you can't be giving away your puffs!' in blue.",
+ "Man is holding a blue and a white object, as well as an orange one.",
+ "Audio captures a conversation with multiple male voices conveying the same message as subtitles.",
+ "Man in the same setting, sitting with hands behind head.",
+ "Subtitles: 'Jackass you're not allowed to sit here.' in red.",
+ "Subtitles: 'Why not?' in white.",
+ "Subtitles: 'Well everyone over here has 100,000 puffs.' in red.",
+ "Man gestures and responds: 'But you can't even see my number.' in white.",
+ "Clip ends with, 'Well it's private for some reason.' in red.",
+ "Audio matches conversation in subtitles with alternation of voices.",
+ "Continuity with the same man and background.",
+ "Green lock icon and text 'Your puffs:' remain the same.",
+ "Man looks surprised.",
+ "Subtitle reads: 'Yo why is everyone coughing?' in white.",
+ "Subtitle: 'Didn't you hear? The system cut everyone's puffs in half to reduce population.' in blue.",
+ "Man holding a tennis ball looks up and exclaims 'WHAT THE JOHN!' in white.",
+ "Dialogues continue: 'I heard you only have 1k puffs is that true?' in pink.",
+ "Reply: 'Yeah that's correct' in white.",
+ "Audio captures shock about 'system' cutting puffs in half and affirmation of 'thousand puffs'.",
+ "Subtitles express storyline: 'Ew I only dated you because I thought you had more puffs, we're over!'.",
+ "Subtitles change to: 'Your teacher is now dead she ran out of puffs.'",
+ "Query: 'Does this mean we don't have a test today?' in red.",
+ "Audio narrative mimics subtitles discussing death due to lack of puffs and canceled test.",
+ "Man in the same room with a fan against the wall.",
+ "Man is holding an orange object and shaking it.",
+ "Screen fades to black.",
+ "Final subtitle: 'Attention everyone's puffs will be public by tomorrow' in white with speaker emojis.",
+ "Final exclamation: 'Oh, fuck.'",
+ "Audio concludes with declaration about 'puffs' going public, emphasizing surprise."
+ ]
+ },
+ {
+ "video": "video/125.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video features a woman with long, straight brown hair, wearing a bright pink shirt with the text 'LET IT BE' emblazoned in bold black and white letters. Against a plain, likely blue or gray background, this woman is involved in a conversation or giving a presentation. Her expressive hand gestures and slight movements indicate a lively engagement with her audience.\n\nThe audio reveals the woman saying \"`Stop saying very. No, don't stop saying very. It's just that I've got a list of some very good words that you can use instead of using very, very, very, very all the time.`\" \n\nAs the video progresses, it continues to focus on the same woman, who remains dynamically involved in explaining concepts, emphasized by her hand motions. The solid background and her bright attire stay consistent. In tandem with this, text appears at the bottom of the screen, displaying synonyms for common adjectives: \"Very Angry = Furious,\" \"Very Beautiful = Gorgeous,\" \"Very Big = Massive,\" and \"Very Boring = Dull.\" These terms complement the lady's high-pitched narration, which continues with, \"`Very angry, furious, very beautiful, gorgeous, very big, massive, very boring, dull.`\"\n\nMaintaining continuity, the woman further elaborates through purposeful hand movements. Additional text appears, presenting more synonyms: \"Very Noisy = Deafening,\" \"Very Poor = Destitute,\" \"Very Cheap = Stingy,\" and \"Very Clean = Spotless.\" The high-pitched female voice narrates synonymously with the text, stating, \"`Very noisy. Deafening. Very poor. Destitute. Very cheap. Stingy. Very clean. Spotless.`\"\n\nThe presentation advances with the woman remaining animated and engaged, adding, \"Very Short = Brief,\" \"Very Dry = Arid,\" \"Very Quick = Rapid,\" and \"Very Bad = Awful,\". The audio description matches the visuals precisely, maintaining the high-pitched voice delivering, \"Very short, brief, very dry, arid, very quick, rapid, very bad, awful.\"\n\nFurther still, this engaging woman continues to expound upon the topic, with text displaying: \"Very Smart = Intelligent,\" \"Very Sad = Sorrowful,\" and \"Very Upset = Distraught.\" The voice remains consistent, narrating, \"`Very smart, intelligent, very sad, sorrowful, very upset, distraught, distraught.`\"\n\nAs the video concludes, the woman's bright attire and dynamic presentation style remain steadfast. The text now includes: \"Very Cold = Freezing,\" \"Very Strong = Forceful,\" and \"Very Huge = Colossal.\" The narration mirrors the visual text with, \"Very cold. Freezing. Very strong. Forceful. Very huge. Colossal. Colossal.\", providing a smooth and coherent finish to the video.\n\nThroughout the video, the synthesis of visual and audio elements creates a harmonious and informative experience, effectively emphasizing the use of richer vocabulary in everyday language."
+ }
+ ],
+ "events": [
+ "Woman with long, straight brown hair and wearing a bright pink shirt with 'LET IT BE' text.",
+ "Background likely blue or gray.",
+ "Woman involved in conversation or presentation.",
+ "Expressive hand gestures and slight movements indicating lively engagement.",
+ "Woman says, 'Stop saying very. No, don't stop saying very. It's just that I've got a list of some very good words that you can use instead of using very, very, very, very all the time.'",
+ "Text on screen: 'Very Angry = Furious', 'Very Beautiful = Gorgeous', 'Very Big = Massive', 'Very Boring = Dull.'",
+ "Woman narrates: 'Very angry, furious, very beautiful, gorgeous, very big, massive, very boring, dull.'",
+ "Woman elaborates with purposeful hand movements.",
+ "Text on screen: 'Very Noisy = Deafening', 'Very Poor = Destitute', 'Very Cheap = Stingy', 'Very Clean = Spotless.'",
+ "Woman narrates: 'Very noisy. Deafening. Very poor. Destitute. Very cheap. Stingy. Very clean. Spotless.'",
+ "Presentation continues with woman animated and engaged.",
+ "Text on screen: 'Very Short = Brief', 'Very Dry = Arid', 'Very Quick = Rapid', 'Very Bad = Awful.'",
+ "Woman narrates: 'Very short, brief, very dry, arid, very quick, rapid, very bad, awful.'",
+ "Woman continues to explain topic.",
+ "Text on screen: 'Very Smart = Intelligent', 'Very Sad = Sorrowful', 'Very Upset = Distraught.'",
+ "Woman narrates: 'Very smart, intelligent, very sad, sorrowful, very upset, distraught, distraught.'",
+ "Video concludes with woman's bright attire and dynamic style.",
+ "Text on screen: 'Very Cold = Freezing', 'Very Strong = Forceful', 'Very Huge = Colossal.'",
+ "Woman narrates: 'Very cold. Freezing. Very strong. Forceful. Very huge. Colossal. Colossal.'",
+ "Synthesis of visual and audio elements effective in emphasizing richer vocabulary."
+ ]
+ },
+ {
+ "video": "video/126.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a large crowd in a stadium, all standing and clapping, many wearing blue clothing. The atmosphere is electric as the camera pans over to a group of gymnasts. One gymnast in a blue leotard stands at the center, surrounded by her teammates in white jackets and black pants, who cheer and raise their hands. She high-fives her teammates and then runs toward the camera, beaming with joy. The audience continues to cheer and clap, with some taking photos. In the background, a commentator announces, \"`Brishi, They came to today's meet. UCLA has already hit a season high. Well above 198. We think they're at 198-225. `\" His voice is amplified by a microphone, merging with the buzz of the crowd.\n\nThe scene transitions to the gymnast in the blue leotard, now standing alone on the mat. The crowd, still clapping and cheering, watches in anticipation. The announcer introduces, \"`And here comes Kaitlyn Ohashi. It's not just a gymnastics meet. It`'s not just a sporting event. It is a happening.\" Music begins to play in the background. Kaitlyn begins her routine, displaying graceful dance moves and spins. The audience remains enthralled, some capturing the moment on their cameras. She executes a high leap, followed by a series of flips and tumbles. The clip ends with her running across the mat, setting up for the next sequence, while her teammates watch supportively from the sidelines.\n\nNext, the video focuses on Kaitlyn continuing her routine. Her teammates stand nearby, clapping and cheering energetically. The crowd also remains enthusiastic, many wearing blue. Kaitlyn performs a series of intricate dance moves and spins. During this, the announcer describes, \"`Double layout with that split in the middle. And she added that sting mat recently and it's affected her landings on that opening pass. \" His commentary alternates with another woman's voice providing insights, creating a dynamic narrative that complements Kaitlyn's performance. She lands gracefully from a leap, ready for the next move, as the applause continues to surge around her.\n\nThe visual shifts to Kaitlyn walking confidently on the mat. Her teammates cheer her on, and the crowd's energy peaks, capturing the moment on their cameras. Embarking on her routine again, she performs dynamic dance moves and spins, her movements powerful yet fluid. The crowd's excitement is palpable as the commentator remarks, \"`Welby`'s Athletes compete week in, week out. In fact, UCLA has some of the, probably the most meets of any other Pac-12 team.`\" Music underscores the vibrant atmosphere, and the crowd's cheers amplify the scene. Kaitlyn strikes a dramatic pose on the mat, her arm extended, the crowd's roar reflecting the height of their encouragement.\n\nThe video continues with Kaitlyn executing dynamic moves on the mat: backbend, high leg lifts, seamlessly following into a handstand, spin, and high leap. The crowd cheers incessantly, capturing the electric moment on their cameras. \"`But like you mentioned, she's not feeling well. And who wants to miss a home meet in front of this crowd? My favorite part right here,`\" the commentator mentions as applause punctuates his words. Kaitlyn's agility and strength shine through as she continues with energetic dance moves and poses. She runs across the mat, preparing for another sequence, while the \"INQUIRY\" sign in the background indicates an area for questions or information.\n\nIn the final segment, Kaitlyn continues her routine, capturing the crowd's undivided attention. Her dynamic moves include a high leap and a split on the mat. The crowd, many still clad in blue, does not wane in their cheers and the constant flash of cameras. She performs a series of energetic dance moves and poses, the palpable excitement of the audience permeating the scene. The voiceover narrates, \"`The student section's joining in. Her teammates know the routine,`\" while upbeat music plays, adding rhythm to her performance. The video culminates with Kaitlyn standing on the mat, raising her arms in a triumphant pose, bathed in the enthusiastic applause from the crowd and her supporting teammates."
+ }
+ ],
+ "events": [
+ "Large crowd in a stadium standing and clapping, many wearing blue clothing",
+ "Camera pans over to a group of gymnasts",
+ "One gymnast in a blue leotard stands at the center, surrounded by teammates in white jackets and black pants",
+ "Teammates cheer and raise their hands",
+ "Gymnast high-fives her teammates",
+ "Gymnast runs toward the camera beaming with joy",
+ "Audience continues to cheer and clap, some taking photos",
+ "Commentator announces UCLA's season high score",
+ "Scene transitions to gymnast in blue leotard standing alone on the mat",
+ "Crowd still clapping and cheering, watching in anticipation",
+ "Announcer introduces Kaitlyn Ohashi",
+ "Music begins to play in the background",
+ "Kaitlyn begins her routine with graceful dance moves and spins",
+ "Audience captures the moment on cameras",
+ "Kaitlyn executes a high leap followed by series of flips and tumbles",
+ "Clip ends with Kaitlyn running across the mat setting up for next sequence",
+ "Teammates watch supportively from the sidelines",
+ "Focus on Kaitlyn continuing her routine",
+ "Teammates nearby clapping and cheering energetically",
+ "Crowd remains enthusiastic, many wearing blue",
+ "Kaitlyn performs intricate dance moves and spins",
+ "Announcer describes her double layout with split in the middle and recent addition of sting mat affecting her landings",
+ "Announcer's commentary alternates with another woman's voice",
+ "Kaitlyn lands gracefully from a leap as applause continues",
+ "Visual shifts to Kaitlyn walking confidently on the mat",
+ "Teammates cheer her on",
+ "Crowd's energy peaks capturing the moment on cameras",
+ "Kaitlyn embarks on her routine again with dynamic dance moves and spins",
+ "Commentator remarks on UCLA's competitive schedule",
+ "Music underscores vibrant atmosphere and crowd's cheers amplify the scene",
+ "Kaitlyn strikes a dramatic pose on the mat with extended arm",
+ "Kaitlyn executes dynamic moves: backbend, high leg lifts, handstand, spin, high leap",
+ "Crowd cheers incessantly capturing the moment on cameras",
+ "Commentator mentions Kaitlyn not feeling well but not wanting to miss a home meet",
+ "Kaitlyn's agility and strength shine as she continues with energetic dance moves and poses",
+ "Kaitlyn runs across the mat preparing for another sequence",
+ "INQUIRY sign in the background indicating area for questions or information",
+ "Kaitlyn continues her routine with high leap and a split on the mat",
+ "Crowd, clad in blue, continues cheering and flashing cameras",
+ "Kaitlyn performs series of energetic dance moves and poses",
+ "Voiceover narrates student section joining in and her teammates knowing the routine",
+ "Upbeat music plays adding rhythm to her performance",
+ "Video culminates with Kaitlyn standing on the mat raising her arms in a triumphant pose",
+ "Crowd and supporting teammates give enthusiastic applause"
+ ]
+ },
+ {
+ "video": "video/127.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with an outdoor scene where a man in a black t-shirt is conducting an interview. He holds a microphone and engage with another rider dressed in a beige t-shirt with 'COAST PARK' printed on it and a helmet, surrounded by bicycles and people. Throughout the video, the speech of the interviewer is displayed at the bottom of the screen in white text on a red background, while the speech of the interviewer is shown in white text on a dark green background. This bustling environment suggests a community biking event. A vibrant yellow Santa Cruz Bronson CC bike is the center of attention. An overlay text reads: 'RIGHT, WE'VE GOT JORDAN HERE WITH YOUR MAGNIFICENT SANTA CRUZ BRONSON CC. TELL US A BIT ABOUT YOUR BIKE BECAUSE I THINK IT'S SUPER NICE.' Audibly, the man in the black t-shirt asks Jordan to talk about his bike. Jordan begins, \"Thank you.\"\n\nThe conversation continues with Jordan explaining the story behind his bike's frame. The frames show text overlays 'IT'S UPGRADED FROM THE 2019, I CRACKED MY FRAME. IT WAS DURING THE COVID THING, SO THEY DIDN'T HAVE ANY FRAMES. SO THEY GAVE ME A BRAND NEW 2022 FRAME.' Jordan discusses that the frame cracked during the COVID-19 pandemic and he received a new 2022 frame as a replacement, enhancing it with custom features. The audio mirrors his explanation.\n\nThe video then zooms in to highlight the specific custom features of Jordan's yellow Santa Cruz Bronson CC bike. Close-up shots showcase the bike's front and rear suspension with 'Marzocchi Bomber shock\u2019, the 'DT SWISS WHEELS', 'BUNCH OF ONE UP COMPONENTS', and a 'CHROMAG SEAT'. The audio captures Jordan elaborating on these details: \"AND THEN I DECIDED JUST TO BUILD UP A BIT OF CUSTOM WITH SOME MARZOCCHI SUSPENSION FRONT AND BACK. DT SWISS WHEELS. A BUNCH OF ONE-UP COMPONENTS. LOCAL SQUAMISH COMPANY. CHROMAG SEAT.\"\n\nStill focusing on the bike's custom features, the scene shows more close-ups of the bike, such as the Garmin navigation device and water bottle. Jordan's enthusiastic words continue, \u201cAND YEAH, I LOVE IT. IT ABSOLUTELY RIPS.\u201d The camera zooms in on the handlebars showing a Garmin device with the interviewer commenting, \"I SEE YOU REPPING THE GARMIN UP FRONT. LIKE IT, YOU'VE GOT A PUMP, WATER BOTTLE. WHAT OTHER LITTLE FEATURES, LIKE STICKERS OR SOMETHING?\" \n\nThe 5th clip cuts to a wider view focusing on the bike. After the interviewer asks about additional features like stickers, Jordan points out and says: \"LITTLE FEATURES...STICKERS...I GOT MY RIDE OR DIE STICKERS, GOT A RIDE OR DIE. SUPPORT LOCAL COAST GRAVITY SUNSHINE COAST COMPANY, AND WORK LESS RIDE MORE.\"\n\nFinally, the video brings the interview narrative to its informative close. With the scene still focused on the yellow bike, then clip cuts to the sticker on the bike frame labeled 'WORK LESS RIDE MORE.' Jordan and the interviewer laugh and discuss the sticker. Laughter and the joy of biking culture fill the air, ending with the words 'SUPER NICE,' capturing the spirit of the event. The audio and visual elements sync well as Jordan's agree on the motto and conclude with the interviewer's appreciation for the bike and its message: \"I LIKE THAT ONE. WORK LESS RIDE MORE. SUPER NICE.\u201c \u201dA MOTTO TO LIVE BY.\u201c \u201dYEAH, THERE IS.\""
+ }
+ ],
+ "events": [
+ "The video opens with an outdoor scene.",
+ "A man in a black t-shirt is conducting an interview.",
+ "The man holds a microphone.",
+ "The man engages with another rider dressed in a beige t-shirt with 'COAST PARK' printed on it and a helmet.",
+ "The scene is surrounded by bicycles and people.",
+ "Text is displayed at the bottom of the screen showing speech: white text on a red background for the interviewer, white text on a dark green background for the interviewee.",
+ "The environment suggests a community biking event.",
+ "A vibrant yellow Santa Cruz Bronson CC bike is the center of attention.",
+ "Overlay text reads: 'RIGHT, WE'VE GOT JORDAN HERE WITH YOUR MAGNIFICENT SANTA CRUZ BRONSON CC. TELL US A BIT ABOUT YOUR BIKE BECAUSE I THINK IT'S SUPER NICE.'",
+ "The man in the black t-shirt asks Jordan to talk about his bike.",
+ "Jordan begins with 'Thank you.'",
+ "Jordan explains the story behind his bike's frame.",
+ "Text overlays the frame: 'IT'S UPGRADED FROM THE 2019, I CRACKED MY FRAME. IT WAS DURING THE COVID THING, SO THEY DIDN'T HAVE ANY FRAMES. SO THEY GAVE ME A BRAND NEW 2022 FRAME.'",
+ "Jordan discusses the frame cracking during the COVID-19 pandemic.",
+ "Jordan received a new 2022 frame as a replacement.",
+ "Jordan enhanced the bike with custom features.",
+ "The audio mirrors his explanation.",
+ "The video zooms in to highlight the specific custom features of Jordan's yellow Santa Cruz Bronson CC bike.",
+ "Close-up shots showcase the bike's front and rear suspension with 'Marzocchi Bomber shock'.",
+ "Close-up shots showcase the 'DT SWISS WHEELS'.",
+ "Close-up shots showcase a 'BUNCH OF ONE UP COMPONENTS'.",
+ "Close-up shots showcase a 'CHROMAG SEAT'.",
+ "The audio captures Jordan elaborating on these details, mentioning the customization.",
+ "The scene shows more close-ups of the bike, such as the Garmin navigation device and water bottle.",
+ "Jordan enthusiastically says, 'AND YEAH, I LOVE IT. IT ABSOLUTELY RIPS.'",
+ "The camera zooms in on the handlebars showing a Garmin device.",
+ "The interviewer comments on the Garmin device and asks about other features like stickers.",
+ "Jordan points out and mentions his stickers: 'LITTLE FEATURES...STICKERS...I GOT MY RIDE OR DIE STICKERS, GOT A RIDE OR DIE. SUPPORT LOCAL COAST GRAVITY SUNSHINE COAST COMPANY, AND WORK LESS RIDE MORE.'",
+ "The scene cuts to a wider view focusing on the bike.",
+ "The scene cuts to the sticker on the bike frame labeled 'WORK LESS RIDE MORE.'",
+ "Jordan and the interviewer laugh and discuss the sticker.",
+ "Laughter and the joy of biking culture fill the air.",
+ "The interview concludes with the scene still focused on the yellow bike.",
+ "The interviewer appreciates the bike and its message.",
+ "The interviewer and Jordan agree on the motto 'WORK LESS RIDE MORE.'",
+ "The video ends with a positive note capturing the spirit of the event."
+ ]
+ },
+ {
+ "video": "video/128.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up of a man holding a black mask with two circular filters on the front. The mask is sleek and modern in design. As the scene unfolds, the man presses a button on the side of the mask, causing the circular filters to light up with white rings. The man adjusts the mask, and then puts it on, securing it over his nose and mouth. The mask fits snugly, and the lights on the filters remain illuminated. The man then looks around, showcasing the mask from different angles, with the dark background contrasting against his light-colored hoodie, which emphasizes the mask's features. \n\nOn the audio track, a robotic voice comments, \"Because this is the most overkill face mask, designed from the one and only Will.i.am. Although I can't deny it looks freaking cool. Made with a hard exterior plastic shell in case you punch yourself or something but thankfully with some softer material wrapped around the rest of your head,\" followed by the sound of the mask being punched. And the man punches the mask and then touches the side of the mask, showing the buttons and soft material there.\n\nThe man takes a deep breath. As the clip continues, the focus shifts to the detailed parts of the mask. Close-up shots show the man's hand taking out a black lid, holding a white circular filter being placed into a black socket. The interior components of the mask are showcased and the man secures the filter into the socket. The audio explains, \"What makes it really cool though are these buttons, the main one being to basically allow you to properly breathe. That's thanks to the filters built inside these Oreos with an electronic fan that helps with airflow. But it doesn't end there.\"\n\nThe man points to the electronic fan behind his ear. Visuals then emphasize the mask's earbud capabilities. Close-ups show the man adjusting a black earbud in his ear, pointing at it to ensure a snug fit, and demonstrating its functionality by pressing small metallic buttons on the mask. Then he puts the other earbud in his ear. Meanwhile, the audio describes, \" It's got Bluetooth earphones magnetically docked behind your ears, with playback controls on the opposite side as well. Because I can't be the only one who's lost an earpod while taking off my mask,\" punctuated by a clearing throat sound for emphasis.\n\nNext, the man adjusts a black cap and then a black strap on the mask. He indicates the strap's position and adjusts it securely, eventually removing the mask to show it. The illuminated filters and small metallic buttons are highlighted again. The audio track critiques, \"I do find the rigid design isn't exactly a universal fit though, but the large strap does make it very comfortable to wear while you head out to get your third $8 latte of the day.\"\n\nThe video shifts back to another close-up of the mask, emphasizing its sleek and modern design as the man adjusts it and showcases its features from different angles. The man touches the button, turning the lights on the mask off. The question on the audio is straightforward, \"So let me know. Would you cop this?\" \n\nThe sequence of clips, combined with the audio descriptions, effectively showcases the innovative design and functionality of the face mask, highlighting its modern features, ease of use, and practical elements, while intermittently providing user impressions and feedback for an engaging overall presentation."
+ }
+ ],
+ "events": [
+ "Man holding black mask with two circular filters",
+ "Close-up of black mask with sleek and modern design",
+ "Man presses button on side of mask",
+ "Circular filters light up with white rings",
+ "Man adjusts mask and secures it over nose and mouth",
+ "Lights on filters remain illuminated",
+ "Man looks around showcasing mask from different angles",
+ "Dark background contrasts with man's light-colored hoodie",
+ "Robotic voice comments about the mask",
+ "Mask described as designed by Will.i.am",
+ "Description of mask's hard exterior plastic shell",
+ "Man punches mask",
+ "Sound of mask being punched",
+ "Man touches the side of the mask",
+ "Close-up showing buttons and soft material on mask",
+ "Man takes a deep breath",
+ "Focus shifts to detailed parts of the mask",
+ "Close-up of man taking out a black lid",
+ "Man places white circular filter into black socket",
+ "Interior components of the mask showcased",
+ "Man secures filter into socket",
+ "Audio explains breathing functionality of mask",
+ "Man points to electronic fan behind his ear",
+ "Visuals emphasize mask's earbud capabilities",
+ "Close-up of man adjusting a black earbud in his ear",
+ "Man points to earbud ensuring snug fit",
+ "Man demonstrates earbud functionality by pressing buttons on mask",
+ "Man puts other earbud in his ear",
+ "Audio description of Bluetooth earphones magnetically docked",
+ "Playback controls highlighted",
+ "Sound of throat clearing",
+ "Man adjusts a black cap on the mask",
+ "Man adjusts black strap, securing it",
+ "Man removes mask to show it",
+ "Highlighted illuminated filters and metallic buttons",
+ "Audio critique of rigid design and large strap",
+ "Video shifts to another close-up of mask",
+ "Man showcases mask features from different angles",
+ "Man touches button turning mask lights off",
+ "Audio poses a question about buying the mask",
+ "Sequences and audio effectively showcase mask's design and functionality",
+ "Audio provides user impressions and feedback"
+ ]
+ },
+ {
+ "video": "video/129.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a woman in a pink shirt standing in a kitchen, holding a large mango. Various kitchen items like jars, a plant, and a cutting board are visible in the background. The woman explains different ways to cut a mango. The woman places the mango on a light blue cutting board, and the text \"favorites\" and \"the way I grew up eating it\" are shown before they pick up a small paring knife. As they start peeling the mango, the text \"Peel the mango\" and \"with the paring knife\" are displayed. A woman's voice accompanies the scene, stating, \"There's actually a ton of ways to cut a mango, but I'm going to show you my favorites. Now, this is the way I grew up eating it. Peel the mango with a paring knife, It's not a way that I've seen done very often\" followed by the sound of a knife peeling the mango.\nAs the camera zooms out, the text \"But it's the way I grew up doing it. So I'm pretty good at it. I try to go as close to the middle as I can and what I get is a really really really\" appear on the screen. The woman further explains her technique while background music plays softly.\nThe camera shows the woman holding two halves of the peeled mango, with the text \"a really, really, really nice thick slice and here you can\". She presents a nice thick slice of the mango. The woman steps back to show a wider kitchen view while explaining different cutting methods; The woman's audio narrative continues, saying, \"nice thick slice. And here you can pretty much do it however you want. You can chop it up into slices. You can cut it into chunks. You can do it into wedges and eat it whole,\" with the sound of the knife chopping and slicing heard in the background.\nHolding a wedge of mango, the woman takes a bite, following which the scene zooms into her hands handling the mango. The woman describes another popular method. Then, the woman shares a childhood memory. Concurrently, the woman narrates, \"There's definitely another way to do it that's very popular. I remember growing up and seeing my mom cut it the way that I just showed you. And when the first time I saw it being cut,\" with background music continuing.\nExplaining a new method, the woman holds half of a mango, expressing her reaction. Shifting the focus to her hands, she holds the mango and a paring knife, shown before demonstrating the scoring method. Then, she scores the mango. The woman\u2019s voice narrates, \"like this, my mind was completely blown away. Cut it in half and then with your knife, score it, making sure not to go all the way down. Do like a checker pattern,\" accompanied by the sound of the knife.\nThe final scene shows a close-up of the scored mango in a checker pattern. The woman demonstrates flipping the mango. She explains further. The clip closes with a detailed view of hands picking off mango pieces. The woman concludes, saying, \"flip it over like a porcupine and then from there you can just pick it off like this,\" while the background music continues to play.\nThis step-by-step guide to cutting a mango interweaves personal narrative and practical instruction, enriched by ons"
+ }
+ ],
+ "events": [
+ "Woman in pink shirt stands in a kitchen holding a large mango",
+ "Various kitchen items like jars, a plant, and a cutting board are visible in the background",
+ "Woman explains different ways to cut a mango",
+ "Woman places mango on a light blue cutting board",
+ "Text 'favorites' and 'the way I grew up eating it' are shown",
+ "Woman picks up a small paring knife",
+ "Text 'Peel the mango' and 'with the paring knife' are displayed",
+ "Woman starts peeling the mango",
+ "Woman's voice states, 'There's actually a ton of ways to cut a mango, but I'm going to show you my favorites. Now, this is the way I grew up eating it. Peel the mango with a paring knife. It's not a way that I've seen done very often'",
+ "Sound of knife peeling the mango",
+ "Camera zooms out",
+ "Text 'But it's the way I grew up doing it. So I'm pretty good at it. I try to go as close to the middle as I can and what I get is a really really really' appear on the screen",
+ "Woman further explains her technique while background music plays softly",
+ "Camera shows woman holding two halves of peeled mango",
+ "Text 'a really, really, really nice thick slice and here you can' is shown",
+ "Woman presents a nice thick slice of the mango",
+ "Woman steps back to show a wider kitchen view",
+ "Woman explains different cutting methods",
+ "Woman's audio narrative continues, 'nice thick slice. And here you can pretty much do it however you want. You can chop it up into slices. You can cut it into chunks. You can do it into wedges and eat it whole'",
+ "Sound of knife chopping and slicing is heard in the background",
+ "Woman holds a wedge of mango and takes a bite",
+ "Scene zooms into woman's hands handling the mango",
+ "Woman describes another popular method",
+ "Woman shares a childhood memory while narrating, 'There's definitely another way to do it that's very popular. I remember growing up and seeing my mom cut it the way that I just showed you. And when the first time I saw it being cut'",
+ "Background music continues",
+ "Woman holds half of a mango and expresses her reaction",
+ "Focus shifts to woman's hands holding the mango and a paring knife",
+ "Woman demonstrates the scoring method",
+ "Woman scores the mango",
+ "Narration, 'like this, my mind was completely blown away. Cut it in half and then with your knife, score it, making sure not to go all the way down. Do like a checker pattern'",
+ "Sound of the knife cutting",
+ "Close-up of the scored mango in a checker pattern",
+ "Woman demonstrates flipping the mango",
+ "Woman explains further while flipping the mango",
+ "Close-up view of hands picking off mango pieces",
+ "Woman concludes, saying, 'flip it over like a porcupine and then from there you can just pick it off like this'",
+ "Background music continues to play",
+ "Step-by-step guide to cutting a mango interweaves personal narrative and practical instruction"
+ ]
+ },
+ {
+ "video": "video/130.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a focus on the rear part of a motorcycle in a garage. The camera captures the large, black tire with distinct tread patterns and a blue rim, as well as the 'WASHINGTON 2022 5OPAID' license plate. The exhaust pipe and mechanical parts are also visible. A man in a gray shirt with a NASA logo and a black smartwatch gestures towards the tire, explaining something. The man then licks the tire and begins to spit. The background shows various tools and equipment. Simultaneously, the audio reveals the man explaining loudly that \"`Most people know that new tires are covered in a releasing agent that helps them get extracted from the mold that they're in. You can see it on new tires, and you can even taste it.`\" before the man spits.\n\nThe video continues to show the tire and license plate. The man in the NASA shirt is seen gesturing again. Text overlays include a screenshot from TireRack.com reading 'HOW DO I BREAK IN TIRES?' and a red circled note 'of 500 miles' with 'WTF'. The man exclaims, \"`fuck I should not have done that,`\" then continues, \"`but what people really underestimate is how differently your bike is going to handle with new tires that have a completely new profile`\". The audio of the same man, provides further explanation.\n\nStill focused on the motorcycle's rear and the same descriptive elements, another image-text overlay appears showing 'AIN'T NOBODY GOT TIME FOR THAT.' The audio aligns with the visuals as the man says, \"`The shops are going to tell you to take it really easy and slow for the first 50 miles. But what really needs to happen is the tires just need to heat up so that those compounds loosen and then wear off.`\" The action shifts to a split-screen, the middle part of the screen shows the motorcycle's dashboard on a highway at 158 mph while the man continues excitedly, \"`What I do is I jump on the highway and then for about five miles, I give it some speed to get them nice and warm.`\", while the remained area shows a close-up of the 'WASHINGTON 2022 5OPAID' license plate with the man beside.\n\nThe view of the middle part continues a first-person perspective of the dashboard, whose speeds increase to 162-172 mph. The rider, with gloves featuring a blue and white star pattern, grips the handlebars. The video then moves to a quiet street, showing the rider performing a turn and taking the route S, further and further away. A text overlay states, 'Also called 'Scrubbing in' the tires,' explaining the process of breaking in new tires. The rider narrates, \"`Then I take my bike onto some city streets or a parking lot and sneak it back and forth with overemphasized lean.`\"\n\nIn an outdoor setting, the rider in the gray NASA shirt and helmet continues to perform a turn and taking the route S coming on, with a background of trees, parked trucks, and distant mountains. Soon afterwards, the man drives the front wheel of the motorcycle raised, straightly coming on.The explanation continues, \"`so I don't fall and keeping to lower speeds. That way I have time and distance to recover if it looks like I'm going to layer down.`\" \n\nFinally, the view returns to the close-up of the rear tire and license plate in the garage. The man in the NASA shirt gestures and points at the tire. The text overlay appears 'YT|SquidTips' with the YouTube logo, suggesting the context of a YouTube series. The audio features the speaker summing up, \"`I find that after 10 or 20 miles, I'm comfortable again. I understand how the bike handles and I can feel it gripping in the corners and I can just go back to being an asshole.`\"\n\nThe video effectively combines practical motorcycle maintenance tips with light-hearted commentary, encapsulating both visual focus on the bike's mechanics and spirited audio guidance on tire break-in techniques."
+ }
+ ],
+ "events": [
+ "Video starts with a focus on the rear part of a motorcycle in a garage",
+ "Camera captures the large, black tire with distinct tread patterns and a blue rim",
+ "Camera shows 'WASHINGTON 2022 5OPAID' license plate",
+ "Exhaust pipe and mechanical parts are visible",
+ "Man in a gray shirt with a NASA logo and a black smartwatch gestures towards the tire",
+ "Man explains something about the tire",
+ "Man licks the tire and begins to spit",
+ "Background shows various tools and equipment",
+ "Audio reveals man explaining loudly about the releasing agent on new tires",
+ "Man spits",
+ "Video continues to show the tire and license plate",
+ "Man in NASA shirt seen gesturing again",
+ "Text overlay includes screenshot from TireRack.com reading 'HOW DO I BREAK IN TIRES?'",
+ "Text overlay shows a red circled note 'of 500 miles' with 'WTF'",
+ "Man exclaims, 'fuck I should not have done that' and continues explaining",
+ "Man explains how new tires affect bike handling differently due to new profile",
+ "Image-text overlay 'AIN'T NOBODY GOT TIME FOR THAT' appears",
+ "Man explains shops advice taking it easy and slow for first 50 miles",
+ "Man narrates that tires need to heat up to loosen compounds and wear off",
+ "Action shifts to a split-screen showing motorcycle's dashboard at 158 mph on a highway",
+ "Man continues explaining his method of speeding up on the highway to warm tires",
+ "Middle screen shows first-person perspective of dashboard speeds increasing to 162-172 mph",
+ "Rider with blue and white star pattern gloves grips the handlebars",
+ "Video shifts to quiet street showing rider performing turn and taking route S",
+ "Text overlay states, 'Also called 'Scrubbing in' the tires'",
+ "Rider narrates taking bike to city streets or parking lot and sneaking back and forth with overemphasized lean",
+ "Rider in gray NASA shirt and helmet performs a turn on a quiet street",
+ "Background includes trees, parked trucks, and distant mountains",
+ "Rider drives straight with front wheel raised",
+ "Explanation continues about keeping to lower speeds for recovery time",
+ "View returns to the close-up of the rear tire and license plate in the garage",
+ "Man in NASA shirt gestures and points at the tire",
+ "Text overlay 'YT|SquidTips' with YouTube logo appears",
+ "Audio features speaker summing up his comfort and understanding after 10-20 miles",
+ "Speaker mentions feeling the bike gripping in corners and returning to normal riding",
+ "Video combines practical maintenance tips with light-hearted commentary"
+ ]
+ },
+ {
+ "video": "video/131.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a view of a wooden door with a white towel hanging on it, followed by a woman in a black long-sleeve shirt holding a white sneaker. She shows the sneaker up close, highlighting its details. Then, the scene changes to a brightly lit room where the woman is holding a colorful sneaker with a blue, yellow, and orange design, seemingly explaining its features. The video cuts to the woman holding a white sneaker and using a green pencil to start drawing on it, initiating a custom design process. \n\nIn the audio, a woman with a high-pitched voice shares, \"My boyfriend wants Pokemon shoes, so I'm going to make it for him. Got the Charmander vibes here. This is the drawing. This is what it's going to look like. Obviously very detailed. Throughout Charmander right here, and he's going to be puking out fire.\"\n\nThe scene shifts to her holding and continuing to draw on the white sneaker with the green pencil, following through with their design idea. A close-up shows the beginning of the painting process with yellow paint carefully applied to a section of the sneaker. Another scene reveals the woman painting the toe area with blue paint. The final touch is a showcase of the completed custom sneaker-vibrantly colored in blue, orange, and yellow. She proudly displays the finished design.\n\nContinuing with the narrative, the woman in the audio comments, \"Doing his little belly first. In fact, if you want to paint shoes, the shoe paint is really sheer if you buy it off of Amazon. We've got the base ready. Look at the base. Looking pretty clean.`\"\n\nThe focus returns to the artwork as the woman showcases the brightly colored sneaker featuring a cartoon character spewing fire. Using a small brush, more details are added to the character. The final reveal is she holding two completed sneakers, featuring the same whimsical, colorful design, and explaining the result.\n\nThe woman's voice exclaims in the audio, \"He's literally just puking. He is puking out fire. So I finished the shoes and we are now going to go surprise my boyfriend. So enjoy!\".\n\nNext, we see a modern living room where a man in a tan hoodie with red plaid pants sits on a gray couch, playing a video game. The woman in a black dress and black heels walks in and joins him on the couch. She engages him in conversation and shows him a small purple pillow, which piques his interest.\n\nIn the audio, a conversation unfolds: \"Hey babe. Yeah? I got you a gift. No way. What? I got you a little ditto. Oh, it's so cute. I love dittos. But I also got you one more gift. This is the best.`\"\n\nThe woman retrieves a red Nike shoe box from behind the couch and presents it to the man, who curiously examines it. They talk about the content within the box, which contains the customized sneakers.\n\nThe accompanying audio voices their excitement, \"So I got you this. Shut the front door. Shoes. It's a little ripped. Don't look too close. What? Where did you get these? I made'em.\"\n\nLastly, the man sits holding the open shoe box containing the brightly colored sneakers. He takes one sneaker out of the box and shows its design to the woman next to him. The woman admires the details of the sneaker, particularly the blue, orange, and yellow segments with the cartoon character breathing fire. They share a heartfelt hug, reflecting a moment of appreciation and celebration.\n\nThe final audio reflects their joy and amazement: \"No, you didn't. What? Baby, these are fire!\" followed by the woman's laughter, solidifying the heartfelt conclusion of the video."
+ }
+ ],
+ "events": [
+ "View of a wooden door with a white towel hanging on it.",
+ "Woman in a black long-sleeve shirt holding a white sneaker.",
+ "Woman shows the sneaker up close, highlighting its details.",
+ "Scene changes to a brightly lit room.",
+ "Woman holding a colorful sneaker with a blue, yellow, and orange design, explaining its features.",
+ "Cut to woman holding a white sneaker and using a green pencil to start drawing on it.",
+ "Woman's audio: My boyfriend wants Pokemon shoes, so I'm going to make it for him.",
+ "Woman's audio: Got the Charmander vibes here.",
+ "Woman's audio: This is the drawing.",
+ "Woman's audio: This is what it's going to look like.",
+ "Woman's audio: Obviously very detailed.",
+ "Woman's audio: Throughout Charmander right here, and he's going to be puking out fire.",
+ "Woman continues to draw on the white sneaker with the green pencil.",
+ "Close-up shows the beginning of the painting process with yellow paint on a section of the sneaker.",
+ "Woman painting the toe area with blue paint.",
+ "Showcase of the completed custom sneaker with vibrant colors: blue, orange, and yellow.",
+ "Woman proudly displays the finished design.",
+ "Woman's audio: Doing his little belly first.",
+ "Woman's audio: If you want to paint shoes, the shoe paint off of Amazon is really sheer.",
+ "Woman's audio: We've got the base ready.",
+ "Woman's audio: Look at the base. Looking pretty clean.",
+ "Woman showcases the brightly colored sneaker featuring a cartoon character spewing fire.",
+ "Using a small brush, woman adds more details to the character.",
+ "Final reveal of woman holding two completed sneakers with colorful, whimsical design.",
+ "Woman explains the result.",
+ "Woman's audio: He's literally just puking. He is puking out fire.",
+ "Woman's audio: I finished the shoes and we are now going to go surprise my boyfriend.",
+ "Woman's audio: So enjoy!",
+ "Modern living room where man in a tan hoodie with red plaid pants sits on gray couch playing a video game.",
+ "Woman in a black dress and black heels walks in and joins him on couch.",
+ "Woman engages man in conversation and shows him a small purple pillow.",
+ "Man shows interest in the small purple pillow.",
+ "Woman's audio conversation: Hey babe.",
+ "Man's audio response: Yeah?",
+ "Woman's audio: I got you a gift.",
+ "Man's audio: No way. What?",
+ "Woman's audio: I got you a little ditto.",
+ "Man's audio: Oh, it's so cute. I love dittos.",
+ "Woman's audio: But I also got you one more gift.",
+ "Woman's audio: This is the best.",
+ "Woman retrieves red Nike shoe box from behind the couch.",
+ "Woman presents the box to the man.",
+ "Man curiously examines the box.",
+ "They talk about the content within the box.",
+ "Box contains the customized sneakers.",
+ "Audio of shared excitement: So I got you this.",
+ "Man's audio: Shut the front door.",
+ "Woman's audio: Shoes.",
+ "Man's audio: It's a little ripped. Don't look too close.",
+ "Man's audio: What? Where did you get these?",
+ "Woman's audio: I made 'em.",
+ "Man sits holding the open shoe box with brightly colored sneakers inside.",
+ "Man takes one sneaker out of the box and shows its design to the woman next to him.",
+ "Woman admires the details of the sneaker: blue, orange, yellow segments, cartoon character breathing fire.",
+ "They share a heartfelt hug.",
+ "Moment of appreciation and celebration.",
+ "Final audio reflects joy and amazement: No, you didn't.",
+ "Man's audio: What? Baby, these are fire!",
+ "Woman's laughter",
+ "Heartfelt conclusion of the video."
+ ]
+ },
+ {
+ "video": "video/132.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a menacing figure emerging from a green mist, dressed in dark, tattered clothing and wielding a large, mechanical weapon. The text 'Do you fear death?' appears on the screen in yellow, accompanied by corresponding a voiceover of the figure. The scene transitions to another character in blue and silver armor with a fur collar, holding a futuristic gun, with the text appearing in white, aligned to the character's response in the background audio, \"Not since I grew up.\" The view shifts back to the first figure, now with glowing red eyes and a skeletal mask, accompanied by the text in yellow except for the \"sister\" in red, as the voice ominously declares, \"Perhaps your sister did.\" The scene closes with the first figure, still in the green mist, asking, \"What frightens an Amazon?\" echoing in the audio.\n\nThe focus then moves to a close-up of a female character in golden and red armor, holding a sword. She uses defiant voice says, \"You will not scare Zeus's daughter.\" with the text appearing in white. In response, the menacing figure from earlier clips, with glowing red eyes and a skeletal mask, retorts with 'Sounds like daddy issues.', accompanied with text in yellow except for the \"daddy\" in red. The scene culminates with the figure, surrounded by green mist and holding a mechanical weapon. The figure's voice says 'Everyone has something to lose.' and the text in yellow appears.\n\nNext, the scene cuts to a character in blue and dark armor, emitting blue energy, along with the text 'Like you're about to lose consciousness.' appearing in white, syncing with the voice query. The perspective returns to a close-up of the menacing figure, who coldly says, \"I refer to your baby sister,\" with the corresponding text in yellow except for the \"baby sister\" in red. This part ends with the menacing figure emerging from a green mist and proclaiming, \"All your rage is for nothing,\" with the corresponding text shown in yellow.\n\nThe final section shifts to a close-up of a character with a blurred face, surrounded by a vivid red, organic-looking texture, with the text 'Vengeance is everything.' in white, accompanied by corresponding voiceover. The scene then returns to the menacing figure, still with glowing red eyes and a skeletal mask, who declares, \"But your family is lost forever,\" with the text shown in yellow except for the \"family\" in red. The video wraps up with the figure, still enveloped in a swirling green mist, clutching the mechanical weapon as ominous music plays in the background, amplifying the mood of enduring conflict and unresolved vengeance."
+ }
+ ],
+ "events": [
+ "A menacing figure emerges from a green mist, dressed in dark, tattered clothing, and wielding a large mechanical weapon.",
+ "The text 'Do you fear death?' appears on the screen in yellow.",
+ "A corresponding voiceover of the figure asks 'Do you fear death?'.",
+ "Scene transitions to a character in blue and silver armor with a fur collar, holding a futuristic gun.",
+ "The text 'Not since I grew up.' appears in white.",
+ "A voice in the background says, 'Not since I grew up.'.",
+ "The view shifts back to the first figure with glowing red eyes and a skeletal mask.",
+ "The text 'Perhaps your sister did.' appears in yellow except for 'sister' which is red.",
+ "The figure's voice ominously declares, 'Perhaps your sister did.'.",
+ "Scene closes with the first figure in the green mist asking, 'What frightens an Amazon?'.",
+ "The question, 'What frightens an Amazon?' echoes in the audio.",
+ "Focus shifts to a close-up of a female character in golden and red armor, holding a sword.",
+ "The female character says, 'You will not scare Zeus\u2019s daughter.' in a defiant voice.",
+ "The text 'You will not scare Zeus\u2019s daughter.' appears in white.",
+ "The menacing figure with glowing red eyes and a skeletal mask retorts, 'Sounds like daddy issues.'.",
+ "The text 'Sounds like daddy issues.' appears in yellow except for 'daddy' which is red.",
+ "The figure surrounded by green mist says, 'Everyone has something to lose.'.",
+ "The text 'Everyone has something to lose.' appears in yellow.",
+ "The scene cuts to a character in blue and dark armor emitting blue energy.",
+ "The text 'Like you\u2019re about to lose consciousness.' appears in white.",
+ "A voice corresponds with 'Like you\u2019re about to lose consciousness.'.",
+ "Perspective returns to a close-up of the menacing figure.",
+ "The figure says coldly, 'I refer to your baby sister.'.",
+ "The text 'I refer to your baby sister.' appears in yellow except for 'baby sister' which is in red.",
+ "The menacing figure emerges from a green mist again.",
+ "The figure proclaims, 'All your rage is for nothing.'.",
+ "The text 'All your rage is for nothing.' appears in yellow.",
+ "Final section shifts to a close-up of a character with a blurred face, red organic-looking texture surrounding.",
+ "The text 'Vengeance is everything.' appears in white.",
+ "A voiceover corresponds with 'Vengeance is everything.'.",
+ "The scene returns to the menacing figure with glowing red eyes and a skeletal mask.",
+ "The figure declares, 'But your family is lost forever.'.",
+ "The text 'But your family is lost forever.' appears in yellow except for 'family' which is red.",
+ "The video wraps up with the figure in swirling green mist, clutching a mechanical weapon.",
+ "Ominous music plays in the background."
+ ]
+ },
+ {
+ "video": "video/133.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up of a man's hand, adorned with a silver bracelet, interacting with a slot machine in a bustling casino. The manpoints at various symbols and numbers on the slot machine screen. Overlaid text reads \"IN BETWEEN A D LUCKY JACKPOT EXPERIENCE IN LAS VEGAS\". The man indicates different parts of the slot machine, including the buttons and payout area. Meanwhile, the man enthusiastically announces, \"`$100,000. Top dollar. $100,000. Let`'s go.\" As the video transitions to a casino staff member handling a stack of money labeled \"BEFORE OPENING,\" the man continues, \"So you got $10, $20, $30, $40, $50. Okay, $50.\" The clip focuses on the money winnings from the slot machine as background music plays, a woman briefly speaks, followed by the man again.\n\nThe video smoothly transitions to a close-up of another woman's hands holding a stack of money wrapped in plastic, also labeled \"BEFORE OPENING\", with the same overlaid text from the first clip. The man places this stack next to another on a white chair. Another scene shows someone holding a larger stack of money, drawing the first man's pointing finger to emphasize the monetary gain. The voiceover continues with, \"`60, 70, 80, 90, 100, 100 000, okay. Oh, okay, awesome. Here you go, we appreciate it,`\" as music plays and both a man and a woman converse.\n\nNext, the video presents a close-up of hands meticulously counting a stack of $100 bills, reiterating the overlaid text \"IN BETWEEN A D LUCKY JACKPOT EXPERIENCE IN LAS VEGAS\". The scene shifts to a casino staff member handing over a white envelope and a stack of money, indicating a substantial prize. As the man continues counting, the voiceover counts aloud, \"`One, two, three, four, five, six, seven, eight, nine, ten, thank you, just like that, one hundred thousand,`\" The man's finger points to a winning combination on the slot machine, solidifying the victory.\n\nThe final part of the video repeats the initial scene, focusing again on the silver bracelet-wearing hand interacting with the slot machine and its symbols. The text \"IN BETWEEN A D LUCKY JACKPOT EXPERIENCE IN LAS VEGAS\" appears one last time as the person navigates the machine, culminating in a screen displaying \"1000 CREDITS\". A voiceover jubilantly declares, \"`100,000. Top prize. Yeah.`\" The man signals a thumbs-up before the camera zooms in on a stack of money on a white chair next to a black wallet, underscoring the winnings once more. The music carries on, encapsulating the overall excitement and triumph of the jackpot experience in Las Vegas."
+ }
+ ],
+ "events": [
+ "Close-up of a man's hand with a silver bracelet interacting with slot machine",
+ "Man points at various symbols and numbers on slot machine screen",
+ "Overlaid text: 'IN BETWEEN A D LUCKY JACKPOT EXPERIENCE IN LAS VEGAS'",
+ "Man indicates different parts of the slot machine including buttons and payout area",
+ "Man enthusiastically announces, '$100,000. Top dollar. $100,000. Let's go.'",
+ "Transition to casino staff member handling stack of money labeled 'BEFORE OPENING'",
+ "Man continues, 'So you got $10, $20, $30, $40, $50. Okay, $50.'",
+ "Focus on money winnings from the slot machine with background music",
+ "A woman briefly speaks",
+ "Man speaks again",
+ "Transition to close-up of another woman's hands holding a stack of money wrapped in plastic",
+ "Same overlaid text from the first clip",
+ "Man places the stack next to another on a white chair",
+ "Scene of someone holding a larger stack of money",
+ "First man's pointing finger emphasizes monetary gain",
+ "Voiceover continues, '60, 70, 80, 90, 100, 100 000, okay. Oh, okay, awesome. Here you go, we appreciate it.'",
+ "Music plays and man and woman converse",
+ "Close-up of hands meticulously counting stack of $100 bills",
+ "Reiterated overlaid text: 'IN BETWEEN A D LUCKY JACKPOT EXPERIENCE IN LAS VEGAS'",
+ "Scene shifts to casino staff member handing over white envelope and a stack of money, indicating a substantial prize",
+ "Man continues counting aloud, 'One, two, three, four, five, six, seven, eight, nine, ten, thank you, just like that, one hundred thousand.'",
+ "Man's finger points to a winning combination on the slot machine",
+ "Scene solidifies the victory",
+ "Final part of the video repeats initial scene",
+ "Focus again on the silver bracelet-wearing hand interacting with the slot machine and its symbols",
+ "Text 'IN BETWEEN A D LUCKY JACKPOT EXPERIENCE IN LAS VEGAS' appears one last time",
+ "Person navigates the machine",
+ "Screen displaying '1000 CREDITS'",
+ "Voiceover jubilantly declares, '100,000. Top prize. Yeah.'",
+ "Man signals a thumbs-up",
+ "Camera zooms in on stack of money on white chair next to black wallet",
+ "Music encapsulates overall excitement and triumph of the jackpot experience in Las Vegas"
+ ]
+ },
+ {
+ "video": "video/134.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a detailed close-up of the knife sharpening process, where a hand holds a precision sharpening tool over a green-handled knife clamped securely in place. The sharpening tool, a metallic device with various adjustment knobs and a long rod is moved back and forth along the knife's edge, creating sparks and metal shavings. The background features a white surface with some black stains, a cardboard box with 'PRIORITY MAIL' and 'TRACKED INSURED' labels, and a bottle of isopropyl alcohol, illustrating a well-organized but practical workspace. \n\nAs the rod meticulously slides across the blade, a man explains, \"In this video, I`m sharpening a super steel known as CPM 4V. CPM 4V is incredible steel; it's very tough and in the right hands takes an incredibly sharp edge, The monotone yet informative narration complements the visual focus on the precision and care involved in sharpening such a resilient steel.\n\nThe process continues with the tool in action, showcasing the repetitive and careful strokes essential for achieving a sharp edge. This hand tool flips over and continues to sharpen the other side of the knife. The man continues. \"I'm sharpening it on the TS Prof Blitz Sharpening System, which is a very professional sharpening system. And I'm using resin-bonded diamonds to sharpen it. His voice reflects the expertise and reliability of the tool being used, with no background noise, ensuring the viewer's attention is solely on the sharpening process.\n\nThe scene transitions to another close-up, still focused on the precision tool. Now I'm sharpening it at 15 degrees per side, right around 5000 grit or 3/2 micron, and you can see how incredibly sharp this edge got.\" His explanation is meticulous, showing the advanced skills required for honing a blade to such fine tolerances.\n\nSwitching to a different setting, the video shows a wooden paddle with a dark surface. A hand, still holding the green-handled knife, strops the knife back and forth on the paddle to refine the edge further. The background remains consistent with the white surface, black stains, cardboard box, and isopropyl alcohol, portraying continuity. The clip progresses to demonstrate the knife's sharpness as it effortlessly slices through a piece of paper from a magazine or catalog, filled with various images and texts.\n\nNext, the man states, \" Now the knife itself is the Kaiser Begleiter 2, which does come in different skills. This just happens to be the 4V version and I do show and teach a lot of sharpening tips and tricks on my YouTube channel along with lots of knife reviews and other knife content,\" His narrative praises the knife's performance, connecting the visual evidence of its sharpness with its specialized design.\n\nThe knife slicing through the paper extends to another piece, a white paper towel, demonstrating the same effortless precision. The consistent background reinforces the focused atmosphere of the workspace. The extreme close-up of a hand holding a single hair, delicately cutting it with the green-handled knife, visually confirms the knife's extraordinary sharpness.\n\nI also offer a sharpening service for those that do need knives sharpened but if you want more knife topics or want to see more knife-related subjects, watch my YouTube channel. Peace.\" His voice maintains a consistent, informative tone, emphasizing the educational aspect of the video.\n\nConcluding the sequence, a close-up shot features the hand holding the green-handled knife, now with a black blade, rotating it to showcase the refined sharp edge against a yellow surface-a stark contrast to the previous white background. The video highlights the sleekness and sharpness achieved, underlining the effective sharpening process.\n\n\"The concluding words provide a call to action, inviting viewers to further engage with the content creator's expertise. The video ends with a close-up of the knife's edge, summarizing the entire process and celebrating the precision and care devoted to achieving the perfect sharpness."
+ }
+ ],
+ "events": [
+ "The video begins with a detailed close-up of the knife sharpening process",
+ "A hand holds a precision sharpening tool over a green-handled knife clamped securely in place",
+ "The sharpening tool, a metallic device with various adjustment knobs and a long rod, is moved back and forth along the knife's edge",
+ "Sparks and metal shavings are created",
+ "The background features a white surface with some black stains, a cardboard box with 'PRIORITY MAIL' and 'TRACKED INSURED' labels, and a bottle of isopropyl alcohol",
+ "A man explains, 'In this video, I`m sharpening a super steel known as CPM 4V. CPM 4V is incredible steel; it's very tough and in the right hands takes an incredibly sharp edge.",
+ "The rod meticulously slides across the blade",
+ "The tool showcases repetitive and careful strokes essential for achieving a sharp edge",
+ "The hand tool flips over to sharpen the other side of the knife",
+ "The man continues, 'I'm sharpening it on the TS Prof Blitz Sharpening System, which is a very professional sharpening system. And I'm using resin-bonded diamonds to sharpen it.'",
+ "The scene transitions to another close-up, still focused on the precision tool",
+ "The man narrates, 'Now I'm sharpening it at 15 degrees per side, right around 5000 grit or 3/2 micron, and you can see how incredibly sharp this edge got.'",
+ "The close-up showcases the advanced skills required for honing a blade to such fine tolerances",
+ "The setting switches to a wooden paddle with a dark surface",
+ "A hand strops the green-handled knife back and forth on the paddle to refine the edge further",
+ "The background remains consistent with the white surface, black stains, cardboard box, and isopropyl alcohol",
+ "The knife effortlessly slices through a piece of paper from a magazine or catalog filled with various images and texts",
+ "The man states, 'Now the knife itself is the Kaiser Begleiter 2, which does come in different skills. This just happens to be the 4V version and I do show and teach a lot of sharpening tips and tricks on my YouTube channel along with lots of knife reviews and other knife content.'",
+ "The knife slicing through the paper extends to another piece, a white paper towel, demonstrating the same effortless precision",
+ "The background is consistent, reinforcing the focused atmosphere of the workspace",
+ "An extreme close-up shows a hand holding a single hair and delicately cutting it with the green-handled knife",
+ "The man narrates, 'I also offer a sharpening service for those that do need knives sharpened but if you want more knife topics or want to see more knife-related subjects, watch my YouTube channel. Peace.'",
+ "Concluding the sequence, a close-up shot features the hand holding the green-handled knife, now with a black blade, rotating it to showcase the refined sharp edge against a yellow surface",
+ "The video highlights the sleekness and sharpness achieved",
+ "The video ends with a close-up of the knife's edge, summarizing the entire process and celebrating the precision and care devoted to achieving the perfect sharpness."
+ ]
+ },
+ {
+ "video": "video/135.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video showcases a fun scene on a baseball field involving three people: one man and two women. The setting revolves around a friendly high school baseball vs. high school softball competition, filled with humor and camaraderie.\n\nThe background features a green grassy field, with trees and a clear blue sky. The text at the top of the screen reads, \"High School Baseball vs. High School Softball Part 2.\" The three individuals on the field are: on the left, a man in a blue shirt with \"COWBOYS\" written in yellow, black shorts, and white sneakers, holding a baseball glove and a ball; in the middle, a woman in a purple shirt, black shorts with a white pattern, black gloves, and white sneakers, holding a white baseball bat; and on the right, a woman in a neon green shirt with a graphic, black shorts with a white pattern, and white sneakers, holding a baseball glove. The scene progresses with the man in the blue shirt demonstrating some actions with the baseball and glove while the two women watch. A close-up shows the man's face and upper body, and the clip ends with a focus on the woman in the middle, now wearing a helmet and preparing to bat.\n\nIn this part, a lively conversation unfolds:\n- Voice 1: \u201cAlright, we\u2019ve got high school baseball and high school softball, and we\u2019re about to pitch to them. How do you feel about this one?\u201d\n- Voice 2: \u201cBro, I\u2019m about to strike him out.\u201d\n- Voice 1: \u201cOkay. How are you feeling? Think you\u2019re going to hit him?\u201d\n- Voice 3: \u201cI feel like he\u2019s going to hit me.\u201d\n\nLaughter and the group\u2019s conversation create an energetic atmosphere.\n\nThe scene transitions to a close-up of the woman in the purple shirt, wearing a helmet and holding the bat, with a chain-link fence and buildings in the background. As she walks toward a batting net, the camera shows a wider shot where the man in the blue shirt pitches the ball. The ball is hit high, and from a distant view of the field, the ball lands on the ground.\n\nDuring this, the audio captures an amusing exchange:\n- Voice 1: \u201cOkay, first pitch. Oh, no. Okay, this is not going to be good. Oh\u2026\u201d\n\nThe conversation is accompanied by group laughter and talking, enhancing the light-hearted tone.\n\nThe video further shows the man in the blue shirt walking on the sandy field with a baseball glove. As he pitches, the woman in the purple shirt practices swinging near an orange cone. Repeated attempts show the man pitching while the other woman continues to practice batting, with the chain-link fence, trees, and grassy area in the background.\n\nAn audio snippet features light-hearted confusion:\n- Voice 1: \u201cWait, what? A ball?\u201d\n\nMore laughter and conversations among the group flow smoothly.\n\nThe sequence continues with another pitch from the man in the blue shirt toward the purple-shirted batter. The screen displays \u201c20th pitch.\u201d The bat swings near the orange cone, and the pitching process continues. The camera zooms in, revealing the man\u2019s focused expression.\n\nThe dialogue adds more humor to the practice:\n- Voice 1: \u201cYour first run! Let\u2019s go! It\u2019s a change-up. A change-up? You can barely throw a fastball. Uh-oh. Yeah! No, Tony.\u201d\n\nBackground chatter and laughter add to the engaging scene.\n\nIn the next scene, the man in the blue shirt retrieves a baseball from the ground and pitches it again. The purple-shirted player hits the ball, sending it flying, while the pitcher watches its trajectory before preparing for another pitch.\n\nThe audio captures an ongoing discussion mixed with laughter:\n- Voice 1: \u201cThat\u2019s the one that works. Okay right, slow it down. It still ain`'t working.\"\n\nFinally, the man in the blue shirt approaches the pitching position, while the purple-shirted player stands ready to bat near an orange cone. The woman in the neon green shirt is seen standing near a fence. The batter swings at the pitched ball. The screen ends with \"Follow and comment what we should do next!\" in red.\n\nThe closing audio wraps up the activity:\n- Voice 1: \u201cThat was your best pitch, easily. Brett might have figured it out.\u201d\n- Voice 2: \u201cNo, he did not.\u201d\n- Voice 1: \u201cOkay, yeah.\u201d\n\nThe video presents an entertaining, interactive baseball practice session with humorous commentary, reinforcing the theme of friendly competition."
+ }
+ ],
+ "events": [
+ "Three people (one man and two women) are on a baseball field.",
+ "The setting involves a friendly high school baseball vs. high school softball competition.",
+ "Background includes a green grassy field, trees, and a clear blue sky.",
+ "Text on the screen reads: 'High School Baseball vs. High School Softball Part 2'.",
+ "The man on the left is wearing a blue shirt with 'COWBOYS' written in yellow, black shorts, and white sneakers, and he is holding a baseball glove and ball.",
+ "The woman in the middle is wearing a purple shirt, black shorts with a white pattern, black gloves, and white sneakers, and she is holding a white baseball bat.",
+ "The woman on the right is wearing a neon green shirt with a graphic, black shorts with a white pattern, and white sneakers, and she is holding a baseball glove.",
+ "The man in the blue shirt demonstrates actions with the baseball and glove while the two women watch.",
+ "Close-up of the man's face and upper body.",
+ "Focus on the woman in the middle wearing a helmet and preparing to bat.",
+ "Voice 1: 'Alright, we\u2019ve got high school baseball and high school softball, and we\u2019re about to pitch to them. How do you feel about this one?'",
+ "Voice 2: 'Bro, I\u2019m about to strike him out.'",
+ "Voice 1: 'Okay. How are you feeling? Think you\u2019re going to hit him?'",
+ "Voice 3: 'I feel like he\u2019s going to hit me.'",
+ "Laughter and conversation add an energetic atmosphere.",
+ "Scene transitions to a close-up of the woman in the purple shirt wearing a helmet and holding a bat, with a chain-link fence and buildings in the background.",
+ "The woman walks toward a batting net.",
+ "Wider shot shows the man in the blue shirt pitching the ball.",
+ "The woman in purple hits the ball high.",
+ "Distant view shows the ball landing on the ground.",
+ "Voice 1: 'Okay, first pitch. Oh, no. Okay, this is not going to be good. Oh...'",
+ "Laughter and talking from the group enhance the light-hearted tone.",
+ "Man in the blue shirt walks on the sandy field with a baseball glove.",
+ "Woman in the purple shirt practices swinging near an orange cone.",
+ "Repeated attempts of the man pitching while the woman practices batting.",
+ "Background includes a chain-link fence, trees, and grassy area.",
+ "Voice 1: 'Wait, what? A ball?'",
+ "Laughter and conversations flow smoothly.",
+ "Man in the blue shirt pitches toward the purple-shirted batter.",
+ "Screen displays '20th pitch.'",
+ "Woman in purple swings near the orange cone.",
+ "Pitching process continues.",
+ "Camera zooms in, revealing the man\u2019s focused expression.",
+ "Voice 1: 'Your first run! Let\u2019s go! It\u2019s a change-up. A change-up? You can barely throw a fastball. Uh-oh. Yeah! No, Tony.'",
+ "Background chatter and laughter add to the engaging scene.",
+ "Man in the blue shirt retrieves a baseball from the ground and pitches it again.",
+ "Purple-shirted player hits the ball, sending it flying.",
+ "Pitcher watches the ball\u2019s trajectory before preparing for another pitch.",
+ "Voice 1: 'That\u2019s the one that works. Okay right, slow it down. It still ain't working.'",
+ "Man in the blue shirt approaches the pitching position.",
+ "Purple-shirted player stands ready to bat near an orange cone.",
+ "Woman in the neon green shirt stands near a fence.",
+ "Batter swings at the pitched ball.",
+ "Screen ends with 'Follow and comment what we should do next!' in red.",
+ "Voice 1: 'That was your best pitch, easily. Brett might have figured it out.'",
+ "Voice 2: 'No, he did not.'",
+ "Voice 1: 'Okay, yeah.'",
+ "Video presents an entertaining, interactive baseball practice session with humorous commentary.",
+ "Theme of friendly competition is reinforced."
+ ]
+ },
+ {
+ "video": "video/136.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video features an exciting cricket match unfolding under bright stadium lights. The scene opens with a green-clad player lining up to bowl, a white cricket ball clutched in his hand. The well-maintained field is surrounded by advertisements for brands like 'Sooper Soft Bakes' and 'KENWOOD'. As he runs up to deliver the ball, the scoreboard displays '0 WIN' with '23 RUNS'. In synchrony with each action, an announcer's voice narrates, \"`Needs to start the over with a boundary. Could be out. It'll be a great catch if it is. Oh, it'`s not`\" followed by crowd cheers. And the player in a green uniform kneels on the grass.\n\nSwitching to another part of the match, we see a diving fielder, also in green, trying to stop the cricket ball. Sponsors like 'Lifebuoy', 'Pepsi', 'Brighto', and 'HOMAGE' decorate the sidelines. The scoreboard remains unchanged from the previous clip. The players prepare for another critical moment as the scoreboard now reads '0 WIN' and '21 RUNS'. The batsman, in a maroon outfit, lofts the ball high into the sky again. The man's voice compliments the hit, \"There's one. There's one. That sounded so good and has gone many a mile,`\" eliciting more cheers from the crowd.\n\nThe tension builds with a sequence displaying a bowler delivering a swift ball, which a batsman in maroon and yellow misses, allowing the wicketkeeper to collect it. Advertisements for 'Sooper Soft Bakes', 'KENWOOD', and 'HOMAGE' remain prominent. The scorers flash 'OSHANE 0 0' and 'SHEPHERD 30 16'. The batsman connects on the next delivery, sending the ball soaring. \"`Goes yorker straight. Spin away for four,`\" continues the announcer, as the fielders chase towards the boundary.\n\nThe excitement stays high with a green-uniformed player darting to the boundary to retrieve the white ball. Ads for 'Sooper Soft Bakes', 'Super Asia', and others blanket the background. The batsman strikes the next ball; the camera perfectly captures its flight to the boundary, setting the scoreboard to 'OSHANE 0 0' and 'SHEPHERD 34 18'. The voice-over remarks on the effort, \"where's that gone into the gap oh good effort. That'`s a thigh. That is a match winning thigh,`\" with enthusiastic crowd applause.\n\nNext, the camera zooms in on a white cricket ball bowled with precision by another green-clad player. The batsman connects, and the ball races across the lush green field towards the boundary line, deeply pursued by a fielder. Ads by 'Sooper Soft Bakes' and 'KENWOOD' linger in the surroundings, and the scoreboard reads '0 WIN' and '10 RUNS'. \"`Asif Ali in the way of it again that's why I think Sheppard is the more accomplished of the two if I compare him with Odin Smith,`\" resonates the commentator's voice, rallying the crowd.\n\nIn the final sequence, the green-shirted bowler pitches a ball that the batsman misses, collected by an alert wicketkeeper against a backdrop of 'Sooper Soft Bakes' and 'KENWOOD' advertisements. The scoreboard now captures '0 WIN' and '10 RUNS'. The focus shifts to the batsman fine-tuning his gear before returning to celebrate the bowler marked '97, RAUF'. The wrap-up shows another player in green, representing 'Pepsi' and 'AVISTAN', standing guard on the field. \"`Bolden, last ball. O'Shane Thomas cleaned up like he was last night,`\" signals the thrilling commentary over the cheering crowd, concluding the video on a powerful note."
+ }
+ ],
+ "events": [
+ "Scene opens with a green-clad player holding a white cricket ball.",
+ "Advertisements for 'Sooper Soft Bakes' and 'KENWOOD' are visible.",
+ "The player runs up to deliver the ball.",
+ "Scoreboard displays '0 WIN' and '23 RUNS'.",
+ "Announcer says, 'Needs to start the over with a boundary. Could be out. It'll be a great catch if it is. Oh, it's not.'",
+ "Crowd cheers.",
+ "Green-uniformed player kneels on the grass.",
+ "Switch to another part of the match.",
+ "Diving fielder in green trying to stop the cricket ball.",
+ "Advertisements for 'Lifebuoy', 'Pepsi', 'Brighto', and 'HOMAGE' are visible.",
+ "Scoreboard reads '0 WIN' and '21 RUNS'.",
+ "Maroon-clad batsman lofts the ball high.",
+ "Announcer says, 'There's one. There's one. That sounded so good and has gone many a mile.'",
+ "Crowd cheers.",
+ "Sequence of a bowler delivering a swift ball.",
+ "Batsman in maroon and yellow misses, wicketkeeper collects.",
+ "Advertisements for 'Sooper Soft Bakes', 'KENWOOD', and 'HOMAGE' are visible.",
+ "Scoreboard shows 'OSHANE 0 0' and 'SHEPHERD 30 16'.",
+ "Batsman connects on the next delivery, sending the ball soaring.",
+ "Announcer says, 'Goes yorker straight. Spin away for four.'",
+ "Fielders chase the ball towards the boundary.",
+ "Green-uniformed player darts to the boundary to retrieve the ball.",
+ "Advertisements for 'Sooper Soft Bakes' and 'Super Asia' are visible.",
+ "Batsman strikes the next ball.",
+ "Camera captures the ball's flight to the boundary.",
+ "Scoreboard updates to 'OSHANE 0 0' and 'SHEPHERD 34 18'.",
+ "Announcer says, 'where's that gone into the gap oh good effort. That's a thigh. That is a match winning thigh.'",
+ "Crowd applauds.",
+ "Camera zooms in on a white cricket ball bowled with precision.",
+ "Batsman connects, ball races across the field.",
+ "Fielder pursues the ball.",
+ "Advertisements for 'Sooper Soft Bakes' and 'KENWOOD' are visible.",
+ "Scoreboard reads '0 WIN' and '10 RUNS'.",
+ "Announcer says, 'Asif Ali in the way of it again that's why I think Sheppard is the more accomplished of the two if I compare him with Odin Smith.'",
+ "Crowd rallies.",
+ "Final sequence: Green-shirted bowler pitches a ball that the batsman misses.",
+ "Wicketkeeper collects the ball.",
+ "Advertisements for 'Sooper Soft Bakes' and 'KENWOOD' are visible.",
+ "Scoreboard reads '0 WIN' and '10 RUNS'.",
+ "Focus shifts to a batsman fine-tuning his gear.",
+ "Batsman celebrates the bowler marked '97, RAUF'.",
+ "Player in green, representing 'Pepsi' and 'AVISTAN', stands guard on the field.",
+ "Announcer says, 'Bolden, last ball. O'Shane Thomas cleaned up like he was last night.'",
+ "Crowd cheers."
+ ]
+ },
+ {
+ "video": "video/137.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins in a pet store setting, featuring a white fluffy dog standing on a platform. Surrounding the dog are various pet products on shelves against a pink wall, and an illuminated sign reads 'JOY PETS' with the tagline 'Gotta make 'em smile.' The texts '`Build your brand online`' and '`with a beautiful logo`' appear on the screen. Transitioning smoothly, the scene changes to a green background with a search bar, where '`Wix Logo Maker`' is typed. The next frame reveals a webpage with 'Start designing your logo,' showing fields for entering the name and tagline of a business. Here, 'Joy Pets' and 'Gotta make 'em smile' are entered, and a 'Let's Go' button is clicked, starting the logo design process.\n\nThroughout this sequence, a woman's voice guides the steps: \"`Build your brand online with a beautiful logo. Just go to Wix Logo Maker. To start, type the business name and add the tagline.`\" Music plays softly in the background, and the sounds of a computer keyboard being typed on enhance the instructional feel.\n\nAfter clicking the blue 'Let's Go' button, the screen prompts 'What's your logo for?' with 'Pet Shop' typed in and 'Next' clicked. Following that, the screen asks 'How should your logo look & feel?' Various options like 'Dynamic,' 'Fun,' 'Playful,' and more appear; 'Fun' 'Playful' and 'Fresh' are selected, and the 'Next' button is clicked. Then, the screen displays 'Which of these logos do you like better?' with two designs side by side. Among three sets of two designs, the favored option is chosen in each set of two designs.\n\nDuring this segment, the woman speaks: \"It'`s for a pet supply shop, so it's gotta be fun, playful, and fresh.`\" Background music continues to play, setting a pleasant tone.\n\nContinuing the process, the video remains 'Which of these logos do you like better?' again, showing two logos with different designs. The cursor selects the right, the left, the right logo in order. A new screen asks 'Where do you want to use your logo?' with options like 'Website,' 'Business Cards,' and more. The cursor chooses over 'Website' 'Social Media' and 'Other', then clicks 'Next.'\n\nThe woman's voice narrates: \"`Then, help Wix logo maker understand your style. Choose where you'll use it. Here we go.`\" Gentle music persists in the background, keeping the viewer engaged.\n\nThe video then displays a white screen with colorful icons and the text 'Creating choices...' transitioning quickly to 'Done!' A webpage titled 'Choose Your New Logo' appears, offering various logo designs in a grid format, all customizable. The cursor rolls and selects a logo and clicks 'Edit,' moving to a design interface where different elements like icons, text, shapes, and backgrounds can be managed.\n\nIn this part, the woman remarks: \"`Wow, so many great options. Let's go with this one. Here is where you can customize the logo, like changing the icon.`\" Background music continues to add to the enjoyable atmosphere.\n\nNext, the design interface on the Wix platform appears, featuring the 'Joy Pets' logo with a stylized dog face and the tagline 'Gotta make 'em smile.' Various animal icon options are listed, and 'smile face' is selected. Various font options for 'Pet Shop' are listed, and 'Sukhumvit Set' is selected. The scene then transitions to a screen titled 'Print Your Logo Everywhere,' showcasing a t-shirt with the logo and the text 'T-shirts from US$19.00,' followed by business cards, and mugs with prices, along with the backward arrow button being clicked.\n\nIn this section, the woman says: \"`And the font. You can also print the new logo on T-shirts, business cards, even mugs. It's gonna look perfect on your Wix site.`\" Music supports the ongoing narrative seamlessly.\n\nFinally, a laptop on a wooden desk displays the 'Joy Pets' website, divided into sections with details about pet products. The left section features the logo and text 'Food & Goods For All Pets' with a 'Shop Now' button. The right section displays a dog next to a food bowl labeled 'Food Bowl $19.95 Quick View.' The background elements tie into the brand's aesthetics. The video concludes with a red screen displaying the 'Wix' logo and the text '`Create Your Professional Logo Today.`'\n\nThe woman's closing message echoes: \"`Go to Wix.com Logo Maker and create your professional logo today.`\" The consistent background music wraps up the informative, step-by-step journey through creating and applying a professional logo with Wix."
+ }
+ ],
+ "events": [
+ "Video begins in a pet store setting with a white fluffy dog standing on a platform.",
+ "Various pet products are on shelves against a pink wall.",
+ "An illuminated sign reads 'JOY PETS' with the tagline 'Gotta make 'em smile.'",
+ "Texts 'Build your brand online' and 'with a beautiful logo' appear on the screen.",
+ "Scene transitions to a green background with a search bar.",
+ "'Wix Logo Maker' is typed in the search bar.",
+ "Next frame reveals a webpage with 'Start designing your logo,' showing fields for entering the name and tagline of a business.",
+ "'Joy Pets' and 'Gotta make 'em smile' are entered.",
+ "'Let's Go' button is clicked, starting the logo design process.",
+ "A woman's voice guides the steps: 'Build your brand online with a beautiful logo. Just go to Wix Logo Maker. To start, type the business name and add the tagline.'",
+ "Soft background music plays with sounds of a computer keyboard being typed on.",
+ "After clicking the blue 'Let's Go' button, screen prompts 'What's your logo for?'",
+ "'Pet Shop' is typed in and 'Next' is clicked.",
+ "Screen asks 'How should your logo look & feel?'",
+ "Options like 'Dynamic,' 'Fun,' 'Playful,' and more appear.",
+ "'Fun,' 'Playful,' and 'Fresh' are selected, then 'Next' button is clicked.",
+ "Screen displays 'Which of these logos do you like better?' with two designs side by side.",
+ "Among three sets of two designs, favored option is chosen in each set of two designs.",
+ "Woman's voice: 'It'`s for a pet supply shop, so it's gotta be fun, playful, and fresh.'",
+ "Background music continues to play.",
+ "Screen shows 'Which of these logos do you like better?' again, showing two logos with different designs.",
+ "Cursor selects the right, the left, and the right logo in order.",
+ "New screen asks 'Where do you want to use your logo?'",
+ "Options like 'Website,' 'Business Cards,' and more are shown.",
+ "Cursor chooses 'Website,' 'Social Media,' and 'Other,' then clicks 'Next.'",
+ "Woman's voice: 'Then, help Wix logo maker understand your style. Choose where you'll use it. Here we go.'",
+ "Gentle background music persists.",
+ "White screen with colorful icons and text 'Creating choices...' transitioning quickly to 'Done!'",
+ "Webpage titled 'Choose Your New Logo' appears with various logo designs in a grid format, all customizable.",
+ "Cursor rolls and selects a logo, then clicks 'Edit,' moving to a design interface.",
+ "Design interface shows different elements like icons, text, shapes, and backgrounds can be managed.",
+ "Woman remarks: 'Wow, so many great options. Let's go with this one. Here is where you can customize the logo, like changing the icon.'",
+ "Background music continues.",
+ "Wix platform design interface shows the 'Joy Pets' logo with a stylized dog face and tagline 'Gotta make 'em smile.'",
+ "Various animal icon options are listed, 'smile face' is selected.",
+ "Various font options for 'Pet Shop' are listed, 'Sukhumvit Set' is selected.",
+ "Scene transitions to a screen titled 'Print Your Logo Everywhere,' showcasing a t-shirt with the logo and text 'T-shirts from US$19.00.'",
+ "Screen shows business cards and mugs with prices, along with the backwards arrow button being clicked.",
+ "Woman says: 'And the font. You can also print the new logo on T-shirts, business cards, even mugs. It's gonna look perfect on your Wix site.'",
+ "Background music supports the ongoing narrative.",
+ "Laptop on a wooden desk displays the 'Joy Pets' website.",
+ "Website divided into sections with details about pet products.",
+ "Left section features the logo and text 'Food & Goods For All Pets' with a 'Shop Now' button.",
+ "Right section shows a dog next to a food bowl labeled 'Food Bowl $19.95 Quick View.'",
+ "Background elements tie into the brand's aesthetics.",
+ "Video concludes with a red screen displaying the 'Wix' Logo and text 'Create Your Professional Logo Today.'",
+ "Woman's closing message: 'Go to Wix.com Logo Maker and create your professional logo today.'",
+ "Background music wraps up the video."
+ ]
+ },
+ {
+ "video": "video/138.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a talk show, The Tonight Show Starring Jimmy Fallon, set in a studio with a cityscape backdrop. The host, Jimmy Fallon, dressed in a dark suit with a white shirt and blue tie, holds a smartphone and is showing it to the audience. Positioned on a desk with a microphone in front of him, he seems engaged in a humorous or surprising moment as indicated by his expressive gestures and body language. The guest, seated across from the host, wears a plaid suit with a white shirt and a black tie. He has a mug on the desk before him. The text \"TOP VIEWED TV\" is visible in the top right corner, indicating the channel of the clip. The video is shown in two parts, upper and lower.\n\nIn the audio, a man on the phone says, \"It's crazy you were just facetiming me. Kevin Hart was facetiming me. He just continues to, like, text me. He just facetimed me like two times. I'm like, dude..,\" while background crowd chatter punctuates the clip and Jimmy Fallon laughs.\n\nThe scene seamlessly continues in the same talk show studio. Jimmy Fallon laughs, showing the smartphone to the audience with a facetime call on display. Still dressed in his suit, his gestures and body language indicate humor. The guest, in his plaid suit, observes the host's actions, looking serious. The text \"TOP VIEWED TV\" remains visible.\n\nThe background audio features people laughing followed by Jimmy Fallon and the man on the facetime saying, \"Stand. End it. You know what's crazy?\" \"Yeah.\" \" Look, about Kevin, and I know you're shooting a show, and the audience will love this, too, because they probably feel the same way.\"\n\nThe scenario unfolds further with Jimmy engaging seriously with the facetime call. The guest continues to observe attentively. \n\nThe video zooms in on the phone, and the audio reflects a narrative with: \"Kevin, to me, is like a little gremlin that... I just want to take him and throw him into oncoming traffic...\" Laughter from the crowd underscores the humor in the dialogue. And the video zooms out, showing the guest still looks serious.\n\nThe video zooms in on the phone again. The dialogue advances with, \"Yeah, yeah, it, it's good to talk to you, buddy, and we love you. And thank you.\" \"I love you back.\" \"Just checking in on you. Just want to make sure everything's cool, man. Thank you so much..,\" with the sounds of people laughing and talking capturing the convivial atmosphere.\n\nThe host finishes the video call, prominently showing his hand pressing the red 'X' button on the smartphone. Maintaining the same setting and attire, he interacts with his guest, possibly looking to transition to the next part of the show.\n\nThe audio punctuates this moment with, \"I'll talk to you later. Bye, buddy. Bye-bye..,\" followed by applause and cheers from the audience, moving the scene along with a sense of closure to the call.\n\nThe final scene captures Jimmy putting his smartphone into his suit pocket, signaling a transition from the video call segment to another part of the show. The attentive guest prepares to engage in the next segment.\n\nJimmy looks quite guilty and hesitant and talks to the guest, who looks at him, surprised, and then mad. The audio finale concludes with, \"Uh... He wasn't there. It went straight to voicemail. \" \"I heard the whole con(versation).,\" accompanied by laughter and applause from the crowd, maintaining the lively tone of the show.\n\nThroughout the video, the discussions, reactions, and interactions on the smartphone create a humorous and engaging atmosphere that keeps the audience entertained."
+ }
+ ],
+ "events": [
+ "Talk show 'The Tonight Show Starring Jimmy Fallon' begins in a studio with a cityscape backdrop.",
+ "Jimmy Fallon, the host, is dressed in a dark suit with a white shirt and blue tie.",
+ "Jimmy Fallon holds a smartphone and shows it to the audience.",
+ "Jimmy is positioned at a desk with a microphone in front of him.",
+ "Jimmy appears engaged in a humorous or surprising moment as indicated by his gestures and body language.",
+ "The guest, seated across from Jimmy, wears a plaid suit with a white shirt and a black tie.",
+ "The guest has a mug on the desk before him.",
+ "The text 'TOP VIEWED TV' is visible in the top right corner, indicating the channel of the clip.",
+ "Video shown in two parts, upper and lower.",
+ "Audio begins with a man on the phone saying: 'It's crazy you were just facetiming me. Kevin Hart was facetiming me. He just continues to, like, text me. He just facetimed me like two times. I'm like, dude..,'",
+ "Background crowd chatter punctuates the clip.",
+ "Jimmy Fallon laughs in response.",
+ "Scene continues seamlessly in the same talk show studio.",
+ "Jimmy Fallon laughs and shows the smartphone to the audience with a facetime call on display.",
+ "Jimmy's gestures and body language indicate humor.",
+ "Guest in plaid suit observes the host's actions looking serious.",
+ "Text 'TOP VIEWED TV' remains visible.",
+ "Background audio features people laughing.",
+ "Jimmy Fallon and the man on the facetime say: 'Stand. End it. You know what's crazy?' 'Yeah.' ' Look, about Kevin, and I know you're shooting a show, and the audience will love this, too, because they probably feel the same way.'",
+ "Jimmy engages seriously with the facetime call.",
+ "Guest continues to observe attentively.",
+ "Video zooms in on the phone.",
+ "Audio narrative reflects: 'Kevin, to me, is like a little gremlin that... I just want to take him and throw him into oncoming traffic...' Laughter follows from the crowd.",
+ "Video zooms out, showing the guest still looks serious.",
+ "Video zooms in on the phone again.",
+ "Dialogue advances with: 'Yeah, yeah, it, it's good to talk to you, buddy, and we love you. And thank you.' 'I love you back.' 'Just checking in on you. Just want to make sure everything's cool, man. Thank you so much..,' with sounds of people laughing and talking.",
+ "Host finishes the video call showing prominently his hand pressing the red 'X' button on the smartphone.",
+ "Still in the same setting and attire, Jimmy interacts with his guest.",
+ "Audio punctuates the moment with: 'I'll talk to you later. Bye, buddy. Bye-bye..,' followed by applause and cheers from the audience.",
+ "Final scene shows Jimmy putting his smartphone into his suit pocket, signaling a transition to another part of the show.",
+ "Guest prepares to engage in the next segment.",
+ "Jimmy looks quite guilty and hesitant.",
+ "Guest looks at Jimmy, surprised, then mad.",
+ "Audio finale concludes with: 'Uh... He wasn't there. It went straight to voicemail.' 'I heard the whole con(versation).,' accompanied by laughter and applause from the crowd."
+ ]
+ },
+ {
+ "video": "video/139.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video features a woman with light brown hair tied back, dressed in a black and red striped top, with a small orange ball resembling Mars shown at the top of the screen. The video unfolds in a dark room with a visible light strip in the background.\n\nThe scene opens with the text \"MARS: 4 Seconds\" at the top, indicating a countdown. \"Wait you're not alone. I also chose Mars. actually we're both safe because Mars is-\" Another woman's voice corresponds with the subtitles, speaking with a calm and assuring tone. As the countdown reaches \"0 Seconds,\" the screen briefly goes black and the audio stops. When the woman in a black and red striped top reappears, the subtitles read \"YOU ARE IN MARS'' The woman looks surprised, reflecting the audio, as she says, \"Because Mars is what? Wait, I'm already here?\"\n\nAnother woman says, \"Yep, welcome to Mars,\" as the subtitles confirm. Music plays softly in the background, providing a contemplative atmosphere. \"If you have any questions please let me know.\" The screen briefly goes black, and upon returning, another woman says, \"Amara, there you are.\" The woman in a black and red striped top asks, \"What were you trying to say before we got here?\" The another woman continues speaking, \"I was trying to say that Mars is more suitable for humans. It's more likely we'll survive here than in Jupiter.\" The woman in a black and red striped top pauses her speaking, \"Wait, I wanna ask, if Mars is more suitable for humans, then why did everyone choose Jupiter?\" The scene goes black again momentarily, with the subtitles and audio aligning perfectly with the dialogue. The another woman responds, \"Because researchers and scientists want to experiment with them, they sent them there to see if the planet is actually liveable, which it's not.\" Expressing concern, the woman in a black and red striped top declares, \"Then what should I do? My whole family chose Jupiter! I need to help them,\" her worry mirrored in the sound of her voice. A sense of urgency and emotional depth is emphasized both visually and aurally. \"Okay, I'm coming with you. Let's go there now,\" The woman in a black and red striped top continues, \"Okay, you ready?\" raising her arm confidently as if preparing to embark on a journey. Another woman's voice hints at readiness with a sigh, synchronizing the narrative closure.\n\nIn essence, the video takes viewers on a visually consistent yet dynamic journey exploring themes of decision-making, exploration, and camaraderie set against the backdrop of interplanetary travel. The conversations drive the narrative, deepening the intrigue and maintaining engagement throughout."
+ }
+ ],
+ "events": [
+ "Woman with light brown hair tied back in a black and red striped top appears",
+ "Small orange ball resembling Mars shown at the top of the screen",
+ "Scene set in a dark room with a visible light strip in the background",
+ "Text 'MARS: 4 Seconds' at the top indicating a countdown",
+ "Other woman's voice says 'Wait you're not alone. I also chose Mars. actually we're both safe because Mars is-'",
+ "Countdown reaches '0 Seconds' and the screen goes black and audio stops",
+ "Woman in black and red striped top reappears, subtitles read 'YOU ARE IN MARS'",
+ "Woman looks surprised, says 'Because Mars is what? Wait, I'm already here?'",
+ "Another woman says 'Yep, welcome to Mars', confirmed by subtitles",
+ "Soft music plays in the background, creating a contemplative atmosphere",
+ "Text 'If you have any questions please let me know' displayed briefly",
+ "Screen goes black briefly",
+ "Another woman says 'Amara, there you are'",
+ "Woman in black and red striped top asks, 'What were you trying to say before we got here?'",
+ "Another woman explains 'I was trying to say that Mars is more suitable for humans. It's more likely we'll survive here than in Jupiter.'",
+ "Woman in black and red striped top pauses and asks 'Wait, I wanna ask, if Mars is more suitable for humans, then why did everyone choose Jupiter?'",
+ "Screen goes black briefly",
+ "Another woman responds 'Because researchers and scientists want to experiment with them, they sent them there to see if the planet is actually liveable, which it's not.'",
+ "Woman in black and red striped top expresses concern, declares 'Then what should I do? My whole family chose Jupiter! I need to help them.'",
+ "Woman in black and red striped top says 'Okay, I'm coming with you. Let's go there now'",
+ "Woman raises her arm confidently as if preparing to embark on a journey, says 'Okay, you ready?'",
+ "Another woman's voice hints at readiness with a sigh"
+ ]
+ },
+ {
+ "video": "video/140.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a shirtless man standing on a wooden deck in a backyard, under a clear blue sky with a few clouds, surrounded by green trees and a wooden fence. He has long hair and is wearing dark shorts. He says: \"`A quick tip on getting leaner body composition. Probably never heard this one before. If you don't know, I used to be a little bit more fluffy. Okay, I was a lot more fluffy.`\" Meanwhile, corresponding subtitles appear, showing only the key words. Next, an older photo of the same man is shown with the text \"fluff\" and an arrow pointing to his stomach. Throughout the segment, the man continues to gesture and speak, with the background remaining unchanged.\n\nThe scene continues with the man still on the wooden deck. The man explains: \"Now, one thing that I started implementing is crawling for five minutes a day. So what I'd do is I'd go to the park, set a little timer on my phone, and then get down and crawl around on all fours for five minutes straight.\"\n\nIn the next sequence, the man holds a small purple object while using hand gestures to emphasize his points. He holds up five fingers. He says: \"`If you've never done this before, it's going to be so hard. You're going to get to a minute and you're going to be like, God dang it.`\"\n\nThe video continues uninterrupted with the man giving a thumbs-up gesture, then holding his left arm with his right hand, and subsequently gesturing near his chest. He bends down, placing his hands on the deck, and then crouches as if preparing to crawl. He continues to explain: \"`Why did I listen to the strengths I got. If that's the case, just crawl for a minute to two minutes and then accumulate to five minutes total. So this is what you do.`\" The background includes the sound of water splashing and gurgling.\n\nIn the following scenes, the man is seen crawling on the wooden deck. He moves forward on all fours and then starts rising from the crawling position, eventually standing up and pointing to his abdominal muscles. He ends this segment by squatting next to a black and white dog. Text \"`SUBSCRIBE TO STRENGTHSIDE`\" appears in red with arrows pointing downwards. His voice continues: \"`You get down, hands are here down, and we just crawl around in the grass or the cement or whatever it is. Keep doing that for five minutes. You hop up, look down, and you\u2019re like, whoa six-pack.`\" followed by the sound of a door opening and closing, then the man's voice again.\n\nIn the final scenes, the man is squatting next to the dog, gently holding it by the collar. The text \"`SUBSCRIBE TO STRENGTHSIDE`\" remains on screen. The dog looks down and then to the side as the man adjusts his grip. In the last frames, he returns to a crawling position with the text still visible, then to a crouched position while gesturing. He interacts with the dog while saying: \"`If that doesn\u2019t work, you can just get yourself a rambunctious dog. Keeps you moving. Hey, come here. Subscribe to Strengthside.`\" The video concludes with the sound of the dog barking and the man speaking again. The background elements\u2014the clear blue sky, green trees, and wooden fence\u2014remain consistent throughout the video. The red text \"`SUBSCRIBE TO STRENGTHSIDE`\" is still visible on the screen."
+ }
+ ],
+ "events": [
+ "Shirtless man standing on a wooden deck in a backyard under a clear blue sky with a few clouds, surrounded by green trees and a wooden fence",
+ "Man has long hair and is wearing dark shorts.",
+ "Man says: 'A quick tip on getting leaner body composition. Probably never heard this one before. If you don't know, I used to be a little bit more fluffy. Okay, I was a lot more fluffy.'",
+ "Corresponding subtitles appear showing only key words.",
+ "Older photo of the same man is shown with the text 'fluff' and an arrow pointing to his stomach.",
+ "Man gestures and speaks with the background remaining unchanged.",
+ "Man explains: 'Now, one thing that I started implementing is crawling for five minutes a day. So what I'd do is I'd go to the park, set a little timer on my phone, and then get down and crawl around on all fours for five minutes straight.'",
+ "Man holds a small purple object while using hand gestures to emphasize his points.",
+ "Man holds up five fingers.",
+ "Man says: 'If you've never done this before, it's going to be so hard. You're going to get to a minute and you're going to be like, God dang it.'",
+ "Man gives a thumbs-up gesture.",
+ "Man holds his left arm with his right hand and then gestures near his chest.",
+ "Man bends down placing his hands on the deck and then crouches as if preparing to crawl.",
+ "Man continues to explain: 'Why did I listen to the strengths I got. If that's the case, just crawl for a minute to two minutes and then accumulate to five minutes total. So this is what you do.'",
+ "Background sound of water splashing and gurgling.",
+ "Man is seen crawling on the wooden deck.",
+ "Man moves forward on all fours and then starts rising from the crawling position.",
+ "Man eventually stands up and points to his abdominal muscles.",
+ "Man ends this segment by squatting next to a black and white dog.",
+ "Text 'SUBSCRIBE TO STRENGTHSIDE' appears in red with arrows pointing downwards.",
+ "Man's voice continues: 'You get down, hands are here down, and we just crawl around in the grass or the cement or whatever it is. Keep doing that for five minutes. You hop up, look down, and you're like, whoa six-pack.'",
+ "Sound of a door opening and closing, followed by the man's voice again.",
+ "Man is squatting next to the dog gently holding it by the collar.",
+ "Text 'SUBSCRIBE TO STRENGTHSIDE' remains on screen.",
+ "Dog looks down and then to the side as the man adjusts his grip.",
+ "Man returns to a crawling position with the text still visible.",
+ "Man returns to a crouched position while gesturing.",
+ "Man interacts with the dog while saying: 'If that doesn't work, you can just get yourself a rambunctious dog. Keeps you moving. Hey, come here. Subscribe to Strengthside.'",
+ "Video concludes with the sound of the dog barking and the man speaking again.",
+ "Background elements\u2014the clear blue sky, green trees, and wooden fence\u2014remain consistent throughout the video.",
+ "Red text 'SUBSCRIBE TO STRENGTHSIDE' is still visible on the screen."
+ ]
+ },
+ {
+ "video": "video/141.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a man standing in a kitchen, holding a bundle of wheat stalks. He is dressed in a dark long-sleeve shirt and jeans. On the wooden countertop before him are two loaves of bread in glass baking dishes. The background features a vibrant painting of a landscape with blue skies and clouds. He gestures toward the bread and wheat as if explaining a process or recipe. The audio complements the scene with a man asking, \"Is it feasible to grow enough wheat at home to make a steady supply of bread? Well, we did it. Here's how. We started with a land-raised wheat variety called red fife, and we devoted one of our entire boulevard spaces to wheat production.\" The video then transitions to a close-up of a package labelled 'Greenboy Organic Red Fife Kernels,' encased in a clear pack with a light green label. It concludes with the same man, outdoors by a chain-link fence in a garden, wearing a blue shirt, brown pants, gloves, and a wide-brimmed hat, seemingly preparing to plant or harvest.\n\nNext, we see a garden area beside a chain-link fence, where the soil is freshly tilled and arranged into rows ready for planting. Houses and a parked car are visible in the background. The same man in a blue shirt and wide-brimmed hat is shown using a manual seeder to plant seeds in these rows, followed by watering them with a hose. The audio from a man's voice states, \"This 450 square foot plot was seeded quickly with our earthway seeder on May 11th, and after watering in the beds once initially, the crop grew really well with minimal maintenance and irrigation throughout the summer.\" The sounds of water splashing and crops rustling enhance the garden\u2019s tranquil atmosphere.\n\nThe narrative progresses showing the garden by the chain-link fence again, this time lush with green plants growing in rows, well-maintained next to an adjacent sidewalk. A later scene shows the plants taller and golden, ready for harvest. The same man in a light-coloured long-sleeve shirt, brown pants, gloves, and a wide-brimmed hat is shown harvesting the wheat by hand and gathering it into bundles. The final part of this clip highlights the same man transporting the harvested wheat on a hand truck along the sidewalk, with residential elements like houses, a parked car, and trees in the background. The continuing voice adds, \"On September 8th the entire plot was cut with a sickle then bundled into sheaths and hauled back to our home for threshing.\"\n\nThe video transitions to an indoor workshop where the same man works on a green machine, assembling or repairing it by attaching a black component. This machine is later seen in action where the same man feeds wheat stalks into it, functioning as a thresher separating kernels from the stalks. The audio aligns: \"I didn't just want to randomly beat this grain with a flail, so I built a small-scale threshing machine. It knocks the grain from the heads and separates the kernels from the chaff.\" The whirring of the threshing machine underscores this process. The clip concludes with a close-up of the processed wheat kernels being collected in a container, while the same man nages additional wheat stalks in the backdrop of houses and trees.\n\nThe continuity extends with a close-up view of wheat kernels being poured from a wooden container into another container outdoors. In the background, a woman wearing a purple shirt and black pants handles wheat stalks. The scene shifts to an indoor pantry where the same man dressed in a dark long-sleeve shirt and jeans places a large white bucket on a shelf. The video then focuses on a white grain mill on a wooden countertop, filled with wheat kernels. Lastly, we see the same man slicing a loaf of bread on the same wooden countertop, with the vibrant landscape painting behind them. As they slice the bread, the audio describes, \" After threshing, we were left with pounds of grain from our plot, which is now stored in our walk-in cooler. So whenever we're out of flour, we grind a small batch with our tiny flour mill and put it to use for bread like this.\"\n\nConcluding the video, the same person is seen again in the kitchen wearing a dark long-sleeve shirt and jeans, positioned next to the wooden countertop with a loaf of sliced bread on a cutting board. He gestures towards the bread, explaining or demonstrating its importance or the process behind it. This final scene alternates between this kitchen shot where the same man in a black shirt and brown pants stands next to the green thresher machine, explaining its function. The audio rounds off with, \"If you've also grown your own grain by hand, please share your experience in the comments. And if you'd like to learn more about our threshing machine, please check out my long video demonstration, I'll leave a link to that in the description. See you guys next one.\""
+ }
+ ],
+ "events": [
+ "Video starts with a man standing in a kitchen holding a bundle of wheat stalks.",
+ "Man is dressed in a dark long-sleeve shirt and jeans.",
+ "On the countertop are two loaves of bread in glass baking dishes.",
+ "Background features a vibrant painting of a landscape with blue skies and clouds.",
+ "Man gestures toward the bread and wheat as if explaining a process or recipe.",
+ "Audio from a man asks, 'Is it feasible to grow enough wheat at home to make a steady supply of bread? Well, we did it. Here's how.'",
+ "Audio continues explaining about starting with 'red fife' wheat and using boulevard space for production.",
+ "Video transitions to a close-up of a package labeled 'Greenboy Organic Red Fife Kernels' in a clear pack with a light green label.",
+ "Video concludes with the man outdoors by a chain-link fence in a garden, wearing a blue shirt, brown pants, gloves, and a wide-brimmed hat.",
+ "Man appears to be preparing to plant or harvest.",
+ "Next scene shows a garden area beside a chain-link fence with freshly tilled soil arranged into rows.",
+ "Houses and a parked car are visible in the background.",
+ "Man in a blue shirt and wide-brimmed hat uses a manual seeder to plant seeds in rows.",
+ "Man waters the planted seeds with a hose.",
+ "Audio states, 'This 450 square foot plot was seeded quickly with our earthway seeder on May 11th.'",
+ "Audio continues, explaining minimal maintenance and irrigation throughout the summer.",
+ "Sounds of water splashing and crops rustling enhance the garden\u2019s atmosphere.",
+ "Garden by the chain-link fence is now lush with green plants growing in rows.",
+ "Plants are well-maintained next to an adjacent sidewalk.",
+ "Later scene shows plants taller and golden, ready for harvest.",
+ "Man in a light-coloured long-sleeve shirt, brown pants, gloves, and a wide-brimmed hat is shown harvesting wheat by hand.",
+ "Man gathers wheat into bundles.",
+ "Final part shows the man transporting the harvested wheat on a hand truck along the sidewalk.",
+ "Background features houses, a parked car, and trees.",
+ "Audio adds, 'On September 8th the entire plot was cut with a sickle then bundled into sheaths and hauled back to our home for threshing.'",
+ "Video transitions to an indoor workshop.",
+ "Man works on a green machine, assembling or repairing it by attaching a black component.",
+ "Machine is later in action with the man feeding wheat stalks into it.",
+ "Machine functions as a thresher, separating kernels from the stalks.",
+ "Audio states, 'I didn't just want to randomly beat this grain with a flail, so I built a small-scale threshing machine.'",
+ "Audio continues explaining the threshing process.",
+ "Whirring of the threshing machine enhances the process.",
+ "Clip concludes with a close-up of processed wheat kernels being collected in a container.",
+ "Same man manages additional wheat stalks in the backdrop of houses and trees.",
+ "Close-up view of wheat kernels being poured from a wooden container into another container outdoors.",
+ "In the background, a woman wearing a purple shirt and black pants handles wheat stalks.",
+ "Scene shifts to an indoor pantry.",
+ "Man dressed in a dark long-sleeve shirt and jeans places a large white bucket on a shelf.",
+ "Video focuses on a white grain mill on a wooden countertop filled with wheat kernels.",
+ "Man is shown slicing a loaf of bread on the wooden countertop with the vibrant landscape painting behind them.",
+ "Audio describes the process of threshing and storing grain, milling it into flour, and using it for bread.",
+ "Concluding scene in the kitchen with man in a dark long-sleeve shirt and jeans next to the countertop with sliced bread on a cutting board.",
+ "Man gestures towards the bread, explaining or demonstrating its importance or the process behind it.",
+ "Final scene alternates between the kitchen shot and the man in black shirt and brown pants by the green thresher machine.",
+ "Audio concludes, 'If you've also grown your own grain by hand, please share your experience in the comments.'",
+ "Audio continues, inviting viewers to check out a longer video demonstration of the threshing machine.",
+ "Audio ends with, 'See you guys next one.'"
+ ]
+ },
+ {
+ "video": "video/142.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man in a dark blue suit and a blue and white striped tie standing in front of a large swimming pool, surrounded by several white lounge chairs and umbrellas, suggesting a resort or hotel setting. The \u2018abc NEWS\u2019 logo is visible in the bottom right corner of the frame. The man is speaking, introducing some remarkable footage to the audience. The audio features the man saying, \u201cThese are remarkable shots. For the first time ever on video, the giant squid in its natural habitat.\u201d The scene then transitions to black and white underwater footage showing a giant squid with long, white tentacles. The squid is seen moving and wrapping its tentacles around an object, while the \u2018Discovery\u2019 logo appears in the bottom right corner. Underwater sounds of the squid swimming can be heard.\n\nThe following underwater footage remains black and white, showing the squid\u2019s long white tentacles tightly wrapping around an underwater piece of equipment, with both \u2018Discovery\u2019 and \u2018abc NEWS\u2019 logos still visible. The audio continues with the narration, \u201cThis mysterious creature can grow up to 50 feet long with eyes the size of dinner plates. A Japanese film crew tried hundreds of times at depths of 3,000 feet or more before finally capturing these remarkable images.\u201d The sound of a helicopter flying overhead is heard, adding a sense of adventure. The scene then transitions back to color, closely focusing on the squid\u2019s head and tentacles, now brown and white, against a dark blue background indicative of deep-sea conditions. The clip concludes by focusing on the squid\u2019s eye and the intricate details of its tentacles.\n\nThe next clip returns to the close-up of the giant squid\u2019s brown and white tentacles against the dark blue deep-sea backdrop, with the \u2018Discovery\u2019 and \u2018abc NEWS\u2019 logos still visible. The scene soon reverts to black and white underwater footage, again displaying the squid\u2019s long white tentacles wrapping around underwater equipment. The logos remain unchanged. This visual is accompanied by the narration, \u201cAll part of a Discovery Channel special later this month. Of course, after yesterday\u2019s news of the baby giant squid in this one.\u201d Concurrently, the sound of a helicopter can be heard, enhancing the sense of exploration. The final scene again cuts back to the man in the dark blue suit, standing in front of the swimming pool with the familiar resort backdrop, continuing the news report.\n\nWe then see the man in the dark blue suit again, with the consistent resort backdrop featuring the large swimming pool and white lounge chairs. The \u2018abc NEWS\u2019 logo remains visible. The narrator gestures with his hand while speaking, saying, \u201cI will be rethinking my morning dip in the Miami Beach waters here after this newscast. Careful where you go diving, I think.\u201d Laughter from a woman can be heard, after which the narrator continues speaking. The scene broadens to reveal a modern studio setting with three people seated at a desk, watching the individual on the large screen. The studio is decorated in blue and yellow, with the \u2018Good Morning America\u2019 logo present on the desk.\n\nFinally, the clip features a wider shot of the same studio, with the three individuals at the desk continuing to watch the person on the large screen. The studio design and the \u2018Good Morning America\u2019 logo remain consistent. The individuals seated at the desk are conversing\u2014the woman on the right is wearing a red dress, the woman on the left is in a blue top, and the woman in the middle is in a gray suit. The background shows a cityscape with large windows offering views of pedestrians and traffic, maintaining the energy of a bustling city. The active dialogue includes a remark, \u201cYou know, Josh, you have a squid beat this week. As we send it back to you guys in New York,\u201d followed by sounds of conversation and laughter, bringing a lively conclusion to the video. The \u2018abc NEWS\u2019 logo remains visible, tying together the entire segment."
+ }
+ ],
+ "events": [
+ "The video begins with a man in a dark blue suit and a blue and white striped tie standing in front of a large swimming pool.",
+ "Several white lounge chairs and umbrellas are visible, suggesting a resort or hotel setting.",
+ "The 'abc NEWS' logo is visible in the bottom right corner of the frame.",
+ "The man is speaking, introducing some remarkable footage to the audience.",
+ "Audio features the man saying, 'These are remarkable shots. For the first time ever on video, the giant squid in its natural habitat.'",
+ "Scene transitions to black and white underwater footage showing a giant squid with long, white tentacles.",
+ "The squid is seen moving and wrapping its tentacles around an object.",
+ "The 'Discovery' logo appears in the bottom right corner.",
+ "Underwater sounds of the squid swimming can be heard.",
+ "Footage shows the squid\u2019s long white tentacles tightly wrapping around an underwater piece of equipment.",
+ "Both 'Discovery' and 'abc NEWS' logos are visible.",
+ "Narration continues, 'This mysterious creature can grow up to 50 feet long with eyes the size of dinner plates. A Japanese film crew tried hundreds of times at depths of 3,000 feet or more before finally capturing these remarkable images.'",
+ "Sound of a helicopter flying overhead is heard, adding a sense of adventure.",
+ "Scene transitions to color, focusing on the squid\u2019s head and tentacles, now brown and white, against a dark blue background.",
+ "Clip concludes by focusing on the squid\u2019s eye and intricate details of its tentacles.",
+ "Next clip returns to close-up of the giant squid\u2019s brown and white tentacles against the dark blue deep-sea backdrop.",
+ "Both 'Discovery' and 'abc NEWS' logos are still visible.",
+ "Scene reverts to black and white underwater footage displaying the squid\u2019s long white tentacles wrapping around underwater equipment.",
+ "Logos remain unchanged.",
+ "Narration continues, 'All part of a Discovery Channel special later this month. Of course, after yesterday\u2019s news of the baby giant squid in this one.'",
+ "Sound of a helicopter can be heard, enhancing the sense of exploration.",
+ "Final scene cuts back to the man in the dark blue suit, standing in front of the swimming pool with the resort backdrop.",
+ "The man continues the news report.",
+ "Man gestures with his hand while speaking, saying, 'I will be rethinking my morning dip in the Miami Beach waters here after this newscast. Careful where you go diving, I think.'",
+ "Laughter from a woman is heard.",
+ "Narrator continues speaking.",
+ "Scene broadens to reveal a modern studio setting with three people seated at a desk.",
+ "They are watching the individual on the large screen.",
+ "The studio is decorated in blue and yellow, with the 'Good Morning America' logo present on the desk.",
+ "Clip features a wider shot of the same studio, with the three individuals at the desk continuing to watch the person on the large screen.",
+ "Studio design and 'Good Morning America' logo remain consistent.",
+ "Individuals seated at the desk are conversing.",
+ "Woman on the right is wearing a red dress.",
+ "Woman on the left is in a blue top.",
+ "Woman in the middle is in a gray suit.",
+ "Background shows a cityscape with large windows offering views of pedestrians and traffic.",
+ "Active dialogue includes a remark, 'You know, Josh, you have a squid beat this week. As we send it back to you guys in New York.'",
+ "Sounds of conversation and laughter are heard.",
+ "The 'abc NEWS' logo remains visible, tying together the entire segment."
+ ]
+ },
+ {
+ "video": "video/143.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a black screen that quickly transitions to a white background. Large, bold gray text appears in the center, reading \"Some Things From The.\" The text changes to \"Some Things From The 80s,\" with \"80s\" in a larger font size, and finally to \"Some Things From The 80s Were a little bit...\" with \"Were a little bit...\" in a smaller font size. \n\nThe scene transitions to a vibrant and retro background filled with geometric shapes and bright colors, where a person is wearing a bright pink mesh top, orange dress, accessorized with colorful bracelets and a headband. Throughout the video the 'Vat19.com' logo is in the bottom left corner of the frame. A man's voiceover states, \"Some things from the 80s were a little bit... yeah. Thankfully, those days have come and gone, making our lives more portable, more manageable, and more awesome.\" while music plays in the background.\n\nNext, we see a split-screen comparison. On the left side, a man in a gray shirt with a piano key tie holds a large, old-fashioned mobile phone against a blue background with purple diamond shapes. On the right, another man wearing a beige shirt holds a modern smartphone against a plain light blue background.\n\nThe clip transitions to another split-screen comparison. On the left side, a man holds a large, old-fashioned video camera, wearing a green shirt and a beige jacket against a pink background with teal geometric shapes. On the right, another man, wearing a brown and white striped shirt, holds a modern, compact video camera against a plain light purple background.\n\nContinuing with the comparisons, the video transitions to a new scene. A man wearing a red and black jacket and a silver glove holds a regular-sized Rubik's Cube against a colorful, retro background. Another hand enters the frame holding a tiny Rubik's Cube keychain. The man in the red jacket takes the keychain and raises it up to watch as the text \"Even more AWESOME!\" appears in bold white letters. The voiceover adds, \"Like with the Rubik's Cube keyring. Even more awesome!\"\n\nThe scene shifts to a close-up of the keychain on a white background, showcasing its length of side as 1.125 inches, with the text \"1.125\"\" in orange below it. In the background are two other blurred Rubik's cubes in the top right corner.\n\nThe visuals then present another larger Rubik's Cube moving from the top right corner to the center, showcasing its length of side as 2.125 inches, with the text \"2.125\"\" in orange below it. Similarly, the third Rubik's Cube with a digital clock display moving from the top right corner to the center, indicating a 3-inch side length with the text \"3\"\" in orange. Above, the text \"Cross Promo!\" appears in orange. The next scene shows a person in a gray shirt and blue jeans putting the Rubik's Cube keychain to their trousers pocket, followed by another person in a brown and white striped shirt attaching the same keychain to a brown backpack, then carrying it on. The first person's background is plain blue, while the second's is plain red. The voiceover describes the keyring\u2019s practicality, \"The keyring is about half the size of its retro counterpart and about a third the size of the Rubik's Cube alarm clock. Though it's small enough to fit into your pocket or hang from your backpack by the attached keyring.\"\n\nThe visuals then present the man wearing a green shirt and a beige jacket sitting in front of the computer to restore the Rubik's Cube keychain, and the camera zooms in on the action of his hand. Next scene, a close-up of the restored Rubik's Cube keychain on a white background, displaying green and yellow faces, is shown. Bold orange and gray text reads \"Rubik's Cube Keyring.\" The background features a blurred image of a larger Rubik's Cube. The voiceover describes the keyring\u2019s practicality, \"The Rubik's Cube keyring is still fully functional.\"\n\nThe video transitions to the last sequence, displaying the Vat19.com logo in large, bold orange and gray letters. Below the logo, the text \"Shop CURIOUSLY AWESOME gifts\" appears in smaller, black and orange letters. A cursor clicks on the 'Vat19' logo, emphasizing the website. The voiceover concludes, \"Rubik's Cube keyring, buy tody at Vat19.com. Vat19.com, purveyors of curiously awesome products. Vat19.com,\" followed by upbeat music and a man singing."
+ }
+ ],
+ "events": [
+ "Video begins with a black screen",
+ "Transition to a white background",
+ "Large, bold gray text appears reading 'Some Things From The'",
+ "Text changes to 'Some Things From The 80s' with '80s' in larger font",
+ "Text changes again to 'Some Things From The 80s Were a little bit...' with 'Were a little bit...' in smaller font",
+ "Scene transitions to a vibrant, retro background with geometric shapes and bright colors",
+ "Person is wearing bright pink mesh top, orange dress, colorful bracelets, and headband",
+ "'Vat19.com' logo appears in the bottom left corner of the frame",
+ "Man's voiceover starts, 'Some things from the 80s were a little bit... yeah. Thankfully, those days have come and gone, making our lives more portable, more manageable, and more awesome.'",
+ "Music plays in the background",
+ "Split-screen comparison appears",
+ "Left side: man in a gray shirt with a piano key tie holds a large, old-fashioned mobile phone against a blue background with purple diamonds",
+ "Right side: another man in a beige shirt holds a modern smartphone against a plain light blue background",
+ "Clip transitions to another split-screen comparison",
+ "Left side: man holds a large, old-fashioned video camera while wearing a green shirt and beige jacket against a pink background with teal geometric shapes",
+ "Right side: another man in a brown and white striped shirt holds a modern, compact video camera against a plain light purple background",
+ "New scene with a man in a red and black jacket and silver glove holding a regular-sized Rubik's Cube against a colorful, retro background",
+ "Another hand enters the frame holding a tiny Rubik's Cube keychain",
+ "Man in the red jacket takes the keychain and raises it up",
+ "Text 'Even more AWESOME!' appears in bold white letters",
+ "Voiceover adds, 'Like with the Rubik's Cube keyring. Even more awesome!'",
+ "Scene shifts to a close-up of the keychain on a white background, showcasing its side length as 1.125 inches",
+ "Text '1.125\"' appears in orange below the keychain",
+ "Two other blurred Rubik's cubes are visible in the top right corner",
+ "Larger Rubik's Cube moves from the top right corner to the center, showcasing its side length as 2.125 inches",
+ "Text '2.125\"' appears in orange below the larger Rubik's Cube",
+ "Third Rubik's Cube with a digital clock display moves to the center, indicating a 3-inch side length",
+ "Text '3\"' appears in orange",
+ "Text 'Cross Promo!' appears in orange above",
+ "Next scene: person in a gray shirt and blue jeans puts the Rubik's Cube keychain into their trousers pocket",
+ "Another person in a brown and white striped shirt attaches the same keychain to a brown backpack",
+ "Background for first person is plain blue, for second person is plain red",
+ "Voiceover describes the keyring\u2019s practicality, 'The keyring is about half the size of its retro counterpart and about a third the size of the Rubik\u2019s Cube alarm clock. Though it's small enough to fit into your pocket or hang from your backpack by the attached keyring.'",
+ "Man wearing green shirt and beige jacket sits in front of a computer to restore the Rubik's Cube keychain",
+ "Camera zooms in on the action of his hand",
+ "Close-up of the restored Rubik's Cube keychain on a white background displaying green and yellow faces",
+ "Bold orange and gray text reads 'Rubik's Cube Keyring'",
+ "Background features a blurred image of a larger Rubik's Cube",
+ "Voiceover adds, 'The Rubik\u2019s Cube keyring is still fully functional.'",
+ "Last sequence: Vat19.com logo appears in large, bold orange and gray letters",
+ "Below the Vat19.com logo, the text 'Shop CURIOUSLY AWESOME gifts' appears in smaller black and orange letters",
+ "A cursor clicks on the 'Vat19' logo",
+ "Voiceover concludes, 'Rubik's Cube keyring, buy today at Vat19.com. Vat19.com, purveyors of curiously awesome products. Vat19.com,'",
+ "Upbeat music plays",
+ "Man starts singing"
+ ]
+ },
+ {
+ "video": "video/144.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts by showing a man in a blue suit and light blue shirt seated at a desk, with his hands clasped together and looking to his left. The background is dim with illuminated shelves and various objects. The scene then transitions to another man dressed in a dark suit, light blue shirt, and pink tie, also seated and looking to his left. The backdrop stays consistent with illuminated shelves and objects. Then it shows two split-screen screens top and bottom. The bottom offers a wider shot where both men are facing each other across the desk while the top focuses on the man in light blue shirt. The man in the blue suit is on the right, while the man in the dark suit with the pink tie is on the left, against the same dark background with illuminated shelves.\n\nIn the audio, a male voice begins by stating, \"40,000 people did not assault the Capitol. Okay, that's just factually inaccurate. There are maybe 1,000 people. That is not making light of the evil of those people who all will end up rotting in prison, as they should\" followed by another male voice responding.\n\nThe visual transitions to a close-up of the man in the dark suit with the pink tie, indicating he's speaking with his mouth in motion. The dark background with illuminated objects remains consistent. The backdrop stays consistent with illuminated shelves and objects. Then it shows two split-screen screens top and bottom. The bottom shows both men, they are seated at the desk, engaged in what appears to be an impassioned conversation while the top focuses on the man in light blue shirt. The clip then zooms in on the man in the dark suit with the pink tie, who continues to speak.\n\nThe audio continues with the same male voice: \"Were all of those people assaulting the Capitol? Is that your contention now? They all marched on the Capitol and then the leading edge attacked the building. They all marched on the Capitol. It's just inaccurate, Malcolm. Yeah, it isn't. No. Yes, it is inaccurate, Malcolm.\"\n\nThe top is a close-up of the man in the blue suit and light blue shirt, his mouth slightly open as he speaks while the bottom focuses on the two men. The dark background with illuminated shelves persists. The bottom is a wider view reveals both men, with the man in the blue suit gesticulating as he talks while the top focuses on the man in light blue shirt. Once again, the setup remains with dark, illuminated shelves, and a reflective desk surface.\n\nThe audio records: \"Yes, you're not mentioning them. Because, Malcolm, you have a terrible habit of conflating people who do actual violent activity with a broader rubric of everyone, for example, who voted for Trump, which was 75 million people.\"\n\nThe top focuses closely on the man in the blue suit, who gestures with his hands while speaking, the backdrop unchanged. The bottom shows both men, the man in the blue suit is animated, emphasizing his points through gesture. The man in the dark suit with the pink tie rests his chin on his hand, listening. The scene holds its consistent dark backdrop with lit shelves and a reflective desk.\n\nIn audio, the male voice articulates: \"And that really is ugly. I mean, you've tweeted about me, for example, that I'm such a racist I shouldn't be allowed in Washington, D.C. That sounds rather authoritarian in an attitude to me. I've never called for you to be kicked off of MSNBC.\"\n\nThe visual cuts to a close-up of the man in the dark suit with a pink tie, who is contemplative, resting his chin on his hand. The dark shelves and illuminated objects remain consistent in the background. The bottom zooms out to show both men; and the top shows the man in the blue suit now clasps his hands together on the desk, looking intently at his counterpart. The clip wraps up with a close-up of the man in the dark suit who begins to speak.\n\nThe audio contains the dialogue: \"I've never called for anybody not to buy your book. I've never called for anybody to censor you on Twitter. Let me tell you something. Two years ago, I was invited to Auschwitz for the Auschwitz Foundation, where they holding a conference call.\"\n\nFinally, the focus is on the man in the dark suit and pink tie, speaking animatedly as he gestures with his hands. The dark background with illuminated objects continues. The bottom zooming out to both men in the frame, the man in the blue suit listens attentively while the top focuses on the man in light blue shirt. The clip closes with a close-up of the man in the dark suit, who is vigorously speaking and gesturing.\n\nThe concluding audio statement is: \"Are you going to lecture me on the Holocaust? No, no, no. I'm going to tell you what.\"\n\nThroughout the video, the scenes cut back and forth between close-ups and wider shots, keeping continuity with the dimly lit background featuring illuminated shelves. The exchanges between the two men stay intense and dynamic, reflecting the gravity of the topics discussed."
+ }
+ ],
+ "events": [
+ "Man in blue suit and light blue shirt seated at desk, hands clasped, looking to his left.",
+ "Dim background with illuminated shelves and various objects.",
+ "Scene transitions to another man in dark suit, light blue shirt, and pink tie, seated, looking to his left.",
+ "Background remains consistent with illuminated shelves and objects.",
+ "Split-screen: top focuses on man in light blue shirt, bottom shows both men facing each other across the desk.",
+ "Man in blue suit on the right, man in dark suit with pink tie on the left, same background with illuminated shelves.",
+ "Audio: Male voice states, '40,000 people did not assault the Capitol...'",
+ "Scene transitions to close-up of man in dark suit with pink tie speaking.",
+ "Consistent backdrop: dark with illuminated objects and shelves.",
+ "Split-screen: top focuses on man in light blue shirt, bottom shows both men at the desk in conversation.",
+ "Zoom-in on man in dark suit with pink tie speaking.",
+ "Audio: Male voice continues, 'Were all of those people assaulting the Capitol?...'",
+ "Top split-screen: Close-up of man in blue suit and light blue shirt speaking.",
+ "Bottom split-screen: Wider view of both men, man in blue suit gesticulates.",
+ "Consistent setup with dark, illuminated shelves, and reflective desk surface.",
+ "Audio: Male voice says, 'Yes, you're not mentioning them... 75 million people.'",
+ "Top split-screen: Close-up of man in blue suit gesticulating.",
+ "Bottom split-screen: Both men shown, man in blue suit animated, man in dark suit with pink tie listens.",
+ "Consistent dark backdrop with lit shelves and reflective desk.",
+ "Audio: Male voice continues, 'And that really is ugly... I've never called for you to be kicked off of MSNBC.'",
+ "Close-up of man in dark suit with pink tie, contemplative, resting chin on hand.",
+ "Bottom split-screen: Zooms out to show both men, top split-screen reveals man in blue suit clasping hands, looking intently.",
+ "Close-up of man in dark suit begins speaking.",
+ "Audio: 'I've never called for anybody not to buy your book... holding a conference call.'",
+ "Focus on man in dark suit with pink tie, gesturing animatedly while speaking.",
+ "Consistent dark background with illuminated shelves.",
+ "Bottom split-screen: Zooms out to show both men, man in blue suit listens attentively, top split-screen focuses on man in light blue shirt.",
+ "Close-up of man in dark suit speaking vigorously and gesturing.",
+ "Audio: 'Are you going to lecture me on the Holocaust?... No, no, no. I'm going to tell you what.'",
+ "Scenes cut back and forth between close-ups and wider shots maintaining continuity.",
+ "Exchanges between the two men stay intense and dynamic."
+ ]
+ },
+ {
+ "video": "video/145.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up shot of an elderly woman with long gray hair wearing glasses and a dark suit paired with a light blue shirt. Visible in the background is a medical device with various buttons, a screen, and some cables. As an informative voiceover, a man states, \"If you experience tingling, weakness, or stabbing pain in your hands or feet, you may be among the 2 to 3 percent of the population with peripheral neuropathy.\" The scene transitions to a close-up of a hand with a gold ring on the ring finger, resting on a white cloth on a table. This clip then shifts to show another person wearing a dark suit with a red shirt underneath, removing a black sock from their foot, and subsequently examining their bare foot closely with their hand on it. Throughout this segment, the consistent presence of the Mayo Clinic logo is noticeable in the bottom right corner.\n\nThe narrative continues with a close-up of a picture showing two feet, one of which is lifted slightly off the ground, revealing pain or pressure points on the sole marked in red areas. The plain white background and the Mayo Clinic logo remain constant. A woman takes over the voiceover, saying, \"It's like a condition that affects what we call the peripheral nerves of the body.\" Following this, a close-up of a woman with shoulder-length blonde hair in a black top, situated in an indoor setting with a warm color palette, is shown. The frame transitions to an illustration of the human nervous system focusing specifically on peripheral nerves. The text 'Peripheral neuropathy' appears on the screen, with an additional note that states, 'For people with diabetes, peripheral neuropathy is more common in the feet.'A man's voice continues to explain the picture,concludes with that the damage of the nerves will lead to Peripheral neuropathy, \" Peripheral Neuropathy happens when those nerves are damaged. Some patients experience decreased sensation or numbness in their hands and feet.\" \n\nNext, a woman's hand with a gold ring on the ring finger appears again, this time being treated with a white medical device by a doctor's hand. Subsequently, a close-up shows the same hand being examined with a medical device by a gloved doctor. The examination is conducted against the backdrop of medical equipment and a monitor displaying an ultrasound image labeled 'LEFT PROXIMAL PNE.'A person in red shirt can be seen. The frame transitions to a detailed model of nerves and blood vessels, emphasized by a pen pointing to distinct areas. Returning to the woman with shoulder-length blonde hair, now identified on-screen as 'Michelle Mauermann, M.D., Neurology, Mayo Clinic,' she speaks,\"They can also have what we call positive sensory symptoms like prickling and tingling or burning or electrical shock-like sensations.\"\n\nDr. Mauermann's discussion continues in a subsequent close-up where she explains.The next scene depicts the woman with a gold ring, wearing a light blue shirt and dark vest, resting her hands. The narrative unfolds to show an elder woman in a floral dress and black shorts sitting on the floor near a staircase, seemingly after a fall, while another individual in jeans and brown shoes assists. An image then centers on a man with short hair, wearing a dark suit and red tie, with his blood pressure being checked by a woman with long red hair in a navy blue top. Accompanying medical equipment is also visible. Mayo Clinic\u2019s prominent logo remains throughout, while the voiceover adds, \"Muscle weakness, balance issues, and abnormalities in blood pressure or pulse also can be symptoms of peripheral neuropathy, which can be brought on by a number of reasons, the most common being diabetes.\" \n\nIn the continuation, the video features a man in a dark suit and red tie taking another woman's blood pressure using a blue cuff, amid medical equipment. This is followed by footage of a man in a dark suit and red tie conversing with a woman in a colorful patterned blouse in an office setting. Both of them are sitting.The scene transitions to a close-up of a hand with a ring holding a blood glucose meter,which likely stabs her finger to test the blood glucose level,then the scene transitions to a pair of hand holing the blood glucose meter labeled 'Contour next EZ.' and the number in the screen transition from 5 to 4.In a subsequent scene, a bare foot on a carpeted floor is being gently examined by a hand. The background audio reveals, \"A beep precedes another male voice that notes, \"Dr. Mauermann says the good news is positive sensory symptoms can be treated, and in some cases, it's possible to actually halt or reverse the neuropathy.\"\n\nThe final segments transitions to blue Crocs on a wooden dock as a woman walks by them. Following this, a woman is seen seated at the edge of the dock, dipping her feet into the water with a serene lakeside background. A close-up captures the woman's feet submerged, displaying pink toenail polish. Her feet are swaying in the water.The screen fades to a white background showcasing the Mayo Clinic logo, the text 'newsnetwork.mayoclinic.org,' and a 2020 copyright notice from the Mayo Foundation for Medical Education and Research. The audio concludes with a man stating the MAYO CLINIC."
+ }
+ ],
+ "events": [
+ "Close-up shot of elderly woman with long gray hair, glasses, dark suit, light blue shirt",
+ "Medical device visible in background with buttons, screen, and cables",
+ "Voiceover by man mentioning symptoms of peripheral neuropathy",
+ "Transition to close-up of a hand with a gold ring on a ring finger resting on white cloth",
+ "Person in dark suit with red shirt removes black sock from foot and examines bare foot",
+ "Mayo Clinic logo visible in bottom right corner",
+ "Close-up of a picture showing two feet, one lifted off the ground with pain points in red",
+ "Voiceover by woman discussing peripheral nerves of the body",
+ "Close-up of woman with shoulder-length blonde hair in black top in warm indoor setting",
+ "Transition to an illustration of human nervous system focusing on peripheral nerves",
+ "Text 'Peripheral neuropathy' appears with note about diabetes",
+ "Man's voice continues explaining nerve damage leading to peripheral neuropathy",
+ "Close-up of woman's hand with a gold ring being treated with white medical device by doctor's hand",
+ "Close-up of hand being examined with medical device by gloved doctor",
+ "Monitor displaying ultrasound image labeled 'LEFT PROXIMAL PNE'",
+ "Person in red shirt visible",
+ "Frame transitions to detailed model of nerves and blood vessels with a pen pointing to areas",
+ "Returning to woman identified as 'Michelle Mauermann, M.D., Neurology, Mayo Clinic,' speaking about symptoms",
+ "Subsequent close-up of Dr. Mauermann continuing to explain",
+ "Woman with gold ring wearing light blue shirt and dark vest resting her hands",
+ "Elder woman in floral dress and black shorts sitting on the floor near staircase after a fall",
+ "Another individual in jeans and brown shoes assisting the elderly woman",
+ "Man in dark suit with red tie having blood pressure checked by woman with long red hair in navy blue top",
+ "Accompanying medical equipment visible",
+ "Voiceover discussing muscle weakness, balance issues, and blood pressure symptoms of peripheral neuropathy",
+ "Man in dark suit and red tie taking another woman's blood pressure with a blue cuff amid medical equipment",
+ "Man in dark suit and red tie conversing with woman in colorful patterned blouse in an office setting, both sitting",
+ "Close-up of hand with a ring holding a blood glucose meter",
+ "Scene transitions to hands holding blood glucose meter labeled 'Contour next EZ', numbers on screen transitioning from 5 to 4",
+ "Bare foot on carpeted floor being gently examined by a hand",
+ "Background audio mentions possibility to treat and reverse neuropathy",
+ "Blue Crocs on wooden dock, woman walks by",
+ "Woman seated at edge of dock dipping her feet into water with serene lakeside background",
+ "Close-up of woman's feet submerged in water with pink toenail polish",
+ "Feet swaying in water",
+ "Screen fades to white background showing Mayo Clinic logo, text 'newsnetwork.mayoclinic.org,' and 2020 copyright notice",
+ "Audio concludes with man stating 'MAYO CLINIC'"
+ ]
+ },
+ {
+ "video": "video/146.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a dark screen, which quickly transitions to a detailed financial chart. The chart illustrates a significant upward trend with multiple peaks and valleys, overlaid with various lines that likely represent different moving averages. An information box appears, showing key data: '68.62 (1186.77%), 6862, 225 bars, 32w4d, Vol 886.633M'. Shortly after, another box with new data '71.52 (1236.86%), 7152, 239 bars, 346d, Vol 927.754M' is displayed.\n\nSimultaneously, the audio features a man explaining: \"`Here is how I made 1200% on one single trade using a simple moving average strategy. First, you want to get on TradingView, type in 'three moving averages', and click this one by Advent Trading.`\"\n\nThe scene shifts to a Google search page showing trending searches, followed by a search for 'tradingview', with results such as 'tradingview app', 'tradingview desktop', and 'tradingview btc'. In the background, typing sounds can be heard, adding realism to the search process.\n\nThe next part transitions to a financial software interface. From the 'Indicators, Metrics & Strategies' section, '`three moving averages`' is selected, highlighting 'Three Moving Averages [AdventTrading]' as a favorite. The screen blurs momentarily and then reveals the 'Favorites' tab with scripts like '[RS]Support and Resistance V0', 'Auto Fib', and the highlighted 'Three Moving Averages [AdventTrading]'. The search term 'ema' is typed in, displaying results such as 'Moving Average Exponential', 'Double EMA', and 'Triple EMA'. The financial chart reappears with multiple moving averages indicating market trends.\n\nThe accompanying audio features the same man stating: \"Then you want an exponential moving average, so type in EMA and click this one right here. Now that we've added the indicators, let's change the settings a bit. \"\n\nNext, TradingView displays the stock 'Digital Turbine, Inc.' (ticker symbol 'APPS') with stock prices listed in red and blue boxes ('44.50' and '45.00'). Indicators like 'TEMA 50 100 200 close close close' and 'EMA 9 close 0 47.83' are visible. The setting menu for the 'TEMA' indicator appears, where the user enables and adjusts the lengths of moving averages to '13', '21', and '55', with the source set to 'close'. The style tab is opened, and colors for the moving averages are changed to green, yellow, and red. The chart updates, reflecting these new settings.\n\nThe audio describes: \"Click the settings for the triple moving average, change the lengths to the following: 13, 21, and 55. Then I'm also going to change the colors for simplicity's sake.\"\n\nThe focus shifts again to TradingView, displaying the same stock and its current prices. The settings for the 'EMA' indicator are modified, changing the 'Length' to '8'. A detailed financial chart appears with moving averages colored green, yellow, and red, alongside the numbers '8', '13', '21', and '55' indicating their lengths.\n\nThe audio continues: \"Then I'm gonna go to the settings of the exponential moving average and change the length to eight. The strategy is simple. You're gonna have four lines on your chart: the eight, 13, 21, and 55. \"\n\nA detailed financial chart is shown with multiple moving averages in red, green, yellow, and blue. Highlighted points on the chart emphasize specific trading signals. The chart zooms in and out, offering views of both specific and broader market trends.\n\nThe man elaborates: \"`You want to buy when the 55, the red line, crosses to the bottom. So right here, for example. You are gonna hold until the 55 crosses all the way to the top. So right here, for instance, you would have sold.`\"\n\nLastly, the chart displays both downward trends with red candlesticks and potential upward movements with green candlesticks, reflecting market volatility. Alternating red and green candlesticks suggest fluctuating trends.\n\nThe audio concludes: \"`I should also be clear if the 55 crosses only one line like this, you do not sell. You only sell if the 55 is all the way on the top.`\"\n\nThroughout the video, the financial chart remains prominently highlighted against a dark background, ensuring the overlaying data and moving averages are clearly visible, with precise audio guidance accompanying each visual transition."
+ }
+ ],
+ "events": [
+ "Video begins with a dark screen.",
+ "Screen transitions to a detailed financial chart.",
+ "Chart illustrates an upward trend with peaks and valleys.",
+ "Various lines represent different moving averages on the chart.",
+ "Information box appears showing data: '68.62 (1186.77%), 6862, 225 bars, 32w4d, Vol 886.633M'.",
+ "Another box with data: '71.52 (1236.86%), 7152, 239 bars, 346d, Vol 927.754M' appears.",
+ "Audio: Man explains how he made 1200% on a single trade using a simple moving average strategy.",
+ "Audio: Man instructs to get on TradingView, type in 'three moving averages', and select by Advent Trading.",
+ "Scene shifts to a Google search page showing trending searches.",
+ "Search for 'tradingview' is performed.",
+ "Results such as 'tradingview app', 'tradingview desktop', and 'tradingview btc' are displayed.",
+ "Typing sounds can be heard in the background.",
+ "Transition to a financial software interface.",
+ "'Three moving averages' is selected from 'Indicators, Metrics & Strategies' section.",
+ "'Three Moving Averages [AdventTrading]' is highlighted as favorite.",
+ "Screen blurs briefly, then reveals 'Favorites' tab with scripts like '[RS]Support and Resistance V0', 'Auto Fib', and 'Three Moving Averages [AdventTrading]'.",
+ "Search term 'ema' is typed in.",
+ "Results such as 'Moving Average Exponential', 'Double EMA', and 'Triple EMA' are displayed.",
+ "Financial chart reappears indicating market trends with multiple moving averages.",
+ "Audio: Man states to add exponential moving average by typing EMA and selecting it.",
+ "Scene shows TradingView displaying stock 'Digital Turbine, Inc.' (ticker symbol 'APPS').",
+ "Stock prices listed in red and blue boxes: '44.50' and '45.00'.",
+ "Indicators like 'TEMA 50 100 200 close close close' and 'EMA 9 close 0 47.83' are visible.",
+ "Settings menu for 'TEMA' indicator appears.",
+ "User enables and adjusts the lengths of moving averages to '13', '21', and '55' with the source set to 'close'.",
+ "Style tab opened, colors changed to green, yellow, and red.",
+ "Chart updates reflecting new settings.",
+ "Audio: Man instructs to change TEMA lengths to '13', '21', and '55' and change colors.",
+ "Focus returns to TradingView displaying same stock and prices.",
+ "Settings for 'EMA' indicator are modified, changing the 'Length' to '8'.",
+ "Detailed financial chart appears with moving averages in green, yellow, and red.",
+ "Numbers '8', '13', '21', and '55' indicate moving average lengths.",
+ "Audio: Man explains the simple strategy with four lines on the chart: '8', '13', '21', and '55'.",
+ "Detailed financial chart shown with moving averages in red, green, yellow, and blue.",
+ "Highlighted points on the chart indicate specific trading signals.",
+ "Chart zooms in and out for specific and broader market views.",
+ "Audio: Man advises to buy when the '55', the red line, crosses to the bottom and to hold until '55' crosses to the top.",
+ "Chart displays downward trends with red candlesticks.",
+ "Chart shows potential upward movements with green candlesticks.",
+ "Alternating red and green candlesticks suggest market volatility.",
+ "Audio: Man specifies not to sell if '55' crosses only one line; sell only if '55' is at the top.",
+ "Financial chart remains highlighted against a dark background throughout the video.",
+ "Overlaying data and moving averages are clearly visible.",
+ "Precise audio guidance accompanies each visual transition."
+ ]
+ },
+ {
+ "video": "video/147.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video portrays a seamless narrative focusing on a woman experiencing a common technical issue and receiving guidance to resolve it, along with a motivational message for the audience to engage further with her content.\n\nThe video starts with a woman in a striped dress entering a well-furnished living room, holding a laptop. The elegantly designed room features a wooden door, a blue sofa, a chandelier, and a large TV on a wooden entertainment unit. She approaches a wooden dining table with a woven placemat, places the laptop on it, and starts working. The scene fades to a man sitting at the same table holding a smartphone, while the woman continues to work on the laptop, standing by the table. \n\nThe audio captures a tense conversation: \"Bro, help me. It's an emergency.\" A man's voice replies, \"Why? What happened? Tell me.\" The woman urgently explains, \"I was actually working on a Word document and suddenly my laptop turned off due to low battery. I didn't even save it, bro. Do I have to do it again?\" the woman is still standing at the dining table, explaining something with emphatic gestures. Behind her, a modern kitchen with a refrigerator and a clock on the wall is visible. \n\nA scene shift shows a man sitting at a different table working on a laptop, wearing a black t-shirt with a yellow design. He speaks animatedly, emphasizing his points with hand gestures, \"Chill ma, I'm there na. Even if you don't save your Word file or even you don't have Autosave turned on, MS Word will automatically make a temporary backup file called as dot esd files and these can be easily accessed from your C drive. I'll quickly show you where you can find that. So, head over to file explorer and in your C drive.\" Text on the screen reads, \"This will work even if Autosave is not enabled!!,\" suggesting a feature related to Microsoft Word. The man explains using hand gestures. Text on the screen reads, \"MS Word always creates a Temporary Backup .asd file to help in these Critical Situations...,\" as the man is speaking.\n\nThe scene shifts to a close-up of a laptop screen showing a file explorer window. A hand points out the path: 'This PC > OS (C:) > Users > ROG > AppData > Roaming > Microsoft > Word', highlighting several backup files including 'STARTUP', 'AutoRecovery save of World Domination (Unsaved-309749...)', and others. Text on the screen advises, \"You can access these unsaved backup files from this path in Your C: Drive.\" The audio continues, \"in the user section, head over to Microsoft and Word document. This is where you'll find your recently unsaved files. Just click on that and you'll be able to easily access the file you were working on\" accompanied by typing sounds and the man's voice.\n\nThe final scene shows a close-up of a laptop screen displaying a Microsoft Word document titled 'World Domination'. The content introduces 'Sheetal Chintapalli', a budding tech reviewer on YouTube, seeking loyal supporters for her goal of world domination. The woman in the striped dress is then shown sitting at the dining table with the same laptop, gesturing as she discusses or explains the document\u2019s content. The background stylish living room features a wooden door, blue sofa, chandelier, and a large TV on a wooden entertainment unit. An iPhone rests on the table beside the laptop. Text on the screen reads, \"Back on track :),\" and then a 'Subscribe' button with a pointing hand icon appears, inviting viewers to subscribe to her channel. Audio features the woman saying, \"wow, thank you, that was such a useful tip! And guys, if you want more such tips, do follow and subscribe to our channel,\" followed by a man and then a child speaking, culminating the call to action.\n\nThus, the video comprehensively guides the audience through a technical troubleshooting process while integrating a promotional message, maintaining a coherent flow of visual and auditory elements."
+ }
+ ],
+ "events": [
+ "Woman in a striped dress enters a well-furnished living room holding a laptop.",
+ "Living room features a wooden door, blue sofa, chandelier, and large TV on a wooden entertainment unit.",
+ "Woman approaches a wooden dining table with a woven placemat, places the laptop on it, and starts working.",
+ "Scene fades to a man sitting at the same table holding a smartphone, while the woman continues working on the laptop.",
+ "Audio captures the woman saying: 'Bro, help me. It's an emergency.'",
+ "Man replies: 'Why? What happened? Tell me.'",
+ "Woman urgently explains about her laptop turning off due to low battery and her unsaved Word document.",
+ "Woman stands at the dining table explaining with emphatic gestures; modern kitchen, refrigerator, and clock are visible in the background.",
+ "Scene shifts to a man sitting at a different table, working on a laptop, wearing a black t-shirt with a yellow design.",
+ "Man speaks animatedly, explaining how to recover an unsaved Word document from a temporary backup file (.asd) on the C drive.",
+ "Text on screen: 'This will work even if Autosave is not enabled!!'",
+ "Man uses hand gestures to explain the process.",
+ "Text on screen: 'MS Word always creates a Temporary Backup .asd file to help in these Critical Situations...'",
+ "Close-up of a laptop screen showing a file explorer window.",
+ "Hand points out the file path 'This PC > OS (C:) > Users > ROG > AppData > Roaming > Microsoft > Word'.",
+ "Several backup files are visible, including 'STARTUP' and 'AutoRecovery save of World Domination (Unsaved-309749...)'.",
+ "Text advises: 'You can access these unsaved backup files from this path in Your C: Drive.'",
+ "Audio continues explaining the process of accessing recently unsaved files in the user's section under Microsoft and Word document.",
+ "Scene shifts to a close-up of a laptop screen displaying a Microsoft Word document titled 'World Domination'.",
+ "Content introduces 'Sheetal Chintapalli', a tech reviewer on YouTube, asking for supporters.",
+ "Woman in the striped dress is shown sitting at the dining table with the laptop, gesturing and discussing the document.",
+ "Background shows stylish living room with a wooden door, blue sofa, chandelier, and large TV on a wooden entertainment unit.",
+ "An iPhone rests on the table beside the laptop.",
+ "Text on the screen reads: 'Back on track :)'",
+ "Subscribe button with a pointing hand icon appears, inviting viewers to subscribe.",
+ "Audio captures the woman saying: 'Wow, thank you, that was such a useful tip! And guys, if you want more such tips, do follow and subscribe to our channel.'",
+ "Audio features a concluding message from a man and then a child, reinforcing the call to action."
+ ]
+ },
+ {
+ "video": "video/148.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a woman with long, straight, pink hair adorned with green hair clips on both sides,something red on her face standing against a plain, light-colored wall. The text 'Here let me show you something.' appears in the top left corner. A conversation starts with a woman saying, \"Hey, let me show you something.''Then the pink haired woman says,' Wait, what are you doing? No, don't post it!\" The woman looks panic and sad.The reaction text 'Wait what are you doing no don't post it!' appears at the bottom, reflecting the spoken words. The screen momentarily goes black, indicating a transition,the voice continues, \"See? Easy. Thank me later. 'As the scene goes black then the woman continues to say,'Okay, so what now?\" and the audio says, \"You wait.\u201cThe pink haired man says.' How long?'\n\nThe next scene continues with the woman, and the text 'I don't know until it goes viral I guess anyways I need to get to class.' appears in the top left corner, reflecting the ongoing conversation. The dialogue is the same as the text.The video shows a black screen with the text '1 hour later,' suggesting a time lapse. The text'What are you having for lunch?\" The text 'I'm not really hungry' shows at the bottom while the woman says exactly the same with the text. A music singing by a man in the background is heard as The clip concludes with the woman reappearing and the text The audio says, \"I heard they're serving pizza. Let's go see.\"\n\nAnother transition occurs with a black screen, then the woman reappears with the text 'Why is everyone staring at me' at the bottom, matching her spoken confusion and she says the same as the text.\"Did you see her video? It has 12 million views. Amara, your mark's gone.\" The woman realizes her newfound popularity.With the music continues to play in the background.\n\nThe scene starts again with the text '+300K followers' at the bottom. The text 'I have 300,000 followers but how??' appears as the woman expresses surprise and the woman says the same.Another voice (in red text 'See I') says, \"See, I told you you'd go viral.\" The woman then confidently states, \"I'm gonna go get that interview!!\" while the text appears at the bottom. The clip ends with 'Ok' at the bottom, with the audio adding, \"Okay, good luck, bestie. \"\n\nContinuing the saga, the woman stands with the text 'Hi I'm here for an interview' at the bottom while stating, \"for an interview.\" A voice says,\"Let me check your marks. No marks? Wow, that's amazing. I've never heard anyone apply with no marks.\" The confusion is captured in the text 'What do you mean I thought you had a strict no mark policy?,' as the woman voices while background music plays.\n\nIn the final scene, the conversation continues with the text 'What do you mean I thought you had a strict no mark policy?' in the top left. The dialogue reveals more about the interview policy, \"We accept students with up to 10 marks,\" as a group of people chatter in the background, explaining the conditions further."
+ }
+ ],
+ "events": [
+ "Woman with long, straight, pink hair and green hair clips stands against a light-colored wall",
+ "Red mark is visible on the woman's face",
+ "Text 'Here let me show you something.' appears in the top left corner",
+ "A woman says, 'Hey, let me show you something.'",
+ "Pink haired woman responds, 'Wait, what are you doing? No, don't post it!'",
+ "Woman looks panicked and sad",
+ "Reaction text 'Wait what are you doing no don't post it!' appears at the bottom",
+ "Screen goes black, transitioning scene",
+ "Voice continues, 'See? Easy. Thank me later.'",
+ "Woman says, 'Okay, so what now?'",
+ "Audio says, 'You wait.'",
+ "Pink haired woman asks, 'How long?'",
+ "Text in the top left corner says, 'I don't know until it goes viral I guess anyways I need to get to class.'",
+ "Dialogue matches the text",
+ "Black screen with text '1 hour later,' suggesting a time lapse",
+ "Text 'What are you having for lunch?' appears on screen",
+ "Text 'I'm not really hungry' shows at the bottom",
+ "Woman says, 'I'm not really hungry'",
+ "Background music with a man singing is heard",
+ "Clip concludes with woman reappearing",
+ "Text and audio say, 'I heard they're serving pizza. Let's go see.'",
+ "Another transition to black screen",
+ "Woman reappears with text 'Why is everyone staring at me' at the bottom",
+ "Woman says, 'Why is everyone staring at me?'",
+ "Voice says, 'Did you see her video? It has 12 million views. Amara, your mark's gone.'",
+ "Woman realizes her newfound popularity",
+ "Background music continues to play",
+ "Scene starts with '+300K followers' text at the bottom",
+ "Text 'I have 300,000 followers but how??' appears",
+ "Woman says, 'I have 300,000 followers but how??'",
+ "Red text says, 'See I told you you'd go viral.'",
+ "Woman confidently states, 'I'm gonna go get that interview!!'",
+ "Text 'I'm gonna go get that interview!!' appears at the bottom",
+ "Clip ends with text 'Ok' at the bottom",
+ "Audio adds, 'Okay, good luck, bestie.'",
+ "Scene continues with woman standing and text 'Hi I'm here for an interview' at the bottom",
+ "Woman says, 'Hi I'm here for an interview.'",
+ "Voice says, 'Let me check your marks. No marks? Wow, that's amazing. I've never heard anyone apply with no marks.'",
+ "Text 'What do you mean I thought you had a strict no mark policy?' appears in the top left",
+ "Woman voices, 'What do you mean I thought you had a strict no mark policy?'",
+ "Background music continues",
+ "Group of people chatter in the background, explaining interview policy",
+ "Voice says, 'We accept students with up to 10 marks.'"
+ ]
+ },
+ {
+ "video": "video/149.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video is a split screenshot divided into two parts., the top half begins with an emotional close-up of a woman with short blonde hair, wearing dangling earrings. The text overlay reads '2009' and 'Taylor Swift's VMAs speech.' The second half of the video stops at Kanye in light brown and Taylor in black sequins standing on stage\nThe view shifts to a stage where Taylor Swift in a white dress is holding a trophy, while Kanye in a dark shirt and jeans interrupts her, speaking into a microphone. The background shows a cityscape with other people standing. Audio reveals, \"`Yo, Taylor. I'm really happy for you. I'm gonna let you finish. But Beyonce had one of the best videos of all time.`\" The crowd cheers and applauds as the Kanye\u2018's hand holding the microphone is highlighted.\n\nThe scene continues with Kanye holding the microphone, maintaining the text overlay of '2009' and 'Taylor Swift's VMAs speech.' The camera captures a woman in a red dress with curly hair reacting emotionally, who seems to be Beyonc\u00e9, while the background shows other people seated in the audience. It cuts back to Kanye and Taylor in white on stage, reinforcing the cityscape backdrop. He continues, \"`...one of the best videos of all time.`\" The crowd's cheers persist as the scene transitions to a wide shot of the stage. \n\nThe video progresses with a close-up of the emotional woman in the red dress once more, maintaining the '2009' and 'Taylor Swift's VMAs speech.' \nThe scene changes to a stage where a man in a beige shirt hands a trophy to a woman in a black and silver outfit amidst a brightly lit audience. Following this, the woman in the white dress resumes speaking into the microphone. Text overlay reinforces '2009' and 'Taylor Swift's VMAs speech,' leading to repeated footage of the man in a dark shirt and a woman in white. The first half of the video ends with a freeze frame of Kanye holding a microphone and speaking to Tyler. And then Taylor in a black and gold outfit speaking with the text '2015.' The audio details Taylor saying, \"`I first met Kanye West six years ago at this show, actually. So I guess I have to say to all the other women.`\" The crowd cheers in the background.\n\nTransitioning again, the scene shows Kanye in a dark shirt and jeans on stage with Taylor in a sparkling silver dress holding a trophy. The consistent text 'Taylor Swift's VMAs speech' appears as the audience remains dimly lit. It switches to a close-up of Taylor in a black and gold outfit speaking, with a '2015' overlay. The clip ends with the audience responding to the speech. The audio captures, \"Tonight, I' `m really happy for you. And I'm gonna let you finish. But Kanye West has had one of the greatest of all time!`\" while the crowd cheers and some background music plays.\n\nFinally, the video displays the stage scene where Kanye in a beige shirt hugging Taylor. It's captioned 'Taylor Swift's VMAs speech.', with '2015' text overlay. And the overlay text reads 'SWIFT, WEST BURY THE HATCHET.' The final shot displays 'KANYE WEST' with '2015' as Kanye and Taylor hugging together. The audio concludes with applause and cheering from the crowd.\n"
+ }
+ ],
+ "events": [
+ "Split-screen video setup with top half focusing on a woman and bottom half on Kanye and Taylor on stage",
+ "Top half begins with an emotional close-up of a woman with short blonde hair wearing dangling earrings",
+ "Text overlay reads '2009' and 'Taylor Swift's VMAs speech'",
+ "Bottom half stops at Kanye (in light brown) and Taylor (in black sequins) standing on stage",
+ "View shifts to a stage where Taylor Swift (in a white dress) holds a trophy",
+ "Kanye (in a dark shirt and jeans) interrupts Taylor, speaking into a microphone",
+ "Background on stage shows a cityscape with other people standing",
+ "Audio captures Kanye saying, 'Yo, Taylor. I'm really happy for you. I'm gonna let you finish. But Beyonce had one of the best videos of all time.'",
+ "Crowd cheers and applauds as Kanye's hand holding the microphone is highlighted",
+ "Kanye continues to hold the microphone with text overlay maintaining '2009' and 'Taylor Swift's VMAs speech'",
+ "Camera captures a woman in a red dress with curly hair reacting emotionally (likely Beyonc\u00e9)",
+ "Background shows other people seated in the audience",
+ "Cuts back to Kanye and Taylor (both in white) on stage with cityscape backdrop",
+ "Kanye continues, '...one of the best videos of all time.'",
+ "Crowd's cheers continue as the scene transitions to a wide shot of the stage",
+ "Close-up of the emotional woman (in red dress) is shown again",
+ "Scene shift to a new stage where a man in a beige shirt hands a trophy to a woman in black and silver outfit amidst a brightly lit audience",
+ "Woman in a white dress resumes speaking into the microphone with text overlay '2009' and 'Taylor Swift's VMAs speech'",
+ "Repeated footage of the man in dark shirt and woman in white on stage",
+ "First half of the video ends with freeze frame of Kanye holding a microphone and speaking to Taylor",
+ "Close-up of Taylor in a black and gold outfit with text overlay '2015'",
+ "Audio of Taylor saying, 'I first met Kanye West six years ago at this show, actually. So I guess I have to say to all the other women.'",
+ "Crowd cheers in the background",
+ "Scene transitions to Kanye in dark shirt and jeans on stage with Taylor in a sparkling silver dress holding a trophy",
+ "Consistent text 'Taylor Swift's VMAs speech' appears with dimly lit audience",
+ "Switches to close-up of Taylor in black and gold outfit speaking, with '2015' overlay",
+ "Clip ends with audience responding to the speech",
+ "Audio captures, 'Tonight, I\u2019m really happy for you. And I'm gonna let you finish. But Kanye West has had one of the greatest of all time!' while crowd cheers and background music plays",
+ "Final scene shows Kanye (in a beige shirt) hugging Taylor on stage with text overlay 'Taylor Swift's VMAs speech.'",
+ "Additional overlay text reads '2015' and 'SWIFT, WEST BURY THE HATCHET.'",
+ "Final shot displays 'KANYE WEST' with '2015' as Kanye and Taylor hug together",
+ "Audio concludes with applause and cheering from the crowd"
+ ]
+ },
+ {
+ "video": "video/150.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a thrilling shot of a race track, showcasing a blue and white Formula 1 car racing ahead of a yellow car. Damon Hill, driving the yellow car in Canada in 1999, navigates the track with a red car in pursuit. Close-up shots capture the intensity of Hill's driving. The scene transitions to a close-up of a man speaking behind a bunch microphone before transitioning to a scene featuring a person in a yellow racing suit running alongside the track As the camera zooms in, the audio sets the scene with a man's voice explaining, \"`I lost control of the vehicle and hit the wall. There's nothing more to it than that,`\" accompanied by music and the harsh sound of tires squealing. \nThe focus then shifts to a red Formula 1 car, driven by Michael Schumacher in the same race. \n\nTransitioning to the next segment, the red Schumacher car is shown up close before dramatically crashing into a wall, smoke billowing around the impact site. Another orange and black car speeds ahead, while Schumacher's car remains motionless against the wall. The video then shifts to a close-up of the red car on the helmeted driver. The audio captures a voice over the loudspeaker declaring, \"`he is trying a way through the field. Schumacher out of the race. Michael Schumacher slams out of turn 13. `\" matched by background music that heightens the tension.\n\nThe video continues with an onboard perspective from a red Formula 1 car featuring prominent 'Marlboro' branding, navigating a turn on the race track. The car races along a straight path beside a wall labeled 'Quebec' before it crashes violently into the barrier. The scene highlights the text 'Bienvenue' on the wall as the crashed car sits stationary. Text on the screen identifies the event: '1999 Canada Jacques Villeneuve,' and as another car whizzes past, audio commentary notes, \"`Well, that was very cleanly mistake by myself just lost it and hit the wall.`\"Another excited man's voice shouted, \u2019 there's Villeneuve out Jacques Villeneuve I was talking about has done exactly the same.'\n\nShifting gears, a blue and yellow Formula 1 car labeled 405 spins out of control on the track, trailing smoke as it skids into the wall. The camera then captures a white and black car driven by Jenson Button in the 2005 Canadian Grand Prix, meeting a similar fate as it crashes into the same wall, engulfed by smoke. The commentary over the loudspeaker reflects the dramatic event, \"`that Michael Schumacher did oh dear me that's Button hitting the wall the wall of champions and Jenson Button has slammed into it oh my goodness me the podium,`\" all set to an intense musical backdrop.\n\nThe scene then cuts to a white and black stationary Formula 1 car with smoke trickling around it, set against a 'Budweiser' branded wall. An onboard view from a blue Formula 1 car branded with Red Bull details the events of the 2011 Canadian Grand Prix driven by Sebastian Vettel. As the car speeds along the track, it hits a curb and crashes into the wall, leaving debris scattered around. The screen text reads, '`Hit the kerb and then,`' matching the spoken words over the loudspeaker.\n\nIn the culmination of the sequence, a close-up of a blue Formula 1 car branded Renault is shown. This car, now with Red Bull branding, slides off the track and crashes into the wall. The screen text sequentially states, '`slid off the track and hit the wall,`' followed by 'Exit of last corner.' The car sits stationary, smoke swirling around its tires, with the backdrop featuring 'Quebec' branding. The final commentary in a foreign language, accompanied by music, reinforces the dramatic ambiance: \"`slid off the track and hit the wall. Exit of left corner.`\"\n\nThe video seamlessly threads these scenes together, showcasing a series of races where control, speed, and inevitable crashes define the evolution of each narrative. The audio complements the visual action, providing context and a narrative thread that ties these exhilarating and tense moments into a cohesive story of Formula 1 racing's highs and lows.\n"
+ }
+ ],
+ "events": [
+ "Thrilling shot of a race track",
+ "Blue and white Formula 1 car racing ahead of a yellow car",
+ "Damon Hill driving the yellow car in Canada in 1999",
+ "Red car in pursuit of Damon Hill",
+ "Close-up shots capturing the intensity of Hill's driving",
+ "Transition to a close-up of a man speaking behind a microphone",
+ "Person in a yellow racing suit running alongside the track",
+ "Man's voice explaining `I lost control of the vehicle and hit the wall. There's nothing more to it than that,`",
+ "Background music and harsh sound of tires squealing",
+ "Focus shifts to a red Formula 1 car driven by Michael Schumacher",
+ "Close-up of Schumacher's red car",
+ "Schumacher's red car crashes into a wall, smoke billowing",
+ "Orange and black car speeds ahead while Schumacher's car stays motionless",
+ "Close-up of the red car on the helmeted driver",
+ "Voice over loudspeaker: `he is trying a way through the field. Schumacher out of the race. Michael Schumacher slams out of turn 13.`",
+ "Background music heightens the tension",
+ "Onboard perspective from a red Formula 1 car with 'Marlboro' branding",
+ "Car navigates a turn on the race track and races beside a wall labeled 'Quebec'",
+ "Car crashes into the barrier with 'Bienvenue' text on the wall",
+ "Text on screen: '1999 Canada Jacques Villeneuve,'",
+ "Another car whizzes past",
+ "Audio commentary: `Well, that was very cleanly mistake by myself just lost it and hit the wall.`",
+ "Excited man's voice: \u2019 there's Villeneuve out Jacques Villeneuve I was talking about has done exactly the same.'",
+ "Blue and yellow Formula 1 car labeled 405 spins out of control",
+ "Car skids into the wall trailing smoke",
+ "White and black car driven by Jenson Button in the 2005 Canadian Grand Prix crashes into the wall engulfed by smoke",
+ "Commentary over loudspeaker: `that Michael Schumacher did oh dear me that's Button hitting the wall the wall of champions and Jenson Button has slammed into it oh my goodness me the podium,`",
+ "Intense musical backdrop",
+ "White and black stationary Formula 1 car with smoke trickling",
+ "Budweiser branded wall in the background",
+ "Onboard view from a blue Formula 1 car branded with Red Bull at the 2011 Canadian Grand Prix driven by Sebastian Vettel",
+ "Car speeds along the track",
+ "Car hits a curb and crashes into the wall scattering debris",
+ "Screen text: '`Hit the kerb and then,`'",
+ "Spoken words over loudspeaker: '`Hit the kerb and then,`'",
+ "Close-up of a blue Formula 1 car branded Renault and Red Bull",
+ "Car slides off the track and crashes into the wall",
+ "Screen text sequentially: '`slid off the track and hit the wall,`' and 'Exit of last corner.'",
+ "Car sits stationary with smoke swirling around its tires",
+ "`Quebec` branding in the background",
+ "Commentary in a foreign language: `slid off the track and hit the wall. Exit of left corner.`",
+ "Music reinforcing the dramatic ambiance"
+ ]
+ },
+ {
+ "video": "video/151.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man setting up a black pot over a fire pit made of stacked stones. Flames flicker vividly as the fire burns. The man, dressed in a green shirt, black shorts, and black shoes, meticulously prepares the ingredients. He puts some oil on a bowl of raw meat. The camera zooms in on a container of 'Spiceology' brand seasoning labeled 'Salt Pepper Garlic Blend,' which is generously sprinkled over the bowl of raw meat. Subsequently, various dried chiles and garlic cloves labeled 'Ancho Chiles, Guajillo Chiles, Chipotle Chiles, Garlic', and water are added into the pot over the fire. Steam rises, indicating the onset of the cooking process.\n\nThe accompanying audio reveals a man speaking energetically, \"Campfire quesadilla tacos \u2013 I've been wanting to make these guys for a while, so let's dive in. Season up some short ribs, adding some chilies and garlic into the skillet to simmer, and then we're going to add those into our food processor along with some seasoning to make that classic taste. Next, I'm going to sear off each side of these short ribs, adding them all in along with of paste, a little bit of beef broth, some white vinegar, a little bit of onion, some bay leaves, and some dark beer. Let that simmer over medium-high heat covered for three and a half to four hours or until fork tender. Next, I'm going to pull them all out, shred them up, get a little piece for myself because why not? And then we're going to strain out that consomm\u00e9, dipping the tortilla in it. I'm going to add it onto a hot skillet along with the shredded beef, some onion and cilantro, and some shredded cheese, flipping it onto itself. We're gonna make that classic quesadilla taco. These were so freaking delicious...\" The sizzling of food in the skillet punctuates his commentary.\n\nHe takes off the pot. The next sequence showcases the soaking of dried chiles and garlic cloves in a reddish liquid, followed by their transfer into a blender. A close-up captures a bowl of diverse spices labeled 'OREGANO, CINNAMON, SALT, PEPPER, GARLIC, CUMIN,' which are subsequently added to the blender. The blending process transforms the mixture into a smooth, reddish-brown sauce. Transitioning back outdoors, the seasoned meat chunks are placed into the black pot, browning on all sides. The episode culminates with the pouring of the freshly blended sauce over the meat in the pot.\n\nIn continuity, the visual focus is on the addition of the reddish-brown sauce into the black pot, which envelopes the browned meat. This is followed by beef broth, white vinegar, sliced onions, bay leaves, and dark beer, each labeled clearly and contributing to the rich, bubbling concoction in the pot. The interaction of these ingredients underlines the complexity of flavor building up within the simmering sauce.\n\nFurther enriching the narrative, the man attentively stirs the contents of the black pot over the fire pit, ensuring coherence in the simmering sauce. The addition of a lid allows the mixture to cook undisturbed. Revealing the succession, the lid is removed, displaying the now tender meat and thick sauce. The meat is transferred into a clear glass bowl, and subsequent shredding with forks shows its tenderness. He tastes it and his thumb up. A taste test by the cook confirms satisfaction.\n\nThe following sequence transitions to straining the liquid from the solid remnants of the sauce, pouring it through a strainer into a clear glass bowl. The black pot is then put on the fire. The cook dips a tortilla into the strained sauce and places it on a flat cooking surface above the fire. Shredded meat, onions, cilantro, and crumbled Oaxaca cheese are layered onto the tortilla. Folding the tortilla into a taco, it is pressed down to ensure an even cook.\nConcluding the delightful culinary journey, the final clip highlights pressing down on the golden brown, crispy tortilla filled with shredded meat, onions, cilantro, and melted cheese on a hot, flat surface. A close-up reveals a plate containing three tacos garnished with chopped cilantro, onions, lime wedges, and a small bowl of reddish-brown sauce. The tantalizing image of the taco being dipped into the sauce and the cook's evident enjoyment from taking a bite serves as a rewarding finale."
+ }
+ ],
+ "events": [
+ "Man sets up a black pot over a fire pit made of stacked stones",
+ "Flames flicker vividly as the fire burns",
+ "Man, dressed in green shirt, black shorts, and black shoes, prepares ingredients",
+ "Man puts oil on a bowl of raw meat",
+ "Camera zooms in on 'Spiceology' brand seasoning labeled 'Salt Pepper Garlic Blend'",
+ "Seasoning is generously sprinkled over the bowl of raw meat",
+ "Various dried chiles and garlic cloves labeled 'Ancho Chiles, Guajillo Chiles, Chipotle Chiles, Garlic' are added into the pot",
+ "Water is added into the pot",
+ "Steam rises from the pot indicating cooking process starts",
+ "Energetic man in audio talks about making campfire quesadilla tacos",
+ "Man mentions seasoning short ribs and adding chilies and garlic into skillet to simmer",
+ "Adding ingredients into food processor with seasoning to make a classic taste",
+ "Man mentions searing each side of short ribs, adding paste, beef broth, white vinegar, onion, bay leaves, and dark beer",
+ "Cook simmers mixture over medium-high heat covered for 3.5 to 4 hours until fork tender",
+ "Removes and shreds meat, tastes a piece",
+ "Strains consomm\u00e9, dips tortilla in it, adds to hot skillet with shredded beef, onion, cilantro, shredded cheese",
+ "Cook flips tortilla to make quesadilla taco",
+ "Man removes pot from fire",
+ "Dried chiles and garlic cloves are soaked in reddish liquid",
+ "Chiles and garlic cloves are transferred into a blender",
+ "Close-up of spices labeled 'OREGANO, CINNAMON, SALT, PEPPER, GARLIC, CUMIN' added to blender",
+ "Blending process transforms mixture into smooth, reddish-brown sauce",
+ "Seasoned meat chunks are placed into black pot, browned on all sides",
+ "Freshly blended sauce is poured over meat in pot",
+ "Reddish-brown sauce envelopes browned meat in pot",
+ "Beef broth, white vinegar, sliced onions, bay leaves, and dark beer are added to the pot",
+ "Rich, bubbling concoction in pot is shown",
+ "Man stirs contents of black pot over fire pit, ensuring coherent sauce",
+ "Lid is added to cook mixture undisturbed",
+ "Lid is removed, showing tender meat and thick sauce",
+ "Meat is transferred into a clear glass bowl and shredded with forks",
+ "Cook tastes meat and shows thumbs up",
+ "Cook strains liquid from solid remnants of sauce, pours into clear glass bowl",
+ "Black pot is put back on fire",
+ "Cook dips tortilla into strained sauce, places on flat cooking surface above fire",
+ "Shredded meat, onions, cilantro, and crumbled Oaxaca cheese are layered onto tortilla",
+ "Tortilla is folded into a taco and pressed down to ensure even cook",
+ "Taco is pressed down on hot, flat surface until golden brown and crispy",
+ "Close-up of plate with three tacos garnished with chopped cilantro, onions, lime wedges, and small bowl of reddish-brown sauce",
+ "Cook dips taco into sauce and takes a bite"
+ ]
+ },
+ {
+ "video": "video/152.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with an exterior view of a building on a sunny day. The building has a light brown facade, a wooden awning, and a sign above the entrance. In the foreground, a wooden fence has a plastic cup containing a drink placed on it. Two people are seen sitting outside the building. The audio reveals a man talking about a renowned barbecue joint in Dallas, Texas, called Pecan Lodge, reputedly serving the best Texas barbecue in the city. He humorously adds that he's not looking to start any fights over this claim. The scene transitions indoors to a merchandise area filled with folded t-shirts and various colored caps, underlined by a sign reading 'SWAG'. \n\nThe camera then focuses on a large chalkboard menu listing an array of BBQ items, specials, sides, and desserts, such as 'Turkey,' 'Brisket,' 'Pork Ribs,' 'Pulled Pork,' 'Handmade Sausage,' 'Combo Plates,' 'Sandwiches,' 'Giant Beef Ribs,' 'Burnt Ends,' 'Fried Ribs,' and 'Smokehouse Nachos.' The menu also mentions 'The Trough,' 'Southern Comfort,' 'Hot Mess,' and 'Southern Fried Chicken,' along with multiple sides and desserts. Framed on the wall is a certificate reading 'THE BEST LIST - Best Barbecue in Dallas - Pecan Lodge Catering.' The video ends with a close-up of a wooden wall sign stating, 'Life is hard, it's harder if you're stupid.' - John Wayne.\n\nThe scene transitions to a close-up of a plate of food, focusing on crinkle-cut pickles atop shredded cabbage. The camera zooms out to reveal a variety of BBQ meats, including a charred beef rib, brisket slices, pulled pork, and sausages, all arranged on a brown paper-lined tray. In the background, a blue plastic cup and a large roll of paper towels are visible on a wooden table. The audio describes how Texas barbecue might be America's greatest food, highlighting a well-known $14 dish called 'Hot Mess,' a loaded sweet potato topped with a mountain of barbecue. \n\nNext, the video displays another tray of food, featuring a loaded baked potato topped with shredded meat, green onions, and creamy sauce, alongside potato salad and macaroni with bacon bits. The audio continues, a man speaking quickly and excitedly, expressing how he is thoroughly impressed with everything, aligning with the visuals showcasing the delicious BBQ meats with their texture and char underlined.\n\nReturning to the loaded baked potato, the camera offers a detailed view, emphasizing the creamy sauce and meat toppings. A black plastic fork digs into the potato, and the camera captures a forkful of the mixture, illustrating the layers of meat and potato. The audio continues with a man describing the dish as a brisket-topped spud with cheese, bacon bits, and chipotle crema, saying it's what he wishes a chipotle burrito would taste like. He mentions that everything there has blown him away.\n\nThe clip ends with a man wearing a white t-shirt and cap, emblazoned with the text 'GETTIN' MORE,' digging into the loaded baked potato amidst a wooden-decor interior with a sign reading 'UP-200.' The man in the audio gives his enthusiastic review, rating the 'Sweet Spud' 9.1, and the video concludes with the visual of the person thoroughly enjoying the meal."
+ }
+ ],
+ "events": [
+ "Exterior view of building on a sunny day",
+ "Building has a light brown facade, a wooden awning, and a sign above the entrance",
+ "Wooden fence in the foreground has a plastic cup with a drink placed on it",
+ "Two people sitting outside the building",
+ "Audio: Man talks about a renowned barbecue joint in Dallas, Texas, called Pecan Lodge",
+ "Audio: Man humorously says he's not looking to start any fights over the claim of best Texas barbecue",
+ "Scene transitions indoors to a merchandise area with folded t-shirts and various colored caps",
+ "Sign reading 'SWAG' in the merchandise area",
+ "Camera focuses on a large chalkboard menu listing various BBQ items, specials, sides, and desserts",
+ "Menu lists items such as 'Turkey,' 'Brisket,' 'Pork Ribs,' 'Pulled Pork,' 'Handmade Sausage,' 'Combo Plates,' 'Sandwiches,' 'Giant Beef Ribs,' 'Burnt Ends,' 'Fried Ribs,' and 'Smokehouse Nachos'",
+ "Menu also lists 'The Trough,' 'Southern Comfort,' 'Hot Mess,' and 'Southern Fried Chicken,' along with multiple sides and desserts",
+ "Framed certificate on the wall reading 'THE BEST LIST - Best Barbecue in Dallas - Pecan Lodge Catering'",
+ "Scene ends with a close-up of a wooden wall sign stating, 'Life is hard, it's harder if you're stupid.' - John Wayne",
+ "Scene transitions to a close-up of a plate of food with crinkle-cut pickles atop shredded cabbage",
+ "Camera zooms out to reveal a variety of BBQ meats on a brown paper-lined tray, including a charred beef rib, brisket slices, pulled pork, and sausages",
+ "In the background, a blue plastic cup and a large roll of paper towels on a wooden table",
+ "Audio: Man describes Texas barbecue as possibly America's greatest food",
+ "Audio: Man highlights a $14 dish called 'Hot Mess,' a loaded sweet potato topped with a mountain of barbecue",
+ "Video displays another tray of food, featuring a loaded baked potato topped with shredded meat, green onions, and creamy sauce, alongside potato salad and macaroni with bacon bits",
+ "Audio: Man speaks quickly and excitedly, expressing his thorough impression with everything",
+ "Camera focuses back on the loaded baked potato, offering a detailed view of creamy sauce and meat toppings",
+ "Black plastic fork digs into the loaded baked potato",
+ "Camera captures a forkful of the potato mixture, showing layers of meat and potato",
+ "Audio: Man describes the dish as a brisket-topped spud with cheese, bacon bits, and chipotle crema",
+ "Audio: Man mentions everything there has blown him away",
+ "Clip ends with a man in a white t-shirt and cap with text 'GETTIN' MORE,' digging into the loaded baked potato",
+ "Man digs into the loaded baked potato amidst a wooden-decor interior with a sign reading 'UP-200'",
+ "Audio: Man gives an enthusiastic review, rating the 'Sweet Spud' 9.1",
+ "Video concludes with a visual of the person thoroughly enjoying the meal"
+ ]
+ },
+ {
+ "video": "video/153.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with an evocative visual of the American flag. This image transitions seamlessly into a bold title card that reads 'JUDGE JESSIE' in illuminated letters. The scene then shifts to a courtroom featuring a golden scale of justice and the continuous appearance of the text 'JUDGE JESSIE.' A well-dressed individual stands at a podium, and the camera cuts to a judge in a black robe behind a bench with the 'JUDGE JESSIE' sign above. The judge is actively engaged, pointing and speaking, while close-up shots capture the reactions and interactions of those involved in the courtroom session.\n\nIn the background, a voice narrates, \"Today on Judge Jessie, Judge Jessie turns up the heat. You signed a piece of paper. He gave you the snake. Then you bring a snake into the home with two small chihuahuas.\" The audio further includes a man speaking over music, transitioning to another man's voice.\n\nThe next sequence focuses on the judge, who is seen speaking and gesturing passionately. The camera zooms into a person in a light-colored suit with a pink shirt and a patterned tie, reflecting their attentive listening. The judge looks to the side, possibly addressing someone off-camera, followed by a cut to a security officer in a light blue uniform standing in front of a legal books-filled bookshelf. The clip concludes by returning to a close-up of the person in the light-colored suit and another individual with long hair in a dark suit and a chain necklace.\n\nIn the courtroom, another dialogue continues, highlighting: \" I had not been made aware that the snake would get rid of its skin. And then the sprinklers came down with the snake skin, ruined my carpet. What set off the sprinklers?\" This audio shifts from one man\u2019s voice to music and then to another man speaking.\n\nThe video resumes with a close-up of the light-colored suited individual at the podium, maintaining the courtroom setting with wooden paneling. The scene momentarily shifts to another 'JUDGE JESSIE' title card against the American flag backdrop. Subsequent scenes show a police officer in a light blue uniform escorting a black-outfitted individual towards a police car outside a columned building. The officer, standing confidently before the building, is accompanied by the text 'POLICE OFFICER.' A courtroom scene features an orange jumpsuit-clad person seated at a table next to another individual in a dark suit.\n\nSimultaneously, the audio narrative mentions: \"unrelated popcorn fire, Your Honor. Judge Jesse, as a police officer, he defended the streets. \" The audio consists of a man\u2019s voice with background music, followed by another man's voice.\n\nThe following frames present a person in a black outfit addressing the court, with another in an orange jumpsuit seated. The courtroom audience watches attentively. A close-up of a dark-suited individual with a blue-striped tie appears with the text 'LEGAL LAW.' The scene transits to an outdoor setting with a traditional lantern and a martial artist performing moves in a white uniform and black belt, followed by the text 'BLACK BELT.' The video concludes with a bright, overexposed scene.\n\nIn the audio, it's stated: \"As a trial lawyer, he fought for the truth. As a black belt. He mastered discipline.\" This includes the sound of a belt whipping and a man's pained scream, repeatedly.\n\nThe subsequent footage shows a person in a plaid shirt and brown overalls holding a power tool and measuring tape within a workshop. 'CARPENTER' is displayed in golden letters. A hospital scene follows, featuring a person in blood-stained blue scrubs with 'SURGEON' indicated. The final shots of this sequence include a wall of pictures and clippings with the 'COMEDY CENTRAL' logo, lending a comedic or satirical context.\n\nThe audio narration underlines: \"As a carpenter, he learned the value of craftsmanship. As a surgeon, he mastered a cool head.\"\n\nThe concluding scenes retain the string-hung clippings with the 'COMEDY CENTRAL' logo. A blue-outfitted individual interacts with this display. Another person dressed in a green and blue sweater with ripped shorts is seen sitting in a stairwell, using a lighter for drug preparation, hinted by the 'CRACK H' text. The final transition shows a suited person speaking into a microphone in a recording studio, with 'ANNOUNC' reflecting their announcer role.\n\nThe concluding audio notes, \"As a crack holder, he learned to go days without sleep. And as an announcer, he does the voiceover on this show,\" accompanied by background music."
+ }
+ ],
+ "events": [
+ "An evocative visual of the American flag at the beginning.",
+ "Transitions to a title card reading 'JUDGE JESSIE' in illuminated letters.",
+ "Scene shifts to a courtroom featuring a golden scale of justice and the continuous appearance of the text 'JUDGE JESSIE'.",
+ "A well-dressed individual stands at a podium.",
+ "Cuts to a judge in a black robe behind a bench with a 'JUDGE JESSIE' sign above.",
+ "The judge is actively engaged, pointing and speaking.",
+ "Close-up shots capture the reactions and interactions of those involved in the courtroom session.",
+ "Voice narrates, 'Today on Judge Jessie, Judge Jessie turns up the heat. You signed a piece of paper. He gave you the snake. Then you bring a snake into the home with two small chihuahuas.'",
+ "Audio transitions to another man's voice speaking over music.",
+ "Focuses on the judge speaking and gesturing passionately.",
+ "Camera zooms into a person in a light-colored suit, pink shirt, and patterned tie, showing their attentive listening.",
+ "The judge looks to the side, possibly addressing someone off-camera.",
+ "Cut to a security officer in a light blue uniform standing in front of a bookshelf filled with legal books.",
+ "Returns to a close-up of the person in the light-colored suit and another in a dark suit with a chain necklace.",
+ "Dialogue continues in the courtroom: 'I had not been made aware that the snake would get rid of its skin. And then the sprinklers came down with the snake skin, ruined my carpet. What set off the sprinklers?'",
+ "Audio shifts from one man\u2019s voice to music, then to another man's voice.",
+ "Close-up of the light-colored suited individual at the podium in the courtroom setting with wooden paneling.",
+ "Momentary shift to another 'JUDGE JESSIE' title card against the American flag backdrop.",
+ "A police officer in light blue uniform escorts a black-outfitted individual towards a police car outside a columned building.",
+ "Officer standing confidently before the building with the text 'POLICE OFFICER'.",
+ "Courtroom scene shows an orange jumpsuit-clad person seated at a table next to another individual in a dark suit.",
+ "Audio narrative mentions: 'unrelated popcorn fire, Your Honor. Judge Jesse, as a police officer, he defended the streets.'",
+ "Man\u2019s voice with background music, followed by another man's voice.",
+ "Person in a black outfit addresses the court.",
+ "Another person in an orange jumpsuit is seated.",
+ "Courtroom audience watches attentively.",
+ "Close-up of a dark-suited individual with a blue-striped tie with the text 'LEGAL LAW'.",
+ "Outdoor setting with a traditional lantern and a martial artist performing moves in a white uniform and black belt, with the text 'BLACK BELT'.",
+ "Bright, overexposed concluding scene.",
+ "Audio states: 'As a trial lawyer, he fought for the truth. As a black belt. He mastered discipline.'",
+ "Sound of a belt whipping and a man\u2019s pained scream repeatedly.",
+ "Person in a plaid shirt and brown overalls holding a power tool and measuring tape in a workshop, with 'CARPENTER' displayed in golden letters.",
+ "Hospital scene with a person in blood-stained blue scrubs, indicated as 'SURGEON'.",
+ "Final shots include a wall of pictures and clippings with 'COMEDY CENTRAL' logo.",
+ "Audio underlines: 'As a carpenter, he learned the value of craftsmanship. As a surgeon, he mastered a cool head.'",
+ "String-hung clippings with 'COMEDY CENTRAL' logo.",
+ "Blue-outfitted individual interacts with this display.",
+ "Person in a green and blue sweater with ripped shorts sits in a stairwell, using a lighter for drug preparation, hinted by the 'CRACK H' text.",
+ "Final transition to a suited person speaking into a microphone in a recording studio, with 'ANNOUNC' indicating their announcer role.",
+ "Concluding audio notes: 'As a crack holder, he learned to go days without sleep. And as an announcer, he does the voiceover on this show,' accompanied by background music."
+ ]
+ },
+ {
+ "video": "video/154.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a hand holding a colorful patterned cube against a white background featuring the logo 'THECUBICLE', a cube with red, green, and blue sections. The text '$1 Rubik's Cube' appears as a close-up reveals a Rubik's Cube with various colored smiley faces on each square. The Rubik's Cube is manipulated to show different sides filled with these vibrant smiley faces. The steady background music complements a high-pitched voice saying, \"`This is the most affordable Rubik's Cube. It's only a dollar, and you can get it from your local dollar store. What I really love about this cube is that it's really shiny, and it makes me happy.`\"\n\nThe scene transitions and focuses on the Rubik's Cube with blue and red smiley faces on each square. The cube is rotated, revealing various combinations of smileys. The man's voice continues, \"`Because the smiley faces and best of all it turns well sometimes and it can even corner cut,`\" while the consistent backdrop of light gray remains unchanged.\n\nNext, the Rubik's Cube with blue, yellow, green, and pink smiley faces is manipulated forcefully, causing it to break apart and scatter pieces with green, red, and blue smileys. The logo 'THECUBICLE' is visible in the background, along with the voice stating, \"`And it never locks up. The stickers never peel.`\"\n\nThe narrative progresses to a partially disassembled Rubik's Cube with varying colored smiley faces, then shifts to another cube with a white base featuring orange, black, and green squares. The text '$1 Rubik's Cube' is prominent. The scene then shifts to another Rubik's Cube inside a blue and purple box labeled 'GAN12 Maglev', and the text '$80 Rubik's Cube' appears. The box is opened to reveal a white-based Rubik's Cube with red, blue, and green squares, indicating its smooth movements. The audio adds, \"And you can even make sticker mods with it. What's better than that? And this is an $80 Rubik'`s Cube. Comes with a super nice case, has a special grippy coat and turns extremely fast. ,`\" amidst upbeat, cheerful music and the voice speaking in a high pitch, occasionally switching to a foreign language.\n\nContinuing, there's a close-up of hands rotating a Rubik's Cube with a white base with colored squares. Massive transition highlights multiple combinations of color squares illustrating the smooth rotations and ends with opening a box to reveal a small white container. The upbeat and fast-tempo electronic music underscores the clear, monotone voice stating, \"`Believe it or not, it can even corner cut better than the other cube. And it turns a little bit better too. Ooh, it comes with AirPod Pros!`\"\n\nIn the final segment, the hands hold a small white container with a transparent lid showing its mechanism against the backdrop of a blue and purple box labeled 'GAN12 Maglev' and 'THECUBICLE' case. The Rubik's Cube is manipulated again. There's a transition to unpacking a blue velvet pouch with 'GANCUBE' logo, revealing its interior. The clip concludes with a disassembled cube next to the pouch, presenting the text, 'Which do you prefer?' The man concludes, \"`Aw, crap. Just tensioning tools. It even has a whole bunch of magnets. And a super clean bag. So which cube do you prefer?`\" The audio continues to play the consistent music in the background.\n\nThrough these clips, the video presents a comparison between a $1 Rubik's Cube and an $80 GAN12 Maglev Rubik's Cube, displaying their features, ease of manipulation, and additional accessories to engage the audience in considering their preferences."
+ }
+ ],
+ "events": [
+ "A hand holding a colorful patterned cube against a white background featuring the logo 'THECUBICLE'.",
+ "The text '$1 Rubik's Cube' appears.",
+ "A close-up reveals a Rubik's Cube with various colored smiley faces on each square.",
+ "The Rubik's Cube is manipulated to show different sides filled with vibrant smiley faces.",
+ "Background music plays.",
+ "A high-pitched voice says, 'This is the most affordable Rubik's Cube. It's only a dollar, and you can get it from your local dollar store. What I really love about this cube is that it's really shiny, and it makes me happy.'",
+ "The scene transitions to focus on the Rubik's Cube with blue and red smiley faces on each square.",
+ "The cube is rotated revealing various combinations of smileys.",
+ "The man's voice continues, 'Because the smiley faces and best of all it turns well sometimes and it can even corner cut,'",
+ "The consistent backdrop of light gray remains unchanged.",
+ "The Rubik's Cube with blue, yellow, green, and pink smiley faces is manipulated forcefully.",
+ "The Rubik's Cube breaks apart scattering pieces with green, red, and blue smileys.",
+ "The logo 'THECUBICLE' is visible in the background.",
+ "The voice states, 'And it never locks up. The stickers never peel.'",
+ "Focus shifts to a partially disassembled Rubik's Cube with varying colored smiley faces.",
+ "Another cube with a white base featuring orange, black, and green squares is shown.",
+ "The text '$1 Rubik's Cube' appears prominently.",
+ "The scene shifts to Rubik's Cube inside a blue and purple box labeled 'GAN12 Maglev'.",
+ "The text '$80 Rubik's Cube' appears.",
+ "The box is opened to reveal a white-based Rubik's Cube with red, blue, and green squares.",
+ "The audio adds, 'And you can even make sticker mods with it. What's better than that? And this is an $80 Rubik's Cube. Comes with a super nice case, has a special grippy coat and turns extremely fast.'",
+ "Upbeat, cheerful music plays with the voice speaking in a high pitch.",
+ "The voice occasionally switches to a foreign language.",
+ "Close-up of hands rotating a Rubik's Cube with a white base with colored squares.",
+ "Massive transition highlighting multiple combinations of color squares illustrating smooth rotations.",
+ "The scene ends with opening a box to reveal a small white container.",
+ "Upbeat, fast-tempo electronic music underscores the clear, monotone voice stating, 'Believe it or not, it can even corner cut better than the other cube. And it turns a little bit better too. Ooh, it comes with AirPod Pros!'",
+ "The final segment shows hands holding a small white container with a transparent lid showing its mechanism.",
+ "The backdrop is a blue and purple box labeled 'GAN12 Maglev' and 'THECUBICLE' case.",
+ "The Rubik's Cube is manipulated again.",
+ "Transition to unpacking a blue velvet pouch with 'GANCUBE' logo revealing its interior.",
+ "The clip concludes with a disassembled cube next to the pouch.",
+ "The text 'Which do you prefer?' appears.",
+ "The man concludes, 'Aw, crap. Just tensioning tools. It even has a whole bunch of magnets. And a super clean bag. So which cube do you prefer?'",
+ "The audio continues to play the consistent music in the background.",
+ "The video presents a comparison between a $1 Rubik's Cube and an $80 GAN12 Maglev Rubik's Cube.",
+ "Features, ease of manipulation, and additional accessories are displayed to engage the audience in considering their preferences."
+ ]
+ },
+ {
+ "video": "video/155.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up of a portafilter filled with finely ground coffee. A spoonful of light brown sugar is added on top, and the sugar is spread evenly across the surface using the back of the spoon. The portafilter is then placed into a coffee grinder labeled 'BARATZA,' dispensing more coffee grounds. The freshly ground coffee is leveled off with a tool, ensuring an even surface, and a tamper labeled 'normcore' is used to press down the coffee grounds firmly, preparing it for brewing. As these meticulous coffee preparation steps unfold, a man comments on the process: \"`Nothing sounds more like a recipe for disaster than putting brown sugar on top of your coffee grounds before you pull your shot. But that's what a Cafe Cubano is and we have to try it.`\"\n\nThe scene transitions to another close-up of a portafilter filled with finely ground coffee. A hand adds and spreads a spoonful of light brown sugar evenly across the surface. The portafilter, now with evenly distributed sugar, is pressed down firmly using the tamper labeled 'normcore,' ready for brewing. A voice explains over some background music: \"`You tamp your espresso grounds, then you add a couple scoops of brown sugar on top, level them out, and then tamp on top of the brown sugar again. It comes out looking like this.`\"\n\nIn the next scene, the portafilter filled with coffee and sugar is seen with a coffee grinder and a container of sugar in the background. A hand inserts the portafilter into an espresso machine labeled 'ascaso' and locks it into place. The brewing process begins, and rich, dark espresso flows into a glass, creating a visually satisfying stream. The accompanying audio reveals someone's thoughts: \"And that'`s what we're gonna pull our shot of espresso with. I've never been an 'add sweetener to my coffee' kind of guy, but this looks so satisfying. I figured I would enjoy the process of doing it even if I hate the way that it tastes.\"\n\nThe narrative continues as we watch the espresso machine brew coffee into a small amber-colored glass. The rich dark espresso flows smoothly, forming a light crema on top. A hand holds the glass, showcasing the freshly brewed espresso before a man in a blue shirt takes a sip, savoring the drink. In the background, a voice hesitantly admits: \"I mean, that looks beautiful. Okay, let's give this a try.`\"\n\nThe final scene features the same man in a blue shirt, holding an amber-colored glass of espresso. Gesturing with their hand and appearing engaged in conversation, the man continues to explain something. The warmly lit indoor setting, with visible ceiling fan and light fixtures, adds to the intimate atmosphere. As he sips the espresso and emphasize their points with hand gestures, the man reflects: \"`That is very sweet, but I kind of like it. It's like a little dessert.`\"\n\nOverall, the video captures the detailed process of making a sweetened espresso drink known as Caf\u00e9 Cubano while providing a narrative that navigates through initial skepticism to a somewhat appreciative conclusion."
+ }
+ ],
+ "events": [
+ "Close-up of a portafilter filled with finely ground coffee",
+ "A spoonful of light brown sugar is added on top of the coffee grounds",
+ "The sugar is spread evenly across the surface using the back of the spoon",
+ "Portafilter is placed into a coffee grinder labeled 'BARATZA'",
+ "The grinder dispenses more coffee grounds into the portafilter",
+ "Freshly ground coffee is leveled off with a tool",
+ "A tamper labeled 'normcore' is used to press down the coffee grounds firmly",
+ "A man comments on the process: 'Nothing sounds more like a recipe for disaster than putting brown sugar on top of your coffee grounds before you pull your shot. But that's what a Cafe Cubano is and we have to try it.'",
+ "Transition to another close-up of a portafilter filled with finely ground coffee",
+ "A hand adds and spreads a spoonful of light brown sugar evenly across the surface",
+ "Portafilter with evenly distributed sugar is pressed down firmly using a tamper labeled 'normcore'",
+ "A voice explains over background music: 'You tamp your espresso grounds, then you add a couple scoops of brown sugar on top, level them out, and then tamp on top of the brown sugar again. It comes out looking like this.'",
+ "Portafilter filled with coffee and sugar is shown with a coffee grinder and a container of sugar in the background",
+ "A hand inserts the portafilter into an espresso machine labeled 'ascaso' and locks it into place",
+ "Brewing process starts, and rich, dark espresso flows into a glass",
+ "Accompanying audio reveals someone's thoughts: 'And that\u2019s what we're gonna pull our shot of espresso with. I've never been an 'add sweetener to my coffee' kind of guy, but this looks so satisfying. I figured I would enjoy the process of doing it even if I hate the way that it tastes.'",
+ "Espresso machine brews coffee into a small amber-colored glass",
+ "Rich dark espresso flows smoothly, forming a light crema on top",
+ "A hand holds the glass, showing the freshly brewed espresso",
+ "A man in a blue shirt takes a sip, savoring the drink",
+ "In the background, a voice hesitantly admits: 'I mean, that looks beautiful. Okay, let's give this a try.'",
+ "Final scene features the man in a blue shirt holding an amber-colored glass of espresso",
+ "Man gestures with his hand and appears engaged in conversation",
+ "Warmly lit indoor setting with visible ceiling fan and light fixtures creates an intimate atmosphere",
+ "Man sips espresso and emphasizes points with hand gestures",
+ "Man reflects: 'That is very sweet, but I kind of like it. It's like a little dessert.'"
+ ]
+ },
+ {
+ "video": "video/156.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a gray background displaying the words 'These are the top 3 ever made' one by one in black text. Transitioning to a high-tech helmet with a dark visor, a thermal image appears on a small screen inside the visor. Then, a man is seen holding a large firearm with a drum magazine, aiming it outdoors in a wooded area. The scene changes to display two tactical knives with tan and black handles, one with an orange tip. Another high-tech helmet with a camouflage strap is shown, labeled '#1' and 'Nexo Python' in white text. The video then shows a webpage displaying 'ASG Strike Systems Tactical Airsoft Goggles - Clear' priced at $24.00, followed by a thermal image viewed through the goggles, highlighting a person in a forested area with a red arrow pointing at the image.\n\nAs the visuals unfold, a man's voice narrates: \"These are the top three weirdest airsoft accessories ever made. First up, we have the Nexo Python airsoft mask. And while this might look like a regular mask, it actually has a thermal camera built into it to see if enemies are hiding behind bushes...\" Some popping sounds and typing sounds interrupt the narration briefly.\n\nNext, displayed is a thermal image of a person holding a firearm, aiming it in a wooded area. The high-tech screen shows the time '4.53 AM' and a battery icon. The scene transitions to an outdoor urban setting, showing a building's side with several large glass panels and a trash bin further down the alley. A person in camouflage gear holds a firearm and communicates through a device. A digital display next to the person shows '3:00 PM' and the date '03.04', followed by various icons appearing, including a location pin, power buttons, and other symbols. Text '$865' appears sequentially. The frame changes to a black background with '#2' in white and ends with a partial view of a high-tech helmet.\n\nThe narration continues: \"...a regular camera to record exactly what you're seeing, and a heads-up display to see exactly where your teammates are. The price of all of that? $865. \" accompanied by the sound of a cash register ringing.\n\nThe transition leads to a black background displaying '#2' in white, followed by a close-up of a firearm with a black rubber bayonet attached, featuring a red tip. The text 'rubber bayo' appears, followed by 'rubber bayonet'. A different firearm is shown with a bayonet attached, this time with a brown handle and tan strap. A black-and-white graphic of a ninja performing a high kick follows. The final scene shows '#3' in white on a black background, then a close-up of a black drum magazine with 'Shotgun to M4' text at the bottom of the screen.\n\nThe man speaks: \"In at number two, we have the rubber bayonet, and I'm not exactly sure why anyone would ever want to use this, but I guess if you want to practice your ninja skills, this might be for you. Coming in at number three, we have a shotgun to M4 drum mag adapter.\" The sound of kicking in the air can be heard.\n\nA man outdoors in a wooded area holds a firearm equipped with a drum magazine, wearing gloves and a dark long-sleeve shirt. The scene shows a webpage displaying 'Matrix 30 Round Shotgun Shell Magazines for Spring Powered Airsoft Shotguns (Color: Red / Pack of 6 Shells)' priced at $14.00, followed by another webpage showing 'A&K 3000rd Electric C-Mag Drum for M4 M16 Series Airsoft AEG (Sound Control / Auto Winding)' priced at $60.00, with a red arrow pointing to the product name.\n\nThe narration concludes: \"It looks a little odd, but this thing is actually super cool because a normal shotgun only holds 30 BBs, whereas this drum mag holds 3,000.\" \n\nThe video narrative and audio complement each other, detailing the top three weirdest airsoft accessories with corresponding visual demonstrations, time stamps, price points, and tactical uses, creating a thoroughly engaging and informative description."
+ }
+ ],
+ "events": [
+ "Gray background displaying 'These are the top 3 ever made' one by one in black text",
+ "High-tech helmet with a dark visor shown",
+ "Thermal image appears on a small screen inside the visor",
+ "Man holding a large firearm with a drum magazine, aiming it outdoors in a wooded area",
+ "Two tactical knives with tan and black handles displayed, one with an orange tip",
+ "Another high-tech helmet with a camouflage strap shown, labeled '#1' and 'Nexo Python' in white text",
+ "Webpage displaying 'ASG Strike Systems Tactical Airsoft Goggles - Clear' priced at $24.00",
+ "Thermal image viewed through goggles, highlighting person in forested area with a red arrow pointing at the image",
+ "Narration mentions top three weirdest airsoft accessories ever made",
+ "Narrator introduces Nexo Python airsoft mask with a thermal camera",
+ "Popping sounds and typing sounds interrupt narration briefly",
+ "Thermal image of person holding a firearm, aiming in wooded area, shown with '4.53 AM' time and battery icon",
+ "Outdoor urban setting displayed, showing a building's side with several large glass panels and a trash bin",
+ "A person in camouflage gear holds a firearm and communicates through a device",
+ "Digital display next to person shows '3:00 PM' and '03.04', followed by various icons appearing",
+ "Text '$865' appears sequentially",
+ "Black background with '#2' in white displayed",
+ "Partial view of a high-tech helmet shown",
+ "Narrator mentions helmet's camera to record and heads-up display showing teammates' locations",
+ "Cash register ringing sound",
+ "Black background displaying '#2' in white",
+ "Close-up of a firearm with a black rubber bayonet attached, featuring a red tip",
+ "Text 'rubber bayo' and 'rubber bayonet' appears",
+ "Different firearm with brown handle and tan strap shown with attached bayonet",
+ "Black-and-white graphic of a ninja performing a high kick",
+ "Black background with '#3' in white",
+ "Close-up of black drum magazine with 'Shotgun to M4' text at the bottom of the screen",
+ "Narrator introduces rubber bayonet and mentions practicing ninja skills",
+ "Narrator introduces shotgun to M4 drum mag adapter",
+ "Sound of kicking in the air",
+ "Man outdoors in wooded area holding a firearm equipped with a drum magazine, wearing gloves and dark long-sleeve shirt",
+ "Webpage displaying 'Matrix 30 Round Shotgun Shell Magazines for Spring Powered Airsoft Shotguns (Color: Red / Pack of 6 Shells)' priced at $14.00",
+ "Another webpage showing 'A&K 3000rd Electric C-Mag Drum for M4 M16 Series Airsoft AEG (Sound Control / Auto Winding)' priced at $60.00 with red arrow pointing to product name",
+ "Narrator mentions shotgun holding 30 BBs versus drum mag holding 3,000",
+ "Narration and audio complement with visual demonstrations, time stamps, price points, and tactical uses"
+ ]
+ },
+ {
+ "video": "video/157.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a black-and-white photograph of a woman wearing a crown, fine jewelry, and a lace veil. Yellow text at the bottom reads: \"The reign of Victoria has begun.\" A male voice explains: \"Queen Victoria's reign began in 1837 and lasted until her death in 1901.\" The picture and text remain the same, with the addition of \"1837, continued until the Queen's death\" in yellow font to reinforce the narrative.\nTransition to a colorful painting of a woman sitting on a throne wearing a red and white gold embroidered robe with a crown on her head. The text \"Later in 1901\" is shown in yellow. The audio continues, \"And now, at the age of 18, Alexandrina Victoria must ascend the throne,\" synced with the yellow visual text \"And Now it's only 18.\"\nThe image cuts to a young girl in a white dress and blue sash, holding flowers, with the words \"Alexandrina Victoria\" written in yellow at the bottom of the image.\nThe audio switches between male and female voices, and the narrator picks up with a woman: \"Queen Victoria was fifth in line when she was born...\" Then a picture appeared of a little girl in a white dress with a blue belt and flowers. Next to her, another child in a white dress and a blue sash sat on the ground. The text reads \"Must ascend to the throne\" in yellow. The yellow text continues \"She was fifth when she was born.\" \"So from the beginning, it was unlikely that she would get the crown,\" the audio says along with the corresponding text.\nThe video then shows a drawing of a child wearing a black coat decorated with fur and a large hat decorated with feathers, with the words \"She is not actually likely to get the crown\" written in yellow. The voiceover echoes: \"But then, one by one, her family suddenly died.\" This is synchronized with the text of the last frame, \"But one by one, all her family members,\" appears.\nIn the next scene, a painting shows a woman in a white dress with rolled-up sleeves, accompanied by audio, \"So around 1830, Victoria was only 11 years old, and she was already next in line to the throne.\" Meanwhile, the words \"sudden departure\" are shown in yellow. The image of the woman wearing a white dress, blue sash, and jewelry was accompanied by the caption \"Victoria is only 11 years old and already\", followed by \"She is next in line to the throne\". The voiceover ends with, \"Victoria was also raised,\" which is intuitively reflected in the yellow text of a person wearing a crown in the following scene.\nThe next segment cuts to a woman in a red and white robe with gold embroidery, holding a scepter and orb in her hands, and a crown on her head. The text \"Under the Kensington System\" is shown in yellow, and the audio reads, \"Victoria's mother, Victoria, Duchess of Kent, created this system to control her daughter.\" This is consistent with the visual transition. A woman in a black dress and a pearl necklace stands next to a child in a white dress and a blue sash, holding a small portrait. As the audio continues, \"She literally isolated her from birth,\" and the corresponding yellow text appears.\nIn the evolving narrative, a woman in a white dress with puffy sleeves stands at a table covered in red cloth and flowers. \"Her mother did it to keep her pure,\" the voice-over says. The text reflects the narrative. As the saying goes \"Her mother will monitor every movement, including who she can,\" with audio synced and visual frames kept aligned. Then the frame says, \"See or speak anytime.\" In the final scene of the video, a woman wearing a black dress with lace details reads \"She shares a room with her mother,\" completing the same narrative with audio.\nThe final segment opens with a woman in a dark dress adorned with intricate lace details and gold earrings with the yellow text, \"Yes, she really can't,\" in line with the male narration. The scene cuts to a woman in a white dress wearing a crown, and the words \"Victoria reflects on her childhood\" appear as the narrative progresses. The final image depicts a man in a red military uniform with MEDALS and the text \"manipulating her mother,\" underlining the solemn conclusion, with a narrator saying: \"Victoria confirms that she hates John Conroy manipulating her mother.\"\nThroughout the video, the synchronized interplay of historical visuals and narratives accurately encapsulates Queen Victoria's early life and ascent to the throne, against a backdrop of carefully arranged voiceovers and written descriptions."
+ }
+ ],
+ "events": [
+ "Opening scene with a black-and-white photograph of a woman wearing a crown, fine jewelry, and a lace veil.",
+ "Yellow text at the bottom reads: 'The reign of Victoria has begun.'",
+ "Male voice explains: 'Queen Victoria's reign began in 1837 and lasted until her death in 1901.'",
+ "Addition of the text '1837, continued until the Queen's death' in yellow font.",
+ "Transition to a colorful painting of a woman sitting on a throne wearing a red and white gold embroidered robe with a crown on her head.",
+ "Text 'Later in 1901' is shown in yellow.",
+ "Audio continues: 'And now, at the age of 18, Alexandrina Victoria must ascend the throne.'",
+ "Yellow visual text 'And Now it's only 18.'",
+ "Cut to an image of a young girl in a white dress and blue sash, holding flowers, with text 'Alexandrina Victoria' in yellow at the bottom.",
+ "Audio switches between male and female voices; female voiceover: 'Queen Victoria was fifth in line when she was born...'",
+ "Picture of a little girl in a white dress with a blue belt and flowers appears.",
+ "Next to her, another child in a white dress and a blue sash sat on the ground.",
+ "Text reads 'Must ascend to the throne' in yellow.",
+ "Yellow text continues 'She was fifth when she was born.'",
+ "Voiceover: 'So from the beginning, it was unlikely that she would get the crown,' with corresponding text.",
+ "Drawing of a child wearing a black coat decorated with fur and a large hat decorated with feathers.",
+ "Words 'She is not actually likely to get the crown' are shown in yellow.",
+ "Voiceover echoes: 'But then, one by one, her family suddenly died,' synchronized with the text 'But one by one, all her family members,' appears.",
+ "Scene with a painting of a woman in a white dress with rolled-up sleeves.",
+ "Voiceover: 'So around 1830, Victoria was only 11 years old, and she was already next in line to the throne.'",
+ "Words 'sudden departure' are shown in yellow.",
+ "Image of a woman wearing a white dress, blue sash, and jewelry with the caption 'Victoria is only 11 years old and already,' followed by 'She is next in line to the throne.'",
+ "Voiceover ends with: 'Victoria was also raised,' reflected in the yellow text of a person wearing a crown in the following scene.",
+ "Segment cuts to a woman in a red and white robe with gold embroidery, holding a scepter and orb in her hands, and a crown on her head.",
+ "Text 'Under the Kensington System' is shown in yellow.",
+ "Audio: 'Victoria's mother, Victoria, Duchess of Kent, created this system to control her daughter,' consistent with the visual transition.",
+ "Woman in a black dress and a pearl necklace stands next to a child in a white dress and a blue sash, holding a small portrait.",
+ "Audio continues: 'She literally isolated her from birth,' with corresponding yellow text.",
+ "Woman in a white dress with puffy sleeves stands at a table covered in red cloth and flowers.",
+ "Voiceover: 'Her mother did it to keep her pure,' reflected in the text.",
+ "Frame: 'Her mother will monitor every movement, including who she can,' with audio synced.",
+ "Frame says: 'See or speak anytime.'",
+ "Final scene with a woman wearing a black dress with lace details; text reads 'She shares a room with her mother,' completing the narrative with audio.",
+ "Final segment with a woman in a dark dress adorned with intricate lace details and gold earrings; text, 'Yes, she really can't,' in line with the male narration.",
+ "Scene cuts to a woman in a white dress wearing a crown; text, 'Victoria reflects on her childhood.'",
+ "Final image depicts a man in a red military uniform with MEDALS; text, 'manipulating her mother,' with a narrator saying: 'Victoria confirms that she hates John Conroy manipulating her mother.'"
+ ]
+ },
+ {
+ "video": "video/158.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a man walking outdoors on a sunny day wearing a grey 'Oswego Rugby' t-shirt and a backpack. The background features a clear blue sky, some greenery, and buildings. The scene shifts to an aerial view from an airplane, showing the wing and a sky filled with clouds. The man continues walking and talking. The clip concludes with the man in a different outdoor setting, accompanied by the text \"TODAY I'M FLYING ON THE WORLD'S CHEAPEST AIRLINE, BUT NOT ONLY ARE THEY THE WORLD'S CHEAPEST AIRLINE, THEY HAVE ONE OF THE WORLD'S STRICTEST BAG POLICIES. NOW TODAY'S INTERNATIONAL FLIGHT FROM BOSNIA AND HERZEGOVINA TO GERMANY IS $30 .\"\n\n\n\nThe next scene is a close-up of a laptop screen displaying a flight search page. A hand points to details like the departure city 'Sarajevo' and the destination 'Memmingen (FMM)' with the date 'Thu, Jul 21.' Transitioning back to the man outdoors , the screen displays text: \"BUT IF I WANTED TO BRING THIS BAG. THE BAG ALONE WOULD HAVE BEEN $70. WE JUST HAVE 2 HURDLES TO JUMP THROUGH FIRST. I HAVE TO GET MY BOARDING PASS.\" The clip ends with a close-up of a boarding pass for 'Wizz Air,' displaying flight number '4637,' seat '11-F,' and the name 'GROSSI/CHRISTIANE.'\n\n\nNext, a close-up of a 'Wizz Air' boarding pass shows details such as the passenger name 'GROSSI/CHRISTIANMR,' from Sarajevo to FMM, flight number 'W6 4637,' departure date '21JUL,' boarding time '1245,' and seat '11-F.' The text \"BECAUSE I'M NOT A CITIZEN HERE, SO THEY DON'T ISSUE IT ONLINE\" appears. The scene shifts outdoors, with the text continuing, and the man explains they can\u2019t get the boarding pass online due to non-citizenship. The scene changes to walking, describing two hurdles: getting the boarding pass and checking the bag size. A close-up of a bag check box shows text \"1 BAG ONLY non WIZZ Priority\" and \"1 BAG WIZZ Priority.\" The man points to the box, indicating the bag check process. The final frame shows the man indoors, explaining that two people ahead of them were told their bags were too big. Screen text reads \"AND THE SECOND ONE IS THE BAG CHECK ITS BOX. OKAY SO THEY TOLD 2 PEOPLE IN FRONT OF ME THAT THEIR BAGS WERE TOO BIG BUT I GOT THROUGH.\"\n\n\n\nInside, the man holds a paper and speaks to the camera. The text \"SO WE HAVE ONE MORE BAGGAGE PERSON LEFT\" appears. A close-up follows, showing a large grey and orange backpack on the floor in front of a red refrigerator. The frame shifts to the smaller blue and grey backpack held in front of the larger one. Text reads \"AND JUST TO GIVE YOU PERSPECTIVE. HERE'S THE SIZE OF MY 45L BACKPACK. HERE'S THE SIZE OF THE BAG THAT WIZZ AIR SAYS I CAN BRING ON THE PLANE.\" The clip ends with a close-up of the smaller backpack, highlighting the size difference.\n\n\n\nThe scene transitions to a close-up of the man in the grey t-shirt going through the boarding procedure and speaking to the camera, with the text\"ALL RIGHT WE JUST HAVE ONE MORE PERSON TO GET PAST. WE HAVE FINESSED WHIZZ AIRS. BAG POLICY. I WILL SAY I HAVE SEEN THEM SPOT CHECK BAG SIZES AT THE GATE, SO...I DON'T KNOW. I DID'T HAVE TO PAY $70 FOR A BAG. THAT WOULD HAVE BEEN OUTRAGEOUS. I WOULD HAVE BEEN SO MAD IF I HAVE TO DO THAT. TAKE THAT WIZZ AIR! I'VE DEFEATED YOUR BACK POLICY.\""
+ }
+ ],
+ "events": [
+ "Man walks outdoors on a sunny day wearing a grey 'Oswego Rugby' t-shirt and a backpack.",
+ "Background features a clear blue sky, some greenery, and buildings.",
+ "Scene shifts to an aerial view from an airplane, showing the wing and sky filled with clouds.",
+ "Man continues walking and talking.",
+ "Text: 'TODAY I'M FLYING ON THE WORLD'S CHEAPEST AIRLINE, BUT NOT ONLY ARE THEY THE WORLD'S CHEAPEST AIRLINE, THEY HAVE ONE OF THE WORLD'S STRICTEST BAG POLICIES. NOW TODAY'S INTERNATIONAL FLIGHT FROM BOSNIA AND HERZEGOVINA TO GERMANY IS $30.'",
+ "Next scene: close-up of a laptop screen displaying a flight search page.",
+ "Hand points to details like the departure city 'Sarajevo' and the destination 'Memmingen (FMM)' with the date 'Thu, Jul 21.'",
+ "Transition back to the man outdoors.",
+ "Screen displays text: 'BUT IF I WANTED TO BRING THIS BAG. THE BAG ALONE WOULD HAVE BEEN $70. WE JUST HAVE 2 HURDLES TO JUMP THROUGH FIRST. I HAVE TO GET MY BOARDING PASS.'",
+ "Close-up of a boarding pass for 'Wizz Air' displaying flight number '4637,' seat '11-F,' and name 'GROSSI/CHRISTIANE.'",
+ "Close-up of a 'Wizz Air' boarding pass shows details: passenger name 'GROSSI/CHRISTIANMR,' from Sarajevo to FMM, flight number 'W6 4637,' departure date '21JUL,' boarding time '1245,' and seat '11-F.'",
+ "Text: 'BECAUSE I'M NOT A CITIZEN HERE, SO THEY DON'T ISSUE IT ONLINE.'",
+ "Scene shifts outdoors, text continues.",
+ "Man explains they can\u2019t get the boarding pass online due to non-citizenship.",
+ "Scene changes to walking, describing two hurdles: getting the boarding pass and checking the bag size.",
+ "Close-up of a bag check box shows text: '1 BAG ONLY non WIZZ Priority' and '1 BAG WIZZ Priority.'",
+ "Man points to the box, indicating the bag check process.",
+ "Final frame shows the man indoors, explaining that two people ahead of them were told their bags were too big.",
+ "Screen text: 'AND THE SECOND ONE IS THE BAG CHECK ITS BOX. OKAY SO THEY TOLD 2 PEOPLE IN FRONT OF ME THAT THEIR BAGS WERE TOO BIG BUT I GOT THROUGH.'",
+ "Inside, man holds a paper and speaks to the camera.",
+ "Text: 'SO WE HAVE ONE MORE BAGGAGE PERSON LEFT.'",
+ "Close-up shows a large grey and orange backpack on the floor in front of a red refrigerator.",
+ "Frame shifts to the smaller blue and grey backpack held in front of the larger one.",
+ "Text: 'AND JUST TO GIVE YOU PERSPECTIVE. HERE'S THE SIZE OF MY 45L BACKPACK. HERE'S THE SIZE OF THE BAG THAT WIZZ AIR SAYS I CAN BRING ON THE PLANE.'",
+ "Close-up of the smaller backpack, highlighting the size difference.",
+ "Scene transitions to a close-up of the man in the grey t-shirt going through the boarding procedure and speaking to the camera.",
+ "Text: 'ALL RIGHT WE JUST HAVE ONE MORE PERSON TO GET PAST. WE HAVE FINESSED WHIZZ AIRS. BAG POLICY. I WILL SAY I HAVE SEEN THEM SPOT CHECK BAG SIZES AT THE GATE, SO...I DON'T KNOW. I DID'T HAVE TO PAY $70 FOR A BAG. THAT WOULD HAVE BEEN OUTRAGEOUS. I WOULD HAVE BEEN SO MAD IF I HAVE TO DO THAT. TAKE THAT WIZZ AIR! I'VE DEFEATED YOUR BACK POLICY.'"
+ ]
+ },
+ {
+ "video": "video/159.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video features a man in a light blue long-sleeve shirt and wristwatch, sitting at a desk with a light-colored surface. The background includes blue curtains and a calming blue light. The man speaks animatedly, using hand gestures to emphasize his points. The text 'YOU DON'T NEED TO BE STUDYING 11, 12, 13 H/DAY' appears on the screen in the beginning. The voiceover states, \"I'`m telling you right now, 99% of you don't need to be studying 11, 12, 13 hours a day. `\"\n\nAs the video progresses, the text changes to '`STUDYING SMARTER RATHER THAN HARDER.`' in white and green. The speaker continues, explaining, \"`You should be aiming to decrease the number of hours you're studying, but studying smarter rather than harder. And here's three ways you can do that.`\" The clip introduces the Pareto Principle with the text 'TIP1 PARETO PRINCIPLE.' The speaker elaborates, \"`The Pareto principle states that 80% of the output is a result of 20% of the input.`\" On-screen text reinforces this by stating, '`80% OF THE OUTPUT IS A RESULT OF 20% OF THE INPUT.`' in white and red.\n\nThe man further explains how to apply this principle, with the text 'DOUBLE DOWN ON THE TASKS THAT MAKE UP THIS 20%' in white and green. The voiceover says, \"`So 80% of your grade was probably as a result of 20% of your studying, so you need to double down on the tasks that are making up this 20%.`\" The video then shifts focus to 'TIPS2 SPACED REPETITION,' defined by on-screen text, which the voiceover explains as \"Studying one subject for one hour every day for eight days hours in a day can often be a waste of time.\"\n\nThe speaker advises that \"`A far more effective way can be to study that subject for one hour every day for eight days. This takes advantage of the spacing effect where it gives your brain enough time to process the information for a long term memory.`\" The environment transitions to show the man, now in a black shirt with a white logo, reading an open book on a grey couch with a laptop in front of him, signifying focused study in a relaxing setting.\n\nBack in the original setting, the subject continues discussing tips, leading to 'TIPS 3 ADVANCED INFORMATION PROCESSING' being displayed on screen. The speaker says, \"When you'`re revising for your exams, you don't consume the material passively. It is a waste of time.`\" reinforced by text 'DON'T CONSUME MATERIALS PASSIVELY.'\n\nTowards the end, the scene changes again, showing the man now in a dark gray shirt with an orange wall and whiteboard backdrop, deeply engaged with Seth Godin's 'The Icarus Deception.' The next scene shows the man highlighting text with a pink highlighter and typing on a laptop before returning to the original blue setting. The voiceover concludes, \"`You need to be processing the information so that it can be stored in your long-term memory. And that is three ways to study smarter rather than harder.`\"\n\nThroughout the video, the calm and informative tone is maintained, with clear and professional guidance on efficient study techniques, making the educational message compelling and accessible."
+ }
+ ],
+ "events": [
+ "A man in a light blue long-sleeve shirt and wristwatch sits at a desk.",
+ "The desk has a light-colored surface.",
+ "Background includes blue curtains and a calming blue light.",
+ "The man speaks animatedly, using hand gestures.",
+ "Text on screen: 'YOU DON'T NEED TO BE STUDYING 11, 12, 13 H/DAY.'",
+ "Voiceover states: 'I'm telling you right now, 99% of you don't need to be studying 11, 12, 13 hours a day.'",
+ "Text changes to: 'STUDYING SMARTER RATHER THAN HARDER.' in white and green.",
+ "Speaker explains the concept of studying smarter rather than harder.",
+ "Speaker mentions three ways to study smarter.",
+ "Introduction of the Pareto Principle with text: 'TIP1 PARETO PRINCIPLE.'",
+ "Speaker elaborates on the Pareto Principle.",
+ "Text on screen reinforces: '80% OF THE OUTPUT IS A RESULT OF 20% OF THE INPUT.' in white and red.",
+ "Text appears: 'DOUBLE DOWN ON THE TASKS THAT MAKE UP THIS 20%' in white and green.",
+ "Voiceover advises to focus on the 20% of tasks that yield 80% of the results.",
+ "Shift focus to 'TIPS2 SPACED REPETITION' with text on screen.",
+ "Voiceover explains Spaced Repetition method for studying.",
+ "Speaker advises studying one subject for one hour every day for eight days.",
+ "Transition to man in black shirt with white logo, reading an open book on a grey couch with laptop.",
+ "Man appears in a relaxing setting, representing focused study.",
+ "Return to original setting for further discussion.",
+ "Introduction of 'TIPS3 ADVANCED INFORMATION PROCESSING' with text on screen.",
+ "Speaker advises against passive consumption of material.",
+ "Text on screen: 'DON'T CONSUME MATERIALS PASSIVELY.'",
+ "Scene changes to man in dark gray shirt, orange wall, whiteboard backdrop.",
+ "Man engaged with book 'The Icarus Deception' by Seth Godin.",
+ "Scene shows man highlighting text with a pink highlighter.",
+ "Man types on a laptop.",
+ "Return to original blue setting.",
+ "Voiceover concludes by summarizing the three ways to study smarter.",
+ "Video maintains a calm and informative tone throughout.",
+ "Clear and professional guidance on efficient study techniques."
+ ]
+ },
+ {
+ "video": "video/160.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a man standing indoors, holding a microphone emblazoned with the 'CBC News' logo. The man,wearing a light blue button-up shirt, stands to the left while a gray cat rests on a beige platform of a cat tree to the right. The setting includes a gray wall and a glass door that reflects the man. The man speaks, occasionally moving his hand.\n\n\"I'`m Alex LeDuc. The West Island SPCA is issuing an urgent plea to the public for donations because a lack of funds has them facing the possibility of shutting down,I`'ll have that story coming up at six.\"\n\nThe scene flashes and the same man, now in a gray button-up shirt, continuing with a similar setting. The same gray cat is still lying on the cat tree, and the glass door continues to reflect him.\n\n\"`Poor puppies,\" the man says after which the man resumes speaking. \n\nIn the following clip, the man maintains his position, microphone in hand and gray button-up still on. The cat wakes and begins to groom itself, shifting from its restful position. The backdrop remains consistent, with the gray wall and glass door's reflection.\n\n\"How much money the West Island SPCA says they need to stay open...\" \n\nThe man seems like he is going to say something but ends in silence,moving around,seeming awkward.Then he glances his phone occasionally.He raises his microphone and repeats his sentence,but is interrupted by the cat which sniffles.\n\nThe man turns his head around and looks at the cat,the cat stands on the platfrom.\n\n\"`Oh, bless you.\"The man sniffles.''That story coming up...`\" \n\nConcluding the video, the man remains in the same setting, still holding the microphone. The cat is now sitting upright, grooming itself consistently. The lighting, wall, and reflections retain their uniform appearance throughout.\n\n\"`Tonight on CBC News, Montreal at 5. Tonight on CBC News, Montreal at 6. Tonight on CBC News, Montreal at 11,`\"The scene then flashes a several times,the man repeats him for a while.\n\nThus, the video weaves a coherent story of an urgent SPCA funding plea, smoothly transitioning through consistent visuals and synchronized audio, highlighting crucial segments of a news broadcast featuring Alex LeDuc and an ever-serene gray cat."
+ }
+ ],
+ "events": [
+ "Man standing indoors holding a microphone with 'CBC News' logo.",
+ "Man wearing a light blue button-up shirt.",
+ "Gray cat resting on a beige platform of a cat tree to the right.",
+ "Setting includes a gray wall and a glass door reflecting the man.",
+ "Man speaks and occasionally moves his hand.",
+ "Audio: 'I'm Alex LeDuc. The West Island SPCA is issuing an urgent plea to the public for donations because a lack of funds has them facing the possibility of shutting down. I'll have that story coming up at six.'",
+ "Scene flashes.",
+ "Same man now in a gray button-up shirt.",
+ "Similar setting with gray cat still lying on the cat tree.",
+ "Glass door continues to reflect him.",
+ "Audio: 'Poor puppies,'",
+ "Man resumes speaking.",
+ "Man maintains his position, microphone in hand, now in gray button-up shirt.",
+ "Cat wakes and begins to groom itself.",
+ "Backdrop remains consistent with gray wall and glass door's reflection.",
+ "Audio: 'How much money the West Island SPCA says they need to stay open...'",
+ "Man seems like he is going to say something but ends in silence.",
+ "Man moves around, seeming awkward.",
+ "Man glances at his phone occasionally.",
+ "Man raises his microphone and repeats his sentence, but cat sniffles interrupt him.",
+ "Man turns his head to look at the cat.",
+ "Cat stands on the platform.",
+ "Audio: 'Oh, bless you.'",
+ "Man sniffles.",
+ "Audio: 'That story coming up...'",
+ "Man remains in the same setting, still holding the microphone.",
+ "Cat now sitting upright, grooming itself consistently.",
+ "Lighting, wall, and reflections remain uniform.",
+ "Audio: 'Tonight on CBC News, Montreal at 5. Tonight on CBC News, Montreal at 6. Tonight on CBC News, Montreal at 11.'",
+ "Scene flashes several times.",
+ "Man repeats himself for a while."
+ ]
+ },
+ {
+ "video": "video/161.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a close-up of a man wearing a white striped dress shirt, preparing to tie a blue floral-patterned necktie. The man carefully adjusts the tie, with one end longer than the other, crossing the longer end over the shorter end to form an 'X' shape. As he manipulates the tie into a knot, a small instructional graphic in the top left corner provides additional guidance with green arrows illustrating the steps being demonstrated.\n\nDuring this initial process, a male instructional voice narrates: \"Cross the wide end over the narrow end, making an X. Grab hold of the center of the X. Step 1 is behind. Take the wide end of the tie and make it go behind the tie hole. The back of the tie should now be showing.\"\n\nAs the demonstrator continues, holding the intersection of the 'X' with one hand and pulling the longer end through the loop with the other, the voice further explains: \"Step 2 is in. Take the wide end of the tie and make it go in the tie hole. The back of the tie should again be showing.\"\n\nNext, the demonstrator wraps the longer end around the shorter end, forming a loop, while the instructional graphic clearly shows these movements. The demonstrator tightens the knot and adjusts it under the collar to ensure it's properly aligned.\n\nThe narration continues in sync: \" Step 3 is across. Take the wide end of the tie and make it go across the tie hole. Tip, keep a finger here.\"\n\nThe video shows the demonstrator wrapping the tie around the knot again and adjusting it to sit neatly. The graphic continues to provide step-by-step visual aid.\n\nThe voice further explains: \u201c\"The front of the tie should now be showing. Step 4 is out. Take the wide end of the tie and make it go out of the tie hole.\" The demonstrator places the wide end of the tie where their finger is and pulls it through, creating a clean knot. The instructional graphic clearly marks each step.\n\nThe voice provides additional tips: \" To finish it up, put the wide end of the tie where your finger is and pull it through. Tip. Squeeze the bottom of the knot and pull the tops apart to create the triangular shape.\"\n\nAs the demonstrator makes final adjustments to ensure the knot is tight and properly aligned under the shirt collar, the voice summarizes the steps: \"Another tip, as you pull the knot tight, you can use a finger to try to create the dimple. Tighten the knot around your neck by holding the narrow end and sliding the knot up.\"\n\nThe video concludes with the demonstrator smoothing down the tie and ensuring it sits perfectly neat, showcasing the finished look achieved with the step-by-step guidance provided both visually and audibly. The instructional graphic in the top left corner effectively assists throughout the knot-tying process."
+ }
+ ],
+ "events": [
+ "Video starts with a close-up of a man wearing a white striped dress shirt.",
+ "Man is preparing to tie a blue floral-patterned necktie.",
+ "Man adjusts the tie, with the longer end crossing over the shorter end to form an 'X' shape.",
+ "Instructional graphic in the top left corner provides additional guidance with green arrows.",
+ "Male instructional voice narrates the steps: 'Cross the wide end over the narrow end, making an X. Grab hold of the center of the X. Step 1 is behind. Take the wide end of the tie and make it go behind the tie hole. The back of the tie should now be showing.'",
+ "Demonstrator holds the intersection of the 'X' with one hand.",
+ "Demonstrator pulls the longer end through the loop with the other hand.",
+ "Voice explains: 'Step 2 is in. Take the wide end of the tie and make it go in the tie hole. The back of the tie should again be showing.'",
+ "Demonstrator wraps the longer end around the shorter end, forming a loop.",
+ "Instructional graphic clearly shows these movements.",
+ "Demonstrator tightens the knot and adjusts it under the collar.",
+ "Voice narrates: 'Step 3 is across. Take the wide end of the tie and make it go across the tie hole. Tip, keep a finger here.'",
+ "Demonstrator wraps the tie around the knot again and adjusts it to sit neatly.",
+ "Voice explains: 'The front of the tie should now be showing. Step 4 is out. Take the wide end of the tie and make it go out of the tie hole.'",
+ "Demonstrator places the wide end of the tie where their finger is and pulls it through, creating a clean knot.",
+ "Instructional graphic marks each step.",
+ "Voice provides additional tips: 'To finish it up, put the wide end of the tie where your finger is and pull it through. Tip. Squeeze the bottom of the knot and pull the tops apart to create the triangular shape.'",
+ "Demonstrator makes final adjustments to ensure the knot is tight and properly aligned under the shirt collar.",
+ "Voice summarizes the steps: 'Another tip, as you pull the knot tight, you can use a finger to try to create the dimple. Tighten the knot around your neck by holding the narrow end and sliding the knot up.'",
+ "Video concludes with the demonstrator smoothing down the tie.",
+ "Demonstrator ensures the tie sits perfectly neat.",
+ "Finished look is showcased.",
+ "Instructional graphic in the top left corner assists throughout the knot-tying process."
+ ]
+ },
+ {
+ "video": "video/162.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a black screen and then transitions to a tennis court, where a man dressed in a bright yellow T-shirt, gray shorts, and checkered slip-on shoes is holding a tennis racket with both hands. The text \"`CAILLOU PLAYS TENNIS`\" appears on the screen. In the background, a clear sky and multiple tennis courts with green fences are visible. Suddenly, the man is wearing a long red shirt, still holding a tennis racket and a tennis ball, and walks on the court. Then, the perspective changes back to the man in the yellow T-shirt, standing in a ready stance with the racket. The scene concludes with the man in the red long-sleeve shirt holding the racket and tennis ball.\n\nThroughout these visuals, a voice enthusiastically exclaims, \"`Caillou plays tennis. Caillou, today we're going to play tennis. Yay!`\".\n\nNext, the man in the red long-sleeve shirt stands ready on the tennis court, holding a tennis ball in one hand and a tennis racket in the other. The sequence then shifts to the same man in the yellow T-shirt, also in a ready stance with the racket held by both hands. The background features a clear sky and green fences encircling the tennis courts. Following this, the man in the red long-sleeve shirt looks to the side while holding the racket and tennis ball. Then he throws the ball into the air. The clip ends with the man in the yellow T-shirt, holding both the tennis ball and racket, and then throws the ball into the air.\n\nDuring these visuals, a voice instructs, \"`Step one, Stance. Step two is throwing the ball in the air. Throw the ball in the air.`\" Meanwhile, birds chirp, and the sound of a ball being thrown into the air is audible.\n\nThe scene transitions to the same man now wearing a red long-sleeve shirt, standing in a relaxed pose holding the tennis equipment. This is followed by the man in the red long-sleeve shirt tossing the ball into the air, preparing to serve. He successfully hit the ball. The view then shifts back to the man in the yellow T-shirt in the middle of a serve, with the racket raised and the ball airborne. The sequence ends with the man in the yellow T-shirt jumping and hitting the ball with the racket in an action shot. He fails to hit the ball.\n\nThroughout these scenes, a child's voice says, \"And the last step is serving and hitting the ball. I can't do it, Mommy!`\". Accompanying these visuals, the sound of wind blowing and a ball bouncing can be heard.\n\nThe final sequence showcases the man in the red long-sleeve shirt standing on the tennis court, holding a tennis ball and racket while being in a ready stance. At this time, a deformation effect is added to his face. Following that, the man tosses the ball into the air and swings the racket to start a serve. He successfully hits the ball, and then the ball flies over the net. Then, from the perspective of the ball, the camera gradually approaches the man in the yellow T-shirt, whose expression becomes scared, and then he is hit on the head by the ball. He then falls down to the ground. The video concludes with a black screen.\n\nIn this clip, the audio features a man's voice encouraging, \"`You can do it. Just whack the ball, Caillou.`\" This is followed by a child's voice exclaiming, \"Mommy!\". The sounds of a child's scream are also heard."
+ }
+ ],
+ "events": [
+ "The video starts with a black screen.",
+ "The video transitions to a tennis court.",
+ "A man is dressed in a bright yellow T-shirt, gray shorts, and checkered slip-on shoes, holding a tennis racket with both hands.",
+ "The text 'CAILLOU PLAYS TENNIS' appears on the screen.",
+ "In the background, a clear sky and multiple tennis courts with green fences are visible.",
+ "Suddenly, the man is wearing a long red shirt, still holding a tennis racket and a tennis ball, and walks on the court.",
+ "The perspective changes back to the man in the yellow T-shirt, standing in a ready stance with the racket.",
+ "The scene concludes with the man in the red long-sleeve shirt holding the racket and tennis ball.",
+ "Throughout these visuals, a voice enthusiastically exclaims, 'Caillou plays tennis. Caillou, today we're going to play tennis. Yay!'",
+ "The man in the red long-sleeve shirt stands ready on the tennis court, holding a tennis ball in one hand and a tennis racket in the other.",
+ "The sequence shifts to the same man in the yellow T-shirt, also in a ready stance with the racket held by both hands.",
+ "The background features a clear sky and green fences encircling the tennis courts.",
+ "The man in the red long-sleeve shirt looks to the side while holding the racket and tennis ball.",
+ "The man in the red long-sleeve shirt throws the ball into the air.",
+ "The clip ends with the man in the yellow T-shirt, holding both the tennis ball and racket, and then throws the ball into the air.",
+ "During these visuals, a voice instructs, 'Step one, Stance. Step two is throwing the ball in the air. Throw the ball in the air.'",
+ "Birds chirp, and the sound of a ball being thrown into the air is audible.",
+ "The scene transitions to the same man now wearing a red long-sleeve shirt, standing in a relaxed pose holding the tennis equipment.",
+ "The man in the red long-sleeve shirt tosses the ball into the air, preparing to serve.",
+ "The man in the red long-sleeve shirt successfully hits the ball.",
+ "The view shifts back to the man in the yellow T-shirt in the middle of a serve, with the racket raised and the ball airborne.",
+ "The sequence ends with the man in the yellow T-shirt jumping and hitting the ball with the racket in an action shot.",
+ "He fails to hit the ball.",
+ "A child's voice says, 'And the last step is serving and hitting the ball. I can't do it, Mommy!'",
+ "The sound of wind blowing and a ball bouncing can be heard.",
+ "The man in the red long-sleeve shirt is standing on the tennis court, holding a tennis ball and racket while being in a ready stance.",
+ "A deformation effect is added to his face.",
+ "The man tosses the ball into the air and swings the racket to start a serve.",
+ "He successfully hits the ball.",
+ "The ball flies over the net.",
+ "From the perspective of the ball, the camera gradually approaches the man in the yellow T-shirt.",
+ "The man's expression becomes scared.",
+ "The man in the yellow T-shirt is hit on the head by the ball.",
+ "The man in the yellow T-shirt falls to the ground.",
+ "The video concludes with a black screen.",
+ "A man's voice encourages, 'You can do it. Just whack the ball, Caillou.'",
+ "A child's voice exclaims, 'Mommy!'",
+ "The sounds of a child's scream are heard."
+ ]
+ },
+ {
+ "video": "video/163.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up shot of a metal strip distorted by a tool, and a figure in blue jeans can be seen in the background. The scene then transitions to showcasing a variety of traditional drilling tools. First, a \"pump drill\" is introduced, which rotates the spindle and drills into the wood by moving the wood handle up and down. Next, an \"auger,\" a manual drill with an auger bit, demonstrates how it rotates on a wooden surface. Finally, a \"brace\" hand drill with a crank handle was used to drill into the wood, and a close-up showed that its bit produced wood chips.\nIn the synchronized movement, a man recounts: \"In my last few short films, I established some early forms of training. Pump drill, auger drill, and support drill. Once it is done, I am curious to see which was the most effective and best choice.\" In his narrative, we hear the different sounds of different RIGS: pump drilling, auger drilling, and support drilling.\nNext, the close-up focuses on the bit rotating and creating wood chips as it penetrates the wood. The scene cuts to a figure in a gray shirt demonstrating a traditional bow drill, moving the wooden handle back and forth, rotating the spindle, and drilling into the wood. The camera highlights the hand and detailed operation of the bow drill against the workshop background. The video ends with a close-up of a bow-drilled bit drilling into wood, underscoring its effectiveness. Backstage, the same man explains, \"So I take a board and have each of them tested. The first is pump drilling, but after several attempts, I keep finding that once the bit is fully into the wood, there is too much friction and it will not spin anymore. So I try a few...\"\nThe following section continues to focus on the bow drill. The man continues to operate the tool, with the spindle rotating and drilling, the detailed operation shown in the same workshop setting. The close-up further highlights the sawdust on the bow drill bit, illustrating the effectiveness of the tool. The narrator continues: \"... Different bit shapes help reduce friction. First, smooth the tip of the knife, then narrow the entire tip. However, I continue to encounter the same problems. So while it's a very clever design, the pump drill...\" The video ends with a faint thud.\nThe video switches to a split-screen view to compare two traditional drilling tools: the \"auger\" at the top and the \"brace\" at the bottom. Both tools drill into the wood simultaneously. A timer in the center of the screen counts from 0:00 to 0:04, showing the drilling time for each tool. Workshop Settings remain unchanged. At the same time, the man says, \"... Feeling a little picky. If anyone has any suggestions to make it work better, please let me know. The rest is auger and drill drill. Put them through the time trial and they actually almost all passed...\" This part of the audio includes several pauses between sentences.\nThe split screen comparison continues, focusing on the person wearing the gray shirt, using \"Auger\" at the top and \"Brace\" at the bottom. The timer is timed from 0:05 to 0:27 to highlight the tool's drilling efficiency. The workshop setting emphasizes the operation of traditional tools. The man says, \"... At the same time. However, I have found that each method requires different skills. Auger seems to work best with a lot of force and a few tough turns, while brackets seem to work best with a small amount of force.\"\nFinally, the comparison between \"auger\" and \"strut\" ends with a timer from 0:28 to 0:33. The man in the gray shirt continues to operate both tools in a workshop environment, highlighting their efficiency. At the end, the narrator says, \"... It keeps spinning. The auger is a little faster, but there's a lot less work to do with the bracket. So I think with some improvements to the fit, the prop drill will be the clear winner.\" At the end of the video, the auger is shown, and the operator can be seen trying his best to operate the auger, from time to time blowing away the sawdust next to the rotor."
+ }
+ ],
+ "events": [
+ "Close-up shot of a metal strip distorted by a tool with a figure in blue jeans in the background.",
+ "Scene transitions to a variety of traditional drilling tools.",
+ "Introduction of a pump drill with demonstration of its operation (wood handle moving up and down).",
+ "Demonstration of an auger, a manual drill, rotating on a wooden surface.",
+ "Demonstration of a brace hand drill drilling into the wood, with close-up showing wood chips produced by its bit.",
+ "Narration: 'In my last few short films, I established some early forms of training. Pump drill, auger drill, and support drill. Once it is done, I am curious to see which was the most effective and best choice.'",
+ "Sounds of pump drilling, auger drilling, and support drilling are heard.",
+ "Close-up of the bit rotating and creating wood chips as it penetrates the wood.",
+ "Cut to a figure in a gray shirt demonstrating a traditional bow drill.",
+ "Close-up of the bow drill's hand and detailed operation against the workshop background.",
+ "Close-up of a bow-drilled bit drilling into wood, underscoring its effectiveness.",
+ "Narration: 'So I take a board and have each of them tested. The first is pump drilling, but after several attempts, I keep finding that once the bit is fully into the wood, there is too much friction and it will not spin anymore. So I try a few...' followed by the scene continuing the focus on the bow drill.",
+ "Detailed operation of the bow drill shown in the workshop setting.",
+ "Close-up highlighting sawdust on the bow drill bit, illustrating the effectiveness of the tool.",
+ "Narration continues: 'Different bit shapes help reduce friction. First, smooth the tip of the knife, then narrow the entire tip. However, I continue to encounter the same problems. So while it's a very clever design, the pump drill...'",
+ "Video ends with a faint thud.",
+ "Video switches to a split-screen view to compare auger and brace drills.",
+ "Both tools drill into wood simultaneously with a timer counting from 0:00 to 0:04.",
+ "Narration: '... Feeling a little picky. If anyone has any suggestions to make it work better, please let me know. The rest is auger and drill drill. Put them through the time trial and they actually almost all passed...' (with pauses between sentences).",
+ "Split-screen continues, focusing on the person in the gray shirt using the auger at the top and the brace at the bottom.",
+ "Timer counts from 0:05 to 0:27 highlighting the tool's drilling efficiency.",
+ "Narration: '... At the same time. However, I have found that each method requires different skills. Auger seems to work best with a lot of force and a few tough turns, while brackets seem to work best with a small amount of force.'",
+ "Comparison between auger and strut ends with a timer from 0:28 to 0:33.",
+ "Guy in the gray shirt continues to operate both tools in the workshop environment.",
+ "Narration ends: '... It keeps spinning. The auger is a little faster, but there's a lot less work to do with the bracket. So I think with some improvements to the fit, the prop drill will be the clear winner.'",
+ "Scene of the auger shown with the operator trying his best to operate it, blowing away sawdust next to the rotor."
+ ]
+ },
+ {
+ "video": "video/164.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man wearing a black t-shirt displaying the logo 'Institute of Human Anatomy' on the left side of the chest. The setting is a laboratory or medical environment with various equipment and shelves. It is worth mentioning that the screen displays the texts corresponding to the real-time audio throughout the video. White texts appear at the bottom, syncing with the narration about herpes being caused by two different viruses, among which the important text is shown in pink. The scene transitions with sequential texts, emphasizing the complexity and dual nature of herpes viruses. A close-up illustration of a virus displays the white text 'Herpes Simplex Virus Type 1 (HSV-1),' on pink background explaining its characteristics.\n\nThe audio reveals that \"`Now herpes is caused by one of two different viruses, although there have been people that have contracted both. One is called herpes simplex virus type 1 or HSV-1.`\", giving necessary details about the types of herpes simplex viruses.\n\nFollowing this, there's a close-up, detailed illustration of another virus appearing on the screen. The virus is identified as 'Herpes Simplex Virus Type 2 (HSV-2),' and another scene emphasizes this with the repeat white text on a pink background. The video continues to the man with the t-shirt, explaining the details. It then transitions to an illustration of HSV-1 Labial Herpes around the lips.\n\nThe audio narration continues with \"`And the other is called herpes simplex virus type 2 or HSV-2. Now HSV-1 is most often associated with sores around the mouth and the lips, often referred to as cold sores or fever blisters.`\"\n\nAs the video shows the man in the black t-shirt again, it transitions to a detailed illustration of HSV-2 Genital Herpes around the genital area. The man reappears discussing HSV-2 and its symptoms.\n\nThe audio states, \"`Whereas HSV-2 is most often associated with sores around the genital structures and even the anus. But there is some potential crossover between these two viruses and that has to do with how you contract them.`\"\n\nMoving forward, a man with a beard and a black hoodie appears with his mouth keeping spouting saliva, and the scene cycles in slow speed.\n\nThe audio explains, \"`HSV-1 is typically contracted or spread through direct contact with infected oral secretions from someone who has HSV-1.`\"\n\nThe sequence then shows the man in a black shirt speaking transitioning to a close-up of a child's face with visible sores, indicating an HSV-1 infection. The child opens his mouth and laughs at the camera. Another close-up focus on the sores confirms the visual evidence, elaborating on childhood infections.\n\nThe audio continues, \"`And this is actually often done during childhood. Say like a parent or a sibling has a visible active lesion.`\"\n\nSubsequently, a scene presents a woman in a red shirt kissing a kid in her arms. Another close-up of a child in a white onesie with whale patterns biting a nipple, leading back to the man in the black t-shirt. The text 'Full Video Link Pinned in the Comments!' appears at the bottom. Finally, the ending frame adds the 'INSTITUTE OF HUMANHNATOMY' logo, the concluding text 'WHAT HERPES DOES TO THE BODY', and the URL 'www.youtube.com/InstituteofHumanAnatomy', together with the overlay video of the man speaking in the middle.\n\nThe audio descriptively finishes with, \"`And this infected oral secretion gets transferred from say like kissing or sharing of other various items that eventually make it to the mouth.`\" followed by music, emphasizing the mode of transmission for herpes and providing a cue for additional educational resources on the institute's YouTube channel."
+ }
+ ],
+ "events": [
+ "Man in black t-shirt appears with 'Institute of Human Anatomy' logo",
+ "Setting in laboratory with medical equipment and shelves",
+ "Texts appear at the bottom syncing with the narration",
+ "Narration about herpes being caused by two different viruses",
+ "Important text in pink emphasizing complexity and dual nature of herpes viruses",
+ "Close-up illustration of herpes virus with text 'Herpes Simplex Virus Type 1 (HSV-1)' on pink background",
+ "Audio reveals details about HSV-1",
+ "Close-up illustration of another virus 'Herpes Simplex Virus Type 2 (HSV-2)' with text on pink background",
+ "Man with t-shirt explains details about HSV-2",
+ "Illustration of HSV-1 Labial Herpes around lips",
+ "Narration discusses HSV-1 symptoms (cold sores or fever blisters)",
+ "Man in black t-shirt reappears",
+ "Illustration of HSV-2 Genital Herpes around the genital area",
+ "Man explaining HSV-2 symptoms",
+ "Narration discusses HSV-2 symptoms (sores around the genital area and anus)",
+ "Slow-motion scene of man with a beard and black hoodie spouting saliva",
+ "Narration explains HSV-1 contraction through infected oral secretions",
+ "Close-up of a child's face with HSV-1 sores laughing",
+ "Narration about childhood HSV-1 infection via parents or siblings",
+ "Woman in red shirt kissing a kid in her arms",
+ "Close-up of a child in a white onesie with whale patterns biting a nipple",
+ "Return to man in black t-shirt with 'Full Video Link Pinned in the Comments!' text",
+ "Ending frame with 'INSTITUTE OF HUMANHNATOMY' logo, 'WHAT HERPES DOES TO THE BODY' text, YouTube URL, and overlay video of man speaking",
+ "Narration concludes explaining transmission of herpes via oral secretions, emphasizing the mode of transmission",
+ "Background music plays, promoting additional educational resources on the YouTube channel"
+ ]
+ },
+ {
+ "video": "video/165.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a girl standing in a bathroom with dark tiled walls, wearing a brown towel over a white crop top. With curly hair, a necklace, and a belly button piercing, She is actively engaging in a hair care routine. Holding a spray bottle with blue liquid, she sprays her hair and wipes off excess liquid with a brown towel. During this process, the audio explains the steps being taken: \"Today I'`m going to be showing you guys how to part your hair. I like to part my hair while it's wet, but you don't have to. If you don't want to wet your whole head, you can just spray the part that you're parting.`\"\n\nMoving forward, the girl holds up two combs-a black comb and a pink wide-tooth comb-demonstrating their features and differences. She uses the black comb to part her curly hair down the middle, aiming for a straight line. audio provides additional guidance: \"`You're going to need a comb like this if you don't want to part your hair with a comb like this, you can use this side or this side. I prefer this side. Press down, pull back, then separate these two pieces. This isn't straight.`\"\n\nContinuing in the same setting, the girl smooths down her hair with her hands ensuring the middle part remains straight. Employing the black comb once more, she further defines the parting while holding sections of her hair to display the texture and volume. The audio continues to instruct: \"`You're not always easy to get it on the first try so you gotta fix it up, press down super hard, pull back. Okay, the part is straight. If you want it straighter than this, I have a tip. You`'re going to take this gel it'`s called Shine 'n gel.`\"\n\nNext, now wearing only a white crop top and blue jeans without the brown towel, the girl holds up the yellow container labeled 'Shine 'n Jam,' and take some out with one finger and show them as introduced. After applying the Shine 'n gel to the exposed straight line of her hair in the middle, the girl utilizes the black comb to ensure the middle part is perfectly straight, smoothing down her hair with her hands. The audio adds more details on the product and its application: \"`Take like this much and put it in the section you're parting. It should look like this. Same thing, press down, pull back, separate the strands.`\"\n\nFinally, the girl, still in a white crop top and blue jeans, continues to style her curly hair, emphasizing a well-defined middle part. Using both hands, she gathers her hair into a low ponytail, securing it carefully. The sequence highlights the neatness of the middle part and the smoothness of the hair, with the audio concluding: \"`Be a lot slower. The part is a lot cleaner when you include Shine 'n gel. If you have any tips for parting your hair, leave them in the comments below.`\" \n\nThe video depicts a coherent and methodical hair care routine, with detailed instructions given throughout, guiding viewers on the use of specific products and techniques for styling curly hair."
+ }
+ ],
+ "events": [
+ "Girl standing in a bathroom with dark tiled walls",
+ "Girl wearing a brown towel over a white crop top",
+ "Girl has curly hair, a necklace, and a belly button piercing",
+ "Girl holding a spray bottle with blue liquid, and spraying her hair",
+ "Girl wipes off excess liquid with a brown towel",
+ "Audio explains hair parting steps",
+ "Girl holds up two combs: black comb and pink wide-tooth comb",
+ "Girl demonstrates features and differences of the combs",
+ "Girl uses black comb to part her curly hair down the middle",
+ "Audio provides additional guidance for parting hair",
+ "Girl smooths down her hair with her hands, ensuring middle part remains straight",
+ "Girl further defines the parting with the black comb",
+ "Girl holds sections of her hair to display texture and volume",
+ "Audio instructs on parting techniques and mentions using a gel",
+ "Girl now without the towel, wearing a white crop top and blue jeans",
+ "Girl holds up yellow container labeled 'Shine 'n Jam'",
+ "Girl takes some gel out with one finger and shows it",
+ "Girl applies Shine 'n gel to the exposed straight line of her hair in the middle",
+ "Girl uses black comb to ensure middle part is perfectly straight",
+ "Girl smooths down her hair with her hands",
+ "Audio adds details on the product and its application",
+ "Girl continues to style her curly hair, emphasizing a well-defined middle part",
+ "Girl gathers her hair into a low ponytail using both hands",
+ "Girl secures the low ponytail carefully",
+ "Sequence highlights the neatness of the middle part and the smoothness of the hair",
+ "Audio concludes with a comment to leave tips in the comments below"
+ ]
+ },
+ {
+ "video": "video/166.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with two individuals standing in a lush, forested area, each equipped with a backpack and dressed in outdoor clothing; one in a green shirt and the other in a bright pink shirt. A sunny ambiance lights up the tall trees surrounding them. The man clad in green holds up a snack bar package, showcasing it to the camera, while the other holds a trekking pole. Text appears on the screen: \"Our Top 5 Bars!\" in English, signaling the video's focus on snack bars that fuel their trek. A voiceover narrates, mentioning their 40-day journey, emphasizing their need for calories. \"We've been walking for over 40 days straight and eat all the calories we can get. These are our favorite bars to power our trek. Bars are an easy snack on trail, and we choose them based on calories, protein, taste, and price,`\" the voice explains as upbeat music plays in the background.\n\nThe scene transitions to a close-up of various snack bars laid out, including Clif, Luna, and L\u00e4rabar, as the text \"Our Top 5 Bars!\" remains on screen. From here, the video shifts into a more detailed nutritional exploration.\n\nCalories, protein, and taste become focal points with text on the screen-'Calories, Protein, Taste, Price'. The next scene shows a woman in a red shirt and gray shorts hiking with trekking poles, traversing a picturesque mountainous landscape under a partly cloudy sky. ' >200 calories!' text flashes, underscoring the energy content crucial for their trekking adventures. The camera then zooms into a hand holding a Nature's Bakery Fig Bar package, with additional nutritional details-'5. Natures Bakery Fig Bars' and 'Calories: 200, Protein: 3g'. The woman, now sitting comfortably, munches on the fig bar. \n\nThe narrative continues as the voiceover elaborates, \"`Bars under 200 calories don't give us enough energy to justify their weight. Fig bars are cheap and offer a unique texture. Peanut butter graham cracker sandwiches are a filling option when we can't find the bars we want in stores.\" Here, the screen shows the man in a green jacket savoring a PB/Graham Cracker sandwich with text detailing its nutritional content: '4. PB/Graham Crackers' and 'Calories: ~450, Protein: ~12g (2 sandwiches)'.\n\nMoving forward, the next clip shows the woman also enjoying PB/Graham Cracker sandwiches, reiterating '4. PB/Graham Crackers' nutritional text. Attention shifts to a Larabar; text on the screen reads '3. L\u00e4rabars' - Calories: 200, Protein: 4g. Next, the same hiker relishes the Larabar, sitting on a mat. The scene cuts to them holding a Clif Bar package: '2. Clif Bars' - Calories: 250, Protein: 9g. The background showcases the etching trails among the rolling mountains as the woman snacks on the Clif Bar.\n\n\"`L\u00e4rabars, have simple ingredients and taste amazing. Regular Clif Bars are easy to find almost anywhere, and it's a bonus when they include caffeine,`\" says the voice-over, providing more insights into their snack preferences.\n\nThe camera then zooms into a hand holding a Clif Builder Bar package. The final clip centers around a man wearing a green long-sleeve shirt and pink shorts in the forest, holding and opening a Clif Builder Bar package and then eating it. \"Clif Builders are our favorite bars. They're full of protein and keep us full the longest,\" the audio narrates. Text on screen highlights: '1. Clif Builder Bars' with 'Calories: 280, Protein: 20g'. The video concludes with a focused shot of the opened Clif Builder Bar, emphasizing the substantiality and nutritional benefit that makes it their top choice, framed beautifully against the sunny forest backdrop.\n\nThe connected series of clips not only introduces viewers to a variety of nutritional snack bars but does so by intertwining the lived experiences of hikers dependent on these calorie-dense snacks for their prolonged adventure trail, hence providing both practical advice and visually enriched content."
+ }
+ ],
+ "events": [
+ "Two individuals stand in a lush, forested area with backpacks and outdoor clothing.",
+ "One person wears a green shirt, the other a bright pink shirt.",
+ "The sunny ambiance lights up the tall trees around them.",
+ "The man in the green shirt holds up a snack bar package and shows it to the camera.",
+ "The other person holds a trekking pole.",
+ "Text appears on screen: 'Our Top 5 Bars!' in English.",
+ "Voiceover mentions their 40-day journey and need for calories.",
+ "Voiceover explains they have been walking for over 40 days and need all the calories they can get.",
+ "Voiceover says they choose snack bars based on calories, protein, taste, and price.",
+ "Upbeat music plays in the background.",
+ "Scene transitions to a close-up of various snack bars laid out.",
+ "Snack bars include Clif, Luna, and L\u00e4rabar.",
+ "Text 'Our Top 5 Bars!' remains on screen.",
+ "Video shifts to detailed nutritional exploration.",
+ "Text on screen: 'Calories, Protein, Taste, Price'.",
+ "Scene shows a woman in a red shirt and gray shorts hiking with trekking poles.",
+ "She traverses a picturesque mountainous landscape under a partly cloudy sky.",
+ "Text '>200 calories!' flashes on screen.",
+ "Camera zooms into a hand holding a Nature's Bakery Fig Bar package.",
+ "Text on screen: '5. Natures Bakery Fig Bars' and 'Calories: 200, Protein: 3g'.",
+ "Woman sits comfortably and munches on the fig bar.",
+ "Voiceover elaborates on the importance of bars over 200 calories.",
+ "Voiceover talks about Fig bars being cheap and having a unique texture.",
+ "Screen shows the man in a green jacket savoring a PB/Graham Cracker sandwich.",
+ "Text on screen: '4. PB/Graham Crackers' and 'Calories: ~450, Protein: ~12g (2 sandwiches)'.",
+ "Woman enjoys PB/Graham Cracker sandwiches.",
+ "Text '4. PB/Graham Crackers' nutritional text is reiterated.",
+ "Attention shifts to a Larabar.",
+ "Text on screen: '3. L\u00e4rabars' - Calories: 200, Protein: 4g.",
+ "Same hiker relishes the Larabar, sitting on a mat.",
+ "Scene cuts to them holding a Clif Bar package.",
+ "Text on screen: '2. Clif Bars' - Calories: 250, Protein: 9g.",
+ "Background shows etching trails among rolling mountains.",
+ "Woman snacks on the Clif Bar.",
+ "Voiceover mentions L\u00e4rabars have simple ingredients and taste amazing.",
+ "Voiceover says regular Clif Bars are easy to find and a bonus with caffeine.",
+ "Camera zooms into a hand holding a Clif Builder Bar package.",
+ "Final clip centers around a man in a green long-sleeve shirt and pink shorts in the forest.",
+ "He holds and opens a Clif Builder Bar package and eats it.",
+ "Voiceover states Clif Builders are their favorite bars full of protein and keep them full the longest.",
+ "Text on screen: '1. Clif Builder Bars' with 'Calories: 280, Protein: 20g'.",
+ "Video concludes with a focused shot of the opened Clif Builder Bar.",
+ "Opened Clif Builder Bar is framed against a sunny forest backdrop.",
+ "Video provides practical advice and visually enriched content about nutritional snack bars intertwined with hikers' experiences."
+ ]
+ },
+ {
+ "video": "video/167.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a close-up of a white goblet filled with a vibrant blue liquid. A hand holds up the goblet and pours more of this blue liquid from a transparent container into the goblet, causing the liquid overflow. The hand stops pouring just before the liquid reaches the rim and then gently holds the goblet, the liquid turning to transparent. In the background, a man explains, \"This cup is called the greedy cup. What happens is you can fill it up like normal, but you shouldn't go past the fill line. If you go past the fill line, then it actually drains the entire contents of the cup.\"\n\nThe frame transitions to a close-up of another white goblet held over a black surface with a small puddle of blue liquid. The hand carefully pours blue liquid from a transparent container into a white, intricately shaped object. The liquid's progress highlights the object's unique design. Accompanying the visuals, the man's voice details, \"So what happens is you fill up the cup, the water level rises, and it will rise up that center U-tube as well, but it won't get to the top unless you overfill it past the fill line.\"\n\nThe narrative continues with a close-up of the intricate object, now steadily dispensing blue liquid from its bottom. The blue liquid continuously flows as the hand keeps the object steady, emphasizing the consistent nature of the siphon effect. Meanwhile, the narrator explains, \"And once it's past the fill line, then the water will start flowing down the other side of the U-shaped tube in there. And after that, that creates a siphon effect.\"\n\nThe scene focuses on the uninterrupted stream of blue liquid from the object, further stressing its intricate design and the siphon mechanism's functionality. The man elaborates, \"The siphon effect can now continue until it empties the entire contents of the cup from the top to the bottom. But you can actually defeat the greedy cup by putting a denser liquid at the bottom.\"\n\nFollowing this, the video returns to the initial white goblet, again showing the hand pouring the transparent liquid and then the blue liquid until just below the rim. The goblet is held up. The clip ends with a cross-sectional diagram of the goblet, unveiling its inner workings and the flow path of the liquid. The voice-over says, \"We can get as greedy as we want. We're completely over the fill line now, but it doesn't leak out. So the gallium is only going to move up the U-shaped tube until it equals the weight of water outside of the tube.\"\n\nThe video concludes with an illustrative cross-sectional diagram of a goblet, identical to the one seen earlier. The brown outer structure and a central vertical tube within are depicted, with red liquid at the top and a gray substance at the bottom, highlighting the design and internal flow of the goblet. The static view emphasizes the goblet's detailed morphology and liquid distribution. The narrator finishes, \"And because gallium is so dense, it's only a little tiny bit of gallium that moves up the tube that equals the weight of water above it.\""
+ }
+ ],
+ "events": [
+ "Close-up of a white goblet filled with vibrant blue liquid",
+ "A hand holds up the goblet and pours more blue liquid from a transparent container",
+ "Blue liquid overflows from the goblet",
+ "Hand stops pouring just before the liquid reaches the rim",
+ "Blue liquid turns transparent in the goblet",
+ "Background narration about the greedy cup and fill line",
+ "Frame transitions to close-up of another white goblet over a black surface with a small puddle of blue liquid",
+ "Hand pours blue liquid into an intricately shaped white object",
+ "Narration continues explaining the water level rise in the center U-tube",
+ "Close-up of the intricate object dispensing blue liquid from its bottom",
+ "Hand keeps object steady, highlighting the siphon effect",
+ "Narration about the siphon effect starting and emptying the cup",
+ "Scene focuses on uninterrupted stream of blue liquid from the object",
+ "Narration elaborates on siphon effect emptying the cup",
+ "Explanation that denser liquid can defeat the greedy cup",
+ "Video returns to initial white goblet, hand pours transparent and then blue liquid just below the rim",
+ "Goblet is held up",
+ "Clip ends with cross-sectional diagram of goblet showing inner workings",
+ "Voice-over explains the gallium's movement in the U-shaped tube",
+ "Illustrative cross-sectional diagram of a goblet with brown outer structure and central vertical tube",
+ "Diagram shows red liquid at the top and gray substance at the bottom",
+ "Narration about the weight equivalence of gallium and water in the tube",
+ "Static view of the goblet's detailed morphology and liquid distribution"
+ ]
+ },
+ {
+ "video": "video/168.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man sitting in a chair, wearing a light blue polo shirt. The background includes a shelf with various items and a blue light strip. The man holds a pen in his clasped hands. While the visual presents a calm and collected setting, the audio reveals a conversational tone with someone saying, \"`All right, guys, it's that time of year again. What, we're eliminated from playoff contention already? That was fast, but I'm not surprised.`\" This remark is followed by another voice responding, \"`What? No, I'm talking about fantasy football.`\" The conversation continues with, \"`Oh, hell yeah. I love fantasy football. Why are you always so happy? Shut up.Sorry.`\"\n\nThe scene transitions to the same man now standing and wearing a white t-shirt with the text 'Saquon Barkley' in large blue letters. The background is a plain white wall, and the discussion about fantasy football deepens. \"`Look, the point of this meeting is to figure out who should be the number one pick, because the league wants to know who they should be marketing as the top guy.`\"\n\nThe man is back in the chair, again in the light blue polo shirt, with the initial background setup. The discourse continues with a man asserting, \"That'`s got to be me, right? Is there really another option?`\" The scene shifts to the man standing in a dark blue t-shirt with 'Austin Ekeler' in large blue letters, against a background that includes a framed certificate and hanging lanyards. \n\nNow, seated by the window in a black t-shirt displaying 'Nick Chubb' in large orange letters, the discussion turns more analytical. \"`Oh, come on. I mean, yeah, you`'re a great running back entering his prime behind a a good O-line with minimal backfield competition and an offense that should be fairly productive and, uh, what was I arguing?`\" This moment highlights confusion and strategy deliberation. The next visual switch shows the man in the chair again, wearing the light blue polo shirt, looking to the side as the spoken inquiry continues, \"`Dude, you are not helping. Shouldn't you be rehabbing or something?`\" Ending with, \"I'`m not injured. Well, that`'s first. ...\"\n\nOnscreen, the man stands, arms crossed, wearing a white t-shirt with 'Jonathan Taylor' in large blue letters, the backdrop featuring a framed certificate. Another shift displays the man standing in a light blue t-shirt with 'Christian McCaffrey' in large black letters against a plain white wall, deepening the player's critiques: \"`Speaking of not injured, did everyone forget that my foot`'s healed?\" This claim is met with skepticism, sparking a response, \"You'`re also 28, which makes you like 75 in running back years, but at least if you get hurt again, it'll be covered by Medicare.`\"\n\nDressed in a black sleeveless t-shirt with 'Najee Harris' in large yellow letters, the plain white background contrasts with their focused stance. The conversation continues: \"`Are we really still doing the whole number one pick has to be a running back thing. Can we please get with the times?\" Before another voice interjects, the visual shifts to the same man in a blue sleeveless t-shirt with 'Derrick Henry' in white letters.\n\nChanging to a black t-shirt that says 'Dalvin Cook' in large purple letters, the background plain and white, the argument proceeds, \"`Agreed, but I should be the number one pick because you're not going to win the Triple Crown again. I might.`\" Another retort, \"`Matthew and I have...`\" breaks off before shifting scenes. Wearing a white long-sleeve t-shirt stating 'Cooper Kupp' in large blue letters, with 'YouTube Ideas' and 'Heavy' scribbled on a whiteboard in the background, the discussion intensifies: \"`Yeah, yeah. You and Stafford bump uglies every night and shower together every morning. We get it. If we're going receiver,`\"\n\nTransitioning to a black t-shirt with 'Justin Jefferson' in large purple letters, the straightforward white background frames the ongoing debate. \"`I feel like I'm a better pick than both of you.`\" Facing slightly to the side, another voice chimes in, \"`Yeah, nobody plays in a points-per-gritty league, so no.`\" Then, concluding with \"So we'`re going with Jonathan Taylor, right? Yeah, probably. He is a fantastic pick. Nice. All right, cool.`\"The man in blue shirt then uses his pen to write something on his notebook.\n\nThroughout the video, the seamless progression of fantasy football debate integrates various wardrobe changes symbolizing different players, each representing their case dynamically. The alternating scenes cultivate a narrative steeped in visual and verbal interplay, creating an engaging experience that encapsulates the strategic depth of fantasy football picks."
+ }
+ ],
+ "events": [
+ "Man sitting in a chair, wearing a light blue polo shirt",
+ "Background includes a shelf with various items and a blue light strip",
+ "Man holds a pen in his clasped hands",
+ "Audio: \\\"All right, guys, it's that time of year again. What, we're eliminated from playoff contention already? That was fast, but I'm not surprised.\\\"",
+ "Response: \\\"What? No, I'm talking about fantasy football.\\\"",
+ "Follow-up Conversation: \\\"Oh, hell yeah. I love fantasy football. Why are you always so happy? Shut up.Sorry.\\\"",
+ "Scene transitions to man standing",
+ "Man wearing a white t-shirt with 'Saquon Barkley' in large blue letters",
+ "Background is a plain white wall",
+ "Discussion deepens about fantasy football",
+ "Man: \\\"Look, the point of this meeting is to figure out who should be the number one pick, because the league wants to know who they should be marketing as the top guy.\\\"",
+ "Man back in chair, wearing light blue polo shirt, with initial background setup",
+ "Man: \\\"That's got to be me, right? Is there really another option?\\\"",
+ "Scene shifts to man standing in a dark blue t-shirt with 'Austin Ekeler' in large blue letters",
+ "Background includes a framed certificate and hanging lanyards",
+ "Man seated by the window in a black t-shirt displaying 'Nick Chubb' in large orange letters",
+ "Discussion turns more analytical: \\\"Oh, come on. I mean, yeah, you're a great running back entering his prime behind a good O-line with minimal backfield competition and an offense that should be fairly productive and, uh, what was I arguing?\\\"",
+ "Man back in chair in light blue polo shirt, looking to the side",
+ "Man: \\\"Dude, you are not helping. Shouldn't you be rehabbing or something?\\\"",
+ "Response: \\\"I'm not injured. Well, that's first. ...\\\"",
+ "Onscreen man stands, arms crossed, wearing a white t-shirt with 'Jonathan Taylor' in large blue letters",
+ "Backdrop features a framed certificate",
+ "Man standing in light blue t-shirt with 'Christian McCaffrey' in large black letters against a plain white wall",
+ "Discussion about player's critiques: \\\"Speaking of not injured, did everyone forget that my foot's healed?\\\"",
+ "Response: \\\"You're also 28, which makes you like 75 in running back years, but at least if you get hurt again, it'll be covered by Medicare.\\\"",
+ "Man dressed in a black sleeveless t-shirt with 'Najee Harris' in large yellow letters",
+ "Plain white background contrasts with focused stance",
+ "Man: \\\"Are we really still doing the whole number one pick has to be a running back thing. Can we please get with the times?\\\"",
+ "Visual shifts to man in blue sleeveless t-shirt with 'Derrick Henry' in white letters",
+ "Man changing to black t-shirt that says 'Dalvin Cook' in large purple letters, plain white background",
+ "Man: \\\"Agreed, but I should be the number one pick because you're not going to win the Triple Crown again. I might.\\\"",
+ "Retort: \\\"Matthew and I have...\\\"",
+ "Scene shifts",
+ "Man wearing a white long-sleeve t-shirt stating 'Cooper Kupp' in large blue letters",
+ "Background: 'YouTube Ideas' and 'Heavy' scribbled on whiteboard",
+ "Discussion intensifies: \\\"Yeah, yeah. You and Stafford bump uglies every night and shower together every morning. We get it. If we're going receiver,\\\"",
+ "Man in black t-shirt with 'Justin Jefferson' in large purple letters, straightforward white background",
+ "Ongoing debate: \\\"I feel like I'm a better pick than both of you.\\\"",
+ "Another voice: \\\"Yeah, nobody plays in a points-per-gritty league, so no.\\\"",
+ "Conclusion: \\\"So we're going with Jonathan Taylor, right? Yeah, probably. He is a fantastic pick. Nice. All right, cool.\\\"",
+ "Man in blue shirt uses pen to write something on notebook",
+ "Wardrobe changes represent different players dynamically",
+ "Scenes cultivate narrative of fantasy football pick strategy"
+ ]
+ },
+ {
+ "video": "video/169.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a detailed and captivating close-up view of a rocky surface. As a hand wearing a dark sleeve points towards a small hole in the rock, the viewer's interest is piqued. The visual elements show various shades of gray and brown, indicative of a natural, unrefined environment. Zooming in, the hole's contents are revealed to be water with small particles floating in it. Approaching even closer, the water inside the hole displays a blue hue, unveiling an underwater cave with rocky walls and a light source at the end, alluding to another chamber or an exit. \n\nIn synchronization, the audio reveals a man narrating his diving experience, speaking about seeing the \"tightest hole\" and squeezing through it to find there's not much left down there. This narrative sets a tone of exploration and a hint of meticulous discovery.\n\nTwo hands raising over the rocky surface suggest the man navigating or holding underwater. As the scene broadens, fish swim among the rocks, and a woman in a blue and white swimsuit with snorkeling gear swims towards a new rocky area, visibly intrigued by the underwater observation. The concluding frames highlight the camera reaching the darker area, the woman pointing towards it, signaling the discovery of another underwater cave or passage.\n\nSimultaneously, the audio continues with a narrative describing how someone guesses that tighter holes lead to quicker explorations and how Mia found something exciting in a bigger cave, leading the narrator to come down and explore it further, filling the scene with a sense of camaraderie and shared adventure.\n\nDrawing nearer, the dimly lit cave reveals more of its rocky, shaded ambiance, while small particles float in the water. As the camera ventures into the cave, a resting sea turtle is revealed on a rocky ledge. The man waves at it, greeting it. The turtle's shell and flippers become focal points before the camera exits the cave to show a wider underwater setting with two snorkelers near the surface, one in a white and blue swimsuit, and the other in a yellow swimsuit. And the man holds a camera, facing forward.\n\nThe audio narrates the scene, mentioning shadows, discovering a \"sleepy turtle,\" playful interaction with it, and then Mia's interest in diving through a tunnel despite dealing with congestion issues. The playful and curious nature of observation continues.\n\nSwimming near a rocky underwater surface, Mia in a blue and white swimsuit explores. Natural light accentuates the shades of brown and green on the rocks. The man and Mia move towards darker areas, another cave. Then the final clip shows the camera turning around to catch the other woman, who is not coming along, with the camera on the screen.\n\nOverlapping is an audio narrative detailing Mia's slow descent due to equalization issues and successfully passing through a tunnel without sinus pain. The audio closes with the man noting Brittany's absence and succinctly wrapping up all they saw throughout the journey with a simple farewell, \"Okay, that's all the stuff we saw. Bye.\" This farewell resonates with the conclusion of an enriching and meticulously documented underwater exploration."
+ }
+ ],
+ "events": [
+ "Close-up view of a rocky surface",
+ "Hand in a dark sleeve points towards a small hole in the rock",
+ "Various shades of gray and brown shown",
+ "Zoom into the hole to reveal water with small particles floating",
+ "Closer view reveals an underwater cave with rocky walls and light source at the end",
+ "Audio narration begins about a diving experience and squeezing through a tight hole",
+ "Two hands raise over the rocky surface suggesting navigation",
+ "Fish swim among the rocks",
+ "Woman in a blue and white swimsuit with snorkeling gear swims towards a new rocky area",
+ "Woman points towards a new underwater cave or passage",
+ "Audio narrative describes exploration and excitement about a bigger cave",
+ "Dimly lit cave reveals rocky shaded ambiance with floating particles",
+ "Camera ventures into the cave, revealing a resting sea turtle on a rocky ledge",
+ "Man waves at the sea turtle, greeting it",
+ "Camera exits the cave to show a wider underwater setting with two snorkelers near the surface",
+ "One snorkeler in a white and blue swimsuit, the other in a yellow swimsuit",
+ "Man holds camera facing forward",
+ "Audio mentions shadows, discovering a sleepy turtle, playful interaction, and Mia's congestion issues",
+ "Swimming near a rocky underwater surface, Mia explores",
+ "Natural light highlights shades of brown and green on the rocks",
+ "Man and Mia move towards darker areas, another cave",
+ "Final clip shows the camera turning around to capture another woman with a camera on the screen",
+ "Audio details Mia's slow descent, passing through a tunnel, and absence of Brittany",
+ "Audio concludes with a farewell, summarizing the exploration"
+ ]
+ },
+ {
+ "video": "video/170.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up view of a blue rectangular box that reads 'FN HERSTAL' and features a circular logo. A hand reaches toward the box, excitedly commenting, \"I haven'`t bought an airsoft shotgun in four years and today I just got a new one in so let's get straight into this unboxing yeet oh my god guys this looks amazing.`\" As the hand opens the box, it reveals a black firearm, two red shotgun shells, and a small bag of white pellets, fully displaying the items inside.\n\nThe scene transitions, maintaining focus on the open box from before, spotlighting the black firearm, two red shotgun shells, and the small bag of white pellets lined within white foam. Orange labels are visible inside the box. The hand points to and then lifts the black firearm, revealing a gray-pillowed bed and a wall poster in the background. The firearm is shown in close-up, while a text overlay displays a product listing for a 'Cybergun FN Herstal Licensed SLP Tactical CO2 Powered Airsoft Shotgun' priced at $165.00 with 4 customer reviews. The narrator mentions, \"`So if you guys don't know what this is, this is the FN Herstal Semi-Automatic CO2 Shotgun. So this shotgun retails for about $160.`\"\n\nMaintaining continuity, the next scene continues with the close-up of the product listing. The background remains consistent with the gray-pillowed bed and wall poster. The hand holds the black firearm, examining it closely, rotating it to show details including the text '3' MAGNUM' and the FN Herstal logo. \"`It just recently came out. It's CO2, a semi-automatic, and you can either do single shot or three-round burst. However, it does not have three barrels. It only has one.`\" the excited explanation continues with intermittent commentary from another voice, adding further details.\n\nThe view shifts to a close-up of the black firearm held with both hands, pointed towards the camera, displaying '3' MAGNUM' and the FN Herstal logo. The familiar gray-pillowed bed and wall poster remain in the background. The hand displays three red shotgun shells with 'SWAT 12GA 3' MAGNUM' printed on them. Transitioning outdoors, the firearm is aimed at a target board with a wooden fence and trees in the backdrop. As the firearm is fired, the narrator adds, \"`It comes with three shells. They each hold 30 BBs, but you can only put one shell in at a time. Alright guys, we're outside an ideal shotgun range.`\"\n\nContinuing outdoors, the black shotgun is aimed at a white target board with a brown silhouette. The wooden fence, trees, and clear blue sky form the backdrop. As the narrator discusses \"`And we're gonna see if we can get a decent spread from the single barrel on triple shot mode,`\" the firearm is fired at the target, showing impact marks. The man reloads the shotgun by inserting a red shell with a gold base. The clip concludes with the man aiming again, ready to fire.\n\nThe final scene remains in the same outdoor setting. The shotgun is once again aimed at the white target board with a brown silhouette, the background of the wooden fence, trees, and clear blue sky unchanged. The narrator remarks, \"`we did actually get a very decent spread out of the shotgun with only one barrel so wow I'm really impressed,`\" as the shotgun is fired, resulting in multiple small holes on the target. A close-up view of the target board shows labeled sections: 'A' at the top, 'B' below it, 'C' in the center, and 'D' on the sides. The text 'EMG International' is visible at the bottom, with a hand pointing at the 'C' section to highlight the spread of impact marks."
+ }
+ ],
+ "events": [
+ "Close-up view of a blue rectangular box reading 'FN HERSTAL' with a circular logo",
+ "Hand reaches toward the box, excitedly comments about buying a new airsoft shotgun",
+ "Hand opens the box, revealing a black firearm, two red shotgun shells, and a small bag of white pellets",
+ "Scene transitions, maintaining focus on the open box with the black firearm, two red shotgun shells, and the small bag of white pellets in white foam",
+ "Orange labels visible inside the box",
+ "Hand points to and lifts the black firearm, revealing a gray-pillowed bed and a wall poster in the background",
+ "Close-up of the firearm with a text overlay: 'Cybergun FN Herstal Licensed SLP Tactical CO2 Powered Airsoft Shotgun' priced at $165.00 with 4 customer reviews",
+ "Narrator mentions the product and its retail price, explaining it is a CO2 semi-automatic shotgun",
+ "Close-up continues with product details, gray-pillowed bed and wall poster in the background",
+ "Hand holds firearm, rotating it to show details including '3' MAGNUM' and the FN Herstal logo",
+ "Further explanation about the shotgun's capabilities and features despite it only having one barrel",
+ "View shifts to the black firearm held with both hands, displaying details to the camera",
+ "Three red shotgun shells with 'SWAT 12GA 3' MAGNUM' printed on them are displayed",
+ "Transition to outdoor setting where firearm is aimed at a target board with a wooden fence and trees in the backdrop",
+ "Firearm is fired at the target, while narrator explains the capacity of each shell and mentions the ideal shotgun range",
+ "Continued outdoor scene with the shotgun aimed at a white target board featuring a brown silhouette, wooden fence, trees, and clear blue sky",
+ "Narrator discusses testing the triple shot mode for spread, fire is shown impacting the target",
+ "Man reloads the shotgun by inserting a red shell with a gold base",
+ "Man aims again, ready to fire",
+ "Final outdoor scene with shotgun aimed at the white target board, unchanged background",
+ "Narrator remarks on the shotgun's spread with only one barrel and fires again, displaying multiple small holes on the target",
+ "Close-up of the target board showing labeled sections 'A', 'B', 'C', and 'D' with 'EMG International' at the bottom",
+ "Hand points at the 'C' section to highlight the spread of impact marks"
+ ]
+ },
+ {
+ "video": "video/171.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a man lying on a blue exercise mat on a wooden deck, surrounded by lush greenery. This man, dressed in a sleeveless dark blue shirt, gestures while lying on his stomach. The text \"Can't Do Pushups?\" appears in white letters. The man introduces the topic, and then begins demonstrating by doing a push-up and then positioning his hands on the mat and lifting his body. Successfully pushing himself up with only two hands supporting him, the text \"Fitness is a journey.\" appears, underscoring the motivational theme.\n\nA man speaks over background music, saying: \"My friend, if you can't do push-ups, just like with everything else in the world, you can build up. Let me help. There's no reason to be ashamed if you can't do a push-up. Fitness is a journey and we all start somewhere.\"\n\nThen he stands up on the wooden deck within the same lush setting, with a satellite dish visible in the background. The camera then shows the man standing against the side of a house, demonstrating wall push-ups. He starts doing wall push-ups with one hand, and then two hands. The text \"Wall Pushups Goal: 3 Sets of 50\" appears in white letters, providing exercise instructions. \n\nThe voiceover continues: \" Plus, there are many kinds of push-ups. So we find one we can do, and we work it until we're strong enough to move on. We can start with some wall push-ups. Aim for 3 sets of 50. The reason you're doing so many is so you can practice your pushing motions.\"\n\nNext, we see the same man leaning against the house, extending his arms while speaking. The scene changes to him performing incline push-ups on the deck railing and speaking. The text \"Incline Pushups Goal: 3 Sets of 40\" appears in white letters. He proceeds to use a chair for another variation of incline push-ups. The man then stands closer to the deck railing, gesturing his hand between his hip and the railing, suggesting an ideal height of the chair. The man performs incline -push-ups on the deck railing again.\n\nThe narration says: \"We can do incline push-ups on a railing, table, or any stable surface. The lower the object, the harder. I recommend starting with something at least higher than your hips. Work these until you can do three sets of 40.\"\n\nThe scene then shifts to the man on a blue exercise mat, doing standard push-ups, surrounded by the lush outdoor setting with a visible satellite dish. He then transitions to kneeling push-ups on the mat, shown from a side angle, with the text \"Kneeling Pushups Goal: 2 Sets of 25\" appearing on the screen.\n\nThe man's voice explains: \"Next, we finally go back to the ground, and we can do kneeling push-ups. And no, I never call these girl push-ups. This is a great therapy exercise that we don't need to shame, and I've never met a girl that likes them being called that.\"\n\nThe scene transitions back to the front, showing the man doing kneeling push-ups, which then transitions to him doing full push-ups. This clip ends with him raising his left hand, seemingly about to high-five to the camera. Then the video shows the man sitting on the blue mat, with the text \"Pushups should not hurt.\" in white letters. The man gestures while sitting and speaking, emphasizing proper form and safety.\n\nThe narration continues: \"Anyway, work these until you can do 2 sets of 25, and then you should be strong enough to do a regular full push-up. Congratulations! If you start getting joint pain, you may have progressed too quickly. Go back to some of the easy variations.\"\n\nThe final clip shows the text \"Muscles develop faster than joints.\" in white letters. The man continues to emphasize the educational content, gesturing and the screen presents \"Longer Video: hybridcalisthenics.com/pushups\" and \"Free Fitness Routine: hybridcalisthenics.com/routine.\" The concluding text \"Follow & Like For More :)\" appears in white letters.\n\nThe man concludes: \"Sometimes our muscles are strong enough to move on, but our joints are not. If you're interested, I have a full YouTube video on this that I'll link in the comments. Have a beautiful day.\"\n\nThroughout the video, the man's voice complements the motivational and instructional content, accompanied by consistent background music."
+ }
+ ],
+ "events": [
+ "Video opens with a man lying on a blue exercise mat on a wooden deck, surrounded by lush greenery.",
+ "Man is dressed in a sleeveless dark blue shirt and gestures while lying on his stomach.",
+ "Text 'Can't Do Pushups?' appears in white letters.",
+ "Man introduces the topic of push-ups.",
+ "Man demonstrates a push-up by positioning his hands on the mat and lifting his body.",
+ "Text 'Fitness is a journey.' appears.",
+ "Man speaks over background music about building up the ability to do push-ups.",
+ "Man stands up on the wooden deck with a satellite dish in the background.",
+ "Camera shows the man standing against the side of a house, demonstrating wall push-ups.",
+ "Man does wall push-ups with one hand, then with two hands.",
+ "Text 'Wall Pushups Goal: 3 Sets of 50' appears.",
+ "Voiceover explains different kinds of push-ups and recommends starting with wall push-ups.",
+ "Man leans against the house, extends his arms while speaking.",
+ "Scene changes to the man performing incline push-ups on the deck railing.",
+ "Text 'Incline Pushups Goal: 3 Sets of 40' appears.",
+ "Man uses a chair for another variation of incline push-ups.",
+ "Man stands closer to the deck railing, gesturing his hand between his hip and the railing, suggesting an ideal height of the chair.",
+ "Man performs incline push-ups on the deck railing.",
+ "Voiceover explains using different stable surfaces for incline push-ups.",
+ "Scene shifts to man on a blue exercise mat, doing standard push-ups in a lush outdoor setting with a satellite dish visible.",
+ "Man transitions to kneeling push-ups on the mat, shown from a side angle.",
+ "Text 'Kneeling Pushups Goal: 2 Sets of 25' appears.",
+ "Man explains kneeling push-ups and emphasizes they're not 'girl push-ups'.",
+ "Scene transitions back to the front, showing man doing kneeling push-ups.",
+ "Scene transitions to man doing full push-ups.",
+ "Clip ends with man raising his left hand as if to high-five the camera.",
+ "Video shows man sitting on blue mat with text 'Pushups should not hurt.' in white letters.",
+ "Man gestures while sitting and speaking about proper form and safety.",
+ "Voiceover advises working on easier variations if experiencing joint pain.",
+ "Final clip shows text 'Muscles develop faster than joints.' in white letters.",
+ "Screen presents 'Longer Video: hybridcalisthenics.com/pushups' and 'Free Fitness Routine: hybridcalisthenics.com/routine.'",
+ "Concluding text 'Follow & Like For More :)' appears in white letters.",
+ "Man concludes by suggesting watching a full YouTube video and wishes a beautiful day.",
+ "Man's voice complements the motivational and instructional content throughout video.",
+ "Consistent background music plays throughout."
+ ]
+ },
+ {
+ "video": "video/172.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens in a lively Sephora store, where shelves are neatly stocked with Fenty Beauty by Rihanna products. Text reading \"Working at Sephora \ud83d\ude33\" appears prominently at the top of each frame. A man is featured in two different attires: one in a stylish black top, the other in a red plaid shirt with a matching red towel wrapped around his head, accessorized with large, ornate earrings. This alternating fashion suggests a playful transformation in attire while in the store.\n\nAs he interacts with customers, the audio reveals a cheerful male voice welcoming, \"Hi, welcome to Sephora!\" Then the man in red says \"Oh, that's is a lot.\". The man in black says \"Yeah, Pride Month so.\". This is followed by a jovial conversation between him and another female voice about Pride Month, with the man in red humorously asking, \"Oh, that's right. When's my month?\" Then the man in black says \"Every month.\". The discussion transitions smoothly with another female voice commenting on the makeup: \"Anyway, I'm looking for a long-lasting lip that isn't drying. Every time I kiss my husband, and I kiss him a lot, he complains that I get makeup on his face.\"\n\nThe video continues with the same dynamic visuals and playful alternation between the two outfits. The conversation builds as the woman elaborates, \"and he can't have makeup on his face. No, no, he's a real man's man.\" This is contrasted by a slightly humorous observation, \"The gentleman over there in the penny loafers?\" The man in red nods his head. Then the man in black says \"Anyway, we do.\" He proceeds to discuss specific products, with the male voice highlighting, \"Fenty Beauty just launched the new Palsicle Hydrating lip stain.\"\n\nMaintaining the same continuity, the visuals keep showing the two different looks as the man enthusiastically continues his product demonstration, \"this is what I'm wearing. It's a super innovative formula; it goes on glossy, leaves behind a soft tint, and lasts all day. I'll actually remove the gloss so you can see the tint.\" He takes a moment to demonstrate the product swatches on her hand, holding a white cloth. The man kissed the white cloth in his hand and then showed the cloth, which was still clean, to show that the lipstick would not fade.\n\nFurther into the video, the woman continues to extol the product's virtues, \"No feathering or bleeding, and it comes in four shades.\" The man in black shows the four shades on his hand. The man in red responds with enthusiastic approval, \"Oh, well that is just perfect. I'll take all of them.\" His approval garners a playful remark, \"Maybe you do deserve your own month.\" The conversation concludes with a heartfelt, \"Oh, thank you.\"\n\nFinally, the vibrant atmosphere of the Sephora store remains consistent as the video nears its end. The man, still in his red plaid and towel ensemble, makes an expressive hand gesture. An excited voice punctuates the light-hearted tone with a playful caution, \"Just one. Don't get greedy.\"\n\nThe consistent background of the Sephora store and the alternating outfits of the man, paired with engaging and humorous interactions, create a cohesive and enjoyable narrative capturing the lively spirit of a day \"Working at Sephora\"."
+ }
+ ],
+ "events": [
+ "The video opens in a lively Sephora store.",
+ "Shelves are neatly stocked with Fenty Beauty by Rihanna products.",
+ "Text reading 'Working at Sephora \ud83d\ude33' appears at the top of each frame.",
+ "A man is featured in two different attires: a stylish black top and a red plaid shirt with a matching red towel wrapped around his head, accessorized with large, ornate earrings.",
+ "The man interacts with customers.",
+ "A cheerful male voice says, 'Hi, welcome to Sephora!'",
+ "The man in red says, 'Oh, that's a lot.'",
+ "The man in black says, 'Yeah, Pride Month so.'",
+ "A jovial conversation between the man and another female voice about Pride Month ensues.",
+ "The man in red humorously asks, 'Oh, that's right. When's my month?'",
+ "The man in black says, 'Every month.'",
+ "A female voice comments, 'Anyway, I'm looking for a long-lasting lip that isn't drying. Every time I kiss my husband, and I kiss him a lot, he complains that I get makeup on his face.'",
+ "The video continues with dynamic visuals and alternating outfits.",
+ "The woman elaborates, 'And he can't have makeup on his face. No, no, he's a real man's man.'",
+ "A humorous observation is made: 'The gentleman over there in the penny loafers?'",
+ "The man in red nods his head.",
+ "The man in black says, 'Anyway, we do.'",
+ "The man discusses specific products, highlighting, 'Fenty Beauty just launched the new Palsicle Hydrating lip stain.'",
+ "The man demonstrates the product enthusiastically, 'This is what I'm wearing. It's a super innovative formula; it goes on glossy, leaves behind a soft tint, and lasts all day. I'll actually remove the gloss so you can see the tint.'",
+ "He demonstrates the product swatches on the woman's hand, holding a white cloth.",
+ "The man kisses the white cloth and shows the cloth is still clean, indicating the lipstick does not fade.",
+ "The woman extols the product's virtues, 'No feathering or bleeding, and it comes in four shades.'",
+ "The man in black shows the four shades on his hand.",
+ "The man in red responds with enthusiastic approval, 'Oh, well that is just perfect. I'll take all of them.'",
+ "A playful remark is made, 'Maybe you do deserve your own month.'",
+ "The conversation concludes with a heartfelt, 'Oh, thank you.'",
+ "The vibrant atmosphere of the Sephora store remains consistent as the video nears its end.",
+ "The man in red plaid and towel ensemble makes an expressive hand gesture.",
+ "An excited voice cautions playfully, 'Just one. Don't get greedy.'",
+ "The consistent background of the Sephora store and alternating outfits capture the lively spirit of 'Working at Sephora'."
+ ]
+ },
+ {
+ "video": "video/173.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a man in a white t-shirt with a red and black graphic design holding a large yellow fruit identified as a pomelo in front of his chest. The kitchen background has warm lighting, white cabinets, and a metallic pot on the counter. The man rotates the pomelo, showing its smooth, yellow surface, transitioning to a close-up that highlights the fruit's texture and size. The fruit is shown wrapped in pink packaging with Chinese characters and a cartoon figure, featuring the text '\u798f' (fu, meaning 'fortune' or 'luck'). The man removes the packaging, revealing the unwrapped fruit and displaying it prominently. \n\nAudio: A man says: \"Let'`s try a giant lemon. It's not actually a lemon but it does look like one. It's called a pomelo and I always see these in Asian supermarkets and they're always wrapped up in nice packaging.`\" This is followed by the sound of smelling things.\n\nThe scene transitions with the man smelling the pomelo, then placing it on a wooden cutting board and using a large knife to cut into it. The video depicts a close-up of the inside of the cut grapefruit, revealing the red, juicy flesh inside. He then holds up the halved pomelo, displaying its vibrant red interior, ending with a close-up of the pomelo's interior, showing the texture and seeds.\n\nAudio: The same man says: \"`It smells almost like an orange or a grapefruit. Looks more like a grapefruit on the inside. I'm not even sure how to eat this.`\" Music plays in the background.\n\nNext, the man cuts the pomelo into smaller pieces. The scene transitions to the man holding a slice of the red-fleshed pomelo, showcasing its vibrant interior and thick rind. The same kitchen setting is maintained. The man brings the whole pomelo close to his mouth, takes a bite of it, then holds the slice again, rotating it slightly to provide a detailed view of its texture and seeds. The video ends with the man holding the pomelo slice in front of his chest.\n\nAudio: The man continues: \" It does remind me of watermelon. It'`s almost like an orange. I thought it would be sour like a grapefruit but it's not really sour. It's not as juicy so there`'s seems to be like a hard skin in the center that you don't really eat. \" \n\nThen, focusing closely again on the man holding the pomelo slice, he points to its juicy texture. The man compares the halved pomelo to the slice, showcasing both, and then holds up the halved pomelo, once more displaying its vibrant red interior. The clip ends with the man holding the slice of pomelo in front of his chest and speaking once again.\n\nAudio: The man says: \"`It went from looking like a lemon on the outside, a grapefruit on the inside, to a watermelon just by the way that I sliced it and it tastes like an orange.`\" The sound of a guitar accompanies his words.\n\nLastly, the man is shown holding a slice of the red-fleshed pomelo in his left hand and a plate with additional pomelo slices in his right hand. The text on the person's t-shirt reads 'IF YOU EVEN DREAM OF BEATING ME... YOU'D BETTER WAKE UP AND APOLOGIZE.' The consistent kitchen background is maintained. He puts half a pomelo on the table and brings the slice of pomelo close to his mouth as if preparing to eat it. \n\nAudio: The man concludes with: \"` That's a pomelo.`\" His final words are accompanied by his laughter."
+ }
+ ],
+ "events": [
+ "Man in a white t-shirt with a red and black graphic design holds a large yellow pomelo in front of his chest",
+ "Kitchen background with warm lighting, white cabinets, and a metallic pot on the counter",
+ "Man rotates the pomelo, showing its smooth, yellow surface",
+ "Close-up of the pomelo highlighting its texture and size",
+ "Pomelo shown wrapped in pink packaging with Chinese characters and a cartoon figure",
+ "Text on packaging reads '\u798f' (fu, meaning 'fortune' or 'luck')",
+ "Man removes the packaging, revealing the unwrapped pomelo",
+ "Man displays the unwrapped pomelo prominently",
+ "Audio: Man says: 'Let's try a giant lemon. It's not actually a lemon but it does look like one. It's called a pomelo and I always see these in Asian supermarkets and they're always wrapped up in nice packaging.'",
+ "Sound of smelling",
+ "Scene transitions to the man smelling the pomelo",
+ "Man places pomelo on a wooden cutting board and uses a large knife to cut into it",
+ "Close-up of the inside of the cut pomelo, revealing the red, juicy flesh",
+ "Man holds up the halved pomelo, displaying its vibrant red interior",
+ "Close-up of pomelo's interior showing texture and seeds",
+ "Audio: Man says: 'It smells almost like an orange or a grapefruit. Looks more like a grapefruit on the inside. I'm not even sure how to eat this.'",
+ "Background music plays",
+ "Man cuts the pomelo into smaller pieces",
+ "Scene transitions to the man holding a slice of the red-fleshed pomelo, showcasing its vibrant interior and thick rind",
+ "Same kitchen setting is maintained",
+ "Man brings the whole pomelo close to his mouth and takes a bite of it",
+ "Man holds the slice again, rotating it slightly to show its texture and seeds",
+ "Video ends with the man holding the pomelo slice in front of his chest",
+ "Audio: Man says: 'It does remind me of watermelon. It's almost like an orange. I thought it would be sour like a grapefruit but it's not really sour. It's not as juicy so there's seems to be like a hard skin in the center that you don't really eat.'",
+ "Man points to the juicy texture of the pomelo slice",
+ "Man compares the halved pomelo to the slice, showcasing both",
+ "Man holds up the halved pomelo displaying its vibrant red interior again",
+ "Clip ends with the man holding the pomelo slice in front of his chest and speaking once again",
+ "Audio: Man says: 'It went from looking like a lemon on the outside, a grapefruit on the inside, to a watermelon just by the way that I sliced it and it tastes like an orange.'",
+ "Sound of a guitar accompanies his words",
+ "Man shown holding a slice of red-fleshed pomelo in his left hand and a plate with additional pomelo slices in his right hand",
+ "Text on the person's t-shirt reads 'IF YOU EVEN DREAM OF BEATING ME... YOU'D BETTER WAKE UP AND APOLOGIZE.'",
+ "Consistent kitchen background is maintained",
+ "Man puts half a pomelo on the table",
+ "Man brings the slice of pomelo close to his mouth as if preparing to eat it",
+ "Audio: Man concludes with: 'That's a pomelo.' Final words accompanied by his laughter"
+ ]
+ },
+ {
+ "video": "video/174.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video follows the journey of a woman with vibrant pink hair as she investigates a drink discovered on TikTok. \n\nStarting outside on a sunny day with a clear blue sky and green trees, the woman walks past a building displaying 'STARBUCKS COFFEE' and address '6290'. The woman, dressed in a dark green shirt with large yellow text, heads into the Starbucks, where a rainbow flag decorates the window. An excited female voice of the woman declares, \"Can't be true. This can't be true. I have to find out for myself right now. I see this on TikTok. Let's see if it actually works though.\". The sounds of a car driving by are heard, setting the scene in a bustling environment.\n\nInside Starbucks, the continuation of the investigation is shown with the text 'It's supposed to taste like Disney Dole Whip' flashing on the screen. The ambiance reveals wooden beams and hanging lights. The woman receives a light yellow drink topped with whipped cream and observes, \"Did I get a venti vanilla bean frappe? You know, with sweet cream instead of the milk and add pineapple base. Too complicated or no? Okay, okay good. Okay, here it is.\" Background music plays calmly, complementing the moment of anticipation.\n\nThe focus remains on the woman holding the drink, gesturing animatedly with red-polished nails. The background showcases a brightly lit window and framed picture, immersing the scene in natural light. The enthusiastic narrative continues: \"Okay, this is supposed to taste like the Dole Whip from Disney. And it's like the Frappuccino Starbucks version. So I'm going to try it.\" Adding texture to the scene, the audio briefly shifts to the sound of a soda bottle opening.\n\nFinally, the woman enthusiastically presents the drink to the camera and tastes it, emphasizing her experience. The vibrant surroundings continue with natural light streaming through the window and artistic decor on the wall. As she gestures and describes the drink, the voice joyfully concludes, \"It's so good. It's like pineapple like creamy like pina colada it's so good,\" followed by a gasp of delight. The consistency of the scene and the continuous engagement of the woman with her drink provide a lighthearted and dynamic video journey."
+ }
+ ],
+ "events": [
+ "Woman with vibrant pink hair begins journey outside on a sunny day.",
+ "Clear blue sky and green trees are visible.",
+ "Woman walks past building with 'STARBUCKS COFFEE' sign and address '6290'.",
+ "Woman is dressed in a dark green shirt with large yellow text.",
+ "Rainbow flag decorates the Starbucks window.",
+ "Woman's excited voice says, 'Can't be true. This can't be true. I have to find out for myself right now. I see this on TikTok. Let's see if it actually works though.'",
+ "Sounds of a car driving by are heard, indicating a bustling environment.",
+ "Inside Starbucks, text 'It's supposed to taste like Disney Dole Whip' flashes on the screen.",
+ "The Starbucks interior has wooden beams and hanging lights.",
+ "Woman receives a light yellow drink topped with whipped cream.",
+ "Woman comments, 'Did I get a venti vanilla bean frappe? You know, with sweet cream instead of the milk and add pineapple base. Too complicated or no? Okay, okay good. Okay, here it is.'",
+ "Calm background music plays.",
+ "Focus on woman holding the drink with red-polished nails.",
+ "Background shows a brightly lit window and a framed picture.",
+ "Woman narrates, 'Okay, this is supposed to taste like the Dole Whip from Disney. And it's like the Frappuccino Starbucks version. So I'm going to try it.'",
+ "Sound of a soda bottle opening is heard briefly.",
+ "Woman presents the drink to the camera and tastes it.",
+ "Natural light streams through the window.",
+ "Artistic decor is visible on the wall.",
+ "Woman describes the drink, 'It's so good. It's like pineapple like creamy like pina colada it's so good,' followed by a gasp of delight.",
+ "Scene remains consistently vibrant and lighthearted."
+ ]
+ },
+ {
+ "video": "video/175.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a split-screen view. The top half displays a game interface with bold yellow and white letters spelling out 'THAT SECRET TOP 100 ARMY'. Below this text, we see a replay in progress of a fortified base, with a countdown timer that starts at 2 minutes and 57 seconds. The bottom half of the screen features a man in a red shirt adorned with cartoon characters and a black cap speaking into a microphone. Behind him are shelves filled with items and a colorful lighting setup. \n\nThe man engagedly explains, \"`So this is the attack strategy that did not make it in the top five Builder Hall attacks because it was considered too hard for you guys. And the army itself is Bart. I know.`\" The scene then shifts to a close-up of the speaker, who continues speaking and gesturing with his hands. White bold text with a black outline stating 'TOO... HARD?' appears on the screen. \n\nThe clip transitions back to the split-screen view, showing the game interface timer at 2 minutes and 45 seconds while the man continues speaking. As the video progresses, the speaker remarks, \"It's kind of crazy when you think about it that Barch is considered a global top 100 army. But we're going to speed this up because this attacker takes exactly 2 minutes and 51 seconds to get 82% of this base down.`\" In the background, the faint sounds of a helicopter flying can be heard. The timer on the game interface counts down from 2 minutes and 43 seconds to 2 minutes and 28 seconds.\n\nNext, the screen shifts to a close-up of the game, showing 'Demon' achieving 82% destruction and 'Black Castle' achieving 56%. Below this information are two 'Replay' buttons. The view transitions again to the fortified base under attack, with the timer now showing 'Replay ends in 2m 17s'. Various defensive structures and troops come into action, with explosions visible as the timer counts down from 2 minutes and 17 seconds to 2 minutes and 5 seconds. The narrator comments, \"Starting out with the use of bombers, and the bombers are gonna open up walls for the barbarians. Rage barbarians are gonna pick off buildings a little bit here and there so I can see where the patience is required, d.`\" This is accompanied by the sound of a plane flying overhead.\n\nContinuing with the close-up of the fortified base, the scene indicates 'Replay ends in: 2M 3S'. Explosions and attacks are prevalent as the battle rages on. The timer decreases from 2 minutes and 3 seconds to 1 minute and 45 seconds. The speaker explains, \"`drawing the guard post troops out. He's got some sneaky archers down the map taking out a few more buildings, doing some damage.`\" The speaker's commentary is consistent with the ongoing action on the screen.\n\nThe timer now shows 'Replay ends in 1M 43S' as the focus remains on the game interface. The countdown continues, emphasizing the intensity of the battle. The speaker says, \"We're gonna have to do a little faster than this. Sneaky archers are taking down cannons, double cannons, and there go the guard post troops being taken out by the Rage Barbarians. Here comes the battle machine in a whole wave of the Rage Barbarians to take out the Builder Hall and more buildings, finishing out at exactly 82%`\" Background music begins to play, adding to the excitement.\n\nFinally, the timer displays 'Replay ends in 18S'. The battle continues fiercely, with the countdown reaching zero. Explosions and defensive actions persist as the scene finally shifts to the battle results screen, showing 'Total damage: 82%' with 'Troops expended: 60 Barbarians, 5 Archers, 28 Giants, 30 Wizards'. The result is two stars. \n\nThe narrative seamlessly ties together the gaming tactics and the strategic considerations, enhanced by the speaker's enthusiastic and detailed explanations. The dramatic countdown and visual explosions underscore the intensity of the in-game battle while aligning perfectly with the speaker's voice-over."
+ }
+ ],
+ "events": [
+ "Split-screen view appears: Top half shows game interface and bottom half shows a man in a red shirt and black cap speaking.",
+ "Top half of the split-screen has bold yellow and white letters spelling out 'THAT SECRET TOP 100 ARMY'.",
+ "Replay of a fortified base in progress with a countdown timer starting at 2:57 minutes on the top half.",
+ "Man on the bottom half speaking into a microphone with shelves and colorful lighting in the background.",
+ "Man explains why the attack strategy is not in the top five Builder Hall attacks.",
+ "Scene shifts to a close-up of the speaker continuing to speak and gesture.",
+ "White bold text with black outline stating 'TOO... HARD?' appears on the screen.",
+ "Transition back to split-screen, game timer at 2:45 minutes while the man continues speaking.",
+ "Speaker remarks on Barch being considered a global top 100 army and mentions speeding up the video.",
+ "Faint sounds of a helicopter flying in the background.",
+ "Game timer counts down from 2:43 minutes to 2:28 minutes.",
+ "Screen shifts to a close-up of the game results showing 'Demon' achieving 82% and 'Black Castle' 56% with two 'Replay' buttons.",
+ "Transition back to the fortified base under attack with timer showing 'Replay ends in 2m 17s'.",
+ "Various defensive structures and troops come into action with explosions visible.",
+ "Timer counts down from 2:17 minutes to 2:05 minutes.",
+ "Narrator comments on the use of bombers and Rage barbarians.",
+ "Sound of a plane flying overhead.",
+ "Close-up of the fortified base with timer indicating 'Replay ends in: 2M 3S'.",
+ "Explosions and attacks continue as the timer decreases from 2:03 minutes to 1:45 minutes.",
+ "Speaker explains actions including drawing guard post troops and sneaky archers taking out buildings.",
+ "Timer shows 'Replay ends in 1M 43S' with focus on the game interface.",
+ "Countdown continues emphasizing battle intensity with speaker's commentary consistent with screen action.",
+ "Speaker details sneaky archers taking down cannons and Rage Barbarians taking out guard post troops.",
+ "Battle machine and a wave of Rage Barbarians attack Builder Hall and other buildings, achieving 82% damage.",
+ "Background music begins to play.",
+ "Timer displays 'Replay ends in 18S'.",
+ "Battle continues fiercely until countdown reaches zero with ongoing explosions and defensive actions.",
+ "Scene shifts to battle results screen showing 'Total damage: 82%' and 'Troops expended: 60 Barbarians, 5 Archers, 28 Giants, 30 Wizards'.",
+ "Result is two stars.",
+ "Narrative ties together gaming tactics and strategic considerations with speaker's enthusiastic and detailed explanations.",
+ "Dramatic countdown and visual explosions underscore the intensity of the in-game battle."
+ ]
+ },
+ {
+ "video": "video/176.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a title screen that reads '`The Lie You've Been Told about Katana`' in bold text, with 'Lie' highlighted in red. The background shows a person holding a katana. The scene transitions to a close-up of a man dressed in black traditional Japanese attire, holding a katana by its sheath. The screen displays the blue text corresponding to the real-time audio at the bottom throughout the video. In the audio, a male voice asserts, \"`The lie you've been told about katana. The movement you do to pull the katana from the sheath is this, right?`\"\n\nNext, a close-up follows of the man's hands on the katana preparing to pull the katana from the sheath. The visual switches to another man in traditional attire sitting and speaking, with the text '`THE LIE YOU'VE BEEN TOLD ABOUT KATANA`' in yellow on the top, accompanied by the spoken explanation, \"`Even in iaido that I train in, we are taught this movement called \u9bc9\u53e3\u3092\u5207\u308b (Koikuchi-wo-kiru, meaning 'cutting the koi-guchi')`\" followed by man dressed in black traditional Japanese attire pulling the katana out.\n\nTransitioning to the next scene, the same man is shown sitting and speaking, dressed in light blue traditional attire, with a wall featuring a gold play button and a framed scroll in the background. Meanwhile, the audio adds context, \"`However, there are some opinions that say that this is not practical.`\"\n\nThe man dressed in black traditional Japanese attire then stands and demonstrates the katana draw. The explanation continues in the audio, \"`When you happen to draw your sword in a hurry in a real fight,`\". The next frame shows a close-up of the man in black traditional Japanese attire, zooming in his hands on the katana. The voiceover continues, \"`there will be a slight moment before you put your thumb on the tsuba.`\"\n\nA close-up of a person's thumb on the tsuba follows, as the audio warns, \"`or you may rush and cut your thumb`\" The video transitions to the man in black traditional attire putting the katana upside down. Concurrently, the voice explains, \"`However, Japanese swords are tightly fastened so they cannot be easily pulled out.`\"\n\nFollowing this, the screen turns gray and the same man in light blue traditional attire appears again, folding his hands, with the query in the audio, \"Then how are you supposed to pull the sword out?\" The following frame shows the man in light blue traditional attire opening his hands to the camera, leading into the audio answer, \"`The answer is this.`\"\n\nThe video then shows a close-up of the man in black traditional attire gripping the handle of a katana. The voice continues, \"`Just by gripping it in, you can actually draw the katana.`\" The scene transitions to a wider shot of the person demonstrating the technique. The man in black traditional attire draws the katana smoothly from its sheath as the audio confirms, \"`This way, you can eliminate the disadvantage of using your thumb.`\" The final frame shows the man in light blue traditional attire pointing and speaking, \"`Please try it the next time you have a chance to pull out a katana.`\"\n\nThe concluding segment presents the man in light blue traditional Japanese attire gesturing and speaking with familiar background elements, including the gold play button and a framed scroll. The subsequent frame transitions to a promotional image with the text 'KATANA HOME TRAINING TUTORIAL' in bold white and red letters, featuring a person holding a katana. The audio continues, \"If, like Lorimar, you would like to learn more about Japanese traditional culture, Kyoto...\"\n\nThen two promotional images follow, one with the text 'CUTTING MATS IN ARMOR' in blue and another with 'KYOTO GOING BANKRUPT' in bold red letters, indicating different video topics. The closing frame shows a 'SUBSCRIBE' button with a hand pointing towards it, accompanied by the invitation, \"`please check out my channel and subscribe.`\""
+ }
+ ],
+ "events": [
+ "Title screen reads 'The Lie You've Been Told about Katana' with 'Lie' highlighted in red.",
+ "Background shows a person holding a katana.",
+ "Scene transitions to a close-up of a man dressed in black traditional Japanese attire holding a katana by its sheath.",
+ "Blue text corresponding to real-time audio is displayed at the bottom throughout the video.",
+ "Male voice asserts, 'The lie you've been told about katana. The movement you do to pull the katana from the sheath is this, right?'",
+ "Close-up of the man's hands on the katana preparing to pull it from the sheath.",
+ "Visual switches to another man in traditional attire sitting and speaking.",
+ "Text 'THE LIE YOU'VE BEEN TOLD ABOUT KATANA' in yellow is displayed on top.",
+ "Spoken explanation: 'Even in iaido that I train in, we are taught this movement called \u9bc9\u53e3\u3092\u5207\u308b (Koikuchi-wo-kiru, meaning 'cutting the koi-guchi')'.",
+ "Man dressed in black traditional Japanese attire pulls the katana out.",
+ "Scene transitions to the same man sitting and speaking, now dressed in light blue traditional attire.",
+ "Background features a gold play button and a framed scroll.",
+ "Audio adds context: 'However, there are some opinions that say that this is not practical.'",
+ "Man dressed in black traditional Japanese attire stands and demonstrates the katana draw.",
+ "Explanation continues: 'When you happen to draw your sword in a hurry in a real fight,'",
+ "Close-up of the man's hands on the katana.",
+ "Voiceover continues: 'there will be a slight moment before you put your thumb on the tsuba.'",
+ "Close-up of a person's thumb on the tsuba.",
+ "Audio warns: 'or you may rush and cut your thumb'",
+ "Man in black traditional attire puts the katana upside down.",
+ "Voice explains: 'However, Japanese swords are tightly fastened so they cannot be easily pulled out.'",
+ "Screen turns gray and the same man in light blue traditional attire appears, folding his hands.",
+ "Query in the audio: 'Then how are you supposed to pull the sword out?'",
+ "Man in light blue traditional attire opens his hands to the camera.",
+ "Audio answer: 'The answer is this.'",
+ "Close-up of the man in black traditional attire gripping the handle of a katana.",
+ "Voice continues: 'Just by gripping it in, you can actually draw the katana.'",
+ "Scene transitions to a wider shot of the person demonstrating the technique.",
+ "Man in black traditional attire draws the katana smoothly from its sheath.",
+ "Audio confirmation: 'This way, you can eliminate the disadvantage of using your thumb.'",
+ "Final frame shows the man in light blue traditional attire pointing and speaking.",
+ "Audio: 'Please try it the next time you have a chance to pull out a katana.'",
+ "Concluding segment presents the man in light blue traditional Japanese attire gesturing and speaking.",
+ "Familiar background elements including the gold play button and a framed scroll.",
+ "Frame transitions to a promotional image with text 'KATANA HOME TRAINING TUTORIAL' in bold white and red letters.",
+ "Image features a person holding a katana.",
+ "Audio continues: 'If, like Lorimar, you would like to learn more about Japanese traditional culture, Kyoto...'",
+ "Two promotional images follow: one with text 'CUTTING MATS IN ARMOR' in blue and another with 'KYOTO GOING BANKRUPT' in bold red letters.",
+ "Closing frame shows a 'SUBSCRIBE' button with a hand pointing towards it.",
+ "Invitation: 'please check out my channel and subscribe.'"
+ ]
+ },
+ {
+ "video": "video/177.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video focuses on a seating dispute that takes place inside the luxurious interior of an airplane with spacious seats.\n\nA man, presumably the passenger, wearing a gray cap and a black leather jacket holds a boarding pass and says, \"`I think you're in my seat.`\" The scene shifts to a man in a gray hoodie and glasses, presumably the husband, who replies, \"`Do you mind switching seats with my wife? We`'re on our honeymoon and we want to sit together.\" The camera then shows a woman in a floral jacket, presumably the wife, sitting in another seat, and strokeing her hair. It is worth mentioning that the passenger, husband and wife are all played by the same person. The passenger queries, \"Ok where is your wife's seat?\" The wife in the floral jacket holds her boarding pass and responses, \"That'`s my seat.`\" The audio reinforces this exchange with the woman speaking again, adding a thank you. The screen displays the texts corresponding to the audio in real time.\n\nContinuing the dispute, the passenger in the gray cap and black leather jacket emphasizes, \"That's in economy... I paid for business,\". The camera cuts to the husband in the gray hoodie and glasses, who responds, \"Wow, we'`re on our honeymoon... can't you show some kindness??`\" The passenger insists, \"`Can I please have my seat?`\" The flight attendant in a red uniform appears and inquires, \"`What seems to be the problem here?`\" The husband in the gray hoodie and glasses explains, \"This guy said we can change seats and now he'`s taking it back,`\" to which the passenger in the gray cap and black leather jacket counters, \"I never said that!\" The audio follows this argument, with alternating statements from the participants.\n\nThe dispute escalates as the wife in the floral jacket holds a boarding pass and attempts to validate her claim, saying, \"He said he would switch with me.\" The flight attendant in the red uniform reacts sternly, \"I actually already took a look at your ticket and you got a free upgrade and now you'`re trying to steal this passenger`'s seat,\" revealed by the audio of the flight attendant speaking quickly and loudly. The flight attendant continues with gestures, \"So I'm gonna give you two options, either both of you go to economy or both of you get off this flight, on your own or with security.\" The audio remains consistent with the flight attendant's ultimatum, followed by a revving car engine sound implying rising tension.\n\nAs the confrontation nears its conclusion, the husband in the gray hoodie and glasses defiantly says, \"No we'`re not moving.`\" The flight attendant calls out for \"Security!\" prompting the husband's more aggressive response, \"You can't do this I'm gonna sue!\" The video shows the security pulling the husband's gray hoodie as he prepares to drag him away. The final frames show the flight attendant waving and saying, \"`Have a nice day!`\" The audio lines up with the visuals as the woman speaks, followed by the man's low-pitch retort, and closing with the flight attendant's cheerful farewell.\n\nThroughout the video, the luxurious setting of the airplane, complete with wide, comfortable seats, provides a stark contrast to the escalating conflict among the passengers and crew."
+ }
+ ],
+ "events": [
+ "Scene inside a luxurious airplane with spacious seats.",
+ "A passenger in a gray cap and black leather jacket holds a boarding pass.",
+ "Passenger says, 'I think you're in my seat.'",
+ "Camera shows a man in a gray hoodie and glasses, presumed to be the husband.",
+ "Husband replies, 'Do you mind switching seats with my wife? We're on our honeymoon and we want to sit together.'",
+ "Camera shows a woman in a floral jacket, presumably the wife, sitting in another seat and stroking her hair.",
+ "Passenger asks, 'Ok where is your wife's seat?'",
+ "Wife in floral jacket holds her boarding pass and responds, 'That's my seat.'",
+ "Audio reinforces the dialogue with wife's thank you.",
+ "Text displaying the conversation appears on screen.",
+ "Passenger in gray cap and black leather jacket emphasizes, 'That's in economy... I paid for business.'",
+ "Camera cuts to husband in gray hoodie and glasses who replies, 'Wow, we're on our honeymoon... can't you show some kindness??'",
+ "Passenger insists, 'Can I please have my seat?'",
+ "A flight attendant in a red uniform appears and asks, 'What seems to be the problem here?'",
+ "Husband explains, 'This guy said we can change seats and now he's taking it back.'",
+ "Passenger counters, 'I never said that!'",
+ "Wife in floral jacket holds her boarding pass and attempts to validate her claim, saying, 'He said he would switch with me.'",
+ "Flight attendant reacts sternly, 'I actually already took a look at your ticket and you got a free upgrade and now you're trying to steal this passenger's seat.'",
+ "Flight attendant continues, 'So I'm gonna give you two options, either both of you go to economy or both of you get off this flight, on your own or with security.'",
+ "Audio remains consistent with flight attendant's ultimatum.",
+ "Revving car engine sound implies rising tension.",
+ "Husband defiantly says, 'No we're not moving.'",
+ "Flight attendant calls out for 'Security!'",
+ "Husband responds aggressively, 'You can't do this I'm gonna sue!'",
+ "Security starts pulling husband's gray hoodie, preparing to drag him away.",
+ "Flight attendant waves and says, 'Have a nice day!'",
+ "Audio aligns with visuals, concluding with flight attendant's cheerful farewell.",
+ "Luxurious airplane setting with wide, comfortable seats contrasts with the escalating conflict."
+ ]
+ },
+ {
+ "video": "video/178.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "\nThe video showcases a humorous interaction in a medieval fantasy setting, revolving around the limitations of a dungeon party.\n\nIn the beginning, two characters stand in an open field dressed in medieval fantasy attire. The woman on the left wears a white and blue outfit with a red cape and holds a staff, while the man on the right is clad in chainmail armor with a brown leather vest, holding a sword. As the text \"EPIC NPC MAN\" and \"DUNGEON PARTY LIMIT\" appear at the top, a third man character in a green and brown outfit with a bow and quiver of arrows joins them. Text bubbles at the bottom read, \"Hey what's up, guys?\" and \"Ayyyy, there he is.\" They greet each other and chat, as a fourth-man character in scale armor appears in a different scene, saying, \"Okay, sweet, we've got a full team. Nice. The plan...\" before greeting another character who shows up in a red scarf and leather armor with \"Hey guys!\".\n\nNext, the video shifts to the character in the red scarf asking, \"Are you going to the Darkstone Depths?\" near a stone arch in a forest. A character from the group responds, \"Yeah, we're just about to leave,\" to which the red-scarfed character replies, \"Oh sweet, I'll tag along.\" But the character in scale armor interjects, \"Oh, we've got a team already,\" trying to decline politely as the red-scarfed character insists, \"Oh that's all good, I can just join the party,\" met with \"No, no, no\" from the scale-armored character. The audio captures the cheerful and insistent tone of the conversation.\n\nThe video continues with a close-up of the scale-armored character asserting with a smile, \"Sorry, Y-you can't join the party.\" Cut to the three characters in the field, the character with a bow offers, \"Guys, I don't mind jumping out if Rowan wants to...\" hinting at making a sacrifice. The scale-armored character quickly intermittes, \"No, No, Alan, it's fine. You don't have to. You can't join! \"The red-scarfed character is visibly frustrated, exclaiming, \"What do you mean 'I can't join'?\" followed by \"Oh my god. I can easily come along. Five of us, in the Dungeon, fighting monsters together!\" This seems practical to everyone as the scale-armored character says, \"That would make it easier, wouldn't it? Just do that\" They consider the feasibility until the red-scarfed character, feeling accused, asks, \"Sorry, Dude. Have I done something to offend you? Because I don't know why you're being such a dick about this\" as the scale-armored individual tries to justify but is cut short. The conversation in the audio conveys confusion and slight tension about party limits. \n\nThe video delves deeper as the scale-armored character clarifies, \"It's physically impossible for more than four people to go into a dungeon at once.\" Highlighted text \"PHYSICALLY IMPOSSIBLE\" emphasizes this point, stirring frustration. When the red-scarfed character suggests, \"How 'bout we just give it a go, then, Ben?\" the scale-armored character reluctantly agrees, \"Fine.\" The character with a bow tries to offer help,'' Guys, seriously I don't mind-\" but is told, \"No, no, no, Alan, it's fine, we're just gonna give it a go. We'll just give it a crack\" The audio maintains the conversation's frustrated and resigned atmosphere. The red-scarfed character declares, \"Yeah. Working as a team!\" as they start to walk. The scale-armored character says, \"Let's give it a go\" \"Come on then\" Comedic text appears above the red-scarfed character labeling him as \"A Lousy Player,\" and he collides with an air wall, followed by the other three characters labeling their nickname, who pass the air wall. They converse as they walk through the forest, with the scale-armored character pointing out, \"Isn't it weird that. It's exactly like I said, Rowan?\" The video ends with their continued camaraderie, underlined by escalating background music."
+ }
+ ],
+ "events": [
+ "Two characters stand in an open field dressed in medieval fantasy attire.",
+ "The woman on the left wears a white and blue outfit with a red cape and holds a staff.",
+ "The man on the right is clad in chainmail armor with a brown leather vest and holds a sword.",
+ "Text 'EPIC NPC MAN' and 'DUNGEON PARTY LIMIT' appear at the top.",
+ "A third character in a green and brown outfit with a bow and quiver of arrows joins them.",
+ "Text bubbles at the bottom read, 'Hey what's up, guys?' and 'Ayyyy, there he is.'",
+ "They greet each other and chat.",
+ "A fourth man character in scale armor appears in a different scene.",
+ "Fourth character says, 'Okay, sweet, we've got a full team. Nice. The plan...'.",
+ "Another character in a red scarf and leather armor shows up saying, 'Hey guys!'.",
+ "The character in the red scarf asks, 'Are you going to the Darkstone Depths?' near a stone arch in a forest.",
+ "A character from the group responds, 'Yeah, we're just about to leave.'.",
+ "The red-scarfed character replies, 'Oh sweet, I'll tag along.'.",
+ "The scale-armored character interjects, 'Oh, we've got a team already,' trying to decline politely.",
+ "The red-scarfed character insists, 'Oh that's all good, I can just join the party.'.",
+ "The scale-armored character says, 'No, no, no.'.",
+ "Close-up of the scale-armored character asserting with a smile, 'Sorry, Y-you can't join the party.'.",
+ "Cut to three characters in the field.",
+ "The character with a bow offers, 'Guys, I don't mind jumping out if Rowan wants to...'.",
+ "The scale-armored character quickly says, 'No, No, Alan, it's fine. You don't have to. You can't join!'.",
+ "The red-scarfed character is visibly frustrated, exclaiming, 'What do you mean I can't join?'.",
+ "The red-scarfed character says, 'Oh my god. I can easily come along. Five of us, in the Dungeon, fighting monsters together!'.",
+ "The scale-armored character says, 'That would make it easier, wouldn't it? Just do that.'.",
+ "The red-scarfed character asks, 'Sorry, Dude. Have I done something to offend you? Because I don't know why you're being such a dick about this.'.",
+ "The scale-armored character tries to justify but is cut short.",
+ "The scale-armored character clarifies, 'It's physically impossible for more than four people to go into a dungeon at once.'.",
+ "Highlighted text 'PHYSICALLY IMPOSSIBLE' emphasizes this point.",
+ "The red-scarfed character suggests, 'How 'bout we just give it a go, then, Ben?'.",
+ "The scale-armored character reluctantly agrees, 'Fine.'.",
+ "The character with a bow tries to offer help again, 'Guys, seriously I don't mind-'.",
+ "The scale-armored character responds, 'No, no, no, Alan, it's fine, we're just gonna give it a go. We'll just give it a crack.'.",
+ "The red-scarfed character declares, 'Yeah. Working as a team!'.",
+ "They start to walk.",
+ "The scale-armored character says, 'Let's give it a go' and 'Come on then.'.",
+ "Comedic text appears above the red-scarfed character labeling him as 'A Lousy Player.'.",
+ "The red-scarfed character collides with an air wall.",
+ "The other three characters pass through what appears to be an invisible wall.",
+ "They converse as they walk through the forest.",
+ "The scale-armored character points out, 'Isn't it weird that? It's exactly like I said, Rowan?'.",
+ "The video ends with their continued camaraderie.",
+ "The background music escalates as the video ends."
+ ]
+ },
+ {
+ "video": "video/179.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a light brown dog standing on grass, holding a picture frame in its mouth. A red arrow points to the dog, while the text \"ANIMALS STEALING\" is displayed in white capital letters. A man's voice, high-pitched, comments, \"Animals stealing things that aren't theirs.\" The scene transitions to a golden retriever walking on a cobblestone street, carrying a pizza in its mouth. The background shows a few people and a parked car as the man jokingly says, \"Why did the dog cross the road? To steal a pizza!\" \n\nNext, a gray cat is shown on a table eating a cake, crumbs on its face and paws, with the cake partially eaten and crumbs scattered around as the voice remarks, \"This is the face of a cat that knows he did wrong, but he's probably gonna keep going.\" \n\nA blue bird appears, perched on a wooden table, holding several sugar packets in its beak, prompting the question, \"What are you even gonna do with those? You don't drink coffee.\" The scene then shows a large chocolate chip cookie and a cup of coffee on a wooden surface, as a small paw reaches for the cookie. The voice warns, \"You never know where a cat is gonna come from to try to steal your cookie.\"\n\nThe focus shifts to a gray tabby cat on a white chair, curiously inspecting a dark green handbag with brown straps. Initially, the cat's head is inside the bag; then it sits upright, looking directly at the camera. The voice remarks, \"This is the face of cat caught in the act, trying to steal from a purse.\" The scene then moves to a kitchen with multiple cats, five in total: two on the countertop, one in an open cabinet, and two more in another open cabinet, their eyes glowing due to the camera flash. The voice states, \"I would just leave. This is the cat's kitchen now.\" The final scene in this sequence features a black and white cat in a tree, holding a string of sausages in its mouth, with a building and greenery in the background. The voice questions, \"Where did you steal hot dogs from? \"\n\nA white dog lying on a brown rug is shown next, holding a stack of dollar bills with its paws, and more bills scattered around as the voice says, \"Mr. Dog, I don't think you even can do anything with that money.\" The scene then transitions to a black and white cat on a kitchen table with a white takeout box stuck on its head, prompting the comment, \"Cat is aggressively stealing this Chinese food.\" The concluding scene in this sequence features a white and gray cat outside a glass door, holding a sausage in its mouth, with a wooden deck and greenery in the background. The voice questions again, \"Where are the cats getting the hot dogs?\"\n\nA small dog with light brown and black fur lies on a multicolored knitted blanket on a couch, holding a white sock with pink stripes in its mouth. The man says,\"If you ever missing a sock, I might know the cop right.\" The scene then shows a black dog with a collar, lying on a patterned couch with one paw inside an orange handbag, with the voice stating, \"Those eyes! He knows he was busted!\" Finally, a woman in a gray suit crouches on a sidewalk, reaching out to three small ducklings, while a white duck with its beak inside her handbag is shown, the voice concluding, \"This family of ducks is working as a team, she's distracted by the cute little ducklings while she steals the money!\"\n\nThe final segment begins with a seagull standing on a tiled floor, holding a bag of chips in its beak, with a wooden cabinet and countertop in the background. A child's high-pitched voice questions, \"Do birds just like potato chips?\" As the video shows the man wearing a red shirt gesturing with their hands, the text \"Follow!\" is displayed in white capital letters at the bottom of the frame, and his voice encourages, \"Follow for more!\"\n\nThroughout the video, various music tracks and sound effects complement the humorous and light-hearted narration, enhancing the overall amusing and engaging tone of the content."
+ }
+ ],
+ "events": [
+ "A light brown dog standing on grass holding a picture frame in its mouth.",
+ "A red arrow points to the dog.",
+ "The text 'ANIMALS STEALING' is displayed in white capital letters.",
+ "A man's high-pitched voice comments, 'Animals stealing things that aren't theirs.'",
+ "The scene transitions to a golden retriever walking on a cobblestone street carrying a pizza in its mouth.",
+ "Background shows a few people and a parked car.",
+ "The man jokingly says, 'Why did the dog cross the road? To steal a pizza!'",
+ "A gray cat is shown on a table eating a cake with crumbs on its face and paws.",
+ "The cake is partially eaten and crumbs are scattered around.",
+ "The voice remarks, 'This is the face of a cat that knows he did wrong, but he's probably gonna keep going.'",
+ "A blue bird appears perched on a wooden table holding several sugar packets in its beak.",
+ "The voice prompts, 'What are you even gonna do with those? You don't drink coffee.'",
+ "A large chocolate chip cookie and a cup of coffee are shown on a wooden surface.",
+ "A small paw reaches for the cookie.",
+ "The voice warns, 'You never know where a cat is gonna come from to try to steal your cookie.'",
+ "Focus shifts to a gray tabby cat on a white chair inspecting a dark green handbag with brown straps.",
+ "The cat's head is inside the bag initially.",
+ "The cat sits upright and looks directly at the camera.",
+ "The voice remarks, 'This is the face of cat caught in the act, trying to steal from a purse.'",
+ "The scene moves to a kitchen with multiple cats (five in total): two on the countertop, one in an open cabinet, and two more in another open cabinet.",
+ "The cats' eyes are glowing due to the camera flash.",
+ "The voice states, 'I would just leave. This is the cat's kitchen now.'",
+ "Final scene in the sequence features a black and white cat in a tree holding a string of sausages in its mouth.",
+ "A building and greenery are in the background.",
+ "The voice questions, 'Where did you steal hot dogs from?'",
+ "A white dog lying on a brown rug holding a stack of dollar bills with its paws.",
+ "More bills are scattered around.",
+ "The voice says, 'Mr. Dog, I don't think you even can do anything with that money.'",
+ "The scene transitions to a black and white cat on a kitchen table with a white takeout box stuck on its head.",
+ "The voice comments, 'Cat is aggressively stealing this Chinese food.'",
+ "The concluding scene in the sequence features a white and gray cat outside a glass door holding a sausage in its mouth.",
+ "A wooden deck and greenery are in the background.",
+ "The voice questions again, 'Where are the cats getting the hot dogs?'",
+ "A small dog with light brown and black fur lies on a multicolored knitted blanket on a couch holding a white sock with pink stripes in its mouth.",
+ "The man says, 'If you ever missing a sock, I might know the cop right.'",
+ "The scene shows a black dog with a collar lying on a patterned couch with one paw inside an orange handbag.",
+ "The voice states, 'Those eyes! He knows he was busted!'",
+ "A woman in a gray suit crouches on a sidewalk reaching out to three small ducklings.",
+ "A white duck with its beak inside her handbag is shown.",
+ "The voice concludes, 'This family of ducks is working as a team, she's distracted by the cute little ducklings while she steals the money!'",
+ "The final segment begins with a seagull standing on a tiled floor holding a bag of chips in its beak.",
+ "A wooden cabinet and countertop are in the background.",
+ "A child's high-pitched voice questions, 'Do birds just like potato chips?'",
+ "The video shows a man wearing a red shirt gesturing with their hands.",
+ "The text 'Follow!' is displayed in white capital letters at the bottom of the frame.",
+ "His voice encourages, 'Follow for more!'",
+ "Various music tracks and sound effects complement the narration throughout the video."
+ ]
+ },
+ {
+ "video": "video/180.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a blank gray screen, which quickly transitions to a black screen featuring a rotating 3D logo that reads \"INSIDE edition\" in white and red text. The logo spins and zooms in, creating a visually engaging effect. The scene then introduces a bustling water park, filled with people enjoying the clear, blue water on blue inflatable tubes. Structures and trees in the background highlight its outdoor setting. The 'INSIDE edition' logo is now in the bottom left corner. A serious male voice then states: \"`SOMEBODY IN THIS POOL IS DROWNING. AND THE OTHER SWIMMERS DON'T EVEN KNOW IT. THERE THEY ARE, ENJOYING A WARM SUMMER DAY IN SOUTH CAROLINA.`\"\n\nThe scene continues at the water park, showcasing more people having fun on blue inflatable tubes. In the foreground, a person in a blue swimsuit sits on a tube while another person in blue swim trunks talks to them. The atmosphere is lively and carefree. However, the voiceover introduces an unsettling contrast: \"`NOTICE THE ARTIFICIAL WAVES MADE BY MACHINE IN THIS VIDEO POSTED ONLINE. NOW SEE THE 3 YEAR OLD CHILD. SHE'S DROWNING.`\" The background noise of a child's scream underscores the gravity of the situation.\n\nThe video remains in the water park, capturing the energetic environment around the pool, with more people swimming and floating on tubes. A person in a red swimsuit stands at the edge of the pool, holding what seems to be a flotation device. The water remains clear and inviting. The man\u2019s voice narrates urgently: \"`Thank goodness the lifeguard is super alert. Watch as he leaps in, feet first, reaching the drowning girl in the nick of time.`\" The sound of a siren adds to the tension, emphasizing the lifeguard's timely intervention.\n\nThe clip stays in the water park setting initially, featuring the same lively atmosphere. The person in the red swimsuit is still at the pool's edge. The scene then shifts to the ocean, showing a person swimming amidst the waves. The 'INSIDE edition' logo stays consistent at the bottom left. The audio captures: \"`Now, look, it's the same poor youngster struggling in the deep end. Once more, a lifeguard comes to the rescue.`\" A whistle blows and splashing sounds are heard as the lifeguard rushes to assist.\n\nMoving forward, the ocean serves as the backdrop for a rescue. The struggling swimmer raises their arms above the surface while another person approaches with a yellow flotation device. Waves create a dynamic scene. The action then cuts to a person paddling a blue surfboard toward shore, with a crowded beach and buildings visible in the background. The man\u2019s voiceover continues: \"`And there's drama at this beach where a youngster is in trouble. There's a father and his daughter right by her, but they don't notice.`\" Background music sets an intense tone.\n\nThe final part of the video shows the rescue operation from the ocean to the beach. A person lying on a blue surfboard is being held by another, ensuring they stay on the board while navigating small waves. The crowded shoreline forms the backdrop, and the 'INSIDE edition' logo remains prominent. The narration adds: \"`This extraordinary video shot with a GoPro on a rescue board shows the lifeguard saving the little girl.`\" The scene wraps up with a transition to a black screen displaying the 'INSIDE edition' logo in the center, followed by prompts to like on Facebook and follow on Twitter, alongside 'Check Local Listings' at the bottom. The video concludes with a copyright notice: \"`MMXVI INSIDE EDITION INC. ALL RIGHTS RESERVED.`\" As the video ends, background music and a woman\u2019s gasp emphasize the dramatic, life-saving efforts showcased."
+ }
+ ],
+ "events": [
+ "Video opens with a blank gray screen",
+ "Transition to a black screen with a rotating 3D 'INSIDE edition' logo in white and red text",
+ "Logo spins and zooms in",
+ "Scene introduces a bustling water park with people on blue inflatable tubes",
+ "Structures and trees in the background indicate outdoor setting",
+ "'INSIDE edition' logo appears in the bottom left corner",
+ "Serious male voice states someone is drowning and other swimmers don't know it",
+ "`SOMEBODY IN THIS POOL IS DROWNING. AND THE OTHER SWIMMERS DON'T EVEN KNOW IT. THERE THEY ARE, ENJOYING A WARM SUMMER DAY IN SOUTH CAROLINA.`",
+ "More scenes of people having fun on blue inflatable tubes in water park",
+ "Foreground shows person in blue swimsuit on a tube and another in blue swim trunks talking to them",
+ "Atmosphere lively and carefree",
+ "Voiceover highlights artificial waves and points out a drowning 3-year-old child",
+ "`NOTICE THE ARTIFICIAL WAVES MADE BY MACHINE IN THIS VIDEO POSTED ONLINE. NOW SEE THE 3 YEAR OLD CHILD. SHE'S DROWNING.`",
+ "Background noise of child's scream",
+ "Scenes continue in the water park showing people swimming and floating on tubes",
+ "Person in red swimsuit stands at pool's edge with a flotation device",
+ "Male narrator urgently describes lifeguard's actions",
+ "`Thank goodness the lifeguard is super alert. Watch as he leaps in, feet first, reaching the drowning girl in the nick of time.`",
+ "Sound of a siren adds tension",
+ "Lively atmosphere around the pool remains",
+ "Person in red swimsuit still at the pool's edge",
+ "Scene shifts to ocean showing person swimming amidst waves",
+ "'INSIDE edition' logo stays at the bottom left",
+ "Narrator describes the struggling youngster in the deep end",
+ "`Now, look, it's the same poor youngster struggling in the deep end. Once more, a lifeguard comes to the rescue.`",
+ "Whistle blows and splashing sounds as lifeguard rushes to assist",
+ "Rescue operation in the ocean depicted",
+ "Struggling swimmer raises arms above the surface",
+ "Another person approaches with yellow flotation device",
+ "Waves create dynamic scene",
+ "Cut to person paddling blue surfboard toward shore, crowded beach and buildings in background",
+ "Narrator describes drama at beach with youngster in trouble",
+ "`And there's drama at this beach where a youngster is in trouble. There's a father and his daughter right by her, but they don't notice.`",
+ "Intense background music",
+ "Final scene of rescue operation from ocean to beach with person on blue surfboard held by another navigating small waves",
+ "Crowded shoreline in the background",
+ "'INSIDE edition' logo remains prominent",
+ "Narrator mentions GoPro footage of lifeguard saving little girl",
+ "`This extraordinary video shot with a GoPro on a rescue board shows the lifeguard saving the little girl.`",
+ "Wrap up with transition to black screen displaying 'INSIDE edition' logo in the center",
+ "Prompts to like on Facebook and follow on Twitter, 'Check Local Listings' at the bottom",
+ "Video concludes with copyright notice",
+ "`MMXVI INSIDE EDITION INC. ALL RIGHTS RESERVED.`",
+ "Background music and woman\u2019s gasp emphasize dramatic, life-saving efforts"
+ ]
+ },
+ {
+ "video": "video/181.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "At the beginning of this video, we see a Fox News Channel broadcast with the headline 'CAN'T GO HOME' prominently displayed in the top right corner. The news ticker at the bottom updates viewers on military operations and international support for a draft of an Iraq resolution. The broadcast cuts away to show a woman in a black, fringed top energetically dancing in what appears to be a storage or utility room. Her face is partially obscured as the camera zooms in on her.\n\nAudio of a man says, \"J. Lo's new song, Jenny from the Block, all about Lopez roots, about how she's still a neighborhood gal at heart. \" plays softly over the scene, filled with rhythmic music.\n\nThe video continues to focus on the same woman dancing in the storage or utility room. The Fox News Channel logo remains visible in the bottom left corner, and the news ticker updates include 'AP: BRITISH PM BLAIR SAYS NO DECISION HAS BEEN MADE ON CALLING UP RESERVISTS FOR POSSIBLE CONFLICT IN IRAQ.' Eventually, the scene shifts to an outdoor setting where the woman is conversing with another person. The backdrop features a brick building and a group of people, including children, seated on benches.\n\nA man narrates, \"But folks from that street in New York, the Bronx section, sound more likely to give her a curb job than a block party. The New York Post... Sorry about that slip up there, I have no idea how that happened, but it won't happen again.\" with music lingering in the background.\n\nThe video then returns to the Fox News studio, showing an anchor in a dark suit and tie. The bottom news ticker now includes updates such as 'AP: BRITISH PM BLAIR SAYS NO DECISION HAS BEEN MADE ON CALLING UP RESERVISTS FOR POSSIBLE CONFLICT IN IRAQ' and '15-PASSENGER VANS' followed by 'ASIAN LDRS ATTENDING ASSOCIATION OF SOUTH EAST ASIAN NATIONS (ASEAN) SUMMIT IN CAMBODIA NEGOTIATING DEAL WITH CHINA.' The typical blue and green color scheme of the studio is seen, focusing consistently on the speaking anchor.\n\nThe anchor's voice asserts, \"And that's your news. And the G Block, as Fox reports this Monday, November the 4th, 2002.\", accompanied by background music.\n\nThe scene remains on the Fox News Channel broadcast, with the anchor continuing to present in a dark suit and tie. The news ticker now highlights 'ASIAN LDRS ATTENDING ASSOCIATION OF SOUTH EAST ASIAN NATIONS (ASEAN) SUMMIT IN CAMBODIA NEGOTIATING DEAL WITH CHINA TO CREATE WORLD'S LARGEST FREE TRADE ZONE.' The consistent studio setup's blue and green colors frame the anchor as he delivers the news.\n\nThe anchor declares, \"And still ahead the most watched prime time in all of cable news. Period. Yes. All right...\" just before a loud bang is heard, followed by a man shouting."
+ }
+ ],
+ "events": [
+ "Fox News Channel broadcast begins with headline 'CAN'T GO HOME' in the top right corner",
+ "News ticker updates on military operations",
+ "News ticker updates on international support for a draft of an Iraq resolution",
+ "Broadcast cuts to a woman in a black, fringed top dancing in a storage or utility room",
+ "Camera zooms in on the woman, partially obscuring her face",
+ "Audio of a man mentions J. Lo's new song 'Jenny from the Block', saying it's about Lopez's roots and neighborhood",
+ "Fox News Channel logo remains visible in the bottom left corner",
+ "News ticker updates mention 'AP: BRITISH PM BLAIR SAYS NO DECISION HAS BEEN MADE ON CALLING UP RESERVISTS FOR POSSIBLE CONFLICT IN IRAQ.'",
+ "Scene shifts to an outdoor setting with the woman conversing with another person",
+ "Brick building and group of people, including children, seated on benches, visible in the background",
+ "Man narrates about folks from the Bronx section being more likely to give her a curb job than a block party",
+ "The man apologizes for a slip up during narration, mentions it won't happen again",
+ "Video returns to Fox News studio with anchor in dark suit and tie",
+ "News ticker updates include 'AP: BRITISH PM BLAIR SAYS NO DECISION HAS BEEN MADE ON CALLING UP RESERVISTS FOR POSSIBLE CONFLICT IN IRAQ'",
+ "News ticker mentions '15-PASSENGER VANS\u2019 and 'ASIAN LDRS ATTENDING ASSOCIATION OF SOUTH EAST ASIAN NATIONS (ASEAN) SUMMIT IN CAMBODIA NEGOTIATING DEAL WITH CHINA'",
+ "Studio's typical blue and green color scheme visible",
+ "Anchor asserts, 'And that's your news. And the G Block, as Fox reports this Monday, November the 4th, 2002.'",
+ "News ticker updates with 'ASIAN LDRS ATTENDING ASSOCIATION OF SOUTH EAST ASIAN NATIONS (ASEAN) SUMMIT IN CAMBODIA NEGOTIATING DEAL WITH CHINA TO CREATE WORLD'S LARGEST FREE TRADE ZONE.'",
+ "Anchor declares, 'And still ahead the most watched prime time in all of cable news. Period. Yes. All right...'",
+ "Loud bang heard, followed by a man shouting"
+ ]
+ },
+ {
+ "video": "video/182.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens on a roadside interaction where a police officer stands next to a white police car at the side of a road. The officer is in a black uniform, distinguished by an American flag patch on the chest. In the background, a cloudy sky looms over a row of trees, and several road signs are visible. Across the top of the frame, the bold text \"36K$ Settlement!\" appears, hinting at an important narrative. The audio reveals an exchange between the officer and a man: \"What'`s going on? Exactly what you see. Okay. Yeah. Is there, can I ask, is there a specific reason for your? Well, there's always a reason, but I'm not going to share it. Okay. Yeah..`\" This suggests the man, likely recording the interaction, is inquiring about the reason for the encounter.\n\nThe scene continues with the officer, now holding a radio or communication device, still next to the police car. The backdrop remains the same: overcast skies, trees, and road signs. The word 'POLICE' is visible on the side of the car. The audio intensifies as the dialogue proceeds: \"I'`m going to be on my way. Have a good day. You're going to stay here for a second. Are you detaining me? Yes. For what reason? Because I would like to know what's going on.\". This exchange clearly indicates a turn towards a more assertive interaction, with the officer expressing the intent to detain the individual.\n\nMaintaining the same visual context, the officer, now in a relaxed posture with arms crossed, stands steadfast by the roadside. Several cars pass by, adding a subtle dynamism to the scene. The audio further reveals a back-and-forth: \"` So here's the thing you can tell me who you are and I can have you on your way very easily here's the thing at this point in time it's a legal detainment because I have broken no law so they're for you have no right to detain me or try to ID me. \". The dialogue underscores the tension, as the man challenges the legitimacy of his detainment.\n\nWith the consistent visual backdrop, the next dialogue amplifies the clash of authority and personal rights: \"`That might be what you think. That's what I know. But you can tell me and I can get you on your way. Am I being detained or am I free to go? No, you are not free..`\". This conversation highlights the ongoing dispute about detainment and identification.\n\nThe scene persists with the officer in his relaxed stance, transitioning slightly toward a broader view that includes a tree and some greenery. The audio escalates as the situation nears confrontation: \"`Okay, silent. Set your cameras down. No. Can I get your supervisor, please? Put your hands behind your back. What are you arresting me for? Obstructing. Obstructing what?..`\" The tension is palpable, with the demand for a supervisor and the movement toward an arrest for obstruction.\n\nFinally, the video concludes with a blurred, unstable shot, suggesting motion or a scene transition. The background, an indistinct mix of green and brown, implies a natural surrounding. Text at the bottom reads \"FOLLOW FOR MORE POLICE INTERACTIONS,\" nudging viewers towards further engagement. Then the text ''FULL VIDEO ON YOUTUBE.'' appears on the screen. The audio wraps up with: \"`And almost a year after the incident, the city settled for $36,000..`\", providing a resolution to the encounter and linking it back to the \"36K$ Settlement!\" text seen earlier. A passage appears on the screen that appears to be the agreement.\n\nThis entire narrative stitches together a tense, roadside confrontation between a man and a police officer, culminating in a financial settlement, all framed within a broader call to observe and perhaps scrutinize police interactions more closely."
+ }
+ ],
+ "events": [
+ "Video opens on a roadside interaction with a police officer and a white police car.",
+ "Police officer is in a black uniform with an American flag patch on the chest.",
+ "Background features a cloudy sky, a row of trees, and several road signs.",
+ "Bold text '36K$ Settlement!' appears at the top of the frame.",
+ "Audio reveals an exchange between the officer and a man starting with 'What's going on?'.",
+ "The man inquires about the reason for the encounter, which the officer does not share.",
+ "Officer holds a radio or communication device next to the police car.",
+ "Word 'POLICE' is visible on the side of the car.",
+ "The conversation intensifies with the officer stating the man is being detained.",
+ "Officer stands with arms crossed by the roadside.",
+ "Several cars pass by in the background.",
+ "Man challenges the legitimacy of his detainment, stating he has broken no law.",
+ "Ongoing dispute about detainment and identification continues in the dialogue.",
+ "Officer maintains a relaxed stance, visual transition includes a tree and some greenery.",
+ "Audio escalates with demands for a supervisor and the movement toward an arrest.",
+ "Man is instructed to put his hands behind his back, facing arrest for obstruction.",
+ "Video concludes with a blurred, unstable shot implying motion or scene transition.",
+ "Background displays an indistinct mix of green and brown.",
+ "Text at the bottom reads 'FOLLOW FOR MORE POLICE INTERACTIONS' and 'FULL VIDEO ON YOUTUBE.'",
+ "Audio wraps up stating 'almost a year after the incident, the city settled for $36,000.'",
+ "Passage about the settlement agreement appears on the screen.",
+ "Narrative stitches together a tense roadside confrontation and financial settlement."
+ ]
+ },
+ {
+ "video": "video/183.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a black screen, transitioning to a scene featuring a happy Siberian Husky with striking blue eyes sitting on a wooden floor against a gray wall. The dog is wearing a collar with a blue tag, with its mouth open and tongue out. A hand reaches out, and then the dog lifts its paw to shake hands. A male voiceover instructs: \"`Would you like to teach your Siberian Husky how to shake your hand every single time that you ask? Here's how to do it in 60 seconds. First thing you do, grab a chair.`\" The narration is followed by a ding and another man speaking. The scene then shifts to a kitchen setting with a wooden chair on a tiled floor, a hand pointing at the chair against the backdrop of a stainless steel refrigerator and white cabinets.\n\nThe scene transitions to the Husky sitting on a tiled floor, looking happy, with another dog partially visible in the background. The setup includes a dining area with chairs and a rug. A hand points towards the Husky, and the dog remains seated. The scene transitions to a hand opening a stainless steel refrigerator, revealing contents such as a can of Pepsi, a bottle of Heinz tomato ketchup, and several food containers. The hand grabs a deli meat labeled 'Publix' package and places a slice on the countertop. The voiceover continues, interspersed with music: \"`Step 2, grab your pepperoni pizza. Step three, get some turkey from the deli. Trust me, turkey works the best. Step four, roll that turkey up like it's...A cinnamon stick. Yeah. \"\n\nFollowing this, the video focuses on a close-up of a hand placing a slice of deli meat on a dark-colored countertop. A hand rolls this piece of meat and then lifts the meat. The view transitions to a colorful countertop as the hand takes the piece of meat into a fist. The Husky eagerly looks up from the floor. The dog, still sporting its blue tag, appears excited as the hand moves closer to the meat, beginning to lick the hand. As the dog eats and licks the hand contentedly, the voiceover adds: \"`Now take that turkey roll and hold it in your hand, but make sure that your dog can't get to it. While they're interested in the turkey that's being covered by your hand, go ahead and grab their paw and place it up on your knee. But the second their paw is on your knee, and you can hold it there yourself.\". In the background, one can discern the chirping of crickets.\n\nThe scene continues with the man taking his hand away from the dog's mouth so the dog stops licking his hand. The man hooks his hand at the dog, who sitting on a tiled floor with part of a wooden floor visible. The husky's eyes, as striking and focused as before, convey affection and enjoyment.\n\nContinuing with a close-up view, the Husky, sitting on a tiled floor with the man in light-colored shorts, licks the man's hand. The man uses his fleshy hand to hold the dog's paw. We can see the dog tries to eat meat from the man's hand. The affectionate nibbling or licking follows through in a side view. The kitchen background reappears with part of the wooden floor. The voiceover resumes: \"`Give them that turkey. Then, the moment their paw drops off of your knee, cut off access to that turkey and put it back into your fist. Continue doing this.`\"\n\nThe scene transitions to the Husky still sitting on a tiled floor with part of a wooden floor, the man points to his knee. Then the Husky rester its front paw on the man's knee and begins to lick the man's hand. The final clip captures the Husky maintaining affectionate licking and nibbling of the person's hand, still sitting contentedly with the handheld by its paw. The background remains consistent with the part of the kitchen. The voiceover concludes: \"`Be sure to reinforce that action with the word paw or shake, whichever one you want to choose. For more free Siberian Husky tips and tricks, go ahead and hit that subscribe button and don't forget to hit the like...`\" accompanied by the sound of a button being pressed.\n\nThe video wraps up with the Husky shaking hands with the man whenever it sees the man extend his hand.\n\nThroughout the video, the Husky's pleasant demeanor, striking blue eyes, and the consistent setting of the kitchen and tiled floor establish a coherent and focused tutorial on teaching the dog to shake hands."
+ }
+ ],
+ "events": [
+ "Video starts with a black screen",
+ "Scene transitions to a happy Siberian Husky with striking blue eyes sitting on a wooden floor against a gray wall",
+ "Dog is wearing a collar with a blue tag, mouth open, and tongue out",
+ "A hand reaches out, and the dog lifts its paw to shake hands",
+ "Male voiceover instructs how to teach the Husky to shake hands",
+ "Voiceover says: 'Would you like to teach your Siberian Husky how to shake your hand every single time that you ask? Here's how to do it in 60 seconds. First thing you do, grab a chair.'",
+ "Sound of a ding followed by another male voice speaking",
+ "Scene shifts to a kitchen setting with a wooden chair on a tiled floor",
+ "Hand points at the chair against the backdrop of a stainless steel refrigerator and white cabinets",
+ "Scene transitions to the Husky sitting on a tiled floor, looking happy, with another dog partially visible in the background",
+ "Setup includes a dining area with chairs and a rug",
+ "Hand points towards the Husky, and the dog remains seated",
+ "Scene transitions to a hand opening a stainless steel refrigerator",
+ "Refrigerator contains a can of Pepsi, a bottle of Heinz tomato ketchup, and several food containers",
+ "Hand grabs a deli meat labeled 'Publix' package and places a slice on the countertop",
+ "Voiceover continues: 'Step 2, grab your pepperoni pizza. Step 3, get some turkey from the deli. Trust me, turkey works the best. Step 4, roll that turkey up like it's... A cinnamon stick. Yeah.'",
+ "Close-up of a hand placing a slice of deli meat on a dark-colored countertop",
+ "Hand rolls the meat and lifts it",
+ "View transitions to a colorful countertop as the hand takes the piece of meat into a fist",
+ "Husky eagerly looks up from the floor, excited",
+ "Dog, with blue tag still visible, appears excited as the hand moves closer to the meat and begins to lick the hand",
+ "Voiceover adds: 'Now take that turkey roll and hold it in your hand, but make sure that your dog can't get to it. While they're interested in the turkey that's being covered by your hand, go ahead and grab their paw and place it up on your knee. But the second their paw is on your knee, and you can hold it there yourself.'",
+ "Background noise of chirping crickets",
+ "Man takes hand away from the dog's mouth, the dog stops licking",
+ "Man gestures at the dog sitting on a tiled floor, part of wooden floor visible",
+ "Husky's eyes convey affection and enjoyment",
+ "Close-up view of the Husky sitting on the tiled floor with the man in light-colored shorts",
+ "Husky licks the man's hand",
+ "Man holds the dog's paw with his hand",
+ "Dog tries to eat meat from the man's hand",
+ "Affectionate nibbling or licking from the dog follows",
+ "Side view of the husky and the man",
+ "Kitchen background reappears with part of the wooden floor",
+ "Voiceover resumes: 'Give them that turkey. Then, the moment their paw drops off of your knee, cut off access to that turkey and put it back into your fist. Continue doing this.'",
+ "Scene transitions to the Husky still sitting on the tiled floor with part of the wooden floor",
+ "Man points to his knee",
+ "Husky places its front paw on the man's knee and begins to lick the man's hand",
+ "Final clip shows the Husky maintaining affectionate licking and nibbling of the person's hand",
+ "Background remains consistent with part of the kitchen",
+ "Voiceover concludes: 'Be sure to reinforce that action with the word paw or shake, whichever one you want to choose. For more free Siberian Husky tips and tricks, go ahead and hit that subscribe button and don't forget to hit the like...'",
+ "Sound of a button being pressed",
+ "Video wraps up with the Husky shaking hands with the man whenever it sees the man extend his hand",
+ "Throughout the video, the Husky's pleasant demeanor, striking blue eyes, and the consistent setting of the kitchen and tiled floor establish a coherent and focused tutorial on teaching the dog to shake hands."
+ ]
+ },
+ {
+ "video": "video/184.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with an exterior view of a house covered in snow, accompanied by a garage and front porch. On the porch, two people stand while the screen displays the text \"Been here since the beginning it's insane. Part 35.\" Transitioning indoors, a person in a white long-sleeve shirt and a black cap is busy tiling the floor on the spreading grout, and using orange spacers to fix the hexagonal outer frame. The scene transitions, and the ongoing renovation is evident as another individual wearing a gray hoodie on a ladder works on the upper part of a wall, potentially sanding or painting. Following this, the person in gray hoodie uses a roller to paint the lower part of a wall. The final segment of this part presents the entire space and reveals a window.\n\nA woman's voice narrates the events: \"We tiled the entire floor in one day. We jumped directly on the drywall in the kitchen, sanding and priming it, and started wondering what to do with the window.\" This statement reflects the intensity and personal challenges faced during renovations.\n\nNext, the video showcases a woman in a gray long-sleeve shirt and black pants working on a ladder, ensecuringe its parts. The scene transitions to show the progress on the walls. The woman then climbs a step ladder to work on the closet, including the right, upper and left part in order. Following this, the scene transitions to a close-up of various tools on the floor, including red-handled pliers, a yellow measuring tape, and metal strips. A subsequent frame reveals a metal strip being installed above a window. The clip concludes with the man in a white long-sleeve shirt and a black cap working on the window frame, possibly applying putty or paint.\n \nThe narration continues: \" While we had the primer, I went ahead and painted the bedroom closet, which Ian said to tear out, but I'm going to make it work. We decided on the window to wrap it in metal corners and finish the drywall, which should be interesting because we never did that before.\" This dialogue highlights the decisions and changes made during the renovation process. \n\nThe video focuses on the man in a white long-sleeve shirt and a black cap, working on a window frame with a putty knife. Then it transitions to a view of a wooden door with a golden handle, followed by a close-up of a tiled floor with orange spacers. A person uses a black rubber mallet to tap the spacers, making it fall off. Two individuals, including the man and the woman, kneel on the floor, working together to tap the spacers and moves the spacers away. Then it transitions to the woman stands up and kicks the orange spacers away with feet. This is followed by a top-down view where one can see the orange spacers gradually being removed altogether.\n\nThe woman's voice narrates: \"We've never done that before. But guess what? The mortar's dry and we can remove these leveling spacers. We started with the mallet and then we started having fun with it and kicked them and then went back to the mallet\" This reflects the novel experiences and the learning process involved in the renovation.\n\nThe next frames show someone vacuuming the gaps between the tiles and, subsequently, cleaning the tile surfaces with a yellow sponge. The final scene depicts a person applying grout to the gaps using a putty knife.\n\nThe narration repeats: \"We had to wipe down marks that the mallet left, and I can't believe those plastic things stay in there.\" This repeated phrase underscores the meticulous care required in their work.\n\nFollowing this, a person in a gray shirt and black cap uses a grout float to spread grout over the hexagonal tiles. Transitioning to a close-up, the meticulous effort to fill the gaps is showcased. A woman in a white shirt and black pants kneels, cleaning excess grout with a yellow sponge, while another person in a gray hoodie applies grout to a different floor section.\n\nThe narration adds: \"But the grout covers them up and grouting is so easy. You just have to go round and round and then wipe it all down and just stay with the motion in the ocean. We worked ourselves around the room and this is the first time we ever did.\" This highlights the ease and enjoyment they find in their tasks despite being first-timers.\n\nThe video concludes with a top-down view of a person in gray applying finishing touches to a window frame, and then the video focuses on two yellow paint cans labeled 'Clare.' The clip ends with two paint samples on a wall under the text \"Which Color? Darker or Lighter.\"\n\nThe final narration remarks: \"A home remodeling job in socks. While that was drying, Trey jumped directly on the second coat for the drywall, and we have to pick a paint color. \""
+ }
+ ],
+ "events": [
+ "Exterior view of house covered in snow with garage and front porch",
+ "Two people standing on porch with text 'Been here since the beginning it's insane. Part 35.'",
+ "Transition to indoors showing person in white long-sleeve shirt and black cap tiling floor",
+ "Person spreads grout and uses orange spacers to fix hexagonal outer frame",
+ "Transition to another person in gray hoodie working on upper part of wall on ladder (sanding or painting)",
+ "Person in gray hoodie uses roller to paint lower part of wall",
+ "Scene shows entire space and a window",
+ "Woman narrates about tiling floor, drywall sanding and priming in kitchen, and deliberation about the window",
+ "Woman in gray long-sleeve shirt and black pants works on ladder ensuring its parts",
+ "Transition to progress on walls",
+ "Woman climbs step ladder to work on closet (right, upper, left part in order)",
+ "Close-up of tools on floor: red-handled pliers, yellow measuring tape, metal strips",
+ "Metal strip being installed above window",
+ "Man in white long-sleeve shirt and black cap works on window frame (applying putty or paint)",
+ "Narration about painting bedroom closet and wrapping window with metal corners",
+ "Focus on man in white shirt and black cap working on window frame with putty knife",
+ "View of wooden door with golden handle",
+ "Close-up of tiled floor with orange spacers",
+ "Person uses black rubber mallet to tap and remove spacers",
+ "Two people kneel on floor working together to remove spacers",
+ "Woman stands up and kicks orange spacers away with feet",
+ "Top-down view of orange spacers being removed",
+ "Narration about removing leveling spacers and fun process",
+ "Person vacuuming gaps between tiles",
+ "Person cleaning tile surfaces with yellow sponge",
+ "Person applying grout to gaps using putty knife",
+ "Narration about wiping down mallet marks and disbelief about plastic spacers",
+ "Person in gray shirt and black cap uses grout float to spread grout over hexagonal tiles",
+ "Close-up of filling gaps with grout",
+ "Woman in white shirt and black pants kneels cleaning excess grout with yellow sponge",
+ "Another person in gray hoodie applies grout to different floor section",
+ "Narration about ease of grouting and first-time experience",
+ "Top-down view of person in gray applying finishing touches to window frame",
+ "Focus on two yellow paint cans labeled 'Clare'",
+ "Two paint samples on wall with text 'Which Color? Darker or Lighter'",
+ "Final narration about remodeling job in socks, second coat of drywall, and picking a paint color"
+ ]
+ },
+ {
+ "video": "video/185.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a blue and yellow taxi cab, marked with the text \"Vets Cab Windsor - Ontario\" on the side.The driver inside the car says,''CAB FOR ZACHERY?\" and the man outside the car says,'YES I'M ZACHERY MAN...'' As the camera moves closer, the man inside responds with a drawn-out \"YEAH...\" The man outside says,\"BUT I JUST REALISED,I LEFT MY WALLET AT HOME...I'`M JUST GOING TO HK...IS THERE ANY CHANCE YOU COULD GIVE ME.I SPRAINED MY ANKLE REALLY BAD...`\" The driver responds with a friendly \"NO PROBLEM...\"\n\nContinuing, we observe a close-up of the blue and yellow taxi cab, with a clearer view of the \"Vets Cab Windsor - Ontario\" text on its side. The man outside says, \"`IS THAT OKAY?`\" is displayed, and the driver inside the cab is gesturing with his hand says,'WHERE IS HK?'' The man outside says,\"JUST ABOUT 3 BLOCKS DOWN...HUMAN KINETICS...RIGHT ACROSS THE RAILROAD TRACK...IS TAHT OK?\"The driver says,'YEAH,YEAH...\"The camera then transitions to the taxi's interior, showing a plastic partition,the man says to the driver,\"`WHY DID YOU LET ME COME IN THE TAXI FOR FREE?`\". The driver explains straightforwardly with \"BECAUSE KARMA...I DID THIS FOR YOU...TRUST ME...IF IT'`S NOT TODAY.....TOMORROW....I WILL GET SOMETHING...\"\n\nThe man in the car continues\"`GIVE ME ONE SEC, OKAY?`\"then he opens the car's door and stands outside the cab before the driver says, \"OKAY...\". The driver is seen wearing a green jacket and blue face mask inside the cab.The man outside says,\"`SO I WAS...ASKING STRANGERS...TO GIVE ME A RIDE...AND THE FIRST TAXI DRIVER TO GIVE ME A RIDE...I WAS GIVING THEM $500...\".\n\nThe man outside then gives the driver the cash the the driver takes the money.Exhibiting gratitude and surprise, the man gestures,looks at the camera and says \"`THANK YOU SO MUCH...OH MY GOD I'M GOING TO HUG YOU MAN...\". He is visibly smiling, and man outside the car says encouragedly: \"COME OUT THERE AND HUG ME MAN...\". The camera shifts focus outside the cab with an amused \"HAHA...\" The final text, \"`YOU SAID GOOD KARMA...I WAS...`\" wraps this segment. The conversation in audio involves the person appreciating the driver's instant goodwill and promising, in joyous disbelief, a reward, culminating in a request for a hug to symbolize his gratitude.\n\nIn the conclusion of this series of sequences, the driver, dressed in the same green jacket and blue face mask, standing outside the taxi hugs the man. The camera zeroes in on his jacket's blue and white argyle pattern,saying \"I BELIEVE IN KARMA...\" is evident in text. The surrounding parking lot, with its array of cars and trees in the background, enriches the setting and he says,'' I love you, man`\", accompanied by the gentle hum of the taxi's idling engine.Then the camera turns to black.\n\nThrough these humane and fortuitous moments, the narrative ties together themes of kindness, karma, and unexpected generosity, framed by the consistent visual and auditory cues of conversation happening within and outside a cab, against the backdrop of a typical urban scene."
+ }
+ ],
+ "events": [
+ "The video begins with a blue and yellow taxi cab marked 'Vets Cab Windsor - Ontario' on the side.",
+ "The driver inside the car says, 'CAB FOR ZACHERY?'",
+ "The man outside the car says, 'YES I'M ZACHERY MAN...'",
+ "As the camera moves closer, the man inside responds with a drawn-out 'YEAH...'",
+ "The man outside says, 'BUT I JUST REALISED, I LEFT MY WALLET AT HOME...I'M JUST GOING TO HK...IS THERE ANY CHANCE YOU COULD GIVE ME. I SPRAINED MY ANKLE REALLY BAD...'",
+ "The driver responds with a friendly 'NO PROBLEM...'",
+ "A close-up of the blue and yellow taxi cab, with a clearer view of 'Vets Cab Windsor - Ontario' text on its side.",
+ "The man outside says, 'IS THAT OKAY?'",
+ "The driver inside the cab gestures with his hand and says, 'WHERE IS HK?'",
+ "The man outside says, 'JUST ABOUT 3 BLOCKS DOWN...HUMAN KINETICS...RIGHT ACROSS THE RAILROAD TRACK...IS THAT OK?'",
+ "The driver says, 'YEAH, YEAH...'",
+ "The camera transitions to the taxi's interior, showing a plastic partition.",
+ "The man says to the driver, 'WHY DID YOU LET ME COME IN THE TAXI FOR FREE?'",
+ "The driver explains straightforwardly with 'BECAUSE KARMA...I DID THIS FOR YOU...TRUST ME...IF IT'S NOT TODAY.....TOMORROW....I WILL GET SOMETHING...'",
+ "The man in the car continues, 'GIVE ME ONE SEC, OKAY?'",
+ "The man opens the car's door and stands outside the cab.",
+ "The driver says, 'OKAY...'",
+ "The driver is seen wearing a green jacket and blue face mask inside the cab.",
+ "The man outside says, 'SO I WAS...ASKING STRANGERS...TO GIVE ME A RIDE...AND THE FIRST TAXI DRIVER TO GIVE ME A RIDE...I WAS GIVING THEM $500...'",
+ "The man outside then gives the driver the cash.",
+ "The driver takes the money.",
+ "The man gestures, looks at the camera, and says, 'THANK YOU SO MUCH...OH MY GOD I'M GOING TO HUG YOU MAN...'",
+ "He is visibly smiling.",
+ "The man outside the car says encouragedly: 'COME OUT THERE AND HUG ME MAN...'",
+ "The camera shifts focus outside the cab with an amused 'HAHA...'",
+ "The final text, 'YOU SAID GOOD KARMA...I WAS...' wraps this segment.",
+ "The person appreciates the driver's instant goodwill and promises a reward in joyous disbelief, culminating in a request for a hug.",
+ "The driver, dressed in the same green jacket and blue face mask, stands outside the taxi and hugs the man.",
+ "The camera zeroes in on his jacket's blue and white argyle pattern, and the text saying 'I BELIEVE IN KARMA...' is evident.",
+ "The surrounding parking lot, with its array of cars and trees in the background, enriches the setting.",
+ "He says, 'I love you, man,' accompanied by the gentle hum of the taxi's idling engine.",
+ "The camera turns to black.",
+ "Narrative ties themes of kindness, karma, and unexpected generosity through the visual and auditory cues of conversation within and outside a cab in an urban scene."
+ ]
+ },
+ {
+ "video": "video/186.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video showcases a thorough comparison of wooden planks in the context of a home renovation project, featuring a man in a blue polo shirt with a 'Home Renovision' logo. Initially, the man is shown holding two wooden planks, one with a darker wood grain pattern and the other with a lighter one. The background is a partially constructed wall with wooden studs, accentuating the ongoing renovation. The camera cuts to a close-up of two wooden planks of different colors placed next to each other.\n\nAs the man holds the planks, the audio reveals commentary: \"Here'`s my prediction. I think Laminate's making a comeback. And here's why. When Laminate came out, it was cheap, it was DIY, it was attractive. It was the flooring of choice for people with a budget.`\" \n\nThe scene transitions to the man using a circular saw with a yellow blade labeled 'DEWALT' to cut a piece of wood. A cloud of sawdust fills the air as the saw slices through the wood. This visual action is paired with the continuation of the audio: \"`But it had two problems. It was really dusty to cut and it did really lousy on water. And then vinyl came along and it solved that problem because it was completely waterproof and you could cut it with a knife.`\"\n\nNext, the man is seen holding the two planks again, comparing them just like before. The comparison involves close-ups that highlight the differing grain patterns and textures of the planks. The man then scores one of the wooden planks with a utility knife. The audio continues: \"`And the truth is, most of this stuff was made overseas. As long as shipping costs were down, it was really affordable. But things have changed.\"\n\nAgain, the man is seen holding the two planks, comparing them and explaining the details just like before. The narrative moves to a close-up of a red plastic cup lying on its side on a wooden surface, similar in texture to the darker plank, and the water in the cup spills on the plank. The audio explains: \"`Now getting good quality vinyl means you`'ve got to pay a lot more money for it or they start making it so thin that the locking joints just aren't what they used to be. Laminate has solved their problem. `\" giving insight into the innovations in laminate flooring.\n\nThe next sequence shows the man fitting two wooden darker planks together, ensuring they align perfectly. This scene highlights the precision required in home renovation. The audio provides additional context: \"`Now they've got a new surface that gives you a 24-hour protection against spills, and the new locking joint means that once it's put together, it's locked for good, and you're not going to have corners breaking off.`\" A loud bang suggests the secure fitting of the planks, emphasizing the durability of the new laminate flooring.\n\nFinally, the man is seen explaining something using hand gestures in front of the same partially constructed wall. The man appears to be providing detailed guidance or instructions about the renovation. The concluding audio states: \"`That's why I think it will come back. The building materials are going through the roof, that can save the cost of your project.`\" , reinforcing the informative and instructional nature of the video.\n\nThroughout the video, the text 'Home Renovision' on the man's shirt serves as a constant reminder of the professional context, emphasizing expertise in home renovation. The continuity of scenes and the detailed audio commentary create a coherent and comprehensive narrative, guiding viewers through the practical choices and benefits of laminate versus vinyl flooring in home renovation projects."
+ }
+ ],
+ "events": [
+ "Man in blue polo shirt with 'Home Renovision' logo is featured.",
+ "Man is holding two wooden planks - one darker, one lighter.",
+ "Background shows a partially constructed wall with wooden studs.",
+ "Camera cuts to close-up of two wooden planks of different colors placed next to each other.",
+ "Man's commentary about laminate: 'Here's my prediction. I think Laminate's making a comeback. And here's why. When Laminate came out, it was cheap, it was DIY, it was attractive. It was the flooring of choice for people with a budget.'",
+ "Scene transitions to the man using a circular saw with a yellow blade labeled 'DEWALT' to cut wood.",
+ "Sawdust fills the air as the saw slices through the wood.",
+ "Audio commentary continues: 'But it had two problems. It was really dusty to cut and it did really lousy on water. And then vinyl came along and it solved that problem because it was completely waterproof and you could cut it with a knife.'",
+ "Man is holding the two planks again, comparing them with close-ups of grain patterns and textures.",
+ "Man scores one of the wooden planks with a utility knife.",
+ "Audio continues: 'And the truth is, most of this stuff was made overseas. As long as shipping costs were down, it was really affordable. But things have changed.'",
+ "Man is holding the two planks again, comparing and explaining.",
+ "Close-up of a red plastic cup on its side on a wooden surface similar to the darker plank, water spills on the plank.",
+ "Audio explains: 'Now getting good quality vinyl means you've got to pay a lot more money for it or they start making it so thin that the locking joints just aren't what they used to be. Laminate has solved their problem.'",
+ "Man is fitting two darker wooden planks together, ensuring perfect alignment.",
+ "Audio provides context: 'Now they've got a new surface that gives you a 24-hour protection against spills, and the new locking joint means that once it's put together, it's locked for good, and you're not going to have corners breaking off.'",
+ "Loud bang emphasizes the secure fitting of the planks.",
+ "Man is explaining something using hand gestures in front of the partially constructed wall.",
+ "Concluding audio states: 'That's why I think it will come back. The building materials are going through the roof, that can save the cost of your project.'",
+ "Text 'Home Renovision' on man's shirt serves as a constant reminder of professional context.",
+ "Video emphasizes expertise in home renovation and practical choices/ benefits of laminate versus vinyl flooring."
+ ]
+ },
+ {
+ "video": "video/187.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a man in a black cap and shirt emblazoned with the 'macmost' logo, standing against a dark red background. The background has a digital pattern of red lines and dots, with the text 'macmost' prominently at the top. The scene transitions to a close-up of an iPhone home screen, showing app icons like FaceTime, Calendar, Photos, and Camera. The next shot reveals the 'Browse' screen of the Files app, spotlighting various locations such as iCloud Drive, On My iPhone, and Recently Deleted. The 'Projects' folder is selected under Favorites, and the final frames display the iCloud Drive contents, highlighting the 'Desktop' folder which contains 16 items.\n\nA man advises, \"Don'`t use your iPhone's Camera app to take pictures of receipts and other documents. Instead, use the Files app. Pick a location like iCloud Drive to store your receipts.`\"\n\nThe video continues by showing the iCloud Drive screen on an iPhone, listing folders like CotEditor, Desktop, Documents, and more. A drop-down menu with options including Select, New Folder, and Scan Documents is accessed. The user selects 'New Folder', creating an 'untitled folder' which is then renamed to 'Receipts'. The folder syncs with iCloud and opens to reveal its emptiness.\n\nThe man instructs, \"`Tap the three dots button at the top right and create a new folder called `'Receipts'`. You only need to do this the first time. From then on, just go to that Receipts folder when you want to scan.`\"\n\nNext, the 'Receipts' folder is displayed, still empty. The user activates the three-dot menu, selects 'Scan Documents', and positions the iPhone to capture a receipt. The receipt shows items like 'Taco Supreme Platter,' 'Cheeseburger,' 'Fries,' and 'Brownie Sundae', along with the total amount and server's name. The receipt is scanned and saved.\n\nAudio: The man continues, \"`Tap the three dots button again and choose Scan Document. Then just point your iPhone at the receipt. It will automatically take the picture or you can use that white button to do this as well.You could also take some more pictures if it's a multi-page receipt,`\"\n\nFollowing this, the iPhone camera view shows a dark textured surface with a blue edge. The text 'Ready for next scan.' appears before the user saves the scanned receipt, which is listed in the 'Receipts' folder as 'Scanned Document'. The document is renamed to '2022-12-15 Diner' and synced with iCloud.\n\nAudio: The man explains, \"Then give it a good name if you like.Something you could look up with later on.''Then the man explains that this document can be seen on all the Apple Devices as long as you log in your iCloud account. \nThe 'Receipts' folder now shows '2022-12-15 Diner'. The scanned receipt is opened, displaying details like the restaurant name, addrness, date, time, server's name, and ordered items with prices. The user returns to the 'Receipts' folder, selects the 'Scan Documents' option, and captures another receipt, adjusting the color settings to '`Black & White`'.\n\nAudio: \"Once saved to iCloud, it's accessible across all your devices. And if you want smaller files,when you go to scan, tap this button here at top and choose Black & White.Now you'll take a black and white scan which will give you a much smaller file size. \"Background music plays.\n\nFinally, the iPhone camera captures a receipt with details, and the user saves it. The 'Receipts' folder now shows two items. The 'Scanned Document' reveals the same receipt details. The video ends on a red background with the text 'macmost.com', a YouTube icon, and 'YouTube.com/MacMostVideo', concluding with 'COPYRIGHT CLEVERMEDIA, INC.'\n"
+ }
+ ],
+ "events": [
+ "A man in a black cap and shirt with 'macmost' logo stands against a dark red background.",
+ "The background has a digital pattern of red lines and dots with 'macmost' text at the top.",
+ "Scene transitions to a close-up of an iPhone home screen showing app icons like FaceTime, Calendar, Photos, and Camera.",
+ "Next shot reveals the 'Browse' screen of the Files app showing locations like iCloud Drive, On My iPhone, and Recently Deleted.",
+ "The 'Projects' folder is selected under Favorites.",
+ "Final frames display iCloud Drive contents highlighting the 'Desktop' folder with 16 items.",
+ "Man advises: \u201cDon\u2019t use your iPhone\u2019s Camera app to take pictures of receipts and other documents. Instead, use the Files app. Pick a location like iCloud Drive to store your receipts.\u201d",
+ "iCloud Drive screen on iPhone shows folders like CotEditor, Desktop, Documents, and more.",
+ "Drop-down menu with options including Select, New Folder, and Scan Documents is accessed.",
+ "User selects 'New Folder', creating an 'untitled folder' which is renamed to 'Receipts'.",
+ "The 'Receipts' folder syncs with iCloud and opens to reveal its emptiness.",
+ "Man instructs: \u201cTap the three dots button at the top right and create a new folder called 'Receipts'. You only need to do this the first time. From then on, just go to that Receipts folder when you want to scan.\u201d",
+ "The 'Receipts' folder is displayed, still empty.",
+ "User activates the three-dot menu and selects 'Scan Documents'.",
+ "User positions iPhone to capture a receipt showing items like 'Taco Supreme Platter', 'Cheeseburger', 'Fries', and 'Brownie Sundae', along with total amount and server's name.",
+ "The receipt is scanned and saved.",
+ "Man continues: \u201cTap the three dots button again and choose Scan Document. Then just point your iPhone at the receipt. It will automatically take the picture, or you can use that white button to do this as well. You could also take some more pictures if it's a multi-page receipt.\u201d",
+ "iPhone camera view shows a dark textured surface with a blue edge and text 'Ready for next scan.'",
+ "User saves the scanned receipt listed in 'Receipts' folder as 'Scanned Document'.",
+ "Document is renamed to '2022-12-15 Diner' and synced with iCloud.",
+ "Man explains: \u201cThen give it a good name if you like. Something you could look up with later on. This document can be seen on all Apple Devices as long as you log in your iCloud account.\u201d",
+ "The 'Receipts' folder shows '2022-12-15 Diner'.",
+ "Scanned receipt is opened, displaying details like restaurant name, address, date, time, server's name, and ordered items with prices.",
+ "User returns to 'Receipts' folder, selects 'Scan Documents' and captures another receipt.",
+ "User adjusts color settings to 'Black & White'.",
+ "Man explains: \u201cOnce saved to iCloud, it's accessible across all your devices. And if you want smaller files, when you go to scan, tap this button here at top and choose Black & White. Now you'll take a black and white scan which will give you a much smaller file size.\u201d",
+ "Background music plays.",
+ "iPhone camera captures a receipt with details and user saves it.",
+ "The 'Receipts' folder now shows two items.",
+ "The 'Scanned Document' reveals the same receipt details.",
+ "Video ends on a red background with text 'macmost.com', YouTube icon, and 'YouTube.com/MacMostVideo', concluding with 'COPYRIGHT CLEVERMEDIA, INC.'"
+ ]
+ },
+ {
+ "video": "video/188.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a cozy, well-decorated room filled with various plants, a large window draped with glass bead curtains, and an array of decorations that includes a mirror and a deer head mount. A screenshot of the comment reads, 'Also remember it only takes one bead to refract right and start the fire, you have hundreds spinning possibly into alignment to do that' and the text 'THEY ARE GLASS BEADS' appears on the screen. This highlights the potential fire hazard of the reflective properties of glass beads. The camera transitions to a close-up of the curtains from different angles, emphasizing their reflective nature. The scene then shifts to another room, where a man in a patterned shirt and graphic t-shirt is seated among books, a guitar, and framed pictures. The white text on a black background appears at the top of the screen as the man speaks, followed by a close-up of his face. It is worth mentioning that the view shakes extremely and it displays the texts corresponding to the real-time audio throughout the video. The audio features a woman asking, \"`Yes or no? Yes or no?`\" followed by the response from a man, \"`Can your glass bead window curtains cause a fire? Not those ones.`\"\n\nThe man now stands in the room filled with decorations, continuing to speak. The man speaks and gestures animatedly with his hands. The audio continues with the man's high-pitched voice explaining, \"`'Now, this is a concern. You need something that's collecting enough light that when it focuses it to, you know, roughly as focused as something that's made of glass or acrylic can make something.`\"\n\nIn the same room, the man emphasizes his explanation. The man discusses the potential for smoldering. The audio supports the narrative with, \"`Uh..then you could actually get something that might cause something to smolder, and then eventually if there's enough fuel nearby, to catch fire.`\" The visual reveals more of the room, where many decorations are placed on the shelves against a green white, and black wall.\n\nContinuing in the same room, the man explains the various items that can cause fires. The man points out these items, and he also raises his finger to count as he enumerates the items. The audio matches this narrative, \"`Things that have done this, and actually caused house fires include fish bowls, vases filled with water that don't have anything in them, once a glass doorknob. A glass doorknob!`\"\n\nThe man continues standing in the decorated room, gesturing while explaining, and points to various objects. The audio aligns with these visuals, \"`London Fire Brigade said that there were 125 fires triggered by the sun's rays over the last five years. But, you need something that will collect enough of the sun's light,`\" in a British accent.\n\nIn the final segment, the man continues discussing the dangers associated with reflected sunlight. The man gestures to emphasize his explanations as the audio concludes with, \" that the heat on the other end of that is going to be a potential hazard and a bead isn't going to be collecting that much light.`\" Along with the explanation, the man makes a gesture to indicate little.\n\nThe narrative throughout the video builds a cohesive cautionary statement on how certain household items, especially those that can refract light like glass beads, can potentially pose fire hazards. The vivid descriptions and synchronized audio guide the viewer through understanding the potential risks within their living spaces."
+ }
+ ],
+ "events": [
+ "The video opens with a cozy, well-decorated room filled with various plants.",
+ "A large window is draped with glass bead curtains.",
+ "The room contains an array of decorations, including a mirror and a deer head mount.",
+ "A screenshot of a comment reads, 'Also remember it only takes one bead to refract right and start the fire, you have hundreds spinning possibly into alignment to do that'.",
+ "The text 'THEY ARE GLASS BEADS' appears on the screen.",
+ "The camera transitions to a close-up of the curtains from different angles, emphasizing their reflective nature.",
+ "The scene shifts to another room.",
+ "A man in a patterned shirt and graphic t-shirt is seated among books, a guitar, and framed pictures.",
+ "White text on a black background appears at the top of the screen as the man speaks.",
+ "Close-up of the man's face.",
+ "The view shakes extremely and displays texts corresponding to the real-time audio throughout the video.",
+ "Audio: A woman asks, 'Yes or no? Yes or no?'",
+ "Audio: A man responds, 'Can your glass bead window curtains cause a fire? Not those ones.'",
+ "The man stands in the room filled with decorations, continuing to speak.",
+ "The man gestures animatedly with his hands.",
+ "Audio: The man explains in a high-pitched voice, 'Now, this is a concern. You need something that's collecting enough light that when it focuses it to, you know, roughly as focused as something that's made of glass or acrylic can make something.'",
+ "The man discusses the potential for smoldering.",
+ "Audio: 'Uh..then you could actually get something that might cause something to smolder, and then eventually if there's enough fuel nearby, to catch fire.'",
+ "The visual reveals more of the room with decorations on shelves against a green, white, and black wall.",
+ "The man explains the various items that can cause fires.",
+ "The man points out these items and raises his finger to count them.",
+ "Audio: 'Things that have done this, and actually caused house fires include fish bowls, vases filled with water that don't have anything in them, once a glass doorknob. A glass doorknob!'",
+ "The man continues standing in the decorated room, gesturing while explaining.",
+ "Audio: 'London Fire Brigade said that there were 125 fires triggered by the sun's rays over the last five years. But, you need something that will collect enough of the sun's light,' spoken in a British accent.",
+ "The man continues discussing the dangers of reflected sunlight in the final segment.",
+ "The man gestures to emphasize his explanations.",
+ "Audio: '...that the heat on the other end of that is going to be a potential hazard and a bead isn't going to be collecting that much light.'",
+ "The man gestures to indicate little.",
+ "The narrative builds a cohesive cautionary statement on fire hazards posed by household items that can refract light.",
+ "Vivid descriptions and synchronized audio guide the viewer to understand potential risks in living spaces."
+ ]
+ },
+ {
+ "video": "video/189.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man holding a bright green sign reading \"FREE DOG TRAINING\" in large, bold letters, along with a small smiley face. A woman says, \"What kind of free dog training you got?Just you make this one less crazy?\" \" The man, wearing a gray shirt and a black cap, walks away, and the back of his shirt reads \"NO BAD DOGS\" with an illustration of a dog. He says,''Yeah Sure.'' A small white dog moves energetically in a park with red flowers, accompanied by the text \"This lady was struggling with her dog in the park.\" A woman says, \"Not a mean bone in him but sometimes he'll approach people and it looks like he's going to like nip them and also leash aggressive.\" In the background, people talk while a dog barks.\n\nThe small white dog runs energetically on the grass, wearing a harness and leash.A transition shows the man holding the bright green sign, he says and points to the dog, \"`SO WHEN HE DOES FREAK OUT WHAT DO YOU DO?\" The dog stands on a paved surface, looked upon by the camera, A woman says,\"WE HAD ANOTHER TRAINER ABOUT A YEAR AGO AND SHE WAS SAYING HI SWEETHEART.\" The scene shifts to a woman in a pink tank top and light-colored shorts speaking, \"`AS SOON AS HE STARTS TO SHOW ANXIETY OR REACTION JUST TURN AROUND AND WALK THE OTHER WAY AND I SAID, WELL, THAT FEELS LIKE IT'S REINFORCING SOMETHING THAT HE SHOULD BE AFRAID OF.\"\n\nThe woman in pink top and white pants continues says,''AND IT DIDN'T WORK.YOU KNOW,HE'S SMART.''The red text \"There's another dog coming \ud83d\ude28\" appears as she continues.The small white dog runs again, with the man saying, \"`CAN I USE A LITTLE SLIP REALLY QUICK.`\" Background audio consists of the woman and a man conversing.\n\nThe small white dog continues running, with \"There's another dog coming \ud83d\ude28\" in red.The man holds the leash, saying, \"SO, I'M GOING TO WALK HIM TOWARDS THIS DOG\" transitioning to a close-up with \"Dog walking by\" and an arrow pointing to the dog. A command is given, and the text \"LEAVE\" appears as the dog walks. The dog, now calm and attentive, acknowledges \"YES.\" and \"LEAVE IT.\" The handler rewards the dog, which sits calmly, and the text \"IT WAS MUCH BETTER.\" appears. A woman in the background says,''It was amazing!'The man walks the small dog on a leash, with \"`YEAH, IT WAS AMAZING`\" at the bottom.\n\nAs the scene transitions, a woman in a blue jacket and a black top sits on a bench remarks, \"That was seconds!\" with \"YEAH.\" at the bottom. As the gray-shirted man walks by, the woman in pink top says,\"There wasn't even a bark I didn't hear!\"A close-up shows the dog on a paved surface, the man says,\"`OH, WHEN YOUR DOG REACTS AND DOES THAT BEHAVIOR WALK AWAY`,YOU'RE LIKE THAT DIDN'T WORK AND THEN I'M LIKE HEY IF JUST CORRECT THE DOG IMMEDIATELY WHEN THEY DID THAT BEHAVIOR, IT HELPS AND IT FIXES IT OR IT MODIFIES IT AT THAT TIME.`\" \n\nHe hands the leash to the woman in the pink tank top, with \"Share this with anyone who is struggling with their dog \u2764\ufe0f\" at the top and \"`THANK YOU.`\" at the bottom. The woman expresses gratitude as they exchange introductions, the man says,\"`WHAT WAS YOUR NAME`\" and she says,''Kate,'The man says,''Katie,it's Tom.''The woman says,''NICE TO MEET YOU.`\"The man responds,''Nice to meet you,pleasure'' The video concludes with them shaking hands, persistently displaying the message.\n\nThroughout the video, instructional texts, conversations, and practical demonstrations seamlessly combine to narrate the journey and tips on dog training, illustrating effective ways to manage and train dogs, especially those with anxieties and energetic behaviors."
+ }
+ ],
+ "events": [
+ "Man holding a bright green sign reading 'FREE DOG TRAINING' with a small smiley face",
+ "Woman asks about the type of free dog training for her dog",
+ "Man, wearing a gray shirt and a black cap, walks away",
+ "Back of man's shirt reads 'NO BAD DOGS' with an illustration of a dog",
+ "Man replies 'Yeah Sure'",
+ "Small white dog moves energetically in a park with red flowers",
+ "Text on screen: 'This lady was struggling with her dog in the park'",
+ "Woman talks about her dog's behavior issues (nip at people, leash aggressive)",
+ "Background audio of people talking and a dog barking",
+ "Small white dog runs energetically on the grass wearing a harness and leash",
+ "Scene transition to man holding the bright green sign again",
+ "Man asks 'SO WHEN HE DOES FREAK OUT WHAT DO YOU DO?'",
+ "Dog stands on a paved surface, camera looking at it",
+ "Woman mentions past experience with another trainer",
+ "Scene shifts to a woman in a pink tank top and light-colored shorts speaking",
+ "Woman recounts past trainer's advice to turn around and walk away when the dog shows anxiety or reacts",
+ "Woman in pink top and white pants continues saying past method didn't work",
+ "Text on screen: 'There's another dog coming \ud83d\ude28'",
+ "Small white dog runs again",
+ "Man asks to use a little slip lead quickly",
+ "Background audio consists of the woman and a man conversing",
+ "Small white dog continues running",
+ "Text on screen: 'There's another dog coming \ud83d\ude28'",
+ "Man holds the leash and says, 'SO, I'M GOING TO WALK HIM TOWARDS THIS DOG'",
+ "Close-up transition with 'Dog walking by' and an arrow pointing to the dog",
+ "Command 'LEAVE' appears as the dog walks",
+ "Dog walks calmly and attentively, acknowledging 'YES.' and 'LEAVE IT.'",
+ "Handler rewards the dog, which sits calmly",
+ "Text on screen: 'IT WAS MUCH BETTER.'",
+ "Woman in the background exclaims, 'It was amazing!'",
+ "Man walks the small dog on a leash",
+ "Text on screen: 'YEAH, IT WAS AMAZING'",
+ "Scene transitions to woman in a blue jacket and black top sitting on a bench, commenting 'That was seconds!'",
+ "Text on screen: 'YEAH.'",
+ "Gray-shirted man walks by, woman in pink top says, 'There wasn't even a bark I didn't hear!'",
+ "Close-up of the dog on a paved surface",
+ "Man advises on how to correct the dog immediately when it exhibits the unwanted behavior",
+ "Man hands the leash to the woman in the pink tank top",
+ "Text on screen: 'Share this with anyone who is struggling with their dog \u2764\ufe0f'",
+ "Text on screen: 'THANK YOU.'",
+ "Woman expresses gratitude",
+ "Man and woman exchange introductions ('WHAT WAS YOUR NAME', 'Kate', 'Katie, it's Tom', 'NICE TO MEET YOU')",
+ "Video concludes with man and woman shaking hands",
+ "Message to share the video persists on the screen"
+ ]
+ },
+ {
+ "video": "video/190.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a man in a dark suit and a red polka dot tie speaking against a teal background, emphasizing the phrase displayed on the screen, \"The Bible is a Catholic book.\" His voice carries on, stating, \"If you don't trust the Church on anything else.\" The scene transitions to a religious procession inside a church where several people in white robes and one person in a red robe holding a large, ornate book are seen. The text now reads, \"Why would you trust the Catholic Church to provide you with the Bible? The one book you are basing your entire religion off of.\" The scene finally shows a person in a red robe performing a religious ritual with incense.\n\nNext, a man in a dark suit and blue shirt, wearing a necklace with a religious pendant, speaks against the same teal background. The words \"Sola Scriptura simply isn't scriptural,\" are seen on the screen. Black-and-white footage of another man holding an open book and speaking at a podium continues the message, keeping the text \"Sola Scriptura simply isn't scriptural.\" The speaker points to the book, and the text changes to, \"Nowhere in the Bible does the Bible say the Bible alone or Scripture alone.\" A close-up shot follows, revealing text that reads. The man's voice resonates, \"Sola Scriptura simply isn't scriptural. Nowhere in the Bible does the Bible say the Bible alone or Scripture alone,\" highlighting his argument against the concept. The footage shows a book passage stating, \"All Scripture is inspired by God and profitable for teaching, for reproof, for correction, and for training in righteousness, that the man of God may be complete, equipped for every good work.\"\n\nThe continuity is maintained with another focus on the text from the previous passage, \"All Scripture is inspired by God and profitable for teaching, for reproof, for correction, and for training in righteousness...\" transitioning to black-and-white footage of a religious figure in ornate attire making a gesture with one hand raised and the other on the chest. The speaker discusses references to the Old Testament, explaining, \"are references to the Old Testament because the New Testament didn't exist yet. There has to be an authoritative interpretation,\" and warns, \"or else you're going to have a bunch of different churches that all disagree on difficult passages.\" A painting depicts a central figure in red and blue robes seated on a rock, raising one hand, surrounded by attentive listeners.\n\nThe narrative flows into a similar painting scene, highlighting continuity before transitioning to black-and-white footage of a man in a suit holding an open book at a podium. The screen text reads, \"Jesus wanted his people to be one, but the fruit of Sola Scriptura is disunity.\" This is followed by shots of various church signs, starting with \"Fishers of Men Lutheran Church,\" then \"Christ Church Sugar Land, A United Methodist Community,\" accentuating the diversity and potential division among churches.\n\nFurther illustrating this point, the video shows multiple church signs, starting with \"Christ Church Sugar Land, A United Methodist Community,\" and transitions to \"Sugar Land Church of God, Anderson, Indiana,\" stating \"WE DON'T CHANGE THE MESSAGE, THE MESSAGE CHANGES US. JOIN US SUNDAY AT 10:30 A.M.\" More signs follow for \"FAITH LUTHERAN CHURCH & SCHOOL, THANKS PASTOR KEN.LIZ\" and \"48+ YEARS OF FAITHFULNESS! NOV I SUNDAYS 8:00 10:00\" and \"Spirit of Life Church 4pm Sunday\" thanking Pastor Ken and Liz for their service, \"thebridge.me,\" \"Sugarland Family Church,\" \"Alpha & Omega Int'l Ministry of Faith,\" and \"The Fort Bend Church,\" each highlighting different church messages and leadership. The man's voice underscores the fragmentation, \"Next thing you know, there are thousands of different churches all claiming to go on the Bible alone,\" and reiterates, \"The Bible is a Catholic book, just as faith and reason go together.\"\n\nClosing the narrative, the video shows \"FIRST COLONY CHURCH OF CHRIST,\" with \"Ronnie Norman, Senior Minister\" displayed. The familiar teal background hosts the man in a suit and blue shirt, still wearing the religious pendant, speaking emphatically with \"Scripture and Tradition go together\" on the screen. There's nothing that the Catholic Church teaches, when properly understood, that contradicts Scripture properly understood.\" The theme of the intertwining of Scripture and Tradition is reinforced as the video concludes, leaving the viewer with a cohesive message about the relationship between the Catholic Church, its teachings, and the Bible."
+ }
+ ],
+ "events": [
+ "A man in a dark suit and red polka dot tie speaks against a teal background.",
+ "Phrase displayed on the screen: 'The Bible is a Catholic book.'",
+ "Man's voice says, 'If you don't trust the Church on anything else.'",
+ "Scene transitions to a religious procession inside a church.",
+ "Several people in white robes and one in a red robe hold a large, ornate book.",
+ "Text reads: 'Why would you trust the Catholic Church to provide you with the Bible? The one book you are basing your entire religion off of.'",
+ "Scene shows a person in a red robe performing a religious ritual with incense.",
+ "Next, a man in a dark suit and blue shirt with a religious pendant speaks against the teal background.",
+ "Words on screen: 'Sola Scriptura simply isn't scriptural.'",
+ "Black-and-white footage of another man holding an open book and speaking at a podium appears.",
+ "Text: 'Sola Scriptura simply isn't scriptural.'",
+ "Speaker points to the book, text changes to: 'Nowhere in the Bible does the Bible say the Bible alone or Scripture alone.'",
+ "Close-up shot reveals text.",
+ "Man's voice resonates: 'Sola Scriptura simply isn't scriptural. Nowhere in the Bible does the Bible say the Bible alone or Scripture alone.'",
+ "Footage shows a book passage: 'All Scripture is inspired by God and profitable for teaching, for reproof, for correction, and for training in righteousness, that the man of God may be complete, equipped for every good work.'",
+ "Focus on text from the previous passage.",
+ "Transition to black-and-white footage of a religious figure in ornate attire making a gesture.",
+ "Speaker discusses references to the Old Testament, explains: 'are references to the Old Testament because the New Testament didn't exist yet. There has to be an authoritative interpretation.'",
+ "Warns: 'or else you're going to have a bunch of different churches that all disagree on difficult passages.'",
+ "Painting depicts a central figure in red and blue robes seated on a rock, surrounded by attentive listeners.",
+ "Narrative flows to a similar painting scene.",
+ "Transition to black-and-white footage of a man in a suit holding an open book at a podium.",
+ "Screen text reads: 'Jesus wanted his people to be one, but the fruit of Sola Scriptura is disunity.'",
+ "Shots of various church signs begin with 'Fishers of Men Lutheran Church.'",
+ "Church sign: 'Christ Church Sugar Land, A United Methodist Community.'",
+ "Video shows multiple church signs: 'Christ Church Sugar Land, A United Methodist Community,' 'Sugar Land Church of God, Anderson, Indiana,' 'FAITH LUTHERAN CHURCH & SCHOOL, THANKS PASTOR KEN.LIZ,' '48+ YEARS OF FAITHFULNESS! NOV I SUNDAYS 8:00 10:00,' 'Spirit of Life Church 4pm Sunday,' 'thebridge.me,' 'Sugarland Family Church,' 'Alpha & Omega Int'l Ministry of Faith,' and 'The Fort Bend Church.'",
+ "Man's voice underscores the fragmentation: 'Next thing you know, there are thousands of different churches all claiming to go on the Bible alone.'",
+ "Voice reiterates: 'The Bible is a Catholic book, just as faith and reason go together.'",
+ "Closing scene shows 'FIRST COLONY CHURCH OF CHRIST,' with 'Ronnie Norman, Senior Minister' displayed.",
+ "Teal background hosts the man in a suit and blue shirt still wearing the religious pendant.",
+ "Words on screen: 'Scripture and Tradition go together.'",
+ "Man speaks emphatically: 'There's nothing that the Catholic Church teaches, when properly understood, that contradicts Scripture properly understood.'",
+ "Video concludes, reinforcing the theme of intertwining Scripture and Tradition."
+ ]
+ },
+ {
+ "video": "video/191.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up view of a man walking on a light wooden floor. Overlaid text reads, \"`Stop buying almond milk from the store,`\" followed by a red cross mark. The audio reveals a man's voice emphasizing the same message: \"Stop buying almond milk from the store if the ingredient list looks like this.\" The scene transitions into a grocery store aisle, focusing on a shelf displaying a carton labeled \"NUT MILK\" in handwritten text. As the man picks up the carton, the camera zooms in on the ingredients list, highlighted with overlaid text that reads, \"`if the ingredient list looks like this...`\" The ingredients list includes filtered water, cane sugar, ground almonds (2.5%), and other additives.\n\nThe scene shifts to a kitchen countertop, subdivided into four quadrants, each showcasing one of the four ingredients essential for making almond milk: a handful of almonds, a glass of water, a bottle of maple syrup, and a bottle of vanilla extract. The overlaid text reads, \"All you need is 4 ingredients...\" while the man's voice continues, \"``All you need is really four ingredients.`\" \n\nOnscreen, a hand reaches into a bowl filled with almonds, indicating the beginning of the almond milk preparation process. Zooming into the next scene, we see a bowl filled with almonds being topped up with water, accompanied by overlaid text, \"soak your almonds.\" The audio supports this by saying, \"` First step, grab your nuts. If you want it to be easier to digest, we highly recommend soaking them for 12 hours and then rinsing and draining just like rice.`\" The text on a black screen advises, \"wait for 12 hours for the almonds to soak.\" The scene then shows the almonds being drained over a sink, with overlaid text reading, \"rinse & drain the almonds (like rice).\" The man pours these rinsed almonds into a blender, emphasized with the text, \"1 cup almonds,\" followed by water being poured in with the caption, \"1-2L water (less water = creamier).\" The audio adds, \"When you're ready, throw everything into your optimum blender.`\"\n\nAs the next scene unfolds, we see a close-up view of the blender, now containing almonds and water, and the overlaid text reads, \"1 tbs maple syrup.\" The audio further elaborates, \"If you want it to be creamier, add less water. If you want it to be sweeter or less sweet, do what you want,\" as a finger presses the button on the blender's control panel, setting it to 45 seconds. The visual text advises to \"blend for 45 seconds :)\", and the camera captures the frothy, white mixture blending vigorously. The man narrates, \"`And then blend it for 45 seconds,`\" as the scene transitions to a person holding a nut milk bag over a large glass bowl. The overlaid text reads, \"pour into nut milk bag,\" while the man's voice instructs, \"Grab your nut milk bag and pour all the liquid into it.\" The person begins straining the mixture with the nut milk bag, ensuring all the liquid is squeezed out.\n\nNext, a person continues to squeeze a nut milk bag over a large glass bowl, with the overlaid text, \"strain all the milk out.\" The man's voice endorses this, \"`Grab your nut milk bag and pour all the liquid into the nut milk bag. And then just start straining all the milk out. By the way, this is how you milk an almond. Squeeze every last bit of the goodness out.`\" The text changes to explain more explicitly, \"`this is how you milk an almond.`\" The person opens the nut milk bag to display the remaining almond pulp. The man continues, \"`This is the pulp. We've got recipes coming soon on how to use them.`\" as the man pours the freshly strained milk into an empty glass jar.\n\nIn the final scenes, we see a clear glass jar being filled steadily with the freshly strained almond milk. The jar is topped with a golden lid and secured tightly. Overlaid text reads, \"follow for more blender recipes :),\" as the man's voice continues, \"But until then, just pour it in into a nice tall glass jar, like that and then store in your fridge. Follow for more recipes.\" The clip concludes with a person holding the sealed jar of almond milk, ready for storage or use. \n\n"
+ }
+ ],
+ "events": [
+ "Close-up view of a man walking on a light wooden floor.",
+ "Overlaid text reads, 'Stop buying almond milk from the store,' followed by a red cross mark.",
+ "Audio message: 'Stop buying almond milk from the store if the ingredient list looks like this.'",
+ "Scene transitions to a grocery store aisle.",
+ "Focus on a shelf displaying a carton labeled 'NUT MILK' in handwritten text.",
+ "Man picks up the 'NUT MILK' carton.",
+ "Camera zooms in on the ingredients list with overlaid text, 'if the ingredient list looks like this...'",
+ "Ingredients list displayed: filtered water, cane sugar, ground almonds (2.5%), and other additives.",
+ "Scene shifts to a kitchen countertop, subdivided into four quadrants.",
+ "Four essential ingredients shown: a handful of almonds, a glass of water, a bottle of maple syrup, and a bottle of vanilla extract.",
+ "Overlaid text reads, 'All you need is 4 ingredients...'",
+ "Audio message: 'All you need is really four ingredients.'",
+ "Hand reaches into a bowl filled with almonds.",
+ "Bowl filled with almonds being topped up with water.",
+ "Overlaid text reads, 'soak your almonds.'",
+ "Audio message: 'First step, grab your nuts. If you want it to be easier to digest, we highly recommend soaking them for 12 hours and then rinsing and draining just like rice.'",
+ "Text on a black screen advises: 'wait for 12 hours for the almonds to soak.'",
+ "Scene shows the almonds being drained over a sink.",
+ "Overlaid text reads, 'rinse & drain the almonds (like rice).'",
+ "Man pours rinsed almonds into a blender with the text, '1 cup almonds.'",
+ "Water being poured into the blender with overlaid caption, '1-2L water (less water = creamier).'",
+ "Audio message: 'When you're ready, throw everything into your optimum blender.'",
+ "Close-up view of the blender containing almonds and water.",
+ "Overlaid text reads, '1 tbs maple syrup.'",
+ "Audio message: 'If you want it to be creamier, add less water. If you want it to be sweeter or less sweet, do what you want.'",
+ "Finger presses the button on the blender's control panel, setting it to 45 seconds.",
+ "Visual text advises: 'blend for 45 seconds :)'",
+ "Camera captures frothy, white mixture blending vigorously.",
+ "Audio message: 'And then blend it for 45 seconds.'",
+ "Scene transitions to a person holding a nut milk bag over a large glass bowl.",
+ "Overlaid text reads, 'pour into nut milk bag.'",
+ "Audio message: 'Grab your nut milk bag and pour all the liquid into it.'",
+ "Person strains the mixture with the nut milk bag, ensuring all the liquid is squeezed out.",
+ "Overlaid text reads, 'strain all the milk out.'",
+ "Audio message: 'Grab your nut milk bag and pour all the liquid into the nut milk bag. And then just start straining all the milk out. By the way, this is how you milk an almond. Squeeze every last bit of the goodness out.'",
+ "Text changes to explain: 'this is how you milk an almond.'",
+ "Person opens the nut milk bag to display the remaining almond pulp.",
+ "Audio message: 'This is the pulp. We've got recipes coming soon on how to use them.'",
+ "Man pours freshly strained milk into an empty glass jar.",
+ "Clear glass jar being filled steadily with freshly strained almond milk.",
+ "Jar is topped with a golden lid and secured tightly.",
+ "Overlaid text reads, 'follow for more blender recipes :)'",
+ "Audio message: 'But until then, just pour it in into a nice tall glass jar, like that and then store in your fridge. Follow for more recipes.'",
+ "Clip concludes with a person holding the sealed jar of almond milk, ready for storage or use."
+ ]
+ },
+ {
+ "video": "video/192.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man and a woman sitting closely together inside a tent. They are dressed warmly in outdoor gear and wrapped in purple and gray sleeping bags. The screen displays \"PNC Day 7,\" indicating it is the seventh day of their journey. The audio reveals someone saying, \u201cHello from our tent, safe and sound. Today the Pacific Northwest Circuit gave us a bit more adventure than we wanted. We\u2019ve been heading up into the Olympic Mountains and camp this morning was surrounded by snow.\u201d The scene transitions to an outdoor setting where the hiker, using trekking poles and a large backpack, climbs a snowy slope. Snow-covered mountains and evergreen trees provide the backdrop, highlighting the effort and determination involved in the climb. The hiker triumphantly raises a trekking pole at a higher elevation, and the text \"4,100 ft.\" appears, indicating the altitude reached. The clip concludes with the hiker walking towards a tent set up on a snowy landscape.\n\nThe second part of the video shows a campsite set up on a snowy landscape, with a tent on a patch of green ground surrounded by snow. A woman in a gray jacket stands near the tent, next to a tree with gear placed around it. The scene then focuses on a woman in a red jacket sitting on a mat, putting on black socks, with emphasis on her adjusting crampons to her shoes. As the woman hikes up a snowy slope with a backpack and trekking poles, the text \"4,400 ft.\" appears on screen, revealing the attained altitude. The audio mentions, \u201cWe put on our waterproof socks and micro spikes and headed off, unsure of the conditions ahead. The day started relatively flat and easy and we soon hit a milestone. We made it to mile 100!\u201d The clip shows the hiker standing and speaking, with the text \"PNC MILE 100!\" emphasizing the significance of their progress.\n\nContinuing, the video displays a woman in a gray jacket on a snowy landscape, marked by the text \"PNC MILE 100!\" Snow-covered slopes and evergreen trees form a captivating backdrop. The scene shifts to a close-up of the woman\u2019s feet in red crampons trudging through the snow. Another person in an orange jacket and black hat uses a smartphone to check the route. The audio indicates, \u201cWhile the trail is buried under several feet of snow, we used GPS and maps on our phones for navigation. As the trail got steeper, we had to kick steps into the side of the mountain so we wouldn\u2019t slip down.\u201d Subsequently, the text \"4,700 ft.\" appears as the clip shows the hiker navigating the challenging snowy terrain.\n\nAs the journey progresses, the video shows a woman in a gray jacket and yellow beanie using a small shovel to dig into the snow against the backdrop of snow-covered mountains under a partly cloudy sky. The woman then melts snow over a portable stove. Another scene shows the person in an orange jacket eating noodles from a pot with a fork. Another hiking segment follows, with the text \"5,000 ft.\" indicating their ascent. The audio describes their experience: \u201cAll the streams were frozen, so we melted snow to make ramen for lunch. After lunch, the trail got even harder with endless traverses on steep mountainsides. We started to question if we had the equipment and skills needed.\u201d\n\nThe challenging conditions unfold as the woman in a gray jacket and yellow beanie continues to hike up the snowy slope, marked by \"5,000 ft.\" Close-up shots show red crampons on the hiker\u2019s shoes. The man in an orange jacket, standing against the snowy landscape, appears concerned. The hike resumes, with \"5,200 ft.\" signifying their achieved altitude. Summarizing their arduous effort, the audio follows, \u201cWe had only made it three and a half miles in five hours, and then Tim took a small tumble. I just slid down the mountain and hit a tree. That helped us decide it was time to turn around. There\u2019s too much snow, so we\u2019re headed back where we came from.\u201d\n\nConcluding the video, the man and woman pose together for a photo at \"5,200 ft.\" They then continue hiking through the snowy landscape. Eventually, the text \"3,500 ft.\" appears, indicating a descent to a lower altitude. At their new campsite, the woman in the red jacket is inside the tent, wrapped in a purple sleeping bag. The audio concludes with, \u201cWe\u2019re disappointed we couldn\u2019t make it safely over the snowy mountains. We\u2019ll find a lower alternate route around them, and our adventure will continue.\u201d"
+ }
+ ],
+ "events": [
+ "Man and woman sitting inside a tent",
+ "They are dressed warmly in outdoor gear",
+ "Wrapped in purple and gray sleeping bags",
+ "Screen displays 'PNC Day 7'",
+ "Audio: 'Hello from our tent, safe and sound. Today the Pacific Northwest Circuit gave us a bit more adventure than we wanted. We\u2019ve been heading up into the Olympic Mountains and camp this morning was surrounded by snow.'",
+ "Scene transitions to an outdoor setting",
+ "Hiker climbs a snowy slope with trekking poles and a large backpack",
+ "Snow-covered mountains and evergreen trees in the background",
+ "Hiker raises a trekking pole at a higher elevation",
+ "Text '4,100 ft.' appears on screen",
+ "Hiker walks towards a tent set up on a snowy landscape",
+ "Campsite on a snowy landscape with a tent on a patch of green ground",
+ "Woman in a gray jacket stands near the tent",
+ "Gear placed around a tree near the tent",
+ "Woman in a red jacket sits on a mat putting on black socks",
+ "Woman adjusts crampons to her shoes",
+ "Woman hikes up a snowy slope with a backpack and trekking poles",
+ "Text '4,400 ft.' appears on screen",
+ "Audio: 'We put on our waterproof socks and micro spikes and headed off, unsure of the conditions ahead. The day started relatively flat and easy and we soon hit a milestone. We made it to mile 100!'",
+ "Hiker standing and speaking with the text 'PNC MILE 100!'",
+ "Snow-covered slopes and evergreen trees in the background",
+ "Close-up of the woman\u2019s feet in red crampons trudging through the snow",
+ "Person in an orange jacket and black hat using a smartphone",
+ "Audio: 'While the trail is buried under several feet of snow, we used GPS and maps on our phones for navigation. As the trail got steeper, we had to kick steps into the side of the mountain so we wouldn\u2019t slip down.'",
+ "Text '4,700 ft.' appears on screen",
+ "Hiker navigating challenging snowy terrain",
+ "Woman in gray jacket and yellow beanie using a small shovel to dig into the snow",
+ "Snow-covered mountains under a partly cloudy sky",
+ "Woman melts snow over a portable stove",
+ "Person in an orange jacket eating noodles from a pot with a fork",
+ "Text '5,000 ft.' appears on screen",
+ "Audio: 'All the streams were frozen, so we melted snow to make ramen for lunch. After lunch, the trail got even harder with endless traverses on steep mountainsides. We started to question if we had the equipment and skills needed.'",
+ "Woman in gray jacket and yellow beanie hiking up a snowy slope",
+ "Close-up shots of red crampons on hiker\u2019s shoes",
+ "Man in an orange jacket standing against the snowy landscape, appearing concerned",
+ "Text '5,200 ft.' appears on screen",
+ "Audio: 'We had only made it three and a half miles in five hours, and then Tim took a small tumble. I just slid down the mountain and hit a tree. That helped us decide it was time to turn around. There\u2019s too much snow, so we\u2019re headed back where we came from.'",
+ "Man and woman pose together for a photo at '5,200 ft.'",
+ "Man and woman continue hiking through the snowy landscape",
+ "Text '3,500 ft.' appears on screen, indicating a descent",
+ "Woman in the red jacket inside the tent wrapped in a purple sleeping bag",
+ "Audio: 'We\u2019re disappointed we couldn\u2019t make it safely over the snowy mountains. We\u2019ll find a lower alternate route around them, and our adventure will continue.'"
+ ]
+ },
+ {
+ "video": "video/193.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man standing inside a house, holding a crumpled piece of aluminum foil. A text overlay reads, 'THIS IS WHY I ALWAYS CHECK MY PROPERTIES.' The man, dressed in a dark blue shirt and wearing a watch, stands in a kitchen area with white cabinets and wooden flooring. He transitions outside to show some fallen boards on the side of the house. The text overlay reads, 'I GOTTA COUPLE BOARDS DOWN.' The man explains that these boards cover up the basement and have fallen down. The text changes to 'YOU CAN SEE THIS IS JUST TO COVER UP THE BASEMENT, THEY'VE FALLEN DOWN.' Frustrated, the man says, 'OH, Y'KNOW WHAT, I GUARANTEE... THAT WAS THE FREAKIN' INSPECTOR, WHO DID THAT CRAP. THAT PISSES ME OFF.' while holding a phone and looking at the fallen boards.\n\nAudio reveals the person saying, \"This is why I like to come check on my properties. I got a couple boards down. You can see this is just to cover up the basement. They've fallen down. Oh, you know what? I guarantee that was the freaking inspector, who did that crap. That pisses me off.\" against a backdrop of continuous speech.\n\nThe scene continues outside the house with the man holding a phone, still focused on the fallen boards. The text overlay reads, 'THE ONLY REASON THEY GET UNDERNEATH THERE IS TO..INSPECT THE PROPERTY.' The man's speculation follows, 'SO I'M SURE THE GUY TOOK THEM DOWN AND DIDN'T PUT THEM BACK UP.' The scene shifts inside the house where the man is looking upwards, the overlay now reading, 'WHY...DID THEY LEAVE THE FAN GOING ON? WHY DID THEY LEAVE THE LIGHTS ON?'\n\nThe audio captures the man's frustration: \"The only reason to get underneath there is to inspect the property. So I'm sure the guy took him down and didn't put them back up. Why did they leave the fan going on? Why did they leave the lights on?\" with continuous speech.\n\nThe man then moves to a wall, pointing at it, with 'WHY...DID WE NOT TAKE CARE OF THIS, ON THE INITIAL WALK THROUGH WHERE THEY HUNG THEIR TV?'. The man inspects more holes, and 'WHAT THE HELL HAPPENED HERE?' shows his perplexity. Transitioning to another part of the wall with a rectangular cut-out, 'WHO CUT THESE HOLES IN HERE?' is the overlay. The man ends by moving towards a closet with a washing machine, 'LET'S TURN THIS OFF..IT SHOULDN'T BE ON, THEY PROBABLY TESTED IT AND DON'T CARE WHAT HAPPENS TO MY HOUSE.' as he reaches to turn off the machine.\n\nAudio reveals: \"Why did they leave the lights on? Why did we not take care of this on the initial walk through where they hung their TV? What the hell happened here? Who cut these holes in here? Let's turn this off.. It shouldn't be on. They probably tested it and don't care what happens to my house.\"\n\nThe video collectively follows the frustrating inspection of a property, highlighting negligence like fallen boards, lights and fans left on, unaddressed damages, and a running washing machine, as evidenced and lamented by the person throughout each scene."
+ }
+ ],
+ "events": [
+ "Video begins with a man standing inside a house, holding a crumpled piece of aluminum foil.",
+ "Text overlay reads, 'THIS IS WHY I ALWAYS CHECK MY PROPERTIES.'",
+ "Man is dressed in a dark blue shirt and wearing a watch.",
+ "Man is standing in a kitchen area with white cabinets and wooden flooring.",
+ "Man transitions outside to show some fallen boards on the side of the house.",
+ "Text overlay reads, 'I GOTTA COUPLE BOARDS DOWN.'",
+ "Man explains the boards cover up the basement and have fallen down.",
+ "Text changes to 'YOU CAN SEE THIS IS JUST TO COVER UP THE BASEMENT, THEY'VE FALLEN DOWN.'",
+ "Man expresses frustration and says, 'OH, Y'KNOW WHAT, I GUARANTEE... THAT WAS THE FREAKIN' INSPECTOR, WHO DID THAT CRAP. THAT PISSES ME OFF.'",
+ "Man holds a phone while looking at the fallen boards.",
+ "Audio reveals the person saying, 'This is why I like to come check on my properties. I got a couple boards down. You can see this is just to cover up the basement. They've fallen down. Oh, you know what? I guarantee that was the freaking inspector, who did that crap. That pisses me off.'",
+ "Scene continues outside with the man holding a phone, still focused on the fallen boards.",
+ "Text overlay reads, 'THE ONLY REASON THEY GET UNDERNEATH THERE IS TO..INSPECT THE PROPERTY.'",
+ "Man speculates, 'SO I'M SURE THE GUY TOOK THEM DOWN AND DIDN'T PUT THEM BACK UP.'",
+ "Scene shifts inside the house where the man is looking upwards.",
+ "Overlay reads, 'WHY...DID THEY LEAVE THE FAN GOING ON? WHY DID THEY LEAVE THE LIGHTS ON?'",
+ "Audio captures the man's frustration: 'The only reason to get underneath there is to inspect the property. So I'm sure the guy took him down and didn't put them back up. Why did they leave the fan going on? Why did they leave the lights on?'",
+ "Man moves to a wall and points at it.",
+ "Text overlay reads, 'WHY...DID WE NOT TAKE CARE OF THIS ON THE INITIAL WALK THROUGH WHERE THEY HUNG THEIR TV?'",
+ "Man inspects more holes and shows perplexity with 'WHAT THE HELL HAPPENED HERE?'",
+ "Transitioning to another part of the wall with a rectangular cut-out, text overlay reads, 'WHO CUT THESE HOLES IN HERE?'",
+ "Man moves towards a closet with a washing machine.",
+ "Text overlay reads, 'LET'S TURN THIS OFF..IT SHOULDN'T BE ON, THEY PROBABLY TESTED IT AND DON'T CARE WHAT HAPPENS TO MY HOUSE.'",
+ "Man reaches to turn off the washing machine.",
+ "Audio reveals: 'Why did they leave the lights on? Why did we not take care of this on the initial walk through where they hung their TV? What the hell happened here? Who cut these holes in here? Let's turn this off.. It shouldn't be on. They probably tested it and don't care what happens to my house.'",
+ "Video follows the frustrating inspection of a property.",
+ "Video highlights neglect like fallen boards, lights and fans left on, unaddressed damages, and a running washing machine."
+ ]
+ },
+ {
+ "video": "video/194.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a serene farm setting, characterized by green fields under a partly cloudy sky. A man wearing a black tank top and gloves leans over a wooden fence to interact with a large black and white sheep, distinguished by its thick, curly wool and a green ear tag. A female voice introduces the sheep, saying, \"This is Igante. He's way too big for me, so I let my wife handle him. With proper technique, she makes it look like a breeze.\". As music plays in the background, the scene transitions to another woman in a blue tank top and jeans getting the sheep down inside a wooden structure.\n\nThe scene transitions to the man in a blue tank top and jeans shearing the sheep inside a wooden structure. Inside the wooden structure, equipped with various tools, including a yellow container and a black toolbox, she uses electric shears to skillfully remove the sheep's wool, maneuvering the animal gently to reach different parts of its body. The voice narration continues, \"Now that she's done all my heavy lifting, I can get started. Again, I skipped straight to the hip because I really want you to see the fleece at the end. Pigante is a Columbia sheep; they're known to be extremely tall and have excellent fiber. And this \"\n\nAs the shearing goes on, with large sections of wool coming off, the male voice remarks, \"His fleece does not disappoint. His fleece is incredibly thick and rich, which is why he's a stud for this herd. He's not the heaviest, but he's definitely the tallest boy I've shorn, so I am grateful that he is being super cooperative until I get to this head where he reminds me that if he wanted to, he could totally take me. This is just one year's worth of growth and even though I'm bending him around, he's not being hurt at all. Quick turn off of my shear so I can maneuver him into the upward position. A couple of wiggles to get comfy, and back to rolling the rest of it off. This guy lives on a fiber farm, so his 12-pound fleece will be sold so that someone can make yarn out of it.\". Throughout this process, another man wearing a blue shirt and red hat appears in the background, observing the procedure. The wool keeps piling up, and the sheep remains calm and cooperative. The female voice and technical skill maintain a harmonious cadence, merging expertise with care.\n\nFinally, the man stands the now-sheared sheep up, showcasing the animal's clean-cut look and a hefty pile of wool. The female voice is enthused, \"Now that he's all cleaned up, let's go check his fleece out. Look at that spread. Truly amazing. Thanks, buddy.\". With the sheep standing relieved and the fleece meticulously prepared on a table, the video concludes, capturing the essence of patience, skill, and the pastoral beauty of the shearing process."
+ }
+ ],
+ "events": [
+ "Video begins with a serene farm setting characterized by green fields under a partly cloudy sky.",
+ "A man wearing a black tank top and gloves leans over a wooden fence to interact with a large black and white sheep with thick, curly wool and a green ear tag.",
+ "A female voice introduces the sheep, saying, 'This is Igante. He's way too big for me, so I let my wife handle him. With proper technique, she makes it look like a breeze.'",
+ "Music plays in the background.",
+ "Scene transitions to another woman in a blue tank top and jeans getting the sheep down inside a wooden structure.",
+ "The scene transitions to the man in a blue tank top and jeans shearing the sheep inside a wooden structure.",
+ "The wooden structure is equipped with various tools including a yellow container and a black toolbox.",
+ "The woman uses electric shears to skillfully remove the sheep's wool, maneuvering the animal gently to reach different parts of its body.",
+ "Voice narration continues, 'Now that she's done all my heavy lifting, I can get started. Again, I skipped straight to the hip because I really want you to see the fleece at the end. Pigante is a Columbia sheep; they're known to be extremely tall and have excellent fiber.'",
+ "Large sections of wool come off during shearing.",
+ "Male voice remarks, 'His fleece does not disappoint. His fleece is incredibly thick and rich, which is why he's a stud for this herd. He's not the heaviest, but he's definitely the tallest boy I've shorn, so I am grateful that he is being super cooperative until I get to this head where he reminds me that if he wanted to, he could totally take me. This is just one year's worth of growth and even though I'm bending him around, he's not being hurt at all. Quick turn off of my shear so I can maneuver him into the upward position. A couple of wiggles to get comfy, and back to rolling the rest of it off. This guy lives on a fiber farm, so his 12-pound fleece will be sold so that someone can make yarn out of it.'",
+ "Another man wearing a blue shirt and red hat appears in the background observing the procedure.",
+ "The wool keeps piling up, and the sheep remains calm and cooperative.",
+ "The female voice and technical skill maintain a harmonious cadence, merging expertise with care.",
+ "The man stands the now-sheared sheep up, showcasing the animal's clean-cut look and a hefty pile of wool.",
+ "Female voice enthused, 'Now that he's all cleaned up, let's go check his fleece out. Look at that spread. Truly amazing. Thanks, buddy.'",
+ "The sheep stands relieved and the fleece is meticulously prepared on a table.",
+ "Video concludes, capturing the essence of patience, skill, and the pastoral beauty of the shearing process."
+ ]
+ },
+ {
+ "video": "video/195.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a man in a white lab coat drinking milk from a glass. The text and audio display \"This is the cleanest milk in the world.\" The scene transitions to a close-up of milk being poured into a clear glass bottle, and the text and audio display \"It doesn't come from a cow, It doesn't come from a goat.\" The narrator continues, It has no lactose or cholesterol as the milk is poured. The scene shifts to a man in a lab coat and blue gloves working with laboratory equipment, with the audio and text concluding, \"And the best part? it's made in a lab by these guys.'\n\nThe narrative continues with a scene showing a man in a white lab coat and blue gloves working with laboratory equipment, The scene transitions to two men in lab coats standing back-to-back. In the next frame, the same men wear green and black shirts with the 'remilk' logo and introduce themselves. The text and audio display \"Hi my name is Aviv\" and \"And my name is Ori,\" respectively. And we found a way to fix milk.\" The clip concludes with a close-up of two clear glass bottles filled with milk, a man holds two clear glass bottles filled with milk with the text and audio \"Turns out normal milk is not healthy, because it's full of lactose.\n\n A woman in a black shirt pours milk into a glass bottle, with the text and audio \"We humans cannot tolerate lactose\". The scene shifts to cows in a milking parlor, with the text and audio \"To make it, animals are abused, and tons of resources like water, money and energy are wasted.\n\nThe scene shows a cow being sprayed with water, one hand counting money, and a person pouring feed into a trough for cows, The scene then shows two men in lab coats standing back-to-back, with the text and audio \"That's why these guys turned cows'genes into a microbe that can generate milk all by itself.\" The close-up of COW GENES and MICROBE appears on the screen.\n\nA lab coat man works with laboratory equipment, with the text and audio \"using fermentation, it looks and tastes exactly the same as cow milk. The scene transitions to a close-up of a hand holding a clear glass bottle filled with milk, and a woman drinks milk from the bottle, The scene shifts to a dairy section in a store, with the text and audio \"And it can even be used to create yogurt and cheese.\"\n\nThe closing scene begins with a man adjusting his glasses in front of a whiteboard with various diagrams and texts. The narrator explains, \"These two scientists are working non-stop to do something that has never been done before.\" The scene shifts to another man in a white lab coat revealing a green 'remilk' logo shirt, In the end, a man in a black shirt makes gesture and says' And when they succeed, we all succeed, that's 1 minute, see you tomorrow, the 1 minute and tomorrow are yellow."
+ }
+ ],
+ "events": [
+ "A man in a white lab coat drinks milk from a glass.",
+ "Text and audio display: 'This is the cleanest milk in the world.'",
+ "Scene transitions to a close-up of milk being poured into a clear glass bottle.",
+ "Text and audio display: 'It doesn't come from a cow, It doesn't come from a goat.'",
+ "Narrator states: 'It has no lactose or cholesterol.'",
+ "Milk is shown being poured.",
+ "Scene shifts to a man in a lab coat and blue gloves working with laboratory equipment.",
+ "Text and audio conclude: 'And the best part? It's made in a lab by these guys.'",
+ "Scene shows a man in a white lab coat and blue gloves working with laboratory equipment.",
+ "Scene transitions to two men in lab coats standing back-to-back.",
+ "Scene transitions to the same men wearing green and black shirts with the 'remilk' logo.",
+ "Text and audio display: 'Hi, my name is Aviv.'",
+ "Text and audio display: 'And my name is Ori.'",
+ "Text and audio state: 'And we found a way to fix milk.'",
+ "Clip concludes with a close-up of two clear glass bottles filled with milk.",
+ "A man holds two clear glass bottles filled with milk.",
+ "Text and audio display: 'Turns out normal milk is not healthy, because it's full of lactose.'",
+ "A woman in a black shirt pours milk into a glass bottle.",
+ "Text and audio display: 'We humans cannot tolerate lactose.'",
+ "Scene shifts to cows in a milking parlor.",
+ "Text and audio display: 'To make it, animals are abused, and tons of resources like water, money, and energy are wasted.'",
+ "Scene shows a cow being sprayed with water.",
+ "A hand is shown counting money.",
+ "A person pours feed into a trough for cows.",
+ "Scene shows two men in lab coats standing back-to-back.",
+ "Text and audio display: 'That's why these guys turned cows' genes into a microbe that can generate milk all by itself.'",
+ "Close-up of COW GENES and MICROBE displayed on the screen.",
+ "A lab coat man works with laboratory equipment.",
+ "Text and audio display: 'Using fermentation, it looks and tastes exactly the same as cow milk.'",
+ "Scene transitions to a close-up of a hand holding a clear glass bottle filled with milk.",
+ "A woman drinks milk from the bottle.",
+ "Scene shifts to a dairy section in a store.",
+ "Text and audio display: 'And it can even be used to create yogurt and cheese.'",
+ "Closing scene begins with a man adjusting his glasses in front of a whiteboard with various diagrams and texts.",
+ "Narrator explains: 'These two scientists are working non-stop to do something that has never been done before.'",
+ "Scene shifts to another man in a white lab coat revealing a green 'remilk' logo shirt.",
+ "A man in a black shirt makes a gesture and says: 'And when they succeed, we all succeed. That's 1 minute, see you tomorrow.'",
+ "The words '1 minute' and 'tomorrow' are displayed in yellow."
+ ]
+ },
+ {
+ "video": "video/196.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video features a series of segments where individuals showcase and discuss various beauty products. In the initial segment, a woman with blonde hair tied back with a black headband. Against a plain background and dressed in a blue top, she applies another beauty product to her face. She says, \"I'm really trying, it's gonna be no for me. \" The second woman with long, wavy dark hair is seated in a room filled with shelves and decorations. She wears a light-colored top and a butterfly pendant necklace. Holding a jar of a creamy, beige product, likely a foundation or skincare item, she seems to explain its utility. She says, \"It does not look good on me, and the scent is overwhelming.\" The background then shifts to a third woman with long braided hair, also spotlighted in a bright room with a white backdrop, where a white bed and mirrored closet door are visible. Holding a round container that reads 'JONES ROAD', she introduces or reviews the product. On the audio, a voice, presumably hers, states: \"This is the most hated foundation on the internet. So I bought it.\"\n\nTransitioning to the second segment, the woman with the long braided hair remains the focus in the same well-lit room. Wearing a dark-colored shirt, she opens the 'JONES ROAD' container to reveal a creamy, brown substance. She showcases its texture and consistency and then applies it to her face using her fingers. The audio accompanies this visual with the dialogue: \"This is what it looks like up close. I got the color cinnamon but I think it might be too dark. We'll see, I'm applying with my fingers just the tiniest bit.\" Her narrative is punctuated by light laughter.\n\nIn the third segment, the same woman continues applying the beauty product to her face. Her light blue nail polish is visible as she blends the product, demonstrating its application. The audio highlights her commentary: \"`A little dark but I feel like when I blend it out it won't matter. Guys, I feel like it's blending out really nicely. I think it looks great, it made my skin look really glow and moisturized but with very little coverage.\"\n\nDuring the fourth segment, she gestures animatedly with her hands, adorned with light blue nail polish, and explains more about the product. Once again, she holds up the round container with the creamy, brown substance. She elaborates, \"So it's really good for like a clean girl makeup look. If you're a full glam, full coverage makeup girl, this is not gonna work for you.\" Her voice becomes more pronounced as she emphasizes her points.\n\nIn the concluding segment, the focus remains on the same woman with braided hair and blue nail polish. She continues discussing the product, occasionally gesticulating. Holding the round container, she remarks: \"It does smell kind of weird though. Earthy.\" The audio captures her speaking, sniffing the product, and then commenting on its smell.\n\nThe consistent backdrop of the white bed and mirrored closet door ties the segments together visually, while the in-depth discussions about the beauty product, enhanced by close-up demonstrations of its texture and application, provide a comprehensive guide to the viewer."
+ }
+ ],
+ "events": [
+ "Woman with blonde hair tied back with a black headband appears against a plain background.",
+ "She is dressed in a blue top and applies a beauty product to her face.",
+ "She says, 'I'm really trying, it's gonna be no for me.'",
+ "Second woman with long, wavy dark hair is seated in a room filled with shelves and decorations.",
+ "She wears a light-colored top and a butterfly pendant necklace.",
+ "She holds a jar of a creamy, beige product and explains its utility.",
+ "She says, 'It does not look good on me, and the scent is overwhelming.'",
+ "Background shifts to a third woman with long braided hair in a bright room with a white backdrop.",
+ "A white bed and mirrored closet door are visible in her room.",
+ "She holds a round container that reads 'JONES ROAD'.",
+ "Audio of her voice says, 'This is the most hated foundation on the internet. So I bought it.'",
+ "Focus remains on the woman with long braided hair in the same well-lit room.",
+ "She wears a dark-colored shirt and opens the 'JONES ROAD' container.",
+ "She reveals a creamy, brown substance in the container.",
+ "She showcases its texture and consistency, then applies it to her face using her fingers.",
+ "Audio says, 'This is what it looks like up close. I got the color cinnamon but I think it might be too dark. We'll see, I'm applying with my fingers just the tiniest bit.'",
+ "Her narrative is punctuated by light laughter.",
+ "She continues applying the beauty product to her face.",
+ "Light blue nail polish is visible as she blends the product.",
+ "Audio highlights her commentary: 'A little dark but I feel like when I blend it out it won't matter. Guys, I feel like it's blending out really nicely. I think it looks great, it made my skin look really glow and moisturized but with very little coverage.'",
+ "In the fourth segment, she gestures animatedly with her hands.",
+ "Her hands are adorned with light blue nail polish.",
+ "She holds up the round container with the creamy, brown substance.",
+ "She elaborates, 'So it's really good for like a clean girl makeup look. If you're a full glam, full coverage makeup girl, this is not gonna work for you.'",
+ "Her voice becomes more pronounced as she emphasizes her points.",
+ "In the concluding segment, she continues discussing the product, occasionally gesticulating.",
+ "She holds the round container and remarks, 'It does smell kind of weird though. Earthy.'",
+ "Audio captures her speaking, sniffing the product, and commenting on its smell.",
+ "Consistent backdrop of the white bed and mirrored closet door ties the segments together visually.",
+ "In-depth discussions about the beauty product are enhanced by close-up demonstrations of its texture and application."
+ ]
+ },
+ {
+ "video": "video/197.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens in a gym setting where a man, dressed in a gray t-shirt and dark gray shorts, is squatting on the floor near a blue climbing wall. He claps his hands while texts 'DR. JASON HOOPER, PT, DPT, OCS, SCS' overlay at the bottom, revealing his credentials. As he maintains his squat, an orange 'X' marks his feet suggesting an incorrect position. Suddenly, he falls backward, indicative of what not to do. A close-up of his legs and feet emphasizes his hip area with an orange circle, indicating the correct posture or focus technique.\n\nOver the visuals, a man narrates, \"If you're having trouble doing this squat, it's probably not because of ankle mobility, contrary to popular belief. Typically, the limitation is actually in the hips, specifically hip flexion range of motion.\"\n\nThe scene shifts to a close-up of a man's feet and lower legs in a squatting position, marked with yellow question marks dissecting areas of common mistakes. As the camera zooms out, revealing the man's entire position against a yellow climbing wall, orange marks highlight his knees and hips, emphasizing correct posture or technique. Eventually, the man falls backward again, with an orange circle highlighting the ankle area to signify a point of focus.\n\nThe narration continues, \"So why people think it's their ankles. Because if you have limited hip flexion ability, your hips will push you backwards when you get near the bottom of your squat, changing your center of gravity. Your ankles will then attempt to combat this by pulling you forward. This creates a large load on the anterior tibialis, which can make it seem like the limitation is in the ankles. \"\n\nNext, a close-up of his ankle and foot, highlighted by an orange circle, transitions into a hand pointing towards the highlighted area, providing a demonstration or explanation. The frame shifts to an anatomical illustration of the lower legs and feet displaying bones and muscles. Finally, the man is seen with his foot against a wall, likely demonstrating a stretch or exercise.\n\nThe narration elaborates, \"To test this yourself, approach a wall, put your toes about eight centimeters away and see if you can touch your knee to the wall without your heel coming off the ground. \"\n\nThe focus moves to a man's lower leg and foot against a wall, bending his knee towards it, marked by an orange circle to emphasize the knee. Further, his ankle area is also circled in orange. Subsequently, the man squats holding a wooden stick vertically for support, transitioning to holding a dumbbell\u2014indicative of recommended exercise techniques.\n\nThe narration adds, \"If you can, your ankles are probably not the problem. To improve your hip mobility, you can use this stretch while holding a weight or a sturdy object in front of you, or you can try the child's pose, emphasizing left and right movement.\"\n\nThe man squats, gripping a dumbbell, focusing on his arms and hands. He shifts to a child's pose, stretching his arms\u2014first forward and then to the sides, executing a comprehensive stretching routine. He transitions from one stretch to another, sitting back on his heels, promoting flexibility.\n\nThe narration underscores, \"If you're experiencing pinching in the front of the hip, back off using your hands for support on the ground and try to slowly settle under the stretch.\"\n\nEventually, the man returns to the child's pose and continues stretching, reinforcing the exercises. The man then squats on the floor with animated hand gestures while text appears: 'Follow for more!' in a red box, signaling viewers for a call to action.\n\nClosing the narration, the man states, \"And that's today's beta.\""
+ }
+ ],
+ "events": [
+ "Man squats on the floor near a blue climbing wall in a gym",
+ "Man is dressed in a gray t-shirt and dark gray shorts",
+ "Texts appear overlayed: 'DR. JASON HOOPER, PT, DPT, OCS, SCS' revealing his credentials",
+ "Man claps his hands while squatting",
+ "Orange 'X' marks the man's feet indicating an incorrect position",
+ "Man falls backwards, showing what not to do in a squat",
+ "Close-up of man's legs and feet with an orange circle emphasizing the hip area for correct posture",
+ "Voice narration: 'If you're having trouble doing this squat... typically, the limitation is actually in the hips, specifically hip flexion range of motion.'",
+ "Scene shifts to close-up of a man's feet and lower legs in a squatting position, marked with yellow question marks indicating common mistakes",
+ "Camera zooms out showing man's entire position against a yellow climbing wall, orange marks highlight his knees and hips",
+ "Man falls backward again, orange circle highlights the ankle area",
+ "Voice narration: 'So why people think it's their ankles... This creates a large load on the anterior tibialis, which can make it seem like the limitation is in the ankles.'",
+ "Close-up of man's ankle and foot highlighted by an orange circle",
+ "Hand points towards the highlighted area for demonstration",
+ "Frame shifts to anatomical illustration of lower legs and feet showing bones and muscles",
+ "Man seen with his foot against a wall demonstrating a stretch or exercise",
+ "Voice narration: 'To test this yourself...knee to the wall without your heel coming off the ground.'",
+ "Focus moves to man's lower leg and foot against a wall, knee bending towards it marked by an orange circle",
+ "Ankle area circled in orange",
+ "Man squats holding a wooden stick vertically for support",
+ "Man holds a dumbbell indicating recommended exercise techniques",
+ "Voice narration: 'If you can, your ankles are probably not the problem... you can try the child's pose, emphasizing left and right movement.'",
+ "Man squats holding a dumbbell focusing on his arms and hands",
+ "Man transitions to child's pose stretching his arms first forward then to the sides",
+ "Man transitions between stretches sitting back on his heels promoting flexibility",
+ "Voice narration: 'If you're experiencing pinching in the front of the hip... try to slowly settle under the stretch.'",
+ "Man returns to child's pose and continues stretching",
+ "Man squats on the floor with animated hand gestures, text appears: 'Follow for more!' in a red box",
+ "Voice narration: 'And that's today's beta.'"
+ ]
+ },
+ {
+ "video": "video/198.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a girl who has long, braided hair adorned with white flowers and a pearl headband, wearing a yellow shirt with a colorful design and holding a small panda plush toy. The background is decorated with green, shimmering curtains. The text 'Wishes \ud83c\udf84\u2728: 1' appears at the top of the screen. \n\nA dialogue unfolds in a playful manner:\n- The text 'mommy what's above my head' appears in pink at the bottom, accompanied by a young girl's voice asking the question.\n- The text 'that's your Christmas wish honey. you can wish for whatever you like' appears in yellow, and a woman responds warmly.\n- The text 'I wanna wish for a boy' appears in pink at the bottom, accompanied by a young girl's voice, then she covers her mouth with a cry of surprise. But the text 'you are not getting a boyfriend' appears, and the woman's voice gently but firmly denies the request.\n- The text 'you're way too young honey' appears in yellow, a woman's reassuring tone.\n- The text 'but sissy wish for a boyfriend' appears in pink, voiced by the young girl.\n- A denial in red text 'no I didn't' appears, suggesting a sibling disagreement.\n- The text 'Lacey I told you not to do that' appears in yellow, reprimanding.\n- The text 'you little tattletale' appears in red, followed by an accusatory 'you got me in trouble' \n- The text 'I'm sorry' appears in pink for the apology.\n- The red text then reads 'No just for that I'm not sharing the money Grandma gave us,' followed by 'haha you can't reach it' and 'oh poor baby,' teasing statements.\n- The girl complains with 'it's not fair' in pink and makes a wish with the text 'I wish that I was a big girl, older than Sissy so I could boss her around' expresses a wish.\n\n- A black screen appears with 'Wishes \ud83c\udf84\u2728: 0' at the top and 'GRANTED\u2705' in green in the center, signifying the wish's fulfillment.\n- The girl now has long, loose hair and wears a red dress with a necklace. The dialogue resumes with 'wow that felt like a long nap' in pink, ending with the question 'why does my voice sound different' in the same pink text, revealing her discovery of change.\n\n- The girl examines herself, with 'what the' in pink for a gasp.\n- Realizations come with texts: 'I have pierced ears' and 'I have a big girl body' in pink.\n- The dialogue includes: 'well good morning to you' and 'you're looking nice' in blue, expressing a greeting and a compliment from a man.\n- The blue text 'come here *kisses her hand*' denotes an affectionate gesture.\n- The pink text 'ew gross' expresses disgust.\n- A blue text questions, 'hey what's the matter,' met by the reaction, 'I think my Christmas wish came true,' in pink.\n- The request 'give me a kiss\u2019 in blue is met urgently with 'come on I gotta go.'\n- The concluding frame shows 'who are you and where are my mom and sister' in pink, indicating confusion and concern from the now transformed young girl.\n\nThe video seamlessly blends both visual and audio elements, wrapping up a whimsical and humorous holiday tale about wishes, family dynamics, and unexpected transformations."
+ }
+ ],
+ "events": [
+ "The video begins with a girl with long, braided hair adorned with white flowers and a pearl headband.",
+ "The girl is wearing a yellow shirt with a colorful design.",
+ "The girl is holding a small panda plush toy.",
+ "The background is decorated with green, shimmering curtains.",
+ "Text 'Wishes \ud83c\udf84\u2728: 1' appears at the top of the screen.",
+ "A dialogue unfolds in a playful manner.",
+ "Text 'mommy what's above my head' appears in pink at the bottom, accompanied by a young girl's voice asking the question.",
+ "Text 'that's your Christmas wish honey. you can wish for whatever you like' appears in yellow, and a woman responds warmly.",
+ "Text 'I wanna wish for a boy' appears in pink, accompanied by a young girl's voice, then she covers her mouth with a cry of surprise.",
+ "Text 'you are not getting a boyfriend' appears, and the woman's voice gently but firmly denies the request.",
+ "Text 'you're way too young honey' appears in yellow, a woman's reassuring tone.",
+ "Text 'but sissy wish for a boyfriend' appears in pink, voiced by the young girl.",
+ "Denial in red text 'no I didn't' appears, suggesting a sibling disagreement.",
+ "Text 'Lacey I told you not to do that' appears in yellow, reprimanding.",
+ "Text 'you little tattletale' appears in red, followed by an accusatory 'you got me in trouble'.",
+ "Text 'I'm sorry' appears in pink for the apology.",
+ "Red text reads 'No just for that I'm not sharing the money Grandma gave us'.",
+ "Text 'haha you can't reach it' and 'oh poor baby' appears in teasing statements.",
+ "The girl complains with 'it's not fair' in pink and makes a wish with the text 'I wish that I was a big girl, older than Sissy so I could boss her around'.",
+ "A black screen appears with 'Wishes \ud83c\udf84\u2728: 0' at the top and 'GRANTED\u2705' in green in the center, signifying the wish's fulfillment.",
+ "The girl now has long, loose hair and wears a red dress with a necklace.",
+ "The dialogue resumes with 'wow that felt like a long nap' in pink, ending with the question 'why does my voice sound different' in the same pink text.",
+ "The girl examines herself, with 'what the' in pink for a gasp.",
+ "Text 'I have pierced ears' and 'I have a big girl body' appears in pink reflecting realization.",
+ "Dialogue includes 'well good morning to you' and 'you're looking nice' in blue, expressing a greeting and a compliment from a man.",
+ "Blue text 'come here *kisses her hand*' denotes an affectionate gesture.",
+ "Pink text 'ew gross' expresses disgust.",
+ "Blue text questions 'hey what's the matter', met by the reaction 'I think my Christmas wish came true' in pink.",
+ "Request 'give me a kiss\u2019 in blue is met urgently with 'come on I gotta go'.",
+ "Concluding frame shows 'who are you and where are my mom and sister' in pink, indicating confusion and concern from the now transformed young girl."
+ ]
+ },
+ {
+ "video": "video/199.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "A man wearing a white shirt and several necklaces speaks against a dark background. The text 'www.OutstandingScreenplays.com' is visible at the bottom of the frame. This man mentions,'`YOU GET SUMMONED BY THE WILD AND WONDERFUL QT.WE GO OVER TO HIS HOUSE BECAUSE THE LAST SCRIPT GOT LEAKED.SO HE WASN'T HAVING ANY OF THAT.`' The scene is black and white and the man is saying with his hand gesturing.\n\nThen, the scene changes to the man now wearing a hat and different outfit says,'' SIT DOWN AT HIS TABLE,HE USUALLY MAKES YOU A TEA AND HE`'LL TELL YOU ABOUT THE STORY ABOUT AN HOUR,YOU MAY NOT KNOW THIS HE'S A BIT VERBAL.'' The scene then becomes wider,revealing the same man sitting next to another man on stage, indicating that after the discussion, then the man continues by,`AND THEN YOU READ THE SCRIPT ALONE.HE'`LL COME AND CHECK ON YOU ABOUT HALFWAY.`'And the man is spinning his wrist.\n\nReturning to the dark background, the man in the white shirt,continues saying, 'IT WAS PRETTY CRISP, PRETTY CLEAN SCRIPT.ONE SCRIPT AND READ IT AND WAS QUITE TAKEN WITH IT.`'I DON'T KNOW,SEVERAL WEEKS LATER AND THAT SAME SCRIPT WAS THERE,BUT IT WAS ALL DOG EARED AND STAINED.`'While the man is speaking with his hands gesturing,another woman's voice can be heard,seems like she is responding to the man.\n\nTransitioning to a scene where the man in the hat continues speaking - '`YOU KNOW, THAT WE'VE ALL KIND OF''While he is gesturing with his hand,another man sitting next to him says,'' EVERYONE'S COME IN.WELL, IT'`S INTERESTING BECAUSE HE WANTS TO HAVE THIS ONE ON ONE IMMEDIATE REACTION AND HE'`S CONSTANTLY CHECKING ON YOU.TO SEE IF YOU'RE LAUGHING AT THE RIGHT SPOT.''A close-up of the man's face is seen.\n\nhe scene moves to two people in a car, with '`CLIFF BOOTH WAS THE NAME OF THE STUNTMAN,`' and admiration for this character expressed 'I WAS REALLY DRAWN TO HIM.'And the scene transitions back to the original man in a black and white filter.\n\nFinally, the man in a white shirt expresses admiration for Quinton's writing, with 'I LOVE WHAT QUINTON HAD WRITTEN.IT WAS A GUY WHO WAS PRETTY AT PEACE WITH.HIS PLACE IN LIFE, LOOKED FOR THE BEST IN PEOPLE.' Moving to another speaker in a yellow shirt, the realistic outlook on people is shared, noted by 'BUT WAS NOT SURPRISED TO GET THE WORST IN PEOPLE.I COULD RELATE TO THAT.`'\n\nThroughout the video, the man speaking is highlighted by another bunch of people are standing in a garden.A man with sunglasses is shown,then a woman in a colorful top with her hands on her hips is shown.Then the scene shifts back to the man in white shirt."
+ }
+ ],
+ "events": [
+ "Man wearing a white shirt and several necklaces speaks against a dark background",
+ "Text 'www.OutstandingScreenplays.com' is visible at the bottom of the frame",
+ "Man says: 'YOU GET SUMMONED BY THE WILD AND WONDERFUL QT. WE GO OVER TO HIS HOUSE BECAUSE THE LAST SCRIPT GOT LEAKED. SO HE WASN'T HAVING ANY OF THAT.'",
+ "Scene is black and white",
+ "Man gestures with his hand while speaking",
+ "Scene changes to the man now wearing a hat and different outfit",
+ "Man says: 'SIT DOWN AT HIS TABLE, HE USUALLY MAKES YOU A TEA AND HE'LL TELL YOU ABOUT THE STORY ABOUT AN HOUR, YOU MAY NOT KNOW THIS HE'S A BIT VERBAL.'",
+ "Scene becomes wider, revealing the man next to another man on stage",
+ "Man continues: 'AND THEN YOU READ THE SCRIPT ALONE. HE'LL COME AND CHECK ON YOU ABOUT HALFWAY.'",
+ "Man spins his wrist",
+ "Returns to the dark background, man in the white shirt speaking again",
+ "Man says: 'IT WAS PRETTY CRISP, PRETTY CLEAN SCRIPT. ONE SCRIPT AND READ IT AND WAS QUITE TAKEN WITH IT. I DON'T KNOW, SEVERAL WEEKS LATER AND THAT SAME SCRIPT WAS THERE, BUT IT WAS ALL DOG EARED AND STAINED.'",
+ "Another woman's voice can be heard responding to the man",
+ "Scene transitions to the man in the hat speaking again",
+ "Man says: 'YOU KNOW, THAT WE'VE ALL KIND OF'",
+ "Man gestures with his hand",
+ "Another man sitting next to him says: 'EVERYONE'S COME IN. WELL, IT'S INTERESTING BECAUSE HE WANTS TO HAVE THIS ONE ON ONE IMMEDIATE REACTION AND HE'S CONSTANTLY CHECKING ON YOU. TO SEE IF YOU'RE LAUGHING AT THE RIGHT SPOT.'",
+ "Close-up of the man's face",
+ "Scene moves to two people in a car",
+ "Text 'CLIFF BOOTH WAS THE NAME OF THE STUNTMAN'",
+ "Admiration for the character expressed: 'I WAS REALLY DRAWN TO HIM.'",
+ "Scene transitions back to the original man in a black and white filter",
+ "Man in a white shirt expresses admiration for Quinton's writing",
+ "Man says: 'I LOVE WHAT QUINTON HAD WRITTEN. IT WAS A GUY WHO WAS PRETTY AT PEACE WITH. HIS PLACE IN LIFE, LOOKED FOR THE BEST IN PEOPLE.'",
+ "Scene moves to another speaker in a yellow shirt",
+ "Speaker expresses a realistic outlook on people, saying: 'BUT WAS NOT SURPRISED TO GET THE WORST IN PEOPLE. I COULD RELATE TO THAT.'",
+ "Throughout the video, the man speaking is highlighted",
+ "Bunch of people in a garden",
+ "Man with sunglasses shown",
+ "Woman in a colorful top with her hands on her hips shown",
+ "Scene shifts back to the man in white shirt"
+ ]
+ },
+ {
+ "video": "video/200.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a scene from a video game where a character wielding a scythe is standing near a blue turret under attack by red minions. The character engages the minions using various abilities, creating visual effects in blue and white flashes. Concurrently, in the audio, a man says, \"`If you think Diana can only dash twice with her E, then you are wrong! There's a trick that only Diana mains know about that lets you dash three times instead of two. Here's how it works. When Diana Qs, she applies Moonlight to every target hit.`\" The scene transitions as the character chases a red enemy through a grassy area, indicated by pulses of blue light. The enemy attempts to escape but is pursued relentlessly.\n\nThe character then reaches another spot and targets two red minions using an ability that creates a blue circular effect, dealing damage. The statistics 'Last Hits: 136', 'DPS: 136', and 'Total: 136' appear above the minions and update to 'Last Hits: 55', 'DPS: 231', and 'Total: 191' after the attack. Text on the screen reads 'Moonlight fell off' and 'Can't reset E twice'. The audio has a second male voice explaining: \"`Unfortunately, if you E a moonlit target to get your E reset, the moonlight falls off of every other target hit, so you can't reset your E multiple times unless you know this trick.`\". The clip ends with the character ready to attack the minions again.\n\nNext, the character wielding a scythe is targeting two red minions, with another blue circular targeting effect appearing. Damage statistics again update, showing detailed numbers as the character attacks the minions. In the audio, a man says, \"` If you're hitting multiple targets with your Q, then you should dash with E before your Q even lands. This will still give you an E reset, but the moonlight won't fall off the other targets. This means you can E reset again, letting you get three total dashes. `\" followed by alternating dialogue between the woman and a man explaining the mechanics of the ability reset and usage of dashes in the gameplay.\n\nIn a lush, grassy area, the character, surrounded by a purple aura, engages red minions, executing blue and purple flashes. As combat continues, a new enemy in red is targeted. Text on the screen states 'Counter: 1' and then updates to 'Counter: 2' with more flashes of red and blue. As the character chases the fleeing enemy character, the audio reveals a woman saying, \"`This is useful for jungle and mid because no one expects you to have 3 dashes. Example, as you gank, you can E reset on the minion wave and then dash again to your opponent. This might make them think you have no more dashes left and bait them into wasting their mobilities like flash.`\"\n\nThe character, still wielding the scythe, next appears near a red turret, engaging an enemy. A blue circular effect deals damage, and 'First Blood!' text appears, indicating a first kill in the game. As the character moves away from the turret and meets another group of red minions, the 'Counter: 3' text displays and then changes to 'Counter: 2' amidst blue and purple flashes. The audio overlaps with a male voice explaining about using mobility like flash, and a female voice speaks about securing the kill for mid laners.\n\nFinally, the scene shows the scythe wielder, still surrounded by a purple aura, amidst a cluster of red minions and an enemy. Blue, white, red, and purple visual effects highlight the combat as text updates from 'Counter: 1' to 'Counter: 4'. The enemy attempts to escape repeatedly but is pursued persistently by the character. The audio consists of a man explaining tactics, saying, \"`You can bait your opponent into thinking you're out of mobility after you use two dashes. You run away, bait them to chase, and completely catch them off guard with your third dash to score a kill.`\" The clip ends with a promotional message offering more guides, \"`Check out Skillcapped for more tips like these,`\" accompanied by music.\n\nThus, the video showcases various strategic tips and mechanics for mastering complex abilities and movements in the game, particularly focusing on the character Diana and her unique dashing abilities."
+ }
+ ],
+ "events": [
+ "Begin with scene from a video game.",
+ "Character wielding a scythe stands near a blue turret under attack by red minions.",
+ "Character engages minions using various abilities, creating blue and white flashes.",
+ "Audio: Man explains about Diana's ability to dash three times instead of two using Moonlight applied with her Q.",
+ "Scene transitions to character chasing a red enemy through a grassy area with pulses of blue light.",
+ "Red enemy attempts to escape but is pursued.",
+ "Character reaches another spot, targets two red minions with a blue circular effect.",
+ "Damage statistics appear and update from 'Last Hits: 136', 'DPS: 136', and 'Total: 136' to 'Last Hits: 55', 'DPS: 231', and 'Total: 191'.",
+ "Text 'Moonlight fell off' and 'Can't reset E twice' appears on screen.",
+ "Audio: Second male voice explains the mechanics of E resetting with Moonlight.",
+ "Clip ends with character preparing to attack minions again.",
+ "Character targets two red minions, blue circular targeting effect appears.",
+ "Damage statistics update with detailed numbers as character attacks minions.",
+ "Audio: Man explains about dashing with E before Q lands for multiple E resets.",
+ "Dialogue alternates between woman and man explaining ability reset mechanics and usage of dashes.",
+ "In lush, grassy area, character engages red minions with blue and purple flashes.",
+ "New enemy in red is targeted.",
+ "Text 'Counter: 1' updates to 'Counter: 2' with flashes of red and blue.",
+ "Character chases fleeing enemy.",
+ "Audio: Woman explains usefulness of three dashes in jungle and mid, mentions baiting opponent's mobility.",
+ "Character appears near a red turret, engages an enemy.",
+ "Blue circular effect deals damage, 'First Blood!' text appears indicating first kill.",
+ "Character moves away from turret, meets another group of red minions.",
+ "Text 'Counter: 3' updates to 'Counter: 2' amidst blue and purple flashes.",
+ "Audio overlaps with male voice about using mobility and female voice about securing kills for mid laners.",
+ "Character with scythe amidst red minions and enemy surrounded by purple aura.",
+ "Blue, white, red, and purple visual effects highlight combat.",
+ "Text updates from 'Counter: 1' to 'Counter: 4'.",
+ "Enemy attempts to escape repeatedly but is pursued persistently.",
+ "Audio: Man explains tactic of baiting opponent with mobility.",
+ "Clip ends with promotional message: 'Check out Skillcapped for more tips like these,' accompanied by music."
+ ]
+ },
+ {
+ "video": "video/201.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with a man in front of a black Mustang parked in a garage. On the screen, the text reads, \"Would my 1,200hp Mustang pass inspection?\" The man points at the car, showcasing its sleek design, prominent front grille, and headlights, while the audio reveals someone saying, \u201cWould my 1200 horsepower Mustang pass inspection? Let's talk about it. If you're a cop watching, click away.\" Then the video transitions to the man, and he says, \"The short answer is no.\" The camera switches to him pointing at the car, with the background sound saying, \"Why wouldn't it pass?\" The man is now next to the license plate and continues to speak to the camera, \"Because of this state right here, everything in this video is going to be in regard to the laws here in California.\" The man continues to gesture towards the car, highlighting its features. The clip ends with a close-up of the car\u2019s rear displaying a California license plate 'BADY500' and a Shelby Cobra emblem, accompanied by the sound of music starting to play and the man says, \"Aftermarket exhaust. Any exhaust that omits higher than 95 decibels is illegal in the state of California.\" The garage features various items, including an American flag and a banner. Then the car makes a loud noise and the man says something loudly which is unclear.\n\nThe camera then provides a close-up of the rear dual exhaust pipes. The same man gestures at the pipes before the view transitions to the engine with the hood open, showcasing shiny metallic pipes and various colored parts. The audio continues, \u201cThe car is also equipped with long tube headers. The removal of catalytic converters is illegal. \u201d\n\nThe camera shifts to a close-up of the underside of the Mustang, concentrating on the exhaust system and mechanical components. The man holds a similar mechanical part from the previous clip while the background reveals the Mustang\u2019s rear and license plate. He points towards the rear tire, highlighting its size and tread. The camera then shows the side of the Mustang\u2019s sleek exterior. The audio informs, \u201cI also have exhaust cutouts, which essentially opens up my exhaust right after my long tube headers. That's a big no-no. I haven't done my wheel review yet, but drag racing tires? Illegal. \u201d The clip ends with the man continuing to gesture and speak with the Mustang prominently displayed.\n\nFocusing on the Mustang\u2019s side, the camera showcases the shiny exterior reflecting garage items, including a banner with the text 'XXI'. The scene shifts to the trunk where the man points to red and blue wires and a red mechanical part. Details of the engine are observed as the camera transitions to the front, revealing shiny metallic pipes and colored parts. The audio describes, \u201cTint is pretty self-explanatory. You can't tint your windows. In my trunk, I have the dual booster pump, which forces more fuel into the engine. California doesn't like that.\u201d The man continues gesturing and speaking, with music playing in the background.\n\nConcluding, the video shows a close-up of the engine, featuring shiny metallic pipes, blue and black parts, and a red cap. The man points out various engine components and moments later, the camera offers a wider view of the engine bay. The man continues to gesture and speak about the engine parts. The audio summarizes, \u201cTo pair it, I also have 1,700 injectors, which, again, adds so much fuel. That's a no-no. Also includes a computer tune which enables this engine to push out 1200 horsepower. And I think that is all.\u201d The final view is of the man standing in front of the open hood Mustang, gesturing towards the engine with the car prominently displayed in the garage. Music continues to play softly in the background."
+ }
+ ],
+ "events": [
+ "Man standing in front of black Mustang parked in a garage.",
+ "Text on the screen reads, 'Would my 1,200hp Mustang pass inspection?'",
+ "Man points at the car, showcasing its design, front grille, and headlights.",
+ "Audio says, 'Would my 1200 horsepower Mustang pass inspection? Let's talk about it. If you're a cop watching, click away.'",
+ "Video transitions to the man; he says, 'The short answer is no.'",
+ "Camera switches to the man pointing at the car.",
+ "Background sound says, 'Why wouldn't it pass?'",
+ "Man stands next to the license plate, continues speaking to the camera, 'Because of this state right here, everything in this video is going to be in regard to the laws here in California.'",
+ "Man gestures towards the car, highlighting its features.",
+ "The video ends with a close-up of the car\u2019s rear displaying a California license plate 'BADY500' and a Shelby Cobra emblem.",
+ "Music starts to play.",
+ "Man says, 'Aftermarket exhaust. Any exhaust that omits higher than 95 decibels is illegal in the state of California.'",
+ "Garage features various items, including an American flag and a banner.",
+ "Car makes a loud noise.",
+ "Man says something loudly which is unclear.",
+ "Camera provides a close-up of the rear dual exhaust pipes.",
+ "Man gestures at the exhaust pipes.",
+ "View transitions to the engine with the hood open, showing shiny metallic pipes and various colored parts.",
+ "Audio says, 'The car is also equipped with long tube headers. The removal of catalytic converters is illegal.'",
+ "Camera shifts to a close-up of the underside of the Mustang, focusing on the exhaust system and mechanical components.",
+ "Man holds a similar mechanical part from the previous clip.",
+ "Background shows the Mustang\u2019s rear and license plate.",
+ "Man points to the rear tire, highlighting its size and tread.",
+ "Camera shows the side of the Mustang\u2019s sleek exterior.",
+ "Audio says, 'I also have exhaust cutouts, which essentially opens up my exhaust right after my long tube headers. That's a big no-no. I haven't done my wheel review yet, but drag racing tires? Illegal.'",
+ "Man continues to gesture and speak with the Mustang prominently displayed.",
+ "Camera focuses on the Mustang\u2019s side, showcasing the shiny exterior reflecting garage items including a banner with the text 'XXI'.",
+ "Scene shifts to the trunk, where the man points to red and blue wires and a red mechanical part.",
+ "Details of the engine are observed as the camera transitions to the front, revealing shiny metallic pipes and colored parts.",
+ "Audio says, 'Tint is pretty self-explanatory. You can't tint your windows. In my trunk, I have the dual booster pump, which forces more fuel into the engine. California doesn't like that.'",
+ "Man continues gesturing and speaking with music playing in the background.",
+ "Video shows a close-up of the engine, featuring shiny metallic pipes, blue and black parts, and a red cap.",
+ "Man points out various engine components.",
+ "Camera offers a wider view of the engine bay.",
+ "Man continues to gesture and speak about the engine parts.",
+ "Audio says, 'To pair it, I also have 1,700 injectors, which, again, adds so much fuel. That's a no-no. Also includes a computer tune which enables this engine to push out 1200 horsepower. And I think that is all.'",
+ "Final view of the man standing in front of the open hood Mustang, gesturing towards the engine.",
+ "Car prominently displayed in the garage.",
+ "Music continues to play softly in the background."
+ ]
+ },
+ {
+ "video": "video/202.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a close-up of a wooden beam joint, showcasing the intricate texture and grain, and establishing its construction context. As the view zooms out, a partially constructed wooden structure, potentially a small house or shed, comes into sight. It sits in a lush, green outdoor setting. A hand holding a pencil appears, marking measurements on the wooden beam with a blue triangular carpenter's square for precision. In the background, birds chirp energetically, and a man explains, \"So this is how you make a bird's mouth notch without measuring really quick with a speed square. Come up here, place it against your plate, make your mark.\"\n\nThe scene continues as the man continues marking the wood. Hands holding the square and pencil precisely mark the beams. The backdrop of the green foliage remains constant, while the audio shifts slightly to include insects chirping, and the man's instructions resonate: \"You come on the top, place a speed square up there, make another mark. So you make your cut along the marks you made. Let me see if I can give you one more view.\" A close-up shows a beam with a notch lying above the grass, emphasizing the ongoing, meticulous craftsmanship.\n\nThe focus then shifts to intersecting wooden beams, highlighting the joint. A man's hand, still using the blue triangular carpenter's square, adjusts the beams, ensuring proper alignment. Detailed textures and cross marks foretell where cuts and adjustments will be made. The outdoor ambiance is enhanced by the distant hum of a vehicle. The instructional voice continues, \"So you got your top plate here, rafter here, place your speed square like that, make your mark.\"\n\nThe intricate process of alignment and marking is captured with intersecting wooden beams supported by brick. Foliage in the background maintains the video's outdoor theme. The repetitive precision of markings and adjustments is highlighted as the blue triangular carpenter's square is removed, leaving marks ready for cutting. The calming yet instructional tone repeats, \"Then you go ahead and place it like this, make your second mark and then you just cut that out and that's your bird's mouth.\"\n\nThe detailed close-up of a wooden beam joint reappears, now part of the expanding structure with nails securing the beams together. The overall craftsmanship is accentuated against the backdrop of green foliage. A thematic graphic overlay featuring a stylized deer antler and the text \"Mountain's BREATH\" ('Mountain's Breath') appears, giving a name or brand to the construction effort. Crickets chirping provide a tranquil background, and the man's concluding words, \"And you get a good fit every time,\" signify the precision and skill involved in the project.\n\nThis detailed journey through the construction process emphasizes the meticulous craftsmanship, set to a backdrop of natural, outdoor sounds and the instructional voice guiding each step."
+ }
+ ],
+ "events": [
+ "Close-up of a wooden beam joint showcasing intricate texture and grain",
+ "View zooms out to reveal a partially constructed wooden structure in a lush, green outdoor setting",
+ "Hand holding a pencil marks measurements on the wooden beam using a blue triangular carpenter's square",
+ "Background audio includes birds chirping energetically",
+ "Man explains making a bird's mouth notch quickly without measuring using a speed square",
+ "The man continues marking the wood with precision",
+ "Background includes insects chirping",
+ "Man instructs to place the speed square on top and make another mark",
+ "Close-up of a beam with a notch lying above the grass",
+ "Focus shifts to intersecting wooden beams, highlighting the joint",
+ "Man adjusts the beams using the blue triangular carpenter's square for proper alignment",
+ "Detailed textures and cross marks shown indicating where cuts and adjustments will be made",
+ "Background audio includes the distant hum of a vehicle",
+ "Man continues instructional voice on making marks with the speed square",
+ "Scene captures intersecting wooden beams supported by brick, highlighting the precision of markings and adjustments",
+ "Blue triangular carpenter's square removed, leaving marks ready for cutting",
+ "Man repeats instructional tone on how to make second mark and cut for bird's mouth",
+ "Close-up of a wooden beam joint reappears, now part of the expanding structure with nails securing the beams",
+ "Backdrop of green foliage enhances the craftsmanship display",
+ "Thematic graphic overlay featuring a stylized deer antler and text 'Mountain's BREATH' appears",
+ "Background includes crickets chirping",
+ "Man concludes with, 'And you get a good fit every time,' signifying precision and skill",
+ "Overall focus on the meticulous craftsmanship with natural, outdoor sounds and instructional guidance"
+ ]
+ },
+ {
+ "video": "video/203.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video showcases a variety of flashlights and pens, neatly displayed on a teal surface. The flashlights come in different colors and designs, such as black, green, red, and metallic shades, and are complemented by colorful pens, some with metallic finishes. A hand picks up a black flashlight, labeled 'Olight Seeker 3 Pro,' and is then shown in close-up, with its features highlighted, including a blue button and a textured grip. The thumb spins the wheel. When the flashlight is turned on, it emits a bright light. The scene transitions to three small keychain flashlights in a hand, one designed with the American flag. The functionality of one of the keychain flashlights is demonstrated by twisting its head to turn it on, emphasizing its compact size and design.\n\nSimultaneously, the audio reveals a man explaining his top five picks among Olight flashlights: \"`I have used and tested a lot of O lights but these are my top five picks. Number five is the i15. This keychain flashlight is easy to use and easy to give as gifts. USB Type-C charging, plenty bright, and under $20.`\" This narration is followed by the sound of the flashlight being turned on.\n\nThe video then features the small green keychain flashlight details, such as a USB symbol and the word 'Olight', being held against the teal background. When the head of the flashlight is twisted, it emits different levels of light. Another scene shows a metallic red container is opened to reveal a sleek red flashlight, underscoring its portability.\n\nThe audio continues with the man stating: \"`Number four is going to be the ultra-popular Baton 3 for very good reason 1200 lumens max output this is definitely a flashlight you can carry that does it all. The i5r comes in at number three because it can use both a standard double as well as the included lithium battery ultra convenient and definitely easy to carry as well.`\" The sound of a cassette tape being inserted into a player is heard next.\n\nThe video continues with a close-up of a red flashlight, with a textured grip and gold clip, further rotated to exhibit the text 'Baton' and 'Olight' engraved on its body. The hand turns it on and shines the light on the other hand. The scene transitions to a hand holding gold, silver, and green flashlights, displaying their size and design. The hand puts the gold and silver flashlights down and shows the green one to the camera.\n\nNext, the video highlights a green flashlight with the text 'I5R EOS' and '5V 9273', rotating it to show its features. The hand is shown holding the green flashlight, turning it on, and demonstrating its features. The video proceeds to the same green flashlight with a textured grip and a silver button, which emits a bright white light when turned on.\n\nThe scene proceeds to showcase a hand holding multiple small flashlights in various colors, including blue, black, orange, red, copper, and silver, emphasizing their compactness. The red flashlight's body, with 'I5R EOS' engraved, is then presented, and its features are displayed by rotating it. The other flashlights are put down.\n\nThe audio narrates: \"`Runner up is the warrior mini 2, my most carried flashlight it manages to have both 17,00 lumens max output and still be EDC friendly. `\" followed by the sound of a coin dropping on a surface.\n\nFinally, the video focuses on a red flashlight with a textured grip and gold clip against the teal background, showing the text 'Olight' engraved. The flashlight is turned on, emitting a bright white light. The same hand then showcases six small flashlights in a variety of colors, neatly arranging them to highlight their designs. The hand holding all six flashlights is shown at the end, emphasizing their varying colors and features.\n\nThe audio concludes with the emphasis on the top pick: \"`At less than $20, super bright, easy to carry, and easy to use, the i3T comes in at number one this is definitely the flashlight I would recommend to damn near anybody, and it comes with a billion different materials, so pick one and go.`\" The video audio is wrapped up with the sound of a flashlight being picked up."
+ }
+ ],
+ "events": [
+ "flashlights and pens displayed on teal surface",
+ "flashlights in black, green, red, and metallic shades",
+ "colorful pens with metallic finishes",
+ "hand picks up black flashlight labeled 'Olight Seeker 3 Pro'",
+ "close-up of black flashlight with blue button and textured grip",
+ "thumb spins the wheel on the black flashlight",
+ "black flashlight turned on emits bright light",
+ "scene transitions to three small keychain flashlights",
+ "one keychain flashlight has American flag design",
+ "demonstration of keychain flashlight functionality by twisting its head",
+ "man explains his top five picks among Olight flashlights in the audio",
+ "i15 keychain flashlight described by the man in the audio",
+ "sound of flashlight being turned on",
+ "small green keychain flashlight details shown",
+ "USB symbol and word 'Olight' on green keychain flashlight",
+ "head of green keychain flashlight twisted emits different levels of light",
+ "metallic red container opened to reveal a red flashlight",
+ "man continues with `Number four is going to be the ultra-popular Baton 3` in the audio",
+ "sound of cassette tape being inserted into player",
+ "close-up of red flashlight with textured grip and gold clip",
+ "text 'Baton' and 'Olight' engraved on red flashlight body",
+ "hand turns red flashlight on and shines light on other hand",
+ "hand holding gold, silver, and green flashlights shown",
+ "gold and silver flashlights put down, green one shown to camera",
+ "video highlights green flashlight with text 'I5R EOS' and '5V 9273'",
+ "green flashlight features demonstrated",
+ "green flashlight with textured grip and silver button emits bright white light",
+ "hand holding multiple small flashlights in various colors",
+ "compact small flashlights in blue, black, orange, red, copper, and silver colors",
+ "red flashlight's body with 'I5R EOS' engraved presented",
+ "runner up warrior mini 2 mentioned by the man in the audio",
+ "sound of a coin dropping on a surface",
+ "focus on red flashlight with textured grip and gold clip against teal background",
+ "text 'Olight' engraved on red flashlight",
+ "red flashlight turned on emits bright white light",
+ "hand showcases six small flashlights in various colors",
+ "hand holding all six flashlights shown emphasizing their colors and features",
+ "top pick i3T described by the man in the audio",
+ "sound of a flashlight being picked up"
+ ]
+ },
+ {
+ "video": "video/204.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts with an intense MMA fight sequence. Two fighters are on the mat, grappling for dominance. The fighter wearing red gloves is positioned on top, while the fighter with blue gloves attempts to secure a submission hold. The text \"Do buggy chokes work in MMA?\" appears at the top of the frames. From the audio, a commentator explains, \"He's looking for a buggy choke. This is a choke from...\" as the crowd cheers. The action intensifies as the fighter in blue gloves wraps his legs around his opponent's torso, targeting a chokehold. The fighter in red gloves remarkably lifts his opponent off the ground while still engaged in the hold. A voice on the loudspeaker continues, \"There was pressure on that choke,\" as the crowd's excitement grows palpable. \n\nAs the battle continues, the fighter in red gloves struggles to break free from the unrelenting hold. The tension rises on screen as the red-gloved fighter slams his opponent back onto the mat. Despite the impact, the fighter in blue gloves maintains his grip. Complementing the visuals, the commentator admiringly notes, \"Beautiful positioning right there,\" as the arena rings with cheers and microphone announcements. \n\nThe escalation persists in the third sequence. The boxer in red gloves begins to hit the boxer in blue gloves who is lying on the ground. The fighter in red gloves, still on top, repeatedly tries to punch his way out of the chokehold. Boxer in blue gloves trying to protect his head with his hands. Their positions change slightly, but the boxer in red gloves still has the overwhelming upper hand. He keeps hitting the boxer in blue gloves, while the boxer in blue gloves has no power to fight back and is always on the defensive. The referee closely monitors the fighters. Then the referee stops the boxer in red gloves, the match is suspended, and the boxer in red gloves stands up. As the scene of the boxer in red gloves lifting his opponent up and throwing him to the ground is replayed, audio commentary resumes, \"This being picked up and dropped down that was a Huge snap.\"\n\nAs the footage is replayed in slow motion, we can see more clearly how the boxer in the red gloves breaks free from the buggy choke. The audio description captures the critical moment with, \"Look at how the legs come out. The arm lets go of it. He's in trouble right there. That hurt.\" amidst the raucous crowd noise and live commentary that heightens the drama of the relentless MMA encounter.\n\nAll throughout, the thematic focus on the efficacy of the buggy choke in MMA is consistently emphasized, both visually and auditorily, delivering a coherent and intense narrative of the fight."
+ }
+ ],
+ "events": [
+ "Intense MMA fight sequence starts",
+ "Two fighters grapple on the mat",
+ "Fighter with red gloves is on top",
+ "Fighter with blue gloves attempts a submission hold",
+ "Text \"Do buggy chokes work in MMA?\" appears",
+ "Commentator explains fighter in blue gloves is looking for a buggy choke",
+ "Crowd cheers",
+ "Action intensifies as blue gloves fighter wraps legs around red gloves opponent's torso",
+ "Blue gloves fighter targets a chokehold",
+ "Red gloves fighter lifts opponent off the ground while in the hold",
+ "Commentator says, \"There was pressure on that choke,\" as crowd excitement grows",
+ "Red gloves fighter struggles to break free from buggy choke",
+ "Red gloves fighter slams blue gloves opponent back on the mat",
+ "Blue gloves opponent maintains grip despite impact",
+ "Commentator notes, \"Beautiful positioning right there,\" as arena rings with cheers and announcements",
+ "Fighter in red gloves starts hitting blue gloves fighter on the ground",
+ "Red gloves fighter repeatedly punches to escape chokehold",
+ "Blue gloves fighter protects head with hands",
+ "Positions change slightly with red gloves fighter maintaining upper hand",
+ "Red gloves fighter continues to hit defenseless blue gloves fighter",
+ "Referee closely monitors fighters",
+ "Referee stops red gloves fighter and suspends match",
+ "Red gloves fighter stands up",
+ "Red gloves fighter lifts and throws blue gloves opponent on the ground again in replay",
+ "Commentator says, \"This being picked up and dropped down that was a Huge snap,\" during replay",
+ "Footage replayed in slow motion",
+ "Clear view of red gloves fighter breaking free from buggy choke in slow motion",
+ "Audio description captures critical moment, \"Look at how the legs come out. The arm lets go of it. He's in trouble right there. That hurt.\"",
+ "Crowd noise and live commentary heighten drama",
+ "Thematic focus on efficacy of buggy choke in MMA emphasized throughout video"
+ ]
+ },
+ {
+ "video": "video/205.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up view of an orange boat bearing the name '\u00c1RS\u00c6LL,' alongside a blue, red and white cross symbol. The boat is turning around in the water, and waves splash against it, creating a dynamic scene. The audio reveals a man speaking over background music, saying, \"This boat cannot be capsized, so why don't they make all ships this way, so they never capsize and sink? It's not what you think. For the Coast Guard, which regularly has to rescue others or stuck in storms, having boats that don't capsize gives them peace of mind.\" \n\nThe scene transitions to a wider view, showing a large, partially submerged shipwreck in the background with cranes and other equipment visible on the shore. Two individuals in orange suits and helmets labeled 'ISCG' navigate a smaller boat through rough seas, with all the while large waves crash around them. The audio continues, \"But how does the boat upright itself? There are two key components. First, these boats have a low center of gravity.\"\n\nAerial views show two boats navigating the rough waves, leaving trails of white foam in their wake. Meanwhile, the audio explains, \"And second, the cabin and hull are watertight. Being watertight means that even if the boat is flipped on its head, it would remain buoyant, and then the low center of gravity ensures the hull flips back upright.\"\n\nThe video transitions to an orange boat labeled 'NORS' being lifted out of the water by a crane, submerged partially, and turned around. Another boat, a red one, is also lifted out of the water in a calm, reflective setting, and turned around, indicating a more controlled environment. During this segment, the audio continues seamlessly, adding, \"But due to their sheer size, making all components watertight on a commercial or military ship is practically impossible.\"\n\nThe scene shifts to the interior of a boat, revealing an empty cabin with well-maintained seats and a control panel. Then it shifts to a red boat being lifted by a crane, turned around, and tilted as water pours out, with the cloudy sky and calm sea in the backdrop. The man speaks, explaining the mechanics behind the resilience of such boats.\n\nThe visual narrative then shows another gray boat navigating through rough seas, cutting through large waves crashing around it. The powerful visual highlights the role of both the boat's design and the skill of its operators in overcoming nature's challenges. The audio continues to add depth to the visual narrative, elaborating on the principles that keep these boats from capsizing.\n\nFinally, the video continues with an aerial view of the gray boat swiftly moving through rough seas, waves breaking around it, leaving white foam trails. The boat's crew, dressed in orange suits, are likely part of a rescue or Coast Guard team, reinforcing the theme of resilience and skill. The audio in a British accent wraps up with, \"So they just have to be careful.\" The closing scene encapsulates the persistent determination and technical prowess required to navigate such treacherous waters."
+ }
+ ],
+ "events": [
+ "Close-up view of an orange boat named '\u00c1RS\u00c6LL' with a blue, red and white cross symbol",
+ "Boat turning around in the water",
+ "Waves splash against the boat",
+ "Audio states: 'This boat cannot be capsized, so why don't they make all ships this way, so they never capsize and sink? It's not what you think. For the Coast Guard, which regularly has to rescue others or get stuck in storms, having boats that don't capsize gives them peace of mind.'",
+ "Transition to a wider view with a large, partially submerged shipwreck in the background",
+ "Cranes and other equipment visible on the shore",
+ "Two individuals in orange suits and helmets labeled 'ISCG' navigate a smaller boat through rough seas",
+ "Large waves crash around the smaller boat",
+ "Audio explains: 'But how does the boat upright itself? There are two key components. First, these boats have a low center of gravity.'",
+ "Aerial views of two boats navigating the rough waves, leaving trails of white foam",
+ "Audio continues: 'And second, the cabin and hull are watertight. Being watertight means that even if the boat is flipped on its head, it would remain buoyant, and then the low center of gravity ensures the hull flips back upright.'",
+ "Transition to an orange boat labeled 'NORS' being lifted out of the water by a crane, submerged partially, and turned around",
+ "Another boat, a red one, is lifted out of the water in a calm, reflective setting, and turned around",
+ "Audio explains: 'But due to their sheer size, making all components watertight on a commercial or military ship is practically impossible.'",
+ "Scene shifts to the interior of a boat, revealing an empty cabin with well-maintained seats and a control panel",
+ "Shift to a red boat being lifted by a crane, turned around, and tilted as water pours out",
+ "Cloudy sky and calm sea in the background",
+ "Explanation of the mechanics behind the resilience of such boats",
+ "Visual shows another gray boat navigating through rough seas, cutting through large waves",
+ "Audio elaborates on principles that keep boats from capsizing",
+ "Aerial view of the gray boat swiftly moving through rough seas, waves breaking around it, leaving white foam trails",
+ "Boat's crew dressed in orange suits, likely part of a rescue or Coast Guard team",
+ "Audio wraps up: 'So they just have to be careful.'",
+ "Closing scene encapsulates determination and technical prowess required to navigate treacherous waters"
+ ]
+ },
+ {
+ "video": "video/206.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up of a man in a workshop setting, wearing a white shirt and blue apron, and operating a piece of machinery. The background features a brown wall adorned with various tools and equipment. The man holds a tool and works on a surface. The text and audio display \"` We are often asked, 'What is the oldest woodcut that we have in our collection?`' The text and audio changes to, \"It's hard to tell because we don't have dates on all of them.`\" The man continues to work with the tool before turning back to the machinery\n\nThe video transition keeps the workshop setting, where the same man moves next to another piece of machinery while holding a tool. The man then relocates to another part of the workspace, the text and audio display \"And we don't know the true origins of some of these.`\"The man picks up a small piece of paper and examines it closely as the text and audio update, \"But I have a woodcut here.\" The man shows the paper to the camera and places it on the work surface before picking up another tool from the table.\n\nMaintaining the coherent scenery of the workshop, the individual continues to examine the paper intensely. The text and audio display, \"That is the advertisement for a local photo and wood engraving business at 627 J Street.\" The man sets down the paper and picks up a large wooden frame. The man adjusts the frame meticulously, aligning it on the work surface. The man makes further precise adjustments to the frame and reaches for another tool from the table.\n\nIn the familiar workshop ambiance, the man is now operating a large piece of machinery, with a sign \"SACRA\" visible in the background. The text and audio display, \"in the mid-1880s. The business belonged to C.H. Holmes.\" The man continues handling the machinery, aligning pieces expertly. . \n\nThe next phase keeps the environment consistent, with the man standing at the work surface. A visible sign reads \"SACRA\" in the background. The text and audio display, \"He was hired in 1886 to become the State Engraver.`\" The man carefully handles a piece of paper, ensuring its proper alignment on the work surface. The text and audio change to, \"and he sold his business to Lewis Winter.`\" \n\nIn the final scene, the workshop stays consistent, with the man holding a piece of paper containing a printed advertisement and showing it to the camera. The paper reads, \"C.H. Holmes, Engraver on Wood, 627 J St., Sac'to, Cal.\" The text and audio display, \"` So we have his advertisement here from the Lewis Winter Collection.` The manensures the details are visible to the camera. The video closes with a close-up of the advertisement and a voice saying, \"`Press the red button, Howard.\""
+ }
+ ],
+ "events": [
+ "Close-up of a man in a workshop setting, wearing a white shirt and blue apron, operating a piece of machinery.",
+ "Background features a brown wall adorned with various tools and equipment.",
+ "Man holds a tool and works on a surface.",
+ "Text and audio display, 'We are often asked, 'What is the oldest woodcut that we have in our collection?'",
+ "Text and audio change to, 'It's hard to tell because we don't have dates on all of them.'",
+ "Man continues to work with the tool before turning back to the machinery.",
+ "Video transitions, keeping the workshop setting.",
+ "Same man moves next to another piece of machinery while holding a tool.",
+ "Man relocates to another part of the workspace.",
+ "Text and audio display, 'And we don't know the true origins of some of these.'",
+ "Man picks up a small piece of paper and examines it closely.",
+ "Text and audio update, 'But I have a woodcut here.'",
+ "Man shows the paper to the camera and places it on the work surface.",
+ "Man picks up another tool from the table.",
+ "Individual continues to examine the paper intensely.",
+ "Text and audio display, 'That is the advertisement for a local photo and wood engraving business at 627 J Street.'",
+ "Man sets down the paper and picks up a large wooden frame.",
+ "Man adjusts the frame meticulously, aligning it on the work surface.",
+ "Man makes further precise adjustments to the frame.",
+ "Man reaches for another tool from the table.",
+ "Man is now operating a large piece of machinery.",
+ "Sign 'SACRA' is visible in the background.",
+ "Text and audio display, 'in the mid-1880s. The business belonged to C.H. Holmes.'",
+ "Man continues handling the machinery, aligning pieces expertly.",
+ "Man stands at the work surface.",
+ "Visible sign reads 'SACRA' in the background.",
+ "Text and audio display, 'He was hired in 1886 to become the State Engraver.'",
+ "Man carefully handles a piece of paper, ensuring its proper alignment on the work surface.",
+ "Text and audio change to, 'and he sold his business to Lewis Winter.'",
+ "Man is holding a piece of paper containing a printed advertisement and showing it to the camera.",
+ "Paper reads, 'C.H. Holmes, Engraver on Wood, 627 J St., Sac'to, Cal.'",
+ "Text and audio display, 'So we have his advertisement here from the Lewis Winter Collection.'",
+ "Man ensures the details are visible to the camera.",
+ "Video closes with a close-up of the advertisement and a voice saying, 'Press the red button, Howard.'"
+ ]
+ },
+ {
+ "video": "video/207.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a clear white surface flanked by two colorful blocks. Bold yellow text stating 'This' takes center stage. The scene transitions to hands skillfully manipulating a Rubik's cube, accompanied by the green text '$100' below a compact image of the cube. The hands deftly solve the cube, followed by yellow text 'limited'edition...' as the cube appears completed. The scene then shifts to an unboxing of a sleek gray box revealing a green Rubik's cube inside. 'Unboxing' in yellow text appears as a gold pouch with 'gold' in yellow is also unveiled. The hands retrieve the cube and the pouch, showcasing additional items like stickers and a manual. The text 'cube' signifies the main focus.\n\nA man's voice states: \"`This is a $ 100-speed cube that is limited edition. And here's the unboxing experience. It comes with the cube, the case, a soft gold bag, extra stickers, and a how-to-solve-the-cube guide,`\" all while lively music plays in the background.\n\nThe video then displays hands holding a Rubik's cube with visible green, white, and orange faces. 'Called' appears in bold yellow text. 'Gan 13 Kunlun' identifies the cube in bold yellow. The hands reveal the cube's sparkling internal mechanism, showcased by the text 'at' in yellow. The glittery gold insides of the cube are revealed, noted by the yellow text 'glitter'. The sequence progresses to show the hands peeling off stickers, with 'sticker' and 'off' in yellow text, and transforming the cube into a gleaming gold block, marked by 'now' in yellow.\n\nThe same man's voice continues: This is called the Gan 13 Kulun Maglev. Look at the details of this cube with the sparkly glitter and these bright thicker shades too. Makes me want to take it all off. Ah, now it looks much better.`\"\n\nNext, the clip showcases the hands holding an entirely gold Rubik's cube against a background of colorful blocks and scattered stickers. The hands turn the cube, the cube is then disassembled to reveal glamorous gold-colored internal parts. The intricate internal mechanism is shown. The cube is reassembled and solved, culminating with the question '$100?' in yellow.\n\nWhile coins clink in the background, the voice states, \"We can actually enjoy the gold colorfully. The insides look glamorous that you can even see the magnets inside. So do I think this is worth $100? Well, let's quickly do a solve first and see.\"\n\nThe focus then shifts to the hands holding a Rubik's cube with green, white, and orange faces visible. ' A close-up of a smartphone timer set to 11 seconds appears. The hands scramble the cube. The hands keep solving, emphasizing the cube's qualities.\n\nThe audio describes: \" Let's try to beat this timer before it hits zero. Turning this cube is super clicky because of all the 88 magnets inside. It's super fast and stable.'\n\nThe final segment features the hands deftly solving the Rubik's cube with a running timer showing 3 seconds. The timer counts down sequentially from 3 to 2 to 1 second, with the final text emphasizing the limited edition nature of the cube as the timer hits zero. A close-up of the cube's internal mechanism reinforces the special edition theme.\n\nThe man concludes: But with all this, I don't think this cube is worth $100 unless you really like limited edition gold things.`\" \n\nOverall, the video thoughtfully explores the unique attributes and the perceived value of an exclusive $100 gold-plated, limited edition Rubik's cube, balancing visual details and explanatory audio commentary."
+ }
+ ],
+ "events": [
+ "Video opens with a white surface with two colorful blocks",
+ "Bold yellow text 'This' appears",
+ "Scene transitions to hands manipulating a Rubik's cube",
+ "Green text '$100' appears below the image of the Rubik's cube",
+ "Hands solve the Rubik's cube",
+ "Yellow text 'limited edition...' appears when the cube is solved",
+ "Scene shifts to unboxing a sleek gray box",
+ "Green Rubik's cube is revealed inside the box",
+ "Yellow text 'Unboxing' appears",
+ "Gold pouch with 'gold' in yellow text is unveiled",
+ "Hands retrieve the green cube and the gold pouch",
+ "Additional items like stickers and a manual are showcased",
+ "Yellow text 'cube' appears to emphasize the main focus",
+ "Man's voice describes the $100 limited edition speed cube and mentions the unboxing experience",
+ "Lively music plays in the background",
+ "Hands hold a Rubik's cube with green, white, and orange faces",
+ "Bold yellow text 'Called' appears",
+ "Bold yellow text 'Gan 13 Kunlun' identifies the cube",
+ "Hands reveal the cube's sparkling internal mechanism",
+ "Yellow text 'at' appears",
+ "Yellow text 'glitter' appears when the glittery gold insides are shown",
+ "Hands peel off stickers from the cube",
+ "Yellow texts 'sticker' and 'off' appear during the peeling process",
+ "Cube transforms into a gleaming gold block, noted by 'now' in yellow",
+ "Man's voice continues describing the cube, emphasizing sparkly glitter and thicker shades",
+ "Man expresses a desire to remove more of the stickers",
+ "Hands hold an entirely gold Rubik's cube against a background of colorful blocks and scattered stickers",
+ "Hands turn the gold cube",
+ "Cube is disassembled to reveal glamorous gold-colored internal parts",
+ "Intricate internal mechanism of the cube is shown",
+ "Cube is reassembled and solved",
+ "Yellow text '$100?' appears",
+ "Coins clink in the background",
+ "Voice states potential enjoyment of the colorful gold cube and mentions visible magnets",
+ "Voice questions whether the cube is worth $100 and suggests a quick solve to determine",
+ "Focus shifts to hands holding a Rubik's cube with green, white, and orange faces",
+ "Close-up of a smartphone timer set to 11 seconds appears",
+ "Hands scramble the Rubik's cube",
+ "Hands solve the cube, showing its qualities",
+ "Audio describes trying to beat the timer before it hits zero, mentions 88 magnets, clicky turn, fast and stable",
+ "Hands solve the Rubik's cube with a running timer showing 3 seconds",
+ "Timer counts down from 3 to 2 to 1 second",
+ "Final emphasis on the limited edition nature of the cube as timer hits zero",
+ "Close-up of the cube's internal mechanism reinforcing special edition theme",
+ "Man concludes with a statement expressing that the cube may not be worth $100 unless one likes limited edition gold items",
+ "Video explores the unique attributes and perceived value of a $100 gold-plated, limited edition Rubik's cube"
+ ]
+ },
+ {
+ "video": "video/208.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a woman standing in front of a green chalkboard, wearing a grey blazer over a patterned shirt with geometric designs in shades of blue and green, along with dark pants and small hoop earrings. The woman looks around as if addressing an audience or a classroom. She makes hand gestures to emphasize points, reinforcing the classroom setting. The audio reveals the woman speaking, accompanied by background music and occasional laughter from people present: \"So yes, you may be dumb today, but what you are tomorrow starts right here. In this classroom. Am I understood?\"\n\nThe scene transitions to 2 women sitting at desks. One is in a pink sweater and the other in a green jacket with a yellow shirt underneath, both with arms crossed. The classroom setting is enriched by various posters and a plant in the background. The audio includes the 2 women's voices: \" Yeah. Yes. Okay. \" \nThe video further develops by featuring the woman on the chalkboard once more, maintaining a focus on addressing the audience or classroom. The audio continues: \"Now, don't be embarrassed. How many of you can read? \" Laughter from the crowd follows. The scene switches back to the two students, one of whom (in the green jacket) is now making a hand gesture, likely speaking or asking a question. They answer, \"Um, all of us? Miss, this is an honors-level physics class.\"\n\nThe woman at the chalkboard reappears, continuing to address the crowd, before transitioning to another student in a blue hoodie who is gesturing with his hands, likely speaking or asking a question. The audio reflects a dialog of sorts, punctuated by laughter: \"This class is that? Yeah, every class at this school is honors level. \" The scene transitions to 2 women sitting at desks. The woman in pink says: \"This is a STEM school.\"\n\nThe woman on the chalkboard is shown again, who looks a little awkward. The clip then cuts to the student in the pink sweater who is now gesturing with their hands. The dialogue shifts to: \"Okay. We all had to take a college-level test to get in here. Okay. Did the principal not tell you that?\" followed by crowd laughter and another man's voice. Returning to the teacher at the chalkboard, the scene closes with another look at the student in the blue hoodie, again gesturing.\n\nThe student in the green jacket with the yellow shirt makes a concluding appearance, still with arms crossed. The last audio clip ties the segments together: \" Nah. And wait, who was calling us dumb? Nobody.\" This exchange underlines the shared experience and sets the tone for the entire classroom scenario, marked by humor and engagement among both the instructor and students. \n\nThroughout the video, the repeated visual themes of the classroom environment and the central figure of the teacher at the chalkboard, paired with the collaborative and humorous interactions in the audio, create a cohesive narrative of a dynamic and lively educational setting."
+ }
+ ],
+ "events": [
+ "A woman stands in front of a green chalkboard.",
+ "The woman wears a grey blazer over a patterned shirt with geometric designs in shades of blue and green, along with dark pants and small hoop earrings.",
+ "The woman looks around as if addressing an audience or a classroom.",
+ "She makes hand gestures to emphasize points.",
+ "Background audio includes her speaking, background music, and occasional laughter.",
+ "Audio: 'So yes, you may be dumb today, but what you are tomorrow starts right here. In this classroom. Am I understood?'",
+ "Scene transitions to 2 women sitting at desks.",
+ "One woman wears a pink sweater, and the other wears a green jacket with a yellow shirt underneath. Both have their arms crossed.",
+ "The background includes various posters and a plant.",
+ "Audio includes the voices of the 2 women saying: 'Yeah. Yes. Okay.'",
+ "Scene features the woman at the chalkboard again, addressing the audience or classroom.",
+ "Audio: 'Now, don't be embarrassed. How many of you can read?'",
+ "Laughter from the crowd follows.",
+ "Scene switches back to the two students. The woman in the green jacket makes a hand gesture, likely speaking or asking a question.",
+ "Audio: 'Um, all of us? Miss, this is an honors-level physics class.'",
+ "Woman at the chalkboard reappears, continuing to address the crowd.",
+ "Transition to another student in a blue hoodie, gesturing with his hands, likely speaking or asking a question.",
+ "Audio: 'This class is that? Yeah, every class at this school is honors level.'",
+ "Scene transitions to the 2 women sitting at desks again.",
+ "The woman in pink says: 'This is a STEM school.'",
+ "Woman at the chalkboard looks a little awkward.",
+ "Clip cuts to the student in the pink sweater, gesturing with their hands.",
+ "Audio: 'Okay. We all had to take a college-level test to get in here. Okay. Did the principal not tell you that?' Crowds laugh and another man's voice is heard.",
+ "Returning to the teacher at the chalkboard.",
+ "Scene closes with another look at the student in the blue hoodie, again gesturing.",
+ "The student in the green jacket with the yellow shirt appears again, arms crossed.",
+ "Audio: 'Nah. And wait, who was calling us dumb? Nobody.'",
+ "The video includes repeated visual themes of the classroom environment and the central figure of the teacher at the chalkboard.",
+ "Collaborative and humorous interactions among the instructor and students are portrayed throughout the video."
+ ]
+ },
+ {
+ "video": "video/209.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens on a wooden surface with three water bottles: Evian, Fiji, and Dasani, arranged from left to right. The Evian has a light blue cap and a white label with red text, the Fiji sports a blue cap and a blue label with a pink flower, and the Dasani has a green cap with a dark blue label adorned by green leaves. A man's hands approach the bottles, picking up the Fiji bottle and bringing it closer to the camera to display its label that reads 'Fiji Natural Spring Water' and '`From the islands of Fiji.`' Then, the man brings a clear glass bottle with a swing-top cap into view, and he \ndrinks it.\n\nAs this sequence unfolds, a man's voice is heard saying, \"Let's taste-test bottled water. First up, we got Fiji. And for reference, I'm gonna drink some tap water. Okay, that's tap water.`\" This is followed by the sounds of a bottle being opened and liquid being poured into a glass.\n\nNext, a close-up shows the Fiji bottle being opened. The man unscrews the blue cap and then drinks from the bottle. The man, wearing a dark maroon shirt and seated at a wooden table, holds up the bottle, revealing the front label with its pink flower and the text 'Fiji Natural Spring Water.' The scene then switches to show the back label with text and a barcode, before the man takes another sip and puts the bottle down.\n\nDuring this segment, the voice narrates, \"Tastes like tap water. From the islands of Fiji, I'm trying to picture a nice waterfall. Nothing.`\" Gentle background music plays as the man speaks.\n\nIn the following clip, a man holds an Evian water bottle, highlighting its light blue cap and white label with red text stating 'Evian' and '`Made by the French Alps.`' The man rotates the bottle to show the back label before unscrewing the cap, taking a sip, and placing the bottle back on the table. The bottle is displayed upright once more, showing its front label.\n\nHere, the audio component features a man saying, \"`Evian, made by the French Alps. Natural spring water. 7.2 pH. No difference, not impressed right now.\" with background music continuing to play.\n\nThe next scene transitions to the man holding the Evian water bottle and screwing the cap back on. The frame shifts to a close-up of the Dasani bottle with its green cap and dark blue label with green leaves. Holding the Dasani bottle horizontally, the man displays the front label reading 'Dasani' and 'Mineralized Water.' The clip ends with the Dasani bottle upright, showing the back label with text and a barcode.\n\nThe spoken description is accompanied by a man saying, \"Dasani. Remineralized water. Fluoride, zero. Reverse osmosis. It has water, mainly in salty, potassium, chloride, salt.`\" with the additional sound of water splashing and gurgling.\n\nFollowing this, the Dasani bottle is seen being unscrewed and the man, wearing the same dark maroon shirt, takes a sip before placing it back on the table. The Dasani bottle is held upright to show its front label, after which the man brings the Fiji water bottle into view, holding both bottles side by side to compare their labels.\n\nThe narration continues with, \"`What a waste of time. Surprisingly, they didn't taste any different at all from each other.`\" as background music plays.\n\nFinally, the man sits at the wooden table with the Dasani and Fiji bottles in front. He holds a clear glass bottle with a swing-top cap in one hand and the Dasani bottle on the table beside him, inspecting and comparing the two. The final scene shows both bottles side by side.\n\nThe last few words are spoken by a man, saying, \"`But yeah, it's the same thing as tap water,`\" concluding the audio as another man's voice begins to speak.\n\nThe video concludes with the man holding the glass bottle, prominently showcasing both."
+ }
+ ],
+ "events": [
+ "Video opens on a wooden surface with three water bottles: Evian, Fiji, and Dasani, arranged from left to right.",
+ "Evian has a light blue cap and a white label with red text.",
+ "Fiji has a blue cap and a blue label with a pink flower.",
+ "Dasani has a green cap with a dark blue label adorned by green leaves.",
+ "Man's hands approach the bottles and pick up the Fiji bottle.",
+ "Man brings the Fiji bottle closer to the camera to display its label.",
+ "Fiji bottle label reads 'Fiji Natural Spring Water' and 'From the islands of Fiji.'",
+ "Man brings a clear glass bottle with a swing-top cap into view and drinks from it.",
+ "Man's voice says, 'Let's taste-test bottled water. First up, we got Fiji. And for reference, I'm gonna drink some tap water. Okay, that's tap water.'",
+ "Sound of a bottle being opened and liquid being poured into a glass.",
+ "Close-up shows the Fiji bottle being opened.",
+ "Man unscrews the blue cap and drinks from the Fiji bottle.",
+ "Man wears a dark maroon shirt and is seated at a wooden table.",
+ "Man holds up the Fiji bottle, revealing the front label with a pink flower and text 'Fiji Natural Spring Water.'",
+ "Scene switches to show the back label of the Fiji bottle with text and a barcode.",
+ "Man takes another sip and puts the Fiji bottle down.",
+ "Voice narrates, 'Tastes like tap water. From the islands of Fiji, I'm trying to picture a nice waterfall. Nothing.'",
+ "Gentle background music plays.",
+ "Man holds an Evian water bottle, highlighting its light blue cap and white label with red text stating 'Evian' and 'Made by the French Alps.'",
+ "Man rotates the Evian bottle to show the back label.",
+ "Man unscrews the cap of the Evian bottle, takes a sip, and places it back on the table.",
+ "Evian bottle is displayed upright again, showing its front label.",
+ "Audio features a man saying, 'Evian, made by the French Alps. Natural spring water. 7.2 pH. No difference, not impressed right now.'",
+ "Background music continues to play.",
+ "Man screws the cap back on the Evian bottle.",
+ "Frame shifts to a close-up of the Dasani bottle with its green cap and dark blue label with green leaves.",
+ "Man holds the Dasani bottle horizontally, displaying the front label reading 'Dasani' and 'Mineralized Water.'",
+ "Clip ends with the Dasani bottle upright, showing the back label with text and a barcode.",
+ "Spoken description is accompanied by a man saying, 'Dasani. Remineralized water. Fluoride, zero. Reverse osmosis. It has water, mainly in salty, potassium, chloride, salt.'",
+ "Additional sound of water splashing and gurgling.",
+ "Dasani bottle is seen being unscrewed.",
+ "Man, wearing the same dark maroon shirt, takes a sip from the Dasani bottle.",
+ "Man places the Dasani bottle back on the table.",
+ "Dasani bottle is held upright to show its front label.",
+ "Man brings the Fiji water bottle into view, holding both bottles side by side to compare their labels.",
+ "Narration continues with, 'What a waste of time. Surprisingly, they didn't taste any different at all from each other.'",
+ "Background music plays.",
+ "Man sits at the wooden table with the Dasani and Fiji bottles in front.",
+ "Man holds a clear glass bottle with a swing-top cap in one hand and the Dasani bottle on the table beside him, inspecting and comparing the two.",
+ "Final scene shows both bottles side by side.",
+ "Audio concludes with a man saying, 'But yeah, it's the same thing as tap water.'",
+ "Another man's voice begins to speak.",
+ "Video concludes with the man holding the glass bottle, prominently showcasing both bottles."
+ ]
+ },
+ {
+ "video": "video/210.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video introduces a young woman in a gray hoodie with 'BUFFALO' and a buffalo image on it. Her wet hair suggests she has just showered. The setting features a ceiling fan with lights and household items like a beige bag and a pink item on a shelf. She holds up a box labeled 'MIGHTY PATCH' and appears to be showcasing it. \n\nHer voice narrates, \"`So I have this giant zit and i want to wear makeup but i don't want to aggravate it more so i'm thinking i'm going to put a pimple patch on it and then put my makeup over the pimple patch and hopefully it won't look weird.`\"\n\nThe focus stays on her as she continues speaking to the camera, explaining, \"` I have seen somebody else do this on tiktok and they looked fine so fingers crossed i already cleaned my face off so it's all clean and ready for the patch.`\" Soft music underlays her speech.\n\nShe then holds up three makeup products: a 'COVERGIRL TRUBLEND' bottle, a 'L'OREAL INFALLIBLE' tube, and another with a red cap. She keeps the narrative going with, \"` You can't really see it, but she's on there. These are the foundations I'm going to be using. They do look really gross. Oh no.`\" She speaks to herself as she applies the patch, emphasizing her words with hand gestures.\n\nTransitioning smoothly, the video captures her applying makeup with a pink sponge while she comments, \"`That does not look good. Okay, the foundation is done. It's looking a little better than what it was. Concealer time.`\" \n\nNext, she uses a makeup brush and showcases a gold bottle labeled 'MORPHE,' which she possibly uses as a setting spray. Her voiceover explains, \"`Contour. Loose powder. And then finally setting spray. So this is without the beauty filter. And this is from like a greater distance.`\" The sound of spraying accompanies her words.\n\nThe continuity of the scene remains as she addresses the camera, stating, \"`And this is the close-up. Girly. Like, it's kind of noticeable, right? But if you're not looking for it, you literally cannot see it. That`'s crazy.\" Her final words reflect a mix of amazement and satisfaction with her makeup application process.\n\nThroughout the video, the synchronization of her explanations with visual demonstrations creates an engaging and informative tutorial on tackling a common beauty concern."
+ }
+ ],
+ "events": [
+ "Young woman in a gray hoodie with 'BUFFALO' and a buffalo image on it is introduced.",
+ "Her wet hair suggests she has just showered.",
+ "The setting features a ceiling fan with lights, a beige bag, and a pink item on a shelf.",
+ "She holds up a box labeled 'MIGHTY PATCH'.",
+ "She appears to be showcasing the 'MIGHTY PATCH' box.",
+ "Her voice narrates about having a giant zit and planning to use a pimple patch under makeup.",
+ "Narration describes her intention to avoid aggravating the zit by using a pimple patch.",
+ "She shares seeing someone do this on TikTok and how their makeup looked fine.",
+ "She mentions her face is already cleaned and ready for the patch.",
+ "Soft music is playing in the background.",
+ "She holds up three makeup products: a 'COVERGIRL TRUBLEND' bottle, a 'L'OREAL INFALLIBLE' tube, and another with a red cap.",
+ "She comments that the foundations look really gross.",
+ "She applies the pimple patch while narrating.",
+ "She uses hand gestures to emphasize her words.",
+ "She starts applying makeup with a pink sponge.",
+ "She comments that the initial makeup application does not look good.",
+ "She mentions that the foundation application is done and it looks a bit better.",
+ "She announces it is 'concealer time'.",
+ "She uses a makeup brush.",
+ "She showcases a gold bottle labeled 'MORPHE', possibly using it as a setting spray.",
+ "Her voiceover describes the steps: contour, loose powder, and setting spray.",
+ "Sound of spraying accompanies her words.",
+ "She shows the result without a beauty filter and from a greater distance.",
+ "She addresses the camera directly, showing a close-up.",
+ "She comments that the patch is noticeable up close but not from a distance.",
+ "She expresses amazement and satisfaction with the makeup application process.",
+ "The video synchronizes her explanations with visual demonstrations."
+ ]
+ },
+ {
+ "video": "video/211.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a vibrant scene on a grassy field under a clear blue sky, where several individuals in sports gear are present. A prominent yellow and black machine with rotating orange arms, labeled \"THE GAUNTLET,\" is showcased. The text \"This is The Gauntlet, a machine designed to test a player's balance and ball security.\" appears on the screen. The audio features a group of people shouting and cheering, accompanied by a man speaking over a loudspeaker: \"This is the gauntlet, a machine designed to test a player's balance and ball security.\" The scene shifts to a player in green and white sports gear running through the machine while holding a football. The audio of people cheering continues, underscored by the explanation \"Our social team thought, given I was Division II defensive back, I should give it a go.\" The final part of the clip displays a text \"SOCIAL MEDIA COORDINATOR\" above a person in similar sports gear walking on the field with \"Our social team thought, given I was a division two defensive back, I should give it a go.\" displayed below.\n\nTransitioning to the second clip, we see the same person in green and white sports gear holding a football on the grassy field. The audio captures wind blowing as a man says, \"Sheesh. Haven't been fitted for a helmet in ages.\" The scene then moves indoors where the man sits on a stool while another man measures his head with a tape measure. The text \"Sheesh.\" and \"Haven't been fitted for a helmet in ages.\" appears, followed by \"*It's been 3 years*\" on the screen. The clip concludes with the man putting on shoulder pads as the text \"For safety, our EQ team got me fitted and suited up.\" displays at the bottom, with the man audibly adding, \"For safety, our EQ team got me fitted and suited up.\"\n\nThen, the man, now fully suited up in the helmet and shoulder pads, holds a football on the grassy field. The text \"All ready to go.\" appears as he approaches \"THE GAUNTLET\" machine while holding the football. The audio, filled with the sounds of wind, a bicycle bell, and a man speaking, conveys a sense of movement and anticipation. The text \"Now, here's round one.\" is displayed as he attempts to run through the machine but is knocked down by the rotating orange arms. The text \"Go ahead. Get your laughs in.\" follows an emoji with wide eyes, illustrating the humor of the situation as the man lies on the ground recovering.\n\nIn the subsequent clip, the man, still fully suited up, sits on the grassy field, recovering from his fall. The audio resumes with wind blowing and a man's voice saying, \"Now here's round two.\" The man stands up and prepares for another attempt. The text \"Now, here's round two.\" overlays as he approaches the machine again. This time, he navigates through \"THE GAUNTLET\" with significantly improved balance and control. \"Showed a lot more balance.\" reads the text as he successfully passes through. The clip ends with the message \"Not too bad, overall.\" displayed on the screen, reflecting the success of the second round, while a man yells in the distance.\n\nThe final clip shows the man, still fully suited up, making another run through \"THE GAUNTLET\" on the grassy field. As he struggles and subsequently falls, the text \"Ehhh...Okay, maybe it was bad.\" appears, accompanied by a laughing emoji. The audio describes the scene with a man speaking and the wind blowing. The video ends with the text \"Who's next?\" as he begins to sit up, suggesting a continuation of the challenge and a playful camaraderie among participants.\n\nThis engaging narrative, showcasing the trials and tribulations of navigating \"THE GAUNTLET,\" is complemented by a coherent audio description that aligns with the visuals, ensuring a complete and immersive experience for the viewer."
+ }
+ ],
+ "events": [
+ "Video opens with a vibrant scene on a grassy field under a clear blue sky",
+ "Several individuals in sports gear are present",
+ "A prominent yellow and black machine with rotating orange arms, labeled 'THE GAUNTLET,' is showcased",
+ "Text 'This is The Gauntlet, a machine designed to test a player's balance and ball security.' appears on the screen",
+ "Audio features a group of people shouting and cheering",
+ "A man speaking over a loudspeaker: 'This is the gauntlet, a machine designed to test a player's balance and ball security.'",
+ "Scene shifts to a player in green and white sports gear running through the machine while holding a football",
+ "People cheering continues",
+ "Audio explains: 'Our social team thought, given I was Division II defensive back, I should give it a go.'",
+ "Text 'SOCIAL MEDIA COORDINATOR' appears above a person in similar sports gear walking",
+ "Text 'Our social team thought, given I was a division two defensive back, I should give it a go.' displayed below",
+ "Second clip starts with same person in green and white sports gear holding a football on the grassy field",
+ "Audio captures wind blowing",
+ "A man says, 'Sheesh. Haven't been fitted for a helmet in ages.'",
+ "Scene moves indoors where man sits on a stool while another man measures his head with a tape measure",
+ "Text 'Sheesh.' and 'Haven't been fitted for a helmet in ages.' appears, followed by '*It's been 3 years*' on the screen",
+ "Man puts on shoulder pads",
+ "Text 'For safety, our EQ team got me fitted and suited up.' displays at the bottom",
+ "Man says, 'For safety, our EQ team got me fitted and suited up.'",
+ "Man, now fully suited up in helmet and shoulder pads, holds football on the grassy field",
+ "Text 'All ready to go.' appears as he approaches 'THE GAUNTLET' machine while holding the football",
+ "Audio filled with sounds of wind, a bicycle bell, and a man speaking",
+ "Text 'Now, here's round one.' is displayed as he attempts to run through the machine but is knocked down by rotating orange arms",
+ "Text 'Go ahead. Get your laughs in.' follows with an emoji with wide eyes",
+ "Subsequent clip shows man, still fully suited up, sitting on grassy field, recovering from fall",
+ "Audio resumes with wind blowing and a man's voice saying, 'Now here's round two.'",
+ "Man stands up and prepares for another attempt",
+ "Text 'Now, here's round two.' overlays as he approaches the machine again",
+ "This time, he navigates through 'THE GAUNTLET' with significantly improved balance and control",
+ "Text 'Showed a lot more balance.' reads as he successfully passes through",
+ "Clip ends with message 'Not too bad, overall.' displayed on the screen",
+ "Audio of a man yelling in the distance",
+ "Final clip shows man, still fully suited up, making another run through 'THE GAUNTLET' on the grassy field",
+ "Man struggles and subsequently falls",
+ "Text 'Ehhh...Okay, maybe it was bad.' appears with a laughing emoji",
+ "Audio describes the scene with a man speaking and the wind blowing",
+ "Video ends with text 'Who's next?' as he begins to sit up"
+ ]
+ },
+ {
+ "video": "video/212.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a small, energetic white dog walking on a tiled floor, viewed from above. The lively pup then moves from the kitchen, where papers and a remote control rest on the counter, into a room with a white floor. The dog eventually settles on a couch, attentively watching a television displaying a scene with a German Shepherd and a person behind bars, accompanied by the text '@goodboyhamlet' in the top right corner. A woman's voice narrates the dog's story, saying, \u201cHe was six weeks old and very clingy. He always had to be touching you. On his second day home, I thought it'd be nice to cuddle up on the couch with him and put a movie on.\u201d.\n\nTransitioning to the next scene, the white dog is still on the couch, watching the TV as scenes change on the screen. The text '@goodboyhamlet' remains visible. The German Shepherd is barking loudly beside a woman while the white dog is watching movie attentively. The scene on the TV then transitions to a brown dog with a red collar beside a man. The audio states, \"Hamlet pops up and he's like, oh my gosh.\".\n\nThe white dog on the couch eventually moves to a different room with a wooden floor and a toy-filled box, nose diving into it. The audio features a woman mentioning, \"As he got older, he developed this bedtime routine around toys.\". The ambiance includes the continuation of the woman\u2019s narration. In the following scene, the white dog stands near the toy box, appearing curious as it looks up at the camera. The ever-present text '@goodboyhamlet' is still in the top right corner. A hand offers the dog an orange ball which it eagerly takes in its mouth. In a playful exchange in the audio, a woman explains, \"He has this toy box which is actually just a cardboard box from when we moved in.\". with intermittent child\u2019s contributions and more from the woman. The scene transitions to the dog walking down a hallway with a purple stuffed toy, tail wagging. \n\nMoving forward, the dog, now carrying an egg crate, roams around a room with a wooden floor, the '@goodboyhamlet' text still there. The scene shifts to the dog on a cushioned surface next to a table holding a brown stuffed toy. The curious creature moves to a different room with a variety of toys scattered around. Here, the audio features a woman\u2019s narration, \u201cAnd sometimes it's not a toy at all, it could actually be an egg crate or a piece of cardboard. Work from home, we haven't been able to go on our usual outdoor adventures. So to cheer him up one day, I decided to throw him a surprise movie day with all his favorite toys.\u201d, followed by a man speaking in a foreign language.\n\nNext, the dog is seen in a room filled with toys scattered on a white mat, curious and playful as always, while the text '@goodboyhamlet' persists. Sunlight floods the room through large windows, and the dog picks up a small brown stuffed toy in its mouth, moving closer to the camera. The scene zooms in on the dog holding the toy, with the modern kitchen and white sofa in the background. It finally stands on the couch, content and playful. The audio describes, \"I put his favorite show on, and I knew it worked because he was so excited that he started running around and wagging all of his toys all over the place and not really settling it.\".\n\nIn the concluding scenes, the dog is on a white couch, seemingly searching for something. Consistent with earlier visuals, the '@goodboyhamlet' text is in the frame\u2019s top right. It then transitions to the dog facing a TV on a white mat surrounded by its toys. The TV shows a lion running in a natural setting, capturing the dog's focus as it wags its tail. The surroundings reveal a small table, a white chair, and a gray wall. This is accompanied by a woman\u2019s voice in the audio, followed by the sounds of a door opening and closing. The video offers a heartfelt glimpse into the life of an affectionate and curious pup named Hamlet, bringing joy and comfort through playful moments, favorite toys, and family interactions, highlighted by the consistent tag '@goodboyhamlet'."
+ }
+ ],
+ "events": [
+ "Small, energetic white dog walks on a tiled floor, viewed from above",
+ "Dog moves from the kitchen to a room with a white floor",
+ "Papers and a remote control are seen resting on the kitchen counter",
+ "Dog settles on a couch and watches a TV",
+ "TV displays a scene with a German Shepherd and a person behind bars",
+ "Text '@goodboyhamlet' visible in the top right corner of the TV screen",
+ "Woman's voice narrates the dog's story about him being six weeks old and very clingy",
+ "Woman mentions cuddling up on the couch with the dog and putting a movie on",
+ "Scene changes, the white dog still on the couch watching TV",
+ "German Shepherd barks loudly beside a woman on the TV screen",
+ "Dog watches the TV attentively",
+ "TV scene transitions to a brown dog with a red collar beside a man",
+ "Audio states 'Hamlet pops up and he's like, oh my gosh'",
+ "White dog moves to a different room with a wooden floor and a toy-filled box",
+ "Woman mentions dog's bedtime routine around toys as he got older",
+ "White dog nose dives into the toy box",
+ "Dog stands near the toy box, looking curious up at the camera",
+ "Text '@goodboyhamlet' still in the top right corner",
+ "Hand offers the dog an orange ball, which it eagerly takes",
+ "Woman explains about the toy box being a cardboard box from when they moved in",
+ "Intermittent child's contributions in the audio",
+ "Dog walks down a hallway with a purple stuffed toy, tail wagging",
+ "Dog carrying an egg crate roams a room with a wooden floor",
+ "Text '@goodboyhamlet' still visible",
+ "Dog on a cushioned surface next to a table holding a brown stuffed toy",
+ "Dog moves to a different room with toys scattered around",
+ "Woman's narration about non-toy items being used like egg crates or cardboard",
+ "Woman explains work from home situation and lack of outdoor adventures",
+ "Woman mentions throwing a surprise movie day with the dog's favorite toys",
+ "Man speaks in a foreign language in the background",
+ "Room filled with toys scattered on a white mat, dog curious and playful",
+ "Sunlight floods the room through large windows",
+ "Dog picks up a small brown stuffed toy and moves closer to the camera",
+ "Scene zooms in on the dog holding the toy",
+ "Modern kitchen and white sofa visible in the background",
+ "Dog stands on the couch, content and playful",
+ "Audio describes putting the dog's favorite show on and the dog's excitement",
+ "Dog runs around, wagging its toys everywhere without settling",
+ "Dog on a white couch searching for something",
+ "Text '@goodboyhamlet' in the top right corner",
+ "Dog faces a TV on a white mat surrounded by toys",
+ "TV shows a lion running in a natural setting",
+ "Dog focuses on the TV, wagging its tail",
+ "Room includes a small table, white chair, and gray wall",
+ "Woman's voice in the audio followed by door sounds",
+ "Video highlights the affectionate and curious pup, Hamlet",
+ "Video emphasizes playful moments, favorite toys, and family interactions",
+ "Consistent tag '@goodboyhamlet' throughout the video"
+ ]
+ },
+ {
+ "video": "video/213.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up of a blue work mat in a workspace environment, with various tools in the background. The text 'HAVE' appears on the screen. A black battery labeled 'AMPSENTRIX' is then shown on the blue mat, with 'WHAT YOU'RE' appearing on the screen. A hand points at the battery, accompanied by the text 'BATTERY'. An animated flame icon appears on the battery, indicating a potential hazard. The hand picks up a screwdriver and starts poking the battery, with the text 'DON'T' on the screen, clearly warning against such actions.\n\nIn the audio, a man\u2019s voice poses a question: \"Have you ever wondered what you're supposed to do if your battery catches on fire? Well, first things first, don't panic. The battery is going to smoke, so you definitely want to be in a well-ventilated area. Obviously,never stab the battery like I'm doing right now.This is for experimental purposes. \" Music plays softly in the background.\n\nThe scene transitions to a close-up of the same black 'AMPSENTRIX' battery on the blue mat. A hand holding a screwdriver continues poking the battery from the previous clip. The hand moves away, and the camera zooms in, revealing details such as 'MPN: AMP-IP7P', 'Li-ion Polymer Battery', '3.82V 2900mAh 11.1Wh', and various certification symbols. Two hands pick up the battery, examining it closely by bending it, before placing it back on the mat. The focus then shifts to a kitchen countertop with a white device connected to a charging cable.\n\nThe man speaks in the background, cautioning: \" You want to avoid bending the battery while it's pierced because that could be really dangerous.\" Another male voice interjects: \"Oh, what's that?\" Music continues to play softly.\n\nOn the kitchen countertop, the white device catches fire, producing significant smoke. A colorful scene which means pausing appears.The scene then cuts back to a close-up of the 'AMPSENTRIX' battery on the blue mat. A plastic container filled with sand is placed on the mat, covering the battery partially. A hand picks up the battery and puts it into the container.Then the hand titling the container to reveal the sand inside. The text 'LET'S JUST ACT LIKE THIS IS A BUCKET' appears on the screen, suggesting a demonstration.\n\nThe audio explains: \"As a tech, believe it or not, you should have a bucket of sand. Just gonna place the battery in the bucket of sand and we're gonna place this outside for a couple of hours. Leaving the battery alone is not the move because it could spark and catch on fire. Stay safe out there.\" Background music plays as the man speaks.\n\nThe final scene further demonstrates the plastic container filled with sand on the blue mat. The hand points to the container, emphasizing its purpose. The background shows various tools, a red 'EASY' button, and a small potted plant, indicating a workspace. The camera shifts to a close-up of the 'AMPSENTRIX' battery, showing its details once more.\n\nOverall, the video delivers a coherent narrative on battery safety, emphasizing the importance of not tampering with batteries and using sand as a precautionary measure."
+ }
+ ],
+ "events": [
+ "Close-up of a blue work mat in a workspace environment.",
+ "Various tools are visible in the background.",
+ "Text 'HAVE' appears on the screen.",
+ "A black battery labeled 'AMPSENTRIX' is shown on the blue mat.",
+ "Text 'WHAT YOU'RE' appears on the screen.",
+ "A hand points at the battery.",
+ "Text 'BATTERY' appears on the screen.",
+ "Animated flame icon appears on the battery, indicating a potential hazard.",
+ "The hand picks up a screwdriver and starts poking the battery.",
+ "Text 'DON'T' appears on the screen, warning against such actions.",
+ "Male voice in audio: 'Have you ever wondered what you're supposed to do if your battery catches on fire? Well, first things first, don't panic.'",
+ "Male voice: 'The battery is going to smoke, so you definitely want to be in a well-ventilated area.'",
+ "Male voice: 'Obviously, never stab the battery like I'm doing right now. This is for experimental purposes.'",
+ "Soft background music plays.",
+ "Scene transitions to a close-up of the same black 'AMPSENTRIX' battery on the blue mat.",
+ "Hand holding a screwdriver continues poking the battery.",
+ "Hand moves away and camera zooms in on battery details.",
+ "Battery details visible: 'MPN: AMP-IP7P', 'Li-ion Polymer Battery', '3.82V 2900mAh 11.1Wh', and various certification symbols.",
+ "Two hands pick up the battery and examine it closely by bending it.",
+ "Hands place the battery back on the mat.",
+ "Focus shifts to a kitchen countertop with a white device connected to a charging cable.",
+ "Male voice in audio: 'You want to avoid bending the battery while it's pierced because that could be really dangerous.'",
+ "Another male voice: 'Oh, what's that?'",
+ "Background music continues to play.",
+ "White device on the kitchen countertop catches fire, producing significant smoke.",
+ "Colorful scene indicating a pause in action appears.",
+ "Scene cuts back to a close-up of the 'AMPSENTRIX' battery on the blue mat.",
+ "Plastic container filled with sand is placed on the mat, covering the battery partially.",
+ "Hand picks up the battery and puts it into the container.",
+ "Hand tilts the container to reveal the sand inside.",
+ "Text 'LET'S JUST ACT LIKE THIS IS A BUCKET' appears on the screen, suggesting a demonstration.",
+ "Male voice in audio: 'As a tech, believe it or not, you should have a bucket of sand.'",
+ "Male voice: 'Just gonna place the battery in the bucket of sand and we're gonna place this outside for a couple of hours.'",
+ "Male voice: 'Leaving the battery alone is not the move because it could spark and catch on fire. Stay safe out there.'",
+ "Soft background music continues as the man speaks.",
+ "Final scene demonstrates the plastic container filled with sand on the blue mat.",
+ "Hand points to the container, emphasizing its purpose.",
+ "Background shows various tools, a red 'EASY' button, and a small potted plant, indicating a workspace.",
+ "Camera shifts to a close-up of the 'AMPSENTRIX' battery, showing its details once more."
+ ]
+ },
+ {
+ "video": "video/214.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video is set at Safeco Field during an intense and lively baseball game, highlighting the electrifying moments from multiple angles. \n\nThe video shows the pitcher, Paxton, in a white uniform with the number '65,' winding up to throw the ball with the scoreboard revealing the Cleveland team leading 4-1 against Seattle in the third inning. The batter swings and connects with the ball, sending it high into the outfield. The crowd watches closely, with advertisements such as 'Join now and get your travel on,' 'Alaska Mileage Plan,' and 'Fly-and-go-seek Alaska Airlines' displayed around the field. During this exciting moment, the animated voice over the loudspeaker exclaims, \"`Against Paxton. This one hit the left field the truck backing up running out of room on the track,`\" intensifying the moment as the crowd cheers energetically.\n\nThe scene transitions to an outfielder in a white uniform with the number '51' skillfully catching the ball near the wall. The enthusiastic crowd continues to cheer, celebrating the catch. The scoreboard still shows Cleveland leading 4-1 in the third inning with advertisements for 'Fred Hutch Cures Start Here,' 'Emerald Queen Casino,' and 'Fly-and-go-seek Alaska Airlines.' The jubilant atmosphere is captured as the pitcher, Paxton, raises his glove in acknowledgment of the excellent play. Over the loudspeaker, a voice celebrates the catch, proclaiming, \"`He did it! 8-0, making a mark here in Seattle on his second stint,`\" signifying the player's success and the crowd's elation.\n\nIn the stands, fans in Mariners jerseys, including a man with 'Paxton' on his back, wave their arms in excitement. The scoreboard again displays the same game stats: Cleveland leading Seattle 4-1 and Paxton having thrown 50 pitches. Attention shifts to the outfield, where the Mariners player acknowledges the crowd, and then to a close-up of a determined Cleveland batter focused on the next pitch. The nervous energy is palpable as a player in the Cleveland dugout gestures towards the field. The voice over the loudspeaker announces, \"`Standing ovation for the 44-year-old future Hall of Famer. ,`\" highlighting the game's tension and excitement.\n\nThe camera captures a pivotal play as the outfielder with the number '51' leaps against the outfield wall to catch the ball, with Cleveland players watching closely. Advertisements 'PREMERA Blue Cross' and 'FRED HUTCH' are visible on the wall. The focus then shifts to the outfielder's glove reaching over the wall for what might be a crucial catch. The lively crowd watches eagerly. The commentator's voice rings out, \"I thought this was going to end up in the pen as you take a look at the concentration. Definitely would have left the ballpark up over the wall perfectly timed by Ichiro,`\" while the crowd erupts into applause.\n\nThe final moments center on the outfielder number '51' successfully catching the ball against the outfield wall. The camera captures the perfect catch, with the crowd watching tensely. The commentator captures the dramatic finish with, \"He makes the catch and falls back down onto the warning track,`\" as the crowd roars with approval.\n\nThroughout these clips, the atmosphere remains charged with excitement and tension, reflecting the dynamic and thrilling nature of the baseball game taking place at Safeco Field."
+ }
+ ],
+ "events": [
+ "The video is set at Safeco Field during an intense and lively baseball game, highlighting electrifying moments from multiple angles.",
+ "The pitcher, Paxton, in a white uniform with the number '65,' winds up to throw the ball.",
+ "The scoreboard shows Cleveland leading 4-1 against Seattle in the third inning.",
+ "The batter swings and connects with the ball, sending it high into the outfield.",
+ "The crowd watches closely.",
+ "Advertisements such as 'Join now and get your travel on,' 'Alaska Mileage Plan,' and 'Fly-and-go-seek Alaska Airlines' are displayed around the field.",
+ "The animated voice over the loudspeaker announces, 'Against Paxton. This one hit the left field the truck backing up running out of room on the track,'.",
+ "The crowd cheers energetically.",
+ "The scene transitions to an outfielder in a white uniform with the number '51' skillfully catching the ball near the wall.",
+ "The enthusiastic crowd continues to cheer, celebrating the catch.",
+ "The scoreboard still shows Cleveland leading 4-1 in the third inning with advertisements for 'Fred Hutch Cures Start Here,' 'Emerald Queen Casino,' and 'Fly-and-go-seek Alaska Airlines'.",
+ "The pitcher, Paxton, raises his glove in acknowledgment of the excellent play.",
+ "A voice over the loudspeaker proclaims, 'He did it! 8-0, making a mark here in Seattle on his second stint,'",
+ "Fans in Mariners jerseys, including a man with 'Paxton' on his back, wave their arms in excitement.",
+ "The scoreboard again displays the same game stats: Cleveland leading Seattle 4-1 and Paxton having thrown 50 pitches.",
+ "Attention shifts to the outfield where the Mariners player acknowledges the crowd.",
+ "A close-up of a determined Cleveland batter focused on the next pitch is shown.",
+ "A player in the Cleveland dugout gestures towards the field.",
+ "The loudspeaker voice announces, 'Standing ovation for the 44-year-old future Hall of Famer.'",
+ "The outfielder with the number '51' leaps against the outfield wall to catch the ball.",
+ "Cleveland players watch closely.",
+ "Advertisements 'PREMERA Blue Cross' and 'FRED HUTCH' are visible on the wall.",
+ "The outfielder's glove reaches over the wall for what might be a crucial catch.",
+ "The crowd watches eagerly.",
+ "The commentator's voice declares, 'I thought this was going to end up in the pen as you take a look at the concentration. Definitely would have left the ballpark up over the wall perfectly timed by Ichiro,'",
+ "The crowd erupts into applause.",
+ "The outfielder number '51' successfully catches the ball against the outfield wall.",
+ "The camera captures the perfect catch.",
+ "The crowd watches tensely.",
+ "The commentator captures the dramatic finish with, 'He makes the catch and falls back down onto the warning track,'",
+ "The crowd roars with approval.",
+ "The atmosphere remains charged with excitement and tension, reflecting the dynamic and thrilling nature of the baseball game at Safeco Field."
+ ]
+ },
+ {
+ "video": "video/215.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up of a person setting up a Samsung Galaxy smartphone on a tripod. The text 'HOW TO SHOOT THE GALAXY USING A SAMSUNG GALAXY (OR ANY PHOTO OF STARS)' is displayed at the top. A stunning image of the galaxy appears, emphasizing the goal of capturing such a photo. The person adjusts the tripod and the phone, ensuring it is securely mounted and properly aligned. The phone's camera interface is shown, demonstrating how to enable 'Pro Mode' and adjust settings for astrophotography. Text overlays guide the viewer: 'TURN ON THE PRO MODE' and 'SHUTTER SPEED: 30S'. The camera interface displays the adjusted settings, ready to capture the galaxy.\n\nSimultaneously, a man's voice explains: \"So here's how I took this photo of the galaxy using a Samsung Galaxy phone. In the camera app, turn on the pro mode. Then set your shutter speed to 30 seconds.\" Music plays softly in the background as he speaks.\n\nThe camera interface of a Samsung Galaxy smartphone is shown, displaying various settings for astrophotography. The ISO is set to 400, with the text 'ISO: 400' prominently shown. The focus setting changes from 'AF' (Auto Focus) to 'MF' (Manual Focus), accompanied by the text 'FOCUS: MANUAL FOCUS.' Instructions appear: 'ZOOM IN AS FAR AS YOU CAN HELP FOCUS (DON'T FORGET TO ZOOM BACK OUT),' and the zoom level is adjusted to 10.0X. The camera interface shows a focused star in the center of the screen.\n\nThroughout, the man's voice continues, \"Then set your ISO to 400. Then change the focus to manual focus. Zoom in as far as you can and use the slider to focus on the brightest star.\"\n\nNext, the camera interface emphasizes the process of capturing a star. The text 'FOCUS UNTIL BRIGHTEST STAR TURNS GREEN' is displayed, indicating ideal focus. Below, 'ZOOM IN AS FAR AS YOU CAN HELP FOCUS (DON'T FORGET TO ZOOM BACK OUT)' reinforces the zooming technique. The zoom level remains at 10.0X, with visible camera settings at the bottom. Text on the screen advises: 'TURN ON 2S TIMER.' The final frames show a countdown starting from '1', signaling the start of the photo capture.\n\nThe man's voice explains: \"If it's in focus when the star turns green because of the focus peaking. Next, turn on the 2-second timer so that the phone doesn't move when you press the shutter. Take the photo and wait for,\" as the countdown and music play.\n\nThe text 'Taking picture... Hold your phone steady.' appears, followed by 'USE A TRIPOD OR REST THE PHONE SOMEWHERE SO IT DOESN'T MOVE,' stressing stability. Additional tips: 'GO TO A PLACE WITH NO LIGHT POLLUTION' and 'IDEALLY SHOOT DURING A NEW MOON' are displayed. The progress circle at the bottom indicates the ongoing photo capture.\n\nThe man advises: \"30 seconds. Make sure to use a tripod or put the phone down somewhere so it doesn't move. Go to a place with no or little amounts of light pollution and ideally you want to shoot during a new,\" followed by a buzzing sound and continued speech.\n\nWhen the circle's progress bar is full, the scene changes. It transitions to a stunning night sky filled with stars, showcasing the result. A clearer, detailed image of the galaxy follows, emphasizing successful astrophotography. Text: 'AFTER A QUICK EDIT USING LIGHTROOM MOBILE,' appears, indicating image enhancement. The final image shows the edited galaxy photo.\n\nThe man's voice concludes: \"moon. And that's it, here's the photo straight out of the phone camera and here's the photo after a quick edit on Lightroom,\" followed by the sound of a camera shutter clicking."
+ }
+ ],
+ "events": [
+ "Close-up of a person setting up a Samsung Galaxy smartphone on a tripod.",
+ "Text 'HOW TO SHOOT THE GALAXY USING A SAMSUNG GALAXY (OR ANY PHOTO OF STARS)' displayed at the top.",
+ "A stunning image of the galaxy appears.",
+ "Person adjusts the tripod and phone, ensuring it is securely mounted and aligned.",
+ "Phone's camera interface shown, demonstrating how to enable 'Pro Mode'.",
+ "Adjust settings for astrophotography shown on camera interface.",
+ "Text overlays: 'TURN ON THE PRO MODE' and 'SHUTTER SPEED: 30S'.",
+ "Camera interface displays the adjusted settings.",
+ "Man's voice explains turning on pro mode and setting shutter speed to 30 seconds.",
+ "Music plays softly in the background.",
+ "Phone's camera interface displays various settings for astrophotography.",
+ "ISO is set to 400, with text 'ISO: 400' prominently shown.",
+ "Focus setting changes from 'AF' to 'MF', accompanied by text 'FOCUS: MANUAL FOCUS'.",
+ "Instructions appear: 'ZOOM IN AS FAR AS YOU CAN HELP FOCUS (DON'T FORGET TO ZOOM BACK OUT)'.",
+ "Zoom level adjusted to 10.0X.",
+ "Camera interface shows a focused star in the center of the screen.",
+ "Man's voice explains setting ISO to 400, changing to manual focus, and zooming in.",
+ "Camera interface emphasizes the process of capturing a star.",
+ "Text 'FOCUS UNTIL BRIGHTEST STAR TURNS GREEN' displayed.",
+ "Text 'ZOOM IN AS FAR AS YOU CAN HELP FOCUS (DON'T FORGET TO ZOOM BACK OUT)' displayed again.",
+ "Zoom level remains at 10.0X.",
+ "Camera settings visible at the bottom.",
+ "Text 'TURN ON 2S TIMER' displayed.",
+ "Final frames show a countdown starting from '1'.",
+ "Man's voice explains using focus peaking, turning on 2-second timer, and taking the photo.",
+ "Text 'Taking picture... Hold your phone steady.' appears.",
+ "Text 'USE A TRIPOD OR REST THE PHONE SOMEWHERE SO IT DOESN'T MOVE' displayed.",
+ "Additional tips: 'GO TO A PLACE WITH NO LIGHT POLLUTION' and 'IDEALLY SHOOT DURING A NEW MOON' displayed.",
+ "Progress circle at the bottom indicates ongoing photo capture.",
+ "Man advises to use a tripod or put phone down, go to a place with no light pollution, and ideally shoot during a new moon.",
+ "Buzzing sound and continued speech from the man.",
+ "Circle's progress bar fills.",
+ "Scene transitions to a stunning night sky filled with stars.",
+ "Detailed image of the galaxy shown, emphasizing successful astrophotography.",
+ "Text 'AFTER A QUICK EDIT USING LIGHTROOM MOBILE' appears.",
+ "Final image shows the edited galaxy photo.",
+ "Man's voice concludes, followed by the sound of a camera shutter clicking."
+ ]
+ },
+ {
+ "video": "video/216.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nThoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a person in a black outfit speaking against a dark background. Chinese text appears at the top, middle, and bottom of the screen in succession, narrating a sorrowful story about an official sabotage involving the Pok\u00e9mon GO trainer in Japan. The text reads: '\u6293\u76ae\u5361\u4e18\u88ab\u5b98\u65b9\u885d\u5eb7 \u8a13\u7df4\u5e2b\u4eba\u751f\u60b2\u75db\u7d50\u675f\uff01' (Officially sabotaged for catching Pikachu, trainer's life ends in sorrow!), '\u8d85\u7a00\u6709\u8077\u696d\u63ed\u7955' (Revealing a super rare profession), '\u4f60\u6709\u5690\u904e\u88ab\u6700\u4fe1\u4efb\u7684\u4eba\u80cc\u53db\u7684\u60b2\u75db\u6ecb\u5473\u55ce\uff1f' (Have you ever tasted the grief of being betrayed by someone you trust the most?) The scene transitions to a close-up of a person in a white shirt using a smartphone, with the text '\u65e5\u672c\u6709\u4f4d Pok\u00e9mon GO \u7684\u8a13\u7df4\u5bb6' (There is a Pok\u00e9mon GO trainer in Japan). It introduces the trainer named '\u30a2\u30eb\u30c1 (Aruci)', who has been committed to catching Pikachu since 2017, as stated with '\u6253\u5f9e2017\u5e74\u5165\u5751\u958b\u59cb' (Since starting in 2017). He is marked by the singular goal of '\u5c31\u4ee5\u6293\u76ae\u5361\u4e18 \u4f5c\u70ba\u552f\u4e00\u76ee\u6a19' (Just catching Pikachu as the only goal). The clip ends highlighting his dedication: '\u4e0d\u65b7\u7a81\u7834\u81ea\u5df1\u7d00\u9304' (Constantly breaking their own records).\n\nIn the audio, a man emotionally narrates: \"Have you ever tasted the pain of being betrayed by the most trusted person? In Japan, a Pok\u00e9mon GO trainer has been using Pikachu as his only goal since entering the pit in 2017, constantly breaking his own record,\" accompanied by the sound of a video game being played.\n\nThe video transitions to an animated scene featuring Pikachu in a forest setting. In May 2020, the trainer caught his 3,000th Pikachu, '2020 \u5e74 5 \u6708 \u4ed6\u6293\u5230\u7b2c 3 \u5343\u96bb\u76ae\u5361\u4e18' (May 2020, he caught the 3,000th Pikachu), and by July 2020, his achievement escalated to 6,000 Pikachu, '7\u6708\u6293\u5230\u7b2c6\u5343\u96bb' (Caught the 6,000th in July), marked further in December 2021, reaching 10,000, '2021.12 10000\u96bb\u9054\u6210!' (December 2021, 10,000 achieved!). Despite his continuous progress, with '1 \u842c\u96bb\u76ae\u5361\u4e18\u7684\u58ef\u8209!' (The feat of 10,000 Pikachus!), it emphasizes '\u4ed6\u7684\u8a13\u7df4\u4eba\u751f\u5c31\u9019\u6a23\u6301\u7e8c\u524d\u9032' (His training life continued like this), hinting at looming despair, '\u4f46\uff0c\u610f\u5916\u7e3d\u662f\u7a81\u7136\u964d\u81e8\u2026\u2026' (But accidents always happen suddenly\u2026).\n\nIn the audio, the narration continues: \"In May 2020, he caught 3,000 Pikachu, and in July he caught 6,000. In December of last year, he was able to catch 10,000 Pikachu. His training life continued like this, but the accident.\"\n\nNext, the clip showcases Pikachu\u2019s happier moments interspersed with an unexpected twist, stating: '\u4f46\uff0c\u610f\u5916\u7e3d\u662f\u7a81\u7136\u964d\u81e8\u2026\u2026' (But, accidents always happen suddenly...). The Pok\u00e9mon GO screen reveals an unsettling discovery: '\u4e0a\u500b\u6708/\u4ed6\u9a5a\u5687\u7684\u767c\u73fe' (Last month, he was shocked to discover) his recent Pikachu catch turned into Zorua, '\u81ea\u5df1\u525b\u6293\u5230\u7684\u4e00\u96bb\u76ae\u5361\u4e18 \u7adf\u7136\u8b8a\u6210\u4e86\u300c\u7d22\u7f85\u4e9e\u300d!' (The Pikachu he just caught turned into 'Zorua'!). The surprising transition showcases '\u7d22\u7f85\u4e9e' (Zorua) pressing its cheeks. In a moment of dismay, the trainer realizes this taint on his record: '\u554a\u6211\u7684\u76ae\u5361\u4e18\u54e9! 100%\u5168\u76ae\u5361\u4e18\u7684\u7d00\u9304 \u9813\u6642\u51fa\u73fe\u6c61\u9ede \u53ea\u5269\u4e0b\u767e\u5206\u4e4b99.99092\u4e86!!!' (Ah, my Pikachu! 100% record of all Pikachus. Suddenly a stain appeared. Only 99.99092% left!!!).\n\nFurthering the dramatic frame, the audio continues with\" \u4e8b\u5f8c\u4ed6\u624d\u8a8d\u77e5\u5230\u9019\u4e00\u5207\u90fd\u662f\u5b98\u65b9\u5728\u842c\u8056\u7bc0\u7684\u60e1\u4f5c\u5287\uff0c\u6d3e\u4e86\u7d22\u7f85\u4e9e\u505a\u4e86\u9593\u8adc\u3002\u4ed6\u8868\u793a\uff0c\u9019\u5c31\u50cf\u662f\u88ab\u6700\u4fe1\u4efb\u7684\u4eba\u80cc\u523a\uff0c\u81ea\u5df1\u5df2\u7d93\u842c\u5ff5\u4ff1\u7070\u6c7a\u5b9a\u7d50\u675f\u81ea\u5df1\u7684\u8a13\u7df4\u5e2b\u4eba\u751f\u3002\u597d\u6158\uff0c\u4f60\u80fd\u63a5\u53d7\u81ea\u5df1\u7684\u73a9\u904a\u62325\u5e74\u7684\u7d00\u9304\u88ab\u8feb\u4e2d\u65b7\uff0c\u800c\u4e14\u9084\u662f\u88ab\u5b98\u65b9\u7d66\u885d\u5eb7\u4e86\u55ce\uff1f\" (It was only later that he realized that it was all an official Halloween prank and sent Zoroa to be a spy. He said that it was like being backstabbed by the person he trusted the most, and he had decided to end his life as a trainer. It's so miserable, can you accept that your 5-year record of playing games was interrupted, and it was still officially washed away?)"
+ }
+ ],
+ "events": [
+ "Person in black outfit speaks against a dark background",
+ "Chinese text appears narrating a sorrowful story about official sabotage involving a Pok\u00e9mon GO trainer",
+ "Text reads: '\u6293\u76ae\u5361\u4e18\u88ab\u5b98\u65b9\u885d\u5eb7 \u8a13\u7df4\u5e2b\u4eba\u751f\u60b2\u75db\u7d50\u675f\uff01'",
+ "Text reads: '\u8d85\u7a00\u6709\u8077\u696d\u63ed\u7955'",
+ "Text reads: '\u4f60\u6709\u5690\u904e\u88ab\u6700\u4fe1\u4efb\u7684\u4eba\u80cc\u53db\u7684\u60b2\u75db\u6ecb\u5473\u55ce\uff1f'",
+ "Scene transitions to a close-up of a person in a white shirt using a smartphone",
+ "Text introduces a Pok\u00e9mon GO trainer in Japan named '\u30a2\u30eb\u30c1 (Aruci)'",
+ "Text states Aruci has been committed to catching Pikachu since 2017",
+ "Text reads: '\u5c31\u4ee5\u6293\u76ae\u5361\u4e18 \u4f5c\u70ba\u552f\u4e00\u76ee\u6a19'",
+ "Clip highlights Aruci's dedication: '\u4e0d\u65b7\u7a81\u7834\u81ea\u5df1\u7d00\u9304'",
+ "Audio narration emotionally describes betrayal and Aruci's dedication",
+ "Sound of a video game being played accompanies the narration",
+ "Video transitions to an animated scene featuring Pikachu in a forest setting",
+ "Trainer caught his 3,000th Pikachu in May 2020",
+ "Trainer caught his 6,000th Pikachu in July 2020",
+ "Trainer achieved 10,000 Pikachu in December 2021",
+ "Text reads: '1 \u842c\u96bb\u76ae\u5361\u4e18\u7684\u58ef\u8209!'",
+ "Text emphasizes '\u4ed6\u7684\u8a13\u7df4\u4eba\u751f\u5c31\u9019\u6a23\u6301\u7e8c\u524d\u9032'",
+ "Text hints at looming despair: '\u4f46\uff0c\u610f\u5916\u7e3d\u662f\u7a81\u7136\u964d\u81e8\u2026\u2026'",
+ "Audio narration continues detailing Aruci's achievements and a sudden accident",
+ "Clip shows happier moments of Pikachu interspersed with an unexpected twist",
+ "Text reads: '\u4f46\uff0c\u610f\u5916\u7e3d\u662f\u7a81\u7136\u964d\u81e8\u2026\u2026'",
+ "Pok\u00e9mon GO screen reveals an unsettling discovery",
+ "Text reads: '\u4e0a\u500b\u6708/\u4ed6\u9a5a\u5687\u7684\u767c\u73fe'",
+ "Trainer's recently caught Pikachu turned into Zorua",
+ "Text reads: '\u81ea\u5df1\u525b\u6293\u5230\u7684\u4e00\u96bb\u76ae\u5361\u4e18 \u7adf\u7136\u8b8a\u6210\u4e86\u300c\u7d22\u7f85\u4e9e\u300d!'",
+ "Zorua is shown pressing its cheeks",
+ "Trainer realizes the stain on his 100% Pikachu record, now 99.99092%",
+ "Audio narration describes Aruci's realization of the official Halloween prank",
+ "Text and audio detail Aruci feeling betrayed and ending his life as a trainer",
+ "Text and audio ask if viewers could accept their 5-year game record being interrupted by official sabotage"
+ ]
+ },
+ {
+ "video": "video/217.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens in a modern kitchen where a woman dressed in a green sweater and light blue jeans stands next to a stainless steel oven. She holds onto the oven with one hand while the other gestures expressively, her voice projecting confidently through a microphone clipped to her sweater. The kitchen counter behind her displays a wooden tray with bananas and other items, while a box of Triscuit crackers and other food items rests on a shelf above the oven. The woman begins by sharing: \"As a dietitian, these are the foods I never keep in the house. Okay, so you probably were expecting me to say chips and cookies and ice cream. But since healing my relationship with food...\". Her speech reveals that the speaker is a woman.\n\nContinuing in the same kitchen, the woman is now standing with both hands slightly raised as she continues her explanation. The ambiance remains consistent, with the Triscuit box and the wooden tray full of bananas still in view. She elaborates: \"I've really found that I've been able to keep these foods in the house and largely forget about them. So what do I never buy? Pretty much just things that my family and I don't like.\"\n\nMaintaining the setting, the woman is now holding a red bag of chips in one hand, making expressive gestures with the other. She is still positioned near the stainless steel oven, speaking confidently through the microphone clipped to her sweater, and occasionally eating a chip as she converse. The woman discusses with some frustration and humor: \"Number one, kale chips. Like why? Eat your kale and have your chips. Like I always find that they're just so dry and I kind of feel like I'm gonna choke when I eat them.\". The sound of crunching and chewing accentuates her point.\n\nThe scene shifts slightly within the kitchen, now focusing on a large black cabinet with an open drawer filled with various food items. The woman, still dressed in a green sweater and light blue jeans, stands in front of it, holding a container with a yellow lid and explaining its contents with animated hand movements. The conversation transitions smoothly, and the woman critiques: \"Number two low-calorie ice cream like hell to the no no no. Honestly, it just tastes like air and sadness to me. I go with the OG and three energy drinks.\"\n\nThe camera pans to showcase the woman near a marble countertop adorned with a knife block and other kitchen essentials. With both hands gesturing actively, she dives into the disadvantages of certain foods, pointing occasionally to items on the counter. Her voice carries her concerns clearly: \"Those are like a big no-no for me because I have anxiety and insomnia. I don't need like 200 plus milligrams of caffeine in one go.\"\n\nAs the video culminates, the woman stays near the marble countertop, this time holding a white mug. She continues to discuss passionately, using her free hand for emphasis while taking occasional sips from the mug. The consistent backdrop includes the marble countertop, knife block, window view of the outdoors, and a few books on a shelf. Wrapping up her advice, she encourages: \"So I'm gonna stick with my usual latte. If you guys are working on your relationship with food, I know it can feel really hard to keep some of these trigger foods in your house, but hit that follow if you want to learn how...\". The steady hum of the coffee machine in the background modestly punctuates the ending of her session. The video effectively combines the visual and audio elements, providing a coherent narrative of dietary advice and personal experiences related to food in a homely kitchen setting."
+ }
+ ],
+ "events": [
+ "Video opens in a modern kitchen.",
+ "A woman dressed in a green sweater and light blue jeans stands next to a stainless steel oven.",
+ "She holds onto the oven with one hand while the other gestures expressively.",
+ "Her voice projects confidently through a microphone clipped to her sweater.",
+ "The kitchen counter behind her displays a wooden tray with bananas and other items.",
+ "A box of Triscuit crackers and other food items rests on a shelf above the oven.",
+ "The woman begins by sharing: 'As a dietitian, these are the foods I never keep in the house...'",
+ "She mentions chips, cookies, and ice cream, and talks about healing her relationship with food.",
+ "The woman stands with both hands slightly raised as she continues her explanation.",
+ "The ambiance remains consistent with the Triscuit box and the wooden tray full of bananas in view.",
+ "She elaborates: 'I've really found that I've been able to keep these foods in the house and largely forget about them...'",
+ "She explains she does not buy things her family does not like.",
+ "The woman is now holding a red bag of chips in one hand, making expressive gestures with the other.",
+ "She is still positioned near the stainless steel oven, speaking confidently through the microphone clipped to her sweater.",
+ "She occasionally eats a chip as she converses.",
+ "The woman discusses with frustration and humor: 'Number one, kale chips...'",
+ "She notes kale chips are dry and make her feel like she is going to choke.",
+ "The sound of crunching and chewing accentuates her point.",
+ "The scene shifts, focusing on a large black cabinet with an open drawer filled with various food items.",
+ "The woman, still dressed in a green sweater and light blue jeans, stands in front of it.",
+ "She holds a container with a yellow lid and explains its contents with animated hand movements.",
+ "The woman critiques: 'Number two low-calorie ice cream...'",
+ "She says it tastes like air and sadness and prefers the original ice cream.",
+ "The camera pans to showcase the woman near a marble countertop adorned with a knife block and other kitchen essentials.",
+ "With both hands gesturing actively, she dives into the disadvantages of certain foods.",
+ "She points occasionally to items on the counter.",
+ "She states energy drinks are a big no-no for her because of anxiety and insomnia.",
+ "She mentions the high caffeine content.",
+ "The woman stays near the marble countertop, this time holding a white mug.",
+ "She discusses passionately, using her free hand for emphasis while taking occasional sips from the mug.",
+ "Consistent backdrop includes the marble countertop, knife block, window view, and books on a shelf.",
+ "She wraps up her advice and encourages: 'So I'm gonna stick with my usual latte...'",
+ "She encourages viewers to follow her for advice on working on their relationship with food.",
+ "The steady hum of the coffee machine modestly punctuates the ending of her session."
+ ]
+ },
+ {
+ "video": "video/218.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nProvide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned."
+ },
+ {
+ "from": "gpt",
+ "value": "The video starts indoors with a man wearing a brown cap with the text 'YOUR FAVORITE MARTIAN' and a black hoodie, sitting comfortably. The text '`It finally happened`' is displayed on the screen. This suggests that an anticipated event has taken place. The scene then cuts to a parking lot where a woman in a bright pink jacket is seen standing in front of a gray car, seemingly blocking it from parking. The text 'Crazy Karen Tries To Save Parking Spot With Her Body' appears at the top, indicating a dramatic confrontation. A male voice narrates: \"`It finally happened. Someone did the thing that we all wanna do when someone stands in a parking spot.`\" As the conflict unfolds, the text '`Find another parking spot`' and '`I already went around here`' appear, with the woman emphatically gesturing. The man's face emerges in the bottom right corner of the screen and continues to speak. The male voice reads the texts out loud, \"`Find another parking spot. I already went around here.`\" Tension escalates as the woman continues her stance, and the text 'You really want to get in trouble?' appears, with the voiceover adding, \"`You really wanna get in trouble?`\", hinting at a potential conflict.\n\nThe video continues with the same scene in the parking lot. The woman in the bright pink jacket is still blocking the car. The text and narration reinforce the confrontation with, \"You really want to get in trouble?\" The scene shifts slightly, showing her from a different angle while the text 'You're f***ing crazy' and 'I'`m gonna tell them so it`'s cool...' appear sequentially. The driver's frustration is evident as the male voiceover transitions through, \"you're crazy,\" and then imitating the woman's intention to report, \"i'`m gonna tell them so it`'s cool.\" The view switches to inside the car, where the man wearing the brown cap is seen again, with the text 'You're a f***ing b***h' appearing, indicating the woman's anger. Despite the woman's attempts to block the car, it continued to move forward in the car park. The audio is layered with beeping sounds censoring the harsh words.\n\nThen, the scene transitions back indoors to the man in the brown cap and black hoodie, gesturing animatedly while speaking. The audio features the man saying, \"`the worst but you can't just and then`\", appearing frustrated at someone not editing a video correctly. An animated sequence follows, showing a crowd in front of a store named 'MOTHER'S JERKY,' with the text '(screaming)' animated at the bottom, reflecting the chaotic scene. The man in the brown cap continues speaking, with the text 'and then' followed by 'DON'T' and '`edit the video,`' visually emphasizing their point. The small animated character dances in the corner to the text 'That Ludacris song.' The audio references this with, \"Definitely don't edit the video to include that ludicrous song.`\"\n\nThe final segment circles back to the parking lot. The woman in the bright pink jacket still blocks the gray car. Throughout this, the text 'Crazy Karen Tries To Save Parking Spot With Her Body' stays at the top, continuing to define the confrontation strongly. The woman is now leaning against the car, and a person in the background is seen getting into a white vehicle, indicating more action in the parking lot. The tension remains high as the woman maintains her blocking position. The audio features a quick and loud male voice proclaiming, \"`Parking spot at Walmart?`\" summarizing the exasperation and absurdity of the situation.\n\nThis coherent narrative shows both the building and resolution of a tense parking lot confrontation, interspersed with reflective commentary and humorous animation, painting a vivid picture of the drama and the observer's disbelief.\n\n"
+ }
+ ],
+ "events": [
+ "The video starts indoors with a man wearing a brown cap with the text 'YOUR FAVORITE MARTIAN' and a black hoodie, sitting comfortably.",
+ "The text 'It finally happened' is displayed on the screen.",
+ "The scene cuts to a parking lot with a woman in a bright pink jacket standing in front of a gray car, seemingly blocking it from parking.",
+ "The text 'Crazy Karen Tries To Save Parking Spot With Her Body' appears at the top of the screen.",
+ "A male voice narrates: 'It finally happened. Someone did the thing that we all wanna do when someone stands in a parking spot.'",
+ "The text 'Find another parking spot' and 'I already went around here' appear as the woman gestures.",
+ "The man's face appears in the bottom right corner of the screen, speaking.",
+ "The male voice reads the texts 'Find another parking spot. I already went around here.'",
+ "The text 'You really want to get in trouble?' appears on the screen.",
+ "The voiceover adds, 'You really wanna get in trouble?'",
+ "The woman in the bright pink jacket continues blocking the car.",
+ "The text and narration reinforce with 'You really want to get in trouble?'",
+ "The scene shifts to a different angle of the woman.",
+ "The texts 'You're f***ing crazy' and 'I'm gonna tell them so it's cool...' appear sequentially.",
+ "The male voiceover expresses the driver's frustration through 'you're crazy' and 'I'm gonna tell them so it's cool.'",
+ "The view switches to inside the car, showing the man wearing the brown cap.",
+ "The text 'You're a f***ing b***h' appears, indicating the woman's anger.",
+ "Despite the woman's attempts to block the car, it continues to move forward in the car park.",
+ "The audio includes beeping sounds censoring harsh words.",
+ "Scene transitions back indoors to the man in the brown cap and black hoodie, gesturing animatedly.",
+ "The audio features the man saying 'the worst but you can't just and then,' frustrated at someone not editing a video correctly.",
+ "An animated sequence shows a crowd in front of a store named 'MOTHER'S JERKY', with the text '(screaming)' animated at the bottom.",
+ "The man in the brown cap continues speaking, with texts 'and then' followed by 'DON'T' and 'edit the video,' visually emphasizing their point.",
+ "An animated character dances in the corner to the text 'That Ludacris song.'",
+ "The audio references with 'Definitely don't edit the video to include that ludicrous song.'",
+ "The final segment circles back to the parking lot with the woman in the bright pink jacket still blocking the gray car.",
+ "The text 'Crazy Karen Tries To Save Parking Spot With Her Body' remains at the top.",
+ "The woman leans against the car, and a person in the background gets into a white vehicle.",
+ "The tension remains high as the woman maintains her blocking position.",
+ "The audio features a quick and loud male voice proclaiming 'Parking spot at Walmart?' summarizing the exasperation and absurdity of the situation."
+ ]
+ },
+ {
+ "video": "video/219.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video features a focused demonstration of a drumming technique known as 'Bass Drum Doubles.' The consistent elements across all the clips include a man seated on a drum stool, dressed in a light gray t-shirt, black pants, and white sneakers. The setting remains dark, spotlighting the drum set\u2014which includes a bass drum, snare drum, and cymbals\u2014and the demonstrator. Throughout the video, the text 'BASS DRUM DOUBLES' is persistently displayed at the top of the frames, elucidating the video's focus.\n\nIn the audio track, the man, articulates step-by-step instructions for executing 'Bass Drum Doubles.' His explanations are punctuated by demonstrations using the bass drum pedal operated by his right foot. The man occasionally gestures with his hands, while the drumsticks are on the snare drum when not in use.\n\nInitially, the man is heard explaining the initial step: \"So, I'll lift my leg up, still letting the pedal shadow my foot, I'll come down for the first note, like that, and then on the way down, I'll let my foot kind of drop down and that gives me the second note, so I'm doing kind of like this. So there's kind of like one note on the way up, one out on the way down.\" He looks at the camera from time to time with gestures, and shows the techniques while explaining, followed by a close-up of the foot's motion. His commentary smoothly integrates with the visual demonstration of the foot technique.\n\nAs the demonstration continues, the man references another drummer, Anika: \"And this is similar to how Anika does it. The little difference that my body just kind of naturally started doing was, especially at quicker speeds, I kind of ankle my heel a little bit inwards, towards the snare. So I'm doing like a little bit of almost like a swivel thing, where the first note doesn't look like this, it rather looks something like this, with my heel kind of going this way. So that's the first note and then the second note comes down that way like, like that, so, that's the motion that I like to use,\" as the audio shows. This insight is paired with visuals of the man's foot executing the technique at various speeds. The text overlay '*angle' in white font appears on the screen near the man, for correcting his verbal slip. He is still gesturing while explaining, sometimes poiting to the left with his thumb, somtimes pointing upwards and downwards, and he looks at the camera occasionally.\n\nFinally, the man summarizes the technique's efficiency at higher speeds: \"You'll see it especially when I try to do it at quicker speeds.\" This concluding remark is accompanied by another close-up of his foot, offering an intricate view of the movement.\n\nThroughout the video, the man's instructions are consistent and coherent, enabling viewers to grasp the 'Bass Drum Doubles' technique thoroughly. The visual and audio components are tightly interwoven, ensuring a precise and engaging educational experience."
+ }
+ ],
+ "events": [
+ "The video starts with a man seated on a drum stool.",
+ "The man is dressed in a light gray t-shirt, black pants, and white sneakers.",
+ "The setting is dark with a spotlight on the drum set.",
+ "The drum set includes a bass drum, snare drum, and cymbals.",
+ "The text 'BASS DRUM DOUBLES' is displayed at the top of the frames throughout the video.",
+ "The man explains the drumming technique 'Bass Drum Doubles' step-by-step.",
+ "The man lifts his leg up and explains the first note of the technique.",
+ "He demonstrates how the foot comes down for the first note and then drops down for the second note.",
+ "He shows a close-up of the foot's motion while explaining.",
+ "The man references another drummer, Anika, and compares his technique to hers.",
+ "He mentions his different approach at quicker speeds, involving a slight inward ankle movement.",
+ "The text overlay '*angle' appears on the screen to correct his verbal slip.",
+ "He gestures with his hands while explaining, sometimes pointing to the left with his thumb, and upward or downward.",
+ "The man occasionally looks at the camera while explaining.",
+ "He shows visuals of his foot executing the technique at various speeds.",
+ "Man summarizes the efficiency of the technique at higher speeds.",
+ "A close-up is shown of the man's foot doing the technique at higher speeds.",
+ "The man's instructions remain consistent and coherent throughout the video.",
+ "The visual and audio components are tightly interwoven for an engaging educational experience."
+ ]
+ },
+ {
+ "video": "video/220.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with an intense football game scene where a team in white jerseys is lining up at the line of scrimmage. The players are bracing themselves for the play, with number 6 standing behind the formation, ready to punt the ball. The opposing team, clad in black and yellow jerseys, is strategically positioned to defend. As the ball is snapped from the center, number 6 kicks it forcefully, sending it soaring down the field. The camera follows the ball as it bounces and rolls closer to the end zone, while players from both teams sprint towards it. \n\nSimultaneously, the audio reveals a man's voice: \"Bounces near the 10. Chase to the end zone..\". The roar of the crowd fills the background, accompanied by distant shouts and someone speaking over a loudspeaker, amplifying the excitement of the moment.\n\nNext, the football bounces precariously near the end zone, the word 'END' prominently displayed on the field. A player in white and blue dashes towards the ball, diving to keep it from crossing the goal line. Successfully, he halts the ball, preventing a touchback, much to the elation of his team. Another player from the black and yellow team attempts to reach the ball but falls short. The white jersey team celebrates, surrounding the player in a triumphant huddle. \n\nIn conjunction with this, the audio continues: \"Oh, two diving attempts, and they never touch the end line. Beautiful play by that special team's unit. \" The crowd's cheers blend with the commentary from a man speaking into a microphone, enhancing the atmosphere of the game.\n\nThe scene transitions to a close-up of the referee, wearing a black and white striped uniform, complete with a white cap bearing the NFL logo. Using hand signals, he explains the play to the crowd, who again predominantly don black and yellow, indicating support for the opposing team. The focus shifts to a low-angle shot capturing the white jersey player diving near the goal line to keep the ball in play. Another player joins him in the celebration as they continue to revel in their success.\n\nDuring this moment, the audio declares: \"After reviewing the play, the ruling field stands as called. First and ten at the one-yard line. With the trapeze act here. You see, Jones already hits it, then Burkhead, but, I mean, those are such well-coached....\" The lively background of crowd cheers and a man's voice on a microphone enriches the unfolding events.\n\nIn the final clip, two players from the white jersey team energetically celebrate their successful play. The player wearing jersey number 14 spreads his arms wide, whereas number 10 holds onto him in pure joy. Both sport silver helmets with a blue logo. The focus then shifts to number 10, who leaps into the air in a jubilant celebration, with a referee (number 33) visible in the background. Number 10 lands, clutching the football and continuing his exuberant celebration.\n\nThe corresponding audio captures the moment: \"Look at this. Exactly. Oh, I mean, wow. That's when you know the head coach makes it a point of emphasis to be great on special teams..\". The cheering crowd, paired with the enthusiastic commentary from a man speaking over the microphone, provides a fitting end to the triumphant sequence.\n\nThroughout the video, the energy and emotion of the game are palpable, as the harmonious blend of visuals and audio brings the thrilling football match to life."
+ }
+ ],
+ "events": [
+ "The video starts with an intense football game scene.",
+ "A team in white jerseys lines up at the line of scrimmage.",
+ "Players brace themselves for the play.",
+ "Player number 6 stands behind the formation, ready to punt the ball.",
+ "The opposing team wears black and yellow jerseys and is positioned to defend.",
+ "The ball is snapped from the center.",
+ "Number 6 kicks the ball forcefully.",
+ "The ball soars down the field as the camera follows it.",
+ "Players from both teams sprint towards the bouncing and rolling ball near the end zone.",
+ "The audio reveals a man\u2019s voice: 'Bounces near the 10. Chase to the end zone..'.",
+ "The background is filled with the roar of the crowd, distant shouts, and announcements over a loudspeaker.",
+ "The football bounces precariously near the end zone.",
+ "The word 'END' is prominently displayed on the field.",
+ "A player in white and blue dashes towards the ball and dives to keep it from crossing the goal line.",
+ "The player successfully halts the ball, preventing a touchback.",
+ "The white jersey team elates and celebrates, surrounding the player in a huddle.",
+ "Another player from the black and yellow team attempts to reach the ball but falls short.",
+ "The audio continues: 'Oh, two diving attempts, and they never touch the end line. Beautiful play by that special team's unit.'",
+ "Crowd\u2019s cheers blend with commentary from a man speaking into a microphone.",
+ "The scene transitions to a close-up of the referee in black and white striped uniform with a white cap bearing the NFL logo.",
+ "The referee uses hand signals to explain the play to the crowd, who predominantly don black and yellow.",
+ "The focus shifts to a low-angle shot of the white jersey player diving near the goal line.",
+ "Another player joins the celebration as they revel in their success.",
+ "Audio declares: \u2018After reviewing the play, the ruling field stands as called. First and ten at the one-yard line. With the trapeze act here. You see, Jones already hits it, then Burkhead, but, I mean, those are such well-coached....'",
+ "The background is lively with crowd cheers and commentary over a microphone.",
+ "Two players from the white jersey team energetically celebrate their successful play.",
+ "Player number 14 spreads his arms wide.",
+ "Number 10 holds onto number 14 in joy.",
+ "Both players wear silver helmets with a blue logo.",
+ "Focus shifts to number 10 leaping into the air in celebration with the referee (number 33) visible in the background.",
+ "Number 10 lands, clutching the football and continues celebrating.",
+ "Audio captures: \u2018Look at this. Exactly. Oh, I mean, wow. That's when you know the head coach makes it a point of emphasis to be great on special teams.'",
+ "Cheering crowd and enthusiastic commentary provide a fitting end to the sequence.",
+ "The energy and emotion of the game are palpable throughout the video.",
+ "The harmonious blend of visuals and audio brings the thrilling football match to life."
+ ]
+ },
+ {
+ "video": "video/221.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "A man in a white bathrobe, white tank top, and dark pants descends a narrow staircase with beige walls and a wooden handrail. As he reaches the bottom, the scene transitions to a well-lit kitchen area where natural light pours through the windows. A woman in a light blue sweater and white pants interacts with a young woman in a blue jacket seated at the counter adorned with various kitchen items like a coffee pot and a cup. The man from the staircase enters, and the man turns to face him. He gestures toward the young woman, indicating a conversation or confrontation. The audio reveals someone saying, \"What'`s the matter? What happened? Your nephew ...`\". This suggests an escalating situation.\n\nThe man in the white bathrobe stands closer to the young man, the man's right hand touches the young woman's face as if examining a wound. The young man holds a cigarette in his left. The kitchen retains the same warm lighting, hinting at an ongoing serious discussion. The audio continues with, \"`Christopher did this? He hit you? What for? He was high. I don`'t pray,\" highlighting a tense conversation between the man and the young woman.\n\nUrgency and frustration are palpable as the man in the white bathrobe moves quickly past the camera. The focus shifts to the young woman at the counter, the beige wall, staircase, and lamp providing a familiar backdrop. The woman, now standing next to the young woman, appears to be mediating. The man re-enters the frame, picks up a phone, and starts speaking into it, increasing the tension as the woman and young man look concerned. The audio captures, \"`And I goanna ring his neck where is he he's been using drugs heroin yeah who`'s your,\" interspersed with sounds of speaking, phone ringing, and more dialogue.\n\nThe scene remains in the kitchen as the man on the phone continues his serious conversation, maintaining the same warm lighting. The focus shifts to the concerned young woman and woman, the latter appearing to comfort the former who holds a cigarette. The man leans forward, deepening the stake of the situation. The audio captures, \"`You called me, marico. Where's Christopher, asshole? Fuck your mother. Oh, his car was stolen by some Puerto Ricans. I took his cell phone, too,`\" blending the gravity of the man's statements with the woman's responses.\n\nThe focus shifts back to the man in the white bathrobe who continues to speak, perhaps continuing the phone call. The young woman, now standing and distressed, and the woman with her comforting hand on her shoulder, dominate the frame. The man remains serious, and the scene brims with tension. The audio intensifies as a woman declares, \"I don'`t know what to do. I already packed a suitcase for my aunts in Daytona. '' and ''Take it easy. Don't do something you're gonna regret,`\" followed by the man's grave rejoinder and the woman's calming words.\n\nCentering on the distressed young woman and comforting woman, the atmosphere is thick with concern. The man in the white bathrobe, now facing the camera. The surrounding tension is palpable, underscored by the audio: \"` You can hire someone to do an intervention, apparently. Girls, come on. We're a family here,`\" a final plea to maintain unity in a fraught family situation.\n\nThroughout the video, the consistent kitchen backdrop, warm lighting, and tense atmosphere compound, reflecting a scenario fraught with familial conflict and concern. The audio complements the visuals by painting a detailed picture of a family crisis revolving around a young woman in distress and the concerned relatives attempting to manage the escalating situation."
+ }
+ ],
+ "events": [
+ "A man in a white bathrobe, white tank top, and dark pants descends a narrow staircase with beige walls and a wooden handrail.",
+ "The scene transitions to a well-lit kitchen area where natural light pours through the windows.",
+ "A woman in a light blue sweater and white pants interacts with a young woman in a blue jacket seated at the counter adorned with various kitchen items like a coffee pot and a cup.",
+ "The man from the staircase enters, and the man turns to face him.",
+ "He gestures toward the young woman, indicating a conversation or confrontation.",
+ "Audio reveals someone saying, 'What's the matter? What happened? Your nephew ...'",
+ "The man in the white bathrobe stands closer to the young woman.",
+ "The man's right hand touches the young woman's face as if examining a wound.",
+ "The young man holds a cigarette in his left hand.",
+ "The kitchen retains the same warm lighting, hinting at an ongoing serious discussion.",
+ "The audio continues with, 'Christopher did this? He hit you? What for? He was high. I don't pray,' highlighting a tense conversation between the man and the young woman.",
+ "The man in the white bathrobe moves quickly past the camera, indicating urgency and frustration.",
+ "The focus shifts to the young woman at the counter, the beige wall, staircase, and lamp providing a familiar backdrop.",
+ "The woman, now standing next to the young woman, appears to be mediating.",
+ "The man re-enters the frame, picks up a phone, and starts speaking into it, increasing the tension.",
+ "The woman and young man look concerned.",
+ "Audio captures, 'And I gonna ring his neck where is he he's been using drugs heroin yeah who's your,' interspersed with sounds of speaking, phone ringing, and more dialogue.",
+ "The scene remains in the kitchen as the man on the phone continues his serious conversation.",
+ "Focus shifts to the concerned young woman and woman, the latter appearing to comfort the former who holds a cigarette.",
+ "The man leans forward, deepening the stake of the situation.",
+ "Audio captures, 'You called me, marico. Where's Christopher, asshole? Fuck your mother. Oh, his car was stolen by some Puerto Ricans. I took his cell phone too,' blending the gravity of the man's statements with the woman's responses.",
+ "Focus shifts back to the man in the white bathrobe who continues to speak, perhaps continuing the phone call.",
+ "The young woman, now standing and distressed, and the woman with her comforting hand on her shoulder dominate the frame.",
+ "The man remains serious, and the scene brims with tension.",
+ "Audio intensifies as a woman declares, 'I don't know what to do. I already packed a suitcase for my aunt's in Daytona.'",
+ "'Take it easy. Don't do something you're gonna regret,' followed by the man's grave rejoinder and the woman's calming words.",
+ "Centering on the distressed young woman and comforting woman, the atmosphere is thick with concern.",
+ "The man in the white bathrobe, now facing the camera.",
+ "The surrounding tension is palpable.",
+ "Audio underscores the situation: 'You can hire someone to do an intervention, apparently. Girls, come on. We're a family here,'",
+ "A final plea to maintain unity in a fraught family situation.",
+ "Consistent kitchen backdrop, warm lighting, and tense atmosphere reflect a scenario fraught with familial conflict and concern.",
+ "Audio complements visuals by painting a detailed picture of a family crisis.",
+ "Focus on young woman in distress and concerned relatives attempting to manage the escalating situation."
+ ]
+ },
+ {
+ "video": "video/222.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a serene scene featuring a sign that reads \"BAME FARM\" in red letters on a brown background, set against a green bush with a yellow top under a blue sky dotted with a few clouds. Emerging from behind the bush is a character with large, round eyes, wearing a green hat, red shirt, and blue pants. This character approaches the camera, shifting from a wide smile to a more neutral expression. Audio enhances this scene with a chicken crowing followed by a man's voice introducing himself: \"`From the earth I rise, and to the earth I one day will return. What's up, mother shuckers?`\" accompanied by some background music.\n\nTransitioning to the next scene, the same character, still in the same attire, appears close up, pointing upwards with a neutral facial expression. The blue sky and a light brown structure form the background, while the text \"BAMBI BAMBI BAMBI\" flashes in bold, white letters at the top one by one from the left to the right. The character's expression turns serious as they continue pointing upwards, and the scene shifts to show the character standing before a large ear of corn, surrounded by multiple images of corn, emphasized by a plain white background. The audio here introduces \"`My name is Colonel Cornelius Cornwall. And I'm here today to introduce you to Corn. `\" layered over cheerful crowd sounds and a popping noise.\n\nThe character's expression shifts to seriousness as they shake from the left to the right as various images of corn show up. This consistency helps emphasize the new text \"MAZE,\" which shows up in bold, white letters. Next, the character stands amidst what looks like tall green grass, giving a thumbs-up with a number from 15 to 17 displayed in red at the bottom. The audio continues to describe the character of corn: \"`Corn is a North American cereal plant that yields large kernels on a cob. It is also called maize because it's easy to get lost in its incredible flavor,`\" set to a backdrop of music.\n\nThe scene soon transitions to a closeup where the character holds a piece of paper with a serious look. Another character in a blue outfit is seen sitting with a small object in hand, against a background showing a green field and a gray structure marked with \"GATE.\" The second character appears hurried and worried, while the first character remains focused on their paper. The scene ends with the character under a blue sky holding a small piece of corn, standing by a large green structure with yellow tops and a sign reading \"CRON\" in red on a brown background. The main character stands beside a second character with a yellow face, black eyes, wearing similar red and blue attire. Suddenly, as the character sings, a face with red eyes and red hat appears with a horrible sound. The audio humorously warns: \"`Side effects of corn may include choking, I guess. Come down today and try some corn or we will sacrifice your newborn,`\" spiced up by background music."
+ }
+ ],
+ "events": [
+ "Sign reading 'BAME FARM' in red letters on a brown background",
+ "Background features a green bush with a yellow top under a blue sky with a few clouds",
+ "Character with large, round eyes emerges from behind the bush",
+ "Character is wearing a green hat, red shirt, and blue pants",
+ "Character approaches the camera with a wide smile then shifts to a neutral expression",
+ "Audio with a chicken crowing",
+ "Man's voice introduces himself with the phrase: 'From the earth I rise, and to the earth I one day will return. What's up, mother shuckers?'",
+ "Background music accompanies the introduction",
+ "Character appears close up, pointing upwards with a neutral facial expression",
+ "Background features a blue sky and a light brown structure",
+ "Text 'BAMBI BAMBI BAMBI' flashes in bold, white letters at the top one by one from the left to the right",
+ "Character's expression turns serious as they continue pointing upwards",
+ "Scene shifts to show the character standing before a large ear of corn surrounded by multiple images of corn on a plain white background",
+ "Audio introduces: 'My name is Colonel Cornelius Cornwall. And I'm here today to introduce you to Corn.'",
+ "Cheerful crowd sounds and a popping noise in the background",
+ "Character's expression shifts to serious as they shake from left to right",
+ "Various images of corn show up",
+ "Text 'MAZE' appears in bold, white letters",
+ "Character stands amidst tall green grass giving a thumbs-up",
+ "A number between 15 to 17 displayed in red at the bottom",
+ "Audio describes corn: 'Corn is a North American cereal plant that yields large kernels on a cob. It is also called maize because it's easy to get lost in its incredible flavor,'",
+ "Background music continues",
+ "Scene transitions to a closeup where the character holds a piece of paper with a serious look",
+ "Another character in a blue outfit is seated, holding a small object",
+ "Background shows a green field and a gray structure marked with 'GATE'",
+ "Second character appears hurried and worried",
+ "First character remains focused on their paper",
+ "Character under a blue sky holding a small piece of corn",
+ "Stands by a large green structure with yellow tops and a sign reading 'CRON' in red on a brown background",
+ "Main character stands beside a second character with a yellow face and black eyes, wearing similar red and blue attire",
+ "Character sings, a face with red eyes and red hat appears with a horrible sound",
+ "Audio humorously warns: 'Side effects of corn may include choking, I guess. Come down today and try some corn or we will sacrifice your newborn,'",
+ "Background music spices up the humorous warning"
+ ]
+ },
+ {
+ "video": "video/223.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a view from inside a car looking out at a sidewalk scattered with discarded items, including a gray couch, a pink object, and a metal rack. The scene shifts to a closer view, focusing on the abandoned couch and the surrounding debris. A man in a black 'Aero 1987' sweatshirt, gray pants, and colorful sneakers picks up a pair of gray and white sneakers from the ground, closely examining them. The sneakers appear worn and dirty with scuffed and cracked surfaces. As this unfolds, the audio features a man speaking while driving, stating that he had to turn back to verify what he saw. He expresses disbelief, noting that the shoes were dirty and had sole separation.\n\nNext, we see a close-up of a pair of worn gray sneakers with red accents on a wooden stool. Various angles highlight the scuffed and cracked surfaces and the red Jumpman logo on the tongues. The scene transitions to a detailed inspection of the leather texture and worn-out soles. The man now in a dark shirt and apron holds one sneaker, showing the detached sole and the interior. Then he presents a bottle of 'EB Kicks Shoe Cleaner & Conditioner,' indicating the intent to clean and restore the sneakers. The man inserts a black shoe holder into the shoe for support. The audio features the man proclaiming his scavenger nature and his ability to make shoes look spectacular as he prepares to clean the shoes. The sound of shoe cleaning ensues.\n\nThe man, wearing a dark shirt and apron with a distinctive design, sits at a wooden stool, and the worn gray sneakers with red accents from the previous clips is on the wooden stool. He pours a cleaning solution from a labeled bottle 'EB Kicks' onto a black sponge. The scene shows vigorous scrubbing on different part including side, front, inside, sole, generating substantial foam. Close-ups reveal the man cleaning the sneaker's tongue, featuring the red Jumpman logo, now covered in foam. The audio captures the man enthusiastically describing his process, noting that when he sees foam, he is in his zone.\n\nFollowing this, there is a close-up of the man scrubbing the gray sneaker with red accents, creating a substantial amount of foam with a black sponge. The scene shifts to the man cleaning the sole, featuring the 'JORDAN' logo in red, using a brush to generate more foam. Focus is given to cleaning the interior of the sneaker, particularly the insole and heel area, using a small tool to scrape off dirt and debris. The meticulous cleaning process is highlighted, suggesting a thorough restoration. In the audio, the man discusses cleaning the insoles and outsoles and the need to remove old glue residue for the new glue to adhere properly, accompanied by scraping sounds.\n\nThe restoration continues with a close-up of the man scraping off old adhesive from the sole of the gray sneaker with red accents. He applies new adhesive using a small bottle with a nozzle, carefully aligning the sole with the upper part and pressing them together for a secure bond. The man shakes a white container, possibly containing pigment, and applies a white substance to the midsole, meticulously focusing on the detailed areas around the 'AIR' logo on the heel. The audio features the man explaining the need for better adhesion, clamping the shoe, and beginning to paint, acknowledging that he's merely touching it up.\n\nIn the final scene, the man meticulously paints the 'AIR' logo on the heel using a small brush and then applies a black substance to the toe area, evenly spreading it. He proceeds to lace up the sneaker, threading the laces properly through the eyelets. The restored sneaker against its worn counterpart are placed on the wooden stool, and showed by the man spinning. The comparison of them highlights the significant transformation, showcasing the clean and polished appearance of one against the dirty state of the other. The audio concludes with the man expressing satisfaction with the restoration, feeling he saved himself $150 and proudly stating, \"Let'`s make sneakers great again.`\"\n\nThroughout all the scenes, both the visual and audio elements emphasize the restoration process of the worn sneakers, showcasing a journey from discarded and damaged to carefully repaired and rejuvenated."
+ }
+ ],
+ "events": [
+ "Camera view from inside a car looking out at a sidewalk with discarded items including a gray couch, pink object, and metal rack.",
+ "Scene shifts to a closer view of the abandoned couch and surrounding debris.",
+ "A man in a black 'Aero 1987' sweatshirt, gray pants, and colorful sneakers picks up gray and white sneakers from the ground and examines them.",
+ "Close-up of the worn and dirty sneakers with scuffed and cracked surfaces.",
+ "Audio of a man driving, expressing disbelief about the dirty shoes with sole separation.",
+ "Close-up of worn gray sneakers with red accents on a wooden stool, highlighting scuffed surfaces and red Jumpman logo.",
+ "Detailed inspection of the leather texture and worn-out soles.",
+ "Man in a dark shirt and apron holds one sneaker, showing the detached sole and interior.",
+ "He presents a bottle of 'EB Kicks Shoe Cleaner & Conditioner' for cleaning and restoration.",
+ "The man inserts a black shoe holder into the shoe for support.",
+ "Audio of the man proclaiming his scavenger nature and ability to make shoes look spectacular.",
+ "Sound of shoe cleaning begins.",
+ "Scene of the man at a wooden stool with the worn gray sneakers.",
+ "He pours cleaning solution from the 'EB Kicks' bottle onto a black sponge.",
+ "Vigorous scrubbing on different parts of the sneaker generating foam.",
+ "Close-ups of the man cleaning the sneaker's tongue with foam covering the red Jumpman logo.",
+ "Audio of the man describing his cleaning process and feeling in the zone when seeing foam.",
+ "Close-up of the man scrubbing the gray sneaker, creating more foam with a black sponge.",
+ "Scene shifts to cleaning the sole featuring the 'JORDAN' logo in red.",
+ "Focus on cleaning the interior, especially the insole and heel area.",
+ "Man uses a small tool to scrape off dirt and debris.",
+ "Audio of the man discussing cleaning insoles/outsoles and removing old glue residue.",
+ "Sounds of scraping in the background.",
+ "Close-up of the man scraping off old adhesive from the sole.",
+ "Man applies new adhesive from a small bottle and aligns the sole with the upper.",
+ "White container (possibly with pigment) is shaken, and a white substance is applied to the midsole.",
+ "Focus on detailed areas around the 'AIR' logo on the heel.",
+ "Audio of the man explaining the need for better adhesion and clamping the shoe.",
+ "Man begins to paint, mentioning he's merely touching it up.",
+ "Final scene of the man painting the 'AIR' logo on the heel with a small brush.",
+ "He applies a black substance to the toe area and spreads it evenly.",
+ "Man laces up the sneaker properly through the eyelets.",
+ "Restored sneaker placed next to its worn counterpart on the wooden stool.",
+ "Comparison highlights the transformed, clean appearance of one against the dirty state of the other.",
+ "Audio concludes with the man expressing satisfaction with the restoration.",
+ "Man states he saved himself $150 and says, 'Let's make sneakers great again.'"
+ ]
+ },
+ {
+ "video": "video/224.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nDescribe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video opens with a close-up of a hand showcasing long, colorful acrylic nails designed with a gradient of pink, orange, and purple. The text in the bottom-left corner reads 'TikTok @anacrylics'. The text 'Let's take my nails off' appears at the bottom along with a female voice saying. The hand tries to remove an acrylic nail while the text at the bottom reads, 'don't rip them out this time ugh', a female voice accompanies. The clip shows a nail clipper being used to trim the nails, accompanied by the text 'I guess we'll do a soak off'. The female voice then says, \"Okay, I guess we'll do a soak off.\"\n\nThe text at the bottom reads 'The first thing you're gonna wanna do is clip down your nails', as a woman says. The long nails are shown being clipped, with bits falling off. Then the text and the woman's voice continued, \"Once you're done that, you're going to go ahead and grab your drill and just drill it down.\" The clip shows a close-up of a nail drill bit, while background music plays.\n\nIn the third part, the nail drill is used on a pink acrylic nail, creating dust. The text 'without burning yourself without burning yourself.\" appears with the audio. Following this, the view shows the nails thinned down, with the text 'once they're thin and they look like this' with the audio aligning. The next clip focuses on two other gradient nails being filed down, and then the woman removes the dust from them and the text aligning with the audio 'Now you're ready to soak them off' appears. The clip then shows a container being set up for the soak-off, while upbeat background music plays.\n\nMoving into the fourth clip, a container is filled with hot water. The text at the bottom reads 'I like using hot water and 100% pure acetone', as the woman states. Then she covers the container with a lid with pink edges. The view shifts to a close-up of an acetone bottle, clearly labeled 'ACETONE', before showing it being poured into a smaller container within the larger one. The text and the audio accompany, 'The hot water activates the acetone a little bit more so the soak off doesn't take forever.\" The clip concludes with nails partially soaked and softened, and the text 'and after that, your nail should be looking like this' appears as the woman says, while music plays.\n\nIn the fifth segment, a metal tool gently scrapes off the softened acrylic. This is followed by another close-up of the tool scraping off more acrylic, with the text and audio accompanying 'It should be flaky and not gooey. Now you're gonna go ahead and lightly file the top of your nail and buff it out'. accompanied by background music.\n\nIn the final clip, the nail drill is followed by a purple buffer further smoothing the nail surface. The video transitions to a close-up of cuticle oil being applied to the nails. The concluding shots show the nails looking clean and moisturized, with the text 'Go ahead and add some cuticle oil and then you`re done,\" as the music fades out, completing the nail removal and care process."
+ }
+ ],
+ "events": [
+ "Close-up of a hand showcasing long, colorful acrylic nails",
+ "Text 'TikTok @anacrylics' in the bottom-left corner",
+ "Text 'Let's take my nails off' appears",
+ "Female voice says 'Let's take my nails off'",
+ "Hand tries to remove an acrylic nail",
+ "Text at the bottom reads 'don't rip them out this time ugh'",
+ "Female voice says 'don't rip them out this time ugh'",
+ "Clip shows a nail clipper being used to trim the nails",
+ "Text 'I guess we'll do a soak off' appears",
+ "Female voice says 'Okay, I guess we'll do a soak off'",
+ "Text reads 'The first thing you're gonna wanna do is clip down your nails'",
+ "Woman's voice explains the same",
+ "Long nails are shown being clipped with bits falling off",
+ "Text and voice continued, 'Once you're done that, you're going to go ahead and grab your drill and just drill it down'",
+ "Clip shows a close-up of a nail drill bit",
+ "Background music plays",
+ "Nail drill is used on a pink acrylic nail creating dust",
+ "Text 'without burning yourself without burning yourself.' appears",
+ "Audio saying the same",
+ "View shows the nails thinned down",
+ "Text 'once they're thin and they look like this' appears",
+ "Audio matches the text",
+ "Clip focuses on two other gradient nails being filed down",
+ "Woman removes the dust from the nails",
+ "Text and audio 'Now you're ready to soak them off' appear",
+ "Clip shows a container being set up for the soak-off",
+ "Upbeat background music plays",
+ "Container is filled with hot water",
+ "Text 'I like using hot water and 100% pure acetone' at the bottom",
+ "Woman\u2019s voice states the same",
+ "Container with a lid with pink edges is covered",
+ "View shifts to a close-up of an acetone bottle labeled \u2018ACETONE\u2019",
+ "Acetone is poured into a smaller container within the larger one",
+ "Text and audio 'The hot water activates the acetone a little bit more so the soak off doesn't take forever.' accompany",
+ "Clip shows nails partially soaked and softened",
+ "Text 'and after that, your nail should be looking like this' appears",
+ "Woman\u2019s voice says the same",
+ "Music plays",
+ "Metal tool gently scrapes off the softened acrylic",
+ "Close-up of the tool scraping off more acrylic",
+ "Text and audio 'It should be flaky and not gooey. Now you're gonna go ahead and lightly file the top of your nail and buff it out' appear",
+ "Background music continues",
+ "Nail drill followed by a purple buffer smoothing the nail surface",
+ "Close-up of cuticle oil being applied to the nails",
+ "Concluding shots show the nails looking clean and moisturized",
+ "Text 'Go ahead and add some cuticle oil and then you`re done' appears",
+ "Music fades out completing the nail removal and care process"
+ ]
+ },
+ {
+ "video": "video/225.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nPlease describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a close-up view of a sleek, gray helicopter in flight, showcasing its streamlined design and modern features. Text appears on the screen saying, \"United States Insane New Stealth Helicopter,\" underlining the advanced nature of the aircraft. The helicopter flies over a forested area, demonstrating its maneuverability and stealth capabilities. As the scene transitions, the helicopter is shown on the ground with the text \"Bell 360 Invictus\" identifying the model. A night-time scene reveals the helicopter in flight, highlighting its operational capabilities in low-light conditions. \n\nIn the audio, a man is saying, \"This is the United States insane new super advanced stealth helicopter, the Bell 360 Invictus. The Bell 360 is a proposed helicopter design intended to meet the United States Army requirement for a future attack reconnaissance helicopter.\" accompanied by background music.\n\nNext, the video captures a chaotic scene inside a helicopter where military personnel in camouflage uniforms and helmets are readying for a jump. The camera pans to an exterior view, showing soldiers jumping out with parachutes and descending over a vast forested area. Multiple soldiers parachute down in formation before the focus shifts back to a sleek, gray helicopter in flight. The camera transitions to a man in a white protective suit cleaning a large, light green panel inside a facility, emphasizing the precision involved. \n\nThe video then shows a green radar screen with a sweeping motion before transitioning to a close-up of a helicopter\u2019s rotor blades, highlighted by a red arrow. The next sequence depicts a black helicopter taking off from a grassy field, flying low over it with mountains and buildings in the background. The camera transitions to a close-up view of a tan helicopter in flight shows off its design, then another gray helicopter in flight, and the final scene emphasizes the power of the aircraft with a detailed view of engine with the text \"General Electric T901\".\n\nThe audio continues with, \"A distinguishing feature of the chopper is its advanced radar-absorbing materials that will greatly reduce its radar signature and its stub wings that are intended to provide lift equivalent to approximately 50% of the weight of the aircraft. The aircraft will be powered by a General Electric T901, one turboshaft engine.\" with the sound of background music in the background.\n\nThe video then moves to another detailed engine view, elaborately showcasing its design. A black helicopter flies over a snowy landscape and forest, demonstrating its agility before the text \"300 km/h\" appears, indicating the speed. The video continues with the helicopter flying at high speed, demonstrating its advanced performance.\n\nNext, the video transitions to a helicopter flying at high speed. A side view of the helicopter reveals a vast landscape below with forests and water bodies and text stating \"Over 4,500 Meters.\" The video then moves to scenes of the helicopter on the ground in various angles are shown in a well-lit area, followed by a view of another black helicopter, emphasizing precise construction and maintenance, indicated by the visible 'Bell' logo. Then the video shows a dark green helicopter marked with the 'Bell' logo, equipped with missiles.\n\nThe audio continues with, \"The engine is capable of producing three thousand horsepower, which will enable the aircraft to fly at a maximum speed of over 300 kilometers per hour, and an altitude of over fifteen thousand feet. \" against the sound of a helicopter in the background. \n\nThe video then shows a transition to a green radar screen highlights its functionality before a massive desert explosion erupts, sending debris and smoke skywards. A rocket launch follows, leaving a smoke trail into the sky, with text \"Subscribe\" and a bell icon suggesting viewers subscribe to a channel. The video finalizes with a missile explosing in the sky, maintaining the \"Subscribe\" text and bell icon.\n\nThe man's voice concludes with, \"The Bell 360 will also incorporate an active horizontal stabilizer, a tandem cockpit, and will carry its weapons internally to further minimize radar signature. Speaking of weapons, it will also be armed with advanced guided rockets and missiles and a 20-millimeter extremely high-rate gun turret system,\" alongside background music, making a tense atmosphere."
+ }
+ ],
+ "events": [
+ "Close-up view of a sleek, gray helicopter in flight.",
+ "Text on screen: 'United States Insane New Stealth Helicopter'.",
+ "Helicopter flies over a forested area.",
+ "Text on screen: 'Bell 360 Invictus'.",
+ "Helicopter shown on the ground with 'Bell 360 Invictus' text.",
+ "Night-time scene of the helicopter in flight.",
+ "Audio description: 'This is the United States insane new super advanced stealth helicopter, the Bell 360 Invictus.'",
+ "Audio description: 'The Bell 360 is a proposed helicopter design intended to meet the United States Army requirement for a future attack reconnaissance helicopter.'",
+ "Background music accompanying the audio.",
+ "Chaotic scene inside a helicopter with military personnel in camouflage uniforms and helmets.",
+ "Camera pans to an exterior view showing soldiers jumping out with parachutes.",
+ "Soldiers descending over a vast forested area.",
+ "Multiple soldiers parachuting down in formation.",
+ "Focus shifts back to a sleek, gray helicopter in flight.",
+ "Man in a white protective suit cleaning a large, light green panel inside a facility.",
+ "Green radar screen with a sweeping motion.",
+ "Close-up of helicopter\u2019s rotor blades highlighted by a red arrow.",
+ "Black helicopter taking off from a grassy field, flying low over it with mountains and buildings in the background.",
+ "Close-up view of a tan helicopter in flight, showcasing its design.",
+ "Gray helicopter in flight.",
+ "Detailed view of helicopter engine with text 'General Electric T901'.",
+ "Audio description: 'A distinguishing feature of the chopper is its advanced radar-absorbing materials that will greatly reduce its radar signature and its stub wings that are intended to provide lift equivalent to approximately 50% of the weight of the aircraft.'",
+ "Audio description: 'The aircraft will be powered by a General Electric T901, one turboshaft engine.'",
+ "Background music accompanying the audio.",
+ "Detailed engine view showcasing its design.",
+ "Black helicopter flies over a snowy landscape and forest.",
+ "Text '300 km/h' appears on screen.",
+ "Helicopter flying at high speed demonstrating advanced performance.",
+ "Helicopter flying at high speed.",
+ "Side view of the helicopter revealing a vast landscape below with forests and water bodies.",
+ "Text 'Over 4,500 Meters' appears on screen.",
+ "Scenes of the helicopter on the ground shown from various angles in a well-lit area.",
+ "View of another black helicopter emphasizing precise construction and maintenance with visible 'Bell' logo.",
+ "Dark green helicopter marked with 'Bell' logo equipped with missiles.",
+ "Audio description: 'The engine is capable of producing three thousand horsepower, which will enable the aircraft to fly at a maximum speed of over 300 kilometers per hour, and an altitude of over fifteen thousand feet.'",
+ "Background sound of a helicopter in the background.",
+ "Transition to a green radar screen highlighting its functionality.",
+ "Massive desert explosion sending debris and smoke skywards.",
+ "Rocket launch leaving a smoke trail into the sky.",
+ "Text 'Subscribe' and bell icon suggesting viewers subscribe to a channel.",
+ "Missile explodes in the sky maintaining 'Subscribe' text and bell icon.",
+ "Audio description: 'The Bell 360 will also incorporate an active horizontal stabilizer, a tandem cockpit, and will carry its weapons internally to further minimize radar signature.'",
+ "Audio description: 'Speaking of weapons, it will also be armed with advanced guided rockets and missiles and a 20-millimeter extremely high-rate gun turret system.'",
+ "Background music accompanying the audio making a tense atmosphere."
+ ]
+ },
+ {
+ "video": "video/226.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video begins with a woman standing in a room with light-colored walls, beside a curtain featuring circular patterns on the left side. She is dressed in a white outfit and has a pink shawl draped over her shoulders. At the bottom of the screen, the text \"Teachers X Discipline Part 1\" is displayed in red and black boxes. The woman gestures energetically, pointing upwards, touching her hair, and pointing towards the camera, indicating that she is actively explaining something. The audio reveals an agitated tone, with a woman exclaiming: \"`This is not a fashion show. This is school. What is all this? Put hairpins and come. If your hair is below this, you tie ponies and come. This fashion and all you do in college then you go.`\"\n\nThe scene continues with the woman maintaining the same location and appearance as before. She continues to make dynamic hand gestures such as crossing her arms and adjusting her shawl, emphasizing her engagement in explaining. The audio continues in a high-pitched, stringent tone: \"I want to see ponies tomorrow. Whole put ribbon in hair. How many times should I tell you? Put oil in hairs. Nice hairstyle you have done for your hair.\"\n\nThe woman remains in focus, continuing her expressive hand movements and body language. The narrative builds as she also adjusts her hair and makes a counting gesture on her fingers, further illustrating her involvement. The audio follows with a strict voice: \"Come to a party or what? You can't do chotis from here. Remove that. And long hair, choti double. Take it up. Don't worry. Two more standards. 9th, 10th. Then you go in college and do whatever you want to do.\"\n\nThe room and the woman's appearance are consistent as she repeats similar gestures with fervent body language. The text \"Teachers X Discipline Part 1\" persists at the screen's bottom. The soundscape shifts, with the woman continuing her tirade. She barks: \"But till the time you are in school, you will have to follow rules. Look at these girls behaving like college girls nowadays. \"\n\nPersisting in the same setting and attire, the woman's gestures remain expressive. She raises her hand, pointing and explaining, signifying thorough engagement. The combative tone in the audio persists as the woman scolds: \"`Shameless Only becoming every day. All this layer cut and step cut and all these types of cut now you do when you go to college. oh so your hair is natural brown huh.`\" \n\nThe description concludes in the same room; the woman's white outfit and pink shawl are unchanged while making dynamic gestures such as raising two fingers and crossing her arms. Continuing with her fervent explanation, the audio culminates in a rapid, scolding address: \"`Suddenly today it has turned brown. Yesterday it was black only. Yesterday it was not brown. You color your hair, no? You call mother. I want to talk to her.`\"\n\nThroughout the video, the recurring theme highlights the teacher's exasperation and attempts to enforce strict discipline, especially about student appearances, against a backdrop of contrastingly casual or rebellious student behavior."
+ }
+ ],
+ "events": [
+ "A woman stands in a room with light-colored walls.",
+ "There is a curtain with circular patterns on the left side.",
+ "The woman is dressed in a white outfit and has a pink shawl draped over her shoulders.",
+ "Text 'Teachers X Discipline Part 1' appears at the bottom of the screen in red and black boxes.",
+ "The woman gestures energetically, pointing upwards, touching her hair, and pointing towards the camera.",
+ "The audio reveals an agitated tone with a woman exclaiming about school rules and hair fashion.",
+ "The woman continues making dynamic hand gestures like crossing her arms and adjusting her shawl.",
+ "The audio continues in a high-pitched, stringent tone insisting on proper hairstyles and use of hairpins.",
+ "The woman adjusts her hair and makes a counting gesture on her fingers.",
+ "The audio gives strict instructions about hairstyles and mocks inappropriate school behavior.",
+ "The text 'Teachers X Discipline Part 1' persists at the screen's bottom.",
+ "The audio shifts to a woman barking about following school rules and criticizing 'college girls' behavior.",
+ "The woman raises her hand, pointing and scolding about layers and haircuts.",
+ "The woman expresses frustration about students coloring their hair and insists on calling a student's mother.",
+ "The video consistently showcases the teacher's exasperation and attempts to enforce strict discipline on student appearances.",
+ "Throughout, the teacher scolds and expresses anger over students' casual or rebellious behavior."
+ ]
+ },
+ {
+ "video": "video/227.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nGive a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+ },
+ {
+ "from": "gpt",
+ "value": "The video showcases a stand-up comedy performance at a venue called 'Comedy Corner.' The male comedian is wearing a blue shirt and black pants, energetically holding a microphone on stage with a dark background illuminated by stage lights. Throughout the performance, subtitles in yellow and white text transcript his routine.\n\nIn the video, the comedian dramatically conveys a situation involving his wife. The focus is tightly on the comedian, he humorously describes his wife as extremely angry, with 'steam coming out of her ears.' The crowd's laughter crescendoes as he explains that he made a judgment call to 'shut it off,' only to have his wife question his decision: 'Why did you shut that off? Why did you shut it off?'. The comedian humorously quotes himself explaining to his wife, 'Because you're about 3 minutes away from yelling at me like I produced this music'. The audience laughs; the comedian pauses, looks at the camera, and shows a look of exasperation. He continues, 'Let's just forget it. We'll watch something else\", with the audience's laughter and cheers now audible, though they are not visible in this segment.\n\nHis narrative advances as he sets the stage for his wife's persistence: 'No, why can't we just watch the rest of this and then discuss it afterwards?' After saying this, he throws his head back. His frustration is animated and clear. 'It's like, I don't want to f*cking do that.' The crowd's laughter punctuates the performance, although the scene remains focused exclusively on the comedian, who also starts laughing so hard that he bends over.\n\nNext, the comedian stands up again and repeats:'I just wanna watch a documentary! I don't want to have a f*cking lecture because of these f*cking a*sholes.' He then humorously laments, 'I didn't do anything! I'm just sitting here watching TV!' The palpable tension builds up to his humorous realization: 'Not to mention, we're just gonna get into an argument.' The comedian's mimicking tone and expressions provide comic relief, and the crowd's delight is evident though unseen.\n\nThe discussion with his wife intensifies: 'Why would we get into an argument?' His resigned tone as he continues, 'All right, you know what? Because not for nothing, there was some black people in there saying some sh*t that I didn't agree with, you know? I'm not huffing and puffing.' ...\n\nThen, he looks at the camera with a helpless laugh and admits, 'Yeah, I said that.' The audience laughs, with the scene transiting to a wider stage view. The comedian is shown from behind, 'Like a f*cking idiot', continuing his performance. The audience, now visible in the background, is clearly amused. \u201cAnd she just looks at me. She's like,\u201d he finishes with his wife's incredulous rebuttal, and the shot cuts back to a close-up, 'Like what black people?' eliciting a final wave of laughter from the crowd.\n\nThroughout the video, the comedian's expressive storytelling and animated gestures dominate the stage, drawing continuous laughter and reactions from the audience, making for an engaging and entertaining stand-up comedy performance."
+ }
+ ],
+ "events": [
+ "Video showcases a stand-up comedy performance at 'Comedy Corner.'",
+ "Male comedian is wearing a blue shirt and black pants.",
+ "Comedian energetically holds a microphone on stage.",
+ "Stage background is dark, illuminated by stage lights.",
+ "Subtitles in yellow and white text transcribe the routine.",
+ "Video focuses tightly on the comedian as he describes a situation involving his wife.",
+ "Comedian humorously describes his wife as extremely angry, with 'steam coming out of her ears'.",
+ "Crowd's laughter crescendoes.",
+ "Comedian states he made a judgment call to 'shut it off'.",
+ "Comedian imitates wife's questioning: 'Why did you shut that off? Why did you shut it off?'.",
+ "Comedian humorously quotes himself explaining: 'Because you're about 3 minutes away from yelling at me like I produced this music'.",
+ "Audience laughs.",
+ "Comedian pauses, looks at the camera, and shows exasperation.",
+ "Comedian says, 'Let's just forget it. We'll watch something else'.",
+ "Audience's laughter and cheers are audible.",
+ "Comedian humorously describes his wife's persistence: 'No, why can't we just watch the rest of this and then discuss it afterwards?'.",
+ "Comedian throws his head back in frustration.",
+ "Comedian exclaims, 'It's like, I don't want to f*cking do that.'.",
+ "Crowd's laughter punctuates the performance.",
+ "Scene remains focused exclusively on the comedian.",
+ "Comedian starts laughing so hard that he bends over.",
+ "Comedian stands up and repeats: 'I just wanna watch a documentary! I don't want to have a f*cking lecture because of these f*cking a*sholes.'.",
+ "Comedian humorously laments, 'I didn't do anything! I'm just sitting here watching TV!'.",
+ "Tension builds up to comedian's humorous realization: 'Not to mention, we're just gonna get into an argument.'.",
+ "Comedian's mimicking tone and expressions provide comic relief.",
+ "Crowd's delight is evident though unseen.",
+ "Discussion with his wife intensifies: 'Why would we get into an argument?'.",
+ "Comedian resignedly says: 'All right, you know what? Because not for nothing, there was some black people in there saying some sh*t that I didn't agree with, you know? I'm not huffing and puffing.'.",
+ "Comedian looks at the camera with a helpless laugh and admits: 'Yeah, I said that.'.",
+ "Audience laughs.",
+ "Scene transitions to a wider stage view.",
+ "Comedian is shown from behind continuing his performance: 'Like a f*cking idiot'.",
+ "Audience, now visible in the background, is clearly amused.",
+ "Comedian finishes with his wife's incredulous rebuttal: 'Like what black people?'.",
+ "Final wave of laughter from the crowd.",
+ "Comedian's expressive storytelling and animated gestures dominate the stage.",
+ "Continuous laughter and reactions from the audience.",
+ "Stand-up comedy performance is engaging and entertaining."
+ ]
+ },
+ {
+ "video": "video/228.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "\nOffer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync."
+ },
+ {
+ "from": "gpt",
+ "value": "The video takes place throughout a grocery store aisle filled with various types of bread. A man wearing a distinctive red t-shirt with the text 'Bacon is my favorite wrapper' and light blue jeans guides us through his visual and verbal explanation of different breads. \n\nThe video starts with this man walking down the aisle, gesturing energetically with his hands, clapping, and pointing toward the loaves of bread. A lively cartoon image of a loaf of bread with a heart hovering above its head. The text \"WHOLE GRAIN GAINS\" appears on the screen. He picks up a loaf of 'Dave's Killer Bread' with a green label reading '21 Whole Grains and Seeds,' presenting it to the camera. The word 'LOOK' appears in colorful letters above his head. The accompanying audio features a man enthusiastically stating: \"What is the best bread to buy at the grocery store? It's easy. It's whole-grain bread. And a lot of people reach for this. And yes, I love the logo of Dave on his guitar. It's kick butt. Look at the ingredients, it's whole grain.\" with upbeat music playing in the background.\n\nContinuing in the same aisle, the man remains to hold the loaf of 'Dave's Killer Bread'. He gestures as an overlay of the bread's ingredients list appears on the screen, highlighting its organic components. However, the word 'BUT' appears in large, colorful letters over the list, emphasizing a downside. The nutritional information reveals 110 calories, 1.5g of total fat, 0g of cholesterol, 170mg of sodium, 22g of total carbohydrates, 5g of dietary fiber, 5g of total sugars (including 5g of added sugars), and 5g of protein per slice. As he continues to speak and gesture, a cartoon image of a sandwich appears above his head. The audio shows a man pointing out: \"Fantastic. But there's sugar in this bread to the tune of five grams of added sugar per slice. Make a sandwich for your kids with two slices. That's two and a half teaspoons of added sugar.\"\n\nThe man next manipulates the bread, placing 'Dave's Killer Bread' back on the shelf and reaching for a different loaf with a yellow label. A cartoon loaf showing the number '2' appears above his head, transitioning to the text 'Bye Felicia' in pink letters. He explains that an overlay displays the bread's ingredients, emphasizing his point by pointing toward the bread and shelves. The audio from a man explains: \"So I would put this one back here also read the ingredients because most breads are made with enriched wheat or non-whole grain flour at that point it's just a simple carbohydrate.\" \n\nThe narrative progresses with the man holding the bread with the yellow label, summarily placing it back with the word 'meh' appearing above his head. Swapping it for another loaf with an orange label, the text 'THE BEST' in orange letters appears. He introduces 'Food for Life Ezekiel 4:9 Flourless Sprouted Grain Bread' while an overlay shows its label and organic components. He punctuates his talking with gestures pointing towards the shelves and bread. The audio advocates sprouted whole grain bread as the best option: \" The best you can do at the grocery store is sprouted whole grain. Ezekiel is the OG. It's organic, it's sprouted, and sprouted grains are more easily digestible by your body,\".\n\nVisuals continue with the man holding 'Food for Life Ezekiel 4:9 Flourless Sprouted Grain Bread', emphasizing its benefits as a cartoon image of a smiling pastry and an unlocked padlock appear above. The green speech bubble 'this is healthy' highlights his statements. He points towards the bread and shelves, while a cartoon of a determined animal appears. He finalizes his explanation by placing the loaf back, while the audio describes: \"And when you sprout a grain it unlocks more nutrition the problem is this is healthy and kind of has that healthiest texture the kids might not like which is why Silver Hills sprouted bread.,\".\n\nIn conclusion, the man holds a loaf of 'Silver Hills Sprouted Power Little Big Bread' with a blue label, ardently talking and pointing towards the shelves and bread. An overlay reiterates the bread\u2019s name and label. A bread image appears on the right top of the screen. The text '100% organic' with a plant icon and the message 'must try' enclose the final product review. The audio endorses: \"And the blue bag is light and pillowy. It's organic. It's sprouted. It makes a killer grilled cheese sandwich. And this is the kind of bread you want to be eating for you and your family.\".\n\nThrough vibrant animations, informative overlays, and compelling personal narration, this video effectively introduces the different kinds of grain bread and gives some advice on choosing the bread."
+ }
+ ],
+ "events": [
+ "Video takes place in a grocery store aisle filled with various types of bread.",
+ "Man wearing a red t-shirt with 'Bacon is my favorite wrapper' and light blue jeans guides through different breads.",
+ "Man walks down the aisle gesturing energetically, clapping, and pointing at loaves of bread.",
+ "Cartoon image of a loaf of bread with a heart appears.",
+ "Text 'WHOLE GRAIN GAINS' appears on screen.",
+ "Man picks up a loaf of 'Dave's Killer Bread' with a green label reading '21 Whole Grains and Seeds.'",
+ "He presents the bread to the camera.",
+ "Text 'LOOK' appears above his head.",
+ "Audio features a man stating the benefits of whole grain bread and enthusiasm for the brand logo.",
+ "Upbeat music plays in the background.",
+ "Ingredients list overlay appears on screen, highlighting organic components.",
+ "Text 'BUT' appears over the ingredients list.",
+ "Nutritional information is shown: 110 calories, 1.5g fat, 0g cholesterol, 170mg sodium, 22g carbohydrates, 5g fiber, 5g sugars (including 5g added), 5g protein per slice.",
+ "Cartoon image of a sandwich appears above his head.",
+ "Audio points out the added sugars in the bread.",
+ "Man places 'Dave's Killer Bread' back on the shelf.",
+ "Man reaches for a different loaf with a yellow label.",
+ "Cartoon loaf with number '2' and text 'Bye Felicia' appears above his head.",
+ "Overlay displays the bread's ingredients.",
+ "Audio explains enriched wheat or non-whole grain flour is a simple carbohydrate.",
+ "Man places bread with yellow label back on the shelf.",
+ "Text 'meh' appears above his head.",
+ "Man swaps bread for another loaf with an orange label.",
+ "Text 'THE BEST' appears.",
+ "Man introduces 'Food for Life Ezekiel 4:9 Flourless Sprouted Grain Bread'.",
+ "Overlay shows the bread's label and organic components.",
+ "Audio advocates sprouted whole grain bread as the best option.",
+ "Cartoon image of a smiling pastry and unlocked padlock appear.",
+ "Text 'this is healthy' appears in a green speech bubble.",
+ "Cartoon of a determined animal appears.",
+ "Man puts 'Food for Life Ezekiel 4:9 Flourless Sprouted Grain Bread' back on the shelf.",
+ "Audio describes the benefits of sprouting grains.",
+ "Man holds a loaf of 'Silver Hills Sprouted Power Little Big Bread' with a blue label.",
+ "Overlay reiterates the bread's name and label.",
+ "Bread image appears on the top right of the screen.",
+ "Text '100% organic' with a plant icon and 'must try' message appear.",
+ "Audio endorses the blue bag bread as light, pillow-like, organic, sprouted, and ideal for grilled cheese sandwiches.",
+ "Video concludes with the endorsement for 'Silver Hills Sprouted Power Little Big Bread'.",
+ "Video utilizes vibrant animations, informative overlays, and compelling personal narration."
+ ]
+ },
+ {
+ "video": "video/229.mp4",
+ "conversations": [
+ {
+ "from": "human",
+ "value": "